When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.
Signed-off-by: kingbri <bdashore3@proton.me>
This will manage dependencies from now on since it's a more flexible
file that's similar to other packaging utilities like npm and cargo.
Signed-off-by: kingbri <bdashore3@proton.me>
Yielding the finish reason before the logging causes the function to
terminate early. Instead, log before yielding and breaking out of the
generation loop.
Signed-off-by: kingbri <bdashore3@proton.me>
If max_tokens is None, it automatically scales to fill up the context.
This does not mean the generation will fill up that context since
EOS stops also exist.
Originally suggested by #86
Signed-off-by: kingbri <bdashore3@proton.me>
Max output len should be hardcoded to 16 since it's the amount of
tokens to predict per forward pass. 16 is a good value for both
normal inference and speculative decoding which also helps save
vram compared to 2048 which was the previous default.
Signed-off-by: kingbri <bdashore3@proton.me>
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).
Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a definite way to check if an authorized key is API or admin.
The endpoint only runs if the key is valid in the first place to keep
inline with the API's security model.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to override uvicorn's signal handler in addition
to using main's signal handler for any SIGINTs before the API server
starts.
Signed-off-by: kingbri <bdashore3@proton.me>
If the model didn't load properly, the container still exists until
unload is called. However, the name check still registered as true.
Signed-off-by: kingbri <bdashore3@proton.me>
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.
Use asyncio's run_thread function since it allows for errors to be
propegated.
Signed-off-by: kingbri <bdashore3@proton.me>
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.
NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)
Signed-off-by: kingbri <bdashore3@proton.me>
These are mainly used for some clients that ping to see if the request
is alive. However, we don't need this.
Signed-off-by: kingbri <bdashore3@proton.me>
Speculative ngram decoding is like speculative decoding without the
draft model. It's not as useful because it only decodes on predictable
sequences, but it depends on the usecase.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, generation function were bundled with the request function
causing the overall code structure and API to look ugly and unreadable.
Split these up and cleanup a lot of the methods that were previously
overlooked in the API itself.
Signed-off-by: kingbri <bdashore3@proton.me>
Moving the API into its own directory helps compartmentalize it
and allows for cleaning up the main file to just contain bootstrapping
and the entry point.
Signed-off-by: kingbri <bdashore3@proton.me>
Use the module singleton pattern to share global state. This can also
be a modified version of the Global Object Pattern. The main reason
this pattern is used is for ease of use when handling global state
rather than adding extra dependencies for a DI parameter.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a shared module which manages the model container and provides
extra utility functions around it to help slim down the API.
Signed-off-by: kingbri <bdashore3@proton.me>
Similar to Gradio, fall back to port + 1 if the config port isn't
bindable. If both ports aren't available, let the user know and exit.
An infinite loop of finding a port isn't advisable.
Signed-off-by: kingbri <bdashore3@proton.me>
Rich markup sequences inside the log string were causing issues
with printing. Fix this by using their escape function.
Signed-off-by: kingbri <bdashore3@proton.me>
Starlette's StreamingResponse has an issue where it yields after
a request has disconnected. A bugfix to starlette will fix this
issue, but FastAPI uses starlette <= 0.36 which isn't ideal.
Therefore, switch back to sse-starlette which handles these disconnects
correctly.
Also don't try yielding after the request is disconnected. Just return
out of the generator instead.
Signed-off-by: kingbri <bdashore3@proton.me>