Models do not fully unload if an exception is caught in load. Therefore,
leave it to the client to unload on cancel.
Also add handlers in the event a SSE stream is cancelled. These packets
can't be sent back to the client since the client has severed the
connection, so print them in terminal.
Signed-off-by: kingbri <bdashore3@proton.me>
Chat completions previously always yielded a final packet to say that
a generation finished. However, this caused errors that a yield was
executed after GeneratorExit. This is correctly stated because python's
garbage collector can't clean up the generator after exiting due to the
finally block executing.
In addition, SSE endpoints close off the connection, so the finish packet
can only be yielded when the response has completed, so ignore yield on
exception.
Signed-off-by: kingbri <bdashore3@proton.me>
FastAPI is kinda weird with queueing. If an await is used within an
async def, requests aren't executed sequentially. Get the sequential
requests back by using a semaphore to limit concurrent execution from
generator functions.
Also scaffold the framework to move generator functions to their own
file.
Signed-off-by: kingbri <bdashore3@proton.me>
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.
A better alternative is to use 8bit cache which works and helps save
VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
* enables automatic calculation of NTK-aware alpha scaling for models if the rope_alpha arg is not passed in the config, using the same formula used for draft models
sse_starlette kept firing a ping response if it was taking too long
to set an event. Rather than using a hacky workaround, switch to
FastAPI's inbuilt streaming response and construct SSE requests with
a utility function.
This helps the API become more robust and removes an extra requirement.
Signed-off-by: kingbri <bdashore3@proton.me>
Some APIs require an OAI model to be sent against the models endpoint.
Fix this by adding a GPT 3.5 turbo entry as first in the list to cover
as many APIs as possible.
Signed-off-by: kingbri <bdashore3@proton.me>
Chat completions require a finish reason to be provided in the OAI
spec once the streaming is completed. This is different from a non-
streaming chat completion response.
Also fix some errors that were raised from the endpoint.
References #15
Signed-off-by: kingbri <bdashore3@proton.me>
If the generator errors, there's no proper handling to send an error
packet and close the connection.
This is especially important for unloading models if the load fails
at any stage to reclaim a user's VRAM. Raising an exception caused
the model_container object to lock and not get freed by the GC.
This made sense to propegate SSE errors across all generator functions
rather than relying on abort signals.
Signed-off-by: kingbri <bdashore3@proton.me>
The default encoding method when opening files on Windows is cp1252
which doesn't support all unicode and can cause issues.
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit cad144126f.
Change this parameter back to repetition_decay. This is different than
rep_pen_slope used in other backends such as kobold and NAI.
Still keep the fallback condition though.
Signed-off-by: kingbri <bdashore3@proton.me>
Unlike other backends, tabby attempts to generate even if the context
is greater than the max sequence length via truncation of the given
context.
Rather than artifically erroring out, give a warning that outputted
console metrics are going to be incorrect and to make sure that
context <= max_seq_len.
Signed-off-by: kingbri <bdashore3@proton.me>
Alias repetition_penalty_range to repetition_range since that's used
as an internal variable. Perhaps in the future, there should be a function
that allows for iterating through request aliases and give a default value.
Signed-off-by: kingbri <bdashore3@proton.me>
On /v1/model/load, some internal server errors weren't being sent,
so migrate directory checking out and also add a check to make sure
the proposed model path exists.
Signed-off-by: kingbri <bdashore3@proton.me>
Use 2.3.4 from tgw. However, keep the 2.3.3 wheels in requirements
if the newer wheels don't work for now.
Signed-off-by: kingbri <bdashore3@proton.me>
Documented in previous commits. Also make sure that for version checking,
check the value of kwargs instead of if the key is present since requests
pass default values.
Signed-off-by: kingbri <bdashore3@proton.me>
A simple batch script to activate a venv and start TabbyAPI. This
can be used with nssm in Windows for a systemd-like background service.
Signed-off-by: kingbri <bdashore3@proton.me>