Use a queue-based system to get choices independently and send them
in the overall streaming payload. This method allows for unordered
streaming of generations.
The system is a bit redundant, so maybe make the code more optimized
in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
For multiple generations in the same request, nested arrays kept their
original reference, resulting in duplications. This will occur with
any collection type.
For optimization purposes, a deepcopy isn't run for the first iteration
since original references are created.
This is not the most elegant solution, but it works for the described
cases.
Signed-off-by: kingbri <bdashore3@proton.me>
This adds the ability to add multiple choices to a generation. This
is only available for non-streaming gens for now, it requires some
more work to port over to streaming.
Signed-off-by: kingbri <bdashore3@proton.me>
While TabbyAPI doesn't need a config.yml to run, new users can get
confused by the task of copying config_sample.yml to config.yml.
Therefore, automatically do this in the start script to immediately
expose options to the user.
Signed-off-by: kingbri <bdashore3@proton.me>
Cache size must be a multiple of 256 to work properly in ExllamaV2.
Take the config value and set the cache size to one multiple above
the remainder of the cache size divided by 256.
This is because cache size can never be lower than max_seq_len.
If max_seq_len isn't a multiple of 256, this method will never
yield a number that's lower than max_seq_len since it's no longer
a source of truth.
Signed-off-by: kingbri <bdashore3@proton.me>
Waiting for request disconnect takes some extra time and allows
generation chunks to pile up, resulting in large payloads being sent
at once not making up a smooth stream.
Use the polling method in non-streaming requests by creating a background
task and then check if the task is done, signifying that the request
has been disconnected.
Signed-off-by: kingbri <bdashore3@proton.me>
Depending on the day of the week, Starlette can work with a CancelledError
or using await request.is_disconnected(). Run the same behavior for both
cases and allow cancellation.
Streaming requests now set an event to cancel the batched job and break
out of the generation loop.
Signed-off-by: kingbri <bdashore3@proton.me>
If a user is using GPU split, check compute capability on only those
GPUs. Autosplit assumes that all GPUs will be used.
Signed-off-by: kingbri <bdashore3@proton.me>
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.
Signed-off-by: kingbri <bdashore3@proton.me>
FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs.
Clarify the error message and explain what happens as a result.
Signed-off-by: kingbri <bdashore3@proton.me>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.
Signed-off-by: kingbri <bdashore3@proton.me>
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.
If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.
Signed-off-by: kingbri <bdashore3@proton.me>
Dynamic generator needed multiple prompts to be tokenized and sent
for them to be sampled in serial, but generated in parallel.
Signed-off-by: kingbri <bdashore3@proton.me>
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.
Signed-off-by: kingbri <bdashore3@proton.me>
Dynamic gen takes in filters differently. Adjust to set the filter list
per class rather than in the generation function.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds basic support for ExllamaV2's dynamic generator. Can generate
a streaming and non-streaming completion.
Signed-off-by: kingbri <bdashore3@proton.me>
At any point for any request cancellation, the semaphore will be
decremented. This is an issue since an arbitrary request can desync
the semaphore, causing multiple tasks to be processed at once and
break generation.
Remove this from the networking handlers and therefore, remove the
release_semaphore function itself.
Signed-off-by: kingbri <bdashore3@proton.me>
If an override was iterable, any modifications to the returned value
would alter the reference to the global storage dict.
Therefore, copy the structure if it's an iterable so any modification
won't alter the original override. Also apply this for the function
that checks for forced overrides.
Signed-off-by: kingbri <bdashore3@proton.me>
skip_special_tokens is in stable exl2. Also default the parameters
if they are not present in the function signature.
Signed-off-by: kingbri <bdashore3@proton.me>