FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs.
Clarify the error message and explain what happens as a result.
Signed-off-by: kingbri <bdashore3@proton.me>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.
Signed-off-by: kingbri <bdashore3@proton.me>
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.
If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.
Signed-off-by: kingbri <bdashore3@proton.me>
Dynamic generator needed multiple prompts to be tokenized and sent
for them to be sampled in serial, but generated in parallel.
Signed-off-by: kingbri <bdashore3@proton.me>
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.
Signed-off-by: kingbri <bdashore3@proton.me>
Dynamic gen takes in filters differently. Adjust to set the filter list
per class rather than in the generation function.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds basic support for ExllamaV2's dynamic generator. Can generate
a streaming and non-streaming completion.
Signed-off-by: kingbri <bdashore3@proton.me>
At any point for any request cancellation, the semaphore will be
decremented. This is an issue since an arbitrary request can desync
the semaphore, causing multiple tasks to be processed at once and
break generation.
Remove this from the networking handlers and therefore, remove the
release_semaphore function itself.
Signed-off-by: kingbri <bdashore3@proton.me>
If an override was iterable, any modifications to the returned value
would alter the reference to the global storage dict.
Therefore, copy the structure if it's an iterable so any modification
won't alter the original override. Also apply this for the function
that checks for forced overrides.
Signed-off-by: kingbri <bdashore3@proton.me>
skip_special_tokens is in stable exl2. Also default the parameters
if they are not present in the function signature.
Signed-off-by: kingbri <bdashore3@proton.me>
From exllamav2: List of strings that the generator will refuse to output. As soon as a partial match happens, a checkpoint is saved that the generator can rewind to if need be. Subsequent tokens are then held until the full string is resolved (match or no match) and either emitted or discarded, accordingly.
Bans the EOS token until the generation reaches a minimum length. This will not prevent the model from otherwise ending the generation early by outputting other stop conditions.
This reverts commit 7556dcf134.
The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.
Signed-off-by: kingbri <bdashore3@proton.me>
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.
Signed-off-by: kingbri <bdashore3@proton.me>
Use None-ish coalescing instead of unwrap optional handling. This means
that any value that is "empty" for python will default to the fallback.
Ex. print("" or "test") will print out "test"
Signed-off-by: kingbri <bdashore3@proton.me>
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.
Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.
Signed-off-by: kingbri <bdashore3@proton.me>