This adds the ability to add multiple choices to a generation. This
is only available for non-streaming gens for now, it requires some
more work to port over to streaming.
Signed-off-by: kingbri <bdashore3@proton.me>
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.
If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit 7556dcf134.
The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.
Signed-off-by: kingbri <bdashore3@proton.me>
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.
Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.
Signed-off-by: kingbri <bdashore3@proton.me>
response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).
Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.
Signed-off-by: kingbri <bdashore3@proton.me>
A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.
Signed-off-by: kingbri <bdashore3@proton.me>
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.
Also expose the parameter to the end user in both config and model
load.
Signed-off-by: kingbri <bdashore3@proton.me>
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).
Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a definite way to check if an authorized key is API or admin.
The endpoint only runs if the key is valid in the first place to keep
inline with the API's security model.
Signed-off-by: kingbri <bdashore3@proton.me>
Moving the API into its own directory helps compartmentalize it
and allows for cleaning up the main file to just contain bootstrapping
and the entry point.
Signed-off-by: kingbri <bdashore3@proton.me>