* returning stop str if exists from gen
* added chat template for firefunctionv2
* pulling tool vars from template
* adding parsing for tool inputs/outputs
* passing tool data from endpoint to chat template, adding tool_start to the stop list
* loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format
* non streaming generation prototype
* cleaning template
* Continued work with type, ingestion into template, and chat template for fire func
* Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib.
* Ruff Formating
* Moved stop string and tool updates out of prompt creation func
Updated tool pydantic to match OAI
Support for streaming
Updated generate tool calls to use flag within chat_template and insert tool reminder
* Llama 3.1 chat templates
Updated fire func template
* renamed llama3.1 to chatml_with_headers..
* update name of template
* Support for calling a tool start token rather than the string.
Simplified tool_params
Warning when gen_settings are being overidden becuase user set temp to 0
Corrected schema and tools to correct types for function args. Str for some reason
* draft groq tool use model template
* changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally)
* Clean up comments and code in chat comp
* Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call.
* changes example back to args as json rather than string of json
* Standardize chat templates to each other
* cleaning/rewording
* stop elements can also be ints (tokens)
* Cleaning/formatting
* added special tokens for tools and tool_response as specified in description
* Cleaning
* removing aux templates - going to live in llm-promp-templates repo instead
* Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
* Chat Completions: Don't include internal tool variables in OpenAPI
Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The
location of these variables may need to be changed in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
* Templates: Deserialize metadata on template load
Since we're only looking for specific template variables that are
static in the template, it makes more sense to render when the template
is initialized.
Signed-off-by: kingbri <bdashore3@proton.me>
* Tools: Fix comments
Adhere to the format style of comments in the rest of the project.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: Ben Gitter <gitterbd@gmail.com>
Signed-off-by: kingbri <bdashore3@proton.me>
Place OAI specific routes in the appropriate folder. This is in
preperation for adding new API servers that can be optionally enabled.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).
However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.
Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.
This behavior may change in the future, but I think it solves the
issue for now.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Clean up paged attention checks
* Model: Move cache_size checks after paged attn checks
Cache size is only relevant in paged mode
* Model: Fix no_flash_attention
* Model: Remove no_flash_attention
Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
This adds the ability to add multiple choices to a generation. This
is only available for non-streaming gens for now, it requires some
more work to port over to streaming.
Signed-off-by: kingbri <bdashore3@proton.me>
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.
If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit 7556dcf134.
The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.
Signed-off-by: kingbri <bdashore3@proton.me>
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.
Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.
Signed-off-by: kingbri <bdashore3@proton.me>
response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).
Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.
Signed-off-by: kingbri <bdashore3@proton.me>
A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.
Signed-off-by: kingbri <bdashore3@proton.me>
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.
Also expose the parameter to the end user in both config and model
load.
Signed-off-by: kingbri <bdashore3@proton.me>
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).
Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a definite way to check if an authorized key is API or admin.
The endpoint only runs if the key is valid in the first place to keep
inline with the API's security model.
Signed-off-by: kingbri <bdashore3@proton.me>
Moving the API into its own directory helps compartmentalize it
and allows for cleaning up the main file to just contain bootstrapping
and the entry point.
Signed-off-by: kingbri <bdashore3@proton.me>