Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.
Signed-off-by: kingbri <bdashore3@proton.me>
Exllamav2 is currently supported on all GPUs and versions. Therefore,
it should be expected that users use the latest version of exllamav2 to
get the latest features.
Doing this helps reduce checks that don't really serve any purpose.
Signed-off-by: kingbri <bdashore3@proton.me>
Allow users to switch the currently overriden samplers via the API
so a restart isn't required to switch the overrides.
Signed-off-by: kingbri <bdashore3@proton.me>
Unify API sampler params into a superclass which should make them
easier to manage and inherit generic functions from.
Not all frontends expose all sampling parameters due to connections
with OAI (that handles sampling themselves with the exception of
a few sliders).
Add the ability for the user to customize fallback parameters from
server-side.
In addition, parameters can be forced to a certain value server-side
in case the repo automatically sets other sampler values in the
background that the user doesn't want.
Signed-off-by: kingbri <bdashore3@proton.me>
Move common functions into their own folder and refactor the backends
to use their own folder as well.
Also cleanup imports and alphabetize import statments themselves.
Finally, move colab and docker into their own folders as well.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, if model_name was commented out, a load would not occur.
Add the case if model_name or loras is blank which returns None when
parsing the YAML.
Signed-off-by: kingbri <bdashore3@proton.me>
Add an argparser that casts over to dictionaries of subgroups to
integrate with the config.
This argparser doesn't contain everything in the config due to complexity
issues with CLI args, but will eventually progress to parity. In addition,
it's used to override the config.yml rather than replace it.
A config arg is also provided if the user wants to fully override the
config yaml with another file path.
Signed-off-by: kingbri <bdashore3@proton.me>
Similar to the transformers library, add an error handler when an
exception is fired. This relays the error to the user.
Signed-off-by: kingbri <bdashore3@proton.me>
These are commonly seen in huggingface provided chat templates and
aren't that difficult to add in.
For feature parity, honor the add_bos_token and ban_eos_token
parameters when constructing the prompt.
Signed-off-by: kingbri <bdashore3@proton.me>
This creates a massive security hole, but it's gated behind a flag
for users who only use localhost.
A warning will pop up when users disable authentication.
Signed-off-by: kingbri <bdashore3@proton.me>
Non-streaming tasks were not regulated by the semaphore, causing these
tasks to interfere with streaming generations. Add helper functions
to take in both sync and async functions for callbacks and sequential
blocking with the semaphore.
Signed-off-by: kingbri <bdashore3@proton.me>
Append generation prompts if given the flag on an OAI chat completion
request.
This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.
Signed-off-by: kingbri <bdashore3@proton.me>
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.
Also allows for unblocking Pydantic's version.
Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.
Signed-off-by: kingbri <bdashore3@proton.me>
Mistakenly forgot that the user can choose what cache mode to use
when loading a model.
Also add when fetching model info.
Signed-off-by: kingbri <bdashore3@proton.me>
Generations can be logged in the console along with sampling parameters
if the user enables it in config.
Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.
Signed-off-by: kingbri <bdashore3@proton.me>
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.
Also send the provided prompt template on model info request.
Signed-off-by: kingbri <bdashore3@proton.me>
Python doesn't have proper handling of optionals. The only way to
handle them is checking via an if statement if the value is None or
by using the "or" keyword to unwrap optionals.
Previously, I used the "or" method to unwrap, but this caused issues
due to falsy values falling back to the default. This is especially
the case with booleans were "False" changed to "True".
Instead, add two new functions: unwrap and coalesce. Both function
to properly implement a functional way of "None" coalescing.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Implement basic lora support
* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras
* Colab: Update for basic lora support
* Model: Test vram alloc after lora load, add docs
* Git: Add loras folder to .gitignore
* API: Add basic lora-related endpoints
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Revert bad CRLF line ending changes
* API: Add basic lora-related endpoints (fixed)
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Model: Unload loras first when unloading model
* API + Models: Cleanup lora endpoints and functions
Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.
Signed-off-by: kingbri <bdashore3@proton.me>
* Loras: Optimize load endpoint
Return successes and failures along with consolidating the request
to the rewritten load_loras function.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
Draft wasn't being parsed correctly with the new changes which removed
the draft_enabled bool. There's still some more work to be done with
returning exceptions.
Signed-off-by: kingbri <bdashore3@proton.me>
Models do not fully unload if an exception is caught in load. Therefore,
leave it to the client to unload on cancel.
Also add handlers in the event a SSE stream is cancelled. These packets
can't be sent back to the client since the client has severed the
connection, so print them in terminal.
Signed-off-by: kingbri <bdashore3@proton.me>
Chat completions previously always yielded a final packet to say that
a generation finished. However, this caused errors that a yield was
executed after GeneratorExit. This is correctly stated because python's
garbage collector can't clean up the generator after exiting due to the
finally block executing.
In addition, SSE endpoints close off the connection, so the finish packet
can only be yielded when the response has completed, so ignore yield on
exception.
Signed-off-by: kingbri <bdashore3@proton.me>
FastAPI is kinda weird with queueing. If an await is used within an
async def, requests aren't executed sequentially. Get the sequential
requests back by using a semaphore to limit concurrent execution from
generator functions.
Also scaffold the framework to move generator functions to their own
file.
Signed-off-by: kingbri <bdashore3@proton.me>