Non-streaming tasks were not regulated by the semaphore, causing these
tasks to interfere with streaming generations. Add helper functions
to take in both sync and async functions for callbacks and sequential
blocking with the semaphore.
Signed-off-by: kingbri <bdashore3@proton.me>
When stream is false, the generation can be empty, which means
that there's no chunks present in the final generation array, causing
an error.
Instead, return a dummy value if generation is falsy (empty array
or None)
Signed-off-by: kingbri <bdashore3@proton.me>
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.
Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.
If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.
Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
Lets the user know if a file not found (OSError) occurs and prints
the applied template on model load.
Also fix some remaining references to fastchat.
Signed-off-by: kingbri <bdashore3@proton.me>
Use exllamav2's token bias which is the functional equivalent of
OAI's logit bias parameter.
Strings are casted to integers on request and errors if an invalid
value is passed.
Signed-off-by: kingbri <bdashore3@proton.me>
Append generation prompts if given the flag on an OAI chat completion
request.
This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.
Signed-off-by: kingbri <bdashore3@proton.me>
OSError means that a file wasn't found, which means auth tokens should
be rengenerated. Otherwise, fire the error and exit.
Signed-off-by: kingbri <bdashore3@proton.me>
Validation wasn't properly run on older pydantic, so ChatCompletionRespChoice
was being sent instead of a ChatCompletionMessage when streaming
responses.
Signed-off-by: kingbri <bdashore3@proton.me>
Adding field descriptions show which parameters are used solely for
OAI compliance and not actually parsed in the model code.
Signed-off-by: kingbri <bdashore3@proton.me>
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.
Also allows for unblocking Pydantic's version.
Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.
Signed-off-by: kingbri <bdashore3@proton.me>
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.
Signed-off-by: kingbri <bdashore3@proton.me>
Rope alpha changes don't require removing the 1.0 default
from Rope scale.
Keep defaults when possible to avoid errors.
Signed-off-by: kingbri <bdashore3@proton.me>
Mistakenly forgot that the user can choose what cache mode to use
when loading a model.
Also add when fetching model info.
Signed-off-by: kingbri <bdashore3@proton.me>
Also change how wheel_test works for safe import testing rather than
trying to import the package which can cause system issues.
Signed-off-by: kingbri <bdashore3@proton.me>
Generations can be logged in the console along with sampling parameters
if the user enables it in config.
Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.
Signed-off-by: kingbri <bdashore3@proton.me>
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.
Also send the provided prompt template on model info request.
Signed-off-by: kingbri <bdashore3@proton.me>
Python doesn't have proper handling of optionals. The only way to
handle them is checking via an if statement if the value is None or
by using the "or" keyword to unwrap optionals.
Previously, I used the "or" method to unwrap, but this caused issues
due to falsy values falling back to the default. This is especially
the case with booleans were "False" changed to "True".
Instead, add two new functions: unwrap and coalesce. Both function
to properly implement a functional way of "None" coalescing.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Implement basic lora support
* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras
* Colab: Update for basic lora support
* Model: Test vram alloc after lora load, add docs
* Git: Add loras folder to .gitignore
* API: Add basic lora-related endpoints
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Revert bad CRLF line ending changes
* API: Add basic lora-related endpoints (fixed)
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Model: Unload loras first when unloading model
* API + Models: Cleanup lora endpoints and functions
Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.
Signed-off-by: kingbri <bdashore3@proton.me>
* Loras: Optimize load endpoint
Return successes and failures along with consolidating the request
to the rewritten load_loras function.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>