Returns token offsets, selected tokens, probabilities of tokens
post-sampling, and normalized probability of selecting a token
pre-sampling (for efficiency purposes).
Only for text completions. Chat completions in a later commit.
Signed-off-by: kingbri <bdashore3@proton.me>
It makes more sense to use gpu split parameters when the user has
>1 GPUs. Otherwise, set split and split_auto to False and save
the user some VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
The model loader was using more VRAM on a single GPU compared to
base exllamav2's loader. This was because single GPUs were running
using the autosplit config which allocates an extra vram buffer
for safe loading. Turn this off for single-GPU setups (and turn
it off by default).
This change should allow users to run models which require the
entire card with hopefully faster T/s. For example, Mixtral with
3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom
on Windows.
Signed-off-by: kingbri <bdashore3@proton.me>
Cleanup how overrides are handled, class naming, and adopt exllamav2's
model class to enforce latest stable version methods rather than
adding multiple backwards compatability checks.
Signed-off-by: kingbri <bdashore3@proton.me>
Allow users to switch the currently overriden samplers via the API
so a restart isn't required to switch the overrides.
Signed-off-by: kingbri <bdashore3@proton.me>
Unify API sampler params into a superclass which should make them
easier to manage and inherit generic functions from.
Not all frontends expose all sampling parameters due to connections
with OAI (that handles sampling themselves with the exception of
a few sliders).
Add the ability for the user to customize fallback parameters from
server-side.
In addition, parameters can be forced to a certain value server-side
in case the repo automatically sets other sampler values in the
background that the user doesn't want.
Signed-off-by: kingbri <bdashore3@proton.me>
Move common functions into their own folder and refactor the backends
to use their own folder as well.
Also cleanup imports and alphabetize import statments themselves.
Finally, move colab and docker into their own folders as well.
Signed-off-by: kingbri <bdashore3@proton.me>
Helps with understanding API aliases. These aliases should not be
used but are helpful for developers who want frontend compat.
Signed-off-by: kingbri <bdashore3@proton.me>
CFG, or classifier-free guidance helps push a model in different
directions based on what the user provides.
Currently, CFG is ignored if the negative prompt is blank (it shouldn't
be used in that way anyways).
Signed-off-by: kingbri <bdashore3@proton.me>
All penalties can have a sustain (range) applied to them in exl2,
so clarify the parameter.
However, the default behaviors change based on if freq OR pres pen
is enabled. For the sanity of OAI users, have freq and pres pen only
apply on the output tokens when range is -1 (default).
But, repetition penalty still functions the same way where -1 means
the range is the max seq len.
Doing this prevents gibberish output when using the more modern freq
and presence penalties similar to llamacpp.
NOTE: This logic is still subject to change in the future, but I believe
it hits the happy medium for users who want defaults and users who want
to tinker around with the sampling knobs.
Signed-off-by: kingbri <bdashore3@proton.me>
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.
Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.
If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.
Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
Use exllamav2's token bias which is the functional equivalent of
OAI's logit bias parameter.
Strings are casted to integers on request and errors if an invalid
value is passed.
Signed-off-by: kingbri <bdashore3@proton.me>
Append generation prompts if given the flag on an OAI chat completion
request.
This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.
Signed-off-by: kingbri <bdashore3@proton.me>
Validation wasn't properly run on older pydantic, so ChatCompletionRespChoice
was being sent instead of a ChatCompletionMessage when streaming
responses.
Signed-off-by: kingbri <bdashore3@proton.me>
Adding field descriptions show which parameters are used solely for
OAI compliance and not actually parsed in the model code.
Signed-off-by: kingbri <bdashore3@proton.me>
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.
Also allows for unblocking Pydantic's version.
Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.
Signed-off-by: kingbri <bdashore3@proton.me>
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.
Signed-off-by: kingbri <bdashore3@proton.me>
Rope alpha changes don't require removing the 1.0 default
from Rope scale.
Keep defaults when possible to avoid errors.
Signed-off-by: kingbri <bdashore3@proton.me>
Mistakenly forgot that the user can choose what cache mode to use
when loading a model.
Also add when fetching model info.
Signed-off-by: kingbri <bdashore3@proton.me>
Generations can be logged in the console along with sampling parameters
if the user enables it in config.
Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.
Signed-off-by: kingbri <bdashore3@proton.me>
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.
Also send the provided prompt template on model info request.
Signed-off-by: kingbri <bdashore3@proton.me>
Python doesn't have proper handling of optionals. The only way to
handle them is checking via an if statement if the value is None or
by using the "or" keyword to unwrap optionals.
Previously, I used the "or" method to unwrap, but this caused issues
due to falsy values falling back to the default. This is especially
the case with booleans were "False" changed to "True".
Instead, add two new functions: unwrap and coalesce. Both function
to properly implement a functional way of "None" coalescing.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Implement basic lora support
* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras
* Colab: Update for basic lora support
* Model: Test vram alloc after lora load, add docs
* Git: Add loras folder to .gitignore
* API: Add basic lora-related endpoints
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Revert bad CRLF line ending changes
* API: Add basic lora-related endpoints (fixed)
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Model: Unload loras first when unloading model
* API + Models: Cleanup lora endpoints and functions
Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.
Signed-off-by: kingbri <bdashore3@proton.me>
* Loras: Optimize load endpoint
Return successes and failures along with consolidating the request
to the rewritten load_loras function.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
Draft wasn't being parsed correctly with the new changes which removed
the draft_enabled bool. There's still some more work to be done with
returning exceptions.
Signed-off-by: kingbri <bdashore3@proton.me>
Lots of bugs were unearthed when switching to the new fallback changes.
Fix them and make sure samplers are being set properly.
Signed-off-by: kingbri <bdashore3@proton.me>
Chat completions require a finish reason to be provided in the OAI
spec once the streaming is completed. This is different from a non-
streaming chat completion response.
Also fix some errors that were raised from the endpoint.
References #15
Signed-off-by: kingbri <bdashore3@proton.me>