Using the Outlines library, add support to supply EBNF strings and
pass them to the library for parsing.
From there, a wrapper is created and a filter is passed to generation.
Replace with an in-house solution at some point that's more flexible.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to constrain the return value of a model to be JSON.
Built using the JSON schema standard to define the properties of what
the model should return.
This feature should be more accurate than using GBNF/EBNF to yield
the same results due to the use of lmformatenforcer.
GBNF/EBNF will be added in a different commit/branch.
Signed-off-by: kingbri <bdashore3@proton.me>
Accidentally checked on the token bias tensor which didn't contain
the token IDs. Check if the index exists on the id_to_piece list
instead.
Signed-off-by: kingbri <bdashore3@proton.me>
Automatically unload the existing model when calling /load. This was
requested many times, and does make more sense in the long run.
Signed-off-by: kingbri <bdashore3@proton.me>
For safety reasons, always use auto unless a manual split is provided
and auto is forced off.
If auto is forced off and a manual split isn't provided, a manual
split will be attempted.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, pre-sampling logprobs were used from the raw logits,
but newer versions of exl2 allow for returning token probs post-sampling.
Convert these to logprobs and send to the user.
Signed-off-by: kingbri <bdashore3@proton.me>
This option saves some VRAM, but does have the chance to error out.
Add this in the experimental config section.
Signed-off-by: kingbri <bdashore3@proton.me>
Injecting into Pydantic fields caused issues with serialization for
documentation rendering. Rather than reinvent the wheel again,
switch to a chain of if statements for now. This may change in the
future if subclasses from the base sampler request need to be
validated as well.
Signed-off-by: kingbri <bdashore3@proton.me>
Rather than maintaining yet another function to validate sampler
ranges/values, embed them in fields which allows for less
maintainence in the future.
Also add validation for existing samplers that can corrupt
the sampling stack if set improperly.
Signed-off-by: kingbri <bdashore3@proton.me>
Creates templates for issues to help guide users in the right direction when making a bug report or request.
Signed-off-by: kingbri <bdashore3@proton.me>
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.
Signed-off-by: kingbri <bdashore3@proton.me>
Take a log of the token probs since they're already normalized which
reflects the proper value. Also, don't error out if a token prob
doesn't exist in the dict and return None instead from zip.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds chat completion logprob support using OAI's spec. Tokens are
not converted to tiktoken here since that will add an extra dependency
for no real reason.
Signed-off-by: kingbri <bdashore3@proton.me>
Returns token offsets, selected tokens, probabilities of tokens
post-sampling, and normalized probability of selecting a token
pre-sampling (for efficiency purposes).
Only for text completions. Chat completions in a later commit.
Signed-off-by: kingbri <bdashore3@proton.me>
Split the get tokens function into separate wrapper encode and decode
functions for overall code cleanliness.
Signed-off-by: kingbri <bdashore3@proton.me>
Many APIs automatically ask for request streaming without giving
the user the option to turn it off. Therefore, give the user more
freedom by giving a server-side kill switch.
Signed-off-by: kingbri <bdashore3@proton.me>
It makes more sense to use gpu split parameters when the user has
>1 GPUs. Otherwise, set split and split_auto to False and save
the user some VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
The model loader was using more VRAM on a single GPU compared to
base exllamav2's loader. This was because single GPUs were running
using the autosplit config which allocates an extra vram buffer
for safe loading. Turn this off for single-GPU setups (and turn
it off by default).
This change should allow users to run models which require the
entire card with hopefully faster T/s. For example, Mixtral with
3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom
on Windows.
Signed-off-by: kingbri <bdashore3@proton.me>
Now that exllamav2 is required to be the latest, don't add attribute
checks unless the feature is not in the release build.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.
Signed-off-by: kingbri <bdashore3@proton.me>
Cleanup how overrides are handled, class naming, and adopt exllamav2's
model class to enforce latest stable version methods rather than
adding multiple backwards compatability checks.
Signed-off-by: kingbri <bdashore3@proton.me>
Exllamav2 is currently supported on all GPUs and versions. Therefore,
it should be expected that users use the latest version of exllamav2 to
get the latest features.
Doing this helps reduce checks that don't really serve any purpose.
Signed-off-by: kingbri <bdashore3@proton.me>
Does not work if max_temp is less than or equal to min_temp. Sampler
validation will have to be refactored in the future, so the dynamic
temperature check will also be changed.
Signed-off-by: kingbri <bdashore3@proton.me>