The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.
Signed-off-by: kingbri <bdashore3@proton.me>
Take a log of the token probs since they're already normalized which
reflects the proper value. Also, don't error out if a token prob
doesn't exist in the dict and return None instead from zip.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds chat completion logprob support using OAI's spec. Tokens are
not converted to tiktoken here since that will add an extra dependency
for no real reason.
Signed-off-by: kingbri <bdashore3@proton.me>
Returns token offsets, selected tokens, probabilities of tokens
post-sampling, and normalized probability of selecting a token
pre-sampling (for efficiency purposes).
Only for text completions. Chat completions in a later commit.
Signed-off-by: kingbri <bdashore3@proton.me>
Split the get tokens function into separate wrapper encode and decode
functions for overall code cleanliness.
Signed-off-by: kingbri <bdashore3@proton.me>
It makes more sense to use gpu split parameters when the user has
>1 GPUs. Otherwise, set split and split_auto to False and save
the user some VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
The model loader was using more VRAM on a single GPU compared to
base exllamav2's loader. This was because single GPUs were running
using the autosplit config which allocates an extra vram buffer
for safe loading. Turn this off for single-GPU setups (and turn
it off by default).
This change should allow users to run models which require the
entire card with hopefully faster T/s. For example, Mixtral with
3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom
on Windows.
Signed-off-by: kingbri <bdashore3@proton.me>
Now that exllamav2 is required to be the latest, don't add attribute
checks unless the feature is not in the release build.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.
Signed-off-by: kingbri <bdashore3@proton.me>
Cleanup how overrides are handled, class naming, and adopt exllamav2's
model class to enforce latest stable version methods rather than
adding multiple backwards compatability checks.
Signed-off-by: kingbri <bdashore3@proton.me>
Does not work if max_temp is less than or equal to min_temp. Sampler
validation will have to be refactored in the future, so the dynamic
temperature check will also be changed.
Signed-off-by: kingbri <bdashore3@proton.me>
The example JSON fields were changed because of the new sampler
default strategy. Fix these by manually changing the values.
Also add support for fasttensors and expose generate_window to
the API. It's recommended to not adjust generate_window as it's
dynamically scaled based on max_seq_len by default.
Signed-off-by: kingbri <bdashore3@proton.me>
The previous commit iterated through multiple try conditions which
made it so the user has to provide a dummy prompt template. Now,
template loading is fallback based.
Run through a loop of functions and return if one of them succeeds.
Signed-off-by: kingbri <bdashore3@proton.me>
Allows for adjustment of reservation space at the end of the context
before rolling it. This should be scaled as a model's max_seq_len
goes up.
Signed-off-by: kingbri <bdashore3@proton.me>
Move common functions into their own folder and refactor the backends
to use their own folder as well.
Also cleanup imports and alphabetize import statments themselves.
Finally, move colab and docker into their own folders as well.
Signed-off-by: kingbri <bdashore3@proton.me>