Commit Graph

77 Commits

Author SHA1 Message Date
kingbri
b0c295dd2f API: Add more methods to semaphore
The semaphore/queue model for Tabby is as follows:
- Any load requests go through the semaphore by default
- Any load request can include the skip_queue parameter to bypass
the semaphore
- Any unload requests are immediately executed
- All completion requests are placed inside the semaphore by default

This model preserves the parallelism of single-user mode with extra
convenience methods for queues in multi-user. It also helps mitigate
problems that were previously present in the concurrency stack.

Also change how the program's loop runs so it exits when the API thread
dies.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-04 23:21:40 -05:00
kingbri
f627485534 OAI: Fix completion token fetching
The generator returns generated_tokens in the dict.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-11 01:12:13 -05:00
kingbri
2f568ff573 Config: Expose auto GPU split reserve config
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 22:09:50 -05:00
kingbri
c7428f0bcd API: Add logprobs for chat completions
Adds chat completion logprob support using OAI's spec. Tokens are
not converted to tiktoken here since that will add an extra dependency
for no real reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c02fe4d1db API: Fix response creation
Change chat completion and text completion responses to be more
flexible.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
0af6a38af3 Model: Add logprobs support
Returns token offsets, selected tokens, probabilities of tokens
post-sampling, and normalized probability of selecting a token
pre-sampling (for efficiency purposes).

Only for text completions. Chat completions in a later commit.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
2642ef7156 OAI: Update logprobs type
Some logprobs cannot exist, so make the type optional

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c0ad647fa7 Model: Auto-detect a one GPU setup and fix gpu_split_auto
It makes more sense to use gpu split parameters when the user has
>1 GPUs. Otherwise, set split and split_auto to False and save
the user some VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-06 23:08:57 -05:00
kingbri
849179df17 Model: Make loading use less VRAM
The model loader was using more VRAM on a single GPU compared to
base exllamav2's loader. This was because single GPUs were running
using the autosplit config which allocates an extra vram buffer
for safe loading. Turn this off for single-GPU setups (and turn
it off by default).

This change should allow users to run models which require the
entire card with hopefully faster T/s. For example, Mixtral with
3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom
on Windows.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-06 22:29:56 -05:00
kingbri
b827bcbb44 Sampling: Cleanup and update
Cleanup how overrides are handled, class naming, and adopt exllamav2's
model class to enforce latest stable version methods rather than
adding multiple backwards compatability checks.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:36:17 -05:00
kingbri
d3781920b3 OAI: Split up utility functions
Just like types, put utility functions in their own separate module
based on the route.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:36:17 -05:00
kingbri
751627e571 OAI: Add fasttensors to model load endpoint
Also fix logging when loading prompt templates.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 01:08:02 -05:00
kingbri
b14c5443fd API: Add sampler override switching
Allow users to switch the currently overriden samplers via the API
so a restart isn't required to switch the overrides.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
de0ba7214c API: Add template switching and unload endpoints
Templates can be switched and unloaded without reloading the entire
model.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
6c30f24c83 Tree: Unify sampler parameters and add override support
Unify API sampler params into a superclass which should make them
easier to manage and inherit generic functions from.

Not all frontends expose all sampling parameters due to connections
with OAI (that handles sampling themselves with the exception of
a few sliders).

Add the ability for the user to customize fallback parameters from
server-side.

In addition, parameters can be forced to a certain value server-side
in case the repo automatically sets other sampler values in the
background that the user doesn't want.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
78f920eeda Tree: Refactor code organization
Move common functions into their own folder and refactor the backends
to use their own folder as well.

Also cleanup imports and alphabetize import statments themselves.

Finally, move colab and docker into their own folders as well.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
7a29664f06 API: Add alias names to field descriptions
Helps with understanding API aliases. These aliases should not be
used but are helpful for developers who want frontend compat.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-08 23:00:33 -05:00
kingbri
81b504e8c5 OAI: Fix typical alias
AliasChoices takes strings, not an array.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-05 16:38:39 -05:00
kingbri
2c57dafc59 OAI: Add alias for typical sampling
Typical can also be called typical_p

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-05 15:29:53 -05:00
kingbri
d4ed9f703d Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-04 21:13:30 -05:00
kingbri
cd4bf99598 OAI: Fix autodoc examples for model loading
Some values weren't defaulting to correct values.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-04 20:53:56 -05:00
kingbri
6b04463051 API: Fix CFG reporting
THe model endpoint wasn't reporting if CFG is on.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-02 13:54:16 -05:00
kingbri
b378773d0a Model: Add CFG support
CFG, or classifier-free guidance helps push a model in different
directions based on what the user provides.

Currently, CFG is ignored if the negative prompt is blank (it shouldn't
be used in that way anyways).

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-02 01:46:51 -05:00
kingbri
79a57588d5 API: Add template list endpoint
Fetches all template names that a user has in the templates directory
for chat completions.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-29 22:58:55 -05:00
kingbri
dce8c74edc API: Add clarification and cleanup autodocs
It's possible to override parts of the example JSON to give proper
examples of values.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-29 10:28:06 -05:00
kingbri
5dc2df68be Model: Repetition penalty range -> penalty range
All penalties can have a sustain (range) applied to them in exl2,
so clarify the parameter.

However, the default behaviors change based on if freq OR pres pen
is enabled. For the sanity of OAI users, have freq and pres pen only
apply on the output tokens when range is -1 (default).

But, repetition penalty still functions the same way where -1 means
the range is the max seq len.

Doing this prevents gibberish output when using the more modern freq
and presence penalties similar to llamacpp.

NOTE: This logic is still subject to change in the future, but I believe
it hits the happy medium for users who want defaults and users who want
to tinker around with the sampling knobs.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-28 18:16:10 -05:00
kingbri
e92ef8f5c7 OAI: Fix rep pen range alias
No need to unwrap because the Pydantic alias does that for us.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-25 15:37:11 -05:00
kingbri
e256ff8182 Samplers: Add frequency and presence penalty
Un-alias repetition penalty from the frequency penalty parameter.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-25 15:27:32 -05:00
kingbri
3461f8294f Logging: Clarify preferences
Preferences are preferences, not a config.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-23 21:08:10 -05:00
kingbri
80ef379721 Sampling: Add top-a support
Currently in exllamav2 dev, but will be in the next release.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-22 23:50:24 -05:00
AlpinDale
fa47f51f85 feat: workflows for formatting/linting (#35)
* add github workflows for pylint and yapf

* yapf

* docstrings for auth

* fix auth.py

* fix generators.py

* fix gen_logging.py

* fix main.py

* fix model.py

* fix templating.py

* fix utils.py

* update formatting.sh to include subdirs for pylint

* fix model_test.py

* fix wheel_test.py

* rename utils to utils_oai

* fix OAI/utils_oai.py

* fix completion.py

* fix token.py

* fix lora.py

* fix common.py

* add pylintrc and fix model.py

* finish up pylint

* fix attribute error

* main.py formatting

* add formatting batch script

* Main: Remove unnecessary global

Linter suggestion.

Signed-off-by: kingbri <bdashore3@proton.me>

* switch to ruff

* Formatting + Linting: Add ruff.toml

Signed-off-by: kingbri <bdashore3@proton.me>

* Formatting + Linting: Switch scripts to use ruff

Also remove the file and recent file change functions from both
scripts.

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Format and lint

Signed-off-by: kingbri <bdashore3@proton.me>

* Scripts + Workflows: Format

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Remove pylint flags

We use ruff now

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Format

Signed-off-by: kingbri <bdashore3@proton.me>

* Formatting: Line length is 88

Use the same value as Black.

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Format

Update to new line length rules.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>
Co-authored-by: kingbri <bdashore3@proton.me>
2023-12-22 16:20:35 +00:00
kingbri
ab10b263fd Model: Add override base seq len
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.

Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.

If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-20 00:45:39 -05:00
kingbri
c9e43e51aa API: Add route for draft model list
Does the same thing as model list except with draft models.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 23:45:53 -05:00
kingbri
ce2602df9a Model: Fix max seq len handling
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.

Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 23:37:52 -05:00
kingbri
c3f7898967 OAI: Add logit bias support
Use exllamav2's token bias which is the functional equivalent of
OAI's logit bias parameter.

Strings are casted to integers on request and errors if an invalid
value is passed.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
bc21f0bbc0 OAI: Add field aliasing
Repetition penalty range needs field aliases to support multiple
parameter calls.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
de9a19b5d3 Templating: Add generation prompt appending
Append generation prompts if given the flag on an OAI chat completion
request.

This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
a87e474660 OAI: Fix chat completion validation
Validation wasn't properly run on older pydantic, so ChatCompletionRespChoice
was being sent instead of a ChatCompletionMessage when streaming
responses.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
e895eaa4bd OAI: Clarify types in docs
Adding field descriptions show which parameters are used solely for
OAI compliance and not actually parsed in the model code.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
51ca1ff396 Tree: Switch to Pydantic 2
Pydantic 2 has more modern methods and stability compared to Pydantic 1

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
f631dd6ff7 Templates: Switch to Jinja2
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.

Also allows for unblocking Pydantic's version.

Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
ad8807a830 Model: Add support for num_experts_by_token
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 18:03:01 -05:00
kingbri
70fbee3edd OAI: Fix model parameter placement
Accidentally edited the Model Card parameters vs the model load request
ones.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 14:36:28 -05:00
kingbri
1d0bdfa77c Model + OAI: Fix parameter parsing
Rope alpha changes don't require removing the 1.0 default
from Rope scale.

Keep defaults when possible to avoid errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 14:28:18 -05:00
Veden
3e57125025 OAI: adding optional draft model properties for draft_rope alpha and scale (#28)
* OAI: adding optional draft model properties for draft_rope alpha and scale

* Forgot to set the properties to None
2023-12-17 19:23:45 +00:00
kingbri
1a331afe3a OAI: Add cache_mode parameter to model
Mistakenly forgot that the user can choose what cache mode to use
when loading a model.

Also add when fetching model info.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-16 02:47:50 -05:00
kingbri
ed868fd262 OAI: Remove unused parameters
Seed and low_mem aren't used, so comment them out.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-15 14:56:43 -05:00
kingbri
083df7d585 Tree: Add generation logging support
Generations can be logged in the console along with sampling parameters
if the user enables it in config.

Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:43:35 -05:00
kingbri
db87efde4a OAI: Add ability to specify fastchat prompt template
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.

Also send the provided prompt template on model info request.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 15:43:58 -05:00
kingbri
fd9f3eac87 Model: Add params to current model endpoint
Grabs the current model rope params, max seq len, and the draft model
if applicable.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 00:40:56 -05:00