Commit Graph

118 Commits

Author SHA1 Message Date
kingbri
2a33ebbf29 Model: Bypass lock checks when shutting down
Previously, when a SIGINT was emitted and a model load is running,
the API didn't shut down until the load finished due to waitng for
the lock. However, when shutting down, the lock doesn't matter since
the process is being killed anyway.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 16:05:34 -04:00
kingbri
0bcb4e4a7d Model: Attach request ID to logs
If multiple logs come in at once, track which log corresponds to
which request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 00:25:54 -04:00
kingbri
9390d362dd Model: Log generation params and metrics after the prompt/response
A user's prompt and response can be large in the console. Therefore,
always log the smaller payloads (ex. gen params + metrics) after
the large chunks.

However, it's recommended to keep prompt logging off anyways since
it'll result in console spam.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 00:19:21 -04:00
kingbri
46304ce875 Model: Properly pass in max_batch_size from config
The override wasn't being passed in before. Also, the default is now
none since Exl2 can automatically calculate the max batch size.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 18:42:25 -04:00
kingbri
7522b1447b Model: Add support for HuggingFace config and bad_words_ids
This is necessary for Kobold's API. Current models use bad_words_ids
in generation_config.json, but for some reason, they're also present
in the model's config.json.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-26 18:23:22 -04:00
kingbri
b7cb6f0b91 API: Add KoboldAI server
Used for interacting with applications that use KoboldAI's API
such as horde.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-26 16:37:30 -04:00
kingbri
3e8ffebdd3 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-23 14:32:50 -04:00
kingbri
9ad69e8ab6 API: Migrate universal routes to core
Place OAI specific routes in the appropriate folder. This is in
preperation for adding new API servers that can be optionally enabled.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-23 14:08:48 -04:00
kingbri
191600a150 Revert "Model: Skip empty token chunks"
This reverts commit 21516bd7b5.

This skips EOS and implementing it the proper way seems more
costly than necessary.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 18:34:00 -04:00
kingbri
21516bd7b5 Model: Skip empty token chunks
This helps make the generation loop more efficient by skipping past
chunks that aren't providing any tokens anyways. The offset isn't
affected.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 12:23:49 -04:00
kingbri
cae94b920c API: Add ability to use request IDs
Identify which request is being processed to help users disambiguate
which logs correspond to which request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-21 21:01:05 -04:00
kingbri
933404c185 Model: Warn user if terminating jobs
If skip_wait is true, it's best to let the user know that all jobs
will be forcibly cancelled.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-15 11:34:16 -04:00
kingbri
9dae461142 Model: Attempt to recreate generator on a fatal error
If a job causes the generator to error, tabby stops working until
a relaunch. It's better to try establishing a system of redundancy
and remake the generator in the event that it fails.

May replace this with an exit signal for a fatal error instead, but
not sure.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-15 01:09:49 -04:00
kingbri
1f46a1130c OAI: Restrict list permissions for API keys
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.

This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
turboderp
4cf79c5ae1 Clear tokenizer_data cache when unloading model 2024-07-08 03:31:05 +02:00
turboderp
bb8b02a60a Wrap arch_compat_overrides in try block
Quick fix until exllamav2 0.1.7 releases, since the function isn't defined for 0.1.6.
2024-07-07 07:54:05 +02:00
kingbri
773639ea89 Model: Fix flash-attn checks
If flash attention is already turned off by exllamaV2 itself, don't
try creating a paged generator. Also condense all the redundant
logic into one if statement.

Also check arch_compat_overrides to see if flash attention should
be disabled for a model arch (ex. Gemma 2)

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 20:58:24 -04:00
kingbri
c575105e41 ExllamaV2: Cleanup log placements
Move the large import errors into the check functions themselves.
This helps reduce the difficulty in interpreting where errors are
coming from.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-16 00:16:03 -04:00
Glenn Maynard
8da7644571 Fix exception unloading models. (#138)
self.generator is None if a model load fails or is cancelled.
2024-06-15 23:44:29 +02:00
DocShotgun
85387d97ad Fix disabling flash attention in exl2 config (#136)
* Model: Fix disabling flash attention in exl2 config

* Model: Pass no_flash_attn to draft config

* Model: Force torch flash SDP off in compatibility mode
2024-06-12 20:00:46 +02:00
DocShotgun
156b74f3f0 Revision to paged attention checks (#133)
* Model: Clean up paged attention checks

* Model: Move cache_size checks after paged attn checks
Cache size is only relevant in paged mode

* Model: Fix no_flash_attention

* Model: Remove no_flash_attention
Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
2024-06-09 17:28:11 +02:00
DocShotgun
55d979b7a5 Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134)
* Dependencies: Add wheels for Python 3.12

* Model: Switch fp8 cache to Q8 cache

* Model: Add ability to set draft model cache mode

* Dependencies: Bump exllamav2 to 0.1.5

* Model: Support Q6 cache

* Config: Add Q6 cache and draft_cache_mode to config sample
2024-06-09 17:27:39 +02:00
DocShotgun
dcd9428325 Model: Warn if cache size is too small for CFG (#132) 2024-06-05 19:40:14 +02:00
DocShotgun
e391d84e40 More extensive checks for paged mode support (#121)
* Model: More extensive checks for paged attention
Previously, TabbyAPI only checked for whether the user's hardware supports flash attention before deciding whether to enabled paged mode.
This adds checks for whether no_flash_attention is set, whether flash-attn is installed, and whether the installed version supports paged attention.

* Tree: Format

* Tree: Lint

* Model: Check GPU architecture first
Check GPU arch prior to checking whether flash attention 2 is installed
2024-06-05 09:33:21 +02:00
Brian Dashore
516b52b341 Merge pull request #112 from DocShotgun/main
Separate new prompt tokens from those reused from cache in metric logging
2024-05-27 18:04:43 -04:00
kingbri
116cf56c87 Model: Auto-round cache size on init
Cache size must be a multiple of 256 to work properly in ExllamaV2.
Take the config value and set the cache size to one multiple above
the remainder of the cache size divided by 256.

This is because cache size can never be lower than max_seq_len.
If max_seq_len isn't a multiple of 256, this method will never
yield a number that's lower than max_seq_len since it's no longer
a source of truth.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 21:24:54 -04:00
DocShotgun
ce5e2ec8de Logging: Clarify new vs cached tokens in prompt processing 2024-05-26 18:21:17 -07:00
DocShotgun
767e6a798a API + Model: Add support for specifying k/v cache size 2024-05-26 14:17:01 -07:00
kingbri
660f9b8432 OAI: Fix request cancellation behavior
Depending on the day of the week, Starlette can work with a CancelledError
or using await request.is_disconnected(). Run the same behavior for both
cases and allow cancellation.

Streaming requests now set an event to cancel the batched job and break
out of the generation loop.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 13:00:33 -04:00
kingbri
094c7b1734 Model: Fix paged and FA2 checks
If a user is using GPU split, check compute capability on only those
GPUs. Autosplit assumes that all GPUs will be used.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 11:29:31 -04:00
kingbri
9fbbc5afca Tree: Swap from map to list comprehensions
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
a46ee62d03 Model: Clarify warning and device check on load
FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs.
Clarify the error message and explain what happens as a result.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
43cd7f57e8 API + Model: Add blocks and checks for various load requests
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
408c66a1f2 Model: Change FA2 and paged attention checks
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
c2d3675408 Model: Add min_tokens support
In the form of min_new_tokens. Stopping strings take priority.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
5f0fb9c4ff Model: Add CFG support
Dynamic generator needed multiple prompts to be tokenized and sent
for them to be sampled in serial, but generated in parallel.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
06ff47e2b4 Model: Use true async jobs and add logprobs
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
32ae62feac Model: Add filter support to dynamic gen
Dynamic gen takes in filters differently. Adjust to set the filter list
per class rather than in the generation function.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
8ccd8fe5f8 Model: Initial dynamic generator support
Adds basic support for ExllamaV2's dynamic generator. Can generate
a streaming and non-streaming completion.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
e4bb709305 Model: Fix usage stats in non-streaming gens
The wrong key was being returned from the model to the API.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-12 22:44:50 -04:00
DocShotgun
abe411c6fb API + Model: Add support for regex pattern constraints
Adds the ability to constrain generation via regex pattern using lm-format-enforcer.
2024-05-12 19:10:43 -07:00
Ycros
57525219d0 Fix: Properly handle banned_strings and decode_special tokens (#104)
* Fix: Actually pass banned_strings to the generation call.

* decode_special_tokens was missing as well.

* syntax
2024-05-12 20:47:45 +00:00
kingbri
c8ec742be9 Samplers: Expose skew sampling
Skew is an extra unused sampler in ExllamaV2. Add it in for coverage.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-12 01:41:01 -04:00
kingbri
b4bc941cbe Tree: Lint
Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-11 22:42:39 -04:00
kingbri
7bebc085ec Model: Remove legacy checks
v0.0.21 has these features implemented.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-11 19:26:23 -04:00
kingbri
366d57cf45 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-10 21:20:41 -04:00
kingbri
7eee936a3f Model: Remove old code and fix API handling
skip_special_tokens is in stable exl2. Also default the parameters
if they are not present in the function signature.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-10 21:20:00 -04:00
DocShotgun
c0b631ba92 API: Add banned_strings
From exllamav2: List of strings that the generator will refuse to output. As soon as a partial match happens, a checkpoint is saved that the generator can rewind to if need be. Subsequent tokens are then held until the full string is resolved (match or no match) and either emitted or discarded, accordingly.
2024-05-10 13:53:55 -07:00
DocShotgun
a1df22668b API: Add min_tokens
Bans the EOS token until the generation reaches a minimum length. This will not prevent the model from otherwise ending the generation early by outputting other stop conditions.
2024-05-10 12:30:17 -07:00
kingbri
6f9da97114 API: Add banned_tokens
Appends the banned tokens to the generation. This is equivalent of
setting logit bias to -100 on a specific set of tokens.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-28 11:06:09 -04:00