Commit Graph

115 Commits

Author SHA1 Message Date
DocShotgun
e391d84e40 More extensive checks for paged mode support (#121)
* Model: More extensive checks for paged attention
Previously, TabbyAPI only checked for whether the user's hardware supports flash attention before deciding whether to enabled paged mode.
This adds checks for whether no_flash_attention is set, whether flash-attn is installed, and whether the installed version supports paged attention.

* Tree: Format

* Tree: Lint

* Model: Check GPU architecture first
Check GPU arch prior to checking whether flash attention 2 is installed
2024-06-05 09:33:21 +02:00
turboderp
dbdcb38ad7 Allow either "[" or "{" prefix to support JSON grammar with top level arrays (#129) 2024-06-04 02:32:39 +02:00
turboderp
e889fa3efe Bump exllamav2 to v0.1.4 (#128) 2024-06-04 02:32:08 +02:00
Brian Dashore
516b52b341 Merge pull request #112 from DocShotgun/main
Separate new prompt tokens from those reused from cache in metric logging
2024-05-27 18:04:43 -04:00
kingbri
19961f4126 Dependencies: Update ExllamaV2
v0.1.1

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-27 13:38:07 -04:00
kingbri
116cf56c87 Model: Auto-round cache size on init
Cache size must be a multiple of 256 to work properly in ExllamaV2.
Take the config value and set the cache size to one multiple above
the remainder of the cache size divided by 256.

This is because cache size can never be lower than max_seq_len.
If max_seq_len isn't a multiple of 256, this method will never
yield a number that's lower than max_seq_len since it's no longer
a source of truth.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 21:24:54 -04:00
DocShotgun
ce5e2ec8de Logging: Clarify new vs cached tokens in prompt processing 2024-05-26 18:21:17 -07:00
DocShotgun
767e6a798a API + Model: Add support for specifying k/v cache size 2024-05-26 14:17:01 -07:00
kingbri
660f9b8432 OAI: Fix request cancellation behavior
Depending on the day of the week, Starlette can work with a CancelledError
or using await request.is_disconnected(). Run the same behavior for both
cases and allow cancellation.

Streaming requests now set an event to cancel the batched job and break
out of the generation loop.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 13:00:33 -04:00
kingbri
094c7b1734 Model: Fix paged and FA2 checks
If a user is using GPU split, check compute capability on only those
GPUs. Autosplit assumes that all GPUs will be used.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 11:29:31 -04:00
kingbri
9fbbc5afca Tree: Swap from map to list comprehensions
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
46d0d13914 Model/Grammar: Fix filter append call
No need to use extend if the array is length 1.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
a46ee62d03 Model: Clarify warning and device check on load
FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs.
Clarify the error message and explain what happens as a result.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
47582c2440 Dependencies: Update ExllamaV2
v0.1.0

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
43cd7f57e8 API + Model: Add blocks and checks for various load requests
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
408c66a1f2 Model: Change FA2 and paged attention checks
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
c2d3675408 Model: Add min_tokens support
In the form of min_new_tokens. Stopping strings take priority.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
5f0fb9c4ff Model: Add CFG support
Dynamic generator needed multiple prompts to be tokenized and sent
for them to be sampled in serial, but generated in parallel.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
06ff47e2b4 Model: Use true async jobs and add logprobs
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
32ae62feac Model: Add filter support to dynamic gen
Dynamic gen takes in filters differently. Adjust to set the filter list
per class rather than in the generation function.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
8ccd8fe5f8 Model: Initial dynamic generator support
Adds basic support for ExllamaV2's dynamic generator. Can generate
a streaming and non-streaming completion.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
e4bb709305 Model: Fix usage stats in non-streaming gens
The wrong key was being returned from the model to the API.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-12 22:44:50 -04:00
kingbri
213430a122 Model/Grammar: Remove lmfe checks
lmfe is a required dependency, so checks are no longer needed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-12 22:24:28 -04:00
DocShotgun
abe411c6fb API + Model: Add support for regex pattern constraints
Adds the ability to constrain generation via regex pattern using lm-format-enforcer.
2024-05-12 19:10:43 -07:00
Ycros
57525219d0 Fix: Properly handle banned_strings and decode_special tokens (#104)
* Fix: Actually pass banned_strings to the generation call.

* decode_special_tokens was missing as well.

* syntax
2024-05-12 20:47:45 +00:00
kingbri
c8ec742be9 Samplers: Expose skew sampling
Skew is an extra unused sampler in ExllamaV2. Add it in for coverage.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-12 01:41:01 -04:00
kingbri
b4bc941cbe Tree: Lint
Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-11 22:42:39 -04:00
kingbri
7bebc085ec Model: Remove legacy checks
v0.0.21 has these features implemented.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-11 19:26:23 -04:00
kingbri
cd78728a77 Dependencies: Update ExllamaV2
v0.0.21

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-11 19:26:03 -04:00
kingbri
366d57cf45 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-10 21:20:41 -04:00
kingbri
7eee936a3f Model: Remove old code and fix API handling
skip_special_tokens is in stable exl2. Also default the parameters
if they are not present in the function signature.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-10 21:20:00 -04:00
DocShotgun
c0b631ba92 API: Add banned_strings
From exllamav2: List of strings that the generator will refuse to output. As soon as a partial match happens, a checkpoint is saved that the generator can rewind to if need be. Subsequent tokens are then held until the full string is resolved (match or no match) and either emitted or discarded, accordingly.
2024-05-10 13:53:55 -07:00
DocShotgun
a1df22668b API: Add min_tokens
Bans the EOS token until the generation reaches a minimum length. This will not prevent the model from otherwise ending the generation early by outputting other stop conditions.
2024-05-10 12:30:17 -07:00
kingbri
0e015ad58e Dependencies: Update ExllamaV2
v0.0.20

ROCm 6.0 is now required

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-28 11:06:59 -04:00
kingbri
6f9da97114 API: Add banned_tokens
Appends the banned tokens to the generation. This is equivalent of
setting logit bias to -100 on a specific set of tokens.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-28 11:06:09 -04:00
kingbri
5750826120 Model: Remove extraneous print
Was printing IDs by accident.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-25 18:49:09 -04:00
kingbri
fb1d2f34c1 OAI: Add response_prefix and fix BOS token issues in chat completions
response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).

Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-25 00:54:43 -04:00
kingbri
88b0b6f4f1 Model: Cast autosplit_reserve to int
Torch errors if float values are passed (because bytes are not float
types). Therefore, overestimate and cast to an int type.

Resolves #97

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-21 23:49:01 -04:00
kingbri
cab789e685 Templates: Migrate to class
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-21 23:28:14 -04:00
kingbri
9f93505bc1 OAI: Add skip_special_tokens parameter
Allows the ability to decode special tokens if the user wishes.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-21 00:37:46 -04:00
kingbri
8824ea0205 Model: Add EOS token support from generation_config.json
GenerationConfig is meant to override various parts of the model
on generation within the transformers lib. Rather than implementing
the entire GenerationConfig framework (since it's pretty redundant),
add in multi eos_token support like VLLM.

The GenerationConfig is used only for generation, but can be used
for other uses if needed.

If there's more necessary parameters in the future, add those in
as well.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-19 22:52:32 -04:00
kingbri
515b3c2930 OAI: Tokenize chat completion messages
Since chat completion messages are a structure, format the prompt
before checking in the tokenizer.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-15 14:17:16 -04:00
kingbri
ed05f376d9 Dependencies: Switch to LM-format-enforcer fork
LM format enforcer has some latency on token ingestion, so use an
optimized fork instead. Also add this in as a base dependency since
the size is small.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-14 11:59:49 -04:00
kingbri
d759a15559 Model: Fix chunk size handling
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.

Also expose the parameter to the end user in both config and model
load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:39:19 -04:00
kingbri
30c4554572 Requirements: Update Exllamav2
v0.0.18

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:00:56 -04:00
kingbri
46ac3beea9 Templates: Support list style chat_template keys
HuggingFace updated transformers to provide templates in a list for
tokenizers. Update to support this new format. Providing the name
of a template for the "prompt_template" value in config.yml will also
look inside the template list.

In addition, log if there's a template exception, but continue model
loading since it shouldn't shut down the application.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 11:20:25 -04:00
kingbri
6ecce1604b Model: Fix log if exl2 version is too low
Switch to pyproject syntax.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-31 23:11:21 -04:00
kingbri
f534930270 Dependencies: Bump Exllamav2
v0.0.17

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-31 23:10:28 -04:00
kingbri
b11aac51e2 Model: Add torch.inference_mode() to generator function
Provides a speedup to model forward.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-30 10:45:28 -04:00
kingbri
190a0b26c3 Model: Fix generation when stream = false
References #91. Check if the length of the generation array is > 0
after popping the finish reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-29 02:15:56 -04:00