tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-04-20 14:28:54 +00:00

Author	SHA1	Message	Date
DocShotgun	e391d84e40	More extensive checks for paged mode support (#121 ) * Model: More extensive checks for paged attention Previously, TabbyAPI only checked for whether the user's hardware supports flash attention before deciding whether to enabled paged mode. This adds checks for whether no_flash_attention is set, whether flash-attn is installed, and whether the installed version supports paged attention. * Tree: Format * Tree: Lint * Model: Check GPU architecture first Check GPU arch prior to checking whether flash attention 2 is installed	2024-06-05 09:33:21 +02:00
turboderp	dbdcb38ad7	Allow either "[" or "{" prefix to support JSON grammar with top level arrays (#129 )	2024-06-04 02:32:39 +02:00
turboderp	e889fa3efe	Bump exllamav2 to v0.1.4 (#128 )	2024-06-04 02:32:08 +02:00
Brian Dashore	516b52b341	Merge pull request #112 from DocShotgun/main Separate new prompt tokens from those reused from cache in metric logging	2024-05-27 18:04:43 -04:00
kingbri	19961f4126	Dependencies: Update ExllamaV2 v0.1.1 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-27 13:38:07 -04:00
kingbri	116cf56c87	Model: Auto-round cache size on init Cache size must be a multiple of 256 to work properly in ExllamaV2. Take the config value and set the cache size to one multiple above the remainder of the cache size divided by 256. This is because cache size can never be lower than max_seq_len. If max_seq_len isn't a multiple of 256, this method will never yield a number that's lower than max_seq_len since it's no longer a source of truth. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 21:24:54 -04:00
DocShotgun	ce5e2ec8de	Logging: Clarify new vs cached tokens in prompt processing	2024-05-26 18:21:17 -07:00
DocShotgun	767e6a798a	API + Model: Add support for specifying k/v cache size	2024-05-26 14:17:01 -07:00
kingbri	660f9b8432	OAI: Fix request cancellation behavior Depending on the day of the week, Starlette can work with a CancelledError or using await request.is_disconnected(). Run the same behavior for both cases and allow cancellation. Streaming requests now set an event to cancel the batched job and break out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:00:33 -04:00
kingbri	094c7b1734	Model: Fix paged and FA2 checks If a user is using GPU split, check compute capability on only those GPUs. Autosplit assumes that all GPUs will be used. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 11:29:31 -04:00
kingbri	9fbbc5afca	Tree: Swap from map to list comprehensions List comprehensions are the more "pythonic" way to approach mapping values to a list. They're also more flexible across different collection types rather than the inbuilt map method. It's best to keep one convention rather than splitting down two. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	46d0d13914	Model/Grammar: Fix filter append call No need to use extend if the array is length 1. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	a46ee62d03	Model: Clarify warning and device check on load FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs. Clarify the error message and explain what happens as a result. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	47582c2440	Dependencies: Update ExllamaV2 v0.1.0 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	43cd7f57e8	API + Model: Add blocks and checks for various load requests Add a sequential lock and wait until jobs are completed before executing any loading requests that directly alter the model. However, we also need to block any new requests that come in until the load is finished, so add a condition that triggers once the lock is free. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	408c66a1f2	Model: Change FA2 and paged attention checks The dynamic generator requires Flash attention 2.5.7 or higher to be installed. This is only supported on Nvidia's 30 series and higher. If a card is AMD or lower than the 30 series, switch to compatability mode which functions the same way as the older generator, except without parallel batching and any features that depend on it, such as CFG. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	c2d3675408	Model: Add min_tokens support In the form of min_new_tokens. Stopping strings take priority. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	5f0fb9c4ff	Model: Add CFG support Dynamic generator needed multiple prompts to be tokenized and sent for them to be sampled in serial, but generated in parallel. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	06ff47e2b4	Model: Use true async jobs and add logprobs The new async dynamic job allows for native async support without the need of threading. Also add logprobs and metrics back to responses. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	32ae62feac	Model: Add filter support to dynamic gen Dynamic gen takes in filters differently. Adjust to set the filter list per class rather than in the generation function. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	8ccd8fe5f8	Model: Initial dynamic generator support Adds basic support for ExllamaV2's dynamic generator. Can generate a streaming and non-streaming completion. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	e4bb709305	Model: Fix usage stats in non-streaming gens The wrong key was being returned from the model to the API. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 22:44:50 -04:00
kingbri	213430a122	Model/Grammar: Remove lmfe checks lmfe is a required dependency, so checks are no longer needed. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 22:24:28 -04:00
DocShotgun	abe411c6fb	API + Model: Add support for regex pattern constraints Adds the ability to constrain generation via regex pattern using lm-format-enforcer.	2024-05-12 19:10:43 -07:00
Ycros	57525219d0	Fix: Properly handle banned_strings and decode_special tokens (#104 ) * Fix: Actually pass banned_strings to the generation call. * decode_special_tokens was missing as well. * syntax	2024-05-12 20:47:45 +00:00
kingbri	c8ec742be9	Samplers: Expose skew sampling Skew is an extra unused sampler in ExllamaV2. Add it in for coverage. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 01:41:01 -04:00
kingbri	b4bc941cbe	Tree: Lint Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 22:42:39 -04:00
kingbri	7bebc085ec	Model: Remove legacy checks v0.0.21 has these features implemented. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 19:26:23 -04:00
kingbri	cd78728a77	Dependencies: Update ExllamaV2 v0.0.21 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 19:26:03 -04:00
kingbri	366d57cf45	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-10 21:20:41 -04:00
kingbri	7eee936a3f	Model: Remove old code and fix API handling skip_special_tokens is in stable exl2. Also default the parameters if they are not present in the function signature. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-10 21:20:00 -04:00
DocShotgun	c0b631ba92	API: Add banned_strings From exllamav2: List of strings that the generator will refuse to output. As soon as a partial match happens, a checkpoint is saved that the generator can rewind to if need be. Subsequent tokens are then held until the full string is resolved (match or no match) and either emitted or discarded, accordingly.	2024-05-10 13:53:55 -07:00
DocShotgun	a1df22668b	API: Add min_tokens Bans the EOS token until the generation reaches a minimum length. This will not prevent the model from otherwise ending the generation early by outputting other stop conditions.	2024-05-10 12:30:17 -07:00
kingbri	0e015ad58e	Dependencies: Update ExllamaV2 v0.0.20 ROCm 6.0 is now required Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-28 11:06:59 -04:00
kingbri	6f9da97114	API: Add banned_tokens Appends the banned tokens to the generation. This is equivalent of setting logit bias to -100 on a specific set of tokens. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-28 11:06:09 -04:00
kingbri	5750826120	Model: Remove extraneous print Was printing IDs by accident. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-25 18:49:09 -04:00
kingbri	fb1d2f34c1	OAI: Add response_prefix and fix BOS token issues in chat completions response_prefix is used to add a prefix before generating the next message. This is used in many cases such as continuining a prompt (see #96). Also if a template has BOS token specified, add_bos_token will append two BOS tokens. Add a check which strips a starting BOS token from the prompt if it exists. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-25 00:54:43 -04:00
kingbri	88b0b6f4f1	Model: Cast autosplit_reserve to int Torch errors if float values are passed (because bytes are not float types). Therefore, overestimate and cast to an int type. Resolves #97 Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-21 23:49:01 -04:00
kingbri	cab789e685	Templates: Migrate to class Having many utility functions for initialization doesn't make much sense. Instead, handle anything regarding template creation inside the class which reduces the amount of function imports. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-21 23:28:14 -04:00
kingbri	9f93505bc1	OAI: Add skip_special_tokens parameter Allows the ability to decode special tokens if the user wishes. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-21 00:37:46 -04:00
kingbri	8824ea0205	Model: Add EOS token support from generation_config.json GenerationConfig is meant to override various parts of the model on generation within the transformers lib. Rather than implementing the entire GenerationConfig framework (since it's pretty redundant), add in multi eos_token support like VLLM. The GenerationConfig is used only for generation, but can be used for other uses if needed. If there's more necessary parameters in the future, add those in as well. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-19 22:52:32 -04:00
kingbri	515b3c2930	OAI: Tokenize chat completion messages Since chat completion messages are a structure, format the prompt before checking in the tokenizer. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-15 14:17:16 -04:00
kingbri	ed05f376d9	Dependencies: Switch to LM-format-enforcer fork LM format enforcer has some latency on token ingestion, so use an optimized fork instead. Also add this in as a base dependency since the size is small. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-14 11:59:49 -04:00
kingbri	d759a15559	Model: Fix chunk size handling Wrong class attribute name used for max_attention_size and fixes declaration of the draft model's chunk_size. Also expose the parameter to the end user in both config and model load. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-07 18:39:19 -04:00
kingbri	30c4554572	Requirements: Update Exllamav2 v0.0.18 Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-07 18:00:56 -04:00
kingbri	46ac3beea9	Templates: Support list style chat_template keys HuggingFace updated transformers to provide templates in a list for tokenizers. Update to support this new format. Providing the name of a template for the "prompt_template" value in config.yml will also look inside the template list. In addition, log if there's a template exception, but continue model loading since it shouldn't shut down the application. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-07 11:20:25 -04:00
kingbri	6ecce1604b	Model: Fix log if exl2 version is too low Switch to pyproject syntax. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-31 23:11:21 -04:00
kingbri	f534930270	Dependencies: Bump Exllamav2 v0.0.17 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-31 23:10:28 -04:00
kingbri	b11aac51e2	Model: Add torch.inference_mode() to generator function Provides a speedup to model forward. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-30 10:45:28 -04:00
kingbri	190a0b26c3	Model: Fix generation when stream = false References #91. Check if the length of the generation array is > 0 after popping the finish reason. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-29 02:15:56 -04:00

1 2 3

115 Commits