tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-03-15 00:07:28 +00:00

Author	SHA1	Message	Date
kingbri	e95e67a000	OAI: Add validation to "n" n must be greater than 1 to generate. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	e2a8b6e8ae	OAI: Add "n" support for streaming generations Use a queue-based system to get choices independently and send them in the overall streaming payload. This method allows for unordered streaming of generations. The system is a bit redundant, so maybe make the code more optimized in the future. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	c8371e0f50	OAI: Copy gen params for "n" For multiple generations in the same request, nested arrays kept their original reference, resulting in duplications. This will occur with any collection type. For optimization purposes, a deepcopy isn't run for the first iteration since original references are created. This is not the most elegant solution, but it works for the described cases. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	b944f8d756	OAI: Add "n" for non-streaming generations This adds the ability to add multiple choices to a generation. This is only available for non-streaming gens for now, it requires some more work to port over to streaming. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	8d31a5aed1	Dependencies: Update Flash Attention 2 v2.5.9.post1 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:45:35 -04:00
Brian Dashore	516b52b341	Merge pull request #112 from DocShotgun/main Separate new prompt tokens from those reused from cache in metric logging	2024-05-27 18:04:43 -04:00
kingbri	19961f4126	Dependencies: Update ExllamaV2 v0.1.1 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-27 13:38:07 -04:00
kingbri	04cbed16e8	Update README Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-27 13:37:57 -04:00
kingbri	4087586449	Start: Create config.yml if it doesn't exist While TabbyAPI doesn't need a config.yml to run, new users can get confused by the task of copying config_sample.yml to config.yml. Therefore, automatically do this in the start script to immediately expose options to the user. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 21:37:52 -04:00
DocShotgun	7084081b1f	Tree: Lint	2024-05-26 18:27:30 -07:00
kingbri	116cf56c87	Model: Auto-round cache size on init Cache size must be a multiple of 256 to work properly in ExllamaV2. Take the config value and set the cache size to one multiple above the remainder of the cache size divided by 256. This is because cache size can never be lower than max_seq_len. If max_seq_len isn't a multiple of 256, this method will never yield a number that's lower than max_seq_len since it's no longer a source of truth. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 21:24:54 -04:00
DocShotgun	ce5e2ec8de	Logging: Clarify new vs cached tokens in prompt processing	2024-05-26 18:21:17 -07:00
Brian Dashore	3dcae8b023	Merge pull request #111 from DocShotgun/main Add support for specifying k/v cache size	2024-05-26 20:52:21 -04:00
kingbri	bec919e202	Config: Change cache_size description and location Makes more sense to place cache_size with the other cache options. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 20:50:56 -04:00
DocShotgun	7ab7ffd562	Tree: Format	2024-05-26 15:48:18 -07:00
DocShotgun	767e6a798a	API + Model: Add support for specifying k/v cache size	2024-05-26 14:17:01 -07:00
kingbri	d710a1b441	OAI: Switch to background task for disconnect checks Waiting for request disconnect takes some extra time and allows generation chunks to pile up, resulting in large payloads being sent at once not making up a smooth stream. Use the polling method in non-streaming requests by creating a background task and then check if the task is done, signifying that the request has been disconnected. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:52:20 -04:00
kingbri	660f9b8432	OAI: Fix request cancellation behavior Depending on the day of the week, Starlette can work with a CancelledError or using await request.is_disconnected(). Run the same behavior for both cases and allow cancellation. Streaming requests now set an event to cancel the batched job and break out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:00:33 -04:00
kingbri	094c7b1734	Model: Fix paged and FA2 checks If a user is using GPU split, check compute capability on only those GPUs. Autosplit assumes that all GPUs will be used. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 11:29:31 -04:00
kingbri	9fbbc5afca	Tree: Swap from map to list comprehensions List comprehensions are the more "pythonic" way to approach mapping values to a list. They're also more flexible across different collection types rather than the inbuilt map method. It's best to keep one convention rather than splitting down two. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	46d0d13914	Model/Grammar: Fix filter append call No need to use extend if the array is length 1. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	a46ee62d03	Model: Clarify warning and device check on load FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs. Clarify the error message and explain what happens as a result. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	47582c2440	Dependencies: Update ExllamaV2 v0.1.0 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	43cd7f57e8	API + Model: Add blocks and checks for various load requests Add a sequential lock and wait until jobs are completed before executing any loading requests that directly alter the model. However, we also need to block any new requests that come in until the load is finished, so add a condition that triggers once the lock is free. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	408c66a1f2	Model: Change FA2 and paged attention checks The dynamic generator requires Flash attention 2.5.7 or higher to be installed. This is only supported on Nvidia's 30 series and higher. If a card is AMD or lower than the 30 series, switch to compatability mode which functions the same way as the older generator, except without parallel batching and any features that depend on it, such as CFG. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	c2d3675408	Model: Add min_tokens support In the form of min_new_tokens. Stopping strings take priority. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	5f0fb9c4ff	Model: Add CFG support Dynamic generator needed multiple prompts to be tokenized and sent for them to be sampled in serial, but generated in parallel. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	06ff47e2b4	Model: Use true async jobs and add logprobs The new async dynamic job allows for native async support without the need of threading. Also add logprobs and metrics back to responses. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	32ae62feac	Model: Add filter support to dynamic gen Dynamic gen takes in filters differently. Adjust to set the filter list per class rather than in the generation function. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	8ccd8fe5f8	Model: Initial dynamic generator support Adds basic support for ExllamaV2's dynamic generator. Can generate a streaming and non-streaming completion. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	c474076b22	Concurrency: Remove release_semaphore method At any point for any request cancellation, the semaphore will be decremented. This is an issue since an arbitrary request can desync the semaphore, causing multiple tasks to be processed at once and break generation. Remove this from the networking handlers and therefore, remove the release_semaphore function itself. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-19 10:42:26 -04:00
kingbri	b9fd8555fe	Sampling: Copy over iterable overrides If an override was iterable, any modifications to the returned value would alter the reference to the global storage dict. Therefore, copy the structure if it's an iterable so any modification won't alter the original override. Also apply this for the function that checks for forced overrides. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-17 21:38:28 -04:00
kingbri	0e9385e023	API: Fix usage reporting for chat completions Resolves #106 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-17 00:03:15 -04:00
kingbri	e4bb709305	Model: Fix usage stats in non-streaming gens The wrong key was being returned from the model to the API. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 22:44:50 -04:00
kingbri	213430a122	Model/Grammar: Remove lmfe checks lmfe is a required dependency, so checks are no longer needed. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 22:24:28 -04:00
Brian Dashore	b255847c2a	Merge pull request #105 from DocShotgun/main Add support for regex pattern constraints	2024-05-12 22:22:12 -04:00
DocShotgun	abe411c6fb	API + Model: Add support for regex pattern constraints Adds the ability to constrain generation via regex pattern using lm-format-enforcer.	2024-05-12 19:10:43 -07:00
Ycros	57525219d0	Fix: Properly handle banned_strings and decode_special tokens (#104 ) * Fix: Actually pass banned_strings to the generation call. * decode_special_tokens was missing as well. * syntax	2024-05-12 20:47:45 +00:00
Brian Dashore	611f00818b	Merge pull request #103 from DocShotgun/main Minor fixes for sampler override	2024-05-12 16:47:12 -04:00
DocShotgun	dad34237ba	Samplers: Add example override for generate_window	2024-05-12 00:39:01 -07:00
DocShotgun	9463ecfa40	Samplers: Minor fixes for sampler override * Add missing settings to sample_preset.yml * Fix override for skip_special_tokens	2024-05-12 00:31:31 -07:00
kingbri	c8ec742be9	Samplers: Expose skew sampling Skew is an extra unused sampler in ExllamaV2. Add it in for coverage. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 01:41:01 -04:00
kingbri	6f4012d20d	API: Add preset listing for sampler overrides Querying the overrides list endpoint now returns the selected preset and a list of presets to use. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 01:34:51 -04:00
kingbri	b4bc941cbe	Tree: Lint Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 22:42:39 -04:00
kingbri	2da3fb2caf	Start: Bump ROCm error version ROCm support is for 6.0 now. Update that. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 21:57:51 -04:00
kingbri	7bebc085ec	Model: Remove legacy checks v0.0.21 has these features implemented. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 19:26:23 -04:00
kingbri	cd78728a77	Dependencies: Update ExllamaV2 v0.0.21 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 19:26:03 -04:00
Brian Dashore	5432f523cb	Merge pull request #102 from DocShotgun/main Add support for min_tokens and banned_strings	2024-05-10 21:21:57 -04:00
kingbri	366d57cf45	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-10 21:20:41 -04:00
kingbri	7eee936a3f	Model: Remove old code and fix API handling skip_special_tokens is in stable exl2. Also default the parameters if they are not present in the function signature. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-10 21:20:00 -04:00

... 2 3 4 5 6 ...

650 Commits