tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-03-15 00:07:28 +00:00

Author	SHA1	Message	Date
kingbri	5055a98e41	Model: Wrap load in inference_mode Some tensors were being taken out of inference mode during each iteration of exllama's load_autosplit_gen. This causes errors since autograd is off. Therefore, make the shared load_gen_sync function have an overarching inference_mode context to prevent forward issues. This should allow for the generator to iterate across each thread call. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 18:06:50 -04:00
kingbri	56fdfb5f8e	OAI: Add stream to gen params Good for logging. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 00:55:44 -04:00
kingbri	69e41e994c	Model: Fix generation with non-streaming and logprobs Finish_reason was giving an empty offset. Fix this by grabbing the finish reason first and then handling the static generation as normal. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 00:47:24 -04:00
kingbri	7e669527ed	Model: Fix tokenizer bugs Some tokenizer variables don't get cleaned up on init, so these can persist. Clean these up manually before creating a new tokenizer for now. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	07d9b7cf7b	Model: Add abort on generation When the model is processing a prompt, add the ability to abort on request cancellation. This is also a catch for a SIGINT. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	7020a0a2d1	Dependencies: Update Exllamav2 v0.0.16 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	b74603db59	Model: Log metrics before yielding a stop Yielding the finish reason before the logging causes the function to terminate early. Instead, log before yielding and breaking out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 01:17:04 -04:00
kingbri	09a4c79847	Model: Auto-scale max_tokens by default If max_tokens is None, it automatically scales to fill up the context. This does not mean the generation will fill up that context since EOS stops also exist. Originally suggested by #86 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 22:54:59 -04:00
kingbri	8cbb59d6e1	Model: Cleanup some comments Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 22:20:45 -04:00
kingbri	4f75fb5588	Model: Adjust max output len Max output len should be hardcoded to 16 since it's the amount of tokens to predict per forward pass. 16 is a good value for both normal inference and speculative decoding which also helps save vram compared to 2048 which was the previous default. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 22:16:53 -04:00
kingbri	5c7fc69ded	API: Fix finish_reason returns OAI expects finish_reason to be "stop" or "length" (there are others, but they're not in the current scope of this project). Make all completions and chat completions responses return this from the model generation itself rather than putting a placeholder. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 15:59:28 -04:00
kingbri	c9a6d9ae1f	Model: Switch to begin_stream_ex Allows for dynamically passing logprobs params instead of assuming on initialization of the generator. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-17 14:41:16 -04:00
kingbri	2755fd1af0	API: Fix blocking iterator execution Run these iterators on the background thread. On startup, the API spawns a background thread as needed to run sync code on without blocking the event loop. Use asyncio's run_thread function since it allows for errors to be propegated. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-16 23:23:31 -04:00
kingbri	7fded4f183	Tree: Switch to async generators Async generation helps remove many roadblocks to managing tasks using threads. It should allow for abortables and modern-day paradigms. NOTE: Exllamav2 itself is not an asynchronous library. It's just been added into tabby's async nature to allow for a fast and concurrent API server. It's still being debated to run stream_ex in a separate thread or manually manage it using asyncio.sleep(0) Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-16 23:23:31 -04:00
kingbri	efc01d947b	API + Model: Add speculative ngram decoding Speculative ngram decoding is like speculative decoding without the draft model. It's not as useful because it only decodes on predictable sequences, but it depends on the usecase. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 23:32:11 -04:00
kingbri	2ebefe8258	Logging: Move metrics to gen logging This didn't have a place in the generation function. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 23:13:55 -04:00
kingbri	1ec8eb9620	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 00:02:55 -04:00
kingbri	b373b25235	API: Move to ModelManager This is a shared module which manages the model container and provides extra utility functions around it to help slim down the API. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-12 23:59:30 -04:00
kingbri	8b46282aef	Model: Fix state flag sets on unload The load state should be false only if the models are unloaded. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-12 23:59:30 -04:00
kingbri	53d889e0f0	Logging: Fix legacy warn statement Warn is not a valid method with loguru. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-11 01:31:43 -04:00
kingbri	228c227c1e	Logging: Switch to loguru Loguru is a flexible logger that allows for easier hooking and imports into Rich with no problems. Also makes progress bars stick to the bottom of the terminal window. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-08 01:00:48 -05:00
kingbri	39617adb65	Requirements: Update Exllamav2 v0.0.15 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-06 22:29:55 -05:00
kingbri	9a007c4707	Model: Add support for Q4 cache Add this in addition to 8bit cache and 16bit cache. Passing "Q4" with the cache_mode request parameter will set this on model load. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-06 00:59:28 -05:00
kingbri	b0c295dd2f	API: Add more methods to semaphore The semaphore/queue model for Tabby is as follows: - Any load requests go through the semaphore by default - Any load request can include the skip_queue parameter to bypass the semaphore - Any unload requests are immediately executed - All completion requests are placed inside the semaphore by default This model preserves the parallelism of single-user mode with extra convenience methods for queues in multi-user. It also helps mitigate problems that were previously present in the concurrency stack. Also change how the program's loop runs so it exits when the API thread dies. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-04 23:21:40 -05:00
kingbri	c82697fef2	API: Fix issues with concurrent requests and queueing This is the first in many future commits that will overhaul the API to be more robust and concurrent. The model is admin-first where the admin can do anything in-case something goes awry. Previously, calls to long running synchronous background tasks would block the entire API, making it ignore any terminal signals until generation is completed. To fix this, levrage FastAPI's run_in_threadpool to offload the long running tasks to another thread. However, signals to abort the process still kept the background thread running and made the terminal hang. This was due to an issue with Uvicorn not propegating the SIGINT signal across threads in its event loop. To fix this in a catch-all way, run the API processes in a separate thread so the main thread can still kill the process if needed. In addition, make request error logging more robust and refer to the console for full error logs rather than creating a long message on the client-side. Finally, add state checks to see if a model is fully loaded before generating a completion. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-04 23:21:40 -05:00
kingbri	fc857893ee	Model: Remove Exllamav2 patches These classes are in the newest version now. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-24 23:40:11 -05:00
kingbri	73a1d9ef78	Model: Fix imports Use the standard import ordering. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-24 23:40:11 -05:00
kingbri	f6d749c771	Model: Add EBNF grammar support Using the Outlines library, add support to supply EBNF strings and pass them to the library for parsing. From there, a wrapper is created and a filter is passed to generation. Replace with an in-house solution at some point that's more flexible. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-24 23:40:11 -05:00
kingbri	57b3d69949	API + Model: Add support for JSON schema constraints Add the ability to constrain the return value of a model to be JSON. Built using the JSON schema standard to define the properties of what the model should return. This feature should be more accurate than using GBNF/EBNF to yield the same results due to the use of lmformatenforcer. GBNF/EBNF will be added in a different commit/branch. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-24 23:40:11 -05:00
kingbri	ccd41d720d	Requirements: Bump ExllamaV2 v0.0.14 Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-24 12:26:08 -05:00
kingbri	360802762c	Model: Fix logit bias token checks Accidentally checked on the token bias tensor which didn't contain the token IDs. Check if the index exists on the id_to_piece list instead. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-22 21:44:15 -05:00
kingbri	bee26a2f2c	API: Auto-unload on a load request Automatically unload the existing model when calling /load. This was requested many times, and does make more sense in the long run. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-21 23:00:11 -05:00
kingbri	a19a4eb1be	Model: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-18 18:31:31 -05:00
kingbri	7def32e4de	Model: Fix logit bias handling If the token doesn't exist, gracefully warn instead of erroring out. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-18 18:30:58 -05:00
kingbri	aa34b2e5fd	Model: Prefer auto over manual GPU split For safety reasons, always use auto unless a manual split is provided and auto is forced off. If auto is forced off and a manual split isn't provided, a manual split will be attempted. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-17 00:21:48 -05:00
kingbri	ea00a6bd45	Requirements: Update Exllamav2 Update to v0.0.13.post2 Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-14 21:51:25 -05:00
kingbri	cce97deea5	Model: Switch logprobs to use post-sampling Previously, pre-sampling logprobs were used from the raw logits, but newer versions of exl2 allow for returning token probs post-sampling. Convert these to logprobs and send to the user. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-14 21:51:25 -05:00
kingbri	664e2c417e	Model: Fix GPU split args loading Autosplit was overwriting a manual GPU split if the YAML parameter wasn't set. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-14 17:42:20 -05:00
kingbri	9f1d891490	Packages: Fix exllamav2 version check Post versions are ok to use for checking if the user is on the correct exllamav2 wheel. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-10 14:00:26 -05:00
kingbri	8d8cf5dc69	Model: Fix dynatemp fallback Set to 1.0 if the condition fails. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-10 12:02:31 -05:00
kingbri	2f568ff573	Config: Expose auto GPU split reserve config The GPU reserve is used as a VRAM buffer to prevent GPU overflow when automatically deciding how to load a model on multiple GPUs. Make this configurable. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 22:09:50 -05:00
kingbri	43bba526bf	Model: Fix logprobs unwrapping Take a log of the token probs since they're already normalized which reflects the proper value. Also, don't error out if a token prob doesn't exist in the dict and return None instead from zip. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	c7428f0bcd	API: Add logprobs for chat completions Adds chat completion logprob support using OAI's spec. Tokens are not converted to tiktoken here since that will add an extra dependency for no real reason. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	c02fe4d1db	API: Fix response creation Change chat completion and text completion responses to be more flexible. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	0af6a38af3	Model: Add logprobs support Returns token offsets, selected tokens, probabilities of tokens post-sampling, and normalized probability of selecting a token pre-sampling (for efficiency purposes). Only for text completions. Chat completions in a later commit. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	284f20263f	API: Clean up tokenizing endpoint Split the get tokens function into separate wrapper encode and decode functions for overall code cleanliness. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
AliCat	bb48f77ca1	Neutralize samplers (#59 ) * Update sample_preset.yml Neutralized the samplers. * Sampling: Fix dynatemp defaults Default max temp and min temp is 1.0 * Sampling: Fix TFS defaults Default is 1.0 --------- Co-authored-by: AliCat <86847834+alicat22@users.noreply.github.com> Co-authored-by: kingbri <bdashore3@proton.me>	2024-02-08 00:23:09 -05:00
kingbri	c0ad647fa7	Model: Auto-detect a one GPU setup and fix gpu_split_auto It makes more sense to use gpu split parameters when the user has >1 GPUs. Otherwise, set split and split_auto to False and save the user some VRAM. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 23:08:57 -05:00
kingbri	849179df17	Model: Make loading use less VRAM The model loader was using more VRAM on a single GPU compared to base exllamav2's loader. This was because single GPUs were running using the autosplit config which allocates an extra vram buffer for safe loading. Turn this off for single-GPU setups (and turn it off by default). This change should allow users to run models which require the entire card with hopefully faster T/s. For example, Mixtral with 3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom on Windows. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 22:29:56 -05:00
kingbri	fedebadc81	Model: Fix generate window fallback Use max_seq_len as the numerator, not the max_tokens. Mismatched parameter. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 14:48:42 -05:00

1 2

63 Commits