Commit Graph

69 Commits

Author SHA1 Message Date
kingbri
6ecce1604b Model: Fix log if exl2 version is too low
Switch to pyproject syntax.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-31 23:11:21 -04:00
kingbri
f534930270 Dependencies: Bump Exllamav2
v0.0.17

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-31 23:10:28 -04:00
kingbri
b11aac51e2 Model: Add torch.inference_mode() to generator function
Provides a speedup to model forward.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-30 10:45:28 -04:00
kingbri
190a0b26c3 Model: Fix generation when stream = false
References #91. Check if the length of the generation array is > 0
after popping the finish reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-29 02:15:56 -04:00
kingbri
26496c4db2 Dependencies: Require tokenizers
This is used for some models and isn't too big in size (compared to
other huggingface dependencies), so include it by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-23 01:12:21 -04:00
kingbri
1755f284cf Model: Prompt users to install extras if dependencies don't exist
Ex: tokenizers, lmfe, outlines.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-22 22:13:55 -04:00
kingbri
5055a98e41 Model: Wrap load in inference_mode
Some tensors were being taken out of inference mode during each
iteration of exllama's load_autosplit_gen. This causes errors since
autograd is off.

Therefore, make the shared load_gen_sync function have an overarching
inference_mode context to prevent forward issues. This should allow for
the generator to iterate across each thread call.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 18:06:50 -04:00
kingbri
56fdfb5f8e OAI: Add stream to gen params
Good for logging.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:55:44 -04:00
kingbri
69e41e994c Model: Fix generation with non-streaming and logprobs
Finish_reason was giving an empty offset. Fix this by grabbing the
finish reason first and then handling the static generation as normal.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:47:24 -04:00
kingbri
7e669527ed Model: Fix tokenizer bugs
Some tokenizer variables don't get cleaned up on init, so these can
persist. Clean these up manually before creating a new tokenizer for
now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
07d9b7cf7b Model: Add abort on generation
When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
7020a0a2d1 Dependencies: Update Exllamav2
v0.0.16

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
b74603db59 Model: Log metrics before yielding a stop
Yielding the finish reason before the logging causes the function to
terminate early. Instead, log before yielding and breaking out of the
generation loop.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 01:17:04 -04:00
kingbri
09a4c79847 Model: Auto-scale max_tokens by default
If max_tokens is None, it automatically scales to fill up the context.
This does not mean the generation will fill up that context since
EOS stops also exist.

Originally suggested by #86

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 22:54:59 -04:00
kingbri
8cbb59d6e1 Model: Cleanup some comments
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 22:20:45 -04:00
kingbri
4f75fb5588 Model: Adjust max output len
Max output len should be hardcoded to 16 since it's the amount of
tokens to predict per forward pass. 16 is a good value for both
normal inference and speculative decoding which also helps save
vram compared to 2048 which was the previous default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 22:16:53 -04:00
kingbri
5c7fc69ded API: Fix finish_reason returns
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).

Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 15:59:28 -04:00
kingbri
c9a6d9ae1f Model: Switch to begin_stream_ex
Allows for dynamically passing logprobs params instead of assuming
on initialization of the generator.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 14:41:16 -04:00
kingbri
2755fd1af0 API: Fix blocking iterator execution
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.

Use asyncio's run_thread function since it allows for errors to be
propegated.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
7fded4f183 Tree: Switch to async generators
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.

NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
efc01d947b API + Model: Add speculative ngram decoding
Speculative ngram decoding is like speculative decoding without the
draft model. It's not as useful because it only decodes on predictable
sequences, but it depends on the usecase.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 23:32:11 -04:00
kingbri
2ebefe8258 Logging: Move metrics to gen logging
This didn't have a place in the generation function.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 23:13:55 -04:00
kingbri
1ec8eb9620 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 00:02:55 -04:00
kingbri
b373b25235 API: Move to ModelManager
This is a shared module  which manages the model container and provides
extra utility functions around it to help slim down the API.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-12 23:59:30 -04:00
kingbri
8b46282aef Model: Fix state flag sets on unload
The load state should be false only if the models are unloaded.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-12 23:59:30 -04:00
kingbri
53d889e0f0 Logging: Fix legacy warn statement
Warn is not a valid method with loguru.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-11 01:31:43 -04:00
kingbri
228c227c1e Logging: Switch to loguru
Loguru is a flexible logger that allows for easier hooking and imports
into Rich with no problems. Also makes progress bars stick to the
bottom of the terminal window.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-08 01:00:48 -05:00
kingbri
39617adb65 Requirements: Update Exllamav2
v0.0.15

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-06 22:29:55 -05:00
kingbri
9a007c4707 Model: Add support for Q4 cache
Add this in addition to 8bit cache and 16bit cache. Passing "Q4" with
the cache_mode request parameter will set this on model load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-06 00:59:28 -05:00
kingbri
b0c295dd2f API: Add more methods to semaphore
The semaphore/queue model for Tabby is as follows:
- Any load requests go through the semaphore by default
- Any load request can include the skip_queue parameter to bypass
the semaphore
- Any unload requests are immediately executed
- All completion requests are placed inside the semaphore by default

This model preserves the parallelism of single-user mode with extra
convenience methods for queues in multi-user. It also helps mitigate
problems that were previously present in the concurrency stack.

Also change how the program's loop runs so it exits when the API thread
dies.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-04 23:21:40 -05:00
kingbri
c82697fef2 API: Fix issues with concurrent requests and queueing
This is the first in many future commits that will overhaul the API
to be more robust and concurrent. The model is admin-first where the
admin can do anything in-case something goes awry.

Previously, calls to long running synchronous background tasks would
block the entire API, making it ignore any terminal signals until
generation is completed.

To fix this, levrage FastAPI's run_in_threadpool to offload the long
running tasks to another thread. However, signals to abort the process
still kept the background thread running and made the terminal hang.

This was due to an issue with Uvicorn not propegating the SIGINT signal
across threads in its event loop. To fix this in a catch-all way, run
the API processes in a separate thread so the main thread can still
kill the process if needed.

In addition, make request error logging more robust and refer to the
console for full error logs rather than creating a long message on the
client-side.

Finally, add state checks to see if a model is fully loaded before
generating a completion.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-04 23:21:40 -05:00
kingbri
fc857893ee Model: Remove Exllamav2 patches
These classes are in the newest version now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
73a1d9ef78 Model: Fix imports
Use the standard import ordering.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
f6d749c771 Model: Add EBNF grammar support
Using the Outlines library, add support to supply EBNF strings and
pass them to the library for parsing.

From there, a wrapper is created and a filter is passed to generation.

Replace with an in-house solution at some point that's more flexible.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
57b3d69949 API + Model: Add support for JSON schema constraints
Add the ability to constrain the return value of a model to be JSON.
Built using the JSON schema standard to define the properties of what
the model should return.

This feature should be more accurate than using GBNF/EBNF to yield
the same results due to the use of lmformatenforcer.

GBNF/EBNF will be added in a different commit/branch.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
ccd41d720d Requirements: Bump ExllamaV2
v0.0.14

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 12:26:08 -05:00
kingbri
360802762c Model: Fix logit bias token checks
Accidentally checked on the token bias tensor which didn't contain
the token IDs. Check if the index exists on the id_to_piece list
instead.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-22 21:44:15 -05:00
kingbri
bee26a2f2c API: Auto-unload on a load request
Automatically unload the existing model when calling /load. This was
requested many times, and does make more sense in the long run.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-21 23:00:11 -05:00
kingbri
a19a4eb1be Model: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-18 18:31:31 -05:00
kingbri
7def32e4de Model: Fix logit bias handling
If the token doesn't exist, gracefully warn instead of erroring out.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-18 18:30:58 -05:00
kingbri
aa34b2e5fd Model: Prefer auto over manual GPU split
For safety reasons, always use auto unless a manual split is provided
and auto is forced off.

If auto is forced off and a manual split isn't provided, a manual
split will be attempted.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-17 00:21:48 -05:00
kingbri
ea00a6bd45 Requirements: Update Exllamav2
Update to v0.0.13.post2

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:51:25 -05:00
kingbri
cce97deea5 Model: Switch logprobs to use post-sampling
Previously, pre-sampling logprobs were used from the raw logits,
but newer versions of exl2 allow for returning token probs post-sampling.
Convert these to logprobs and send to the user.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:51:25 -05:00
kingbri
664e2c417e Model: Fix GPU split args loading
Autosplit was overwriting a manual GPU split if the YAML parameter
wasn't set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 17:42:20 -05:00
kingbri
9f1d891490 Packages: Fix exllamav2 version check
Post versions are ok to use for checking if the user is on the correct
exllamav2 wheel.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 14:00:26 -05:00
kingbri
8d8cf5dc69 Model: Fix dynatemp fallback
Set to 1.0 if the condition fails.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 12:02:31 -05:00
kingbri
2f568ff573 Config: Expose auto GPU split reserve config
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 22:09:50 -05:00
kingbri
43bba526bf Model: Fix logprobs unwrapping
Take a log of the token probs since they're already normalized which
reflects the proper value. Also, don't error out if a token prob
doesn't exist in the dict and return None instead from zip.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c7428f0bcd API: Add logprobs for chat completions
Adds chat completion logprob support using OAI's spec. Tokens are
not converted to tiktoken here since that will add an extra dependency
for no real reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c02fe4d1db API: Fix response creation
Change chat completion and text completion responses to be more
flexible.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00