Commit Graph

319 Commits

Author SHA1 Message Date
kingbri
cad72315f4 Init: Switch to display redoc endpoint
Redoc looks much better than Swagger docs, so show that by default.
Both endpoints still exist.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-08 01:00:48 -05:00
kingbri
ef2dc326f5 Logging: Fix inconsistent formatting
Some colorization was incorrect and the separator insertion has become
more robust.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-08 01:00:48 -05:00
kingbri
228c227c1e Logging: Switch to loguru
Loguru is a flexible logger that allows for easier hooking and imports
into Rich with no problems. Also makes progress bars stick to the
bottom of the terminal window.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-08 01:00:48 -05:00
kingbri
fe0ff240e7 Progress: Switch to Rich
Rich is a more mature library for displaying progress bars, logging,
and console output. This should help properly align progress bars
within the terminal.

Side note: "We're Rich!"

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-08 01:00:48 -05:00
kingbri
39617adb65 Requirements: Update Exllamav2
v0.0.15

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-06 22:29:55 -05:00
Brian Dashore
47c42a23d4 Merge pull request #72 from djmaze/patch-1
Remove explicit install of pytorch & exllamav2 in Dockerfile
2024-03-06 01:13:37 -05:00
kingbri
9a007c4707 Model: Add support for Q4 cache
Add this in addition to 8bit cache and 16bit cache. Passing "Q4" with
the cache_mode request parameter will set this on model load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-06 00:59:28 -05:00
kingbri
0b25c208d6 API: Fix error reporting
Make a disconnect on load error consistently. It should be safer to
warn the user to run unload (or re-run load) if a model does not
load correctly.

Also don't log the traceback for request errors that don't have one.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-05 18:16:02 -05:00
kingbri
165cc6fc2d API: Remove unnecessary endpoint
This used to be a shim for ooba, but it's no longer necessary.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-04 23:21:40 -05:00
kingbri
d2c6ae2d35 API: Back to async
According to FastAPI docs, if you're using a generic function, running
it in async will make it more performant (which makes sense since
running def functions for routes will automatically run the caller
through a threadpool).

Tested and everything works fine.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-04 23:21:40 -05:00
kingbri
b0c295dd2f API: Add more methods to semaphore
The semaphore/queue model for Tabby is as follows:
- Any load requests go through the semaphore by default
- Any load request can include the skip_queue parameter to bypass
the semaphore
- Any unload requests are immediately executed
- All completion requests are placed inside the semaphore by default

This model preserves the parallelism of single-user mode with extra
convenience methods for queues in multi-user. It also helps mitigate
problems that were previously present in the concurrency stack.

Also change how the program's loop runs so it exits when the API thread
dies.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-04 23:21:40 -05:00
kingbri
c82697fef2 API: Fix issues with concurrent requests and queueing
This is the first in many future commits that will overhaul the API
to be more robust and concurrent. The model is admin-first where the
admin can do anything in-case something goes awry.

Previously, calls to long running synchronous background tasks would
block the entire API, making it ignore any terminal signals until
generation is completed.

To fix this, levrage FastAPI's run_in_threadpool to offload the long
running tasks to another thread. However, signals to abort the process
still kept the background thread running and made the terminal hang.

This was due to an issue with Uvicorn not propegating the SIGINT signal
across threads in its event loop. To fix this in a catch-all way, run
the API processes in a separate thread so the main thread can still
kill the process if needed.

In addition, make request error logging more robust and refer to the
console for full error logs rather than creating a long message on the
client-side.

Finally, add state checks to see if a model is fully loaded before
generating a completion.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-04 23:21:40 -05:00
Brian Dashore
de91eade4b Merge pull request #75 from DocShotgun/main
Additional clarification for override_base_seq_len
2024-03-03 01:30:45 -05:00
DocShotgun
8245488926 Additional clarification for override_base_seq_len 2024-03-02 09:29:50 -08:00
Martin Honermeyer
4afb4137f7 Remove explicit pytorch & exllamav2 in Dockerfile
These packages are already installed via requirements.txt.
2024-02-25 18:03:01 +01:00
kingbri
fc857893ee Model: Remove Exllamav2 patches
These classes are in the newest version now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
73a1d9ef78 Model: Fix imports
Use the standard import ordering.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
f6d749c771 Model: Add EBNF grammar support
Using the Outlines library, add support to supply EBNF strings and
pass them to the library for parsing.

From there, a wrapper is created and a filter is passed to generation.

Replace with an in-house solution at some point that's more flexible.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
57b3d69949 API + Model: Add support for JSON schema constraints
Add the ability to constrain the return value of a model to be JSON.
Built using the JSON schema standard to define the properties of what
the model should return.

This feature should be more accurate than using GBNF/EBNF to yield
the same results due to the use of lmformatenforcer.

GBNF/EBNF will be added in a different commit/branch.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 23:40:11 -05:00
kingbri
ccd41d720d Requirements: Bump ExllamaV2
v0.0.14

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-24 12:26:08 -05:00
kingbri
360802762c Model: Fix logit bias token checks
Accidentally checked on the token bias tensor which didn't contain
the token IDs. Check if the index exists on the id_to_piece list
instead.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-22 21:44:15 -05:00
kingbri
5a23b9ebc9 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-22 01:28:30 -05:00
kingbri
bee26a2f2c API: Auto-unload on a load request
Automatically unload the existing model when calling /load. This was
requested many times, and does make more sense in the long run.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-21 23:00:11 -05:00
kingbri
368eb2e2d9 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-20 00:19:31 -05:00
kingbri
a19a4eb1be Model: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-18 18:31:31 -05:00
kingbri
7def32e4de Model: Fix logit bias handling
If the token doesn't exist, gracefully warn instead of erroring out.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-18 18:30:58 -05:00
kingbri
aa34b2e5fd Model: Prefer auto over manual GPU split
For safety reasons, always use auto unless a manual split is provided
and auto is forced off.

If auto is forced off and a manual split isn't provided, a manual
split will be attempted.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-17 00:21:48 -05:00
kingbri
ea00a6bd45 Requirements: Update Exllamav2
Update to v0.0.13.post2

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:51:25 -05:00
kingbri
cce97deea5 Model: Switch logprobs to use post-sampling
Previously, pre-sampling logprobs were used from the raw logits,
but newer versions of exl2 allow for returning token probs post-sampling.
Convert these to logprobs and send to the user.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:51:25 -05:00
kingbri
949248fb94 Config: Add experimental torch cuda malloc backend
This option saves some VRAM, but does have the chance to error out.
Add this in the experimental config section.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:45:56 -05:00
kingbri
664e2c417e Model: Fix GPU split args loading
Autosplit was overwriting a manual GPU split if the YAML parameter
wasn't set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 17:42:20 -05:00
kingbri
a79c42ff4c Sampling: Make validators simpler
Injecting into Pydantic fields caused issues with serialization for
documentation rendering. Rather than reinvent the wheel again,
switch to a chain of if statements for now. This may change in the
future if subclasses from the base sampler request need to be
validated as well.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-11 15:28:43 -05:00
kingbri
f627485534 OAI: Fix completion token fetching
The generator returns generated_tokens in the dict.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-11 01:12:13 -05:00
kingbri
7e730e3507 Sampling: Add universal validation system
Rather than maintaining yet another function to validate sampler
ranges/values, embed them in fields which allows for less
maintainence in the future.

Also add validation for existing samplers that can corrupt
the sampling stack if set improperly.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 14:59:23 -05:00
kingbri
9f1d891490 Packages: Fix exllamav2 version check
Post versions are ok to use for checking if the user is on the correct
exllamav2 wheel.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 14:00:26 -05:00
kingbri
8d8cf5dc69 Model: Fix dynatemp fallback
Set to 1.0 if the condition fails.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 12:02:31 -05:00
Brian Dashore
17636ed899 Create pull request template
Asks users to give more information when committing a pull request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-09 14:53:29 -05:00
Brian Dashore
c3601bdd18 Issues: Disable blank issues
Users must follow the appropriate issue templates

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-09 14:48:03 -05:00
Brian Dashore
aa56ff829f Add issue templates
Creates templates for issues to help guide users in the right direction when making a bug report or request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-09 14:43:33 -05:00
kingbri
2f568ff573 Config: Expose auto GPU split reserve config
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 22:09:50 -05:00
kingbri
43bba526bf Model: Fix logprobs unwrapping
Take a log of the token probs since they're already normalized which
reflects the proper value. Also, don't error out if a token prob
doesn't exist in the dict and return None instead from zip.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c7428f0bcd API: Add logprobs for chat completions
Adds chat completion logprob support using OAI's spec. Tokens are
not converted to tiktoken here since that will add an extra dependency
for no real reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c02fe4d1db API: Fix response creation
Change chat completion and text completion responses to be more
flexible.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
0af6a38af3 Model: Add logprobs support
Returns token offsets, selected tokens, probabilities of tokens
post-sampling, and normalized probability of selecting a token
pre-sampling (for efficiency purposes).

Only for text completions. Chat completions in a later commit.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
2642ef7156 OAI: Update logprobs type
Some logprobs cannot exist, so make the type optional

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
284f20263f API: Clean up tokenizing endpoint
Split the get tokens function into separate wrapper encode and decode
functions for overall code cleanliness.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
AliCat
bb48f77ca1 Neutralize samplers (#59)
* Update sample_preset.yml

Neutralized the samplers.

* Sampling: Fix dynatemp defaults

Default max temp and min temp is 1.0

* Sampling: Fix TFS defaults

Default is 1.0

---------

Co-authored-by: AliCat <86847834+alicat22@users.noreply.github.com>
Co-authored-by: kingbri <bdashore3@proton.me>
2024-02-08 00:23:09 -05:00
kingbri
321c9a1ea9 Requirements: Fix FA2 version number
The URL wasn't edited correctly

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 21:37:30 -05:00
kingbri
58590a6c57 Config: Add option to force streaming off
Many APIs automatically ask for request streaming without giving
the user the option to turn it off. Therefore, give the user more
freedom by giving a server-side kill switch.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 21:09:59 -05:00
kingbri
d0027bce32 Requirements: Update flash attention 2 for Windows
Version 2.5.2

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 20:44:23 -05:00