Commit Graph

117 Commits

Author SHA1 Message Date
kingbri
d8f7b93c54 Model: Fix fetching of draft args
Mistakenly fetched these from parent kwargs instead of the scoped
draft_config var.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 22:24:27 -05:00
DocShotgun
3f2fcbcc45 Add fallback to draft_rope_scale to 1.0 2023-12-05 18:51:36 -08:00
DocShotgun
39f7a2aabd Expose draft_rope_scale 2023-12-05 12:59:32 -08:00
Brian Dashore
e085b806e8 Merge pull request #22 from DocShotgun/main
Update colab, expose additional args
2023-12-05 01:22:33 -05:00
DocShotgun
67507105d0 Update colab, expose additional args
* Exposed draft model args for speculative decoding
* Exposed int8 cache, dummy models, and no flash attention
* Resolved CUDA 11.8 dependency issue
2023-12-04 22:20:46 -08:00
Brian Dashore
37f8f3ef8b Merge pull request #20 from veryamazinglystupid/main
make colab better, fix libcudart errors
2023-12-05 01:14:21 -05:00
kingbri
621e11b940 Update documentation
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 00:33:43 -05:00
kingbri
8ba3bfa6b3 API: Fix load exception handling
Models do not fully unload if an exception is caught in load. Therefore,
leave it to the client to unload on cancel.

Also add handlers in the event a SSE stream is cancelled. These packets
can't be sent back to the client since the client has severed the
connection, so print them in terminal.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 00:23:15 -05:00
kingbri
7c92968558 API: Fix mistaken debug statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 18:07:12 -05:00
kingbri
5e54911cc8 API: Fix semaphore handling and chat completion errors
Chat completions previously always yielded a final packet to say that
a generation finished. However, this caused errors that a yield was
executed after GeneratorExit. This is correctly stated because python's
garbage collector can't clean up the generator after exiting due to the
finally block executing.

In addition, SSE endpoints close off the connection, so the finish packet
can only be yielded when the response has completed, so ignore yield on
exception.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 15:51:25 -05:00
kingbri
30fc5b3d29 Merge branch 'main' of github.com:theroyallab/tabbyAPI 2023-12-03 22:55:51 -05:00
kingbri
ed6c962aad API: Fix sequential requests
FastAPI is kinda weird with queueing. If an await is used within an
async def, requests aren't executed sequentially. Get the sequential
requests back by using a semaphore to limit concurrent execution from
generator functions.

Also scaffold the framework to move generator functions to their own
file.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 22:54:34 -05:00
veryamazinglystupid
ad1a12a0f2 make colab better, fix libcudart errors
:3
2023-12-03 14:07:52 +05:30
DocShotgun
2a9e4ca051 Add Colab example
*note: this uses wheels for python 3.10 and torch 2.1.0+cu118 which is the current default in colab
2023-12-03 02:21:51 -05:00
kingbri
e740b53478 Requirements: Update Flash Attention 2
Bump to 2.3.6

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:56:29 -05:00
kingbri
c67c9f6d66 Model + Config: Remove low_mem option
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.

A better alternative is to use 8bit cache which works and helps save
VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:07:42 -05:00
Brian Dashore
109e4223e0 Merge pull request #18 from DocShotgun/main
Add automatic NTK-aware alpha scaling to model
2023-12-03 01:06:50 -05:00
kingbri
27fc0c0069 Model: Cleanup and compartmentalize auto rope functions
Also handle an edge case if ratio <= 1 since NTK scaling is only
used for values > 1.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:05:09 -05:00
DocShotgun
bd2c5d0d09 Force auto-alpha to 1.0 if config ctx == base ctx 2023-12-02 21:19:59 -08:00
DocShotgun
1c398b0be7 Add automatic NTK-aware alpha scaling to model
* enables automatic calculation of NTK-aware alpha scaling for models if the rope_alpha arg is not passed in the config, using the same formula used for draft models
2023-12-02 21:02:29 -08:00
kingbri
61f6e51fdb OAI: Add separator style fallback
Some models may return None for separator style with FastChat. Fall
back to LLAMA2 if this is the case.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 23:30:19 -05:00
kingbri
ae69b18583 API: Use FastAPI streaming instead of sse_starlette
sse_starlette kept firing a ping response if it was taking too long
to set an event. Rather than using a hacky workaround, switch to
FastAPI's inbuilt streaming response and construct SSE requests with
a utility function.

This helps the API become more robust and removes an extra requirement.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 01:54:35 -05:00
kingbri
6493b1d2aa OAI: Add ability to send dummy models
Some APIs require an OAI model to be sent against the models endpoint.
Fix this by adding a GPT 3.5 turbo entry as first in the list to cover
as many APIs as possible.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 00:27:28 -05:00
kingbri
aef411bed5 OAI: Fix chat completion streaming
Chat completions require a finish reason to be provided in the OAI
spec once the streaming is completed. This is different from a non-
streaming chat completion response.

Also fix some errors that were raised from the endpoint.

References #15

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 00:14:24 -05:00
Brian Dashore
c4d8c901e1 Merge pull request #13 from ziadloo/main
Adding the usage stat support (prompt_tokens, completion_tokens, and total_tokens)
2023-11-30 01:57:44 -05:00
kingbri
8a5ac5485b Model: Fix rounding
generated_tokens is always a whole number.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-30 01:55:46 -05:00
kingbri
e703c716ee Merge branch 'main' of https://github.com/ziadloo/tabbyAPI into ziadloo-main 2023-11-30 01:01:48 -05:00
kingbri
56f9b1d1a8 API: Add generator error handling
If the generator errors, there's no proper handling to send an error
packet and close the connection.

This is especially important for unloading models if the load fails
at any stage to reclaim a user's VRAM. Raising an exception caused
the model_container object to lock and not get freed by the GC.

This made sense to propegate SSE errors across all generator functions
rather than relying on abort signals.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-30 00:37:48 -05:00
kingbri
2bc3da0155 YAML: Force all files to open with utf8
The default encoding method when opening files on Windows is cp1252
which doesn't support all unicode and can cause issues.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 22:04:29 -05:00
kingbri
3957316b79 Revert "API: Rename repetition_decay -> repetition_slope"
This reverts commit cad144126f.

Change this parameter back to repetition_decay. This is different than
rep_pen_slope used in other backends such as kobold and NAI.

Still keep the fallback condition though.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 22:03:45 -05:00
kingbri
94696543bc Model: Warn user if context > max_seq_len
Unlike other backends, tabby attempts to generate even if the context
is greater than the max sequence length via truncation of the given
context.

Rather than artifically erroring out, give a warning that outputted
console metrics are going to be incorrect and to make sure that
context <= max_seq_len.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 01:35:32 -05:00
kingbri
cad144126f API: Rename repetition_decay -> repetition_slope
Also fix the fallback to use 0 for sanity checking and validation.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 01:13:05 -05:00
kingbri
5cbf7f13da OAI: Fix repetition range
Alias repetition_penalty_range to repetition_range since that's used
as an internal variable. Perhaps in the future, there should be a function
that allows for iterating through request aliases and give a default value.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 00:53:19 -05:00
Mehran Ziadloo
b0c42d0f05 Leveraging local variables 2023-11-27 20:56:56 -08:00
Mehran Ziadloo
ead503c75b Adding token usage support 2023-11-27 20:05:05 -08:00
kingbri
44e7f7b0ee Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-25 23:47:48 -05:00
Brian Dashore
0914bc313f Merge pull request #12 from DocShotgun/main
Add start-up shell script for Linux
2023-11-25 00:29:47 -05:00
kingbri
d929e0c826 API: Fix error points and exceptions
On /v1/model/load, some internal server errors weren't being sent,
so migrate directory checking out and also add a check to make sure
the proposed model path exists.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-25 00:27:02 -05:00
DocShotgun
cffd20f580 Add start-up shell script for Linux
- requires user to have already installed the pre-requisites in venv
2023-11-23 19:03:52 -08:00
kingbri
d47c39da54 API: Don't include draft directory in response
The draft directory should be returned for a draft model request (TBD).

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-23 00:07:56 -05:00
kingbri
13c9c09398 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-22 00:20:21 -05:00
kingbri
d25310e55d Requirements: Update Flash Attention 2
Use 2.3.4 from tgw. However, keep the 2.3.3 wheels in requirements
if the newer wheels don't work for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-21 22:12:55 -05:00
kingbri
71b9a53336 API: Add temperature_last support
Documented in previous commits. Also make sure that for version checking,
check the value of kwargs instead of if the key is present since requests
pass default values.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-21 21:20:59 -05:00
turboderp
3337fe6acc Warning if unsupported samplers are used 2023-11-21 18:35:22 +01:00
turboderp
a54de11cf3 Add new samplers 2023-11-21 18:16:53 +01:00
kingbri
c92ee24bb4 Tree: Add batch script
A simple batch script to activate a venv and start TabbyAPI. This
can be used with nssm in Windows for a systemd-like background service.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-20 01:48:06 -05:00
kingbri
2aa9c145be Auth: Fix an oops with headers
I copy pasted the code wrong.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-20 00:16:12 -05:00
kingbri
39ea730be5 Auth: Allow admin keys to work with api key routes
Admin keys are an administrator key, so it makes sense to allow it
for API key routes as well.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-19 23:53:07 -05:00
turboderp
8ef730f016 Merge pull request #11 from veden/patch-1
Fix incorrect ratio calculation for draft model
2023-11-20 04:23:34 +01:00
Veden
f960fac8ff Fix incorrect ratio calculation for draft model 2023-11-19 13:12:53 -08:00