Commit Graph

60 Commits

Author SHA1 Message Date
Aaron Veden
f53c98db94 Templates: Added automatic detection of chat templates from tokenizer_config.json 2023-12-20 22:45:55 -08:00
kingbri
5728b9fffb Model: Don't error out if a generation is empty
When stream is false, the generation can be empty, which means
that there's no chunks present in the final generation array, causing
an error.

Instead, return a dummy value if generation is falsy (empty array
or None)

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-20 00:51:33 -05:00
kingbri
ab10b263fd Model: Add override base seq len
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.

Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.

If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-20 00:45:39 -05:00
kingbri
ce2602df9a Model: Fix max seq len handling
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.

Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 23:37:52 -05:00
kingbri
d3246747c0 Templates: Attempt loading from model config
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 22:58:47 -05:00
kingbri
0a144688c6 Templates: Add clarity statements
Lets the user know if a file not found (OSError) occurs and prints
the applied template on model load.

Also fix some remaining references to fastchat.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 08:13:04 -05:00
kingbri
c3f7898967 OAI: Add logit bias support
Use exllamav2's token bias which is the functional equivalent of
OAI's logit bias parameter.

Strings are casted to integers on request and errors if an invalid
value is passed.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
7cbc08fc72 Templates: Add auto-detection from path
This replicates FastChat's model path detection.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
f631dd6ff7 Templates: Switch to Jinja2
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.

Also allows for unblocking Pydantic's version.

Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
95fd0f075e Model: Fix no flash attention
Was being called wrong from config.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 23:31:58 -05:00
kingbri
ad8807a830 Model: Add support for num_experts_by_token
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 18:03:01 -05:00
kingbri
1d0bdfa77c Model + OAI: Fix parameter parsing
Rope alpha changes don't require removing the 1.0 default
from Rope scale.

Keep defaults when possible to avoid errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 14:28:18 -05:00
kingbri
eb8ccb9783 Tree: Fix linter issues
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:58:19 -05:00
kingbri
083df7d585 Tree: Add generation logging support
Generations can be logged in the console along with sampling parameters
if the user enables it in config.

Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:43:35 -05:00
kingbri
db87efde4a OAI: Add ability to specify fastchat prompt template
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.

Also send the provided prompt template on model info request.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 15:43:58 -05:00
kingbri
fd9f3eac87 Model: Add params to current model endpoint
Grabs the current model rope params, max seq len, and the draft model
if applicable.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 00:40:56 -05:00
kingbri
0f4290f05c Model: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 22:48:42 -05:00
kingbri
5ae2a91c04 Tree: Use unwrap and coalesce for optional handling
Python doesn't have proper handling of optionals. The only way to
handle them is checking via an if statement if the value is None or
by using the "or" keyword to unwrap optionals.

Previously, I used the "or" method to unwrap, but this caused issues
due to falsy values falling back to the default. This is especially
the case with booleans were "False" changed to "True".

Instead, add two new functions: unwrap and coalesce. Both function
to properly implement a functional way of "None" coalescing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 21:52:17 -05:00
DocShotgun
7380a3b79a Implement lora support (#24)
* Model: Implement basic lora support

* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras

* Colab: Update for basic lora support

* Model: Test vram alloc after lora load, add docs

* Git: Add loras folder to .gitignore

* API: Add basic lora-related endpoints

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Revert bad CRLF line ending changes

* API: Add basic lora-related endpoints (fixed)

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Model: Unload loras first when unloading model

* API + Models: Cleanup lora endpoints and functions

Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.

Signed-off-by: kingbri <bdashore3@proton.me>

* Loras: Optimize load endpoint

Return successes and failures along with consolidating the request
to the rewritten load_loras function.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
2023-12-08 23:38:08 -05:00
kingbri
fa1e99daf6 Model: Remove unused print statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-07 21:13:52 -05:00
kingbri
6a71890d45 Model: Fix sampler bugs
Lots of bugs were unearthed when switching to the new fallback changes.
Fix them and make sure samplers are being set properly.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 17:29:58 -05:00
kingbri
4c0e686e7d Model: Cleanup and fix fallbacks
Use the standard "dict.get("key") or default" to handle fetching values
from kwargs and get a fallback value without possible errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 23:28:16 -05:00
kingbri
d8f7b93c54 Model: Fix fetching of draft args
Mistakenly fetched these from parent kwargs instead of the scoped
draft_config var.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 22:24:27 -05:00
DocShotgun
3f2fcbcc45 Add fallback to draft_rope_scale to 1.0 2023-12-05 18:51:36 -08:00
DocShotgun
39f7a2aabd Expose draft_rope_scale 2023-12-05 12:59:32 -08:00
kingbri
c67c9f6d66 Model + Config: Remove low_mem option
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.

A better alternative is to use 8bit cache which works and helps save
VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:07:42 -05:00
kingbri
27fc0c0069 Model: Cleanup and compartmentalize auto rope functions
Also handle an edge case if ratio <= 1 since NTK scaling is only
used for values > 1.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:05:09 -05:00
DocShotgun
bd2c5d0d09 Force auto-alpha to 1.0 if config ctx == base ctx 2023-12-02 21:19:59 -08:00
DocShotgun
1c398b0be7 Add automatic NTK-aware alpha scaling to model
* enables automatic calculation of NTK-aware alpha scaling for models if the rope_alpha arg is not passed in the config, using the same formula used for draft models
2023-12-02 21:02:29 -08:00
kingbri
ae69b18583 API: Use FastAPI streaming instead of sse_starlette
sse_starlette kept firing a ping response if it was taking too long
to set an event. Rather than using a hacky workaround, switch to
FastAPI's inbuilt streaming response and construct SSE requests with
a utility function.

This helps the API become more robust and removes an extra requirement.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 01:54:35 -05:00
kingbri
8a5ac5485b Model: Fix rounding
generated_tokens is always a whole number.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-30 01:55:46 -05:00
kingbri
e703c716ee Merge branch 'main' of https://github.com/ziadloo/tabbyAPI into ziadloo-main 2023-11-30 01:01:48 -05:00
kingbri
3957316b79 Revert "API: Rename repetition_decay -> repetition_slope"
This reverts commit cad144126f.

Change this parameter back to repetition_decay. This is different than
rep_pen_slope used in other backends such as kobold and NAI.

Still keep the fallback condition though.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 22:03:45 -05:00
kingbri
94696543bc Model: Warn user if context > max_seq_len
Unlike other backends, tabby attempts to generate even if the context
is greater than the max sequence length via truncation of the given
context.

Rather than artifically erroring out, give a warning that outputted
console metrics are going to be incorrect and to make sure that
context <= max_seq_len.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 01:35:32 -05:00
kingbri
cad144126f API: Rename repetition_decay -> repetition_slope
Also fix the fallback to use 0 for sanity checking and validation.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 01:13:05 -05:00
Mehran Ziadloo
b0c42d0f05 Leveraging local variables 2023-11-27 20:56:56 -08:00
Mehran Ziadloo
ead503c75b Adding token usage support 2023-11-27 20:05:05 -08:00
kingbri
d47c39da54 API: Don't include draft directory in response
The draft directory should be returned for a draft model request (TBD).

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-23 00:07:56 -05:00
kingbri
71b9a53336 API: Add temperature_last support
Documented in previous commits. Also make sure that for version checking,
check the value of kwargs instead of if the key is present since requests
pass default values.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-21 21:20:59 -05:00
turboderp
3337fe6acc Warning if unsupported samplers are used 2023-11-21 18:35:22 +01:00
turboderp
a54de11cf3 Add new samplers 2023-11-21 18:16:53 +01:00
Veden
f960fac8ff Fix incorrect ratio calculation for draft model 2023-11-19 13:12:53 -08:00
kingbri
4cddd0400c Model: Fix draft model loading
Use draft_config to find the path instead of kwargs.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-19 02:04:02 -05:00
kingbri
31bc418795 Model: Add context in response output
When printing to the console, give information about the context
(ingestion token count).

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-19 00:49:32 -05:00
kingbri
6b9af58cc1 Tree: Fix extraneous bugs and update T/s print
Model: Add extra information to print and fix the divide by zero error.
Auth: Fix validation of API and admin keys to look for the entire key.

References #7 and #6

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-18 22:34:40 -05:00
Brian Dashore
b2410a0436 Merge pull request #4 from waldfee/config_samples
Adds draft model support to config.yml
2023-11-18 13:16:23 -05:00
kingbri
27ebec3b35 Model: Add speculative decoding support via config
Speculative decoding makes use of draft models that ingest the prompt
before forwarding it to the main model.

Add options in the config to support this. API options will occur
in a different commit.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-18 01:42:20 -05:00
kingbri
2ad79cb9ea Model: Add tokens in responses
Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-17 23:33:48 -05:00
kingbri
9dfa580b1e Model: Add tokens/second output
Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-17 01:16:20 -05:00
kingbri
d5551352bf Model: Fix parsing of stop conditions
Add the EOS token into stop strings after checking kwargs. If
ban_eos_token is on, don't add the EOS token in for extra measure.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 17:15:33 -05:00