Commit Graph

549 Commits

Author SHA1 Message Date
Amgad Hasan
2e5cf0ea3f Fix docker compose volume mount 2024-07-12 13:23:58 +00:00
kingbri
073e9fa6f0 Dependencies: Bump ExllamaV2
v0.1.7

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
9fc3fc4c54 OAI: Amend comments
Clarify what the user can and can't see.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
1f46a1130c OAI: Restrict list permissions for API keys
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.

This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
10890913b8 Auth: Revert x-admin-key allowance in API key check
These kinda clash with each other. Use the correct header for the
correct endpoint.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
dfb4c51d5f OAI: Fix function idioms
Make functions mean the same thing to avoid confusion.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
b9a58ff01b Auth: Make key permission check work on Requests
Pass a request and internally unwrap the headers. In addition, allow
X-admin-key to get checked in an API key request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:49 -04:00
Brian Dashore
ff15eed85d Update README.md
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 21:26:11 +00:00
kingbri
5c293499bd OAI: Reorder functions
Reordering routes changes the order of appearance on documentation.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:27:08 -04:00
kingbri
521d21b9f2 OAI: Add return types for docs
Adding return types allows for responses to get included in the
autogenerated docs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:23:41 -04:00
kingbri
62e495fc13 Model: Grammar: Fix lru_cache clear function
It's cache_clear not clear_cache.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:10:15 -04:00
Brian Dashore
17438288c7 Merge pull request #146 from theroyallab/tokenizer_data_fix
Tokenizer data fix
2024-07-08 15:08:29 -04:00
kingbri
c7ce97f119 Tree: Ruff lint
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:06:28 -04:00
kingbri
8a81fe2eb4 Actions: Add Github Pages deploy
Deploys OpenAPI documentation to pages.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:04:27 -04:00
kingbri
6613e38436 Main: Make openapi export store locally
This runs faster than always making a syscall to check if the env
var is set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 14:54:06 -04:00
kingbri
ae66e8f9ba Ruff: Lint
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 13:44:12 -04:00
kingbri
b907421285 Main: Fix launch if EXPORT_OPENAPI is unset
A default needs to be provided with getenv. Fix that with an empty
string.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 13:41:44 -04:00
kingbri
a59e8ef9e7 Main: Make EXPORT_OPENAPI only work if true or 1
Use truthy values instead of checking if the variable is set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 12:51:24 -04:00
kingbri
e58e197f0b Ruff: Remove deprecated rule E999
Syntax error is removed since they'll always be shown when linting
anyways.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 12:36:15 -04:00
kingbri
933268f7e2 API: Integrate OpenAPI export script
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.

In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 12:34:32 -04:00
turboderp
e97ad9cb27 RUFF 2024-07-08 03:51:14 +02:00
turboderp
8bbce3455c RUFF 2024-07-08 03:49:26 +02:00
kingbri
5e82b7eb69 API: Add standalone method to fetch OpenAPI docs
Generates and stores an export of the openapi.json file for use in
static websites.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-07 21:35:52 -04:00
turboderp
4cf79c5ae1 Clear tokenizer_data cache when unloading model 2024-07-08 03:31:05 +02:00
turboderp
b7e7df1220 Move tokenizer_data cache to global scope 2024-07-08 02:54:49 +02:00
turboderp
4d0bb1ffc3 Cache creation tokenizer_data in LMFE 2024-07-08 00:51:59 +02:00
turboderp
bb8b02a60a Wrap arch_compat_overrides in try block
Quick fix until exllamav2 0.1.7 releases, since the function isn't defined for 0.1.6.
2024-07-07 07:54:05 +02:00
kingbri
773639ea89 Model: Fix flash-attn checks
If flash attention is already turned off by exllamaV2 itself, don't
try creating a paged generator. Also condense all the redundant
logic into one if statement.

Also check arch_compat_overrides to see if flash attention should
be disabled for a model arch (ex. Gemma 2)

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 20:58:24 -04:00
kingbri
27d2d5f3d2 Config + Model: Allow for default fallbacks from config for model loads
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).

However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.

Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.

This behavior may change in the future, but I think it solves the
issue for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 17:50:58 -04:00
kingbri
d03752e31b Issues: Fix template
Correct Discord invite link.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:52:01 -04:00
kingbri
45fae89af6 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:50:17 -04:00
kingbri
c5ea2abe24 Dependencies: Update ExllamaV2
v0.1.6

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:45:04 -04:00
kingbri
d85b526644 Dependencies: Pin numpy
v2.x breaks many upstream dependencies (torch). Pin until repos are
fixed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:40:09 -04:00
DocShotgun
107436f601 Dependencies: Fix AMD triton (#139) 2024-06-18 15:19:27 +02:00
Brian Dashore
06ee610a97 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-17 03:56:47 +00:00
kingbri
c575105e41 ExllamaV2: Cleanup log placements
Move the large import errors into the check functions themselves.
This helps reduce the difficulty in interpreting where errors are
coming from.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-16 00:16:03 -04:00
Glenn Maynard
8da7644571 Fix exception unloading models. (#138)
self.generator is None if a model load fails or is cancelled.
2024-06-15 23:44:29 +02:00
DocShotgun
85387d97ad Fix disabling flash attention in exl2 config (#136)
* Model: Fix disabling flash attention in exl2 config

* Model: Pass no_flash_attn to draft config

* Model: Force torch flash SDP off in compatibility mode
2024-06-12 20:00:46 +02:00
DocShotgun
156b74f3f0 Revision to paged attention checks (#133)
* Model: Clean up paged attention checks

* Model: Move cache_size checks after paged attn checks
Cache size is only relevant in paged mode

* Model: Fix no_flash_attention

* Model: Remove no_flash_attention
Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
2024-06-09 17:28:11 +02:00
DocShotgun
55d979b7a5 Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134)
* Dependencies: Add wheels for Python 3.12

* Model: Switch fp8 cache to Q8 cache

* Model: Add ability to set draft model cache mode

* Dependencies: Bump exllamav2 to 0.1.5

* Model: Support Q6 cache

* Config: Add Q6 cache and draft_cache_mode to config sample
2024-06-09 17:27:39 +02:00
DocShotgun
dcd9428325 Model: Warn if cache size is too small for CFG (#132) 2024-06-05 19:40:14 +02:00
DocShotgun
e391d84e40 More extensive checks for paged mode support (#121)
* Model: More extensive checks for paged attention
Previously, TabbyAPI only checked for whether the user's hardware supports flash attention before deciding whether to enabled paged mode.
This adds checks for whether no_flash_attention is set, whether flash-attn is installed, and whether the installed version supports paged attention.

* Tree: Format

* Tree: Lint

* Model: Check GPU architecture first
Check GPU arch prior to checking whether flash attention 2 is installed
2024-06-05 09:33:21 +02:00
turboderp
dbdcb38ad7 Allow either "[" or "{" prefix to support JSON grammar with top level arrays (#129) 2024-06-04 02:32:39 +02:00
turboderp
e889fa3efe Bump exllamav2 to v0.1.4 (#128) 2024-06-04 02:32:08 +02:00
Orion
6cc3bd9752 feat: list support in message.content (#122) 2024-06-03 19:57:15 +02:00
turboderp
1951f7521c Forward exceptions from _stream_collector to stream_generate_(chat)_completion (#126) 2024-06-03 19:42:45 +02:00
turboderp
0eb8fa5d1e [fix] Bring draft progress and model progress in sync with model loader (#125)
* Bring draft progress and model progress in sync with model loader

* Fix formatting
2024-06-03 19:41:02 +02:00
turboderp
a011c17488 Revert "Forward exceptions from _stream_collector to stream_generate_chat_completion"
This reverts commit 1bb8d1a312.
2024-06-02 15:37:37 +02:00
turboderp
1bb8d1a312 Forward exceptions from _stream_collector to stream_generate_chat_completion 2024-06-02 15:13:30 +02:00
kingbri
e95e67a000 OAI: Add validation to "n"
n must be greater than 1 to generate.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00