Commit Graph

173 Commits

Author SHA1 Message Date
Brian Dashore
8524999284 Merge pull request #184 from SecretiveShell/Infinity-Embed-TODO
Complete conditional infinity import TODO
2024-09-04 21:47:49 -04:00
Jake
42a42caf43 remove logging
- remove logging statements
- format code with ruff
2024-09-04 16:14:09 +01:00
kingbri
4bf1a71d7b Model: Fix model override application for draft args
These have to be merged beforehand and the updated version needs to be
re-fetched. It's possible to prevent the fetch of draft_args in the
beginning of init.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
kingbri
4aebe8a2a5 Config: Use an explicit "auto" value for rope_alpha
Using "auto" for rope alpha removes ambiguity on how to explicitly
enable automatic rope calculation. The same behavior of None -> auto
calculate still exists, but can be overwritten if a model's tabby_config.yml
includes `rope_alpha`.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
kingbri
a96fa5f138 API: Don't fallback to default values on model load request
It's best to pass them down the config stack.

API/User config.yml -> model config.yml -> model config.json -> fallback.

Doing this allows for seamless flow and yielding control to each
member in the stack.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
kingbri
4452d6f665 Model: Add support for overridable model config.yml
Like config.json in a model folder, providing a tabby_config.yml
will serve as a layer between user provided kwargs and the config.json
values.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
kingbri
dd55b99af5 Model: Store directory paths
Storing a pathlib type makes it easier to manipulate the model
directory path in the long run without constantly fetching it
from the config.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
kingbri
523709741c Model: Reorder how configs are set up
Initialize the Exllama classes first then add user-specific params.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
TerminalMan
43104e0d19 Complete conditional infinity import TODO
- add logging
- change declaration order
2024-08-31 21:48:43 +01:00
kingbri
21712578cf API: Add allowed_tokens support
This is the opposite of banned tokens. Exllama specific implementation
of #181.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-29 21:44:42 -04:00
kingbri
10d9419f90 Model: Add BOS token to prompt logs
If add_bos_token is enabled, the BOS token gets appended to the logged
prompt if logging is enabled.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-29 21:15:09 -04:00
kingbri
4958c06813 Model: Remove and format comments
The comment in __init__ was outdated and all the kwargs are the
config options anyways.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-27 21:43:40 -04:00
turboderp
fe3253f3a9 Model: Account for tokenizer lazy init 2024-08-23 23:51:53 +02:00
turboderp
a676c4bf38 Model: Formatting 2024-08-23 11:15:30 +02:00
turboderp
a3733caeda Model: Fix draft model cache initialization 2024-08-23 11:08:49 +02:00
kingbri
565b0300d6 Dependencies: Update Exllamav2
v0.1.9

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
078fbf1080 Model: Add quantized cache support for tensor parallel
Newer versions of exl2 v1.9-dev have quantized cache implemented. Add
those APIs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
871c89063d Model: Add Tensor Parallel support
Use the tensor parallel loader when the flag is enabled. The new loader
has its own autosplit implementation, so gpu_split_auto isn't valid
here.

Also make it easier to determine which cache type to use rather than
multiple if/else statements.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
5002617eac Model: Split cache creation into a common function
Unifies the switch statement across both draft and model caches.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
Ben Gitter
70b9fc95de [WIP] OpenAI Tools Support/Function calling (#154)
* returning stop str if exists from gen

* added chat template for firefunctionv2

* pulling tool vars from template

* adding parsing for tool inputs/outputs

* passing tool data from endpoint to chat template, adding tool_start to the stop list

* loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format

* non streaming generation prototype

* cleaning template

* Continued work with type, ingestion into template, and chat template for fire func

* Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib.

* Ruff Formating

* Moved stop string and tool updates out of prompt creation func

Updated tool pydantic to match OAI

Support for streaming

Updated generate tool calls to use flag within chat_template and insert tool reminder

* Llama 3.1 chat templates

Updated fire func template

* renamed llama3.1 to chatml_with_headers..

* update name of template

* Support for calling a tool start token rather than the string.

Simplified tool_params

Warning when gen_settings are being overidden becuase user set temp to 0

Corrected schema and tools to correct types for function args. Str for some reason

* draft groq tool use model template

* changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally)

* Clean up comments and code in chat comp

* Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call.

* changes example back to args as json rather than string of json

* Standardize chat templates to each other

* cleaning/rewording

* stop elements can also be ints (tokens)

* Cleaning/formatting

* added special tokens for tools and tool_response as specified in description

* Cleaning

* removing aux templates - going to live in llm-promp-templates repo instead

* Tree: Format

Signed-off-by: kingbri <bdashore3@proton.me>

* Chat Completions: Don't include internal tool variables in OpenAPI

Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The
location of these variables may need to be changed in the future.

Signed-off-by: kingbri <bdashore3@proton.me>

* Templates: Deserialize metadata on template load

Since we're only looking for specific template variables that are
static in the template, it makes more sense to render when the template
is initialized.

Signed-off-by: kingbri <bdashore3@proton.me>

* Tools: Fix comments

Adhere to the format style of comments in the rest of the project.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: Ben Gitter <gitterbd@gmail.com>
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-17 00:16:25 -04:00
kingbri
63650d2c3c Model: Disable banned strings if grammar is used
ExllamaV2 filters don't allow for rewinding which is what banned
strings uses. Therefore, constrained generation via LMFE or outlines
is not compatible for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-05 11:08:58 -04:00
kingbri
8ff2586d45 Start: Fix pip update, method calls, and logging
platform.system() was not called in some places, breaking the
ternary on Windows.

Pip's --upgrade flag does not actually update dependencies to their
latest versions. That's what the --upgrade-strategy eager flag is for.

Tell the user where their start preferences are coming from.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-04 10:30:26 -04:00
kingbri
b6d2676f1c Start: Give the user a hint when a module can't be imported
If an ImportError or ModuleNotFoundError is raised, tell the user
to run the update scripts.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 21:59:06 -04:00
kingbri
2a33ebbf29 Model: Bypass lock checks when shutting down
Previously, when a SIGINT was emitted and a model load is running,
the API didn't shut down until the load finished due to waitng for
the lock. However, when shutting down, the lock doesn't matter since
the process is being killed anyway.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 16:05:34 -04:00
kingbri
0bcb4e4a7d Model: Attach request ID to logs
If multiple logs come in at once, track which log corresponds to
which request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 00:25:54 -04:00
kingbri
9390d362dd Model: Log generation params and metrics after the prompt/response
A user's prompt and response can be large in the console. Therefore,
always log the smaller payloads (ex. gen params + metrics) after
the large chunks.

However, it's recommended to keep prompt logging off anyways since
it'll result in console spam.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 00:19:21 -04:00
Brian Dashore
1bf062559d Merge pull request #158 from AlpinDale/embeddings
feat: add embeddings support via Infinity-emb
2024-07-31 20:33:12 -04:00
kingbri
46304ce875 Model: Properly pass in max_batch_size from config
The override wasn't being passed in before. Also, the default is now
none since Exl2 can automatically calculate the max batch size.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 18:42:25 -04:00
kingbri
dc3dcc9c0d Embeddings: Update config, args, and parameter names
Use embeddings_device as the parameter for device to remove ambiguity.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 15:32:26 -04:00
kingbri
f13d0fb8b3 Embeddings: Add model load checks
Same as the normal model container.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 11:17:36 -04:00
kingbri
01c7702859 Signal: Fix async signal handling
Run unload async functions before exiting the program.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 11:11:05 -04:00
kingbri
fbf1455db1 Embeddings: Migrate and organize Infinity
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 11:00:23 -04:00
kingbri
7522b1447b Model: Add support for HuggingFace config and bad_words_ids
This is necessary for Kobold's API. Current models use bad_words_ids
in generation_config.json, but for some reason, they're also present
in the model's config.json.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-26 18:23:22 -04:00
kingbri
b7cb6f0b91 API: Add KoboldAI server
Used for interacting with applications that use KoboldAI's API
such as horde.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-26 16:37:30 -04:00
kingbri
3e8ffebdd3 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-23 14:32:50 -04:00
kingbri
9ad69e8ab6 API: Migrate universal routes to core
Place OAI specific routes in the appropriate folder. This is in
preperation for adding new API servers that can be optionally enabled.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-23 14:08:48 -04:00
kingbri
191600a150 Revert "Model: Skip empty token chunks"
This reverts commit 21516bd7b5.

This skips EOS and implementing it the proper way seems more
costly than necessary.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 18:34:00 -04:00
kingbri
21516bd7b5 Model: Skip empty token chunks
This helps make the generation loop more efficient by skipping past
chunks that aren't providing any tokens anyways. The offset isn't
affected.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 12:23:49 -04:00
kingbri
cae94b920c API: Add ability to use request IDs
Identify which request is being processed to help users disambiguate
which logs correspond to which request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-21 21:01:05 -04:00
kingbri
933404c185 Model: Warn user if terminating jobs
If skip_wait is true, it's best to let the user know that all jobs
will be forcibly cancelled.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-15 11:34:16 -04:00
kingbri
9dae461142 Model: Attempt to recreate generator on a fatal error
If a job causes the generator to error, tabby stops working until
a relaunch. It's better to try establishing a system of redundancy
and remake the generator in the event that it fails.

May replace this with an exit signal for a fatal error instead, but
not sure.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-15 01:09:49 -04:00
kingbri
073e9fa6f0 Dependencies: Bump ExllamaV2
v0.1.7

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
1f46a1130c OAI: Restrict list permissions for API keys
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.

This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
62e495fc13 Model: Grammar: Fix lru_cache clear function
It's cache_clear not clear_cache.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:10:15 -04:00
turboderp
e97ad9cb27 RUFF 2024-07-08 03:51:14 +02:00
turboderp
8bbce3455c RUFF 2024-07-08 03:49:26 +02:00
turboderp
4cf79c5ae1 Clear tokenizer_data cache when unloading model 2024-07-08 03:31:05 +02:00
turboderp
b7e7df1220 Move tokenizer_data cache to global scope 2024-07-08 02:54:49 +02:00
turboderp
4d0bb1ffc3 Cache creation tokenizer_data in LMFE 2024-07-08 00:51:59 +02:00
turboderp
bb8b02a60a Wrap arch_compat_overrides in try block
Quick fix until exllamav2 0.1.7 releases, since the function isn't defined for 0.1.6.
2024-07-07 07:54:05 +02:00