Commit Graph

193 Commits

Author SHA1 Message Date
kingbri
7e007f0761 Model: Handle finish chunks and logprobs in separate functions
Helps split up and trim the generate_gen function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:19:03 -04:00
kingbri
3f09fcd8c9 Model: Make model params return a model card
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:15:46 -04:00
kingbri
13beef8021 Model: Move find_template function to templating
Makes sense to extract to a utility function instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:27:53 -04:00
kingbri
8e238fa8f6 Model: Move calculate_rope_alpha from backend
Makes more sense to use as a utility function. Also clarify how the
vars are set.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:20:19 -04:00
kingbri
b751e0a1d5 Model: Move inline overrides to common
This is applied across containers. Doesn't make sense to put this method
in the backend.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:51:57 -04:00
kingbri
034682fcf1 Backends: Add base model container
Base class for all model containers. Used in the shared model file
for interface.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:24:10 -04:00
kingbri
f15ac1f69d Model: Reject model requests when unloading
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.

Also, remove model_is_loaded since we simply check if the container
is None now.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-19 22:34:06 -04:00
kingbri
3f1d5d396e Model: Store active jobs in tabby
Rather than relying on the generator, use tabby to store the active
job IDs.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 13:17:55 -04:00
kingbri
1afc9b983e Model: Remove generate_window
Not required since we error with exceeding the max_seq_len

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:59:02 -04:00
kingbri
2f5235e1a3 Model: Extract settings creation to a separate function
Maybe move this out of the class entirely, but for now, it makes
sense to encapsulate this logic.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:57:27 -04:00
kingbri
5697204e47 Merge branch 'main' into model-rewrite 2025-04-16 02:15:46 -04:00
kingbri
6bb5f8f599 Sampling: Rewrite mirostat_mode parameter
Apparently the "mirostat" parameter has been updated by frontends
to pass a number. ExllamaV2 expects a boolean, but most pass a number
anyway, so just alias mirostat_mode and mirostat together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 02:13:55 -04:00
kingbri
3084ef9fa1 Model + API: Migrate to use BaseSamplerParams
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.

However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 00:50:05 -04:00
kingbri
dcb36e9ab2 Model: Remove extra unwraps
The base sampler request already specifies the defaults, so don't
unwrap in this way.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:38:46 -04:00
kingbri
11ed3cf5ee Model: Cleanup logging and remove extraneous declarations
Log the parameters passed into the generate gen function rather than
the generation settings to reduce complexity.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:31:12 -04:00
kingbri
79f9c6e854 Model: Remove num_experts_per_token
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:52:10 -04:00
kingbri
9f649647f0 Model + API: GPU split updates and fixes
For the TP loader, GPU split cannot be an empty array. However,
defaulting the parameter to an empty array makes it easier to calculate
the device list. Therefore, cast an empty array to None using
falsy comparisons at load time.

Also add draft_gpu_split to the load request.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-15 21:50:14 -05:00
kingbri
beb6d8faa5 Model: Adjust draft_gpu_split and add to config
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.

Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-08 16:09:46 -05:00
kingbri
bd8256d168 Merge branch 'main' into draft-split 2025-02-08 15:10:44 -05:00
kingbri
b994aae995 Model: Cleanup generation length and page checks
Reduce the amount of if statements and combine parts of code.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 23:13:08 -05:00
kingbri
ba2579ff74 Merge branch 'main' into robust-length-checks 2024-12-26 18:00:26 -05:00
kingbri
7878d351a7 Endpoints: Add props endpoint and add more values to model params
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.

Also include the contents of a prompt template when fetching the current
model.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 17:32:19 -05:00
DocShotgun
4d11323c17 Tree: Format 2024-12-17 09:37:33 -08:00
DocShotgun
5da335eb3d Model: Robust request length checking in generator
* Ensure that length of positive/negative prompt + max_tokens does not exceed max_seq_len
* Ensure that total required pages for CFG request does not exceed allocated cache_size
2024-12-17 09:34:43 -08:00
DocShotgun
7f899734c0 Grammar: Cache the engine vocabulary
* Avoid rebuilding the KBNF engine vocabulary on every grammar-enabled request
2024-12-05 21:36:37 -08:00
lucy
ab1f4b7a6a add draft_gpu_split option 2024-11-27 02:52:19 +01:00
DocShotgun
6f2dc2ea99 Grammar: Fix syntax, lint 2024-11-24 11:35:45 -08:00
DocShotgun
8f209efb99 Grammar: Clean up KBNF implementation
* Also remove empty cache clear function
2024-11-24 10:44:45 -08:00
DocShotgun
0836a9317f Grammar: Initial Formatron regex and JSON schema implementation
* Replace LMFE's regex and JSON schema filters with Formatron's
* Remove Outlines EBNF filter in preparation for Formatron KBNF filter
* TODO: Implement Formatron KBNF filter
2024-11-23 10:27:37 -08:00
kingbri
eadc71a4c3 Model: Add unload and error messages for vision
If vision is enabled and the model doesn't support it, send an
error asking the user to reload. Also, add a method to unload the
vision tower.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 14:25:03 -05:00
kingbri
0ab393f09c Model: Set vision load to False by default
Mistake in unwrapping. Vision should be false to allow normal model
loading when the flag isn't provided.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-21 17:54:42 -05:00
kingbri
902045edbb API: Fix chat completion formatting flow
Previously, the flow for parsing chat completion messages and rendering
from the prompt template was disconnected between endpoints. Now, create
a common function to render and handle everything appropriately afterwards.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-21 17:51:14 -05:00
kingbri
0fadb1e5e8 Merge branch 'main' into vision 2024-11-19 21:19:21 -05:00
DocShotgun
27d9af50a8 API: Report whether vision is enabled 2024-11-19 12:29:25 -08:00
DocShotgun
5611365c07 OAI: Allow /v1/encode endpoint to handle vision requests
* More robust checks for OAI chat completion message lists on /v1/encode endpoint
* Added TODO to support other aspects of chat completions
* Fix oversight where embeddings was not defined in advance on /v1/chat/completions endpoint
2024-11-19 11:14:37 -08:00
Brian
a69f86098a Merge pull request #243 from DocShotgun/chunk-size-fix
Enforce chunk_size as multiple of 256
2024-11-18 00:40:36 -05:00
DocShotgun
dd41eec8a4 OAI: Initial vision support in OAI chat completions
* Support image_url inputs containing URLs or base64 strings following OAI vision spec
* Use async lru cache for image embeddings
* Add generic wrapper class for multimodal embeddings
2024-11-17 21:23:09 -08:00
DocShotgun
5bb46df3c3 Model: Fix draft model non-FA2 fallback 2024-11-15 21:04:25 -08:00
DocShotgun
37cc701137 Model: Enforce chunk_size as multiple of 256 2024-11-15 20:35:18 -08:00
kingbri
69ac0eb8aa Model: Add vision loading support
Adds the ability to load vision parts of text + image models. Requires
an explicit flag in config because there isn't a way to automatically
determine whether the vision tower should be used.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-11 12:10:11 -05:00
kingbri
cc2516790d Model: Add support for chat_template.json
HuggingFace separated the chat template in the newest transformers
versions.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-11 12:10:06 -05:00
kingbri
9530f8c8c7 Model: Add support for chat_template.json
HuggingFace separated the chat template in the newest transformers
versions.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-11 12:09:27 -05:00
DocShotgun
603760cecb Model: Remove override_base_seq_len 2024-10-30 10:03:08 +08:00
TerminalMan
7d18d2e2ca Refactor the sampling class (#199)
* improve validation

* remove to_gen_params functions

* update changes for all endpoint types

* OAI: Fix calls to generation

Chat completion and completion need to have prompt split out before
pushing to the backend.

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Convert Top-K values of -1 to 0

Some OAI implementations use -1 as disabled instead of 0. Therefore,
add a coalesce case.

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Format and space out

Make the code more readable.

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Fix mirostat

Field items are nested in data within a Pydantic FieldInfo

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Format

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Fix banned_tokens and allowed_tokens conversion

If the provided string has whitespace, trim it before splitting.

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Add helpful log to dry_sequence_breakers

Let the user know if the sequence errors out.

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Apply validators in right order

Validators need to be applied in order from top to bottom, this is why
the after validator was not being applied properly.

Set the model to validate default params for sampler override purposes.
This can be turned off if there are unclear errors.

Signed-off-by: kingbri <bdashore3@proton.me>

* Endpoints: Format

Cleanup and semantically fix field validators

Signed-off-by: kingbri <bdashore3@proton.me>

* Kobold: Update validators and fix parameter application

Validators on parent fields cannot see child fields. Therefore,
validate using the child fields instead and alter the parent field
data from there.

Also fix badwordsids casting.

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Remove validate defaults and fix mirostat

If a user sets an override to a non-default value, that's their
own fault.

Run validator on the actual mirostat_mode parameter rather than
the alternate mirostat parameter.

Signed-off-by: kingbri <bdashore3@proton.me>

* Kobold: Rework badwordsids

Currently, this serves to ban the EOS token. All other functionality
was legacy, so remove it.

Signed-off-by: kingbri <bdashore3@proton.me>

* Model: Remove HuggingfaceConfig

This was only necessary for badwordsids. All other fields are handled
by exl2. Keep the class as a stub if it's needed again.

Signed-off-by: kingbri <bdashore3@proton.me>

* Kobold: Bump kcpp impersonation

TabbyAPI supports XTC now.

Signed-off-by: kingbri <bdashore3@proton.me>

* Sampling: Change alias to validation_alias

Reduces the probability for errors and makes the class consistent.

Signed-off-by: kingbri <bdashore3@proton.me>

* OAI: Use constraints for validation

Instead of adding a model_validator, use greater than or equal to
constraints provided by Pydantic.

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Lint

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: SecretiveShell <84923604+SecretiveShell@users.noreply.github.com>
Co-authored-by: kingbri <bdashore3@proton.me>
2024-10-27 11:43:41 -04:00
Brian Dashore
6e48bb420a Model: Fix inline loading and draft key (#225)
* Model: Fix inline loading and draft key

There was a lack of foresight between the new config.yml and how
it was structured. The "draft" key became "draft_model" without updating
both the API request and inline loading keys.

For the API requests, still support "draft" as legacy, but the "draft_model"
key is preferred.

Signed-off-by: kingbri <bdashore3@proton.me>

* OAI: Add draft model dir to inline load

Was not pushed before and caused errors of the kwargs being None.

Signed-off-by: kingbri <bdashore3@proton.me>

* Model: Fix draft args application

Draft model args weren't applying since there was a reset due to how
the old override behavior worked.

Signed-off-by: kingbri <bdashore3@proton.me>

* OAI: Change embedding model load params

Use embedding_model_name to be inline with the config.

Signed-off-by: kingbri <bdashore3@proton.me>

* API: Fix parameter for draft model load

Alias name to draft_model_name.

Signed-off-by: kingbri <bdashore3@proton.me>

* API: Fix parameter for template switch

Add prompt_template_name to be more descriptive.

Signed-off-by: kingbri <bdashore3@proton.me>

* API: Fix parameter for model load

Alias name to model_name for config parity.

Signed-off-by: kingbri <bdashore3@proton.me>

* API: Add alias documentation

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Signed-off-by: kingbri <bdashore3@proton.me>
2024-10-24 23:35:05 -04:00
kingbri
126a44483c Tree: Remove fasttensors
Now a noop in upstream.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-30 00:18:47 -04:00
kingbri
56ce82ef77 Sampling: Add XTC support
Matches with upstream.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-24 18:10:52 -04:00
TerminalMan
2cda890deb Add health check monitoring for EXL2 errors (#206)
* Add health check monitoring for EXL2 errors

* Health: Format and change status code

A status code of 503 makes more sense to use.
---------
2024-09-22 21:40:36 -04:00
kingbri
75af974c88 Model: Raise an error if the context length is too large
The dynamic generator gave a not-so-helpful exception already which
basically said to not exceed the max sequence length. Instead of
possible undefined behavior, error out.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-19 22:05:56 -04:00
kingbri
24ea85b3c5 Tree: Use safe loader for YAML
Loaders that read use a safe type while loaders that write use both
round-trip and safe options.

Also don't create module-level parsers where they're not needed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-18 19:26:51 -04:00