286 Commits

Author SHA1 Message Date
turboderp
f1a2416da5 OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n") 2026-04-12 04:12:53 +02:00
turboderp
3a42c1756c ExLlamaV2: Use new disconnect handler 2026-04-10 22:04:21 +02:00
turboderp
55124d0fc6 Config: Add force_enable_thinking 2026-04-10 00:16:40 +02:00
turboderp
79d581e1f5 OAI endpoints: More rework
- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
2026-04-02 01:26:44 +02:00
turboderp
0409064028 Tools: Refactor and further simplify tool parsing
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
2026-04-01 00:07:44 +02:00
turboderp
179479199b Rework tool calls and OAI chat completions
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well

Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
2026-03-30 00:22:55 +02:00
turboderp
aa54098f26 Ruff: Format (line length) 2026-03-30 00:19:07 +02:00
turboderp
2a1503b283 Logging: Use debug level for Seq instead of verbose 2026-03-29 18:51:57 +02:00
turboderp
f3787de6a6 Ruff: Format 2026-03-27 21:47:24 +01:00
turboderp
83127ab4f8 Logging: Log messages via Seq wrapper 2026-03-27 21:38:47 +01:00
turboderp
8b1bfeaba7 Model: Make sure reasoning tokens are always defined 2026-03-20 20:41:44 +01:00
turboderp
6bccc70d94 Tree: Formatting 2026-03-18 03:29:15 +01:00
turboderp
d2117a7c3b Config: Pass reasoning settings in kwargs, allow for overrides via tabby_config.yml 2026-03-18 00:24:22 +01:00
turboderp
6bf3670372 Model: Correctly read max_position_embeddings in nested config
Rework how max_seq_len is determined from user settings, model defaults and cache size constraint
2026-03-17 02:58:47 +01:00
turboderp
fece4791ad exllamav2: Make sure cache size is set in unpaged mode 2025-11-06 21:03:24 +01:00
turboderp
486dd0418e Formatting 2025-10-15 10:47:58 +02:00
turboderp
0af29d957a Fix #390 2025-10-15 10:40:19 +02:00
kingbri
6f73a0b388 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:06:20 -04:00
kingbri
fdb86f4c63 ExllamaV2: Add max_seq_len empty case like ExllamaV3
Also remove the intermediate base_seq_len and target_seq_len variables
to make code clearer.

If paged mode is off, max_seq_len becomes the prime mover since batching
is unavailable.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:02:52 -04:00
turboderp
04ca346732 Fix formatting 2025-10-14 03:11:59 +02:00
turboderp
4235f98e83 Model: Change cache_size/max_seq_len behavior
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
2025-10-05 22:16:01 +02:00
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
kingbri
a02d39de31 Model: Remove rogue print
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 23:09:07 -04:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
kingbri
5d94d4d022 Merge branch 'main' into breaking 2025-06-17 22:24:32 -04:00
turboderp
1c9891bf04 Exl3: Add vision capability 2025-06-15 19:22:51 +02:00
turboderp
4605c0f6bd Common: Refactor get_image to common functions 2025-06-15 19:20:36 +02:00
turboderp
a0c16bba2a Exl2: Fix banned_strings (move outside of assign_gen_params) 2025-06-15 16:51:42 +02:00
kingbri
2096c9bad2 Model: Default max_seq_len to 4096
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.

To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
turboderp
691a080ac7 Dependencies: Bump ExllamaV3 and ExllamaV2 2025-05-31 23:55:04 +02:00
kingbri
17f3dca6fc Packaging: Add agnostic method to check version of packages
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.

This is also sent to requests for loading and unloading, so keep the
error short.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 01:04:24 -04:00
kingbri
0858b6d4b2 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:40 -04:00
kingbri
390daeb92f Model: Create universal HFModel class
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.

Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-13 18:12:38 -04:00
kingbri
bd3fec929c Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:32:27 -04:00
kingbri
a524ac3c0f Model: Fix cache mode again
If statements can be difficult to work with.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:30:47 -04:00
kingbri
20cad851e9 Model: Fix param call
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:52:28 -04:00
kingbri
d15eb55f20 Model: Fix exl2 cache mode check
FP16 was not included in the validation step.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:51:09 -04:00
kingbri
656af41b5d Model: Always enable decode_special_tokens
The frontend should handle the special tokens if they get emitted.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:25:50 -04:00
kingbri
42346c6b39 Sampling: Remove skip_special_tokens
This parameter is way too confusing and does not make sense in
the modern LLM space.

Change approved by all maintainers.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:11:33 -04:00
kingbri
25c77ebf77 Model: Remove exllamav2-specific version check
No longer necessary thanks to the agnostic check.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:08:15 -04:00
DocShotgun
9dcde59c57 Model: Check for unsupported cache mode in exllamav2 2025-05-06 01:18:15 -07:00
DocShotgun
68a660bdb3 Model: Initial Exl3 cache quantization support 2025-05-03 20:35:35 -07:00
kingbri
e8f00412f6 Model: Fetch from generation_config and tokenizer_config in Exl3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
bdc5189a4b Exl3: Add chunk size, cache size, and model info
Use the same algorithm for estimating and adjusting cache size based
on multiples of 256 and above max seq len.

Same applies for chunk size.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
randoentity
daae9ec43d Exl3: Couldn't wait
Just copied some stuff around and it ended up working for basic use.
2025-05-02 21:33:25 -04:00
kingbri
0c1d794390 Model: Add exl3 and associated load functions
Initial exl3 compat and loading functionality.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:39 -04:00
kingbri
242f6b7d2a Model: Simplify add_bos_token handling
Set add_bos_token to True by default in the tokenizer_config stub.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:28 -04:00
kingbri
4cb3e5d5b1 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:23:15 -04:00
kingbri
47cb2a0de9 Model: Add TokenizerConfig stub and add_eos_token fallback
This stub fetches the add_eos_token field from the HF tokenizer config.
Ideally, this should be in the backend rather than tabby.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:08:01 -04:00
kingbri
aa657fa6e9 API: Ignore add_bos_token in chat completions
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.

In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-01 22:51:15 -04:00