Commit Graph

352 Commits

Author SHA1 Message Date
turboderp
f1a2416da5 OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n") 2026-04-12 04:12:53 +02:00
turboderp
55124d0fc6 Config: Add force_enable_thinking 2026-04-10 00:16:40 +02:00
turboderp
79d581e1f5 OAI endpoints: More rework
- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
2026-04-02 01:26:44 +02:00
turboderp
0409064028 Tools: Refactor and further simplify tool parsing
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
2026-04-01 00:07:44 +02:00
turboderp
b6428b1676 Seq: Allow longer strings in log 2026-03-31 18:18:07 +02:00
turboderp
41ed1e4881 Seq: Sanitize extra log data 2026-03-30 03:36:30 +02:00
turboderp
a035bc9e94 Model: Fix regression 2026-03-30 02:37:27 +02:00
turboderp
357eebffd2 Logger: Fix invalid escape sequence (gave syntax warning) 2026-03-30 00:33:01 +02:00
turboderp
179479199b Rework tool calls and OAI chat completions
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well

Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
2026-03-30 00:22:55 +02:00
turboderp
aa54098f26 Ruff: Format (line length) 2026-03-30 00:19:07 +02:00
turboderp
2a1503b283 Logging: Use debug level for Seq instead of verbose 2026-03-29 18:51:57 +02:00
turboderp
56378b946d Merge branch 'fork/devnen/full-tool-calling-support' into main_seqlog
# Conflicts:
#	common/templating.py
#	endpoints/OAI/utils/chat_completion.py
#	endpoints/OAI/utils/tools.py
2026-03-28 01:06:54 +01:00
turboderp
f3787de6a6 Ruff: Format 2026-03-27 21:47:24 +01:00
turboderp
83127ab4f8 Logging: Log messages via Seq wrapper 2026-03-27 21:38:47 +01:00
turboderp
c32a628917 Logging: Add Seq wrapper 2026-03-27 21:38:47 +01:00
turboderp
da3d3338e8 Logging: Fix env var parsing, formatting 2026-03-27 02:31:36 +01:00
turboderp
a3eabecf39 Logging: Add TABBY_LOG_CONSOLE_WIDTH to enable wider console log 2026-03-27 01:30:13 +01:00
turboderp
803ca5c681 Tree: Format 2026-03-20 20:56:43 +01:00
turboderp
0d577b8121 Cleanup and formatting 2026-03-20 01:27:29 +01:00
turboderp
6bccc70d94 Tree: Formatting 2026-03-18 03:29:15 +01:00
turboderp
8eb6c65008 Merge branch 'main' into fork/Orion-zhen/feat_reasoning
# Conflicts:
#	config_sample.yml
2026-03-17 23:05:19 +01:00
turboderp
c2452414e1 Model: Ignore inline load requests if the requested model is already loaded 2026-03-17 03:00:27 +01:00
turboderp
6bf3670372 Model: Correctly read max_position_embeddings in nested config
Rework how max_seq_len is determined from user settings, model defaults and cache size constraint
2026-03-17 02:58:47 +01:00
devnen
a2c7d81686 Broader model compatibility, tool_choice support, bug fixes and cleanup 2026-02-14 16:19:59 +01:00
devnen
87bbe0fac2 Full tool-calling support: XML parsing, streaming compliance, Pydantic fix, inference abort fix 2026-02-14 14:26:57 +01:00
turboderp
54e3ea1fb3 Tree: Format 2026-01-20 22:57:36 +01:00
turboderp
0985c7f7b7 Sampling: Add adaptive-P params 2026-01-20 19:09:54 +01:00
kingbri
ad64942fa1 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:49:13 -04:00
kingbri
f205349c81 Config: Fix use_as_default application
Apply the default overrides after inline config has been merged.

Do not require an inline config to apply use_as_default and other
overrides.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:45:39 -04:00
kingbri
69a25d7fa6 Config + Endpoints: Make cache_size more prominent
Since cache_size is a more important parameter now for multi-user
setups, mark it as such by placing it below max_seq_len.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 21:53:33 -04:00
kingbri
62e9fa217a ExllamaV3: Handle max_seq_len defined and cache_size undefined case
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).

cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.

Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 21:48:36 -04:00
turboderp
8abdfe7b13 Config: replace disable_output_chunking flag with output_chunking 2025-10-14 02:47:52 +02:00
turboderp
4235f98e83 Model: Change cache_size/max_seq_len behavior
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
2025-10-05 22:16:01 +02:00
turboderp
52e093ae6c Model: Enable max_rq_tokens (output chunking) 2025-10-05 18:54:45 +02:00
kingbri
067d63773e Config: Move sampling higher in the list
This has become a bigger priority with addition of the safe_defaults
noob proofing.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-18 22:55:03 -04:00
DocShotgun
6fb0c2cdbd Config: Update description for override_preset default
* We provide safe_defaults as a default in config_sample.yml but not internally
2025-08-18 12:39:52 -07:00
DocShotgun
998abe5ad1 Config: Enable safe sampler overrides by default
* Provides safe fallback samplers, intended for better out-of-the-box support for clients that do not pass sampler params
2025-08-18 12:32:28 -07:00
kingbri
43f9483bc4 Model: Add tensor_parallel_backend option
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:35:10 -04:00
DocShotgun
81a115b781 Templating: Support chat_template.jinja 2025-08-03 16:10:08 -07:00
DocShotgun
102af306e5 Config: Remove developer arg cuda_malloc_backend
* cudaMallocAsync is now enabled by default on supported configurations
2025-08-01 10:59:13 -07:00
kingbri
879f4cee7e API: Modify tool calling for wider compat
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.

However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 14:28:12 -04:00
kingbri
d23fefbecd API + Model: Fix application of defaults
use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.

In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-03 14:37:34 -04:00
kingbri
d339139fb6 Config: Deep merge model overrides
Anything below the first level of kwargs was not being merged properly.
A more bulletproof solution would be to refactor the loading code
to separate draft and normal model parameters.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-03 12:17:09 -04:00
kingbri
0152a1665b Downloader: Switch to use API sizes
Rather than relying on Content-Length which can be unreliable, ping
the API to get file sizes and work from there.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-30 12:49:53 -04:00
kingbri
03ff4c3128 Downloader: Handle if Content-Length is undefined
Usually, the client and server both are aware of the file size by
sending a Content-Length header. However, HuggingFace has changed
their headers and now does not always send Content-Length.

In this case, show an indeterminate progressbar and mark as complete
once the download finishes.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-30 11:43:22 -04:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
kingbri
5d94d4d022 Merge branch 'main' into breaking 2025-06-17 22:24:32 -04:00
turboderp
122d87ac36 Tree: Format 2025-06-15 19:33:14 +02:00
turboderp
21c5af48e1 Tree: Format 2025-06-15 19:30:38 +02:00
turboderp
1c9891bf04 Exl3: Add vision capability 2025-06-15 19:22:51 +02:00