184 Commits

Author SHA1 Message Date
turboderp
3e3d7ccd54 Tools: Add step3_5 alias (qwen3_coder tool format) 2026-04-18 19:55:34 +02:00
turboderp
ed41c51909 API: Prevent race condition when multiple chat requests try to inline-load the same model 2026-04-18 19:55:34 +02:00
turboderp
1a4896ce66 Tree: Format 2026-04-12 13:47:05 +02:00
turboderp
f1a2416da5 OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n") 2026-04-12 04:12:53 +02:00
turboderp
2636b445f0 Tree: Format 2026-04-12 03:33:14 +02:00
turboderp
b21100f971 ExLlamaV3: Fix disconnected request handling regression 2026-04-10 22:03:19 +02:00
mindkrypted
08f92167de Tools: Updated/fixed Gemma4 tool parser 2026-04-10 22:02:34 +02:00
turboderp
5517cb5b9e Templates: Revert add_bos_token fix 2026-04-10 03:53:58 +02:00
turboderp
7fedc179f0 Templates: Make sure add_bos_token=False is respected 2026-04-10 03:14:29 +02:00
turboderp
27d29209c6 Tools: Add Gemma4 parser 2026-04-10 00:16:58 +02:00
turboderp
55124d0fc6 Config: Add force_enable_thinking 2026-04-10 00:16:40 +02:00
turboderp
79d581e1f5 OAI endpoints: More rework
- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
2026-04-02 01:26:44 +02:00
turboderp
c315f6b73e OAI endpoints: Correctly propagate exceptions in non-streaming mode 2026-04-01 12:27:07 +02:00
turboderp
455c09932f OAI endpoints: Fix regression for non-reasoning models 2026-04-01 00:08:39 +02:00
turboderp
0409064028 Tools: Refactor and further simplify tool parsing
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
2026-04-01 00:07:44 +02:00
turboderp
112ab69002 Fix comments 2026-03-31 14:43:55 +02:00
turboderp
a7c7934ec3 Tool parsing: Include outer <tool_call> tags in raw text sent to parser 2026-03-30 04:05:15 +02:00
turboderp
02a700e065 ExLlamaV3: Limit MMEmbedding cache size 2026-03-30 03:35:46 +02:00
turboderp
9ee5ded218 OAI: Log raw requests 2026-03-30 01:23:16 +02:00
turboderp
179479199b Rework tool calls and OAI chat completions
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well

Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
2026-03-30 00:22:55 +02:00
turboderp
aa54098f26 Ruff: Format (line length) 2026-03-30 00:19:07 +02:00
turboderp
2a1503b283 Logging: Use debug level for Seq instead of verbose 2026-03-29 18:51:57 +02:00
turboderp
4b3c74782d Fix bad merge 2026-03-28 12:47:26 +01:00
turboderp
b4dfd2e86f Fix logging 2026-03-28 01:13:23 +01:00
turboderp
56378b946d Merge branch 'fork/devnen/full-tool-calling-support' into main_seqlog
# Conflicts:
#	common/templating.py
#	endpoints/OAI/utils/chat_completion.py
#	endpoints/OAI/utils/tools.py
2026-03-28 01:06:54 +01:00
turboderp
f3787de6a6 Ruff: Format 2026-03-27 21:47:24 +01:00
turboderp
83127ab4f8 Logging: Log messages via Seq wrapper 2026-03-27 21:38:47 +01:00
turboderp
40aa82da28 API: More robust test for whether generation starts in reasoning mode 2026-03-27 01:29:17 +01:00
turboderp
0d1a8ba784 API: Try to guess whether streaming response should start with content or reasoning_content 2026-03-21 01:11:01 +01:00
turboderp
0d577b8121 Cleanup and formatting 2026-03-20 01:27:29 +01:00
turboderp
6bccc70d94 Tree: Formatting 2026-03-18 03:29:15 +01:00
turboderp
d2117a7c3b Config: Pass reasoning settings in kwargs, allow for overrides via tabby_config.yml 2026-03-18 00:24:22 +01:00
turboderp
8eb6c65008 Merge branch 'main' into fork/Orion-zhen/feat_reasoning
# Conflicts:
#	config_sample.yml
2026-03-17 23:05:19 +01:00
devnen
a2c7d81686 Broader model compatibility, tool_choice support, bug fixes and cleanup 2026-02-14 16:19:59 +01:00
devnen
87bbe0fac2 Full tool-calling support: XML parsing, streaming compliance, Pydantic fix, inference abort fix 2026-02-14 14:26:57 +01:00
turboderp
d672dc2137 API: Fix race condition when client disconnects 2025-10-05 21:23:02 +02:00
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
kingbri
707d005aad API: Default tool call ID and type
Doing this helps reduce the model's burden of generating the tool
call ID and type (which is always "function"). Follow mistral's spec
for tool call IDs by using a 9 character alphanumeric string.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-11 01:11:09 -04:00
kingbri
5b1db3ad83 API: Don't do a second re-render when tool calling
Re-rendering the template is an expensive operation when it's possible
to just concatenate the prompt and current generation text together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-06 11:32:36 -04:00
kingbri
3dfa965019 API: Add tool_call_id for role = tool
If a message with role = tool is present, the tool_call_id should
also be given to the template.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 21:52:58 -04:00
kingbri
879f4cee7e API: Modify tool calling for wider compat
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.

However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 14:28:12 -04:00
kingbri
b6a26da50c API: Fix tool call serialization
To render in the template, tool call start tokens needed to have less
checks and remove the line to convert message.tool_calls to a dict
since that breaks the rest of the chain by disconnecting the types.
model_dump on the message itself already accomplishes this.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-04 15:02:49 -04:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
kingbri
2d89c96879 API: Re-add BOS token stripping in template render
Matching YALS, if the model has add_bos_token enabled, then remove
an extra BOS token at the start of the prompt. This usually happens
with misconfigured templates such as Llama 3.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:11:53 -04:00
kingbri
10fbe043a4 API: Fix typing for chat templates in CC requests
Tools must be None by default. Chat completion message content can
be None, a string, or a list, so default to None. Exclude all None
values from a CC message since the template can say the variable
"exists" despite being None, causing an error.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:06:05 -04:00
kingbri
54b8a20a19 API: Fix types for chat completions
Messages were mistakenly being sent as Pydantic objects, but templates
expect dictionaries. Properly convert these before render.

In addition, initialize all Optional lists as an empty list since
this will cause the least problems when interacting with other parts
of API code, such as templates.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 18:10:34 -04:00
kingbri
0858b6d4b2 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:40 -04:00
kingbri
7900b72848 API: Add chat_template_kwargs alias for template_vars
This key is used in VLLM and SGLang.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 15:48:39 -04:00
Brian
b555eeb6e7 Merge pull request #339 from Maaaxiii/fix/tool-calling-embeddings
fix: Aligned Parameter Name in chat completions generate_tool_calls
2025-05-11 20:41:58 -04:00
kingbri
6379081dd8 Sampling: Make add_bos_token override concise
Also set the default to None so text completions follows the same
pattern.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-10 19:07:35 -04:00