81 Commits

Author SHA1 Message Date
turboderp
5b2b707af9 exllamav3: Account for bsz=2 in autosplit 2026-04-18 19:55:34 +02:00
turboderp
9ebbe06f29 exllamav3: Supply max_chunk_size when loading model 2026-04-18 13:20:12 +02:00
turboderp
f74f16a5c2 Config: Make recurrent cache size configurable 2026-04-17 02:40:22 +02:00
turboderp
f1a2416da5 OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n") 2026-04-12 04:12:53 +02:00
turboderp
b21100f971 ExLlamaV3: Fix disconnected request handling regression 2026-04-10 22:03:19 +02:00
turboderp
55124d0fc6 Config: Add force_enable_thinking 2026-04-10 00:16:40 +02:00
turboderp
79d581e1f5 OAI endpoints: More rework
- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
2026-04-02 01:26:44 +02:00
turboderp
0409064028 Tools: Refactor and further simplify tool parsing
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
2026-04-01 00:07:44 +02:00
turboderp
179479199b Rework tool calls and OAI chat completions
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well

Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
2026-03-30 00:22:55 +02:00
turboderp
aa54098f26 Ruff: Format (line length) 2026-03-30 00:19:07 +02:00
turboderp
2a1503b283 Logging: Use debug level for Seq instead of verbose 2026-03-29 18:51:57 +02:00
turboderp
56378b946d Merge branch 'fork/devnen/full-tool-calling-support' into main_seqlog
# Conflicts:
#	common/templating.py
#	endpoints/OAI/utils/chat_completion.py
#	endpoints/OAI/utils/tools.py
2026-03-28 01:06:54 +01:00
turboderp
f3787de6a6 Ruff: Format 2026-03-27 21:47:24 +01:00
turboderp
83127ab4f8 Logging: Log messages via Seq wrapper 2026-03-27 21:38:47 +01:00
turboderp
da3d3338e8 Logging: Fix env var parsing, formatting 2026-03-27 02:31:36 +01:00
turboderp
ffca853d4c ExLlamaV3: Force minimum rep_decay of 1 token, pending update to backend 2026-03-22 14:51:08 +01:00
turboderp
92cb48c38d ExLlamaV3: Fix regression in max_seq_len limit 2026-03-22 00:34:47 +01:00
turboderp
088e196cbc ExLlamaV3: Change cache size fallback value to max_seq_len, add warning to configure manually 2026-03-20 20:42:14 +01:00
turboderp
8b1bfeaba7 Model: Make sure reasoning tokens are always defined 2026-03-20 20:41:44 +01:00
turboderp
78c5993c27 ExLlamaV3: Correctly report when vision is supported but not enabled 2026-03-20 01:33:38 +01:00
turboderp
0d577b8121 Cleanup and formatting 2026-03-20 01:27:29 +01:00
turboderp
6bccc70d94 Tree: Formatting 2026-03-18 03:29:15 +01:00
turboderp
d2117a7c3b Config: Pass reasoning settings in kwargs, allow for overrides via tabby_config.yml 2026-03-18 00:24:22 +01:00
turboderp
6bf3670372 Model: Correctly read max_position_embeddings in nested config
Rework how max_seq_len is determined from user settings, model defaults and cache size constraint
2026-03-17 02:58:47 +01:00
devnen
a2c7d81686 Broader model compatibility, tool_choice support, bug fixes and cleanup 2026-02-14 16:19:59 +01:00
devnen
87bbe0fac2 Full tool-calling support: XML parsing, streaming compliance, Pydantic fix, inference abort fix 2026-02-14 14:26:57 +01:00
turboderp
0985c7f7b7 Sampling: Add adaptive-P params 2026-01-20 19:09:54 +01:00
Brian
685aca5a7d Merge pull request #397 from beep39/json-schema-for-exllamav3
Constrained generation with json schema for ExllamaV3
2025-11-24 22:34:31 -05:00
beep39
d53ca1345a Constrained generation with json schema for ExllamaV3 2025-11-18 02:01:31 +09:00
mefich
37aea9de83 Update exl3 backend model.py: fix for unloading vision models
This change ensures that when unloading vlm their vision part is also unloaded.
2025-10-30 12:30:23 +05:00
turboderp
0af29d957a Fix #390 2025-10-15 10:40:19 +02:00
kingbri
62e9fa217a ExllamaV3: Handle max_seq_len defined and cache_size undefined case
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).

cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.

Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 21:48:36 -04:00
turboderp
8abdfe7b13 Config: replace disable_output_chunking flag with output_chunking 2025-10-14 02:47:52 +02:00
kingbri
85459ce600 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-09 22:33:53 -04:00
turboderp
4235f98e83 Model: Change cache_size/max_seq_len behavior
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
2025-10-05 22:16:01 +02:00
turboderp
52e093ae6c Model: Enable max_rq_tokens (output chunking) 2025-10-05 18:54:45 +02:00
turboderp
e09a61969f Model: Fix NCCL detection 2025-10-05 18:52:37 +02:00
kingbri
a4d02c2b70 Model: Add log messages for model loading
It's useful to know the split method that the model is being loaded
on.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 23:09:27 -04:00
kingbri
43f9483bc4 Model: Add tensor_parallel_backend option
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:35:10 -04:00
Forkoz
60ae419746 Model.py TP changes 2025-08-12 21:01:54 +00:00
kingbri
fe149489af Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-05 01:22:18 -04:00
AUTOMATIC
056527ceb3 add logprobs support for exl3 2025-08-03 11:42:32 +03:00
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
turboderp
0ae878712e Exl3: Clear image embedding cache on unload 2025-06-25 23:56:21 +02:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
turboderp
21c5af48e1 Tree: Format 2025-06-15 19:30:38 +02:00
turboderp
1c9891bf04 Exl3: Add vision capability 2025-06-15 19:22:51 +02:00
turboderp
d357f100d0 Dependencies: Bump ExllamaV3 2025-06-15 19:12:45 +02:00
turboderp
691a080ac7 Dependencies: Bump ExllamaV3 and ExllamaV2 2025-05-31 23:55:04 +02:00
kingbri
0c4cc1eba3 Model: Add prompt logging to ExllamaV3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 22:05:18 -04:00