1194 Commits

Author SHA1 Message Date
turboderp
6aa842a1b2 Dependencies: Update exllamav3 2026-04-20 23:11:30 +02:00
turboderp
3e3d7ccd54 Tools: Add step3_5 alias (qwen3_coder tool format) 2026-04-18 19:55:34 +02:00
turboderp
ed41c51909 API: Prevent race condition when multiple chat requests try to inline-load the same model 2026-04-18 19:55:34 +02:00
turboderp
5b2b707af9 exllamav3: Account for bsz=2 in autosplit 2026-04-18 19:55:34 +02:00
turboderp
9ebbe06f29 exllamav3: Supply max_chunk_size when loading model 2026-04-18 13:20:12 +02:00
turboderp
f74f16a5c2 Config: Make recurrent cache size configurable 2026-04-17 02:40:22 +02:00
turboderp
bd589272cc Config: Make cuda_malloc_async configurable again, change import order to make sure config is loaded before torch is imported 2026-04-17 02:39:16 +02:00
turboderp
32eed618dc Dependencies: Add requests 2026-04-12 13:51:54 +02:00
turboderp
1a4896ce66 Tree: Format 2026-04-12 13:47:05 +02:00
turboderp
510bf7bf6c Update README.md 2026-04-12 13:44:26 +02:00
turboderp
f1a2416da5 OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n") 2026-04-12 04:12:53 +02:00
turboderp
2636b445f0 Tree: Format 2026-04-12 03:33:14 +02:00
turboderp
bb64f8f18e Dependencies: Update exllamav3 2026-04-12 03:31:54 +02:00
turboderp
3a42c1756c ExLlamaV2: Use new disconnect handler 2026-04-10 22:04:21 +02:00
turboderp
b21100f971 ExLlamaV3: Fix disconnected request handling regression 2026-04-10 22:03:19 +02:00
mindkrypted
08f92167de Tools: Updated/fixed Gemma4 tool parser 2026-04-10 22:02:34 +02:00
turboderp
5517cb5b9e Templates: Revert add_bos_token fix 2026-04-10 03:53:58 +02:00
turboderp
7fedc179f0 Templates: Make sure add_bos_token=False is respected 2026-04-10 03:14:29 +02:00
turboderp
27d29209c6 Tools: Add Gemma4 parser 2026-04-10 00:16:58 +02:00
turboderp
55124d0fc6 Config: Add force_enable_thinking 2026-04-10 00:16:40 +02:00
turboderp
db9048e59b Docs: Tool calling 2026-04-08 19:39:42 +02:00
turboderp
79d581e1f5 OAI endpoints: More rework
- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
2026-04-02 01:26:44 +02:00
turboderp
c315f6b73e OAI endpoints: Correctly propagate exceptions in non-streaming mode 2026-04-01 12:27:07 +02:00
turboderp
455c09932f OAI endpoints: Fix regression for non-reasoning models 2026-04-01 00:08:39 +02:00
turboderp
0409064028 Tools: Refactor and further simplify tool parsing
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
2026-04-01 00:07:44 +02:00
turboderp
b6428b1676 Seq: Allow longer strings in log 2026-03-31 18:18:07 +02:00
turboderp
112ab69002 Fix comments 2026-03-31 14:43:55 +02:00
turboderp
bc66ba4b8b Merge branch 'main' into main_tools 2026-03-30 23:07:53 +02:00
turboderp
c887ae88fc Dependencies: Update exllamav3 2026-03-30 23:07:00 +02:00
turboderp
a7c7934ec3 Tool parsing: Include outer <tool_call> tags in raw text sent to parser 2026-03-30 04:05:15 +02:00
turboderp
41ed1e4881 Seq: Sanitize extra log data 2026-03-30 03:36:30 +02:00
turboderp
02a700e065 ExLlamaV3: Limit MMEmbedding cache size 2026-03-30 03:35:46 +02:00
turboderp
ba4309b948 ExLlamaV3: Replace MMEmbedding lru_cache with dict to avoid storing arbitrarily large uuencoded images as keys 2026-03-30 02:55:21 +02:00
turboderp
a035bc9e94 Model: Fix regression 2026-03-30 02:37:27 +02:00
turboderp
9ee5ded218 OAI: Log raw requests 2026-03-30 01:23:16 +02:00
turboderp
357eebffd2 Logger: Fix invalid escape sequence (gave syntax warning) 2026-03-30 00:33:01 +02:00
turboderp
9f565562dd Add inference test scripts 2026-03-30 00:23:25 +02:00
turboderp
179479199b Rework tool calls and OAI chat completions
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well

Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
2026-03-30 00:22:55 +02:00
turboderp
aa54098f26 Ruff: Format (line length) 2026-03-30 00:19:07 +02:00
turboderp
2a1503b283 Logging: Use debug level for Seq instead of verbose 2026-03-29 18:51:57 +02:00
turboderp
47d08729ed Ruff: Raise line length limit to 100 2026-03-28 19:49:17 +01:00
turboderp
4b3c74782d Fix bad merge 2026-03-28 12:47:26 +01:00
turboderp
b4dfd2e86f Fix logging 2026-03-28 01:13:23 +01:00
turboderp
56378b946d Merge branch 'fork/devnen/full-tool-calling-support' into main_seqlog
# Conflicts:
#	common/templating.py
#	endpoints/OAI/utils/chat_completion.py
#	endpoints/OAI/utils/tools.py
2026-03-28 01:06:54 +01:00
turboderp
f3787de6a6 Ruff: Format 2026-03-27 21:47:24 +01:00
turboderp
83127ab4f8 Logging: Log messages via Seq wrapper 2026-03-27 21:38:47 +01:00
turboderp
c32a628917 Logging: Add Seq wrapper 2026-03-27 21:38:47 +01:00
turboderp
1a7191702d Dependencies: Update exllamav3 2026-03-27 02:54:42 +01:00
turboderp
da3d3338e8 Logging: Fix env var parsing, formatting 2026-03-27 02:31:36 +01:00
turboderp
a3eabecf39 Logging: Add TABBY_LOG_CONSOLE_WIDTH to enable wider console log 2026-03-27 01:30:13 +01:00