Commit Graph

1203 Commits

Author SHA1 Message Date
RodriMora
54c1e56019 Update config_sample.yml (#418)
small typo in "content" on the reasoning config
2026-05-09 21:21:57 +02:00
Josh
09f36f9c05 fix: prevent xformers from pulling cu130 wheels on cu128 hosts (#420)
The default `pip install .[cu12,extras]` lets pip resolve xformers
transitively (via infinity-emb / sentence-transformers in the extras
group), which can pull a cu130-aligned wheel that requires
libcudart.so.13. On hosts with NVIDIA driver 590.x (cu128-only), this
fails at import time with:

    ImportError: libcudart.so.13: cannot open shared object file

Reproduced on K3s clusters running 12 exllamav2/exllamav3 deployment
pods × 6 hosts; all crash-looped on the published `:latest` image
which had transitively resolved xformers to a cu130 wheel.

Fix: split the install into two pip invocations. Install the cu12 group
first to lock torch + cu128 wheels for exllamav2 / exllamav3 / flash_attn,
then install the extras group with --no-deps so pip cannot resolve
xformers (or any other transitive dep) outside the cu128 lock.

Also align the Windows py3.12 flash_attn wheel version to v0.7.13 to
match the other Windows variants (py3.10, py3.11, py3.13). The py3.12
variant was pinned to v0.7.6 while the rest were on v0.7.13, leaving
py3.12 Windows users on an older flash_attn release with no semantic
reason for the divergence.

Tested on Hydra K3s cluster (NVIDIA 590.48.01-open + cu128 base image
nvidia/cuda:12.8.1-runtime-ubuntu24.04 + torch 2.9.0+cu128). All 12
exllamav2/v3 deployments now import cleanly and serve /v1/models.

Co-authored-by: Josh Jones <scoobydont-666@users.noreply.github.com>
2026-05-09 21:21:17 +02:00
turboderp
bc5de12c82 Dependencies: Fix Windows FA2 wheel URL for cp312 2026-05-05 10:02:49 +02:00
turboderp
59494106c9 Dependencies: Update exllamav3 2026-05-03 00:01:59 +02:00
turboderp
51b67595f4 Dependencies: Switch to mjun0812 flash-attn wheels 2026-05-03 00:01:29 +02:00
turboderp
6e97aa5fc1 Model: Fix model loading progress display when draft enabled 2026-05-02 20:30:38 +02:00
turboderp
c06a6fbf7f API: Accept JSON schema in request.response_format.json_schema, delay JSON filter until start of content block 2026-05-02 20:29:59 +02:00
turboderp
d0103c19a7 Dependencies: Bump exllamav3 2026-04-29 00:55:59 +02:00
turboderp
e909f7ecdb ExLlamaV3: Respect device split when loading draft model 2026-04-25 01:51:46 +02:00
turboderp
6aa842a1b2 Dependencies: Update exllamav3 2026-04-20 23:11:30 +02:00
turboderp
3e3d7ccd54 Tools: Add step3_5 alias (qwen3_coder tool format) 2026-04-18 19:55:34 +02:00
turboderp
ed41c51909 API: Prevent race condition when multiple chat requests try to inline-load the same model 2026-04-18 19:55:34 +02:00
turboderp
5b2b707af9 exllamav3: Account for bsz=2 in autosplit 2026-04-18 19:55:34 +02:00
turboderp
9ebbe06f29 exllamav3: Supply max_chunk_size when loading model 2026-04-18 13:20:12 +02:00
turboderp
f74f16a5c2 Config: Make recurrent cache size configurable 2026-04-17 02:40:22 +02:00
turboderp
bd589272cc Config: Make cuda_malloc_async configurable again, change import order to make sure config is loaded before torch is imported 2026-04-17 02:39:16 +02:00
turboderp
32eed618dc Dependencies: Add requests 2026-04-12 13:51:54 +02:00
turboderp
1a4896ce66 Tree: Format 2026-04-12 13:47:05 +02:00
turboderp
510bf7bf6c Update README.md 2026-04-12 13:44:26 +02:00
turboderp
f1a2416da5 OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n") 2026-04-12 04:12:53 +02:00
turboderp
2636b445f0 Tree: Format 2026-04-12 03:33:14 +02:00
turboderp
bb64f8f18e Dependencies: Update exllamav3 2026-04-12 03:31:54 +02:00
turboderp
3a42c1756c ExLlamaV2: Use new disconnect handler 2026-04-10 22:04:21 +02:00
turboderp
b21100f971 ExLlamaV3: Fix disconnected request handling regression 2026-04-10 22:03:19 +02:00
mindkrypted
08f92167de Tools: Updated/fixed Gemma4 tool parser 2026-04-10 22:02:34 +02:00
turboderp
5517cb5b9e Templates: Revert add_bos_token fix 2026-04-10 03:53:58 +02:00
turboderp
7fedc179f0 Templates: Make sure add_bos_token=False is respected 2026-04-10 03:14:29 +02:00
turboderp
27d29209c6 Tools: Add Gemma4 parser 2026-04-10 00:16:58 +02:00
turboderp
55124d0fc6 Config: Add force_enable_thinking 2026-04-10 00:16:40 +02:00
turboderp
db9048e59b Docs: Tool calling 2026-04-08 19:39:42 +02:00
turboderp
79d581e1f5 OAI endpoints: More rework
- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
2026-04-02 01:26:44 +02:00
turboderp
c315f6b73e OAI endpoints: Correctly propagate exceptions in non-streaming mode 2026-04-01 12:27:07 +02:00
turboderp
455c09932f OAI endpoints: Fix regression for non-reasoning models 2026-04-01 00:08:39 +02:00
turboderp
0409064028 Tools: Refactor and further simplify tool parsing
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
2026-04-01 00:07:44 +02:00
turboderp
b6428b1676 Seq: Allow longer strings in log 2026-03-31 18:18:07 +02:00
turboderp
112ab69002 Fix comments 2026-03-31 14:43:55 +02:00
turboderp
bc66ba4b8b Merge branch 'main' into main_tools 2026-03-30 23:07:53 +02:00
turboderp
c887ae88fc Dependencies: Update exllamav3 2026-03-30 23:07:00 +02:00
turboderp
a7c7934ec3 Tool parsing: Include outer <tool_call> tags in raw text sent to parser 2026-03-30 04:05:15 +02:00
turboderp
41ed1e4881 Seq: Sanitize extra log data 2026-03-30 03:36:30 +02:00
turboderp
02a700e065 ExLlamaV3: Limit MMEmbedding cache size 2026-03-30 03:35:46 +02:00
turboderp
ba4309b948 ExLlamaV3: Replace MMEmbedding lru_cache with dict to avoid storing arbitrarily large uuencoded images as keys 2026-03-30 02:55:21 +02:00
turboderp
a035bc9e94 Model: Fix regression 2026-03-30 02:37:27 +02:00
turboderp
9ee5ded218 OAI: Log raw requests 2026-03-30 01:23:16 +02:00
turboderp
357eebffd2 Logger: Fix invalid escape sequence (gave syntax warning) 2026-03-30 00:33:01 +02:00
turboderp
9f565562dd Add inference test scripts 2026-03-30 00:23:25 +02:00
turboderp
179479199b Rework tool calls and OAI chat completions
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well

Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
2026-03-30 00:22:55 +02:00
turboderp
aa54098f26 Ruff: Format (line length) 2026-03-30 00:19:07 +02:00
turboderp
2a1503b283 Logging: Use debug level for Seq instead of verbose 2026-03-29 18:51:57 +02:00
turboderp
47d08729ed Ruff: Raise line length limit to 100 2026-03-28 19:49:17 +01:00