Commit Graph

1247 Commits

Author SHA1 Message Date
kingbri
3cf468c283 Actions: Fix docker buildx casing issue
Add step to change the repo name to lowercase

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2026-06-26 21:21:05 -04:00
kingbri
7e4ccd5e8c Actions: Point to GHCR cache instead of GHA cache
Need a longer term cache storage

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2026-06-26 21:13:46 -04:00
kingbri
a3bd248e08 Actions: Use GHCR as Docker layer cache
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2026-06-26 20:53:52 -04:00
suspicious-pineapple
79126f904c API: Fix v1/token/encode endpoint after regression 2026-06-27 02:40:35 +02:00
turboderp
538654bfcb Add tests 2026-06-27 02:37:30 +02:00
turboderp
9b6ffdc3b2 exllamav3: Expose loop detect option, enable 800-token window by default 2026-06-16 23:30:16 +02:00
turboderp
d2d87bb9e0 API: Fix error message/code when context length exceeded 2026-06-15 20:36:25 +02:00
turboderp
c1655d1234 Dependencies: Update exllamav3 2026-06-14 20:07:08 +02:00
turboderp
202fbdc6d2 Merge remote-tracking branch 'origin/main' 2026-06-14 16:16:07 +02:00
turboderp
afec2e354f Fix git error, missing file 2026-06-14 16:15:54 +02:00
lavd
5485088231 add cu13 build (#423) 2026-06-13 01:47:33 +02:00
turboderp
671c12d78c API: Reject oversized prompts with error code 400 before committing to EventSourceResponse 2026-06-13 01:34:24 +02:00
turboderp
ddd2c409ad Logging: Add draft metrics 2026-06-13 00:29:53 +02:00
turboderp
26102e0251 Dependencies: Update exllamav3 2026-06-13 00:13:34 +02:00
turboderp
004e837412 exllamav3: Add draft_mode option, support MTP and n-gram drafting 2026-06-13 00:12:24 +02:00
turboderp
95d1278694 Model: Fix regression when no draft_gpu_split specified 2026-06-12 23:31:34 +02:00
turboderp
637b595bb6 Merge branch 'fork/baronrabban/fix/draft-model-gpu-split' 2026-06-12 21:13:39 +02:00
turboderp
9726fbf0a0 Dependencies: Update exllamav3 2026-06-12 21:10:41 +02:00
baronrabban
4c7249e98d Fix draft model ignoring draft_gpu_split on load
The exllamav3 backend parses the user-configured draft_gpu_split into
self.draft_gpu_split, but load_model_sync passed self.gpu_split (the main
model's split) when loading the draft model, so the draft split was
silently ignored. Use self.draft_gpu_split instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 18:45:52 -04:00
turboderp
2e50555d37 Dependencies: Update exllamav3 2026-06-06 23:15:43 +02:00
turboderp
624f2baebd Dependencies: Don't try to import exllamav2 2026-06-05 01:58:47 +02:00
turboderp
8822b886ea Tools: Add Step 3.7 tool format alias (qwen3_coder compatible) 2026-06-02 17:21:53 +02:00
turboderp
210bbe78f5 exllamav3: Include stop conditions from backend tokenizer 2026-06-02 15:28:56 +02:00
turboderp
ff4160051f Dependencies: Update exllamav3 2026-06-01 03:47:31 +02:00
turboderp
7a23e48fc1 Dependencies: Enable flash-linear-attention on Windows 2026-05-31 20:50:22 +02:00
turboderp
95c1101bd2 Dependencies: Update exllamav3 2026-05-29 22:17:24 +02:00
turboderp
510367d1ab Logging: Add comprehensive request logging option 2026-05-27 00:33:45 +02:00
turboderp
dd792e1916 Dependencies: Update exllamav3 2026-05-24 20:33:33 +02:00
turboderp
20cd52371a Docker: Update compose service 2026-05-24 20:33:03 +02:00
turboderp
fef811d484 Dependencies: Add cu13 install option and Dockerfile (exllamav3 only) 2026-05-23 01:21:18 +02:00
turboderp
539289375c Dependencies: Add flash-linear-attention 2026-05-23 01:20:18 +02:00
turboderp
ed97bbb2af Model: Add draft_num_tokens config option, update model container to forward draft and bsz args to backend 2026-05-23 00:40:42 +02:00
turboderp
a430dce6f3 Config: Fix incorrect description of gpu_split as integer list 2026-05-22 23:33:34 +02:00
turboderp
2593fb79a2 Merge remote-tracking branch 'origin/main' 2026-05-14 12:56:51 +02:00
turboderp
857f9e21dd Merge pull request #422
Start.py: improve dependency installation check and cleanup uv logging
2026-05-14 12:56:40 +02:00
Optimal
52bc74b3f9 Start.py: improve dependency installation check and cleanup uv logging
- Added check=True to subprocess.run in run_pip.
- Wrapped installation in try-except to set first_run_done = True only on success.
- Added error message and sys.exit(1) on installation failure.
- Fix uv version logging by calling subprocess.run directly.
2026-05-14 01:56:30 +09:00
turboderp
4de923d8b3 Add docker instructions to README.md 2026-05-10 11:26:53 +02:00
turboderp
838df5a3c7 Docker: Remove version from example docker-compose.yml 2026-05-10 11:25:15 +02:00
turboderp
64ad702416 Dependencies: Pin pydantic again (>2.11 breaks docker image) 2026-05-10 01:41:03 +02:00
turboderp
5818311d06 Dependencies: Pin correct xformers version torch 2.9 2026-05-10 01:21:37 +02:00
turboderp
553c4e7cbb Docker: Serve on 0.0.0.0 by default 2026-05-09 23:22:56 +02:00
turboderp
5d964494b6 Merge remote-tracking branch 'origin/main' 2026-05-09 23:18:52 +02:00
turboderp
4a8cb08a24 Dependencies: Include triton and xformers 2026-05-09 23:14:30 +02:00
turboderp
fd9591133d Dependencies: Update exllamav3, unpin pydantic 2026-05-09 23:01:07 +02:00
RodriMora
54c1e56019 Update config_sample.yml (#418)
small typo in "content" on the reasoning config
2026-05-09 21:21:57 +02:00
Josh
09f36f9c05 fix: prevent xformers from pulling cu130 wheels on cu128 hosts (#420)
The default `pip install .[cu12,extras]` lets pip resolve xformers
transitively (via infinity-emb / sentence-transformers in the extras
group), which can pull a cu130-aligned wheel that requires
libcudart.so.13. On hosts with NVIDIA driver 590.x (cu128-only), this
fails at import time with:

    ImportError: libcudart.so.13: cannot open shared object file

Reproduced on K3s clusters running 12 exllamav2/exllamav3 deployment
pods × 6 hosts; all crash-looped on the published `:latest` image
which had transitively resolved xformers to a cu130 wheel.

Fix: split the install into two pip invocations. Install the cu12 group
first to lock torch + cu128 wheels for exllamav2 / exllamav3 / flash_attn,
then install the extras group with --no-deps so pip cannot resolve
xformers (or any other transitive dep) outside the cu128 lock.

Also align the Windows py3.12 flash_attn wheel version to v0.7.13 to
match the other Windows variants (py3.10, py3.11, py3.13). The py3.12
variant was pinned to v0.7.6 while the rest were on v0.7.13, leaving
py3.12 Windows users on an older flash_attn release with no semantic
reason for the divergence.

Tested on Hydra K3s cluster (NVIDIA 590.48.01-open + cu128 base image
nvidia/cuda:12.8.1-runtime-ubuntu24.04 + torch 2.9.0+cu128). All 12
exllamav2/v3 deployments now import cleanly and serve /v1/models.

Co-authored-by: Josh Jones <scoobydont-666@users.noreply.github.com>
2026-05-09 21:21:17 +02:00
turboderp
bc5de12c82 Dependencies: Fix Windows FA2 wheel URL for cp312 2026-05-05 10:02:49 +02:00
turboderp
59494106c9 Dependencies: Update exllamav3 2026-05-03 00:01:59 +02:00
turboderp
51b67595f4 Dependencies: Switch to mjun0812 flash-attn wheels 2026-05-03 00:01:29 +02:00
turboderp
6e97aa5fc1 Model: Fix model loading progress display when draft enabled 2026-05-02 20:30:38 +02:00