tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-06-29 02:37:13 +00:00

Author	SHA1	Message	Date
kingbri	3cf468c283	Actions: Fix docker buildx casing issue Add step to change the repo name to lowercase Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2026-06-26 21:21:05 -04:00
kingbri	7e4ccd5e8c	Actions: Point to GHCR cache instead of GHA cache Need a longer term cache storage Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2026-06-26 21:13:46 -04:00
kingbri	a3bd248e08	Actions: Use GHCR as Docker layer cache Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2026-06-26 20:53:52 -04:00
suspicious-pineapple	79126f904c	API: Fix v1/token/encode endpoint after regression	2026-06-27 02:40:35 +02:00
turboderp	538654bfcb	Add tests	2026-06-27 02:37:30 +02:00
turboderp	9b6ffdc3b2	exllamav3: Expose loop detect option, enable 800-token window by default	2026-06-16 23:30:16 +02:00
turboderp	d2d87bb9e0	API: Fix error message/code when context length exceeded	2026-06-15 20:36:25 +02:00
turboderp	c1655d1234	Dependencies: Update exllamav3	2026-06-14 20:07:08 +02:00
turboderp	202fbdc6d2	Merge remote-tracking branch 'origin/main'	2026-06-14 16:16:07 +02:00
turboderp	afec2e354f	Fix git error, missing file	2026-06-14 16:15:54 +02:00
lavd	5485088231	add cu13 build (#423 )	2026-06-13 01:47:33 +02:00
turboderp	671c12d78c	API: Reject oversized prompts with error code 400 before committing to EventSourceResponse	2026-06-13 01:34:24 +02:00
turboderp	ddd2c409ad	Logging: Add draft metrics	2026-06-13 00:29:53 +02:00
turboderp	26102e0251	Dependencies: Update exllamav3	2026-06-13 00:13:34 +02:00
turboderp	004e837412	exllamav3: Add draft_mode option, support MTP and n-gram drafting	2026-06-13 00:12:24 +02:00
turboderp	95d1278694	Model: Fix regression when no draft_gpu_split specified	2026-06-12 23:31:34 +02:00
turboderp	637b595bb6	Merge branch 'fork/baronrabban/fix/draft-model-gpu-split'	2026-06-12 21:13:39 +02:00
turboderp	9726fbf0a0	Dependencies: Update exllamav3	2026-06-12 21:10:41 +02:00
baronrabban	4c7249e98d	Fix draft model ignoring draft_gpu_split on load The exllamav3 backend parses the user-configured draft_gpu_split into self.draft_gpu_split, but load_model_sync passed self.gpu_split (the main model's split) when loading the draft model, so the draft split was silently ignored. Use self.draft_gpu_split instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 18:45:52 -04:00
turboderp	2e50555d37	Dependencies: Update exllamav3	2026-06-06 23:15:43 +02:00
turboderp	624f2baebd	Dependencies: Don't try to import exllamav2	2026-06-05 01:58:47 +02:00
turboderp	8822b886ea	Tools: Add Step 3.7 tool format alias (qwen3_coder compatible)	2026-06-02 17:21:53 +02:00
turboderp	210bbe78f5	exllamav3: Include stop conditions from backend tokenizer	2026-06-02 15:28:56 +02:00
turboderp	ff4160051f	Dependencies: Update exllamav3	2026-06-01 03:47:31 +02:00
turboderp	7a23e48fc1	Dependencies: Enable flash-linear-attention on Windows	2026-05-31 20:50:22 +02:00
turboderp	95c1101bd2	Dependencies: Update exllamav3	2026-05-29 22:17:24 +02:00
turboderp	510367d1ab	Logging: Add comprehensive request logging option	2026-05-27 00:33:45 +02:00
turboderp	dd792e1916	Dependencies: Update exllamav3	2026-05-24 20:33:33 +02:00
turboderp	20cd52371a	Docker: Update compose service	2026-05-24 20:33:03 +02:00
turboderp	fef811d484	Dependencies: Add cu13 install option and Dockerfile (exllamav3 only)	2026-05-23 01:21:18 +02:00
turboderp	539289375c	Dependencies: Add flash-linear-attention	2026-05-23 01:20:18 +02:00
turboderp	ed97bbb2af	Model: Add draft_num_tokens config option, update model container to forward draft and bsz args to backend	2026-05-23 00:40:42 +02:00
turboderp	a430dce6f3	Config: Fix incorrect description of gpu_split as integer list	2026-05-22 23:33:34 +02:00
turboderp	2593fb79a2	Merge remote-tracking branch 'origin/main'	2026-05-14 12:56:51 +02:00
turboderp	857f9e21dd	Merge pull request #422 Start.py: improve dependency installation check and cleanup uv logging	2026-05-14 12:56:40 +02:00
Optimal	52bc74b3f9	Start.py: improve dependency installation check and cleanup uv logging - Added check=True to subprocess.run in run_pip. - Wrapped installation in try-except to set first_run_done = True only on success. - Added error message and sys.exit(1) on installation failure. - Fix uv version logging by calling subprocess.run directly.	2026-05-14 01:56:30 +09:00
turboderp	4de923d8b3	Add docker instructions to README.md	2026-05-10 11:26:53 +02:00
turboderp	838df5a3c7	Docker: Remove version from example docker-compose.yml	2026-05-10 11:25:15 +02:00
turboderp	64ad702416	Dependencies: Pin pydantic again (>2.11 breaks docker image)	2026-05-10 01:41:03 +02:00
turboderp	5818311d06	Dependencies: Pin correct xformers version torch 2.9	2026-05-10 01:21:37 +02:00
turboderp	553c4e7cbb	Docker: Serve on 0.0.0.0 by default	2026-05-09 23:22:56 +02:00
turboderp	5d964494b6	Merge remote-tracking branch 'origin/main'	2026-05-09 23:18:52 +02:00
turboderp	4a8cb08a24	Dependencies: Include triton and xformers	2026-05-09 23:14:30 +02:00
turboderp	fd9591133d	Dependencies: Update exllamav3, unpin pydantic	2026-05-09 23:01:07 +02:00
RodriMora	54c1e56019	Update config_sample.yml (#418 ) small typo in "content" on the reasoning config	2026-05-09 21:21:57 +02:00
Josh	09f36f9c05	fix: prevent xformers from pulling cu130 wheels on cu128 hosts (#420 ) The default `pip install .[cu12,extras]` lets pip resolve xformers transitively (via infinity-emb / sentence-transformers in the extras group), which can pull a cu130-aligned wheel that requires libcudart.so.13. On hosts with NVIDIA driver 590.x (cu128-only), this fails at import time with: ImportError: libcudart.so.13: cannot open shared object file Reproduced on K3s clusters running 12 exllamav2/exllamav3 deployment pods × 6 hosts; all crash-looped on the published `:latest` image which had transitively resolved xformers to a cu130 wheel. Fix: split the install into two pip invocations. Install the cu12 group first to lock torch + cu128 wheels for exllamav2 / exllamav3 / flash_attn, then install the extras group with --no-deps so pip cannot resolve xformers (or any other transitive dep) outside the cu128 lock. Also align the Windows py3.12 flash_attn wheel version to v0.7.13 to match the other Windows variants (py3.10, py3.11, py3.13). The py3.12 variant was pinned to v0.7.6 while the rest were on v0.7.13, leaving py3.12 Windows users on an older flash_attn release with no semantic reason for the divergence. Tested on Hydra K3s cluster (NVIDIA 590.48.01-open + cu128 base image nvidia/cuda:12.8.1-runtime-ubuntu24.04 + torch 2.9.0+cu128). All 12 exllamav2/v3 deployments now import cleanly and serve /v1/models. Co-authored-by: Josh Jones <scoobydont-666@users.noreply.github.com>	2026-05-09 21:21:17 +02:00
turboderp	bc5de12c82	Dependencies: Fix Windows FA2 wheel URL for cp312	2026-05-05 10:02:49 +02:00
turboderp	59494106c9	Dependencies: Update exllamav3	2026-05-03 00:01:59 +02:00
turboderp	51b67595f4	Dependencies: Switch to mjun0812 flash-attn wheels	2026-05-03 00:01:29 +02:00
turboderp	6e97aa5fc1	Model: Fix model loading progress display when draft enabled	2026-05-02 20:30:38 +02:00

1 2 3 4 5 ...

1247 Commits