tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-05-11 16:30:16 +00:00

Author	SHA1	Message	Date
RodriMora	54c1e56019	Update config_sample.yml (#418 ) small typo in "content" on the reasoning config	2026-05-09 21:21:57 +02:00
Josh	09f36f9c05	fix: prevent xformers from pulling cu130 wheels on cu128 hosts (#420 ) The default `pip install .[cu12,extras]` lets pip resolve xformers transitively (via infinity-emb / sentence-transformers in the extras group), which can pull a cu130-aligned wheel that requires libcudart.so.13. On hosts with NVIDIA driver 590.x (cu128-only), this fails at import time with: ImportError: libcudart.so.13: cannot open shared object file Reproduced on K3s clusters running 12 exllamav2/exllamav3 deployment pods × 6 hosts; all crash-looped on the published `:latest` image which had transitively resolved xformers to a cu130 wheel. Fix: split the install into two pip invocations. Install the cu12 group first to lock torch + cu128 wheels for exllamav2 / exllamav3 / flash_attn, then install the extras group with --no-deps so pip cannot resolve xformers (or any other transitive dep) outside the cu128 lock. Also align the Windows py3.12 flash_attn wheel version to v0.7.13 to match the other Windows variants (py3.10, py3.11, py3.13). The py3.12 variant was pinned to v0.7.6 while the rest were on v0.7.13, leaving py3.12 Windows users on an older flash_attn release with no semantic reason for the divergence. Tested on Hydra K3s cluster (NVIDIA 590.48.01-open + cu128 base image nvidia/cuda:12.8.1-runtime-ubuntu24.04 + torch 2.9.0+cu128). All 12 exllamav2/v3 deployments now import cleanly and serve /v1/models. Co-authored-by: Josh Jones <scoobydont-666@users.noreply.github.com>	2026-05-09 21:21:17 +02:00
turboderp	bc5de12c82	Dependencies: Fix Windows FA2 wheel URL for cp312	2026-05-05 10:02:49 +02:00
turboderp	59494106c9	Dependencies: Update exllamav3	2026-05-03 00:01:59 +02:00
turboderp	51b67595f4	Dependencies: Switch to mjun0812 flash-attn wheels	2026-05-03 00:01:29 +02:00
turboderp	6e97aa5fc1	Model: Fix model loading progress display when draft enabled	2026-05-02 20:30:38 +02:00
turboderp	c06a6fbf7f	API: Accept JSON schema in request.response_format.json_schema, delay JSON filter until start of content block	2026-05-02 20:29:59 +02:00
turboderp	d0103c19a7	Dependencies: Bump exllamav3	2026-04-29 00:55:59 +02:00
turboderp	e909f7ecdb	ExLlamaV3: Respect device split when loading draft model	2026-04-25 01:51:46 +02:00
turboderp	6aa842a1b2	Dependencies: Update exllamav3	2026-04-20 23:11:30 +02:00
turboderp	3e3d7ccd54	Tools: Add step3_5 alias (qwen3_coder tool format)	2026-04-18 19:55:34 +02:00
turboderp	ed41c51909	API: Prevent race condition when multiple chat requests try to inline-load the same model	2026-04-18 19:55:34 +02:00
turboderp	5b2b707af9	exllamav3: Account for bsz=2 in autosplit	2026-04-18 19:55:34 +02:00
turboderp	9ebbe06f29	exllamav3: Supply max_chunk_size when loading model	2026-04-18 13:20:12 +02:00
turboderp	f74f16a5c2	Config: Make recurrent cache size configurable	2026-04-17 02:40:22 +02:00
turboderp	bd589272cc	Config: Make cuda_malloc_async configurable again, change import order to make sure config is loaded before torch is imported	2026-04-17 02:39:16 +02:00
turboderp	32eed618dc	Dependencies: Add requests	2026-04-12 13:51:54 +02:00
turboderp	1a4896ce66	Tree: Format	2026-04-12 13:47:05 +02:00
turboderp	510bf7bf6c	Update README.md	2026-04-12 13:44:26 +02:00
turboderp	f1a2416da5	OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n")	2026-04-12 04:12:53 +02:00
turboderp	2636b445f0	Tree: Format	2026-04-12 03:33:14 +02:00
turboderp	bb64f8f18e	Dependencies: Update exllamav3	2026-04-12 03:31:54 +02:00
turboderp	3a42c1756c	ExLlamaV2: Use new disconnect handler	2026-04-10 22:04:21 +02:00
turboderp	b21100f971	ExLlamaV3: Fix disconnected request handling regression	2026-04-10 22:03:19 +02:00
mindkrypted	08f92167de	Tools: Updated/fixed Gemma4 tool parser	2026-04-10 22:02:34 +02:00
turboderp	5517cb5b9e	Templates: Revert add_bos_token fix	2026-04-10 03:53:58 +02:00
turboderp	7fedc179f0	Templates: Make sure add_bos_token=False is respected	2026-04-10 03:14:29 +02:00
turboderp	27d29209c6	Tools: Add Gemma4 parser	2026-04-10 00:16:58 +02:00
turboderp	55124d0fc6	Config: Add force_enable_thinking	2026-04-10 00:16:40 +02:00
turboderp	db9048e59b	Docs: Tool calling	2026-04-08 19:39:42 +02:00
turboderp	79d581e1f5	OAI endpoints: More rework - remove disconnect_task - move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling - exclusively signal disconnect with CancelledError - rework completions endpoint to follow same approach as chat completions, share some code - refactor OAI endpoints a bit - correct behavior for batched completion requests - make sure logprobs work for completion and streaming completion requests - more tests	2026-04-02 01:26:44 +02:00
turboderp	c315f6b73e	OAI endpoints: Correctly propagate exceptions in non-streaming mode	2026-04-01 12:27:07 +02:00
turboderp	455c09932f	OAI endpoints: Fix regression for non-reasoning models	2026-04-01 00:08:39 +02:00
turboderp	0409064028	Tools: Refactor and further simplify tool parsing - remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens - dispatch to short, self-contained (and probably easily vibe coded) parser for each model type - remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models) - streamline xml parser and dedicate to qwen3_coder models - add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf) - update docs	2026-04-01 00:07:44 +02:00
turboderp	b6428b1676	Seq: Allow longer strings in log	2026-03-31 18:18:07 +02:00
turboderp	112ab69002	Fix comments	2026-03-31 14:43:55 +02:00
turboderp	bc66ba4b8b	Merge branch 'main' into main_tools	2026-03-30 23:07:53 +02:00
turboderp	c887ae88fc	Dependencies: Update exllamav3	2026-03-30 23:07:00 +02:00
turboderp	a7c7934ec3	Tool parsing: Include outer <tool_call> tags in raw text sent to parser	2026-03-30 04:05:15 +02:00
turboderp	41ed1e4881	Seq: Sanitize extra log data	2026-03-30 03:36:30 +02:00
turboderp	02a700e065	ExLlamaV3: Limit MMEmbedding cache size	2026-03-30 03:35:46 +02:00
turboderp	ba4309b948	ExLlamaV3: Replace MMEmbedding lru_cache with dict to avoid storing arbitrarily large uuencoded images as keys	2026-03-30 02:55:21 +02:00
turboderp	a035bc9e94	Model: Fix regression	2026-03-30 02:37:27 +02:00
turboderp	9ee5ded218	OAI: Log raw requests	2026-03-30 01:23:16 +02:00
turboderp	357eebffd2	Logger: Fix invalid escape sequence (gave syntax warning)	2026-03-30 00:33:01 +02:00
turboderp	9f565562dd	Add inference test scripts	2026-03-30 00:23:25 +02:00
turboderp	179479199b	Rework tool calls and OAI chat completions - move tool config from template_vars to separate yml config - new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both - move responsibility for switching between phases to stream collector - collect tool calls during streaming and parse at the end of each gen - prevent streaming empty content spans (be nice to clients) - correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish - collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc. - respect top_logprobs argument in request - handle a number of edge cases like <think> tag being part of held string, etc. - retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well Still TODO: - testing and validation with more models and tool schemas (tested on Qwen so far) - enable JSON constraint for JSON tool models - possibly some pydantification - documentation	2026-03-30 00:22:55 +02:00
turboderp	aa54098f26	Ruff: Format (line length)	2026-03-30 00:19:07 +02:00
turboderp	2a1503b283	Logging: Use debug level for Seq instead of verbose	2026-03-29 18:51:57 +02:00
turboderp	47d08729ed	Ruff: Raise line length limit to 100	2026-03-28 19:49:17 +01:00

1 2 3 4 5 ...

1203 Commits