tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-05-11 16:30:16 +00:00

Author	SHA1	Message	Date
turboderp	c06a6fbf7f	API: Accept JSON schema in request.response_format.json_schema, delay JSON filter until start of content block	2026-05-02 20:29:59 +02:00
turboderp	e909f7ecdb	ExLlamaV3: Respect device split when loading draft model	2026-04-25 01:51:46 +02:00
turboderp	5b2b707af9	exllamav3: Account for bsz=2 in autosplit	2026-04-18 19:55:34 +02:00
turboderp	9ebbe06f29	exllamav3: Supply max_chunk_size when loading model	2026-04-18 13:20:12 +02:00
turboderp	f74f16a5c2	Config: Make recurrent cache size configurable	2026-04-17 02:40:22 +02:00
turboderp	f1a2416da5	OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n")	2026-04-12 04:12:53 +02:00
turboderp	3a42c1756c	ExLlamaV2: Use new disconnect handler	2026-04-10 22:04:21 +02:00
turboderp	b21100f971	ExLlamaV3: Fix disconnected request handling regression	2026-04-10 22:03:19 +02:00
turboderp	55124d0fc6	Config: Add force_enable_thinking	2026-04-10 00:16:40 +02:00
turboderp	79d581e1f5	OAI endpoints: More rework - remove disconnect_task - move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling - exclusively signal disconnect with CancelledError - rework completions endpoint to follow same approach as chat completions, share some code - refactor OAI endpoints a bit - correct behavior for batched completion requests - make sure logprobs work for completion and streaming completion requests - more tests	2026-04-02 01:26:44 +02:00
turboderp	0409064028	Tools: Refactor and further simplify tool parsing - remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens - dispatch to short, self-contained (and probably easily vibe coded) parser for each model type - remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models) - streamline xml parser and dedicate to qwen3_coder models - add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf) - update docs	2026-04-01 00:07:44 +02:00
turboderp	02a700e065	ExLlamaV3: Limit MMEmbedding cache size	2026-03-30 03:35:46 +02:00
turboderp	ba4309b948	ExLlamaV3: Replace MMEmbedding lru_cache with dict to avoid storing arbitrarily large uuencoded images as keys	2026-03-30 02:55:21 +02:00
turboderp	179479199b	Rework tool calls and OAI chat completions - move tool config from template_vars to separate yml config - new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both - move responsibility for switching between phases to stream collector - collect tool calls during streaming and parse at the end of each gen - prevent streaming empty content spans (be nice to clients) - correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish - collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc. - respect top_logprobs argument in request - handle a number of edge cases like <think> tag being part of held string, etc. - retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well Still TODO: - testing and validation with more models and tool schemas (tested on Qwen so far) - enable JSON constraint for JSON tool models - possibly some pydantification - documentation	2026-03-30 00:22:55 +02:00
turboderp	aa54098f26	Ruff: Format (line length)	2026-03-30 00:19:07 +02:00
turboderp	2a1503b283	Logging: Use debug level for Seq instead of verbose	2026-03-29 18:51:57 +02:00
turboderp	56378b946d	Merge branch 'fork/devnen/full-tool-calling-support' into main_seqlog # Conflicts: # common/templating.py # endpoints/OAI/utils/chat_completion.py # endpoints/OAI/utils/tools.py	2026-03-28 01:06:54 +01:00
turboderp	f3787de6a6	Ruff: Format	2026-03-27 21:47:24 +01:00
turboderp	83127ab4f8	Logging: Log messages via Seq wrapper	2026-03-27 21:38:47 +01:00
turboderp	da3d3338e8	Logging: Fix env var parsing, formatting	2026-03-27 02:31:36 +01:00
turboderp	ffca853d4c	ExLlamaV3: Force minimum rep_decay of 1 token, pending update to backend	2026-03-22 14:51:08 +01:00
turboderp	92cb48c38d	ExLlamaV3: Fix regression in max_seq_len limit	2026-03-22 00:34:47 +01:00
turboderp	088e196cbc	ExLlamaV3: Change cache size fallback value to max_seq_len, add warning to configure manually	2026-03-20 20:42:14 +01:00
turboderp	8b1bfeaba7	Model: Make sure reasoning tokens are always defined	2026-03-20 20:41:44 +01:00
turboderp	78c5993c27	ExLlamaV3: Correctly report when vision is supported but not enabled	2026-03-20 01:33:38 +01:00
turboderp	0d577b8121	Cleanup and formatting	2026-03-20 01:27:29 +01:00
turboderp	6bccc70d94	Tree: Formatting	2026-03-18 03:29:15 +01:00
turboderp	d2117a7c3b	Config: Pass reasoning settings in kwargs, allow for overrides via tabby_config.yml	2026-03-18 00:24:22 +01:00
turboderp	6bf3670372	Model: Correctly read max_position_embeddings in nested config Rework how max_seq_len is determined from user settings, model defaults and cache size constraint	2026-03-17 02:58:47 +01:00
devnen	a2c7d81686	Broader model compatibility, tool_choice support, bug fixes and cleanup	2026-02-14 16:19:59 +01:00
devnen	87bbe0fac2	Full tool-calling support: XML parsing, streaming compliance, Pydantic fix, inference abort fix	2026-02-14 14:26:57 +01:00
turboderp	54e3ea1fb3	Tree: Format	2026-01-20 22:57:36 +01:00
turboderp	0985c7f7b7	Sampling: Add adaptive-P params	2026-01-20 19:09:54 +01:00
Brian	685aca5a7d	Merge pull request #397 from beep39/json-schema-for-exllamav3 Constrained generation with json schema for ExllamaV3	2025-11-24 22:34:31 -05:00
kingbri	126759034e	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-11-24 22:32:19 -05:00
Brian	df724fdc78	Merge pull request #393 from mefich/main Unloading vision model of VLMs for Exllamav3 backend	2025-11-19 22:46:59 -05:00
beep39	d53ca1345a	Constrained generation with json schema for ExllamaV3	2025-11-18 02:01:31 +09:00
turboderp	fece4791ad	exllamav2: Make sure cache size is set in unpaged mode	2025-11-06 21:03:24 +01:00
mefich	37aea9de83	Update exl3 backend model.py: fix for unloading vision models This change ensures that when unloading vlm their vision part is also unloaded.	2025-10-30 12:30:23 +05:00
turboderp	486dd0418e	Formatting	2025-10-15 10:47:58 +02:00
turboderp	0af29d957a	Fix #390	2025-10-15 10:40:19 +02:00
kingbri	6f73a0b388	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-10-14 23:06:20 -04:00
kingbri	fdb86f4c63	ExllamaV2: Add max_seq_len empty case like ExllamaV3 Also remove the intermediate base_seq_len and target_seq_len variables to make code clearer. If paged mode is off, max_seq_len becomes the prime mover since batching is unavailable. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-10-14 23:02:52 -04:00
kingbri	62e9fa217a	ExllamaV3: Handle max_seq_len defined and cache_size undefined case The previous changes broke existing configs and max_seq_len was force-overriden to 4096. This helps single-user setups since they do not really benefit from the split cache_size max_seq_len mechanism (except if batching). cache_size is still the prime mover in exl3 due to its paging mechanism. Ideally, for multi-user setups, cache_size should take as much VRAM as possible and max_seq_len should be limited. Breakdown: cache_size and max_seq_len specified -> values only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size) Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-10-14 21:48:36 -04:00
turboderp	04ca346732	Fix formatting	2025-10-14 03:11:59 +02:00
turboderp	8abdfe7b13	Config: replace disable_output_chunking flag with output_chunking	2025-10-14 02:47:52 +02:00
kingbri	85459ce600	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-10-09 22:33:53 -04:00
turboderp	4235f98e83	Model: Change cache_size/max_seq_len behavior - Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM) - max_seq_len, if not overridden in the config, will default to the model's config.json - max_seq_len is reduced to be no larger than the cache	2025-10-05 22:16:01 +02:00
turboderp	52e093ae6c	Model: Enable max_rq_tokens (output chunking)	2025-10-05 18:54:45 +02:00
turboderp	e09a61969f	Model: Fix NCCL detection	2025-10-05 18:52:37 +02:00

1 2 3 4 5 ...

360 Commits