tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-04-20 06:19:15 +00:00

Author	SHA1	Message	Date
turboderp	f1a2416da5	OAI endpoints: Add option to suppress header after reasoning start token (e.g. Gemma4's "thought\n")	2026-04-12 04:12:53 +02:00
turboderp	3a42c1756c	ExLlamaV2: Use new disconnect handler	2026-04-10 22:04:21 +02:00
turboderp	55124d0fc6	Config: Add force_enable_thinking	2026-04-10 00:16:40 +02:00
turboderp	79d581e1f5	OAI endpoints: More rework - remove disconnect_task - move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling - exclusively signal disconnect with CancelledError - rework completions endpoint to follow same approach as chat completions, share some code - refactor OAI endpoints a bit - correct behavior for batched completion requests - make sure logprobs work for completion and streaming completion requests - more tests	2026-04-02 01:26:44 +02:00
turboderp	0409064028	Tools: Refactor and further simplify tool parsing - remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens - dispatch to short, self-contained (and probably easily vibe coded) parser for each model type - remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models) - streamline xml parser and dedicate to qwen3_coder models - add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf) - update docs	2026-04-01 00:07:44 +02:00
turboderp	179479199b	Rework tool calls and OAI chat completions - move tool config from template_vars to separate yml config - new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both - move responsibility for switching between phases to stream collector - collect tool calls during streaming and parse at the end of each gen - prevent streaming empty content spans (be nice to clients) - correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish - collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc. - respect top_logprobs argument in request - handle a number of edge cases like <think> tag being part of held string, etc. - retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well Still TODO: - testing and validation with more models and tool schemas (tested on Qwen so far) - enable JSON constraint for JSON tool models - possibly some pydantification - documentation	2026-03-30 00:22:55 +02:00
turboderp	aa54098f26	Ruff: Format (line length)	2026-03-30 00:19:07 +02:00
turboderp	2a1503b283	Logging: Use debug level for Seq instead of verbose	2026-03-29 18:51:57 +02:00
turboderp	f3787de6a6	Ruff: Format	2026-03-27 21:47:24 +01:00
turboderp	83127ab4f8	Logging: Log messages via Seq wrapper	2026-03-27 21:38:47 +01:00
turboderp	8b1bfeaba7	Model: Make sure reasoning tokens are always defined	2026-03-20 20:41:44 +01:00
turboderp	6bccc70d94	Tree: Formatting	2026-03-18 03:29:15 +01:00
turboderp	d2117a7c3b	Config: Pass reasoning settings in kwargs, allow for overrides via tabby_config.yml	2026-03-18 00:24:22 +01:00
turboderp	6bf3670372	Model: Correctly read max_position_embeddings in nested config Rework how max_seq_len is determined from user settings, model defaults and cache size constraint	2026-03-17 02:58:47 +01:00
turboderp	fece4791ad	exllamav2: Make sure cache size is set in unpaged mode	2025-11-06 21:03:24 +01:00
turboderp	486dd0418e	Formatting	2025-10-15 10:47:58 +02:00
turboderp	0af29d957a	Fix #390	2025-10-15 10:40:19 +02:00
kingbri	6f73a0b388	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-10-14 23:06:20 -04:00
kingbri	fdb86f4c63	ExllamaV2: Add max_seq_len empty case like ExllamaV3 Also remove the intermediate base_seq_len and target_seq_len variables to make code clearer. If paged mode is off, max_seq_len becomes the prime mover since batching is unavailable. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-10-14 23:02:52 -04:00
turboderp	04ca346732	Fix formatting	2025-10-14 03:11:59 +02:00
turboderp	4235f98e83	Model: Change cache_size/max_seq_len behavior - Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM) - max_seq_len, if not overridden in the config, will default to the model's config.json - max_seq_len is reduced to be no larger than the cache	2025-10-05 22:16:01 +02:00
kingbri	0b4ca567f8	API: Persist request IDs and append full_text to finish chunk Adding these to each generation chunk helps remove redundancy and unecessary request ID operations. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-25 12:27:44 -04:00
kingbri	a02d39de31	Model: Remove rogue print Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-17 23:09:07 -04:00
kingbri	2913ce29fc	API: Add timings to usage stats It's useful for the client to know what the T/s and total time for generation are per-request. Works with both completions and chat completions. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-17 22:54:51 -04:00
kingbri	5d94d4d022	Merge branch 'main' into breaking	2025-06-17 22:24:32 -04:00
turboderp	1c9891bf04	Exl3: Add vision capability	2025-06-15 19:22:51 +02:00
turboderp	4605c0f6bd	Common: Refactor get_image to common functions	2025-06-15 19:20:36 +02:00
turboderp	a0c16bba2a	Exl2: Fix banned_strings (move outside of assign_gen_params)	2025-06-15 16:51:42 +02:00
kingbri	2096c9bad2	Model: Default max_seq_len to 4096 A common problem in TabbyAPI is that users who want to get up and running with a model always had issues with max_seq_len causing OOMs. This is because model devs set max context values in the millions which requires a lot of VRAM. To idiot-proof first time setup, make the fallback default 4096 so users can run their models. If a user still wants to use the model's max_seq_len, set it to -1. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-13 14:57:24 -04:00
turboderp	691a080ac7	Dependencies: Bump ExllamaV3 and ExllamaV2	2025-05-31 23:55:04 +02:00
kingbri	17f3dca6fc	Packaging: Add agnostic method to check version of packages Some packages such as ExllamaV2 and V3 require specific versions for the latest features. Rather than creating repetitive functions, create an agnostic function to check the installed package and then report to the user to upgrade. This is also sent to requests for loading and unloading, so keep the error short. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 01:04:24 -04:00
kingbri	0858b6d4b2	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 00:46:40 -04:00
kingbri	390daeb92f	Model: Create universal HFModel class The HFModel class serves to coalesce all config files that contain random keys which are required for model usage. Adding this base class allows us to expand as HuggingFace randomly changes their JSON schemas over time, reducing the brunt that backend devs need to feel when their next model isn't supported. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-13 18:12:38 -04:00
kingbri	bd3fec929c	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-12 11:32:27 -04:00
kingbri	a524ac3c0f	Model: Fix cache mode again If statements can be difficult to work with. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-12 11:30:47 -04:00
kingbri	20cad851e9	Model: Fix param call Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-12 09:52:28 -04:00
kingbri	d15eb55f20	Model: Fix exl2 cache mode check FP16 was not included in the validation step. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-12 09:51:09 -04:00
kingbri	656af41b5d	Model: Always enable decode_special_tokens The frontend should handle the special tokens if they get emitted. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-09 22:25:50 -04:00
kingbri	42346c6b39	Sampling: Remove skip_special_tokens This parameter is way too confusing and does not make sense in the modern LLM space. Change approved by all maintainers. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-09 22:11:33 -04:00
kingbri	25c77ebf77	Model: Remove exllamav2-specific version check No longer necessary thanks to the agnostic check. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-09 22:08:15 -04:00
DocShotgun	9dcde59c57	Model: Check for unsupported cache mode in exllamav2	2025-05-06 01:18:15 -07:00
DocShotgun	68a660bdb3	Model: Initial Exl3 cache quantization support	2025-05-03 20:35:35 -07:00
kingbri	e8f00412f6	Model: Fetch from generation_config and tokenizer_config in Exl3 Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-02 21:33:25 -04:00
kingbri	bdc5189a4b	Exl3: Add chunk size, cache size, and model info Use the same algorithm for estimating and adjusting cache size based on multiples of 256 and above max seq len. Same applies for chunk size. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-02 21:33:25 -04:00
randoentity	daae9ec43d	Exl3: Couldn't wait Just copied some stuff around and it ended up working for basic use.	2025-05-02 21:33:25 -04:00
kingbri	0c1d794390	Model: Add exl3 and associated load functions Initial exl3 compat and loading functionality. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-02 21:32:39 -04:00
kingbri	242f6b7d2a	Model: Simplify add_bos_token handling Set add_bos_token to True by default in the tokenizer_config stub. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-02 21:32:28 -04:00
kingbri	4cb3e5d5b1	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-02 00:23:15 -04:00
kingbri	47cb2a0de9	Model: Add TokenizerConfig stub and add_eos_token fallback This stub fetches the add_eos_token field from the HF tokenizer config. Ideally, this should be in the backend rather than tabby. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-02 00:08:01 -04:00
kingbri	aa657fa6e9	API: Ignore add_bos_token in chat completions When fetching special tokens from the model, don't factor in the add_bos_token and ban_eos_token parameters as switches. In addition, change the internal handling of add_bos_token to an optional boolean. This allows us to fallback to the model when selecting whether or not to add the BOS token, especially for chat completions. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-01 22:51:15 -04:00

1 2 3 4 5 ...

286 Commits