- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well
Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
Apply the default overrides after inline config has been merged.
Do not require an inline config to apply use_as_default and other
overrides.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Since cache_size is a more important parameter now for multi-user
setups, mark it as such by placing it below max_seq_len.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).
cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.
Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.
However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.
In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Anything below the first level of kwargs was not being merged properly.
A more bulletproof solution would be to refactor the loading code
to separate draft and normal model parameters.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Rather than relying on Content-Length which can be unreliable, ping
the API to get file sizes and work from there.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Usually, the client and server both are aware of the file size by
sending a Content-Length header. However, HuggingFace has changed
their headers and now does not always send Content-Length.
In this case, show an indeterminate progressbar and mark as complete
once the download finishes.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
It's useful for the client to know what the T/s and total time for
generation are per-request.
Works with both completions and chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>