- remove disconnect_task
- move disconnect logic to a per-request handler that wraps cleanup operation and directly polls the request state with throttling
- exclusively signal disconnect with CancelledError
- rework completions endpoint to follow same approach as chat completions, share some code
- refactor OAI endpoints a bit
- correct behavior for batched completion requests
- make sure logprobs work for completion and streaming completion requests
- more tests
- remove ToolConfig, reduce to a single `tool_format` argument and hard-code extra args like start/end tokens
- dispatch to short, self-contained (and probably easily vibe coded) parser for each model type
- remove autodetection (seems infeasible since parsing effectively starts during streaming, and there is overlap between tool formats for different models)
- streamline xml parser and dedicate to qwen3_coder models
- add parsers for glm4.x, minimax-m2.x and mistral (seems shaky, probably because mistralai don't validate against hf)
- update docs
- move tool config from template_vars to separate yml config
- new per-gen stream collector used for both streaming and non-streaming requests to ensure logic is consistent for both
- move responsibility for switching between phases to stream collector
- collect tool calls during streaming and parse at the end of each gen
- prevent streaming empty content spans (be nice to clients)
- correctly aggregate usage stats for n>1 requests, always emit with last chunk in last gen to finish
- collect logprobs in model wrapper and correctly handle logprobs for multi-token chars etc.
- respect top_logprobs argument in request
- handle a number of edge cases like <think> tag being part of held string, etc.
- retain tool parsing and inference-abort fixes from #413, apply similar fix to non-stream request as well
Still TODO:
- testing and validation with more models and tool schemas (tested on Qwen so far)
- enable JSON constraint for JSON tool models
- possibly some pydantification
- documentation
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Doing this helps reduce the model's burden of generating the tool
call ID and type (which is always "function"). Follow mistral's spec
for tool call IDs by using a 9 character alphanumeric string.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Re-rendering the template is an expensive operation when it's possible
to just concatenate the prompt and current generation text together.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.
However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
To render in the template, tool call start tokens needed to have less
checks and remove the line to convert message.tool_calls to a dict
since that breaks the rest of the chain by disconnecting the types.
model_dump on the message itself already accomplishes this.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
It's useful for the client to know what the T/s and total time for
generation are per-request.
Works with both completions and chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Matching YALS, if the model has add_bos_token enabled, then remove
an extra BOS token at the start of the prompt. This usually happens
with misconfigured templates such as Llama 3.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Tools must be None by default. Chat completion message content can
be None, a string, or a list, so default to None. Exclude all None
values from a CC message since the template can say the variable
"exists" despite being None, causing an error.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Messages were mistakenly being sent as Pydantic objects, but templates
expect dictionaries. Properly convert these before render.
In addition, initialize all Optional lists as an empty list since
this will cause the least problems when interacting with other parts
of API code, such as templates.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>