Files
ik_llama.cpp/examples/server/server-task.cpp
firecoperana ab1d74074b common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369)
---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

common : add nemotron 3 parsing (#18077)

common : add parser for ministral/mistral large 3/devstral 2 (#17713)

common : default content to an empty string (#18485)

chat: make tool description and parameters optional per OpenAI spec (#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

common : implement new jinja template engine (#18462)
---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jinja: correct member access rule (#18905)

jinja : fix lexing of float literals with sign (#18901)

jinja : add missing tojson filter for bool (#18900)

jinja : attribute support for join, map and sort (#18883)

jinja : fix object item order (and properly implement dictsort) (#18904)

tests : add test-jinja -py option for cross-checking (#18906)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : run test-jinja -py on high perf [no ci] (#18916)

jinja : fix undefined keys and attributes and int/float as bool (#18924)

jinja: support none|string (#18995)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

jinja : implement mixed type object keys (#18955)

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (#19176)

jinja : add unordered_map include to value.h [no ci] (#19205)

jinja : add missing 'in' test to template engine (#19004) (#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add Jinja support for "indent" string filter (#19529)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (#18994)

chat : fix translategemma crash on common_chat_format_example (#19019)

chat: fix language input for translategemma (#19052)

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

chat: fix case where template accepts type content only (#19419)

mtmd : chat : Fix extra \n between text and media marker (#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

common : fix improper trimming in XML parser on complete message (#19805)

Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>

jinja: correct stats for tojson and string filters (#19785)

jinja : correct default size for string slices (#19913)

common : handle unicode during partial json parsing (#16526)

common : fix json schema with '\' in literals (#17307)

add back qwen_coder_xml and mirothinker

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-03-09 11:03:33 +01:00

1220 lines
41 KiB
C++

#include "server-task.h"
json result_timings::to_json() const {
json base = {
{"prompt_n", prompt_n},
{"prompt_ms", prompt_ms},
{"prompt_per_token_ms", prompt_per_token_ms},
{"prompt_per_second", prompt_per_second},
{"predicted_n", predicted_n},
{"predicted_ms", predicted_ms},
{"predicted_per_token_ms", predicted_per_token_ms},
{"predicted_per_second", predicted_per_second},
{"n_ctx", n_ctx},
{"n_past", n_past},
};
if (draft_n > 0) {
base["draft_n"] = draft_n;
base["draft_n_accepted"] = draft_n_accepted;
}
return base;
}
//json server_task_result_cmpl_partial::to_json_non_oaicompat_partial() {
// // non-OAI-compat JSON
// json res = json{
// {"index", index},
// {"content", content},
// {"tokens", tokens},
// {"stop", false},
// {"id_slot", id_multi},
// {"tokens_predicted", n_decoded},
// {"tokens_evaluated", n_prompt_tokens},
// };
// // populate the timings object when needed (usually for the last response or with timings_per_token enabled)
// if (timings.prompt_n > 0) {
// res.push_back({ "timings", timings.to_json() });
// }
// if (!probs_output.empty()) {
// res["completion_probabilities"] = completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs);
// }
// return res;
//}
//json server_task_result_cmpl_final::to_json_non_oaicompat_final() {
// json res = json{
// {"index", index},
// {"content", stream ? "" : content}, // in stream mode, content is already in last partial chunk
// {"tokens", stream ? std::vector<llama_token> {} : tokens},
// {"id_slot", id_multi},
// {"stop", true},
// {"model", oaicompat_model},
// {"tokens_predicted", n_decoded},
// {"tokens_evaluated", n_prompt_tokens},
// //{"generation_settings", default_generation_settings_for_props.to_json()},
// {"prompt", prompt},
// {"has_new_line", has_new_line},
// {"truncated", truncated},
// //{"stop_type", stop_type_to_str(STOP_TYPE_EOS)},
// {"stopping_word", stopping_word},
// {"tokens_cached", n_tokens_cached},
// {"timings", timings.to_json()},
// };
// if (!stream && !probs_output.empty()) {
// res["completion_probabilities"] = completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs);
// }
// return response_fields.empty() ? res : json_get_nested_values(response_fields, res);
//}
json server_task_result_cmpl_partial::to_json_non_oaicompat_partial() {
// non-OAI-compat JSON
return data;
}
json server_task_result_cmpl_final::to_json_non_oaicompat_final() {
// non-OAI-compat JSON
return data;
}
json server_task_result_cmpl_partial::to_json_oaicompat_partial() {
std::time_t t = std::time(0);
json logprobs = json(nullptr); // OAI default to null
if (probs_output.size() > 0) {
logprobs = json{
{"content", completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs)},
};
}
json res = json{
{"choices", json::array({
json{
{"text", content},
{"index", index},
{"logprobs", logprobs},
{"finish_reason", nullptr},
}
})},
{"created", t},
{"model", oaicompat_model},
{"object", "text_completion"},
{"usage", json {
{"completion_tokens", n_decoded},
{"prompt_tokens", n_prompt_tokens},
{"total_tokens", n_decoded + n_prompt_tokens}
}},
{"id", oaicompat_cmpl_id}
};
// extra fields for debugging purposes
if (verbose) {
res["__verbose"] = to_json_non_oaicompat_partial();
}
if (timings.prompt_n >= 0) {
res.push_back({ "timings", timings.to_json() });
}
return res;
}
json server_task_result_cmpl_final::to_json_oaicompat_final() {
std::time_t t = std::time(0);
json logprobs = json(nullptr); // OAI default to null
if (!stream && probs_output.size() > 0) {
logprobs = json{
{"content", completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs)},
};
}
json finish_reason = "length";
if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
finish_reason = "stop";
}
json res = json{
{"choices", json::array({
json{
{"text", stream ? "" : content}, // in stream mode, content is already in last partial chunk
{"index", index},
{"logprobs", logprobs},
{"finish_reason", finish_reason},
}
})},
{"created", t},
{"model", oaicompat_model},
{"object", "text_completion"},
{"usage", json {
{"completion_tokens", n_decoded},
{"prompt_tokens", n_prompt_tokens},
{"total_tokens", n_decoded + n_prompt_tokens}
}},
{"id", oaicompat_cmpl_id}
};
// extra fields for debugging purposes
if (verbose) {
res["__verbose"] = to_json_non_oaicompat_final();
}
if (timings.prompt_n >= 0) {
res.push_back({ "timings", timings.to_json() });
}
return res;
}
json server_task_result_cmpl_partial::to_json_oaicompat_chat_partial() {
bool first = n_decoded == 1;
std::time_t t = std::time(0);
json choices;
std::vector<json> deltas;
auto add_delta = [&](const json& delta) {
deltas.push_back({
{"choices", json::array({
json {
{"finish_reason", nullptr},
{"index", 0},
{"delta", delta},
},
})},
{"created", t},
{"id", oaicompat_cmpl_id},
{"model", oaicompat_model},
{"object", "chat.completion.chunk"},
{"usage", json {
{"completion_tokens", n_decoded},
{"prompt_tokens", n_prompt_tokens},
{"total_tokens", n_decoded + n_prompt_tokens},
}},
});
};
// We have to send an initial update to conform to openai behavior
if (first) {
add_delta({
{"role", "assistant"},
{"content", nullptr},
});
}
for (const auto& diff : oaicompat_msg_diffs) {
add_delta(common_chat_msg_diff_to_json_oaicompat(diff));
}
if (!deltas.empty()) {
GGML_ASSERT(deltas[deltas.size() - 1].at("choices").size() >= 1);
if (probs_output.size() > 0) {
deltas[deltas.size() - 1].at("choices").at(0)["logprobs"] = json{
{"content", completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs)},
};
}
if (timings.prompt_n >= 0) {
deltas[deltas.size() - 1].push_back({ "timings", timings.to_json() });
}
}
return deltas;
}
json server_task_result_cmpl_partial::to_json_oaicompat_resp_partial() {
std::vector<json> events;
if (n_decoded == 1) {
events.push_back(json{
{"event", "response.created"},
{"data", json{
{"type", "response.created"},
{"response", json{
{"id", oai_resp_id},
{"object", "response"},
{"status", "in_progress"},
}},
}},
});
events.push_back(json{
{"event", "response.in_progress"},
{"data", json{
{"type", "response.in_progress"},
{"response", json{
{"id", oai_resp_id},
{"object", "response"},
{"status", "in_progress"},
}},
}},
});
}
for (const auto& diff : oaicompat_msg_diffs) {
if (!diff.reasoning_content_delta.empty()) {
if (!oai_resp_thinking_block_started) {
events.push_back(json{
{"event", "response.output_item.added"},
{"data", json{
{"type", "response.output_item.added"},
{"item", json{
{"id", oai_resp_reasoning_id},
{"summary", json::array()},
{"type", "reasoning"},
{"content", json::array()},
{"encrypted_content", ""},
{"status", "in_progress"},
}},
}},
});
oai_resp_thinking_block_started = true;
}
events.push_back(json{
{"event", "response.reasoning_text.delta"},
{"data", json{
{"type", "response.reasoning_text.delta"},
{"delta", diff.reasoning_content_delta},
{"item_id", oai_resp_reasoning_id},
}},
});
}
if (!diff.content_delta.empty()) {
if (!oai_resp_text_block_started) {
events.push_back(json{
{"event", "response.output_item.added"},
{"data", json{
{"type", "response.output_item.added"},
{"item", json{
{"content", json::array()},
{"id", oai_resp_message_id},
{"role", "assistant"},
{"status", "in_progress"},
{"type", "message"},
}},
}},
});
events.push_back(json{
{"event", "response.content_part.added"},
{"data", json{
{"type", "response.content_part.added"},
{"item_id", oai_resp_message_id},
{"part", json{
{"type", "output_text"},
{"text", ""},
}},
}},
});
oai_resp_text_block_started = true;
}
events.push_back(json{
{"event", "response.output_text.delta"},
{"data", json{
{"type", "response.output_text.delta"},
{"item_id", oai_resp_message_id},
{"delta", diff.content_delta},
}},
});
}
if (!diff.tool_call_delta.name.empty()) {
events.push_back(json{
{"event", "response.output_item.added"},
{"data", json{
{"type", "response.output_item.added"},
{"item", json{
{"arguments", ""},
{"call_id", "fc_" + diff.tool_call_delta.id},
{"name", diff.tool_call_delta.name},
{"type", "function_call"},
{"status", "in_progress"},
}},
}},
});
oai_resp_fc_id = diff.tool_call_delta.id;
}
if (!diff.tool_call_delta.arguments.empty()) {
events.push_back(json{
{"event", "response.function_call_arguments.delta"},
{"data", json{
{"type", "response.function_call_arguments.delta"},
{"delta", diff.tool_call_delta.arguments},
{"item_id", "fc_" + oai_resp_fc_id},
}},
});
}
}
return events;
}
json server_task_result_cmpl_final::to_json_oaicompat_chat_final() {
std::string finish_reason = "length";
common_chat_msg msg;
if (!oaicompat_msg.empty()) {
msg = oaicompat_msg;
}
else {
msg.role = "assistant";
msg.content = content;
}
if (stop) {
finish_reason = msg.tool_calls.empty() ? "stop" : "tool_calls";
}
json choice{
{"finish_reason", finish_reason},
{"index", 0},
{"message", msg.to_json_oaicompat()},
};
if (!stream && probs_output.size() > 0) {
choice["logprobs"] = json{
{"content", completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs)},
};
}
std::time_t t = std::time(0);
json res = json{
{"choices", json::array({choice})},
{"created", t},
{"model", oaicompat_model},
{"object", "chat.completion"},
{"usage", json {
{"completion_tokens", n_decoded},
{"prompt_tokens", n_prompt_tokens},
{"total_tokens", n_decoded + n_prompt_tokens}
}},
{"id", oaicompat_cmpl_id}
};
// extra fields for debugging purposes
if (verbose) {
res["__verbose"] = to_json_non_oaicompat_final();
}
if (timings.prompt_n >= 0) {
res.push_back({ "timings", timings.to_json() });
}
return res;
}
json server_task_result_cmpl_final::to_json_oaicompat_chat_stream() {
std::time_t t = std::time(0);
std::string finish_reason = "length";
if (stop) {
//if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
finish_reason = oaicompat_msg.tool_calls.empty() ? "stop" : "tool_calls";
}
json deltas = json::array();
for (const auto& diff : oaicompat_msg_diffs) {
deltas.push_back({
{"choices", json::array({
json {
{"finish_reason", nullptr},
{"index", 0},
{"delta", common_chat_msg_diff_to_json_oaicompat(diff)},
},
})},
{"created", t},
{"id", oaicompat_cmpl_id},
{"model", oaicompat_model},
{"object", "chat.completion.chunk"},
});
}
deltas.push_back({
{"choices", json::array({
json {
{"finish_reason", finish_reason},
{"index", 0},
{"delta", json::object()},
},
})},
{"created", t},
{"id", oaicompat_cmpl_id},
{"model", oaicompat_model},
{"object", "chat.completion.chunk"},
});
if (include_usage) {
// OpenAI API spec for chat.completion.chunks specifies an empty `choices` array for the last chunk when including usage
// https://platform.openai.com/docs/api-reference/chat_streaming/streaming#chat_streaming/streaming-choices
deltas.push_back({
{"choices", json::array()},
{"created", t},
{"id", oaicompat_cmpl_id},
{"model", oaicompat_model},
{"object", "chat.completion.chunk"},
{"usage", json {
{"completion_tokens", n_decoded},
{"prompt_tokens", n_prompt_tokens},
{"total_tokens", n_decoded + n_prompt_tokens},
}},
});
}
if (timings.prompt_n >= 0) {
deltas.back().push_back({ "timings", timings.to_json() });
}
// extra fields for debugging purposes
if (verbose && !deltas.empty()) {
deltas.front()["__verbose"] = to_json_non_oaicompat_final();
}
return deltas;
}
json server_task_result_cmpl_final::to_json_oaicompat_resp_final() {
common_chat_msg msg;
if (!oaicompat_msg.empty()) {
msg = oaicompat_msg;
}
else {
msg.role = "assistant";
msg.content = content;
}
std::vector<json> output;
if (!msg.reasoning_content.empty()) {
output.push_back(json{
{"id", oai_resp_reasoning_id},
{"summary", json::array()},
{"type", "reasoning"},
{"content", json::array({json{
{"text", msg.reasoning_content},
{"type", "reasoning_text"},
}})},
{"encrypted_content", ""},
{"status", "completed"},
});
}
if (!msg.content.empty()) {
output.push_back(json{
{"content", json::array({json{
{"type", "output_text"},
{"annotations", json::array()},
{"logprobs", json::array()},
{"text", msg.content},
}})},
{"id", oai_resp_message_id},
{"role", msg.role},
{"status", "completed"},
{"type", "message"},
});
}
for (const auto& tool_call : oaicompat_msg.tool_calls) {
output.push_back(json{
{"type", "function_call"},
{"status", "completed"},
{"arguments", tool_call.arguments},
{"call_id", "fc_" + tool_call.id},
{"name", tool_call.name},
});
}
std::time_t t = std::time(0);
json res = {
{"completed_at", t},
{"created_at", t},
{"id", oai_resp_id},
{"model", oaicompat_model},
{"object", "response"},
{"output", output},
{"status", "completed"},
{"usage", json{
{"input_tokens", n_prompt_tokens},
{"output_tokens", n_decoded},
{"total_tokens", n_decoded + n_prompt_tokens},
}},
};
return res;
}
json server_task_result_cmpl_final::to_json_oaicompat_resp_stream() {
std::vector<json> events;
std::vector<json> output;
if (!oaicompat_msg.reasoning_content.empty()) {
const json output_item = json{
{"id", oai_resp_reasoning_id},
{"summary", json::array()},
{"type", "reasoning"},
{"content", json::array({json{
{"text", oaicompat_msg.reasoning_content},
{"type", "reasoning_text"},
}})},
{"encrypted_content", ""},
};
events.push_back(json{
{"event", "response.output_item.done"},
{"data", json{
{"type", "response.output_item.done"},
{"item", output_item},
}},
});
output.push_back(output_item);
}
if (!oaicompat_msg.content.empty()) {
events.push_back(json{
{"event", "response.output_text.done"},
{"data", json{
{"type", "response.output_text.done"},
{"item_id", oai_resp_message_id},
{"text", oaicompat_msg.content},
}},
});
const json content_part = {
{"type", "output_text"},
{"annotations", json::array()},
{"logprobs", json::array()},
{"text", oaicompat_msg.content},
};
events.push_back(json{
{"event", "response.content_part.done"},
{"data", json{
{"type", "response.content_part.done"},
{"item_id", oai_resp_message_id},
{"part", content_part},
}},
});
const json output_item = {
{"type", "message"},
{"status", "completed"},
{"id", oai_resp_message_id},
{"content", json::array({content_part})},
{"role", "assistant"},
};
events.push_back(json{
{"event", "response.output_item.done"},
{"data", json{
{"type", "response.output_item.done"},
{"item", output_item},
}},
});
output.push_back(output_item);
}
for (const auto& tool_call : oaicompat_msg.tool_calls) {
const json output_item = {
{"type", "function_call"},
{"status", "completed"},
{"arguments", tool_call.arguments},
{"call_id", "fc_" + tool_call.id},
{"name", tool_call.name},
};
events.push_back(json{
{"event", "response.output_item.done"},
{"data", json{
{"type", "response.output_item.done"},
{"item", output_item},
}},
});
output.push_back(output_item);
}
std::time_t t = std::time(0);
events.push_back(json{
{"event", "response.completed"},
{"data", json{
{"type", "response.completed"},
{"response", json{
{"id", oai_resp_id},
{"object", "response"},
{"created_at", t},
{"status", "completed"},
{"model", oaicompat_model},
{"output", output},
{"usage", json{
{"input_tokens", n_prompt_tokens},
{"output_tokens", n_decoded},
{"total_tokens", n_decoded + n_prompt_tokens},
}},
}},
}},
});
return events;
}
json server_task_result_cmpl_final::to_json_anthropic_final() {
std::string stop_reason = "max_tokens";
if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
stop_reason = oaicompat_msg.tool_calls.empty() ? "end_turn" : "tool_use";
}
json content_blocks = json::array();
common_chat_msg msg;
if (!oaicompat_msg.empty()) {
msg = oaicompat_msg;
}
else {
msg.role = "assistant";
msg.content = content;
}
if (!msg.reasoning_content.empty()) {
content_blocks.push_back({
{"type", "thinking"},
{"thinking", msg.reasoning_content},
{"signature", ""}
});
}
if (!msg.content.empty()) {
content_blocks.push_back({
{"type", "text"},
{"text", msg.content}
});
}
for (const auto& tool_call : msg.tool_calls) {
json tool_use_block = {
{"type", "tool_use"},
{"id", tool_call.id},
{"name", tool_call.name}
};
try {
tool_use_block["input"] = json::parse(tool_call.arguments);
}
catch (const std::exception&) {
tool_use_block["input"] = json::object();
}
content_blocks.push_back(tool_use_block);
}
json res = {
{"id", oaicompat_cmpl_id},
{"type", "message"},
{"role", "assistant"},
{"content", content_blocks},
{"model", oaicompat_model},
{"stop_reason", stop_reason},
{"stop_sequence", stopping_word.empty() ? nullptr : json(stopping_word)},
{"usage", {
{"input_tokens", n_prompt_tokens},
{"output_tokens", n_decoded}
}}
};
return res;
}
json server_task_result_cmpl_final::to_json_anthropic_stream() {
json events = json::array();
std::string stop_reason = "max_tokens";
if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
stop_reason = oaicompat_msg.tool_calls.empty() ? "end_turn" : "tool_use";
}
bool has_thinking = !oaicompat_msg.reasoning_content.empty();
bool has_text = !oaicompat_msg.content.empty();
size_t num_tool_calls = oaicompat_msg.tool_calls.size();
size_t thinking_block_index = 0;
size_t text_block_index = has_thinking ? 1 : 0;
bool thinking_block_started = false;
bool text_block_started = false;
std::set<size_t> tool_calls_started;
for (const auto& diff : oaicompat_msg_diffs) {
if (!diff.reasoning_content_delta.empty()) {
if (!thinking_block_started) {
events.push_back({
{"event", "content_block_start"},
{"data", {
{"type", "content_block_start"},
{"index", thinking_block_index},
{"content_block", {
{"type", "thinking"},
{"thinking", ""}
}}
}}
});
thinking_block_started = true;
}
events.push_back({
{"event", "content_block_delta"},
{"data", {
{"type", "content_block_delta"},
{"index", thinking_block_index},
{"delta", {
{"type", "thinking_delta"},
{"thinking", diff.reasoning_content_delta}
}}
}}
});
}
if (!diff.content_delta.empty()) {
if (!text_block_started) {
events.push_back({
{"event", "content_block_start"},
{"data", {
{"type", "content_block_start"},
{"index", text_block_index},
{"content_block", {
{"type", "text"},
{"text", ""}
}}
}}
});
text_block_started = true;
}
events.push_back({
{"event", "content_block_delta"},
{"data", {
{"type", "content_block_delta"},
{"index", text_block_index},
{"delta", {
{"type", "text_delta"},
{"text", diff.content_delta}
}}
}}
});
}
if (diff.tool_call_index != std::string::npos) {
size_t content_block_index = (has_thinking ? 1 : 0) + (has_text ? 1 : 0) + diff.tool_call_index;
if (tool_calls_started.find(diff.tool_call_index) == tool_calls_started.end()) {
const auto& full_tool_call = oaicompat_msg.tool_calls[diff.tool_call_index];
events.push_back({
{"event", "content_block_start"},
{"data", {
{"type", "content_block_start"},
{"index", content_block_index},
{"content_block", {
{"type", "tool_use"},
{"id", full_tool_call.id},
{"name", full_tool_call.name}
}}
}}
});
tool_calls_started.insert(diff.tool_call_index);
}
if (!diff.tool_call_delta.arguments.empty()) {
events.push_back({
{"event", "content_block_delta"},
{"data", {
{"type", "content_block_delta"},
{"index", content_block_index},
{"delta", {
{"type", "input_json_delta"},
{"partial_json", diff.tool_call_delta.arguments}
}}
}}
});
}
}
}
if (has_thinking) {
events.push_back({
{"event", "content_block_delta"},
{"data", {
{"type", "content_block_delta"},
{"index", thinking_block_index},
{"delta", {
{"type", "signature_delta"},
{"signature", ""}
}}
}}
});
events.push_back({
{"event", "content_block_stop"},
{"data", {
{"type", "content_block_stop"},
{"index", thinking_block_index}
}}
});
}
if (has_text) {
events.push_back({
{"event", "content_block_stop"},
{"data", {
{"type", "content_block_stop"},
{"index", text_block_index}
}}
});
}
for (size_t i = 0; i < num_tool_calls; i++) {
size_t content_block_index = (has_thinking ? 1 : 0) + (has_text ? 1 : 0) + i;
events.push_back({
{"event", "content_block_stop"},
{"data", {
{"type", "content_block_stop"},
{"index", content_block_index}
}}
});
}
events.push_back({
{"event", "message_delta"},
{"data", {
{"type", "message_delta"},
{"delta", {
{"stop_reason", stop_reason},
{"stop_sequence", stopping_word.empty() ? nullptr : json(stopping_word)}
}},
{"usage", {
{"output_tokens", n_decoded}
}}
}}
});
events.push_back({
{"event", "message_stop"},
{"data", {
{"type", "message_stop"}
}}
});
// extra fields for debugging purposes
if (verbose && !events.empty()) {
events.front()["data"]["__verbose"] = to_json_non_oaicompat_final();
}
// Don't add timings for Anthropic API (breaks spec compliance)
if (oaicompat != OAICOMPAT_TYPE_ANTHROPIC && timings.prompt_n >= 0 && !events.empty()) {
events.back()["data"]["timings"] = timings.to_json();
}
return events;
}
json server_task_result_cmpl_partial::to_json_anthropic_partial() {
json events = json::array();
bool first = n_decoded == 1;
size_t thinking_block_index = 0;
size_t text_block_index = anthropic_has_reasoning ? 1 : 0;
bool thinking_started = anthropic_thinking_block_started;
bool text_started = anthropic_text_block_started;
if (first) {
events.push_back({
{"event", "message_start"},
{"data", {
{"type", "message_start"},
{"message", {
{"id", oaicompat_cmpl_id},
{"type", "message"},
{"role", "assistant"},
{"content", json::array()},
{"model", oaicompat_model},
{"stop_reason", nullptr},
{"stop_sequence", nullptr},
{"usage", {
{"input_tokens", n_prompt_tokens},
{"output_tokens", 0}
}}
}}
}}
});
}
for (const auto& diff : oaicompat_msg_diffs) {
if (!diff.reasoning_content_delta.empty()) {
if (!thinking_started) {
events.push_back({
{"event", "content_block_start"},
{"data", {
{"type", "content_block_start"},
{"index", thinking_block_index},
{"content_block", {
{"type", "thinking"},
{"thinking", ""}
}}
}}
});
thinking_started = true;
}
events.push_back({
{"event", "content_block_delta"},
{"data", {
{"type", "content_block_delta"},
{"index", thinking_block_index},
{"delta", {
{"type", "thinking_delta"},
{"thinking", diff.reasoning_content_delta}
}}
}}
});
}
if (!diff.content_delta.empty()) {
if (!text_started) {
events.push_back({
{"event", "content_block_start"},
{"data", {
{"type", "content_block_start"},
{"index", text_block_index},
{"content_block", {
{"type", "text"},
{"text", ""}
}}
}}
});
text_started = true;
}
events.push_back({
{"event", "content_block_delta"},
{"data", {
{"type", "content_block_delta"},
{"index", text_block_index},
{"delta", {
{"type", "text_delta"},
{"text", diff.content_delta}
}}
}}
});
}
if (diff.tool_call_index != std::string::npos) {
size_t content_block_index = (anthropic_has_reasoning ? 1 : 0) + (text_started ? 1 : 0) + diff.tool_call_index;
if (!diff.tool_call_delta.name.empty()) {
events.push_back({
{"event", "content_block_start"},
{"data", {
{"type", "content_block_start"},
{"index", content_block_index},
{"content_block", {
{"type", "tool_use"},
{"id", diff.tool_call_delta.id},
{"name", diff.tool_call_delta.name}
}}
}}
});
}
if (!diff.tool_call_delta.arguments.empty()) {
events.push_back({
{"event", "content_block_delta"},
{"data", {
{"type", "content_block_delta"},
{"index", content_block_index},
{"delta", {
{"type", "input_json_delta"},
{"partial_json", diff.tool_call_delta.arguments}
}}
}}
});
}
}
}
if (verbose && !events.empty() && first) {
events.front()["data"]["__verbose"] = to_json_non_oaicompat_partial();
}
if (timings.prompt_n >= 0 && !events.empty()) {
events.back()["data"]["timings"] = timings.to_json();
}
//if (is_progress && !events.empty()) {
// events.back()["data"]["prompt_progress"] = progress.to_json();
//}
return events;
}
size_t server_prompt::size() const {
size_t res = data.size();
for (const auto& checkpoint : checkpoints) {
res += checkpoint.size();
}
return res;
}
size_t server_prompt_cache::size() const {
size_t res = 0;
for (const auto& state : states) {
res += state.size();
}
return res;
}
size_t server_prompt_cache::n_tokens() const {
size_t res = 0;
for (const auto& state : states) {
res += state.n_tokens();
}
return res;
}
bool server_prompt_cache::load(server_prompt& prompt, const server_tokens& tokens_new, llama_context* ctx, int32_t id_slot) {
thinking_tokens think_tokens;
for (auto it = states.begin(); it != states.end(); ++it) {
think_tokens = it->think_tokens;
break;
}
server_tokens prompt_tokens;
server_tokens tokens_new_ex;
if (think_tokens.exclude) {
prompt_tokens = server_tokens(prompt.tokens.get_text_tokens_exclude_think(ctx, think_tokens), false);
tokens_new_ex = server_tokens(tokens_new.get_text_tokens_exclude_think(ctx, think_tokens), false);
}
else {
prompt_tokens = std::move(prompt.tokens); //server_tokens(prompt.tokens.get_text_tokens(), false);
tokens_new_ex = server_tokens(tokens_new.get_text_tokens(), false);
}
const auto lcp_best = prompt_tokens.get_common_prefix(ctx, tokens_new_ex);
float f_keep_best = float(lcp_best.second) / prompt_tokens.size();
float sim_best = prompt_tokens.get_tokens_similarity(ctx, tokens_new_ex, prompt.n_kept_prompt, prompt.n_discarded_prompt);
LLAMA_LOG_INFO(" - looking for better prompt, base f_keep = %.3f, sim = %.3f, n_keep = %d, n_discarded_prompt = %d\n", f_keep_best, sim_best, prompt.n_kept_prompt, prompt.n_discarded_prompt);
auto it_best = states.end();
// find the most similar cached prompt, that would also preserve the most context
for (auto it = states.begin(); it != states.end(); ++it) {
server_tokens tokens;
if (think_tokens.exclude) {
tokens = server_tokens(it->tokens.get_text_tokens_exclude_think(ctx, think_tokens), false);
}
else {
tokens = std::move(it->tokens);
}
const auto lcp_cur = tokens.get_common_prefix(ctx, tokens_new_ex);
const float f_keep_cur = float(lcp_cur.first) / tokens.size();
const float sim_cur = tokens.get_tokens_similarity(ctx, tokens_new_ex, it->n_kept_prompt, it->n_discarded_prompt);
if (sim_best < sim_cur) {
f_keep_best = f_keep_cur;
sim_best = sim_cur;
it_best = it;
}
}
if (it_best != states.end()) {
LLAMA_LOG_INFO(" - found better prompt with f_keep = %.3f, sim = %.3f, n_keep = %d, n_discarded_prompt = %d\n", f_keep_best, sim_best, it_best->n_kept_prompt, it_best->n_discarded_prompt);
const size_t size = it_best->data.size();
const size_t n = llama_state_seq_set_data(ctx, it_best->data.data(), size, id_slot, 0);
if (n != size) {
LLAMA_LOG_INFO("failed to restore state with size %zu\n", size);
return false;
}
it_best->data.clear();
it_best->data.shrink_to_fit();
prompt = std::move(*it_best);
states.erase(it_best);
}
return true;
}
server_prompt* server_prompt_cache::alloc(const server_prompt& prompt, size_t state_size) {
for (auto it = states.begin(); it != states.end();) {
auto tokens_ctx_shift = server_tokens(prompt.tokens.get_text_tokens(), false); // copy cache tokens
tokens_ctx_shift.discard_n_tokens(prompt.n_kept_prompt, prompt.n_discarded_prompt);
auto prefix = it->tokens.get_common_prefix(ctx, tokens_ctx_shift);
const size_t len = prefix.first;
const size_t len_prompt = prefix.second;
// first check if the current state is contained fully in the cache
if (len_prompt == tokens_ctx_shift.size()) {
LLAMA_LOG_INFO("%s", " - prompt is already in the cache, skipping\n");
return nullptr;
}
// next, remove any cached prompts that are fully contained in the current prompt
else if (len == it->tokens.size()) {
LLAMA_LOG_INFO(" - removing obsolete cached prompt with length %d\n", (int)len);
it = states.erase(it);
}
else {
++it;
}
}
std::vector<uint8_t> state_data;
// check if we can allocate enough memory for the new state
try {
state_data.resize(state_size);
}
catch (const std::bad_alloc& e) {
LLAMA_LOG_INFO("failed to allocate memory for prompt cache state: %s\n", e.what());
limit_size = std::max<size_t>(1, 0.4 * size());
LLAMA_LOG_INFO(" - cache size limit reduced to %.3f MiB\n", limit_size / (1024.0 * 1024.0));
update();
return nullptr;
}
// TODO: for some reason we can't copy server_tokens, so we have to do this workaround
auto& cur = states.emplace_back();
cur = {
/*.tokens =*/ server_tokens(prompt.tokens.get_text_tokens(), false),
/*.n_keep =*/ prompt.n_kept_prompt,
/*.n_discarded_prompt =*/ prompt.n_discarded_prompt,
/*.think_tokens =*/ prompt.think_tokens,
/*.data =*/ std::move(state_data),
/*.checkpoints =*/ prompt.checkpoints,
};
return &cur;
}
void server_prompt_cache::update() {
if (limit_size > 0) {
// always keep at least one state, regardless of the limits
while (states.size() > 1 && size() > limit_size) {
if (states.empty()) {
break;
}
LLAMA_LOG_INFO(" - cache size limit reached, removing oldest entry (size = %.3f MiB)\n", states.front().size() / (1024.0 * 1024.0));
states.pop_front();
}
}
// average size per token
const float size_per_token = std::max<float>(1.0f, float(size()) / (std::max<size_t>(1, n_tokens())));
// dynamically increase the token limit if it can fit in the memory limit
const size_t limit_tokens_cur = limit_size > 0 ? std::max<size_t>(limit_tokens, limit_size / size_per_token) : limit_tokens;
LLAMA_LOG_INFO(" - cache state: %zu prompts, %.3f MiB (limits: %.3f MiB, %zu tokens, %zu est)\n",
states.size(), size() / (1024.0 * 1024.0), limit_size / (1024.0 * 1024.0), limit_tokens, limit_tokens_cur);
for (const auto& state : states) {
LLAMA_LOG_INFO(" - prompt %p: %7d tokens, %7d discarded, checkpoints: %2zu, %9.3f MiB\n",
(const void*)&state, state.n_tokens(), state.n_discarded_prompt, state.checkpoints.size(), state.size() / (1024.0 * 1024.0));
}
}