ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-30 11:09:51 +00:00

Author	SHA1	Message	Date
firecoperana	1cad1ec1cc	Update grammar (#1023 ) * grammar : fix JSON Schema for string regex with top-level alt. (#9903) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A\|B\|C\|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" \| "B" \| "C" \| "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" \| "B" \| "C" \| "D") "\"" space ``` * grammars : add English-only grammar (#10612) * grammar : handle maxItems == 0 in JSON schema (#13117) Co-authored-by: Richard Lyons <frob@cloudstaff.com> * grammar-parser : fix possible null-deref (#9004) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com> * llama : fix typo in llama-grammar.h [no ci] (#11816) * * server: fix "--grammar-file" parameter (#12285) * common : use std::string_view now that we target c++17 (#14319) * json : support `enum` values within `allOf` (#15830) * grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626) * grammar : support array references in json schema (#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> # Conflicts: # tests/test-json-schema-to-grammar.cpp * merge fix * llama : minor grammar refactor (#10897) * llama: fix error on bad grammar (#12628) * grammar : fix integer overflow (#17381) * Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX # Conflicts: # src/llama-grammar.cpp * grammar: fix regression caused by #17381 (#17412) * grammar: fix regression caused by #17381 * more readable # Conflicts: # src/llama-grammar.cpp * Merge Fix * Fix warnings --------- Signed-off-by: David Korczynski <david@adalogics.com> Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: frob <rick+github@frob.com.au> Co-authored-by: Richard Lyons <frob@cloudstaff.com> Co-authored-by: DavidKorczynski <david@adalogics.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com> Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 18:45:38 +01:00
hksdpc255	d24ea9e48e	server : add Anthropic Messages API support (#1012 )	2025-11-29 07:24:59 +01:00
firecoperana	0b126b2ca6	Fix prompt tokenization issue during prompt processing (#1008 ) * Find common tokens between prompt and cache Fix wrong context size usage for mtmd Use start position of common part server: handle context shift * Add size check for inexact match * Change --------- Co-authored-by: firecoperana <firecoperana>	2025-11-26 10:34:26 +01:00
Yap Sok Ann	7505165dee	Fix truncated logprobs when streaming is off (#998 ) The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in #723. Then, after #958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.	2025-11-24 06:52:15 +01:00
firecoperana	bacb8fb79f	Server: Handle context shift better to reduce prompt processing time (#973 ) * Handle context shift better to reduce pp Add context-shift args Add back ga_n in context shift * optimize discard function and bring back n_keep = -1 --------- Co-authored-by: firecoperana <firecoperana>	2025-11-19 16:04:48 +01:00
firecoperana	bb358223cd	server: cache prompt to host memory (#954 ) * server : host-memory prompt caching change similarity calculation and prompt save conditions Remove unneeded token limit rename variable Separate prompt save and load logic change default values change log remove truncate prompt logic * add description * bug fixes * remove token limit in init --------- Co-authored-by: firecoperana <firecoperana>	2025-11-14 18:40:13 +02:00
firecoperana	b63309a918	Fix embedding missing, CORS and crash using verbose in server (#924 ) * server: fix crash when prompt has image and is too long * server: fix CORS * server: fix empty result for embedding * change error message to truncate prompt * server: fix slot id for save and load state * bug fix * server: update slot similarity to handle mtmd * server: quick hack to calculate number of token processed with image * server: fix out of range error when detokenizing prompt under verbose * Add back Access-Control-Allow-Origin * Server: Add prompt tokens in embedding results --------- Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:03 +02:00
Kawrakow	66ef68bc14	Fix compiler warning	2025-11-06 07:12:07 +02:00
firecoperana	18f5a6caef	Bug fixes for completions and prompt caching in server (#906 ) * Bug fixes for completions and prompt caching in server * Fix compiler warning about redefinition --------- Co-authored-by: firecoperana <firecoperana>	2025-11-06 07:10:51 +02:00
firecoperana	7978f04996	Add vision support in llama-server (#901 ) * server: add support for vision model webui: add support for vision model * server : remove hack for extra parallel slot#10187 * llama : fix KV shift for qwen2vl #13870 * add no-context-shift parameter --------- Co-authored-by: firecoperana <firecoperana>	2025-11-05 10:43:46 +02:00
jarrodfeaks	92517e74ad	fix v1/chat/completions assistant prefill (#874 )	2025-10-29 17:21:05 +02:00
Kawrakow	e2f21c8dc8	Move minja and nlohmann/json to vendor (#802 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 09:12:35 +02:00
Yap Sok Ann	4f9b0ec4f0	Fix logprobs (#787 ) This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by #723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-25 15:43:30 +02:00
firecoperana	a6da22beb2	Deepseek V3.1 native tool calling support (OpenAI Style) (#771 )	2025-09-13 07:51:40 +02:00
firecoperana	d7882c3cf8	Tool calls support from mainline (#723 ) * Tool calls support from mainline * update cmake * revert api for /completions * Fix broken thinking process for gpt-oss * add missing args and fix webui bugs * add missing args and fix webui bugs2 * Fix reasoning format error * add usage * change default post_sampling_probs to true * add back generated_text * Remove server endpoints tests * add log * Chat fixes * Remove logs * webui: revert extra handling of thinking process --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-01 08:38:49 +03:00
firecoperana	21ced1e3c1	Fix completions endpoint (#684 ) Co-authored-by: firecoperana <firecoperana>	2025-08-11 09:43:20 +03:00
firecoperana	ff024df079	add jinja template support (#677 ) Co-authored-by: firecoperana <firecoperana>	2025-08-09 12:50:30 +00:00
Anton Sokolchenko	9ee72225dc	Function calling support for Kimi-K2 (#628 ) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency.	2025-07-23 18:11:42 +02:00
firecoperana	d1ae1504c6	Fix non rpc build error (#506 ) * Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - Type Validation: `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range before calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - Bounds Checks: Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - Size Checks: Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - Error Propagation: - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui * fix rpc function not found error --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>	2025-06-08 17:27:00 +03:00
Kawrakow	7fc39ae7e8	Revert "Rpc improvement (#480 )" This reverts commit `8a5f8573ae`.	2025-06-08 14:49:50 +03:00
firecoperana	ed9e8ecc9b	Rpc improvement (#480 ) * Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - Type Validation: `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range before calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - Bounds Checks: Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - Size Checks: Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - Error Propagation: - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>	2025-06-08 14:43:21 +03:00
firecoperana	6cbd73c60e	Webui improvement (#481 ) * update webui * add token/s in webui * add webui files * fix webui first message disappear in some browser * add missing html files --------- Co-authored-by: firecoperana <firecoperana>	2025-06-08 14:38:47 +03:00
Kawrakow	1a4cfbcc53	Merge mainline - Aug 12 2024 (#17 ) * Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-12 15:14:32 +02:00
Kawrakow	0ceeb11721	Merge mainline llama.cpp (#3 ) * Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 07:55:01 +02:00
sasha0552	66217bbac6	server : smart slot selection using Longest Common Prefix (#7728 ) * server : Smart selection of available slot using Longest Common Substring * add usage * remove trailing whitespaces * Use Longest Common Prefix (LCP) instead of LCS * Rename argument	2024-06-08 10:50:31 +03:00
Georgi Gerganov	8822dcce8d	common : refactor cli arg parsing (#7675 ) * common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params	2024-06-04 21:23:39 +03:00
Benjamin Findley	9b9b4eb946	change default temperature of OAI compat API from 0 to 1 (#7226 ) * change default temperature of OAI compat API from 0 to 1 * make tests explicitly send temperature to OAI API	2024-05-13 12:40:08 +10:00
Johannes Gäßler	4e708b6d16	JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143 )	2024-05-08 21:53:08 +02:00
Xuan Son Nguyen	777e128db5	clean up json_value & server_log (#7142 )	2024-05-08 13:24:14 +02:00
Pedro Cuenca	a512902ca8	llama : support Llama 3 HF conversion (#6745 ) * Support Llama 3 conversion The tokenizer is BPE. * style * Accept suggestion Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> * llama : add llama_token_is_eog() ggml-ci * llama : auto-detect more EOT tokens when missing in KV data * convert : replacing EOS token is a hack * llama : fix codegemma EOT token + add TODOs * llama : fix model type string for 8B model --------- Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-21 14:50:41 +03:00
Pierrick Hymbert	73e01781d4	ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495 ) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate	2024-04-06 05:40:47 +02:00
JH23X	0f4e3af782	server : handle exception on wrong type in request (#6452 ) Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>	2024-04-03 21:09:52 +03:00
Xuan Son Nguyen	242d1c1e90	Server: clean up OAI params parsing function (#6284 ) * server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs	2024-03-25 09:42:17 +01:00
Pierrick Hymbert	6331d10b51	server: flush stdout after logging in both text and json layout (#6253 )	2024-03-23 13:18:45 +01:00
Olivier Chafik	f41d10fa62	json-schema-to-grammar : fix order of props + non-str const/enum (#6232 ) * json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums	2024-03-22 15:07:44 +02:00
Olivier Chafik	c5b204162d	json-schema-to-grammar improvements (+ added to server) (#5978 ) * json: fix arrays (disallow `[,1]`) * json: support tuple types (`[number, string]`) * json: support additionalProperties (`{[k: string]: [string,number][]}`) * json: support required / optional properties * json: add support for pattern * json: resolve $ref (and support https schema urls) * json: fix $ref resolution * join: support union types (mostly for nullable types I think) * json: support allOf + nested anyOf * json: support any (`{}` or `{type: object}`) * json: fix merge * json: temp fix for escapes * json: spaces in output and unrestricted output spaces * json: add typings * json:fix typo * Create ts-type-to-grammar.sh * json: fix _format_literal (json.dumps already escapes quotes) * json: merge lit sequences and handle negatives {"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"} * json: handle pattern repetitions * Update json-schema-to-grammar.mjs * Create regex-to-grammar.py * json: extract repeated regexp patterns to subrule * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * json: handle schema from pydantic Optional fields * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update ts-type-to-grammar.sh * Update ts-type-to-grammar.sh * json: simplify nullable fields handling * json: accept duplicate identical rules * json: revert space to 1 at most * json: reuse regexp pattern subrules * json: handle uuid string format * json: fix literal escapes * json: add --allow-fetch * json: simplify range escapes * json: support negative ranges in patterns * Delete commit.txt * json: custom regex parser, adds dot support & JS-portable * json: rm trailing spaces * Update json-schema-to-grammar.mjs * json: updated server & chat `( cd examples/server && ./deps.sh )` * json: port fixes from mjs to python * Update ts-type-to-grammar.sh * json: support prefixItems alongside array items * json: add date format + fix uuid * json: add date, time, date-time formats * json: preserve order of props from TS defs * json: port schema converter to C++, wire in ./server * json: nits * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * json: fix mjs implementation + align outputs * Update json-schema-to-grammar.mjs.hpp * json: test C++, JS & Python versions * json: nits + regen deps * json: cleanup test * json: revert from c++17 to 11 * json: nit fixes * json: dirty include for test * json: fix zig build * json: pass static command to std::system in tests (fixed temp files) * json: fix top-level $refs * json: don't use c++20 designated initializers * nit * json: basic support for reserved names `{number:{number:{root:number}}}` * Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test) * json: re-ran server deps.sh * json: simplify test * json: support mix of additional props & required/optional * json: add tests for some expected failures * json: fix type=const in c++, add failure expectations for non-str const&enum * json: test (& simplify output of) empty schema * json: check parsing in test + fix value & string refs * json: add server tests for OAI JSON response_format * json: test/fix top-level anyOf * json: improve grammar parsing failures * json: test/fix additional props corner cases * json: fix string patterns (was missing quotes) * json: ws nit * json: fix json handling in server when there's no response_format * json: catch schema conversion errors in server * json: don't complain about unknown format type in server if unset * json: cleaner build of test * json: create examples/json-schema-pydantic-example.py * json: fix date pattern * json: move json.hpp & json-schema-to-grammar.{cpp,h} to common * json: indent 4 spaces * json: fix naming of top-level c++ function (+ drop unused one) * json: avoid using namespace std * json: fix zig build * Update server.feature * json: iostream -> fprintf * json: space before & refs for consistency * json: nits	2024-03-21 11:50:43 +00:00
Karthick	1d1248fb8c	Server: Handle n_keep parameter in the request (#6174 )	2024-03-20 12:02:34 +01:00
Xuan Son Nguyen	2534910086	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
Xuan Son Nguyen	2375a2e6d9	Server: format error to json (#5961 ) * server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme	2024-03-11 10:56:41 +01:00
Minsoo Cheong	1feee1846b	server : maintain chat completion id for streaming responses (#5988 ) * server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-11 10:09:32 +02:00
Georgi Gerganov	8259cb3547	server : refactor (#5882 ) * server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>	2024-03-07 11:41:53 +02:00
Pierrick Hymbert	17a641c829	server: tests: passkey challenge / self-extend with context shift demo (#5832 ) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test	2024-03-02 22:00:14 +01:00
Xuan Son Nguyen	f304cb4978	Server: normalize naming (#5779 ) * server: normalize naming * fix spacing	2024-02-29 21:42:11 +01:00
Pierrick Hymbert	6c8868e093	server: logs - unified format and --log-format option (#5700 ) * server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id * server : skip GH copilot requests from logging * server : change message format of server_log() * server : no need to repeat log in comment * server : log style consistency * server : fix compile warning * server : fix tests regex patterns on M2 Ultra * server: logs: PR feedback on log level * server: logs: allow to choose log format in json or plain text * server: tests: output server logs in text * server: logs switch init logs to server logs macro * server: logs ensure value json value does not raised error * server: logs reduce level VERBOSE to VERB to max 4 chars * server: logs lower case as other log messages * server: logs avoid static in general Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: logs PR feedback: change text log format to: LEVEL [function_name] message \| additional=data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-25 13:50:32 +01:00
Pierrick Hymbert	7f6af033d4	server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708 ) * server: monitoring - add /metrics prometheus compatible endpoint * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified * server: metrics - move to a dedicated struct	2024-02-25 13:49:43 +01:00
Pierrick Hymbert	bceef5565a	server: health: fix race condition on slots data using tasks queue (#5634 ) * server: health: fix race condition on slots data using tasks queue * server: health: * include_slots only if slots_endpoint * fix compile warning task.target_id not initialized.	2024-02-21 15:47:48 +01:00
Xuan Son Nguyen	8410e862e7	Server: use llama_chat_apply_template (#5593 ) * server: use llama_chat_apply_template * server: remove trailing space * server: fix format_chat * server: fix help message Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: fix formatted_chat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-20 15:58:27 +01:00
Daniel Hiltgen	42c5518ad5	server : graceful server shutdown (#5244 ) This updates the server queue to support graceful shutdown of the server on signals.	2024-02-18 18:23:16 +02:00
Xuan Son Nguyen	71993ba165	server : add llama2 chat template (#5425 ) * server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-11 12:16:22 +02:00
Georgi Gerganov	32e84d74f1	sync : ggml	2024-01-27 17:00:24 +02:00

1 2

51 Commits