ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-21 15:09:40 +00:00

Author	SHA1	Message	Date
Kawrakow	18fdd80eaf	Hadamard transforms for K-cache - CPU only (#1033 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 06:51:11 +01:00
Kawrakow	a719349982	POC: CUDA tensor parallel (MoE models) (#1022 ) * Remove most of split mode row * WIP * WIP: also allocate the KV cache using tensor split * WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling * WIP * This works, but it is slow * This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined. * Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG. * Fix attn split Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio. * Show memory used per device * Make it work with partial offload but no tensor overrides yet, just ngl < num_layers. * Allow for f16 source in fused_rms_norm * This results in faster PP. Now PP is faster than split mode layer for L3-70B. * Rename split mode "row" to split mode "graph" * Leave FFN partial results as f16 * WIP GLM4.5 - runs with wrong results * WIP GLM4.5 - this works PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer. * Work around compiler bug It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug. * Make graph reuse work with split mode graph * Remove more split mode row remnants * WIP tensor overrides Runs with wrong results, don't see where the issue could be. * This works but is slow Still does not work for row-interleaved quants * Slightly better * Slightly better * Row-interleaved quants work * Better * Minor * Guarad against using split mode "graph" for unsupported models * Guards against using merge_qkv with split mode "graph" * WIP split mode attn Works for LlaMA models, but not for GLM-4.5. Doesn't seem to improve performance, so I guess no point in trying to fix it. * Split mode graph for qwen3moe * Try to better distribute the splits --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-01 19:25:40 +01:00
Kawrakow	02b717c8c6	Fix build with RPC not enabled (#1025 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 19:04:54 +01:00
firecoperana	15771072c7	RPC: support multiple devices including cpu (#1024 ) * RPC support multiple devices * rpc : update documentation (#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com> # Conflicts: # examples/rpc/README.md * Remove memory settings * rpc : cache and reuse compute graphs (#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server. * Add -cpu to include cpu backend --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>	2025-11-30 18:48:02 +01:00
firecoperana	1cad1ec1cc	Update grammar (#1023 ) * grammar : fix JSON Schema for string regex with top-level alt. (#9903) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A\|B\|C\|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" \| "B" \| "C" \| "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" \| "B" \| "C" \| "D") "\"" space ``` * grammars : add English-only grammar (#10612) * grammar : handle maxItems == 0 in JSON schema (#13117) Co-authored-by: Richard Lyons <frob@cloudstaff.com> * grammar-parser : fix possible null-deref (#9004) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com> * llama : fix typo in llama-grammar.h [no ci] (#11816) * * server: fix "--grammar-file" parameter (#12285) * common : use std::string_view now that we target c++17 (#14319) * json : support `enum` values within `allOf` (#15830) * grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626) * grammar : support array references in json schema (#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> # Conflicts: # tests/test-json-schema-to-grammar.cpp * merge fix * llama : minor grammar refactor (#10897) * llama: fix error on bad grammar (#12628) * grammar : fix integer overflow (#17381) * Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX # Conflicts: # src/llama-grammar.cpp * grammar: fix regression caused by #17381 (#17412) * grammar: fix regression caused by #17381 * more readable # Conflicts: # src/llama-grammar.cpp * Merge Fix * Fix warnings --------- Signed-off-by: David Korczynski <david@adalogics.com> Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: frob <rick+github@frob.com.au> Co-authored-by: Richard Lyons <frob@cloudstaff.com> Co-authored-by: DavidKorczynski <david@adalogics.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com> Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 18:45:38 +01:00
firecoperana	869557c8fd	Update mtmd to improve accuracy of M-RoPE (#993 ) * model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (#14928) * model : add LightOnOCR-1B model (#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (#16921) * mtmd: improve struct initialization (#16981) * mtmd: allow QwenVL to process larger image by default (#17020) * Disable flash attention * mtmd : fix embedding size for image input (#17123) * mtmd: fix patch_size initialized to random value in audio models (#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>	2025-11-29 07:27:15 +01:00
Kawrakow	ada5a92241	Disable RoPE cache (#1004 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 07:09:46 +01:00
hksdpc255	15695a0617	fix kimi-k2 tool call (#996 )	2025-11-24 06:51:16 +01:00
Kawrakow	1128a55b0a	Fix Kimi2 parsing issues (#989 ) * Fix Kimi2 chat parse * Add @hksdpc255's junja templates * Fix junja -> junja --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-20 10:08:02 +01:00
Kawrakow	0f6986a33c	Disable split mode "row" (#987 ) * Disable split mode "row" * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 16:15:50 +01:00
firecoperana	bacb8fb79f	Server: Handle context shift better to reduce prompt processing time (#973 ) * Handle context shift better to reduce pp Add context-shift args Add back ga_n in context shift * optimize discard function and bring back n_keep = -1 --------- Co-authored-by: firecoperana <firecoperana>	2025-11-19 16:04:48 +01:00
hksdpc255	2ebd715fa0	common: Generalized XML-style tool-call parsing with streaming support (#958 ) * port upstream https://github.com/ggml-org/llama.cpp/pull/16932 * Add fixed chat templates. * fix grammar when tool have no argument * Insert additional stops for Kimi-K2 * Fix `no triggers set for lazy grammar!` for GLM4.5/4.6 * update chat.cpp * fix grammar for GLM 4.5/4.6 * chat: Fix streaming parser for granite models (#15682) * fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser * common : Fix corrupted memory error on json grammar initialization (#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:1992 #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:2074 #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING * common : fix reasoning before forced tool call via tool_choice = required (#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475) --------- Co-authored-by: Alde Rojas <hello@alde.dev> * Try fix Jinja template for GLM * Improve Kimi-K2 chat template * Fix "Invalid tool call arguments passed" in a rare case. In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation. --------- Co-authored-by: shun095 <8069181+shun095@users.noreply.github.com> Co-authored-by: David Ribeiro Alves <davidralves@gmail.com> Co-authored-by: crat0z <11581854+crat0z@users.noreply.github.com> Co-authored-by: Alde Rojas <hello@alde.dev>	2025-11-18 15:29:58 +01:00
Kawrakow	412e4f6e23	Add usage for -vq, --validate-quants (#977 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-17 16:02:14 +01:00
firecoperana	bb358223cd	server: cache prompt to host memory (#954 ) * server : host-memory prompt caching change similarity calculation and prompt save conditions Remove unneeded token limit rename variable Separate prompt save and load logic change default values change log remove truncate prompt logic * add description * bug fixes * remove token limit in init --------- Co-authored-by: firecoperana <firecoperana>	2025-11-14 18:40:13 +02:00
Kawrakow	00dffb5e68	Add --chat-template-file to usage (#959 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 11:08:58 +02:00
Kawrakow	6b9d1bf4b4	Graph reuse (#947 ) * Add mainline compatible FA command line option * Graph reuse: add command line argument to turn it on * WIP * This seems to work * This is perhaps cleaner * Change the command line option to -gr --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:58:19 +02:00
Kawrakow	22c20fcd6d	Fix flash attention long argument for mainloine compatibility	2025-11-13 19:22:16 +02:00
Kawrakow	874926800f	Add mainline compatible FA command line option (#944 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 08:55:33 +02:00
Kawrakow	ddc88bac17	Set mla=3 by default (#943 ) so more recent users that haven't followed the history of FlashMLA evolution and hence don't know about the MLA options get the best setting without having to add -mla 3 on the command line. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-12 11:00:58 +02:00
firecoperana	eea6cc4433	Server: Add --draft-params to set draft model parameter via command line args (#932 ) * Add command line argument for draft model * Remove second context of draft model * Format print * print usage if parsing -draft fails --------- Co-authored-by: firecoperana <firecoperana>	2025-11-10 09:51:07 +02:00
Kawrakow	7df9947923	Fix compiler warning	2025-11-09 14:35:59 +02:00
firecoperana	b63309a918	Fix embedding missing, CORS and crash using verbose in server (#924 ) * server: fix crash when prompt has image and is too long * server: fix CORS * server: fix empty result for embedding * change error message to truncate prompt * server: fix slot id for save and load state * bug fix * server: update slot similarity to handle mtmd * server: quick hack to calculate number of token processed with image * server: fix out of range error when detokenizing prompt under verbose * Add back Access-Control-Allow-Origin * Server: Add prompt tokens in embedding results --------- Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:03 +02:00
Kawrakow	532a05e466	CUDA: set compute parameters via command line arguments (#910 ) * cuda: set compute parameters via command line arguments * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 07:11:23 +02:00
firecoperana	7978f04996	Add vision support in llama-server (#901 ) * server: add support for vision model webui: add support for vision model * server : remove hack for extra parallel slot#10187 * llama : fix KV shift for qwen2vl #13870 * add no-context-shift parameter --------- Co-authored-by: firecoperana <firecoperana>	2025-11-05 10:43:46 +02:00
Kawrakow	c23fda2103	Disable some fusion, RoPE cache off by default (#894 ) * Disable some fusion and make rope cahe off by default * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 07:50:14 +02:00
Kawrakow	fb0d5a995c	RoPE cache (#887 ) * Introducing rope cache When computing RoPE, the rotation angles in each layer are exactly the same, and only depend on the token positions (and other constant, model dependent parameters). So, I wonder, why don't we compute the angles just once and then reuse for the Q and K RoPE in each layer? This commit does it as a POC on the CPU, and uses it in the Qwen3-MoE compute graph. * cuda: neox works * WIP * rope_cache: norm works * Fused rope+rope * Fused rope+rope (norm) * Fused rms+rms+rope+rope (neox) - not working * WIP * Also qwen3 * Add command line arg to disable rope cache * Disable RoPE cache if rope type is not neox or norm * Add missing break after merge with main * Fused fused_rms+fused_rms+rope+rope (with -mqkv) * Fused fused_rms+fused_rms+rope+rope (without -mqkv) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 18:42:20 +02:00
firecoperana	a3bd0158f7	Disable pipeline parallel for tensor override or allocation failed (#879 ) * disable pipeline parallelism when tensor override present * disable pipeline parallel if allocation failed --------- Co-authored-by: firecoperana <firecoperana>	2025-10-31 14:20:48 +02:00
Kawrakow	56fc5454ff	Merge Q, K, V (#878 ) * POC: merge Q, K, V into a single, contiguous tensor Done just for Qwen3-MoE, where I see a 4% uplift in TG. PP performance gain is sub-percent, if any. Still, it seems it makes sense to do it in general given the TG performance gain. * WIP * merge_qkv: it works for gpt-oss ...but we see a smaller TG gain (~1.5%) * WIP * Don't ignore the return value of create_tensors() else, when q, k, v get merged and we are running on the CPU, we get a crash because the backend is trying to use mmap, but that no longer works. * merge_qkv: bias can be required, optional, or mandatory * merge_qkv: glm4.5moe * merge_qkv: add command loine argument to enable * merge_qkv: fix tensor dimensions * merge_qkv: llama-4 * merge_qkv: qwen3 (dense) * merge_qkv: simplify build_qwen3moe * cohere2 - simplify graph building --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-30 10:49:48 +02:00
Nexes the Elder	e68dabc242	A few server commits from mainline. (#872 ) server : handle models with missing EOS token (#8997) server : fix segfault on long system prompt (#8987) * server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment server : init stop and error fields of the result struct (#9026) server : fix duplicated n_predict key in the generation_settings (#8994) server : support reading arguments from environment variables (#9105) * server : support reading arguments from environment variables * add -fa and -dt * readme : specify non-arg env var server : add some missing env variables (#9116) * server : add some missing env variables * add LLAMA_ARG_HOST to server dockerfile * also add LLAMA_ARG_CONT_BATCHING Credits are to the respective authors. Not a single merge conflict occurred. Compiled, then tested without bug.	2025-10-28 09:58:31 +02:00
firecoperana	904e994bfb	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00
firecoperana	bf991ba60a	Add --webui arg to launch llama.cpp new webui (#786 ) * Add new webui from llama.cpp * Add new webui * feat: Improve mobile UI for Settings Dialog (#16084) * feat: Improve mobile UI for Settings Dialog * chore: update webui build output * fix: Linting errors * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsFields.svelte # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsSection.svelte # tools/server/public/index.html.gz * webui : fix handling incomplete chunks (#16107) * Always show message actions for mobile UI + improvements for user message sizing (#16076) # Conflicts: # .gitignore # examples/server/webui_llamacpp/package.json # examples/server/webui_llamacpp/scripts/dev.sh # tools/server/webui/scripts/post-build.sh * webui: switch to hash-based routing (alternative of #16079) (#16157) * Switched web UI to hash-based routing * Added hash to missed goto function call * Removed outdated SPA handling code * Fixed broken sidebar home link # Conflicts: # examples/server/webui_llamacpp/src/routes/+layout.ts # tools/server/server.cpp * Allow viewing conversations even when llama server is down (#16255) * webui: allow viewing conversations and sending messages even if llama-server is down - Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable. - Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation. - Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data. * feat: Add UI for `props` endpoint unavailable + cleanup logic * webui: extend cached props fallback to offline errors Treat connection failures (refused, DNS, timeout, fetch) the same way as server 5xx so the warning banner shows up when cache is available, instead of falling back to a full error screen. * webui: Left the chat form enabled when a server warning is present so operators can keep sending messages e.g., to restart the backend over llama-swap, even while cached /props data is in use * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatScreen/ChatScreenWarning.svelte # examples/server/webui_llamacpp/src/lib/constants/localstorage-keys.ts * Enhance text file detection logic for file attachments (#16199) * feat: Enhances text file detection logic * chore: Build static `webui` output * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/constants/binary-detection.ts * Show message actions by default (#16289) * fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312) * Improve Mobile UI for dialogs and action dropdowns (#16222) * fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com> * Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Chatapi ignore empty sampling (#16330) * fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output * webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363) * Add optional setting for showing "Model used:" information (#16337) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output * Improve code block color theming (#16325) * feat: Improve code block theming * chore: update webui build output * chore: Update webui static build * Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build * fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui : Fix messages payload sent to chat completions (#16402) * fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output * Capture model name only after first token (streaming) or completed request (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output * Fix missing messages on sibling navigation (#16408) * fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output * webui : added download action (#13552) (#16282) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: centralize CoT parsing in backend for streaming mode (#16394) * refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> # Conflicts: # common/arg.cpp # examples/server/webui_llamacpp/src/lib/utils/thinking.ts # tools/server/README.md * No markdown in cot (#16483) * fix: let the model think in plaintext * chore: npm run format + npm run build * webui: updated the chat service to only include max_tokens in the req… (#16489) * webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel * chore: update webui build output * feat: render user content as markdown option (#16358) * feat: render user content as markdown option - Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options - Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle - Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise - chore: update webui build output * chore: update webui build output * webui: remove client-side context pre-check and rely on backend for limits (#16506) * fix: make SSE client robust to premature [DONE] in agentic proxy chains * webui: remove client-side context pre-check and rely on backend for limits Removed the client-side context window pre-check and now simply sends messages while keeping the dialog imports limited to core components, eliminating the maximum context alert path Simplified streaming and non-streaming chat error handling to surface a generic 'No response received from server' error whenever the backend returns no content Removed the obsolete maxContextError plumbing from the chat store so state management now focuses on the core message flow without special context-limit cases * webui: cosmetic rename of error messages * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/dialogs/ChatErrorDialog.svelte # examples/server/webui_llamacpp/src/lib/components/app/dialogs/MaximumContextAlertDialog.svelte # examples/server/webui_llamacpp/src/lib/services/context.ts * fix: add remark plugin to render raw HTML as literal text (#16505) * fix: add remark plugin to render raw HTML as literal text Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs do ensuring consistent and safe Markdown rendering Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the Markdown AST into plain-text equivalents while preserving indentation and line breaks. This ensures consistent rendering and prevents unintended HTML execution, without altering valid Markdown structure Kept 'remarkRehype' in the pipeline since it performs the required conversion from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization Refined the link-enhancement logic to skip unnecessary DOM rewrites, fixing a subtle bug where extra paragraphs were injected after the first line due to full innerHTML reconstruction, and ensuring links open in new tabs only when required Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml -> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify * fix: address review feedback from allozaur * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/constants/literal-html.ts * Add server-driven parameter defaults and syncing (#16515) # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ParameterSourceIndicator.svelte # examples/server/webui_llamacpp/src/lib/constants/precision.ts # examples/server/webui_llamacpp/src/lib/services/parameter-sync.spec.ts # examples/server/webui_llamacpp/src/lib/services/parameter-sync.ts # examples/server/webui_llamacpp/src/lib/utils/config-helpers.ts # examples/server/webui_llamacpp/src/lib/utils/precision.ts * fix: added a normalization step for MathJax-style \[\] and delimiters (#16599) * fix: added a normalization step for MathJax-style \[\] and delimiters So inline and block equations are converted before KaTeX rendering, enabling proper display of model-generated LaTeX in the WebUI * chore: update webui build output * webui: reorganize settings layout (#16607) * webui: reorganize settings layout * chore: update webui build output * fix: remove unused variable * chore: update webui build output * Enable per-conversation loading states to allow having parallel conversations (#16327) * feat: Per-conversation loading states and tracking streaming stats * chore: update webui build output * refactor: Chat state management Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states. This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed. Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution. * feat: Adds loading indicator to conversation items * chore: update webui build output * fix: Fix aborting chat streaming Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent. This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion. * refactor: Remove redundant comments * chore: build webui static output * refactor: Cleanup * chore: update webui build output * chore: update webui build output * fix: Conversation loading indicator for regenerating messages * chore: update webui static build * feat: Improve configuration * feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI * Import/Export UX improvements (#16619) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * feat: Import/Export UX improvements * chore: update webui build output * feat: Update UI placement of Import/Export tab in Chat Settings Dialog * refactor: Cleanup chore: update webui build output * feat: Enable shift-click multiple conversation items selection * chore: update webui static build * chore: update webui static build --------- Co-authored-by: Sascha Rogmann <github@rogmann.org> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ConversationSelectionDialog.svelte # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ImportExportTab.svelte # examples/server/webui_llamacpp/src/lib/utils/conversation-utils.ts * Prevent premature submission on IME input (#16673) * fix: Prevent premature submission on IME input * chore: update webui static build * refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229` * chore: update webui static build * chore: update webui static build * chore: update webui static build # Conflicts: # examples/server/webui_llamacpp/src/lib/utils/is-ime-composing.ts * Handle legacy 'context' attachments (#16687) * webui: introduce OpenAI-compatible model selector in JSON payload (#16562) * webui: introduce OpenAI-compatible model selector in JSON payload * webui: restore OpenAI-Compatible model source of truth and unify metadata capture This change re-establishes a single, reliable source of truth for the active model: fully aligned with the OpenAI-Compat API behavior It introduces a unified metadata flow that captures the model field from both streaming and non-streaming responses, wiring a new onModel callback through ChatService The model name is now resolved directly from the API payload rather than relying on server /props or UI assumptions ChatStore records and persists the resolved model for each assistant message during streaming, ensuring consistency across the UI and database Type definitions for API and settings were also extended to include model metadata and the onModel callback, completing the alignment with OpenAI-Compat semantics * webui: address review feedback from allozaur * webui: move model selector into ChatForm (idea by @allozaur) * webui: make model selector more subtle and integrated into ChatForm * webui: replaced the Flowbite selector with a native Svelte dropdown * webui: add developer setting to toggle the chat model selector * webui: address review feedback from allozaur Normalized streamed model names during chat updates by trimming input and removing directory components before saving or persisting them, so the conversation UI shows only the filename Forced model names within the chat form selector dropdown to render as a single-line, truncated entry with a tooltip revealing the full name * webui: toggle displayed model source for legacy vs OpenAI-Compat modes When the selector is disabled, it falls back to the active server model name from /props When the model selector is enabled, the displayed model comes from the message metadata (the one explicitly selected and sent in the request) * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormActions.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/constants/localstorage-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: refactor model selector and persistence helpers - Replace inline portal and event listeners with proper Svelte bindings - Introduce 'persisted' store helper for localStorage sync without runes - Extract 'normalizeModelName' utils + Vitest coverage - Simplify ChatFormModelSelector structure and cleanup logic Replaced the persisted store helper's use of '$state/$effect' runes with a plain TS implementation to prevent orphaned effect runtime errors outside component context Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: document normalizeModelName usage with inline examples * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: extract ModelOption type into dedicated models.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: refine ChatMessageAssistant displayedModel source logic * webui: stabilize dropdown, simplify model extraction, and init assistant model field * chore: update webui static build * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: npm format, update webui static build * webui: align sidebar trigger position, remove z-index glitch * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte # examples/server/webui_llamacpp/src/lib/services/models.ts # examples/server/webui_llamacpp/src/lib/stores/models.svelte.ts # examples/server/webui_llamacpp/src/lib/stores/persisted.svelte.ts # examples/server/webui_llamacpp/src/lib/types/models.d.ts # examples/server/webui_llamacpp/src/lib/utils/model-names.test.ts # examples/server/webui_llamacpp/src/lib/utils/model-names.ts # examples/server/webui_llamacpp/src/lib/utils/portal-to-body.ts * webui: support q URL parameter (#16728) * webui: support q URL parameter Fixes #16722 I’ve checked that it works with Firefox’s AI tools * webui: apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * build fix --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Quentin Bramas <quentin.bramas@gmail.com> Co-authored-by: Isaac McFadyen <isaac@imcf.me> Co-authored-by: Pascal <admin@serveurperso.com> Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Sascha Rogmann <github@rogmann.org> Co-authored-by: Florian Badie <florianbadie@odrling.xyz>	2025-10-27 14:22:02 +02:00
Kawrakow	41d6c42b96	Change flash attention and fmoe to be on by default (#863 ) * Change fmoe to be on by default * Change default fmoe also in llama-bench * Change flash attention to be on by default --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-25 09:37:28 +03:00
Kawrakow	0549be76e5	Fused mul + multi_add op (#858 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: CUDA * fused mul+multi_add: command line argument to disable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:40:35 +03:00
Kawrakow	06ad8d1b2d	Fix PR #842 (#844 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-20 11:35:57 +03:00
Kawrakow	36f9601e8d	Make ooae on by default and add to llama-bench (#842 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-20 08:32:41 +03:00
Kawrakow	0c050638b6	Change --n-cpu-moe to not keep expert biases on CPU (#841 ) * Change --n-cpu-moe to not keep expert biases ion CPU * Also for --cpu-moe --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-19 19:03:03 +03:00
Kawrakow	cde642e591	Grouped expert routing (CPU only) (#836 ) * Better argsort (CPU) * Attemt at grouped topk * This seems to do the trick for grouped experts routing * Cleanup * Trying to merge, something is not right * Working merged grouped top_k (CPU) * Add command line option to enable grouped expert routing * Add grouped expert routing option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 14:57:02 +03:00
Kawrakow	e2f21c8dc8	Move minja and nlohmann/json to vendor (#802 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 09:12:35 +02:00
Kawrakow	346f580267	Remove stb_image.h copy in common - it is now in vendor (#801 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 08:55:42 +02:00
Kawrakow	c1a0e15377	Port mdmd from mainline + Qwen2/2.5-VL support (#798 ) * Add mtmd: the beginning * Add mtmd: mtmd.cpp compiles * Add mtmd: clip initialization compiles * Add mtmd: clip.cpp compiles * Add mtmd: builds successfully * Add CPU implementation for GGML_OP_GLU * Add CUDA implementation for GGML_OP_GLU * Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add mtmd: refresh CPU rope * Add mtmd: refresh CUDA rope * Add mtmd: add Qwen2-VL * Add mtmd: Qwen2.5-VL text seems to work with this change * Add mtmd: fix swiglu * Add mtmd: use LOG_TEE so generated tokens show up in terminal * Add mtmd: do not attempt to load a GPU backend if none are available * GLU, not GPU * Fix typo * Fix new/free mismatch * LOG stuff * Add mtmd: this fixes gibberish on second image --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 08:45:29 +02:00
firecoperana	7d8d232896	sync: vendor (#799 ) Co-authored-by: firecoperana <firecoperana>	2025-09-26 18:22:47 +02:00
firecoperana	17f7f1ed18	Update webui to handle reasoning content and include usage stats in server only when requested (#791 ) * handle reasoning content in webui server : include usage statistics only when user request them (#16052) server : only attempt to enable thinking if using jinja (#15967) * config reasoning_content in webui and change default to auto --------- Co-authored-by: firecoperana <firecoperana>	2025-09-24 07:45:09 +02:00
firecoperana	079231c291	model : add grok-2 support (#782 ) Co-authored-by: firecoperana <firecoperana>	2025-09-23 16:31:01 +02:00
firecoperana	a6da22beb2	Deepseek V3.1 native tool calling support (OpenAI Style) (#771 )	2025-09-13 07:51:40 +02:00
Kawrakow	13c3b6412e	Offload only activated experts to the GPU (#698 ) * Offload only activated experts * This seems to do the trick for -fmoe * Do not recalculate activated expers for fused up/gate * Log out of bounds access details * Add a command line argument --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 12:22:30 +02:00
firecoperana	d7882c3cf8	Tool calls support from mainline (#723 ) * Tool calls support from mainline * update cmake * revert api for /completions * Fix broken thinking process for gpt-oss * add missing args and fix webui bugs * add missing args and fix webui bugs2 * Fix reasoning format error * add usage * change default post_sampling_probs to true * add back generated_text * Remove server endpoints tests * add log * Chat fixes * Remove logs * webui: revert extra handling of thinking process --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-01 08:38:49 +03:00
Kawrakow	8de297b795	Fused FFN_UP+FFN_GATE op (#741 ) * Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-31 18:16:36 +03:00
Kawrakow	e760b4dc41	Check for NaNs while loading the model. (#727 ) * Check for NaNs while loading the model. * Also tell which experts have NaNs. * Add command line option to validate quants * Add checks for more quantization types * Add checks for more quantizagtion types --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 19:00:17 +03:00
saood06	7a68553487	Add mikupad to ik_llama as an alternative WebUI (#558 ) * mikupad.html in ik_llama.cpp (functional but WIP) * Remove hardcoded extension and add error handling to extension loading * Update version number and add features array to version * Make version endpoint always accessible * Fix case with empty sql * Add useful error message when launched without sql file * Add sigma sampler * Update sigma step and max based on docs * Remove selectedSessionId and handle it with URL fragment * Export All (code only, no UI) * Add compression to server.cpp * Major UI work (and also add update backend endpoints to accomadate) * Finalize UI * Fix visual bug * fix merge conflict issue * Pull in full sqlite_modern_cpp repo for the license as it is not attached to source files * Make compression not show in sidebar if extension is not loaded * Finalize build, Put support behing LLAMA_SERVER_SQLITE3: command not found build option, and update error message to include the build option is not passed situation * Fix compile without flag on systems without it installed	2025-08-24 08:27:29 -05:00
g2mt	06bed7e01b	Port universal assisted decoding to llama-server (#699 ) * port universal assisted decoding to server * fix calls * fix LOG_INFO * fix llama_detokenize call * use emplace_back	2025-08-18 09:22:23 +03:00

1 2 3 4 5 ...

313 Commits