ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-13 07:20:15 +00:00

Author	SHA1	Message	Date
firecoperana	c03ee1a4d2	server: improve speed of speculative decoding (#1119 ) * server: improve speed of speculative decoding change logs rpc: add recompute spec dec fix * Fix n_batch_size not set to context size for draft model --------- Co-authored-by: firecoperana <firecoperana>	2026-01-10 08:01:22 +02:00
dungquixote42	52ad1c6421	Implement Adaptive-P Sampler (#1100 ) * initial implementation of adaptive-p sampler * explicitly mark candidates unsorted + cleanup qualifiers * cosmetic update * reorg prototypes * lockstep with mainline * add _impl for _init + reorg * add LLAMA_API to prototypes * update sharpness to 10 * lockstep: rng seed * delete llama_sampling member in llama_sampler_adaptive_p * fix LLAMA_API return type * lockstep: rng seed cont * actually correct implementation * lockstep: sorting behavior * const -> constexpr for known constants * add missing space * fix softmax usage in adaptive p sampler * cosmetic changes * implement do-not-sort version of softmax * simpify rng seed, add static to constexpr * refactor: remove iface + use shared rng + use actually original probabilities * adaptive-p: add dedicated rng back in * fix initial max_logit + add float vector to adaptive p sampler context + stochastic sampling * adaptive-p: fuse first softmax with transformation * adaptive-p: implement binary search selection * adaptive-p: update comment	2026-01-10 07:58:53 +02:00
firecoperana	56dceefd6b	Fix windows build with CUDA (#1101 ) Co-authored-by: firecoperana <firecoperana>	2026-01-05 07:59:23 +02:00
Kawrakow	17a5a80946	Fix Windows build (#1097 )	2025-12-29 14:18:27 +01:00
firecoperana	903377bc34	Webui: improve scroll and bug fixes (#1082 ) * Webui: fix message scroll back due to setPending smooth scroll remove throttle increase scroll margin # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/utils/app.context.tsx * webui: don't scroll to bottom when conversation changes or edit message # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: fix save config error # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: add api key to request model name # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Update * webui: fix loading dots display issue # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/components/ChatMessage.tsx * Webui: cancel scroll when user moves up --------- Co-authored-by: firecoperana <firecoperana>	2025-12-24 12:30:26 +01:00
firecoperana	2a633c4357	server: exclude thinking tokens when finding the slot (#1079 ) refactor find slot enable by default Fix load prompt rename variables Co-authored-by: firecoperana <firecoperana>	2025-12-22 09:46:45 +01:00
firecoperana	0e91b89cd3	Refactor chat and server file (#1062 ) * Add alternative log functions * chat: fix int overflow, prevent size calculation in float/double (#17357) * chat: fix int overflow, prevent size calculation in float/double * Update common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : move all common_chat_parse_* to chat-parser.cpp. (#17481) # Conflicts: # common/chat.cpp * server: split server.cpp code into server/common/task/queue/context * Fix compiler warning * Clean up code * common: use native MultiByteToWideChar * move server prompt to server task * Clean code * delete utils.hpp --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com>	2025-12-15 08:27:20 +01:00
Kawrakow	279b4c44fc	Fix the fix (#1054 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 08:05:33 +01:00
Kawrakow	a2efa22f10	Fix llama-bench - missing buffer override comparison operator (#1053 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:21:06 +01:00
i4TsU	e61cd0303a	QoL/bugfixes for llama-bench (#1052 ) * include cuda-params and -ot in llama-bench output * cleanup redundant type mapping * fix wrong field name * fix preexisting mistake in cuda_params help text (default value) * fix preexisting mistake in kompute column header * adjust code style to match current norms * simplify/fix inverted columns * fix field->value pairings/order * remove dead field `f16_kv` * sql printer deserves a way out too * actually enable the new improvements....	2025-12-11 07:04:15 +01:00
Kawrakow	a719349982	POC: CUDA tensor parallel (MoE models) (#1022 ) * Remove most of split mode row * WIP * WIP: also allocate the KV cache using tensor split * WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling * WIP * This works, but it is slow * This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined. * Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG. * Fix attn split Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio. * Show memory used per device * Make it work with partial offload but no tensor overrides yet, just ngl < num_layers. * Allow for f16 source in fused_rms_norm * This results in faster PP. Now PP is faster than split mode layer for L3-70B. * Rename split mode "row" to split mode "graph" * Leave FFN partial results as f16 * WIP GLM4.5 - runs with wrong results * WIP GLM4.5 - this works PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer. * Work around compiler bug It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug. * Make graph reuse work with split mode graph * Remove more split mode row remnants * WIP tensor overrides Runs with wrong results, don't see where the issue could be. * This works but is slow Still does not work for row-interleaved quants * Slightly better * Slightly better * Row-interleaved quants work * Better * Minor * Guarad against using split mode "graph" for unsupported models * Guards against using merge_qkv with split mode "graph" * WIP split mode attn Works for LlaMA models, but not for GLM-4.5. Doesn't seem to improve performance, so I guess no point in trying to fix it. * Split mode graph for qwen3moe * Try to better distribute the splits --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-01 19:25:40 +01:00
firecoperana	15771072c7	RPC: support multiple devices including cpu (#1024 ) * RPC support multiple devices * rpc : update documentation (#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com> # Conflicts: # examples/rpc/README.md * Remove memory settings * rpc : cache and reuse compute graphs (#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server. * Add -cpu to include cpu backend --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>	2025-11-30 18:48:02 +01:00
firecoperana	1cad1ec1cc	Update grammar (#1023 ) * grammar : fix JSON Schema for string regex with top-level alt. (#9903) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A\|B\|C\|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" \| "B" \| "C" \| "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" \| "B" \| "C" \| "D") "\"" space ``` * grammars : add English-only grammar (#10612) * grammar : handle maxItems == 0 in JSON schema (#13117) Co-authored-by: Richard Lyons <frob@cloudstaff.com> * grammar-parser : fix possible null-deref (#9004) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com> * llama : fix typo in llama-grammar.h [no ci] (#11816) * * server: fix "--grammar-file" parameter (#12285) * common : use std::string_view now that we target c++17 (#14319) * json : support `enum` values within `allOf` (#15830) * grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626) * grammar : support array references in json schema (#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> # Conflicts: # tests/test-json-schema-to-grammar.cpp * merge fix * llama : minor grammar refactor (#10897) * llama: fix error on bad grammar (#12628) * grammar : fix integer overflow (#17381) * Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX # Conflicts: # src/llama-grammar.cpp * grammar: fix regression caused by #17381 (#17412) * grammar: fix regression caused by #17381 * more readable # Conflicts: # src/llama-grammar.cpp * Merge Fix * Fix warnings --------- Signed-off-by: David Korczynski <david@adalogics.com> Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: frob <rick+github@frob.com.au> Co-authored-by: Richard Lyons <frob@cloudstaff.com> Co-authored-by: DavidKorczynski <david@adalogics.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com> Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 18:45:38 +01:00
firecoperana	869557c8fd	Update mtmd to improve accuracy of M-RoPE (#993 ) * model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (#14928) * model : add LightOnOCR-1B model (#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (#16921) * mtmd: improve struct initialization (#16981) * mtmd: allow QwenVL to process larger image by default (#17020) * Disable flash attention * mtmd : fix embedding size for image input (#17123) * mtmd: fix patch_size initialized to random value in audio models (#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>	2025-11-29 07:27:15 +01:00
hksdpc255	d24ea9e48e	server : add Anthropic Messages API support (#1012 )	2025-11-29 07:24:59 +01:00
Kawrakow	45cd1a70f5	Fix llama-bench mla parameter (#1016 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-27 09:33:30 +01:00
firecoperana	0b126b2ca6	Fix prompt tokenization issue during prompt processing (#1008 ) * Find common tokens between prompt and cache Fix wrong context size usage for mtmd Use start position of common part server: handle context shift * Add size check for inexact match * Change --------- Co-authored-by: firecoperana <firecoperana>	2025-11-26 10:34:26 +01:00
firecoperana	07d08e15ad	webui update (#1003 ) webui: add system message in export conversation, support upload conversation with system message Webui: show upload only when in new conversation Webui: Add model name webui: increase height of chat message window when clicking editing Webui: autoclose settings dialog dropdown and maximze screen width when zoom in webui: fix date issues and add more dates webui: change error to toast.error. server: add n_past and slot_id in props_simple webui: add cache tokens, context and prompt speed in chat webui: modernize ui webui: change welcome message webui: change speed display webui: change run python icon webui: add config to use server defaults for sampler webui: put speed on left and context on right webui: recognize AsciiDoc files as valid text files (#16850) * webui: recognize AsciiDoc files as valid text files * webui: add an updated static webui build * webui: add the updated dependency list * webui: re-add an updated static webui build Add a setting to display message generation statistics (#16901) * feat: Add setting to display message generation statistics * chore: build static webui output webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe (#16757) * webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog Extended MarkdownContent to flag previewable code languages, add a preview button alongside copy controls, manage preview dialog state, and share styling for the new button group Introduced CodePreviewDialog.svelte, a sandboxed iframe modal for rendering HTML/JS previews with consistent dialog controls * webui: fullscreen HTML preview dialog using bits-ui * Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: pedantic style tweak for CodePreviewDialog close button * webui: remove overengineered preview language logic * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> webui: auto-refresh /props on inference start to resync model metadata (#16784) * webui: auto-refresh /props on inference start to resync model metadata - Add no-cache headers to /props and /slots - Throttle slot checks to 30s - Prevent concurrent fetches with promise guard - Trigger refresh from chat streaming for legacy and ModelSelector - Show dynamic serverWarning when using cached data * fix: restore proper legacy behavior in webui by using unified /props refresh Updated assistant message bubbles to show each message's stored model when available, falling back to the current server model only when the per-message value is missing When the model selector is disabled, now fetches /props and prioritizes that model name over chunk metadata, then persists it with the streamed message so legacy mode properly reflects the backend configuration * fix: detect first valid SSE chunk and refresh server props once * fix: removed the slots availability throttle constant and state * webui: purge ai-generated cruft * chore: update webui static build feat(webui): improve LaTeX rendering with currency detection (#16508) * webui : Revised LaTeX formula recognition * webui : Further examples containg amounts * webui : vitest for maskInlineLaTeX * webui: Moved preprocessLaTeX to lib/utils * webui: LaTeX in table-cells * chore: update webui build output (use theirs) * webui: backslash in LaTeX-preprocessing * chore: update webui build output * webui: look-behind backslash-check * chore: update webui build output * Apply suggestions from code review Code maintenance (variable names, code formatting, string handling) Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: Moved constants to lib/constants. * webui: package woff2 inside base64 data * webui: LaTeX-line-break in display formula * chore: update webui build output * webui: Bugfix (font embedding) * webui: Bugfix (font embedding) * webui: vite embeds assets * webui: don't suppress 404 (fonts) * refactor: KaTeX integration with SCSS Moves KaTeX styling to SCSS for better customization and font embedding. This change includes: - Adding `sass` as a dev dependency. - Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding. - Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables. * fix: LaTeX processing within blockquotes * webui: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> server : add props.model_alias (#16943) * server : add props.model_alias webui: fix keyboard shortcuts for new chat & edit chat title (#17007) Better UX for handling multiple attachments in WebUI (#17246) webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI (#16618) * webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI - Purely visual and diagnostic change, no effect on model context, prompt construction, or inference behavior - Captured assistant tool call payloads during streaming and non-streaming completions, and persisted them in chat state and storage for downstream use - Exposed parsed tool call labels beneath the assistant's model info line with graceful fallback when parsing fails - Added tool call badges beneath assistant responses that expose JSON tooltips and copy their payloads when clicked, matching the existing model badge styling - Added a user-facing setting to toggle tool call visibility to the Developer settings section directly under the model selector option * webui: remove scroll listener causing unnecessary layout updates (model selector) * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: npm run format & update webui build output * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> webui: Fix clickability around chat processing statistics UI (#17278) * fix: Better pointer events handling in chat processing info elements * chore: update webui build output Fix merge error webui: Add a "Continue" Action for Assistant Message (#16971) * feat: Add "Continue" action for assistant messages * feat: Continuation logic & prompt improvements * chore: update webui build output * feat: Improve logic for continuing the assistant message * chore: update webui build output * chore: Linting * chore: update webui build output * fix: Remove synthetic prompt logic, use the prefill feature by sending the conversation payload ending with assistant message * chore: update webui build output * feat: Enable "Continue" button based on config & non-reasoning model type * chore: update webui build output * chore: Update packages with `npm audit fix` * fix: Remove redundant error * chore: update webui build output * chore: Update `.gitignore` * fix: Add missing change * feat: Add auto-resizing for Edit Assistant/User Message textareas * chore: update webui build output Improved file naming & structure for UI components (#17405) * refactor: Component iles naming & structure * chore: update webui build output * refactor: Dialog titles + components namig * chore: update webui build output * refactor: Imports * chore: update webui build output webui: hide border of button webui: update webui: update webui: update add vision webui: minor settings reorganization and add disable autoscroll option (#17452) * webui: added a dedicated 'Display' settings section that groups visualization options * webui: added a Display setting to toggle automatic chat scrolling * chore: update webui build output Co-authored-by: firecoperana <firecoperana>	2025-11-24 07:03:45 +01:00
gapeleon	f6163dd58f	Fix: Register missing /apply-template endpoint (#999 ) The handle_apply_template handler was defined but never registered as an HTTP endpoint, causing 404 errors when calling POST /apply-template. This commit adds the missing endpoint registration to match the functionality available in llama.cpp mainline. Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>	2025-11-24 06:53:15 +01:00
Yap Sok Ann	7505165dee	Fix truncated logprobs when streaming is off (#998 ) The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in #723. Then, after #958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.	2025-11-24 06:52:15 +01:00
Kawrakow	0f6986a33c	Disable split mode "row" (#987 ) * Disable split mode "row" * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 16:15:50 +01:00
firecoperana	bacb8fb79f	Server: Handle context shift better to reduce prompt processing time (#973 ) * Handle context shift better to reduce pp Add context-shift args Add back ga_n in context shift * optimize discard function and bring back n_keep = -1 --------- Co-authored-by: firecoperana <firecoperana>	2025-11-19 16:04:48 +01:00
firecoperana	b40d11b22d	Fix kv cache save and load for GLM model (#965 ) Co-authored-by: firecoperana <firecoperana>	2025-11-15 17:04:16 +02:00
firecoperana	5ec0def0ef	Fix compiler warnings (#963 ) * Fix changes meaning warnings * A couple of more warnings and formatting --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-15 07:07:15 +02:00
firecoperana	bb358223cd	server: cache prompt to host memory (#954 ) * server : host-memory prompt caching change similarity calculation and prompt save conditions Remove unneeded token limit rename variable Separate prompt save and load logic change default values change log remove truncate prompt logic * add description * bug fixes * remove token limit in init --------- Co-authored-by: firecoperana <firecoperana>	2025-11-14 18:40:13 +02:00
firecoperana	177b5d2a47	Fix cuda init error in rpc (#957 ) Co-authored-by: firecoperana <firecoperana>	2025-11-14 06:59:54 +02:00
Kawrakow	6b9d1bf4b4	Graph reuse (#947 ) * Add mainline compatible FA command line option * Graph reuse: add command line argument to turn it on * WIP * This seems to work * This is perhaps cleaner * Change the command line option to -gr --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:58:19 +02:00
Kawrakow	88c02fa108	Set default MLA to 3 also in llama-bench (#949 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 09:52:06 +02:00
Kawrakow	25cd985c9b	Add --n-cpu-moe to llama_bench (#937 ) * Add --n-cpu-moe to llama_banch * Add usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 08:44:59 +02:00
Kawrakow	121ed91165	Add rcache to llama-bench (#936 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 08:06:18 +02:00
firecoperana	eea6cc4433	Server: Add --draft-params to set draft model parameter via command line args (#932 ) * Add command line argument for draft model * Remove second context of draft model * Format print * print usage if parsing -draft fails --------- Co-authored-by: firecoperana <firecoperana>	2025-11-10 09:51:07 +02:00
firecoperana	73c28dbef4	server: bug fix for preserved_tokens not preserved in process_token (#926 ) Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:29 +02:00
firecoperana	b63309a918	Fix embedding missing, CORS and crash using verbose in server (#924 ) * server: fix crash when prompt has image and is too long * server: fix CORS * server: fix empty result for embedding * change error message to truncate prompt * server: fix slot id for save and load state * bug fix * server: update slot similarity to handle mtmd * server: quick hack to calculate number of token processed with image * server: fix out of range error when detokenizing prompt under verbose * Add back Access-Control-Allow-Origin * Server: Add prompt tokens in embedding results --------- Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:03 +02:00
Nexes the Elder	f9a411e5db	More informative PPL readout line (#914 ) * More informative PPL readout line * trailing whitespace..	2025-11-07 16:41:24 +02:00
Kawrakow	532a05e466	CUDA: set compute parameters via command line arguments (#910 ) * cuda: set compute parameters via command line arguments * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 07:11:23 +02:00
Kawrakow	66ef68bc14	Fix compiler warning	2025-11-06 07:12:07 +02:00
firecoperana	18f5a6caef	Bug fixes for completions and prompt caching in server (#906 ) * Bug fixes for completions and prompt caching in server * Fix compiler warning about redefinition --------- Co-authored-by: firecoperana <firecoperana>	2025-11-06 07:10:51 +02:00
Kawrakow	e68f50be9a	Allow quantization of ffn_gate_inp (#896 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:44:32 +02:00
firecoperana	7978f04996	Add vision support in llama-server (#901 ) * server: add support for vision model webui: add support for vision model * server : remove hack for extra parallel slot#10187 * llama : fix KV shift for qwen2vl #13870 * add no-context-shift parameter --------- Co-authored-by: firecoperana <firecoperana>	2025-11-05 10:43:46 +02:00
Thireus ☠	86597623a5	Port of Qwen3-VL support from mainline (#883 ) * Port of Qwen3-VL for latest ik_llama.cpp - convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead - sysl and metal support for imrope not added - Vulkan support for imrope not tested - Code not tested * Bugfix n_embd was declared multiple times https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655 * Fix n_embd issue with qwen3vl * model.output tensor not required https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389 * Improved logic for qkv combined tensors `59ceaf8fcb (r2480395800)` `59ceaf8fcb (r2480398187)` * Fix n_embd for merge_qkv() + cleaner code https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395 * Revert TENSOR_NOT_REQUIRED	2025-11-04 19:20:54 +02:00
Kawrakow	efcb5f9d9e	sweep-bench: be able to set TG tokens via -n (#897 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 14:39:30 +02:00
Kawrakow	cfb840379f	Biased mmvq: minor optimization (#880 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:21:18 +02:00
jarrodfeaks	92517e74ad	fix v1/chat/completions assistant prefill (#874 )	2025-10-29 17:21:05 +02:00
Nexes the Elder	e68dabc242	A few server commits from mainline. (#872 ) server : handle models with missing EOS token (#8997) server : fix segfault on long system prompt (#8987) * server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment server : init stop and error fields of the result struct (#9026) server : fix duplicated n_predict key in the generation_settings (#8994) server : support reading arguments from environment variables (#9105) * server : support reading arguments from environment variables * add -fa and -dt * readme : specify non-arg env var server : add some missing env variables (#9116) * server : add some missing env variables * add LLAMA_ARG_HOST to server dockerfile * also add LLAMA_ARG_CONT_BATCHING Credits are to the respective authors. Not a single merge conflict occurred. Compiled, then tested without bug.	2025-10-28 09:58:31 +02:00
firecoperana	904e994bfb	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00
firecoperana	bf991ba60a	Add --webui arg to launch llama.cpp new webui (#786 ) * Add new webui from llama.cpp * Add new webui * feat: Improve mobile UI for Settings Dialog (#16084) * feat: Improve mobile UI for Settings Dialog * chore: update webui build output * fix: Linting errors * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsFields.svelte # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsSection.svelte # tools/server/public/index.html.gz * webui : fix handling incomplete chunks (#16107) * Always show message actions for mobile UI + improvements for user message sizing (#16076) # Conflicts: # .gitignore # examples/server/webui_llamacpp/package.json # examples/server/webui_llamacpp/scripts/dev.sh # tools/server/webui/scripts/post-build.sh * webui: switch to hash-based routing (alternative of #16079) (#16157) * Switched web UI to hash-based routing * Added hash to missed goto function call * Removed outdated SPA handling code * Fixed broken sidebar home link # Conflicts: # examples/server/webui_llamacpp/src/routes/+layout.ts # tools/server/server.cpp * Allow viewing conversations even when llama server is down (#16255) * webui: allow viewing conversations and sending messages even if llama-server is down - Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable. - Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation. - Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data. * feat: Add UI for `props` endpoint unavailable + cleanup logic * webui: extend cached props fallback to offline errors Treat connection failures (refused, DNS, timeout, fetch) the same way as server 5xx so the warning banner shows up when cache is available, instead of falling back to a full error screen. * webui: Left the chat form enabled when a server warning is present so operators can keep sending messages e.g., to restart the backend over llama-swap, even while cached /props data is in use * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatScreen/ChatScreenWarning.svelte # examples/server/webui_llamacpp/src/lib/constants/localstorage-keys.ts * Enhance text file detection logic for file attachments (#16199) * feat: Enhances text file detection logic * chore: Build static `webui` output * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/constants/binary-detection.ts * Show message actions by default (#16289) * fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312) * Improve Mobile UI for dialogs and action dropdowns (#16222) * fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com> * Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Chatapi ignore empty sampling (#16330) * fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output * webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363) * Add optional setting for showing "Model used:" information (#16337) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output * Improve code block color theming (#16325) * feat: Improve code block theming * chore: update webui build output * chore: Update webui static build * Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build * fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui : Fix messages payload sent to chat completions (#16402) * fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output * Capture model name only after first token (streaming) or completed request (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output * Fix missing messages on sibling navigation (#16408) * fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output * webui : added download action (#13552) (#16282) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: centralize CoT parsing in backend for streaming mode (#16394) * refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> # Conflicts: # common/arg.cpp # examples/server/webui_llamacpp/src/lib/utils/thinking.ts # tools/server/README.md * No markdown in cot (#16483) * fix: let the model think in plaintext * chore: npm run format + npm run build * webui: updated the chat service to only include max_tokens in the req… (#16489) * webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel * chore: update webui build output * feat: render user content as markdown option (#16358) * feat: render user content as markdown option - Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options - Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle - Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise - chore: update webui build output * chore: update webui build output * webui: remove client-side context pre-check and rely on backend for limits (#16506) * fix: make SSE client robust to premature [DONE] in agentic proxy chains * webui: remove client-side context pre-check and rely on backend for limits Removed the client-side context window pre-check and now simply sends messages while keeping the dialog imports limited to core components, eliminating the maximum context alert path Simplified streaming and non-streaming chat error handling to surface a generic 'No response received from server' error whenever the backend returns no content Removed the obsolete maxContextError plumbing from the chat store so state management now focuses on the core message flow without special context-limit cases * webui: cosmetic rename of error messages * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/dialogs/ChatErrorDialog.svelte # examples/server/webui_llamacpp/src/lib/components/app/dialogs/MaximumContextAlertDialog.svelte # examples/server/webui_llamacpp/src/lib/services/context.ts * fix: add remark plugin to render raw HTML as literal text (#16505) * fix: add remark plugin to render raw HTML as literal text Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs do ensuring consistent and safe Markdown rendering Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the Markdown AST into plain-text equivalents while preserving indentation and line breaks. This ensures consistent rendering and prevents unintended HTML execution, without altering valid Markdown structure Kept 'remarkRehype' in the pipeline since it performs the required conversion from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization Refined the link-enhancement logic to skip unnecessary DOM rewrites, fixing a subtle bug where extra paragraphs were injected after the first line due to full innerHTML reconstruction, and ensuring links open in new tabs only when required Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml -> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify * fix: address review feedback from allozaur * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/constants/literal-html.ts * Add server-driven parameter defaults and syncing (#16515) # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ParameterSourceIndicator.svelte # examples/server/webui_llamacpp/src/lib/constants/precision.ts # examples/server/webui_llamacpp/src/lib/services/parameter-sync.spec.ts # examples/server/webui_llamacpp/src/lib/services/parameter-sync.ts # examples/server/webui_llamacpp/src/lib/utils/config-helpers.ts # examples/server/webui_llamacpp/src/lib/utils/precision.ts * fix: added a normalization step for MathJax-style \[\] and delimiters (#16599) * fix: added a normalization step for MathJax-style \[\] and delimiters So inline and block equations are converted before KaTeX rendering, enabling proper display of model-generated LaTeX in the WebUI * chore: update webui build output * webui: reorganize settings layout (#16607) * webui: reorganize settings layout * chore: update webui build output * fix: remove unused variable * chore: update webui build output * Enable per-conversation loading states to allow having parallel conversations (#16327) * feat: Per-conversation loading states and tracking streaming stats * chore: update webui build output * refactor: Chat state management Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states. This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed. Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution. * feat: Adds loading indicator to conversation items * chore: update webui build output * fix: Fix aborting chat streaming Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent. This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion. * refactor: Remove redundant comments * chore: build webui static output * refactor: Cleanup * chore: update webui build output * chore: update webui build output * fix: Conversation loading indicator for regenerating messages * chore: update webui static build * feat: Improve configuration * feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI * Import/Export UX improvements (#16619) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * feat: Import/Export UX improvements * chore: update webui build output * feat: Update UI placement of Import/Export tab in Chat Settings Dialog * refactor: Cleanup chore: update webui build output * feat: Enable shift-click multiple conversation items selection * chore: update webui static build * chore: update webui static build --------- Co-authored-by: Sascha Rogmann <github@rogmann.org> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ConversationSelectionDialog.svelte # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ImportExportTab.svelte # examples/server/webui_llamacpp/src/lib/utils/conversation-utils.ts * Prevent premature submission on IME input (#16673) * fix: Prevent premature submission on IME input * chore: update webui static build * refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229` * chore: update webui static build * chore: update webui static build * chore: update webui static build # Conflicts: # examples/server/webui_llamacpp/src/lib/utils/is-ime-composing.ts * Handle legacy 'context' attachments (#16687) * webui: introduce OpenAI-compatible model selector in JSON payload (#16562) * webui: introduce OpenAI-compatible model selector in JSON payload * webui: restore OpenAI-Compatible model source of truth and unify metadata capture This change re-establishes a single, reliable source of truth for the active model: fully aligned with the OpenAI-Compat API behavior It introduces a unified metadata flow that captures the model field from both streaming and non-streaming responses, wiring a new onModel callback through ChatService The model name is now resolved directly from the API payload rather than relying on server /props or UI assumptions ChatStore records and persists the resolved model for each assistant message during streaming, ensuring consistency across the UI and database Type definitions for API and settings were also extended to include model metadata and the onModel callback, completing the alignment with OpenAI-Compat semantics * webui: address review feedback from allozaur * webui: move model selector into ChatForm (idea by @allozaur) * webui: make model selector more subtle and integrated into ChatForm * webui: replaced the Flowbite selector with a native Svelte dropdown * webui: add developer setting to toggle the chat model selector * webui: address review feedback from allozaur Normalized streamed model names during chat updates by trimming input and removing directory components before saving or persisting them, so the conversation UI shows only the filename Forced model names within the chat form selector dropdown to render as a single-line, truncated entry with a tooltip revealing the full name * webui: toggle displayed model source for legacy vs OpenAI-Compat modes When the selector is disabled, it falls back to the active server model name from /props When the model selector is enabled, the displayed model comes from the message metadata (the one explicitly selected and sent in the request) * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormActions.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/constants/localstorage-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: refactor model selector and persistence helpers - Replace inline portal and event listeners with proper Svelte bindings - Introduce 'persisted' store helper for localStorage sync without runes - Extract 'normalizeModelName' utils + Vitest coverage - Simplify ChatFormModelSelector structure and cleanup logic Replaced the persisted store helper's use of '$state/$effect' runes with a plain TS implementation to prevent orphaned effect runtime errors outside component context Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: document normalizeModelName usage with inline examples * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: extract ModelOption type into dedicated models.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: refine ChatMessageAssistant displayedModel source logic * webui: stabilize dropdown, simplify model extraction, and init assistant model field * chore: update webui static build * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: npm format, update webui static build * webui: align sidebar trigger position, remove z-index glitch * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte # examples/server/webui_llamacpp/src/lib/services/models.ts # examples/server/webui_llamacpp/src/lib/stores/models.svelte.ts # examples/server/webui_llamacpp/src/lib/stores/persisted.svelte.ts # examples/server/webui_llamacpp/src/lib/types/models.d.ts # examples/server/webui_llamacpp/src/lib/utils/model-names.test.ts # examples/server/webui_llamacpp/src/lib/utils/model-names.ts # examples/server/webui_llamacpp/src/lib/utils/portal-to-body.ts * webui: support q URL parameter (#16728) * webui: support q URL parameter Fixes #16722 I’ve checked that it works with Firefox’s AI tools * webui: apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * build fix --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Quentin Bramas <quentin.bramas@gmail.com> Co-authored-by: Isaac McFadyen <isaac@imcf.me> Co-authored-by: Pascal <admin@serveurperso.com> Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Sascha Rogmann <github@rogmann.org> Co-authored-by: Florian Badie <florianbadie@odrling.xyz>	2025-10-27 14:22:02 +02:00
firecoperana	40eabe5dd8	fix bugs (#870 ) Co-authored-by: firecoperana <firecoperana>	2025-10-27 14:17:48 +02:00
Kawrakow	41d6c42b96	Change flash attention and fmoe to be on by default (#863 ) * Change fmoe to be on by default * Change default fmoe also in llama-bench * Change flash attention to be on by default --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-25 09:37:28 +03:00
Kawrakow	36f9601e8d	Make ooae on by default and add to llama-bench (#842 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-20 08:32:41 +03:00
Kawrakow	cde642e591	Grouped expert routing (CPU only) (#836 ) * Better argsort (CPU) * Attemt at grouped topk * This seems to do the trick for grouped experts routing * Cleanup * Trying to merge, something is not right * Working merged grouped top_k (CPU) * Add command line option to enable grouped expert routing * Add grouped expert routing option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 14:57:02 +03:00

1 2 3 4 5 ...

1106 Commits