ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Author	SHA1	Message	Date
Kawrakow	844a8b0bfa	Fix sync logic (#1064 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 18:40:49 +01:00
Kawrakow	2e04b7cbef	Undo sync reduction (#1063 ) I'm finding issues for Qwen3-MoE Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 16:58:32 +01:00
Kawrakow	093cc7c380	Do not use split mode graph scheduling if there are tensor overrides (#1060 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:48:38 +01:00
Kawrakow	b3a19a6f37	Fix overflow in offset calculation in mmq (#1059 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:31:06 +01:00
Kawrakow	53fb7a4118	Be able to enable or disable P2P via command line argument (#1058 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 13:36:42 +01:00
Kawrakow	f65fefa36c	Slightly faster TG for split mode "graph" (#1057 ) * Rearrange graph nodes So that we can do graph portions that are the same on 2 or more GPUs at the same time. * Separate graph compute implementation for split mode graph * This is better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 07:54:37 +01:00
Kawrakow	bf03f63c34	Fix #1055 (#1056 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 14:44:32 +01:00
abc-nix	37e41d22dc	enable peer access (NVlink) (#1050 ) * enable peer access for cuda * Remove redundant loop	2025-12-11 08:31:56 +01:00
Kawrakow	279b4c44fc	Fix the fix (#1054 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 08:05:33 +01:00
Kawrakow	22863cf9c9	Be able to set a max. number of GPUs to be used in split mode graph (#1051 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:22:53 +01:00
Kawrakow	a2efa22f10	Fix llama-bench - missing buffer override comparison operator (#1053 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:21:06 +01:00
Kawrakow	02206cff46	Reduce back-end syncs (#1049 ) * Reduce backend synchronization calls * Also this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:04:44 +01:00
i4TsU	e61cd0303a	QoL/bugfixes for llama-bench (#1052 ) * include cuda-params and -ot in llama-bench output * cleanup redundant type mapping * fix wrong field name * fix preexisting mistake in cuda_params help text (default value) * fix preexisting mistake in kompute column header * adjust code style to match current norms * simplify/fix inverted columns * fix field->value pairings/order * remove dead field `f16_kv` * sql printer deserves a way out too * actually enable the new improvements....	2025-12-11 07:04:15 +01:00
Kawrakow	5fe3979951	KV cache read/write for split mode "graph" (#1048 ) * Handle split cache (write) * Handle split cache (read) * Fix writing the data twice --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-09 06:50:53 +01:00
Djip007	5669d39036	Unroll for loop for repacked BF16 MATMUL (#1047 ) see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for detail	2025-12-08 06:09:45 +01:00
Kawrakow	2f645f2579	Fix annoying compiler warnings (#1042 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 09:59:07 +01:00
Kawrakow	e02b71f89e	Automatically disable CUDA graphs for split mode "graph" (#1040 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 07:38:02 +01:00
Kawrakow	0383dfb177	CUDA: set current device in compute_forward (#1039 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 16:47:50 +01:00
firecoperana	42e4c61243	CUDA: Fix FA for Pascal GPU (#1036 ) Co-authored-by: firecoperana <firecoperana>	2025-12-05 16:42:14 +01:00
Kawrakow	2125f68636	Don't split the output tensor (#1038 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 15:56:53 +01:00
Kawrakow	9264abfbaf	Fix debug build (#1037 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 14:06:22 +01:00
Kawrakow	efc8c8ef8d	K-cache Hadamard transforms (CUDA) (#1034 ) * Hadamard transforms for K-cache on CUDA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 18:46:22 +01:00
Kawrakow	18fdd80eaf	Hadamard transforms for K-cache - CPU only (#1033 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 06:51:11 +01:00
Kawrakow	0581f90c0f	Allow empty splits (#1029 ) * Allow empty splits * Fix type, add additional asserts * Fix also output --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:52:41 +01:00
Kawrakow	90f36eb517	Use standard attention for Ministral3 (#1032 ) Required adding the "temperature scaling" to the standard attention implementation. But in this way split mode "graph" is automatically supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:43:31 +01:00
Kawrakow	7fbe8d3ac2	Fix bug in ggml_cuda_op_scale_tensor (#1031 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 11:32:19 +01:00
Kawrakow	cf20d0c756	Adding ministral3: this seems to work (#1030 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 11:01:21 +01:00
Kawrakow	92410bbd1e	Slightly better graph split strategy (#1026 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-02 18:50:52 +01:00
Kawrakow	a719349982	POC: CUDA tensor parallel (MoE models) (#1022 ) * Remove most of split mode row * WIP * WIP: also allocate the KV cache using tensor split * WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling * WIP * This works, but it is slow * This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined. * Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG. * Fix attn split Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio. * Show memory used per device * Make it work with partial offload but no tensor overrides yet, just ngl < num_layers. * Allow for f16 source in fused_rms_norm * This results in faster PP. Now PP is faster than split mode layer for L3-70B. * Rename split mode "row" to split mode "graph" * Leave FFN partial results as f16 * WIP GLM4.5 - runs with wrong results * WIP GLM4.5 - this works PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer. * Work around compiler bug It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug. * Make graph reuse work with split mode graph * Remove more split mode row remnants * WIP tensor overrides Runs with wrong results, don't see where the issue could be. * This works but is slow Still does not work for row-interleaved quants * Slightly better * Slightly better * Row-interleaved quants work * Better * Minor * Guarad against using split mode "graph" for unsupported models * Guards against using merge_qkv with split mode "graph" * WIP split mode attn Works for LlaMA models, but not for GLM-4.5. Doesn't seem to improve performance, so I guess no point in trying to fix it. * Split mode graph for qwen3moe * Try to better distribute the splits --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-01 19:25:40 +01:00
Kawrakow	02b717c8c6	Fix build with RPC not enabled (#1025 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 19:04:54 +01:00
firecoperana	15771072c7	RPC: support multiple devices including cpu (#1024 ) * RPC support multiple devices * rpc : update documentation (#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com> # Conflicts: # examples/rpc/README.md * Remove memory settings * rpc : cache and reuse compute graphs (#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server. * Add -cpu to include cpu backend --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>	2025-11-30 18:48:02 +01:00
firecoperana	1cad1ec1cc	Update grammar (#1023 ) * grammar : fix JSON Schema for string regex with top-level alt. (#9903) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A\|B\|C\|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" \| "B" \| "C" \| "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" \| "B" \| "C" \| "D") "\"" space ``` * grammars : add English-only grammar (#10612) * grammar : handle maxItems == 0 in JSON schema (#13117) Co-authored-by: Richard Lyons <frob@cloudstaff.com> * grammar-parser : fix possible null-deref (#9004) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com> * llama : fix typo in llama-grammar.h [no ci] (#11816) * * server: fix "--grammar-file" parameter (#12285) * common : use std::string_view now that we target c++17 (#14319) * json : support `enum` values within `allOf` (#15830) * grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626) * grammar : support array references in json schema (#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> # Conflicts: # tests/test-json-schema-to-grammar.cpp * merge fix * llama : minor grammar refactor (#10897) * llama: fix error on bad grammar (#12628) * grammar : fix integer overflow (#17381) * Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX # Conflicts: # src/llama-grammar.cpp * grammar: fix regression caused by #17381 (#17412) * grammar: fix regression caused by #17381 * more readable # Conflicts: # src/llama-grammar.cpp * Merge Fix * Fix warnings --------- Signed-off-by: David Korczynski <david@adalogics.com> Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: frob <rick+github@frob.com.au> Co-authored-by: Richard Lyons <frob@cloudstaff.com> Co-authored-by: DavidKorczynski <david@adalogics.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com> Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 18:45:38 +01:00
firecoperana	869557c8fd	Update mtmd to improve accuracy of M-RoPE (#993 ) * model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (#14928) * model : add LightOnOCR-1B model (#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (#16921) * mtmd: improve struct initialization (#16981) * mtmd: allow QwenVL to process larger image by default (#17020) * Disable flash attention * mtmd : fix embedding size for image input (#17123) * mtmd: fix patch_size initialized to random value in audio models (#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>	2025-11-29 07:27:15 +01:00
hksdpc255	d24ea9e48e	server : add Anthropic Messages API support (#1012 )	2025-11-29 07:24:59 +01:00
Kawrakow	d6daee337c	Attempt to fix #1014 (#1017 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-27 15:58:18 +01:00
Kawrakow	45cd1a70f5	Fix llama-bench mla parameter (#1016 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-27 09:33:30 +01:00
firecoperana	8c39ff966d	Change default RPC order and fix wrong RPC server order in --device arg (#1011 ) * Change default RPC order and fix wrong RPC order in --device arg * Update --------- Co-authored-by: firecoperana <firecoperana>	2025-11-26 16:51:51 +01:00
firecoperana	0b126b2ca6	Fix prompt tokenization issue during prompt processing (#1008 ) * Find common tokens between prompt and cache Fix wrong context size usage for mtmd Use start position of common part server: handle context shift * Add size check for inexact match * Change --------- Co-authored-by: firecoperana <firecoperana>	2025-11-26 10:34:26 +01:00
Kawrakow	9337229274	Add MXFP4 to gguf-py constants (#1007 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 15:43:49 +01:00
Kawrakow	a3b8efd687	Enable iq4_nl KV cache on CUDA (#1006 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 09:41:19 +01:00
Kawrakow	0243356650	Fix q6_0 dequantize (#1005 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 09:13:46 +01:00
Nexes the Elder	9a63e768ea	Legacy quants cpy_blck_q_f16 function for K cache (#1001 ) Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16. To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts. Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior. Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix) -ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.	2025-11-24 08:56:38 +01:00
Kawrakow	ada5a92241	Disable RoPE cache (#1004 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 07:09:46 +01:00
firecoperana	07d08e15ad	webui update (#1003 ) webui: add system message in export conversation, support upload conversation with system message Webui: show upload only when in new conversation Webui: Add model name webui: increase height of chat message window when clicking editing Webui: autoclose settings dialog dropdown and maximze screen width when zoom in webui: fix date issues and add more dates webui: change error to toast.error. server: add n_past and slot_id in props_simple webui: add cache tokens, context and prompt speed in chat webui: modernize ui webui: change welcome message webui: change speed display webui: change run python icon webui: add config to use server defaults for sampler webui: put speed on left and context on right webui: recognize AsciiDoc files as valid text files (#16850) * webui: recognize AsciiDoc files as valid text files * webui: add an updated static webui build * webui: add the updated dependency list * webui: re-add an updated static webui build Add a setting to display message generation statistics (#16901) * feat: Add setting to display message generation statistics * chore: build static webui output webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe (#16757) * webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog Extended MarkdownContent to flag previewable code languages, add a preview button alongside copy controls, manage preview dialog state, and share styling for the new button group Introduced CodePreviewDialog.svelte, a sandboxed iframe modal for rendering HTML/JS previews with consistent dialog controls * webui: fullscreen HTML preview dialog using bits-ui * Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: pedantic style tweak for CodePreviewDialog close button * webui: remove overengineered preview language logic * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> webui: auto-refresh /props on inference start to resync model metadata (#16784) * webui: auto-refresh /props on inference start to resync model metadata - Add no-cache headers to /props and /slots - Throttle slot checks to 30s - Prevent concurrent fetches with promise guard - Trigger refresh from chat streaming for legacy and ModelSelector - Show dynamic serverWarning when using cached data * fix: restore proper legacy behavior in webui by using unified /props refresh Updated assistant message bubbles to show each message's stored model when available, falling back to the current server model only when the per-message value is missing When the model selector is disabled, now fetches /props and prioritizes that model name over chunk metadata, then persists it with the streamed message so legacy mode properly reflects the backend configuration * fix: detect first valid SSE chunk and refresh server props once * fix: removed the slots availability throttle constant and state * webui: purge ai-generated cruft * chore: update webui static build feat(webui): improve LaTeX rendering with currency detection (#16508) * webui : Revised LaTeX formula recognition * webui : Further examples containg amounts * webui : vitest for maskInlineLaTeX * webui: Moved preprocessLaTeX to lib/utils * webui: LaTeX in table-cells * chore: update webui build output (use theirs) * webui: backslash in LaTeX-preprocessing * chore: update webui build output * webui: look-behind backslash-check * chore: update webui build output * Apply suggestions from code review Code maintenance (variable names, code formatting, string handling) Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: Moved constants to lib/constants. * webui: package woff2 inside base64 data * webui: LaTeX-line-break in display formula * chore: update webui build output * webui: Bugfix (font embedding) * webui: Bugfix (font embedding) * webui: vite embeds assets * webui: don't suppress 404 (fonts) * refactor: KaTeX integration with SCSS Moves KaTeX styling to SCSS for better customization and font embedding. This change includes: - Adding `sass` as a dev dependency. - Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding. - Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables. * fix: LaTeX processing within blockquotes * webui: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> server : add props.model_alias (#16943) * server : add props.model_alias webui: fix keyboard shortcuts for new chat & edit chat title (#17007) Better UX for handling multiple attachments in WebUI (#17246) webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI (#16618) * webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI - Purely visual and diagnostic change, no effect on model context, prompt construction, or inference behavior - Captured assistant tool call payloads during streaming and non-streaming completions, and persisted them in chat state and storage for downstream use - Exposed parsed tool call labels beneath the assistant's model info line with graceful fallback when parsing fails - Added tool call badges beneath assistant responses that expose JSON tooltips and copy their payloads when clicked, matching the existing model badge styling - Added a user-facing setting to toggle tool call visibility to the Developer settings section directly under the model selector option * webui: remove scroll listener causing unnecessary layout updates (model selector) * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: npm run format & update webui build output * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> webui: Fix clickability around chat processing statistics UI (#17278) * fix: Better pointer events handling in chat processing info elements * chore: update webui build output Fix merge error webui: Add a "Continue" Action for Assistant Message (#16971) * feat: Add "Continue" action for assistant messages * feat: Continuation logic & prompt improvements * chore: update webui build output * feat: Improve logic for continuing the assistant message * chore: update webui build output * chore: Linting * chore: update webui build output * fix: Remove synthetic prompt logic, use the prefill feature by sending the conversation payload ending with assistant message * chore: update webui build output * feat: Enable "Continue" button based on config & non-reasoning model type * chore: update webui build output * chore: Update packages with `npm audit fix` * fix: Remove redundant error * chore: update webui build output * chore: Update `.gitignore` * fix: Add missing change * feat: Add auto-resizing for Edit Assistant/User Message textareas * chore: update webui build output Improved file naming & structure for UI components (#17405) * refactor: Component iles naming & structure * chore: update webui build output * refactor: Dialog titles + components namig * chore: update webui build output * refactor: Imports * chore: update webui build output webui: hide border of button webui: update webui: update webui: update add vision webui: minor settings reorganization and add disable autoscroll option (#17452) * webui: added a dedicated 'Display' settings section that groups visualization options * webui: added a Display setting to toggle automatic chat scrolling * chore: update webui build output Co-authored-by: firecoperana <firecoperana>	2025-11-24 07:03:45 +01:00
Kawrakow	920f424929	Support GigaChat3 (#995 ) * Fixing Gigachat support * Gigachat: CUDA FA (needs 192 x 192 for MLA = 3) * Gigachat: CPU FA (needs 192 x 192 for MLA = 3) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 06:55:14 +01:00
gapeleon	f6163dd58f	Fix: Register missing /apply-template endpoint (#999 ) The handle_apply_template handler was defined but never registered as an HTTP endpoint, causing 404 errors when calling POST /apply-template. This commit adds the missing endpoint registration to match the functionality available in llama.cpp mainline. Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>	2025-11-24 06:53:15 +01:00
Yap Sok Ann	7505165dee	Fix truncated logprobs when streaming is off (#998 ) The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in #723. Then, after #958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.	2025-11-24 06:52:15 +01:00
hksdpc255	15695a0617	fix kimi-k2 tool call (#996 )	2025-11-24 06:51:16 +01:00
Kawrakow	bf12f502a4	Fix requatizing from row-interleaved quants (#992 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-20 11:50:09 +01:00
Kawrakow	60227f4433	Make gguf-py stuff work with numpy 2.0 (#991 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-20 10:20:55 +01:00

1 2 3 4 5 ...

4113 Commits