ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-26 16:14:10 +00:00

Author	SHA1	Message	Date
Kawrakow	216f44363f	Fix KT quantization yet again (#1321 ) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one	2026-02-25 18:07:12 +01:00
Kawrakow	c77ec4b8b8	Fused delta-net (#1315 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name	2026-02-25 14:12:48 +01:00
Nexes the Elder	0bf7043a7b	Display the size of the tensors overriden during the tensor loading (#1318 ) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2026-02-25 07:36:27 +01:00
Nexes the Elder	170467e835	Llama-quantize: Partial requant feature (#1313 ) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup	2026-02-25 07:25:15 +01:00
Joshua Jolley	68431b049a	server: propagate task index to response objects for batch requests (#1303 ) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai>	2026-02-24 15:39:38 +01:00
dungquixote42	aaa545c3dc	adaptive p: collect probability before logit bias (#1314 )	2026-02-24 15:39:17 +01:00
Kawrakow	38ca19d828	Minor delta-net tweak (#1308 ) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak	2026-02-24 15:22:57 +01:00
Kawrakow	7065488135	Slightly better graph parallel for Qwen3-Next (#1307 ) * Make sure we pick the reduced tensor from the right GPU * Minor	2026-02-24 15:22:30 +01:00
Kawrakow	cfb6747776	llama-quantize: --dry-run option (#1309 )	2026-02-24 15:21:52 +01:00
TheAIGuyFromAR	96b8298472	Fix typo in merge-up-gate-experts argument (#1311 )	2026-02-24 15:13:22 +01:00
Kawrakow	68bd30d99c	Fix max nodes (again) (#1306 )	2026-02-23 11:17:37 +01:00
Kawrakow	2bb40f8c35	Fix llm_arch_is_hybrid (#1305 )	2026-02-23 08:55:53 +01:00
Kawrakow	5dacb5355a	Graph parallel for Qwen3-Next (#1292 ) * WIP * This works, but is slower than split mode layer	2026-02-23 07:58:00 +01:00
Yap Sok Ann	dcf50d8279	Fix tool call for Qwen3.5 (#1300 ) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * https://github.com/ggml-org/llama.cpp/pull/19635 * https://github.com/ggml-org/llama.cpp/pull/19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one	2026-02-23 07:54:56 +01:00
firecoperana	efc294cc39	server: fix crash from adaptive p (#1304 ) Co-authored-by: firecoperana <firecoperana>	2026-02-23 07:25:52 +01:00
Kawrakow	89b1e2b518	Better estimate for max. nuber of compute nodes (#1296 ) * Better estimate for max. nuber of compute nodes * Just in case	2026-02-22 18:16:49 +01:00
Samuel Oliveira Alves	09a88c9ae5	Add MTP decoding support for GLM-4.x MoE (#1270 ) * wip: port MTP architecture Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes. * Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model). * core: enable hybrid outputs (logits + embeddings) for MTP support * fix(mtp): correct KV-cache slot finding for updates * fix(mtp): persist hidden states to prevent context corruption during drafting * refactor(mtp): clean unused code * fix(mtp): update server to new functions name * fix(mtp): fix graph and save hidden state * mtp: refactor integration, context params and kv cache search * mtp: fix hidden state extraction and speculative acceptance flow * server: fix MTP warmup for long prompts and reset token buffer * llama: refactor MTP operation state to context parameters * server: fix n_past calculation in MTP acceptance * llama: fix mtp enable flags * speculative: refactor MTP to use common_speculative interface * context: remove unused signatures * clip: fix deprecated enum-enum conversion warning * common: fix format string crash in help message * context: fix mtp activation logic	2026-02-22 18:14:39 +01:00
Kawrakow	cbf7fc7e2f	Update README with warning about '_XL' models from Unsloth Added important note regarding quantized models from Unsloth.	2026-02-22 07:42:17 +01:00
Kawrakow	bd387a279a	Add new authors to the AUTHORS file	2026-02-21 19:20:31 +01:00
firecoperana	66323b92f7	Qwen3.5-MoE: fix regenerating message error (#1295 ) Co-authored-by: firecoperana <firecoperana>	2026-02-21 18:24:12 +01:00
Kawrakow	13c3d83ce7	Qwen3.5-MoE support (#1288 ) * WIP: loads and runs, but not correct Very high PPL, empty TG. * This appears to work	2026-02-21 08:33:06 +01:00
mcm007	b2cb4512c5	Create parameters overview (#1269 ) * raw parameters.md * fix small typos in common.cpp * Update build args in parameters.md * Update parameters.md - format as table - sections * Update README.md - quickstart - build and run * Update parameters.md other tools examples * add PR links * multiple updates to parameters.md - description - add jargon section - add suggestions from feedbacks * don't imply that only linux is supported in README.md * add alias to parameters.md * Update README.md with recent models and features * Update parameters.md with latest features * address suggestions - no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters * specify Linux distro in README.md	2026-02-20 07:20:56 +01:00
dungquixote42	0f411b02e2	Fix adaptive p sampler bug with string ban (#1287 ) * adaptive p: upadte internal state only if not rewinding * adaptive p: conditional update for speculative decoding * adaptive p: refactor to rewind instead of update * adaptive p fix: better comments * fix rewind check * add record to handle multi-token rewind * better comment	2026-02-20 07:11:36 +01:00
rkozuch	b855bf92de	Fix slot prompt updating. (#1285 ) Co-authored-by: Rkozuch <you@example.com>	2026-02-19 08:15:49 +01:00
Kawrakow	d81cde5cea	Fix very low bpw missing imatrix check (#1284 )	2026-02-19 08:15:26 +01:00
Samuel Oliveira Alves	51df09be8a	Feat - add kimi 2.5 Vision (#1280 ) * port kimi 25-vision from upstream * feat(clip): add support for Kimi K2.5 vision model	2026-02-19 08:15:03 +01:00
Kawrakow	04cf685e82	Factor out delta net (#1286 ) * WIP: factor out delta net implementation * WIP * Use the standard FFN functions * More standard attn for Qwen3-Next	2026-02-18 17:16:17 +01:00
Kawrakow	d2d65c0d64	Better CPU performance for Qwen3-Next (#1283 ) * Better CPU silu - +4% PP * Improve ggml_compute_forward_dup_bytes	2026-02-18 15:55:11 +01:00
Kawrakow	84831fc3ee	Don't disable CUDA graphs for Qwen3-Next (#1278 )	2026-02-18 08:47:45 +01:00
Kawrakow	cafeef484c	More Qwen3-Next optimizations (#1277 ) * Optimizing q3next TG * Fused add -> softplus -> mul on CUDA * Remove forgotten debug log * Increase ggml context size Required for Qwen3-Next with batch/u-batch size of 4096 * WIP * Avoid some contiguous ops * Avoid some repeats * Avoid some more repeats	2026-02-17 16:03:51 +01:00
Samuel Oliveira Alves	88f98c891d	server: add string ban in speculative path (#1274 )	2026-02-17 12:33:28 +01:00
Kawrakow	16fe459a49	Faster CPU PP performance for Qwen3-Next - optimize concat (#1276 )	2026-02-17 11:46:27 +01:00
Kawrakow	35c99f9f41	Faster Qwen3-Next PP on CUDA - optimize concat (#1275 )	2026-02-16 11:46:39 +01:00
Kawrakow	97e7c091cd	Update AUTHORS file with new contributors Added new contributors to the AUTHORS file.	2026-02-16 07:13:25 +01:00
firecoperana	868ac2128e	fix build error (#1272 ) Co-authored-by: firecoperana <firecoperana>	2026-02-16 06:51:03 +01:00
Kawrakow	e30198a553	WIP: Qwen3Next (#1266 ) * qwen3next: add architecture support and recurrent-state fixes * qwen3next: optimize broadcast sub and single-seq ssm conv * cuda: build MoE row mapping on device in mul_mat_id * cuda: add guarded multi-seq fast path for ssm_conv * docs: update qwen3next perf report for cuda MoE/SSM tuning * cuda: reduce qwen3next moe/ssm sync overhead and refresh eval * qwen3next: split cpu/cuda eval builds and tune PP scheduling * qwen3next: harden seq-state flow and support optional dense FFN layers * qwen3next: trim delta-net graph overhead in chunking path * qwen3next: remove redundant v_conv cont in delta path * qwen3next: avoid extra cont on linear attention output * qwen3next: drop redundant cont before recurrent state flatten * qwen3next: keep recurrent state in 4d layout through delta path * qwen3next: add fused delta-net op and wire model path * tests: add backend-op coverage for ggml_delta_net * qwen3next: add runtime switch for fused delta-net path * docs: refresh qwen3next perf review and benchmark matrix * qwen3next: default fused delta-net off and document quality checks * qwen3next: add decode-only fused delta mode * qwen3next: make fused delta safe by default and fix fused tensor layout * qwen3next: warn when forcing fused decode mode * qwen3next: add fused-delta regression runner script * qwen3next: integrate fused regression into eval harness * qwen3next: clean up chunked delta-net shape handling * qwen3next: add absolute sanity guards to fused regression * qwen3next: add unified regression runner script * qwen3next: disable flash-attn for cpu-only contexts * docs: reconcile qwen3next status and remaining upstream gaps * common: add qwen3next fused-delta runtime flag * cuda: add qwen3next delta-net kernel dispatch override * docs: update qwen3next quality and serving baseline findings * qwen3next: keep fused delta on safe path and remove PR artifacts * qwen3next: align autoregressive delta-net decode layout * Revert "qwen3next: align autoregressive delta-net decode layout" This reverts commit `9241164a5e`. * cuda: port solve-tri fast-paths for qwen3next delta-net * qwen3next: add fused-delta runtime flag and drop env toggle * qwen3next: make fused delta single-flag and default on * Account for GPU arch differences * Revert "cuda: build MoE row mapping on device in mul_mat_id" This reverts commit `89e9ecfa84`. * qwen3next: drop non-essential MoE scheduling and split heuristics * qwen3next: avoid generic ggml_sub broadcast changes * llama: restore only_active_experts log message * Remove unnecessary hacks, disable fusion for now. * qwen3next: port hybrid recurrent state memory semantics * qwen3next: clean up recurrent state slot plumbing * qwen3next: fix hybrid V-cache layout plumbing * qwen3next: guard recurrent state slots against kv capacity * qwen3next: persist recurrent state in session data - serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches * qwen3next: drop unused fused-delta builder path - remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member * qwen3next: remove unused fused-delta CLI/context plumbing - drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init * ggml: remove unused DELTA_NET operator stack * Missing include * Reorder ops/unary ops So we don't change again the enum values of the mul mat ops * Minor * Discard unnecessary changes in llama-build-context.cpp * Minor * Revert "Discard unnecessary changes in llama-build-context.cpp" This reverts commit `edadb80ed6`. * Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches * Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next * Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next It was single-threaded and was taking ~25% of the computation time during TG. It is now down to 2%. Strangely enough, I measure 13.6 t/s with llama-bench, but if I let the model give me an actual response with llama-cli, I get close to 17 t/s. * Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next For Qwen3Next there is a scale op on a largish tensor (548k elements) that has a single row for TG, so was done in a single thread. We now simply use blocks of 1024 elements. * Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next * CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next * Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512 * Multithreading for OP_SUB * Don't commit with timing trace on * Multithread neg and sigmoid * Be able to turn on/off fusion more easily (CPU) * Name the mul_mat ops so we know where the time goes * WIP * Much better PP on CUDA * CUDA: fuse transpose -> cont -> sum_rows -> transpose Needs non-coontiguous variant of sum_rows. On the CPU this gave 30+% improvement in TG performance, on CUDA ist is disapointing 6-7%. I guess, this is because Georgi's cont CPU implementation was so bad that skipping it made such a big difference. * CUDA: faster mul for special case relevant for Qwen3Next Worth 1% in TG * Fix CPU OP_CONT --------- Co-authored-by: yurko <yurko@local> Co-authored-by: Yurko <yurko@example.com> Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net> Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>	2026-02-16 06:50:28 +01:00
Kawrakow	528cadb07b	GLM-5 support (#1268 )	2026-02-15 07:49:44 +01:00
mcm007	f5fe33b7a9	Update README.md (#1263 ) * Update README.md Add new models and few of the features, quants and improvements * Update README.md ministral3 and split mode "graph"	2026-02-14 09:02:33 +01:00
mcm007	f80505911d	Improve README.md (#1260 )	2026-02-14 09:01:52 +01:00
RodriMora	102f77b7d3	server: add /v1/responses support (#1184 ) * server: add /v1/responses support * server: fix Responses API model fallback and SSE branching	2026-02-14 08:30:18 +01:00
firecoperana	1cb7e1bf39	spec : add self speculative decoding, ngram and refactor (#1261 ) * spec : add self speculative decoding and ngram-mod and refactor common : use common_ prefix for common library function llama : use LLAMA_TOKEN_NULL spec : add self speculative decoding (no draft model required) + refactor spec : add ngram-mod spec : various improvements ton ngram-map + docs spec : fix the check-rate logic of ngram-simple common : add common_speculative_is_compat() spec : simplify time measurement using common_time_meas refactor common_sampler_init refactor common_token_to_piece refactor and fix cur_p bug clean up * spec : remove check rate * spec: show warnings instead of abort --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>	2026-02-13 19:04:55 +01:00
Kawrakow	1fdbc0dafe	Fix #1222 (#1257 ) * Fix #1222 * Typo	2026-02-09 16:20:16 +01:00
Kawrakow	494d70626f	Allow missing rope_frequency_base_swa in Step-3.5 models	2026-02-08 08:59:39 +00:00
Kawrakow	e22b2d1246	Be able to read uint32_t and bool arrays from GGUFs (#1252 )	2026-02-07 19:20:15 +02:00
firecoperana	f1ccf340dd	fix model name missing in final response (#1250 ) Co-authored-by: firecoperana <firecoperana>	2026-02-07 18:31:39 +02:00
mcm007	dbcbfdb0ef	Ik llama swap in container step by step guide (#1249 ) * Create README.md * Add container files and llama-swap configs * Update main README.md * Build without GGML_IQK_FA_ALL_QUANTS Otherwise fails with CUDA_DOCKER_ARCH=default * Mention GGML_IQK_FA_ALL_QUANTS usage * First step more explicit	2026-02-07 18:30:19 +02:00
Kawrakow	82c4f27332	Fuse the attention gate in Step-3.5-Flash (#1244 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq) * WIP: graph parallel for Step-3.5 * WIP * This should be it * Cleanup * Fix merge * Not working attempt to extend fused_mul_unary to the Step-3.5 case * It works now, but performance gain is very minor	2026-02-07 07:56:58 +02:00
Kawrakow	90d7499c2c	Step-3.5: llama.cpp compatibility changes (#1240 ) * Step-3.5: llama.cpp compatibility changes * Also read rope_freq_base_train_swa from the GGUF	2026-02-07 07:56:11 +02:00
Kawrakow	c5d74f66e2	Fix graph parallel when ngl < n_layers (#1241 ) * Fix graph parallel when ngl < n_layers * Fix using ffn_norm When using graph parallel with ngl < n_layers, the ffn_norm tensor may have ended up being split, while the ffn tensors are on the CPU. In that case we will get a crash because we attempt to use the not-split buffer of ffn_norm, which is invalid. Thi commit fixes that. * Cleanup	2026-02-06 11:48:24 +02:00
Kawrakow	4d86907b18	Remove forgotten printf	2026-02-06 07:43:18 +00:00

1 2 3 4 5 ...

4232 Commits