ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 00:20:19 +00:00

Author	SHA1	Message	Date
Kawrakow	a568e12c8f	Minor delta-net tweak (#1337 )	2026-03-01 17:45:02 +01:00
Kawrakow	04c140fe54	Make vision woork with Qwen-3.5 models (#1345 )	2026-03-01 17:44:37 +01:00
Kawrakow	0ff3a43289	Bring back #1333 and #1335 (#1340 ) * Bring back fused delta net 3 * Remove autoregressive and chunking	2026-02-28 14:31:42 +01:00
Kawrakow	1922449b2c	Revert delta net 3 (#1339 ) * Revert "Simplify delta-net (#1335)" This reverts commit `e5fc30244c`. * Revert "Fused delta net 3 (#1333)" This reverts commit `7b68353e09`.	2026-02-28 13:12:08 +01:00
Kawrakow	e5fc30244c	Simplify delta-net (#1335 ) * Simplify delta-net * Minor * Minor	2026-02-28 11:12:19 +01:00
Kawrakow	702e0765b8	Update README with clarification on '_XL' models Clarified warning about Unsloth '_XL' models in README.	2026-02-27 16:22:10 +01:00
Kawrakow	7b68353e09	Fused delta net 3 (#1333 ) * This is better than chunked * Keep the state in registers * Cleanup * Remove unused stuff * Minor * Make fused delta-net the default * Fix race	2026-02-27 15:02:56 +01:00
Kawrakow	1e6d36b1b4	Graph parallel for dense Qwen-3.5 models (#1331 ) * Graph parallel for idense Qwen-3.5 models * Cleanup	2026-02-27 07:03:25 +01:00
Kawrakow	facc8fdc44	Very slightly better fused delta-net (#1330 )	2026-02-27 07:03:09 +01:00
Kawrakow	62a7dcac5a	Move the Qwen-3.5 models to the standard attention mechanism (#1329 )	2026-02-26 15:50:51 +01:00
Kawrakow	757bee6238	Add special FA handling for dense Qwen3.5 (#1328 )	2026-02-26 11:27:41 +01:00
Kawrakow	0aa6f7e7cd	iAdding support for dense Qwen-3.5 models (#1326 )	2026-02-26 08:51:01 +01:00
Kawrakow	2616efa296	Fused delta net 2 (#1320 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes.	2026-02-26 06:53:43 +01:00
Kawrakow	87b35dac0c	Faster quantization for MoE models with many experts (#1322 )	2026-02-26 06:52:28 +01:00
firecoperana	3fac78c48b	server: enable checkpoint for recurrent models (#1310 ) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana>	2026-02-26 06:51:18 +01:00
Kawrakow	216f44363f	Fix KT quantization yet again (#1321 ) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one	2026-02-25 18:07:12 +01:00
Kawrakow	c77ec4b8b8	Fused delta-net (#1315 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name	2026-02-25 14:12:48 +01:00
Nexes the Elder	0bf7043a7b	Display the size of the tensors overriden during the tensor loading (#1318 ) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2026-02-25 07:36:27 +01:00
Nexes the Elder	170467e835	Llama-quantize: Partial requant feature (#1313 ) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup	2026-02-25 07:25:15 +01:00
Joshua Jolley	68431b049a	server: propagate task index to response objects for batch requests (#1303 ) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai>	2026-02-24 15:39:38 +01:00
dungquixote42	aaa545c3dc	adaptive p: collect probability before logit bias (#1314 )	2026-02-24 15:39:17 +01:00
Kawrakow	38ca19d828	Minor delta-net tweak (#1308 ) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak	2026-02-24 15:22:57 +01:00
Kawrakow	7065488135	Slightly better graph parallel for Qwen3-Next (#1307 ) * Make sure we pick the reduced tensor from the right GPU * Minor	2026-02-24 15:22:30 +01:00
Kawrakow	cfb6747776	llama-quantize: --dry-run option (#1309 )	2026-02-24 15:21:52 +01:00
TheAIGuyFromAR	96b8298472	Fix typo in merge-up-gate-experts argument (#1311 )	2026-02-24 15:13:22 +01:00
Kawrakow	68bd30d99c	Fix max nodes (again) (#1306 )	2026-02-23 11:17:37 +01:00
Kawrakow	2bb40f8c35	Fix llm_arch_is_hybrid (#1305 )	2026-02-23 08:55:53 +01:00
Kawrakow	5dacb5355a	Graph parallel for Qwen3-Next (#1292 ) * WIP * This works, but is slower than split mode layer	2026-02-23 07:58:00 +01:00
Yap Sok Ann	dcf50d8279	Fix tool call for Qwen3.5 (#1300 ) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * https://github.com/ggml-org/llama.cpp/pull/19635 * https://github.com/ggml-org/llama.cpp/pull/19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one	2026-02-23 07:54:56 +01:00
firecoperana	efc294cc39	server: fix crash from adaptive p (#1304 ) Co-authored-by: firecoperana <firecoperana>	2026-02-23 07:25:52 +01:00
Kawrakow	89b1e2b518	Better estimate for max. nuber of compute nodes (#1296 ) * Better estimate for max. nuber of compute nodes * Just in case	2026-02-22 18:16:49 +01:00
Samuel Oliveira Alves	09a88c9ae5	Add MTP decoding support for GLM-4.x MoE (#1270 ) * wip: port MTP architecture Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes. * Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model). * core: enable hybrid outputs (logits + embeddings) for MTP support * fix(mtp): correct KV-cache slot finding for updates * fix(mtp): persist hidden states to prevent context corruption during drafting * refactor(mtp): clean unused code * fix(mtp): update server to new functions name * fix(mtp): fix graph and save hidden state * mtp: refactor integration, context params and kv cache search * mtp: fix hidden state extraction and speculative acceptance flow * server: fix MTP warmup for long prompts and reset token buffer * llama: refactor MTP operation state to context parameters * server: fix n_past calculation in MTP acceptance * llama: fix mtp enable flags * speculative: refactor MTP to use common_speculative interface * context: remove unused signatures * clip: fix deprecated enum-enum conversion warning * common: fix format string crash in help message * context: fix mtp activation logic	2026-02-22 18:14:39 +01:00
Kawrakow	cbf7fc7e2f	Update README with warning about '_XL' models from Unsloth Added important note regarding quantized models from Unsloth.	2026-02-22 07:42:17 +01:00
Kawrakow	bd387a279a	Add new authors to the AUTHORS file	2026-02-21 19:20:31 +01:00
firecoperana	66323b92f7	Qwen3.5-MoE: fix regenerating message error (#1295 ) Co-authored-by: firecoperana <firecoperana>	2026-02-21 18:24:12 +01:00
Kawrakow	13c3d83ce7	Qwen3.5-MoE support (#1288 ) * WIP: loads and runs, but not correct Very high PPL, empty TG. * This appears to work	2026-02-21 08:33:06 +01:00
mcm007	b2cb4512c5	Create parameters overview (#1269 ) * raw parameters.md * fix small typos in common.cpp * Update build args in parameters.md * Update parameters.md - format as table - sections * Update README.md - quickstart - build and run * Update parameters.md other tools examples * add PR links * multiple updates to parameters.md - description - add jargon section - add suggestions from feedbacks * don't imply that only linux is supported in README.md * add alias to parameters.md * Update README.md with recent models and features * Update parameters.md with latest features * address suggestions - no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters * specify Linux distro in README.md	2026-02-20 07:20:56 +01:00
dungquixote42	0f411b02e2	Fix adaptive p sampler bug with string ban (#1287 ) * adaptive p: upadte internal state only if not rewinding * adaptive p: conditional update for speculative decoding * adaptive p: refactor to rewind instead of update * adaptive p fix: better comments * fix rewind check * add record to handle multi-token rewind * better comment	2026-02-20 07:11:36 +01:00
rkozuch	b855bf92de	Fix slot prompt updating. (#1285 ) Co-authored-by: Rkozuch <you@example.com>	2026-02-19 08:15:49 +01:00
Kawrakow	d81cde5cea	Fix very low bpw missing imatrix check (#1284 )	2026-02-19 08:15:26 +01:00
Samuel Oliveira Alves	51df09be8a	Feat - add kimi 2.5 Vision (#1280 ) * port kimi 25-vision from upstream * feat(clip): add support for Kimi K2.5 vision model	2026-02-19 08:15:03 +01:00
Kawrakow	04cf685e82	Factor out delta net (#1286 ) * WIP: factor out delta net implementation * WIP * Use the standard FFN functions * More standard attn for Qwen3-Next	2026-02-18 17:16:17 +01:00
Kawrakow	d2d65c0d64	Better CPU performance for Qwen3-Next (#1283 ) * Better CPU silu - +4% PP * Improve ggml_compute_forward_dup_bytes	2026-02-18 15:55:11 +01:00
Kawrakow	84831fc3ee	Don't disable CUDA graphs for Qwen3-Next (#1278 )	2026-02-18 08:47:45 +01:00
Kawrakow	cafeef484c	More Qwen3-Next optimizations (#1277 ) * Optimizing q3next TG * Fused add -> softplus -> mul on CUDA * Remove forgotten debug log * Increase ggml context size Required for Qwen3-Next with batch/u-batch size of 4096 * WIP * Avoid some contiguous ops * Avoid some repeats * Avoid some more repeats	2026-02-17 16:03:51 +01:00
Samuel Oliveira Alves	88f98c891d	server: add string ban in speculative path (#1274 )	2026-02-17 12:33:28 +01:00
Kawrakow	16fe459a49	Faster CPU PP performance for Qwen3-Next - optimize concat (#1276 )	2026-02-17 11:46:27 +01:00
Kawrakow	35c99f9f41	Faster Qwen3-Next PP on CUDA - optimize concat (#1275 )	2026-02-16 11:46:39 +01:00
Kawrakow	97e7c091cd	Update AUTHORS file with new contributors Added new contributors to the AUTHORS file.	2026-02-16 07:13:25 +01:00
firecoperana	868ac2128e	fix build error (#1272 ) Co-authored-by: firecoperana <firecoperana>	2026-02-16 06:51:03 +01:00

1 2 3 4 5 ...

4247 Commits