ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-07 04:20:03 +00:00

Author	SHA1	Message	Date
firecoperana	2add439e43	grammar: fix trigger pattern init error (#1365 ) Co-authored-by: firecoperana <firecoperana>	2026-03-05 07:54:41 +01:00
dungquixote42	a903409a5e	fix adaptive p sampler rewinding too far back (#1359 ) * fix adaptive p sampler rewinding too far back * update comments * correct default value for total_weight, more comments * new variables/names * update comment for n_rewind * move null pointer check back to common_sampler_review() * refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()	2026-03-04 13:26:25 +01:00
Kawrakow	f27678d39b	ARM_NEON fused delta-net implementation (#1361 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-03-04 13:24:59 +01:00
mullecofo	2f93bf7563	Fix compilation on clang-cl.exe (#1355 ) Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169 See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html Clang (and GCC) supports a language feature called Vector Extensions. To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type. Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions. When you write `a \| b`, Clang sees that a and b are 512-bit integer vectors. It implicitly understands that the bitwise OR operator (\|) applies to these vectors. It automatically generates the VPORQ (or VPORD) instruction without needing any helper function. MSVC follows a stricter, more traditional C++ model regarding intrinsics. In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float. Standard C++ does not define what `\|` means for a user-defined struct. MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs. When you write `a \| b` in MSVC, the compiler looks for a definition of `operator\|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error. You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b). To get the nice syntax `(a \| b)` in MSVC, you have to manually "teach" the compiler what `\|` means by defining the `operator\|` overload yourself.	2026-03-04 08:00:28 +01:00
Kawrakow	fd16a418de	Fix clang warnings on macOS (#1354 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-03-03 16:27:16 +01:00
Yap Sok Ann	ea3e8e30e1	Allow arbitrary arguments order for Q3C, Q3CN, and Qwen3.5 (#1352 ) This should fix the read file at offset/limit issue, where the tool definition has offset before limit, while the model sets limit before offset.	2026-03-03 15:39:16 +01:00
Kawrakow	505e2c57f9	Reduce memory use when processing large images (#1349 )	2026-03-02 17:54:56 +01:00
Kawrakow	3735e88925	Remove unused tensors from delta-net (#1350 )	2026-03-02 16:02:40 +01:00
Nexes the Elder	d4ac5f1566	gguf-split: fix the split output files naming (#1336 ) * Fix gguf-split.cpp splits output naming With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits. ex: No more model.gguf-00001-of-00200.gguf Instead, model-00001-of-00200.gguf * increase ggml_max_context to 2048 * Revert GGML_MAX_CONTEXTS to 64	2026-03-02 08:43:47 +01:00
Kawrakow	d239dabcc6	Graph parallel for Qwen-3.5-MoE (#1347 ) * Graph parallel for Qwen3.5-MoE * Add --max-gpu to llama-bench * Fix graph reuse when not all GPUs participate in self-attention	2026-03-02 07:48:43 +01:00
firecoperana	8f9e19d57c	server: add checkpoint tolerance and fix grammar_trigger init (#1346 ) Co-authored-by: firecoperana <firecoperana>	2026-03-02 07:45:32 +01:00
Kawrakow	a568e12c8f	Minor delta-net tweak (#1337 )	2026-03-01 17:45:02 +01:00
Kawrakow	04c140fe54	Make vision woork with Qwen-3.5 models (#1345 )	2026-03-01 17:44:37 +01:00
Kawrakow	0ff3a43289	Bring back #1333 and #1335 (#1340 ) * Bring back fused delta net 3 * Remove autoregressive and chunking	2026-02-28 14:31:42 +01:00
Kawrakow	1922449b2c	Revert delta net 3 (#1339 ) * Revert "Simplify delta-net (#1335)" This reverts commit `e5fc30244c`. * Revert "Fused delta net 3 (#1333)" This reverts commit `7b68353e09`.	2026-02-28 13:12:08 +01:00
Kawrakow	e5fc30244c	Simplify delta-net (#1335 ) * Simplify delta-net * Minor * Minor	2026-02-28 11:12:19 +01:00
Kawrakow	702e0765b8	Update README with clarification on '_XL' models Clarified warning about Unsloth '_XL' models in README.	2026-02-27 16:22:10 +01:00
Kawrakow	7b68353e09	Fused delta net 3 (#1333 ) * This is better than chunked * Keep the state in registers * Cleanup * Remove unused stuff * Minor * Make fused delta-net the default * Fix race	2026-02-27 15:02:56 +01:00
Kawrakow	1e6d36b1b4	Graph parallel for dense Qwen-3.5 models (#1331 ) * Graph parallel for idense Qwen-3.5 models * Cleanup	2026-02-27 07:03:25 +01:00
Kawrakow	facc8fdc44	Very slightly better fused delta-net (#1330 )	2026-02-27 07:03:09 +01:00
Kawrakow	62a7dcac5a	Move the Qwen-3.5 models to the standard attention mechanism (#1329 )	2026-02-26 15:50:51 +01:00
Kawrakow	757bee6238	Add special FA handling for dense Qwen3.5 (#1328 )	2026-02-26 11:27:41 +01:00
Kawrakow	0aa6f7e7cd	iAdding support for dense Qwen-3.5 models (#1326 )	2026-02-26 08:51:01 +01:00
Kawrakow	2616efa296	Fused delta net 2 (#1320 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes.	2026-02-26 06:53:43 +01:00
Kawrakow	87b35dac0c	Faster quantization for MoE models with many experts (#1322 )	2026-02-26 06:52:28 +01:00
firecoperana	3fac78c48b	server: enable checkpoint for recurrent models (#1310 ) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana>	2026-02-26 06:51:18 +01:00
Kawrakow	216f44363f	Fix KT quantization yet again (#1321 ) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one	2026-02-25 18:07:12 +01:00
Kawrakow	c77ec4b8b8	Fused delta-net (#1315 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name	2026-02-25 14:12:48 +01:00
Nexes the Elder	0bf7043a7b	Display the size of the tensors overriden during the tensor loading (#1318 ) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2026-02-25 07:36:27 +01:00
Nexes the Elder	170467e835	Llama-quantize: Partial requant feature (#1313 ) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup	2026-02-25 07:25:15 +01:00
Joshua Jolley	68431b049a	server: propagate task index to response objects for batch requests (#1303 ) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai>	2026-02-24 15:39:38 +01:00
dungquixote42	aaa545c3dc	adaptive p: collect probability before logit bias (#1314 )	2026-02-24 15:39:17 +01:00
Kawrakow	38ca19d828	Minor delta-net tweak (#1308 ) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak	2026-02-24 15:22:57 +01:00
Kawrakow	7065488135	Slightly better graph parallel for Qwen3-Next (#1307 ) * Make sure we pick the reduced tensor from the right GPU * Minor	2026-02-24 15:22:30 +01:00
Kawrakow	cfb6747776	llama-quantize: --dry-run option (#1309 )	2026-02-24 15:21:52 +01:00
TheAIGuyFromAR	96b8298472	Fix typo in merge-up-gate-experts argument (#1311 )	2026-02-24 15:13:22 +01:00
Kawrakow	68bd30d99c	Fix max nodes (again) (#1306 )	2026-02-23 11:17:37 +01:00
Kawrakow	2bb40f8c35	Fix llm_arch_is_hybrid (#1305 )	2026-02-23 08:55:53 +01:00
Kawrakow	5dacb5355a	Graph parallel for Qwen3-Next (#1292 ) * WIP * This works, but is slower than split mode layer	2026-02-23 07:58:00 +01:00
Yap Sok Ann	dcf50d8279	Fix tool call for Qwen3.5 (#1300 ) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * https://github.com/ggml-org/llama.cpp/pull/19635 * https://github.com/ggml-org/llama.cpp/pull/19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one	2026-02-23 07:54:56 +01:00
firecoperana	efc294cc39	server: fix crash from adaptive p (#1304 ) Co-authored-by: firecoperana <firecoperana>	2026-02-23 07:25:52 +01:00
Kawrakow	89b1e2b518	Better estimate for max. nuber of compute nodes (#1296 ) * Better estimate for max. nuber of compute nodes * Just in case	2026-02-22 18:16:49 +01:00
Samuel Oliveira Alves	09a88c9ae5	Add MTP decoding support for GLM-4.x MoE (#1270 ) * wip: port MTP architecture Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes. * Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model). * core: enable hybrid outputs (logits + embeddings) for MTP support * fix(mtp): correct KV-cache slot finding for updates * fix(mtp): persist hidden states to prevent context corruption during drafting * refactor(mtp): clean unused code * fix(mtp): update server to new functions name * fix(mtp): fix graph and save hidden state * mtp: refactor integration, context params and kv cache search * mtp: fix hidden state extraction and speculative acceptance flow * server: fix MTP warmup for long prompts and reset token buffer * llama: refactor MTP operation state to context parameters * server: fix n_past calculation in MTP acceptance * llama: fix mtp enable flags * speculative: refactor MTP to use common_speculative interface * context: remove unused signatures * clip: fix deprecated enum-enum conversion warning * common: fix format string crash in help message * context: fix mtp activation logic	2026-02-22 18:14:39 +01:00
Kawrakow	cbf7fc7e2f	Update README with warning about '_XL' models from Unsloth Added important note regarding quantized models from Unsloth.	2026-02-22 07:42:17 +01:00
Kawrakow	bd387a279a	Add new authors to the AUTHORS file	2026-02-21 19:20:31 +01:00
firecoperana	66323b92f7	Qwen3.5-MoE: fix regenerating message error (#1295 ) Co-authored-by: firecoperana <firecoperana>	2026-02-21 18:24:12 +01:00
Kawrakow	13c3d83ce7	Qwen3.5-MoE support (#1288 ) * WIP: loads and runs, but not correct Very high PPL, empty TG. * This appears to work	2026-02-21 08:33:06 +01:00
mcm007	b2cb4512c5	Create parameters overview (#1269 ) * raw parameters.md * fix small typos in common.cpp * Update build args in parameters.md * Update parameters.md - format as table - sections * Update README.md - quickstart - build and run * Update parameters.md other tools examples * add PR links * multiple updates to parameters.md - description - add jargon section - add suggestions from feedbacks * don't imply that only linux is supported in README.md * add alias to parameters.md * Update README.md with recent models and features * Update parameters.md with latest features * address suggestions - no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters * specify Linux distro in README.md	2026-02-20 07:20:56 +01:00
dungquixote42	0f411b02e2	Fix adaptive p sampler bug with string ban (#1287 ) * adaptive p: upadte internal state only if not rewinding * adaptive p: conditional update for speculative decoding * adaptive p: refactor to rewind instead of update * adaptive p fix: better comments * fix rewind check * add record to handle multi-token rewind * better comment	2026-02-20 07:11:36 +01:00
rkozuch	b855bf92de	Fix slot prompt updating. (#1285 ) Co-authored-by: Rkozuch <you@example.com>	2026-02-19 08:15:49 +01:00

1 2 3 4 5 ...

4258 Commits