ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Author	SHA1	Message	Date
Kawrakow	11feb49562	Fix compilation failure after merging #883 (#900 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 19:28:52 +02:00
Thireus ☠	86597623a5	Port of Qwen3-VL support from mainline (#883 ) * Port of Qwen3-VL for latest ik_llama.cpp - convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead - sysl and metal support for imrope not added - Vulkan support for imrope not tested - Code not tested * Bugfix n_embd was declared multiple times https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655 * Fix n_embd issue with qwen3vl * model.output tensor not required https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389 * Improved logic for qkv combined tensors `59ceaf8fcb (r2480395800)` `59ceaf8fcb (r2480398187)` * Fix n_embd for merge_qkv() + cleaner code https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395 * Revert TENSOR_NOT_REQUIRED	2025-11-04 19:20:54 +02:00
Kawrakow	c23fda2103	Disable some fusion, RoPE cache off by default (#894 ) * Disable some fusion and make rope cahe off by default * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 07:50:14 +02:00
Kawrakow	fb0d5a995c	RoPE cache (#887 ) * Introducing rope cache When computing RoPE, the rotation angles in each layer are exactly the same, and only depend on the token positions (and other constant, model dependent parameters). So, I wonder, why don't we compute the angles just once and then reuse for the Q and K RoPE in each layer? This commit does it as a POC on the CPU, and uses it in the Qwen3-MoE compute graph. * cuda: neox works * WIP * rope_cache: norm works * Fused rope+rope * Fused rope+rope (norm) * Fused rms+rms+rope+rope (neox) - not working * WIP * Also qwen3 * Add command line arg to disable rope cache * Disable RoPE cache if rope type is not neox or norm * Add missing break after merge with main * Fused fused_rms+fused_rms+rope+rope (with -mqkv) * Fused fused_rms+fused_rms+rope+rope (without -mqkv) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 18:42:20 +02:00
Kawrakow	846e736e85	cuda: add missing backwards RoPE op (#889 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 07:45:18 +02:00
Kawrakow	55a704b67a	Fused Q and K fused_rms_norm for TG on CUDA (#882 ) * Biased mmvq: minor optimization * Fusing Q and K rms_norm for TG on CUDA * Remove commented out code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:41:28 +02:00
Kawrakow	cfb840379f	Biased mmvq: minor optimization (#880 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:21:18 +02:00
Kawrakow	0459f595d7	CUDA: corectly detect if flash attention is supported (#875 ) * Don't use vector kernels if K or V are quantized * Correctly determine if FA is supported * Also wmma * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-29 13:56:16 +02:00
Nexes the Elder	d50c2490fc	correct typo (#876 )	2025-10-28 19:01:45 +02:00
firecoperana	904e994bfb	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00
Kawrakow	eb8116b097	Even more fused ops (#868 ) * Fuse Q, K, V gemv+add * More gemv+add fusing * Faster copy when tensors are contiguous Relevant for storing data into the KV cache. I see ~1% speedup for fast models (Ling-mini-2.0, gpt-oss-20b, etc.) * Cleanup * Make sure the bias really is 1 row to use fusion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-27 16:09:01 +02:00
Kawrakow	e34399c116	CUDA: fuse ffn_upunary_op(ffn_gate) for MMVQ (V2) (#864 ) Args for MMVQ functions * WIP * Fused ffn_upunary_op(ffn_gate) for MMVQ (no bias) We see nearly 2% TG speedup for Ling-mini-2.0 and about 1% for DeepSeek-Lite. Fused ffn_upunary_op(ffn_gate) for MMVQ (with bias) Fusing also for iqk/trellis/repacked quants * Fusing mmvq also in non-MoE up+gate * Fuse mul_mat_id and add_id into a single kernel for mmvq * Also iqk quants * Split mmvq.cu and iqk_mmvq.cu into separate template instances * Put iqk mmvq implementations into template instances * Somehow I forgot to change the ggml_type in the legacy template calls * Add disagnostics * Disable assert * Fix TG fused up*nary(gate) when down cannot be fused The wrong memory buffer got used in that case --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-26 17:08:50 +02:00
Kawrakow	70c0095e11	Faster tensor name formatting (#860 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: command line argument to disable it * Faster tensor name formatting We gain ~1% for Ling-mini-2.0 when running on CUDA. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:46:18 +03:00
Kawrakow	0549be76e5	Fused mul + multi_add op (#858 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: CUDA * fused mul+multi_add: command line argument to disable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:40:35 +03:00
Kawrakow	ed4e1a6588	Fuse add+add+fused_rms (#853 ) * Fuse add+add+fused_rms * Try this * Macro to easily enable/disable fusion * Various: * Check that all tensors involved are on the same device before applying fusion * Fuse sigmoid+scale+sum_rows+div * Fix the fused bailingmoe2 experts selection The issue there was that the bias was not per row, but per expert group, so only the first n_per_group biases were used for al experts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-22 16:18:11 +03:00
Kawrakow	af5bf60cc8	Hopefully this fixes #854 (#855 ) * Hopefully this fixes #854 * Also this one --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 19:07:23 +03:00
Kawrakow	366d66bc1a	Fuse add + fused_rms_norm (CUDA) (#852 ) * Combine all calls to llm_build_norm to a single line so more easily check what kind of arguments are being passed by simply using grep. * Combine add + fused_rms_norm For many models this happens at each layer: the result of the layer is added to the ayer input, which then becomes the input to the next layer, which then is typically normalized via fused_rms_norm. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 14:29:50 +03:00
Kawrakow	a27d661aeb	Fix fused grouped topk (#851 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 10:10:38 +03:00
Kawrakow	c23a17b6fe	cuda: use better block sizes for rms_norm (#845 ) * cuda: use better block sizes for rms_norm * Minor * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 08:12:48 +03:00
Kawrakow	7a41b3b1f5	Various fused ops around expert selection (#840 ) * Fuse sigmoid+add+grouped_topk+get_rows (CPU) * Fix CPU + CUDA but CUDA is somehow not 100% correct as I get a slightly different PPL (lower!) * Minor * Fuse sigmoid+add+topk+get_rows (CUDA) * Fuse sigmoid+add+topk+get_rows (CPU) * Fuse topk+view+get_rows+reshape+softmax (CPU) * Fuse topk+view+get_rows+reshape+softmax (CUDA) * cpu: turn off the openai topk fusing for now Something is not right and I don't see the bug. On the CPU one doesn't gain much if anything, so not a big loss. * Also fuse sum_rows and div --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-19 19:02:46 +03:00
Kawrakow	1dcc044134	Grouped expert routing (CUDA) (#838 ) * WIP * cuda: grouped top_k * This is very slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-18 07:22:35 +03:00
Kawrakow	cde642e591	Grouped expert routing (CPU only) (#836 ) * Better argsort (CPU) * Attemt at grouped topk * This seems to do the trick for grouped experts routing * Cleanup * Trying to merge, something is not right * Working merged grouped top_k (CPU) * Add command line option to enable grouped expert routing * Add grouped expert routing option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 14:57:02 +03:00
Kawrakow	e66d307e13	Better argsort (CPU) (#835 ) * Better argsort (CPU) * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 11:31:03 +03:00
Kawrakow	f7adde1043	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	ba9fefb73d	gpt-oss: duplicate experts biases when necessary (#829 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-14 14:38:40 +03:00
Kawrakow	4e24d48e63	Attention mask tweaks for better long context performance (#825 ) * Parallelize mask We see non-negligible PP gains for long contexts. More importantly, the strange drop in performance observed for GPT-OSS for context >= 32k tokens is gone. * Whith FA on, create mask as f16 directly * WIP * Reduce KQ mask padding to 16 Why was it 64 in the first place? I don't observe any issues, while TG performance for long contexts improves by 2-4%. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-13 14:01:11 +03:00
Kawrakow	475223079c	Attempt to fix AVX2 FA (#807 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-30 08:06:53 +02:00
Kawrakow	9932e6b102	Fix gemma3 vision (#803 ) * Remove unnecessary assert in im2col * Remove unnecessary assert in im2col (CPU) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 11:15:32 +02:00
Kawrakow	c1a0e15377	Port mdmd from mainline + Qwen2/2.5-VL support (#798 ) * Add mtmd: the beginning * Add mtmd: mtmd.cpp compiles * Add mtmd: clip initialization compiles * Add mtmd: clip.cpp compiles * Add mtmd: builds successfully * Add CPU implementation for GGML_OP_GLU * Add CUDA implementation for GGML_OP_GLU * Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add mtmd: refresh CPU rope * Add mtmd: refresh CUDA rope * Add mtmd: add Qwen2-VL * Add mtmd: Qwen2.5-VL text seems to work with this change * Add mtmd: fix swiglu * Add mtmd: use LOG_TEE so generated tokens show up in terminal * Add mtmd: do not attempt to load a GPU backend if none are available * GLU, not GPU * Fix typo * Fix new/free mismatch * LOG stuff * Add mtmd: this fixes gibberish on second image --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 08:45:29 +02:00
Kawrakow	bc34573356	CPU: faster FA (#797 ) * Avoid computing FA chunks where the mask is -infinity * Avoid computing FA chunks where the mask is -infinity also for f16/bf16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-26 09:00:25 +02:00
Kawrakow	f8b66238fa	Fused matrix multiplications (CUDA and CPU) (#796 ) * Quick attempt to fuse the Q, K, V GEMMs Doesn't do much on the CPU * Doesn't do much on the GPU either * Use llm_build_mul_mat_qkv * This is not needed * Revert timing on committed by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 16:52:54 +02:00
Kawrakow	f59b2909d4	cpu: fused softmax+topk (#794 ) * cpu: fused softmax+topk * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 09:02:21 +02:00
Kawrakow	8b4208e789	Fix #772 (#790 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 16:43:02 +02:00
Kawrakow	4591e83825	cuda: fused top_k+softmax as used in most MoE models (#789 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 13:45:57 +02:00
Kawrakow	540a26514f	This is very slightly better (#762 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-05 21:31:02 +02:00
Kawrakow	f74dd77143	Fix ggml_is_contiguously_allocated (#764 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-05 19:05:02 +02:00
firecoperana	49979ba9e9	llama: enable K-shift for quantized KV cache for cuda (#760 ) cuda: add q8_0->f32 cpy operation (#9571) It will fail on unsupported backends or quant types. Co-authored-by: Ivan <nekotekina@gmail.com>	2025-09-05 11:54:18 +02:00
Kawrakow	13c3b6412e	Offload only activated experts to the GPU (#698 ) * Offload only activated experts * This seems to do the trick for -fmoe * Do not recalculate activated expers for fused up/gate * Log out of bounds access details * Add a command line argument --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 12:22:30 +02:00
Kawrakow	144d456717	Better CPU SWA (#757 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 11:58:16 +02:00
Kawrakow	4a6a6f17ee	Alternative CUDA FA for SWA models (#754 ) * Bounds for flash attention * Add n_swa to FA parameters * Fix it * This seems very slightly better * Using vec kernel when we have SWA * Need also this * f32 vec kernel * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 08:42:18 +02:00
Kawrakow	727f7b7d9f	Refactor CUDA flash attention (#745 ) * Factor out mma * Factor out wmma * Factor out vec * Remove unnecessary includes from fattn.cu * Move mma launch to fattn-mma-f16.cuh * Slightly better PP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 10:12:56 +02:00
Kawrakow	d29c21ecbc	Set default value of GGML_SCHED_MAX_COPIES to 1 (#751 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 07:04:39 +02:00
Kawrakow	56e0f897ae	Revert "CUDA: prompt processing optimizations for MoE models (#739 )" (#748 ) This reverts commit `f22a9ef95a`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 06:55:48 +02:00
Kawrakow	8de297b795	Fused FFN_UP+FFN_GATE op (#741 ) * Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-31 18:16:36 +03:00
Kawrakow	d55e98519f	CUDA: prompt processing optimizations for MoE models (#739 ) * Skip the row id computation for the ffn_down op Sadly, almost negligible performance gain. * Also this doesn't do much * Also this barely moves the needle * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-30 12:09:41 +03:00
Kawrakow	f529c3a808	Sanitize imatrix (#735 ) * sanitize importance matrix: WIP * sanitize importance matrix: iq4_k * sanitize importance matrix: iq5_k, iq6_k * sanitize imatrix: iq4_ks * sanitize imatrix: iq4_kss * sanitize imatrix: iq2_ks and iq2_kl * sanitize imatrix: iq5_ks * sanitize imatrix: iq4_nl_r4 * sanitize imatrix: q4_0_r8 * sanitize imatrix: q6_0_r4 * sanitize imatrix: iq4_xs_r8 * sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template * sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4 * sanitize imatrix: repacked i-quants * Minor * Add more checks for iq3_k, iq3_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-29 09:08:15 +03:00
Kawrakow	e760b4dc41	Check for NaNs while loading the model. (#727 ) * Check for NaNs while loading the model. * Also tell which experts have NaNs. * Add command line option to validate quants * Add checks for more quantization types * Add checks for more quantizagtion types --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 19:00:17 +03:00
Kawrakow	ca5b6ab9b1	Fix typo	2025-08-27 14:43:44 +03:00
Kawrakow	1dcc34f70a	Heuristics for mmq_id -> original threshold (#734 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:17:41 +03:00
Kawrakow	6afe9b48ab	Sanitize importances for KT quantization (#720 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:04:15 +03:00

1 2 3 4 5 ...

423 Commits