ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-07 20:40:02 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	525dda2e80	Add command line arg to disable rope cache	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	60d56fa2d0	WIP	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	332c4d6680	Fused rms+rms+rope+rope (neox) - not working	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	623d775929	Fused rope+rope (norm)	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	f5ac78de5c	Fused rope+rope	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	209bf1d29c	WIP	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	f2c4b3a8d1	cuda: neox works	2025-11-03 08:31:14 +02:00
Iwan Kawrakow	9a790a8905	Introducing rope cache When computing RoPE, the rotation angles in each layer are exactly the same, and only depend on the token positions (and other constant, model dependent parameters). So, I wonder, why don't we compute the angles just once and then reuse for the Q and K RoPE in each layer? This commit does it as a POC on the CPU, and uses it in the Qwen3-MoE compute graph.	2025-11-03 08:30:32 +02:00
Kawrakow	d890b9fee0	cuda: add missing backwards RoPE op (#889 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 07:45:18 +02:00
Kawrakow	8c8a7fb7c8	Fused Q and K fused_rms_norm for TG on CUDA (#882 ) * Biased mmvq: minor optimization * Fusing Q and K rms_norm for TG on CUDA * Remove commented out code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:41:28 +02:00
Kawrakow	fd3757d4ee	Biased mmvq: minor optimization (#880 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:21:18 +02:00
Kawrakow	c33f39d58f	CUDA: corectly detect if flash attention is supported (#875 ) * Don't use vector kernels if K or V are quantized * Correctly determine if FA is supported * Also wmma * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-29 13:56:16 +02:00
Nexes the Elder	0ba5424fbf	correct typo (#876 )	2025-10-28 19:01:45 +02:00
firecoperana	6dc5bd847b	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00
Kawrakow	bdf4f0ddce	Even more fused ops (#868 ) * Fuse Q, K, V gemv+add * More gemv+add fusing * Faster copy when tensors are contiguous Relevant for storing data into the KV cache. I see ~1% speedup for fast models (Ling-mini-2.0, gpt-oss-20b, etc.) * Cleanup * Make sure the bias really is 1 row to use fusion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-27 16:09:01 +02:00
Kawrakow	f76e98536f	CUDA: fuse ffn_upunary_op(ffn_gate) for MMVQ (V2) (#864 ) Args for MMVQ functions * WIP * Fused ffn_upunary_op(ffn_gate) for MMVQ (no bias) We see nearly 2% TG speedup for Ling-mini-2.0 and about 1% for DeepSeek-Lite. Fused ffn_upunary_op(ffn_gate) for MMVQ (with bias) Fusing also for iqk/trellis/repacked quants * Fusing mmvq also in non-MoE up+gate * Fuse mul_mat_id and add_id into a single kernel for mmvq * Also iqk quants * Split mmvq.cu and iqk_mmvq.cu into separate template instances * Put iqk mmvq implementations into template instances * Somehow I forgot to change the ggml_type in the legacy template calls * Add disagnostics * Disable assert * Fix TG fused up*nary(gate) when down cannot be fused The wrong memory buffer got used in that case --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-26 17:08:50 +02:00
Kawrakow	2522c97dc9	Faster tensor name formatting (#860 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: command line argument to disable it * Faster tensor name formatting We gain ~1% for Ling-mini-2.0 when running on CUDA. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:46:18 +03:00
Kawrakow	db3ba4999f	Fused mul + multi_add op (#858 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: CUDA * fused mul+multi_add: command line argument to disable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:40:35 +03:00
Kawrakow	0e1d33ca4a	Fuse add+add+fused_rms (#853 ) * Fuse add+add+fused_rms * Try this * Macro to easily enable/disable fusion * Various: * Check that all tensors involved are on the same device before applying fusion * Fuse sigmoid+scale+sum_rows+div * Fix the fused bailingmoe2 experts selection The issue there was that the bias was not per row, but per expert group, so only the first n_per_group biases were used for al experts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-22 16:18:11 +03:00
Kawrakow	8aa3c2ec5e	Hopefully this fixes #854 (#855 ) * Hopefully this fixes #854 * Also this one --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 19:07:23 +03:00
Kawrakow	caf9759c97	Fuse add + fused_rms_norm (CUDA) (#852 ) * Combine all calls to llm_build_norm to a single line so more easily check what kind of arguments are being passed by simply using grep. * Combine add + fused_rms_norm For many models this happens at each layer: the result of the layer is added to the ayer input, which then becomes the input to the next layer, which then is typically normalized via fused_rms_norm. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 14:29:50 +03:00
Kawrakow	92231460cf	Fix fused grouped topk (#851 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 10:10:38 +03:00
Kawrakow	f5571e241e	cuda: use better block sizes for rms_norm (#845 ) * cuda: use better block sizes for rms_norm * Minor * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 08:12:48 +03:00
Kawrakow	28d3e63805	Various fused ops around expert selection (#840 ) * Fuse sigmoid+add+grouped_topk+get_rows (CPU) * Fix CPU + CUDA but CUDA is somehow not 100% correct as I get a slightly different PPL (lower!) * Minor * Fuse sigmoid+add+topk+get_rows (CUDA) * Fuse sigmoid+add+topk+get_rows (CPU) * Fuse topk+view+get_rows+reshape+softmax (CPU) * Fuse topk+view+get_rows+reshape+softmax (CUDA) * cpu: turn off the openai topk fusing for now Something is not right and I don't see the bug. On the CPU one doesn't gain much if anything, so not a big loss. * Also fuse sum_rows and div --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-19 19:02:46 +03:00
Kawrakow	747f411da5	Grouped expert routing (CUDA) (#838 ) * WIP * cuda: grouped top_k * This is very slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-18 07:22:35 +03:00
Kawrakow	dbfd151594	Grouped expert routing (CPU only) (#836 ) * Better argsort (CPU) * Attemt at grouped topk * This seems to do the trick for grouped experts routing * Cleanup * Trying to merge, something is not right * Working merged grouped top_k (CPU) * Add command line option to enable grouped expert routing * Add grouped expert routing option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 14:57:02 +03:00
Kawrakow	ecf8f931ea	Better argsort (CPU) (#835 ) * Better argsort (CPU) * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 11:31:03 +03:00
Kawrakow	9d364b88ba	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	8d0d01a593	gpt-oss: duplicate experts biases when necessary (#829 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-14 14:38:40 +03:00
Kawrakow	9724ea9213	Attention mask tweaks for better long context performance (#825 ) * Parallelize mask We see non-negligible PP gains for long contexts. More importantly, the strange drop in performance observed for GPT-OSS for context >= 32k tokens is gone. * Whith FA on, create mask as f16 directly * WIP * Reduce KQ mask padding to 16 Why was it 64 in the first place? I don't observe any issues, while TG performance for long contexts improves by 2-4%. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-13 14:01:11 +03:00
Kawrakow	e94d1a92a5	Attempt to fix AVX2 FA (#807 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-30 08:06:53 +02:00
Kawrakow	3d4977cb6e	Fix gemma3 vision (#803 ) * Remove unnecessary assert in im2col * Remove unnecessary assert in im2col (CPU) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 11:15:32 +02:00
Kawrakow	87e4762720	Port mdmd from mainline + Qwen2/2.5-VL support (#798 ) * Add mtmd: the beginning * Add mtmd: mtmd.cpp compiles * Add mtmd: clip initialization compiles * Add mtmd: clip.cpp compiles * Add mtmd: builds successfully * Add CPU implementation for GGML_OP_GLU * Add CUDA implementation for GGML_OP_GLU * Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add mtmd: refresh CPU rope * Add mtmd: refresh CUDA rope * Add mtmd: add Qwen2-VL * Add mtmd: Qwen2.5-VL text seems to work with this change * Add mtmd: fix swiglu * Add mtmd: use LOG_TEE so generated tokens show up in terminal * Add mtmd: do not attempt to load a GPU backend if none are available * GLU, not GPU * Fix typo * Fix new/free mismatch * LOG stuff * Add mtmd: this fixes gibberish on second image --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 08:45:29 +02:00
Kawrakow	c108e4b7c9	CPU: faster FA (#797 ) * Avoid computing FA chunks where the mask is -infinity * Avoid computing FA chunks where the mask is -infinity also for f16/bf16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-26 09:00:25 +02:00
Kawrakow	8e497e704e	Fused matrix multiplications (CUDA and CPU) (#796 ) * Quick attempt to fuse the Q, K, V GEMMs Doesn't do much on the CPU * Doesn't do much on the GPU either * Use llm_build_mul_mat_qkv * This is not needed * Revert timing on committed by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 16:52:54 +02:00
Kawrakow	cde2eb5e95	cpu: fused softmax+topk (#794 ) * cpu: fused softmax+topk * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 09:02:21 +02:00
Kawrakow	45afaf3391	Fix #772 (#790 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 16:43:02 +02:00
Kawrakow	18f04350e9	cuda: fused top_k+softmax as used in most MoE models (#789 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 13:45:57 +02:00
Kawrakow	c519d4177b	This is very slightly better (#762 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-05 21:31:02 +02:00
Kawrakow	c15f8ac508	Fix ggml_is_contiguously_allocated (#764 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-05 19:05:02 +02:00
firecoperana	cec8b70a7e	llama: enable K-shift for quantized KV cache for cuda (#760 ) cuda: add q8_0->f32 cpy operation (#9571) It will fail on unsupported backends or quant types. Co-authored-by: Ivan <nekotekina@gmail.com>	2025-09-05 11:54:18 +02:00
Kawrakow	0c15494c30	Offload only activated experts to the GPU (#698 ) * Offload only activated experts * This seems to do the trick for -fmoe * Do not recalculate activated expers for fused up/gate * Log out of bounds access details * Add a command line argument --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 12:22:30 +02:00
Kawrakow	06cc7c6894	Better CPU SWA (#757 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 11:58:16 +02:00
Kawrakow	f5e68bf8b6	Alternative CUDA FA for SWA models (#754 ) * Bounds for flash attention * Add n_swa to FA parameters * Fix it * This seems very slightly better * Using vec kernel when we have SWA * Need also this * f32 vec kernel * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 08:42:18 +02:00
Kawrakow	3433c7b56d	Refactor CUDA flash attention (#745 ) * Factor out mma * Factor out wmma * Factor out vec * Remove unnecessary includes from fattn.cu * Move mma launch to fattn-mma-f16.cuh * Slightly better PP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 10:12:56 +02:00
Kawrakow	1f4346381f	Set default value of GGML_SCHED_MAX_COPIES to 1 (#751 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 07:04:39 +02:00
Kawrakow	62f5382c2b	Revert "CUDA: prompt processing optimizations for MoE models (#739 )" (#748 ) This reverts commit `f22a9ef95a`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 06:55:48 +02:00
Kawrakow	b66cecca45	Fused FFN_UP+FFN_GATE op (#741 ) * Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-31 18:16:36 +03:00
Kawrakow	f22a9ef95a	CUDA: prompt processing optimizations for MoE models (#739 ) * Skip the row id computation for the ffn_down op Sadly, almost negligible performance gain. * Also this doesn't do much * Also this barely moves the needle * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-30 12:09:41 +03:00
Kawrakow	46968d4ab1	Sanitize imatrix (#735 ) * sanitize importance matrix: WIP * sanitize importance matrix: iq4_k * sanitize importance matrix: iq5_k, iq6_k * sanitize imatrix: iq4_ks * sanitize imatrix: iq4_kss * sanitize imatrix: iq2_ks and iq2_kl * sanitize imatrix: iq5_ks * sanitize imatrix: iq4_nl_r4 * sanitize imatrix: q4_0_r8 * sanitize imatrix: q6_0_r4 * sanitize imatrix: iq4_xs_r8 * sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template * sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4 * sanitize imatrix: repacked i-quants * Minor * Add more checks for iq3_k, iq3_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-29 09:08:15 +03:00

1 2 3 4 5 ...

427 Commits