ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-24 08:29:29 +00:00

Author	SHA1	Message	Date
Kawrakow	2f645f2579	Fix annoying compiler warnings (#1042 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 09:59:07 +01:00
Kawrakow	e02b71f89e	Automatically disable CUDA graphs for split mode "graph" (#1040 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 07:38:02 +01:00
firecoperana	42e4c61243	CUDA: Fix FA for Pascal GPU (#1036 ) Co-authored-by: firecoperana <firecoperana>	2025-12-05 16:42:14 +01:00
Kawrakow	efc8c8ef8d	K-cache Hadamard transforms (CUDA) (#1034 ) * Hadamard transforms for K-cache on CUDA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 18:46:22 +01:00
Kawrakow	7fbe8d3ac2	Fix bug in ggml_cuda_op_scale_tensor (#1031 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 11:32:19 +01:00
Kawrakow	a719349982	POC: CUDA tensor parallel (MoE models) (#1022 ) * Remove most of split mode row * WIP * WIP: also allocate the KV cache using tensor split * WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling * WIP * This works, but it is slow * This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined. * Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG. * Fix attn split Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio. * Show memory used per device * Make it work with partial offload but no tensor overrides yet, just ngl < num_layers. * Allow for f16 source in fused_rms_norm * This results in faster PP. Now PP is faster than split mode layer for L3-70B. * Rename split mode "row" to split mode "graph" * Leave FFN partial results as f16 * WIP GLM4.5 - runs with wrong results * WIP GLM4.5 - this works PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer. * Work around compiler bug It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug. * Make graph reuse work with split mode graph * Remove more split mode row remnants * WIP tensor overrides Runs with wrong results, don't see where the issue could be. * This works but is slow Still does not work for row-interleaved quants * Slightly better * Slightly better * Row-interleaved quants work * Better * Minor * Guarad against using split mode "graph" for unsupported models * Guards against using merge_qkv with split mode "graph" * WIP split mode attn Works for LlaMA models, but not for GLM-4.5. Doesn't seem to improve performance, so I guess no point in trying to fix it. * Split mode graph for qwen3moe * Try to better distribute the splits --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-01 19:25:40 +01:00
Kawrakow	d6daee337c	Attempt to fix #1014 (#1017 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-27 15:58:18 +01:00
Kawrakow	a3b8efd687	Enable iq4_nl KV cache on CUDA (#1006 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 09:41:19 +01:00
Kawrakow	0243356650	Fix q6_0 dequantize (#1005 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 09:13:46 +01:00
Nexes the Elder	9a63e768ea	Legacy quants cpy_blck_q_f16 function for K cache (#1001 ) Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16. To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts. Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior. Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix) -ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.	2025-11-24 08:56:38 +01:00
Kawrakow	920f424929	Support GigaChat3 (#995 ) * Fixing Gigachat support * Gigachat: CUDA FA (needs 192 x 192 for MLA = 3) * Gigachat: CPU FA (needs 192 x 192 for MLA = 3) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 06:55:14 +01:00
Kawrakow	232050b473	Attempt to fix #974 (#983 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 15:48:39 +01:00
Kawrakow	d764edd652	Fuse sum_rows and div with topk-moe (#984 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 13:44:09 +01:00
Kawrakow	054c31cf8f	Fuse Q and K RoPE (#980 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 09:08:42 +01:00
Kawrakow	0157f78061	Minor	2025-11-18 08:55:36 +00:00
Kawrakow	03da76eb05	Fix RoPE cache on multi-GPU setup (#966 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-16 11:50:48 +02:00
Kawrakow	37d72f9878	Fix ggml_cuda_fattn_is_supported (#968 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-16 11:50:29 +02:00
Kawrakow	32edcb4b74	Fix rope_norm_fast_cuda (#945 ) * Fix rope_norm_fast_cuda * One more * Also fix mrope and vision --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 08:54:37 +02:00
Kawrakow	219fe93973	Opt from #880 also for iqk cuda gemv (#938 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 10:01:34 +02:00
Kawrakow	86e2bec04e	DeepSeek FA optimizations (#929 ) * Use new-new-mma also for MLA=3, and use mask bounds This gives us ~25% better PP at 32k tokens compared to main * This seems better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:55:30 +02:00
Kawrakow	adba641347	DeepSeek TG optimizations for TG (#928 ) * Fuse concat and copy into K cache * Avoid ggml_cont() when n_token = 1 Combined effect: about +2% in TG performance with full GPU offload Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:52:07 +02:00
Kawrakow	5cc15d0ecf	CUDA MoE improvements (#923 ) * Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-09 11:34:33 +02:00
Kawrakow	defa6945b3	CUDA: fuse copies to K and V cache (#921 ) * Fuse copies to K- and V-cache on CUDA * Adapt to latest main --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 18:13:58 +02:00
Kawrakow	3614c4f098	Adopt fix from mainline PR 17089 (#920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 07:44:20 +02:00
Kawrakow	1c31b25380	Fix PPL increase caused by mmq_id (#913 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 18:58:09 +02:00
Kawrakow	532a05e466	CUDA: set compute parameters via command line arguments (#910 ) * cuda: set compute parameters via command line arguments * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 07:11:23 +02:00
Thireus ☠	86597623a5	Port of Qwen3-VL support from mainline (#883 ) * Port of Qwen3-VL for latest ik_llama.cpp - convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead - sysl and metal support for imrope not added - Vulkan support for imrope not tested - Code not tested * Bugfix n_embd was declared multiple times https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655 * Fix n_embd issue with qwen3vl * model.output tensor not required https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389 * Improved logic for qkv combined tensors `59ceaf8fcb (r2480395800)` `59ceaf8fcb (r2480398187)` * Fix n_embd for merge_qkv() + cleaner code https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395 * Revert TENSOR_NOT_REQUIRED	2025-11-04 19:20:54 +02:00
Kawrakow	fb0d5a995c	RoPE cache (#887 ) * Introducing rope cache When computing RoPE, the rotation angles in each layer are exactly the same, and only depend on the token positions (and other constant, model dependent parameters). So, I wonder, why don't we compute the angles just once and then reuse for the Q and K RoPE in each layer? This commit does it as a POC on the CPU, and uses it in the Qwen3-MoE compute graph. * cuda: neox works * WIP * rope_cache: norm works * Fused rope+rope * Fused rope+rope (norm) * Fused rms+rms+rope+rope (neox) - not working * WIP * Also qwen3 * Add command line arg to disable rope cache * Disable RoPE cache if rope type is not neox or norm * Add missing break after merge with main * Fused fused_rms+fused_rms+rope+rope (with -mqkv) * Fused fused_rms+fused_rms+rope+rope (without -mqkv) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 18:42:20 +02:00
Kawrakow	55a704b67a	Fused Q and K fused_rms_norm for TG on CUDA (#882 ) * Biased mmvq: minor optimization * Fusing Q and K rms_norm for TG on CUDA * Remove commented out code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:41:28 +02:00
Kawrakow	cfb840379f	Biased mmvq: minor optimization (#880 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:21:18 +02:00
Kawrakow	0459f595d7	CUDA: corectly detect if flash attention is supported (#875 ) * Don't use vector kernels if K or V are quantized * Correctly determine if FA is supported * Also wmma * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-29 13:56:16 +02:00
Nexes the Elder	d50c2490fc	correct typo (#876 )	2025-10-28 19:01:45 +02:00
Kawrakow	eb8116b097	Even more fused ops (#868 ) * Fuse Q, K, V gemv+add * More gemv+add fusing * Faster copy when tensors are contiguous Relevant for storing data into the KV cache. I see ~1% speedup for fast models (Ling-mini-2.0, gpt-oss-20b, etc.) * Cleanup * Make sure the bias really is 1 row to use fusion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-27 16:09:01 +02:00
Kawrakow	e34399c116	CUDA: fuse ffn_upunary_op(ffn_gate) for MMVQ (V2) (#864 ) Args for MMVQ functions * WIP * Fused ffn_upunary_op(ffn_gate) for MMVQ (no bias) We see nearly 2% TG speedup for Ling-mini-2.0 and about 1% for DeepSeek-Lite. Fused ffn_upunary_op(ffn_gate) for MMVQ (with bias) Fusing also for iqk/trellis/repacked quants * Fusing mmvq also in non-MoE up+gate * Fuse mul_mat_id and add_id into a single kernel for mmvq * Also iqk quants * Split mmvq.cu and iqk_mmvq.cu into separate template instances * Put iqk mmvq implementations into template instances * Somehow I forgot to change the ggml_type in the legacy template calls * Add disagnostics * Disable assert * Fix TG fused up*nary(gate) when down cannot be fused The wrong memory buffer got used in that case --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-26 17:08:50 +02:00
Kawrakow	0549be76e5	Fused mul + multi_add op (#858 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: CUDA * fused mul+multi_add: command line argument to disable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:40:35 +03:00
Kawrakow	ed4e1a6588	Fuse add+add+fused_rms (#853 ) * Fuse add+add+fused_rms * Try this * Macro to easily enable/disable fusion * Various: * Check that all tensors involved are on the same device before applying fusion * Fuse sigmoid+scale+sum_rows+div * Fix the fused bailingmoe2 experts selection The issue there was that the bias was not per row, but per expert group, so only the first n_per_group biases were used for al experts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-22 16:18:11 +03:00
Kawrakow	af5bf60cc8	Hopefully this fixes #854 (#855 ) * Hopefully this fixes #854 * Also this one --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 19:07:23 +03:00
Kawrakow	366d66bc1a	Fuse add + fused_rms_norm (CUDA) (#852 ) * Combine all calls to llm_build_norm to a single line so more easily check what kind of arguments are being passed by simply using grep. * Combine add + fused_rms_norm For many models this happens at each layer: the result of the layer is added to the ayer input, which then becomes the input to the next layer, which then is typically normalized via fused_rms_norm. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 14:29:50 +03:00
Kawrakow	c23a17b6fe	cuda: use better block sizes for rms_norm (#845 ) * cuda: use better block sizes for rms_norm * Minor * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 08:12:48 +03:00
Kawrakow	7a41b3b1f5	Various fused ops around expert selection (#840 ) * Fuse sigmoid+add+grouped_topk+get_rows (CPU) * Fix CPU + CUDA but CUDA is somehow not 100% correct as I get a slightly different PPL (lower!) * Minor * Fuse sigmoid+add+topk+get_rows (CUDA) * Fuse sigmoid+add+topk+get_rows (CPU) * Fuse topk+view+get_rows+reshape+softmax (CPU) * Fuse topk+view+get_rows+reshape+softmax (CUDA) * cpu: turn off the openai topk fusing for now Something is not right and I don't see the bug. On the CPU one doesn't gain much if anything, so not a big loss. * Also fuse sum_rows and div --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-19 19:02:46 +03:00
Kawrakow	1dcc044134	Grouped expert routing (CUDA) (#838 ) * WIP * cuda: grouped top_k * This is very slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-18 07:22:35 +03:00
Kawrakow	f7adde1043	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	9932e6b102	Fix gemma3 vision (#803 ) * Remove unnecessary assert in im2col * Remove unnecessary assert in im2col (CPU) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 11:15:32 +02:00
Kawrakow	c1a0e15377	Port mdmd from mainline + Qwen2/2.5-VL support (#798 ) * Add mtmd: the beginning * Add mtmd: mtmd.cpp compiles * Add mtmd: clip initialization compiles * Add mtmd: clip.cpp compiles * Add mtmd: builds successfully * Add CPU implementation for GGML_OP_GLU * Add CUDA implementation for GGML_OP_GLU * Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add mtmd: refresh CPU rope * Add mtmd: refresh CUDA rope * Add mtmd: add Qwen2-VL * Add mtmd: Qwen2.5-VL text seems to work with this change * Add mtmd: fix swiglu * Add mtmd: use LOG_TEE so generated tokens show up in terminal * Add mtmd: do not attempt to load a GPU backend if none are available * GLU, not GPU * Fix typo * Fix new/free mismatch * LOG stuff * Add mtmd: this fixes gibberish on second image --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 08:45:29 +02:00
Kawrakow	8b4208e789	Fix #772 (#790 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 16:43:02 +02:00
Kawrakow	4591e83825	cuda: fused top_k+softmax as used in most MoE models (#789 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 13:45:57 +02:00
Kawrakow	4a6a6f17ee	Alternative CUDA FA for SWA models (#754 ) * Bounds for flash attention * Add n_swa to FA parameters * Fix it * This seems very slightly better * Using vec kernel when we have SWA * Need also this * f32 vec kernel * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 08:42:18 +02:00
Kawrakow	727f7b7d9f	Refactor CUDA flash attention (#745 ) * Factor out mma * Factor out wmma * Factor out vec * Remove unnecessary includes from fattn.cu * Move mma launch to fattn-mma-f16.cuh * Slightly better PP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 10:12:56 +02:00
Kawrakow	56e0f897ae	Revert "CUDA: prompt processing optimizations for MoE models (#739 )" (#748 ) This reverts commit `f22a9ef95a`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 06:55:48 +02:00
Kawrakow	8de297b795	Fused FFN_UP+FFN_GATE op (#741 ) * Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-31 18:16:36 +03:00

1 2 3 4

157 Commits