ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-28 02:11:50 +00:00

Author	SHA1	Message	Date
Kawrakow	0b76f23334	This results in faster PP. Now PP is faster than split mode layer for L3-70B.	2025-11-27 14:58:48 +00:00
Kawrakow	ed67bcbb2a	Allow for f16 source in fused_rms_norm	2025-11-27 14:58:48 +00:00
Kawrakow	c9e129b3db	Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG.	2025-11-27 14:58:48 +00:00
Kawrakow	7376d1c6eb	This works, but it is slow	2025-11-27 14:58:48 +00:00
Kawrakow	b703d00edc	WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling	2025-11-27 14:58:47 +00:00
Kawrakow	93cdd71673	WIP: also allocate the KV cache using tensor split	2025-11-27 14:58:47 +00:00
Kawrakow	135fc5f4c1	WIP	2025-11-27 14:58:47 +00:00
Kawrakow	df8704ca78	Remove most of split mode row	2025-11-27 14:58:47 +00:00
Kawrakow	d6daee337c	Attempt to fix #1014 (#1017 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-27 15:58:18 +01:00
Kawrakow	a3b8efd687	Enable iq4_nl KV cache on CUDA (#1006 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 09:41:19 +01:00
Kawrakow	0243356650	Fix q6_0 dequantize (#1005 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 09:13:46 +01:00
Nexes the Elder	9a63e768ea	Legacy quants cpy_blck_q_f16 function for K cache (#1001 ) Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16. To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts. Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior. Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix) -ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.	2025-11-24 08:56:38 +01:00
Kawrakow	920f424929	Support GigaChat3 (#995 ) * Fixing Gigachat support * Gigachat: CUDA FA (needs 192 x 192 for MLA = 3) * Gigachat: CPU FA (needs 192 x 192 for MLA = 3) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 06:55:14 +01:00
Kawrakow	232050b473	Attempt to fix #974 (#983 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 15:48:39 +01:00
Kawrakow	d764edd652	Fuse sum_rows and div with topk-moe (#984 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 13:44:09 +01:00
Kawrakow	054c31cf8f	Fuse Q and K RoPE (#980 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-19 09:08:42 +01:00
Kawrakow	0157f78061	Minor	2025-11-18 08:55:36 +00:00
Kawrakow	03da76eb05	Fix RoPE cache on multi-GPU setup (#966 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-16 11:50:48 +02:00
Kawrakow	37d72f9878	Fix ggml_cuda_fattn_is_supported (#968 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-16 11:50:29 +02:00
Kawrakow	c64e3e3482	Fix fused up+gate when mmq is not supported (#952 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:59:27 +02:00
Kawrakow	a1f60b3535	Add missing AVX512 operators for MSVC (#948 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:58:51 +02:00
Kawrakow	ce3ce97a29	Fix repacked legacy quants (#951 ) * Fix q5_0_r4 The issue waqs in the tail part. As almost all models have tensor rows that are multiple of 128, that part was never triggered in testing. But ithe gpt-oss models have an embedding size of 2880, so we end up there and trigger the bug. * Fix q6_0_r4 Same fix as q5_0_r4 * Fix q4_0_r8 * Fix q5_0_r4 and q6_0_r4 also on Zen4 * Fix q4_0_r8 also on Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 15:35:37 +02:00
Kawrakow	32edcb4b74	Fix rope_norm_fast_cuda (#945 ) * Fix rope_norm_fast_cuda * One more * Also fix mrope and vision --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 08:54:37 +02:00
Kawrakow	0d97b9c0bf	Enable fusion by default (#939 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 10:35:48 +02:00
Kawrakow	219fe93973	Opt from #880 also for iqk cuda gemv (#938 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 10:01:34 +02:00
Kawrakow	86e2bec04e	DeepSeek FA optimizations (#929 ) * Use new-new-mma also for MLA=3, and use mask bounds This gives us ~25% better PP at 32k tokens compared to main * This seems better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:55:30 +02:00
Kawrakow	adba641347	DeepSeek TG optimizations for TG (#928 ) * Fuse concat and copy into K cache * Avoid ggml_cont() when n_token = 1 Combined effect: about +2% in TG performance with full GPU offload Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:52:07 +02:00
Kawrakow	bf474e9bff	Use fused gemv+add only for TG (#933 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 08:34:24 +02:00
Kawrakow	56ee303254	Make biased gemv fusion optional (#931 ) * Make biased gemv fusion optional * Fix one path through gemv fusion * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-09 19:09:47 +02:00
Lennart Lopin	fd37776584	Add ARM Grace Blackwell (NVIDIA DGX Spark) support (#922 ) This commit enables IQK quantization operations on ARM-based systems, specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell. Changes: - Enable IQK_IMPLEMENT macro for ARM NEON operations - Add arm_neon.h header include for ARM SIMD intrinsics - Fix compilation errors related to missing NEON types and functions Build requirements for ARM: cmake .. -DGGML_CUDA=ON \ -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \ -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16" Tested on: - Platform: NVIDIA DGX Spark (aarch64) - CPU: GB10 Grace Blackwell Superchip - Memory: 128GB unified memory Fixes build errors: - 'float32x4_t' does not name a type - 'vld1q_f32' was not declared in this scope - 'v_expf' was not declared in this scope - Missing FP16 NEON intrinsics	2025-11-09 14:22:40 +02:00
Kawrakow	5cc15d0ecf	CUDA MoE improvements (#923 ) * Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-09 11:34:33 +02:00
Kawrakow	defa6945b3	CUDA: fuse copies to K and V cache (#921 ) * Fuse copies to K- and V-cache on CUDA * Adapt to latest main --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 18:13:58 +02:00
Kawrakow	3614c4f098	Adopt fix from mainline PR 17089 (#920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 07:44:20 +02:00
Kawrakow	d0850dccc8	Disable add + fused_rms_norm fusion (#916 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 19:38:18 +02:00
Kawrakow	1c31b25380	Fix PPL increase caused by mmq_id (#913 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 18:58:09 +02:00
Kawrakow	532a05e466	CUDA: set compute parameters via command line arguments (#910 ) * cuda: set compute parameters via command line arguments * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 07:11:23 +02:00
Kawrakow	49befdd4fb	Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-06 19:07:46 +02:00
Kawrakow	50f95d7bf3	Disable CUDA fusion by default for now (#903 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:58:12 +02:00
Kawrakow	92607d44c4	Much better CPU TG performance at long context for GLM-4.5 (#899 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:20:26 +02:00
Kawrakow	98357d9aa5	Adding cmake option to disable CUDA fusion (#902 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 07:09:27 +02:00
Kawrakow	11feb49562	Fix compilation failure after merging #883 (#900 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 19:28:52 +02:00
Thireus ☠	86597623a5	Port of Qwen3-VL support from mainline (#883 ) * Port of Qwen3-VL for latest ik_llama.cpp - convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead - sysl and metal support for imrope not added - Vulkan support for imrope not tested - Code not tested * Bugfix n_embd was declared multiple times https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655 * Fix n_embd issue with qwen3vl * model.output tensor not required https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389 * Improved logic for qkv combined tensors `59ceaf8fcb (r2480395800)` `59ceaf8fcb (r2480398187)` * Fix n_embd for merge_qkv() + cleaner code https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395 * Revert TENSOR_NOT_REQUIRED	2025-11-04 19:20:54 +02:00
Kawrakow	c23fda2103	Disable some fusion, RoPE cache off by default (#894 ) * Disable some fusion and make rope cahe off by default * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 07:50:14 +02:00
Kawrakow	fb0d5a995c	RoPE cache (#887 ) * Introducing rope cache When computing RoPE, the rotation angles in each layer are exactly the same, and only depend on the token positions (and other constant, model dependent parameters). So, I wonder, why don't we compute the angles just once and then reuse for the Q and K RoPE in each layer? This commit does it as a POC on the CPU, and uses it in the Qwen3-MoE compute graph. * cuda: neox works * WIP * rope_cache: norm works * Fused rope+rope * Fused rope+rope (norm) * Fused rms+rms+rope+rope (neox) - not working * WIP * Also qwen3 * Add command line arg to disable rope cache * Disable RoPE cache if rope type is not neox or norm * Add missing break after merge with main * Fused fused_rms+fused_rms+rope+rope (with -mqkv) * Fused fused_rms+fused_rms+rope+rope (without -mqkv) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 18:42:20 +02:00
Kawrakow	846e736e85	cuda: add missing backwards RoPE op (#889 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 07:45:18 +02:00
Kawrakow	55a704b67a	Fused Q and K fused_rms_norm for TG on CUDA (#882 ) * Biased mmvq: minor optimization * Fusing Q and K rms_norm for TG on CUDA * Remove commented out code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:41:28 +02:00
Kawrakow	cfb840379f	Biased mmvq: minor optimization (#880 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:21:18 +02:00
Kawrakow	0459f595d7	CUDA: corectly detect if flash attention is supported (#875 ) * Don't use vector kernels if K or V are quantized * Correctly determine if FA is supported * Also wmma * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-29 13:56:16 +02:00
Nexes the Elder	d50c2490fc	correct typo (#876 )	2025-10-28 19:01:45 +02:00
firecoperana	904e994bfb	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00

1 2 3 4 5 ...

463 Commits