ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-28 02:11:50 +00:00

Author	SHA1	Message	Date
Andrew Keen Chan	aefab2eec1	Merge branch 'main' into andrewkchan/try_trellis	2025-05-20 06:48:14 +00:00
Andrew Keen Chan	d5eb74d719	cleanup	2025-05-20 06:29:12 +00:00
Andrew Keen Chan	922b22f1e9	naming and remove unused fn	2025-05-20 06:12:59 +00:00
Andrew Keen Chan	cb29146fbe	fix (0.22t/s eval)	2025-05-20 06:10:23 +00:00
Andrew Keen Chan	103345a872	wip buggy iq4_KT	2025-05-19 08:24:10 +00:00
Andrew Keen Chan	04eb150b9f	iq3_kt (0.3t/s eval) and renames	2025-05-19 03:03:05 +00:00
Andrew Keen Chan	c4e5d3e382	flatten 3inst iters + avx2 (0.3t/s eval)	2025-05-18 23:21:15 +00:00
Andrew Keen Chan	addac77278	still super slow (0.17t/s eval)	2025-05-18 21:35:38 +00:00
Andrew Keen Chan	7561158313	WIP - working basic iq2_kt	2025-05-18 21:07:22 +00:00
Andrew Keen Chan	3cc0de96a6	WIP for IQ2_KT	2025-05-18 06:56:39 +00:00
Kawrakow	8a5c0410e1	Fix DeepSeek q8_0 cache (#391 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 12:06:49 +03:00
Kawrakow	090eae4d69	Fix build for Xeon Gold 6226R (#390 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 10:33:27 +03:00
Kawrakow	b890e01238	Another attempt to fix #367 (#371 ) * Another attempt to fix #367 * Yet another --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 09:02:12 +03:00
Kawrakow	afcfa85756	Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-03 14:43:55 +03:00
Kawrakow	1ea1df4b2d	Fix FA bug on AVX2 (#364 ) * Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-02 07:09:09 +02:00
Kawrakow	4c2bee0bed	Fix IQK_FA_ALL_QUANTS on AVX2 (#360 ) * Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 10:45:43 +02:00
Kawrakow	cda24b58cb	CPU FA improvements (#351 ) * FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-29 07:19:43 +02:00
Kawrakow	9e846f0eb1	Fix division by zero bug (#349 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 09:19:43 +02:00
Kawrakow	715fc552ad	Add support for Cohere2 (#341 ) * Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 08:13:25 +02:00
Kawrakow	25d1a0dca8	Fix FA on ARM (#346 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 11:01:08 +02:00
saood06	93cd77b655	Fix termux/android build (#336 ) * Attempt fix * Attempt fix 2 * Attempt fix 3 * Attempt fix 4 * Attempt fix 5 * Attempt fix 6 * Attempt fix 7 * Attempt fix 8 * Attempt fix 9 * Attempt fix 10 * Attempt fix 11 * Attempt fix 12 * Attempt fix 13	2025-04-21 09:13:46 +02:00
Kawrakow	3bb64d9330	Better TG performance for GQA models (CPU) (#332 ) * Slightly better CPU TG performance for GQA * Better CPU FA implementation for TG when GQA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-17 08:08:40 +02:00
Kawrakow	f7c5a94e75	Better gemm/gemv on AVX2 fr q4_0_r8 (#331 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-15 17:18:50 +02:00
Kawrakow	a051f08b8f	Add copyright notices (#317 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-07 10:43:26 +02:00
Kawrakow	2ee6263e24	Fix GCC compilation errors on ARM (#309 ) * Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 15:50:53 +02:00
Kawrakow	21a5b8bd28	Fix ARM_NEON build failure due to q8_2 (#303 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 13:48:20 +02:00
Kawrakow	190e7866db	Quantization improvements (2) (#302 ) * iq3_k: slightly better quantization Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main). * Small improvement for type-1 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 10:31:06 +02:00
Kawrakow	6e5156cab5	Fix #300 (#301 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 08:29:25 +02:00
Kawrakow	d0b52076da	Use bf16 instead of fp16 block scales for q8_1 (#292 ) * WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * q8_0_r8 on avx2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-27 05:49:16 +01:00
Kawrakow	f9307d7907	Improve DeepSeek batched processing speed (#282 ) * Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-23 17:10:52 +01:00
Kawrakow	5a4855e61c	Attempt to improve FlashMLA on the CPU (#277 ) * Fix it for nth > rk2 * Handle rk2%nth_k != 0 * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-23 07:28:21 +01:00
Kawrakow	b8d1fac97b	Convert models to row-interleaved quants using the quantize tool (#272 ) * Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-21 07:23:36 +01:00
Kawrakow	305fabfc3b	FlashMLA-2 (CPU): faster and smaller compute buffer size (#253 ) * FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-13 12:07:43 +02:00
Kawrakow	3d85a1d663	Better FlashMLA (#243 ) * This is a better FA for TG It should benefit MLA and GQA. Tested to work with DeepSeek-Lite MLA, not yet for GQA. For tg64@pp8192 it is ~13% faster than MLA without FA, and 57% faster that the main branch FA. * WIP * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-07 09:46:58 +02:00
Kawrakow	a87e54db6e	Flash MLA (CPU only) (#240 ) * FlashMLA - it finally works (on the CPU) * FlashMLA: allow for f16 and bf16 cache in addition to q8_0 * It works with ggml FA, not with iqk FA * WIP * FlashMLA: it now works with iqk I had forgotten to divide the Q stride by sizeof(float) and that's why, very cobfusingly, it was working for TG but not for PP. * WIP * FlashMLA: that should be it for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-03 15:17:51 +02:00
Kawrakow	547eee81d9	Fix #230 (#231 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-24 09:29:58 +02:00
Kawrakow	ac1d259b93	Fused MoE ffn_up and ffn_gate (#229 ) * Fusing MoE up * unary(gate) * Fusing MoE up * unary(gate): CUDA We get ~13% speedup for PP-512 and ~2% for TG-128 for DeepSeek-Lite * On CUDA also fuse MoE down * (up * unary(gate)) in case the MUL_MAT_ID op for the down experts is the next op in the graph. * Command line option to enable fused MoE upunary(gate) Add fmoe option to llama-bench * Adding forgotten gelu, relu, silu on ARM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-23 14:31:11 +02:00
Kawrakow	71b7b510c2	Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-23 08:02:16 +02:00
Kawrakow	4926105844	Fix #217 (#220 ) * Fix #217 * Remove stuff commited by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-22 14:25:38 +02:00
Kawrakow	c4a5103299	Better strategy for attention matrix multiplications when generating tokens (#218 ) * This seems to be a better way to do the attention matrix multiplications in the TG case. * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-22 09:38:51 +02:00
Kawrakow	b9a6639ac3	Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD (#216 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-21 15:33:25 +02:00
Kawrakow	a45da7bfbf	Fix NEON gemm/gemv for legacy quants when row size is not divisible by 128 (#213 ) * Fix gemm/gemv for legacy quants when row size is not divisible by 128 * Fix typo --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-20 13:55:13 +02:00
Kawrakow	498a582919	Optimized GEMM/GEMV for IQ1_S (#212 ) * Adding iq1_s to iqk_mul_mat (Zen4) * iq1_s: slightly better on Zen4 * iq1_s: AVX2 * iq1s: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-20 12:41:45 +02:00
Kawrakow	a0ebfdd661	Q8_KV: 8-bit quantization type targeting the KV cache (#208 ) * Adding q8_KV - Basics + AVX2 gemm/gemv * q8_KV: Better AVX2 gemm * q8_KV: Better Zen4 gemm We get 225.7 t/s for L3-8B. In comparison q8_0 without run-tinme-repacking is at 169 t/s. * q8_KV: AVX2 gemm/gemv We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr. * q8_KV: be able to use it for K cache This required quite a few fixes in ggml and llama.cpp: * ggml: do not calculate row size as n/block_sizetype_size. I had removed most of it when implementing the quants with per row scale, bit it was stull lurking in ggml_copy. Not sure if these were the last remnants of ggmil-style row sizes, or if there are still places left llama.cpp: get rid of the the 1d K cache assumption. Create and manage the K-cache as a 2D tensor so we can have per row meta data as needed by q8_KV. Using q8_KV for K-cache results in non-negligible performance gains. More details to follow, but for DeepSeek-Lite with MLA, we get 18% speedup for PP-8192 compared to q8_0 K-cache. * q8_KV: be able to use it for K cache in FA * q8_KV: repack it for KQ in FA q8_KV: slightly faster gemv on Zen4 * q8_KV: slightly faster gemv on Zen4 * q8_KV: ARM_NEON We get PP-512 = 167 t/s for L3-8B without interleaving! We do the interleaving on the fly, so I wonder if this could be done for other quants as well. * q8_KV: use it in FA on NEON * q8_KV_r8 - repacked q8_KV On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s) This makes no sense whatsoever as the q8_KV_r8 GEMM is basically the q8_k_r8 GEMM with the unnecessary block stuff removed (so, one would think that it would be faster). * q8_KV_r8: don't use nrc_y = 16 on Zen4 This is faster - 350 t/s. Why? Much better than the 290 t/s we had before, but still slower than the 370 t/s for q8_k_r8. * q8_KV: nrc_y = 16 also doesn't pay off in FA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-19 11:47:07 +02:00
Kawrakow	047ba895bb	Repack also experts (#210 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-19 10:01:49 +02:00
Kawrakow	0551e7630b	Moving 4D gemm logic from ggml.c to iqk_mul_mat.cpp (#207 ) This allows us to optimize TG performance for GQA models. E.g., for IQ4_XS L3-8B with 8k TG-64 goes from 8.6 to 10.26 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-15 08:45:45 +02:00
Kawrakow	1bbb543478	Fix iqk_mul_mat on AVX512 systems that are missing BF16 support (#204 ) * Fix iqk_mul_mat on AVX512 systems that are missing BF16 support * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-12 14:22:26 +02:00
Kawrakow	3c98bfb33d	DeepSeek FA support (CPU only) (#200 ) * Adding support for K head size != V head size This is relevant for DeepSeek models. At this point ggml CPU FA works. Now I need to go and change iqk FA to make it work with Dk != Dv. * iqk support for K head size != V head size To not have compilation time explode, just Dk = 192, Dv = 128 for now (DeepSeek) * FA: very slightly faster for nq = 1 (TG) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-11 14:46:30 +02:00
Iwan Kawrakow	c13027bcaf	Merge remote-tracking branch 'origin/main' into ik/try_trellis	2025-02-09 20:00:41 +02:00
Kawrakow	cae2b81155	FA: Add option to build all FA kernels (#197 ) Similar to the CUDA situation. It is OFF by default. If OFF, only F16, Q8_0, Q6_0, and, if the CPU provides native BF16 support, BF16 FA kernels will be included. To enable all, cmake -DGGML_IQK_FA_ALL_QUANTS=1 ... This cuts compilation time for iqk_mul_mat.cpp by almost half (45 seconds vs 81 seconds on my Ryzen-7950X). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-09 18:59:33 +02:00

1 2 3 4 5

203 Commits