ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-25 17:09:22 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	8b102792d8	iq4_nl 84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s	2025-06-21 15:13:45 +02:00
Iwan Kawrakow	a78bed0f8e	q8_0 84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.	2025-06-21 12:43:30 +02:00
Iwan Kawrakow	a834e4be15	q6_0 74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.	2025-06-21 12:33:52 +02:00
Iwan Kawrakow	f8efac6295	q5_0 74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.	2025-06-21 12:21:02 +02:00
Iwan Kawrakow	1f31789b96	q4_0 83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s	2025-06-21 11:58:38 +02:00
Kawrakow	1843ed22c5	New integer trellis on ARM_NEON (#544 ) * Adapt iq3_kt to new trellis on NEON * iq3_kt is now working on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-20 09:26:36 +03:00
Kawrakow	d85c64428e	New IQ2_KT, IQ3_KT and IQ4_KT, V2 (#529 ) * New iq4_kt trellis The new trellis generates int8_t values via sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126. CUDA dequantize works. AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B. PPL is on par or even slightly lower than original QTIP trellis. * Something is not working with the AVX2 dot product * New iq4_kt: CUDA MMVQ * New iq4_kt: CUDA MMQ * For now have only iq4_kt use the new trellis * Fix iq2_kt that got broken along the way * New iq4_kt: AVX2 dot product finally works We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic. Still somewhat slower than other quants, but no longer pathetic. * New iq4_kt: fix vanilla AVX2 * New iq4_kt: NEON implementation We get very respectable PP-512 = 120 t/s. TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant. * New iq4_kt: slightly faster NEON * New iq4_kt: slightly faster NEON * New iq4_kt: faster NEON We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis. * Minor * New iq4_kt trellis: not working Metal implementation * Remove the extra 4 bytes of row meta data that is no longer used * Cleanup * Adding forgottent file * Switching iq2_kt to new trellis - CUDA MMQ * New iq2_kt: CUDA GEMV * New iq2_kt: AVX2 dequantize * New iq2_kt: AVX2 GEMM/GEMV * Adding forgotten file * New iq2_kt: NEON GEMM/GEMV * New iq2_kt: slightly faster NEON GEMM * New iq2_kt: Metal - very slow. It seems Apple Silicon cannot quickly add 4 8-bit ints. Or I don't know how to do it - but I didn't find anything in the Metal Shading Language Specification. So, performance is quite a bit worse than the original trellis. * Add missing break * Trying @louiehelm's multiplier * CPU * iq3_kt: use integer trellis + CUDA dequantize and MMVQ * iq3_kt: MMQ * iq3_kt: AVX2 GEMM * iq3_kt: AVX2 GEMV * The trellis quants now need super-blocks of 256, so we need a check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-18 16:20:54 +03:00
Kawrakow	c410cc72bb	Much faster CPU prompt processing (part 3) (#534 ) * Repack q4_0 and q8_0 to q8_0_R8 q8_0 is fine, but I observe a very significant PPL increase for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit scale conversions. * Change q8_2_x4 to store in16_t sums With that q4_0 now works. I need to check all quants that use q8_2_x4! * q5_0 and use a dequntizing template * q6_0 129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s. * iq4_nl 137 t/s -> 293 t/s. iq4_nl is at 251 t/s. * q4_1: 135 t/s -> 262 t/s * q5_1: 125 t/s -> 253 t/s * iq3_xs 178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s. * q2_K 202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-18 15:30:56 +03:00
Kawrakow	dc96820ddb	Much faster CPU prompt processing (part 2) (#533 ) * iq4_ks 203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s. * iq4_k 175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s. PPL is actually lower! * iq5_ks 180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s. PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct * iq5_k - accuracy loss is too big * iq5_k - there was a bug with the shifts ...and that's why PPL was so high. It is also high on main. This fixes it. * iq6_k 148 t/s -> 350 t/s. There is no iq6_k_r4 PPL is actually lower because we have a bug in the existing implementation! * iq3_k 169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s. * iq2_k 190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s. * iq2_ks 200 t/s -> 367 t/s. There is no iq2_ks_r4. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-18 07:29:33 +03:00
Kawrakow	0f8f8b32e2	Much faster CPU prompt processing (part 1) (#531 ) * q6_K dequantizing GEMM * Much easier: just use different vec_dot types! * WIP * Finally q6_K x q8_2_x4 dot product works * Very slightly better * We don't need the changes in ggml.c * Fix AVX2 * iq2_xs * Fix AVX2 * iq2_s * q3_K * Fix q8_k_r8 on Zen4 * q3_K: repack to q8_k_r8 instead of q8_0_r8 With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X. q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs ~7% of the time taken by the actual GEMM. * q3_K: don't scale when all quants in a block are <= 127 when repacking * iq2_s: repack to q8_k_r8 instead of q8_0_r8 * iq2_xs: rapck to q8_k_r8 * WIP * iq2_xs: repack to q8_k_r8 * iq3_xxs: repack to q8_k_r8 * iq3_s: use q8_k_r8 * iq1_s: repack to q8_k_r8 * iq1_m: repack to q8_k_r8 * iq1_m: slightly faster * Slightly faster --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-17 07:12:48 +03:00
Kawrakow	6fc5bbb657	Call iqk_convert_repack in MoE GEMM (#528 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-14 05:52:46 +03:00
Kawrakow	066ed4fd11	Faster CPU prompt processing for Q4_K and Q5_K (#525 ) * q4_K: dequantize to q8_1_r8 for batch >= 32 We get 268 t/s, up from 186 t/s. * q4_K: GEMM with q8_2_X4 * q5_K: GEMM with q8_2_X4 and repack to q8_1_r8 * Remove the scales, they are not needed --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-13 07:58:15 +03:00
Kawrakow	4fc3cb4a47	iq3_s: much faster GEMM via repacking to q8_0_r8 (#518 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-12 08:16:12 +03:00
Kawrakow	3f54b49786	Faster iq1_s GEMM via repacking to Q8_0_R8 (#517 ) TG is slightly faster too - 24.4 vs 23.1 t/s on the Ryzen-5975WX Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-11 15:01:34 +03:00
Kawrakow	69af3f5990	Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2) (#516 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-11 13:05:26 +03:00
Kawrakow	e56061fa12	IQ2_XXS: much faster CPU prompt processing (#515 ) * Much faster iq2_xxs GEMM PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main. * iq2_xxs: q8_2_x4 GEMM * iq2_xxs: use template for q8_2_x4 GEMM * Fix AVX2 * Cleanup * NEON is not working yet, so still use Q8_K GEMM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-11 11:12:30 +03:00
Kawrakow	0b10f7418f	Faster CPU prompt processing for Trellis quants and MoE models (#488 ) * Also do the dequantize approach for mul_mat_id * Also do the dequantize approach for iqk_moe_fused_up_gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 08:30:35 +03:00
Kawrakow	3df1a3a44d	Trellis quants: faster CPU prompt processing (#482 ) * Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-01 15:24:05 +03:00
Kawrakow	1eac9e8487	NEON implementation for trellis quants (#471 ) * iq2_kt: NEON implementation * iq3_kt: NEON implementation * iq4_kt: not working NEON implementation * iq4_kt: NEON implementation Have to use f32 arithmetic else I get gibberish? Correspondigly ridiculously slow. * Cleanup * iq4_kt: slightly faster TG on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-29 18:57:41 +03:00
Andrew Chan	a1c931c30c	Trellis quants with CPU inference (#441 ) * WIP * WIP * WIP * Testing Trellis quantization Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable. * Testing Trellis quantization: 4-bit quantized block scales rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw. * Testing Trellis quantization: playing with scales and generators * iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B). * iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs. * WIP * WIP * WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster. * iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower. * iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time). * iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU. * iq2_kt: very slightly faster CUDA dot product * iq2_kt: f16 CUDA dot product We arrive at 112 t/s. * iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance. * iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16. * Minor * Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot. * Forgotten change * WIP * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants. * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892 * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v. * iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX. * iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds! * iq3_kt: CUDA dot product * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * Adding iq4_kt - not competitive at this point * WIP * WIP * iq4_kt: CUDA dot product * iq4_kt: minor tweaks * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster. * iq3_kt: small improvements and faster quantization * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX. * iq3_kt: small progress * WIP * iq4_kt: go to 4.0 bpw 15 bits per group of 4, plus 8 bit scales ifor blocks of 32. This gives a slightly better PPL than iq4_kss. * iq4_kt: very slightly better at the expense of much longer quantization time. * iq4_kt: failed attemt to adjust CUDA dot product It was working for 4.125 bpw. But after changing to 4.0 bpw there is something wrong and I don't see the bug. * DRY * DRY * iq4_kt: CUDA dot product works * DRY * Report actual bpw * Minor tweaks * Checkpoint Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude plus 1 bpw for the sign. It goves a visible improvement in the PPL vs bpw plot, but that comes at the expense of much longer quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX). I also notices that the 3INST generator is not actually generating a Gaussian distribution. But going to a better generator means readjusting all the hyper-parameters, so leaving it for later. * WIP for IQ2_KT * WIP - working basic iq2_kt * still super slow (0.17t/s eval) * flatten 3inst iters + avx2 (0.3t/s eval) * iq3_kt (0.3t/s eval) and renames * wip buggy iq4_KT * fix (0.22t/s eval) * naming and remove unused fn * cleanup * more cleanup * delete unused and noncompiling mmvq functions * Some performance tweaks * Slighty faster iq2_kt * port Trellis struct to iq3_kt, iq4_kt * oops untracked files --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-23 09:17:52 +03:00
Kawrakow	b94cd3b632	Refactor iqk_mul_mat.cpp (#435 ) * Refactor iqk: WIP * Refactor iqk: Factor out float GEMM (AVX2/AVX512) * Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512) * Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4 * Refactor iqk: Factor out GEMM for repacked legacy quants * Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV * Refactor iqk: Factor out GEMM for repacked i-quants * Refactor iqk: GEMM kernels are refactored on AVX2/AVX512 * Refactor iqk: factor out 1-bit quants (NEON) * Refactor iqk: factor out k-quants (NEON) * Refactor iqk: factor out floats (NEON) * Also iq4_xs belongs to k-quants * Refactor iqk: factor out iqk quants (NEON) * Refactor iqk: factor out legacy quants (NEON) * Refactor iqk: factor out repacked legacy quants (NEON) * Refactor iqk: factor out repacked k-quants (NEON) * Refactor iqk: factor out repacked iqk quants (NEON) * Refactor iqk: GEMM kernels are refactored on NEON * Refactor iqk: FA compiles If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X * Refactor iqk: FA refactored (Zen4) Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before. * Adding forgotten file * Most helpers don't need to be templates Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX * Fix bf16 * Refactor iqk: FA refactored (NEON) * Forgotten MMQ ref and typo (#431) * Adding forgotten iq5_k_r4 * Fix iq4_k_r4 on NEON * Fix iq4_ks on NEON It was broken before the refactoring (the shifts were not correctly applied). * Fix q8_0 on NEON * Fix q6_0 K cache --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>	2025-05-22 10:05:51 +03:00
Kawrakow	b3036a872f	Option to enable disable the IQK CPU FA kernels (#429 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-17 11:21:58 +03:00
Kawrakow	c35a383bcd	Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428 ) * Zen4: faster PP for iq4_ks and iq5_ks * Zen4: faster PP for iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-17 10:42:33 +03:00
Kawrakow	7abdf2b099	IQ5_KS_R4: row-interleaved IQ5_KS (#426 ) * iq5_ks_r4: basics * iq5_ks_r4: Zen4 works * iq5_ks_r4: AVX2 works * iq5_ks_r4: NEON * Fix iq5_ks on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-17 08:57:26 +03:00
Kawrakow	134d548173	Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427 ) * Fix IQ4_K on AVX2 * Fix IQ4_KS on AVX2 * Fix IQ5_K on AVX2 * Fix IQ6_K on AVX2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-16 17:25:15 +03:00
Kawrakow	3d92d7f802	Adding IQ5_KS - 5.25 bpw quants (#422 ) * iq5_ks: basics * iq5_ks: quantize * iq5_ks: CUDA dequantize works * iq5_ks: dot product works on CUDA * iq5_ks: MMQ works * iq5_ks: Zen4 * iq5_ks: AVX2 But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks. All these need fixing on AVX2. * iq5_ks: NEON * iq5_ks: Metal dequantize * iq5_ks: Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 16:02:39 +03:00
Kawrakow	3f8c865b92	Fix standard attention on the CPU (#421 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 08:43:39 +03:00
Kawrakow	13740622e9	Fix SER (CPU) (#415 ) * Fixing SER bugs * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:55:04 +03:00
Kawrakow	553c08b6b4	Better CPU FA performance for DeepSeek-Lite (#410 ) * Better CPU FA performance for DeepSeek-Lite * It must be like this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:53:20 +03:00
Kawrakow	8a5c0410e1	Fix DeepSeek q8_0 cache (#391 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 12:06:49 +03:00
Kawrakow	090eae4d69	Fix build for Xeon Gold 6226R (#390 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 10:33:27 +03:00
Kawrakow	1ea1df4b2d	Fix FA bug on AVX2 (#364 ) * Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-02 07:09:09 +02:00
Kawrakow	4c2bee0bed	Fix IQK_FA_ALL_QUANTS on AVX2 (#360 ) * Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 10:45:43 +02:00
Kawrakow	cda24b58cb	CPU FA improvements (#351 ) * FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-29 07:19:43 +02:00
Kawrakow	715fc552ad	Add support for Cohere2 (#341 ) * Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 08:13:25 +02:00
Kawrakow	25d1a0dca8	Fix FA on ARM (#346 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 11:01:08 +02:00
saood06	93cd77b655	Fix termux/android build (#336 ) * Attempt fix * Attempt fix 2 * Attempt fix 3 * Attempt fix 4 * Attempt fix 5 * Attempt fix 6 * Attempt fix 7 * Attempt fix 8 * Attempt fix 9 * Attempt fix 10 * Attempt fix 11 * Attempt fix 12 * Attempt fix 13	2025-04-21 09:13:46 +02:00
Kawrakow	3bb64d9330	Better TG performance for GQA models (CPU) (#332 ) * Slightly better CPU TG performance for GQA * Better CPU FA implementation for TG when GQA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-17 08:08:40 +02:00
Kawrakow	f7c5a94e75	Better gemm/gemv on AVX2 fr q4_0_r8 (#331 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-15 17:18:50 +02:00
Kawrakow	2ee6263e24	Fix GCC compilation errors on ARM (#309 ) * Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 15:50:53 +02:00
Kawrakow	6e5156cab5	Fix #300 (#301 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 08:29:25 +02:00
Kawrakow	d0b52076da	Use bf16 instead of fp16 block scales for q8_1 (#292 ) * WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * q8_0_r8 on avx2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-27 05:49:16 +01:00
Kawrakow	f9307d7907	Improve DeepSeek batched processing speed (#282 ) * Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-23 17:10:52 +01:00
Kawrakow	b8d1fac97b	Convert models to row-interleaved quants using the quantize tool (#272 ) * Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-21 07:23:36 +01:00
Kawrakow	3d85a1d663	Better FlashMLA (#243 ) * This is a better FA for TG It should benefit MLA and GQA. Tested to work with DeepSeek-Lite MLA, not yet for GQA. For tg64@pp8192 it is ~13% faster than MLA without FA, and 57% faster that the main branch FA. * WIP * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-07 09:46:58 +02:00
Kawrakow	a87e54db6e	Flash MLA (CPU only) (#240 ) * FlashMLA - it finally works (on the CPU) * FlashMLA: allow for f16 and bf16 cache in addition to q8_0 * It works with ggml FA, not with iqk FA * WIP * FlashMLA: it now works with iqk I had forgotten to divide the Q stride by sizeof(float) and that's why, very cobfusingly, it was working for TG but not for PP. * WIP * FlashMLA: that should be it for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-03 15:17:51 +02:00
Kawrakow	547eee81d9	Fix #230 (#231 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-24 09:29:58 +02:00
Kawrakow	ac1d259b93	Fused MoE ffn_up and ffn_gate (#229 ) * Fusing MoE up * unary(gate) * Fusing MoE up * unary(gate): CUDA We get ~13% speedup for PP-512 and ~2% for TG-128 for DeepSeek-Lite * On CUDA also fuse MoE down * (up * unary(gate)) in case the MUL_MAT_ID op for the down experts is the next op in the graph. * Command line option to enable fused MoE upunary(gate) Add fmoe option to llama-bench * Adding forgotten gelu, relu, silu on ARM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-23 14:31:11 +02:00
Kawrakow	71b7b510c2	Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-23 08:02:16 +02:00
Kawrakow	4926105844	Fix #217 (#220 ) * Fix #217 * Remove stuff commited by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-22 14:25:38 +02:00

1 2 3 4

164 Commits