ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-01 20:19:52 +00:00

Author	SHA1	Message	Date
Kawrakow	6028362ef6	Native build ooption for CUDA when GGML_NATIVE is set (#280 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-22 18:17:51 +01:00
Kawrakow	13ecc5332e	Fighting with cmake (#279 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-22 16:58:30 +01:00
Kawrakow	b8d1fac97b	Convert models to row-interleaved quants using the quantize tool (#272 ) * Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-21 07:23:36 +01:00
Kawrakow	22c84a126f	Fix ggml_compute_forward_dup_q (#269 ) I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-19 15:47:24 +01:00
Kawrakow	c3b75c531c	Prevent FlashMLA-1 from running on CUDA (#268 ) as it is not supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-19 13:03:59 +01:00
Kawrakow	8e549b4234	Allow q8_0 cache on the CPU for FlashMLA-2 (#265 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 15:41:05 +01:00
Kawrakow	68a5b60408	Make Q8_0 KV cache work with mla=2,fa on CUDA (#264 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 15:40:47 +01:00
Kawrakow	f4ebf13b6a	Fix #261 (#262 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 07:44:43 +01:00
Kawrakow	bdcae905c4	Compile time option to use bf16 for qunts without MMQ kernels (#261 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 07:37:10 +01:00
Kawrakow	dcdfad29f7	FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260 ) * FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. * FlashMLA-2: avoid conversions to f32 also on CUDA * Be able to compute for more than 65535 tokens On CUDA just a quick hack that allows us to cancatenate tensors with more than 65535 rows along zroth dimension as needed by FlashMLA-2. Also needed some care in the perplexity tool to avoid int overflows when evaluating the computed logits. * Reduce memory usage for FlashMLA-2 Oh, also fix int overflow in the CUDA concat implementation. It is funny how the llama.cpp 64-bit police has gone (almost) everywhere and replaced 32-bit ints with 64-bit ints, needed or not, but hasn't done it where it is actually needed. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 07:36:42 +01:00
Kawrakow	305fabfc3b	FlashMLA-2 (CPU): faster and smaller compute buffer size (#253 ) * FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-13 12:07:43 +02:00
Kawrakow	3f23ed68f1	MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252 ) * FlashMLA(CUDA): WIP to allow q8_0 quantized cache * WIP * FlashMLA(CUDA) - allow q8_0 for KV cache This works, and PP is not bad, but TG is still quite a bit slower. * FlashMLA(CUDA) - allow q8_0 for KV cache This is better. ~9% slower than f16 cache for short contexts, nearly on par at 16k tokens. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-12 07:21:46 +02:00
Kawrakow	a48e163247	DeepSeek imatrix stuff (#250 ) * This gives us ~20% TG speedup for DeepSeek on CUDA * Slightly better * Also do it for plain (not fused) mul_mat_id * Guard against numerical precision issues for MLA on CUDA * imatrix: wv_b <-> wkv_b --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-10 16:19:09 +02:00
Kawrakow	699c9cb7f6	Faster MoE token generation on CUDA (#248 ) * This gives us ~20% TG speedup for DeepSeek on CUDA * Slightly better * Also do it for plain (not fused) mul_mat_id * Guard against numerical precision issues for MLA on CUDA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-10 16:16:51 +02:00
Kawrakow	81748fb55e	Faster FlashMLA prompt processing (#246 ) * FlashMLA-2: faster prompt processing The current MLA implementation computes wv_b * (k_cache * softmax(k_cache * (wk_bq))) This leads to 3.4X more multiply-adds (madds) compared to standard attention. Due to the resulting tensor shapes, TG is still faster than standard attention because the k_cache(wk_bq) and k_cache(softmax(k_cache * (wk_bq))) multiplications become GEMMs, so the additional madds are more than compensated for due to the much higher performance of GEMMs compared to GEMVs. But for PP, where we are dealing with GEMMs in both cases, the additional madds needed for MLA lead to lower performance, with the performance gap increasing with context length. So, then, when we are dealing with PP, we can rearrange the above to (wv_b k_cache) * softmax( (wk_b^Tk_cache) q), thus transforming it into the standard attention mechanism. We do need two additional matrix multiplications (which in practice is done as a single wkv_b * k_cache GEMM) with the entire K cache. But this is still cheaper than MLA, as we end up with 1.8X the madds required by standard attention. Oh, these figures are for the DeepSeek-V3/R1/Lite attention architecture. This leads to a significant PP performance increase compared to standard MLA with FA. There are many upsides to this: * If we only apply the above trick when we are processing more than X tokens (with suitable chosen X), TG performance stays the same as MLA with FA * We still need to store just the K-cache, so 576 entries per layer for DeepSeek-V3/R1/Lite * We get significantly better PP performance * We can use MLA+FA on CUDA. It works already with this commit for PP, something is not yet quite right for TG. The downside is that it only works with fp16 cache (for now). This is so because we need to convert the cache to fp32, else we cannot do the wkv_b * k_cache matrix multiplication (which in ggml requires the second operand to be fp32). But converting (copying) to fp32 only works for f16, bf16 and f32 tensors, so no luck with quantized cache. Another reason that we need to convert to fp32 is that the cache contains the RoPE'd portion, which we need to concatenate to the result of the wkv_b * k_cache matrix multiplication. Also this op works only when the tensors being concatenated are both fp32. So much about ggml being a general purpose ML library. * FlashMLA-2: on the CPU it now works for quantized cache except for q8_KV (q8_KV has row meta data, and there is still some confusion with row sizes because of that). * FlashMLA-2: on the CPU it now works also with q8_KV --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-08 19:33:41 +02:00
Kawrakow	3d85a1d663	Better FlashMLA (#243 ) * This is a better FA for TG It should benefit MLA and GQA. Tested to work with DeepSeek-Lite MLA, not yet for GQA. For tg64@pp8192 it is ~13% faster than MLA without FA, and 57% faster that the main branch FA. * WIP * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-07 09:46:58 +02:00
Kawrakow	7bdbf99bbd	DeepSeek CUDA Flash Attention (#241 ) * WIP CUDA FA with Dk != Dv * WIP * CUDA FA WIP - It actually works! No TG yet, but for PP I can run FA with fp16 cache and it gets the same answer. * CUDA FA WIP - it now works for Q8_0 + Q8_0 for KV cache * CUDA FA WIP - TG, not working yet. * CUDA FA with Dk != Dv: it works now for DeepSeek --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-05 07:27:49 +02:00
Kawrakow	a87e54db6e	Flash MLA (CPU only) (#240 ) * FlashMLA - it finally works (on the CPU) * FlashMLA: allow for f16 and bf16 cache in addition to q8_0 * It works with ggml FA, not with iqk FA * WIP * FlashMLA: it now works with iqk I had forgotten to divide the Q stride by sizeof(float) and that's why, very cobfusingly, it was working for TG but not for PP. * WIP * FlashMLA: that should be it for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-03 15:17:51 +02:00
Kawrakow	a89adaa78f	SER - Smart Expert Reduction (#239 ) * A better way to measure the cost of ggml_barrier * Smart expert selection * Add ser option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-02 13:47:38 +02:00
Kawrakow	ef9a3d17b5	A better way to measure the cost of ggml_barrier (#238 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-01 17:12:58 +02:00
Kawrakow	a79ab8f342	Reduce size of compute buffers (#237 ) * This reduces compute buffer size for MLA * This should accomplish it for standard attention * Much better * Better concat for contiguous tensors If all the op does is to concatenate the second tensor to the first, why would we want to have a loop? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-01 08:25:27 +02:00
Kawrakow	b762db7c92	Option to use MLA without a transposed cache (#235 ) The `-mla` command line option turns into an int from a bool. mla = 0: use standard attention mla = 1: use MLA with transposed cache mla > 1: use MLA without transposed cache Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-27 16:40:49 +02:00
Kawrakow	51029edfdf	Faster MLA on CUDA (#234 ) * Slight MLA TG performance improvement on CUDA The low MLA performance on CUDA is dues to the wk_b * q_nope operation. It turns into n_head matrix multiplications with n_head separate quantization and GEMV steps. The associated overhead is just too much for TG where each GEMV is very fast (512 x 128 = 131 KFLOP for DeepSeek-Lite, 4X that for DeepSeekV3/R1). The way it was done there was also a copy of each q_nope row before quantization, which I have now eliminated. This results in a ~2.5% speedup. What needs to happen instead is to launch a single computation that quantizes all heads, and then have a kernel that does the GEMV for all heads instead of n_head sequential GEMVs. * Slightly better * CUDA: Quantize non-contiguous tensors * Much better MLA It is a total hack, but it works. * Cleanup Remove duplicated gemv's. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-27 08:42:18 +02:00
Kawrakow	94b659a2f1	Give the user the option to override where model weights are stored (#232 ) * Give the user the option to override where model weights are stored * Fix ggml_nbytes() problem and cleanup For a tensor with zero elements ggml_nbytes() was returning uint64_t::max, and this was causing graph allocation failure. * Add timing info to CUDA graph evaluation * Add more timing info --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-25 17:55:58 +02:00
Kawrakow	547eee81d9	Fix #230 (#231 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-24 09:29:58 +02:00
Kawrakow	ac1d259b93	Fused MoE ffn_up and ffn_gate (#229 ) * Fusing MoE up * unary(gate) * Fusing MoE up * unary(gate): CUDA We get ~13% speedup for PP-512 and ~2% for TG-128 for DeepSeek-Lite * On CUDA also fuse MoE down * (up * unary(gate)) in case the MUL_MAT_ID op for the down experts is the next op in the graph. * Command line option to enable fused MoE upunary(gate) Add fmoe option to llama-bench * Adding forgotten gelu, relu, silu on ARM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-23 14:31:11 +02:00
Kawrakow	71b7b510c2	Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-23 08:02:16 +02:00
Kawrakow	4926105844	Fix #217 (#220 ) * Fix #217 * Remove stuff commited by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-22 14:25:38 +02:00
Kawrakow	33646fc409	Fuse MoE up and gate matrix multiplications (#219 ) * This seems to be a better way to do the attention matrix multiplications in the TG case. * Cleanup * Fuse up and gate gemms in MoE models Small (~1-2%) but measurable performan ce gain --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-22 09:41:40 +02:00
Kawrakow	c4a5103299	Better strategy for attention matrix multiplications when generating tokens (#218 ) * This seems to be a better way to do the attention matrix multiplications in the TG case. * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-22 09:38:51 +02:00
Kawrakow	b9a6639ac3	Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD (#216 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-21 15:33:25 +02:00
Kawrakow	a45da7bfbf	Fix NEON gemm/gemv for legacy quants when row size is not divisible by 128 (#213 ) * Fix gemm/gemv for legacy quants when row size is not divisible by 128 * Fix typo --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-20 13:55:13 +02:00
Kawrakow	498a582919	Optimized GEMM/GEMV for IQ1_S (#212 ) * Adding iq1_s to iqk_mul_mat (Zen4) * iq1_s: slightly better on Zen4 * iq1_s: AVX2 * iq1s: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-20 12:41:45 +02:00
Kawrakow	a0ebfdd661	Q8_KV: 8-bit quantization type targeting the KV cache (#208 ) * Adding q8_KV - Basics + AVX2 gemm/gemv * q8_KV: Better AVX2 gemm * q8_KV: Better Zen4 gemm We get 225.7 t/s for L3-8B. In comparison q8_0 without run-tinme-repacking is at 169 t/s. * q8_KV: AVX2 gemm/gemv We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr. * q8_KV: be able to use it for K cache This required quite a few fixes in ggml and llama.cpp: * ggml: do not calculate row size as n/block_sizetype_size. I had removed most of it when implementing the quants with per row scale, bit it was stull lurking in ggml_copy. Not sure if these were the last remnants of ggmil-style row sizes, or if there are still places left llama.cpp: get rid of the the 1d K cache assumption. Create and manage the K-cache as a 2D tensor so we can have per row meta data as needed by q8_KV. Using q8_KV for K-cache results in non-negligible performance gains. More details to follow, but for DeepSeek-Lite with MLA, we get 18% speedup for PP-8192 compared to q8_0 K-cache. * q8_KV: be able to use it for K cache in FA * q8_KV: repack it for KQ in FA q8_KV: slightly faster gemv on Zen4 * q8_KV: slightly faster gemv on Zen4 * q8_KV: ARM_NEON We get PP-512 = 167 t/s for L3-8B without interleaving! We do the interleaving on the fly, so I wonder if this could be done for other quants as well. * q8_KV: use it in FA on NEON * q8_KV_r8 - repacked q8_KV On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s) This makes no sense whatsoever as the q8_KV_r8 GEMM is basically the q8_k_r8 GEMM with the unnecessary block stuff removed (so, one would think that it would be faster). * q8_KV_r8: don't use nrc_y = 16 on Zen4 This is faster - 350 t/s. Why? Much better than the 290 t/s we had before, but still slower than the 370 t/s for q8_k_r8. * q8_KV: nrc_y = 16 also doesn't pay off in FA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-19 11:47:07 +02:00
Kawrakow	047ba895bb	Repack also experts (#210 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-19 10:01:49 +02:00
Iwan Kawrakow	d44aba79ea	Bug fix in activation quantization I added a change in the last PR how activations are quantized. It looked like it is working and slightly improving performance. But I now hit an edge case where I get gibberish that goes away if I remove the change. I absolutely don't see what goes wrong, so leaving the change in commented out for now.	2025-02-15 19:50:53 +02:00
Kawrakow	0551e7630b	Moving 4D gemm logic from ggml.c to iqk_mul_mat.cpp (#207 ) This allows us to optimize TG performance for GQA models. E.g., for IQ4_XS L3-8B with 8k TG-64 goes from 8.6 to 10.26 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-15 08:45:45 +02:00
Kawrakow	1bbb543478	Fix iqk_mul_mat on AVX512 systems that are missing BF16 support (#204 ) * Fix iqk_mul_mat on AVX512 systems that are missing BF16 support * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-12 14:22:26 +02:00
Kawrakow	3c98bfb33d	DeepSeek FA support (CPU only) (#200 ) * Adding support for K head size != V head size This is relevant for DeepSeek models. At this point ggml CPU FA works. Now I need to go and change iqk FA to make it work with Dk != Dv. * iqk support for K head size != V head size To not have compilation time explode, just Dk = 192, Dv = 128 for now (DeepSeek) * FA: very slightly faster for nq = 1 (TG) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-11 14:46:30 +02:00
Kawrakow	c12f73ba61	Add optional MLA (#188 ) * Deepseek MLA Optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make MLA optional * Remove some unnecessary copies in the MLA attention * Deepseek MLA Optimizations V2 (#195) * Avoid allocating MHA KV cache when MLA is turned on * Added missing gguf-py file * Added final optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make sure we do have wk_b and wv_b before enabling MLA --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Use type_k and type_v to set the types of the MLA caches They were hard-coded at f16. On my Ryzen-7950X with native bf16 support I get a fairly significant PP performance boost with bf16 KV-cache: PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache. * Better gemm strategy when nth > nhead It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads (with or without MLA). Before this commit, when nth > nhead heads were processed sequentially with all nth threads participating in each matrix multiplication. Now we ind the gcd of nhead and nth and split threads into nth/gcd groups, each group processing nhead/gcd heads. --------- Co-authored-by: Saood Karim <saood05@gmail.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-09 19:48:44 +02:00
Kawrakow	cae2b81155	FA: Add option to build all FA kernels (#197 ) Similar to the CUDA situation. It is OFF by default. If OFF, only F16, Q8_0, Q6_0, and, if the CPU provides native BF16 support, BF16 FA kernels will be included. To enable all, cmake -DGGML_IQK_FA_ALL_QUANTS=1 ... This cuts compilation time for iqk_mul_mat.cpp by almost half (45 seconds vs 81 seconds on my Ryzen-7950X). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-09 18:59:33 +02:00
Kawrakow	33390c4b74	Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications (#194 ) * iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_m_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (Neon) * iq1_m_r4: Use Q8_K_128 instead of Q8_0_X4 for gemm (Neon) * Simdify q8_K128 quantization also on Neon * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-09 09:14:52 +02:00
Kawrakow	6d7b58eade	Revert #79 (#192 ) * Revert "Do not quantize activations if not necessary (#79)" This reverts commit `0bf4d99774`. * Fixed compilation after revert --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-08 09:48:59 +02:00
Kawrakow	4601a8c373	cuda: non-contiguous rms norm (#190 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-07 08:33:42 +02:00
Kawrakow	b08a2e9dfc	Add additional checks for iq1_s_r4 quantization (#191 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-07 08:33:28 +02:00
Kawrakow	a08501ee52	Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8 (#189 ) * Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving * Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving * Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-06 18:45:28 +02:00
Kawrakow	7f61b3068e	IQ1_M_R4: better 1.75 bpw quants (#187 ) * iq1_m_r4: basics (quantize/dequantize) * iq1_m_r4: Zen4 gemm * iq1_m_r4: neon gemm * iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4 With the deltas being per group of 8, we cannot make use of the q8 sums stored in q8_1, so we get a tiny gain by using q8_0_x4. * iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-06 14:08:52 +02:00
Kawrakow	a6f9f2ec9a	iq1_s_r4: slightly faster NEON gemm/gemv (#186 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-05 14:45:51 +02:00
Kawrakow	8b7536bda8	IQ1_S_R4: better 1.5 bpw quants (#185 ) * iq1_s_r4: basics - quantize/dequantize * iq1_s_r4: gemm/gemv works on AVX2/Zen4 * Don't forget to make sure we have a multiple of 4 rows per thread * iq1_s_r4: this is better * iq1_s_r4: fix Zen4 after AVX2 changes * iq1_s_r4: NEON gemm/gemv * iq1_s_r4: more bits for shared experts With this mix we arrive at PPL(512) = 9.4140 for Deepseek-Lite using 1.766 bpw for the repeating layers. On the Ryzen-7950X we get PP-512 = 494 t/s and TG-128 = 52 t/s @ 16 threads. * Forgotten counter increment * iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv * Compiler warnings --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-05 13:49:39 +02:00
Kawrakow	ecf111a11c	Deepseek-Lite (#184 ) * Quantization mixes tweaks * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on Zen4 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q5_0_r4 work with row size that are not a multiple of 128 ... on Zen4 and AVX2 * Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128 also on NEON. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-01-30 18:36:24 +02:00

1 2 3 4 5

221 Commits