mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 09:09:50 +00:00
3.3 KiB
3.3 KiB
🔀 #187 - IQ1_M_R4: better 1.75 bpw quants
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-02-06 |
| Updated | 2025-02-06 |
Description
Following in the foot steps of #185, this PR adds IQ1_M_R4, a 4-row interleaved version of IQ1_M.
- I have removed the
f16super-block scale (replaced with af16per row scale) and have changed the 3-bitIQ1_Mblock scales with 4 bit. Hence, we end up using the same 1.75 bpw asIQ1_M. - The above change allows to implement
IQ1_M_R4with a block size of 32. I wanted to have this because DeepSeek-Lite, the model I'm testing with, has a lot of tensors with row sizes not divisible by 256, so a significant fraction of tensors gets quantized toIQ4_NLwhen usingIQ1_M - Quantization mixes for MoE models are adjusted. Today's mainline
llama.cpparrives at a context-512 perplexity (PPL(512)in what follows) of 20.75 for DeepSeek-Lite using 2.74 bpw withIQ1_M. TheIQ1_M_R4quantization in this PR getsPPL-512 = 8.85with 1.966 bpw for the repeating layers. IQ1_M_R4is much faster on the CPU compared toIQ1_M(see tables below). I never implemented iqk-style GEMM forIQ1_S/IQ1_M, so these quantization types run at the snail speed of mainlinellama.cpp.- Caveat: it is CPU only for now.
The following table compares prompt processing (pp512) and token generation (tg128) speed for LLaMA-3.1-8B on AVX2 (Ryzen-5975WX), Zen4 (Ryzen-7950X) and ARM_NEON (M2-Max CPU). I didn't use DeepSeek-Lite for this comparison to avoid the difference in quantization types one ends up with due to not all tensors having row sizes that are multiple of 256.
| platform | threads | test | t/s (IQ1_M) | t/s (IQ1_M_R4) | Speedup |
|---|---|---|---|---|---|
| AVX2 | 32 | pp512 | 43.98 ± 0.09 | 187.94 ± 0.21 | 4.273 |
| Zen4 | 16 | pp512 | 26.70 ± 0.03 | 149.57 ± 0.31 | 5.602 |
| NEON | 8 | pp512 | 17.61 ± 0.03 | 95.04 ± 0.16 | 5.397 |
| AVX2 | 2 | tg128 | 2.66 ± 0.00 | 3.96 ± 0.00 | 1.489 |
| 4 | tg128 | 5.25 ± 0.00 | 7.76 ± 0.00 | 1.478 | |
| 8 | tg128 | 9.93 ± 0.16 | 13.71 ± 0.01 | 1.381 | |
| 16 | tg128 | 17.14 ± 0.00 | 22.60 ± 0.01 | 1.319 | |
| 32 | tg128 | 23.91 ± 0.01 | 25.39 ± 0.02 | 1.062 | |
| Zen4 | 2 | tg128 | 3.39 ± 0.00 | 5.29 ± 0.00 | 1.560 |
| 4 | tg128 | 6.50 ± 0.00 | 10.19 ± 0.00 | 1.568 | |
| 8 | tg128 | 11.68 ± 0.01 | 17.54 ± 0.01 | 1.502 | |
| 16 | tg128 | 19.13 ± 0.05 | 25.91 ± 0.43 | 1.354 | |
| NEON | 2 | tg128 | 4.16 ± 0.00 | 5.27 ± 0.01 | 1.267 |
| 4 | tg128 | 7.88 ± 0.00 | 9.99 ± 0.01 | 1.268 | |
| 8 | tg128 | 14.74 ± 0.26 | 19.19 ± 0.01 | 1.302 |