mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-28 10:21:48 +00:00
3.4 KiB
3.4 KiB
🔀 #139 - Faster R4 quants on Zen4
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-13 |
| Updated | 2024-12-13 |
Description
Use integer accumulators for dot products within superblocks. I did not use this originally because according to this Intel reference the _mm256_mullo_epi32() instruction has an extremely high latency. But given that on ARM_NEON the use of integer dot product accumulation resulted in significant performance boost (see #135), I decided to still try. Outcome: it is faster, despite the high latency of the integer multiplication.
Here PP-512 and TG-128 measurements for LLaMA-3.1-8B on Zen4 (Ryzen-7950X CPU):
| Quant | Threads | Task | t/s (main) | t/s (PR) | Speedup |
|---|---|---|---|---|---|
| Q2_K_R4 | 16 | pp512 | 256.19 ± 0.26 | 272.69 ± 0.13 | 1.064 |
| 1 | tg128 | 9.08 ± 0.12 | 9.95 ± 0.0 | 1.096 | |
| 2 | tg128 | 16.40 ± 0.00 | 17.44 ± 0.01 | 1.063 | |
| 4 | tg128 | 20.72 ± 0.12 | 20.97 ± 0.08 | 1.012 | |
| Q3_K_R4 | 16 | pp512 | 236.77 ± 0.35 | 255.84 ± 0.20 | 1.081 |
| 1 | tg128 | 6.78 ± 0.00 | 7.16 ± 0.07 | 1.056 | |
| 2 | tg128 | 12.46 ± 0.00 | 13.00 ± 0.01 | 1.043 | |
| 4 | tg128 | 17.02 ± 0.09 | 17.20 ± 0.24 | 1.012 | |
| Q4_K_R4 | 16 | pp512 | 262.40 ± 0.28 | 268.09 ± 0.12 | 1.022 |
| IQ4_XS_R4 | 16 | pp512 | 256.80 ± 0.35 | 271.95 ± 0.39 | 1.059 |
| Q5_K_R4 | 16 | pp512 | 248.30 ± 0.29 | 256.68 ± 0.31 | 1.034 |
| Q6_K_R4 | 16 | pp512 | 243.25 ± 0.31 | 261.33 ± 0.38 | 1.074 |
| 1 | tg128 | 7.94 ± 0.00 | 8.34 ± 0.00 | 1.050 | |
| 2 | tg128 | 10.38 ± 0.00 | 10.38 ± 0.00 | 1.000 |
For Q4_K_R4, Q5_K_R4 and IQ4_XS_R4 matrix-vector multiplications are done with a different implementation where this change is not applicable, so no TG results for those.