Files
ik_llama.cpp/github-data/pull_requests/139 - Faster R4 quants on Zen4.md
2025-07-23 13:31:53 +02:00

3.4 KiB

🔀 #139 - Faster R4 quants on Zen4

Author ikawrakow
State Closed
Created 2024-12-13
Updated 2024-12-13

Description

Use integer accumulators for dot products within superblocks. I did not use this originally because according to this Intel reference the _mm256_mullo_epi32() instruction has an extremely high latency. But given that on ARM_NEON the use of integer dot product accumulation resulted in significant performance boost (see #135), I decided to still try. Outcome: it is faster, despite the high latency of the integer multiplication.

Here PP-512 and TG-128 measurements for LLaMA-3.1-8B on Zen4 (Ryzen-7950X CPU):

Quant Threads Task t/s (main) t/s (PR) Speedup
Q2_K_R4 16 pp512 256.19 ± 0.26 272.69 ± 0.13 1.064
1 tg128 9.08 ± 0.12 9.95 ± 0.0 1.096
2 tg128 16.40 ± 0.00 17.44 ± 0.01 1.063
4 tg128 20.72 ± 0.12 20.97 ± 0.08 1.012
Q3_K_R4 16 pp512 236.77 ± 0.35 255.84 ± 0.20 1.081
1 tg128 6.78 ± 0.00 7.16 ± 0.07 1.056
2 tg128 12.46 ± 0.00 13.00 ± 0.01 1.043
4 tg128 17.02 ± 0.09 17.20 ± 0.24 1.012
Q4_K_R4 16 pp512 262.40 ± 0.28 268.09 ± 0.12 1.022
IQ4_XS_R4 16 pp512 256.80 ± 0.35 271.95 ± 0.39 1.059
Q5_K_R4 16 pp512 248.30 ± 0.29 256.68 ± 0.31 1.034
Q6_K_R4 16 pp512 243.25 ± 0.31 261.33 ± 0.38 1.074
1 tg128 7.94 ± 0.00 8.34 ± 0.00 1.050
2 tg128 10.38 ± 0.00 10.38 ± 0.00 1.000

For Q4_K_R4, Q5_K_R4 and IQ4_XS_R4 matrix-vector multiplications are done with a different implementation where this change is not applicable, so no TG results for those.