Files
ik_llama.cpp/github-data/pull_requests/135 - Better ARM_NEON implementation for R4 quants.md
2025-07-23 13:31:53 +02:00

1.7 KiB

🔀 #135 - Better ARM_NEON implementation for R4 quants

Author ikawrakow
State Closed
Created 2024-12-11
Updated 2024-12-11

Description

We get improved performance for IQ4_XS_R4, Q4_K_R4, Q5_K_R4, Q6_K_R4. The trick was to accumulate super-blocks in int32_t, thus avoiding expensive int -> float conversions.

Here performance comparisons for LLaMA-3.1-8B on M2-Max between the previous implementation and this PR

Quant Task Threads t/s (main) t/s (PR) Speedup
IQ4_XS_R4 pp512 8 115.43 ± 0.57 131.28 ± 0.51 1.137
tg128 2 12.71 ± 0.01 13.44 ± 0.01 1.057
tg128 4 22.35 ± 0.17 22.98 ± 0.05 1.028
Q4_K_R4 pp512 8 110.02 ± 1.31 122.12 ± 1.28 1.110
tg128 2 12.17 ± 0.01 13.72 ± 0.01 1.127
tg128 4 21.56 ± 0.06 22.46 ± 0.20 1.042
Q5_K_R4. pp512 8 96.90 ± 0.79 108.66 ± 0.27 1.121
tg128 2 8.22 ± 0.01 8.66 ± 0.01 1.054
tg128 4 15.54 ± 0.09 16.13 ± 0.05 1.038
Q6_K_R4 pp512 8 83.25 ± 0.81 104.19 ± 1.96 1.252
tg128 2 7.35 ± 0.01 8.05 ± 0.00 1.095
tg128 4 13.80 ± 0.01 14.92 ± 0.03 1.081

TG results only up to 4 threads because at 8 threads the result is 100% memory bound, so the same within noise.