mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-30 11:21:56 +00:00
1.7 KiB
1.7 KiB
🔀 #135 - Better ARM_NEON implementation for R4 quants
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-11 |
| Updated | 2024-12-11 |
Description
We get improved performance for IQ4_XS_R4, Q4_K_R4, Q5_K_R4, Q6_K_R4. The trick was to accumulate super-blocks in int32_t, thus avoiding expensive int -> float conversions.
Here performance comparisons for LLaMA-3.1-8B on M2-Max between the previous implementation and this PR
| Quant | Task | Threads | t/s (main) | t/s (PR) | Speedup |
|---|---|---|---|---|---|
| IQ4_XS_R4 | pp512 | 8 | 115.43 ± 0.57 | 131.28 ± 0.51 | 1.137 |
| tg128 | 2 | 12.71 ± 0.01 | 13.44 ± 0.01 | 1.057 | |
| tg128 | 4 | 22.35 ± 0.17 | 22.98 ± 0.05 | 1.028 | |
| Q4_K_R4 | pp512 | 8 | 110.02 ± 1.31 | 122.12 ± 1.28 | 1.110 |
| tg128 | 2 | 12.17 ± 0.01 | 13.72 ± 0.01 | 1.127 | |
| tg128 | 4 | 21.56 ± 0.06 | 22.46 ± 0.20 | 1.042 | |
| Q5_K_R4. | pp512 | 8 | 96.90 ± 0.79 | 108.66 ± 0.27 | 1.121 |
| tg128 | 2 | 8.22 ± 0.01 | 8.66 ± 0.01 | 1.054 | |
| tg128 | 4 | 15.54 ± 0.09 | 16.13 ± 0.05 | 1.038 | |
| Q6_K_R4 | pp512 | 8 | 83.25 ± 0.81 | 104.19 ± 1.96 | 1.252 |
| tg128 | 2 | 7.35 ± 0.01 | 8.05 ± 0.00 | 1.095 | |
| tg128 | 4 | 13.80 ± 0.01 | 14.92 ± 0.03 | 1.081 |
TG results only up to 4 threads because at 8 threads the result is 100% memory bound, so the same within noise.