Files
ik_llama.cpp/github-data/pull_requests/145 - IQ3_K_R4.md
2025-07-23 13:31:53 +02:00

1.7 KiB

🔀 #145 - IQ3_K_R4

Author ikawrakow
State Closed
Created 2024-12-17
Updated 2024-12-17

Description

Adding IQ3_K with 4 interleaved rows.

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4. Overall slower than other _R4 quants, which is expected as 3-bit quantization is always kind of slow.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ3_K IQ3_K_R4 Speedup
ARM_NEON 8 54.94 ± 0.79 93.83 ± 0.09 1.708
Zen4 16 180.13 ± 0.48 230.33 ± 0.13 1.279
AVX2 32 197.59 ± 0.43 253.36 ± 0.50 1.282

We get decent performance gains for TG as well. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads IQ3_K IQ3_K_R4 Speedup
ARM_NEON 2 5.84 ± 0.00 6.71 ± 0.05 1.149
4 11.14 ± 0.00 12.83 ± 0.01 1.152
8 20.59 ± 0.17 23.07 ± 0.16 1.120
Zen4 1 5.06 ± 0.00 5.64 ± 0.00 1.115
2 9.58 ± 0.01 10.50 ± 0.01 1.096
4 16.56 ± 0.05 16.77 ± 0.32 1.013
AVX2 2 4.45 ± 0.00 6.83 ± 0.00 1.535
4 8.24 ± 0.00 12.51 ± 0.00 1.518
8 14.59 ± 0.04 16.23 ± 0.00 1.112