Files
ik_llama.cpp/github-data/pull_requests/162 - IQ3_S_R4.md
2025-07-23 13:31:53 +02:00

1.7 KiB

🔀 #162 - IQ3_S_R4

Author ikawrakow
State Closed
Created 2024-12-23
Updated 2024-12-23

Description

Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.

This PR adds IQ3_S_R4, a 4-row interleaved version of IQ3_S.

We get significant performance gains. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ3_S IQ3_S_R4 Speedup
ARM_NEON 8 42.97 ± 1.28 80.61 ± 0.41 1.876
Zen4 16 104.66 ± 0.68 159.08 ± 0.57 1.520
AVX2 32 132.50 ± 0.37 231.41 ± 0.45 1.746

We get decent performance gains for TG as well, especially on AVX2. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads IQ3_S IQ3_S_R4 Speedup
ARM_NEON 2 3.00 ± 0.00 3.40 ± 0.00 1.133
4 5.74 ± 0.02 6.60 ± 0.01 1.150
8 9.25 ± 0.83 12.27 ± 0.33 1.326
Zen4 2 4.17 ± 0.00 4.38 ± 0.01 1.050
4 7.82 ± 0.05 8.14 ± 0.01 1.041
8 14.29 ± 0.02 14.41 ± 0.02 1.008
AVX2 2 1.98 ± 0.00 3.31 ± 0.00 1.672
4 3.87 ± 0.00 6.49 ± 0.00 1.677
8 7.13 ± 0.01 11.63 ± 0.02 1.631
16 12.97 ± 0.00 15.81 ± 0.00 1.219