Files
ik_llama.cpp/github-data/pull_requests/155 - IQ2_XS_R4.md
2025-07-23 13:31:53 +02:00

1.9 KiB

🔀 #155 - IQ2_XS_R4

Author ikawrakow
State Closed
Created 2024-12-21
Updated 2024-12-21

Description

Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.

This PR adds IQ2_XS_R4, a 4-row interleaved version of IQ2_XS.

We get very modest performance gains. I guess, the combination of loading data from a large table, blocks of 16 quants, and perhaps me not having found the optimum bit packing kills the performance.

Anyway, here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ2_XS IQ2_XS_R4 Speedup
ARM_NEON 8 45.55 ± 0.28 54.13 ± 0.19 1.188
Zen4 16 135.43 ± 0.65 156.55 ± 0.51 1.156
AVX2 32 157.34 ± 0.27 192.60 ± 0.37 1.224

We get some performance gains for TG as well, especially on AVX2. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads IQ2_XS IQ2_XS_R4 Speedup
ARM_NEON 2 5.10 ± 0.02 5.91 ± 0.01 1.159
4 9.71 ± 0.09 10.90 ± 0.03 1.123
8 17.21 ± 0.77 19.30 ± 0.56 1.121
Zen4 2 6.54 ± 0.01 6.90 ± 0.00 1.055
4 12.23 ± 0.02 12.79 ± 0.00 1.046
8 21.19 ± 0.01 22.12 ± 0.01 1.044
AVX2 2 3.16 ± 0.00 4.54 ± 0.00 1.437
4 6.13 ± 0.00 8.75 ± 0.00 1.427
8 11.31 ± 0.05 15.67 ± 0.05 1.385
16 19.41 ± 0.01 22.28 ± 0.00 1.148