ik_llama.cpp/162 - IQ3_S_R4.md at main - ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.

This PR adds IQ3_S_R4, a 4-row interleaved version of IQ3_S.

We get significant performance gains. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ3_S	IQ3_S_R4	Speedup
ARM_NEON	8	42.97 ± 1.28	80.61 ± 0.41	1.876
Zen4	16	104.66 ± 0.68	159.08 ± 0.57	1.520
AVX2	32	132.50 ± 0.37	231.41 ± 0.45	1.746

We get decent performance gains for TG as well, especially on AVX2. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	IQ3_S	IQ3_S_R4	Speedup
ARM_NEON	2	3.00 ± 0.00	3.40 ± 0.00	1.133
	4	5.74 ± 0.02	6.60 ± 0.01	1.150
	8	9.25 ± 0.83	12.27 ± 0.33	1.326
Zen4	2	4.17 ± 0.00	4.38 ± 0.01	1.050
	4	7.82 ± 0.05	8.14 ± 0.01	1.041
	8	14.29 ± 0.02	14.41 ± 0.02	1.008
AVX2	2	1.98 ± 0.00	3.31 ± 0.00	1.672
	4	3.87 ± 0.00	6.49 ± 0.00	1.677
	8	7.13 ± 0.01	11.63 ± 0.02	1.631
	16	12.97 ± 0.00	15.81 ± 0.00	1.219