ik_llama.cpp/155 - IQ2_XS_R4.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.9 KiB

Raw Permalink Blame History

🔀 #155 - IQ2_XS_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-21
Updated	2024-12-21

Description

Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.

This PR adds IQ2_XS_R4, a 4-row interleaved version of IQ2_XS.

We get very modest performance gains. I guess, the combination of loading data from a large table, blocks of 16 quants, and perhaps me not having found the optimum bit packing kills the performance.

Anyway, here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ2_XS	IQ2_XS_R4	Speedup
ARM_NEON	8	45.55 ± 0.28	54.13 ± 0.19	1.188
Zen4	16	135.43 ± 0.65	156.55 ± 0.51	1.156
AVX2	32	157.34 ± 0.27	192.60 ± 0.37	1.224

We get some performance gains for TG as well, especially on AVX2. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	IQ2_XS	IQ2_XS_R4	Speedup
ARM_NEON	2	5.10 ± 0.02	5.91 ± 0.01	1.159
	4	9.71 ± 0.09	10.90 ± 0.03	1.123
	8	17.21 ± 0.77	19.30 ± 0.56	1.121
Zen4	2	6.54 ± 0.01	6.90 ± 0.00	1.055
	4	12.23 ± 0.02	12.79 ± 0.00	1.046
	8	21.19 ± 0.01	22.12 ± 0.01	1.044
AVX2	2	3.16 ± 0.00	4.54 ± 0.00	1.437
	4	6.13 ± 0.00	8.75 ± 0.00	1.427
	8	11.31 ± 0.05	15.67 ± 0.05	1.385
	16	19.41 ± 0.01	22.28 ± 0.00	1.148

1.9 KiB Raw Permalink Blame History

🔀 #155 - IQ2_XS_R4

Description

1.9 KiB

Raw Permalink Blame History