ik_llama.cpp/156 - IQ2_S_R4.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.9 KiB

Raw Permalink Blame History

🔀 #156 - IQ2_S_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-21
Updated	2024-12-21

Description

Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.

This PR adds IQ2_S_R4, a 4-row interleaved version of IQ2_S.

We get very modest performance gains. I guess, the combination of loading data from a large table, blocks of 16 quants, and perhaps me not having found the optimum bit packing kills the performance.

Anyway, here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ2_S	IQ2_S_R4	Speedup
ARM_NEON	8	44.68 ± 0.20	50.40 ± 0.18	1.128
Zen4	16	117.47 ± 0.47	148.51 ± 0.51	1.264
AVX2	32	150.92 ± 0.25	177.59 ± 0.40	1.177

We get some performance gains for TG as well, especially on AVX2. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	IQ2_S	IQ2_S_R4	Speedup
ARM_NEON	2	4.30 ± 0.00	4.56 ± 0.01	1.084
	4	8.20 ± 0.03	8.64 ± 0.02	1.054
	8	15.07 ± 0.35	16.12 ± 0.17	1.070
Zen4	2	5.31 ± 0.01	5.56 ± 0.0	1.047
	4	9.53 ± 0.29	10.52 ± 0.02	1.104
	8	17.80 ± 0.03	18.66 ± 0.05	1.048
AVX2	2	2.60 ± 0.00	3.83 ± 0.0	1.473
	4	5.02 ± 0.00	7.40 ± 0.00	1.474
	8	9.69 ± 0.04	13.97 ± 0.03	1.442
	16	16.70 ± 0.00	19.52 ± 0.00	1.169

1.9 KiB Raw Permalink Blame History

🔀 #156 - IQ2_S_R4

Description

1.9 KiB

Raw Permalink Blame History