ik_llama.cpp/145 - IQ3_K_R4.md at 993cb00a347fc77632b73126f614092d659727de - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-01 20:19:52 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.7 KiB

Raw Blame History

🔀 #145 - IQ3_K_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-17
Updated	2024-12-17

Description

Adding IQ3_K with 4 interleaved rows.

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4. Overall slower than other _R4 quants, which is expected as 3-bit quantization is always kind of slow.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ3_K	IQ3_K_R4	Speedup
ARM_NEON	8	54.94 ± 0.79	93.83 ± 0.09	1.708
Zen4	16	180.13 ± 0.48	230.33 ± 0.13	1.279
AVX2	32	197.59 ± 0.43	253.36 ± 0.50	1.282

We get decent performance gains for TG as well. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	IQ3_K	IQ3_K_R4	Speedup
ARM_NEON	2	5.84 ± 0.00	6.71 ± 0.05	1.149
	4	11.14 ± 0.00	12.83 ± 0.01	1.152
	8	20.59 ± 0.17	23.07 ± 0.16	1.120
Zen4	1	5.06 ± 0.00	5.64 ± 0.00	1.115
	2	9.58 ± 0.01	10.50 ± 0.01	1.096
	4	16.56 ± 0.05	16.77 ± 0.32	1.013
AVX2	2	4.45 ± 0.00	6.83 ± 0.00	1.535
	4	8.24 ± 0.00	12.51 ± 0.00	1.518
	8	14.59 ± 0.04	16.23 ± 0.00	1.112

1.7 KiB Raw Blame History

🔀 #145 - IQ3_K_R4

Description

1.7 KiB

Raw Blame History