ik_llama.cpp/138 - IQ4_K_R4.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.9 KiB

Raw Permalink Blame History

🔀 #138 - IQ4_K_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-12
Updated	2024-12-12

Description

On to R4 implementation of the new iqk quants.

First IQ4_K

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4. I suspect my AVX2/Zen4 implementation is not optimum, but I did not see a better way for now.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ4_K	IQ4_K_R4	Speedup
ARM_NEON	8	58.20 ± 1.03	108.02 ± 1.10	1.856
Zen4	16	182.20 ± 0.38	232.63 ± 0.39	1.277
AVX2	32	206.43 ± 0.49	227.60 ± 0.46	1.103

We get decent performance gains for TG as well. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	Q2_K_S	Q2_K_R4	Speedup
ARM_NEON	2	8.44 ± 0.02	10.56 ± 0.01	1.251
	4	15.90 ± 0.05	19.32 ± 0.14	1.215
	8	24.54 ± 0.15	25.16 ± 0.03	1.025
Zen4	1	5.26 ± 0.00	6.73 ± 0.00	1.279
	2	9.71 ± 0.01	12.43 ± 0.00	1.269
	4	13.48 ± 0.06	14.00 ± 0.03	1.039
AVX2	2	4.02 ± 0.00	6.91 ± 0.00	1.719
	4	8.03 ± 0.00	11.13 ± 0.00	1.386
	8	11.81 ± 0.00	12.75 ± 0.00	1.079

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

1.9 KiB Raw Permalink Blame History

🔀 #138 - IQ4_K_R4

Description

1.9 KiB

Raw Permalink Blame History