ik_llama.cpp/149 - IQ5_K_R4.md at 993cb00a347fc77632b73126f614092d659727de - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-01 20:19:52 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.4 KiB

Raw Blame History

🔀 #149 - IQ5_K_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-18
Updated	2025-03-27

Description

Adding IQ5_K with 4 interleaved rows.

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ5_K	IQ5_K_R4	Speedup
ARM_NEON	8	53.80 ± 1.08	93.33 ± 2.02	1.735
Zen4	16	168.09 ± 0.58	230.23 ± 0.23	1.370
AVX2	32	177.16 ± 0.31	231.50 ± 0.43	1.307

TG does not look good on AVX2/Zen4. On ARM_NEON we get a decent performance gain. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	IQ5_K	IQ5_K_R4	Speedup
ARM_NEON	2	5.92 ± 0.07	6.98 ± 0.00	1.179
	4	11.53 ± 0.01	13.35 ± 0.01	1.158
	8	20.29 ± 0.46	21.17 ± 0.18	1.043

💬 Conversation

👤 saood06 commented the 2025-03-27 at 06:53:47:

TG does not look good on AVX2/Zen4

Does this mean regression compared to non-interleaved or just no benefit?

1.4 KiB Raw Blame History

🔀 #149 - IQ5_K_R4

Description

💬 Conversation

1.4 KiB

Raw Blame History