ik_llama.cpp/132 - Q5_K_R4.md at 3afd0600a1cdcd4b53238bc46695e66ec378cdcb - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-07 06:50:09 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

3.4 KiB

Raw Blame History

🔀 #132 - Q5_K_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-10
Updated	2024-12-10

Description

Follow up of #118, #119, #120, #121, #122, #123, #129, #130 for Q5_K.

We get a large speedup on ARM_NEON and non-negligible gains on AVX2/Zen4. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	Q5_K	Q5_K_R4	Speedup
ARM_NEON	8	61.07 ± 0.95	96.13 ± 2.38	1.574
Zen4	16	188.73 ± 0.75	248.30 ± 0.29	1.316
AVX2	32	188.11 ± 0.29	269.18 ± 0.40	1.431

On AVX2/Zen4 we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	Q6_K	Q6_K_R4	Speedup
Zen4	1	5.12 ± 0.00	7.07 ± 0.01	1.380
	2	9.31 ± 0.00	11.54 ± 0.0	1.240
	4	11.33 ± 0.37	11.89 ± 0.00	1.049
AVX2	2	4.04 ± 0.00	6.40 ± 0.00	1.584
	4	7.57 ± 0.00	9.95 ± 0.00	1.314
	8	9.75 ± 0.00	11.00 ± 0.00	1.128

I decided to check the current state of mainline llama.cpp for Q5_K_S.

Hahaha - here is what we get on my M2-Max (build: 7736837d (4274))

model	size	params	backend	threads	test	t/s
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	8	pp512	27.69 ± 0.09
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	2	tg128	6.39 ± 0.01
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	4	tg128	12.18 ± 0.02
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	8	tg128	19.68 ± 0.64

The performance gap in prompt processing for Q5_K has now grown to 3.5X, and it is ~30% slower for TG with 2 threads.

Here is what I get on my Ryzen-7950X (build: 26a8406b (4295))

model	size	params	backend	threads	test	t/s
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	16	pp512	75.88 ± 0.26
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	1	tg128	4.10 ± 0.00
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	2	tg128	7.66 ± 0.01
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	4	tg128	11.26 ± 0.00
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CPU	8	tg128	11.20 ± 0.22

3.26X slower for prompt processing, 72%/51% slower for TG at 1/2 thread.

3.4 KiB Raw Blame History

🔀 #132 - Q5_K_R4

Description

3.4 KiB

Raw Blame History