ik_llama.cpp/130 - Q6_K_R4.md at 69fdd041c1ccd0e307bdc3b2264583b54f6dfea2 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.9 KiB

Raw Blame History

🔀 #130 - Q6_K_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-10
Updated	2024-12-10

Description

Follow up of #118, #119, #120, #121, #122, #123, #129 for Q6_K.

If nothing else Q6_K is routinely used for the output tensor, so having a better Q6_K performance would be useful.

We get a large speedup on ARM_NEON and non-negligible gains on AVX2/Zen4. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	Q6_K	Q6_K_R4	Speedup
ARM_NEON	8	57.57 ± 0.61	83.25 ± 0.81	1.446
Zen4	16	195.20 ± 0.74	243.25 ± 0.31	1.246
AVX2	32	194.51 ± 0.35	264.16 ± 0.44	1.358

Except on ARM_NEON, where TG performance is slightly lower for small numbers of threads, we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	Q6_K	Q6_K_R4	Speedup
ARM_NEON	2	7.46 ± 0.03	7.35 ± 0.01	0.985
	4	13.88 ± 0.02	13.80 ± 0.01	0.994
	8	18.31 ± 0.16	18.57 ± 0.14	1.014
Zen4	1	5.38 ± 0.00	7.94 ± 0.00	1.476
	2	8.93 ± 0.00	10.38 ± 0.00	1.162
	4	9.97 ± 0.27	10.18 ± 0.01	1.021
AVX2	2	4.75 ± 0.00	5.78 ± 0.01	1.217
	4	7.57 ± 0.00	8.47 ± 0.00	1.119
	8	8.23 ± 0.00	9.14 ± 0.00	1.111

With this Zen4 implementation, for TG the available memory bandwidth is fully saturated with just 2 threads!

1.9 KiB Raw Blame History

🔀 #130 - Q6_K_R4

Description

1.9 KiB

Raw Blame History