ik_llama.cpp/136 - Q2_K_R4.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

2.2 KiB

Raw Permalink Blame History

🔀 #136 - Q2_K_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-11
Updated	2024-12-11

Description

Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134 for Q2_K.

This completes R4 implementation for k-quants on ARM_NEON, AVX2, and Zen4.

We get signifiant performance gains on all platforms. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	Q2_K_S	Q2_K_R4	Speedup
ARM_NEON	8	73.79 ± 1.92	109.07 ± 0.58	1.478
Zen4	16	205.95 ± 0.77	256.19 ± 0.26	1.244
AVX2	32	214.42 ± 0.54	286.91 ± 0.63	1.338

As Q2_K is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform	Threads	Q2_K_S	Q2_K_R4	Speedup
ARM_NEON	2	10.34 ± 0.01	12.81 ± 0.01	1.239
	4	19.32 ± 0.02	23.40 ± 0.08	1.211
	8	32.36 ± 0.59	36.02 ± 0.40	1.113
Zen4	1	6.60 ± 0.02	9.08 ± 0.12	1.376
	2	12.12 ± 0.01	16.40 ± 0.00	1.353
	4	19.12 ± 0.56	20.72 ± 0.19	1.084
AVX2	2	5.93 ± 0.02	10.16 ± 0.30	1.713
	4	11.24 ± 0.00	17.59 ± 0.01	1.565
	8	18.62 ± 0.03	21.44 ± 0.00	1.151

It is actually too bad Q2_K is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.

2.2 KiB Raw Permalink Blame History

🔀 #136 - Q2_K_R4

Description

2.2 KiB

Raw Permalink Blame History