ik_llama.cpp/github-data/pull_requests/139 - Faster R4 quants on Zen4.md at ik/debug_issue_721 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-28 10:21:48 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

3.4 KiB

Raw Permalink Blame History

🔀 #139 - Faster R4 quants on Zen4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-13
Updated	2024-12-13

Description

Use integer accumulators for dot products within superblocks. I did not use this originally because according to this Intel reference the _mm256_mullo_epi32() instruction has an extremely high latency. But given that on ARM_NEON the use of integer dot product accumulation resulted in significant performance boost (see #135), I decided to still try. Outcome: it is faster, despite the high latency of the integer multiplication.

Here PP-512 and TG-128 measurements for LLaMA-3.1-8B on Zen4 (Ryzen-7950X CPU):

Quant	Threads	Task	t/s (main)	t/s (PR)	Speedup
Q2_K_R4	16	pp512	256.19 ± 0.26	272.69 ± 0.13	1.064
	1	tg128	9.08 ± 0.12	9.95 ± 0.0	1.096
	2	tg128	16.40 ± 0.00	17.44 ± 0.01	1.063
	4	tg128	20.72 ± 0.12	20.97 ± 0.08	1.012
Q3_K_R4	16	pp512	236.77 ± 0.35	255.84 ± 0.20	1.081
	1	tg128	6.78 ± 0.00	7.16 ± 0.07	1.056
	2	tg128	12.46 ± 0.00	13.00 ± 0.01	1.043
	4	tg128	17.02 ± 0.09	17.20 ± 0.24	1.012
Q4_K_R4	16	pp512	262.40 ± 0.28	268.09 ± 0.12	1.022
IQ4_XS_R4	16	pp512	256.80 ± 0.35	271.95 ± 0.39	1.059
Q5_K_R4	16	pp512	248.30 ± 0.29	256.68 ± 0.31	1.034
Q6_K_R4	16	pp512	243.25 ± 0.31	261.33 ± 0.38	1.074
	1	tg128	7.94 ± 0.00	8.34 ± 0.00	1.050
	2	tg128	10.38 ± 0.00	10.38 ± 0.00	1.000

For Q4_K_R4, Q5_K_R4 and IQ4_XS_R4 matrix-vector multiplications are done with a different implementation where this change is not applicable, so no TG results for those.

3.4 KiB Raw Permalink Blame History

🔀 #139 - Faster R4 quants on Zen4

Description

3.4 KiB

Raw Permalink Blame History