ik_llama.cpp/github-data/pull_requests/135 - Better ARM_NEON implementation for R4 quants.md at 8a8de91a423adcd00937cef0924bfb2c3156b741 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.7 KiB

Raw Blame History

🔀 #135 - Better ARM_NEON implementation for R4 quants

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-11
Updated	2024-12-11

Description

We get improved performance for IQ4_XS_R4, Q4_K_R4, Q5_K_R4, Q6_K_R4. The trick was to accumulate super-blocks in int32_t, thus avoiding expensive int -> float conversions.

Here performance comparisons for LLaMA-3.1-8B on M2-Max between the previous implementation and this PR

Quant	Task	Threads	t/s (main)	t/s (PR)	Speedup
IQ4_XS_R4	pp512	8	115.43 ± 0.57	131.28 ± 0.51	1.137
	tg128	2	12.71 ± 0.01	13.44 ± 0.01	1.057
	tg128	4	22.35 ± 0.17	22.98 ± 0.05	1.028
Q4_K_R4	pp512	8	110.02 ± 1.31	122.12 ± 1.28	1.110
	tg128	2	12.17 ± 0.01	13.72 ± 0.01	1.127
	tg128	4	21.56 ± 0.06	22.46 ± 0.20	1.042
Q5_K_R4.	pp512	8	96.90 ± 0.79	108.66 ± 0.27	1.121
	tg128	2	8.22 ± 0.01	8.66 ± 0.01	1.054
	tg128	4	15.54 ± 0.09	16.13 ± 0.05	1.038
Q6_K_R4	pp512	8	83.25 ± 0.81	104.19 ± 1.96	1.252
	tg128	2	7.35 ± 0.01	8.05 ± 0.00	1.095
	tg128	4	13.80 ± 0.01	14.92 ± 0.03	1.081

TG results only up to 4 threads because at 8 threads the result is 100% memory bound, so the same within noise.

1.7 KiB Raw Blame History

🔀 #135 - Better ARM_NEON implementation for R4 quants

Description

1.7 KiB

Raw Blame History