ik_llama.cpp/121 - Q5_0_R4.md at 993cb00a347fc77632b73126f614092d659727de - ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-03 04:57:32 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

Follow up of #118, #119, #120 for Q5_0.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Risen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	Q5_0	Q5_0_R4	Speedup
ARM_NEON	8	71.04 ± 0.83	99.59 ± 1.06	1.402
Zen4	16	157.46 ± 0.50	256.70 ± 0.42	1.630
AVX2	32	171.99 ± 0.50	236.33 ± 0.56	1.374

Here I see a benefit even for TG. E.g., on the Ryzen-7950X I get for TG-128

Threads	Q5_0	Q5_0_R4	Speedup
2	9.06 ± 0.00	9.87 ± 0.00	1.089
4	11.06 ± 0.15	11.73 ± 0.00	1.061

It is worth comparing Q5_0_R4 to mainline llama.cpp (build: 3420909d (4234)) on the M2-Max:

Task	Threads	t/s mainline	t/s (PR)	Speedup
pp512	8	26.49 ± 0.61	99.59 ± 1.06	3.758
tg128	2	6.38 ± 0.01	8.75 ± 0.01	1.371
tg128	4	12.27 ± 0.10	16.46 ± 0.08	1.341
tg128	8	20.60 ± 0.14	22.07 ± 0.32	1.071