ik_llama.cpp/119 - Q4_0_R4.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.2 KiB

Raw Permalink Blame History

🔀 #119 - Q4_0_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-02
Updated	2024-12-02

Description

Q4_0 repacked with 4 interleaved rows as IQ4_NL_X4 (see PR #118).

PP-512 for LLaMA-3.1-8B for ARM_NEON (M2-Max), Zen4 (Ryzen-7950X) and AVX2 (Risen-5975WX):

Platform	Threads	Q4_0	Q4_0_R4	Speedup
ARM_NEON	8	84.57 ± 0.94	115.79 ± 0.86	1.369
Zen4	16	185.89 ± 0.84	278.15 ± 0.39	1.496
AVX2.	32	190.73 ± 0.39	251.00 ± 0.51	1.316

On Zen4 Q4_0_R4 is now the prompt processing champion.

Here the hand-written assembly for Q4_0_4_4 in mainline llama.cpp achieves 122.8 t/s on my M2-Max, so beats Q4_0_R4 by a small margin. My guess is that Q4_0_4_4 is slightly better because there the 0x88 xor mask (which converts the unsigned 4-bit quants to signed 4-bit quants shifted 4 bits to the left) is already applied. But this trick is only useful for the ARM instruction set, and is absolutely not useful on x86_64, so I did not use it.

1.2 KiB Raw Permalink Blame History

🔀 #119 - Q4_0_R4

Description

1.2 KiB

Raw Permalink Blame History