Files
ik_llama.cpp/github-data/pull_requests/119 - Q4_0_R4.md
2025-07-23 13:31:53 +02:00

1.2 KiB

🔀 #119 - Q4_0_R4

Author ikawrakow
State Closed
Created 2024-12-02
Updated 2024-12-02

Description

Q4_0 repacked with 4 interleaved rows as IQ4_NL_X4 (see PR #118).

PP-512 for LLaMA-3.1-8B for ARM_NEON (M2-Max), Zen4 (Ryzen-7950X) and AVX2 (Risen-5975WX):

Platform Threads Q4_0 Q4_0_R4 Speedup
ARM_NEON 8 84.57 ± 0.94 115.79 ± 0.86 1.369
Zen4 16 185.89 ± 0.84 278.15 ± 0.39 1.496
AVX2. 32 190.73 ± 0.39 251.00 ± 0.51 1.316

On Zen4 Q4_0_R4 is now the prompt processing champion.

Here the hand-written assembly for Q4_0_4_4 in mainline llama.cpp achieves 122.8 t/s on my M2-Max, so beats Q4_0_R4 by a small margin. My guess is that Q4_0_4_4 is slightly better because there the 0x88 xor mask (which converts the unsigned 4-bit quants to signed 4-bit quants shifted 4 bits to the left) is already applied. But this trick is only useful for the ARM instruction set, and is absolutely not useful on x86_64, so I did not use it.