mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
1.2 KiB
1.2 KiB
🔀 #119 - Q4_0_R4
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-02 |
| Updated | 2024-12-02 |
Description
Q4_0 repacked with 4 interleaved rows as IQ4_NL_X4 (see PR #118).
PP-512 for LLaMA-3.1-8B for ARM_NEON (M2-Max), Zen4 (Ryzen-7950X) and AVX2 (Risen-5975WX):
| Platform | Threads | Q4_0 | Q4_0_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 8 | 84.57 ± 0.94 | 115.79 ± 0.86 | 1.369 |
| Zen4 | 16 | 185.89 ± 0.84 | 278.15 ± 0.39 | 1.496 |
| AVX2. | 32 | 190.73 ± 0.39 | 251.00 ± 0.51 | 1.316 |
On Zen4 Q4_0_R4 is now the prompt processing champion.
Here the hand-written assembly for Q4_0_4_4 in mainline llama.cpp achieves 122.8 t/s on my M2-Max, so beats Q4_0_R4 by a small margin. My guess is that Q4_0_4_4 is slightly better because there the 0x88 xor mask (which converts the unsigned 4-bit quants to signed 4-bit quants shifted 4 bits to the left) is already applied. But this trick is only useful for the ARM instruction set, and is absolutely not useful on x86_64, so I did not use it.