1.4 KiB
🔀 #118 - IQ4_NL_X4
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-02 |
| Updated | 2024-12-02 |
Description
In mainline llama.cpp they have added various types where Q4_0 or IQ4_NL are repacked by interleaving quants from 4 or 8 consecutive rows. They get significant improvement in prompt processing speed on ARM, so I decided to see if interleaved rows can further improve the iqk_mul_mat matrix-matrix multiplication speed.
This PR adds IQ4_NL_X4, a repacked variant of IQ4_NL. The table below shows PP-512comparison between IQ4_NL and IQ4_NL_X4 for LLaMA-3.1-8B-Instruct on ARM (M2-Max), Zen4 (Ryzen-7950X) and AVX2 (Ryzen-5975WX). Somewhat surprisingly the speedup on Zen4 is larger than the speedup on M2-Max. On Zen4 IQ4_NL_X4 is now the fastest quantization type for prompt processing, beating even bf16 (237 t/s on the Ryzen-7950X CPU, which has native bf16 support).
| Platform | Threads | IQ4_NL | IQ4_NL_X4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 8 | 85.11 ± 0.47 | 110.32 ± 0.53 | 1.296 |
| Zen4 | 16 | 168.21 ± 0.60 | 262.69 ± 0.65 | 1.562 |
| AVX2. | 32 | 186.81 ± 0.17 | 231.45 ± 0.61 | 1.240 |
For reference: On my M2-Max mainline llama.cpp (build: 3420909d) achieves 92.3 t/s for IQ4_NL_4_4.