mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 09:09:50 +00:00
1.2 KiB
1.2 KiB
🔀 #123 - IQ4_XS_R4
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-04 |
| Updated | 2024-12-04 |
Description
Follow up of #118, #119, #120, #121, #122 for IQ4_XS.
I was curious to see if one can make the interleaved rows strategy work for i- and k-quants with their super-blocks & blocks and two levels of scales. IQ4_XS seemed easiest, so I tackled that one first. We get a massive speedup on ARM_NEON and a more modest (but still significant) gain on AVX2/Zen4. I'm not 100% happy with the Zen4 implementation, but shuffling scale bits for 4 rows at once is tricky, so for now I have settled on a sub-optimal solution.
Anyway, here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)
| Platform | Threads | IQ4_XS | IQ4_XS_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 8 | 68.23 ± 1.06 | 115.43 ± 0.57 | 1.692 |
| Zen4 | 16 | 183.43 ± 0.60 | 223.98 ± 0.12 | 1.221 |
| AVX2 | 32 | 195.20 ± 0.40 | 248.25 ± 0.43 | 1.272 |