mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
1.9 KiB
1.9 KiB
🔀 #138 - IQ4_K_R4
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-12 |
| Updated | 2024-12-12 |
Description
On to R4 implementation of the new iqk quants.
First IQ4_K
We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4. I suspect my AVX2/Zen4 implementation is not optimum, but I did not see a better way for now.
Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)
| Platform | Threads | IQ4_K | IQ4_K_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 8 | 58.20 ± 1.03 | 108.02 ± 1.10 | 1.856 |
| Zen4 | 16 | 182.20 ± 0.38 | 232.63 ± 0.39 | 1.277 |
| AVX2 | 32 | 206.43 ± 0.49 | 227.60 ± 0.46 | 1.103 |
We get decent performance gains for TG as well. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
| Platform | Threads | Q2_K_S | Q2_K_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 2 | 8.44 ± 0.02 | 10.56 ± 0.01 | 1.251 |
| 4 | 15.90 ± 0.05 | 19.32 ± 0.14 | 1.215 | |
| 8 | 24.54 ± 0.15 | 25.16 ± 0.03 | 1.025 | |
| Zen4 | 1 | 5.26 ± 0.00 | 6.73 ± 0.00 | 1.279 |
| 2 | 9.71 ± 0.01 | 12.43 ± 0.00 | 1.269 | |
| 4 | 13.48 ± 0.06 | 14.00 ± 0.03 | 1.039 | |
| AVX2 | 2 | 4.02 ± 0.00 | 6.91 ± 0.00 | 1.719 |
| 4 | 8.03 ± 0.00 | 11.13 ± 0.00 | 1.386 | |
| 8 | 11.81 ± 0.00 | 12.75 ± 0.00 | 1.079 |
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High