Files
ik_llama.cpp/github-data/pull_requests/138 - IQ4_K_R4.md
2025-07-23 13:31:53 +02:00

1.9 KiB

🔀 #138 - IQ4_K_R4

Author ikawrakow
State Closed
Created 2024-12-12
Updated 2024-12-12

Description

On to R4 implementation of the new iqk quants.

First IQ4_K

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4. I suspect my AVX2/Zen4 implementation is not optimum, but I did not see a better way for now.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ4_K IQ4_K_R4 Speedup
ARM_NEON 8 58.20 ± 1.03 108.02 ± 1.10 1.856
Zen4 16 182.20 ± 0.38 232.63 ± 0.39 1.277
AVX2 32 206.43 ± 0.49 227.60 ± 0.46 1.103

We get decent performance gains for TG as well. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads Q2_K_S Q2_K_R4 Speedup
ARM_NEON 2 8.44 ± 0.02 10.56 ± 0.01 1.251
4 15.90 ± 0.05 19.32 ± 0.14 1.215
8 24.54 ± 0.15 25.16 ± 0.03 1.025
Zen4 1 5.26 ± 0.00 6.73 ± 0.00 1.279
2 9.71 ± 0.01 12.43 ± 0.00 1.269
4 13.48 ± 0.06 14.00 ± 0.03 1.039
AVX2 2 4.02 ± 0.00 6.91 ± 0.00 1.719
4 8.03 ± 0.00 11.13 ± 0.00 1.386
8 11.81 ± 0.00 12.75 ± 0.00 1.079