Files
ik_llama.cpp/github-data/pull_requests/149 - IQ5_K_R4.md
2025-07-23 13:31:53 +02:00

1.4 KiB

🔀 #149 - IQ5_K_R4

Author ikawrakow
State Closed
Created 2024-12-18
Updated 2025-03-27

Description

Adding IQ5_K with 4 interleaved rows.

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ5_K IQ5_K_R4 Speedup
ARM_NEON 8 53.80 ± 1.08 93.33 ± 2.02 1.735
Zen4 16 168.09 ± 0.58 230.23 ± 0.23 1.370
AVX2 32 177.16 ± 0.31 231.50 ± 0.43 1.307

TG does not look good on AVX2/Zen4. On ARM_NEON we get a decent performance gain. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads IQ5_K IQ5_K_R4 Speedup
ARM_NEON 2 5.92 ± 0.07 6.98 ± 0.00 1.179
4 11.53 ± 0.01 13.35 ± 0.01 1.158
8 20.29 ± 0.46 21.17 ± 0.18 1.043

💬 Conversation

👤 saood06 commented the 2025-03-27 at 06:53:47:

TG does not look good on AVX2/Zen4

Does this mean regression compared to non-interleaved or just no benefit?