Files
ik_llama.cpp/github-data/pull_requests/146 - IQ2_K_R4.md
2025-07-23 13:31:53 +02:00

1.6 KiB

🔀 #146 - IQ2_K_R4

Author ikawrakow
State Closed
Created 2024-12-17
Updated 2024-12-17

Description

Adding IQ2_K with 4 interleaved rows.

We get very signifiant performance gains on ARM_NEON and more modest gains on AVX2/Zen4.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ2_K IQ2_K_R4 Speedup
ARM_NEON 8 59.71 ± 0.91 107.93 ± 0.75 1.808
Zen4 16 198.79 ± 0.58 250.19 ± 0.42 1.259
AVX2 32 209.02 ± 0.16 287.17 ± 0.64 1.374

We get decent performance gains for TG as well. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads IQ2_K IQ2_K_R4 Speedup
ARM_NEON 2 8.22 ± 0.01 9.79 ± 0.00 1.191
4 15.12 ± 0.01 18.25 ± 0.02 1.207
8 28.01 ± 0.13 32.33 ± 0.26 1.154
Zen4 1 6.56 ± 0.00 7.13 ± 0.11 1.087
2 11.89 ± 0.00 13.35 ± 0.01 1.123
4 19.37 ± 1.84 21.55 ± 0.86 1.113
AVX2 2 5.06 ± 0.00 8.83 ± 0.00 1.745
4 9.63 ± 0.00 16.28 ± 0.00 1.691
8 17.45 ± 0.08 22.11 ± 0.00 1.267