Files
ik_llama.cpp/github-data/pull_requests/129 - Q4_K_R4.md
2025-07-23 13:31:53 +02:00

1.9 KiB

🔀 #129 - Q4_K_R4

Author ikawrakow
State Closed
Created 2024-12-09
Updated 2024-12-09

Description

Follow up of #118, #119, #120, #121, #122, #123 for Q4_K.

After having demonstrated interleaved rows with blocks and super-blocks for IQ4_XS in #123, here the corresponding implementation for Q4_K. To not have an explosion of quantization types, Q4_K_R4 corresponds to Q4_K_S (and there is no _R4 variant for Q4_K_M).

We get a massive speedup on ARM_NEON and quite significant gain on AVX2/Zen4. The Zen4 implementation could probably be optimized further. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads Q4_K_S Q4_K_R4 Speedup
ARM_NEON 8 68.73 ± 0.88 110.02 ± 1.31 1.601
Zen4 16 198.92 ± 0.69 259.19 ± 0.24 1.303
AVX2 32 206.39 ± 0.28 282.78 ± 0.54 1.370

Here we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads Q4_K_S Q4_K_R4 Speedup
ARM_NEON 2 11.38 ± 0.00 12.17 ± 0.01 1.069
4 18.08 ± 0.44 21.56 ± 0.06 1.192
8 25.02 ± 0.17 25.39 ± 0.14 1.015
Zen4 1 5.73 ± 0.01 8.95 ± 0.00 1.562
2 10.47 ± 0.01 13.37 ± 0.00 1.277
4 13.38 ± 0.63 14.03 ± 0.01 1.049
AVX2 2 4.60 ± 0.00 7.61 ± 0.00 1.370
4 8.55 ± 0.00 12.01 ± 0.00 1.403
8 11.67 ± 0.00 13.83 ± 0.00 1.185