Files
ik_llama.cpp/github-data/pull_requests/136 - Q2_K_R4.md
2025-07-23 13:31:53 +02:00

2.2 KiB

🔀 #136 - Q2_K_R4

Author ikawrakow
State Closed
Created 2024-12-11
Updated 2024-12-11

Description

Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134 for Q2_K.

This completes R4 implementation for k-quants on ARM_NEON, AVX2, and Zen4.

We get signifiant performance gains on all platforms. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads Q2_K_S Q2_K_R4 Speedup
ARM_NEON 8 73.79 ± 1.92 109.07 ± 0.58 1.478
Zen4 16 205.95 ± 0.77 256.19 ± 0.26 1.244
AVX2 32 214.42 ± 0.54 286.91 ± 0.63 1.338

As Q2_K is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads Q2_K_S Q2_K_R4 Speedup
ARM_NEON 2 10.34 ± 0.01 12.81 ± 0.01 1.239
4 19.32 ± 0.02 23.40 ± 0.08 1.211
8 32.36 ± 0.59 36.02 ± 0.40 1.113
Zen4 1 6.60 ± 0.02 9.08 ± 0.12 1.376
2 12.12 ± 0.01 16.40 ± 0.00 1.353
4 19.12 ± 0.56 20.72 ± 0.19 1.084
AVX2 2 5.93 ± 0.02 10.16 ± 0.30 1.713
4 11.24 ± 0.00 17.59 ± 0.01 1.565
8 18.62 ± 0.03 21.44 ± 0.00 1.151

It is actually too bad Q2_K is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.