Files
ik_llama.cpp/github-data/pull_requests/132 - Q5_K_R4.md
2025-07-23 13:31:53 +02:00

3.4 KiB

🔀 #132 - Q5_K_R4

Author ikawrakow
State Closed
Created 2024-12-10
Updated 2024-12-10

Description

Follow up of #118, #119, #120, #121, #122, #123, #129, #130 for Q5_K.

We get a large speedup on ARM_NEON and non-negligible gains on AVX2/Zen4. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads Q5_K Q5_K_R4 Speedup
ARM_NEON 8 61.07 ± 0.95 96.13 ± 2.38 1.574
Zen4 16 188.73 ± 0.75 248.30 ± 0.29 1.316
AVX2 32 188.11 ± 0.29 269.18 ± 0.40 1.431

On AVX2/Zen4 we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads Q6_K Q6_K_R4 Speedup
Zen4 1 5.12 ± 0.00 7.07 ± 0.01 1.380
2 9.31 ± 0.00 11.54 ± 0.0 1.240
4 11.33 ± 0.37 11.89 ± 0.00 1.049
AVX2 2 4.04 ± 0.00 6.40 ± 0.00 1.584
4 7.57 ± 0.00 9.95 ± 0.00 1.314
8 9.75 ± 0.00 11.00 ± 0.00 1.128

I decided to check the current state of mainline llama.cpp for Q5_K_S.

Hahaha - here is what we get on my M2-Max (build: 7736837d (4274))

model size params backend threads test t/s
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 8 pp512 27.69 ± 0.09
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 2 tg128 6.39 ± 0.01
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 4 tg128 12.18 ± 0.02
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 8 tg128 19.68 ± 0.64

The performance gap in prompt processing for Q5_K has now grown to 3.5X, and it is ~30% slower for TG with 2 threads.

Here is what I get on my Ryzen-7950X (build: 26a8406b (4295))

model size params backend threads test t/s
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 16 pp512 75.88 ± 0.26
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 1 tg128 4.10 ± 0.00
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 2 tg128 7.66 ± 0.01
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 4 tg128 11.26 ± 0.00
llama 8B Q5_K - Small 5.21 GiB 8.03 B CPU 8 tg128 11.20 ± 0.22

3.26X slower for prompt processing, 72%/51% slower for TG at 1/2 thread.