Files
ik_llama.cpp/github-data/pull_requests/121 - Q5_0_R4.md
2025-07-23 13:31:53 +02:00

1.3 KiB

🔀 #121 - Q5_0_R4

Author ikawrakow
State Closed
Created 2024-12-03
Updated 2024-12-03

Description

Follow up of #118, #119, #120 for Q5_0.

Here is PP-512 for LLaMA-3.1-8B on Zen4 (Risen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads Q5_0 Q5_0_R4 Speedup
ARM_NEON 8 71.04 ± 0.83 99.59 ± 1.06 1.402
Zen4 16 157.46 ± 0.50 256.70 ± 0.42 1.630
AVX2 32 171.99 ± 0.50 236.33 ± 0.56 1.374

Here I see a benefit even for TG. E.g., on the Ryzen-7950X I get for TG-128

Threads Q5_0 Q5_0_R4 Speedup
2 9.06 ± 0.00 9.87 ± 0.00 1.089
4 11.06 ± 0.15 11.73 ± 0.00 1.061

It is worth comparing Q5_0_R4 to mainline llama.cpp (build: 3420909d (4234)) on the M2-Max:

Task Threads t/s mainline t/s (PR) Speedup
pp512 8 26.49 ± 0.61 99.59 ± 1.06 3.758
tg128 2 6.38 ± 0.01 8.75 ± 0.01 1.371
tg128 4 12.27 ± 0.10 16.46 ± 0.08 1.341
tg128 8 20.60 ± 0.14 22.07 ± 0.32 1.071