Files
ik_llama.cpp/github-data/pull_requests/134 - Q3_K_R4.md
2025-07-23 13:31:53 +02:00

1.4 KiB

🔀 #134 - Q3_K_R4

Author ikawrakow
State Closed
Created 2024-12-11
Updated 2024-12-11

Description

Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132 for Q3_K.

We get a massive speedup on ARM_NEON and non-negligible gains on AVX2/Zen4. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads Q3_K Q3_K_R4 Speedup
ARM_NEON 8 55.42 ± 1.00 106.89 ± 1.14 1.929
Zen4 16 193.89 ± 0.43 236.77 ± 0.35 1.221
AVX2 32 199.22 ± 0.41 262.34 ± 0.50 1.317

On AVX2/Zen4 we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads Q3_K Q3_K_R4 Speedup
Zen4 1 5.47 ± 0.01 6.78 ± 0.00 1.239
2 10.25 ± 0.00 12.46 ± 0.00 1.216
4 15.21 ± 0.59 17.02 ± 0.09 1.119
AVX2 2 5.02 ± 0.01 8.21 ± 0.00 1.635
4 9.33 ± 0.00 13.67 ± 0.00 1.465
8 14.85 ± 0.02 16.67 ± 0.00 1.123