mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-07 06:50:09 +00:00
3.4 KiB
3.4 KiB
🔀 #132 - Q5_K_R4
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-10 |
| Updated | 2024-12-10 |
Description
Follow up of #118, #119, #120, #121, #122, #123, #129, #130 for Q5_K.
We get a large speedup on ARM_NEON and non-negligible gains on AVX2/Zen4. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)
| Platform | Threads | Q5_K | Q5_K_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 8 | 61.07 ± 0.95 | 96.13 ± 2.38 | 1.574 |
| Zen4 | 16 | 188.73 ± 0.75 | 248.30 ± 0.29 | 1.316 |
| AVX2 | 32 | 188.11 ± 0.29 | 269.18 ± 0.40 | 1.431 |
On AVX2/Zen4 we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
| Platform | Threads | Q6_K | Q6_K_R4 | Speedup |
|---|---|---|---|---|
| Zen4 | 1 | 5.12 ± 0.00 | 7.07 ± 0.01 | 1.380 |
| 2 | 9.31 ± 0.00 | 11.54 ± 0.0 | 1.240 | |
| 4 | 11.33 ± 0.37 | 11.89 ± 0.00 | 1.049 | |
| AVX2 | 2 | 4.04 ± 0.00 | 6.40 ± 0.00 | 1.584 |
| 4 | 7.57 ± 0.00 | 9.95 ± 0.00 | 1.314 | |
| 8 | 9.75 ± 0.00 | 11.00 ± 0.00 | 1.128 |
I decided to check the current state of mainline llama.cpp for Q5_K_S.
Hahaha - here is what we get on my M2-Max (build: 7736837d (4274))
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | pp512 | 27.69 ± 0.09 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 2 | tg128 | 6.39 ± 0.01 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 4 | tg128 | 12.18 ± 0.02 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | tg128 | 19.68 ± 0.64 |
The performance gap in prompt processing for Q5_K has now grown to 3.5X, and it is ~30% slower for TG with 2 threads.
Here is what I get on my Ryzen-7950X (build: 26a8406b (4295))
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 16 | pp512 | 75.88 ± 0.26 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 1 | tg128 | 4.10 ± 0.00 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 2 | tg128 | 7.66 ± 0.01 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 4 | tg128 | 11.26 ± 0.00 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | tg128 | 11.20 ± 0.22 |
3.26X slower for prompt processing, 72%/51% slower for TG at 1/2 thread.