Files
ik_llama.cpp/github-data/pull_requests/132-Q5_K_R4.md
2025-07-22 18:18:40 +02:00

57 lines
3.4 KiB
Markdown

### 🔀 [#132](https://github.com/ikawrakow/ik_llama.cpp/pull/132) - Q5_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-12-10 |
| **Updated** | 2024-12-10 |
---
#### Description
Follow up of #118, #119, #120, #121, #122, #123, #129, #130 for `Q5_K`.
We get a large speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
| Platform | Threads | Q5_K | Q5_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| ARM_NEON | 8 | 61.07 ± 0.95 | 96.13 ± 2.38 | 1.574 |
| Zen4 | 16 | 188.73 ± 0.75 | 248.30 ± 0.29 | 1.316 |
| AVX2 | 32 | 188.11 ± 0.29 | 269.18 ± 0.40 | 1.431 |
On `AVX2/Zen4` we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
| Platform | Threads | Q6_K | Q6_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| Zen4 | 1 | 5.12 ± 0.00 | 7.07 ± 0.01 | 1.380 |
| | 2 | 9.31 ± 0.00 | 11.54 ± 0.0 | 1.240 |
| | 4 | 11.33 ± 0.37 | 11.89 ± 0.00 | 1.049 |
| AVX2 | 2 | 4.04 ± 0.00 | 6.40 ± 0.00 | 1.584 |
| | 4 | 7.57 ± 0.00 | 9.95 ± 0.00 | 1.314 |
| | 8 | 9.75 ± 0.00 | 11.00 ± 0.00 | 1.128 |
I decided to check the current state of mainline `llama.cpp` for `Q5_K_S`.
Hahaha - here is what we get on my M2-Max (`build: 7736837d (4274)`)
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | pp512 | 27.69 ± 0.09 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 2 | tg128 | 6.39 ± 0.01 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 4 | tg128 | 12.18 ± 0.02 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | tg128 | 19.68 ± 0.64 |
The performance gap in prompt processing for `Q5_K` has now grown to 3.5X, and it is ~30% slower for TG with 2 threads.
Here is what I get on my Ryzen-7950X (`build: 26a8406b (4295)`)
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 16 | pp512 | 75.88 ± 0.26 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 1 | tg128 | 4.10 ± 0.00 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 2 | tg128 | 7.66 ± 0.01 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 4 | tg128 | 11.26 ± 0.00 |
| llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | tg128 | 11.20 ± 0.22 |
3.26X slower for prompt processing, 72%/51% slower for TG at 1/2 thread.