mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-11 14:30:02 +00:00
Add GitHub data (#637)
This commit is contained in:
40
github-data/pull_requests/136-Q2_K_R4.md
Normal file
40
github-data/pull_requests/136-Q2_K_R4.md
Normal file
@@ -0,0 +1,40 @@
|
||||
### 🔀 [#136](https://github.com/ikawrakow/ik_llama.cpp/pull/136) - Q2_K_R4
|
||||
|
||||
| **Author** | `ikawrakow` |
|
||||
| :--- | :--- |
|
||||
| **State** | ❌ **Closed** |
|
||||
| **Created** | 2024-12-11 |
|
||||
| **Updated** | 2024-12-11 |
|
||||
|
||||
---
|
||||
|
||||
#### Description
|
||||
|
||||
Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134 for `Q2_K`.
|
||||
|
||||
This completes R4 implementation for k-quants on `ARM_NEON`, `AVX2`, and `Zen4`.
|
||||
|
||||
We get signifiant performance gains on all platforms. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
|
||||
|
||||
| Platform | Threads | Q2_K_S | Q2_K_R4 | Speedup |
|
||||
| ---: | ---: | ---: | ---: | ---: |
|
||||
| ARM_NEON | 8 | 73.79 ± 1.92 | 109.07 ± 0.58 | 1.478 |
|
||||
| Zen4 | 16 | 205.95 ± 0.77 | 256.19 ± 0.26 | 1.244 |
|
||||
| AVX2 | 32 | 214.42 ± 0.54 | 286.91 ± 0.63 | 1.338 |
|
||||
|
||||
As `Q2_K` is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG.
|
||||
Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
|
||||
|
||||
| Platform | Threads | Q2_K_S | Q2_K_R4 | Speedup |
|
||||
| ---: | ---: | ---: | ---: | ---: |
|
||||
| ARM_NEON | 2 | 10.34 ± 0.01 | 12.81 ± 0.01 | 1.239 |
|
||||
| | 4 | 19.32 ± 0.02 | 23.40 ± 0.08 | 1.211 |
|
||||
| | 8 | 32.36 ± 0.59 | 36.02 ± 0.40 | 1.113 |
|
||||
| Zen4 | 1 | 6.60 ± 0.02 | 9.08 ± 0.12 | 1.376 |
|
||||
| | 2 | 12.12 ± 0.01 | 16.40 ± 0.00 | 1.353 |
|
||||
| | 4 | 19.12 ± 0.56 | 20.72 ± 0.19 | 1.084 |
|
||||
| AVX2 | 2 | 5.93 ± 0.02 | 10.16 ± 0.30 | 1.713 |
|
||||
| | 4 | 11.24 ± 0.00 | 17.59 ± 0.01 | 1.565 |
|
||||
| | 8 | 18.62 ± 0.03 | 21.44 ± 0.00 | 1.151 |
|
||||
|
||||
It is actually too bad `Q2_K` is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.
|
||||
Reference in New Issue
Block a user