mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-18 20:30:11 +00:00
25 lines
1.2 KiB
Markdown
25 lines
1.2 KiB
Markdown
### 🔀 [#119](https://github.com/ikawrakow/ik_llama.cpp/pull/119) - Q4_0_R4
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **State** | ❌ **Closed** |
|
|
| **Created** | 2024-12-02 |
|
|
| **Updated** | 2024-12-02 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
`Q4_0` repacked with 4 interleaved rows as `IQ4_NL_X4` (see PR #118).
|
|
|
|
PP-512 for LLaMA-3.1-8B for `ARM_NEON` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Risen-5975WX):
|
|
|
|
| Platform | Threads | Q4_0 | Q4_0_R4 | Speedup |
|
|
| ---: | ---: | ---: | ---: | ---: |
|
|
| ARM_NEON | 8 | 84.57 ± 0.94 | 115.79 ± 0.86 | 1.369 |
|
|
| Zen4 | 16 | 185.89 ± 0.84 | 278.15 ± 0.39 | 1.496 |
|
|
| AVX2. | 32 | 190.73 ± 0.39 | 251.00 ± 0.51 | 1.316 |
|
|
|
|
On `Zen4` `Q4_0_R4` is now the prompt processing champion.
|
|
|
|
Here the hand-written assembly for `Q4_0_4_4` in mainline `llama.cpp` achieves 122.8 t/s on my M2-Max, so beats `Q4_0_R4` by a small margin. My guess is that `Q4_0_4_4` is slightly better because there the `0x88` xor mask (which converts the unsigned 4-bit quants to signed 4-bit quants shifted 4 bits to the left) is already applied. But this trick is only useful for the `ARM` instruction set, and is absolutely not useful on `x86_64`, so I did not use it. |