ik_llama.cpp/github-data/pull_requests/119 - Q4_0_R4.md

### 🔀 [#119](https://github.com/ikawrakow/ik_llama.cpp/pull/119) - Q4_0_R4

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-12-02 |
| **Updated** | 2024-12-02 |

---

#### Description

`Q4_0` repacked with 4 interleaved rows as `IQ4_NL_X4` (see PR #118).

PP-512 for LLaMA-3.1-8B for `ARM_NEON` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Risen-5975WX):

| Platform |  Threads | Q4_0 | Q4_0_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| ARM_NEON |  8 |   84.57 ± 0.94 | 115.79 ± 0.86 | 1.369 |
| Zen4            | 16 | 185.89 ± 0.84 | 278.15 ± 0.39 | 1.496 |
| AVX2.          | 32 | 190.73 ± 0.39 | 251.00 ± 0.51 | 1.316 |

On `Zen4` `Q4_0_R4` is now the prompt processing champion.

Here the hand-written assembly for `Q4_0_4_4` in mainline `llama.cpp` achieves 122.8 t/s on my M2-Max, so beats `Q4_0_R4` by a small margin. My guess is that `Q4_0_4_4` is slightly better because there the `0x88` xor mask (which converts the unsigned 4-bit quants to signed 4-bit quants shifted 4 bits to the left) is already applied. But this trick is only useful for the `ARM` instruction set, and is absolutely not useful on `x86_64`, so I did not use it.