ik_llama.cpp/github-data/pull_requests/129 - Q4_K_R4.md

### 🔀 [#129](https://github.com/ikawrakow/ik_llama.cpp/pull/129) - Q4_K_R4

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-12-09 |
| **Updated** | 2024-12-09 |

---

#### Description

Follow up of #118, #119, #120, #121, #122, #123  for `Q4_K`.

After having demonstrated interleaved rows with blocks and super-blocks for `IQ4_XS` in #123, here the corresponding implementation for `Q4_K`. To not have an explosion of quantization types, `Q4_K_R4` corresponds to `Q4_K_S` (and there is no `_R4` variant for `Q4_K_M`).

We get a massive speedup on `ARM_NEON` and quite significant gain on `AVX2/Zen4`. The `Zen4` implementation could probably be optimized further. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)

| Platform |  Threads | Q4_K_S | Q4_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| ARM_NEON |  8 |  68.73 ± 0.88  | 110.02 ± 1.31  | 1.601 |
| Zen4            | 16 | 198.92 ± 0.69  | 259.19 ± 0.24  | 1.303 |
| AVX2           | 32 | 206.39 ± 0.28  | 282.78 ± 0.54  | 1.370 |

Here we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

| Platform |  Threads | Q4_K_S | Q4_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| ARM_NEON |  2 |  11.38 ± 0.00  | 12.17 ± 0.01  | 1.069 |
|                       |  4 |  18.08 ± 0.44  | 21.56 ± 0.06  | 1.192 |
|                       |  8 |  25.02 ± 0.17   | 25.39 ± 0.14  | 1.015 |
| Zen4            | 1 |  5.73 ± 0.01  | 8.95 ± 0.00  |  1.562 |
|                      | 2 |  10.47 ± 0.01  | 13.37 ± 0.00  |  1.277 |
|                      | 4 |  13.38 ± 0.63  | 14.03 ± 0.01  |  1.049 |
| AVX2           | 2 | 4.60 ± 0.00   | 7.61 ± 0.00  | 1.370 |
|                     | 4 | 8.55 ± 0.00    | 12.01 ± 0.00  | 1.403 |
|                     | 8 |  11.67 ± 0.00   | 13.83 ± 0.00  | 1.185 |