### 🔀 [#136](https://github.com/ikawrakow/ik_llama.cpp/pull/136) - Q2_K_R4

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-12-11 |
| **Updated** | 2024-12-11 |

---

#### Description

Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134  for `Q2_K`. 

This completes R4 implementation for k-quants on `ARM_NEON`, `AVX2`, and `Zen4`.

We get signifiant performance gains on all platforms.  Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)

| Platform |  Threads | Q2_K_S | Q2_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| ARM_NEON |  8 |  73.79 ± 1.92  | 109.07 ± 0.58 | 1.478 |
| Zen4            | 16 | 205.95 ± 0.77  | 256.19 ± 0.26  | 1.244 |
| AVX2           | 32 | 214.42 ± 0.54 |  286.91 ± 0.63  | 1.338 |

As `Q2_K` is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG. 
Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

| Platform |  Threads | Q2_K_S | Q2_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| ARM_NEON | 2 | 10.34 ± 0.01 | 12.81 ± 0.01 | 1.239 |
|                      | 4 | 19.32 ± 0.02 | 23.40 ± 0.08 | 1.211 |
|                      | 8 | 32.36 ± 0.59 | 36.02 ± 0.40 | 1.113 |
| Zen4            | 1 |  6.60 ± 0.02  | 9.08 ± 0.12  |  1.376 |
|                      | 2 |  12.12 ± 0.01 | 16.40 ± 0.00  |  1.353 |
|                      | 4 |  19.12 ± 0.56  | 20.72 ± 0.19  |  1.084 |
| AVX2           | 2 | 5.93 ± 0.02   | 10.16 ± 0.30  | 1.713 |
|                     | 4 | 11.24 ± 0.00    |  17.59 ± 0.01 | 1.565 |
|                     | 8 |  18.62 ± 0.03  | 21.44 ± 0.00  | 1.151 |

It is actually too bad `Q2_K` is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.