ik_llama.cpp/github-data/pull_requests/132-Q5_K_R4.md

### 🔀 [#132](https://github.com/ikawrakow/ik_llama.cpp/pull/132) - Q5_K_R4

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-12-10 |
| **Updated** | 2024-12-10 |

---

#### Description

Follow up of #118, #119, #120, #121, #122, #123, #129, #130  for `Q5_K`.

We get a large speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`.  Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)

| Platform |  Threads | Q5_K | Q5_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| ARM_NEON |  8 |  61.07 ± 0.95  | 96.13 ± 2.38  | 1.574 |
| Zen4            | 16 | 188.73 ± 0.75   | 248.30 ± 0.29  | 1.316 |
| AVX2           | 32 | 188.11 ± 0.29 |  269.18 ± 0.40  | 1.431 |

On `AVX2/Zen4` we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

| Platform |  Threads | Q6_K | Q6_K_R4 | Speedup |
| ---: | ---: | ---: | ---: | ---: |
| Zen4            | 1 |  5.12 ± 0.00   | 7.07 ± 0.01  |  1.380 |
|                      | 2 |  9.31 ± 0.00 | 11.54 ± 0.0  |  1.240 |
|                      | 4 |  11.33 ± 0.37  | 11.89 ± 0.00  |  1.049 |
| AVX2           | 2 | 4.04 ± 0.00    | 6.40 ± 0.00  | 1.584 |
|                     | 4 | 7.57 ± 0.00    | 9.95 ± 0.00  | 1.314 |
|                     | 8 |  9.75 ± 0.00  | 11.00 ± 0.00  | 1.128 |

I decided to check the current state of mainline `llama.cpp` for `Q5_K_S`.

Hahaha - here is what we get on my M2-Max (`build: 7736837d (4274)`)

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       8 |         pp512 |         27.69 ± 0.09 |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       2 |         tg128 |          6.39 ± 0.01 |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       4 |         tg128 |         12.18 ± 0.02 |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       8 |         tg128 |         19.68 ± 0.64 |

The performance gap in prompt processing for `Q5_K` has now grown to 3.5X, and it is ~30% slower for TG with 2 threads.

Here is what I get on my Ryzen-7950X (`build: 26a8406b (4295)`)

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |      16 |         pp512 |         75.88 ± 0.26 |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       1 |         tg128 |          4.10 ± 0.00 |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       2 |         tg128 |          7.66 ± 0.01 |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       4 |         tg128 |         11.26 ± 0.00 |
| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       8 |         tg128 |         11.20 ± 0.22 |

3.26X slower for prompt processing, 72%/51% slower for TG at 1/2 thread.