ik_llama.cpp/github-data/pull_requests/55 - Improve Q5_0 performance on AVX2.md

### 🔀 [#55](https://github.com/ikawrakow/ik_llama.cpp/pull/55) - Improve Q5_0 performance on AVX2

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-09-14 |
| **Updated** | 2024-09-14 |

---

#### Description

The main purpose of the [previous PR](https://github.com/ikawrakow/ik_llama.cpp/pull/54) was to try to improve `K*Q` matrix multiplications for flash attention with `Q8_0` quantized k-cache. Sadly, the performance improvement that we got for `Q8_0` did not translate into better FA performance. It is a rainy Saturday, so need something to brighten my day. The last PR is very easily applied to `Q5_0`, so here we are.

The table shows performance comparison to mainline `llama.cpp` for LLaMA-3.1-8B ona Ryzen-7950X

| model         | backend    | threads |          test |     t/s (llama.cpp)  |    t/s (PR)      |  Speedup |
| --------------| ---------- | ------: | ------------: | -------------------: | ---------------: | -------: |
| llama 8B Q5_0 | CPU        |      16 |         pp512 |         55.72 ± 0.25 |    152.10 ± 0.74 |  2.793   |
| llama 8B Q5_0 | CPU        |       2 |         tg128 |          5.22 ± 0.01 |      8.88 ± 0.01 |  1.701   |
| llama 8B Q5_0 | CPU        |       4 |         tg128 |          9.24 ± 0.01 |     11.57 ± 0.00 |  1.252   |