### 🔀 [#132](https://github.com/ikawrakow/ik_llama.cpp/pull/132) - Q5_K_R4 | **Author** | `ikawrakow` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2024-12-10 | | **Updated** | 2024-12-10 | --- #### Description Follow up of #118, #119, #120, #121, #122, #123, #129, #130 for `Q5_K`. We get a large speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX) | Platform | Threads | Q5_K | Q5_K_R4 | Speedup | | ---: | ---: | ---: | ---: | ---: | | ARM_NEON | 8 | 61.07 ± 0.95 | 96.13 ± 2.38 | 1.574 | | Zen4 | 16 | 188.73 ± 0.75 | 248.30 ± 0.29 | 1.316 | | AVX2 | 32 | 188.11 ± 0.29 | 269.18 ± 0.40 | 1.431 | On `AVX2/Zen4` we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads: | Platform | Threads | Q6_K | Q6_K_R4 | Speedup | | ---: | ---: | ---: | ---: | ---: | | Zen4 | 1 | 5.12 ± 0.00 | 7.07 ± 0.01 | 1.380 | | | 2 | 9.31 ± 0.00 | 11.54 ± 0.0 | 1.240 | | | 4 | 11.33 ± 0.37 | 11.89 ± 0.00 | 1.049 | | AVX2 | 2 | 4.04 ± 0.00 | 6.40 ± 0.00 | 1.584 | | | 4 | 7.57 ± 0.00 | 9.95 ± 0.00 | 1.314 | | | 8 | 9.75 ± 0.00 | 11.00 ± 0.00 | 1.128 | I decided to check the current state of mainline `llama.cpp` for `Q5_K_S`. Hahaha - here is what we get on my M2-Max (`build: 7736837d (4274)`) | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | pp512 | 27.69 ± 0.09 | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 2 | tg128 | 6.39 ± 0.01 | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 4 | tg128 | 12.18 ± 0.02 | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | tg128 | 19.68 ± 0.64 | The performance gap in prompt processing for `Q5_K` has now grown to 3.5X, and it is ~30% slower for TG with 2 threads. Here is what I get on my Ryzen-7950X (`build: 26a8406b (4295)`) | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 16 | pp512 | 75.88 ± 0.26 | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 1 | tg128 | 4.10 ± 0.00 | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 2 | tg128 | 7.66 ± 0.01 | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 4 | tg128 | 11.26 ± 0.00 | | llama 8B Q5_K - Small | 5.21 GiB | 8.03 B | CPU | 8 | tg128 | 11.20 ± 0.22 | 3.26X slower for prompt processing, 72%/51% slower for TG at 1/2 thread.