### 🔀 [#136](https://github.com/ikawrakow/ik_llama.cpp/pull/136) - Q2_K_R4 | **Author** | `ikawrakow` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2024-12-11 | | **Updated** | 2024-12-11 | --- #### Description Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134 for `Q2_K`. This completes R4 implementation for k-quants on `ARM_NEON`, `AVX2`, and `Zen4`. We get signifiant performance gains on all platforms. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX) | Platform | Threads | Q2_K_S | Q2_K_R4 | Speedup | | ---: | ---: | ---: | ---: | ---: | | ARM_NEON | 8 | 73.79 ± 1.92 | 109.07 ± 0.58 | 1.478 | | Zen4 | 16 | 205.95 ± 0.77 | 256.19 ± 0.26 | 1.244 | | AVX2 | 32 | 214.42 ± 0.54 | 286.91 ± 0.63 | 1.338 | As `Q2_K` is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads: | Platform | Threads | Q2_K_S | Q2_K_R4 | Speedup | | ---: | ---: | ---: | ---: | ---: | | ARM_NEON | 2 | 10.34 ± 0.01 | 12.81 ± 0.01 | 1.239 | | | 4 | 19.32 ± 0.02 | 23.40 ± 0.08 | 1.211 | | | 8 | 32.36 ± 0.59 | 36.02 ± 0.40 | 1.113 | | Zen4 | 1 | 6.60 ± 0.02 | 9.08 ± 0.12 | 1.376 | | | 2 | 12.12 ± 0.01 | 16.40 ± 0.00 | 1.353 | | | 4 | 19.12 ± 0.56 | 20.72 ± 0.19 | 1.084 | | AVX2 | 2 | 5.93 ± 0.02 | 10.16 ± 0.30 | 1.713 | | | 4 | 11.24 ± 0.00 | 17.59 ± 0.01 | 1.565 | | | 8 | 18.62 ± 0.03 | 21.44 ± 0.00 | 1.151 | It is actually too bad `Q2_K` is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.