2.2 KiB
🔀 #136 - Q2_K_R4
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-11 |
| Updated | 2024-12-11 |
Description
Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134 for Q2_K.
This completes R4 implementation for k-quants on ARM_NEON, AVX2, and Zen4.
We get signifiant performance gains on all platforms. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)
| Platform | Threads | Q2_K_S | Q2_K_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 8 | 73.79 ± 1.92 | 109.07 ± 0.58 | 1.478 |
| Zen4 | 16 | 205.95 ± 0.77 | 256.19 ± 0.26 | 1.244 |
| AVX2 | 32 | 214.42 ± 0.54 | 286.91 ± 0.63 | 1.338 |
As Q2_K is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG.
Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
| Platform | Threads | Q2_K_S | Q2_K_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 2 | 10.34 ± 0.01 | 12.81 ± 0.01 | 1.239 |
| 4 | 19.32 ± 0.02 | 23.40 ± 0.08 | 1.211 | |
| 8 | 32.36 ± 0.59 | 36.02 ± 0.40 | 1.113 | |
| Zen4 | 1 | 6.60 ± 0.02 | 9.08 ± 0.12 | 1.376 |
| 2 | 12.12 ± 0.01 | 16.40 ± 0.00 | 1.353 | |
| 4 | 19.12 ± 0.56 | 20.72 ± 0.19 | 1.084 | |
| AVX2 | 2 | 5.93 ± 0.02 | 10.16 ± 0.30 | 1.713 |
| 4 | 11.24 ± 0.00 | 17.59 ± 0.01 | 1.565 | |
| 8 | 18.62 ± 0.03 | 21.44 ± 0.00 | 1.151 |
It is actually too bad Q2_K is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.