mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-25 17:09:22 +00:00
1.8 KiB
1.8 KiB
🔀 #453 - Faster IQ3_KT and IQ4_KT
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-05-24 |
| Updated | 2025-05-24 |
Description
The PR improves AVX2 performance for the trellis quants IQ3_KT and IQ4_KT recently added in PR #441.
The results below are for LLaMA-3.1-8B on a Ryzen-5975WX CPU.
IQ3_KT
| N_KV | S_PP t/s (main) | S_PP t/s (PR) | PP speedup | S_TG t/s (main) | S_TG t/s (PR) | TG speedup |
|---|---|---|---|---|---|---|
| 0 | 61.98 | 71.59 | 1.155 | 11.17 | 13.30 | 1.191 |
| 512 | 61.27 | 70.79 | 1.155 | 11.10 | 13.19 | 1.188 |
| 1024 | 60.48 | 69.93 | 1.156 | 11.04 | 13.10 | 1.187 |
| 1536 | 59.94 | 69.15 | 1.154 | 10.95 | 12.96 | 1.184 |
| 2048 | 59.48 | 68.55 | 1.152 | 10.87 | 12.85 | 1.182 |
IQ4_KT
| N_KV | S_PP t/s (main) | S_PP t/s (PR) | PP speedup | S_TG t/s (main) | S_TG t/s (PR) | TG speedup |
|---|---|---|---|---|---|---|
| 0 | 44.32 | 64.91 | 1.465 | 9.36 | 11.69 | 1.249 |
| 512 | 43.90 | 64.12 | 1.461 | 9.26 | 11.56 | 1.248 |
| 1024 | 43.60 | 63.39 | 1.454 | 9.19 | 11.47 | 1.248 |
| 1536 | 43.32 | 62.86 | 1.451 | 9.12 | 11.37 | 1.247 |
| 2048 | 43.07 | 62.37 | 1.448 | 9.06 | 11.28 | 1.245 |
CPU performance is still much lower than other quantization types. But memory bandwidth is far from saturated, so PP and TG will be better on a faster CPU with more cores.