mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-02 12:21:42 +00:00
2.1 KiB
2.1 KiB
🔀 #516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2)
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-06-11 |
| Updated | 2025-06-11 |
Description
This PR is a follow up of #515, and applies the same technique to IQ3_XXS. We see nearly 3X increase in prompt processing speed compared to IQ3_XXS, and over 2X compared to IQ3_XXS_R4.
Sweep-bench for pure IQ3_XXS quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:
IQ3_XXS, main branch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 5.023 | 101.94 | 7.365 | 17.38 |
| 512 | 128 | 512 | 5.281 | 96.96 | 8.088 | 15.83 |
| 512 | 128 | 1024 | 5.170 | 99.03 | 7.977 | 16.05 |
| 512 | 128 | 1536 | 5.324 | 96.16 | 7.942 | 16.12 |
| 512 | 128 | 2048 | 5.389 | 95.02 | 8.043 | 15.91 |
IQ3_XXS_R4, main branch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 3.836 | 133.47 | 7.675 | 16.68 |
| 512 | 128 | 512 | 3.687 | 138.87 | 8.279 | 15.46 |
| 512 | 128 | 1024 | 3.805 | 134.57 | 8.245 | 15.53 |
| 512 | 128 | 1536 | 3.906 | 131.08 | 8.252 | 15.51 |
| 512 | 128 | 2048 | 4.076 | 125.61 | 8.545 | 14.98 |
IQ3_XXS, PR
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 1.730 | 296.01 | 7.641 | 16.75 |
| 512 | 128 | 512 | 1.807 | 283.30 | 8.333 | 15.36 |
| 512 | 128 | 1024 | 1.896 | 269.98 | 8.070 | 15.86 |
| 512 | 128 | 1536 | 1.978 | 258.78 | 8.481 | 15.09 |
| 512 | 128 | 2048 | 2.062 | 248.32 | 8.514 | 15.03 |