mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
2.1 KiB
2.1 KiB
🔀 #517 - IQ1_S: much faster CPU prompt processing
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-06-11 |
| Updated | 2025-06-11 |
Description
This PR is a follow up of #515 and #516, and applies the same technique to IQ1_S. We see nearly 2X increase in prompt processing speed compared to IQ1_S and `IQ1_S_R4.
Sweep-bench for IQ1_S quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:
IQ1_S, main branch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 3.272 | 156.47 | 4.605 | 27.79 |
| 512 | 128 | 512 | 3.351 | 152.77 | 5.092 | 25.14 |
| 512 | 128 | 1024 | 3.402 | 150.52 | 5.084 | 25.18 |
| 512 | 128 | 1536 | 3.677 | 139.25 | 5.201 | 24.61 |
| 512 | 128 | 2048 | 3.586 | 142.79 | 5.515 | 23.21 |
IQ1_S_R4, main branch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 3.101 | 165.10 | 4.543 | 28.18 |
| 512 | 128 | 512 | 3.166 | 161.74 | 4.836 | 26.47 |
| 512 | 128 | 1024 | 3.309 | 154.75 | 5.282 | 24.23 |
| 512 | 128 | 1536 | 3.348 | 152.92 | 5.093 | 25.13 |
| 512 | 128 | 2048 | 3.447 | 148.55 | 5.265 | 24.31 |
IQ1_S, PR
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 1.855 | 275.94 | 4.643 | 27.57 |
| 512 | 128 | 512 | 1.940 | 263.87 | 5.056 | 25.32 |
| 512 | 128 | 1024 | 2.188 | 234.05 | 5.099 | 25.10 |
| 512 | 128 | 1536 | 2.097 | 244.20 | 5.112 | 25.04 |
| 512 | 128 | 2048 | 2.184 | 234.42 | 5.368 | 23.85 |