Files
ik_llama.cpp/github-data/pull_requests/517 - IQ1_S_ much faster CPU prompt processing.md
2025-07-23 13:31:53 +02:00

2.1 KiB

🔀 #517 - IQ1_S: much faster CPU prompt processing

Author ikawrakow
State Closed
Created 2025-06-11
Updated 2025-06-11

Description

This PR is a follow up of #515 and #516, and applies the same technique to IQ1_S. We see nearly 2X increase in prompt processing speed compared to IQ1_S and `IQ1_S_R4.

Sweep-bench for IQ1_S quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:

IQ1_S, main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.272 156.47 4.605 27.79
512 128 512 3.351 152.77 5.092 25.14
512 128 1024 3.402 150.52 5.084 25.18
512 128 1536 3.677 139.25 5.201 24.61
512 128 2048 3.586 142.79 5.515 23.21

IQ1_S_R4, main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.101 165.10 4.543 28.18
512 128 512 3.166 161.74 4.836 26.47
512 128 1024 3.309 154.75 5.282 24.23
512 128 1536 3.348 152.92 5.093 25.13
512 128 2048 3.447 148.55 5.265 24.31

IQ1_S, PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.855 275.94 4.643 27.57
512 128 512 1.940 263.87 5.056 25.32
512 128 1024 2.188 234.05 5.099 25.10
512 128 1536 2.097 244.20 5.112 25.04
512 128 2048 2.184 234.42 5.368 23.85