ik_llama.cpp/517 - IQ1_S_ much faster CPU prompt processing.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

2.1 KiB

Raw Permalink Blame History

🔀 #517 - IQ1_S: much faster CPU prompt processing

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-11
Updated	2025-06-11

Description

This PR is a follow up of #515 and #516, and applies the same technique to IQ1_S. We see nearly 2X increase in prompt processing speed compared to IQ1_S and `IQ1_S_R4.

Sweep-bench for IQ1_S quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:

IQ1_S, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	3.272	156.47	4.605	27.79
512	128	512	3.351	152.77	5.092	25.14
512	128	1024	3.402	150.52	5.084	25.18
512	128	1536	3.677	139.25	5.201	24.61
512	128	2048	3.586	142.79	5.515	23.21

IQ1_S_R4, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	3.101	165.10	4.543	28.18
512	128	512	3.166	161.74	4.836	26.47
512	128	1024	3.309	154.75	5.282	24.23
512	128	1536	3.348	152.92	5.093	25.13
512	128	2048	3.447	148.55	5.265	24.31

IQ1_S, PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.855	275.94	4.643	27.57
512	128	512	1.940	263.87	5.056	25.32
512	128	1024	2.188	234.05	5.099	25.10
512	128	1536	2.097	244.20	5.112	25.04
512	128	2048	2.184	234.42	5.368	23.85

2.1 KiB Raw Permalink Blame History