ik_llama.cpp/525 - Faster CPU prompt processing for Q4_K and Q5_K.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

4.2 KiB

Raw Permalink Blame History

🔀 #525 - Faster CPU prompt processing for Q4_K and Q5_K

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-12
Updated	2025-06-13

Description

These two quantization types are quite popular, so I thought it makes sense to improve their performance. The repacked variants Q4_K_R4 and Q5_K_R4 do not have a CUDA implementation, so repacking is not useful in a hybrid CPU/GPU setup where it may be better to offload tensors stored in RAM to the GPU when processing large batched.

The PR uses the same trick as #515, #516, #517, #518. When processing batches >= 32 tokens, Q4_K or Q5_K quantized tensors are repacked on-the-fly to Q8_1_R8.

Here some sweep-bench results for LLaMA-3.1-8B-Instruct on a Ryzen-7950X CPU

Q4_K, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.853	179.49	9.792	13.07
512	128	512	2.745	186.52	10.119	12.65
512	128	1024	2.806	182.49	10.118	12.65
512	128	1536	2.905	176.22	10.273	12.46
512	128	2048	3.434	149.08	10.492	12.20

Q4_K_R4

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.015	254.10	9.808	13.05
512	128	512	2.051	249.65	9.992	12.81
512	128	1024	2.131	240.28	10.145	12.62
512	128	1536	2.247	227.84	10.297	12.43
512	128	2048	2.338	219.02	10.478	12.22

Q4_K, PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.903	269.00	9.719	13.17
512	128	512	1.974	259.37	9.975	12.83
512	128	1024	2.004	255.47	10.024	12.77
512	128	1536	2.351	217.73	10.033	12.76
512	128	2048	2.114	242.19	10.150	12.61

Q5_K, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.894	176.89	11.650	10.99
512	128	512	3.461	147.93	11.760	10.88
512	128	1024	2.986	171.44	11.818	10.83
512	128	1536	3.026	169.22	11.875	10.78
512	128	2048	3.172	161.39	11.967	10.70

Q5_K_R4

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	2.149	238.30	11.712	10.93
512	128	512	2.189	233.89	11.899	10.76
512	128	1024	2.269	225.62	11.953	10.71
512	128	1536	2.328	219.90	12.044	10.63
512	128	2048	2.343	218.54	12.050	10.62

Q5_K, PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.929	265.41	11.599	11.04
512	128	512	2.042	250.69	11.810	10.84
512	128	1024	2.051	249.64	11.888	10.77
512	128	1536	2.350	217.91	11.888	10.77
512	128	2048	2.133	240.00	11.998	10.67

Here performance gains are not as large as in #514, #515, #516, #518 as k-quants are much faster than sub-4 bpw i-quants. Nevertheless, we see a nearly 50% PP performance improvement compared to the non-interleaved variants, and 5-10% improvement compared to the _R4 variants.

4.2 KiB Raw Permalink Blame History