ik_llama.cpp/282 - Improve DeepSeek batched processing speed.md at 993cb00a347fc77632b73126f614092d659727de - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-18 20:30:11 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

13 KiB

Raw Blame History

🔀 #282 - Improve DeepSeek batched processing speed

Author	`ikawrakow`
State	❌ Closed
Created	2025-03-23
Updated	2025-03-23

Description

I was looking into the batched processing performance dips observed by @saood06 here and I saw this for DeepSeek-Lite:

Commandline was

./bin/llama-batched-bench -m junk1.bin -npp 512 -ntg 128 -npl 4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120 -pps -fmoe -fa -mla 3 -t 16

It took me a while to figure out the reason for the dramatic drop in performance between a batch size of 16 and a batch size of 20. I was suspecting that something goes wrong how the work is being distributed between the threads. But at the end it turned out that it is due to the way the compute graph is built: when n_token > n_head we switch to "PP optimized" processing, which means we go from FA with Dk = 576, Dv = 512 to Dk = 192, Dv = 128, which requires two additional matrix multiplications. For DeepSeek-Lite n_head = 16, so with steps of 4 for the batch size 20 is exactly where the switch is made. I'm not sure what the rationale was for selecting this specific transition point (the optimization came from the mainline llama.cpp PR, but it clearly kills performance. If we look at prompt processing performance using "PP optimized" vs "TG optimized" DeepSeek compute graphs, we see this picture:

I.e., "TG optimized" is better than "PP optimized" for prompt lengths up to 64 tokens, and is not too far behind at 128 tokens. So, we can easily solve the performance drop by using "TG optimized" up to n_prompt = 128. By doing that, we get this result:

The calculations take quite some time, so I didn't have the patience to run beyond batch size of 100 to see the exact crossover point. But eyeballing the graph, it looks like 128 is a good choice for DeepSeek-Lite. DeepSeek-V3/R1 have 128 heads, so this PR will not change the behavior for this models. But it isn't clear to me if one shouldn't use a larger threshold for the "TG optimized" -> "PP optimized" transition.

Concerning DeepSeek-R1, there is a small change in this PR that I hope will reduce the performance dips observed by @saood06

💬 Conversation

👤 saood06 commented the 2025-03-23 at 11:32:00:

Concerning DeepSeek-R1, there is a small change in this PR that I hope will reduce the performance dips observed by @saood06

Running sweep bench and will post full results with graph when they finish, but right now but early results look promising showing

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	49.636	10.32	39.574	3.23
512	128	512	57.011	8.98	43.246	2.96
512	128	1024	62.986	8.13	42.916	2.98
512	128	1536	63.400	8.08	44.014	2.91
512	128	2048	66.228	7.73	47.167	2.71
512	128	2560	72.508	7.06	46.553	2.75
512	128	3072	74.616	6.86	47.772	2.68
512	128	3584	80.675	6.35	50.907	2.51
512	128	4096	87.558	5.85	50.432	2.54
512	128	4608	88.584	5.78	53.859	2.38
512	128	5120	92.838	5.52	54.277	2.36
512	128	5632	99.437	5.15	54.257	2.36

I see you pushed another commit, should I stop this test and recompile and run the new commit?

👤 ikawrakow commented the 2025-03-23 at 11:34:40:

I see you pushed another commit, should I stop this test and recompile and run the new commit?

This will only affect results for B > 128, so beyond the range where you are testing, so no need to rerun.

👤 ikawrakow commented the 2025-03-23 at 11:51:34:

What would be very interesting is to run PP benchmarks with DeepSeek-V3/R1 with ./bin/llama-bench -mla 3 -fa 1 -fmoe 1 -p 32,64,128,192,256,320,384,448,512,576,640,704,768 with

This line changed to pp_opt = true;
The same line changed to pp_opt = false;

This will help understand if the crossover between "TG optimized" and "PP optimized" is somehow dependent on the number of heads, or if it is just a (perhaps somewhat computer dependent) constant. I can see arguments for both options, so the only way to understand is to just test.

👤 saood06 commented the 2025-03-23 at 13:28:00:

What would be very interesting is to run PP benchmarks with DeepSeek-V3/R1 with ./bin/llama-bench -mla 3 -fa 1 -fmoe 1 -p 32,64,128,192,256,320,384,448,512,576,640,704,768 with
* [This line](https://github.com/ikawrakow/ik_llama.cpp/blob/5a4855e61c05b0c54ecad3f4155074d8f344b6f6/src/llama.cpp#L13899) changed to `pp_opt = true`;

* The same line changed to `pp_opt = false`;
This will help understand if the crossover between "TG optimized" and "PP optimized" is somehow dependent on the number of heads, or if it is just a (perhaps somewhat computer dependent) constant. I can see arguments for both options, so the only way to understand is to just test.

Running now, each config is going to take ~50 minutes.

👤 saood06 commented the 2025-03-23 at 16:56:44:

@ikawrakow Here's the benchmark you asked for:

On d12f4a12aa with bool pp_opt = n_tokens > n_head;

model	size	params	backend	threads	fa	mla	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp32	10.30 ± 0.12
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp64	10.46 ± 0.66
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp128	11.25 ± 0.69
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp192	9.35 ± 0.34
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp256	9.46 ± 0.13
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp320	9.15 ± 0.29
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp384	9.43 ± 0.33
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp448	10.05 ± 0.16
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp512	10.30 ± 0.11
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp576	9.97 ± 0.20
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp640	9.62 ± 0.20
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp704	9.43 ± 0.14
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp768	9.51 ± 0.16
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	tg128	2.84 ± 0.00

On ec4bc75f with bool pp_opt = true;

model	size	params	backend	threads	fa	mla	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp32	9.15 ± 0.06
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp64	9.91 ± 0.61
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp128	11.20 ± 0.38
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp192	9.25 ± 0.48
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp256	9.11 ± 0.29
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp320	8.96 ± 0.18
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp384	9.17 ± 0.12
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp448	9.93 ± 0.13
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp512	10.07 ± 0.31
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp576	9.66 ± 0.21
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp640	9.37 ± 0.10
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp704	9.26 ± 0.11
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp768	9.44 ± 0.20
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	tg128	0.99 ± 0.02

On ec4bc75f with bool pp_opt = false;

model	size	params	backend	threads	fa	mla	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp32	10.09 ± 0.17
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp64	10.09 ± 0.53
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp128	10.50 ± 0.60
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp192	8.79 ± 0.37
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp256	8.70 ± 0.12
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp320	8.39 ± 0.17
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp384	8.74 ± 0.09
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp448	8.85 ± 0.15
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp512	9.48 ± 0.15
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp576	9.28 ± 0.02
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp640	8.89 ± 0.30
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp704	8.67 ± 0.10
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	pp768	8.69 ± 0.13
deepseek2 671B IQ4_K_R4 - 4.5 bpw	353.53 GiB	672.05 B	CPU	48	1	3	1	tg128	2.87 ± 0.00

I'm going to reboot my machine now to enable 1GB hugepages and mitigations=off and run a sweep-bench to see if TG performance increases.

👤 ikawrakow commented the 2025-03-23 at 17:10:32:

Thanks, this is great! It looks like a threshold of 128 tokens is not a bad choice for DeepSeek-R1 as well.

13 KiB Raw Blame History

🔀 #282 - Improve DeepSeek batched processing speed

Description

💬 Conversation

13 KiB

Raw Blame History