ik_llama.cpp/490 - Bug_ Performance drop with 14292913 _461.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

13 KiB

Raw Permalink Blame History

🐛 #490 - Bug: Performance drop with `14292913` #461

Author	`nux`
State	❌ Closed
Created	2025-06-03
Updated	2025-06-05

Description

What happened?

Performance dropping with commit 14292913 #461

To identify which commit the performance dropped with I was running:

Was running for i in cut -d " " -f1 commits.txt ;do git checkout $i;./cmd-build.sh ;./start-bench.sh >> results.txt;done

start-bench.sh is: ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512

Relevant results.txt:

model	size	params	backend	ngl	fa	mla	amb	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	pp512	26.74 ± 0.05
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	tg128	4.80 ± 0.00

build: 09764678 (3715)

model	size	params	backend	ngl	fa	mla	amb	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	pp512	26.75 ± 0.04
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	tg128	4.81 ± 0.00

build: 14292913 (3714)

model	size	params	backend	ngl	fa	mla	amb	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	pp512	76.24 ± 1.44
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	tg128	10.08 ± 0.06

build: 24c010b3 (3713)

model	size	params	backend	ngl	fa	mla	amb	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	pp512	77.25 ± 0.70
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	99	1	2	512	1	tg128	10.07 ± 0.06

build: c7ecd4e2 (3712)

Building like this: cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build build --config Release -j --clean-first

Running on 2x9115, 768gb ram, 3090 gpu

Name and Version

version: 3710 (9fb82af3) built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

💬 Conversation

👤 ikawrakow commented the 2025-06-03 at 14:24:50:

Are all tensors IQ4_K_R4? If not, what is the quantization mix in this model?

👤 nux commented the 2025-06-03 at 14:30:39:

This is https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4

They are not all IQ4_K_R4 - I believe this is summary:

llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 612 tensors llama_model_loader: - type iq4_k_r4: 116 tensors llama_model_loader: - type iq5_k_r4: 58 tensors

👤 ikawrakow commented the 2025-06-03 at 15:08:10:

I cannot run DeepSeek-V3, but as a surrogate here some results with Qwen3-30B-A22B. Quantized with the same mix of IQ4_K_R4 and IQ5_K_R4 for the experts, Q8_0 everything else, just like the model you have. My system is Ryzen-7950X + RTX-4080. I'm leaving all experts on the CPU (-ot exps=CPU).

To make things more interesting I'm using pp2048 instead of pp512.

The "good" build 24c010b3

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	512	1	pp2048	606.31 ± 3.88
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	1024	1	pp2048	622.61 ± 8.59
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	2048	1	pp2048	616.80 ± 7.54
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	1024	1	tg128	34.48 ± 0.03

And now the "bad" build (f6d5fbdc, which is latest master)

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	512	1	pp2048	481.03 ± 3.55
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	1024	1	pp2048	893.92 ± 1.59
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	2048	1	pp2048	1554.57 ± 2.93
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	512	1	tg128	34.45 ± 0.41
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	1024	1	tg128	34.50 ± 0.27
qwen3moe ?B IQ4_K_R4 - 4.5 bpw	17.57 GiB	30.53 B	CUDA	99	2048	1	tg128	34.69 ± 0.01

I see zero difference in TG. PP on main is indeed slower for u-batch of 512, but becomes 2.5X faster for u-batch = 2048!

👤 ikawrakow commented the 2025-06-03 at 15:36:46:

If you say that you don't want to use large u-batches because of something, you can recover the pre-#461 behavior using -op 26,0,27,0,29,0. This disables offloading of tensors that are on the CPU to the GPU. This has not been implemented in llama-bench, which has its own command line argument parsing, but is available in llama-sweep-bench.

Here is what I get with

./bin/llama-sweep-bench -m $model -c 16384 -up 2048 -t 16 -ngl 100 -ot exps=CPU

"Good build"

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	3.158	648.45	14.698	34.84
2048	512	2048	3.275	625.28	14.792	34.61
2048	512	4096	3.235	633.05	15.047	34.03
2048	512	6144	3.262	627.77	15.252	33.57
2048	512	8192	3.308	619.06	15.425	33.19
2048	512	10240	3.368	608.10	15.702	32.61
2048	512	12288	4.105	498.92	15.776	32.45
2048	512	14336	3.596	569.58	15.549	32.93

Main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	1.352	1514.60	14.926	34.30
2048	512	2048	1.345	1523.06	15.034	34.06
2048	512	4096	1.378	1486.27	15.232	33.61
2048	512	6144	1.413	1449.21	15.413	33.22
2048	512	8192	1.445	1417.62	15.612	32.79
2048	512	10240	1.482	1381.74	15.875	32.25
2048	512	12288	1.516	1350.95	15.973	32.05
2048	512	14336	1.546	1324.99	16.158	31.69

Main branch with -op 26,0,27,0,29,0

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	3.293	621.93	14.868	34.44
2048	512	2048	3.588	570.87	15.029	34.07
2048	512	4096	3.452	593.34	15.157	33.78
2048	512	6144	3.463	591.43	15.380	33.29
2048	512	8192	3.359	609.71	15.564	32.90
2048	512	10240	3.375	606.87	15.802	32.40
2048	512	12288	3.622	565.51	15.918	32.17
2048	512	14336	3.439	595.48	15.675	32.66

👤 nux commented the 2025-06-03 at 22:24:24:

I don't mind using larger batch sizes. I mostly leave things as they are when it's working and only look at it when there's a problem :-D

That is good to know with ubatch. It seems to work very well for qwen3

nux@red ~/dev/ik_llama.cpp $ ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf -p 2048 -t 32 -mla 3 -fa 1 -fmoe 1 -ngl 99 -amb 512 -ub 512,1024,2048 -ot blk.1[2-9].ffn.=CPU -ot blk.[2-8][0-9].ffn.=CPU -ot blk.9[0-3].ffn.*=CPU ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	n_ubatch	fa	mla	amb	fmoe	test	t/s
qwen3moe ?B IQ3_K - 3.4325 bpw	106.83 GiB	235.09 B	CUDA	99	512	1	3	512	1	pp2048	103.22 ± 14.97
qwen3moe ?B IQ3_K - 3.4325 bpw	106.83 GiB	235.09 B	CUDA	99	512	1	3	512	1	tg128	19.01 ± 0.01
qwen3moe ?B IQ3_K - 3.4325 bpw	106.83 GiB	235.09 B	CUDA	99	1024	1	3	512	1	pp2048	195.53 ± 0.19
qwen3moe ?B IQ3_K - 3.4325 bpw	106.83 GiB	235.09 B	CUDA	99	1024	1	3	512	1	tg128	18.92 ± 0.05
qwen3moe ?B IQ3_K - 3.4325 bpw	106.83 GiB	235.09 B	CUDA	99	2048	1	3	512	1	pp2048	321.14 ± 0.48
qwen3moe ?B IQ3_K - 3.4325 bpw	106.83 GiB	235.09 B	CUDA	99	2048	1	3	512	1	tg128	18.49 ± 0.55

build: f6d5fbdc (3725)

If I'm the only one having problems, I'll keep using 24c010b3 for deepseek-r1 and deepseek-v3.

👤 ikawrakow commented the 2025-06-04 at 04:47:10:

If I'm the only one having problems, I'll keep using 24c010b391 for deepseek-r1 and deepseek-v3.

Did you try any of the options available to you with DeepSeek?

I'll close the issue then.

👤 nux commented the 2025-06-04 at 05:47:54:

What do you mean options available with DeepSeek? I tried ubatch and have been running mla 3.

Would any of them cause this decrease in performance for this command? ~10t/s to ~4.8t/s ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512

This issue came up originally when trying to figure out why ubergarm's deepseek-r1 was performing poorly. The older deepseek-v3 benchmarks that i had sitting around in a .txt made it easy to compare.

If you would like me to try anything specific I can, but I don't know where to start diagnosing my issue any further

I wouldn't consider the issue resolved. Using commit 24c010b3 for deepseek seems more of a short term workaround than resolution.

That being said I don't think we pay you enough. I appreciate all the work you've done.

👤 ikawrakow commented the 2025-06-04 at 05:52:12:

I didn't see your performance values for -ub 2048 (or even -b 4096 -ub 4096

Neither did I see results for your regular way of using DeepSeek but adding -op 26,0,27,0,29,0 to your command line. This latter option should match what you had prior to #461.

👤 nux commented the 2025-06-05 at 13:53:10:

-op 26,0,27,0,29,0 brought back the performance. I hadn't tried that one as my PCI-E speed is 16x - but working now.

Thanks

13 KiB Raw Permalink Blame History

🐛 #490 - Bug: Performance drop with 14292913 #461