Files
ik_llama.cpp/github-data/issues/490 - Bug_ Performance drop with 14292913 _461.md
2025-07-23 13:31:53 +02:00

13 KiB

🐛 #490 - Bug: Performance drop with 14292913 #461

Author nux
State Closed
Created 2025-06-03
Updated 2025-06-05

Description

What happened?

Performance dropping with commit 14292913 #461

To identify which commit the performance dropped with I was running:

Was running for i in cut -d " " -f1 commits.txt ;do git checkout $i;./cmd-build.sh ;./start-bench.sh >> results.txt;done

start-bench.sh is: ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512

Relevant results.txt:

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 26.74 ± 0.05
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 4.80 ± 0.00

build: 09764678 (3715)

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 26.75 ± 0.04
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 4.81 ± 0.00

build: 14292913 (3714)

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 76.24 ± 1.44
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 10.08 ± 0.06

build: 24c010b3 (3713)

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 77.25 ± 0.70
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 10.07 ± 0.06

build: c7ecd4e2 (3712)

Building like this: cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build build --config Release -j --clean-first

Running on 2x9115, 768gb ram, 3090 gpu

Name and Version

version: 3710 (9fb82af3) built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output



💬 Conversation

👤 ikawrakow commented the 2025-06-03 at 14:24:50:

Are all tensors IQ4_K_R4? If not, what is the quantization mix in this model?


👤 nux commented the 2025-06-03 at 14:30:39:

This is https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4

They are not all IQ4_K_R4 - I believe this is summary:

llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 612 tensors llama_model_loader: - type iq4_k_r4: 116 tensors llama_model_loader: - type iq5_k_r4: 58 tensors


👤 ikawrakow commented the 2025-06-03 at 15:08:10:

I cannot run DeepSeek-V3, but as a surrogate here some results with Qwen3-30B-A22B. Quantized with the same mix of IQ4_K_R4 and IQ5_K_R4 for the experts, Q8_0 everything else, just like the model you have. My system is Ryzen-7950X + RTX-4080. I'm leaving all experts on the CPU (-ot exps=CPU).

To make things more interesting I'm using pp2048 instead of pp512.

The "good" build 24c010b3

model size params backend ngl n_ubatch fa test t/s
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 512 1 pp2048 606.31 ± 3.88
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 1024 1 pp2048 622.61 ± 8.59
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 2048 1 pp2048 616.80 ± 7.54
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 1024 1 tg128 34.48 ± 0.03

And now the "bad" build (f6d5fbdc, which is latest master)

model size params backend ngl n_ubatch fa test t/s
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 512 1 pp2048 481.03 ± 3.55
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 1024 1 pp2048 893.92 ± 1.59
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 2048 1 pp2048 1554.57 ± 2.93
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 512 1 tg128 34.45 ± 0.41
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 1024 1 tg128 34.50 ± 0.27
qwen3moe ?B IQ4_K_R4 - 4.5 bpw 17.57 GiB 30.53 B CUDA 99 2048 1 tg128 34.69 ± 0.01

I see zero difference in TG. PP on main is indeed slower for u-batch of 512, but becomes 2.5X faster for u-batch = 2048!


👤 ikawrakow commented the 2025-06-03 at 15:36:46:

If you say that you don't want to use large u-batches because of something, you can recover the pre-#461 behavior using -op 26,0,27,0,29,0. This disables offloading of tensors that are on the CPU to the GPU. This has not been implemented in llama-bench, which has its own command line argument parsing, but is available in llama-sweep-bench.

Here is what I get with

./bin/llama-sweep-bench -m $model -c 16384 -up 2048 -t 16 -ngl 100 -ot exps=CPU

"Good build"

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 3.158 648.45 14.698 34.84
2048 512 2048 3.275 625.28 14.792 34.61
2048 512 4096 3.235 633.05 15.047 34.03
2048 512 6144 3.262 627.77 15.252 33.57
2048 512 8192 3.308 619.06 15.425 33.19
2048 512 10240 3.368 608.10 15.702 32.61
2048 512 12288 4.105 498.92 15.776 32.45
2048 512 14336 3.596 569.58 15.549 32.93

Main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 1.352 1514.60 14.926 34.30
2048 512 2048 1.345 1523.06 15.034 34.06
2048 512 4096 1.378 1486.27 15.232 33.61
2048 512 6144 1.413 1449.21 15.413 33.22
2048 512 8192 1.445 1417.62 15.612 32.79
2048 512 10240 1.482 1381.74 15.875 32.25
2048 512 12288 1.516 1350.95 15.973 32.05
2048 512 14336 1.546 1324.99 16.158 31.69

Main branch with -op 26,0,27,0,29,0

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 3.293 621.93 14.868 34.44
2048 512 2048 3.588 570.87 15.029 34.07
2048 512 4096 3.452 593.34 15.157 33.78
2048 512 6144 3.463 591.43 15.380 33.29
2048 512 8192 3.359 609.71 15.564 32.90
2048 512 10240 3.375 606.87 15.802 32.40
2048 512 12288 3.622 565.51 15.918 32.17
2048 512 14336 3.439 595.48 15.675 32.66

👤 nux commented the 2025-06-03 at 22:24:24:

I don't mind using larger batch sizes. I mostly leave things as they are when it's working and only look at it when there's a problem :-D

That is good to know with ubatch. It seems to work very well for qwen3

nux@red ~/dev/ik_llama.cpp $ ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf -p 2048 -t 32 -mla 3 -fa 1 -fmoe 1 -ngl 99 -amb 512 -ub 512,1024,2048 -ot blk.1[2-9].ffn.=CPU -ot blk.[2-8][0-9].ffn.=CPU -ot blk.9[0-3].ffn.*=CPU ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model size params backend ngl n_ubatch fa mla amb fmoe test t/s
qwen3moe ?B IQ3_K - 3.4325 bpw 106.83 GiB 235.09 B CUDA 99 512 1 3 512 1 pp2048 103.22 ± 14.97
qwen3moe ?B IQ3_K - 3.4325 bpw 106.83 GiB 235.09 B CUDA 99 512 1 3 512 1 tg128 19.01 ± 0.01
qwen3moe ?B IQ3_K - 3.4325 bpw 106.83 GiB 235.09 B CUDA 99 1024 1 3 512 1 pp2048 195.53 ± 0.19
qwen3moe ?B IQ3_K - 3.4325 bpw 106.83 GiB 235.09 B CUDA 99 1024 1 3 512 1 tg128 18.92 ± 0.05
qwen3moe ?B IQ3_K - 3.4325 bpw 106.83 GiB 235.09 B CUDA 99 2048 1 3 512 1 pp2048 321.14 ± 0.48
qwen3moe ?B IQ3_K - 3.4325 bpw 106.83 GiB 235.09 B CUDA 99 2048 1 3 512 1 tg128 18.49 ± 0.55

build: f6d5fbdc (3725)

If I'm the only one having problems, I'll keep using 24c010b3 for deepseek-r1 and deepseek-v3.


👤 ikawrakow commented the 2025-06-04 at 04:47:10:

If I'm the only one having problems, I'll keep using 24c010b391 for deepseek-r1 and deepseek-v3.

Did you try any of the options available to you with DeepSeek?

I'll close the issue then.


👤 nux commented the 2025-06-04 at 05:47:54:

What do you mean options available with DeepSeek? I tried ubatch and have been running mla 3.

Would any of them cause this decrease in performance for this command? ~10t/s to ~4.8t/s ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512

This issue came up originally when trying to figure out why ubergarm's deepseek-r1 was performing poorly. The older deepseek-v3 benchmarks that i had sitting around in a .txt made it easy to compare.

If you would like me to try anything specific I can, but I don't know where to start diagnosing my issue any further

I wouldn't consider the issue resolved. Using commit 24c010b3 for deepseek seems more of a short term workaround than resolution.

That being said I don't think we pay you enough. I appreciate all the work you've done.


👤 ikawrakow commented the 2025-06-04 at 05:52:12:

I didn't see your performance values for -ub 2048 (or even -b 4096 -ub 4096

Neither did I see results for your regular way of using DeepSeek but adding -op 26,0,27,0,29,0 to your command line. This latter option should match what you had prior to #461.


👤 nux commented the 2025-06-05 at 13:53:10:

-op 26,0,27,0,29,0 brought back the performance. I hadn't tried that one as my PCI-E speed is 16x - but working now.

Thanks