Files
ik_llama.cpp/github-data/issues/490 - Bug_ Performance drop with 14292913 _461.md
2025-07-23 13:31:53 +02:00

245 lines
13 KiB
Markdown

### 🐛 [#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490) - Bug: Performance drop with 14292913 [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461)
| **Author** | `nux` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-06-03 |
| **Updated** | 2025-06-05 |
---
#### Description
### What happened?
Performance dropping with commit 14292913 #461
To identify which commit the performance dropped with I was running:
Was running for i in `cut -d " " -f1 commits.txt `;do git checkout $i;./cmd-build.sh ;./start-bench.sh >> results.txt;done
start-bench.sh is:
./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512
Relevant results.txt:
| model | size | params | backend | ngl | fa | mla | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | ----: | ---: | ------------: | ---------------: |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | pp512 | 26.74 ± 0.05 |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | tg128 | 4.80 ± 0.00 |
build: 09764678 (3715)
| model | size | params | backend | ngl | fa | mla | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | ----: | ---: | ------------: | ---------------: |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | pp512 | 26.75 ± 0.04 |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | tg128 | 4.81 ± 0.00 |
build: 14292913 (3714)
| model | size | params | backend | ngl | fa | mla | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | ----: | ---: | ------------: | ---------------: |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | pp512 | 76.24 ± 1.44 |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | tg128 | 10.08 ± 0.06 |
build: 24c010b3 (3713)
| model | size | params | backend | ngl | fa | mla | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | ----: | ---: | ------------: | ---------------: |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | pp512 | 77.25 ± 0.70 |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 99 | 1 | 2 | 512 | 1 | tg128 | 10.07 ± 0.06 |
build: c7ecd4e2 (3712)
Building like this:
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j --clean-first
Running on 2x9115, 768gb ram, 3090 gpu
### Name and Version
version: 3710 (9fb82af3)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
### What operating system are you seeing the problem on?
Linux
### Relevant log output
```shell
```
---
#### 💬 Conversation
👤 **ikawrakow** commented the **2025-06-03** at **14:24:50**:<br>
Are all tensors `IQ4_K_R4`? If not, what is the quantization mix in this model?
---
👤 **nux** commented the **2025-06-03** at **14:30:39**:<br>
This is https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4
They are not all IQ4_K_R4 - I believe this is summary:
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 612 tensors
llama_model_loader: - type iq4_k_r4: 116 tensors
llama_model_loader: - type iq5_k_r4: 58 tensors
---
👤 **ikawrakow** commented the **2025-06-03** at **15:08:10**:<br>
I cannot run DeepSeek-V3, but as a surrogate here some results with Qwen3-30B-A22B. Quantized with the same mix of `IQ4_K_R4` and `IQ5_K_R4` for the experts, `Q8_0` everything else, just like the model you have. My system is Ryzen-7950X + RTX-4080. I'm leaving all experts on the CPU (`-ot exps=CPU`).
To make things more interesting I'm using `pp2048` instead of `pp512`.
The "good" build 24c010b3
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------: | ---------------: |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 512 | 1 | pp2048 | 606.31 ± 3.88 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 1024 | 1 | pp2048 | 622.61 ± 8.59 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 616.80 ± 7.54 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 1024 | 1 | tg128 | 34.48 ± 0.03 |
And now the "bad" build (f6d5fbdc, which is latest master)
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ------------: | ---------------: |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 512 | 1 | pp2048 | 481.03 ± 3.55 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 1024 | 1 | pp2048 | 893.92 ± 1.59 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1554.57 ± 2.93 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 512 | 1 | tg128 | 34.45 ± 0.41 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 1024 | 1 | tg128 | 34.50 ± 0.27 |
| qwen3moe ?B IQ4_K_R4 - 4.5 bpw | 17.57 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg128 | 34.69 ± 0.01 |
I see zero difference in TG. PP on main is indeed slower for u-batch of 512, but becomes 2.5X faster for u-batch = 2048!
---
👤 **ikawrakow** commented the **2025-06-03** at **15:36:46**:<br>
If you say that you don't want to use large u-batches because of something, you can recover the pre-#461 behavior using `-op 26,0,27,0,29,0`. This disables offloading of tensors that are on the CPU to the GPU. This has not been implemented in `llama-bench`, which has its own command line argument parsing, but is available in `llama-sweep-bench`.
Here is what I get with
```
./bin/llama-sweep-bench -m $model -c 16384 -up 2048 -t 16 -ngl 100 -ot exps=CPU
```
### "Good build"
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 2048 | 512 | 0 | 3.158 | 648.45 | 14.698 | 34.84 |
| 2048 | 512 | 2048 | 3.275 | 625.28 | 14.792 | 34.61 |
| 2048 | 512 | 4096 | 3.235 | 633.05 | 15.047 | 34.03 |
| 2048 | 512 | 6144 | 3.262 | 627.77 | 15.252 | 33.57 |
| 2048 | 512 | 8192 | 3.308 | 619.06 | 15.425 | 33.19 |
| 2048 | 512 | 10240 | 3.368 | 608.10 | 15.702 | 32.61 |
| 2048 | 512 | 12288 | 4.105 | 498.92 | 15.776 | 32.45 |
| 2048 | 512 | 14336 | 3.596 | 569.58 | 15.549 | 32.93 |
### Main branch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 2048 | 512 | 0 | 1.352 | 1514.60 | 14.926 | 34.30 |
| 2048 | 512 | 2048 | 1.345 | 1523.06 | 15.034 | 34.06 |
| 2048 | 512 | 4096 | 1.378 | 1486.27 | 15.232 | 33.61 |
| 2048 | 512 | 6144 | 1.413 | 1449.21 | 15.413 | 33.22 |
| 2048 | 512 | 8192 | 1.445 | 1417.62 | 15.612 | 32.79 |
| 2048 | 512 | 10240 | 1.482 | 1381.74 | 15.875 | 32.25 |
| 2048 | 512 | 12288 | 1.516 | 1350.95 | 15.973 | 32.05 |
| 2048 | 512 | 14336 | 1.546 | 1324.99 | 16.158 | 31.69 |
### Main branch with -op 26,0,27,0,29,0
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 2048 | 512 | 0 | 3.293 | 621.93 | 14.868 | 34.44 |
| 2048 | 512 | 2048 | 3.588 | 570.87 | 15.029 | 34.07 |
| 2048 | 512 | 4096 | 3.452 | 593.34 | 15.157 | 33.78 |
| 2048 | 512 | 6144 | 3.463 | 591.43 | 15.380 | 33.29 |
| 2048 | 512 | 8192 | 3.359 | 609.71 | 15.564 | 32.90 |
| 2048 | 512 | 10240 | 3.375 | 606.87 | 15.802 | 32.40 |
| 2048 | 512 | 12288 | 3.622 | 565.51 | 15.918 | 32.17 |
| 2048 | 512 | 14336 | 3.439 | 595.48 | 15.675 | 32.66 |
---
👤 **nux** commented the **2025-06-03** at **22:24:24**:<br>
I don't mind using larger batch sizes. I mostly leave things as they are when it's working and only look at it when there's a problem :-D
That is good to know with ubatch. It seems to work very well for qwen3
nux@red ~/dev/ik_llama.cpp $ ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf -p 2048 -t 32 -mla 3 -fa 1 -fmoe 1 -ngl 99 -amb 512 -ub 512,1024,2048 -ot blk\.1[2-9]\.ffn.*=CPU -ot blk\.[2-8][0-9]\.ffn.*=CPU -ot blk\.9[0-3]\.ffn.*=CPU
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | mla | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 99 | 512 | 1 | 3 | 512 | 1 | pp2048 | 103.22 ± 14.97 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 99 | 512 | 1 | 3 | 512 | 1 | tg128 | 19.01 ± 0.01 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 99 | 1024 | 1 | 3 | 512 | 1 | pp2048 | 195.53 ± 0.19 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 99 | 1024 | 1 | 3 | 512 | 1 | tg128 | 18.92 ± 0.05 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 99 | 2048 | 1 | 3 | 512 | 1 | pp2048 | 321.14 ± 0.48 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 99 | 2048 | 1 | 3 | 512 | 1 | tg128 | 18.49 ± 0.55 |
build: f6d5fbdc (3725)
If I'm the only one having problems, I'll keep using 24c010b3 for deepseek-r1 and deepseek-v3.
---
👤 **ikawrakow** commented the **2025-06-04** at **04:47:10**:<br>
>If I'm the only one having problems, I'll keep using https://github.com/ikawrakow/ik_llama.cpp/commit/24c010b3916b5f1bb9d712d610d1fe9308ef7df4 for deepseek-r1 and deepseek-v3.
Did you try any of the options available to you with DeepSeek?
I'll close the issue then.
---
👤 **nux** commented the **2025-06-04** at **05:47:54**:<br>
What do you mean options available with DeepSeek? I tried ubatch and have been running mla 3.
Would any of them cause this decrease in performance for this command? ~10t/s to ~4.8t/s
./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512
This issue came up originally when trying to figure out why ubergarm's deepseek-r1 was performing poorly. The older deepseek-v3 benchmarks that i had sitting around in a .txt made it easy to compare.
If you would like me to try anything specific I can, but I don't know where to start diagnosing my issue any further
I wouldn't consider the issue resolved. Using commit 24c010b3 for deepseek seems more of a short term workaround than resolution.
That being said I don't think we pay you enough. I appreciate all the work you've done.
---
👤 **ikawrakow** commented the **2025-06-04** at **05:52:12**:<br>
I didn't see your performance values for `-ub 2048` (or even `-b 4096 -ub 4096`
Neither did I see results for your regular way of using DeepSeek but adding `-op 26,0,27,0,29,0` to your command line. This latter option should match what you had prior to #461.
---
👤 **nux** commented the **2025-06-05** at **13:53:10**:<br>
-op 26,0,27,0,29,0 brought back the performance. I hadn't tried that one as my PCI-E speed is 16x - but working now.
Thanks