21 KiB
🗣️ #548 - Poor performance with bf16 model on Qwen3 30B-A3B
| Author | Gaolingx |
|---|---|
| Created | 2025-06-22 |
| Updated | 2025-07-02 |
Description
Introduction
I tried to run model Qwen3-30B-A3B-GGUF with ik_llama.cpp. Because I have a nvidia GPU(RTX 4060Ti) with 8G VRAM on my PC, so I compiled ik_llama.cpp with the cuda backend, and run with -ot exps=CPU to offload experts(ffn_down_exps, ffn_up_exps, gate_exps) to CPU.
Build options:
cmake -B build -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_CUDA=ON -DGGML_AVX2=ON -DGGML_AVX512=OFF -DBUILD_SHARED_LIBS=ON
I tested q8_0 quantization and bf16 models, on q8_0 model, the prompt processing speed(PP) the token generate speed(TG) are very quickly, I got a speed of up to 165 token/s PP and 18 token/s TG, that's a good start. but when I ran bf16 model, the PP speed is much slower than before, It just 30-40token/s PP, 11-12 token/s TG, It's not even as good as only CPU ggml backend(about 51 token/s PP, 11 token/s TG), This performance is obviously not normal on bf16 models. It makes me confused. I've also found that the GPU spends quite a bit of time on the copy every time the token processing phase is processed, but quantization modes(like q8_0) don't have the above problem.
cpu backend, bf16 model(Qwen3-30B-A3B-BF16)
cuda backend, bf16 model(Qwen3-30B-A3B-BF16)
cuda backend, q8_0 model(Qwen3-30B-A3B-Q8_0)
System Info
Here are my SystemInfo(include hardware and software)
- Hardware
- CPU: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz(20c, 40t) x2
- GPU: NVIDIA GeForce RTX 4060Ti 8G
- RAM: RDIMM DDR4 2666 2Rx4 32G x16(12 Channels total)
- Motherboard: Supermicro X11DPi-N
- SSD: ZHITAI TiPlus7100 1TB
- Software
- OS: Microsoft Windows 10 Pro
- BIOS: Hyper-Threading-Enable, SNC-Disable
- Model: Qwen3-235B-A22B-128K-Q8_0(unsloth/Qwen3-235B-A22B-128K-GGUF)
- ik_llama.cpp:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes INFO [ main] build info | tid="54808" timestamp=1750526676 build=3761 commit="144ee1c4" INFO [ main] system info | tid="54808" timestamp=1750526676 n_threads=16 n_threads_batch=-1 total_threads=40 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
Benchmark
Here are the results of my initialllama-sweep-bench testing for PP speed and TG speed, the command line for is ik_llama.cpp
llama-sweep-bench:
./llama-sweep-bench -m "%MODEL_PATH%" -c 16384 -t 16 -ngl 48 -fa -rtr -fmoe -ser 6,1 -ot exps=CPU
ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-Q8_0)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 2.308 | 221.84 | 6.463 | 19.81 |
| 512 | 128 | 512 | 2.437 | 210.11 | 7.741 | 16.54 |
| 512 | 128 | 1024 | 2.295 | 223.07 | 7.040 | 18.18 |
| 512 | 128 | 1536 | 2.537 | 201.81 | 7.739 | 16.54 |
| 512 | 128 | 2048 | 2.327 | 220.05 | 7.006 | 18.27 |
| 512 | 128 | 2560 | 2.523 | 202.97 | 7.766 | 16.48 |
| 512 | 128 | 3072 | 2.571 | 199.15 | 7.901 | 16.20 |
| 512 | 128 | 3584 | 2.531 | 202.26 | 7.717 | 16.59 |
| 512 | 128 | 4096 | 2.600 | 196.89 | 8.016 | 15.97 |
| 512 | 128 | 4608 | 2.602 | 196.80 | 7.962 | 16.08 |
| 512 | 128 | 5120 | 2.623 | 195.21 | 7.880 | 16.24 |
| 512 | 128 | 5632 | 2.614 | 195.86 | 8.090 | 15.82 |
| 512 | 128 | 6144 | 2.647 | 193.44 | 8.055 | 15.89 |
| 512 | 128 | 6656 | 2.647 | 193.43 | 7.963 | 16.07 |
| 512 | 128 | 7168 | 2.686 | 190.62 | 7.975 | 16.05 |
| 512 | 128 | 7680 | 2.687 | 190.54 | 8.069 | 15.86 |
| 512 | 128 | 8192 | 2.691 | 190.28 | 7.990 | 16.02 |
| 512 | 128 | 8704 | 2.713 | 188.69 | 8.030 | 15.94 |
| 512 | 128 | 9216 | 2.690 | 190.33 | 8.081 | 15.84 |
| 512 | 128 | 9728 | 2.706 | 189.24 | 8.015 | 15.97 |
| 512 | 128 | 10240 | 2.712 | 188.80 | 8.034 | 15.93 |
| 512 | 128 | 10752 | 2.777 | 184.35 | 8.097 | 15.81 |
| 512 | 128 | 11264 | 2.728 | 187.69 | 8.142 | 15.72 |
| 512 | 128 | 11776 | 2.651 | 193.15 | 8.040 | 15.92 |
| 512 | 128 | 12288 | 2.715 | 188.57 | 8.032 | 15.94 |
| 512 | 128 | 12800 | 2.727 | 187.76 | 8.091 | 15.82 |
| 512 | 128 | 13312 | 2.693 | 190.12 | 8.145 | 15.72 |
| 512 | 128 | 13824 | 2.692 | 190.22 | 8.137 | 15.73 |
| 512 | 128 | 14336 | 2.579 | 198.54 | 7.770 | 16.47 |
| 512 | 128 | 14848 | 2.688 | 190.45 | 8.211 | 15.59 |
| 512 | 128 | 15360 | 2.592 | 197.57 | 8.075 | 15.85 |
| 512 | 128 | 15872 | 2.660 | 192.47 | 8.132 | 15.74 |
ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-BF16)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 18.004 | 28.44 | 10.550 | 12.13 |
| 512 | 128 | 512 | 17.938 | 28.54 | 10.384 | 12.33 |
| 512 | 128 | 1024 | 17.859 | 28.67 | 10.370 | 12.34 |
| 512 | 128 | 1536 | 17.924 | 28.57 | 10.399 | 12.31 |
| 512 | 128 | 2048 | 17.989 | 28.46 | 10.386 | 12.32 |
| 512 | 128 | 2560 | 17.935 | 28.55 | 10.435 | 12.27 |
| 512 | 128 | 3072 | 18.006 | 28.44 | 10.513 | 12.18 |
| 512 | 128 | 3584 | 18.030 | 28.40 | 10.495 | 12.20 |
| 512 | 128 | 4096 | 18.063 | 28.35 | 10.578 | 12.10 |
| 512 | 128 | 4608 | 17.570 | 29.14 | 10.613 | 12.06 |
| 512 | 128 | 5120 | 17.685 | 28.95 | 10.600 | 12.08 |
| 512 | 128 | 5632 | 17.744 | 28.86 | 10.682 | 11.98 |
| 512 | 128 | 6144 | 17.911 | 28.59 | 10.640 | 12.03 |
| 512 | 128 | 6656 | 17.727 | 28.88 | 10.719 | 11.94 |
| 512 | 128 | 7168 | 17.529 | 29.21 | 10.636 | 12.03 |
| 512 | 128 | 7680 | 17.547 | 29.18 | 10.660 | 12.01 |
| 512 | 128 | 8192 | 17.517 | 29.23 | 10.708 | 11.95 |
| 512 | 128 | 8704 | 17.572 | 29.14 | 10.814 | 11.84 |
| 512 | 128 | 9216 | 17.542 | 29.19 | 10.813 | 11.84 |
| 512 | 128 | 9728 | 17.615 | 29.07 | 10.815 | 11.84 |
| 512 | 128 | 10240 | 17.573 | 29.14 | 10.839 | 11.81 |
| 512 | 128 | 10752 | 17.616 | 29.06 | 10.858 | 11.79 |
| 512 | 128 | 11264 | 17.670 | 28.98 | 10.899 | 11.74 |
| 512 | 128 | 11776 | 17.764 | 28.82 | 11.194 | 11.44 |
| 512 | 128 | 12288 | 17.622 | 29.05 | 10.960 | 11.68 |
| 512 | 128 | 12800 | 17.658 | 28.99 | 11.039 | 11.60 |
| 512 | 128 | 13312 | 17.661 | 28.99 | 11.036 | 11.60 |
| 512 | 128 | 13824 | 17.624 | 29.05 | 11.093 | 11.54 |
| 512 | 128 | 14336 | 17.587 | 29.11 | 11.094 | 11.54 |
| 512 | 128 | 14848 | 17.650 | 29.01 | 11.174 | 11.45 |
| 512 | 128 | 15360 | 17.648 | 29.01 | 11.190 | 11.44 |
| 512 | 128 | 15872 | 17.645 | 29.02 | 11.204 | 11.42 |
🗣️ Discussion
👤 ikawrakow replied the 2025-06-22 at 15:16:00:
Don't use -rtr for the bf16 model.
👤 Gaolingx replied the 2025-06-22 at 15:31:07:
wow, thanks a lot for your suggestion, the speed is normal now, I got ~65 PP speed and ~11.8 TG speed, but the cpu+cuda(-ot exps=CPU) speed doesn't seem to be much faster than the pure cpu, although it is a moe model. maybe I should do a more detailed benchmark.👤 ikawrakow replied the 2025-06-22 at 15:35:13:
You need larger u-batch size for better PP performance. The experts are in RAM and need to be offloaded to the GPU, which takes a while. If you runllama-sweep-benchwith-ub 2048you will see much better PP performance.👤 Gaolingx replied the 2025-07-02 at 10:54:41:
Hi, we all know that runtime repack(-rtr) is good to use with hybrid GPU + CPU, according to my research in the last few days, if we don't add the '-rtr' parameter, when we input long prompts, the cuda device needs to spend a long time on copying (and you can see in the task manager that the usage of 'Copy1' is quite high, but the usage ofCPUandCUDAis insufficient), and the processing speed of prompt words is also significantly lower than the performance with the '-rtr' parameter, or even worse than the cpu only, what is the reason for this?👤 ikawrakow replied the 2025-07-02 at 12:14:25:
I'm not sure I understand what could be the issue from the description. Can you tell us what is the model you are using and post your command line?👤 Gaolingx replied the 2025-07-02 at 14:23:58:
I'm not sure I understand what could be the issue from the description. Can you tell us what is the model you are using and post your command line?
Ok. I ran llama-sweep-bench again and tested the 16k context length data of three sets of qwen3 30ba3b models. They are that the q8_0 model with
-rtrparameter, the q8_0 model without-rtrparameter, and the bf16 model without-rtrparameter. To control the variables, in the test group without the -rtr parameter, I added the--no-mmapparameter. The rest of the startup parameters remained the same. The llama-sweep-bench startup parameters and test results are as follows.I have come to the conclusion that, whether it is the q8_0 or bf16 model, on my platform, if the
-rtrparameter is not used, the prompt processing performance during cpu+gpu hybrid inference will be significantly affected. The larger the model, the more serious this situation is. However, The token generation speed is normal and in line with expectations. What causes this? How does the runtime repack tensors(-rtr) to improve prompt processing performance?
ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-Q8_0 with
-rtr)./llama-sweep-bench -m "D:\Downloads\Qwen3-30B-A3B-Q8_0.gguf" -c 16384 -t 16 -ngl 49 -fa -rtr -fmoe -ser 6,1 -ot exps=CPU
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 49, n_threads = 16, n_threads_batch = 16
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s 512 128 0 2.247 227.87 5.941 21.54 512 128 512 2.293 223.28 6.718 19.05 512 128 1024 2.354 217.46 6.981 18.34 512 128 1536 2.382 214.94 7.088 18.06 512 128 2048 2.406 212.81 7.011 18.26 512 128 2560 2.394 213.84 7.078 18.09 512 128 3072 2.408 212.61 7.080 18.08 512 128 3584 2.383 214.83 7.127 17.96 512 128 4096 2.415 211.97 7.083 18.07 512 128 4608 2.391 214.12 7.170 17.85 512 128 5120 2.461 208.03 7.216 17.74 512 128 5632 2.448 209.11 7.233 17.70 512 128 6144 2.458 208.31 7.286 17.57 512 128 6656 2.456 208.48 7.251 17.65 512 128 7168 2.413 212.17 7.160 17.88 512 128 7680 2.450 208.98 7.310 17.51 512 128 8192 2.482 206.26 7.302 17.53 512 128 8704 2.365 216.50 7.130 17.95 512 128 9216 2.371 215.94 7.109 18.01 512 128 9728 2.381 214.99 7.264 17.62 512 128 10240 2.395 213.81 7.192 17.80 512 128 10752 2.402 213.16 7.103 18.02 512 128 11264 2.402 213.18 7.005 18.27 512 128 11776 2.372 215.87 7.023 18.22 512 128 12288 2.474 206.92 6.762 18.93 512 128 12800 2.457 208.42 6.808 18.80 512 128 13312 2.442 209.64 6.740 18.99 512 128 13824 2.447 209.22 6.824 18.76 512 128 14336 2.473 207.03 6.704 19.09 512 128 14848 2.524 202.86 6.695 19.12 512 128 15360 2.573 199.00 7.093 18.05 512 128 15872 2.520 203.17 6.611 19.36
ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-Q8_0 without
-rtr)./llama-sweep-bench -m "D:\Downloads\Qwen3-30B-A3B-Q8_0.gguf" -c 16384 -t 16 -ngl 49 -fa --no-mmap -fmoe -ser 6,1 -ot exps=CPU
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 49, n_threads = 16, n_threads_batch = 16
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s 512 128 0 9.527 53.74 6.171 20.74 512 128 512 9.556 53.58 6.117 20.93 512 128 1024 9.554 53.59 6.184 20.70 512 128 1536 9.551 53.61 6.149 20.82 512 128 2048 9.590 53.39 6.255 20.46 512 128 2560 9.523 53.76 6.230 20.55 512 128 3072 9.509 53.84 6.257 20.46 512 128 3584 9.555 53.58 6.274 20.40 512 128 4096 9.640 53.11 6.705 19.09 512 128 4608 9.638 53.12 6.409 19.97 512 128 5120 9.615 53.25 6.388 20.04 512 128 5632 9.652 53.04 6.360 20.12 512 128 6144 9.662 52.99 6.430 19.91 512 128 6656 9.702 52.77 6.480 19.75 512 128 7168 9.609 53.28 6.494 19.71 512 128 7680 9.606 53.30 6.485 19.74 512 128 8192 9.622 53.21 6.521 19.63 512 128 8704 9.620 53.22 6.546 19.55 512 128 9216 9.559 53.56 6.602 19.39 512 128 9728 9.538 53.68 6.542 19.57 512 128 10240 9.563 53.54 6.626 19.32 512 128 10752 9.610 53.28 6.561 19.51 512 128 11264 9.689 52.85 6.618 19.34 512 128 11776 9.619 53.23 6.628 19.31 512 128 12288 9.654 53.03 6.452 19.84 512 128 12800 9.800 52.24 6.578 19.46 512 128 13312 9.641 53.11 6.613 19.35 512 128 13824 9.638 53.12 6.513 19.65 512 128 14336 9.686 52.86 6.555 19.53 512 128 14848 9.729 52.62 6.609 19.37 512 128 15360 9.702 52.77 6.624 19.32 512 128 15872 9.697 52.80 6.636 19.29
ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-BF16 without
-rtr)./llama-sweep-bench -m "D:\Downloads\unsloth\Qwen3-30B-A3B-GGUF\BF16\Qwen3-30B-A3B-BF16-00001-of-00002.gguf" -c 16384 -t 16 -ngl 49 -fa --no-mmap -fmoe -ser 6,1 -ot exps=CPU
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 49, n_threads = 16, n_threads_batch = 16
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s 512 128 0 17.771 28.81 9.791 13.07 512 128 512 17.398 29.43 9.025 14.18 512 128 1024 17.305 29.59 9.030 14.17 512 128 1536 17.367 29.48 9.054 14.14 512 128 2048 17.859 28.67 9.342 13.70 512 128 2560 17.700 28.93 9.143 14.00 512 128 3072 17.696 28.93 9.170 13.96 512 128 3584 17.939 28.54 9.241 13.85 512 128 4096 17.926 28.56 9.212 13.89 512 128 4608 17.714 28.90 9.280 13.79 512 128 5120 17.822 28.73 9.226 13.87 512 128 5632 17.830 28.72 9.273 13.80 512 128 6144 17.749 28.85 9.121 14.03 512 128 6656 17.581 29.12 9.356 13.68 512 128 7168 17.517 29.23 9.401 13.62 512 128 7680 17.408 29.41 9.393 13.63 512 128 8192 17.451 29.34 9.371 13.66 512 128 8704 17.409 29.41 9.544 13.41 512 128 9216 17.443 29.35 9.476 13.51 512 128 9728 17.449 29.34 10.037 12.75 512 128 10240 17.370 29.48 9.480 13.50 512 128 10752 17.472 29.30 9.504 13.47 512 128 11264 17.612 29.07 9.500 13.47 512 128 11776 17.492 29.27 9.580 13.36 512 128 12288 17.384 29.45 9.569 13.38 512 128 12800 18.000 28.44 9.436 13.56 512 128 13312 17.759 28.83 9.493 13.48 512 128 13824 17.905 28.60 9.442 13.56 512 128 14336 17.843 28.69 9.372 13.66 512 128 14848 17.928 28.56 9.538 13.42 512 128 15360 17.902 28.60 9.436 13.57 512 128 15872 17.971 28.49 9.336 13.71 👤 ikawrakow replied the 2025-07-02 at 14:40:43:
When you use-rtr, the tensors not offloaded to the GPU get repacked to a row-interleaved version.Q8_0becomesQ8_0_R8, andBF16becomesBF16_R16.Q8_0_R8andBF16_R16are not supported by the CUDA backend, so matrix multiplications with these tensors are done on the CPU. When you do not use-rtr, there is no repacking, CUDA supportsQ8_0andBF16, so the tensors stored in RAM get copied to the GPU to do matrix multiplications. If the model is large, and your PCI-E is not very fast, the copying to VRAM takes a long time, so your PP performance becomes low. You can improve the performance by using larger u-batches because more work is done per copy to the GPU (tensors are copied once, but multiply 2048 tokens with-ub 2048. To accomplish the same with the u-batch of 512 you are using, tensors need to get copied 4 times). If you don't want to repack, and don't want to use larger u-batches, you can prevent copying to the GPU using-op 26,0,27,0,29,0. In that casebf16performance will be slightly lower than with-rtr, butQ8_0performance will be somewhere in the middle between-rtrand no-rtr.