Files
ik_llama.cpp/github-data/issues/476 - Research_ performance divergence.md
2025-07-23 13:31:53 +02:00

26 KiB

📝 #476 - Research: performance divergence

Author VinnyG9
State Open
Created 2025-05-30
Updated 2025-06-14

Description

Research Stage

  • Background Research (Let's try to avoid reinventing the wheel)
  • Hypothesis Formed (How do you think this will work and it's effect?)
  • Strategy / Implementation Forming
  • Analysis of results
  • Debrief / Documentation (So people in the future can learn from us)

Previous existing literature and research

when i ran benches previously i got pretty good results on cpu inference like 30-40t/s on qwen3 30B, now i am trying to run the server for aider and the speed is less than half is it expected??

Hypothesis

No response

Implementation

No response

Analysis

No response

Relevant log output



💬 Conversation

👤 ikawrakow commented the 2025-05-31 at 05:22:24:

Please be specific in your issue. Provide quantization type used, system information, full command line to start the server and, ideally, last good/first bad commit where you observe the performance change. See #474 for an example.


👤 VinnyG9 commented the 2025-05-31 at 10:18:09:

I've been testing ik_llama.cpp for about a month mostly benchmarks can't report any regression running bare build time flags(NATIVE=1, CUDA=1, CUDA_ARCH) and runtime flags(rtr, fa, fmoe, numa)

latest ubuntu/mint

no matter the model i try, dense, MoE etc i get less than 50% performance than the benchmarks show, when running mainline the benchmark numbers are way lower but are consistent with performance numbers when running the server


👤 ikawrakow commented the 2025-05-31 at 14:28:49:

So, the issue is that the performance you observe when running llama-server is 2X lower than the performance you observe when running one of the benchmark tools?

Is the PP performance affected, or the TG performance, or both?

Generic statement will lead nowhere (other than the issue getting closed)


👤 VinnyG9 commented the 2025-06-01 at 03:16:25:

So, the issue is that the performance you observe when running llama-server is 2X lower than the performance you observe when running one of the benchmark tools?

yes

Is the PP performance affected, or the TG performance, or both?

both, literally a bit less than half PP/TG. think it could be a numa issue? i tried with stock bios settings but got worse results albeit closer bench/serve numbers


👤 saood06 commented the 2025-06-01 at 03:26:18:

both, literally a bit less than half PP/TG. think it could be a numa issue? i tried with stock bios settings but got worse results albeit closer bench/serve numbers

Please provide the exact commands used for benching and for server.

I brought llama-sweep-bench over to this repo and use it regularly because in my experience it does accurately reflect server performance (including how it changes across different depths), to the point where I run it to validate that the model has warmed up and loaded into RAM correctly (as my server is very sensitive to memory placement and the model is stored on HDDs so performance is unusable until the model is warmed up).


👤 Ph0rk0z commented the 2025-06-01 at 13:24:16:

Its funny because I often get slightly better speeds on server than the sweep bench. Nowhere near half so something is wrong.

Only numa thing that helps is adding interleave=all to the command line you run. Setting load balancing in the kernel to 0 doesn't do move the needle one way or another despite the warning.

One thing I did notice is that the bench can be a little irregular at times by a t/s here or there. May I also suggest getting ccmake when compiling so you can set your flags and forget it.

edit:

So playing with 4096 batches showed me something. In server, prompt speed on smaller prompts prints as half speed or less.I was getting 110 max and on a 2k token prompt would hit 60-70 reported. A large enough prompt still returns the correct speed. Can't explain 1/2 TG tho.


👤 VinnyG9 commented the 2025-06-02 at 16:34:43:

both, literally a bit less than half PP/TG. think it could be a numa issue? i tried with stock bios settings but got worse results albeit closer bench/serve numbers

Please provide the exact commands used for benching and for server.

I brought llama-sweep-bench over to this repo and use it regularly because in my experience it does accurately reflect server performance (including how it changes across different depths), to the point where I run it to validate that the model has warmed up and loaded into RAM correctly (as my server is very sensitive to memory placement and the model is stored on HDDs so performance is unusable until the model is warmed up).

server:

  "Qwen3-30B-MoE":
    env:
      - "CUDA_VISIBLE_DEVICES= "
    proxy: "http://192.168.15.101:9999"
    cmd: |
      /ssd/share/Software/backends/ik_llama.cpp/build/bin/llama-server
      --host localhost --port 9999 --flash-attn
      --cache-type-k f16 --cache-type-v f16
      --ctx-size 40960
      --samplers "top_k;top_p;min_p;temperature;typ_p;xtc"
      --temp 0.6 --repeat-penalty 1.0
      --min-p 0.01 --top-k 20 --top-p 0.95
      -ngl 0 -rtr -fmoe -ser 7,1 --threads 31 --numa distribute
      --model /models/gguf/MoE/Qwen3-30B-A3B-128K-UD-Q4_K_XL.gguf

output

INFO [   launch_slot_with_task] slot is processing task | tid="123321346306048" timestamp=1748881719 id_slot=0 id_task=399
INFO [            update_slots] kv cache rm [p0, end) | tid="123321346306048" timestamp=1748881719 id_slot=0 id_task=399 p0=450
INFO [           print_timings] prompt eval time     =    3133.99 ms /   388 tokens (    8.08 ms per token,   123.80 tokens per second) | tid="123321346306048" timestamp=1748881752 id_slot=0 id_task=399 t_prompt_processing=3133.991 n_prompt_tokens_processed=388 t_token=8.077296391752578 n_tokens_second=123.80380160632241
INFO [           print_timings] generation eval time =   29634.35 ms /   400 runs   (   74.09 ms per token,    13.50 tokens per second) | tid="123321346306048" timestamp=1748881752 id_slot=0 id_task=399 t_token_generation=29634.354 n_decoded=400 t_token=74.085885 n_tokens_second=13.497847801912604
INFO [           print_timings]           total time =   32768.35 ms | tid="123321346306048" timestamp=1748881752 id_slot=0 id_task=399 t_prompt_processing=3133.991 t_token_generation=29634.354 t_total=32768.345
INFO [            update_slots] slot released | tid="123321346306048" timestamp=1748881752 id_slot=0 id_task=399 n_ctx=40960 n_past=1237 n_system_tokens=0 n_cache_tokens=1237 truncated=false
INFO [            update_slots] all slots are idle | tid="123321346306048" timestamp=1748881752

sweep-bench:

CUDA_VISIBLE_DEVICES= numactl --interleave=all /ssd/share/Software/backends/ik_llama.cpp/build/bin/llama-sweep-bench -m /models/gguf/MoE/Qwen3-30B-A3B-128K-UD-Q4_K_XL.gguf -rtr -fa -fmoe --numa distribute -t 31 -c 8196 -b 2048 -ub 512

main: n_kv_max = 8448, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 31, n_threads_batch = 31

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.981 |   258.43 |    4.757 |    26.91 |
|   512 |    128 |    512 |    2.157 |   237.35 |    4.642 |    27.57 |
|   512 |    128 |   1024 |    2.421 |   211.47 |    5.040 |    25.40 |
|   512 |    128 |   1536 |    2.844 |   180.04 |    4.951 |    25.85 |
|   512 |    128 |   2048 |    2.991 |   171.20 |    5.313 |    24.09 |
|   512 |    128 |   2560 |    3.222 |   158.89 |    5.136 |    24.92 |
|   512 |    128 |   3072 |    3.525 |   145.24 |    5.442 |    23.52 |
|   512 |    128 |   3584 |    3.758 |   136.25 |    5.559 |    23.03 |
|   512 |    128 |   4096 |    4.089 |   125.20 |    5.580 |    22.94 |
|   512 |    128 |   4608 |    4.262 |   120.14 |    5.563 |    23.01 |
|   512 |    128 |   5120 |    4.832 |   105.96 |    6.061 |    21.12 |
|   512 |    128 |   5632 |    4.954 |   103.36 |    6.060 |    21.12 |
|   512 |    128 |   6144 |    5.218 |    98.12 |    6.202 |    20.64 |
|   512 |    128 |   6656 |    5.664 |    90.40 |    6.193 |    20.67 |
|   512 |    128 |   7168 |    5.776 |    88.65 |    6.122 |    20.91 |
|   512 |    128 |   7680 |    6.135 |    83.46 |    7.535 |    16.99 |
failed to decode the batch, n_batch = 2048, ret = 1
main: llama_decode() failed

llama-bench

CUDA_VISIBLE_DEVICES= /ssd/share/Software/backends/ik_llama.cpp/build/bin/llama-bench -m /models/gguf/MoE/Qwen3-30B-A3B-128K-UD-Q4_K_XL.gguf -t 31 --numa distribute -rtr 1 -fa 1 -fmoe 1 -n 64,128,256,512 -p 512,1024,2048,4096
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
model size params backend ngl threads fa rtr fmoe test t/s
============ Repacked 337 tensors
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 pp512 254.35 ± 7.37
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 pp1024 226.91 ± 7.88
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 pp2048 206.85 ± 6.65
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 pp4096 163.43 ± 2.74
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 tg64 32.71 ± 0.89
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 tg128 33.11 ± 0.54
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 tg256 31.07 ± 1.80
qwen3moe ?B Q4_K - Medium 16.49 GiB 30.53 B CUDA 99 31 1 1 1 tg512 27.44 ± 3.13

👤 VinnyG9 commented the 2025-06-02 at 16:55:21:

and running on GPU

INFO [            update_slots] all slots are idle | tid="123510069428224" timestamp=1748883135
INFO [   launch_slot_with_task] slot is processing task | tid="123510069428224" timestamp=1748883146 id_slot=0 id_task=408
INFO [            update_slots] kv cache rm [p0, end) | tid="123510069428224" timestamp=1748883146 id_slot=0 id_task=408 p0=456
INFO [           print_timings] prompt eval time     =    1171.24 ms /   381 tokens (    3.07 ms per token,   325.30 tokens per second) | tid="123510069428224" timestamp=1748883151 id_slot=0 id_task=408 t_prompt_processing=1171.238 n_prompt_tokens_processed=381 t_token=3.0741154855643047 n_tokens_second=325.2968226782259
INFO [           print_timings] generation eval time =    3364.22 ms /   124 runs   (   27.13 ms per token,    36.86 tokens per second) | tid="123510069428224" timestamp=1748883151 id_slot=0 id_task=408 t_token_generation=3364.215 n_decoded=124 t_token=27.13076612903226 n_tokens_second=36.8585242025257
INFO [           print_timings]           total time =    4535.45 ms | tid="123510069428224" timestamp=1748883151 id_slot=0 id_task=408 t_prompt_processing=1171.238 t_token_generation=3364.215 t_total=4535.453
INFO [            update_slots] slot released | tid="123510069428224" timestamp=1748883151 id_slot=0 id_task=408 n_ctx=40960 n_past=960 n_system_tokens=0 n_cache_tokens=960 truncated=false

main: n_kv_max = 8448, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 31, n_threads_batch = 31

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.241 412.49 2.392 53.51
512 128 512 1.186 431.64 2.512 50.96
512 128 1024 1.223 418.59 2.613 48.99
512 128 1536 1.232 415.57 2.713 47.18
512 128 2048 1.275 401.43 2.819 45.41
512 128 2560 1.267 403.95 2.933 43.64
512 128 3072 1.318 388.40 3.042 42.07
512 128 3584 1.332 384.52 3.146 40.68
512 128 4096 1.366 374.89 3.267 39.18
512 128 4608 1.377 371.86 3.386 37.80
512 128 5120 1.389 368.53 3.533 36.23
512 128 5632 1.409 363.27 3.633 35.23
512 128 6144 1.432 357.60 3.710 34.51
512 128 6656 1.458 351.20 3.796 33.72
512 128 7168 1.492 343.10 3.905 32.78
512 128 7680 1.493 342.86 4.017 31.86
failed to decode the batch, n_batch = 2048, ret = 1
main: llama_decode() failed

👤 saood06 commented the 2025-06-03 at 00:42:25:

Is there any reason why you use 31 threads? I would say try using 32 threads and see if that helps your performance (but I don't think that is the reason for the gap in performance between server and sweep).

See this comment about why that might be a bad choice: https://github.com/ikawrakow/ik_llama.cpp/discussions/223#discussioncomment-12292591


👤 VinnyG9 commented the 2025-06-03 at 01:37:17:

Is there any reason why you use 31 threads? I would say try using 32 threads and see if that helps your performance (but I don't think that is the reason for the gap in performance between server and sweep).

See this comment about why that might be a bad choice: #223 (comment)

yeah when i benched it performance improved with the number of(physical) threads up to 31-32, though only for the moe's

is it normal that during generation the model pauses on every comma? i find it funny


👤 nux commented the 2025-06-03 at 02:28:40:

Not sure if relevant here - the topic name seems so. Was looking into some performance issues and found this thread.

nux@red ~/dev/ik_llama.cpp $ ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 28.36 ± 0.03
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 4.80 ± 0.01

I had a .txt file of an older benchmark showing this: nux@red ~/dev/ik_llama.cpp $ ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 76.13 ± 2.43
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 9.79 ± 0.08

build: 1ea1df4b (3659)

I checked out the same checkout and ran it again and results were the same.

Am I missing anything obvious here? Something change that I need to adjust for?

Building like this: cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build build --config Release -j --clean-first

Running on 2x9115, 768gb ram, 3090 gpu

I ended up troubleshooting the performance I am seeing now when trying to figure out why ubergarm/DeepSeek-R1-0528-GGUF/IQ4_KS_R4 was running slowly.

If you think it makes sense for me to open a new issue I can, as my tg/pp have both slowed down, unlike what I'm reading about above

Thanks


👤 ikawrakow commented the 2025-06-03 at 05:06:02:

@Fuckingnameless

Your system seems to be one of those that are extremely finicky about tensor placement in RAM. Looking at the llama-bench vs llama-sweep-bench TG results I see a 20% difference in TG performance at zero context. There is an obvious difference between your server command and your benchmark runs: context is 41k tokens for the server and 8k or less in the benchmarks. This potentially changes where things go in RAM. Also, seeing numactl and --numa involved immediately rises red flags. Do you have a dual socket system? (I don't remember the system configurations of all users, so adding the details of your system to the issue would be useful).

Having said all this, I still find a factor of 2 difference in CPU performance strange. The difference is much less on CUDA, so I would focus on trying to resolve the CPU performance first.


👤 ikawrakow commented the 2025-06-03 at 05:20:03:

@nux

There was PR #461 that added CUDA implementation for some of the row-interleaved quants. This results in a change in behavior for your IQ4_K_R4 quantized model: prior to PR #461 all matrix multiplications for X_R4 tensors had to be done on the CPU. After PR #461, for batch size >= 32 they get offloaded to the GPU to perform the matrix multiplications. If the PCI-E speed is low for some reason, this can make PP slower. You can try adding -op 26,0,27,0,29,0 to the command line to see what happens. This will disable the offload to the GPU.

I have no explanation for the 2X lower TG performance. Try using -mla 3, which has been supported on the GPU since PR #408/#413


👤 nux commented the 2025-06-03 at 12:56:00:

I will put together a script to go through commits and benchmark to figure out exactly when this started. I'm noticing right now is that while llama-bench is running, the GPU utilization drops to 38-39% for about 10 seconds and going back up to 99%. While llama-bench is running I see this pattern repeating with gpu usage %

I have been using mla 3 - but ran the benchmark above in mla 2 for comparison purposes. PCI-E is 16x. Will post when I figure out what commit performance went down


👤 nux commented the 2025-06-03 at 14:00:55:

Commit 0976467 is when the performance went down for me. Was running for i in cut -d " " -f1 commits.txt ;do git checkout $i;./cmd-build.sh ;./start-bench.sh >> results.txt;done

start-bench is: ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512

build: ccd6d9cd (3716)

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 26.74 ± 0.05
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 4.80 ± 0.00

build: 09764678 (3715)

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 26.75 ± 0.04
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 4.81 ± 0.00

build: 14292913 (3714)

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 76.24 ± 1.44
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 10.08 ± 0.06

build: 24c010b3 (3713)

model size params backend ngl fa mla amb fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 pp512 77.25 ± 0.70
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 99 1 2 512 1 tg128 10.07 ± 0.06

👤 ikawrakow commented the 2025-06-03 at 14:10:23:

@nux Maybe it is better you open a new issue with your findings. You can also add the tensors being used in your model when you do so. This issue is about a discrepancy between performance observed with llama-bench/llama-sweep-bench and performance observed with llama-server.


👤 VinnyG9 commented the 2025-06-10 at 00:38:42:

@Fuckingnameless

Your system seems to be one of those that are extremely finicky about tensor placement in RAM. Looking at the llama-bench vs llama-sweep-bench TG results I see a 20% difference in TG performance at zero context. There is an obvious difference between your server command and your benchmark runs: context is 41k tokens for the server and 8k or less in the benchmarks. This potentially changes where things go in RAM. Also, seeing numactl and --numa involved immediately rises red flags. Do you have a dual socket system? (I don't remember the system configurations of all users, so adding the details of your system to the issue would be useful).

Having said all this, I still find a factor of 2 difference in CPU performance strange. The difference is much less on CUDA, so I would focus on trying to resolve the CPU performance first.

it happens on qwen3 30b but not the 235b moe in which i see almost 1:1 numbers for sweep-bench vs API, also ran a dense 70b test and TG was 1:1

i make sure the runtime flags are equal between runs, should i be building with: -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON ?


👤 cg10036 commented the 2025-06-14 at 15:58:34:

Hi, I'm leaving a comment because I seem to be experiencing a similar issue. Quantization Type: IQ4_XS System: A clean Debian 12 install with only llama.cpp and ik_llama.cpp, single Xeon E5-2686v4 CPU, 128GB (16GBx8) DDR3 RAM, no GPU, no swap. Command:

~/ik_llama.cpp/build_cpu/bin/llama-cli --no-mmap --model ~/unsloth/Qwen3-235B-A22B-128K-GGUF/IQ4_XS/Qwen3-235B-A22B-128K-IQ4_XS-00001-of-00003.gguf --threads 16 --ctx-size 16384 --seed 3407 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section. /no_think<|im_end|>\n<|im_start|>assistant\n" -fa -rtr -fmoe

I am also experiencing an issue where the output briefly pauses for 1-2 seconds at commas. This problem does not occur with llama.cpp.

While it doesn't happen at every single comma, the output generation is perfectly smooth in parts of the text without commas.

Interestingly, if I add 9. Replace every comma with a pipe in python code to the prompt, the pausing issue disappears.

What could be the problem?

  1. with comma: https://youtu.be/n7tr2N_2DK8
  2. without comma: https://youtu.be/Zy4r61EKq18

Additionally, I've confirmed this issue is present in the initial commit that added Qwen3 support: 9ba362706c.