docs: update qwen3next perf report for cuda MoE/SSM tuning

2026-04-30 19:31:48 +00:00 · 2026-02-06 13:52:54 +00:00
parent 236633af99
commit c767cfa1d3
1 changed files with 116 additions and 0 deletions
--- a/docs/development/qwen3next_perf_diff_report.md
+++ b/docs/development/qwen3next_perf_diff_report.md
@@ -199,3 +199,119 @@ This is substantially improved versus earlier observed TG (`~2` to `~4` t/s) in
 - Some runs still offload few/no layers depending on available VRAM at run time, which can mask CUDA-path gains.
 - `SCHED_MAX_SPLITS` limits at very aggressive `(b, ub)` settings are still a separate scaling constraint.
 - Additional backend-level profiling is still needed to determine whether remaining gap to top-end mainline numbers is dominated by offload limits, scheduler split overhead, or other kernels.
 ## 2026-02-06 CUDA MoE/SSM Optimization Update
 ### Applied changes in this update
 1. MoE row mapping in CUDA `mul_mat_id` paths (`ggml/src/ggml-cuda.cu`):
   - Replaced per-call `ids` device->host copy, host-side count/build, and mapping host->device copy.
   - Added device-side count + exclusive prefix sum + scatter kernels:
     - `k_moe_row_count`
     - `k_moe_row_exclusive_scan`
     - `k_moe_row_scatter`
   - Kept existing call-site logic intact by copying only compact metadata back (`moe_counts`, `cum_moe_counts`, invalid-id flag).
   - Net effect: removes large host round-trip traffic from a hot MoE routing path.
 2. Qwen3Next SSM conv path for `n_kv > 1` (`ggml/src/ggml-cuda/ssm-conv.cu`):
   - Added a guarded fast path for decode-like multi-sequence batches where each token maps to one unique sequence (no multi-sequence fan-out per token).
   - Added:
     - `ssm_conv_validate_unique_seq_map`
     - `ssm_conv_multi_seq_unique_f32_kernel`
     - `ssm_conv_multi_seq_unique_f32_kernel_nc4`
   - If the input pattern does not satisfy fast-path constraints, execution falls back to the existing kernel path unchanged.
 3. Top-k MoE fusion verification:
   - No matcher change was required in this update.
   - Qwen3Next MoE build path still emits the expected `SOFT_MAX -> ... -> ARGSORT -> VIEW -> GET_ROWS` form used by current CUDA fusion checks.
 ### Parity validation (required checks)
 Tests were run in Docker (`iktest-dev:latest`) with:
 - model: `/models/qwen3-next-coder.gguf`
 - text corpus: `/tmp/qnext_ppl.txt` (same file for `ik` and mainline)
 - params: `-c 256 -b 64 -ub 64 --no-warmup`
 CPU parity (`-ngl 0`, threshold `<= 5e-4`):
 - `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS
 - `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS
 CUDA sanity parity (`-ngl 1`, threshold `<= 1e-3`):
 - `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS
 - `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS
 ### Quick performance matrix (`llama-sweep-bench`)
 Config: `-c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe`
 | Profile | Baseline maxPP | Baseline maxTG | New maxPP | New maxTG | Delta maxPP | Delta maxTG |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: |
 | 16GB a) `CUDA_VISIBLE_DEVICES=0` | 129.83 | 26.45 | 122.91 | 26.79 | -6.92 | +0.34 |
 | 16GB b) `CUDA_VISIBLE_DEVICES=0 -no-ooae` | n/a | n/a | 132.02 | 26.84 | n/a | n/a |
 | 28GB a) `CUDA_VISIBLE_DEVICES=0,1 --tensor-split 0.85,0.15` | 127.66 | 22.95 | 127.48 | 23.97 | -0.18 | +1.02 |
 | 28GB b) `CUDA_VISIBLE_DEVICES=0,1` | n/a | n/a | 104.61 | 21.17 | n/a | n/a |
 ### Command log (exact forms)
 Build:
 ```bash
 docker run --rm --gpus all \
  -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
  iktest-dev:latest \
  bash -lc 'cmake --build /ik_llama.cpp/build-cuda13-fresh --config Release -j 56 --target llama-perplexity llama-bench'
 ```
 Parity (`ik`):
 ```bash
 docker run --rm --gpus all \
  -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
  -v /home/yurko/.cache/llama.cpp:/models \
  -v /tmp:/tmp \
  iktest-dev:latest \
  bash -lc 'export LD_LIBRARY_PATH=/ik_llama.cpp/build-cuda13-fresh/src:/ik_llama.cpp/build-cuda13-fresh/ggml/src:$LD_LIBRARY_PATH; \
  /ik_llama.cpp/build-cuda13-fresh/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'
 ```
 Parity (mainline):
 ```bash
 docker run --rm --gpus all \
  -v /home/yurko/Code/llama.cpp:/llama.cpp \
  -v /home/yurko/.cache/llama.cpp:/models \
  -v /tmp:/tmp \
  iktest-dev:latest \
  bash -lc 'export LD_LIBRARY_PATH=/llama.cpp/build/src:/llama.cpp/build/ggml/src:$LD_LIBRARY_PATH; \
  /llama.cpp/build/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'
 ```
 Quick matrix:
 ```bash
 # 16GB a
 CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe
 # 16GB b
 CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe -no-ooae
 # 28GB a
 CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe --tensor-split 0.85,0.15
 # 28GB b
 CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe
 ```
 ### Status after this update
 - Precision parity: PASS on all required checks.
 - Performance:
  - 16GB profile improved TG but not PP vs baseline.
  - 28GB split profile improved TG and preserved PP.
 - Remaining likely bottlenecks for 16GB PP:
  - MoE routing still limited by per-expert launches/host-side per-expert loop in `mul_mat_id`.
  - Scheduler split / backend-crossing overhead remains visible at this config.