mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-30 19:31:48 +00:00
docs: update qwen3next perf report for cuda MoE/SSM tuning
This commit is contained in:
@@ -199,3 +199,119 @@ This is substantially improved versus earlier observed TG (`~2` to `~4` t/s) in
|
|||||||
- Some runs still offload few/no layers depending on available VRAM at run time, which can mask CUDA-path gains.
|
- Some runs still offload few/no layers depending on available VRAM at run time, which can mask CUDA-path gains.
|
||||||
- `SCHED_MAX_SPLITS` limits at very aggressive `(b, ub)` settings are still a separate scaling constraint.
|
- `SCHED_MAX_SPLITS` limits at very aggressive `(b, ub)` settings are still a separate scaling constraint.
|
||||||
- Additional backend-level profiling is still needed to determine whether remaining gap to top-end mainline numbers is dominated by offload limits, scheduler split overhead, or other kernels.
|
- Additional backend-level profiling is still needed to determine whether remaining gap to top-end mainline numbers is dominated by offload limits, scheduler split overhead, or other kernels.
|
||||||
|
|
||||||
|
## 2026-02-06 CUDA MoE/SSM Optimization Update
|
||||||
|
|
||||||
|
### Applied changes in this update
|
||||||
|
|
||||||
|
1. MoE row mapping in CUDA `mul_mat_id` paths (`ggml/src/ggml-cuda.cu`):
|
||||||
|
- Replaced per-call `ids` device->host copy, host-side count/build, and mapping host->device copy.
|
||||||
|
- Added device-side count + exclusive prefix sum + scatter kernels:
|
||||||
|
- `k_moe_row_count`
|
||||||
|
- `k_moe_row_exclusive_scan`
|
||||||
|
- `k_moe_row_scatter`
|
||||||
|
- Kept existing call-site logic intact by copying only compact metadata back (`moe_counts`, `cum_moe_counts`, invalid-id flag).
|
||||||
|
- Net effect: removes large host round-trip traffic from a hot MoE routing path.
|
||||||
|
|
||||||
|
2. Qwen3Next SSM conv path for `n_kv > 1` (`ggml/src/ggml-cuda/ssm-conv.cu`):
|
||||||
|
- Added a guarded fast path for decode-like multi-sequence batches where each token maps to one unique sequence (no multi-sequence fan-out per token).
|
||||||
|
- Added:
|
||||||
|
- `ssm_conv_validate_unique_seq_map`
|
||||||
|
- `ssm_conv_multi_seq_unique_f32_kernel`
|
||||||
|
- `ssm_conv_multi_seq_unique_f32_kernel_nc4`
|
||||||
|
- If the input pattern does not satisfy fast-path constraints, execution falls back to the existing kernel path unchanged.
|
||||||
|
|
||||||
|
3. Top-k MoE fusion verification:
|
||||||
|
- No matcher change was required in this update.
|
||||||
|
- Qwen3Next MoE build path still emits the expected `SOFT_MAX -> ... -> ARGSORT -> VIEW -> GET_ROWS` form used by current CUDA fusion checks.
|
||||||
|
|
||||||
|
### Parity validation (required checks)
|
||||||
|
|
||||||
|
Tests were run in Docker (`iktest-dev:latest`) with:
|
||||||
|
- model: `/models/qwen3-next-coder.gguf`
|
||||||
|
- text corpus: `/tmp/qnext_ppl.txt` (same file for `ik` and mainline)
|
||||||
|
- params: `-c 256 -b 64 -ub 64 --no-warmup`
|
||||||
|
|
||||||
|
CPU parity (`-ngl 0`, threshold `<= 5e-4`):
|
||||||
|
- `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS
|
||||||
|
- `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS
|
||||||
|
|
||||||
|
CUDA sanity parity (`-ngl 1`, threshold `<= 1e-3`):
|
||||||
|
- `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS
|
||||||
|
- `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS
|
||||||
|
|
||||||
|
### Quick performance matrix (`llama-sweep-bench`)
|
||||||
|
|
||||||
|
Config: `-c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe`
|
||||||
|
|
||||||
|
| Profile | Baseline maxPP | Baseline maxTG | New maxPP | New maxTG | Delta maxPP | Delta maxTG |
|
||||||
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| 16GB a) `CUDA_VISIBLE_DEVICES=0` | 129.83 | 26.45 | 122.91 | 26.79 | -6.92 | +0.34 |
|
||||||
|
| 16GB b) `CUDA_VISIBLE_DEVICES=0 -no-ooae` | n/a | n/a | 132.02 | 26.84 | n/a | n/a |
|
||||||
|
| 28GB a) `CUDA_VISIBLE_DEVICES=0,1 --tensor-split 0.85,0.15` | 127.66 | 22.95 | 127.48 | 23.97 | -0.18 | +1.02 |
|
||||||
|
| 28GB b) `CUDA_VISIBLE_DEVICES=0,1` | n/a | n/a | 104.61 | 21.17 | n/a | n/a |
|
||||||
|
|
||||||
|
### Command log (exact forms)
|
||||||
|
|
||||||
|
Build:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm --gpus all \
|
||||||
|
-v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
|
||||||
|
iktest-dev:latest \
|
||||||
|
bash -lc 'cmake --build /ik_llama.cpp/build-cuda13-fresh --config Release -j 56 --target llama-perplexity llama-bench'
|
||||||
|
```
|
||||||
|
|
||||||
|
Parity (`ik`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm --gpus all \
|
||||||
|
-v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
|
||||||
|
-v /home/yurko/.cache/llama.cpp:/models \
|
||||||
|
-v /tmp:/tmp \
|
||||||
|
iktest-dev:latest \
|
||||||
|
bash -lc 'export LD_LIBRARY_PATH=/ik_llama.cpp/build-cuda13-fresh/src:/ik_llama.cpp/build-cuda13-fresh/ggml/src:$LD_LIBRARY_PATH; \
|
||||||
|
/ik_llama.cpp/build-cuda13-fresh/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'
|
||||||
|
```
|
||||||
|
|
||||||
|
Parity (mainline):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm --gpus all \
|
||||||
|
-v /home/yurko/Code/llama.cpp:/llama.cpp \
|
||||||
|
-v /home/yurko/.cache/llama.cpp:/models \
|
||||||
|
-v /tmp:/tmp \
|
||||||
|
iktest-dev:latest \
|
||||||
|
bash -lc 'export LD_LIBRARY_PATH=/llama.cpp/build/src:/llama.cpp/build/ggml/src:$LD_LIBRARY_PATH; \
|
||||||
|
/llama.cpp/build/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'
|
||||||
|
```
|
||||||
|
|
||||||
|
Quick matrix:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 16GB a
|
||||||
|
CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
|
||||||
|
-m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe
|
||||||
|
|
||||||
|
# 16GB b
|
||||||
|
CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
|
||||||
|
-m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe -no-ooae
|
||||||
|
|
||||||
|
# 28GB a
|
||||||
|
CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
|
||||||
|
-m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe --tensor-split 0.85,0.15
|
||||||
|
|
||||||
|
# 28GB b
|
||||||
|
CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
|
||||||
|
-m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe
|
||||||
|
```
|
||||||
|
|
||||||
|
### Status after this update
|
||||||
|
|
||||||
|
- Precision parity: PASS on all required checks.
|
||||||
|
- Performance:
|
||||||
|
- 16GB profile improved TG but not PP vs baseline.
|
||||||
|
- 28GB split profile improved TG and preserved PP.
|
||||||
|
- Remaining likely bottlenecks for 16GB PP:
|
||||||
|
- MoE routing still limited by per-expert launches/host-side per-expert loop in `mul_mat_id`.
|
||||||
|
- Scheduler split / backend-crossing overhead remains visible at this config.
|
||||||
|
|||||||
Reference in New Issue
Block a user