mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-21 13:44:10 +00:00
qwen3next: default fused delta-net off and document quality checks
This commit is contained in:
115
docs/development/qwen3next_bench_16k_pp16384_tg128.md
Normal file
115
docs/development/qwen3next_bench_16k_pp16384_tg128.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Qwen3Next Benchmark: PP 16384 / TG 128 (`ik_llama.cpp` vs `llama.cpp`)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
## Setup
|
||||
|
||||
- Container: `iktest2`
|
||||
- Model: `/models/qwen3-next-coder.gguf`
|
||||
- Prompt processing: `-p 16384`
|
||||
- Token generation: `-n 128`
|
||||
- Batch settings: `-b 3072 -ub 768`
|
||||
- Threads: `-t 8`
|
||||
- Repetitions: `-r 1`
|
||||
- Mmap: `-mmp 0`
|
||||
|
||||
CUDA runs:
|
||||
|
||||
- `CUDA_VISIBLE_DEVICES=0`
|
||||
- `-fa 1 -ngl 999 --n-cpu-moe 47`
|
||||
|
||||
CPU-only runs:
|
||||
|
||||
- `-fa 0 -ngl 0 --n-cpu-moe 0`
|
||||
|
||||
Hardware note:
|
||||
|
||||
- GPU0 (bench target): `NVIDIA GeForce RTX 5060 Ti`, `16311 MiB` total (`CUDA_VISIBLE_DEVICES=0` for CUDA runs).
|
||||
- GPU1 (not used for these runs): `NVIDIA GeForce RTX 3060`, `12288 MiB` total.
|
||||
- Observed during active `ik` CUDA run (`p=8192,b=2048,ub=512,n-cpu-moe=45`): GPU0 memory used `~12074 MiB` (`~3775 MiB` free), from `nvidia-smi`.
|
||||
|
||||
## Results
|
||||
|
||||
| Build | Backend | PP 16384 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---|---:|---:|
|
||||
| `ik_llama.cpp` | CUDA | 207.891304 | 27.263562 |
|
||||
| `llama.cpp` | CUDA | 185.764649 | 24.145662 |
|
||||
| `ik_llama.cpp` | CPU-only | 45.739881 | 12.172113 |
|
||||
| `llama.cpp` | CPU-only | 47.835420 | 6.991398 |
|
||||
|
||||
## Relative (`ik` vs `llama.cpp`)
|
||||
|
||||
- CUDA PP: `+11.91%`
|
||||
- CUDA TG: `+12.91%`
|
||||
- CPU PP: `-4.38%`
|
||||
- CPU TG: `+74.10%`
|
||||
|
||||
## Raw outputs
|
||||
|
||||
- `/tmp/ik_cuda_bench_16k.json`
|
||||
- `/tmp/mainline_cuda_bench_16k.json`
|
||||
- `/tmp/ik_cpu_bench_16k.json`
|
||||
- `/tmp/mainline_cpu_bench_16k.json`
|
||||
|
||||
## Additional CUDA rerun (requested lower `n-cpu-moe` ballpark)
|
||||
|
||||
Adjusted config:
|
||||
|
||||
- `-p 8192 -n 128 -b 2048 -ub 512 -t 8 -fa 1 -ngl 999 -mmp 0`
|
||||
- single GPU: `CUDA_VISIBLE_DEVICES=0`
|
||||
|
||||
Fit checks on `ik`:
|
||||
|
||||
- `--n-cpu-moe 25` -> fail to load model
|
||||
- `--n-cpu-moe 40` -> fail to create context
|
||||
- `--n-cpu-moe 45` -> works
|
||||
|
||||
Working comparison at `--n-cpu-moe 45`:
|
||||
|
||||
| Build | Backend | PP 8192 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---|---:|---:|
|
||||
| `ik_llama.cpp` | CUDA | 201.613283 | 24.884600 |
|
||||
| `llama.cpp` | CUDA | 145.100895 | 24.595058 |
|
||||
|
||||
`ik` rerun with `-rtr 1` at the same config (`--n-cpu-moe 45`):
|
||||
|
||||
| Build | Backend | PP 8192 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---|---:|---:|
|
||||
| `ik_llama.cpp` (`-rtr 1`) | CUDA | 232.340508 | 27.895722 |
|
||||
|
||||
## Fused DeltaNet Quality Check (GPU, `-c 2048`, `--chunks 1`)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
Setup:
|
||||
|
||||
- Container: `iktest2`
|
||||
- Device: `CUDA_VISIBLE_DEVICES=0` (RTX 5060 Ti)
|
||||
- Common args: `-c 2048 -b 2048 -ub 512 --chunks 1 --no-warmup -ngl 999 --n-cpu-moe 47 -t 8 -fa on`
|
||||
- Switch under test: `LLAMA_QWEN3NEXT_FUSED_DELTA`
|
||||
|
||||
Results (Wikitext2 sample file `/tmp/ppl_wikitext2_test.txt`):
|
||||
|
||||
| Model | `LLAMA_QWEN3NEXT_FUSED_DELTA=0` | `LLAMA_QWEN3NEXT_FUSED_DELTA=1` |
|
||||
|---|---:|---:|
|
||||
| `/models/qwen3-next-coder.gguf` | `PPL 3.9378` | `PPL 15.3628` |
|
||||
| `/models/qwen-3-coder-next-mxfp4.gguf` | `PPL 3.9860` | `PPL 15.0740` |
|
||||
|
||||
Conclusion:
|
||||
|
||||
- Fused DeltaNet path is currently numerically bad for both tested quants on CUDA in this setup.
|
||||
- Keeping fused path opt-in (`LLAMA_QWEN3NEXT_FUSED_DELTA=1`) and defaulting to non-fused is required for model quality.
|
||||
|
||||
## Upstream PR #19375 Trial (Selective Port) Outcome
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
What was tried:
|
||||
|
||||
- Ported selected non-fused qwen3next graph changes from `ggml-org/llama.cpp#19375` (broadcast/repeat and autoregressive matmul rewrite), then benchmarked and re-tested perplexity.
|
||||
|
||||
Outcome:
|
||||
|
||||
- No stable speed win in our setup after repeated runs.
|
||||
- Autoregressive rewrite specifically hurt TG throughput in non-fused mode and was reverted.
|
||||
- Final code keeps only the fused-default safety fix (non-fused by default).
|
||||
@@ -4178,22 +4178,23 @@ ggml_cgraph * llm_build_context::build_qwen3next() {
|
||||
|
||||
const bool reset_state = batch.pos != nullptr && batch.pos[0] == 0;
|
||||
|
||||
// Default to fused DeltaNet path; set LLAMA_QWEN3NEXT_FUSED_DELTA=0 to force legacy graph path.
|
||||
// Keep legacy DeltaNet path as the default for correctness; enable fused path explicitly
|
||||
// with LLAMA_QWEN3NEXT_FUSED_DELTA=1 for controlled testing.
|
||||
const bool use_fused_delta_net = []() {
|
||||
const char * env = std::getenv("LLAMA_QWEN3NEXT_FUSED_DELTA");
|
||||
if (env == nullptr || env[0] == '\0') {
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
switch (env[0]) {
|
||||
case '0':
|
||||
case 'n':
|
||||
case 'N':
|
||||
case 'f':
|
||||
case 'F':
|
||||
return false;
|
||||
default:
|
||||
case '1':
|
||||
case 'y':
|
||||
case 'Y':
|
||||
case 't':
|
||||
case 'T':
|
||||
return true;
|
||||
default:
|
||||
return false;
|
||||
}
|
||||
}();
|
||||
|
||||
|
||||
Reference in New Issue
Block a user