mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-26 09:29:27 +00:00
qwen3next: keep fused delta on safe path and remove PR artifacts
This commit is contained in:
@@ -1,182 +0,0 @@
|
||||
# Qwen3Next Benchmark: PP 16384 / TG 128 (`ik_llama.cpp` vs `llama.cpp`)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
## Setup
|
||||
|
||||
- Container: `iktest2`
|
||||
- Model: `/models/qwen3-next-coder.gguf`
|
||||
- Prompt processing: `-p 16384`
|
||||
- Token generation: `-n 128`
|
||||
- Batch settings: `-b 3072 -ub 768`
|
||||
- Threads: `-t 8`
|
||||
- Repetitions: `-r 1`
|
||||
- Mmap: `-mmp 0`
|
||||
|
||||
CUDA runs:
|
||||
|
||||
- `CUDA_VISIBLE_DEVICES=0`
|
||||
- `-fa 1 -ngl 999 --n-cpu-moe 47`
|
||||
|
||||
CPU-only runs:
|
||||
|
||||
- `-fa 0 -ngl 0 --n-cpu-moe 0`
|
||||
|
||||
Hardware note:
|
||||
|
||||
- GPU0 (bench target): `NVIDIA GeForce RTX 5060 Ti`, `16311 MiB` total (`CUDA_VISIBLE_DEVICES=0` for CUDA runs).
|
||||
- GPU1 (not used for these runs): `NVIDIA GeForce RTX 3060`, `12288 MiB` total.
|
||||
- Observed during active `ik` CUDA run (`p=8192,b=2048,ub=512,n-cpu-moe=45`): GPU0 memory used `~12074 MiB` (`~3775 MiB` free), from `nvidia-smi`.
|
||||
|
||||
## Results
|
||||
|
||||
| Build | Backend | PP 16384 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---|---:|---:|
|
||||
| `ik_llama.cpp` | CUDA | 207.891304 | 27.263562 |
|
||||
| `llama.cpp` | CUDA | 185.764649 | 24.145662 |
|
||||
| `ik_llama.cpp` | CPU-only | 45.739881 | 12.172113 |
|
||||
| `llama.cpp` | CPU-only | 47.835420 | 6.991398 |
|
||||
|
||||
## Relative (`ik` vs `llama.cpp`)
|
||||
|
||||
- CUDA PP: `+11.91%`
|
||||
- CUDA TG: `+12.91%`
|
||||
- CPU PP: `-4.38%`
|
||||
- CPU TG: `+74.10%`
|
||||
|
||||
## Raw outputs
|
||||
|
||||
- `/tmp/ik_cuda_bench_16k.json`
|
||||
- `/tmp/mainline_cuda_bench_16k.json`
|
||||
- `/tmp/ik_cpu_bench_16k.json`
|
||||
- `/tmp/mainline_cpu_bench_16k.json`
|
||||
|
||||
## Additional CUDA rerun (requested lower `n-cpu-moe` ballpark)
|
||||
|
||||
Adjusted config:
|
||||
|
||||
- `-p 8192 -n 128 -b 2048 -ub 512 -t 8 -fa 1 -ngl 999 -mmp 0`
|
||||
- single GPU: `CUDA_VISIBLE_DEVICES=0`
|
||||
|
||||
Fit checks on `ik`:
|
||||
|
||||
- `--n-cpu-moe 25` -> fail to load model
|
||||
- `--n-cpu-moe 40` -> fail to create context
|
||||
- `--n-cpu-moe 45` -> works
|
||||
|
||||
Working comparison at `--n-cpu-moe 45`:
|
||||
|
||||
| Build | Backend | PP 8192 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---|---:|---:|
|
||||
| `ik_llama.cpp` | CUDA | 201.613283 | 24.884600 |
|
||||
| `llama.cpp` | CUDA | 145.100895 | 24.595058 |
|
||||
|
||||
`ik` rerun with `-rtr 1` at the same config (`--n-cpu-moe 45`):
|
||||
|
||||
| Build | Backend | PP 8192 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---|---:|---:|
|
||||
| `ik_llama.cpp` (`-rtr 1`) | CUDA | 232.340508 | 27.895722 |
|
||||
|
||||
## Historical Fused DeltaNet Check (obsolete)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
Setup:
|
||||
|
||||
- Container: `iktest2`
|
||||
- Device: `CUDA_VISIBLE_DEVICES=0` (RTX 5060 Ti)
|
||||
- Common args: `-c 2048 -b 2048 -ub 512 --chunks 1 --no-warmup -ngl 999 --n-cpu-moe 47 -t 8 -fa on`
|
||||
- Switch under test: `LLAMA_QWEN3NEXT_FUSED_DELTA`
|
||||
|
||||
Results (Wikitext2 sample file `/tmp/ppl_wikitext2_test.txt`):
|
||||
|
||||
| Model | `LLAMA_QWEN3NEXT_FUSED_DELTA=0` | `LLAMA_QWEN3NEXT_FUSED_DELTA=1` |
|
||||
|---|---:|---:|
|
||||
| `/models/qwen3-next-coder.gguf` | `PPL 3.9378` | `PPL 15.3628` |
|
||||
| `/models/qwen-3-coder-next-mxfp4.gguf` | `PPL 3.9860` | `PPL 15.0740` |
|
||||
|
||||
Conclusion:
|
||||
|
||||
- This run is kept for history only and is superseded by the later `Fused DeltaNet Safety Update (Superseding)` section below.
|
||||
- Use the superseding section as source of truth for mode mapping and quality guidance.
|
||||
|
||||
## Upstream PR #19375 Trial (Selective Port) Outcome
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
What was tried:
|
||||
|
||||
- Ported selected non-fused qwen3next graph changes from `ggml-org/llama.cpp#19375` (broadcast/repeat and autoregressive matmul rewrite), then benchmarked and re-tested perplexity.
|
||||
|
||||
Outcome:
|
||||
|
||||
- No stable speed win in our setup after repeated runs.
|
||||
- Direct autoregressive rewrite attempts from PR #19375 were not compatible with current ik graph-layout/contiguity assumptions and were reverted.
|
||||
- Final code keeps only safe chunk-shape fixes plus fused-mode safety controls.
|
||||
|
||||
## Decode-Only Fused Mode Trial (`LLAMA_QWEN3NEXT_FUSED_DELTA=2`)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
Code change:
|
||||
|
||||
- Added mode `2` for `LLAMA_QWEN3NEXT_FUSED_DELTA`:
|
||||
- prompt / multi-token path: non-fused
|
||||
- single-token decode path: fused
|
||||
|
||||
Perplexity validation (`-c 2048`, GPU config as above):
|
||||
|
||||
| Model | `=0` non-fused | `=2` decode-only fused |
|
||||
|---|---:|---:|
|
||||
| `/models/qwen3-next-coder.gguf` | `3.9378` | `3.9378` |
|
||||
| `/models/qwen-3-coder-next-mxfp4.gguf` | `3.9860` | `3.9860` |
|
||||
|
||||
`llama-bench` at `-p 8192 -n 128 -b 2048 -ub 512 -r 3 -rtr 1`:
|
||||
|
||||
| Mode | PP 8192 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---:|---:|
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=0` | `170.090` | `25.465` |
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=2` | `166.212` | `29.599` |
|
||||
|
||||
Notes:
|
||||
|
||||
- Decode-only fused mode preserves prompt-quality metrics in this test.
|
||||
- TG improved significantly in this run; PP variance was higher, so PP delta should be treated as noisy.
|
||||
|
||||
## Fused DeltaNet Safety Update (Superseding)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
This section supersedes the earlier `LLAMA_QWEN3NEXT_FUSED_DELTA` mode mapping.
|
||||
|
||||
Updated env behavior in `src/llama-build-context.cpp`:
|
||||
|
||||
- `0` / unset: non-fused for all token counts
|
||||
- `1`: fused only for `n_tok > 1` (prefill/chunking), non-fused for single-token decode
|
||||
- `2`: fused for all token counts (experimental)
|
||||
|
||||
Reason:
|
||||
|
||||
- Fused path has a known decode-path quality regression when forced on single-token steps.
|
||||
- The safer default acceleration is therefore prefill-only fused mode (`=1`).
|
||||
|
||||
Validation (CUDA, `qwen3-next-coder.gguf`, `-c 2048 -b 1 -ub 1 -fa on -ngl 47 --n-cpu-moe 40 --chunks 1 --no-warmup`):
|
||||
|
||||
| Mode | PPL |
|
||||
|---|---:|
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=0` | `3.9148 +/- 0.31093` |
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=1` | `3.9148 +/- 0.31093` |
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=2` | `6.1277 +/- 0.54810` |
|
||||
|
||||
Quick throughput check (`-p 8192 -n 128 -b 2048 -ub 512 -r 1 -rtr 1`, same CUDA settings):
|
||||
|
||||
| Mode | PP 8192 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---:|---:|
|
||||
| `0` | `179.30` | `24.69` |
|
||||
| `1` | `252.12` | `22.99` |
|
||||
| `2` | `245.71` | `27.94` |
|
||||
|
||||
Interpretation:
|
||||
|
||||
- Use `=1` for production-safe quality with strong PP gain.
|
||||
- Reserve `=2` for experiments only until decode-path correctness is fixed.
|
||||
@@ -1,165 +0,0 @@
|
||||
# Qwen3Next Review and Benchmark Summary (`ik_llama.cpp` vs `llama.cpp`)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
## Scope
|
||||
|
||||
This document captures:
|
||||
|
||||
- Current upstream PR alignment for Qwen3Next-related work.
|
||||
- What is already strong in `ik_llama.cpp` and what still needs adjustment.
|
||||
- Recommended runtime settings for this machine (single GPU target, long context).
|
||||
- Final apples-to-apples benchmark matrix for `ik_llama.cpp` vs `../llama.cpp`.
|
||||
|
||||
## Upstream PR Check (as of 2026-02-08)
|
||||
|
||||
Reviewed PRs:
|
||||
|
||||
- https://github.com/ggml-org/llama.cpp/pull/18102 (`open`): Delta-Net CUDA op + integration.
|
||||
- https://github.com/ggml-org/llama.cpp/pull/18792 (`open`): unified DeltaNet handling (`src/models/delta.cpp`).
|
||||
- https://github.com/ggml-org/llama.cpp/pull/19375 (`open`, `draft`): Qwen3Next graph optimization in model builder.
|
||||
|
||||
### Current alignment in `ik_llama.cpp`
|
||||
|
||||
Already present and/or functionally covered:
|
||||
|
||||
- CUDA DeltaNet op path exists in GGML (`ggml/src/ggml-cuda/delta-net.cu`).
|
||||
- Solve-tri and backend op support are present for the fused path.
|
||||
- Qwen3Next fused DeltaNet builder path exists (and is now runtime-toggleable via env).
|
||||
- Existing ik optimizations remain available (`-rtr`, grouped/fused paths, no-offload-only-active-experts switches).
|
||||
|
||||
Not directly mirrored yet (by design divergence from mainline model layout):
|
||||
|
||||
- Mainline `src/models/delta.cpp` structure from PR #18792.
|
||||
- Mainline `src/models/qwen3next.cpp` graph-form from PR #19375.
|
||||
|
||||
## Required Adjustments (remaining)
|
||||
|
||||
1. Keep non-fused as the strict safety baseline in defaults, and use `LLAMA_QWEN3NEXT_FUSED_DELTA=1` (prefill-only fused) as the explicit acceleration mode.
|
||||
2. Continue using `scripts/qwen3next-regression.sh` as the release gate for this model path, and wire it into CI or pre-merge checks.
|
||||
3. Treat the remaining PR #19375 autoregressive rewrite as deferred: direct porting into current ik graph builder is not layout-compatible without broader contiguity/reshape refactoring.
|
||||
4. Revisit PR #18792 (`src/models/delta.cpp`) only if we need unified GDA/KDA support for additional architectures; for Qwen3Next-only it is optional.
|
||||
|
||||
## Strong Points of `ik_llama.cpp` to Preserve
|
||||
|
||||
- More runtime controls than mainline for this workload (`-rtr`, backend toggles, MoE/OOAE controls).
|
||||
- Strong CUDA path for this model family once offload routing is tuned (`--n-cpu-moe` thresholding).
|
||||
- Better TG throughput than current mainline in matched CUDA and CPU tests on this host.
|
||||
|
||||
## Best Runtime Configuration (this host)
|
||||
|
||||
Model: `/models/qwen3-next-coder.gguf`
|
||||
|
||||
Single-GPU long-context finding:
|
||||
|
||||
- `-c 65536` on GPU0 (16 GB) requires at least `--n-cpu-moe 47` to fit reliably.
|
||||
|
||||
8k sweep proxy (single GPU, tuned path):
|
||||
|
||||
- `b=2048,ub=512` -> `pp8192=142.85`, `tg128=24.81`
|
||||
- `b=3072,ub=768` -> `pp8192=229.31`, `tg128=27.29` (best)
|
||||
- `b=4096,ub=1024` -> `pp8192=211.53`, `tg128=23.85`
|
||||
|
||||
Recommended serving baseline:
|
||||
|
||||
- `CUDA_VISIBLE_DEVICES=0`
|
||||
- `-c 65536 -b 3072 -ub 768 -t 8 -fa on -ngl 999 --n-cpu-moe 47 -rtr --qwen3next-fused-delta 1`
|
||||
|
||||
## Final Benchmark Matrix (8k context proxy)
|
||||
|
||||
All four builds were benchmarked with matched parameters and explicit `-mmp 0` for fairness.
|
||||
|
||||
Common args:
|
||||
|
||||
- `-m /models/qwen3-next-coder.gguf -p 8192 -n 128 -b 3072 -ub 768 -t 8 -r 1`
|
||||
- CUDA runs: `CUDA_VISIBLE_DEVICES=0 -fa 1 -ngl 999 --n-cpu-moe 47 -mmp 0`
|
||||
- CPU runs: `-fa 0 -ngl 0 --n-cpu-moe 0 -mmp 0`
|
||||
|
||||
| Build | PP (tok/s) | TG (tok/s) |
|
||||
|---|---:|---:|
|
||||
| `ik` CUDA | 204.614 | 28.979 |
|
||||
| mainline CUDA | 184.521 | 22.012 |
|
||||
| `ik` CPU | 49.795 | 12.681 |
|
||||
| mainline CPU | 51.674 | 7.299 |
|
||||
|
||||
Relative (`ik` vs mainline):
|
||||
|
||||
- CUDA PP: `+10.9%`
|
||||
- CUDA TG: `+31.7%`
|
||||
- CPU PP: `-3.6%`
|
||||
- CPU TG: `+73.7%`
|
||||
|
||||
## Notes
|
||||
|
||||
- CPU-only Qwen3Next with `-fa 1` is now guarded in ik: FA is auto-disabled with a warning for `n_gpu_layers == 0` to avoid the prior `iqk_fa_templates.h` assert path.
|
||||
- `ik` benchmark JSON currently includes some non-JSON log lines in stdout around context creation; parsing should tolerate that.
|
||||
- Fused DeltaNet mode mapping has been updated in code:
|
||||
- `0` / unset: non-fused
|
||||
- `1`: fused only for `n_tok > 1` (safe mode)
|
||||
- `2`: fused on all token counts (experimental; decode-quality regression observed)
|
||||
- Added manual regression runner for fused-mode safety checks:
|
||||
- `scripts/qwen3next-fused-regression.sh`
|
||||
- Example:
|
||||
- `BIN=./build-qwen3next-fix/bin/llama-perplexity scripts/qwen3next-fused-regression.sh --model /models/qwen3-next-coder.gguf --ctx 2048 --decode-b 1 --decode-ub 1 --prefill-b 2048 --prefill-ub 512 --ngl 47 --n-cpu-moe 40`
|
||||
- Also integrated into the broader eval harness:
|
||||
- `scripts/qwen3next-eval.sh --with-gpu --with-fused-regression ...`
|
||||
- Results are surfaced in `SUMMARY.md` under `IK Fused Delta Regression`.
|
||||
- Fused regression now enforces absolute non-fused sanity too:
|
||||
- mode0 decode/prefill PPL must stay below configurable thresholds (defaults: `10.0` / `10.0`).
|
||||
- Added unified Qwen3Next regression entrypoint for ongoing checks:
|
||||
- `scripts/qwen3next-regression.sh --model /path/to/qwen3-next-coder.gguf`
|
||||
- Outputs `SUMMARY.md` + per-step logs under `/tmp/qwen3next-regression/<timestamp>/`.
|
||||
- Added CLI plumbing for fused mode control (no raw env required):
|
||||
- `--qwen3next-fused-delta {0|1|2}`
|
||||
- This sets `LLAMA_QWEN3NEXT_FUSED_DELTA` for the current process.
|
||||
- Added experimental CUDA DeltaNet dispatch control:
|
||||
- `GGML_CUDA_DELTA_NET_OPT={0|1|2|3|4}`
|
||||
- `0`: baseline dispatch (default)
|
||||
- `1`: force fp16 recurrent kernel (`head_dim=128`)
|
||||
- `2`: force multiblock kernel
|
||||
- `3`: force Blackwell optimized kernel
|
||||
- `4`: conservative auto mode (pre-Blackwell only)
|
||||
- RTX 5060 Ti spot checks (`p=2048,n=64,b=1024,ub=256,--n-cpu-moe 47,-rtr 1`) did not show a reliable win from forced kernels:
|
||||
- mode `2` and mode `3` reduced TG in single-run checks versus baseline.
|
||||
- mode `4` tracks baseline on Blackwell (by design, no forced optimized-kernel switch there).
|
||||
|
||||
## Decode Quality Diagnosis (Wikitext-2, `--chunks 1`, CUDA)
|
||||
|
||||
Real-data perplexity checks on `/tmp/ppl_wikitext2_test.txt` confirm the decode regression source:
|
||||
|
||||
- `qwen3-next-coder.gguf`
|
||||
- mode `0`, opt `0`: `PPL=3.9148`
|
||||
- mode `1`, opt `0`: `PPL=3.9148` (parity with mode 0)
|
||||
- mode `2`, opt `0/1/2/4`: `PPL=6.1277` (consistently regressed)
|
||||
- mode `2`, opt `3`: `PPL=302221.3639` (catastrophic instability)
|
||||
- `qwen-3-coder-next-mxfp4.gguf`
|
||||
- mode `0`, opt `0`: `PPL=3.9832`
|
||||
- mode `1`, opt `0`: `PPL=3.9832` (parity with mode 0)
|
||||
- mode `2`, opt `0`: `PPL=6.2362` (same regression pattern)
|
||||
- mode `2`, opt `3`: `PPL=795964.1118` (catastrophic instability)
|
||||
|
||||
Conclusion:
|
||||
|
||||
- Decode-quality regression is tied to fused-all mode (`LLAMA_QWEN3NEXT_FUSED_DELTA=2`), not fixed by kernel dispatch overrides.
|
||||
- `GGML_CUDA_DELTA_NET_OPT=3` should not be used on this path.
|
||||
|
||||
## Safe Speed Gain (mode 1)
|
||||
|
||||
With decode-safe mode (`LLAMA_QWEN3NEXT_FUSED_DELTA=1`), throughput on the serving proxy profile improved while preserving perplexity:
|
||||
|
||||
- Profile:
|
||||
- `llama-bench -m /models/qwen3-next-coder.gguf -p 8192 -n 128 -b 3072 -ub 768 -t 8 -fa 1 -ngl 999 --n-cpu-moe 47 -r 3 -rtr 1 -mmp 0`
|
||||
- Mode `0` (`r=3`):
|
||||
- `pp8192 = 175.639 +/- 0.221 tok/s`
|
||||
- `tg128 = 26.393 +/- 1.469 tok/s`
|
||||
- Mode `1` (`r=3`):
|
||||
- `pp8192 = 237.014 +/- 1.199 tok/s`
|
||||
- `tg128 = 27.111 +/- 1.395 tok/s`
|
||||
- Relative (`mode1` vs `mode0`):
|
||||
- PP: `+34.9%`
|
||||
- TG: `+2.7%`
|
||||
|
||||
Additional A/B for `GGML_CUDA_DELTA_NET_OPT=2` under mode `1` (`r=3`) did not improve performance:
|
||||
|
||||
- opt `0`: `pp8192=238.352`, `tg128=24.709`
|
||||
- opt `2`: `pp8192=237.680`, `tg128=24.566`
|
||||
Reference in New Issue
Block a user