mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-12 06:50:08 +00:00
qwen3next: make fused delta safe by default and fix fused tensor layout
This commit is contained in:
@@ -142,3 +142,41 @@ Notes:
|
||||
|
||||
- Decode-only fused mode preserves prompt-quality metrics in this test.
|
||||
- TG improved significantly in this run; PP variance was higher, so PP delta should be treated as noisy.
|
||||
|
||||
## Fused DeltaNet Safety Update (Superseding)
|
||||
|
||||
Date: 2026-02-08
|
||||
|
||||
This section supersedes the earlier `LLAMA_QWEN3NEXT_FUSED_DELTA` mode mapping.
|
||||
|
||||
Updated env behavior in `src/llama-build-context.cpp`:
|
||||
|
||||
- `0` / unset: non-fused for all token counts
|
||||
- `1`: fused only for `n_tok > 1` (prefill/chunking), non-fused for single-token decode
|
||||
- `2`: fused for all token counts (experimental)
|
||||
|
||||
Reason:
|
||||
|
||||
- Fused path has a known decode-path quality regression when forced on single-token steps.
|
||||
- The safer default acceleration is therefore prefill-only fused mode (`=1`).
|
||||
|
||||
Validation (CUDA, `qwen3-next-coder.gguf`, `-c 2048 -b 1 -ub 1 -fa on -ngl 47 --n-cpu-moe 40 --chunks 1 --no-warmup`):
|
||||
|
||||
| Mode | PPL |
|
||||
|---|---:|
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=0` | `3.9148 +/- 0.31093` |
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=1` | `3.9148 +/- 0.31093` |
|
||||
| `LLAMA_QWEN3NEXT_FUSED_DELTA=2` | `6.1277 +/- 0.54810` |
|
||||
|
||||
Quick throughput check (`-p 8192 -n 128 -b 2048 -ub 512 -r 1 -rtr 1`, same CUDA settings):
|
||||
|
||||
| Mode | PP 8192 (tok/s) | TG 128 (tok/s) |
|
||||
|---|---:|---:|
|
||||
| `0` | `179.30` | `24.69` |
|
||||
| `1` | `252.12` | `22.99` |
|
||||
| `2` | `245.71` | `27.94` |
|
||||
|
||||
Interpretation:
|
||||
|
||||
- Use `=1` for production-safe quality with strong PP gain.
|
||||
- Reserve `=2` for experiments only until decode-path correctness is fixed.
|
||||
|
||||
@@ -35,7 +35,7 @@ Not directly mirrored yet (by design divergence from mainline model layout):
|
||||
|
||||
## Required Adjustments (remaining)
|
||||
|
||||
1. Keep fused DeltaNet as default, but preserve safe fallback path (`LLAMA_QWEN3NEXT_FUSED_DELTA=0`) for debugging/regression checks.
|
||||
1. Keep non-fused as the strict safety baseline, and use `LLAMA_QWEN3NEXT_FUSED_DELTA=1` (prefill-only fused) as the practical acceleration mode.
|
||||
2. Port selective graph-shape optimizations from PR #19375 into `src/llama-build-context.cpp` where they map cleanly (avoid blind copy due architectural divergence).
|
||||
3. Add one dedicated Qwen3Next perf regression target in CI/dev docs (single-GPU 8k proxy + 65k fit sanity).
|
||||
4. Investigate ik CPU Flash-Attn assertion path for Qwen3Next (`iqk_fa_templates.h`, `S > 0`) before enabling `-fa 1` for CPU benchmark profiles.
|
||||
@@ -93,3 +93,7 @@ Relative (`ik` vs mainline):
|
||||
|
||||
- `ik` CPU benchmark with `-fa 1` currently aborts for this model in `iqk_fa_templates.h` (`GGML_ASSERT(S > 0)`), so CPU matrix uses `-fa 0` for both repos.
|
||||
- `ik` benchmark JSON currently includes some non-JSON log lines in stdout around context creation; parsing should tolerate that.
|
||||
- Fused DeltaNet mode mapping has been updated in code:
|
||||
- `0` / unset: non-fused
|
||||
- `1`: fused only for `n_tok > 1` (safe mode)
|
||||
- `2`: fused on all token counts (experimental; decode-quality regression observed)
|
||||
|
||||
Reference in New Issue
Block a user