docs: reconcile qwen3next status and remaining upstream gaps

2026-02-21 13:44:10 +00:00 · 2026-02-08 01:12:40 -08:00
parent 627d46912c
commit bd0dd7804b
2 changed files with 11 additions and 12 deletions
--- a/docs/development/qwen3next_bench_16k_pp16384_tg128.md
+++ b/docs/development/qwen3next_bench_16k_pp16384_tg128.md
@@ -77,7 +77,7 @@ Working comparison at `--n-cpu-moe 45`:
 |---|---|---:|---:|
 | `ik_llama.cpp` (`-rtr 1`) | CUDA | 232.340508 | 27.895722 |

-## Fused DeltaNet Quality Check (GPU, `-c 2048`, `--chunks 1`)
+## Historical Fused DeltaNet Check (obsolete)

 Date: 2026-02-08

@@ -97,8 +97,8 @@ Results (Wikitext2 sample file `/tmp/ppl_wikitext2_test.txt`):

 Conclusion:

- Fused DeltaNet path is currently numerically bad for both tested quants on CUDA in this setup.
- Keeping fused path opt-in (`LLAMA_QWEN3NEXT_FUSED_DELTA=1`) and defaulting to non-fused is required for model quality.
+- This run is kept for history only and is superseded by the later `Fused DeltaNet Safety Update (Superseding)` section below.
+- Use the superseding section as source of truth for mode mapping and quality guidance.

 ## Upstream PR #19375 Trial (Selective Port) Outcome

@@ -111,8 +111,8 @@ What was tried:
 Outcome:

 - No stable speed win in our setup after repeated runs.
- Autoregressive rewrite specifically hurt TG throughput in non-fused mode and was reverted.
- Final code keeps only the fused-default safety fix (non-fused by default).
+- Direct autoregressive rewrite attempts from PR #19375 were not compatible with current ik graph-layout/contiguity assumptions and were reverted.
+- Final code keeps only safe chunk-shape fixes plus fused-mode safety controls.

 ## Decode-Only Fused Mode Trial (`LLAMA_QWEN3NEXT_FUSED_DELTA=2`)

--- a/docs/development/qwen3next_perf_diff_report.md
+++ b/docs/development/qwen3next_perf_diff_report.md
@@ -35,12 +35,11 @@ Not directly mirrored yet (by design divergence from mainline model layout):

 ## Required Adjustments (remaining)

-1. Keep non-fused as the strict safety baseline, and use `LLAMA_QWEN3NEXT_FUSED_DELTA=1` (prefill-only fused) as the practical acceleration mode.
-2. Port selective graph-shape optimizations from PR #19375 into `src/llama-build-context.cpp` where they map cleanly (avoid blind copy due architectural divergence).
-3. Added dedicated Qwen3Next regression target for dev/CI-style checks:
-   - `scripts/qwen3next-regression.sh`
-   - combines fused safety regression + single-GPU proxy sweep + long-context fit sanity.
-4. Investigate ik CPU Flash-Attn assertion path for Qwen3Next (`iqk_fa_templates.h`, `S > 0`) before enabling `-fa 1` for CPU benchmark profiles.
+1. Keep non-fused as the strict safety baseline in defaults, and use `LLAMA_QWEN3NEXT_FUSED_DELTA=1` (prefill-only fused) as the explicit acceleration mode.
+2. Add a first-class runtime flag/CLI plumb for Qwen3Next fused mode (`LLAMA_QWEN3NEXT_FUSED_DELTA`) so serving does not depend on raw env wiring.
+3. Continue using `scripts/qwen3next-regression.sh` as the release gate for this model path, and wire it into CI or pre-merge checks.
+4. Treat the remaining PR #19375 autoregressive rewrite as deferred: direct porting into current ik graph builder is not layout-compatible without broader contiguity/reshape refactoring.
+5. Revisit PR #18792 (`src/models/delta.cpp`) only if we need unified GDA/KDA support for additional architectures; for Qwen3Next-only it is optional.

 ## Strong Points of `ik_llama.cpp` to Preserve

@@ -93,7 +92,7 @@ Relative (`ik` vs mainline):

 ## Notes

- `ik` CPU benchmark with `-fa 1` currently aborts for this model in `iqk_fa_templates.h` (`GGML_ASSERT(S > 0)`), so CPU matrix uses `-fa 0` for both repos.
+- CPU-only Qwen3Next with `-fa 1` is now guarded in ik: FA is auto-disabled with a warning for `n_gpu_layers == 0` to avoid the prior `iqk_fa_templates.h` assert path.
 - `ik` benchmark JSON currently includes some non-JSON log lines in stdout around context creation; parsing should tolerate that.
 - Fused DeltaNet mode mapping has been updated in code:
  - `0` / unset: non-fused