diff --git a/docs/development/qwen3next_bench_16k_pp16384_tg128.md b/docs/development/qwen3next_bench_16k_pp16384_tg128.md index 8e91e562..82ede13c 100644 --- a/docs/development/qwen3next_bench_16k_pp16384_tg128.md +++ b/docs/development/qwen3next_bench_16k_pp16384_tg128.md @@ -77,7 +77,7 @@ Working comparison at `--n-cpu-moe 45`: |---|---|---:|---:| | `ik_llama.cpp` (`-rtr 1`) | CUDA | 232.340508 | 27.895722 | -## Fused DeltaNet Quality Check (GPU, `-c 2048`, `--chunks 1`) +## Historical Fused DeltaNet Check (obsolete) Date: 2026-02-08 @@ -97,8 +97,8 @@ Results (Wikitext2 sample file `/tmp/ppl_wikitext2_test.txt`): Conclusion: -- Fused DeltaNet path is currently numerically bad for both tested quants on CUDA in this setup. -- Keeping fused path opt-in (`LLAMA_QWEN3NEXT_FUSED_DELTA=1`) and defaulting to non-fused is required for model quality. +- This run is kept for history only and is superseded by the later `Fused DeltaNet Safety Update (Superseding)` section below. +- Use the superseding section as source of truth for mode mapping and quality guidance. ## Upstream PR #19375 Trial (Selective Port) Outcome @@ -111,8 +111,8 @@ What was tried: Outcome: - No stable speed win in our setup after repeated runs. -- Autoregressive rewrite specifically hurt TG throughput in non-fused mode and was reverted. -- Final code keeps only the fused-default safety fix (non-fused by default). +- Direct autoregressive rewrite attempts from PR #19375 were not compatible with current ik graph-layout/contiguity assumptions and were reverted. +- Final code keeps only safe chunk-shape fixes plus fused-mode safety controls. ## Decode-Only Fused Mode Trial (`LLAMA_QWEN3NEXT_FUSED_DELTA=2`) diff --git a/docs/development/qwen3next_perf_diff_report.md b/docs/development/qwen3next_perf_diff_report.md index 25d59f3a..cc2a9765 100644 --- a/docs/development/qwen3next_perf_diff_report.md +++ b/docs/development/qwen3next_perf_diff_report.md @@ -35,12 +35,11 @@ Not directly mirrored yet (by design divergence from mainline model layout): ## Required Adjustments (remaining) -1. Keep non-fused as the strict safety baseline, and use `LLAMA_QWEN3NEXT_FUSED_DELTA=1` (prefill-only fused) as the practical acceleration mode. -2. Port selective graph-shape optimizations from PR #19375 into `src/llama-build-context.cpp` where they map cleanly (avoid blind copy due architectural divergence). -3. Added dedicated Qwen3Next regression target for dev/CI-style checks: - - `scripts/qwen3next-regression.sh` - - combines fused safety regression + single-GPU proxy sweep + long-context fit sanity. -4. Investigate ik CPU Flash-Attn assertion path for Qwen3Next (`iqk_fa_templates.h`, `S > 0`) before enabling `-fa 1` for CPU benchmark profiles. +1. Keep non-fused as the strict safety baseline in defaults, and use `LLAMA_QWEN3NEXT_FUSED_DELTA=1` (prefill-only fused) as the explicit acceleration mode. +2. Add a first-class runtime flag/CLI plumb for Qwen3Next fused mode (`LLAMA_QWEN3NEXT_FUSED_DELTA`) so serving does not depend on raw env wiring. +3. Continue using `scripts/qwen3next-regression.sh` as the release gate for this model path, and wire it into CI or pre-merge checks. +4. Treat the remaining PR #19375 autoregressive rewrite as deferred: direct porting into current ik graph builder is not layout-compatible without broader contiguity/reshape refactoring. +5. Revisit PR #18792 (`src/models/delta.cpp`) only if we need unified GDA/KDA support for additional architectures; for Qwen3Next-only it is optional. ## Strong Points of `ik_llama.cpp` to Preserve @@ -93,7 +92,7 @@ Relative (`ik` vs mainline): ## Notes -- `ik` CPU benchmark with `-fa 1` currently aborts for this model in `iqk_fa_templates.h` (`GGML_ASSERT(S > 0)`), so CPU matrix uses `-fa 0` for both repos. +- CPU-only Qwen3Next with `-fa 1` is now guarded in ik: FA is auto-disabled with a warning for `n_gpu_layers == 0` to avoid the prior `iqk_fa_templates.h` assert path. - `ik` benchmark JSON currently includes some non-JSON log lines in stdout around context creation; parsing should tolerate that. - Fused DeltaNet mode mapping has been updated in code: - `0` / unset: non-fused