Qwen3Next Review and Benchmark Summary (`ik_llama.cpp` vs `llama.cpp`)

Date: 2026-02-08

Scope

This document captures:

Current upstream PR alignment for Qwen3Next-related work.
What is already strong in ik_llama.cpp and what still needs adjustment.
Recommended runtime settings for this machine (single GPU target, long context).
Final apples-to-apples benchmark matrix for ik_llama.cpp vs ../llama.cpp.

Reviewed PRs:

https://github.com/ggml-org/llama.cpp/pull/18102 (open): Delta-Net CUDA op + integration.
https://github.com/ggml-org/llama.cpp/pull/18792 (open): unified DeltaNet handling (src/models/delta.cpp).
https://github.com/ggml-org/llama.cpp/pull/19375 (open, draft): Qwen3Next graph optimization in model builder.

Already present and/or functionally covered:

CUDA DeltaNet op path exists in GGML (ggml/src/ggml-cuda/delta-net.cu).
Solve-tri and backend op support are present for the fused path.
Qwen3Next fused DeltaNet builder path exists (and is now runtime-toggleable via env).
Existing ik optimizations remain available (-rtr, grouped/fused paths, no-offload-only-active-experts switches).

Not directly mirrored yet (by design divergence from mainline model layout):

Keep non-fused as the strict safety baseline, and use LLAMA_QWEN3NEXT_FUSED_DELTA=1 (prefill-only fused) as the practical acceleration mode.
Port selective graph-shape optimizations from PR #19375 into src/llama-build-context.cpp where they map cleanly (avoid blind copy due architectural divergence).
Add one dedicated Qwen3Next perf regression target in CI/dev docs (single-GPU 8k proxy + 65k fit sanity).
Investigate ik CPU Flash-Attn assertion path for Qwen3Next (iqk_fa_templates.h, S > 0) before enabling -fa 1 for CPU benchmark profiles.

More runtime controls than mainline for this workload (-rtr, backend toggles, MoE/OOAE controls).
Strong CUDA path for this model family once offload routing is tuned (--n-cpu-moe thresholding).
Better TG throughput than current mainline in matched CUDA and CPU tests on this host.

Model: /models/qwen3-next-coder.gguf

Single-GPU long-context finding:

8k sweep proxy (single GPU, tuned path):

Recommended serving baseline:

All four builds were benchmarked with matched parameters and explicit -mmp 0 for fairness.

Common args:

Relative (ik vs mainline):

ik CPU benchmark with -fa 1 currently aborts for this model in iqk_fa_templates.h (GGML_ASSERT(S > 0)), so CPU matrix uses -fa 0 for both repos.
ik benchmark JSON currently includes some non-JSON log lines in stdout around context creation; parsing should tolerate that.
Fused DeltaNet mode mapping has been updated in code:
- 0 / unset: non-fused
- 1: fused only for n_tok > 1 (safe mode)
- 2: fused on all token counts (experimental; decode-quality regression observed)
Added manual regression runner for fused-mode safety checks:
- scripts/qwen3next-fused-regression.sh
- Example:
  - BIN=./build-qwen3next-fix/bin/llama-perplexity scripts/qwen3next-fused-regression.sh --model /models/qwen3-next-coder.gguf --ctx 2048 --decode-b 1 --decode-ub 1 --prefill-b 2048 --prefill-ub 512 --ngl 47 --n-cpu-moe 40
Also integrated into the broader eval harness:
- scripts/qwen3next-eval.sh --with-gpu --with-fused-regression ...
- Results are surfaced in SUMMARY.md under IK Fused Delta Regression.
Fused regression now enforces absolute non-fused sanity too:
- mode0 decode/prefill PPL must stay below configurable thresholds (defaults: 10.0 / 10.0).