docs: refresh qwen3next perf review and benchmark matrix

2026-04-27 01:49:28 +00:00 · 2026-02-07 17:31:17 -08:00
parent b33cef68ad
commit 81e788e2f6
1 changed files with 62 additions and 449 deletions
--- a/docs/development/qwen3next_perf_diff_report.md
+++ b/docs/development/qwen3next_perf_diff_report.md
@@ -1,482 +1,95 @@
-# Qwen3Next Performance-Differences Report (`ik_llama.cpp` vs `llama.cpp`)
+# Qwen3Next Review and Benchmark Summary (`ik_llama.cpp` vs `llama.cpp`)
+
+Date: 2026-02-08

 ## Scope

-This report documents:
+This document captures:

- Measured behavior observed during bring-up and benchmarking.
- Code-level differences likely affecting performance.
- Fixes already applied in `ik_llama.cpp`.
- Remaining bottlenecks and concrete next steps.
+- Current upstream PR alignment for Qwen3Next-related work.
+- What is already strong in `ik_llama.cpp` and what still needs adjustment.
+- Recommended runtime settings for this machine (single GPU target, long context).
+- Final apples-to-apples benchmark matrix for `ik_llama.cpp` vs `../llama.cpp`.

-All numbers below were collected on this machine in Docker with the model:
+## Upstream PR Check (as of 2026-02-08)

- `/models/qwen3-next-coder.gguf`
+Reviewed PRs:

-Date of measurements: 2026-02-06.
+- https://github.com/ggml-org/llama.cpp/pull/18102 (`open`): Delta-Net CUDA op + integration.
+- https://github.com/ggml-org/llama.cpp/pull/18792 (`open`): unified DeltaNet handling (`src/models/delta.cpp`).
+- https://github.com/ggml-org/llama.cpp/pull/19375 (`open`, `draft`): Qwen3Next graph optimization in model builder.

-## Environment Notes
+### Current alignment in `ik_llama.cpp`

- GPU setup: RTX 5060 Ti + RTX 3060.
- Early slow runs were partially confounded by low free memory on GPU1 in one session (`~201 MiB` free at init).
- Later checks confirmed GPUs can be mostly free (`~15.8 GiB` and `~11.9 GiB` free) before starting runs.
+Already present and/or functionally covered:

-## What Was Validated
+- CUDA DeltaNet op path exists in GGML (`ggml/src/ggml-cuda/delta-net.cu`).
+- Solve-tri and backend op support are present for the fused path.
+- Qwen3Next fused DeltaNet builder path exists (and is now runtime-toggleable via env).
+- Existing ik optimizations remain available (`-rtr`, grouped/fused paths, no-offload-only-active-experts switches).

-### Numerical sanity/parity check (perplexity)
+Not directly mirrored yet (by design divergence from mainline model layout):

-Using identical prompt text, `c=256`, `b=64`, `ub=64`, CPU model weights (`-ngl 0`), no warmup:
+- Mainline `src/models/delta.cpp` structure from PR #18792.
+- Mainline `src/models/qwen3next.cpp` graph-form from PR #19375.

- `ik` (`llama-perplexity`) `chunks=1`:
-  - `[1]1.0009`
-  - `Final estimate: PPL over 1 chunks for n_ctx=256 = 1.0009 +/- 0.00045`
- mainline (`llama-perplexity`) `chunks=1`:
-  - `[1]1.0008`
-  - `Final estimate: PPL = 1.0008 +/- 0.00036`
+## Required Adjustments (remaining)

-And for `chunks=2`:
+1. Keep fused DeltaNet as default, but preserve safe fallback path (`LLAMA_QWEN3NEXT_FUSED_DELTA=0`) for debugging/regression checks.
+2. Port selective graph-shape optimizations from PR #19375 into `src/llama-build-context.cpp` where they map cleanly (avoid blind copy due architectural divergence).
+3. Add one dedicated Qwen3Next perf regression target in CI/dev docs (single-GPU 8k proxy + 65k fit sanity).
+4. Investigate ik CPU Flash-Attn assertion path for Qwen3Next (`iqk_fa_templates.h`, `S > 0`) before enabling `-fa 1` for CPU benchmark profiles.

- `ik`: `[1]1.0009,[2]1.0009`, `Final estimate ... = 1.0009 +/- 0.00026`
- mainline: `[1]1.0008,[2]1.0008`, `Final estimate ... = 1.0008 +/- 0.00020`
+## Strong Points of `ik_llama.cpp` to Preserve

-Interpretation: current `ik` Qwen3Next path is numerically very close to mainline for this test.
+- More runtime controls than mainline for this workload (`-rtr`, backend toggles, MoE/OOAE controls).
+- Strong CUDA path for this model family once offload routing is tuned (`--n-cpu-moe` thresholding).
+- Better TG throughput than current mainline in matched CUDA and CPU tests on this host.

-## Measured Performance Signals
+## Best Runtime Configuration (this host)

-### `ik` sweep at long context
+Model: `/models/qwen3-next-coder.gguf`

-`llama-sweep-bench` with `c=65536`, `b=1024`, `ub=128` started successfully and produced low TG values in observed rows (roughly `~2.2` to `~4.1` t/s) and PP mostly in `~27` to `~60` t/s depending on `n_kv` occupancy.
+Single-GPU long-context finding:

-This run was intentionally stopped by user before completion.
+- `-c 65536` on GPU0 (16 GB) requires at least `--n-cpu-moe 47` to fit reliably.

-### Scheduler limits hit at larger batch
+8k sweep proxy (single GPU, tuned path):

-`ik` with `c=65536`, `b=4096`, `ub=1024` failed with:
+- `b=2048,ub=512` -> `avg_tg ~27.9 tok/s`
+- `b=3072,ub=768` -> `avg_tg ~28.4 tok/s` (best TG)
+- `b=4096,ub=1024` -> `avg_tg ~26.9 tok/s`

- `GGML_ASSERT(i_split < GGML_SCHED_MAX_SPLITS)` in `ggml-backend.cpp`.
+Recommended serving baseline:

-This indicates high graph split pressure for this configuration.
+- `CUDA_VISIBLE_DEVICES=0`
+- `-c 65536 -b 3072 -ub 768 -t 8 -fa on -ngl 999 --n-cpu-moe 47 -rtr`

-## Code-Level Differences Relevant to Performance
+## Final Benchmark Matrix (8k context proxy)

-## 1) Recurrent-state storage model differs from mainline
+All four builds were benchmarked with matched parameters and explicit `-mmp 0` for fairness.

-Mainline Qwen3Next uses recurrent memory abstractions (`llama_memory_recurrent`) with `R` and `S` state buffers in F32:
+Common args:

- `llama.cpp/src/llama-model.cpp:7505`
- `llama.cpp/src/models/qwen3next.cpp:686`
- `llama.cpp/src/models/qwen3next.cpp:687`
+- `-m /models/qwen3-next-coder.gguf -p 8192 -n 128 -b 3072 -ub 768 -t 8 -r 1`
+- CUDA runs: `CUDA_VISIBLE_DEVICES=0 -fa 1 -ngl 999 --n-cpu-moe 47 -mmp 0`
+- CPU runs: `-fa 0 -ngl 0 --n-cpu-moe 0 -mmp 0`

-`ik` path originally used KV cache-tail handling; this was adjusted to dedicated per-layer state tensors (`s_l`) in F32:
+| Build | PP (tok/s) | TG (tok/s) |
+|---|---:|---:|
+| `ik` CUDA | 204.614 | 28.979 |
+| mainline CUDA | 184.521 | 22.012 |
+| `ik` CPU | 49.795 | 12.681 |
+| mainline CPU | 51.674 | 7.299 |

- `ik_llama.cpp/src/llama-context.h:59`
- `ik_llama.cpp/src/llama.cpp:771`
- `ik_llama.cpp/src/llama.cpp:817`
- `ik_llama.cpp/src/llama-build-context.cpp:4617`
+Relative (`ik` vs mainline):

-Impact: avoids repeated cast in/out of recurrent state for Qwen3Next and aligns closer to mainline state precision behavior.
+- CUDA PP: `+10.9%`
+- CUDA TG: `+31.7%`
+- CPU PP: `-3.6%`
+- CPU TG: `+73.7%`

-## 2) `ggml_sub` broadcast semantics differ
+## Notes

-Mainline allows repeat/broadcast in `ggml_sub`:
-
- `llama.cpp/ggml/src/ggml.c:2129`
-
-`ik` currently enforces same-shape inputs:
-
- `ik_llama.cpp/ggml/src/ggml.c:6406`
-
-Consequence: in Qwen3Next chunking, `ik` must materialize explicit repeats for tensors used in `sub`, increasing graph materialization overhead.
-
-## 3) Qwen3Next chunking path has extra explicit repeats in `ik`
-
-Current `ik` chunking path repeats `g_cumsum` and `g_last` before subtraction:
-
- `ik_llama.cpp/src/llama-build-context.cpp:4234`
- `ik_llama.cpp/src/llama-build-context.cpp:4287`
-
-Mainline path uses broadcasted subtraction without those explicit materializations:
-
- `llama.cpp/src/models/qwen3next.cpp:200`
- `llama.cpp/src/models/qwen3next.cpp:264`
-
-Consequence: additional memory traffic and nodes in high-frequency path.
-
-## 4) Graph split count is higher in `ik` for tested Qwen3Next context
-
-Observed logs for `c=256` showed:
-
- `ik`: graph splits `1227`
- mainline: graph splits `975`
-
-Higher split count usually implies more sync/copy overhead and can reduce PP/TG.
-
-## Fixes Already Applied in `ik`
-
-These are included in commit:
-
- `a7df116` (`qwen3next: add architecture support and recurrent-state fixes`)
-
-Applied items:
-
- Added Qwen3Next architecture and kernels in `ik`.
- Added dedicated F32 recurrent-state storage (`s_l`) for Qwen3Next recurrent layers.
- Updated Qwen3Next build path to read/write from dedicated state storage when available.
- Ensured numerical sanity vs mainline with perplexity checks above.
- Kept conservative explicit-repeat logic in chunking where `ik` `ggml_sub` currently requires same-shape (after testing showed global broadcast change caused instability in this fork).
-
-## Why Current `ik` Can Still Be Slower
-
-Most probable remaining reasons:
-
- Extra repeat materializations in chunking path.
- Higher graph split count in scheduler/backend path.
- Less optimized Qwen3Next integration path compared to mainline recurrent-memory abstractions.
- Run configuration sensitivity at long context and very large batch (`SCHED_MAX_SPLITS` boundary).
-
-## Priority Next Fixes
-
-1. Reduce split pressure and keep benchmark configs inside stable split envelope at 64k.
-2. Eliminate or fuse high-cost repeat materializations in Qwen3Next chunking path without changing math.
-3. Align more of Qwen3Next recurrent memory/update flow with mainline memory-recurrent pattern where possible.
-4. Validate after each change:
-   - PPL/outputs against mainline.
-   - PP/TG against the same benchmark parameters.
-
-## Current Status
-
- Qwen3Next is integrated and functionally running in `ik`.
- Precision is close to mainline on tested perplexity cases.
- Performance gap remains and requires targeted optimization work listed above.
-
-## 2026-02-06 Optimization Update
-
-### Newly applied performance changes
-
-1. Enabled broadcast-capable `ggml_sub` and aligned it with existing `ggml_mul` broadcast behavior.
-2. Reworked CPU `ggml_compute_forward_sub_f32` to use threaded row-splitting and contiguous broadcast loops.
-3. Enabled `GGML_OP_SUB` multi-task scheduling in `ggml_get_n_tasks`.
-4. Removed two avoidable repeat materializations in Qwen3Next chunking path:
-   - `gcs_i = repeat(g_cumsum, ...)` -> `gcs_i = g_cumsum`
-   - `g_last_repeat` in `g_diff` path removed, using direct broadcasted subtract.
-5. Added a CUDA fast path in `ggml_cuda_op_ssm_conv` for single-sequence recurrent updates (`n_kv == 1`), with token-block parallelization and explicit final-state reconstruction.
-
-### Post-change validation
-
-#### CPU parity vs mainline (`-ngl 0`)
-
-`c=256`, `b=64`, `ub=64`, `--no-warmup`:
-
- `chunks=1`
-  - `ik`: `[1]1.0007`, final `1.0007 +/- 0.00042`
-  - mainline: `[1]1.0007`, final `1.0007 +/- 0.00049`
- `chunks=2`
-  - `ik`: `[1]1.0007,[2]1.0007`, final `1.0007 +/- 0.00023`
-  - mainline: `[1]1.0007,[2]1.0008`, final `1.0008 +/- 0.00028`
-
-#### CUDA sanity parity vs mainline (`CUDA_VISIBLE_DEVICES=1`, `-ngl 1`)
-
-`c=256`, `b=64`, `ub=64`, `--no-warmup`, `chunks=1`:
-
- `ik`: `[1]1.0011`, final `1.0011 +/- 0.00071`
- mainline: `[1]1.0011`, final `1.0011 +/- 0.00074`
-
-Interpretation: precision parity remains intact after CPU and CUDA optimizations.
-
-### Updated long-context speed signal (`ik`, no KV quantization)
-
-Config: `llama-sweep-bench -c 65536 -b 1024 -ub 128 -ctk f16 -ctv f16`
-
-Observed rows after the changes show:
-
- PP generally in `~82` to `~91` t/s range once `n_kv` grows (`~768` to `~3328` in sampled rows).
- TG generally in `~6.2` to `~6.6` t/s range in the same sampled region.
-
-This is substantially improved versus earlier observed TG (`~2` to `~4` t/s) in the prior slow run.
-
-### Remaining performance risks
-
- Some runs still offload few/no layers depending on available VRAM at run time, which can mask CUDA-path gains.
- `SCHED_MAX_SPLITS` limits at very aggressive `(b, ub)` settings are still a separate scaling constraint.
- Additional backend-level profiling is still needed to determine whether remaining gap to top-end mainline numbers is dominated by offload limits, scheduler split overhead, or other kernels.
-
-## 2026-02-06 CUDA MoE/SSM Optimization Update
-
-### Applied changes in this update
-
-1. MoE row mapping in CUDA `mul_mat_id` paths (`ggml/src/ggml-cuda.cu`):
-   - Replaced per-call `ids` device->host copy, host-side count/build, and mapping host->device copy.
-   - Added device-side count + exclusive prefix sum + scatter kernels:
-     - `k_moe_row_count`
-     - `k_moe_row_exclusive_scan`
-     - `k_moe_row_scatter`
-   - Kept existing call-site logic intact by copying only compact metadata back (`moe_counts`, `cum_moe_counts`, invalid-id flag).
-   - Net effect: removes large host round-trip traffic from a hot MoE routing path.
-
-2. Qwen3Next SSM conv path for `n_kv > 1` (`ggml/src/ggml-cuda/ssm-conv.cu`):
-   - Added a guarded fast path for decode-like multi-sequence batches where each token maps to one unique sequence (no multi-sequence fan-out per token).
-   - Added:
-     - `ssm_conv_validate_unique_seq_map`
-     - `ssm_conv_multi_seq_unique_f32_kernel`
-     - `ssm_conv_multi_seq_unique_f32_kernel_nc4`
-   - If the input pattern does not satisfy fast-path constraints, execution falls back to the existing kernel path unchanged.
-
-3. Top-k MoE fusion verification:
-   - No matcher change was required in this update.
-   - Qwen3Next MoE build path still emits the expected `SOFT_MAX -> ... -> ARGSORT -> VIEW -> GET_ROWS` form used by current CUDA fusion checks.
-
-### Parity validation (required checks)
-
-Tests were run in Docker (`iktest-dev:latest`) with:
- model: `/models/qwen3-next-coder.gguf`
- text corpus: `/tmp/qnext_ppl.txt` (same file for `ik` and mainline)
- params: `-c 256 -b 64 -ub 64 --no-warmup`
-
-CPU parity (`-ngl 0`, threshold `<= 5e-4`):
- `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS
- `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS
-
-CUDA sanity parity (`-ngl 1`, threshold `<= 1e-3`):
- `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS
- `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS
-
-### Quick performance matrix (`llama-sweep-bench`)
-
-Config: `-c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe`
-
-| Profile | Baseline maxPP | Baseline maxTG | New maxPP | New maxTG | Delta maxPP | Delta maxTG |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| 16GB a) `CUDA_VISIBLE_DEVICES=0` | 129.83 | 26.45 | 122.91 | 26.79 | -6.92 | +0.34 |
-| 16GB b) `CUDA_VISIBLE_DEVICES=0 -no-ooae` | n/a | n/a | 132.02 | 26.84 | n/a | n/a |
-| 28GB a) `CUDA_VISIBLE_DEVICES=0,1 --tensor-split 0.85,0.15` | 127.66 | 22.95 | 127.48 | 23.97 | -0.18 | +1.02 |
-| 28GB b) `CUDA_VISIBLE_DEVICES=0,1` | n/a | n/a | 104.61 | 21.17 | n/a | n/a |
-
-### Command log (exact forms)
-
-Build:
-
-```bash
-docker run --rm --gpus all \
-  -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
-  iktest-dev:latest \
-  bash -lc 'cmake --build /ik_llama.cpp/build-cuda13-fresh --config Release -j 56 --target llama-perplexity llama-bench'
-```
-
-Parity (`ik`):
-
-```bash
-docker run --rm --gpus all \
-  -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
-  -v /home/yurko/.cache/llama.cpp:/models \
-  -v /tmp:/tmp \
-  iktest-dev:latest \
-  bash -lc 'export LD_LIBRARY_PATH=/ik_llama.cpp/build-cuda13-fresh/src:/ik_llama.cpp/build-cuda13-fresh/ggml/src:$LD_LIBRARY_PATH; \
-  /ik_llama.cpp/build-cuda13-fresh/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'
-```
-
-Parity (mainline):
-
-```bash
-docker run --rm --gpus all \
-  -v /home/yurko/Code/llama.cpp:/llama.cpp \
-  -v /home/yurko/.cache/llama.cpp:/models \
-  -v /tmp:/tmp \
-  iktest-dev:latest \
-  bash -lc 'export LD_LIBRARY_PATH=/llama.cpp/build/src:/llama.cpp/build/ggml/src:$LD_LIBRARY_PATH; \
-  /llama.cpp/build/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'
-```
-
-Quick matrix:
-
-```bash
-# 16GB a
-CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
-  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe
-
-# 16GB b
-CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
-  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe -no-ooae
-
-# 28GB a
-CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
-  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe --tensor-split 0.85,0.15
-
-# 28GB b
-CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
-  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe
-```
-
-### Status after this update
-
- Precision parity: PASS on all required checks.
- Performance:
-  - 16GB profile improved TG but not PP vs baseline.
-  - 28GB split profile improved TG and preserved PP.
- Remaining likely bottlenecks for 16GB PP:
-  - MoE routing still limited by per-expert launches/host-side per-expert loop in `mul_mat_id`.
-  - Scheduler split / backend-crossing overhead remains visible at this config.
-
-## 2026-02-06 Follow-up Hotspot Pass (this session)
-
-### Additional code changes
-
-1. `ggml/src/ggml-cuda.cu`
-   - Removed an unused `ids` device->host copy + stream sync in `ggml_cuda_moe_up_gate_unary` fallback path.
-   - Reduced row-mapping host transfer volume by deriving `moe_counts` from host-side prefix bounds (`cum_moe_counts`) instead of copying both arrays from device.
-   - Added `build_active_experts(...)` and switched per-expert loops to iterate only active experts.
-2. `ggml/src/ggml-cuda/ssm-conv.cu`
-   - Removed host-side `cudaMemcpyAsync(...D2H...) + cudaStreamSynchronize` for multi-seq fast-path eligibility.
-   - Made fast/fallback dispatch fully async by gating both kernels with a device-side `fast_path_ok` flag.
-3. `ggml/src/ggml-backend.cpp`
-   - Reduced unnecessary split churn when a weight tensor is on another backend but the current backend can consume that buffer type directly.
-   - Increased `GGML_SCHED_MAX_SPLITS` from `2048` to `4096` for large-graph headroom.
-4. `src/llama.cpp`
-   - Added a Qwen3Next-specific default split guard for heterogeneous dual-GPU layer mode: clamp to at least `75/25` on 2-GPU auto-split when GPU0 has more free memory.
-5. `scripts/qwen3next-eval.sh`
-   - Fixed CLI compatibility (`mainline: llama-completion`, `ik: llama-cli` completion path).
-   - Made evaluation resilient to missing binaries (`gpu_sweep_mainline` is skipped if unavailable).
-   - Fixed complexity-token regex.
-   - Switched PPL corpus generation to a stable deterministic pattern to reduce chunk-level variance.
-
-### Validation rerun
-
-Run artifact: `/tmp/qwen3next-eval/20260206_064339`
-
- CPU PPL parity:
-  - chunks=1: mainline `1.0009`, ik `1.0009`, delta `0.000000`
-  - chunks=2: mainline `1.0005`, ik `1.0005`, delta `0.000000`
- CUDA sanity parity:
-  - `gpu_ppl_chunks1_mainline`: `OK`
-  - `gpu_ppl_chunks1_ik`: `OK`
- Generation smoke:
-  - both mainline and ik contain Fibonacci token(s)
-  - mainline contains complexity token(s), ik did not in this sample output
- Notes:
-  - `gpu_sweep_mainline` skipped in this environment because `/home/yurko/Code/llama.cpp/build/bin/llama-sweep-bench` is not present.
-  - `gpu_sweep_ik` (`c=2048`, `n=32`) in this run peaked at approximately `maxPP=137.02`, `maxTG=24.81`.
-
-### Quick matrix (exact required configs)
-
-Run artifact: `/tmp/qwen3next-matrix/20260206_063957`
-
-| Profile | Baseline maxPP | Baseline maxTG | New maxPP | New maxTG | Delta maxPP | Delta maxTG |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| 16GB a) `CUDA_VISIBLE_DEVICES=0 --cpu-moe` | 129.83 | 26.45 | 115.56 | 25.74 | -14.27 | -0.71 |
-| 16GB b) `CUDA_VISIBLE_DEVICES=0 --cpu-moe -no-ooae` | n/a | n/a | 136.21 | 26.00 | n/a | n/a |
-| 28GB a) `CUDA_VISIBLE_DEVICES=0,1 --cpu-moe --tensor-split 0.85,0.15` | 127.66 | 22.95 | 129.70 | 22.72 | +2.04 | -0.23 |
-| 28GB b) `CUDA_VISIBLE_DEVICES=0,1 --cpu-moe` | n/a | n/a | 117.54 | 22.99 | n/a | n/a |
-
-### Variance note for single-GPU default (`--cpu-moe`)
-
-Repeated measurements show substantial run-to-run variance in this environment:
-
-Run artifact: `/tmp/qwen3next-repeat-20260206_064133`
-
- `single_cpu_moe` maxPP/maxTG:
-  - run1: `113.84 / 25.86`
-  - run2: `135.29 / 26.88`
-  - run3: `113.95 / 23.54`
- `single_cpu_moe_no_ooae` maxPP/maxTG:
-  - run1: `135.33 / 26.49`
-  - run2: `133.64 / 24.92`
-  - run3: `126.33 / 23.42`
-
-Interpretation: in this setup, `-no-ooae` is currently more stable and generally faster for PP; default OOAE shows large variance and occasional severe PP drops.
-
-## 2026-02-06 Dual-Build Split + Context Sweep + PP Profiling
-
-### Code updates in this pass
-
-1. `src/llama.cpp`
-   - Added a Qwen3Next-specific guard that disables `only_active_experts` for large-batch hybrid MoE prompt paths:
-     - condition: `arch == QWEN3NEXT`, tensor overrides enabled, `n_batch >= 512`
-   - Rationale: avoid extra scheduling/sync/copy overhead in this PP-heavy path.
-2. `scripts/qwen3next-eval.sh`
-   - Added build-dir selection so CPU and CUDA trees can be reused without rebuild toggling:
-     - `--main-build-dir`
-     - `--ik-build-dir`
-   - Fixed runtime loader paths to include both `bin` and `src` shared-library locations.
-
-### Separate build setup (requested)
-
-Built and validated two persistent trees in `ik_llama.cpp`:
-
- `build-cpu`: `GGML_CUDA=OFF`, `GGML_BLAS=ON`
- `build-cuda`: `GGML_CUDA=ON`, `GGML_BLAS=OFF`
-
-Command used:
-
-```bash
-docker run --rm --gpus all \
-  -v /home/yurko/Code/ik_llama.cpp:/ik \
-  -w /ik \
-  iktest-dev:latest \
-  bash -lc '
-    cmake -S /ik -B /ik/build-cpu -DGGML_CUDA=OFF -DGGML_BLAS=ON -DCMAKE_BUILD_TYPE=Release
-    cmake --build /ik/build-cpu --config Release -j 56 --target llama-cli llama-sweep-bench llama-perplexity
-    cmake -S /ik -B /ik/build-cuda -DGGML_CUDA=ON -DGGML_BLAS=OFF -DCMAKE_BUILD_TYPE=Release
-    cmake --build /ik/build-cuda --config Release -j 56 --target llama-cli llama-sweep-bench llama-perplexity
-  '
-```
-
-### Parity rerun after this pass
-
-Run artifact: `/tmp/qwen3next-eval/20260206_191050`
-
- CPU PPL parity:
-  - chunks=1: mainline `1.0009`, ik `1.0009`, delta `0.000000`
-  - chunks=2: mainline `1.0005`, ik `1.0005`, delta `0.000000`
- CUDA sanity parity:
-  - `gpu_ppl_chunks1_mainline`: `OK`
-  - `gpu_ppl_chunks1_ik`: `OK`
-
-### Requested runs: CPU `c=512`, CUDA up to `c=8192`
-
-Run artifact: `/tmp/qwen3next-dual-build-20260206_191427`
-
-Config:
-
- CPU: `build-cpu`, `-c 512 -b 1024 -ub 128 -n 16 -ngl 0`
- CUDA: `build-cuda`, `-c {512,1024,2048,4096,8192} -b 1024 -ub 128 -n 16 -ngl 999 --cpu-moe`
-
-| Case | maxPP (t/s) | maxTG (t/s) | graph splits |
-| --- | ---: | ---: | ---: |
-| `cpu_c512` | 98.31 | 6.58 | 1 |
-| `cuda_c512` | 137.09 | 25.69 | 530 |
-| `cuda_c1024` | 135.74 | 27.68 | 530 |
-| `cuda_c2048` | 134.87 | 26.71 | 530 |
-| `cuda_c4096` | 136.62 | 27.37 | 530 |
-| `cuda_c8192` | 137.50 | 27.53 | 530 |
-
-Observation: PP remains roughly flat (`~135-137 t/s`) from `c=512` through `c=8192`, so this is not primarily a long-context KV-scaling bottleneck.
-
-### Prompt-processing bottleneck profiling
-
-Run artifact: `/tmp/qwen3next-profile-20260206_192018`
-
-| Case | maxPP (t/s) | maxTG (t/s) | splits | threads | offloaded layers |
-| --- | ---: | ---: | ---: | --- | --- |
-| `single_default` | 125.77 | 24.01 | 530 | `t=8,tb=8` | `49/49` |
-| `single_t16_tb16` | 37.00 | 0.85 | 530 | `t=16,tb=16` | `49/49` |
-| `dual_default` | 128.94 | 22.75 | 531 | `t=8,tb=8` | `49/49` |
-| `dual_t16_tb16` | 37.68 | 0.82 | 531 | `t=16,tb=16` | `49/49` |
-
-Key findings:
-
-1. Increasing CPU threads to 16 for this CPU-MoE path is strongly harmful on this machine.
-2. Dual-GPU (`0,1`) does not materially improve PP over single-GPU for this config.
-3. Main logs still show all expert tensors overridden to CPU and a large CPU expert buffer (`~45.8 GiB`), so PP is dominated by CPU-MoE path behavior rather than GPU-context growth.
-4. Graph splits remain high (`~530`) and stable across contexts, indicating persistent scheduler/backend overhead.
-
-### Additional variance check (`default` vs `-no-ooae`)
-
-Run artifact: `/tmp/qwen3next-ooae-repeat-20260206_192523`
-
- `default` (with auto Qwen3Next guard): `112.64/23.88`, `135.73/26.40`, `135.30/27.19` (PP/TG)
- `-no-ooae`: `131.87/25.97`, `113.80/23.77`, `114.25/23.79`
-
-Interpretation: run-to-run variance is still significant in this environment; however, the new auto-guard removes the worst observed OOAE collapse mode in the default path while preserving parity.
-
-### Why this is still below ~400 PP
-
-Given this exact setup, the dominant limiter is CPU-MoE expert execution (large expert tensors on CPU + routing/scheduler overhead), not context length. With `--cpu-moe`, this hardware/config currently lands around `~125-137` PP in stable runs. Reaching `~400` PP on this model likely requires reducing or eliminating CPU-MoE dependence (more VRAM / different placement strategy) rather than only kernel micro-tuning.
+- `ik` CPU benchmark with `-fa 1` currently aborts for this model in `iqk_fa_templates.h` (`GGML_ASSERT(S > 0)`), so CPU matrix uses `-fa 0` for both repos.
+- `ik` benchmark JSON currently includes some non-JSON log lines in stdout around context creation; parsing should tolerate that.