# Qwen3Next Performance-Differences Report (`ik_llama.cpp` vs `llama.cpp`) ## Scope This report documents: - Measured behavior observed during bring-up and benchmarking. - Code-level differences likely affecting performance. - Fixes already applied in `ik_llama.cpp`. - Remaining bottlenecks and concrete next steps. All numbers below were collected on this machine in Docker with the model: - `/models/qwen3-next-coder.gguf` Date of measurements: 2026-02-06. ## Environment Notes - GPU setup: RTX 5060 Ti + RTX 3060. - Early slow runs were partially confounded by low free memory on GPU1 in one session (`~201 MiB` free at init). - Later checks confirmed GPUs can be mostly free (`~15.8 GiB` and `~11.9 GiB` free) before starting runs. ## What Was Validated ### Numerical sanity/parity check (perplexity) Using identical prompt text, `c=256`, `b=64`, `ub=64`, CPU model weights (`-ngl 0`), no warmup: - `ik` (`llama-perplexity`) `chunks=1`: - `[1]1.0009` - `Final estimate: PPL over 1 chunks for n_ctx=256 = 1.0009 +/- 0.00045` - mainline (`llama-perplexity`) `chunks=1`: - `[1]1.0008` - `Final estimate: PPL = 1.0008 +/- 0.00036` And for `chunks=2`: - `ik`: `[1]1.0009,[2]1.0009`, `Final estimate ... = 1.0009 +/- 0.00026` - mainline: `[1]1.0008,[2]1.0008`, `Final estimate ... = 1.0008 +/- 0.00020` Interpretation: current `ik` Qwen3Next path is numerically very close to mainline for this test. ## Measured Performance Signals ### `ik` sweep at long context `llama-sweep-bench` with `c=65536`, `b=1024`, `ub=128` started successfully and produced low TG values in observed rows (roughly `~2.2` to `~4.1` t/s) and PP mostly in `~27` to `~60` t/s depending on `n_kv` occupancy. This run was intentionally stopped by user before completion. ### Scheduler limits hit at larger batch `ik` with `c=65536`, `b=4096`, `ub=1024` failed with: - `GGML_ASSERT(i_split < GGML_SCHED_MAX_SPLITS)` in `ggml-backend.cpp`. This indicates high graph split pressure for this configuration. ## Code-Level Differences Relevant to Performance ## 1) Recurrent-state storage model differs from mainline Mainline Qwen3Next uses recurrent memory abstractions (`llama_memory_recurrent`) with `R` and `S` state buffers in F32: - `llama.cpp/src/llama-model.cpp:7505` - `llama.cpp/src/models/qwen3next.cpp:686` - `llama.cpp/src/models/qwen3next.cpp:687` `ik` path originally used KV cache-tail handling; this was adjusted to dedicated per-layer state tensors (`s_l`) in F32: - `ik_llama.cpp/src/llama-context.h:59` - `ik_llama.cpp/src/llama.cpp:771` - `ik_llama.cpp/src/llama.cpp:817` - `ik_llama.cpp/src/llama-build-context.cpp:4617` Impact: avoids repeated cast in/out of recurrent state for Qwen3Next and aligns closer to mainline state precision behavior. ## 2) `ggml_sub` broadcast semantics differ Mainline allows repeat/broadcast in `ggml_sub`: - `llama.cpp/ggml/src/ggml.c:2129` `ik` currently enforces same-shape inputs: - `ik_llama.cpp/ggml/src/ggml.c:6406` Consequence: in Qwen3Next chunking, `ik` must materialize explicit repeats for tensors used in `sub`, increasing graph materialization overhead. ## 3) Qwen3Next chunking path has extra explicit repeats in `ik` Current `ik` chunking path repeats `g_cumsum` and `g_last` before subtraction: - `ik_llama.cpp/src/llama-build-context.cpp:4234` - `ik_llama.cpp/src/llama-build-context.cpp:4287` Mainline path uses broadcasted subtraction without those explicit materializations: - `llama.cpp/src/models/qwen3next.cpp:200` - `llama.cpp/src/models/qwen3next.cpp:264` Consequence: additional memory traffic and nodes in high-frequency path. ## 4) Graph split count is higher in `ik` for tested Qwen3Next context Observed logs for `c=256` showed: - `ik`: graph splits `1227` - mainline: graph splits `975` Higher split count usually implies more sync/copy overhead and can reduce PP/TG. ## Fixes Already Applied in `ik` These are included in commit: - `a7df116` (`qwen3next: add architecture support and recurrent-state fixes`) Applied items: - Added Qwen3Next architecture and kernels in `ik`. - Added dedicated F32 recurrent-state storage (`s_l`) for Qwen3Next recurrent layers. - Updated Qwen3Next build path to read/write from dedicated state storage when available. - Ensured numerical sanity vs mainline with perplexity checks above. - Kept conservative explicit-repeat logic in chunking where `ik` `ggml_sub` currently requires same-shape (after testing showed global broadcast change caused instability in this fork). ## Why Current `ik` Can Still Be Slower Most probable remaining reasons: - Extra repeat materializations in chunking path. - Higher graph split count in scheduler/backend path. - Less optimized Qwen3Next integration path compared to mainline recurrent-memory abstractions. - Run configuration sensitivity at long context and very large batch (`SCHED_MAX_SPLITS` boundary). ## Priority Next Fixes 1. Reduce split pressure and keep benchmark configs inside stable split envelope at 64k. 2. Eliminate or fuse high-cost repeat materializations in Qwen3Next chunking path without changing math. 3. Align more of Qwen3Next recurrent memory/update flow with mainline memory-recurrent pattern where possible. 4. Validate after each change: - PPL/outputs against mainline. - PP/TG against the same benchmark parameters. ## Current Status - Qwen3Next is integrated and functionally running in `ik`. - Precision is close to mainline on tested perplexity cases. - Performance gap remains and requires targeted optimization work listed above. ## 2026-02-06 Optimization Update ### Newly applied performance changes 1. Enabled broadcast-capable `ggml_sub` and aligned it with existing `ggml_mul` broadcast behavior. 2. Reworked CPU `ggml_compute_forward_sub_f32` to use threaded row-splitting and contiguous broadcast loops. 3. Enabled `GGML_OP_SUB` multi-task scheduling in `ggml_get_n_tasks`. 4. Removed two avoidable repeat materializations in Qwen3Next chunking path: - `gcs_i = repeat(g_cumsum, ...)` -> `gcs_i = g_cumsum` - `g_last_repeat` in `g_diff` path removed, using direct broadcasted subtract. 5. Added a CUDA fast path in `ggml_cuda_op_ssm_conv` for single-sequence recurrent updates (`n_kv == 1`), with token-block parallelization and explicit final-state reconstruction. ### Post-change validation #### CPU parity vs mainline (`-ngl 0`) `c=256`, `b=64`, `ub=64`, `--no-warmup`: - `chunks=1` - `ik`: `[1]1.0007`, final `1.0007 +/- 0.00042` - mainline: `[1]1.0007`, final `1.0007 +/- 0.00049` - `chunks=2` - `ik`: `[1]1.0007,[2]1.0007`, final `1.0007 +/- 0.00023` - mainline: `[1]1.0007,[2]1.0008`, final `1.0008 +/- 0.00028` #### CUDA sanity parity vs mainline (`CUDA_VISIBLE_DEVICES=1`, `-ngl 1`) `c=256`, `b=64`, `ub=64`, `--no-warmup`, `chunks=1`: - `ik`: `[1]1.0011`, final `1.0011 +/- 0.00071` - mainline: `[1]1.0011`, final `1.0011 +/- 0.00074` Interpretation: precision parity remains intact after CPU and CUDA optimizations. ### Updated long-context speed signal (`ik`, no KV quantization) Config: `llama-sweep-bench -c 65536 -b 1024 -ub 128 -ctk f16 -ctv f16` Observed rows after the changes show: - PP generally in `~82` to `~91` t/s range once `n_kv` grows (`~768` to `~3328` in sampled rows). - TG generally in `~6.2` to `~6.6` t/s range in the same sampled region. This is substantially improved versus earlier observed TG (`~2` to `~4` t/s) in the prior slow run. ### Remaining performance risks - Some runs still offload few/no layers depending on available VRAM at run time, which can mask CUDA-path gains. - `SCHED_MAX_SPLITS` limits at very aggressive `(b, ub)` settings are still a separate scaling constraint. - Additional backend-level profiling is still needed to determine whether remaining gap to top-end mainline numbers is dominated by offload limits, scheduler split overhead, or other kernels. ## 2026-02-06 CUDA MoE/SSM Optimization Update ### Applied changes in this update 1. MoE row mapping in CUDA `mul_mat_id` paths (`ggml/src/ggml-cuda.cu`): - Replaced per-call `ids` device->host copy, host-side count/build, and mapping host->device copy. - Added device-side count + exclusive prefix sum + scatter kernels: - `k_moe_row_count` - `k_moe_row_exclusive_scan` - `k_moe_row_scatter` - Kept existing call-site logic intact by copying only compact metadata back (`moe_counts`, `cum_moe_counts`, invalid-id flag). - Net effect: removes large host round-trip traffic from a hot MoE routing path. 2. Qwen3Next SSM conv path for `n_kv > 1` (`ggml/src/ggml-cuda/ssm-conv.cu`): - Added a guarded fast path for decode-like multi-sequence batches where each token maps to one unique sequence (no multi-sequence fan-out per token). - Added: - `ssm_conv_validate_unique_seq_map` - `ssm_conv_multi_seq_unique_f32_kernel` - `ssm_conv_multi_seq_unique_f32_kernel_nc4` - If the input pattern does not satisfy fast-path constraints, execution falls back to the existing kernel path unchanged. 3. Top-k MoE fusion verification: - No matcher change was required in this update. - Qwen3Next MoE build path still emits the expected `SOFT_MAX -> ... -> ARGSORT -> VIEW -> GET_ROWS` form used by current CUDA fusion checks. ### Parity validation (required checks) Tests were run in Docker (`iktest-dev:latest`) with: - model: `/models/qwen3-next-coder.gguf` - text corpus: `/tmp/qnext_ppl.txt` (same file for `ik` and mainline) - params: `-c 256 -b 64 -ub 64 --no-warmup` CPU parity (`-ngl 0`, threshold `<= 5e-4`): - `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS - `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS CUDA sanity parity (`-ngl 1`, threshold `<= 1e-3`): - `chunks=1`: `ik 1.0041` vs `mainline 1.0037` (`delta=4e-4`) -> PASS - `chunks=2`: `ik 1.0025` vs `mainline 1.0023` (`delta=2e-4`) -> PASS ### Quick performance matrix (`llama-sweep-bench`) Config: `-c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe` | Profile | Baseline maxPP | Baseline maxTG | New maxPP | New maxTG | Delta maxPP | Delta maxTG | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | 16GB a) `CUDA_VISIBLE_DEVICES=0` | 129.83 | 26.45 | 122.91 | 26.79 | -6.92 | +0.34 | | 16GB b) `CUDA_VISIBLE_DEVICES=0 -no-ooae` | n/a | n/a | 132.02 | 26.84 | n/a | n/a | | 28GB a) `CUDA_VISIBLE_DEVICES=0,1 --tensor-split 0.85,0.15` | 127.66 | 22.95 | 127.48 | 23.97 | -0.18 | +1.02 | | 28GB b) `CUDA_VISIBLE_DEVICES=0,1` | n/a | n/a | 104.61 | 21.17 | n/a | n/a | ### Command log (exact forms) Build: ```bash docker run --rm --gpus all \ -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \ iktest-dev:latest \ bash -lc 'cmake --build /ik_llama.cpp/build-cuda13-fresh --config Release -j 56 --target llama-perplexity llama-bench' ``` Parity (`ik`): ```bash docker run --rm --gpus all \ -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \ -v /home/yurko/.cache/llama.cpp:/models \ -v /tmp:/tmp \ iktest-dev:latest \ bash -lc 'export LD_LIBRARY_PATH=/ik_llama.cpp/build-cuda13-fresh/src:/ik_llama.cpp/build-cuda13-fresh/ggml/src:$LD_LIBRARY_PATH; \ /ik_llama.cpp/build-cuda13-fresh/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16' ``` Parity (mainline): ```bash docker run --rm --gpus all \ -v /home/yurko/Code/llama.cpp:/llama.cpp \ -v /home/yurko/.cache/llama.cpp:/models \ -v /tmp:/tmp \ iktest-dev:latest \ bash -lc 'export LD_LIBRARY_PATH=/llama.cpp/build/src:/llama.cpp/build/ggml/src:$LD_LIBRARY_PATH; \ /llama.cpp/build/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16' ``` Quick matrix: ```bash # 16GB a CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \ -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe # 16GB b CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \ -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe -no-ooae # 28GB a CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \ -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe --tensor-split 0.85,0.15 # 28GB b CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \ -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe ``` ### Status after this update - Precision parity: PASS on all required checks. - Performance: - 16GB profile improved TG but not PP vs baseline. - 28GB split profile improved TG and preserved PP. - Remaining likely bottlenecks for 16GB PP: - MoE routing still limited by per-expert launches/host-side per-expert loop in `mul_mat_id`. - Scheduler split / backend-crossing overhead remains visible at this config. ## 2026-02-06 Follow-up Hotspot Pass (this session) ### Additional code changes 1. `ggml/src/ggml-cuda.cu` - Removed an unused `ids` device->host copy + stream sync in `ggml_cuda_moe_up_gate_unary` fallback path. - Reduced row-mapping host transfer volume by deriving `moe_counts` from host-side prefix bounds (`cum_moe_counts`) instead of copying both arrays from device. - Added `build_active_experts(...)` and switched per-expert loops to iterate only active experts. 2. `ggml/src/ggml-cuda/ssm-conv.cu` - Removed host-side `cudaMemcpyAsync(...D2H...) + cudaStreamSynchronize` for multi-seq fast-path eligibility. - Made fast/fallback dispatch fully async by gating both kernels with a device-side `fast_path_ok` flag. 3. `ggml/src/ggml-backend.cpp` - Reduced unnecessary split churn when a weight tensor is on another backend but the current backend can consume that buffer type directly. - Increased `GGML_SCHED_MAX_SPLITS` from `2048` to `4096` for large-graph headroom. 4. `src/llama.cpp` - Added a Qwen3Next-specific default split guard for heterogeneous dual-GPU layer mode: clamp to at least `75/25` on 2-GPU auto-split when GPU0 has more free memory. 5. `scripts/qwen3next-eval.sh` - Fixed CLI compatibility (`mainline: llama-completion`, `ik: llama-cli` completion path). - Made evaluation resilient to missing binaries (`gpu_sweep_mainline` is skipped if unavailable). - Fixed complexity-token regex. - Switched PPL corpus generation to a stable deterministic pattern to reduce chunk-level variance. ### Validation rerun Run artifact: `/tmp/qwen3next-eval/20260206_064339` - CPU PPL parity: - chunks=1: mainline `1.0009`, ik `1.0009`, delta `0.000000` - chunks=2: mainline `1.0005`, ik `1.0005`, delta `0.000000` - CUDA sanity parity: - `gpu_ppl_chunks1_mainline`: `OK` - `gpu_ppl_chunks1_ik`: `OK` - Generation smoke: - both mainline and ik contain Fibonacci token(s) - mainline contains complexity token(s), ik did not in this sample output - Notes: - `gpu_sweep_mainline` skipped in this environment because `/home/yurko/Code/llama.cpp/build/bin/llama-sweep-bench` is not present. - `gpu_sweep_ik` (`c=2048`, `n=32`) in this run peaked at approximately `maxPP=137.02`, `maxTG=24.81`. ### Quick matrix (exact required configs) Run artifact: `/tmp/qwen3next-matrix/20260206_063957` | Profile | Baseline maxPP | Baseline maxTG | New maxPP | New maxTG | Delta maxPP | Delta maxTG | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | 16GB a) `CUDA_VISIBLE_DEVICES=0 --cpu-moe` | 129.83 | 26.45 | 115.56 | 25.74 | -14.27 | -0.71 | | 16GB b) `CUDA_VISIBLE_DEVICES=0 --cpu-moe -no-ooae` | n/a | n/a | 136.21 | 26.00 | n/a | n/a | | 28GB a) `CUDA_VISIBLE_DEVICES=0,1 --cpu-moe --tensor-split 0.85,0.15` | 127.66 | 22.95 | 129.70 | 22.72 | +2.04 | -0.23 | | 28GB b) `CUDA_VISIBLE_DEVICES=0,1 --cpu-moe` | n/a | n/a | 117.54 | 22.99 | n/a | n/a | ### Variance note for single-GPU default (`--cpu-moe`) Repeated measurements show substantial run-to-run variance in this environment: Run artifact: `/tmp/qwen3next-repeat-20260206_064133` - `single_cpu_moe` maxPP/maxTG: - run1: `113.84 / 25.86` - run2: `135.29 / 26.88` - run3: `113.95 / 23.54` - `single_cpu_moe_no_ooae` maxPP/maxTG: - run1: `135.33 / 26.49` - run2: `133.64 / 24.92` - run3: `126.33 / 23.42` Interpretation: in this setup, `-no-ooae` is currently more stable and generally faster for PP; default OOAE shows large variance and occasional severe PP drops. ## 2026-02-06 Dual-Build Split + Context Sweep + PP Profiling ### Code updates in this pass 1. `src/llama.cpp` - Added a Qwen3Next-specific guard that disables `only_active_experts` for large-batch hybrid MoE prompt paths: - condition: `arch == QWEN3NEXT`, tensor overrides enabled, `n_batch >= 512` - Rationale: avoid extra scheduling/sync/copy overhead in this PP-heavy path. 2. `scripts/qwen3next-eval.sh` - Added build-dir selection so CPU and CUDA trees can be reused without rebuild toggling: - `--main-build-dir` - `--ik-build-dir` - Fixed runtime loader paths to include both `bin` and `src` shared-library locations. ### Separate build setup (requested) Built and validated two persistent trees in `ik_llama.cpp`: - `build-cpu`: `GGML_CUDA=OFF`, `GGML_BLAS=ON` - `build-cuda`: `GGML_CUDA=ON`, `GGML_BLAS=OFF` Command used: ```bash docker run --rm --gpus all \ -v /home/yurko/Code/ik_llama.cpp:/ik \ -w /ik \ iktest-dev:latest \ bash -lc ' cmake -S /ik -B /ik/build-cpu -DGGML_CUDA=OFF -DGGML_BLAS=ON -DCMAKE_BUILD_TYPE=Release cmake --build /ik/build-cpu --config Release -j 56 --target llama-cli llama-sweep-bench llama-perplexity cmake -S /ik -B /ik/build-cuda -DGGML_CUDA=ON -DGGML_BLAS=OFF -DCMAKE_BUILD_TYPE=Release cmake --build /ik/build-cuda --config Release -j 56 --target llama-cli llama-sweep-bench llama-perplexity ' ``` ### Parity rerun after this pass Run artifact: `/tmp/qwen3next-eval/20260206_191050` - CPU PPL parity: - chunks=1: mainline `1.0009`, ik `1.0009`, delta `0.000000` - chunks=2: mainline `1.0005`, ik `1.0005`, delta `0.000000` - CUDA sanity parity: - `gpu_ppl_chunks1_mainline`: `OK` - `gpu_ppl_chunks1_ik`: `OK` ### Requested runs: CPU `c=512`, CUDA up to `c=8192` Run artifact: `/tmp/qwen3next-dual-build-20260206_191427` Config: - CPU: `build-cpu`, `-c 512 -b 1024 -ub 128 -n 16 -ngl 0` - CUDA: `build-cuda`, `-c {512,1024,2048,4096,8192} -b 1024 -ub 128 -n 16 -ngl 999 --cpu-moe` | Case | maxPP (t/s) | maxTG (t/s) | graph splits | | --- | ---: | ---: | ---: | | `cpu_c512` | 98.31 | 6.58 | 1 | | `cuda_c512` | 137.09 | 25.69 | 530 | | `cuda_c1024` | 135.74 | 27.68 | 530 | | `cuda_c2048` | 134.87 | 26.71 | 530 | | `cuda_c4096` | 136.62 | 27.37 | 530 | | `cuda_c8192` | 137.50 | 27.53 | 530 | Observation: PP remains roughly flat (`~135-137 t/s`) from `c=512` through `c=8192`, so this is not primarily a long-context KV-scaling bottleneck. ### Prompt-processing bottleneck profiling Run artifact: `/tmp/qwen3next-profile-20260206_192018` | Case | maxPP (t/s) | maxTG (t/s) | splits | threads | offloaded layers | | --- | ---: | ---: | ---: | --- | --- | | `single_default` | 125.77 | 24.01 | 530 | `t=8,tb=8` | `49/49` | | `single_t16_tb16` | 37.00 | 0.85 | 530 | `t=16,tb=16` | `49/49` | | `dual_default` | 128.94 | 22.75 | 531 | `t=8,tb=8` | `49/49` | | `dual_t16_tb16` | 37.68 | 0.82 | 531 | `t=16,tb=16` | `49/49` | Key findings: 1. Increasing CPU threads to 16 for this CPU-MoE path is strongly harmful on this machine. 2. Dual-GPU (`0,1`) does not materially improve PP over single-GPU for this config. 3. Main logs still show all expert tensors overridden to CPU and a large CPU expert buffer (`~45.8 GiB`), so PP is dominated by CPU-MoE path behavior rather than GPU-context growth. 4. Graph splits remain high (`~530`) and stable across contexts, indicating persistent scheduler/backend overhead. ### Additional variance check (`default` vs `-no-ooae`) Run artifact: `/tmp/qwen3next-ooae-repeat-20260206_192523` - `default` (with auto Qwen3Next guard): `112.64/23.88`, `135.73/26.40`, `135.30/27.19` (PP/TG) - `-no-ooae`: `131.87/25.97`, `113.80/23.77`, `114.25/23.79` Interpretation: run-to-run variance is still significant in this environment; however, the new auto-guard removes the worst observed OOAE collapse mode in the default path while preserving parity. ### Why this is still below ~400 PP Given this exact setup, the dominant limiter is CPU-MoE expert execution (large expert tensors on CPU + routing/scheduler overhead), not context length. With `--cpu-moe`, this hardware/config currently lands around `~125-137` PP in stable runs. Reaching `~400` PP on this model likely requires reducing or eliminating CPU-MoE dependence (more VRAM / different placement strategy) rather than only kernel micro-tuning.