ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-21 05:34:08 +00:00

Files

Yurko e64b43392f cuda: reduce qwen3next moe/ssm sync overhead and refresh eval

2026-02-06 14:46:59 +00:00

16 KiB

Raw Blame History

Qwen3Next Performance-Differences Report (`ik_llama.cpp` vs `llama.cpp`)

Scope

This report documents:

Measured behavior observed during bring-up and benchmarking.
Code-level differences likely affecting performance.
Fixes already applied in ik_llama.cpp.
Remaining bottlenecks and concrete next steps.

All numbers below were collected on this machine in Docker with the model:

/models/qwen3-next-coder.gguf

Date of measurements: 2026-02-06.

Environment Notes

GPU setup: RTX 5060 Ti + RTX 3060.
Early slow runs were partially confounded by low free memory on GPU1 in one session (~201 MiB free at init).
Later checks confirmed GPUs can be mostly free (~15.8 GiB and ~11.9 GiB free) before starting runs.

What Was Validated

Numerical sanity/parity check (perplexity)

Using identical prompt text, c=256, b=64, ub=64, CPU model weights (-ngl 0), no warmup:

ik (llama-perplexity) chunks=1:
- [1]1.0009
- Final estimate: PPL over 1 chunks for n_ctx=256 = 1.0009 +/- 0.00045
mainline (llama-perplexity) chunks=1:
- [1]1.0008
- Final estimate: PPL = 1.0008 +/- 0.00036

And for chunks=2:

ik: [1]1.0009,[2]1.0009, Final estimate ... = 1.0009 +/- 0.00026
mainline: [1]1.0008,[2]1.0008, Final estimate ... = 1.0008 +/- 0.00020

Interpretation: current ik Qwen3Next path is numerically very close to mainline for this test.

Measured Performance Signals

`ik` sweep at long context

llama-sweep-bench with c=65536, b=1024, ub=128 started successfully and produced low TG values in observed rows (roughly ~2.2 to ~4.1 t/s) and PP mostly in ~27 to ~60 t/s depending on n_kv occupancy.

This run was intentionally stopped by user before completion.

Scheduler limits hit at larger batch

ik with c=65536, b=4096, ub=1024 failed with:

GGML_ASSERT(i_split < GGML_SCHED_MAX_SPLITS) in ggml-backend.cpp.

This indicates high graph split pressure for this configuration.

Code-Level Differences Relevant to Performance

1) Recurrent-state storage model differs from mainline

Mainline Qwen3Next uses recurrent memory abstractions (llama_memory_recurrent) with R and S state buffers in F32:

llama.cpp/src/llama-model.cpp:7505
llama.cpp/src/models/qwen3next.cpp:686
llama.cpp/src/models/qwen3next.cpp:687

ik path originally used KV cache-tail handling; this was adjusted to dedicated per-layer state tensors (s_l) in F32:

ik_llama.cpp/src/llama-context.h:59
ik_llama.cpp/src/llama.cpp:771
ik_llama.cpp/src/llama.cpp:817
ik_llama.cpp/src/llama-build-context.cpp:4617

Impact: avoids repeated cast in/out of recurrent state for Qwen3Next and aligns closer to mainline state precision behavior.

2) `ggml_sub` broadcast semantics differ

Mainline allows repeat/broadcast in ggml_sub:

llama.cpp/ggml/src/ggml.c:2129

ik currently enforces same-shape inputs:

ik_llama.cpp/ggml/src/ggml.c:6406

Consequence: in Qwen3Next chunking, ik must materialize explicit repeats for tensors used in sub, increasing graph materialization overhead.

3) Qwen3Next chunking path has extra explicit repeats in `ik`

Current ik chunking path repeats g_cumsum and g_last before subtraction:

ik_llama.cpp/src/llama-build-context.cpp:4234
ik_llama.cpp/src/llama-build-context.cpp:4287

Mainline path uses broadcasted subtraction without those explicit materializations:

llama.cpp/src/models/qwen3next.cpp:200
llama.cpp/src/models/qwen3next.cpp:264

Consequence: additional memory traffic and nodes in high-frequency path.

4) Graph split count is higher in `ik` for tested Qwen3Next context

Observed logs for c=256 showed:

ik: graph splits 1227
mainline: graph splits 975

Higher split count usually implies more sync/copy overhead and can reduce PP/TG.

Fixes Already Applied in `ik`

These are included in commit:

a7df116 (qwen3next: add architecture support and recurrent-state fixes)

Applied items:

Added Qwen3Next architecture and kernels in ik.
Added dedicated F32 recurrent-state storage (s_l) for Qwen3Next recurrent layers.
Updated Qwen3Next build path to read/write from dedicated state storage when available.
Ensured numerical sanity vs mainline with perplexity checks above.
Kept conservative explicit-repeat logic in chunking where ik ggml_sub currently requires same-shape (after testing showed global broadcast change caused instability in this fork).

Why Current `ik` Can Still Be Slower

Most probable remaining reasons:

Extra repeat materializations in chunking path.
Higher graph split count in scheduler/backend path.
Less optimized Qwen3Next integration path compared to mainline recurrent-memory abstractions.
Run configuration sensitivity at long context and very large batch (SCHED_MAX_SPLITS boundary).

Priority Next Fixes

Reduce split pressure and keep benchmark configs inside stable split envelope at 64k.
Eliminate or fuse high-cost repeat materializations in Qwen3Next chunking path without changing math.
Align more of Qwen3Next recurrent memory/update flow with mainline memory-recurrent pattern where possible.
Validate after each change:
- PPL/outputs against mainline.
- PP/TG against the same benchmark parameters.

Current Status

Qwen3Next is integrated and functionally running in ik.
Precision is close to mainline on tested perplexity cases.
Performance gap remains and requires targeted optimization work listed above.

2026-02-06 Optimization Update

Newly applied performance changes

Enabled broadcast-capable ggml_sub and aligned it with existing ggml_mul broadcast behavior.
Reworked CPU ggml_compute_forward_sub_f32 to use threaded row-splitting and contiguous broadcast loops.
Enabled GGML_OP_SUB multi-task scheduling in ggml_get_n_tasks.
Removed two avoidable repeat materializations in Qwen3Next chunking path:
- gcs_i = repeat(g_cumsum, ...) -> gcs_i = g_cumsum
- g_last_repeat in g_diff path removed, using direct broadcasted subtract.
Added a CUDA fast path in ggml_cuda_op_ssm_conv for single-sequence recurrent updates (n_kv == 1), with token-block parallelization and explicit final-state reconstruction.

Post-change validation

CPU parity vs mainline (`-ngl 0`)

c=256, b=64, ub=64, --no-warmup:

chunks=1
- ik: [1]1.0007, final 1.0007 +/- 0.00042
- mainline: [1]1.0007, final 1.0007 +/- 0.00049
chunks=2
- ik: [1]1.0007,[2]1.0007, final 1.0007 +/- 0.00023
- mainline: [1]1.0007,[2]1.0008, final 1.0008 +/- 0.00028

CUDA sanity parity vs mainline (`CUDA_VISIBLE_DEVICES=1`, `-ngl 1`)

c=256, b=64, ub=64, --no-warmup, chunks=1:

ik: [1]1.0011, final 1.0011 +/- 0.00071
mainline: [1]1.0011, final 1.0011 +/- 0.00074

Interpretation: precision parity remains intact after CPU and CUDA optimizations.

Updated long-context speed signal (`ik`, no KV quantization)

Config: llama-sweep-bench -c 65536 -b 1024 -ub 128 -ctk f16 -ctv f16

Observed rows after the changes show:

PP generally in ~82 to ~91 t/s range once n_kv grows (~768 to ~3328 in sampled rows).
TG generally in ~6.2 to ~6.6 t/s range in the same sampled region.

This is substantially improved versus earlier observed TG (~2 to ~4 t/s) in the prior slow run.

Remaining performance risks

Some runs still offload few/no layers depending on available VRAM at run time, which can mask CUDA-path gains.
SCHED_MAX_SPLITS limits at very aggressive (b, ub) settings are still a separate scaling constraint.
Additional backend-level profiling is still needed to determine whether remaining gap to top-end mainline numbers is dominated by offload limits, scheduler split overhead, or other kernels.

2026-02-06 CUDA MoE/SSM Optimization Update

Applied changes in this update

MoE row mapping in CUDA mul_mat_id paths (ggml/src/ggml-cuda.cu):
- Replaced per-call ids device->host copy, host-side count/build, and mapping host->device copy.
- Added device-side count + exclusive prefix sum + scatter kernels:
  - k_moe_row_count
  - k_moe_row_exclusive_scan
  - k_moe_row_scatter
- Kept existing call-site logic intact by copying only compact metadata back (moe_counts, cum_moe_counts, invalid-id flag).
- Net effect: removes large host round-trip traffic from a hot MoE routing path.
Qwen3Next SSM conv path for n_kv > 1 (ggml/src/ggml-cuda/ssm-conv.cu):
- Added a guarded fast path for decode-like multi-sequence batches where each token maps to one unique sequence (no multi-sequence fan-out per token).
- Added:
  - ssm_conv_validate_unique_seq_map
  - ssm_conv_multi_seq_unique_f32_kernel
  - ssm_conv_multi_seq_unique_f32_kernel_nc4
- If the input pattern does not satisfy fast-path constraints, execution falls back to the existing kernel path unchanged.
Top-k MoE fusion verification:
- No matcher change was required in this update.
- Qwen3Next MoE build path still emits the expected SOFT_MAX -> ... -> ARGSORT -> VIEW -> GET_ROWS form used by current CUDA fusion checks.

Parity validation (required checks)

Tests were run in Docker (iktest-dev:latest) with:

model: /models/qwen3-next-coder.gguf
text corpus: /tmp/qnext_ppl.txt (same file for ik and mainline)
params: -c 256 -b 64 -ub 64 --no-warmup

CPU parity (-ngl 0, threshold <= 5e-4):

chunks=1: ik 1.0041 vs mainline 1.0037 (delta=4e-4) -> PASS
chunks=2: ik 1.0025 vs mainline 1.0023 (delta=2e-4) -> PASS

CUDA sanity parity (-ngl 1, threshold <= 1e-3):

chunks=1: ik 1.0041 vs mainline 1.0037 (delta=4e-4) -> PASS
chunks=2: ik 1.0025 vs mainline 1.0023 (delta=2e-4) -> PASS

Quick performance matrix (`llama-sweep-bench`)

Config: -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe

Profile	Baseline maxPP	Baseline maxTG	New maxPP	New maxTG	Delta maxPP	Delta maxTG
16GB a) `CUDA_VISIBLE_DEVICES=0`	129.83	26.45	122.91	26.79	-6.92	+0.34
16GB b) `CUDA_VISIBLE_DEVICES=0 -no-ooae`	n/a	n/a	132.02	26.84	n/a	n/a
28GB a) `CUDA_VISIBLE_DEVICES=0,1 --tensor-split 0.85,0.15`	127.66	22.95	127.48	23.97	-0.18	+1.02
28GB b) `CUDA_VISIBLE_DEVICES=0,1`	n/a	n/a	104.61	21.17	n/a	n/a

Command log (exact forms)

Build:

docker run --rm --gpus all \
  -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
  iktest-dev:latest \
  bash -lc 'cmake --build /ik_llama.cpp/build-cuda13-fresh --config Release -j 56 --target llama-perplexity llama-bench'

Parity (ik):

docker run --rm --gpus all \
  -v /home/yurko/Code/ik_llama.cpp:/ik_llama.cpp \
  -v /home/yurko/.cache/llama.cpp:/models \
  -v /tmp:/tmp \
  iktest-dev:latest \
  bash -lc 'export LD_LIBRARY_PATH=/ik_llama.cpp/build-cuda13-fresh/src:/ik_llama.cpp/build-cuda13-fresh/ggml/src:$LD_LIBRARY_PATH; \
  /ik_llama.cpp/build-cuda13-fresh/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'

Parity (mainline):

docker run --rm --gpus all \
  -v /home/yurko/Code/llama.cpp:/llama.cpp \
  -v /home/yurko/.cache/llama.cpp:/models \
  -v /tmp:/tmp \
  iktest-dev:latest \
  bash -lc 'export LD_LIBRARY_PATH=/llama.cpp/build/src:/llama.cpp/build/ggml/src:$LD_LIBRARY_PATH; \
  /llama.cpp/build/bin/llama-perplexity -m /models/qwen3-next-coder.gguf -f /tmp/qnext_ppl.txt -c 256 -b 64 -ub 64 --no-warmup --chunks {1|2} -ngl {0|1} -ctk f16 -ctv f16'

Quick matrix:

# 16GB a
CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe

# 16GB b
CUDA_VISIBLE_DEVICES=0 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe -no-ooae

# 28GB a
CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe --tensor-split 0.85,0.15

# 28GB b
CUDA_VISIBLE_DEVICES=0,1 /ik_llama.cpp/build-cuda13-fresh/bin/llama-sweep-bench \
  -m /models/qwen3-next-coder.gguf -c 512 -b 1024 -ub 128 -n 16 -ctk f16 -ctv f16 -ngl 999 --cpu-moe

Status after this update

Precision parity: PASS on all required checks.
Performance:
- 16GB profile improved TG but not PP vs baseline.
- 28GB split profile improved TG and preserved PP.
Remaining likely bottlenecks for 16GB PP:
- MoE routing still limited by per-expert launches/host-side per-expert loop in mul_mat_id.
- Scheduler split / backend-crossing overhead remains visible at this config.

2026-02-06 Follow-up Hotspot Pass (this session)

Additional code changes

ggml/src/ggml-cuda.cu
- Removed an unused ids device->host copy + stream sync in ggml_cuda_moe_up_gate_unary fallback path.
- Reduced row-mapping host transfer volume by deriving moe_counts from host-side prefix bounds (cum_moe_counts) instead of copying both arrays from device.
- Added build_active_experts(...) and switched per-expert loops to iterate only active experts.
ggml/src/ggml-cuda/ssm-conv.cu
- Removed host-side cudaMemcpyAsync(...D2H...) + cudaStreamSynchronize for multi-seq fast-path eligibility.
- Made fast/fallback dispatch fully async by gating both kernels with a device-side fast_path_ok flag.
ggml/src/ggml-backend.cpp
- Reduced unnecessary split churn when a weight tensor is on another backend but the current backend can consume that buffer type directly.
- Increased GGML_SCHED_MAX_SPLITS from 2048 to 4096 for large-graph headroom.
src/llama.cpp
- Added a Qwen3Next-specific default split guard for heterogeneous dual-GPU layer mode: clamp to at least 75/25 on 2-GPU auto-split when GPU0 has more free memory.
scripts/qwen3next-eval.sh
- Fixed CLI compatibility (mainline: llama-completion, ik: llama-cli completion path).
- Made evaluation resilient to missing binaries (gpu_sweep_mainline is skipped if unavailable).
- Fixed complexity-token regex.
- Switched PPL corpus generation to a stable deterministic pattern to reduce chunk-level variance.

Validation rerun

Run artifact: /tmp/qwen3next-eval/20260206_064339

CPU PPL parity:
- chunks=1: mainline 1.0009, ik 1.0009, delta 0.000000
- chunks=2: mainline 1.0005, ik 1.0005, delta 0.000000
CUDA sanity parity:
- gpu_ppl_chunks1_mainline: OK
- gpu_ppl_chunks1_ik: OK
Generation smoke:
- both mainline and ik contain Fibonacci token(s)
- mainline contains complexity token(s), ik did not in this sample output
Notes:
- gpu_sweep_mainline skipped in this environment because /home/yurko/Code/llama.cpp/build/bin/llama-sweep-bench is not present.
- gpu_sweep_ik (c=2048, n=32) in this run peaked at approximately maxPP=137.02, maxTG=24.81.

Quick matrix (exact required configs)

Run artifact: /tmp/qwen3next-matrix/20260206_063957

Profile	Baseline maxPP	Baseline maxTG	New maxPP	New maxTG	Delta maxPP	Delta maxTG
16GB a) `CUDA_VISIBLE_DEVICES=0 --cpu-moe`	129.83	26.45	115.56	25.74	-14.27	-0.71
16GB b) `CUDA_VISIBLE_DEVICES=0 --cpu-moe -no-ooae`	n/a	n/a	136.21	26.00	n/a	n/a
28GB a) `CUDA_VISIBLE_DEVICES=0,1 --cpu-moe --tensor-split 0.85,0.15`	127.66	22.95	129.70	22.72	+2.04	-0.23
28GB b) `CUDA_VISIBLE_DEVICES=0,1 --cpu-moe`	n/a	n/a	117.54	22.99	n/a	n/a

Variance note for single-GPU default (`--cpu-moe`)

Repeated measurements show substantial run-to-run variance in this environment:

Run artifact: /tmp/qwen3next-repeat-20260206_064133

single_cpu_moe maxPP/maxTG:
- run1: 113.84 / 25.86
- run2: 135.29 / 26.88
- run3: 113.95 / 23.54
single_cpu_moe_no_ooae maxPP/maxTG:
- run1: 135.33 / 26.49
- run2: 133.64 / 24.92
- run3: 126.33 / 23.42

Interpretation: in this setup, -no-ooae is currently more stable and generally faster for PP; default OOAE shows large variance and occasional severe PP drops.

16 KiB Raw Blame History

Qwen3Next Performance-Differences Report (ik_llama.cpp vs llama.cpp)