* Optimizing q3next TG
* Fused add -> softplus -> mul on CUDA
* Remove forgotten debug log
* Increase ggml context size
Required for Qwen3-Next with batch/u-batch size of 4096
* WIP
* Avoid some contiguous ops
* Avoid some repeats
* Avoid some more repeats
* qwen3next: add architecture support and recurrent-state fixes
* qwen3next: optimize broadcast sub and single-seq ssm conv
* cuda: build MoE row mapping on device in mul_mat_id
* cuda: add guarded multi-seq fast path for ssm_conv
* docs: update qwen3next perf report for cuda MoE/SSM tuning
* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval
* qwen3next: split cpu/cuda eval builds and tune PP scheduling
* qwen3next: harden seq-state flow and support optional dense FFN layers
* qwen3next: trim delta-net graph overhead in chunking path
* qwen3next: remove redundant v_conv cont in delta path
* qwen3next: avoid extra cont on linear attention output
* qwen3next: drop redundant cont before recurrent state flatten
* qwen3next: keep recurrent state in 4d layout through delta path
* qwen3next: add fused delta-net op and wire model path
* tests: add backend-op coverage for ggml_delta_net
* qwen3next: add runtime switch for fused delta-net path
* docs: refresh qwen3next perf review and benchmark matrix
* qwen3next: default fused delta-net off and document quality checks
* qwen3next: add decode-only fused delta mode
* qwen3next: make fused delta safe by default and fix fused tensor layout
* qwen3next: warn when forcing fused decode mode
* qwen3next: add fused-delta regression runner script
* qwen3next: integrate fused regression into eval harness
* qwen3next: clean up chunked delta-net shape handling
* qwen3next: add absolute sanity guards to fused regression
* qwen3next: add unified regression runner script
* qwen3next: disable flash-attn for cpu-only contexts
* docs: reconcile qwen3next status and remaining upstream gaps
* common: add qwen3next fused-delta runtime flag
* cuda: add qwen3next delta-net kernel dispatch override
* docs: update qwen3next quality and serving baseline findings
* qwen3next: keep fused delta on safe path and remove PR artifacts
* qwen3next: align autoregressive delta-net decode layout
* Revert "qwen3next: align autoregressive delta-net decode layout"
This reverts commit 9241164a5e.
* cuda: port solve-tri fast-paths for qwen3next delta-net
* qwen3next: add fused-delta runtime flag and drop env toggle
* qwen3next: make fused delta single-flag and default on
* Account for GPU arch differences
* Revert "cuda: build MoE row mapping on device in mul_mat_id"
This reverts commit 89e9ecfa84.
* qwen3next: drop non-essential MoE scheduling and split heuristics
* qwen3next: avoid generic ggml_sub broadcast changes
* llama: restore only_active_experts log message
* Remove unnecessary hacks, disable fusion for now.
* qwen3next: port hybrid recurrent state memory semantics
* qwen3next: clean up recurrent state slot plumbing
* qwen3next: fix hybrid V-cache layout plumbing
* qwen3next: guard recurrent state slots against kv capacity
* qwen3next: persist recurrent state in session data
- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches
* qwen3next: drop unused fused-delta builder path
- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member
* qwen3next: remove unused fused-delta CLI/context plumbing
- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init
* ggml: remove unused DELTA_NET operator stack
* Missing include
* Reorder ops/unary ops
So we don't change again the enum values of the mul mat ops
* Minor
* Discard unnecessary changes in llama-build-context.cpp
* Minor
* Revert "Discard unnecessary changes in llama-build-context.cpp"
This reverts commit edadb80ed6.
* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches
* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next
* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next
It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.
Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.
* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next
For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.
* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next
* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next
* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512
* Multithreading for OP_SUB
* Don't commit with timing trace on
* Multithread neg and sigmoid
* Be able to turn on/off fusion more easily (CPU)
* Name the mul_mat ops so we know where the time goes
* WIP
* Much better PP on CUDA
* CUDA: fuse transpose -> cont -> sum_rows -> transpose
Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.
* CUDA: faster mul for special case relevant for Qwen3Next
Worth 1% in TG
* Fix CPU OP_CONT
---------
Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
* spec : add self speculative decoding and ngram-mod and refactor
common : use common_ prefix for common library function
llama : use LLAMA_TOKEN_NULL
spec : add self speculative decoding (no draft model required) + refactor
spec : add ngram-mod
spec : various improvements ton ngram-map + docs
spec : fix the check-rate logic of ngram-simple
common : add common_speculative_is_compat()
spec : simplify time measurement using common_time_meas
refactor common_sampler_init
refactor common_token_to_piece
refactor and fix cur_p bug
clean up
* spec : remove check rate
* spec: show warnings instead of abort
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* Not working attempt to extend fused_mul_unary to the Step-3.5 case
* It works now, but performance gain is very minor
* Fix graph parallel when ngl < n_layers
* Fix using ffn_norm
When using graph parallel with ngl < n_layers, the ffn_norm tensor
may have ended up being split, while the ffn tensors are on the CPU.
In that case we will get a crash because we attempt to use the not-split
buffer of ffn_norm, which is invalid. Thi commit fixes that.
* Cleanup
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* adaptive_p: fix history update + use current probability for high temp
* adaptive_p: fix history update bug, update with current probability if temp is high
* replace temp-as-signal with server argument
* adaptive_p: rename ema_w_cur_p to updt_w_cur
* delete test code
* Copy reduce result to other GPUs if necessary
* Avoid ggml_get_rows for TG
* For the output ops use the result of the split that ran on the main GPU
* More models
* A hopefully more efficient adaptive_p sampling
* Once at it, lets fix the formatting too
* More formatting
* Hopefully better
* This should be better
* Correctly accumulate adaptive_p sampling time
* AVX2
* A hopefully more efficient adaptive_p sampling
* Once at it, lets fix the formatting too
* More formatting
* Correctly accumulate sampling time for adaptive_p
* adaptive-p sampler: fix zeroed orig_probs bug and refactor
- Fix bug where original probabilities were captured as zero by calculating
them from logits in llama_prep_adaptive_p (new).
- Replace vector with unordered_map to track candidate probabilities,
filtering for relevance via logit delta (16.6f).
- Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info>
- Update function signatures to follow most other samplers.
* resolve merge bug
* adaptive-p: revert reordering function definitions
* Attempt to fix the many GPU issue in split mode graph
* WIP: this seems more stable
Still hanging after a while if I try to use all 7 GPUs
* Reenable OpenMP in scheduler async
Seems solid up to 4 GPUs. It did hang with --max-gpu 6.
* printf cleanup
* Add ability to merge up+gate exps to more models
* We need to of course pass the merged tensor to build_ffn
* All the others
* Also Qwen3VL-MoE
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP - not working
* WIP - not working
* WIP - GPT-OSS working
However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.
My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.
For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.
Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.
* WIP
* WIP - Qwen3-MoE (and hopefully all others) working
But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.
* WIP: TG seems to be working
* Minor
* Add command line option to merge experts up/gate
* Add merge up/gate command line parameter to llama-bench
* Turn off merge_up_gate_exps if split mode graph
It is not yet implemented
* When no bias, allow merging up/gate with tensor overrides
* Arghh, we need to increase the context size again
* Cleanup