* server: enable checkpoint for recurrent models
create checkpoint after cancel
fix ban string and rm context during rewind
add checkpoint interval
only save recurrent cache
* save checkpoint during pp
---------
Co-authored-by: firecoperana <firecoperana>
* wip: port MTP architecture
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.
Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.
* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).
* core: enable hybrid outputs (logits + embeddings) for MTP support
* fix(mtp): correct KV-cache slot finding for updates
* fix(mtp): persist hidden states to prevent context corruption during drafting
* refactor(mtp): clean unused code
* fix(mtp): update server to new functions name
* fix(mtp): fix graph and save hidden state
* mtp: refactor integration, context params and kv cache search
* mtp: fix hidden state extraction and speculative acceptance flow
* server: fix MTP warmup for long prompts and reset token buffer
* llama: refactor MTP operation state to context parameters
* server: fix n_past calculation in MTP acceptance
* llama: fix mtp enable flags
* speculative: refactor MTP to use common_speculative interface
* context: remove unused signatures
* clip: fix deprecated enum-enum conversion warning
* common: fix format string crash in help message
* context: fix mtp activation logic
* Optimizing q3next TG
* Fused add -> softplus -> mul on CUDA
* Remove forgotten debug log
* Increase ggml context size
Required for Qwen3-Next with batch/u-batch size of 4096
* WIP
* Avoid some contiguous ops
* Avoid some repeats
* Avoid some more repeats
* qwen3next: add architecture support and recurrent-state fixes
* qwen3next: optimize broadcast sub and single-seq ssm conv
* cuda: build MoE row mapping on device in mul_mat_id
* cuda: add guarded multi-seq fast path for ssm_conv
* docs: update qwen3next perf report for cuda MoE/SSM tuning
* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval
* qwen3next: split cpu/cuda eval builds and tune PP scheduling
* qwen3next: harden seq-state flow and support optional dense FFN layers
* qwen3next: trim delta-net graph overhead in chunking path
* qwen3next: remove redundant v_conv cont in delta path
* qwen3next: avoid extra cont on linear attention output
* qwen3next: drop redundant cont before recurrent state flatten
* qwen3next: keep recurrent state in 4d layout through delta path
* qwen3next: add fused delta-net op and wire model path
* tests: add backend-op coverage for ggml_delta_net
* qwen3next: add runtime switch for fused delta-net path
* docs: refresh qwen3next perf review and benchmark matrix
* qwen3next: default fused delta-net off and document quality checks
* qwen3next: add decode-only fused delta mode
* qwen3next: make fused delta safe by default and fix fused tensor layout
* qwen3next: warn when forcing fused decode mode
* qwen3next: add fused-delta regression runner script
* qwen3next: integrate fused regression into eval harness
* qwen3next: clean up chunked delta-net shape handling
* qwen3next: add absolute sanity guards to fused regression
* qwen3next: add unified regression runner script
* qwen3next: disable flash-attn for cpu-only contexts
* docs: reconcile qwen3next status and remaining upstream gaps
* common: add qwen3next fused-delta runtime flag
* cuda: add qwen3next delta-net kernel dispatch override
* docs: update qwen3next quality and serving baseline findings
* qwen3next: keep fused delta on safe path and remove PR artifacts
* qwen3next: align autoregressive delta-net decode layout
* Revert "qwen3next: align autoregressive delta-net decode layout"
This reverts commit 9241164a5e.
* cuda: port solve-tri fast-paths for qwen3next delta-net
* qwen3next: add fused-delta runtime flag and drop env toggle
* qwen3next: make fused delta single-flag and default on
* Account for GPU arch differences
* Revert "cuda: build MoE row mapping on device in mul_mat_id"
This reverts commit 89e9ecfa84.
* qwen3next: drop non-essential MoE scheduling and split heuristics
* qwen3next: avoid generic ggml_sub broadcast changes
* llama: restore only_active_experts log message
* Remove unnecessary hacks, disable fusion for now.
* qwen3next: port hybrid recurrent state memory semantics
* qwen3next: clean up recurrent state slot plumbing
* qwen3next: fix hybrid V-cache layout plumbing
* qwen3next: guard recurrent state slots against kv capacity
* qwen3next: persist recurrent state in session data
- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches
* qwen3next: drop unused fused-delta builder path
- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member
* qwen3next: remove unused fused-delta CLI/context plumbing
- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init
* ggml: remove unused DELTA_NET operator stack
* Missing include
* Reorder ops/unary ops
So we don't change again the enum values of the mul mat ops
* Minor
* Discard unnecessary changes in llama-build-context.cpp
* Minor
* Revert "Discard unnecessary changes in llama-build-context.cpp"
This reverts commit edadb80ed6.
* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches
* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next
* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next
It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.
Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.
* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next
For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.
* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next
* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next
* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512
* Multithreading for OP_SUB
* Don't commit with timing trace on
* Multithread neg and sigmoid
* Be able to turn on/off fusion more easily (CPU)
* Name the mul_mat ops so we know where the time goes
* WIP
* Much better PP on CUDA
* CUDA: fuse transpose -> cont -> sum_rows -> transpose
Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.
* CUDA: faster mul for special case relevant for Qwen3Next
Worth 1% in TG
* Fix CPU OP_CONT
---------
Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* Not working attempt to extend fused_mul_unary to the Step-3.5 case
* It works now, but performance gain is very minor
* Fix graph parallel when ngl < n_layers
* Fix using ffn_norm
When using graph parallel with ngl < n_layers, the ffn_norm tensor
may have ended up being split, while the ffn tensors are on the CPU.
In that case we will get a crash because we attempt to use the not-split
buffer of ffn_norm, which is invalid. Thi commit fixes that.
* Cleanup
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* Copy reduce result to other GPUs if necessary
* Avoid ggml_get_rows for TG
* For the output ops use the result of the split that ran on the main GPU
* More models
* Add ability to merge up+gate exps to more models
* We need to of course pass the merged tensor to build_ffn
* All the others
* Also Qwen3VL-MoE
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP - not working
* WIP - not working
* WIP - GPT-OSS working
However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.
My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.
For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.
Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.
* WIP
* WIP - Qwen3-MoE (and hopefully all others) working
But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.
* WIP: TG seems to be working
* Minor
* Add command line option to merge experts up/gate
* Add merge up/gate command line parameter to llama-bench
* Turn off merge_up_gate_exps if split mode graph
It is not yet implemented
* When no bias, allow merging up/gate with tensor overrides
* Arghh, we need to increase the context size again
* Cleanup
* Mimo-2 support
* Fix bug for head sizes not being the same
It still does not solve the Mimo-2 quantized cache issue.
* Fix quantized cache
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP: absorb adding input into std_attn and std_ffn
* WIP: NCCL infra
* WIP: add reduce and fake_cpy ops
* WIP
* WIP: graph appears to work, layer is broken
* WIP: Qwen3-MoE works with graph, layer still broken
* WIP: GLM-4.5 graph works
* WIP: fix sm layer (dense)
* WIP: fix sm layer (MoE)
* WIP: fast PP with bespoke 4-GPU NCCL
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
* WIP: Cohere2
* Explicitely set device
* Bespoke 3-GPU case
* WIP
* Do not repeat get_rows multiple times
* Fix 3 GPUs
* OK, let's leave it in
* Simple async
* This sync seems enough
* Only do async for 4 or more backends
With 2 GPUs (so, 3 backends) not using async is slightly faster
* Scheduler changes
* Use OpenMP if available
Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!
* Do not use OpenMP if there are tensor overrides
* Set omp max active levels
* Be more careful with having set the device before using a stream
* Command line option to turn on async. Set to false by defualt for now
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused_norm - same idea as fused_rms_norm
* Avoid computing the attention reduce op for cohere2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP: absorb adding input into std_attn and std_ffn
* WIP: NCCL infra
* WIP: add reduce and fake_cpy ops
* WIP
* WIP: graph appears to work, layer is broken
* WIP: Qwen3-MoE works with graph, layer still broken
* WIP: GLM-4.5 graph works
* WIP: fix sm layer (dense)
* WIP: fix sm layer (MoE)
* WIP: fast PP with bespoke 4-GPU NCCL
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
* WIP: Cohere2
* Explicitely set device
* Bespoke 3-GPU case
* WIP
* Do not repeat get_rows multiple times
* Fix 3 GPUs
* OK, let's leave it in
* Implement the reduce op without NCCL available
* Be able to build without NCCL
cmake -DGGML_NCCL=OFF disables it
* Make --max-gpu work again
* Slightly better for 4 GPUs without NCCL
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used.
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* This works and TG is descent, but PP is low
* Better
* Apply f_logit_scale before mul mat with output tensor
* This is better for PP: 600 t/s -> 700 t/s
* To not lose this again
* WIP
* Equal split
* WIP
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Rearrange graph nodes
So that we can do graph portions that are the same on 2 or more
GPUs at the same time.
* Separate graph compute implementation for split mode graph
* This is better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>