* wip: port MTP architecture
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.
Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.
* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).
* core: enable hybrid outputs (logits + embeddings) for MTP support
* fix(mtp): correct KV-cache slot finding for updates
* fix(mtp): persist hidden states to prevent context corruption during drafting
* refactor(mtp): clean unused code
* fix(mtp): update server to new functions name
* fix(mtp): fix graph and save hidden state
* mtp: refactor integration, context params and kv cache search
* mtp: fix hidden state extraction and speculative acceptance flow
* server: fix MTP warmup for long prompts and reset token buffer
* llama: refactor MTP operation state to context parameters
* server: fix n_past calculation in MTP acceptance
* llama: fix mtp enable flags
* speculative: refactor MTP to use common_speculative interface
* context: remove unused signatures
* clip: fix deprecated enum-enum conversion warning
* common: fix format string crash in help message
* context: fix mtp activation logic
* Optimizing q3next TG
* Fused add -> softplus -> mul on CUDA
* Remove forgotten debug log
* Increase ggml context size
Required for Qwen3-Next with batch/u-batch size of 4096
* WIP
* Avoid some contiguous ops
* Avoid some repeats
* Avoid some more repeats
* qwen3next: add architecture support and recurrent-state fixes
* qwen3next: optimize broadcast sub and single-seq ssm conv
* cuda: build MoE row mapping on device in mul_mat_id
* cuda: add guarded multi-seq fast path for ssm_conv
* docs: update qwen3next perf report for cuda MoE/SSM tuning
* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval
* qwen3next: split cpu/cuda eval builds and tune PP scheduling
* qwen3next: harden seq-state flow and support optional dense FFN layers
* qwen3next: trim delta-net graph overhead in chunking path
* qwen3next: remove redundant v_conv cont in delta path
* qwen3next: avoid extra cont on linear attention output
* qwen3next: drop redundant cont before recurrent state flatten
* qwen3next: keep recurrent state in 4d layout through delta path
* qwen3next: add fused delta-net op and wire model path
* tests: add backend-op coverage for ggml_delta_net
* qwen3next: add runtime switch for fused delta-net path
* docs: refresh qwen3next perf review and benchmark matrix
* qwen3next: default fused delta-net off and document quality checks
* qwen3next: add decode-only fused delta mode
* qwen3next: make fused delta safe by default and fix fused tensor layout
* qwen3next: warn when forcing fused decode mode
* qwen3next: add fused-delta regression runner script
* qwen3next: integrate fused regression into eval harness
* qwen3next: clean up chunked delta-net shape handling
* qwen3next: add absolute sanity guards to fused regression
* qwen3next: add unified regression runner script
* qwen3next: disable flash-attn for cpu-only contexts
* docs: reconcile qwen3next status and remaining upstream gaps
* common: add qwen3next fused-delta runtime flag
* cuda: add qwen3next delta-net kernel dispatch override
* docs: update qwen3next quality and serving baseline findings
* qwen3next: keep fused delta on safe path and remove PR artifacts
* qwen3next: align autoregressive delta-net decode layout
* Revert "qwen3next: align autoregressive delta-net decode layout"
This reverts commit 9241164a5e.
* cuda: port solve-tri fast-paths for qwen3next delta-net
* qwen3next: add fused-delta runtime flag and drop env toggle
* qwen3next: make fused delta single-flag and default on
* Account for GPU arch differences
* Revert "cuda: build MoE row mapping on device in mul_mat_id"
This reverts commit 89e9ecfa84.
* qwen3next: drop non-essential MoE scheduling and split heuristics
* qwen3next: avoid generic ggml_sub broadcast changes
* llama: restore only_active_experts log message
* Remove unnecessary hacks, disable fusion for now.
* qwen3next: port hybrid recurrent state memory semantics
* qwen3next: clean up recurrent state slot plumbing
* qwen3next: fix hybrid V-cache layout plumbing
* qwen3next: guard recurrent state slots against kv capacity
* qwen3next: persist recurrent state in session data
- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches
* qwen3next: drop unused fused-delta builder path
- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member
* qwen3next: remove unused fused-delta CLI/context plumbing
- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init
* ggml: remove unused DELTA_NET operator stack
* Missing include
* Reorder ops/unary ops
So we don't change again the enum values of the mul mat ops
* Minor
* Discard unnecessary changes in llama-build-context.cpp
* Minor
* Revert "Discard unnecessary changes in llama-build-context.cpp"
This reverts commit edadb80ed6.
* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches
* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next
* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next
It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.
Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.
* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next
For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.
* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next
* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next
* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512
* Multithreading for OP_SUB
* Don't commit with timing trace on
* Multithread neg and sigmoid
* Be able to turn on/off fusion more easily (CPU)
* Name the mul_mat ops so we know where the time goes
* WIP
* Much better PP on CUDA
* CUDA: fuse transpose -> cont -> sum_rows -> transpose
Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.
* CUDA: faster mul for special case relevant for Qwen3Next
Worth 1% in TG
* Fix CPU OP_CONT
---------
Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP - not working
* WIP - not working
* WIP - GPT-OSS working
However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.
My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.
For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.
Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.
* WIP
* WIP - Qwen3-MoE (and hopefully all others) working
But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.
* WIP: TG seems to be working
* Minor
* Add command line option to merge experts up/gate
* Add merge up/gate command line parameter to llama-bench
* Turn off merge_up_gate_exps if split mode graph
It is not yet implemented
* When no bias, allow merging up/gate with tensor overrides
* Arghh, we need to increase the context size again
* Cleanup
* Mimo-2 support
* Fix bug for head sizes not being the same
It still does not solve the Mimo-2 quantized cache issue.
* Fix quantized cache
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Remove most of split mode row
* WIP
* WIP: also allocate the KV cache using tensor split
* WIP: it runs with wrong result
But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
wait for GPU 0 to finish its entore FFN calculation before it can
start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
* WIP
* This works, but it is slow
* This is slightly better
the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.
* Playing games with the scheduler
This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.
* Fix attn split
Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.
* Show memory used per device
* Make it work with partial offload
but no tensor overrides yet, just ngl < num_layers.
* Allow for f16 source in fused_rms_norm
* This results in faster PP.
Now PP is faster than split mode layer for L3-70B.
* Rename split mode "row" to split mode "graph"
* Leave FFN partial results as f16
* WIP GLM4.5 - runs with wrong results
* WIP GLM4.5 - this works
PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.
* Work around compiler bug
It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.
* Make graph reuse work with split mode graph
* Remove more split mode row remnants
* WIP tensor overrides
Runs with wrong results, don't see where the issue could be.
* This works but is slow
Still does not work for row-interleaved quants
* Slightly better
* Slightly better
* Row-interleaved quants work
* Better
* Minor
* Guarad against using split mode "graph" for unsupported models
* Guards against using merge_qkv with split mode "graph"
* WIP split mode attn
Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.
* Split mode graph for qwen3moe
* Try to better distribute the splits
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* RPC support multiple devices
* rpc : update documentation (#16441)
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
# Conflicts:
# examples/rpc/README.md
* Remove memory settings
* rpc : cache and reuse compute graphs (#15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
* Add -cpu to include cpu backend
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
* Merge Q and K into a single tensor
* Make V mul mat follow QK mul mat
so they can be fused, which gives a slightly bbetter TG performance.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>