* wip: port MTP architecture
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.
Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.
* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).
* core: enable hybrid outputs (logits + embeddings) for MTP support
* fix(mtp): correct KV-cache slot finding for updates
* fix(mtp): persist hidden states to prevent context corruption during drafting
* refactor(mtp): clean unused code
* fix(mtp): update server to new functions name
* fix(mtp): fix graph and save hidden state
* mtp: refactor integration, context params and kv cache search
* mtp: fix hidden state extraction and speculative acceptance flow
* server: fix MTP warmup for long prompts and reset token buffer
* llama: refactor MTP operation state to context parameters
* server: fix n_past calculation in MTP acceptance
* llama: fix mtp enable flags
* speculative: refactor MTP to use common_speculative interface
* context: remove unused signatures
* clip: fix deprecated enum-enum conversion warning
* common: fix format string crash in help message
* context: fix mtp activation logic
* qwen3next: add architecture support and recurrent-state fixes
* qwen3next: optimize broadcast sub and single-seq ssm conv
* cuda: build MoE row mapping on device in mul_mat_id
* cuda: add guarded multi-seq fast path for ssm_conv
* docs: update qwen3next perf report for cuda MoE/SSM tuning
* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval
* qwen3next: split cpu/cuda eval builds and tune PP scheduling
* qwen3next: harden seq-state flow and support optional dense FFN layers
* qwen3next: trim delta-net graph overhead in chunking path
* qwen3next: remove redundant v_conv cont in delta path
* qwen3next: avoid extra cont on linear attention output
* qwen3next: drop redundant cont before recurrent state flatten
* qwen3next: keep recurrent state in 4d layout through delta path
* qwen3next: add fused delta-net op and wire model path
* tests: add backend-op coverage for ggml_delta_net
* qwen3next: add runtime switch for fused delta-net path
* docs: refresh qwen3next perf review and benchmark matrix
* qwen3next: default fused delta-net off and document quality checks
* qwen3next: add decode-only fused delta mode
* qwen3next: make fused delta safe by default and fix fused tensor layout
* qwen3next: warn when forcing fused decode mode
* qwen3next: add fused-delta regression runner script
* qwen3next: integrate fused regression into eval harness
* qwen3next: clean up chunked delta-net shape handling
* qwen3next: add absolute sanity guards to fused regression
* qwen3next: add unified regression runner script
* qwen3next: disable flash-attn for cpu-only contexts
* docs: reconcile qwen3next status and remaining upstream gaps
* common: add qwen3next fused-delta runtime flag
* cuda: add qwen3next delta-net kernel dispatch override
* docs: update qwen3next quality and serving baseline findings
* qwen3next: keep fused delta on safe path and remove PR artifacts
* qwen3next: align autoregressive delta-net decode layout
* Revert "qwen3next: align autoregressive delta-net decode layout"
This reverts commit 9241164a5e.
* cuda: port solve-tri fast-paths for qwen3next delta-net
* qwen3next: add fused-delta runtime flag and drop env toggle
* qwen3next: make fused delta single-flag and default on
* Account for GPU arch differences
* Revert "cuda: build MoE row mapping on device in mul_mat_id"
This reverts commit 89e9ecfa84.
* qwen3next: drop non-essential MoE scheduling and split heuristics
* qwen3next: avoid generic ggml_sub broadcast changes
* llama: restore only_active_experts log message
* Remove unnecessary hacks, disable fusion for now.
* qwen3next: port hybrid recurrent state memory semantics
* qwen3next: clean up recurrent state slot plumbing
* qwen3next: fix hybrid V-cache layout plumbing
* qwen3next: guard recurrent state slots against kv capacity
* qwen3next: persist recurrent state in session data
- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches
* qwen3next: drop unused fused-delta builder path
- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member
* qwen3next: remove unused fused-delta CLI/context plumbing
- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init
* ggml: remove unused DELTA_NET operator stack
* Missing include
* Reorder ops/unary ops
So we don't change again the enum values of the mul mat ops
* Minor
* Discard unnecessary changes in llama-build-context.cpp
* Minor
* Revert "Discard unnecessary changes in llama-build-context.cpp"
This reverts commit edadb80ed6.
* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches
* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next
* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next
It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.
Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.
* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next
For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.
* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next
* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next
* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512
* Multithreading for OP_SUB
* Don't commit with timing trace on
* Multithread neg and sigmoid
* Be able to turn on/off fusion more easily (CPU)
* Name the mul_mat ops so we know where the time goes
* WIP
* Much better PP on CUDA
* CUDA: fuse transpose -> cont -> sum_rows -> transpose
Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.
* CUDA: faster mul for special case relevant for Qwen3Next
Worth 1% in TG
* Fix CPU OP_CONT
---------
Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* Mimo-2 support
* Fix bug for head sizes not being the same
It still does not solve the Mimo-2 quantized cache issue.
* Fix quantized cache
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# convert_hf_to_gguf_update.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/gguf_writer.py
# src/llama-vocab.cpp
# src/llama-vocab.h
* mtmd : support home-cooked Mistral Small Omni (#14928)
* model : add LightOnOCR-1B model (#16764)
* model : add LightOnOCR-1B model
* add test
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
* mtmd : fix idefics3 preprocessing (#16806)
* mtmd : fix idefics3 preprocessing
* disable granite test
* fix test for granite
* model: Add support for CogVLM model (#15002)
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# examples/mtmd/clip.cpp
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
# src/llama-arch.cpp
# src/llama-arch.h
# src/llama-model.cpp
# src/llama-model.h
* mtmd: refactor preprocessing + support max/min pixels (#16878)
* mtmd: refactor preprocessing + support max/min pixels
* fix mlp type
* implement mix/max pixels
* improve hparams
* better image preproc for qwen
* fix
* fix out of bound composite
* fix (2)
* fix token calculation
* get_merge_kernel_size()
* fix llama4 and lfm2
* gonna fix them all
* use simple resize for qwen
* qwen: increase min tokens
* no resize if dst size == src size
* restore to initial min/max tokens value for qwen
# Conflicts:
# examples/mtmd/clip.cpp
* clip : use FA (#16837)
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* model: add Janus Pro for image understanding (#16906)
* Add support for Janus Pro
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Address reviewer suggestions
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add JANUS_PRO constant
* Update clip model handling
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Refactor JANUS_PRO handling in clip.cpp
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* em whitespace
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
* mtmd: pad mask for qwen2.5vl (#16954)
* mtmd: pad mask for qwen2.5vl
* improve
* mtmd: add --image-min/max-tokens (#16921)
* mtmd: improve struct initialization (#16981)
* mtmd: allow QwenVL to process larger image by default (#17020)
* Disable flash attention
* mtmd : fix embedding size for image input (#17123)
* mtmd: fix patch_size initialized to random value in audio models (#17128)
* mtmd: fix patch_size initialized to random value in audio models
* add default hparams
* add llama_model_n_embd_inp
* Fix load qwen3 vl
Change batch size
* Add description
* Fix cli build error
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
* Fixing Gigachat support
* Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)
* Gigachat: CPU FA (needs 192 x 192 for MLA = 3)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>