* fix adaptive p sampler rewinding too far back
* update comments
* correct default value for total_weight, more comments
* new variables/names
* update comment for n_rewind
* move null pointer check back to common_sampler_review()
* refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()
* Fix gguf-split.cpp splits output naming
With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits.
ex:
No more model.gguf-00001-of-00200.gguf
Instead, model-00001-of-00200.gguf
* increase ggml_max_context to 2048
* Revert GGML_MAX_CONTEXTS to 64
* server: enable checkpoint for recurrent models
create checkpoint after cancel
fix ban string and rm context during rewind
add checkpoint interval
only save recurrent cache
* save checkpoint during pp
---------
Co-authored-by: firecoperana <firecoperana>
* Revive fused delta-net
* Add command line argument for fused delta net
* Simplify/improve CUDA delta-net
* Add -fdn to llama-bench
* More CUDA fused delta net optimizations
* CPU optimizations
* Much faster fused delta-net on the CPU
It seems it is faster than the chunked implementation!
* Change meaning of fdn from bool flag to threshold value
* Use eps = 1e-6
* Give some nodes a name
* Partial Requant feature for llama-quantize
- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.
* Create output directory if it doesn't exist in llama-quantize
* Create output directory if it doesn't exist in gguf-split
* Add exit when directory fails to be created on Windows
* Use std::filesystem
* cleanup
When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.
Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.
Generated with [Devin](https://cli.devin.ai/docs)
Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>
* wip: port MTP architecture
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.
Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.
* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).
* core: enable hybrid outputs (logits + embeddings) for MTP support
* fix(mtp): correct KV-cache slot finding for updates
* fix(mtp): persist hidden states to prevent context corruption during drafting
* refactor(mtp): clean unused code
* fix(mtp): update server to new functions name
* fix(mtp): fix graph and save hidden state
* mtp: refactor integration, context params and kv cache search
* mtp: fix hidden state extraction and speculative acceptance flow
* server: fix MTP warmup for long prompts and reset token buffer
* llama: refactor MTP operation state to context parameters
* server: fix n_past calculation in MTP acceptance
* llama: fix mtp enable flags
* speculative: refactor MTP to use common_speculative interface
* context: remove unused signatures
* clip: fix deprecated enum-enum conversion warning
* common: fix format string crash in help message
* context: fix mtp activation logic
* adaptive p: upadte internal state only if not rewinding
* adaptive p: conditional update for speculative decoding
* adaptive p: refactor to rewind instead of update
* adaptive p fix: better comments
* fix rewind check
* add record to handle multi-token rewind
* better comment
* spec : add self speculative decoding and ngram-mod and refactor
common : use common_ prefix for common library function
llama : use LLAMA_TOKEN_NULL
spec : add self speculative decoding (no draft model required) + refactor
spec : add ngram-mod
spec : various improvements ton ngram-map + docs
spec : fix the check-rate logic of ngram-simple
common : add common_speculative_is_compat()
spec : simplify time measurement using common_time_meas
refactor common_sampler_init
refactor common_token_to_piece
refactor and fix cur_p bug
clean up
* spec : remove check rate
* spec: show warnings instead of abort
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
This implements the ability to load, unload, and scale control vectors
(representation engineering) mid-inference, following the existing
task-queue pattern used by LoRA adapters.
New Endpoints:
- GET /control-vectors
- POST /control-vectors/load
- POST /control-vectors/unload
- POST /control-vectors/apply (handles scaling)
Technical Notes:
- Centralizes vector aggregation logic to share implementation between
load, unload, and apply tasks.
- Vectors are applied globally to the model context.
- Enforces dimension validation on load to safely reject incompatible
vectors.
Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>
* WIP - not working
* WIP - not working
* WIP - GPT-OSS working
However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.
My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.
For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.
Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.
* WIP
* WIP - Qwen3-MoE (and hopefully all others) working
But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.
* WIP: TG seems to be working
* Minor
* Add command line option to merge experts up/gate
* Add merge up/gate command line parameter to llama-bench
* Turn off merge_up_gate_exps if split mode graph
It is not yet implemented
* When no bias, allow merging up/gate with tensor overrides
* Arghh, we need to increase the context size again
* Cleanup
* server: improve speed of speculative decoding
change logs
rpc: add recompute
spec dec fix
* Fix n_batch_size not set to context size for draft model
---------
Co-authored-by: firecoperana <firecoperana>
* include cuda-params and -ot in llama-bench output
* cleanup redundant type mapping
* fix wrong field name
* fix preexisting mistake in cuda_params help text (default value)
* fix preexisting mistake in kompute column header
* adjust code style to match current norms
* simplify/fix inverted columns
* fix field->value pairings/order
* remove dead field `f16_kv`
* sql printer deserves a way out too
* actually enable the new improvements....
* Remove most of split mode row
* WIP
* WIP: also allocate the KV cache using tensor split
* WIP: it runs with wrong result
But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
wait for GPU 0 to finish its entore FFN calculation before it can
start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
* WIP
* This works, but it is slow
* This is slightly better
the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.
* Playing games with the scheduler
This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.
* Fix attn split
Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.
* Show memory used per device
* Make it work with partial offload
but no tensor overrides yet, just ngl < num_layers.
* Allow for f16 source in fused_rms_norm
* This results in faster PP.
Now PP is faster than split mode layer for L3-70B.
* Rename split mode "row" to split mode "graph"
* Leave FFN partial results as f16
* WIP GLM4.5 - runs with wrong results
* WIP GLM4.5 - this works
PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.
* Work around compiler bug
It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.
* Make graph reuse work with split mode graph
* Remove more split mode row remnants
* WIP tensor overrides
Runs with wrong results, don't see where the issue could be.
* This works but is slow
Still does not work for row-interleaved quants
* Slightly better
* Slightly better
* Row-interleaved quants work
* Better
* Minor
* Guarad against using split mode "graph" for unsupported models
* Guards against using merge_qkv with split mode "graph"
* WIP split mode attn
Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.
* Split mode graph for qwen3moe
* Try to better distribute the splits
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* RPC support multiple devices
* rpc : update documentation (#16441)
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
# Conflicts:
# examples/rpc/README.md
* Remove memory settings
* rpc : cache and reuse compute graphs (#15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
* Add -cpu to include cpu backend
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>