* server: spec checkpoints for recurrent models
* fix: save/restore sampler state during speculative checkpoint
When speculative decoding rejects draft tokens and restores the
recurrent state checkpoint, the sampler (RNG, grammar, prev tokens)
must also be restored to maintain consistency. Without this, the
sampler state reflects the rejected draft tokens, leading to
potential divergence.
Uses common_sampler_clone() to snapshot the sampler before the
speculative batch decode, and restores it on rejection.
* server: snapshot recurrent state in tensor
* reset ngram mod state for rejected tokens
* server: refactor checkpoint state logic
* speculative: fix sampler for checkpoints
* recurrent model: implement recurrent kernel checkpoint
* recurrent model: refactor api
* spec: free rbudget before overwriting
* wip: separate llama_context for MTP with graph reuse
* wip: fix KV cache desync with separate MTP context
* refactor: remove dead mtp logic code, encapsulate KV mirroring
* mtp-context: derive args directly from the main model's context
* mtp: fix kv cache positions
* clean small comments
* minor refactor for context shift
* wip: build spec tuner for spefic args
* wip: test different reward system
* spec-tune: fix the reward to find best params given a good TPS
* spec-tune: refactor logic for its own file
* minor clean for comments and modules
* wip: port MTP architecture
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.
Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.
* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).
* core: enable hybrid outputs (logits + embeddings) for MTP support
* fix(mtp): correct KV-cache slot finding for updates
* fix(mtp): persist hidden states to prevent context corruption during drafting
* refactor(mtp): clean unused code
* fix(mtp): update server to new functions name
* fix(mtp): fix graph and save hidden state
* mtp: refactor integration, context params and kv cache search
* mtp: fix hidden state extraction and speculative acceptance flow
* server: fix MTP warmup for long prompts and reset token buffer
* llama: refactor MTP operation state to context parameters
* server: fix n_past calculation in MTP acceptance
* llama: fix mtp enable flags
* speculative: refactor MTP to use common_speculative interface
* context: remove unused signatures
* clip: fix deprecated enum-enum conversion warning
* common: fix format string crash in help message
* context: fix mtp activation logic
* llamat: always use the extracted embedding
* llama: get all embeddings to kv cache
* llama: revert logit to not run mtp for not supported arch
* llama: allocate all the n_outputs for MTP
* wip
* server-context: get only the last embedding for hidden state
* ggml-backend: fix array of bounds in debug build
* server-context: run mt kv update to each prompt batch
* revert segmentation fault fixes
* glm-mtp(feat): optimize graph embedding and recursive drafting
* wip: port MTP architecture
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.
Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.
* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).
* core: enable hybrid outputs (logits + embeddings) for MTP support
* fix(mtp): correct KV-cache slot finding for updates
* fix(mtp): persist hidden states to prevent context corruption during drafting
* refactor(mtp): clean unused code
* fix(mtp): update server to new functions name
* fix(mtp): fix graph and save hidden state
* mtp: refactor integration, context params and kv cache search
* mtp: fix hidden state extraction and speculative acceptance flow
* server: fix MTP warmup for long prompts and reset token buffer
* llama: refactor MTP operation state to context parameters
* server: fix n_past calculation in MTP acceptance
* llama: fix mtp enable flags
* speculative: refactor MTP to use common_speculative interface
* context: remove unused signatures
* clip: fix deprecated enum-enum conversion warning
* common: fix format string crash in help message
* context: fix mtp activation logic
* spec : add self speculative decoding and ngram-mod and refactor
common : use common_ prefix for common library function
llama : use LLAMA_TOKEN_NULL
spec : add self speculative decoding (no draft model required) + refactor
spec : add ngram-mod
spec : various improvements ton ngram-map + docs
spec : fix the check-rate logic of ngram-simple
common : add common_speculative_is_compat()
spec : simplify time measurement using common_time_meas
refactor common_sampler_init
refactor common_token_to_piece
refactor and fix cur_p bug
clean up
* spec : remove check rate
* spec: show warnings instead of abort
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
* server : integrate speculative decoding
* server: Fix field names
* server: fix include, whitespace
* fix compile errors in speculative.cpp
* add llama_sampling_sample_and_accept_n to sampling
* finish porting speculative decoding in server
* port functions from common/speculative, common/sampling
* remove arg
* fix function names
* init params_dft to none
* correct value for n_ctx
* prefix kv cache tensors with model name to avoid conflict
* fix call arguments
* fix spec decoding args
* correct slot.id
* use n_max
* port the rest of sampling funcs
* fix func arguments
* slot.id starts at 1?
* Revert "prefix kv cache tensors with model name to avoid conflict"
This reverts commit fbd5dfd866.
* disable draft logging
* disable logging in speculative.cpp
in mainline, these would be LOG_DEBUG, but since ik_llama doesnt support
it, logging is disabled entirely
* add more draft model parameters
* fix
* pass flash_attn
* add speculative params for parity
* set speculative params in launch_slot_with_task instead