server: improve speed of speculative decoding (#1119)

* server: improve speed of speculative decoding change logs rpc: add recompute spec dec fix * Fix n_batch_size not set to context size for draft model --------- Co-authored-by: firecoperana <firecoperana>
2026-02-02 04:29:53 +00:00 · 2026-01-10 00:01:22 -06:00
parent 6695c6c945
commit c1931663ad
7 changed files with 164 additions and 135 deletions
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -3568,6 +3568,7 @@ void llama_batch_add(
                          llama_pos   pos,
    const std::vector<llama_seq_id> & seq_ids,
                               bool   logits) {
+    GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded");
    batch.token   [batch.n_tokens] = id;
    batch.pos     [batch.n_tokens] = pos;
    batch.n_seq_id[batch.n_tokens] = seq_ids.size();