Port speculative decoding from upstream to llama-server (#645)

* server : integrate speculative decoding * server: Fix field names * server: fix include, whitespace * fix compile errors in speculative.cpp * add llama_sampling_sample_and_accept_n to sampling * finish porting speculative decoding in server * port functions from common/speculative, common/sampling * remove arg * fix function names * init params_dft to none * correct value for n_ctx * prefix kv cache tensors with model name to avoid conflict * fix call arguments * fix spec decoding args * correct slot.id * use n_max * port the rest of sampling funcs * fix func arguments * slot.id starts at 1? * Revert "prefix kv cache tensors with model name to avoid conflict" This reverts commit fbd5dfd866. * disable draft logging * disable logging in speculative.cpp in mainline, these would be LOG_DEBUG, but since ik_llama doesnt support it, logging is disabled entirely * add more draft model parameters * fix * pass flash_attn * add speculative params for parity * set speculative params in launch_slot_with_task instead
2026-04-30 11:21:56 +00:00 · 2025-08-15 21:26:44 -07:00
parent 2e2abddaa8
commit b6bc5eedad
8 changed files with 655 additions and 41 deletions
--- a/common/speculative.h
+++ b/common/speculative.h
@@ -0,0 +1,29 @@
+#pragma once
+
+#include "llama.h"
+
+#include <vector>
+
+struct llama_speculative;
+
+struct llama_speculative_params {
+    int n_draft = 16;  // max drafted tokens
+    int n_reuse = 256;
+
+    float p_min = 0.75f; // min probability required to accept a token in the draft
+};
+
+struct llama_speculative * llama_speculative_init(struct llama_context * ctx_dft);
+
+void llama_speculative_free(struct llama_speculative * spec);
+
+bool llama_speculative_are_compatible(
+        const struct llama_context * ctx_tgt,
+        const struct llama_context * ctx_dft);
+
+// sample up to n_draft tokens and add them to the batch using the draft model
+std::vector<llama_token> llama_speculative_gen_draft(
+                struct llama_speculative * spec,
+         struct llama_speculative_params   params,
+          const std::vector<llama_token> & prompt,
+                             llama_token   id_last);