mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-12 00:50:22 +00:00
This patch addresses two latent bugs in examples/speculative/speculative.cpp
that prevent llama-speculative.exe from running on greedy sampling
(temp=0) or producing rejection-sampling output (temp>0):
1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };`
invokes `common_grammar(type, grammar)` which asserts
`type != NONE || !grammar.empty()`. Both conditions fail with the
intended-to-be-empty grammar, so every speculative run hits a hard
`GGML_ASSERT` in common/sampling.h:63 immediately after model load.
Fix: default-construct via `common_grammar{}` to bypass the
field-init constructor.
2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and
`GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does
not set the .sorted flag (which is most modern sampler paths).
Comment them out — the next ~10 lines re-sort both distributions
by id explicitly, so the assertion is incorrect anyway.
Fix: replace the asserts with an explanatory comment.
After both fixes, `llama-speculative.exe` runs to completion. The
acceptance-rate measurement at temp=0 still looks suspicious (0%
across same-family draft/target pairs), but that is a different
issue out of scope for this PR.
Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base
models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp
build at HEAD of windows-mingw-default-win10 (which is itself a
follow-up to PR #1755).
llama.cpp/examples/speculative
Demonstration of speculative decoding and tree-based speculative decoding techniques
More info: