Files
Alex 51331f4973 Fix two speculative-decoding crashes that prevent any usage (#1760)
This patch addresses two latent bugs in examples/speculative/speculative.cpp
that prevent llama-speculative.exe from running on greedy sampling
(temp=0) or producing rejection-sampling output (temp>0):

1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };`
   invokes `common_grammar(type, grammar)` which asserts
   `type != NONE || !grammar.empty()`. Both conditions fail with the
   intended-to-be-empty grammar, so every speculative run hits a hard
   `GGML_ASSERT` in common/sampling.h:63 immediately after model load.

   Fix: default-construct via `common_grammar{}` to bypass the
   field-init constructor.

2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and
   `GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does
   not set the .sorted flag (which is most modern sampler paths).
   Comment them out — the next ~10 lines re-sort both distributions
   by id explicitly, so the assertion is incorrect anyway.

   Fix: replace the asserts with an explanatory comment.

After both fixes, `llama-speculative.exe` runs to completion. The
acceptance-rate measurement at temp=0 still looks suspicious (0%
across same-family draft/target pairs), but that is a different
issue out of scope for this PR.

Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base
models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp
build at HEAD of windows-mingw-default-win10 (which is itself a
follow-up to PR #1755).
2026-05-09 08:36:38 +03:00
..