mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-27 18:01:45 +00:00
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by #723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
25 KiB
25 KiB