ik_llama.cpp/examples/server/utils.hpp at 292c934ac8efa8c2933119a58ee907c93426eb0b

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-27 18:01:45 +00:00

Files

Yap Sok Ann 6bb76b142d Fix logprobs (#787 )

This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.

That mainline PR and friends were partially cherry-picked by #723, but
wasn't really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:

This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)

If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-09-25 15:43:30 +02:00

25 KiB

Raw Blame History

View Raw

25 KiB Raw Blame History

25 KiB

Raw Blame History