2.2 KiB
🔀 #284 - llama-bench: enable having different number of threads for tg and pp
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-03-24 |
| Updated | 2025-03-25 |
Description
All applications in the examples folder except llama-bench accept -t (to specify number of threads for token generation) and -tb (to specify number of threads for prompt processing, a.k.a. prefill) as command line arguments. This is handy because often TG peak performance is reached at a lower number of threads, so one wants to use that instead of the number of cores, which is good for maximum prompt processing speed. llama-bench, inherited from upstream, has its own command line argument parsing, where one only has available -t but not -tb.
This PR adds a new command line argument to llama-bench: -tgb (or --threads-gen-batch). One can use it as, e.g.,
./bin/llama-bench -tgb 4,16 -p 512 -n 128 other_arguments
where 4 threads will be used for the tg128 test, and 16 threads will be used for the pp512 test. For tests that are a combination of prefill and gen (-pg, -gp), the batch number of threads will be used for prefill, and the gen number of threads will be used for token generation. One can also specify multiple pairs of {t_gen, t_batch} for the -tgb argument, separating them with a semicolon. E.g.,
./bin/llama-bench -tgb 2,16;4,16;8,32
The -t argument continues to work as before. It adds a pair of the same integer in the list of {t_hen, t_batch} number of thread pairs.
Caveat: For -p the batch number of threads is added to the table. For all other tests the gen number of threads is printed. This is of course appropriate for -n and -gp, but it becomes confusing for -pg, where the batch and gen number of threads both matter for the reported performance. I guess, it would be better to print both thread numbers in this case, but this is not done in this PR.
💬 Conversation
👤 ubergarm commented the 2025-03-25 at 16:27:02:
Thanks for this one, should help optimize the big xeon 6980P given previous testing suggests that pp likes more threads than tg.