ik_llama.cpp/284 - llama-bench_ enable having different number of threads for tg and pp.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

2.2 KiB

Raw Permalink Blame History

🔀 #284 - llama-bench: enable having different number of threads for tg and pp

Author	`ikawrakow`
State	❌ Closed
Created	2025-03-24
Updated	2025-03-25

Description

All applications in the examples folder except llama-bench accept -t (to specify number of threads for token generation) and -tb (to specify number of threads for prompt processing, a.k.a. prefill) as command line arguments. This is handy because often TG peak performance is reached at a lower number of threads, so one wants to use that instead of the number of cores, which is good for maximum prompt processing speed. llama-bench, inherited from upstream, has its own command line argument parsing, where one only has available -t but not -tb.

This PR adds a new command line argument to llama-bench: -tgb (or --threads-gen-batch). One can use it as, e.g.,

./bin/llama-bench -tgb 4,16 -p 512 -n 128 other_arguments

where 4 threads will be used for the tg128 test, and 16 threads will be used for the pp512 test. For tests that are a combination of prefill and gen (-pg, -gp), the batch number of threads will be used for prefill, and the gen number of threads will be used for token generation. One can also specify multiple pairs of {t_gen, t_batch} for the -tgb argument, separating them with a semicolon. E.g.,

./bin/llama-bench -tgb 2,16;4,16;8,32

The -t argument continues to work as before. It adds a pair of the same integer in the list of {t_hen, t_batch} number of thread pairs.

Caveat: For -p the batch number of threads is added to the table. For all other tests the gen number of threads is printed. This is of course appropriate for -n and -gp, but it becomes confusing for -pg, where the batch and gen number of threads both matter for the reported performance. I guess, it would be better to print both thread numbers in this case, but this is not done in this PR.

💬 Conversation

👤 ubergarm commented the 2025-03-25 at 16:27:02:

Thanks for this one, should help optimize the big xeon 6980P given previous testing suggests that pp likes more threads than tg.

2.2 KiB Raw Permalink Blame History

🔀 #284 - llama-bench: enable having different number of threads for tg and pp

Description

💬 Conversation

2.2 KiB

Raw Permalink Blame History