mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-13 15:30:03 +00:00

Files

firecoperana d71a3ec315 Server: refactor and rename functions (#1151 )

* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>

2026-01-18 08:16:57 +02:00

CMakeLists.txt

Add new sweep-bench benchmark (#225 )

2025-02-23 00:16:27 -06:00

README.md

Update sweep bench (depracating .jsonl support) (#289 )

2025-03-25 10:14:44 -05:00

sweep-bench-plot.py

Update sweep bench (depracating .jsonl support) (#289 )

2025-03-25 10:14:44 -05:00

sweep-bench.cpp

Server: refactor and rename functions (#1151 )

2026-01-18 08:16:57 +02:00

README.md

ik_llama.cpp/example/sweep-bench

Benchmark the prompt processing and token generation performance of ik_llama.cpp by doing a sweep over a whole context size and gathering performance metrics in each ubatch-sized window. Only a single token sequence is used.

The benchmark steps are:

for each ubatch-sized window in context:

1. generate ubatch/4 tokens (not the whole window to save some time)
2. measure generation performance
3. remove generated tokens from KV cache
4. prepare a ubatch-sized batch of random tokens
4. process prepated batch
5. measure prompt processing performance

The purpose of the benchmark is to visualize how the performance changes with the context size without averaging the metrics values over the whole context.

Usage

./llama-sweep-bench -c 8704 -ub 512 -m models/Meta-Llama-3.2-3B-Instruct-Q8_0.gguf

Sample results

PP - prompt tokens per ubatch
TG - generated tokens per ubatch
N_KV - current KV cache size
T_PP - prompt processing time (i.e. time to first token)
S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
T_TG - time to generate all batches
S_TG - text generation speed ((B*TG)/T_TG)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.100	465.51	2.311	55.38
512	128	512	1.183	432.97	1.895	67.55
512	128	1024	1.305	392.38	2.071	61.81
512	128	1536	1.279	400.42	2.164	59.14
512	128	2048	1.571	325.96	2.280	56.14
512	128	2560	1.431	357.87	2.418	52.94
512	128	3072	1.515	337.93	2.566	49.88
512	128	3584	1.588	322.34	2.722	47.03
512	128	4096	1.675	305.70	2.864	44.69
512	128	4608	1.769	289.50	2.999	42.68
512	128	5120	1.845	277.48	3.102	41.26
512	128	5632	1.893	270.46	3.219	39.76
512	128	6144	1.953	262.20	3.348	38.23
512	128	6656	2.018	253.71	3.474	36.84
512	128	7168	2.078	246.34	3.589	35.66
512	128	7680	2.140	239.22	3.717	34.43
512	128	8192	2.196	233.15	3.854	33.21

JSONL output

Pass --output-format jsonl to output JSONL instead of Markdown, á la

{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 0, "t_pp": 1.093814, "speed_pp": 468.086884, "t_tg": 1.780312, "speed_tg": 71.897514 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 512, "t_pp": 1.169302, "speed_pp": 437.868073, "t_tg": 1.897474, "speed_tg": 67.458099 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1024, "t_pp": 1.183700, "speed_pp": 432.542053, "t_tg": 2.059179, "speed_tg": 62.160694 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1536, "t_pp": 1.428625, "speed_pp": 358.386566, "t_tg": 2.160639, "speed_tg": 59.241734 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 2048, "t_pp": 1.360647, "speed_pp": 376.291595, "t_tg": 2.274003, "speed_tg": 56.288403 }