mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-20 13:14:09 +00:00

Files

firecoperana 1cb7e1bf39 spec : add self speculative decoding, ngram and refactor (#1261 )

* spec : add self speculative decoding and ngram-mod and refactor

common : use common_ prefix for common library function

llama : use LLAMA_TOKEN_NULL

spec : add self speculative decoding (no draft model required) + refactor

spec : add ngram-mod

spec : various improvements ton ngram-map + docs

spec : fix the check-rate logic of ngram-simple

common : add common_speculative_is_compat()

spec : simplify time measurement using common_time_meas

refactor common_sampler_init

refactor common_token_to_piece

refactor and fix cur_p bug

clean up

* spec : remove check rate

* spec: show warnings instead of abort

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>

2026-02-13 19:04:55 +01:00

CMakeLists.txt

Add new sweep-bench benchmark (#225 )

2025-02-23 00:16:27 -06:00

README.md

Update sweep bench (depracating .jsonl support) (#289 )

2025-03-25 10:14:44 -05:00

sweep-bench-plot.py

Update sweep bench (depracating .jsonl support) (#289 )

2025-03-25 10:14:44 -05:00

sweep-bench.cpp

spec : add self speculative decoding, ngram and refactor (#1261 )

2026-02-13 19:04:55 +01:00

README.md

ik_llama.cpp/example/sweep-bench

Benchmark the prompt processing and token generation performance of ik_llama.cpp by doing a sweep over a whole context size and gathering performance metrics in each ubatch-sized window. Only a single token sequence is used.

The benchmark steps are:

for each ubatch-sized window in context:

1. generate ubatch/4 tokens (not the whole window to save some time)
2. measure generation performance
3. remove generated tokens from KV cache
4. prepare a ubatch-sized batch of random tokens
4. process prepated batch
5. measure prompt processing performance

The purpose of the benchmark is to visualize how the performance changes with the context size without averaging the metrics values over the whole context.

Usage

./llama-sweep-bench -c 8704 -ub 512 -m models/Meta-Llama-3.2-3B-Instruct-Q8_0.gguf

Sample results

PP - prompt tokens per ubatch
TG - generated tokens per ubatch
N_KV - current KV cache size
T_PP - prompt processing time (i.e. time to first token)
S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
T_TG - time to generate all batches
S_TG - text generation speed ((B*TG)/T_TG)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.100	465.51	2.311	55.38
512	128	512	1.183	432.97	1.895	67.55
512	128	1024	1.305	392.38	2.071	61.81
512	128	1536	1.279	400.42	2.164	59.14
512	128	2048	1.571	325.96	2.280	56.14
512	128	2560	1.431	357.87	2.418	52.94
512	128	3072	1.515	337.93	2.566	49.88
512	128	3584	1.588	322.34	2.722	47.03
512	128	4096	1.675	305.70	2.864	44.69
512	128	4608	1.769	289.50	2.999	42.68
512	128	5120	1.845	277.48	3.102	41.26
512	128	5632	1.893	270.46	3.219	39.76
512	128	6144	1.953	262.20	3.348	38.23
512	128	6656	2.018	253.71	3.474	36.84
512	128	7168	2.078	246.34	3.589	35.66
512	128	7680	2.140	239.22	3.717	34.43
512	128	8192	2.196	233.15	3.854	33.21

JSONL output

Pass --output-format jsonl to output JSONL instead of Markdown, á la

{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 0, "t_pp": 1.093814, "speed_pp": 468.086884, "t_tg": 1.780312, "speed_tg": 71.897514 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 512, "t_pp": 1.169302, "speed_pp": 437.868073, "t_tg": 1.897474, "speed_tg": 67.458099 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1024, "t_pp": 1.183700, "speed_pp": 432.542053, "t_tg": 2.059179, "speed_tg": 62.160694 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1536, "t_pp": 1.428625, "speed_pp": 358.386566, "t_tg": 2.160639, "speed_tg": 59.241734 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 2048, "t_pp": 1.360647, "speed_pp": 376.291595, "t_tg": 2.274003, "speed_tg": 56.288403 }