mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-13 07:20:15 +00:00

Files

Kawrakow 633e0617b0 Enable CUDA graphs for MoE models + GPT-OSS support (#689 )

* gmp-oss: common

* gpt-oss: attnetion sinks, swiglu_oai

* gpt-oss: WIP llama

Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?

* gpt-oss: CPU seems to be working

It was the SWA thta was missing in the previous commit.

There are issues with EOG tokens, so this still needs to be added.

* CUDA: ADD_ID

Just a copy from mainline

* gpt-oss: Seems to be working on CUDA

* gpt-oss: add sinks to the attn-vec kernels

* CUDA: add head size of 64 to new mma

Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.

* gpt-oss: add ability to use -fmoe (only CUDA for now)

* Move row sums to the write place

* Add sinks to iqk flash attention

* gpt_oss: Implement -fmoe on the CPU

* Simdify swiglu_oai

Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.

* llama: factor out model loader

* Builds successfully

* It runs, but mmap does not work

* Fix llama_mmap so mmap works

* Minor

* Fix CUDA after latest changes

* Attempt to use CUDA graphs with MoE models - not working

* CUDA graphs WIP - still not working

* CUDA graphs - seems to be working

Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.

* Make q8_0 cache work for DeepSeek models with CUDA graphs

* cuda: cpy for q6_0

* Fix llama_mmap on non-Linux platforms

* Adding forgotten file

* Iterating on Windows build failures

* cuda: re-add q8_0 -> q8_0 transpose

so mla = 2 can be used with CUDA graphs and q8_0 cache.

* Disable graphs without -fmoe

* Minor

* Turn graphs on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-08-15 09:18:07 +03:00

CMakeLists.txt

Add new sweep-bench benchmark (#225 )

2025-02-23 00:16:27 -06:00

README.md

Update sweep bench (depracating .jsonl support) (#289 )

2025-03-25 10:14:44 -05:00

sweep-bench-plot.py

Update sweep bench (depracating .jsonl support) (#289 )

2025-03-25 10:14:44 -05:00

sweep-bench.cpp

Enable CUDA graphs for MoE models + GPT-OSS support (#689 )

2025-08-15 09:18:07 +03:00

README.md

ik_llama.cpp/example/sweep-bench

Benchmark the prompt processing and token generation performance of ik_llama.cpp by doing a sweep over a whole context size and gathering performance metrics in each ubatch-sized window. Only a single token sequence is used.

The benchmark steps are:

for each ubatch-sized window in context:

1. generate ubatch/4 tokens (not the whole window to save some time)
2. measure generation performance
3. remove generated tokens from KV cache
4. prepare a ubatch-sized batch of random tokens
4. process prepated batch
5. measure prompt processing performance

The purpose of the benchmark is to visualize how the performance changes with the context size without averaging the metrics values over the whole context.

Usage

./llama-sweep-bench -c 8704 -ub 512 -m models/Meta-Llama-3.2-3B-Instruct-Q8_0.gguf

Sample results

PP - prompt tokens per ubatch
TG - generated tokens per ubatch
N_KV - current KV cache size
T_PP - prompt processing time (i.e. time to first token)
S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
T_TG - time to generate all batches
S_TG - text generation speed ((B*TG)/T_TG)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.100	465.51	2.311	55.38
512	128	512	1.183	432.97	1.895	67.55
512	128	1024	1.305	392.38	2.071	61.81
512	128	1536	1.279	400.42	2.164	59.14
512	128	2048	1.571	325.96	2.280	56.14
512	128	2560	1.431	357.87	2.418	52.94
512	128	3072	1.515	337.93	2.566	49.88
512	128	3584	1.588	322.34	2.722	47.03
512	128	4096	1.675	305.70	2.864	44.69
512	128	4608	1.769	289.50	2.999	42.68
512	128	5120	1.845	277.48	3.102	41.26
512	128	5632	1.893	270.46	3.219	39.76
512	128	6144	1.953	262.20	3.348	38.23
512	128	6656	2.018	253.71	3.474	36.84
512	128	7168	2.078	246.34	3.589	35.66
512	128	7680	2.140	239.22	3.717	34.43
512	128	8192	2.196	233.15	3.854	33.21

JSONL output

Pass --output-format jsonl to output JSONL instead of Markdown, á la

{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 0, "t_pp": 1.093814, "speed_pp": 468.086884, "t_tg": 1.780312, "speed_tg": 71.897514 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 512, "t_pp": 1.169302, "speed_pp": 437.868073, "t_tg": 1.897474, "speed_tg": 67.458099 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1024, "t_pp": 1.183700, "speed_pp": 432.542053, "t_tg": 2.059179, "speed_tg": 62.160694 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1536, "t_pp": 1.428625, "speed_pp": 358.386566, "t_tg": 2.160639, "speed_tg": 59.241734 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 2048, "t_pp": 1.360647, "speed_pp": 376.291595, "t_tg": 2.274003, "speed_tg": 56.288403 }