mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-19 03:59:13 +00:00

Files

Kawrakow a719349982 POC: CUDA tensor parallel (MoE models) (#1022 )

* Remove most of split mode row

* WIP

* WIP: also allocate the KV cache using tensor split

* WIP: it runs with wrong result

But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling

* WIP

* This works, but it is slow

* This is slightly better

the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.

* Playing games with the scheduler

This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.

* Fix attn split

Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.

* Show memory used per device

* Make it work with partial offload

but no tensor overrides yet, just ngl < num_layers.

* Allow for f16 source in fused_rms_norm

* This results in faster PP.

Now PP is faster than split mode layer for L3-70B.

* Rename split mode "row" to split mode "graph"

* Leave FFN partial results as f16

* WIP GLM4.5 - runs with wrong results

* WIP GLM4.5 - this works

PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.

* Work around compiler bug

It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.

* Make graph reuse work with split mode graph

* Remove more split mode row remnants

* WIP tensor overrides

Runs with wrong results, don't see where the issue could be.

* This works but is slow

Still does not work for row-interleaved quants

* Slightly better

* Slightly better

* Row-interleaved quants work

* Better

* Minor

* Guarad against using split mode "graph" for unsupported models

* Guards against using merge_qkv with split mode "graph"

* WIP split mode attn

Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.

* Split mode graph for qwen3moe

* Try to better distribute the splits

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-12-01 19:25:40 +01:00

CMakeLists.txt

build : link against build info instead of compiling against it (#3879 )

2023-11-02 08:50:16 +02:00

llama-bench.cpp

POC: CUDA tensor parallel (MoE models) (#1022 )

2025-12-01 19:25:40 +01:00

README.md

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

README.md

llama.cpp/examples/llama-bench

Performance testing tool for llama.cpp.

Syntax
Examples
Output formats
1. Markdown
2. CSV
3. JSON
4. SQL

Syntax

usage: ./llama-bench [options]

options:
  -h, --help
  -m, --model <filename>              (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                  (default: 512)
  -n, --n-gen <n>                     (default: 128)
  -pg <pp,tg>                         (default: 512,128)
  -b, --batch-size <n>                (default: 2048)
  -ub, --ubatch-size <n>              (default: 512)
  -ctk, --cache-type-k <t>            (default: f16)
  -ctv, --cache-type-v <t>            (default: f16)
  -t, --threads <n>                   (default: 16)
  -ngl, --n-gpu-layers <n>            (default: 99)
  -sm, --split-mode <none|layer|row>  (default: layer)
  -mg, --main-gpu <i>                 (default: 0)
  -nkvo, --no-kv-offload <0|1>        (default: 0)
  -fa, --flash-attn <0|1>             (default: 0)
  -mmp, --mmap <0|1>                  (default: 1)
  --numa <distribute|isolate|numactl> (default: disabled)
  -embd, --embeddings <0|1>           (default: 0)
  -ts, --tensor-split <ts0/ts1/..>    (default: 0)
  -r, --repetitions <n>               (default: 5)
  -o, --output <csv|json|md|sql>      (default: md)
  -v, --verbose                       (default: 0)

Multiple values can be given for each parameter by separating them with ',' or by specifying the parameter multiple times.

llama-bench can perform three types of tests:

Prompt processing (pp): processing a prompt in batches (-p)
Text generation (tg): generating a sequence of tokens (-n)
Prompt processing + text generation (pg): processing a prompt followed by generating a sequence of tokens (-pg)

With the exception of -r, -o and -v, all options can be specified multiple times to run multiple tests. Each pp and tg test is run with all combinations of the specified options. To specify multiple values for an option, the values can be separated by commas (e.g. -n 16,32), or the option can be specified multiple times (e.g. -n 16 -n 32).

Each test is repeated the number of times given by -r, and the results are averaged. The results are given in average tokens per second (t/s) and standard deviation. Some output formats (e.g. json) also include the individual results of each repetition.

For a description of the other options, see the main example.

Note:

When using SYCL backend, there would be hang issue in some cases. Please set --mmp 0.

Examples

Text generation with different models

$ ./llama-bench -m models/7B/ggml-model-q4_0.gguf -m models/13B/ggml-model-q4_0.gguf -p 0 -n 128,256,512

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	132.19 ± 0.55
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 256	129.37 ± 0.54
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 512	123.83 ± 0.25
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 128	82.17 ± 0.31
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 256	80.74 ± 0.23
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 512	78.08 ± 0.07

Prompt processing with different batch sizes

$ ./llama-bench -n 0 -p 1024 -b 128,256,512,1024

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	128	pp 1024	1436.51 ± 3.66
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	256	pp 1024	1932.43 ± 23.48
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	512	pp 1024	2254.45 ± 15.59
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 1024	2498.61 ± 13.58

Different numbers of threads

$ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	1	pp 64	6.17 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	1	tg 16	4.05 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	2	pp 64	12.31 ± 0.13
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	2	tg 16	7.80 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	4	pp 64	23.18 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	4	tg 16	12.22 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	pp 64	32.29 ± 1.21
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 16	16.71 ± 0.66
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	pp 64	33.52 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 16	15.32 ± 0.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	32	pp 64	59.00 ± 1.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	32	tg 16	16.41 ± 0.79

Different numbers of layers offloaded to the GPU

$ ./llama-bench -ngl 10,20,30,31,32,33,34,35

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	10	pp 512	373.36 ± 2.25
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	10	tg 128	13.45 ± 0.93
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	20	pp 512	472.65 ± 1.25
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	20	tg 128	21.36 ± 1.94
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	30	pp 512	631.87 ± 11.25
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	30	tg 128	40.04 ± 1.82
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	31	pp 512	657.89 ± 5.08
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	31	tg 128	48.19 ± 0.81
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	32	pp 512	688.26 ± 3.29
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	32	tg 128	54.78 ± 0.65
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	33	pp 512	704.27 ± 2.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	33	tg 128	60.62 ± 1.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	34	pp 512	881.34 ± 5.40
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	34	tg 128	71.76 ± 0.23
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	35	pp 512	2400.01 ± 7.72
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	35	tg 128	131.66 ± 0.49

Output formats

By default, llama-bench outputs the results in markdown format. The results can be output in other formats by using the -o option.

Markdown

$ ./llama-bench -o md

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	2368.80 ± 93.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	131.42 ± 0.59

CSV

$ ./llama-bench -o csv

build_commit,build_number,cuda,metal,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_threads,f16_kv,n_gpu_layers,main_gpu,mul_mat_q,tensor_split,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"3469684","1275","1","0","0","1","1","13th Gen Intel(R) Core(TM) i9-13900K","NVIDIA GeForce RTX 3090 Ti","models/7B/ggml-model-q4_0.gguf","llama 7B mostly Q4_0","3825065984","6738415616","512","16","1","99","0","1","0.00","512","0","2023-09-23T12:09:01Z","212155977","732372","2413.341687","8.305961"
"3469684","1275","1","0","0","1","1","13th Gen Intel(R) Core(TM) i9-13900K","NVIDIA GeForce RTX 3090 Ti","models/7B/ggml-model-q4_0.gguf","llama 7B mostly Q4_0","3825065984","6738415616","512","16","1","99","0","1","0.00","0","128","2023-09-23T12:09:02Z","969320879","2728399","132.052051","0.371342"

JSON

$ ./llama-bench -o json

[
  {
    "build_commit": "3469684",
    "build_number": 1275,
    "cuda": true,
    "metal": false,
    "gpu_blas": true,
    "blas": true,
    "cpu_info": "13th Gen Intel(R) Core(TM) i9-13900K",
    "gpu_info": "NVIDIA GeForce RTX 3090 Ti",
    "model_filename": "models/7B/ggml-model-q4_0.gguf",
    "model_type": "llama 7B mostly Q4_0",
    "model_size": 3825065984,
    "model_n_params": 6738415616,
    "n_batch": 512,
    "n_threads": 16,
    "f16_kv": true,
    "n_gpu_layers": 99,
    "main_gpu": 0,
    "mul_mat_q": true,
    "tensor_split": "0.00",
    "n_prompt": 512,
    "n_gen": 0,
    "test_time": "2023-09-23T12:09:57Z",
    "avg_ns": 212365953,
    "stddev_ns": 985423,
    "avg_ts": 2410.974041,
    "stddev_ts": 11.163766,
    "samples_ns": [ 213837238, 211635853, 212328053, 211329715, 212698907 ],
    "samples_ts": [ 2394.34, 2419.25, 2411.36, 2422.75, 2407.16 ]
  },
  {
    "build_commit": "3469684",
    "build_number": 1275,
    "cuda": true,
    "metal": false,
    "gpu_blas": true,
    "blas": true,
    "cpu_info": "13th Gen Intel(R) Core(TM) i9-13900K",
    "gpu_info": "NVIDIA GeForce RTX 3090 Ti",
    "model_filename": "models/7B/ggml-model-q4_0.gguf",
    "model_type": "llama 7B mostly Q4_0",
    "model_size": 3825065984,
    "model_n_params": 6738415616,
    "n_batch": 512,
    "n_threads": 16,
    "f16_kv": true,
    "n_gpu_layers": 99,
    "main_gpu": 0,
    "mul_mat_q": true,
    "tensor_split": "0.00",
    "n_prompt": 0,
    "n_gen": 128,
    "test_time": "2023-09-23T12:09:59Z",
    "avg_ns": 977425219,
    "stddev_ns": 9268593,
    "avg_ts": 130.965708,
    "stddev_ts": 1.238924,
    "samples_ns": [ 984472709, 974901233, 989474741, 970729355, 967548060 ],
    "samples_ts": [ 130.019, 131.295, 129.362, 131.86, 132.293 ]
  }
]

SQL

SQL output is suitable for importing into a SQLite database. The output can be piped into the sqlite3 command line tool to add the results to a database.

$ ./llama-bench -o sql

CREATE TABLE IF NOT EXISTS test (
  build_commit TEXT,
  build_number INTEGER,
  cuda INTEGER,
  metal INTEGER,
  gpu_blas INTEGER,
  blas INTEGER,
  cpu_info TEXT,
  gpu_info TEXT,
  model_filename TEXT,
  model_type TEXT,
  model_size INTEGER,
  model_n_params INTEGER,
  n_batch INTEGER,
  n_threads INTEGER,
  f16_kv INTEGER,
  n_gpu_layers INTEGER,
  main_gpu INTEGER,
  mul_mat_q INTEGER,
  tensor_split TEXT,
  n_prompt INTEGER,
  n_gen INTEGER,
  test_time TEXT,
  avg_ns INTEGER,
  stddev_ns INTEGER,
  avg_ts REAL,
  stddev_ts REAL
);

INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '512', '0', '2023-09-23T12:10:30Z', '212693772', '743623', '2407.240204', '8.409634');
INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '0', '128', '2023-09-23T12:10:31Z', '977925003', '4037361', '130.891159', '0.537692');

README.md

llama.cpp/examples/llama-bench

Table of contents

Syntax

Examples

Text generation with different models

Prompt processing with different batch sizes

Different numbers of threads

Different numbers of layers offloaded to the GPU

Output formats

Markdown

CSV

JSON

SQL