mirror of https://github.com/NVIDIA/nvbench.git synced 2026-06-29 10:47:36 +00:00

Go to file

Oleksandr Pavlyk 1d13b49996 Add scoped filtering and device pairing to nvbench_compare

Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.

Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.

Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.

Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.

Original commit messages folded into this change:

Tweaks for nvbench_compare

1. When JSON files contain multiple entries with the same name and axis values,
   make sure that scripts compares corresponding entries.

   Previous logic would extract the first entry from ref data, and would compare
   measurements for each state in cmp against the first entry from ref. The
   change introduces a counter to know which nth entry we process for a
   particular axis value, and retrieve corresponding entry in ref.

Scope occurrence matching by device.

Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.

Require devices in ref and cmp to have the same cardinality

Handle mismatch when number of duplicates in ref data is not same as in cmp data

Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency

Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.

Fix to filter-before-matching.

 - compare_benches() now pairs devices by selected position instead of taking a
   device id.
 - For each device pair, compare_benches() now builds:
     - ref_device_states: matching reference device and axis filters
     - cmp_device_states: matching compare device and axis filters
 - State occurrence counts and duplicate occurrence matching now operate only
   on those filtered per-device lists.
 - Removed the later matches_axis_filters() skip inside the compare-state loop
   because filtering now happens before matching.

Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.

Introduce device filtering in nvbench_compare

 - --reference-devices all|ID|ID,ID,...
 - --compare-devices all|ID|ID,ID,...
 - Integer lists preserve order and duplicates.
 - Requested IDs are validated against the file-level device list.
 - Filtered reference/compare device counts must match before comparison.
 - compare_benches() pairs selected reference and compare devices by position.
 - Each benchmark validates that requested device IDs are present in its own
   devices list.

Implemented benchmark-scoped --axis handling.

  - --axis and --benchmark now share an ordered argparse action, so their
    relative CLI order is preserved.
  - -a before any -b becomes a global axis filter.
  - -a after -b <name> applies to that most recent benchmark only.
  - Repeated -b entries are treated as separate filter scopes and combined as
    alternatives for that benchmark.
  - Device filtering remains global and is applied independently.

Allow non-matching devices for explicit device selection

Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.

Fix for resolve_benchmark_device_ids, add comments

The return value of resolve_benchmark_device_ids now always owns its list.

Use monkeypatch class in set_test_devices helper

Stricted device id validation

Test for device id validation

2026-06-02 11:48:01 -05:00

.devcontainer

Updated devcontainer imageset to 26.08 (#381 )

2026-06-01 11:02:40 -05:00

.github

Add Windows support (#354 )

2026-05-19 15:10:58 -05:00

Remove unused python_wheel section (#382 )

2026-06-01 14:04:38 -05:00

cmake

Add Windows support (#354 )

2026-05-19 15:10:58 -05:00

docs

Reduce stdrel criterion complexity and ensure termination (#374 )

2026-05-29 17:06:28 +00:00

examples

Add Windows support (#354 )

2026-05-19 15:10:58 -05:00

exec

Add Windows support (#354 )

2026-05-19 15:10:58 -05:00

nvbench

Reduce stdrel criterion complexity and ensure termination (#374 )

2026-05-29 17:06:28 +00:00

python

Add scoped filtering and device pairing to nvbench_compare

2026-06-02 11:48:01 -05:00

testing

Reduce stdrel criterion complexity and ensure termination (#374 )

2026-05-29 17:06:28 +00:00

.clang-format

Formatting updates.

2025-04-14 17:26:12 +00:00

.clangd

Add initial CI, presets, and devcontainers.

2024-04-04 21:42:43 +00:00

.coderabbit.yaml

Add .coderabbit.yaml following in footsteps of CCCL (#359 )

2026-05-12 13:55:46 -05:00

.git-blame-ignore-revs

Update blame-ignore file.

2025-04-14 17:31:13 +00:00

.gitignore

Add multi-cuda wheel build (#289 )

2026-01-28 10:37:55 -05:00

.pre-commit-config.yaml

[pre-commit.ci] pre-commit autoupdate (#333 )

2026-04-13 16:24:55 +00:00

CMakeLists.txt

Add Windows support (#354 )

2026-05-19 15:10:58 -05:00

CMakePresets.json

Squashed commit of the following:

2024-04-06 22:05:40 +00:00

CODE_OF_CONDUCT.md

Added CODE_OF_CONDUCT.md.

2021-03-04 16:17:53 -05:00

LICENSE

Add LICENSE file and boilerplate.

2021-03-04 16:03:02 -05:00

pyproject.toml

Add pre-commit.ci configs.

2025-04-14 12:23:44 -04:00

README.md

Update youtube link URL. (#226 )

2025-05-10 10:11:58 -04:00

requirements.txt

Initial implementation of nvbench_histogram.

2022-01-11 15:06:54 -05:00

README.md

Overview

This project is a work-in-progress. Everything is subject to change.

NVBench is a C++17 library designed to simplify CUDA kernel benchmarking. It features:

Parameter sweeps: a powerful and flexible "axis" system explores a kernel's configuration space. Parameters may be dynamic numbers/strings or static types.
Runtime customization: A rich command-line interface allows redefinition of parameter axes, CUDA device selection, locking GPU clocks (Volta+), changing output formats, and more.
Throughput calculations: Compute and report:
- Item throughput (elements/second)
- Global memory bandwidth usage (bytes/second and per-device %-of-peak-bw)
Multiple output formats: Currently supports markdown (default) and CSV output.
Manual timer mode: (optional) Explicitly start/stop timing in a benchmark implementation.
Multiple measurement types:
- Cold Measurements:
  - Each sample runs the benchmark once with a clean device L2 cache.
  - GPU and CPU times are reported.
- Batch Measurements:
  - Executes the benchmark multiple times back-to-back and records total time.
  - Reports the average execution time (total time / number of executions).
- CPU-only Measurements
  - Measures the host-side execution time of a non-GPU benchmark.
  - Not suitable for microbenchmarking.

Check out this talk for an overview of the challenges inherent to CUDA kernel benchmarking and how NVBench solves them for you!

Supported Compilers and Tools

CMake > 3.30.4
CUDA Toolkit + nvcc: 12.0 and above
g++: 7 -> 14
clang++: 14 -> 19
Headers are tested with C++17 -> C++20.

Getting Started

Minimal Benchmark

A basic kernel benchmark can be created with just a few lines of CUDA C++:

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) {
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
  });
}
NVBENCH_BENCH(my_benchmark);

See Benchmarks for information on customizing benchmarks and implementing parameter sweeps.

Command Line Interface

Each benchmark executable produced by NVBench provides a rich set of command-line options for configuring benchmark execution at runtime. See the CLI overview and CLI axis specification for more information.

Examples

This repository provides a number of examples that demonstrate various NVBench features and usecases:

Building Examples

To build the examples:

mkdir -p build
cd build
cmake -DNVBench_ENABLE_EXAMPLES=ON -DCMAKE_CUDA_ARCHITECTURES=70 .. && make

Be sure to set CMAKE_CUDA_ARCHITECTURE based on the GPU you are running on.

Examples are built by default into build/bin and are prefixed with nvbench.example.

Example output from `nvbench.example.throughput`

# Devices

## [0] `Quadro GV100`
* SM Version: 700 (PTX Version: 700)
* Number of SMs: 80
* SM Default Clock Rate: 1627 MHz
* Global Memory: 32163 MiB Free / 32508 MiB Total
* Global Memory Bus Peak: 870 GiB/sec (4096-bit DDR @850MHz)
* Max Shared Memory: 96 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No

# Log

Run:  throughput_bench [Device=0]
Warn: Current measurement timed out (15.00s) while over noise threshold (1.26% > 0.50%)
Pass: Cold: 0.262392ms GPU, 0.267860ms CPU, 7.19s total GPU, 27393x
Pass: Batch: 0.261963ms GPU, 7.18s total GPU, 27394x

# Benchmark Results

## throughput_bench

### [0] Quadro GV100

| NumElements |  DataSize  | Samples |  CPU Time  | Noise |  GPU Time  | Noise | Elem/s  | GlobalMem BW  | BWPeak | Batch GPU  | Batch  |
|-------------|------------|---------|------------|-------|------------|-------|---------|---------------|--------|------------|--------|
|    16777216 | 64.000 MiB |  27393x | 267.860 us | 1.25% | 262.392 us | 1.26% | 63.940G | 476.387 GiB/s | 58.77% | 261.963 us | 27394x |

Demo Project

To get started using NVBench with your own kernels, consider trying out the NVBench Demo Project.

nvbench_demo provides a simple CMake project that uses NVBench to build an example benchmark. It's a great way to experiment with the library without a lot of investment.

Contributing

Contributions are welcome!

For current issues, see the issue board. Issues labeled with are good for first time contributors.

Tests

To build nvbench tests:

mkdir -p build
cd build
cmake -DNVBench_ENABLE_TESTING=ON .. && make

Tests are built by default into build/bin and prefixed with nvbench.test.

To run all tests:

make test

ctest

License

NVBench is released under the Apache 2.0 License with LLVM exceptions. See LICENSE.

NVBench will measure the CPU and CUDA GPU execution time of a single host-side critical region per benchmark. It is intended for regression testing and parameter tuning of individual kernels. For in-depth analysis of end-to-end performance of multiple applications, the NVIDIA Nsight tools are more appropriate.

NVBench is focused on evaluating the performance of CUDA kernels. It also provides CPU-only benchmarking facilities intended for non-trivial CPU workloads, but is not optimized for CPU microbenchmarks. This may change in the future, but for now, consider using Google Benchmark for high resolution CPU benchmarks.

Languages

Cuda 55.2%

Python 19.6%

C++ 14.4%

CMake 4.9%

Shell 4.2%

Other 1.7%

README.md

Overview

Supported Compilers and Tools

Getting Started

Minimal Benchmark

Command Line Interface

Examples

Building Examples

Demo Project

Contributing

Tests

License

Scope and Related Projects