commit4b309e6ad8Author: Allison Piper <alliepiper16@gmail.com> Date: Sat Apr 6 13:19:14 2024 +0000 Minor cleanups commit476ed2ceaeAuthor: Allison Piper <alliepiper16@gmail.com> Date: Sat Apr 6 12:53:37 2024 +0000 WAR compiler ice in nlohmann json. Only seeing this on GCC 9 + CTK 11.1. Seems to be having trouble with the `[[no_unique_address]]` optimization. commita9bf1d3e42Author: Allison Piper <alliepiper16@gmail.com> Date: Sat Apr 6 00:24:47 2024 +0000 Bump nlohmann json. commit80980fe373Author: Allison Piper <alliepiper16@gmail.com> Date: Sat Apr 6 00:22:07 2024 +0000 Fix llvm filesystem support commitf6099e6311Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 23:18:44 2024 +0000 Drop MSVC 2017 testing. commit5ae50a8ef5Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 23:02:32 2024 +0000 Add mroe missing headers. commitb2a9ae04d9Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 22:37:56 2024 +0000 Remove old CUDA+MSVC builds and make windows build-only. commit5b18c26a28Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 22:37:07 2024 +0000 Fix header for std::min/max. Why do I always think it's utility instead of algorithm.... commit6a409efa2dAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 22:18:18 2024 +0000 Temporarily disable CUPTI on all windows builds. commitf432f88866Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 21:42:52 2024 +0000 Fix warnings on MSVC. commit829787649bAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 21:03:16 2024 +0000 More flailing about in powershell. commit21742e6beaAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 20:36:08 2024 +0000 Cleanup filesystem header handling. commitde3d202635Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 20:09:00 2024 +0000 Windows CI debugging. commita4151667ffAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 19:45:40 2024 +0000 Quotation mark madness commitdd04f3befeAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 19:27:27 2024 +0000 Temporarily disable NVML on windows CI until new containers are ready. commitf3952848c4Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 19:25:22 2024 +0000 WAR issues on gcc-7. commit198986875eAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 19:25:04 2024 +0000 More matrix/devcontainer updates. commitb9712f8696Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 18:30:35 2024 +0000 Fix windows build scripts. commit943f268280Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 18:18:33 2024 +0000 Fix warnings with clang host compiler. commit7063e1d60aAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 18:14:28 2024 +0000 More devcontainer hijinks. commit06532fde81Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 17:51:25 2024 +0000 More matrix updates. commit78a265ea55Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 17:34:00 2024 +0000 Support CLI CMake options for windows ci scripts. commit670895c867Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 17:31:59 2024 +0000 Add missing devcontainers. commitb121823e74Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 17:22:54 2024 +0000 Build for `all-major` architectures in presets. We can get away with this because we require CMake 3.23.1. This was added in 3.23. commitfccfd44685Author: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 17:22:08 2024 +0000 Update matrix file. commite7d43ba90eAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 16:23:48 2024 +0000 Consolidate build/test jobs. commitc4044056ecAuthor: Allison Piper <alliepiper16@gmail.com> Date: Fri Apr 5 16:04:11 2024 +0000 Add missing build script.
Overview
This project is a work-in-progress. Everything is subject to change.
NVBench is a C++17 library designed to simplify CUDA kernel benchmarking. It features:
- Parameter sweeps: a powerful and flexible "axis" system explores a kernel's configuration space. Parameters may be dynamic numbers/strings or static types.
- Runtime customization: A rich command-line interface allows redefinition of parameter axes, CUDA device selection, locking GPU clocks (Volta+), changing output formats, and more.
- Throughput calculations: Compute
and report:
- Item throughput (elements/second)
- Global memory bandwidth usage (bytes/second and per-device %-of-peak-bw)
- Multiple output formats: Currently supports markdown (default) and CSV output.
- Manual timer mode: (optional) Explicitly start/stop timing in a benchmark implementation.
- Multiple measurement types:
- Cold Measurements:
- Each sample runs the benchmark once with a clean device L2 cache.
- GPU and CPU times are reported.
- Batch Measurements:
- Executes the benchmark multiple times back-to-back and records total time.
- Reports the average execution time (total time / number of executions).
- Cold Measurements:
Getting Started
Minimal Benchmark
A basic kernel benchmark can be created with just a few lines of CUDA C++:
void my_benchmark(nvbench::state& state) {
state.exec([](nvbench::launch& launch) {
my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
});
}
NVBENCH_BENCH(my_benchmark);
See Benchmarks for information on customizing benchmarks and implementing parameter sweeps.
Command Line Interface
Each benchmark executable produced by NVBench provides a rich set of command-line options for configuring benchmark execution at runtime. See the CLI overview and CLI axis specification for more information.
Examples
This repository provides a number of examples that demonstrate various NVBench features and usecases:
- Runtime and compile-time parameter sweeps
- Enums and compile-time-constant-integral parameter axes
- Reporting item/sec and byte/sec throughput statistics
- Skipping benchmark configurations
- Benchmarking on a specific stream
- Benchmarks that sync CUDA devices:
nvbench::exec_tag::sync - Manual timing:
nvbench::exec_tag::timer
Building Examples
To build the examples:
mkdir -p build
cd build
cmake -DNVBench_ENABLE_EXAMPLES=ON -DCMAKE_CUDA_ARCHITECTURES=70 .. && make
Be sure to set CMAKE_CUDA_ARCHITECTURE based on the GPU you are running on.
Examples are built by default into build/bin and are prefixed with nvbench.example.
Example output from `nvbench.example.throughput`
# Devices
## [0] `Quadro GV100`
* SM Version: 700 (PTX Version: 700)
* Number of SMs: 80
* SM Default Clock Rate: 1627 MHz
* Global Memory: 32163 MiB Free / 32508 MiB Total
* Global Memory Bus Peak: 870 GiB/sec (4096-bit DDR @850MHz)
* Max Shared Memory: 96 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No
# Log
Run: throughput_bench [Device=0]
Warn: Current measurement timed out (15.00s) while over noise threshold (1.26% > 0.50%)
Pass: Cold: 0.262392ms GPU, 0.267860ms CPU, 7.19s total GPU, 27393x
Pass: Batch: 0.261963ms GPU, 7.18s total GPU, 27394x
# Benchmark Results
## throughput_bench
### [0] Quadro GV100
| NumElements | DataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWPeak | Batch GPU | Batch |
|-------------|------------|---------|------------|-------|------------|-------|---------|---------------|--------|------------|--------|
| 16777216 | 64.000 MiB | 27393x | 267.860 us | 1.25% | 262.392 us | 1.26% | 63.940G | 476.387 GiB/s | 58.77% | 261.963 us | 27394x |
Demo Project
To get started using NVBench with your own kernels, consider trying out the NVBench Demo Project.
nvbench_demo provides a simple CMake project that uses NVBench to build an
example benchmark. It's a great way to experiment with the library without a lot
of investment.
Contributing
Contributions are welcome!
For current issues, see the issue board. Issues labeled with are good for first time contributors.
Tests
To build nvbench tests:
mkdir -p build
cd build
cmake -DNVBench_ENABLE_TESTING=ON .. && make
Tests are built by default into build/bin and prefixed with nvbench.test.
To run all tests:
make test
or
ctest
License
NVBench is released under the Apache 2.0 License with LLVM exceptions. See LICENSE.
Scope and Related Projects
NVBench will measure the CPU and CUDA GPU execution time of a single host-side critical region per benchmark. It is intended for regression testing and parameter tuning of individual kernels. For in-depth analysis of end-to-end performance of multiple applications, the NVIDIA Nsight tools are more appropriate.
NVBench is focused on evaluating the performance of CUDA kernels and is not optimized for CPU microbenchmarks. This may change in the future, but for now, consider using Google Benchmark for high resolution CPU benchmarks.