mirror of https://github.com/NVIDIA/nvbench.git synced 2026-06-29 18:57:44 +00:00

Files

Oleksandr Pavlyk 48b7f61da3 Implement clear-gap comparison for early FAST/SLOW decision

Implemented the clear-gap comparison, with the log-distance-equivalent
algebra and pessimistic SM-clock fallback.

What changed:

 - Added TimingInterval and interval construction from summaries:
    - robust interval: [min, q3], centered at median
    - fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max]
 - Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005.
 - FAST gap uses:

   (ref.lower - cmp.upper) / cmp.upper >= delta
   which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta).
 - SLOW gap uses:

   (cmp.lower - ref.upper) / ref.upper >= delta
 - FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean.
 - If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED.
 - Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification.

Updated tests to cover:

 - center/noise-only comparisons becoming UNDECIDED
 - clear FAST/SLOW with matching clock evidence
 - missing clock fallback to UNDECIDED
 - frequency-shift disagreement becoming UNDECIDED
 - regression reporting with robust interval and clock evidence

2026-06-03 07:13:46 -05:00

cuda/bench

Fix docutil error when building docs (#365 )

2026-05-18 10:57:19 -05:00

examples

Implement Timer, and support State.exec(fn, timer=True) (#364 )

2026-05-15 10:19:40 -05:00

scripts

Implement clear-gap comparison for early FAST/SLOW decision

2026-06-03 07:13:46 -05:00

src

Add python api for cold warmup parameters (#363 )

2026-05-18 10:56:44 -05:00

test

Implement clear-gap comparison for early FAST/SLOW decision

2026-06-03 07:13:46 -05:00

.gitignore

Draft of Python API for NVBench

2025-07-28 15:37:04 -05:00

CMakeLists.txt

Disable CUPTI in cmake file

2026-02-02 16:03:15 -06:00

pyproject.toml

Provide BenchmarkResult class for parsing JSON output of NVBench-instrumented benchmarks (#356 )

2026-05-13 13:23:58 -05:00

README.md

Add installation instructions

2026-01-30 09:32:44 -06:00

README.md

CUDA Kernel Benchmarking Package

This package provides a Python API to the CUDA Kernel Benchmarking Library NVBench.

Installation

Install from PyPi

pip install cuda-bench[cu13]  # For CUDA 13.x
pip install cuda-bench[cu12]  # For CUDA 12.x

Building from source

Ensure recent version of CMake

Since nvbench requires a rather new version of CMake (>=3.30.4), either build CMake from sources, or create a conda environment with a recent version of CMake, using

conda create -n build_env --yes  cmake ninja
conda activate build_env

Ensure CUDA compiler

Since building NVBench library requires CUDA compiler, ensure that appropriate environment variables are set. For example, assuming CUDA toolkit is installed system-wide, and assuming Ampere GPU architecture:

export CUDACXX=/usr/local/cuda/bin/nvcc
export CUDAARCHS=86

Build Python project

Now switch to python folder, configure and install NVBench library, and install the package in editable mode:

cd nvbench/python
pip install -e .

Verify that package works

python test/run_1.py

Run examples

# Example benchmarking numba.cuda kernel
python examples/throughput.py

# Example benchmarking kernels authored using cuda.core
python examples/axes.py

# Example benchmarking algorithms from cuda.cccl.parallel
python examples/cccl_parallel_segmented_reduce.py

# Example benchmarking CuPy function
python examples/cupy_extract.py