mirror of
https://github.com/NVIDIA/nvbench.git
synced 2026-07-01 11:47:33 +00:00
Treat matched states with unusable timing data as UNKNOWN instead of dropping them from the comparison. This includes missing, non-finite, or non-positive timing centers, skipped states, and states with missing GPU timing summaries. Add explicit reason codes for these cases so the summary points users at the underlying data issue. Preserve available timing data from the other side when only one side is missing, and render unavailable durations as n/a in all display modes. Also sort values returned by np.unique_counts before nearest-neighbor coverage checks so the distance algorithm receives ordered inputs. Add regression coverage for UNKNOWN counting, skipped states, missing summaries, unavailable center formatting, and the updated coverage helper.
CUDA Kernel Benchmarking Package
This package provides a Python API to the CUDA Kernel Benchmarking
Library NVBench.
Installation
Install from PyPi
pip install cuda-bench[cu13] # For CUDA 13.x
pip install cuda-bench[cu12] # For CUDA 12.x
Building from source
Ensure recent version of CMake
Since nvbench requires a rather new version of CMake (>=3.30.4), either build CMake from sources, or create a conda environment with a recent version of CMake, using
conda create -n build_env --yes cmake ninja
conda activate build_env
Ensure CUDA compiler
Since building NVBench library requires CUDA compiler, ensure that appropriate environment variables
are set. For example, assuming CUDA toolkit is installed system-wide, and assuming Ampere GPU architecture:
export CUDACXX=/usr/local/cuda/bin/nvcc
export CUDAARCHS=86
Build Python project
Now switch to python folder, configure and install NVBench library, and install the package in editable mode:
cd nvbench/python
pip install -e .
Verify that package works
python test/run_1.py
Run examples
# Example benchmarking numba.cuda kernel
python examples/throughput.py
# Example benchmarking kernels authored using cuda.core
python examples/axes.py
# Example benchmarking algorithms from cuda.cccl.parallel
python examples/cccl_parallel_segmented_reduce.py
# Example benchmarking CuPy function
python examples/cupy_extract.py