mirror of
https://github.com/NVIDIA/nvbench.git
synced 2026-03-14 20:27:24 +00:00
Changed title, fixed references, added intro borrowed from README
This commit is contained in:
@@ -4,9 +4,12 @@ CLI Options
|
||||
Every benchmark created with NVBench supports command-line interface,
|
||||
with a variety of options.
|
||||
|
||||
.. _cli-overview:
|
||||
|
||||
.. include:: ../cli_help.md
|
||||
:parser: myst_parser.sphinx_
|
||||
|
||||
.. _cli-overview-axes:
|
||||
|
||||
.. include:: ../cli_help_axis.md
|
||||
:parser: myst_parser.sphinx_
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import os
|
||||
|
||||
project = "NVBench API"
|
||||
project = "NVBench: CUDA Kernel Benchmarking Library"
|
||||
author = "NVIDIA Corporation"
|
||||
|
||||
extensions = [
|
||||
|
||||
@@ -74,6 +74,7 @@ NVBENCH_BENCH(my_benchmark);
|
||||
|
||||
A full example can be found in [examples/stream.cu][CppExample_Stream].
|
||||
|
||||
(parameter-axes)=
|
||||
## Parameter Axes
|
||||
|
||||
Some kernels will be used with a variety of options, input data types/sizes, and
|
||||
@@ -166,6 +167,7 @@ NVBENCH_BENCH(benchmark).add_string_axis("RNG Distribution", {"Uniform", "Gaussi
|
||||
A common use for string axes is to encode enum values, as shown in
|
||||
[examples/enums.cu][CppExample_Enums].
|
||||
|
||||
(type-axes)=
|
||||
### Type Axes
|
||||
|
||||
Another common situation involves benchmarking a templated kernel with multiple
|
||||
@@ -244,6 +246,7 @@ times. Keep the rapid growth of these combinations in mind when choosing the
|
||||
number of values in an axis. See the section about combinatorial explosion for
|
||||
more examples and information.
|
||||
|
||||
(throughput-measurements)=
|
||||
## Throughput Measurements
|
||||
|
||||
In additional to raw timing information, NVBench can track a kernel's
|
||||
|
||||
@@ -1,7 +1,45 @@
|
||||
NVBench: CUDA Kernel Benchmarking Library
|
||||
=========================================
|
||||
CUDA Kernel Benchmarking Library
|
||||
================================
|
||||
|
||||
The library presently supports kernel benchmarking in C++ and in Python.
|
||||
The library, NVBench, presently supports writing benchmarks in C++ and in Python.
|
||||
It is designed to simplify CUDA kernel benchmarking. It features:
|
||||
|
||||
* :ref:`Parameter sweeps <parameter-axes>`: a powerful and
|
||||
flexible "axis" system explores a kernel's configuration space. Parameters may
|
||||
be dynamic numbers/strings or :ref:`static types <type-axes>`.
|
||||
* :ref:`Runtime customization <cli-overview>`: A rich command-line interface
|
||||
allows :ref:`redefinition of parameter axes <cli-overview-axes>`, CUDA device
|
||||
selection, locking GPU clocks (Volta+), changing output formats, and more.
|
||||
* :ref:`Throughput calculations <throughput-measurements>`: Compute
|
||||
and report:
|
||||
|
||||
* Item throughput (elements/second)
|
||||
* Global memory bandwidth usage (bytes/second and per-device %-of-peak-bw)
|
||||
|
||||
* Multiple output formats: Currently supports markdown (default) and CSV output.
|
||||
* :ref:`Manual timer mode <explicit-timer-mode>`:
|
||||
(optional) Explicitly start/stop timing in a benchmark implementation.
|
||||
* Multiple measurement types:
|
||||
|
||||
* Cold Measurements:
|
||||
|
||||
* Each sample runs the benchmark once with a clean device L2 cache.
|
||||
* GPU and CPU times are reported.
|
||||
|
||||
* Batch Measurements:
|
||||
|
||||
* Executes the benchmark multiple times back-to-back and records total time.
|
||||
* Reports the average execution time (total time / number of executions).
|
||||
|
||||
* :ref:`CPU-only Measurements <cpu-only-benchmarks>`)
|
||||
|
||||
* Measures the host-side execution time of a non-GPU benchmark.
|
||||
* Not suitable for microbenchmarking.
|
||||
|
||||
Check out `GPU Mode talk #56 <https://www.youtube.com/watch?v=CtrqBmYtSEki>`_ for an overview
|
||||
of the challenges inherent to CUDA kernel benchmarking and how NVBench solves them for you!
|
||||
|
||||
-------
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
@@ -21,14 +21,18 @@ def benchmark_impl(state: State) -> None:
|
||||
data = generate(n, state.get_stream())
|
||||
|
||||
# body that is being timed. Must execute
|
||||
# on the stream handed over by NVBench
|
||||
launchable_fn : Callable[[Launch], None] =
|
||||
# on the stream handed over by NVBench.
|
||||
# Typically launches a kernel of interest
|
||||
launch_fn : Callable[[Launch], None] =
|
||||
lambda launch: impl(data, launch.get_stream())
|
||||
|
||||
state.exec(launchable_fn)
|
||||
state.exec(launch_fn)
|
||||
|
||||
|
||||
bench = register(benchmark_impl)
|
||||
# provide kernel a name
|
||||
bench.set_name("my_package_kernel")
|
||||
# specify default values of parameter to run benchmark with
|
||||
bench.add_int64_axis("Elements", [1000, 10000, 100000])
|
||||
|
||||
|
||||
|
||||
@@ -1,5 +1,14 @@
|
||||
cuda.bench Python API Reference
|
||||
===============================
|
||||
`cuda.bench` Python API Reference
|
||||
=================================
|
||||
|
||||
Python package ``cuda.bench`` is designed to empower
|
||||
users to write CUDA kernel benchmarks in Python.
|
||||
|
||||
Alignment with behavior of benchmarks written in C++
|
||||
allows for meaningful comparison between them.
|
||||
|
||||
Classes and functions
|
||||
---------------------
|
||||
|
||||
.. automodule:: cuda.bench
|
||||
:members:
|
||||
|
||||
Reference in New Issue
Block a user