* Implement warmup-runs count, supported as CLI
CLI option --warmup-runs implemented and documented.
The warm-up counts is enforced to always be positive.
This is necessary to ensure that JIT-ting has occurred,
and use of blocking kernel would not result in time-outs.
Test is option parser is added.
* Ensure that measure_cold::run_warmup instantiates blocking kernel
Because warm-up runs are executed without use of blocking kernel,
the blocking kernel was not jitted until actual measurements were
collected. The module loading cost incurred during the first run
shows as elevated CPU time noise value for the first measurement
as noted in https://github.com/NVIDIA/nvbench/pull/339
This PR adds `this->block_stream(); this->unblock_stream();` prior
to executing warm-up loop with use of blocking kernel disabled.
This ensures that blocking kernel is instantiated during the warm-up,
but it no other kernel is launched between its launch and stream sync
thus avoiding deadlocking.
* Rename --warmup-runs to --cold-warmup-runs, add --cold-max-warmup-walltime
Since configurable number of warmups only applies to measure_cold.cuh
rename the CLI option to reflect that.
Also add --cold-max-warmup-walltime (defaults to -1, i.e. disabled).
If enabled, exits warmup loop before request count is reached if
the wall-time expanded executign warmups exceeds this max-warmup-walltime
value.
This example demonstrates using cuda.bench and cuda.bench.results
to implement simple auto-tuning, demonstrated on selecting of
tile shape hyperparameter for naive stencil kernel implemented
in numba-cuda.
Move NVBench JSON result parsing into cuda.bench.results with explicit
BenchmarkResult, BenchmarkResultDevice, BenchmarkResultSummary,
SubBenchmarkResult, and SubBenchmarkState types. Remove the result reader
from the top-level cuda.bench namespace and require construction through
BenchmarkResult.from_json() or BenchmarkResult.empty().
Preserve bulk sample/frequency parsing and estimator helpers while making
summaries rich objects that retain tag/name/hint/hide/description metadata.
Add nvbench-json-summary to render NVBench JSON output as an NVBench-style
markdown summary table, including axis formatting, device sections, hidden
summary filtering, and summary hint formatting.
Update packaging, type stubs, and tests for the new namespace, renamed
classes, Python 3.10-compatible annotations, and summary-table generation.
This allows for Pythonic way of working with BenchResult
as if it was a dictionary.
```
In [1]: import array, numpy as np, cuda.bench
In [2]: r = cuda.bench.BenchResult("temp_data/axes_run1.json")
In [3]: list(r)
Out[3]:
['simple',
'single_float64_axis',
'copy_sweep_grid_shape',
'copy_type_sweep',
'copy_type_conversion_sweep',
'copy_type_and_block_size_sweep']
In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75]))
Out[4]: {'Device=0': array([0.00100966, 0.00101299])}
In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"]
Out[5]: {'Device=0': array([0.00100966, 0.00101299])}
In [6]: len(r)
Out[6]: 6
In [7]: "fake" in r
Out[7]: False
```
Add arbitrary BenchResult metadata and explicit parse control, replacing
the previous code/elapsed fields. Make BenchResult subscriptable by
subbenchmark name and make SubBenchResult list-like over its states.
Extend SubBenchState parsing to expose summaries by tag, read paired
sample frequency data, return None for unavailable sample/frequency
files, and validate matching sample/frequency lengths.
Harden parsing for NVBench JSON output with no-axis benchmarks, null
axis_values, skipped states with null summaries, float axis input_string
lookups, and recorded sidecar binary paths.
Expand BenchResult tests to cover metadata, parse=False, sequence-style
access, frequency-aware centers, missing binary data, skipped states,
and mismatched sample/frequency counts.
Example usage:
```
import array, numpy as np, cuda.bench
r = cuda.bench.BenchResult("perf_data/axes_run1.json")
r["copy_sweep_grid_shape"].centers_with_frequencies(
lambda t, f: np.median(np.asarray(t)*np.asarray(f)))
```
* Correct Python API signature of State.get_axis_values_as_strings
The C++ API has default boolean argument color, but Python API
declared no arguments.
Closes#345
* Also exercise invocation of get_axis_values_as_string with keyword argument value
* Remove use of cuda.core.experimental
Fixed relative text alignment in docstrings to fix autodoc warnigns
Renamed cuda.bench.test_cpp_exception and cuda.bench.test_py_exception functions
to start with underscore, signaling that these functions are internal and should
not be documented
Account for test_cpp_exceptions -> _test_cpp_exception, same for *_py_*
Make sure to reset __module__ of reexported symbols to be cuda.bench
* Introduce function colorize to modularize colorization/no-color handling
* Use sns.set_theme instead of deprecated sns.set()
* Use str.format instead of legacy % syntax
* Simplified iteration over list
Use f-string (supported since Python 3.6) instead of str.format for
better readability and performance
Fix GCC16 sfinae incomplete warnings.
GCC16 started requiring that the type `T` used in `std::reference_wrapper<T>` is complete where using `-std=c++17`. Since NVBench has to forward declare some types in header files to break circular dependency, use of incomplete type breaks build due to use of `-Werror` flag due to `-Wsfinae-incomplete` warning emitted by GCC16.
This commit replaced affected uses of `std::reference_wrapper<const nvbench::benchmark_base>` in state.cxx, and `std::reference_wrapper<nvbench::printer_base>` in benchmark_base.cxx with raw pointers.
* Mark NVBench headers as SYSTEM for consuming targets.
Fixes#30.
* As nvbench.main links to nvbench as INTERFACE only, it no longer consumes usage reqs of nvbench
Because of this nvbench.main was no longer consuming dependence on CUDA::toolkit include dirs.
This PR links nvbench.main to ${ctk_libraries} privately to reestablish that dependency
* Implement use of pragma system_header in NVBench
1. Add code to nvbench/config.cuh.in to define NVBENCH_IMPLICIT_SYSTEM_HEADER_*
preprocessor variable dependending on compiler, unless NVBENCH_NO_IMPLICIT_SYSTEM_HEADER
was defined.
2. Build NVBench targets with -DNVBENCH_NO_IMPLICIT_SYSTEM_HEADER
3. Modify each header file in nvbench/ folder to
- include <nvbench/config.cuh>
- Execute pragma <OPTIONAL_CMPLR> system_header guarded
by checks for defined preprocessor variables
- Do the above two steps before any other headers are included
---------
Co-authored-by: Allison Piper <apiper@nvidia.com>
* Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end
NVPW_* API has been deprecated since CTK 13.0. Followed advice in compliation
message to replace NVPW_* API with CUPTI Profiler Host API.
`libnvbench.so` no longer links to `nvperf_host` directly, only to `libcupti`.
NVBench uses the **CUPTI Host API** to build a config image from metric names,
and the **Range Profiler API** to collect and decode counters. The host API never
collects data directly; it prepares and evaluates data produced by range profiling.
Introduce `host_impl`/`profiler_init_guard` to manage CUPTI Host object and
initialization/deinitialization, including safe move-assignment cleanup.
`profiler_init_guard` initializes profiler, and throws if CUPTI returns
an error code. `profiler_init_guard::finalize_profiler()` de-inits profiler
and returns the error code. Destructor calls finalize_profiler, but ignores
the status code. If user wants to explicitly de-initialize profiler and handle
the error, he/she is advised to call `finalize_profiler()` directly. The guard
has a boolean member variable to allow destructor to work even if user explicitly
called finalize_profiler() method.
The old counter-data prefix/scratch flow was replaced with the Range Profiler counter
data image sizing/initialization path and decode flow.
Host API metric filtering (base metrics + context scope) and Host-side evaluation to
GPU values via cuptiProfilerHostEvaluateToGpuValues is implemented.
- **Host object**: `host_impl::object` in `nvbench/cupti_profiler.cxx`.
- **Range profiler object**: `host_impl::range_profiler_object`.
- **Config image**: `m_config_image`.
- **Counter data image**: `m_data_image`.
1) **Host init + config image**
- `initialize_profiler_host()` creates the host object.
- `initialize_config_image_host()` adds metrics and builds the config image.
2) **Range profiler enable + counter data image**
- `enable_range_profiler()` creates the range profiler object.
- `initialize_counter_data_image()` sizes and initializes the data image using
the range profiler object, matching the CUPTI samples.
3) **Config + collect + decode**
- `set_range_profiler_config()` binds the config image + data image.
- `start_user_loop()` / `stop_user_loop()` push/pop the user range and
start/stop the range profiler.
- `process_user_loop()` decodes counter data via
`cuptiRangeProfilerDecodeData()`.
4) **Evaluate metrics**
- `get_counter_values()` calls `cuptiProfilerHostEvaluateToGpuValues()` to
convert counter data into metric values.
The
* Use class instead of struct in profiler_init_guard; forward declaration
* Add SFINAE guards before accessing members not present in earlier CTK versions
* Check if cupti_profiler_host.h exists, use old/new implementation based on that check
1. Reintroduced legacy `cupti_profiler_nvpw.cuh` and `cupti_profiler_nvpw.cuh`.
2. Moved profiler-host-API implementation to `cupti_profiler_host.cuh`, `cupti_profiler_host.cxx`.
3. Add `nvbench/cupti_profiler.cuh` which checks if `cupti_profiler_host.h` header is known and
includes `cupti_profiler_host.cuh` or `cupti_profiler_nvpw.cuh` respectively.
4. In cmake, we check if ${nvbench_cupti_root}/include/cupti_profiler_host.h file exists.
If it does not, `libnvbench.so` would have dependency on libnvperf_host and libnvperf_target
in addition to dependency on libcupti. If the header exists, it would only depend on libcupti