Commit Graph

767 Commits

Author SHA1 Message Date
Oleksandr Pavlyk
18632f9e30 Fix break due to file name change 2026-05-12 11:30:55 -05:00
Oleksandr Pavlyk
00a5d1d33f Split tests in test_benchmark_result into smaller tests 2026-05-12 10:25:34 -05:00
Oleksandr Pavlyk
4a22923721 Add BenchmarkResult JSON results namespace and summary CLI
Move NVBench JSON result parsing into cuda.bench.results with explicit
BenchmarkResult, BenchmarkResultDevice, BenchmarkResultSummary,
SubBenchmarkResult, and SubBenchmarkState types. Remove the result reader
from the top-level cuda.bench namespace and require construction through
BenchmarkResult.from_json() or BenchmarkResult.empty().

Preserve bulk sample/frequency parsing and estimator helpers while making
summaries rich objects that retain tag/name/hint/hide/description metadata.

Add nvbench-json-summary to render NVBench JSON output as an NVBench-style
markdown summary table, including axis formatting, device sections, hidden
summary filtering, and summary hint formatting.

Update packaging, type stubs, and tests for the new namespace, renamed
classes, Python 3.10-compatible annotations, and summary-table generation.
2026-05-12 09:58:30 -05:00
Oleksandr Pavlyk
231e9d7b1d Refactoring: BenchResult->BenchmarkResult 2026-05-11 17:36:12 -05:00
Oleksandr Pavlyk
de8d0b7368 Implement mapping interface for BenchResult
This allows for Pythonic way of working with BenchResult
as if it was a dictionary.

```
In [1]: import array, numpy as np, cuda.bench

In [2]: r = cuda.bench.BenchResult("temp_data/axes_run1.json")

In [3]: list(r)
Out[3]:
['simple',
 'single_float64_axis',
 'copy_sweep_grid_shape',
 'copy_type_sweep',
 'copy_type_conversion_sweep',
 'copy_type_and_block_size_sweep']

In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75]))
Out[4]: {'Device=0': array([0.00100966, 0.00101299])}

In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"]
Out[5]: {'Device=0': array([0.00100966, 0.00101299])}

In [6]: len(r)
Out[6]: 6

In [7]: "fake" in r
Out[7]: False
```
2026-05-08 16:10:28 -05:00
Oleksandr Pavlyk
2604547eeb Improve Python BenchResult parsing and container APIs
Add arbitrary BenchResult metadata and explicit parse control, replacing
the previous code/elapsed fields. Make BenchResult subscriptable by
subbenchmark name and make SubBenchResult list-like over its states.

Extend SubBenchState parsing to expose summaries by tag, read paired
sample frequency data, return None for unavailable sample/frequency
files, and validate matching sample/frequency lengths.

Harden parsing for NVBench JSON output with no-axis benchmarks, null
axis_values, skipped states with null summaries, float axis input_string
lookups, and recorded sidecar binary paths.

Expand BenchResult tests to cover metadata, parse=False, sequence-style
access, frequency-aware centers, missing binary data, skipped states,
and mismatched sample/frequency counts.

Example usage:

```
import array, numpy as np, cuda.bench

r = cuda.bench.BenchResult("perf_data/axes_run1.json")

r["copy_sweep_grid_shape"].centers_with_frequencies(
     lambda t, f: np.median(np.asarray(t)*np.asarray(f)))

```
2026-05-08 16:10:28 -05:00
Oleksandr Pavlyk
e26fe6bda2 Initial implementation of cuda.bench.BenchResult class 2026-05-08 16:10:17 -05:00
Oleksandr Pavlyk
d13a0fde32 Correct cuda cccl examples per change in api (#353) 2026-05-06 13:30:44 -05:00
Oleksandr Pavlyk
f392725015 Correct Python API signature of State.get_axis_values_as_strings (#346)
* Correct Python API signature of State.get_axis_values_as_strings

The C++ API has default boolean argument color, but Python API
declared no arguments.

Closes #345

* Also exercise invocation of get_axis_values_as_string with keyword argument value

* Remove use of cuda.core.experimental
2026-05-04 08:40:29 -05:00
Oleksandr Pavlyk
a3364ca5c7 Port changes to the package from #323 (#337)
Fixed relative text alignment in docstrings to fix autodoc warnigns

Renamed cuda.bench.test_cpp_exception and cuda.bench.test_py_exception functions
to start with underscore, signaling that these functions are internal and should
not be documented

Account for test_cpp_exceptions -> _test_cpp_exception, same for *_py_*

Make sure to reset __module__ of reexported symbols to be cuda.bench
2026-04-22 08:28:15 -05:00
Oleksandr Pavlyk
b0a46f44c2 Modularize color handling (#336)
* Introduce function colorize to modularize colorization/no-color handling

* Use sns.set_theme instead of deprecated sns.set()

* Use str.format instead of legacy % syntax

* Simplified iteration over list

Use f-string (supported since Python 3.6) instead of str.format for
better readability and performance
2026-04-14 08:09:44 -05:00
pre-commit-ci[bot]
8d23e3e73c [pre-commit.ci] pre-commit autoupdate (#333)
* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v21.1.8 → v22.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v21.1.8...v22.1.2)
- [github.com/astral-sh/ruff-pre-commit: v0.14.10 → v0.15.9](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.10...v0.15.9)
- [github.com/codespell-project/codespell: v2.4.1 → v2.4.2](https://github.com/codespell-project/codespell/compare/v2.4.1...v2.4.2)

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-13 16:24:55 +00:00
Oleksandr Pavlyk
e62c5b6f79 Correct description/hint entries for summaries with name "Noise" (#335)
See #334
2026-04-13 11:13:37 -05:00
Nader Al Awar
373970323f Merge pull request #331 from oleksandr-pavlyk/update-python-examples
Update python examples
2026-04-02 15:20:24 -04:00
Oleksandr Pavlyk
39730efbc3 Update requirements to reflect packages used by examples 2026-04-02 10:37:17 -05:00
Oleksandr Pavlyk
9f75642387 Add patch to cutlass.base_dsl.dsl.BaseDSL to work-around a bug
See https://github.com/NVIDIA/cutlass/issues/3142
2026-04-02 10:29:31 -05:00
Nader Al Awar
488173a242 Add --no-color flag to nvbench_compare.py which can be used for github issues and PRs python-0.2.1 2026-04-01 18:27:54 -04:00
Nader Al Awar
7a68e53df0 Rename flag from markdown to no-color 2026-04-01 17:01:29 -05:00
Nader Al Awar
7e5e784855 Add --markdown flag to nvbench_compare.py which can be use for github issues/prs 2026-04-01 14:53:13 -05:00
Oleksandr Pavlyk
93bc59d05c Renamed CUTLASS example to reflect that it uses CuteDSL 2026-04-01 08:24:29 -05:00
Oleksandr Pavlyk
e4cfddeb87 Rewrote cutlass_gemm example to use CuteDSL 2026-04-01 08:23:41 -05:00
Oleksandr Pavlyk
3f284b4004 Renamed cccl_* examples
cccl_parallel_* -> cuda_compute_*
cccl_cooperative_* -> cuda_coop_*
2026-04-01 08:20:20 -05:00
Oleksandr Pavlyk
5bdb30f4b6 Update to cccl_parallel_segmented_reduce example per changes in API
Update namespace changes. Use make_segmented_reduce factory function,
and update call signatures.
2026-04-01 08:18:15 -05:00
Oleksandr Pavlyk
d8739fc208 Update to cccl_cooperative_block_reduce example 2026-04-01 08:17:52 -05:00
Oleksandr Pavlyk
974eb5ee0f Replace use of cupy.cuda.ExternalStream with cupy.cuda.Stream.from_external 2026-04-01 08:17:12 -05:00
Oleksandr Pavlyk
7c60edcc0a cuda.core.experimental -> cuda.core 2026-04-01 08:16:04 -05:00
Oleksandr Pavlyk
836a6c12f4 Merge pull request #326 from oleksandr-pavlyk/fix-sfinae-incomplete
Fix GCC16 sfinae incomplete warnings.

GCC16 started requiring that the type `T` used in `std::reference_wrapper<T>` is complete where using `-std=c++17`. Since NVBench has to forward declare some types in header files to break circular dependency, use of incomplete type breaks build due to use of `-Werror` flag due to `-Wsfinae-incomplete` warning emitted by GCC16.

This commit replaced affected uses of `std::reference_wrapper<const nvbench::benchmark_base>` in state.cxx, and `std::reference_wrapper<nvbench::printer_base>` in benchmark_base.cxx with raw pointers.
2026-03-24 16:02:28 -05:00
Oleksandr Pavlyk
317dc6824e Mark NVBench headers as SYSTEM for consuming targets + FIX (#330)
* Mark NVBench headers as SYSTEM for consuming targets.

Fixes #30.

* As nvbench.main links to nvbench as INTERFACE only, it no longer consumes usage reqs of nvbench

Because of this nvbench.main was no longer consuming dependence on CUDA::toolkit include dirs.

This PR links nvbench.main to ${ctk_libraries} privately to reestablish that dependency

* Implement use of pragma system_header in NVBench

1. Add code to nvbench/config.cuh.in to define NVBENCH_IMPLICIT_SYSTEM_HEADER_*
preprocessor variable dependending on compiler, unless NVBENCH_NO_IMPLICIT_SYSTEM_HEADER
was defined.

2. Build NVBench targets with -DNVBENCH_NO_IMPLICIT_SYSTEM_HEADER

3. Modify each header file in nvbench/ folder to
   - include <nvbench/config.cuh>
   - Execute pragma <OPTIONAL_CMPLR> system_header guarded
     by checks for defined preprocessor variables
   - Do the above two steps before any other headers are included

---------

Co-authored-by: Allison Piper <apiper@nvidia.com>
2026-03-23 15:10:41 -04:00
Oleksandr Pavlyk
9a91b9ef0c Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end (#327)
* Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end

NVPW_* API has been deprecated since CTK 13.0. Followed advice in compliation
message to replace NVPW_* API with CUPTI Profiler Host API.

`libnvbench.so` no longer links to `nvperf_host` directly, only to `libcupti`.

NVBench uses the **CUPTI Host API** to build a config image from metric names,
and the **Range Profiler API** to collect and decode counters. The host API never
collects data directly; it prepares and evaluates data produced by range profiling.

Introduce `host_impl`/`profiler_init_guard` to manage CUPTI Host object and
initialization/deinitialization, including safe move-assignment cleanup.

`profiler_init_guard` initializes profiler, and throws if CUPTI returns
an error code. `profiler_init_guard::finalize_profiler()` de-inits profiler
and returns the error code. Destructor calls finalize_profiler, but ignores
the status code. If user wants to explicitly de-initialize profiler and handle
the error, he/she is advised to call `finalize_profiler()` directly. The guard
has a boolean member variable to allow destructor to work even if user explicitly
called finalize_profiler() method.

The old counter-data prefix/scratch flow was replaced with the Range Profiler counter
data image sizing/initialization path and decode flow.

Host API metric filtering (base metrics + context scope) and Host-side evaluation to
GPU values via cuptiProfilerHostEvaluateToGpuValues is implemented.

   - **Host object**: `host_impl::object` in `nvbench/cupti_profiler.cxx`.
   - **Range profiler object**: `host_impl::range_profiler_object`.
   - **Config image**: `m_config_image`.
   - **Counter data image**: `m_data_image`.

1) **Host init + config image**
   - `initialize_profiler_host()` creates the host object.
   - `initialize_config_image_host()` adds metrics and builds the config image.

2) **Range profiler enable + counter data image**
   - `enable_range_profiler()` creates the range profiler object.
   - `initialize_counter_data_image()` sizes and initializes the data image using
     the range profiler object, matching the CUPTI samples.

3) **Config + collect + decode**
   - `set_range_profiler_config()` binds the config image + data image.
   - `start_user_loop()` / `stop_user_loop()` push/pop the user range and
     start/stop the range profiler.
   - `process_user_loop()` decodes counter data via
     `cuptiRangeProfilerDecodeData()`.

4) **Evaluate metrics**
   - `get_counter_values()` calls `cuptiProfilerHostEvaluateToGpuValues()` to
     convert counter data into metric values.

The

* Use class instead of struct in profiler_init_guard; forward declaration

* Add SFINAE guards before accessing members not present in earlier CTK versions

* Check if cupti_profiler_host.h exists, use old/new implementation based on that check

1. Reintroduced legacy `cupti_profiler_nvpw.cuh` and `cupti_profiler_nvpw.cuh`.
2. Moved profiler-host-API implementation to `cupti_profiler_host.cuh`, `cupti_profiler_host.cxx`.
3. Add `nvbench/cupti_profiler.cuh` which checks if `cupti_profiler_host.h` header is known and
   includes `cupti_profiler_host.cuh` or `cupti_profiler_nvpw.cuh` respectively.
4. In cmake, we check if ${nvbench_cupti_root}/include/cupti_profiler_host.h file exists.
   If it does not, `libnvbench.so` would have dependency on libnvperf_host and libnvperf_target
   in addition to dependency on libcupti. If the header exists, it would only depend on libcupti
2026-03-23 11:51:16 -04:00
Oleksandr Pavlyk
1d823c6975 Merge pull request #328 from oleksandr-pavlyk/set-type-axes-names-in-auto-throughput-example 2026-03-20 18:44:03 -05:00
Oleksandr Pavlyk
56cdaed0af Merge pull request #299 from NVIDIA/pre-commit-ci-update-config
[pre-commit.ci] pre-commit autoupdate
2026-03-20 16:15:20 -05:00
Oleksandr Pavlyk
a6e570083d Merge pull request #329 from oleksandr-pavlyk/fix-fmt-target-name-in-tests
Link against fmt::fmt target, not fmt
2026-03-20 08:49:05 -05:00
Oleksandr Pavlyk
4c278b08b3 Link against fmt::fmt target, not fmt. Consistent with nvbench/CMakeLists.txt
Co-authored-by: Dominic Charrier <docharri@amd.com>
2026-03-19 14:53:06 -05:00
Oleksandr Pavlyk
49636c70b3 Set type-axes name to ItemsPerThread to replace auto-generated T 2026-03-19 14:35:46 -05:00
Bernhard Manfred Gruber
728212f9f1 Merge pull request #315 from bernhardmgruber/plot_diff_script
Extend `nvbench_compare.py` with `--plot`, axis/benchmark filtering, and dark mode
2026-02-28 01:38:27 +01:00
Bernhard Manfred Gruber
4164909c52 Feedback 2026-02-28 01:19:18 +01:00
Oleksandr Pavlyk
5387d2005b Merge pull request #322 from oleksandr-pavlyk/feature/save-frequencies
Save frequencies when bulk-saving of times is enabled

SM clock rates are now always collected, even if throttling threshold is set to zero
2026-02-27 13:30:11 -06:00
Oleksandr Pavlyk
c9705de4a4 Reserve enough space clock-rates for min samples, if specified 2026-02-27 12:49:35 -06:00
Bernhard Manfred Gruber
0abc8ec82b Extend nvbench_compare.py with --plot, axis/benchmark filtering, and dark mode
Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com>
2026-02-27 11:06:20 +01:00
Oleksandr Pavlyk
ba7150e447 Merge pull request #314 from bernhardmgruber/plot_script
Add a script to plot benchmark results
2026-02-26 12:59:16 -06:00
Bernhard Manfred Gruber
800f640c20 Apply reviewer feedback 2026-02-26 19:23:51 +01:00
Oleksandr Pavlyk
998ab125ce Don't override m_check_throttling if throttling threshold is non-positive
measure_cold class now directly inherits m_check_throttling from state.
This ensures that when `--jsonbin` is specified frequency data corresponding
to timing data are available to write out.
2026-02-20 16:34:53 -06:00
Oleksandr Pavlyk
731e0c2c30 Swapped data members m_sm_clock_rates and m_sm_clock_rate_accumulator
This places all std::vector members together. Added default initialization
to all std::vector members, and all other members with default constructors.

Exceptions are references and nvbench::launch m_launch; member
2026-02-19 15:33:57 -06:00
Oleksandr Pavlyk
4da9f431c0 Templatize write_out_values for different storage formats
This could be used to save data as float32_t, or float64_t.
This flexibility is useful for experimentation.
2026-02-19 15:32:00 -06:00
Oleksandr Pavlyk
988420b5b1 Use write_out_values utility to save frequencies
The utility was already used to save times
2026-02-13 10:19:06 -06:00
Georgy Evtushenko
40b2f4ece2 Better place to stop freq timer? 2026-02-13 09:53:59 -06:00
Georgy Evtushenko
a487a38895 Dump frequencies 2026-02-13 08:49:41 -06:00
Bernhard Manfred Gruber
d3a0bec4a8 Feedback from review 2026-02-05 14:13:16 +01:00
Bernhard Manfred Gruber
28ed32bb47 Implement dark mode using style sheets 2026-02-05 14:00:33 +01:00
Bernhard Manfred Gruber
ec9759037d I have no idea what I am doing 2026-02-05 11:15:27 +01:00