760 Commits

Author SHA1 Message Date
Oleksandr Pavlyk
d13a0fde32 Correct cuda cccl examples per change in api (#353) 2026-05-06 13:30:44 -05:00
Oleksandr Pavlyk
f392725015 Correct Python API signature of State.get_axis_values_as_strings (#346)
* Correct Python API signature of State.get_axis_values_as_strings

The C++ API has default boolean argument color, but Python API
declared no arguments.

Closes #345

* Also exercise invocation of get_axis_values_as_string with keyword argument value

* Remove use of cuda.core.experimental
2026-05-04 08:40:29 -05:00
Oleksandr Pavlyk
a3364ca5c7 Port changes to the package from #323 (#337)
Fixed relative text alignment in docstrings to fix autodoc warnigns

Renamed cuda.bench.test_cpp_exception and cuda.bench.test_py_exception functions
to start with underscore, signaling that these functions are internal and should
not be documented

Account for test_cpp_exceptions -> _test_cpp_exception, same for *_py_*

Make sure to reset __module__ of reexported symbols to be cuda.bench
2026-04-22 08:28:15 -05:00
Oleksandr Pavlyk
b0a46f44c2 Modularize color handling (#336)
* Introduce function colorize to modularize colorization/no-color handling

* Use sns.set_theme instead of deprecated sns.set()

* Use str.format instead of legacy % syntax

* Simplified iteration over list

Use f-string (supported since Python 3.6) instead of str.format for
better readability and performance
2026-04-14 08:09:44 -05:00
pre-commit-ci[bot]
8d23e3e73c [pre-commit.ci] pre-commit autoupdate (#333)
* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v21.1.8 → v22.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v21.1.8...v22.1.2)
- [github.com/astral-sh/ruff-pre-commit: v0.14.10 → v0.15.9](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.10...v0.15.9)
- [github.com/codespell-project/codespell: v2.4.1 → v2.4.2](https://github.com/codespell-project/codespell/compare/v2.4.1...v2.4.2)

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-13 16:24:55 +00:00
Oleksandr Pavlyk
e62c5b6f79 Correct description/hint entries for summaries with name "Noise" (#335)
See #334
2026-04-13 11:13:37 -05:00
Nader Al Awar
373970323f Merge pull request #331 from oleksandr-pavlyk/update-python-examples
Update python examples
2026-04-02 15:20:24 -04:00
Oleksandr Pavlyk
39730efbc3 Update requirements to reflect packages used by examples 2026-04-02 10:37:17 -05:00
Oleksandr Pavlyk
9f75642387 Add patch to cutlass.base_dsl.dsl.BaseDSL to work-around a bug
See https://github.com/NVIDIA/cutlass/issues/3142
2026-04-02 10:29:31 -05:00
Nader Al Awar
488173a242 Add --no-color flag to nvbench_compare.py which can be used for github issues and PRs python-0.2.1 2026-04-01 18:27:54 -04:00
Nader Al Awar
7a68e53df0 Rename flag from markdown to no-color 2026-04-01 17:01:29 -05:00
Nader Al Awar
7e5e784855 Add --markdown flag to nvbench_compare.py which can be use for github issues/prs 2026-04-01 14:53:13 -05:00
Oleksandr Pavlyk
93bc59d05c Renamed CUTLASS example to reflect that it uses CuteDSL 2026-04-01 08:24:29 -05:00
Oleksandr Pavlyk
e4cfddeb87 Rewrote cutlass_gemm example to use CuteDSL 2026-04-01 08:23:41 -05:00
Oleksandr Pavlyk
3f284b4004 Renamed cccl_* examples
cccl_parallel_* -> cuda_compute_*
cccl_cooperative_* -> cuda_coop_*
2026-04-01 08:20:20 -05:00
Oleksandr Pavlyk
5bdb30f4b6 Update to cccl_parallel_segmented_reduce example per changes in API
Update namespace changes. Use make_segmented_reduce factory function,
and update call signatures.
2026-04-01 08:18:15 -05:00
Oleksandr Pavlyk
d8739fc208 Update to cccl_cooperative_block_reduce example 2026-04-01 08:17:52 -05:00
Oleksandr Pavlyk
974eb5ee0f Replace use of cupy.cuda.ExternalStream with cupy.cuda.Stream.from_external 2026-04-01 08:17:12 -05:00
Oleksandr Pavlyk
7c60edcc0a cuda.core.experimental -> cuda.core 2026-04-01 08:16:04 -05:00
Oleksandr Pavlyk
836a6c12f4 Merge pull request #326 from oleksandr-pavlyk/fix-sfinae-incomplete
Fix GCC16 sfinae incomplete warnings.

GCC16 started requiring that the type `T` used in `std::reference_wrapper<T>` is complete where using `-std=c++17`. Since NVBench has to forward declare some types in header files to break circular dependency, use of incomplete type breaks build due to use of `-Werror` flag due to `-Wsfinae-incomplete` warning emitted by GCC16.

This commit replaced affected uses of `std::reference_wrapper<const nvbench::benchmark_base>` in state.cxx, and `std::reference_wrapper<nvbench::printer_base>` in benchmark_base.cxx with raw pointers.
2026-03-24 16:02:28 -05:00
Oleksandr Pavlyk
317dc6824e Mark NVBench headers as SYSTEM for consuming targets + FIX (#330)
* Mark NVBench headers as SYSTEM for consuming targets.

Fixes #30.

* As nvbench.main links to nvbench as INTERFACE only, it no longer consumes usage reqs of nvbench

Because of this nvbench.main was no longer consuming dependence on CUDA::toolkit include dirs.

This PR links nvbench.main to ${ctk_libraries} privately to reestablish that dependency

* Implement use of pragma system_header in NVBench

1. Add code to nvbench/config.cuh.in to define NVBENCH_IMPLICIT_SYSTEM_HEADER_*
preprocessor variable dependending on compiler, unless NVBENCH_NO_IMPLICIT_SYSTEM_HEADER
was defined.

2. Build NVBench targets with -DNVBENCH_NO_IMPLICIT_SYSTEM_HEADER

3. Modify each header file in nvbench/ folder to
   - include <nvbench/config.cuh>
   - Execute pragma <OPTIONAL_CMPLR> system_header guarded
     by checks for defined preprocessor variables
   - Do the above two steps before any other headers are included

---------

Co-authored-by: Allison Piper <apiper@nvidia.com>
2026-03-23 15:10:41 -04:00
Oleksandr Pavlyk
9a91b9ef0c Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end (#327)
* Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end

NVPW_* API has been deprecated since CTK 13.0. Followed advice in compliation
message to replace NVPW_* API with CUPTI Profiler Host API.

`libnvbench.so` no longer links to `nvperf_host` directly, only to `libcupti`.

NVBench uses the **CUPTI Host API** to build a config image from metric names,
and the **Range Profiler API** to collect and decode counters. The host API never
collects data directly; it prepares and evaluates data produced by range profiling.

Introduce `host_impl`/`profiler_init_guard` to manage CUPTI Host object and
initialization/deinitialization, including safe move-assignment cleanup.

`profiler_init_guard` initializes profiler, and throws if CUPTI returns
an error code. `profiler_init_guard::finalize_profiler()` de-inits profiler
and returns the error code. Destructor calls finalize_profiler, but ignores
the status code. If user wants to explicitly de-initialize profiler and handle
the error, he/she is advised to call `finalize_profiler()` directly. The guard
has a boolean member variable to allow destructor to work even if user explicitly
called finalize_profiler() method.

The old counter-data prefix/scratch flow was replaced with the Range Profiler counter
data image sizing/initialization path and decode flow.

Host API metric filtering (base metrics + context scope) and Host-side evaluation to
GPU values via cuptiProfilerHostEvaluateToGpuValues is implemented.

   - **Host object**: `host_impl::object` in `nvbench/cupti_profiler.cxx`.
   - **Range profiler object**: `host_impl::range_profiler_object`.
   - **Config image**: `m_config_image`.
   - **Counter data image**: `m_data_image`.

1) **Host init + config image**
   - `initialize_profiler_host()` creates the host object.
   - `initialize_config_image_host()` adds metrics and builds the config image.

2) **Range profiler enable + counter data image**
   - `enable_range_profiler()` creates the range profiler object.
   - `initialize_counter_data_image()` sizes and initializes the data image using
     the range profiler object, matching the CUPTI samples.

3) **Config + collect + decode**
   - `set_range_profiler_config()` binds the config image + data image.
   - `start_user_loop()` / `stop_user_loop()` push/pop the user range and
     start/stop the range profiler.
   - `process_user_loop()` decodes counter data via
     `cuptiRangeProfilerDecodeData()`.

4) **Evaluate metrics**
   - `get_counter_values()` calls `cuptiProfilerHostEvaluateToGpuValues()` to
     convert counter data into metric values.

The

* Use class instead of struct in profiler_init_guard; forward declaration

* Add SFINAE guards before accessing members not present in earlier CTK versions

* Check if cupti_profiler_host.h exists, use old/new implementation based on that check

1. Reintroduced legacy `cupti_profiler_nvpw.cuh` and `cupti_profiler_nvpw.cuh`.
2. Moved profiler-host-API implementation to `cupti_profiler_host.cuh`, `cupti_profiler_host.cxx`.
3. Add `nvbench/cupti_profiler.cuh` which checks if `cupti_profiler_host.h` header is known and
   includes `cupti_profiler_host.cuh` or `cupti_profiler_nvpw.cuh` respectively.
4. In cmake, we check if ${nvbench_cupti_root}/include/cupti_profiler_host.h file exists.
   If it does not, `libnvbench.so` would have dependency on libnvperf_host and libnvperf_target
   in addition to dependency on libcupti. If the header exists, it would only depend on libcupti
2026-03-23 11:51:16 -04:00
Oleksandr Pavlyk
1d823c6975 Merge pull request #328 from oleksandr-pavlyk/set-type-axes-names-in-auto-throughput-example 2026-03-20 18:44:03 -05:00
Oleksandr Pavlyk
56cdaed0af Merge pull request #299 from NVIDIA/pre-commit-ci-update-config
[pre-commit.ci] pre-commit autoupdate
2026-03-20 16:15:20 -05:00
Oleksandr Pavlyk
a6e570083d Merge pull request #329 from oleksandr-pavlyk/fix-fmt-target-name-in-tests
Link against fmt::fmt target, not fmt
2026-03-20 08:49:05 -05:00
Oleksandr Pavlyk
4c278b08b3 Link against fmt::fmt target, not fmt. Consistent with nvbench/CMakeLists.txt
Co-authored-by: Dominic Charrier <docharri@amd.com>
2026-03-19 14:53:06 -05:00
Oleksandr Pavlyk
49636c70b3 Set type-axes name to ItemsPerThread to replace auto-generated T 2026-03-19 14:35:46 -05:00
Bernhard Manfred Gruber
728212f9f1 Merge pull request #315 from bernhardmgruber/plot_diff_script
Extend `nvbench_compare.py` with `--plot`, axis/benchmark filtering, and dark mode
2026-02-28 01:38:27 +01:00
Bernhard Manfred Gruber
4164909c52 Feedback 2026-02-28 01:19:18 +01:00
Oleksandr Pavlyk
5387d2005b Merge pull request #322 from oleksandr-pavlyk/feature/save-frequencies
Save frequencies when bulk-saving of times is enabled

SM clock rates are now always collected, even if throttling threshold is set to zero
2026-02-27 13:30:11 -06:00
Oleksandr Pavlyk
c9705de4a4 Reserve enough space clock-rates for min samples, if specified 2026-02-27 12:49:35 -06:00
Bernhard Manfred Gruber
0abc8ec82b Extend nvbench_compare.py with --plot, axis/benchmark filtering, and dark mode
Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com>
2026-02-27 11:06:20 +01:00
Oleksandr Pavlyk
ba7150e447 Merge pull request #314 from bernhardmgruber/plot_script
Add a script to plot benchmark results
2026-02-26 12:59:16 -06:00
Bernhard Manfred Gruber
800f640c20 Apply reviewer feedback 2026-02-26 19:23:51 +01:00
Oleksandr Pavlyk
998ab125ce Don't override m_check_throttling if throttling threshold is non-positive
measure_cold class now directly inherits m_check_throttling from state.
This ensures that when `--jsonbin` is specified frequency data corresponding
to timing data are available to write out.
2026-02-20 16:34:53 -06:00
Oleksandr Pavlyk
731e0c2c30 Swapped data members m_sm_clock_rates and m_sm_clock_rate_accumulator
This places all std::vector members together. Added default initialization
to all std::vector members, and all other members with default constructors.

Exceptions are references and nvbench::launch m_launch; member
2026-02-19 15:33:57 -06:00
Oleksandr Pavlyk
4da9f431c0 Templatize write_out_values for different storage formats
This could be used to save data as float32_t, or float64_t.
This flexibility is useful for experimentation.
2026-02-19 15:32:00 -06:00
Oleksandr Pavlyk
988420b5b1 Use write_out_values utility to save frequencies
The utility was already used to save times
2026-02-13 10:19:06 -06:00
Georgy Evtushenko
40b2f4ece2 Better place to stop freq timer? 2026-02-13 09:53:59 -06:00
Georgy Evtushenko
a487a38895 Dump frequencies 2026-02-13 08:49:41 -06:00
Bernhard Manfred Gruber
d3a0bec4a8 Feedback from review 2026-02-05 14:13:16 +01:00
Bernhard Manfred Gruber
28ed32bb47 Implement dark mode using style sheets 2026-02-05 14:00:33 +01:00
Bernhard Manfred Gruber
ec9759037d I have no idea what I am doing 2026-02-05 11:15:27 +01:00
Bernhard Manfred Gruber
ccde9fc4d4 More 2026-02-05 10:56:36 +01:00
Bernhard Manfred Gruber
0be190b407 Add a script to plot benchmark results 2026-02-05 10:36:52 +01:00
Nader Al Awar
dc59f98ecd Remove cupti from cuda-bench dependencies (#311) python-0.2.0 2026-02-03 14:16:26 -06:00
Bernhard Manfred Gruber
90ad8bcbc7 Merge pull request #296 from bernhardmgruber/compare_sub_results
Allow partial comparison in `nvbench_compare.py`
2026-02-03 20:02:34 +01:00
Bernhard Manfred Gruber
c6ef87575c Allow partial comparison in nvbench_compare.py
Fixes: #295
2026-02-03 16:32:11 +01:00
Nader Al Awar
d75fc74162 Merge branch 'main' into remove-cupti-python 2026-02-03 08:58:41 -06:00
Oleksandr Pavlyk
867d5d4276 Merge pull request #294 from oleksandr-pavlyk/add-docstrings 2026-02-03 08:51:55 -06:00