nvbench

mirror of https://github.com/NVIDIA/nvbench.git synced 2026-05-12 09:15:47 +00:00

Author	SHA1	Message	Date
Oleksandr Pavlyk	8d1b316765	Require at least 5 samples to begin estimating noise level	2026-05-05 07:44:25 -05:00
Oleksandr Pavlyk	e53a1a2654	Use median and IR/relative as cmp_time/ref_time and cmp_noise/ref_noise These measures are less sensitive to outliers	2026-05-04 16:14:56 -05:00
Oleksandr Pavlyk	ea592b6444	Tweaks for nvbench_compare 1. For JSON files that contains repeated measurements of run-time axis values, make sure that scripts compares corresponding reference entries. If cmp had two states with the same name and ref had two, we would compare measurements for each state in cmp against the first state in ref. Change here introduces counters tracking how many times each particular axis value, and retrieve corresponding entry in ref. Previously, I had ``` \| BlockSize \| NumBlocks \| Ref Time \| Ref Noise \| Cmp Time \| Cmp Noise \| Diff \| %Diff \| Status \| \|-------------\|-------------\|------------\|-------------\|------------\|-------------\|-----------\|---------\|----------\| \| 2^8 \| 64 \| 1.776 ms \| 0.46% \| 1.777 ms \| 0.40% \| 1.024 us \| 0.06% \| SAME \| \| 2^8 \| 64 \| 1.776 ms \| 0.46% \| 1.774 ms \| 0.52% \| -2.048 us \| -0.12% \| SAME \| \| 2^8 \| 64 \| 1.776 ms \| 0.46% \| 1.773 ms \| 0.52% \| -3.072 us \| -0.17% \| SAME \| \| 2^8 \| 64 \| 1.776 ms \| 0.46% \| 1.774 ms \| 0.58% \| -2.048 us \| -0.12% \| SAME \| \| 2^8 \| 64 \| 1.776 ms \| 0.46% \| 1.773 ms \| 0.58% \| -3.072 us \| -0.17% \| SAME \| ``` and now it becomes ``` \| BlockSize \| NumBlocks \| Ref Time \| Ref Noise \| Cmp Time \| Cmp Noise \| Diff \| %Diff \| Status \| \|-------------\|-------------\|------------\|-------------\|------------\|-------------\|-----------\|---------\|----------\| \| 2^8 \| 64 \| 1.776 ms \| 0.46% \| 1.777 ms \| 0.40% \| 1.024 us \| 0.06% \| SAME \| \| 2^8 \| 64 \| 1.773 ms \| 0.64% \| 1.774 ms \| 0.52% \| 1.024 us \| 0.06% \| SAME \| \| 2^8 \| 64 \| 1.774 ms \| 0.46% \| 1.773 ms \| 0.52% \| -1.024 us \| -0.06% \| SAME \| \| 2^8 \| 64 \| 1.773 ms \| 0.46% \| 1.774 ms \| 0.58% \| 1.024 us \| 0.06% \| SAME \| \| 2^8 \| 64 \| 1.774 ms \| 0.52% \| 1.773 ms \| 0.58% \| -1.024 us \| -0.06% \| SAME \| ``` With the following raw data expected ``` (py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. \| .benchmarks[] \| .states[] \| .summaries[] \| select(.tag == "nv/cold/time/gpu/median") \| .data[] \| .value' base.json "0.0017756160497665405" "0.0017725440263748169" "0.001773568034172058" "0.0017725440263748169" "0.001773568034172058" (py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. \| .benchmarks[] \| .states[] \| .summaries[] \| select(.tag == "nv/cold/time/gpu/median") \| .data[] \| .value' test.json "0.0017766400575637818" "0.001773568034172058" "0.0017725440263748169" "0.001773568034172058" "0.0017725440263748169" ``` 2. nvbench_compare changes from using min_noise = min(ref_noise, cmp_noise) to using max_noise = max(ref_noise, cmp_noise) Using larger of ref and cmp noise level as a reference against which to gauge timing difference ratio makes more sense.	2026-05-04 16:14:56 -05:00
Oleksandr Pavlyk	e292bb4eec	Add statistics::compute_percentiles, use it in summaries of measure_cold Percentiles on empty dataset are NaN, not infinity Add Robust statistics of CPU times to summary Fixed name for nv/cold/time/gpu/q3, corrected value reported for nv/cold/time/gpu/ir/relative Use median and IR to compute location and noise in measure_cold Also in stdrel_criterion, compute noise as IR / median.	2026-05-04 16:14:18 -05:00
Oleksandr Pavlyk	e9daaba0f9	Implement sample-count stopping criterion with parameter target-samples --stopping-criterion sample-count --target-samples 100 would stop once max(--min-samples, --target-samples) samples are collected	2026-05-04 08:52:46 -05:00
Oleksandr Pavlyk	bf0d2a807d	Ensure that measure_cold::run_warmup instantiates blocking kernel Because warm-up runs are executed without use of blocking kernel, the blocking kernel was not jitted until actual measurements were collected. The module loading cost incurred during the first run shows as elevated CPU time noise value for the first measurement as noted in https://github.com/NVIDIA/nvbench/pull/339 This PR adds `this->block_stream(); this->unblock_stream();` prior to executing warm-up loop with use of blocking kernel disabled. This ensures that blocking kernel is instantiated during the warm-up, but it no other kernel is launched between its launch and stream sync thus avoiding deadlocking.	2026-05-04 08:52:46 -05:00
Oleksandr Pavlyk	81e27660b8	Implement warmup-runs count, supported as CLI CLI option --warmup-runs implemented and documented. The warm-up counts is enforced to always be positive. This is necessary to ensure that JIT-ting has occurred, and use of blocking kernel would not result in time-outs. Test is option parser is added.	2026-05-04 08:52:46 -05:00
Oleksandr Pavlyk	f392725015	Correct Python API signature of State.get_axis_values_as_strings (#346 ) * Correct Python API signature of State.get_axis_values_as_strings The C++ API has default boolean argument color, but Python API declared no arguments. Closes #345 * Also exercise invocation of get_axis_values_as_string with keyword argument value * Remove use of cuda.core.experimental	2026-05-04 08:40:29 -05:00
Oleksandr Pavlyk	a3364ca5c7	Port changes to the package from #323 (#337 ) Fixed relative text alignment in docstrings to fix autodoc warnigns Renamed cuda.bench.test_cpp_exception and cuda.bench.test_py_exception functions to start with underscore, signaling that these functions are internal and should not be documented Account for test_cpp_exceptions -> _test_cpp_exception, same for _py_ Make sure to reset __module__ of reexported symbols to be cuda.bench	2026-04-22 08:28:15 -05:00
Oleksandr Pavlyk	b0a46f44c2	Modularize color handling (#336 ) * Introduce function colorize to modularize colorization/no-color handling * Use sns.set_theme instead of deprecated sns.set() * Use str.format instead of legacy % syntax * Simplified iteration over list Use f-string (supported since Python 3.6) instead of str.format for better readability and performance	2026-04-14 08:09:44 -05:00
pre-commit-ci[bot]	8d23e3e73c	[pre-commit.ci] pre-commit autoupdate (#333 ) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v21.1.8 → v22.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v21.1.8...v22.1.2) - [github.com/astral-sh/ruff-pre-commit: v0.14.10 → v0.15.9](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.10...v0.15.9) - [github.com/codespell-project/codespell: v2.4.1 → v2.4.2](https://github.com/codespell-project/codespell/compare/v2.4.1...v2.4.2) * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-13 16:24:55 +00:00
Oleksandr Pavlyk	e62c5b6f79	Correct description/hint entries for summaries with name "Noise" (#335 ) See #334	2026-04-13 11:13:37 -05:00
Nader Al Awar	373970323f	Merge pull request #331 from oleksandr-pavlyk/update-python-examples Update python examples	2026-04-02 15:20:24 -04:00
Oleksandr Pavlyk	39730efbc3	Update requirements to reflect packages used by examples	2026-04-02 10:37:17 -05:00
Oleksandr Pavlyk	9f75642387	Add patch to cutlass.base_dsl.dsl.BaseDSL to work-around a bug See https://github.com/NVIDIA/cutlass/issues/3142	2026-04-02 10:29:31 -05:00
Nader Al Awar	488173a242	Add `--no-color` flag to nvbench_compare.py which can be used for github issues and PRs python-0.2.1	2026-04-01 18:27:54 -04:00
Nader Al Awar	7a68e53df0	Rename flag from markdown to no-color	2026-04-01 17:01:29 -05:00
Nader Al Awar	7e5e784855	Add --markdown flag to nvbench_compare.py which can be use for github issues/prs	2026-04-01 14:53:13 -05:00
Oleksandr Pavlyk	93bc59d05c	Renamed CUTLASS example to reflect that it uses CuteDSL	2026-04-01 08:24:29 -05:00
Oleksandr Pavlyk	e4cfddeb87	Rewrote cutlass_gemm example to use CuteDSL	2026-04-01 08:23:41 -05:00
Oleksandr Pavlyk	3f284b4004	Renamed cccl_* examples cccl_parallel_* -> cuda_compute_* cccl_cooperative_* -> cuda_coop_*	2026-04-01 08:20:20 -05:00
Oleksandr Pavlyk	5bdb30f4b6	Update to cccl_parallel_segmented_reduce example per changes in API Update namespace changes. Use make_segmented_reduce factory function, and update call signatures.	2026-04-01 08:18:15 -05:00
Oleksandr Pavlyk	d8739fc208	Update to cccl_cooperative_block_reduce example	2026-04-01 08:17:52 -05:00
Oleksandr Pavlyk	974eb5ee0f	Replace use of cupy.cuda.ExternalStream with cupy.cuda.Stream.from_external	2026-04-01 08:17:12 -05:00
Oleksandr Pavlyk	7c60edcc0a	cuda.core.experimental -> cuda.core	2026-04-01 08:16:04 -05:00
Oleksandr Pavlyk	836a6c12f4	Merge pull request #326 from oleksandr-pavlyk/fix-sfinae-incomplete Fix GCC16 sfinae incomplete warnings. GCC16 started requiring that the type `T` used in `std::reference_wrapper<T>` is complete where using `-std=c++17`. Since NVBench has to forward declare some types in header files to break circular dependency, use of incomplete type breaks build due to use of `-Werror` flag due to `-Wsfinae-incomplete` warning emitted by GCC16. This commit replaced affected uses of `std::reference_wrapper<const nvbench::benchmark_base>` in state.cxx, and `std::reference_wrapper<nvbench::printer_base>` in benchmark_base.cxx with raw pointers.	2026-03-24 16:02:28 -05:00
Oleksandr Pavlyk	317dc6824e	Mark NVBench headers as SYSTEM for consuming targets + FIX (#330 ) * Mark NVBench headers as SYSTEM for consuming targets. Fixes #30. * As nvbench.main links to nvbench as INTERFACE only, it no longer consumes usage reqs of nvbench Because of this nvbench.main was no longer consuming dependence on CUDA::toolkit include dirs. This PR links nvbench.main to ${ctk_libraries} privately to reestablish that dependency * Implement use of pragma system_header in NVBench 1. Add code to nvbench/config.cuh.in to define NVBENCH_IMPLICIT_SYSTEM_HEADER_* preprocessor variable dependending on compiler, unless NVBENCH_NO_IMPLICIT_SYSTEM_HEADER was defined. 2. Build NVBench targets with -DNVBENCH_NO_IMPLICIT_SYSTEM_HEADER 3. Modify each header file in nvbench/ folder to - include <nvbench/config.cuh> - Execute pragma <OPTIONAL_CMPLR> system_header guarded by checks for defined preprocessor variables - Do the above two steps before any other headers are included --------- Co-authored-by: Allison Piper <apiper@nvidia.com>	2026-03-23 15:10:41 -04:00
Oleksandr Pavlyk	9a91b9ef0c	Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end (#327 ) * Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end NVPW_* API has been deprecated since CTK 13.0. Followed advice in compliation message to replace NVPW_* API with CUPTI Profiler Host API. `libnvbench.so` no longer links to `nvperf_host` directly, only to `libcupti`. NVBench uses the CUPTI Host API to build a config image from metric names, and the Range Profiler API to collect and decode counters. The host API never collects data directly; it prepares and evaluates data produced by range profiling. Introduce `host_impl`/`profiler_init_guard` to manage CUPTI Host object and initialization/deinitialization, including safe move-assignment cleanup. `profiler_init_guard` initializes profiler, and throws if CUPTI returns an error code. `profiler_init_guard::finalize_profiler()` de-inits profiler and returns the error code. Destructor calls finalize_profiler, but ignores the status code. If user wants to explicitly de-initialize profiler and handle the error, he/she is advised to call `finalize_profiler()` directly. The guard has a boolean member variable to allow destructor to work even if user explicitly called finalize_profiler() method. The old counter-data prefix/scratch flow was replaced with the Range Profiler counter data image sizing/initialization path and decode flow. Host API metric filtering (base metrics + context scope) and Host-side evaluation to GPU values via cuptiProfilerHostEvaluateToGpuValues is implemented. - Host object: `host_impl::object` in `nvbench/cupti_profiler.cxx`. - Range profiler object: `host_impl::range_profiler_object`. - Config image: `m_config_image`. - Counter data image: `m_data_image`. 1) Host init + config image - `initialize_profiler_host()` creates the host object. - `initialize_config_image_host()` adds metrics and builds the config image. 2) Range profiler enable + counter data image - `enable_range_profiler()` creates the range profiler object. - `initialize_counter_data_image()` sizes and initializes the data image using the range profiler object, matching the CUPTI samples. 3) Config + collect + decode - `set_range_profiler_config()` binds the config image + data image. - `start_user_loop()` / `stop_user_loop()` push/pop the user range and start/stop the range profiler. - `process_user_loop()` decodes counter data via `cuptiRangeProfilerDecodeData()`. 4) Evaluate metrics - `get_counter_values()` calls `cuptiProfilerHostEvaluateToGpuValues()` to convert counter data into metric values. The * Use class instead of struct in profiler_init_guard; forward declaration * Add SFINAE guards before accessing members not present in earlier CTK versions * Check if cupti_profiler_host.h exists, use old/new implementation based on that check 1. Reintroduced legacy `cupti_profiler_nvpw.cuh` and `cupti_profiler_nvpw.cuh`. 2. Moved profiler-host-API implementation to `cupti_profiler_host.cuh`, `cupti_profiler_host.cxx`. 3. Add `nvbench/cupti_profiler.cuh` which checks if `cupti_profiler_host.h` header is known and includes `cupti_profiler_host.cuh` or `cupti_profiler_nvpw.cuh` respectively. 4. In cmake, we check if ${nvbench_cupti_root}/include/cupti_profiler_host.h file exists. If it does not, `libnvbench.so` would have dependency on libnvperf_host and libnvperf_target in addition to dependency on libcupti. If the header exists, it would only depend on libcupti	2026-03-23 11:51:16 -04:00
Oleksandr Pavlyk	1d823c6975	Merge pull request #328 from oleksandr-pavlyk/set-type-axes-names-in-auto-throughput-example	2026-03-20 18:44:03 -05:00
Oleksandr Pavlyk	56cdaed0af	Merge pull request #299 from NVIDIA/pre-commit-ci-update-config [pre-commit.ci] pre-commit autoupdate	2026-03-20 16:15:20 -05:00
Oleksandr Pavlyk	a6e570083d	Merge pull request #329 from oleksandr-pavlyk/fix-fmt-target-name-in-tests Link against fmt::fmt target, not fmt	2026-03-20 08:49:05 -05:00
Oleksandr Pavlyk	4c278b08b3	Link against fmt::fmt target, not fmt. Consistent with nvbench/CMakeLists.txt Co-authored-by: Dominic Charrier <docharri@amd.com>	2026-03-19 14:53:06 -05:00
Oleksandr Pavlyk	49636c70b3	Set type-axes name to ItemsPerThread to replace auto-generated T	2026-03-19 14:35:46 -05:00
Bernhard Manfred Gruber	728212f9f1	Merge pull request #315 from bernhardmgruber/plot_diff_script Extend `nvbench_compare.py` with `--plot`, axis/benchmark filtering, and dark mode	2026-02-28 01:38:27 +01:00
Bernhard Manfred Gruber	4164909c52	Feedback	2026-02-28 01:19:18 +01:00
Oleksandr Pavlyk	5387d2005b	Merge pull request #322 from oleksandr-pavlyk/feature/save-frequencies Save frequencies when bulk-saving of times is enabled SM clock rates are now always collected, even if throttling threshold is set to zero	2026-02-27 13:30:11 -06:00
Oleksandr Pavlyk	c9705de4a4	Reserve enough space clock-rates for min samples, if specified	2026-02-27 12:49:35 -06:00
Bernhard Manfred Gruber	0abc8ec82b	Extend nvbench_compare.py with `--plot`, axis/benchmark filtering, and dark mode Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com>	2026-02-27 11:06:20 +01:00
Oleksandr Pavlyk	ba7150e447	Merge pull request #314 from bernhardmgruber/plot_script Add a script to plot benchmark results	2026-02-26 12:59:16 -06:00
Bernhard Manfred Gruber	800f640c20	Apply reviewer feedback	2026-02-26 19:23:51 +01:00
Oleksandr Pavlyk	998ab125ce	Don't override m_check_throttling if throttling threshold is non-positive measure_cold class now directly inherits m_check_throttling from state. This ensures that when `--jsonbin` is specified frequency data corresponding to timing data are available to write out.	2026-02-20 16:34:53 -06:00
Oleksandr Pavlyk	731e0c2c30	Swapped data members m_sm_clock_rates and m_sm_clock_rate_accumulator This places all std::vector members together. Added default initialization to all std::vector members, and all other members with default constructors. Exceptions are references and nvbench::launch m_launch; member	2026-02-19 15:33:57 -06:00
Oleksandr Pavlyk	4da9f431c0	Templatize write_out_values for different storage formats This could be used to save data as float32_t, or float64_t. This flexibility is useful for experimentation.	2026-02-19 15:32:00 -06:00
Oleksandr Pavlyk	988420b5b1	Use write_out_values utility to save frequencies The utility was already used to save times	2026-02-13 10:19:06 -06:00
Georgy Evtushenko	40b2f4ece2	Better place to stop freq timer?	2026-02-13 09:53:59 -06:00
Georgy Evtushenko	a487a38895	Dump frequencies	2026-02-13 08:49:41 -06:00
Bernhard Manfred Gruber	d3a0bec4a8	Feedback from review	2026-02-05 14:13:16 +01:00
Bernhard Manfred Gruber	28ed32bb47	Implement dark mode using style sheets	2026-02-05 14:00:33 +01:00
Bernhard Manfred Gruber	ec9759037d	I have no idea what I am doing	2026-02-05 11:15:27 +01:00
Bernhard Manfred Gruber	ccde9fc4d4	More	2026-02-05 10:56:36 +01:00

1 2 3 4 5 ...

766 Commits