nvbench

mirror of https://github.com/NVIDIA/nvbench.git synced 2026-06-29 10:47:36 +00:00

Author	SHA1	Message	Date
Oleksandr Pavlyk	1d13b49996	Add scoped filtering and device pairing to nvbench_compare Teach nvbench_compare to keep the order of --benchmark and --axis arguments so axis filters can apply either globally or to the most recent benchmark. Build a filter plan from the ordered CLI arguments and apply the same plan to table output and plotting labels. Add explicit --reference-devices and --compare-devices filters. The filters accept all, a single device id, or a comma-separated list of ids; ordered lists and duplicates are preserved so selected reference and compare devices can be paired by position. Device-section mismatches remain fatal for unfiltered all-vs-all comparisons, but become warnings when the user explicitly selects devices and the selected device counts match. Match duplicate benchmark states by occurrence within each filtered device section instead of matching only by state name across the whole benchmark. This keeps repeated axis values and filtered duplicate states aligned between the reference and compare inputs, and reports mismatched occurrence counts instead of silently dropping extra states. Add Python tests for duplicate-state matching, axis filtering before matching, device filter parsing and validation, explicit cross-device pairing, and benchmark-scoped axis filters. Original commit messages folded into this change: Tweaks for nvbench_compare 1. When JSON files contain multiple entries with the same name and axis values, make sure that scripts compares corresponding entries. Previous logic would extract the first entry from ref data, and would compare measurements for each state in cmp against the first entry from ref. The change introduces a counter to know which nth entry we process for a particular axis value, and retrieve corresponding entry in ref. Scope occurrence matching by device. Device pairing in nvbench_compare.py is strictly index-based under --ignore-devices, reused IDs in a different order no longer pair against the wrong reference device. Require devices in ref and cmp to have the same cardinality Handle mismatch when number of duplicates in ref data is not same as in cmp data Use pytest monkeypatch fixture to pretend third-party package dependencies are available during test run for nvbench_compare without introducing test-time dependency Added the happy-path test and fixed its direct-call setup by initializing the device globals that main() normally populates. Fix to filter-before-matching. - compare_benches() now pairs devices by selected position instead of taking a device id. - For each device pair, compare_benches() now builds: - ref_device_states: matching reference device and axis filters - cmp_device_states: matching compare device and axis filters - State occurrence counts and duplicate occurrence matching now operate only on those filtered per-device lists. - Removed the later matches_axis_filters() skip inside the compare-state loop because filtering now happens before matching. Added a regression test where ref/cmp have duplicate state names in opposite order, and --axis keeps only one of them. The test verifies the kept compare state is matched against the kept reference state, not the first unfiltered occurrence. Introduce device filtering in nvbench_compare - --reference-devices all\|ID\|ID,ID,... - --compare-devices all\|ID\|ID,ID,... - Integer lists preserve order and duplicates. - Requested IDs are validated against the file-level device list. - Filtered reference/compare device counts must match before comparison. - compare_benches() pairs selected reference and compare devices by position. - Each benchmark validates that requested device IDs are present in its own devices list. Implemented benchmark-scoped --axis handling. - --axis and --benchmark now share an ordered argparse action, so their relative CLI order is preserved. - -a before any -b becomes a global axis filter. - -a after -b <name> applies to that most recent benchmark only. - Repeated -b entries are treated as separate filter scopes and combined as alternatives for that benchmark. - Device filtering remains global and is applied independently. Allow non-matching devices for explicit device selection Now the device-section equality check remains fatal only for unfiltered all-vs-all comparisons. If either --reference-devices or --compare-devices is explicit, mismatched selected device metadata is printed as a warning, but comparison proceeds after the selected device counts have been validated. Fix for resolve_benchmark_device_ids, add comments The return value of resolve_benchmark_device_ids now always owns its list. Use monkeypatch class in set_test_devices helper Stricted device id validation Test for device id validation	2026-06-02 11:48:01 -05:00
Oleksandr Pavlyk	ca1d60610c	Use robust summaries in nvbench_compare classification Teach nvbench_compare to parse GPU timing summaries into structured values and prefer the robust median/IQR summaries when both compared measurements provide them. Fall back to the existing mean/stdev summaries when robust summaries are not available. Classify comparisons with the larger available relative noise estimate instead of the smaller one, keep unavailable noise distinct from encoded infinite noise, and report improvements separately from regressions. Keep the process exit code as success for completed comparisons; regression counts are reported in the summary instead of being used as the process status. Make plotting tolerate unavailable noise by leaving gaps in confidence bands, sort plotted series by the plotted axis, and avoid reusing pyplot state across plot calls. Add focused Python tests for robust-summary preference, unavailable-noise classification, non-finite timing centers, plot-along handling when the selected axis is absent, and the exit-code contract.	2026-06-02 11:47:47 -05:00
Oleksandr Pavlyk	ee4b9f0963	Remove unused python_wheel section (#382 ) ci/matrix.yaml contains unused section once intended for Python wheels	2026-06-01 14:04:38 -05:00
Oleksandr Pavlyk	97c8b29f5a	Updated devcontainer imageset to 26.08 (#381 ) Add CTK 13.2 with compact support for host compilers: - gcc 11 (min), gcc 13 (working), gcc 15 (max) - llvm15 (min), llvm 21 (max) - CL 14.44	2026-06-01 11:02:40 -05:00
Oleksandr Pavlyk	7ba2b79d5b	Reduce stdrel criterion complexity and ensure termination (#374 ) * Reduce stdrel criterion complexity and ensure termination Replace the stdrel criterion's growing sample history with an online mean/variance accumulator. This keeps the stopping criterion based on relative standard deviation, preserves the unbiased standard-deviation estimate used for convergence, and reduces per-sample update work from recomputing over the full history to constant time. Add a bounded invalid-noise path so measurements that persistently produce non-finite relative noise, such as all-zero timings, can terminate without waiting for the wall-time timeout. Keep the normal min-time gate for ordinary stdrel convergence. Add focused tests for the online accumulator, stdrel sample-count threshold, sample-standard-deviation behavior, deterministic convergence inputs, and persistent invalid-noise termination. Update the CLI help for the stdrel termination behavior. * change max-noise to for consistency * Use online_mean_variance on m_noise_tracker in is_finished() Previously, standard deviation call was made using current noise level instead of mean noise level. Because of identity E[ (N - C)^2 ] = E[ (N - E[N])^2 ] + (E[N] - C)^2 >= E[ (N - E[N])^2 ] this led to criterion terminating later than it could have because the estimated expectation is always greater of equal that the estimate relative to the mean. Code used current noise level instead of mean to avoid needing to make two passed through m_noise_tracker container. Use of online_mean_variance allows to improve accuracy of estimating dispersion of noise signal while maintaining single pass through container. * Address review feedback Fixed misleading commit. Introduce private methods to refactor computation of repeated expressions. Renamed m_cuda_times_summary to m_measurements_summary, since criterion can be applied for CPU-only measurements too. Introduced is_close utility for checking whether two floating point numbers are closed to one another. Introduced descriptive constexpr variables for hard-wired constants	2026-05-29 17:06:28 +00:00
omribz156	ec025d7e0d	docs: separate measurement options from stopping criteria (#373 ) Signed-off-by: Omri SirComp <omribz156@gmail.com>	2026-05-28 16:51:12 -05:00
Oleksandr Pavlyk	6bdbff7f21	include cleanup across nvbench/ (#377 ) Added missing direct standard includes for entities such as std::size_t, std::move, std::vector, std::optional, std::exception, std::memcpy, etc. Added missing project include in nvbench/internal/table_builder.cuh for nvbench::detail::transform_reduce. Fixed nvbench/detail/gpu_frequency.cuh to forward-declare nvbench::cuda_stream in nvbench namespace instead of in nvbench::detail namespace.	2026-05-28 16:40:30 -05:00
Oleksandr Pavlyk	84c7952f8b	nvbench::cpu_timer changed to use steady_clock (#371 ) Using steady_clock is more appropriate for timing measurements. It guarantees that duration computed from two time-points will not contain correction deltas.	2026-05-20 10:22:22 -05:00
mfranzrebsal	4a33a61591	Add Windows support (#354 )	2026-05-19 15:10:58 -05:00
Oleksandr Pavlyk	3d82e58170	Fix docutil error when building docs (#365 )	2026-05-18 10:57:19 -05:00
Oleksandr Pavlyk	4472e7b59b	Add python api for cold warmup parameters (#363 )	2026-05-18 10:56:44 -05:00
Oleksandr Pavlyk	ce75dab94b	Add stopping criterion sample count (#341 ) * Implement sample-count stopping criterion with parameter target-samples --stopping-criterion sample-count --target-samples 100 would stop once max(--min-samples, --target-samples) samples are collected * Address review nitpicks	2026-05-15 15:15:12 -05:00
Oleksandr Pavlyk	6dd27aedfd	Fix exception safety (#358 ) Improve exception safety of timer structs by using local scope guards to ensure that cleanup steps, such as signaling blocking kernel to unblock and making sure that the stream is synchronized are performed even launch object throws an exception. Tests of exception safety were added. -- * blocking_kernel.unblock_noexcept() noexcept method added This decouples the logic of signaling to unblock from checking of the timeout. * Improve exception safely in kernel_launch_timer Introduce noexcept cleanup methods. Place body of start() and stop() methods in the try/catch block and execute noexcept clean-up on exception before rethrowing. * Improve exception safety of measure_hot * Make sure that throwing methods call noexcept ones instead of duplicating functionality * Use cleanup_guard in measure_cold_base::kernel_launch_timer Replace try/catch pattern with cleaner use of cleanup_guard class. * cpu_timer::start, cpu_timer::stop methods marked noexcept These methods do not throw, and marking them noexcept explicitly makes it fine to call them from other noexcept methods, as such cleanup_noexcept in measure_cold. * Address remaining exception safety issue in measure_hot * Renamed guard variables to reflect their purpose, apply arm-then-do to ops queueing kernels Set m_block_stream_armed = true; before launching the kernel. Doing so signals cleanup guard that stream must be unblocked, even if launching of the kernel failed. Same for operation launching time-stamps kernel. * Add testing/device/exception_safety.cu This test add benchmark that throws. It verifies that it did not time-out and control counters the benchmark maintains are at the expected values. * Refactor measurement cleanup guards for testability Extract hot stream cleanup and cold launch timer cleanup into reusable detail helpers. Keep measure_hot and measure_cold using those helpers through thin adapters so the tested cleanup logic matches the production path. Add driver-free cleanup guard tests using a fake measure object to verify cleanup ordering when exceptions occur after blocking stream setup, after hot unblock, and around cold GPU frequency start/stop paths. * Implement cpu_timer_stop_noexcept in terms of cpu_timer_stop The cpu_timer_stop is already noexcept by nature of implementation, but we maintain cpu_timer_stop_noexcept method for symmetry with other pairs sync_stream()/sync_stream_noexcept(). The cpu_timer_stop_noexcept() is implemented via cpu_timer_stop(). These methods are annotated __forceinline__, so the same code should be generated. * More readable initialization of bool members * Moved exception_safety.cu back to testing/ folder testing/device is reserved for tests that require locking of GPU frequency per CMake option description. * Fixed nitpick and bug it discovered Changed testing/exception_safety.cu:237 so run_benchmark now iterates over every state from bench.get_states() and checks each one is skipped with a reason containing "requested". That exposed a real runner behavior gap, so I also made a minimal fix in nvbench/runner.cuh:120: after stop_runner_loop, remaining states are now explicitly marked skipped with a reason instead of only printing a skip notification. * Move static assertions (pertaining to cleanup guards) to testing/cleanup_guards.cu The CI failure with CTK 12.0 and certain version of GCC is caused by OOM in cudafe++ process tripped by compiling instantiation of contract verification on cold_launch_timer_probe struct. As a work-around, this instantiation is excluded for CTK 12.0-12.6	2026-05-15 15:14:30 -05:00
Oleksandr Pavlyk	d63a2761eb	Implement Timer, and support State.exec(fn, timer=True) (#364 ) * Add type annotations for future functionality ```python class Timer: def start(self) -> None: ... def stop(self) -> None: ... ``` and overloaded `State.exec` so: - normal mode accepts `Callable[[Launch], None]` - `timer=True` accepts `Callable[[Launch, Timer], None]` No implementation yet. Type annotation checked with ``` (py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-imports /tmp/check_timer.py /tmp/check_timer.py:24: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch], None]", "bool" [call-overload] /tmp/check_timer.py:24: note: Possible overload variants: /tmp/check_timer.py:24: note: def exec(self, Callable[[Launch], None], /, , batched: bool \| None = ..., sync: bool \| None = ..., timer: Literal[False] = ...) -> None /tmp/check_timer.py:24: note: def exec(self, Callable[[Launch, Timer], None], /, , timer: Literal[True], sync: bool \| None = ...) -> None /tmp/check_timer.py:25: error: Argument 1 to "exec" of "State" has incompatible type "Callable[[Launch, Timer], None]"; expected "Callable[[Launch], None]" [arg-type] /tmp/check_timer.py:26: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch, int], None]", "bool" [call-overload] /tmp/check_timer.py:26: note: Possible overload variants: /tmp/check_timer.py:26: note: def exec(self, Callable[[Launch], None], /, , batched: bool \| None = ..., sync: bool \| None = ..., timer: Literal[False] = ...) -> None /tmp/check_timer.py:26: note: def exec(self, Callable[[Launch, Timer], None], /, , timer: Literal[True], sync: bool \| None = ...) -> None Found 3 errors in 1 file (checked 1 source file) (py313) :~/repos/nvbench/python$ nl -ba /tmp/check_timer.py 1 # /tmp/check_nvbench_timer.py 2 import cuda.bench as bench 3 4 def normal_ok(launch: bench.Launch) -> None: 5 pass 6 7 def timer_ok(launch: bench.Launch, timer: bench.Timer) -> None: 8 timer.start() 9 timer.stop() 10 11 def missing_timer(launch: bench.Launch) -> None: 12 pass 13 14 def extra_timer(launch: bench.Launch, timer: bench.Timer) -> None: 15 pass 16 17 def wrong_timer_type(launch: bench.Launch, timer: int) -> None: 18 pass 19 20 def state_bench(state: bench.State) -> None: 21 state.exec(normal_ok) 22 state.exec(normal_ok, timer=False) 23 state.exec(timer_ok, timer=True) 24 state.exec(missing_timer, timer=True) # should fail 25 state.exec(extra_timer) # should fail 26 state.exec(wrong_timer_type, timer=True) # should fail ``` * Implement cuda.bench.Timer object The Timer class is not user-constructible. It exposes two nullary methods timer.start() and timer.stop(). The instance of Timer class would be provided to launchable object passed to State.exec with timer=True. * Implement support for State.exec( launch_fn, timer=True) * Change type annotation for batch to default to None None is interpreted as `not timer`, i.e., it effectively defaults to True (as before) for usage without timer set, but starts defaulting to `False` is `timer=True` is set. The batched keyword type is `bool \| None`. * Implement default batched=None behavior API allows one to specify all 3 keywords, sync, batched, and timer. batched is None by default, run-time interpreted as `(not timer)`. * Update tests for new behavior of batched/time combination * Add python/examples/exec_tag_timer.py * Expand Timer class and methods docstrings * Reworked python/example/exec_tag_timer.py to align with C++ example. * Replace ::cuda::std::name with cuda::std::name * Resolve review feedback	2026-05-15 10:19:40 -05:00
Oleksandr Pavlyk	44ec7de6bd	Implement decorators to register benchmarks add axis and options (#347 ) * Add decorators for registering benchmarks and adding axis cuda.bench.register(fn) continues returning Benchmark, and supports legacy use. New signature added: cuda.bench.register(): Returns a decorator ``` @bench.register() @bench.axis.float64("Duration (s)", [7e-5, 1e-4, 5e-4]) @bench.option.min_samples(120) def single_float64_axis(state: bench.State): ... ``` * Remove example/auto_throughput.py The C++ counterpart's purpose is to demonstrate use of CUPTI metrics, but these are not supported in Python bindings, so this example is a duplicate of example/throughput.py * Add wrong decorator order test for bench.axis.* * Strengthen type annotation for register function Acting on code rabbit nit-pick require that function being registered take cuda.bench.State object as an argument. Verified the fix as ``` (py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-import /tmp/t.py /tmp/t.py:8: error: Argument 1 has incompatible type "Callable[[], None]"; expected "Callable[[State], None]" [arg-type] Found 1 error in 1 file (checked 1 source file) (py313) :~/repos/nvbench/python$ nl -ba /tmp/t.py 1 # /tmp/check_nvbench_register.py 2 import cuda.bench as bench 3 4 @bench.register() 5 def good(state: bench.State) -> None: 6 pass 7 8 @bench.register() 9 def bad() -> None: 10 pass ``` * Replace use of global variable with thread-safe lru_cache This improves thread-safety of module initialization. * Abide by RUF005 linting rule * Expand docstrings regarding cuda.bench.register() decorator It explains to the user what the decorator does and provides a concise usage example. * Sharpen wording on exception maybe-thrown by decorator	2026-05-14 15:41:30 -05:00
Oleksandr Pavlyk	338936b6fe	Provide BenchmarkResult class for parsing JSON output of NVBench-instrumented benchmarks (#356 ) Implements `cuda.bench.results.BenchmarkResult` class to represent data from JSON output of benchmark execution. The contains implements two class methods `BenchmarkResult.from_json(filename : str \| os.PathLike, , metadata : Any = None)` which expects well-formed JSON filename and `BenchmarkResult.empty(, metadata : Any = None)` intended to represent failed result with reasons that can be recorded in metadata at user's discretion. The `BenchmarkResult` implements mapping interface, supporting `.keys()`, `.values()`, `.items()` methods, `__len__`, `__contains__`, `__getitem__` and `__iter__` special methods. Values in `BenchmarkResult` has type `cuda.bench.results.SubBenchmarkResult` which implements a list-like interface, i.e. implements `__len__`, `__getitem__`, and `__iter__` special methods. Values in this list-like structure correspond to measurements of individual states of a particular benchmark (the key in `BenchmarkResult`). Elements of `SubBenchmarkResult` structure have type `SubBenchmarkState` that supports mapping protocol with axis_values as a key and represent data corresponding to measurements for a particular state (combination of settings for each axis). The state provides `.samples` and `.frequencies` attributes storing raw execution duration values and estimates for average GPU frequencies. Example usage: ``` import array, numpy as np, cuda.bench.results r = cuda.bench.results.BenchmarkResult("perf_data/axes_run1.json") r["copy_sweep_grid_shape"].centers_with_frequencies( lambda t, f: np.median(np.asarray(t)np.asarray(f))) ``` ``` In [1]: import array, numpy as np, cuda.bench.results In [2]: r = cuda.bench.results.BenchmarkResult("temp_data/axes_run1.json") In [3]: list(r) Out[3]: ['simple', 'single_float64_axis', 'copy_sweep_grid_shape', 'copy_type_sweep', 'copy_type_conversion_sweep', 'copy_type_and_block_size_sweep'] In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75])) Out[4]: {'Device=0': array([0.00100966, 0.00101299])} In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"] Out[5]: {'Device=0': array([0.00100966, 0.00101299])} In [6]: len(r) Out[6]: 6 In [7]: "fake" in r Out[7]: False ``` Each `SubBenchmarkState` implements `.summaries` attribute - rich object that retains tag/name/hint/hide/description metadata. Add nvbench-json-summary to render NVBench JSON output as an NVBench-style markdown summary table, including axis formatting, device sections, hidden summary filtering, and summary hint formatting. Update packaging, type stubs, and tests for the new namespace, renamed classes, Python 3.10-compatible annotations, and summary-table generation. * Split tests in test_benchmark_result into smaller tests * Fix break due to file name change * Add python/examples/benchmark_result_autotune.py This example demonstrates using cuda.bench and cuda.bench.results to implement simple auto-tuning, demonstrated on selecting of tile shape hyperparameter for naive stencil kernel implemented in numba-cuda. * Resolve ruff PLE0604 * Fix for format_axis_value in json format script to handle None value Add tests to cover such input. * Address code rabbit review feedback * Fix license header, add validation * Addressed both issues raised in review Malformed values are now represented in result as None. Skipped benchmarks are no longer dropped, i.e., they are present in BenchmarkResult data, but they are not reflected in summary table in line with what NVBench-instrumented benchmarks do.	2026-05-13 13:23:58 -05:00
Oleksandr Pavlyk	6df6dc8d89	Enable building of NVBench on Windows (#362 ) * Enable building of NVBench on Windows, no testing * Add guard to disable nvbench-windows for now	2026-05-13 13:16:41 -04:00
Oleksandr Pavlyk	f14055d5cc	Change CMake's nvbench::main exported target to correspond to static library (#350 ) Previously, it corresponded to main.cu.o object file. Now it corresponds to static library libnvbench_main.a which is archive file with main.cu.o object in it. This closes #349	2026-05-13 13:10:44 -04:00
Oleksandr Pavlyk	9ea77bccaa	Implement CLI option to control warmups for cold measurements (#339 ) * Implement warmup-runs count, supported as CLI CLI option --warmup-runs implemented and documented. The warm-up counts is enforced to always be positive. This is necessary to ensure that JIT-ting has occurred, and use of blocking kernel would not result in time-outs. Test is option parser is added. * Ensure that measure_cold::run_warmup instantiates blocking kernel Because warm-up runs are executed without use of blocking kernel, the blocking kernel was not jitted until actual measurements were collected. The module loading cost incurred during the first run shows as elevated CPU time noise value for the first measurement as noted in https://github.com/NVIDIA/nvbench/pull/339 This PR adds `this->block_stream(); this->unblock_stream();` prior to executing warm-up loop with use of blocking kernel disabled. This ensures that blocking kernel is instantiated during the warm-up, but it no other kernel is launched between its launch and stream sync thus avoiding deadlocking. * Rename --warmup-runs to --cold-warmup-runs, add --cold-max-warmup-walltime Since configurable number of warmups only applies to measure_cold.cuh rename the CLI option to reflect that. Also add --cold-max-warmup-walltime (defaults to -1, i.e. disabled). If enabled, exits warmup loop before request count is reached if the wall-time expanded executign warmups exceeds this max-warmup-walltime value.	2026-05-12 14:30:08 -05:00
Oleksandr Pavlyk	ebf9f9a087	Add .coderabbit.yaml following in footsteps of CCCL (#359 )	2026-05-12 13:55:46 -05:00
Oleksandr Pavlyk	7dfbcad27c	Create directories for output files (#360 ) * QOL UX, NVBench creates directories for output JSON, MD, CSV files This closes #185 and supports specifying `--json path/to/nonexistent/folder/result.json` This would create sequence of folders where to place result.json ``` (py313) :~/repos/nvbench$ rm -rf /tmp/nested/ (py313) :~/repos/nvbench$ ./build2/bin/nvbench.example.cpp20.axes -b copy_type_and_block_size_sweep -a Type=I32 -a BlockSize=64 --jsonbin /tmp/nested/json/axes.json --md /tmp/nested/md/res.md --csv /tmp/nested/csv/res.csv > /dev/null 2>&1 (py313) :~/repos/nvbench$ tree /tmp/nested/ /tmp/nested/ ├── csv │ └── res.csv ├── json │ ├── axes.json │ ├── axes.json-bin │ │ └── 0.bin │ └── axes.json-freqs-bin │ └── 0.bin └── md └── res.md 6 directories, 5 files ``` * Add a test that non-existent output folder is created * Remove throwing custom error message. Use default * Replace static_assert(false, ...) with #error	2026-05-12 10:26:28 -05:00
Oleksandr Pavlyk	d13a0fde32	Correct cuda cccl examples per change in api (#353 )	2026-05-06 13:30:44 -05:00
Oleksandr Pavlyk	f392725015	Correct Python API signature of State.get_axis_values_as_strings (#346 ) * Correct Python API signature of State.get_axis_values_as_strings The C++ API has default boolean argument color, but Python API declared no arguments. Closes #345 * Also exercise invocation of get_axis_values_as_string with keyword argument value * Remove use of cuda.core.experimental	2026-05-04 08:40:29 -05:00
Oleksandr Pavlyk	a3364ca5c7	Port changes to the package from #323 (#337 ) Fixed relative text alignment in docstrings to fix autodoc warnigns Renamed cuda.bench.test_cpp_exception and cuda.bench.test_py_exception functions to start with underscore, signaling that these functions are internal and should not be documented Account for test_cpp_exceptions -> _test_cpp_exception, same for _py_ Make sure to reset __module__ of reexported symbols to be cuda.bench	2026-04-22 08:28:15 -05:00
Oleksandr Pavlyk	b0a46f44c2	Modularize color handling (#336 ) * Introduce function colorize to modularize colorization/no-color handling * Use sns.set_theme instead of deprecated sns.set() * Use str.format instead of legacy % syntax * Simplified iteration over list Use f-string (supported since Python 3.6) instead of str.format for better readability and performance	2026-04-14 08:09:44 -05:00
pre-commit-ci[bot]	8d23e3e73c	[pre-commit.ci] pre-commit autoupdate (#333 ) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v21.1.8 → v22.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v21.1.8...v22.1.2) - [github.com/astral-sh/ruff-pre-commit: v0.14.10 → v0.15.9](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.10...v0.15.9) - [github.com/codespell-project/codespell: v2.4.1 → v2.4.2](https://github.com/codespell-project/codespell/compare/v2.4.1...v2.4.2) * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-13 16:24:55 +00:00
Oleksandr Pavlyk	e62c5b6f79	Correct description/hint entries for summaries with name "Noise" (#335 ) See #334	2026-04-13 11:13:37 -05:00
Nader Al Awar	373970323f	Merge pull request #331 from oleksandr-pavlyk/update-python-examples Update python examples	2026-04-02 15:20:24 -04:00
Oleksandr Pavlyk	39730efbc3	Update requirements to reflect packages used by examples	2026-04-02 10:37:17 -05:00
Oleksandr Pavlyk	9f75642387	Add patch to cutlass.base_dsl.dsl.BaseDSL to work-around a bug See https://github.com/NVIDIA/cutlass/issues/3142	2026-04-02 10:29:31 -05:00
Nader Al Awar	488173a242	Add `--no-color` flag to nvbench_compare.py which can be used for github issues and PRs python-0.2.1	2026-04-01 18:27:54 -04:00
Nader Al Awar	7a68e53df0	Rename flag from markdown to no-color	2026-04-01 17:01:29 -05:00
Nader Al Awar	7e5e784855	Add --markdown flag to nvbench_compare.py which can be use for github issues/prs	2026-04-01 14:53:13 -05:00
Oleksandr Pavlyk	93bc59d05c	Renamed CUTLASS example to reflect that it uses CuteDSL	2026-04-01 08:24:29 -05:00
Oleksandr Pavlyk	e4cfddeb87	Rewrote cutlass_gemm example to use CuteDSL	2026-04-01 08:23:41 -05:00
Oleksandr Pavlyk	3f284b4004	Renamed cccl_* examples cccl_parallel_* -> cuda_compute_* cccl_cooperative_* -> cuda_coop_*	2026-04-01 08:20:20 -05:00
Oleksandr Pavlyk	5bdb30f4b6	Update to cccl_parallel_segmented_reduce example per changes in API Update namespace changes. Use make_segmented_reduce factory function, and update call signatures.	2026-04-01 08:18:15 -05:00
Oleksandr Pavlyk	d8739fc208	Update to cccl_cooperative_block_reduce example	2026-04-01 08:17:52 -05:00
Oleksandr Pavlyk	974eb5ee0f	Replace use of cupy.cuda.ExternalStream with cupy.cuda.Stream.from_external	2026-04-01 08:17:12 -05:00
Oleksandr Pavlyk	7c60edcc0a	cuda.core.experimental -> cuda.core	2026-04-01 08:16:04 -05:00
Oleksandr Pavlyk	836a6c12f4	Merge pull request #326 from oleksandr-pavlyk/fix-sfinae-incomplete Fix GCC16 sfinae incomplete warnings. GCC16 started requiring that the type `T` used in `std::reference_wrapper<T>` is complete where using `-std=c++17`. Since NVBench has to forward declare some types in header files to break circular dependency, use of incomplete type breaks build due to use of `-Werror` flag due to `-Wsfinae-incomplete` warning emitted by GCC16. This commit replaced affected uses of `std::reference_wrapper<const nvbench::benchmark_base>` in state.cxx, and `std::reference_wrapper<nvbench::printer_base>` in benchmark_base.cxx with raw pointers.	2026-03-24 16:02:28 -05:00
Oleksandr Pavlyk	317dc6824e	Mark NVBench headers as SYSTEM for consuming targets + FIX (#330 ) * Mark NVBench headers as SYSTEM for consuming targets. Fixes #30. * As nvbench.main links to nvbench as INTERFACE only, it no longer consumes usage reqs of nvbench Because of this nvbench.main was no longer consuming dependence on CUDA::toolkit include dirs. This PR links nvbench.main to ${ctk_libraries} privately to reestablish that dependency * Implement use of pragma system_header in NVBench 1. Add code to nvbench/config.cuh.in to define NVBENCH_IMPLICIT_SYSTEM_HEADER_* preprocessor variable dependending on compiler, unless NVBENCH_NO_IMPLICIT_SYSTEM_HEADER was defined. 2. Build NVBench targets with -DNVBENCH_NO_IMPLICIT_SYSTEM_HEADER 3. Modify each header file in nvbench/ folder to - include <nvbench/config.cuh> - Execute pragma <OPTIONAL_CMPLR> system_header guarded by checks for defined preprocessor variables - Do the above two steps before any other headers are included --------- Co-authored-by: Allison Piper <apiper@nvidia.com>	2026-03-23 15:10:41 -04:00
Oleksandr Pavlyk	9a91b9ef0c	Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end (#327 ) * Reworked cupti_profiler to use Host + Range Profiler APIs end-to-end NVPW_* API has been deprecated since CTK 13.0. Followed advice in compliation message to replace NVPW_* API with CUPTI Profiler Host API. `libnvbench.so` no longer links to `nvperf_host` directly, only to `libcupti`. NVBench uses the CUPTI Host API to build a config image from metric names, and the Range Profiler API to collect and decode counters. The host API never collects data directly; it prepares and evaluates data produced by range profiling. Introduce `host_impl`/`profiler_init_guard` to manage CUPTI Host object and initialization/deinitialization, including safe move-assignment cleanup. `profiler_init_guard` initializes profiler, and throws if CUPTI returns an error code. `profiler_init_guard::finalize_profiler()` de-inits profiler and returns the error code. Destructor calls finalize_profiler, but ignores the status code. If user wants to explicitly de-initialize profiler and handle the error, he/she is advised to call `finalize_profiler()` directly. The guard has a boolean member variable to allow destructor to work even if user explicitly called finalize_profiler() method. The old counter-data prefix/scratch flow was replaced with the Range Profiler counter data image sizing/initialization path and decode flow. Host API metric filtering (base metrics + context scope) and Host-side evaluation to GPU values via cuptiProfilerHostEvaluateToGpuValues is implemented. - Host object: `host_impl::object` in `nvbench/cupti_profiler.cxx`. - Range profiler object: `host_impl::range_profiler_object`. - Config image: `m_config_image`. - Counter data image: `m_data_image`. 1) Host init + config image - `initialize_profiler_host()` creates the host object. - `initialize_config_image_host()` adds metrics and builds the config image. 2) Range profiler enable + counter data image - `enable_range_profiler()` creates the range profiler object. - `initialize_counter_data_image()` sizes and initializes the data image using the range profiler object, matching the CUPTI samples. 3) Config + collect + decode - `set_range_profiler_config()` binds the config image + data image. - `start_user_loop()` / `stop_user_loop()` push/pop the user range and start/stop the range profiler. - `process_user_loop()` decodes counter data via `cuptiRangeProfilerDecodeData()`. 4) Evaluate metrics - `get_counter_values()` calls `cuptiProfilerHostEvaluateToGpuValues()` to convert counter data into metric values. The * Use class instead of struct in profiler_init_guard; forward declaration * Add SFINAE guards before accessing members not present in earlier CTK versions * Check if cupti_profiler_host.h exists, use old/new implementation based on that check 1. Reintroduced legacy `cupti_profiler_nvpw.cuh` and `cupti_profiler_nvpw.cuh`. 2. Moved profiler-host-API implementation to `cupti_profiler_host.cuh`, `cupti_profiler_host.cxx`. 3. Add `nvbench/cupti_profiler.cuh` which checks if `cupti_profiler_host.h` header is known and includes `cupti_profiler_host.cuh` or `cupti_profiler_nvpw.cuh` respectively. 4. In cmake, we check if ${nvbench_cupti_root}/include/cupti_profiler_host.h file exists. If it does not, `libnvbench.so` would have dependency on libnvperf_host and libnvperf_target in addition to dependency on libcupti. If the header exists, it would only depend on libcupti	2026-03-23 11:51:16 -04:00
Oleksandr Pavlyk	1d823c6975	Merge pull request #328 from oleksandr-pavlyk/set-type-axes-names-in-auto-throughput-example	2026-03-20 18:44:03 -05:00
Oleksandr Pavlyk	56cdaed0af	Merge pull request #299 from NVIDIA/pre-commit-ci-update-config [pre-commit.ci] pre-commit autoupdate	2026-03-20 16:15:20 -05:00
Oleksandr Pavlyk	a6e570083d	Merge pull request #329 from oleksandr-pavlyk/fix-fmt-target-name-in-tests Link against fmt::fmt target, not fmt	2026-03-20 08:49:05 -05:00
Oleksandr Pavlyk	4c278b08b3	Link against fmt::fmt target, not fmt. Consistent with nvbench/CMakeLists.txt Co-authored-by: Dominic Charrier <docharri@amd.com>	2026-03-19 14:53:06 -05:00
Oleksandr Pavlyk	49636c70b3	Set type-axes name to ItemsPerThread to replace auto-generated T	2026-03-19 14:35:46 -05:00
Bernhard Manfred Gruber	728212f9f1	Merge pull request #315 from bernhardmgruber/plot_diff_script Extend `nvbench_compare.py` with `--plot`, axis/benchmark filtering, and dark mode	2026-02-28 01:38:27 +01:00
Bernhard Manfred Gruber	4164909c52	Feedback	2026-02-28 01:19:18 +01:00

1 2 3 4 5 ...

781 Commits