nvbench

mirror of https://github.com/NVIDIA/nvbench.git synced 2026-06-29 18:57:44 +00:00

Author	SHA1	Message	Date
Oleksandr Pavlyk	d63a2761eb	Implement Timer, and support State.exec(fn, timer=True) (#364 ) * Add type annotations for future functionality ```python class Timer: def start(self) -> None: ... def stop(self) -> None: ... ``` and overloaded `State.exec` so: - normal mode accepts `Callable[[Launch], None]` - `timer=True` accepts `Callable[[Launch, Timer], None]` No implementation yet. Type annotation checked with ``` (py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-imports /tmp/check_timer.py /tmp/check_timer.py:24: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch], None]", "bool" [call-overload] /tmp/check_timer.py:24: note: Possible overload variants: /tmp/check_timer.py:24: note: def exec(self, Callable[[Launch], None], /, , batched: bool \| None = ..., sync: bool \| None = ..., timer: Literal[False] = ...) -> None /tmp/check_timer.py:24: note: def exec(self, Callable[[Launch, Timer], None], /, , timer: Literal[True], sync: bool \| None = ...) -> None /tmp/check_timer.py:25: error: Argument 1 to "exec" of "State" has incompatible type "Callable[[Launch, Timer], None]"; expected "Callable[[Launch], None]" [arg-type] /tmp/check_timer.py:26: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch, int], None]", "bool" [call-overload] /tmp/check_timer.py:26: note: Possible overload variants: /tmp/check_timer.py:26: note: def exec(self, Callable[[Launch], None], /, , batched: bool \| None = ..., sync: bool \| None = ..., timer: Literal[False] = ...) -> None /tmp/check_timer.py:26: note: def exec(self, Callable[[Launch, Timer], None], /, , timer: Literal[True], sync: bool \| None = ...) -> None Found 3 errors in 1 file (checked 1 source file) (py313) :~/repos/nvbench/python$ nl -ba /tmp/check_timer.py 1 # /tmp/check_nvbench_timer.py 2 import cuda.bench as bench 3 4 def normal_ok(launch: bench.Launch) -> None: 5 pass 6 7 def timer_ok(launch: bench.Launch, timer: bench.Timer) -> None: 8 timer.start() 9 timer.stop() 10 11 def missing_timer(launch: bench.Launch) -> None: 12 pass 13 14 def extra_timer(launch: bench.Launch, timer: bench.Timer) -> None: 15 pass 16 17 def wrong_timer_type(launch: bench.Launch, timer: int) -> None: 18 pass 19 20 def state_bench(state: bench.State) -> None: 21 state.exec(normal_ok) 22 state.exec(normal_ok, timer=False) 23 state.exec(timer_ok, timer=True) 24 state.exec(missing_timer, timer=True) # should fail 25 state.exec(extra_timer) # should fail 26 state.exec(wrong_timer_type, timer=True) # should fail ``` * Implement cuda.bench.Timer object The Timer class is not user-constructible. It exposes two nullary methods timer.start() and timer.stop(). The instance of Timer class would be provided to launchable object passed to State.exec with timer=True. * Implement support for State.exec( launch_fn, timer=True) * Change type annotation for batch to default to None None is interpreted as `not timer`, i.e., it effectively defaults to True (as before) for usage without timer set, but starts defaulting to `False` is `timer=True` is set. The batched keyword type is `bool \| None`. * Implement default batched=None behavior API allows one to specify all 3 keywords, sync, batched, and timer. batched is None by default, run-time interpreted as `(not timer)`. * Update tests for new behavior of batched/time combination * Add python/examples/exec_tag_timer.py * Expand Timer class and methods docstrings * Reworked python/example/exec_tag_timer.py to align with C++ example. * Replace ::cuda::std::name with cuda::std::name * Resolve review feedback	2026-05-15 10:19:40 -05:00
Oleksandr Pavlyk	44ec7de6bd	Implement decorators to register benchmarks add axis and options (#347 ) * Add decorators for registering benchmarks and adding axis cuda.bench.register(fn) continues returning Benchmark, and supports legacy use. New signature added: cuda.bench.register(): Returns a decorator ``` @bench.register() @bench.axis.float64("Duration (s)", [7e-5, 1e-4, 5e-4]) @bench.option.min_samples(120) def single_float64_axis(state: bench.State): ... ``` * Remove example/auto_throughput.py The C++ counterpart's purpose is to demonstrate use of CUPTI metrics, but these are not supported in Python bindings, so this example is a duplicate of example/throughput.py * Add wrong decorator order test for bench.axis.* * Strengthen type annotation for register function Acting on code rabbit nit-pick require that function being registered take cuda.bench.State object as an argument. Verified the fix as ``` (py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-import /tmp/t.py /tmp/t.py:8: error: Argument 1 has incompatible type "Callable[[], None]"; expected "Callable[[State], None]" [arg-type] Found 1 error in 1 file (checked 1 source file) (py313) :~/repos/nvbench/python$ nl -ba /tmp/t.py 1 # /tmp/check_nvbench_register.py 2 import cuda.bench as bench 3 4 @bench.register() 5 def good(state: bench.State) -> None: 6 pass 7 8 @bench.register() 9 def bad() -> None: 10 pass ``` * Replace use of global variable with thread-safe lru_cache This improves thread-safety of module initialization. * Abide by RUF005 linting rule * Expand docstrings regarding cuda.bench.register() decorator It explains to the user what the decorator does and provides a concise usage example. * Sharpen wording on exception maybe-thrown by decorator	2026-05-14 15:41:30 -05:00
Oleksandr Pavlyk	338936b6fe	Provide BenchmarkResult class for parsing JSON output of NVBench-instrumented benchmarks (#356 ) Implements `cuda.bench.results.BenchmarkResult` class to represent data from JSON output of benchmark execution. The contains implements two class methods `BenchmarkResult.from_json(filename : str \| os.PathLike, , metadata : Any = None)` which expects well-formed JSON filename and `BenchmarkResult.empty(, metadata : Any = None)` intended to represent failed result with reasons that can be recorded in metadata at user's discretion. The `BenchmarkResult` implements mapping interface, supporting `.keys()`, `.values()`, `.items()` methods, `__len__`, `__contains__`, `__getitem__` and `__iter__` special methods. Values in `BenchmarkResult` has type `cuda.bench.results.SubBenchmarkResult` which implements a list-like interface, i.e. implements `__len__`, `__getitem__`, and `__iter__` special methods. Values in this list-like structure correspond to measurements of individual states of a particular benchmark (the key in `BenchmarkResult`). Elements of `SubBenchmarkResult` structure have type `SubBenchmarkState` that supports mapping protocol with axis_values as a key and represent data corresponding to measurements for a particular state (combination of settings for each axis). The state provides `.samples` and `.frequencies` attributes storing raw execution duration values and estimates for average GPU frequencies. Example usage: ``` import array, numpy as np, cuda.bench.results r = cuda.bench.results.BenchmarkResult("perf_data/axes_run1.json") r["copy_sweep_grid_shape"].centers_with_frequencies( lambda t, f: np.median(np.asarray(t)np.asarray(f))) ``` ``` In [1]: import array, numpy as np, cuda.bench.results In [2]: r = cuda.bench.results.BenchmarkResult("temp_data/axes_run1.json") In [3]: list(r) Out[3]: ['simple', 'single_float64_axis', 'copy_sweep_grid_shape', 'copy_type_sweep', 'copy_type_conversion_sweep', 'copy_type_and_block_size_sweep'] In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75])) Out[4]: {'Device=0': array([0.00100966, 0.00101299])} In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"] Out[5]: {'Device=0': array([0.00100966, 0.00101299])} In [6]: len(r) Out[6]: 6 In [7]: "fake" in r Out[7]: False ``` Each `SubBenchmarkState` implements `.summaries` attribute - rich object that retains tag/name/hint/hide/description metadata. Add nvbench-json-summary to render NVBench JSON output as an NVBench-style markdown summary table, including axis formatting, device sections, hidden summary filtering, and summary hint formatting. Update packaging, type stubs, and tests for the new namespace, renamed classes, Python 3.10-compatible annotations, and summary-table generation. * Split tests in test_benchmark_result into smaller tests * Fix break due to file name change * Add python/examples/benchmark_result_autotune.py This example demonstrates using cuda.bench and cuda.bench.results to implement simple auto-tuning, demonstrated on selecting of tile shape hyperparameter for naive stencil kernel implemented in numba-cuda. * Resolve ruff PLE0604 * Fix for format_axis_value in json format script to handle None value Add tests to cover such input. * Address code rabbit review feedback * Fix license header, add validation * Addressed both issues raised in review Malformed values are now represented in result as None. Skipped benchmarks are no longer dropped, i.e., they are present in BenchmarkResult data, but they are not reflected in summary table in line with what NVBench-instrumented benchmarks do.	2026-05-13 13:23:58 -05:00
Oleksandr Pavlyk	d13a0fde32	Correct cuda cccl examples per change in api (#353 )	2026-05-06 13:30:44 -05:00
Oleksandr Pavlyk	39730efbc3	Update requirements to reflect packages used by examples	2026-04-02 10:37:17 -05:00
Oleksandr Pavlyk	9f75642387	Add patch to cutlass.base_dsl.dsl.BaseDSL to work-around a bug See https://github.com/NVIDIA/cutlass/issues/3142	2026-04-02 10:29:31 -05:00
Oleksandr Pavlyk	93bc59d05c	Renamed CUTLASS example to reflect that it uses CuteDSL	2026-04-01 08:24:29 -05:00
Oleksandr Pavlyk	e4cfddeb87	Rewrote cutlass_gemm example to use CuteDSL	2026-04-01 08:23:41 -05:00
Oleksandr Pavlyk	3f284b4004	Renamed cccl_* examples cccl_parallel_* -> cuda_compute_* cccl_cooperative_* -> cuda_coop_*	2026-04-01 08:20:20 -05:00
Oleksandr Pavlyk	5bdb30f4b6	Update to cccl_parallel_segmented_reduce example per changes in API Update namespace changes. Use make_segmented_reduce factory function, and update call signatures.	2026-04-01 08:18:15 -05:00
Oleksandr Pavlyk	d8739fc208	Update to cccl_cooperative_block_reduce example	2026-04-01 08:17:52 -05:00
Oleksandr Pavlyk	974eb5ee0f	Replace use of cupy.cuda.ExternalStream with cupy.cuda.Stream.from_external	2026-04-01 08:17:12 -05:00
Oleksandr Pavlyk	7c60edcc0a	cuda.core.experimental -> cuda.core	2026-04-01 08:16:04 -05:00
Nader Al Awar	6df5fc8c67	Remove cupti from cuda-bench dependencies	2026-02-02 15:37:13 -06:00
Jaya Venkatesh	0f997271f7	added numba-cuda to requirements Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>	2025-09-16 14:54:08 -07:00
Jaya Venkatesh	bfa6a6c7c6	remove pynvjitlink references in examples Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>	2025-09-08 16:00:19 -07:00
Oleksandr Pavlyk	b5e4b4ba31	cuda.nvbench -> cuda.bench Per PR review suggestion: - `cuda.parallel` - device-wide algorithms/Thrust - `cuda.cooperative` - Cooperative algorithsm/CUB - `cuda.bench` - Benchmarking/NVBench	2025-08-04 13:42:43 -05:00
Oleksandr Pavlyk	584f48ac97	Remove warm-up invocations outside of launcher in examples/throughout and auto_throughput	2025-08-04 12:14:44 -05:00
Oleksandr Pavlyk	453a1648aa	Improvements to readability of examples per PR review	2025-07-31 16:20:52 -05:00
Oleksandr Pavlyk	6b9050e404	Add example of benchmarking pytorch code	2025-07-28 15:57:11 -05:00
Oleksandr Pavlyk	985db4f144	Add examples/cccl_cooperative_block_reduce.py	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	5e8c17c740	Fix mypy error in import statement used in cutlass_gemm example	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	5c01c34793	Fix mypy error in cutlass_gemm example	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	a69a3647b2	CUTLASS example added, license headers added, fixes - Add license header to each example file. - Fixed broken runs caused by type declarations. - Fixed hang in throughput.py when --run-once by doing a manual warm-up step, like in auto_throughput.py	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	fc0249d188	Updated examples/axes.py to use get_float64_or_default	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	a535a1d173	Fix type annotations in cuda.nvbench, and in examples	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	bd2b536ab4	cpu_only -> cpu_activity Change example to illustrate timing CPU work. First example does only CPU work (sleeps), use CPU-only timer. Second examples does both CPU and GPU work (sleeps in either case). Use cold-run timer with/without sync tag to measure both CPU and GPU times.	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	d09df0f754	Expand examples/cpu_only.py Benchmark function that sleeps for 1 seconda on the host using CPU-only timer, as well as CPU/GPU timer that does/doesn't use blocking kernel. All three methods must report consistent values close to 1 second.	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	e589518376	Change test and examples from using camelCase to using snake_case as implementation changed	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	c960ef75cc	Add examples/cpu_only.py based on code from PR feedback https://github.com/NVIDIA/nvbench/pull/237#issuecomment-3058594793	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	203ef2046e	Add warm-up call to auto_throughput.py Add throughput.py example, which is based on the same kernel as auto_throughput.py but records global memory reads/writes amounts to output BWUtil metric measuring %SOL in bandwidth utilization.	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	8589511f61	Corrected broken cccl_parallel_segmented_reduce.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	394324023f	Add example for benchmarking CuPy function	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	707b24ffb5	Add examples/cccl_parallel_segmented_reduce.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	883e5819b6	Use cuda.Stream.from_handle to create core.Stream from nvbench.CudaStream	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	b357af0092	Add examples/skip.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	964ec2e1bc	Add examples/exec_tag_sync.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	4f15840832	Use state.add_summary to supplement integral TypeID with meaningful type name	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	df426a0bad	Add examples/axes.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	2507bc2263	Add Python example based on C++ example/auto_throughput.cpp	2025-07-28 15:37:04 -05:00

40 Commits