nvbench

mirror of https://github.com/NVIDIA/nvbench.git synced 2026-05-14 02:02:16 +00:00

Author	SHA1	Message	Date
Oleksandr Pavlyk	df4cec071f	Remove example/auto_throughput.py The C++ counterpart's purpose is to demonstrate use of CUPTI metrics, but these are not supported in Python bindings, so this example is a duplicate of example/throughput.py	2026-05-13 13:34:48 -05:00
Oleksandr Pavlyk	63019a7e42	Add decorators for registering benchmarks and adding axis cuda.bench.register(fn) continues returning Benchmark, and supports legacy use. New signature added: cuda.bench.register(): Returns a decorator ``` @bench.register() @bench.axis.float64("Duration (s)", [7e-5, 1e-4, 5e-4]) @bench.option.min_samples(120) def single_float64_axis(state: bench.State): ... ```	2026-05-13 13:34:48 -05:00
Oleksandr Pavlyk	338936b6fe	Provide BenchmarkResult class for parsing JSON output of NVBench-instrumented benchmarks (#356 ) Implements `cuda.bench.results.BenchmarkResult` class to represent data from JSON output of benchmark execution. The contains implements two class methods `BenchmarkResult.from_json(filename : str \| os.PathLike, , metadata : Any = None)` which expects well-formed JSON filename and `BenchmarkResult.empty(, metadata : Any = None)` intended to represent failed result with reasons that can be recorded in metadata at user's discretion. The `BenchmarkResult` implements mapping interface, supporting `.keys()`, `.values()`, `.items()` methods, `__len__`, `__contains__`, `__getitem__` and `__iter__` special methods. Values in `BenchmarkResult` has type `cuda.bench.results.SubBenchmarkResult` which implements a list-like interface, i.e. implements `__len__`, `__getitem__`, and `__iter__` special methods. Values in this list-like structure correspond to measurements of individual states of a particular benchmark (the key in `BenchmarkResult`). Elements of `SubBenchmarkResult` structure have type `SubBenchmarkState` that supports mapping protocol with axis_values as a key and represent data corresponding to measurements for a particular state (combination of settings for each axis). The state provides `.samples` and `.frequencies` attributes storing raw execution duration values and estimates for average GPU frequencies. Example usage: ``` import array, numpy as np, cuda.bench.results r = cuda.bench.results.BenchmarkResult("perf_data/axes_run1.json") r["copy_sweep_grid_shape"].centers_with_frequencies( lambda t, f: np.median(np.asarray(t)np.asarray(f))) ``` ``` In [1]: import array, numpy as np, cuda.bench.results In [2]: r = cuda.bench.results.BenchmarkResult("temp_data/axes_run1.json") In [3]: list(r) Out[3]: ['simple', 'single_float64_axis', 'copy_sweep_grid_shape', 'copy_type_sweep', 'copy_type_conversion_sweep', 'copy_type_and_block_size_sweep'] In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75])) Out[4]: {'Device=0': array([0.00100966, 0.00101299])} In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"] Out[5]: {'Device=0': array([0.00100966, 0.00101299])} In [6]: len(r) Out[6]: 6 In [7]: "fake" in r Out[7]: False ``` Each `SubBenchmarkState` implements `.summaries` attribute - rich object that retains tag/name/hint/hide/description metadata. Add nvbench-json-summary to render NVBench JSON output as an NVBench-style markdown summary table, including axis formatting, device sections, hidden summary filtering, and summary hint formatting. Update packaging, type stubs, and tests for the new namespace, renamed classes, Python 3.10-compatible annotations, and summary-table generation. * Split tests in test_benchmark_result into smaller tests * Fix break due to file name change * Add python/examples/benchmark_result_autotune.py This example demonstrates using cuda.bench and cuda.bench.results to implement simple auto-tuning, demonstrated on selecting of tile shape hyperparameter for naive stencil kernel implemented in numba-cuda. * Resolve ruff PLE0604 * Fix for format_axis_value in json format script to handle None value Add tests to cover such input. * Address code rabbit review feedback * Fix license header, add validation * Addressed both issues raised in review Malformed values are now represented in result as None. Skipped benchmarks are no longer dropped, i.e., they are present in BenchmarkResult data, but they are not reflected in summary table in line with what NVBench-instrumented benchmarks do.	2026-05-13 13:23:58 -05:00
Oleksandr Pavlyk	d13a0fde32	Correct cuda cccl examples per change in api (#353 )	2026-05-06 13:30:44 -05:00
Oleksandr Pavlyk	39730efbc3	Update requirements to reflect packages used by examples	2026-04-02 10:37:17 -05:00
Oleksandr Pavlyk	9f75642387	Add patch to cutlass.base_dsl.dsl.BaseDSL to work-around a bug See https://github.com/NVIDIA/cutlass/issues/3142	2026-04-02 10:29:31 -05:00
Oleksandr Pavlyk	93bc59d05c	Renamed CUTLASS example to reflect that it uses CuteDSL	2026-04-01 08:24:29 -05:00
Oleksandr Pavlyk	e4cfddeb87	Rewrote cutlass_gemm example to use CuteDSL	2026-04-01 08:23:41 -05:00
Oleksandr Pavlyk	3f284b4004	Renamed cccl_* examples cccl_parallel_* -> cuda_compute_* cccl_cooperative_* -> cuda_coop_*	2026-04-01 08:20:20 -05:00
Oleksandr Pavlyk	5bdb30f4b6	Update to cccl_parallel_segmented_reduce example per changes in API Update namespace changes. Use make_segmented_reduce factory function, and update call signatures.	2026-04-01 08:18:15 -05:00
Oleksandr Pavlyk	d8739fc208	Update to cccl_cooperative_block_reduce example	2026-04-01 08:17:52 -05:00
Oleksandr Pavlyk	974eb5ee0f	Replace use of cupy.cuda.ExternalStream with cupy.cuda.Stream.from_external	2026-04-01 08:17:12 -05:00
Oleksandr Pavlyk	7c60edcc0a	cuda.core.experimental -> cuda.core	2026-04-01 08:16:04 -05:00
Nader Al Awar	6df5fc8c67	Remove cupti from cuda-bench dependencies	2026-02-02 15:37:13 -06:00
Jaya Venkatesh	0f997271f7	added numba-cuda to requirements Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>	2025-09-16 14:54:08 -07:00
Jaya Venkatesh	bfa6a6c7c6	remove pynvjitlink references in examples Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>	2025-09-08 16:00:19 -07:00
Oleksandr Pavlyk	b5e4b4ba31	cuda.nvbench -> cuda.bench Per PR review suggestion: - `cuda.parallel` - device-wide algorithms/Thrust - `cuda.cooperative` - Cooperative algorithsm/CUB - `cuda.bench` - Benchmarking/NVBench	2025-08-04 13:42:43 -05:00
Oleksandr Pavlyk	584f48ac97	Remove warm-up invocations outside of launcher in examples/throughout and auto_throughput	2025-08-04 12:14:44 -05:00
Oleksandr Pavlyk	453a1648aa	Improvements to readability of examples per PR review	2025-07-31 16:20:52 -05:00
Oleksandr Pavlyk	6b9050e404	Add example of benchmarking pytorch code	2025-07-28 15:57:11 -05:00
Oleksandr Pavlyk	985db4f144	Add examples/cccl_cooperative_block_reduce.py	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	5e8c17c740	Fix mypy error in import statement used in cutlass_gemm example	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	5c01c34793	Fix mypy error in cutlass_gemm example	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	a69a3647b2	CUTLASS example added, license headers added, fixes - Add license header to each example file. - Fixed broken runs caused by type declarations. - Fixed hang in throughput.py when --run-once by doing a manual warm-up step, like in auto_throughput.py	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	fc0249d188	Updated examples/axes.py to use get_float64_or_default	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	a535a1d173	Fix type annotations in cuda.nvbench, and in examples	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	bd2b536ab4	cpu_only -> cpu_activity Change example to illustrate timing CPU work. First example does only CPU work (sleeps), use CPU-only timer. Second examples does both CPU and GPU work (sleeps in either case). Use cold-run timer with/without sync tag to measure both CPU and GPU times.	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	d09df0f754	Expand examples/cpu_only.py Benchmark function that sleeps for 1 seconda on the host using CPU-only timer, as well as CPU/GPU timer that does/doesn't use blocking kernel. All three methods must report consistent values close to 1 second.	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	e589518376	Change test and examples from using camelCase to using snake_case as implementation changed	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	c960ef75cc	Add examples/cpu_only.py based on code from PR feedback https://github.com/NVIDIA/nvbench/pull/237#issuecomment-3058594793	2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk	203ef2046e	Add warm-up call to auto_throughput.py Add throughput.py example, which is based on the same kernel as auto_throughput.py but records global memory reads/writes amounts to output BWUtil metric measuring %SOL in bandwidth utilization.	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	8589511f61	Corrected broken cccl_parallel_segmented_reduce.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	394324023f	Add example for benchmarking CuPy function	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	707b24ffb5	Add examples/cccl_parallel_segmented_reduce.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	883e5819b6	Use cuda.Stream.from_handle to create core.Stream from nvbench.CudaStream	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	b357af0092	Add examples/skip.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	964ec2e1bc	Add examples/exec_tag_sync.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	4f15840832	Use state.add_summary to supplement integral TypeID with meaningful type name	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	df426a0bad	Add examples/axes.py	2025-07-28 15:37:04 -05:00
Oleksandr Pavlyk	2507bc2263	Add Python example based on C++ example/auto_throughput.cpp	2025-07-28 15:37:04 -05:00

40 Commits