Commit Graph

28 Commits

Author SHA1 Message Date
Oleksandr Pavlyk
65abfbcfb2 Implement DecisionReason, tracking and summarisation
- Add DecisionReason(code, message) and internal
  TimingDecision(status, reason).
- SummaryComparison now carries reason
- ComparisonStats now aggregates undecided reasons.
- Final summary prints a reason breakdown only when
  undecided reasons exist, e.g.:

  - Undecided   (comparison requires more evidence): 3
    - Reasons:
      - noise_too_high: 2 (relative dispersion is too
                           high to declare same)
      - weak_interval_overlap: 1 (timing intervals do not
                 overlap strongly enough to declare same)
2026-06-03 07:52:25 -05:00
Oleksandr Pavlyk
6de54fa07a Implement early SAME check
If SLOW/FAST check returned undecided, we attempt conservative
SAME check based on summary data alone (bulk data are not read)

Reference and compare measurements are considered SAME if
   - both centers are positive finite values;
   - abs(ref - cmp) / min(ref, cmp) <= 0.5%.
     This is equivalent to max(ref, cmp) / min(ref, cmp) <= 1 + delta;
   - interval overlap must cover at least 50% of the smaller interval;
   - relative dispersion must be finite on both sides and no more than 2%;
   - if SM clock summaries are available, the same check must also pass in cycle space.

Otherwise UNDECIDED remains working decision, to be refined by further checks
2026-06-03 07:38:00 -05:00
Oleksandr Pavlyk
48b7f61da3 Implement clear-gap comparison for early FAST/SLOW decision
Implemented the clear-gap comparison, with the log-distance-equivalent
algebra and pessimistic SM-clock fallback.

What changed:

 - Added TimingInterval and interval construction from summaries:
    - robust interval: [min, q3], centered at median
    - fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max]
 - Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005.
 - FAST gap uses:

   (ref.lower - cmp.upper) / cmp.upper >= delta
   which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta).
 - SLOW gap uses:

   (cmp.lower - ref.upper) / ref.upper >= delta
 - FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean.
 - If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED.
 - Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification.

Updated tests to cover:

 - center/noise-only comparisons becoming UNDECIDED
 - clear FAST/SLOW with matching clock evidence
 - missing clock fallback to UNDECIDED
 - frequency-shift disagreement becoming UNDECIDED
 - regression reporting with robust interval and clock evidence
2026-06-03 07:13:46 -05:00
Oleksandr Pavlyk
71823e2f4f Add q1/q3 quartiles to GPUTimeData struct
The quantile values are not currently used, but plumbed through
2026-06-03 06:35:24 -05:00
Oleksandr Pavlyk
a8704103a7 Add "nv/cold/sm_clock_rate/mean" to GPU time summary data
Its intent is to be cheaply retrievable metric of average
SM clock frequence over entire sample
2026-06-02 16:21:39 -05:00
Oleksandr Pavlyk
debde4f4b2 Lazy-load nvbench-compare bulk timing data
Store JSON-bin sample time and frequency metadata in GpuTimingData instead of
reading the binary files during summary extraction.

Add Float32BinarySource and lazy cached accessors for samples and frequencies.
Use np.fromfile by default, but allow tests and alternate callers to inject a
float32 reader returning any buffer-compatible object convertable to "<f4" data
type.

Treat optional bulk-data failures as unavailable evidence instead of aborting
comparison: unreadable files, invalid buffers, count mismatches, and mismatched
sample/frequency metadata now emit RuntimeWarning and return None.

Update nvbench_compare tests to verify lazy loading, cache reuse, injected
reader behavior, warning-based degradation, and count mismatch handling.
2026-06-02 15:55:02 -05:00
Oleksandr Pavlyk
6d8aa878cf Introduce UNDECIDED comparison status
It is not emitted just yet, but the code becomes ready for it
when it starts being emitted
2026-06-02 15:23:47 -05:00
Oleksandr Pavlyk
d4283f77a5 Refactor nvbench-compare timing comparison state
Introduce GpuTimingData, SummaryComparison, ComparisonStats, and
ComparisonRunData to make timing extraction, classification, and run-level
state explicit.

Load sample-time and SM-frequency bulk data from JSON binary output into
GpuTimingData when available, preserving count validation between paired
sample and frequency arrays.

Move GPU timing comparison logic into compare_gpu_timings(), prefer robust
median/IQR data when available, and fall back to mean/stdev summaries otherwise.
Keep missing or invalid noise on the unknown path.

Replace module-level comparison counters and selected-device globals with
per-run data passed into compare_benches(). Update tests to validate timing
classification, bulk-data loading, device pairing, filtered duplicate matching,
and summary counters through the new structures.
2026-06-02 15:04:39 -05:00
Oleksandr Pavlyk
0b2dd26625 Make nvbench_compare read bulk data, if available 2026-06-02 13:38:53 -05:00
Oleksandr Pavlyk
1d13b49996 Add scoped filtering and device pairing to nvbench_compare
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.

Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.

Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.

Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.

Original commit messages folded into this change:

Tweaks for nvbench_compare

1. When JSON files contain multiple entries with the same name and axis values,
   make sure that scripts compares corresponding entries.

   Previous logic would extract the first entry from ref data, and would compare
   measurements for each state in cmp against the first entry from ref. The
   change introduces a counter to know which nth entry we process for a
   particular axis value, and retrieve corresponding entry in ref.

Scope occurrence matching by device.

Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.

Require devices in ref and cmp to have the same cardinality

Handle mismatch when number of duplicates in ref data is not same as in cmp data

Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency

Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.

Fix to filter-before-matching.

 - compare_benches() now pairs devices by selected position instead of taking a
   device id.
 - For each device pair, compare_benches() now builds:
     - ref_device_states: matching reference device and axis filters
     - cmp_device_states: matching compare device and axis filters
 - State occurrence counts and duplicate occurrence matching now operate only
   on those filtered per-device lists.
 - Removed the later matches_axis_filters() skip inside the compare-state loop
   because filtering now happens before matching.

Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.

Introduce device filtering in nvbench_compare

 - --reference-devices all|ID|ID,ID,...
 - --compare-devices all|ID|ID,ID,...
 - Integer lists preserve order and duplicates.
 - Requested IDs are validated against the file-level device list.
 - Filtered reference/compare device counts must match before comparison.
 - compare_benches() pairs selected reference and compare devices by position.
 - Each benchmark validates that requested device IDs are present in its own
   devices list.

Implemented benchmark-scoped --axis handling.

  - --axis and --benchmark now share an ordered argparse action, so their
    relative CLI order is preserved.
  - -a before any -b becomes a global axis filter.
  - -a after -b <name> applies to that most recent benchmark only.
  - Repeated -b entries are treated as separate filter scopes and combined as
    alternatives for that benchmark.
  - Device filtering remains global and is applied independently.

Allow non-matching devices for explicit device selection

Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.

Fix for resolve_benchmark_device_ids, add comments

The return value of resolve_benchmark_device_ids now always owns its list.

Use monkeypatch class in set_test_devices helper

Stricted device id validation

Test for device id validation
2026-06-02 11:48:01 -05:00
Oleksandr Pavlyk
ca1d60610c Use robust summaries in nvbench_compare classification
Teach nvbench_compare to parse GPU timing summaries into structured values and
prefer the robust median/IQR summaries when both compared measurements provide
them. Fall back to the existing mean/stdev summaries when robust summaries are
not available.

Classify comparisons with the larger available relative noise estimate instead
of the smaller one, keep unavailable noise distinct from encoded infinite noise,
and report improvements separately from regressions. Keep the process exit code
as success for completed comparisons; regression counts are reported in the
summary instead of being used as the process status.

Make plotting tolerate unavailable noise by leaving gaps in confidence bands,
sort plotted series by the plotted axis, and avoid reusing pyplot state across
plot calls.

Add focused Python tests for robust-summary preference, unavailable-noise
classification, non-finite timing centers, plot-along handling when the selected
axis is absent, and the exit-code contract.
2026-06-02 11:47:47 -05:00
Oleksandr Pavlyk
4472e7b59b Add python api for cold warmup parameters (#363) 2026-05-18 10:56:44 -05:00
Oleksandr Pavlyk
d63a2761eb Implement Timer, and support State.exec(fn, timer=True) (#364)
* Add type annotations for future functionality

```python
class Timer:
    def start(self) -> None: ...
    def stop(self) -> None: ...
```

and overloaded `State.exec` so:

  - normal mode accepts `Callable[[Launch], None]`
  - `timer=True` accepts `Callable[[Launch, Timer], None]`

No implementation yet. Type annotation checked with

```
(py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-imports /tmp/check_timer.py
/tmp/check_timer.py:24: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch], None]", "bool"  [call-overload]
/tmp/check_timer.py:24: note: Possible overload variants:
/tmp/check_timer.py:24: note:     def exec(self, Callable[[Launch], None], /, *, batched: bool | None = ..., sync: bool | None = ..., timer: Literal[False] = ...) -> None
/tmp/check_timer.py:24: note:     def exec(self, Callable[[Launch, Timer], None], /, *, timer: Literal[True], sync: bool | None = ...) -> None
/tmp/check_timer.py:25: error: Argument 1 to "exec" of "State" has incompatible type "Callable[[Launch, Timer], None]"; expected "Callable[[Launch], None]"  [arg-type]
/tmp/check_timer.py:26: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch, int], None]", "bool"  [call-overload]
/tmp/check_timer.py:26: note: Possible overload variants:
/tmp/check_timer.py:26: note:     def exec(self, Callable[[Launch], None], /, *, batched: bool | None = ..., sync: bool | None = ..., timer: Literal[False] = ...) -> None
/tmp/check_timer.py:26: note:     def exec(self, Callable[[Launch, Timer], None], /, *, timer: Literal[True], sync: bool | None = ...) -> None
Found 3 errors in 1 file (checked 1 source file)

(py313) :~/repos/nvbench/python$ nl -ba /tmp/check_timer.py
     1  # /tmp/check_nvbench_timer.py
     2  import cuda.bench as bench
     3
     4  def normal_ok(launch: bench.Launch) -> None:
     5      pass
     6
     7  def timer_ok(launch: bench.Launch, timer: bench.Timer) -> None:
     8      timer.start()
     9      timer.stop()
    10
    11  def missing_timer(launch: bench.Launch) -> None:
    12      pass
    13
    14  def extra_timer(launch: bench.Launch, timer: bench.Timer) -> None:
    15      pass
    16
    17  def wrong_timer_type(launch: bench.Launch, timer: int) -> None:
    18      pass
    19
    20  def state_bench(state: bench.State) -> None:
    21      state.exec(normal_ok)
    22      state.exec(normal_ok, timer=False)
    23      state.exec(timer_ok, timer=True)
    24      state.exec(missing_timer, timer=True)       # should fail
    25      state.exec(extra_timer)                     # should fail
    26      state.exec(wrong_timer_type, timer=True)    # should fail
```

* Implement cuda.bench.Timer object

The Timer class is not user-constructible. It exposes two nullary
methods timer.start() and timer.stop().

The instance of Timer class would be provided to launchable object
passed to State.exec with timer=True.

* Implement support for State.exec( launch_fn, timer=True)

* Change type annotation for batch to default to None

None is interpreted as `not timer`, i.e., it effectively
defaults to True (as before) for usage without timer set,
but starts defaulting to `False` is `timer=True` is set.

The batched keyword type is `bool | None`.

* Implement default batched=None behavior

API allows one to specify all 3 keywords, sync, batched,
and timer. batched is None by default, run-time interpreted
as `(not timer)`.

* Update tests for new behavior of batched/time combination

* Add python/examples/exec_tag_timer.py

* Expand Timer class and methods docstrings

* Reworked python/example/exec_tag_timer.py to align with C++ example.

* Replace ::cuda::std::name with cuda::std::name

* Resolve review feedback
2026-05-15 10:19:40 -05:00
Oleksandr Pavlyk
44ec7de6bd Implement decorators to register benchmarks add axis and options (#347)
* Add decorators for registering benchmarks and adding axis

cuda.bench.register(fn) continues returning Benchmark, and supports
legacy use.

New signature added:
   cuda.bench.register():
      Returns a decorator

```
@bench.register()
@bench.axis.float64("Duration (s)", [7e-5, 1e-4, 5e-4])
@bench.option.min_samples(120)
def single_float64_axis(state: bench.State):
   ...
```

* Remove example/auto_throughput.py

The C++ counterpart's purpose is to demonstrate use of CUPTI
metrics, but these are not supported in Python bindings, so
this example is a duplicate of example/throughput.py

* Add wrong decorator order test for bench.axis.*

* Strengthen type annotation for register function

Acting on code rabbit nit-pick require that function being
registered take cuda.bench.State object as an argument.

Verified the fix as

```
(py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-import /tmp/t.py
/tmp/t.py:8: error: Argument 1 has incompatible type "Callable[[], None]"; expected "Callable[[State], None]"  [arg-type]
Found 1 error in 1 file (checked 1 source file)
(py313) :~/repos/nvbench/python$ nl -ba /tmp/t.py
     1  # /tmp/check_nvbench_register.py
     2  import cuda.bench as bench
     3
     4  @bench.register()
     5  def good(state: bench.State) -> None:
     6      pass
     7
     8  @bench.register()
     9  def bad() -> None:
    10      pass
```

* Replace use of global variable with thread-safe lru_cache

This improves thread-safety of module initialization.

* Abide by RUF005 linting rule

* Expand docstrings regarding cuda.bench.register() decorator

It explains to the user what the decorator does and provides
a concise usage example.

* Sharpen wording on exception maybe-thrown by decorator
2026-05-14 15:41:30 -05:00
Oleksandr Pavlyk
338936b6fe Provide BenchmarkResult class for parsing JSON output of NVBench-instrumented benchmarks (#356)
Implements `cuda.bench.results.BenchmarkResult` class to represent data from JSON output of benchmark execution.

The contains implements two class methods `BenchmarkResult.from_json(filename : str | os.PathLike, *, metadata : Any = None)` which expects well-formed JSON filename and `BenchmarkResult.empty(*, metadata : Any = None)` intended to represent failed result with reasons that can be recorded in metadata at user's discretion.

The `BenchmarkResult` implements mapping interface, supporting `.keys()`, `.values()`, `.items()` methods, `__len__`, `__contains__`, `__getitem__` and `__iter__` special methods. 

Values in `BenchmarkResult` has type `cuda.bench.results.SubBenchmarkResult` which implements a list-like interface, i.e. implements `__len__`, `__getitem__`, and `__iter__` special methods. Values in this list-like structure correspond to measurements of individual states of a particular benchmark (the key in `BenchmarkResult`).

Elements of `SubBenchmarkResult` structure have type `SubBenchmarkState` that supports mapping protocol with axis_values as a key and represent data corresponding to measurements for a particular state (combination of settings for each axis). 

The state provides `.samples` and `.frequencies` attributes storing raw execution duration values and estimates for average GPU frequencies. 

Example usage:

```
import array, numpy as np, cuda.bench.results

r = cuda.bench.results.BenchmarkResult("perf_data/axes_run1.json")

r["copy_sweep_grid_shape"].centers_with_frequencies(
     lambda t, f: np.median(np.asarray(t)*np.asarray(f)))

```

```
In [1]: import array, numpy as np, cuda.bench.results

In [2]: r = cuda.bench.results.BenchmarkResult("temp_data/axes_run1.json")

In [3]: list(r)
Out[3]:
['simple',
 'single_float64_axis',
 'copy_sweep_grid_shape',
 'copy_type_sweep',
 'copy_type_conversion_sweep',
 'copy_type_and_block_size_sweep']

In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75]))
Out[4]: {'Device=0': array([0.00100966, 0.00101299])}

In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"]
Out[5]: {'Device=0': array([0.00100966, 0.00101299])}

In [6]: len(r)
Out[6]: 6

In [7]: "fake" in r
Out[7]: False
```

Each `SubBenchmarkState` implements 
`.summaries` attribute - rich object that retains tag/name/hint/hide/description metadata.

* Add nvbench-json-summary to render NVBench JSON output as an NVBench-style
markdown summary table, including axis formatting, device sections, hidden
summary filtering, and summary hint formatting.

Update packaging, type stubs, and tests for the new namespace, renamed
classes, Python 3.10-compatible annotations, and summary-table generation.

* Split tests in test_benchmark_result into smaller tests

* Fix break due to file name change

* Add python/examples/benchmark_result_autotune.py

This example demonstrates using cuda.bench and cuda.bench.results
to implement simple auto-tuning, demonstrated on selecting of
tile shape hyperparameter for naive stencil kernel implemented
in numba-cuda.

* Resolve ruff PLE0604

* Fix for format_axis_value in json format script to handle None value

Add tests to cover such input.

* Address code rabbit review feedback

* Fix license header, add validation

* Addressed both issues raised in review

Malformed values are now represented in result as None.

Skipped benchmarks are no longer dropped, i.e., they are present
in BenchmarkResult data, but they are not reflected in summary
table in line with what NVBench-instrumented benchmarks do.
2026-05-13 13:23:58 -05:00
Oleksandr Pavlyk
f392725015 Correct Python API signature of State.get_axis_values_as_strings (#346)
* Correct Python API signature of State.get_axis_values_as_strings

The C++ API has default boolean argument color, but Python API
declared no arguments.

Closes #345

* Also exercise invocation of get_axis_values_as_string with keyword argument value

* Remove use of cuda.core.experimental
2026-05-04 08:40:29 -05:00
Oleksandr Pavlyk
a3364ca5c7 Port changes to the package from #323 (#337)
Fixed relative text alignment in docstrings to fix autodoc warnigns

Renamed cuda.bench.test_cpp_exception and cuda.bench.test_py_exception functions
to start with underscore, signaling that these functions are internal and should
not be documented

Account for test_cpp_exceptions -> _test_cpp_exception, same for *_py_*

Make sure to reset __module__ of reexported symbols to be cuda.bench
2026-04-22 08:28:15 -05:00
Oleksandr Pavlyk
39c29026fd Move docstrings from PYI file to implementation
Added tests that docstrings exist and are not empty.

This closes #291
2026-02-02 11:55:48 -06:00
Nader Al Awar
fa1eed69c0 Rename test file to refer to cuda_bench 2026-01-29 13:53:29 -06:00
Oleksandr Pavlyk
b5e4b4ba31 cuda.nvbench -> cuda.bench
Per PR review suggestion:
   - `cuda.parallel`    - device-wide algorithms/Thrust
   - `cuda.cooperative` - Cooperative algorithsm/CUB
   - `cuda.bench`       - Benchmarking/NVBench
2025-08-04 13:42:43 -05:00
Oleksandr Pavlyk
9dfdd8af89 Minimal test file 2025-08-04 11:59:17 -05:00
Oleksandr Pavlyk
6aff4712f8 Change permissions of test/run_1.py 2025-08-04 10:13:08 -05:00
Oleksandr Pavlyk
453a1648aa Improvements to readability of examples per PR review 2025-07-31 16:20:52 -05:00
Oleksandr Pavlyk
88a3ad0138 Add test/stub.py
The following static analysis run should run green

```
mypy --ignore-missing-imports test/stub.py
```
2025-07-30 13:54:37 -05:00
Oleksandr Pavlyk
b97e27cbf2 Add use of add_axis_values and add_axis_values_as_string to test/run_1.py 2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk
526856db4e Fix typo in the method spelling 2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk
e589518376 Change test and examples from using camelCase to using snake_case as implementation changed 2025-07-28 15:37:05 -05:00
Oleksandr Pavlyk
6552ef503c Draft of Python API for NVBench
The prototype is based on pybind11 to minimize boiler-plate
code needed to deal with move-only semantics of many nvbench
classes.
2025-07-28 15:37:04 -05:00