nvbench

mirror of https://github.com/NVIDIA/nvbench.git synced 2026-06-29 18:57:44 +00:00

Author	SHA1	Message	Date
Oleksandr Pavlyk	a385ee5335	Support rename of tags /ir/(absolute\|relative) to /iqr/(absolute\|relative)	2026-06-04 11:15:10 -05:00
Oleksandr Pavlyk	841bd87638	Add TOML configuration for nvbench-compare thresholds Add versioned TOML configuration support for nvbench-compare threshold settings. The new --config option reads grouped settings for clear-gap, same-result, bulk coverage, and rare-support filtering thresholds. The parser validates the schema strictly so unknown tables, unknown keys, invalid types, unsupported versions, and out-of-range values fail early. Add --dump-config to print the effective configuration without requiring input JSON files. This makes the currently selected preset and resolved threshold values discoverable and gives users a starting point for custom configuration. Preset resolution is: - default is used when neither TOML nor CLI selects a preset - [preset] name = "..." in TOML selects the base preset - --preset ... overrides the TOML preset selection - explicit threshold values in TOML override whichever base preset was selected For example: - nvbench-compare --dump-config Prints the built-in default settings as grouped TOML. - nvbench-compare --preset permissive --dump-config Prints the permissive preset values as TOML. - nvbench-compare --config compare.toml ref.json cmp.json Compares using the preset named in compare.toml, plus any explicit TOML threshold overrides. - nvbench-compare --config compare.toml --preset strict ref.json cmp.json Uses the strict preset as the base, while preserving explicit threshold overrides from compare.toml. Keep TOML parsing lazy: Python 3.11+ uses tomllib, while Python 3.10 only requires tomli when --config is used. Add focused tests for grouped config dumping, strict validation, preset/override precedence, and CLI dump behavior.	2026-06-04 09:55:58 -05:00
Oleksandr Pavlyk	4cf75dcaf5	Add nvbench_compare display modes and interval-based table views Extend nvbench_compare with multiple table display modes and richer interval formatting for timing comparisons. Highlights: - add `--display` with `intervals`, `legacy`, and `explain` modes - keep `legacy` output using scalar Diff/%Diff - make `intervals` the default, showing compact center-plus-delta timing intervals - add `explain` mode with explicit `[L \| C \| H]` interval rendering and self-describing headers - compute and store diff and relative-diff intervals in SummaryComparison - add formatting helpers for absolute and relative interval displays - make default preset slightly more permissive by lowering `bulk_same_sample_coverage` to 0.97 Add focused tests covering: - diff/%diff interval computation - compact and explicit interval formatting - default, legacy, and explain table layouts - CLI propagation of `--display` and preset selection	2026-06-04 08:49:06 -05:00
Oleksandr Pavlyk	2a515c2569	Change in how FAST/SLOW deciision is arrive at Now: - establish a candidate clear timing gap from summary timing intervals, as before - if bulk sample times and frequencies are available on both sides, compute cycles = time * frequency - derive bulk cycle intervals from min/q1/median/q3 - confirm the gap direction from those bulk cycle intervals - only fall back to summary sm_clock_rate_mean confirmation when bulk cycle data is unavailable I also split the reason codes so the evidence source is visible: - clear_gap_confirmed_by_bulk_cycles - bulk_cycle_gap_not_confirmed - clear_gap_confirmed_by_summary_cycles - summary_cycle_gap_not_confirmed Updated tests in python/test/test_nvbench_compare.py cover both the bulk-confirmed and bulk-rejected paths, along with the renamed summary reason codes.	2026-06-03 15:57:34 -05:00
Oleksandr Pavlyk	20b3bd3148	Add nvbench_compare presets and rare-support-aware bulk coverage Introduce comparison threshold presets in nvbench_compare and thread the selected preset through main() into compare_benches. Refine bulk nearest-neighbor support handling by: - adding rare-support filtering thresholds - ignoring low-count support values only when removed sample mass is small - falling back to full support for all-unique or otherwise unusable support - keeping sample-weight coverage over all values Tighten bulk mismatch reporting to show compact min(ref, cmp) coverage summaries, and add tests covering: - rare-tail filtering - strict fallback when too much support mass would be removed - all-unique support preservation - preset lookup and CLI preset propagation	2026-06-03 15:21:26 -05:00
Oleksandr Pavlyk	b791522d48	Group nvbench-compare thresholds into a config object Replace the scattered module-level comparison threshold constants with a ComparisonThresholds value object. Thread this object through compare_benches, compare_gpu_timings, and the lower-level clear-gap, summary-SAME, and bulk-SAME decision helpers. Keep existing behavior by constructing default ComparisonThresholds when callers do not provide one. This prepares nvbench-compare for future CLI-configurable decision thresholds while keeping one consistent configuration for an entire comparison run. Add test coverage that passes custom thresholds through compare_benches and verifies they affect the SAME decision.	2026-06-03 10:02:46 -05:00
Oleksandr Pavlyk	8c85393ee2	Use bulk samples to confirm same comparisons Add a bulk-data SAME path to nvbench_compare for cases where summary intervals do not provide a clear FAST/SLOW decision. The new path compares sample times and SM-clock-adjusted cycles with symmetric nearest-neighbor coverage over unique values and sample counts. The comparison now requires both sample-weight coverage and unique-support coverage to pass before declaring SAME. If bulk data is available but coverage does not pass, the result remains UNDECIDED instead of falling back to the summary-only SAME rule. Also improve undecided diagnostics by aggregating reason codes while preserving the most severe representative detail, including observed coverage values and thresholds for bulk support mismatches. Add tests for: - bulk data confirming SAME despite changed mode weights; - bulk time mismatch overriding summary-only SAME; - cycle coverage vetoing time-only agreement; - sample-weight and unique-support coverage diagnostics; - aggregation of undecided reason details.	2026-06-03 09:36:05 -05:00
Oleksandr Pavlyk	65abfbcfb2	Implement DecisionReason, tracking and summarisation - Add DecisionReason(code, message) and internal TimingDecision(status, reason). - SummaryComparison now carries reason - ComparisonStats now aggregates undecided reasons. - Final summary prints a reason breakdown only when undecided reasons exist, e.g.: - Undecided (comparison requires more evidence): 3 - Reasons: - noise_too_high: 2 (relative dispersion is too high to declare same) - weak_interval_overlap: 1 (timing intervals do not overlap strongly enough to declare same)	2026-06-03 07:52:25 -05:00
Oleksandr Pavlyk	6de54fa07a	Implement early SAME check If SLOW/FAST check returned undecided, we attempt conservative SAME check based on summary data alone (bulk data are not read) Reference and compare measurements are considered SAME if - both centers are positive finite values; - abs(ref - cmp) / min(ref, cmp) <= 0.5%. This is equivalent to max(ref, cmp) / min(ref, cmp) <= 1 + delta; - interval overlap must cover at least 50% of the smaller interval; - relative dispersion must be finite on both sides and no more than 2%; - if SM clock summaries are available, the same check must also pass in cycle space. Otherwise UNDECIDED remains working decision, to be refined by further checks	2026-06-03 07:38:00 -05:00
Oleksandr Pavlyk	48b7f61da3	Implement clear-gap comparison for early FAST/SLOW decision Implemented the clear-gap comparison, with the log-distance-equivalent algebra and pessimistic SM-clock fallback. What changed: - Added TimingInterval and interval construction from summaries: - robust interval: [min, q3], centered at median - fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max] - Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005. - FAST gap uses: (ref.lower - cmp.upper) / cmp.upper >= delta which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta). - SLOW gap uses: (cmp.lower - ref.upper) / ref.upper >= delta - FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean. - If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED. - Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification. Updated tests to cover: - center/noise-only comparisons becoming UNDECIDED - clear FAST/SLOW with matching clock evidence - missing clock fallback to UNDECIDED - frequency-shift disagreement becoming UNDECIDED - regression reporting with robust interval and clock evidence	2026-06-03 07:13:46 -05:00
Oleksandr Pavlyk	71823e2f4f	Add q1/q3 quartiles to GPUTimeData struct The quantile values are not currently used, but plumbed through	2026-06-03 06:35:24 -05:00
Oleksandr Pavlyk	a8704103a7	Add "nv/cold/sm_clock_rate/mean" to GPU time summary data Its intent is to be cheaply retrievable metric of average SM clock frequence over entire sample	2026-06-02 16:21:39 -05:00
Oleksandr Pavlyk	debde4f4b2	Lazy-load nvbench-compare bulk timing data Store JSON-bin sample time and frequency metadata in GpuTimingData instead of reading the binary files during summary extraction. Add Float32BinarySource and lazy cached accessors for samples and frequencies. Use np.fromfile by default, but allow tests and alternate callers to inject a float32 reader returning any buffer-compatible object convertable to "<f4" data type. Treat optional bulk-data failures as unavailable evidence instead of aborting comparison: unreadable files, invalid buffers, count mismatches, and mismatched sample/frequency metadata now emit RuntimeWarning and return None. Update nvbench_compare tests to verify lazy loading, cache reuse, injected reader behavior, warning-based degradation, and count mismatch handling.	2026-06-02 15:55:02 -05:00
Oleksandr Pavlyk	6d8aa878cf	Introduce UNDECIDED comparison status It is not emitted just yet, but the code becomes ready for it when it starts being emitted	2026-06-02 15:23:47 -05:00
Oleksandr Pavlyk	d4283f77a5	Refactor nvbench-compare timing comparison state Introduce GpuTimingData, SummaryComparison, ComparisonStats, and ComparisonRunData to make timing extraction, classification, and run-level state explicit. Load sample-time and SM-frequency bulk data from JSON binary output into GpuTimingData when available, preserving count validation between paired sample and frequency arrays. Move GPU timing comparison logic into compare_gpu_timings(), prefer robust median/IQR data when available, and fall back to mean/stdev summaries otherwise. Keep missing or invalid noise on the unknown path. Replace module-level comparison counters and selected-device globals with per-run data passed into compare_benches(). Update tests to validate timing classification, bulk-data loading, device pairing, filtered duplicate matching, and summary counters through the new structures.	2026-06-02 15:04:39 -05:00
Oleksandr Pavlyk	0b2dd26625	Make nvbench_compare read bulk data, if available	2026-06-02 13:38:53 -05:00
Oleksandr Pavlyk	1d13b49996	Add scoped filtering and device pairing to nvbench_compare Teach nvbench_compare to keep the order of --benchmark and --axis arguments so axis filters can apply either globally or to the most recent benchmark. Build a filter plan from the ordered CLI arguments and apply the same plan to table output and plotting labels. Add explicit --reference-devices and --compare-devices filters. The filters accept all, a single device id, or a comma-separated list of ids; ordered lists and duplicates are preserved so selected reference and compare devices can be paired by position. Device-section mismatches remain fatal for unfiltered all-vs-all comparisons, but become warnings when the user explicitly selects devices and the selected device counts match. Match duplicate benchmark states by occurrence within each filtered device section instead of matching only by state name across the whole benchmark. This keeps repeated axis values and filtered duplicate states aligned between the reference and compare inputs, and reports mismatched occurrence counts instead of silently dropping extra states. Add Python tests for duplicate-state matching, axis filtering before matching, device filter parsing and validation, explicit cross-device pairing, and benchmark-scoped axis filters. Original commit messages folded into this change: Tweaks for nvbench_compare 1. When JSON files contain multiple entries with the same name and axis values, make sure that scripts compares corresponding entries. Previous logic would extract the first entry from ref data, and would compare measurements for each state in cmp against the first entry from ref. The change introduces a counter to know which nth entry we process for a particular axis value, and retrieve corresponding entry in ref. Scope occurrence matching by device. Device pairing in nvbench_compare.py is strictly index-based under --ignore-devices, reused IDs in a different order no longer pair against the wrong reference device. Require devices in ref and cmp to have the same cardinality Handle mismatch when number of duplicates in ref data is not same as in cmp data Use pytest monkeypatch fixture to pretend third-party package dependencies are available during test run for nvbench_compare without introducing test-time dependency Added the happy-path test and fixed its direct-call setup by initializing the device globals that main() normally populates. Fix to filter-before-matching. - compare_benches() now pairs devices by selected position instead of taking a device id. - For each device pair, compare_benches() now builds: - ref_device_states: matching reference device and axis filters - cmp_device_states: matching compare device and axis filters - State occurrence counts and duplicate occurrence matching now operate only on those filtered per-device lists. - Removed the later matches_axis_filters() skip inside the compare-state loop because filtering now happens before matching. Added a regression test where ref/cmp have duplicate state names in opposite order, and --axis keeps only one of them. The test verifies the kept compare state is matched against the kept reference state, not the first unfiltered occurrence. Introduce device filtering in nvbench_compare - --reference-devices all\|ID\|ID,ID,... - --compare-devices all\|ID\|ID,ID,... - Integer lists preserve order and duplicates. - Requested IDs are validated against the file-level device list. - Filtered reference/compare device counts must match before comparison. - compare_benches() pairs selected reference and compare devices by position. - Each benchmark validates that requested device IDs are present in its own devices list. Implemented benchmark-scoped --axis handling. - --axis and --benchmark now share an ordered argparse action, so their relative CLI order is preserved. - -a before any -b becomes a global axis filter. - -a after -b <name> applies to that most recent benchmark only. - Repeated -b entries are treated as separate filter scopes and combined as alternatives for that benchmark. - Device filtering remains global and is applied independently. Allow non-matching devices for explicit device selection Now the device-section equality check remains fatal only for unfiltered all-vs-all comparisons. If either --reference-devices or --compare-devices is explicit, mismatched selected device metadata is printed as a warning, but comparison proceeds after the selected device counts have been validated. Fix for resolve_benchmark_device_ids, add comments The return value of resolve_benchmark_device_ids now always owns its list. Use monkeypatch class in set_test_devices helper Stricted device id validation Test for device id validation	2026-06-02 11:48:01 -05:00
Oleksandr Pavlyk	ca1d60610c	Use robust summaries in nvbench_compare classification Teach nvbench_compare to parse GPU timing summaries into structured values and prefer the robust median/IQR summaries when both compared measurements provide them. Fall back to the existing mean/stdev summaries when robust summaries are not available. Classify comparisons with the larger available relative noise estimate instead of the smaller one, keep unavailable noise distinct from encoded infinite noise, and report improvements separately from regressions. Keep the process exit code as success for completed comparisons; regression counts are reported in the summary instead of being used as the process status. Make plotting tolerate unavailable noise by leaving gaps in confidence bands, sort plotted series by the plotted axis, and avoid reusing pyplot state across plot calls. Add focused Python tests for robust-summary preference, unavailable-noise classification, non-finite timing centers, plot-along handling when the selected axis is absent, and the exit-code contract.	2026-06-02 11:47:47 -05:00
Oleksandr Pavlyk	338936b6fe	Provide BenchmarkResult class for parsing JSON output of NVBench-instrumented benchmarks (#356 ) Implements `cuda.bench.results.BenchmarkResult` class to represent data from JSON output of benchmark execution. The contains implements two class methods `BenchmarkResult.from_json(filename : str \| os.PathLike, , metadata : Any = None)` which expects well-formed JSON filename and `BenchmarkResult.empty(, metadata : Any = None)` intended to represent failed result with reasons that can be recorded in metadata at user's discretion. The `BenchmarkResult` implements mapping interface, supporting `.keys()`, `.values()`, `.items()` methods, `__len__`, `__contains__`, `__getitem__` and `__iter__` special methods. Values in `BenchmarkResult` has type `cuda.bench.results.SubBenchmarkResult` which implements a list-like interface, i.e. implements `__len__`, `__getitem__`, and `__iter__` special methods. Values in this list-like structure correspond to measurements of individual states of a particular benchmark (the key in `BenchmarkResult`). Elements of `SubBenchmarkResult` structure have type `SubBenchmarkState` that supports mapping protocol with axis_values as a key and represent data corresponding to measurements for a particular state (combination of settings for each axis). The state provides `.samples` and `.frequencies` attributes storing raw execution duration values and estimates for average GPU frequencies. Example usage: ``` import array, numpy as np, cuda.bench.results r = cuda.bench.results.BenchmarkResult("perf_data/axes_run1.json") r["copy_sweep_grid_shape"].centers_with_frequencies( lambda t, f: np.median(np.asarray(t)np.asarray(f))) ``` ``` In [1]: import array, numpy as np, cuda.bench.results In [2]: r = cuda.bench.results.BenchmarkResult("temp_data/axes_run1.json") In [3]: list(r) Out[3]: ['simple', 'single_float64_axis', 'copy_sweep_grid_shape', 'copy_type_sweep', 'copy_type_conversion_sweep', 'copy_type_and_block_size_sweep'] In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75])) Out[4]: {'Device=0': array([0.00100966, 0.00101299])} In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"] Out[5]: {'Device=0': array([0.00100966, 0.00101299])} In [6]: len(r) Out[6]: 6 In [7]: "fake" in r Out[7]: False ``` Each `SubBenchmarkState` implements `.summaries` attribute - rich object that retains tag/name/hint/hide/description metadata. Add nvbench-json-summary to render NVBench JSON output as an NVBench-style markdown summary table, including axis formatting, device sections, hidden summary filtering, and summary hint formatting. Update packaging, type stubs, and tests for the new namespace, renamed classes, Python 3.10-compatible annotations, and summary-table generation. * Split tests in test_benchmark_result into smaller tests * Fix break due to file name change * Add python/examples/benchmark_result_autotune.py This example demonstrates using cuda.bench and cuda.bench.results to implement simple auto-tuning, demonstrated on selecting of tile shape hyperparameter for naive stencil kernel implemented in numba-cuda. * Resolve ruff PLE0604 * Fix for format_axis_value in json format script to handle None value Add tests to cover such input. * Address code rabbit review feedback * Fix license header, add validation * Addressed both issues raised in review Malformed values are now represented in result as None. Skipped benchmarks are no longer dropped, i.e., they are present in BenchmarkResult data, but they are not reflected in summary table in line with what NVBench-instrumented benchmarks do.	2026-05-13 13:23:58 -05:00
Oleksandr Pavlyk	b0a46f44c2	Modularize color handling (#336 ) * Introduce function colorize to modularize colorization/no-color handling * Use sns.set_theme instead of deprecated sns.set() * Use str.format instead of legacy % syntax * Simplified iteration over list Use f-string (supported since Python 3.6) instead of str.format for better readability and performance	2026-04-14 08:09:44 -05:00
Nader Al Awar	7a68e53df0	Rename flag from markdown to no-color	2026-04-01 17:01:29 -05:00
Nader Al Awar	7e5e784855	Add --markdown flag to nvbench_compare.py which can be use for github issues/prs	2026-04-01 14:53:13 -05:00
Bernhard Manfred Gruber	4164909c52	Feedback	2026-02-28 01:19:18 +01:00
Bernhard Manfred Gruber	0abc8ec82b	Extend nvbench_compare.py with `--plot`, axis/benchmark filtering, and dark mode Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com>	2026-02-27 11:06:20 +01:00
Bernhard Manfred Gruber	800f640c20	Apply reviewer feedback	2026-02-26 19:23:51 +01:00
Bernhard Manfred Gruber	d3a0bec4a8	Feedback from review	2026-02-05 14:13:16 +01:00
Bernhard Manfred Gruber	28ed32bb47	Implement dark mode using style sheets	2026-02-05 14:00:33 +01:00
Bernhard Manfred Gruber	ec9759037d	I have no idea what I am doing	2026-02-05 11:15:27 +01:00
Bernhard Manfred Gruber	ccde9fc4d4	More	2026-02-05 10:56:36 +01:00
Bernhard Manfred Gruber	0be190b407	Add a script to plot benchmark results	2026-02-05 10:36:52 +01:00
Bernhard Manfred Gruber	c6ef87575c	Allow partial comparison in nvbench_compare.py Fixes: #295	2026-02-03 16:32:11 +01:00
Nader Al Awar	5e7adc5c3f	Build multi architecture cuda wheels (#302 ) * Add cuda architectures to build wheel for * Package scripts in wheel * Separate cuda major version extraction to fix architecutre selection logic * Add back statement printing cuda version * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-01-29 01:13:24 +00:00

32 Commits