nvbench

mirror of https://github.com/NVIDIA/nvbench.git synced 2026-07-01 11:47:33 +00:00

Author	SHA1	Message	Date
Oleksandr Pavlyk	5fd21dd7fa	Load script tooling dependencies lazily Add a shared nvbench_tooling_deps helper for importing packages required by NVBench console tools. Missing tooling packages now raise a dedicated error with an install recipe instead of failing with a raw ImportError. Update script imports to work both as installed package modules and as direct source-tree scripts by using the __package__ import pattern for nvbench_json and the new tooling helper. Defer nvbench-compare dependencies to the points where they are needed: NumPy/colorama during normal comparison setup, tabulate during table rendering, jsondiff only for device mismatch reporting, and plotting packages only for plot modes. Update tests to initialize compare tooling when calling internals directly and add coverage for the tooling dependency loader. Closes #384	2026-06-30 06:40:45 -05:00
Oleksandr Pavlyk	6dae814da5	Handle legacy nvbench-compare timing summaries Derive absolute standard deviation from relative stdev and mean when nv/cold/time/gpu/stdev/absolute is absent. This lets older JSON files that only contain mean and relative stdev still construct timing intervals. Also allow mean/stdev intervals to be built without min/max summaries, using min/max only as optional clipping bounds when present. This restores SAME classification for legacy fixture data instead of treating those rows as missing-interval AMBG cases. Update nvbench_compare tests to cover derived stdev handling and the legacy mean/stdev comparison path.	2026-06-30 06:40:45 -05:00
Oleksandr Pavlyk	5e609dab78	Respect custom bulk readers in materiality checks Move Float32BinarySource material-payload detection into the source object. Default file-backed sources still use resolved file size so missing or empty sidecars remain unavailable, but positive-count sources with custom readers are treated as material and proceed through the lazy read path. Add regression coverage for virtual bulk sources whose custom reader provides data without a local sidecar file.	2026-06-30 06:40:45 -05:00
Oleksandr Pavlyk	9bb5021752	--threshold-diff is treated as % value, as documented	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	a1c78275ad	Harden nvbench-compare parsing and plot tests Reject boolean summary float payloads instead of coercing them to 1.0/0.0, while keeping numeric strings accepted for NVBench JSON compatibility. Add regression coverage for generated bulk-debug Python filenames that require escaping, and strengthen the plot-along test to assert log-log axes and confidence-band rendering.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	9ace525a2c	Test to use --bulk-debug-python stdout, not STDOUT Also add a test to check that STDOUT also works.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	a0b89f4a3b	Match benchmark-axis scoping to NVBench CLI When any --benchmark filter is present, keep comparison limited to the explicitly selected benchmarks. Leading --axis filters are still replayed onto each selected benchmark, matching native NVBench option parsing, but they no longer cause unrelated benchmarks to be compared. E.g., `-a A=2 -b bench1` now compares only bench1, `-a A=2 -b bench1 -b bench2` applies A=2 to both selected benchmarks Update tests for global axis filters with benchmark scopes and document the selection behavior.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	be00f12033	Include axis values in duplicate-state matching before falling back to occurrence order.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	a2a5c8c91f	Compute quantiles the same way C++ does Use rounded-rank method, rather than NumPy's quantile	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	fa1a0253df	Validate bulk binary sizes as integral metadata Reject boolean and floating-point values for int64 bulk binary sizes instead of silently converting them with int(). Keep integer strings accepted for existing NVBench JSON compatibility, and add regression coverage for valid and malformed size payloads.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	0228d46c1d	Keep robust timing comparisons per side Select robust timing inputs independently for reference and compare data: prefer robust summaries when present, otherwise recompute robust statistics from that side's bulk samples. Fall back to mean/stdev summaries only when both sides cannot provide robust timing inputs. This allows modern JSON data with robust summaries to compare against legacy JSON data that lacks robust summaries but includes bulk sample data, without mixing summary families or unnecessarily falling back to mean/stdev.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	1f7e9458cc	Do not resolve missing bulk files against CWD Introduce has_robust_interval predicate and used it. It is used to check whether robust interval can be constructed before attempting that and falling back to mean/stdev construction otherwise	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	1451e3e885	Wrap JSON parsing in try/except This reports a meaningful error if JSON file is not formed as nvbench_compare expected without replicating JSON schema in code.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	0ecd6a4a0b	Validate type of values for expected keys in JSON file	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	dc4d6e8e37	Run the plot test with plot=True keyword to make sure it is exercised	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	816dd9e57d	Implement review feedback on test_nvbench_compare 1. Skip test when running with Py3.10 if tomli is unavailable before nvbench_compare fixture is constructed, that is at collection time rather than at execution time 2. Check that test of plotting options calls plot 3. Test actual output to verify that warning about device mismatch is absent when device selection is requested	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	3a2ef4c550	Fix nvbench-compare filter and plot validation Keep leading --axis filters global even when later --benchmark scopes are present, so commands like "-a A=2 -b bench" still compare other benchmarks matching the global axis filter. Tighten --plot-along validation for the log-log plot path by rejecting non-numeric, non-positive, and non-finite axis values with targeted errors. Add regression coverage for global axis scoping and invalid plot-axis values.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	1d787b7088	Introduce helper to read JSON files for tests	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	75fa3062ce	Reject non-numeric --plot-along axes Add explicit validation for plot-axis values so string/type axes fail with a clear CLI error instead of a raw float conversion exception. Add regression coverage for a type axis.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	a81c1adc00	Replace Pass bucket in the summary output with Unchanged, clarified description	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	1f374f7b86	Harden nvbench-compare input and noise handling Route NVBench JSON read failures and missing required root keys through the documented user-facing error path so malformed inputs return 1 instead of producing a traceback. Allow deterministic mean-based timing summaries with zero standard deviation to form zero-width intervals, while still rejecting negative or non-finite dispersion values. Reuse the same non-negative finite predicate for relative noise validation. Add regression coverage for unreadable inputs, missing root keys, and identical stable timing summaries.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	665eccc543	Reject negative values for noise	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	c2dec6cd05	Tighten nvbench-compare argument parsing Let argparse derive the program name from the actual invocation instead of hardcoding nvbench_compare, so help and error output match the installed nvbench-compare entry point. Declare comparison inputs as explicit positional arguments and use parse_args() instead of parse_known_args(). This preserves --dump-config without input files while rejecting unknown options through argparse rather than treating typoed flags as JSON paths. Add regression coverage for rejecting an unknown CLI option.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	10a5d1fcaa	Harden nvbench-compare plotting and filter docs Skip UNKNOWN rows when collecting summary plot entries so non-numeric fractional differences cannot reach the plotting path. Add a regression test that exercises compare_benches(..., plot=True) with an UNKNOWN row. Document the supported pow2 axis-filter syntax and update the CLI help example to use NAME[pow2]=EXP, matching the parser behavior for axes displayed as 2^N. * Document when status ???? (UNKNOWN) is emitted * Clarify --no-color behavior * nvbench_compare.md: clarify --no-color behavior, fix example * Document display options in nvbench_compare.md * Small mention of plotting capabilities in nvbench_compare.md * Call out that example requires shell with process substitution capabilities	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	d18706cc24	ComparisonThresholds no longer provides constructor with defaults Test file changed to use get_default_thresholds() function instead of call to constructor. This is to make sure that default preset values do not diverge from values encoded in the constructor.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	1db6d8bd38	Use legacy np.unique(..., return_counts=True) This is to support older versions of NumPy	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	4744d26d26	Keep nvbench-compare bulk debug output executable * Define nan and inf in generated --bulk-debug-python scripts so pprint output for non-finite timing values remains valid Python code. Add a regression test that executes the generated script and verifies nan/inf values round-trip. * Sharpen bulk-cycle confirmation gating. Only suppress summary-clock fallback when both reference and compare inputs provide paired, non-empty bulk sample/frequency payloads. Missing or empty bulk files are treated as unavailable evidence and still allow sm_clock_rate/mean fallback, while malformed non-empty payloads continue to produce AMBG. Add regression coverage for missing bulk files falling back to summary-cycle confirmation. These changes resolve automated review feedback	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	d58119d7c6	Harden nvbench-compare tests for diagnostics paths * Register the dynamically loaded nvbench_compare module in sys.modules before executing it so tests better match normal import behavior. * Add shared tabulate-capture helpers and select rendered comparison tables by header suffix instead of relying on the last tabulate call. This makes display tests robust to future summary or legend table output. * Add coverage for unusable bulk cycle data forcing an ambiguous result instead of falling back to summary clock confirmation. * Rename the TOML parser integration test to clarify that it exercises whichever parser is available in the environment, and document the Python 3.10 tomli skip behavior.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	df15de4b7a	Treat unusable bulk cycle data as ambiguous When bulk sample or frequency sources are present, do not silently fall back to summary SM clock confirmation if the bulk cycle data cannot be used. Report the clear-gap decision as AMBG with a bulk_cycle_data_unusable reason instead. Still allow summary-clock fallback when no bulk sample/frequency sources are present. Also update the Unknown summary label to describe the broader set of input-data failures now counted as UNKNOWN.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	b34dfbb348	Explicitly handle unavailable timings in nvbench-compare Treat matched states with unusable timing data as UNKNOWN instead of dropping them from the comparison. This includes missing, non-finite, or non-positive timing centers, skipped states, and states with missing GPU timing summaries. Add explicit reason codes for these cases so the summary points users at the underlying data issue. Preserve available timing data from the other side when only one side is missing, and render unavailable durations as n/a in all display modes. Also sort values returned by np.unique_counts before nearest-neighbor coverage checks so the distance algorithm receives ordered inputs. Add regression coverage for UNKNOWN counting, skipped states, missing summaries, unavailable center formatting, and the updated coverage helper.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	17536fd4ff	Ensure that bulk-debug-python script is enclosed in markers This permits extracting Python script using Unix CLI tools when `--bulk-debug-python stdout` is used. Added example of using this to nvbench_compare.md doc.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	78f70b097f	Replaced UNDECIDED with AMBG, use Gray color/shrug emoji	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	7a582db94e	Improve nvbench-compare interval display readability Add compact reason labels for explain-mode tables while keeping canonical reason codes in the undecided summary. Emit a one-line legend only for non-trivial abbreviations. Refine interval displays so timing values align across table rows: - align Lo/Ce/Hi values in explain mode - align center values in intervals mode when some rows lack interval bounds - avoid repeating units when center and interval deltas use the same unit Add a Change column for non-legacy displays so FAST/SLOW rows show the signed interval-bound relative change, while SAME and UNDECIDED rows remain blank. Extend nvbench_compare tests to cover reason legend filtering, interval alignment, missing-interval alignment, and Change column formatting.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	70d728cba6	Implement --bulk-debug-python option Use this option to generate Python script with information needed to load bulk data from reference/compare datasets for further drill-down into data.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	2b656a94a7	Support rename of tags /ir/(absolute\|relative) to /iqr/(absolute\|relative)	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	732d227be1	Add TOML configuration for nvbench-compare thresholds Add versioned TOML configuration support for nvbench-compare threshold settings. The new --config option reads grouped settings for clear-gap, same-result, bulk coverage, and rare-support filtering thresholds. The parser validates the schema strictly so unknown tables, unknown keys, invalid types, unsupported versions, and out-of-range values fail early. Add --dump-config to print the effective configuration without requiring input JSON files. This makes the currently selected preset and resolved threshold values discoverable and gives users a starting point for custom configuration. Preset resolution is: - default is used when neither TOML nor CLI selects a preset - [preset] name = "..." in TOML selects the base preset - --preset ... overrides the TOML preset selection - explicit threshold values in TOML override whichever base preset was selected For example: - nvbench-compare --dump-config Prints the built-in default settings as grouped TOML. - nvbench-compare --preset permissive --dump-config Prints the permissive preset values as TOML. - nvbench-compare --config compare.toml ref.json cmp.json Compares using the preset named in compare.toml, plus any explicit TOML threshold overrides. - nvbench-compare --config compare.toml --preset strict ref.json cmp.json Uses the strict preset as the base, while preserving explicit threshold overrides from compare.toml. Keep TOML parsing lazy: Python 3.11+ uses tomllib, while Python 3.10 only requires tomli when --config is used. Add focused tests for grouped config dumping, strict validation, preset/override precedence, and CLI dump behavior.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	2585842cf5	Add nvbench_compare display modes and interval-based table views Extend nvbench_compare with multiple table display modes and richer interval formatting for timing comparisons. Highlights: - add `--display` with `intervals`, `legacy`, and `explain` modes - keep `legacy` output using scalar Diff/%Diff - make `intervals` the default, showing compact center-plus-delta timing intervals - add `explain` mode with explicit `[L \| C \| H]` interval rendering and self-describing headers - compute and store diff and relative-diff intervals in SummaryComparison - add formatting helpers for absolute and relative interval displays - make default preset slightly more permissive by lowering `bulk_same_sample_coverage` to 0.97 Add focused tests covering: - diff/%diff interval computation - compact and explicit interval formatting - default, legacy, and explain table layouts - CLI propagation of `--display` and preset selection	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	cc1c40b777	Change in how FAST/SLOW deciision is arrive at Now: - establish a candidate clear timing gap from summary timing intervals, as before - if bulk sample times and frequencies are available on both sides, compute cycles = time * frequency - derive bulk cycle intervals from min/q1/median/q3 - confirm the gap direction from those bulk cycle intervals - only fall back to summary sm_clock_rate_mean confirmation when bulk cycle data is unavailable I also split the reason codes so the evidence source is visible: - clear_gap_confirmed_by_bulk_cycles - bulk_cycle_gap_not_confirmed - clear_gap_confirmed_by_summary_cycles - summary_cycle_gap_not_confirmed Updated tests in python/test/test_nvbench_compare.py cover both the bulk-confirmed and bulk-rejected paths, along with the renamed summary reason codes.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	9104c58d63	Add nvbench_compare presets and rare-support-aware bulk coverage Introduce comparison threshold presets in nvbench_compare and thread the selected preset through main() into compare_benches. Refine bulk nearest-neighbor support handling by: - adding rare-support filtering thresholds - ignoring low-count support values only when removed sample mass is small - falling back to full support for all-unique or otherwise unusable support - keeping sample-weight coverage over all values Tighten bulk mismatch reporting to show compact min(ref, cmp) coverage summaries, and add tests covering: - rare-tail filtering - strict fallback when too much support mass would be removed - all-unique support preservation - preset lookup and CLI preset propagation	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	d8efe3dd9e	Group nvbench-compare thresholds into a config object Replace the scattered module-level comparison threshold constants with a ComparisonThresholds value object. Thread this object through compare_benches, compare_gpu_timings, and the lower-level clear-gap, summary-SAME, and bulk-SAME decision helpers. Keep existing behavior by constructing default ComparisonThresholds when callers do not provide one. This prepares nvbench-compare for future CLI-configurable decision thresholds while keeping one consistent configuration for an entire comparison run. Add test coverage that passes custom thresholds through compare_benches and verifies they affect the SAME decision.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	0f091438a5	Use bulk samples to confirm same comparisons Add a bulk-data SAME path to nvbench_compare for cases where summary intervals do not provide a clear FAST/SLOW decision. The new path compares sample times and SM-clock-adjusted cycles with symmetric nearest-neighbor coverage over unique values and sample counts. The comparison now requires both sample-weight coverage and unique-support coverage to pass before declaring SAME. If bulk data is available but coverage does not pass, the result remains UNDECIDED instead of falling back to the summary-only SAME rule. Also improve undecided diagnostics by aggregating reason codes while preserving the most severe representative detail, including observed coverage values and thresholds for bulk support mismatches. Add tests for: - bulk data confirming SAME despite changed mode weights; - bulk time mismatch overriding summary-only SAME; - cycle coverage vetoing time-only agreement; - sample-weight and unique-support coverage diagnostics; - aggregation of undecided reason details.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	ed98d3d950	Implement DecisionReason, tracking and summarisation - Add DecisionReason(code, message) and internal TimingDecision(status, reason). - SummaryComparison now carries reason - ComparisonStats now aggregates undecided reasons. - Final summary prints a reason breakdown only when undecided reasons exist, e.g.: - Undecided (comparison requires more evidence): 3 - Reasons: - noise_too_high: 2 (relative dispersion is too high to declare same) - weak_interval_overlap: 1 (timing intervals do not overlap strongly enough to declare same)	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	917a950e78	Implement early SAME check If SLOW/FAST check returned undecided, we attempt conservative SAME check based on summary data alone (bulk data are not read) Reference and compare measurements are considered SAME if - both centers are positive finite values; - abs(ref - cmp) / min(ref, cmp) <= 0.5%. This is equivalent to max(ref, cmp) / min(ref, cmp) <= 1 + delta; - interval overlap must cover at least 50% of the smaller interval; - relative dispersion must be finite on both sides and no more than 2%; - if SM clock summaries are available, the same check must also pass in cycle space. Otherwise UNDECIDED remains working decision, to be refined by further checks	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	0e7b9815cf	Implement clear-gap comparison for early FAST/SLOW decision Implemented the clear-gap comparison, with the log-distance-equivalent algebra and pessimistic SM-clock fallback. What changed: - Added TimingInterval and interval construction from summaries: - robust interval: [min, q3], centered at median - fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max] - Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005. - FAST gap uses: (ref.lower - cmp.upper) / cmp.upper >= delta which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta). - SLOW gap uses: (cmp.lower - ref.upper) / ref.upper >= delta - FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean. - If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED. - Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification. Updated tests to cover: - center/noise-only comparisons becoming UNDECIDED - clear FAST/SLOW with matching clock evidence - missing clock fallback to UNDECIDED - frequency-shift disagreement becoming UNDECIDED - regression reporting with robust interval and clock evidence	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	12750221b5	Add q1/q3 quartiles to GPUTimeData struct The quantile values are not currently used, but plumbed through	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	cbe9a5b2fd	Add "nv/cold/sm_clock_rate/mean" to GPU time summary data Its intent is to be cheaply retrievable metric of average SM clock frequence over entire sample	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	2502d29ece	Lazy-load nvbench-compare bulk timing data Store JSON-bin sample time and frequency metadata in GpuTimingData instead of reading the binary files during summary extraction. Add Float32BinarySource and lazy cached accessors for samples and frequencies. Use np.fromfile by default, but allow tests and alternate callers to inject a float32 reader returning any buffer-compatible object convertable to "<f4" data type. Treat optional bulk-data failures as unavailable evidence instead of aborting comparison: unreadable files, invalid buffers, count mismatches, and mismatched sample/frequency metadata now emit RuntimeWarning and return None. Update nvbench_compare tests to verify lazy loading, cache reuse, injected reader behavior, warning-based degradation, and count mismatch handling.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	a11b54101a	Introduce UNDECIDED comparison status It is not emitted just yet, but the code becomes ready for it when it starts being emitted	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	d9db53504e	Refactor nvbench-compare timing comparison state Introduce GpuTimingData, SummaryComparison, ComparisonStats, and ComparisonRunData to make timing extraction, classification, and run-level state explicit. Load sample-time and SM-frequency bulk data from JSON binary output into GpuTimingData when available, preserving count validation between paired sample and frequency arrays. Move GPU timing comparison logic into compare_gpu_timings(), prefer robust median/IQR data when available, and fall back to mean/stdev summaries otherwise. Keep missing or invalid noise on the unknown path. Replace module-level comparison counters and selected-device globals with per-run data passed into compare_benches(). Update tests to validate timing classification, bulk-data loading, device pairing, filtered duplicate matching, and summary counters through the new structures.	2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk	0baa699b64	Make nvbench_compare read bulk data, if available	2026-06-30 06:40:44 -05:00

1 2 3 4

185 Commits