Commit Graph

185 Commits

Author SHA1 Message Date
Oleksandr Pavlyk
5fd21dd7fa Load script tooling dependencies lazily
Add a shared nvbench_tooling_deps helper for importing packages required
by NVBench console tools. Missing tooling packages now raise a dedicated
error with an install recipe instead of failing with a raw ImportError.

Update script imports to work both as installed package modules and as
direct source-tree scripts by using the __package__ import pattern for
nvbench_json and the new tooling helper.

Defer nvbench-compare dependencies to the points where they are needed:
NumPy/colorama during normal comparison setup, tabulate during table
rendering, jsondiff only for device mismatch reporting, and plotting
packages only for plot modes.

Update tests to initialize compare tooling when calling internals
directly and add coverage for the tooling dependency loader.

Closes #384
2026-06-30 06:40:45 -05:00
Oleksandr Pavlyk
6dae814da5 Handle legacy nvbench-compare timing summaries
Derive absolute standard deviation from relative stdev and mean when
nv/cold/time/gpu/stdev/absolute is absent. This lets older JSON files
that only contain mean and relative stdev still construct timing
intervals.

Also allow mean/stdev intervals to be built without min/max summaries,
using min/max only as optional clipping bounds when present. This
restores SAME classification for legacy fixture data instead of treating
those rows as missing-interval AMBG cases.

Update nvbench_compare tests to cover derived stdev handling and the
legacy mean/stdev comparison path.
2026-06-30 06:40:45 -05:00
Oleksandr Pavlyk
5e609dab78 Respect custom bulk readers in materiality checks
Move Float32BinarySource material-payload detection into the source object.
Default file-backed sources still use resolved file size so missing or empty
sidecars remain unavailable, but positive-count sources with custom readers are
treated as material and proceed through the lazy read path.

Add regression coverage for virtual bulk sources whose custom reader provides
data without a local sidecar file.
2026-06-30 06:40:45 -05:00
Oleksandr Pavlyk
9bb5021752 --threshold-diff is treated as % value, as documented 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
a1c78275ad Harden nvbench-compare parsing and plot tests
Reject boolean summary float payloads instead of coercing them to 1.0/0.0,
while keeping numeric strings accepted for NVBench JSON compatibility.

Add regression coverage for generated bulk-debug Python filenames that require
escaping, and strengthen the plot-along test to assert log-log axes and
confidence-band rendering.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
9ace525a2c Test to use --bulk-debug-python stdout, not STDOUT
Also add a test to check that STDOUT also works.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
a0b89f4a3b Match benchmark-axis scoping to NVBench CLI
When any --benchmark filter is present, keep comparison limited to the
explicitly selected benchmarks. Leading --axis filters are still replayed onto
each selected benchmark, matching native NVBench option parsing, but they no
longer cause unrelated benchmarks to be compared.

E.g., `-a A=2 -b bench1` now compares only bench1,
`-a A=2 -b bench1 -b bench2` applies A=2 to both selected benchmarks

Update tests for global axis filters with benchmark scopes and document the
selection behavior.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
be00f12033 Include axis values in duplicate-state matching before falling back to occurrence order. 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
a2a5c8c91f Compute quantiles the same way C++ does
Use rounded-rank method, rather than NumPy's quantile
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
fa1a0253df Validate bulk binary sizes as integral metadata
Reject boolean and floating-point values for int64 bulk binary sizes instead of
silently converting them with int(). Keep integer strings accepted for existing
NVBench JSON compatibility, and add regression coverage for valid and malformed
size payloads.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0228d46c1d Keep robust timing comparisons per side
Select robust timing inputs independently for reference and compare data:
prefer robust summaries when present, otherwise recompute robust statistics from
that side's bulk samples. Fall back to mean/stdev summaries only when both sides
cannot provide robust timing inputs.

This allows modern JSON data with robust summaries to compare against legacy JSON
data that lacks robust summaries but includes bulk sample data, without mixing
summary families or unnecessarily falling back to mean/stdev.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
1f7e9458cc Do not resolve missing bulk files against CWD
Introduce has_robust_interval predicate and used it.
It is used to check whether robust interval can be constructed
before attempting that and falling back to mean/stdev construction
otherwise
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
1451e3e885 Wrap JSON parsing in try/except
This reports a meaningful error if JSON file is not formed as
nvbench_compare expected without replicating JSON schema in code.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0ecd6a4a0b Validate type of values for expected keys in JSON file 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
dc4d6e8e37 Run the plot test with plot=True keyword to make sure it is exercised 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
816dd9e57d Implement review feedback on test_nvbench_compare
1. Skip test when running with Py3.10 if tomli is unavailable
   before nvbench_compare fixture is constructed, that is at
   collection time rather than at execution time
2. Check that test of plotting options calls plot
3. Test actual output to verify that warning about device
   mismatch is absent when device selection is requested
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
3a2ef4c550 Fix nvbench-compare filter and plot validation
Keep leading --axis filters global even when later --benchmark scopes are
present, so commands like "-a A=2 -b bench" still compare other benchmarks
matching the global axis filter.

Tighten --plot-along validation for the log-log plot path by rejecting
non-numeric, non-positive, and non-finite axis values with targeted errors.

Add regression coverage for global axis scoping and invalid plot-axis values.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
1d787b7088 Introduce helper to read JSON files for tests 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
75fa3062ce Reject non-numeric --plot-along axes
Add explicit validation for plot-axis values so string/type axes fail with a
clear CLI error instead of a raw float conversion exception. Add regression
coverage for a type axis.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
a81c1adc00 Replace Pass bucket in the summary output with Unchanged, clarified description 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
1f374f7b86 Harden nvbench-compare input and noise handling
Route NVBench JSON read failures and missing required root keys through the
documented user-facing error path so malformed inputs return 1 instead of
producing a traceback.

Allow deterministic mean-based timing summaries with zero standard deviation to
form zero-width intervals, while still rejecting negative or non-finite
dispersion values. Reuse the same non-negative finite predicate for relative
noise validation.

Add regression coverage for unreadable inputs, missing root keys, and identical
stable timing summaries.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
665eccc543 Reject negative values for noise 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
c2dec6cd05 Tighten nvbench-compare argument parsing
Let argparse derive the program name from the actual invocation instead of
hardcoding nvbench_compare, so help and error output match the installed
nvbench-compare entry point.

Declare comparison inputs as explicit positional arguments and use parse_args()
instead of parse_known_args(). This preserves --dump-config without input files
while rejecting unknown options through argparse rather than treating typoed
flags as JSON paths.

Add regression coverage for rejecting an unknown CLI option.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
10a5d1fcaa Harden nvbench-compare plotting and filter docs
Skip UNKNOWN rows when collecting summary plot entries so non-numeric
fractional differences cannot reach the plotting path. Add a regression test
that exercises compare_benches(..., plot=True) with an UNKNOWN row.

Document the supported pow2 axis-filter syntax and update the CLI help example
to use NAME[pow2]=EXP, matching the parser behavior for axes displayed as 2^N.

* Document when status ???? (UNKNOWN) is emitted
* Clarify --no-color behavior

* nvbench_compare.md: clarify --no-color behavior, fix example

* Document display options in nvbench_compare.md

* Small mention of plotting capabilities in nvbench_compare.md
* Call out that example requires shell with process substitution capabilities
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
d18706cc24 ComparisonThresholds no longer provides constructor with defaults
Test file changed to use get_default_thresholds() function instead
of call to constructor.

This is to make sure that default preset values do not diverge from
values encoded in the constructor.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
1db6d8bd38 Use legacy np.unique(..., return_counts=True)
This is to support older versions of NumPy
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
4744d26d26 Keep nvbench-compare bulk debug output executable
* Define nan and inf in generated --bulk-debug-python scripts so pprint output
for non-finite timing values remains valid Python code. Add a regression test
that executes the generated script and verifies nan/inf values round-trip.

* Sharpen bulk-cycle confirmation gating. Only suppress summary-clock
fallback when both reference and compare inputs provide paired, non-empty bulk
sample/frequency payloads. Missing or empty bulk files are treated as
unavailable evidence and still allow sm_clock_rate/mean fallback, while
malformed non-empty payloads continue to produce AMBG.

Add regression coverage for missing bulk files falling back to summary-cycle
confirmation.

These changes resolve automated review feedback
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
d58119d7c6 Harden nvbench-compare tests for diagnostics paths
* Register the dynamically loaded nvbench_compare module in sys.modules before
executing it so tests better match normal import behavior.

* Add shared tabulate-capture helpers and select rendered comparison tables by
header suffix instead of relying on the last tabulate call. This makes display
tests robust to future summary or legend table output.

* Add coverage for unusable bulk cycle data forcing an ambiguous result instead
of falling back to summary clock confirmation.

* Rename the TOML parser integration test to clarify that it exercises whichever
parser is available in the environment, and document the Python 3.10 tomli
skip behavior.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
df15de4b7a Treat unusable bulk cycle data as ambiguous
When bulk sample or frequency sources are present, do not silently fall
back to summary SM clock confirmation if the bulk cycle data cannot be
used. Report the clear-gap decision as AMBG with a
bulk_cycle_data_unusable reason instead.

Still allow summary-clock fallback when no bulk sample/frequency sources
are present.

Also update the Unknown summary label to describe the broader set of
input-data failures now counted as UNKNOWN.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
b34dfbb348 Explicitly handle unavailable timings in nvbench-compare
Treat matched states with unusable timing data as UNKNOWN instead of
dropping them from the comparison. This includes missing, non-finite, or
non-positive timing centers, skipped states, and states with missing GPU
timing summaries.

Add explicit reason codes for these cases so the summary points users at
the underlying data issue. Preserve available timing data from the other
side when only one side is missing, and render unavailable durations as
n/a in all display modes.

Also sort values returned by np.unique_counts before nearest-neighbor
coverage checks so the distance algorithm receives ordered inputs.

Add regression coverage for UNKNOWN counting, skipped states, missing
summaries, unavailable center formatting, and the updated coverage helper.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
17536fd4ff Ensure that bulk-debug-python script is enclosed in markers
This permits extracting Python script using Unix CLI tools
when `--bulk-debug-python stdout` is used.

Added example of using this to nvbench_compare.md doc.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
78f70b097f Replaced UNDECIDED with AMBG, use Gray color/shrug emoji 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
7a582db94e Improve nvbench-compare interval display readability
Add compact reason labels for explain-mode tables while keeping canonical
reason codes in the undecided summary. Emit a one-line legend only for
non-trivial abbreviations.

Refine interval displays so timing values align across table rows:
  - align Lo/Ce/Hi values in explain mode
  - align center values in intervals mode when some rows lack interval bounds
  - avoid repeating units when center and interval deltas use the same unit

Add a Change column for non-legacy displays so FAST/SLOW rows show the
signed interval-bound relative change, while SAME and UNDECIDED rows remain
blank.

Extend nvbench_compare tests to cover reason legend filtering, interval
alignment, missing-interval alignment, and Change column formatting.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
70d728cba6 Implement --bulk-debug-python option
Use this option to generate Python script with information needed to load
bulk data from reference/compare datasets for further drill-down into
data.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
2b656a94a7 Support rename of tags */ir/(absolute|relative) to */iqr/(absolute|relative) 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
732d227be1 Add TOML configuration for nvbench-compare thresholds
Add versioned TOML configuration support for nvbench-compare threshold
settings. The new --config option reads grouped settings for clear-gap,
same-result, bulk coverage, and rare-support filtering thresholds. The parser
validates the schema strictly so unknown tables, unknown keys, invalid types,
unsupported versions, and out-of-range values fail early.

Add --dump-config to print the effective configuration without requiring input
JSON files. This makes the currently selected preset and resolved threshold
values discoverable and gives users a starting point for custom configuration.

Preset resolution is:
  - default is used when neither TOML nor CLI selects a preset
  - [preset] name = "..." in TOML selects the base preset
  - --preset ... overrides the TOML preset selection
  - explicit threshold values in TOML override whichever base preset was selected

For example:
  - nvbench-compare --dump-config
    Prints the built-in default settings as grouped TOML.

  - nvbench-compare --preset permissive --dump-config
    Prints the permissive preset values as TOML.

  - nvbench-compare --config compare.toml ref.json cmp.json
    Compares using the preset named in compare.toml, plus any explicit TOML
    threshold overrides.

  - nvbench-compare --config compare.toml --preset strict ref.json cmp.json
    Uses the strict preset as the base, while preserving explicit threshold
    overrides from compare.toml.

Keep TOML parsing lazy: Python 3.11+ uses tomllib, while Python 3.10 only
requires tomli when --config is used. Add focused tests for grouped config
dumping, strict validation, preset/override precedence, and CLI dump behavior.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
2585842cf5 Add nvbench_compare display modes and interval-based table views
Extend nvbench_compare with multiple table display modes and richer interval
formatting for timing comparisons.

Highlights:
  - add `--display` with `intervals`, `legacy`, and `explain` modes
  - keep `legacy` output using scalar Diff/%Diff
  - make `intervals` the default, showing compact center-plus-delta timing
    intervals
  - add `explain` mode with explicit `[L | C | H]` interval rendering and
    self-describing headers
  - compute and store diff and relative-diff intervals in SummaryComparison
  - add formatting helpers for absolute and relative interval displays
  - make default preset slightly more permissive by lowering
    `bulk_same_sample_coverage` to 0.97

Add focused tests covering:
  - diff/%diff interval computation
  - compact and explicit interval formatting
  - default, legacy, and explain table layouts
  - CLI propagation of `--display` and preset selection
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
cc1c40b777 Change in how FAST/SLOW deciision is arrive at
Now:

  - establish a candidate clear timing gap from summary timing intervals, as before
  - if bulk sample times and frequencies are available on both sides,
    compute cycles = time * frequency
  - derive bulk cycle intervals from min/q1/median/q3
  - confirm the gap direction from those bulk cycle intervals
  - only fall back to summary sm_clock_rate_mean confirmation when bulk cycle data
    is unavailable

  I also split the reason codes so the evidence source is visible:

  - clear_gap_confirmed_by_bulk_cycles
  - bulk_cycle_gap_not_confirmed
  - clear_gap_confirmed_by_summary_cycles
  - summary_cycle_gap_not_confirmed

Updated tests in python/test/test_nvbench_compare.py cover both the bulk-confirmed
and bulk-rejected paths, along with the renamed summary reason codes.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
9104c58d63 Add nvbench_compare presets and rare-support-aware bulk coverage
Introduce comparison threshold presets in nvbench_compare and thread the
selected preset through main() into compare_benches.

Refine bulk nearest-neighbor support handling by:
  - adding rare-support filtering thresholds
  - ignoring low-count support values only when removed sample mass is small
  - falling back to full support for all-unique or otherwise unusable support
  - keeping sample-weight coverage over all values

Tighten bulk mismatch reporting to show compact min(ref, cmp) coverage
summaries, and add tests covering:
  - rare-tail filtering
  - strict fallback when too much support mass would be removed
  - all-unique support preservation
  - preset lookup and CLI preset propagation
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
d8efe3dd9e Group nvbench-compare thresholds into a config object
Replace the scattered module-level comparison threshold constants
with a ComparisonThresholds value object. Thread this object through
compare_benches, compare_gpu_timings, and the lower-level clear-gap,
summary-SAME, and bulk-SAME decision helpers.

Keep existing behavior by constructing default ComparisonThresholds
when callers do not provide one. This prepares nvbench-compare for
future CLI-configurable decision thresholds while keeping one consistent
configuration for an entire comparison run.

Add test coverage that passes custom thresholds through compare_benches and
verifies they affect the SAME decision.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0f091438a5 Use bulk samples to confirm same comparisons
Add a bulk-data SAME path to nvbench_compare for cases where summary
intervals do not provide a clear FAST/SLOW decision. The new path compares
sample times and SM-clock-adjusted cycles with symmetric nearest-neighbor
coverage over unique values and sample counts.

The comparison now requires both sample-weight coverage and unique-support
coverage to pass before declaring SAME. If bulk data is available but coverage
does not pass, the result remains UNDECIDED instead of falling back to the
summary-only SAME rule.

Also improve undecided diagnostics by aggregating reason codes while preserving
the most severe representative detail, including observed coverage values and
thresholds for bulk support mismatches.

Add tests for:
 - bulk data confirming SAME despite changed mode weights;
 - bulk time mismatch overriding summary-only SAME;
 - cycle coverage vetoing time-only agreement;
 - sample-weight and unique-support coverage diagnostics;
 - aggregation of undecided reason details.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
ed98d3d950 Implement DecisionReason, tracking and summarisation
- Add DecisionReason(code, message) and internal
  TimingDecision(status, reason).
- SummaryComparison now carries reason
- ComparisonStats now aggregates undecided reasons.
- Final summary prints a reason breakdown only when
  undecided reasons exist, e.g.:

  - Undecided   (comparison requires more evidence): 3
    - Reasons:
      - noise_too_high: 2 (relative dispersion is too
                           high to declare same)
      - weak_interval_overlap: 1 (timing intervals do not
                 overlap strongly enough to declare same)
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
917a950e78 Implement early SAME check
If SLOW/FAST check returned undecided, we attempt conservative
SAME check based on summary data alone (bulk data are not read)

Reference and compare measurements are considered SAME if
   - both centers are positive finite values;
   - abs(ref - cmp) / min(ref, cmp) <= 0.5%.
     This is equivalent to max(ref, cmp) / min(ref, cmp) <= 1 + delta;
   - interval overlap must cover at least 50% of the smaller interval;
   - relative dispersion must be finite on both sides and no more than 2%;
   - if SM clock summaries are available, the same check must also pass in cycle space.

Otherwise UNDECIDED remains working decision, to be refined by further checks
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0e7b9815cf Implement clear-gap comparison for early FAST/SLOW decision
Implemented the clear-gap comparison, with the log-distance-equivalent
algebra and pessimistic SM-clock fallback.

What changed:

 - Added TimingInterval and interval construction from summaries:
    - robust interval: [min, q3], centered at median
    - fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max]
 - Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005.
 - FAST gap uses:

   (ref.lower - cmp.upper) / cmp.upper >= delta
   which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta).
 - SLOW gap uses:

   (cmp.lower - ref.upper) / ref.upper >= delta
 - FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean.
 - If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED.
 - Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification.

Updated tests to cover:

 - center/noise-only comparisons becoming UNDECIDED
 - clear FAST/SLOW with matching clock evidence
 - missing clock fallback to UNDECIDED
 - frequency-shift disagreement becoming UNDECIDED
 - regression reporting with robust interval and clock evidence
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
12750221b5 Add q1/q3 quartiles to GPUTimeData struct
The quantile values are not currently used, but plumbed through
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
cbe9a5b2fd Add "nv/cold/sm_clock_rate/mean" to GPU time summary data
Its intent is to be cheaply retrievable metric of average
SM clock frequence over entire sample
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
2502d29ece Lazy-load nvbench-compare bulk timing data
Store JSON-bin sample time and frequency metadata in GpuTimingData instead of
reading the binary files during summary extraction.

Add Float32BinarySource and lazy cached accessors for samples and frequencies.
Use np.fromfile by default, but allow tests and alternate callers to inject a
float32 reader returning any buffer-compatible object convertable to "<f4" data
type.

Treat optional bulk-data failures as unavailable evidence instead of aborting
comparison: unreadable files, invalid buffers, count mismatches, and mismatched
sample/frequency metadata now emit RuntimeWarning and return None.

Update nvbench_compare tests to verify lazy loading, cache reuse, injected
reader behavior, warning-based degradation, and count mismatch handling.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
a11b54101a Introduce UNDECIDED comparison status
It is not emitted just yet, but the code becomes ready for it
when it starts being emitted
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
d9db53504e Refactor nvbench-compare timing comparison state
Introduce GpuTimingData, SummaryComparison, ComparisonStats, and
ComparisonRunData to make timing extraction, classification, and run-level
state explicit.

Load sample-time and SM-frequency bulk data from JSON binary output into
GpuTimingData when available, preserving count validation between paired
sample and frequency arrays.

Move GPU timing comparison logic into compare_gpu_timings(), prefer robust
median/IQR data when available, and fall back to mean/stdev summaries otherwise.
Keep missing or invalid noise on the unknown path.

Replace module-level comparison counters and selected-device globals with
per-run data passed into compare_benches(). Update tests to validate timing
classification, bulk-data loading, device pairing, filtered duplicate matching,
and summary counters through the new structures.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0baa699b64 Make nvbench_compare read bulk data, if available 2026-06-30 06:40:44 -05:00