Commit Graph

822 Commits

Author SHA1 Message Date
Oleksandr Pavlyk
17536fd4ff Ensure that bulk-debug-python script is enclosed in markers
This permits extracting Python script using Unix CLI tools
when `--bulk-debug-python stdout` is used.

Added example of using this to nvbench_compare.md doc.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
78f70b097f Replaced UNDECIDED with AMBG, use Gray color/shrug emoji 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
7a582db94e Improve nvbench-compare interval display readability
Add compact reason labels for explain-mode tables while keeping canonical
reason codes in the undecided summary. Emit a one-line legend only for
non-trivial abbreviations.

Refine interval displays so timing values align across table rows:
  - align Lo/Ce/Hi values in explain mode
  - align center values in intervals mode when some rows lack interval bounds
  - avoid repeating units when center and interval deltas use the same unit

Add a Change column for non-legacy displays so FAST/SLOW rows show the
signed interval-bound relative change, while SAME and UNDECIDED rows remain
blank.

Extend nvbench_compare tests to cover reason legend filtering, interval
alignment, missing-interval alignment, and Change column formatting.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
70d728cba6 Implement --bulk-debug-python option
Use this option to generate Python script with information needed to load
bulk data from reference/compare datasets for further drill-down into
data.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
70e810eca9 Add document for nvbench_compare
It documents use, documents decision tree, configurability,
use of --display option, benchmark/axis filtering and device filtering
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
2b656a94a7 Support rename of tags */ir/(absolute|relative) to */iqr/(absolute|relative) 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
732d227be1 Add TOML configuration for nvbench-compare thresholds
Add versioned TOML configuration support for nvbench-compare threshold
settings. The new --config option reads grouped settings for clear-gap,
same-result, bulk coverage, and rare-support filtering thresholds. The parser
validates the schema strictly so unknown tables, unknown keys, invalid types,
unsupported versions, and out-of-range values fail early.

Add --dump-config to print the effective configuration without requiring input
JSON files. This makes the currently selected preset and resolved threshold
values discoverable and gives users a starting point for custom configuration.

Preset resolution is:
  - default is used when neither TOML nor CLI selects a preset
  - [preset] name = "..." in TOML selects the base preset
  - --preset ... overrides the TOML preset selection
  - explicit threshold values in TOML override whichever base preset was selected

For example:
  - nvbench-compare --dump-config
    Prints the built-in default settings as grouped TOML.

  - nvbench-compare --preset permissive --dump-config
    Prints the permissive preset values as TOML.

  - nvbench-compare --config compare.toml ref.json cmp.json
    Compares using the preset named in compare.toml, plus any explicit TOML
    threshold overrides.

  - nvbench-compare --config compare.toml --preset strict ref.json cmp.json
    Uses the strict preset as the base, while preserving explicit threshold
    overrides from compare.toml.

Keep TOML parsing lazy: Python 3.11+ uses tomllib, while Python 3.10 only
requires tomli when --config is used. Add focused tests for grouped config
dumping, strict validation, preset/override precedence, and CLI dump behavior.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
2585842cf5 Add nvbench_compare display modes and interval-based table views
Extend nvbench_compare with multiple table display modes and richer interval
formatting for timing comparisons.

Highlights:
  - add `--display` with `intervals`, `legacy`, and `explain` modes
  - keep `legacy` output using scalar Diff/%Diff
  - make `intervals` the default, showing compact center-plus-delta timing
    intervals
  - add `explain` mode with explicit `[L | C | H]` interval rendering and
    self-describing headers
  - compute and store diff and relative-diff intervals in SummaryComparison
  - add formatting helpers for absolute and relative interval displays
  - make default preset slightly more permissive by lowering
    `bulk_same_sample_coverage` to 0.97

Add focused tests covering:
  - diff/%diff interval computation
  - compact and explicit interval formatting
  - default, legacy, and explain table layouts
  - CLI propagation of `--display` and preset selection
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
cc1c40b777 Change in how FAST/SLOW deciision is arrive at
Now:

  - establish a candidate clear timing gap from summary timing intervals, as before
  - if bulk sample times and frequencies are available on both sides,
    compute cycles = time * frequency
  - derive bulk cycle intervals from min/q1/median/q3
  - confirm the gap direction from those bulk cycle intervals
  - only fall back to summary sm_clock_rate_mean confirmation when bulk cycle data
    is unavailable

  I also split the reason codes so the evidence source is visible:

  - clear_gap_confirmed_by_bulk_cycles
  - bulk_cycle_gap_not_confirmed
  - clear_gap_confirmed_by_summary_cycles
  - summary_cycle_gap_not_confirmed

Updated tests in python/test/test_nvbench_compare.py cover both the bulk-confirmed
and bulk-rejected paths, along with the renamed summary reason codes.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
9104c58d63 Add nvbench_compare presets and rare-support-aware bulk coverage
Introduce comparison threshold presets in nvbench_compare and thread the
selected preset through main() into compare_benches.

Refine bulk nearest-neighbor support handling by:
  - adding rare-support filtering thresholds
  - ignoring low-count support values only when removed sample mass is small
  - falling back to full support for all-unique or otherwise unusable support
  - keeping sample-weight coverage over all values

Tighten bulk mismatch reporting to show compact min(ref, cmp) coverage
summaries, and add tests covering:
  - rare-tail filtering
  - strict fallback when too much support mass would be removed
  - all-unique support preservation
  - preset lookup and CLI preset propagation
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
d8efe3dd9e Group nvbench-compare thresholds into a config object
Replace the scattered module-level comparison threshold constants
with a ComparisonThresholds value object. Thread this object through
compare_benches, compare_gpu_timings, and the lower-level clear-gap,
summary-SAME, and bulk-SAME decision helpers.

Keep existing behavior by constructing default ComparisonThresholds
when callers do not provide one. This prepares nvbench-compare for
future CLI-configurable decision thresholds while keeping one consistent
configuration for an entire comparison run.

Add test coverage that passes custom thresholds through compare_benches and
verifies they affect the SAME decision.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0f091438a5 Use bulk samples to confirm same comparisons
Add a bulk-data SAME path to nvbench_compare for cases where summary
intervals do not provide a clear FAST/SLOW decision. The new path compares
sample times and SM-clock-adjusted cycles with symmetric nearest-neighbor
coverage over unique values and sample counts.

The comparison now requires both sample-weight coverage and unique-support
coverage to pass before declaring SAME. If bulk data is available but coverage
does not pass, the result remains UNDECIDED instead of falling back to the
summary-only SAME rule.

Also improve undecided diagnostics by aggregating reason codes while preserving
the most severe representative detail, including observed coverage values and
thresholds for bulk support mismatches.

Add tests for:
 - bulk data confirming SAME despite changed mode weights;
 - bulk time mismatch overriding summary-only SAME;
 - cycle coverage vetoing time-only agreement;
 - sample-weight and unique-support coverage diagnostics;
 - aggregation of undecided reason details.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
ed98d3d950 Implement DecisionReason, tracking and summarisation
- Add DecisionReason(code, message) and internal
  TimingDecision(status, reason).
- SummaryComparison now carries reason
- ComparisonStats now aggregates undecided reasons.
- Final summary prints a reason breakdown only when
  undecided reasons exist, e.g.:

  - Undecided   (comparison requires more evidence): 3
    - Reasons:
      - noise_too_high: 2 (relative dispersion is too
                           high to declare same)
      - weak_interval_overlap: 1 (timing intervals do not
                 overlap strongly enough to declare same)
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
917a950e78 Implement early SAME check
If SLOW/FAST check returned undecided, we attempt conservative
SAME check based on summary data alone (bulk data are not read)

Reference and compare measurements are considered SAME if
   - both centers are positive finite values;
   - abs(ref - cmp) / min(ref, cmp) <= 0.5%.
     This is equivalent to max(ref, cmp) / min(ref, cmp) <= 1 + delta;
   - interval overlap must cover at least 50% of the smaller interval;
   - relative dispersion must be finite on both sides and no more than 2%;
   - if SM clock summaries are available, the same check must also pass in cycle space.

Otherwise UNDECIDED remains working decision, to be refined by further checks
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0e7b9815cf Implement clear-gap comparison for early FAST/SLOW decision
Implemented the clear-gap comparison, with the log-distance-equivalent
algebra and pessimistic SM-clock fallback.

What changed:

 - Added TimingInterval and interval construction from summaries:
    - robust interval: [min, q3], centered at median
    - fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max]
 - Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005.
 - FAST gap uses:

   (ref.lower - cmp.upper) / cmp.upper >= delta
   which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta).
 - SLOW gap uses:

   (cmp.lower - ref.upper) / ref.upper >= delta
 - FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean.
 - If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED.
 - Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification.

Updated tests to cover:

 - center/noise-only comparisons becoming UNDECIDED
 - clear FAST/SLOW with matching clock evidence
 - missing clock fallback to UNDECIDED
 - frequency-shift disagreement becoming UNDECIDED
 - regression reporting with robust interval and clock evidence
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
12750221b5 Add q1/q3 quartiles to GPUTimeData struct
The quantile values are not currently used, but plumbed through
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
cbe9a5b2fd Add "nv/cold/sm_clock_rate/mean" to GPU time summary data
Its intent is to be cheaply retrievable metric of average
SM clock frequence over entire sample
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
2502d29ece Lazy-load nvbench-compare bulk timing data
Store JSON-bin sample time and frequency metadata in GpuTimingData instead of
reading the binary files during summary extraction.

Add Float32BinarySource and lazy cached accessors for samples and frequencies.
Use np.fromfile by default, but allow tests and alternate callers to inject a
float32 reader returning any buffer-compatible object convertable to "<f4" data
type.

Treat optional bulk-data failures as unavailable evidence instead of aborting
comparison: unreadable files, invalid buffers, count mismatches, and mismatched
sample/frequency metadata now emit RuntimeWarning and return None.

Update nvbench_compare tests to verify lazy loading, cache reuse, injected
reader behavior, warning-based degradation, and count mismatch handling.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
a11b54101a Introduce UNDECIDED comparison status
It is not emitted just yet, but the code becomes ready for it
when it starts being emitted
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
d9db53504e Refactor nvbench-compare timing comparison state
Introduce GpuTimingData, SummaryComparison, ComparisonStats, and
ComparisonRunData to make timing extraction, classification, and run-level
state explicit.

Load sample-time and SM-frequency bulk data from JSON binary output into
GpuTimingData when available, preserving count validation between paired
sample and frequency arrays.

Move GPU timing comparison logic into compare_gpu_timings(), prefer robust
median/IQR data when available, and fall back to mean/stdev summaries otherwise.
Keep missing or invalid noise on the unknown path.

Replace module-level comparison counters and selected-device globals with
per-run data passed into compare_benches(). Update tests to validate timing
classification, bulk-data loading, device pairing, filtered duplicate matching,
and summary counters through the new structures.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
0baa699b64 Make nvbench_compare read bulk data, if available 2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
cdb06e1a57 Add scoped filtering and device pairing to nvbench_compare
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.

Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.

Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.

Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.

Original commit messages folded into this change:

Tweaks for nvbench_compare

1. When JSON files contain multiple entries with the same name and axis values,
   make sure that scripts compares corresponding entries.

   Previous logic would extract the first entry from ref data, and would compare
   measurements for each state in cmp against the first entry from ref. The
   change introduces a counter to know which nth entry we process for a
   particular axis value, and retrieve corresponding entry in ref.

Scope occurrence matching by device.

Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.

Require devices in ref and cmp to have the same cardinality

Handle mismatch when number of duplicates in ref data is not same as in cmp data

Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency

Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.

Fix to filter-before-matching.

 - compare_benches() now pairs devices by selected position instead of taking a
   device id.
 - For each device pair, compare_benches() now builds:
     - ref_device_states: matching reference device and axis filters
     - cmp_device_states: matching compare device and axis filters
 - State occurrence counts and duplicate occurrence matching now operate only
   on those filtered per-device lists.
 - Removed the later matches_axis_filters() skip inside the compare-state loop
   because filtering now happens before matching.

Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.

Introduce device filtering in nvbench_compare

 - --reference-devices all|ID|ID,ID,...
 - --compare-devices all|ID|ID,ID,...
 - Integer lists preserve order and duplicates.
 - Requested IDs are validated against the file-level device list.
 - Filtered reference/compare device counts must match before comparison.
 - compare_benches() pairs selected reference and compare devices by position.
 - Each benchmark validates that requested device IDs are present in its own
   devices list.

Implemented benchmark-scoped --axis handling.

  - --axis and --benchmark now share an ordered argparse action, so their
    relative CLI order is preserved.
  - -a before any -b becomes a global axis filter.
  - -a after -b <name> applies to that most recent benchmark only.
  - Repeated -b entries are treated as separate filter scopes and combined as
    alternatives for that benchmark.
  - Device filtering remains global and is applied independently.

Allow non-matching devices for explicit device selection

Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.

Fix for resolve_benchmark_device_ids, add comments

The return value of resolve_benchmark_device_ids now always owns its list.

Use monkeypatch class in set_test_devices helper

Stricted device id validation

Test for device id validation
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
613ee08d76 Use robust summaries in nvbench_compare classification
Teach nvbench_compare to parse GPU timing summaries into structured values and
prefer the robust median/IQR summaries when both compared measurements provide
them. Fall back to the existing mean/stdev summaries when robust summaries are
not available.

Classify comparisons with the larger available relative noise estimate instead
of the smaller one, keep unavailable noise distinct from encoded infinite noise,
and report improvements separately from regressions. Keep the process exit code
as success for completed comparisons; regression counts are reported in the
summary instead of being used as the process status.

Make plotting tolerate unavailable noise by leaving gaps in confidence bands,
sort plotted series by the plotted axis, and avoid reusing pyplot state across
plot calls.

Add focused Python tests for robust-summary preference, unavailable-noise
classification, non-finite timing centers, plot-along handling when the selected
axis is absent, and the exit-code contract.
2026-06-30 06:40:44 -05:00
Oleksandr Pavlyk
454e3fe649 Test for timeout warnings for min-samples and min-time 2026-06-30 06:37:00 -05:00
Oleksandr Pavlyk
d230a16e2b Tighten statistics and timeout warning tests
Document that percentile helpers return quiet NaNs for NaN-containing inputs.

Make quartile expected-value tests compute ranks from the documented
round(p / 100 * (n - 1)) rule instead of reusing statistics::percentile_rank(),
so rank regressions are caught independently.

Extend timeout-warning coverage to exercise the too-few-samples max-noise path
in addition to unavailable, invalid, and infinite stdev-noise inputs.
2026-06-28 08:50:21 -05:00
Oleksandr Pavlyk
36d8c5ba46 Test nullopt explicitly in warning check test
check_noise_warning() now takes std::optional<nvbench::float64_t>,
matching the production helper, and the test now covers
std::nullopt explicitly in addition to NaN, negative, and +inf.
2026-06-28 08:22:03 -05:00
Oleksandr Pavlyk
e99ae66989 timeout_warnings now treats engaged NaN and negative stdev noise as unavailable
Add a focused test target, nvbench.test.measure_timeout_warnings, covering:

  - NaN stdev noise -> “unable to estimate noise”
  - negative stdev noise -> “unable to estimate noise”
  - +inf stdev noise -> “over noise threshold”
2026-06-28 07:43:29 -05:00
Oleksandr Pavlyk
bb0f90f1a0 Preserve stdev noise summaries for low sample counts
Keep legacy stdev/relative summary tags present even when too few
samples are available to compute a meaningful standard-deviation noise
estimate. Use the standard-deviation unavailable sentinel for those
values so existing summary consumers continue to see the expected tags.

Factor the sentinel into the statistics helpers and use it from both
standard_deviation() and stdev_noise_or_sentinel(), keeping the schema
compatibility behavior explicit and tested.
2026-06-28 07:14:02 -05:00
Oleksandr Pavlyk
55467266d9 test_compute_standard_deviation_noise exercises other invalid inputs 2026-06-28 07:14:02 -05:00
Oleksandr Pavlyk
b6b0dd1dd4 Collapsed two branches with identical bodies 2026-06-28 07:14:02 -05:00
Oleksandr Pavlyk
caa2f466c8 Check consistency of sort- vs. select-based quartiles using threshold constant
Expose quartile threshold value, use it in testing to test around that value.
2026-06-26 17:02:45 -05:00
Oleksandr Pavlyk
6b85a9b709 Add static assertion that ValueType is a floating-point type 2026-06-26 16:35:24 -05:00
Oleksandr Pavlyk
b0932b09f0 Refactor logic of emitting warnings between cold and cpu-only measures
Introduce new header file with inline implementation. Use it
from measure_cold.cuh and measure_cpu_only.cxx
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
7069a6b888 Add comment re magic sort/select threshold value 2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
86eb2a8ddd Add tests for handling of NaNs in quartile routine inputs 2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
04290dd71c Add NaN guards to percentiles and quartiles computation routines 2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
d9cdd8bd1e Test quartile values across selection threshold
Add fixed expected-value assertions for quartile tests around the
sort/selection switch point, including duplicate-heavy inputs. This keeps the
tests from only proving that both implementations agree with each other.
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
8dc36a6d79 Generate cold summaries only if some accepted samples have been accumulated
Cold measurement can discard throttled trials before incrementing the accepted
sample count, then stop on timeout with zero recorded samples. In that case,
only emit the sample-size summary and skip derived timing, bandwidth, clock, and
bulk summaries that require accepted samples.

This avoids divide-by-zero mean calculations and quartile/IQR computation over
empty sample vectors.

Keep timeout diagnostics reachable for zero-sample runs and add an explicit
warning when no accepted cold samples were recorded. Factor timeout warning
emission into a private helper so the zero-sample and normal paths share the
same diagnostic logic.

Suppress low-sample relative stdev noise

Add a statistics helper that returns no relative standard-deviation noise until
there are enough samples for a meaningful estimate. Use it for cold CPU/GPU and
CPU-only summaries so the low-sample +inf stdev sentinel is not published as
real relative noise or used for max-noise timeout warnings.

Add statistics coverage for suppressing the low-sample sentinel and computing
relative stdev noise once the sample threshold is reached.

compute_standard_deviation_noise return nullopt if standard deviation is not finite

Test verify that noise is nullopt when not enough samples are accumulated

Added statistics::has_enough_samples_for_noise_estimate(...)

Used it in standard_deviation, compute_standard_deviation_noise,
compute_robust_noise.

Added timeout diagnostics in cold and CPU-only paths.
if max-noise is configured and the run timed out before enough
samples exist to estimate noise, the log now says that explicitly,
otherwise the existing “over noise threshold” warning remains
unchanged.

Added a statistics test assertion for the new sample-count
predicate.
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
86e1c2c881 Duplicate-heavy boundary test is added
Prepare duplicate heavy input and check sort-based
quartile computation result with selection-based one.

std::nth_element only guarantees that the nth element
is the value that would appear there in sorted order;
it does not fully sort equal partitions. Bugs in the
selection implementation, especially when selecting Q1
from the left half and Q3 from the right half after
selecting the median, are more likely to show up when
many samples equal the quartile values.
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
214d286247 Replace forwarding with semantically more accurate std::move
Also add comment within percentile_rank to document precondition
on input values checked with assert statement.

Also, sharpened the comment around percentile_rank function
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
b1c61e5109 Rename IR to IQR in */ir/absolute and */ir/relative tags (#385) 2026-06-22 10:22:29 -07:00
Oleksandr Pavlyk
56d552687e Build and test cuda-bench wheels for Python 3.10-3.14 (#380)
Updated devcontainer image to 26.08 and CUDA 13.0.2 for 3.11-3.14,
but continue with 25.12 with CUDA 13.0.1 for Python 3.10 as its support
by RAPIDS team maintaining ci-wheel images has been dropped in newer
versions of container
2026-06-04 10:14:35 -04:00
Oleksandr Pavlyk
0dc93b0c0e Introduce robust metrics (#379)
* Add statistics utilities to compute quartiles

Quartiles are computed using nearest rank method.

Two implementations are provided:
  1. Sort-based:
     a. sort array
     b. extract values at ranks of interest
  2. Selection based:
     a. Run nth_element to find median on whole range
     b. Run nth_element on left side to find first quartile
     c. Run nth_element on right side to find thirst quartile

Public API copies input into temporary vector which is mutated as needed.

Public API uses sort-based implementation for small arrays ( <= 4096 elements),
and selection-based implementation for larger arrays.

Sort-based implementation can support computation of arbitrary percentiles,
which could be useful later if more extreme statistics is needed.

Add tests covering percentile and quartile edge cases, input iterators,
selection-vs-sorting agreement, empty and singleton inputs, and relative
dispersion validation.

* Add quartiles information to summaries

Use the quartile helpers to report robust cold and CPU-only timing summaries:
Q1, median, Q3, interquartile range, and relative interquartile range.
These values stay hidden.

Summary tags are nv/cold/time/gpu/q1, nv/cold/time/gpu/median,
nv/cold/time/gpu/q3, nv/cold/time/gpu/ir/absolute, nv/cold/time/gpu/ir/relative

ir/absolute = q3 - q1, ir/relative = (q3 - q1)/median

Similar tags added for nv/cold/time/cpu and for CPU-only measures.

Validate relative-dispersion calculations before publishing relative noise
summaries so invalid centers or dispersion values do not produce misleading
summary entries.

* Prefer robust summaries in default output

Only flip visibility for nv/cold/cpu/time, nv/cold/gpu/time,
and nv/cpu_only/only:
  - hide mean
  - hide stdev/relative
  - show median
  - show ir/relative

* Use is_close where std::abs(act-exp) was used

* Revert "Prefer robust summaries in default output"

This reverts commit 9a0afc361c.

Basically, all robust statistics summaries entries are hidden,
and mean + stdev/relative are back to be default displayed items

* Address PR review feedback
2026-06-02 13:20:15 -05:00
Oleksandr Pavlyk
ee4b9f0963 Remove unused python_wheel section (#382)
ci/matrix.yaml contains unused section once intended for Python wheels
2026-06-01 14:04:38 -05:00
Oleksandr Pavlyk
97c8b29f5a Updated devcontainer imageset to 26.08 (#381)
Add CTK 13.2 with compact support for host compilers:
   - gcc 11 (min), gcc 13 (working), gcc 15 (max)
   - llvm15 (min), llvm 21 (max)
   - CL 14.44
2026-06-01 11:02:40 -05:00
Oleksandr Pavlyk
7ba2b79d5b Reduce stdrel criterion complexity and ensure termination (#374)
* Reduce stdrel criterion complexity and ensure termination

Replace the stdrel criterion's growing sample history with an online
mean/variance accumulator. This keeps the stopping criterion based on
relative standard deviation, preserves the unbiased standard-deviation
estimate used for convergence, and reduces per-sample update work from
recomputing over the full history to constant time.

Add a bounded invalid-noise path so measurements that persistently produce
non-finite relative noise, such as all-zero timings, can terminate without
waiting for the wall-time timeout. Keep the normal min-time gate for ordinary
stdrel convergence.

Add focused tests for the online accumulator, stdrel sample-count threshold,
sample-standard-deviation behavior, deterministic convergence inputs, and
persistent invalid-noise termination. Update the CLI help for the stdrel
termination behavior.

* change max-noise to  for consistency

* Use online_mean_variance on m_noise_tracker in is_finished()

Previously, standard deviation call was made using current
noise level instead of mean noise level. Because of identity

E[ (N - C)^2 ] =
    E[ (N - E[N])^2 ] + (E[N] - C)^2 >= E[ (N - E[N])^2 ]

this led to criterion terminating later than it could have because
the estimated expectation is always greater of equal that the
estimate relative to the mean.

Code used current noise level instead of mean to avoid needing to
make two passed through m_noise_tracker container.

Use of online_mean_variance allows to improve accuracy of estimating
dispersion of noise signal while maintaining single pass through
container.

* Address review feedback

Fixed misleading commit. Introduce private methods to refactor
computation of repeated expressions.

Renamed m_cuda_times_summary to m_measurements_summary, since
criterion can be applied for CPU-only measurements too.

Introduced is_close utility for checking whether two floating
point numbers are closed to one another.

Introduced descriptive constexpr variables for hard-wired
constants
2026-05-29 17:06:28 +00:00
omribz156
ec025d7e0d docs: separate measurement options from stopping criteria (#373)
Signed-off-by: Omri SirComp <omribz156@gmail.com>
2026-05-28 16:51:12 -05:00
Oleksandr Pavlyk
6bdbff7f21 include cleanup across nvbench/ (#377)
Added missing direct standard includes for entities such as std::size_t,
std::move, std::vector, std::optional, std::exception, std::memcpy, etc.

Added missing project include in nvbench/internal/table_builder.cuh for
nvbench::detail::transform_reduce.

Fixed nvbench/detail/gpu_frequency.cuh to forward-declare nvbench::cuda_stream
in nvbench namespace instead of in nvbench::detail namespace.
2026-05-28 16:40:30 -05:00
Oleksandr Pavlyk
84c7952f8b nvbench::cpu_timer changed to use steady_clock (#371)
Using steady_clock is more appropriate for timing measurements.
It guarantees that duration computed from two time-points will not
contain correction deltas.
2026-05-20 10:22:22 -05:00
mfranzrebsal
4a33a61591 Add Windows support (#354) 2026-05-19 15:10:58 -05:00