Commit Graph

798 Commits

Author SHA1 Message Date
Oleksandr Pavlyk
d230a16e2b Tighten statistics and timeout warning tests
Document that percentile helpers return quiet NaNs for NaN-containing inputs.

Make quartile expected-value tests compute ranks from the documented
round(p / 100 * (n - 1)) rule instead of reusing statistics::percentile_rank(),
so rank regressions are caught independently.

Extend timeout-warning coverage to exercise the too-few-samples max-noise path
in addition to unavailable, invalid, and infinite stdev-noise inputs.
2026-06-28 08:50:21 -05:00
Oleksandr Pavlyk
36d8c5ba46 Test nullopt explicitly in warning check test
check_noise_warning() now takes std::optional<nvbench::float64_t>,
matching the production helper, and the test now covers
std::nullopt explicitly in addition to NaN, negative, and +inf.
2026-06-28 08:22:03 -05:00
Oleksandr Pavlyk
e99ae66989 timeout_warnings now treats engaged NaN and negative stdev noise as unavailable
Add a focused test target, nvbench.test.measure_timeout_warnings, covering:

  - NaN stdev noise -> “unable to estimate noise”
  - negative stdev noise -> “unable to estimate noise”
  - +inf stdev noise -> “over noise threshold”
2026-06-28 07:43:29 -05:00
Oleksandr Pavlyk
bb0f90f1a0 Preserve stdev noise summaries for low sample counts
Keep legacy stdev/relative summary tags present even when too few
samples are available to compute a meaningful standard-deviation noise
estimate. Use the standard-deviation unavailable sentinel for those
values so existing summary consumers continue to see the expected tags.

Factor the sentinel into the statistics helpers and use it from both
standard_deviation() and stdev_noise_or_sentinel(), keeping the schema
compatibility behavior explicit and tested.
2026-06-28 07:14:02 -05:00
Oleksandr Pavlyk
55467266d9 test_compute_standard_deviation_noise exercises other invalid inputs 2026-06-28 07:14:02 -05:00
Oleksandr Pavlyk
b6b0dd1dd4 Collapsed two branches with identical bodies 2026-06-28 07:14:02 -05:00
Oleksandr Pavlyk
caa2f466c8 Check consistency of sort- vs. select-based quartiles using threshold constant
Expose quartile threshold value, use it in testing to test around that value.
2026-06-26 17:02:45 -05:00
Oleksandr Pavlyk
6b85a9b709 Add static assertion that ValueType is a floating-point type 2026-06-26 16:35:24 -05:00
Oleksandr Pavlyk
b0932b09f0 Refactor logic of emitting warnings between cold and cpu-only measures
Introduce new header file with inline implementation. Use it
from measure_cold.cuh and measure_cpu_only.cxx
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
7069a6b888 Add comment re magic sort/select threshold value 2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
86eb2a8ddd Add tests for handling of NaNs in quartile routine inputs 2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
04290dd71c Add NaN guards to percentiles and quartiles computation routines 2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
d9cdd8bd1e Test quartile values across selection threshold
Add fixed expected-value assertions for quartile tests around the
sort/selection switch point, including duplicate-heavy inputs. This keeps the
tests from only proving that both implementations agree with each other.
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
8dc36a6d79 Generate cold summaries only if some accepted samples have been accumulated
Cold measurement can discard throttled trials before incrementing the accepted
sample count, then stop on timeout with zero recorded samples. In that case,
only emit the sample-size summary and skip derived timing, bandwidth, clock, and
bulk summaries that require accepted samples.

This avoids divide-by-zero mean calculations and quartile/IQR computation over
empty sample vectors.

Keep timeout diagnostics reachable for zero-sample runs and add an explicit
warning when no accepted cold samples were recorded. Factor timeout warning
emission into a private helper so the zero-sample and normal paths share the
same diagnostic logic.

Suppress low-sample relative stdev noise

Add a statistics helper that returns no relative standard-deviation noise until
there are enough samples for a meaningful estimate. Use it for cold CPU/GPU and
CPU-only summaries so the low-sample +inf stdev sentinel is not published as
real relative noise or used for max-noise timeout warnings.

Add statistics coverage for suppressing the low-sample sentinel and computing
relative stdev noise once the sample threshold is reached.

compute_standard_deviation_noise return nullopt if standard deviation is not finite

Test verify that noise is nullopt when not enough samples are accumulated

Added statistics::has_enough_samples_for_noise_estimate(...)

Used it in standard_deviation, compute_standard_deviation_noise,
compute_robust_noise.

Added timeout diagnostics in cold and CPU-only paths.
if max-noise is configured and the run timed out before enough
samples exist to estimate noise, the log now says that explicitly,
otherwise the existing “over noise threshold” warning remains
unchanged.

Added a statistics test assertion for the new sample-count
predicate.
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
86e1c2c881 Duplicate-heavy boundary test is added
Prepare duplicate heavy input and check sort-based
quartile computation result with selection-based one.

std::nth_element only guarantees that the nth element
is the value that would appear there in sorted order;
it does not fully sort equal partitions. Bugs in the
selection implementation, especially when selecting Q1
from the left half and Q3 from the right half after
selecting the median, are more likely to show up when
many samples equal the quartile values.
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
214d286247 Replace forwarding with semantically more accurate std::move
Also add comment within percentile_rank to document precondition
on input values checked with assert statement.

Also, sharpened the comment around percentile_rank function
2026-06-26 16:28:34 -05:00
Oleksandr Pavlyk
b1c61e5109 Rename IR to IQR in */ir/absolute and */ir/relative tags (#385) 2026-06-22 10:22:29 -07:00
Oleksandr Pavlyk
56d552687e Build and test cuda-bench wheels for Python 3.10-3.14 (#380)
Updated devcontainer image to 26.08 and CUDA 13.0.2 for 3.11-3.14,
but continue with 25.12 with CUDA 13.0.1 for Python 3.10 as its support
by RAPIDS team maintaining ci-wheel images has been dropped in newer
versions of container
2026-06-04 10:14:35 -04:00
Oleksandr Pavlyk
0dc93b0c0e Introduce robust metrics (#379)
* Add statistics utilities to compute quartiles

Quartiles are computed using nearest rank method.

Two implementations are provided:
  1. Sort-based:
     a. sort array
     b. extract values at ranks of interest
  2. Selection based:
     a. Run nth_element to find median on whole range
     b. Run nth_element on left side to find first quartile
     c. Run nth_element on right side to find thirst quartile

Public API copies input into temporary vector which is mutated as needed.

Public API uses sort-based implementation for small arrays ( <= 4096 elements),
and selection-based implementation for larger arrays.

Sort-based implementation can support computation of arbitrary percentiles,
which could be useful later if more extreme statistics is needed.

Add tests covering percentile and quartile edge cases, input iterators,
selection-vs-sorting agreement, empty and singleton inputs, and relative
dispersion validation.

* Add quartiles information to summaries

Use the quartile helpers to report robust cold and CPU-only timing summaries:
Q1, median, Q3, interquartile range, and relative interquartile range.
These values stay hidden.

Summary tags are nv/cold/time/gpu/q1, nv/cold/time/gpu/median,
nv/cold/time/gpu/q3, nv/cold/time/gpu/ir/absolute, nv/cold/time/gpu/ir/relative

ir/absolute = q3 - q1, ir/relative = (q3 - q1)/median

Similar tags added for nv/cold/time/cpu and for CPU-only measures.

Validate relative-dispersion calculations before publishing relative noise
summaries so invalid centers or dispersion values do not produce misleading
summary entries.

* Prefer robust summaries in default output

Only flip visibility for nv/cold/cpu/time, nv/cold/gpu/time,
and nv/cpu_only/only:
  - hide mean
  - hide stdev/relative
  - show median
  - show ir/relative

* Use is_close where std::abs(act-exp) was used

* Revert "Prefer robust summaries in default output"

This reverts commit 9a0afc361c.

Basically, all robust statistics summaries entries are hidden,
and mean + stdev/relative are back to be default displayed items

* Address PR review feedback
2026-06-02 13:20:15 -05:00
Oleksandr Pavlyk
ee4b9f0963 Remove unused python_wheel section (#382)
ci/matrix.yaml contains unused section once intended for Python wheels
2026-06-01 14:04:38 -05:00
Oleksandr Pavlyk
97c8b29f5a Updated devcontainer imageset to 26.08 (#381)
Add CTK 13.2 with compact support for host compilers:
   - gcc 11 (min), gcc 13 (working), gcc 15 (max)
   - llvm15 (min), llvm 21 (max)
   - CL 14.44
2026-06-01 11:02:40 -05:00
Oleksandr Pavlyk
7ba2b79d5b Reduce stdrel criterion complexity and ensure termination (#374)
* Reduce stdrel criterion complexity and ensure termination

Replace the stdrel criterion's growing sample history with an online
mean/variance accumulator. This keeps the stopping criterion based on
relative standard deviation, preserves the unbiased standard-deviation
estimate used for convergence, and reduces per-sample update work from
recomputing over the full history to constant time.

Add a bounded invalid-noise path so measurements that persistently produce
non-finite relative noise, such as all-zero timings, can terminate without
waiting for the wall-time timeout. Keep the normal min-time gate for ordinary
stdrel convergence.

Add focused tests for the online accumulator, stdrel sample-count threshold,
sample-standard-deviation behavior, deterministic convergence inputs, and
persistent invalid-noise termination. Update the CLI help for the stdrel
termination behavior.

* change max-noise to  for consistency

* Use online_mean_variance on m_noise_tracker in is_finished()

Previously, standard deviation call was made using current
noise level instead of mean noise level. Because of identity

E[ (N - C)^2 ] =
    E[ (N - E[N])^2 ] + (E[N] - C)^2 >= E[ (N - E[N])^2 ]

this led to criterion terminating later than it could have because
the estimated expectation is always greater of equal that the
estimate relative to the mean.

Code used current noise level instead of mean to avoid needing to
make two passed through m_noise_tracker container.

Use of online_mean_variance allows to improve accuracy of estimating
dispersion of noise signal while maintaining single pass through
container.

* Address review feedback

Fixed misleading commit. Introduce private methods to refactor
computation of repeated expressions.

Renamed m_cuda_times_summary to m_measurements_summary, since
criterion can be applied for CPU-only measurements too.

Introduced is_close utility for checking whether two floating
point numbers are closed to one another.

Introduced descriptive constexpr variables for hard-wired
constants
2026-05-29 17:06:28 +00:00
omribz156
ec025d7e0d docs: separate measurement options from stopping criteria (#373)
Signed-off-by: Omri SirComp <omribz156@gmail.com>
2026-05-28 16:51:12 -05:00
Oleksandr Pavlyk
6bdbff7f21 include cleanup across nvbench/ (#377)
Added missing direct standard includes for entities such as std::size_t,
std::move, std::vector, std::optional, std::exception, std::memcpy, etc.

Added missing project include in nvbench/internal/table_builder.cuh for
nvbench::detail::transform_reduce.

Fixed nvbench/detail/gpu_frequency.cuh to forward-declare nvbench::cuda_stream
in nvbench namespace instead of in nvbench::detail namespace.
2026-05-28 16:40:30 -05:00
Oleksandr Pavlyk
84c7952f8b nvbench::cpu_timer changed to use steady_clock (#371)
Using steady_clock is more appropriate for timing measurements.
It guarantees that duration computed from two time-points will not
contain correction deltas.
2026-05-20 10:22:22 -05:00
mfranzrebsal
4a33a61591 Add Windows support (#354) 2026-05-19 15:10:58 -05:00
Oleksandr Pavlyk
3d82e58170 Fix docutil error when building docs (#365) 2026-05-18 10:57:19 -05:00
Oleksandr Pavlyk
4472e7b59b Add python api for cold warmup parameters (#363) 2026-05-18 10:56:44 -05:00
Oleksandr Pavlyk
ce75dab94b Add stopping criterion sample count (#341)
* Implement sample-count stopping criterion with parameter target-samples

--stopping-criterion sample-count --target-samples 100 would stop once
max(--min-samples, --target-samples) samples are collected

* Address review nitpicks
2026-05-15 15:15:12 -05:00
Oleksandr Pavlyk
6dd27aedfd Fix exception safety (#358)
Improve exception safety of timer structs by using local scope guards to ensure that cleanup steps, such as signaling blocking kernel to unblock and making sure that the stream is synchronized are performed even launch object throws an exception.

Tests of exception safety were added.

--

* blocking_kernel.unblock_noexcept() noexcept method added

This decouples the logic of signaling to unblock from checking
of the timeout.

* Improve exception safely in kernel_launch_timer

Introduce noexcept cleanup methods. Place body of start()
and stop() methods in the try/catch block and execute
noexcept clean-up on exception before rethrowing.

* Improve exception safety of measure_hot

* Make sure that throwing methods call noexcept ones instead of duplicating functionality

* Use cleanup_guard in measure_cold_base::kernel_launch_timer

Replace try/catch pattern with cleaner use of cleanup_guard
class.

* cpu_timer::start, cpu_timer::stop methods marked noexcept

These methods do not throw, and marking them noexcept explicitly
makes it fine to call them from other noexcept methods, as such
cleanup_noexcept in measure_cold.

* Address remaining exception safety issue in measure_hot

* Renamed guard variables to reflect their purpose, apply arm-then-do to ops queueing kernels

Set m_block_stream_armed = true; before launching the kernel. Doing so signals
cleanup guard that stream must be unblocked, even if launching of the kernel failed.

Same for operation launching time-stamps kernel.

* Add testing/device/exception_safety.cu

This test add benchmark that throws. It verifies that it did not
time-out and control counters the benchmark maintains are at
the expected values.

* Refactor measurement cleanup guards for testability

Extract hot stream cleanup and cold launch timer cleanup into reusable
detail helpers. Keep measure_hot and measure_cold using those helpers through
thin adapters so the tested cleanup logic matches the production path.

Add driver-free cleanup guard tests using a fake measure object to verify
cleanup ordering when exceptions occur after blocking stream setup, after hot
unblock, and around cold GPU frequency start/stop paths.

* Implement cpu_timer_stop_noexcept in terms of cpu_timer_stop

The cpu_timer_stop is already noexcept by nature of implementation,
but we maintain cpu_timer_stop_noexcept method for symmetry with
other pairs sync_stream()/sync_stream_noexcept().

The cpu_timer_stop_noexcept() is implemented via cpu_timer_stop().
These methods are annotated __forceinline__, so the same code should be
generated.

* More readable initialization of bool members

* Moved exception_safety.cu back to testing/ folder

testing/device is reserved for tests that require locking
of GPU frequency per CMake option description.

* Fixed nitpick and bug it discovered

Changed testing/exception_safety.cu:237 so run_benchmark now iterates over every state
from bench.get_states() and checks each one is skipped with a reason
containing "requested".

That exposed a real runner behavior gap, so I also made a minimal fix in
nvbench/runner.cuh:120: after stop_runner_loop, remaining states are now explicitly
marked skipped with a reason instead of only printing a skip notification.

* Move static assertions (pertaining to cleanup guards) to
testing/cleanup_guards.cu

The CI failure with CTK 12.0 and certain version of GCC is caused
by OOM in cudafe++ process tripped by compiling instantiation
of contract verification on cold_launch_timer_probe struct.

As a work-around, this instantiation is excluded for CTK 12.0-12.6
2026-05-15 15:14:30 -05:00
Oleksandr Pavlyk
d63a2761eb Implement Timer, and support State.exec(fn, timer=True) (#364)
* Add type annotations for future functionality

```python
class Timer:
    def start(self) -> None: ...
    def stop(self) -> None: ...
```

and overloaded `State.exec` so:

  - normal mode accepts `Callable[[Launch], None]`
  - `timer=True` accepts `Callable[[Launch, Timer], None]`

No implementation yet. Type annotation checked with

```
(py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-imports /tmp/check_timer.py
/tmp/check_timer.py:24: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch], None]", "bool"  [call-overload]
/tmp/check_timer.py:24: note: Possible overload variants:
/tmp/check_timer.py:24: note:     def exec(self, Callable[[Launch], None], /, *, batched: bool | None = ..., sync: bool | None = ..., timer: Literal[False] = ...) -> None
/tmp/check_timer.py:24: note:     def exec(self, Callable[[Launch, Timer], None], /, *, timer: Literal[True], sync: bool | None = ...) -> None
/tmp/check_timer.py:25: error: Argument 1 to "exec" of "State" has incompatible type "Callable[[Launch, Timer], None]"; expected "Callable[[Launch], None]"  [arg-type]
/tmp/check_timer.py:26: error: No overload variant of "exec" of "State" matches argument types "Callable[[Launch, int], None]", "bool"  [call-overload]
/tmp/check_timer.py:26: note: Possible overload variants:
/tmp/check_timer.py:26: note:     def exec(self, Callable[[Launch], None], /, *, batched: bool | None = ..., sync: bool | None = ..., timer: Literal[False] = ...) -> None
/tmp/check_timer.py:26: note:     def exec(self, Callable[[Launch, Timer], None], /, *, timer: Literal[True], sync: bool | None = ...) -> None
Found 3 errors in 1 file (checked 1 source file)

(py313) :~/repos/nvbench/python$ nl -ba /tmp/check_timer.py
     1  # /tmp/check_nvbench_timer.py
     2  import cuda.bench as bench
     3
     4  def normal_ok(launch: bench.Launch) -> None:
     5      pass
     6
     7  def timer_ok(launch: bench.Launch, timer: bench.Timer) -> None:
     8      timer.start()
     9      timer.stop()
    10
    11  def missing_timer(launch: bench.Launch) -> None:
    12      pass
    13
    14  def extra_timer(launch: bench.Launch, timer: bench.Timer) -> None:
    15      pass
    16
    17  def wrong_timer_type(launch: bench.Launch, timer: int) -> None:
    18      pass
    19
    20  def state_bench(state: bench.State) -> None:
    21      state.exec(normal_ok)
    22      state.exec(normal_ok, timer=False)
    23      state.exec(timer_ok, timer=True)
    24      state.exec(missing_timer, timer=True)       # should fail
    25      state.exec(extra_timer)                     # should fail
    26      state.exec(wrong_timer_type, timer=True)    # should fail
```

* Implement cuda.bench.Timer object

The Timer class is not user-constructible. It exposes two nullary
methods timer.start() and timer.stop().

The instance of Timer class would be provided to launchable object
passed to State.exec with timer=True.

* Implement support for State.exec( launch_fn, timer=True)

* Change type annotation for batch to default to None

None is interpreted as `not timer`, i.e., it effectively
defaults to True (as before) for usage without timer set,
but starts defaulting to `False` is `timer=True` is set.

The batched keyword type is `bool | None`.

* Implement default batched=None behavior

API allows one to specify all 3 keywords, sync, batched,
and timer. batched is None by default, run-time interpreted
as `(not timer)`.

* Update tests for new behavior of batched/time combination

* Add python/examples/exec_tag_timer.py

* Expand Timer class and methods docstrings

* Reworked python/example/exec_tag_timer.py to align with C++ example.

* Replace ::cuda::std::name with cuda::std::name

* Resolve review feedback
2026-05-15 10:19:40 -05:00
Oleksandr Pavlyk
44ec7de6bd Implement decorators to register benchmarks add axis and options (#347)
* Add decorators for registering benchmarks and adding axis

cuda.bench.register(fn) continues returning Benchmark, and supports
legacy use.

New signature added:
   cuda.bench.register():
      Returns a decorator

```
@bench.register()
@bench.axis.float64("Duration (s)", [7e-5, 1e-4, 5e-4])
@bench.option.min_samples(120)
def single_float64_axis(state: bench.State):
   ...
```

* Remove example/auto_throughput.py

The C++ counterpart's purpose is to demonstrate use of CUPTI
metrics, but these are not supported in Python bindings, so
this example is a duplicate of example/throughput.py

* Add wrong decorator order test for bench.axis.*

* Strengthen type annotation for register function

Acting on code rabbit nit-pick require that function being
registered take cuda.bench.State object as an argument.

Verified the fix as

```
(py313) :~/repos/nvbench/python$ python -m mypy --ignore-missing-import /tmp/t.py
/tmp/t.py:8: error: Argument 1 has incompatible type "Callable[[], None]"; expected "Callable[[State], None]"  [arg-type]
Found 1 error in 1 file (checked 1 source file)
(py313) :~/repos/nvbench/python$ nl -ba /tmp/t.py
     1  # /tmp/check_nvbench_register.py
     2  import cuda.bench as bench
     3
     4  @bench.register()
     5  def good(state: bench.State) -> None:
     6      pass
     7
     8  @bench.register()
     9  def bad() -> None:
    10      pass
```

* Replace use of global variable with thread-safe lru_cache

This improves thread-safety of module initialization.

* Abide by RUF005 linting rule

* Expand docstrings regarding cuda.bench.register() decorator

It explains to the user what the decorator does and provides
a concise usage example.

* Sharpen wording on exception maybe-thrown by decorator
2026-05-14 15:41:30 -05:00
Oleksandr Pavlyk
338936b6fe Provide BenchmarkResult class for parsing JSON output of NVBench-instrumented benchmarks (#356)
Implements `cuda.bench.results.BenchmarkResult` class to represent data from JSON output of benchmark execution.

The contains implements two class methods `BenchmarkResult.from_json(filename : str | os.PathLike, *, metadata : Any = None)` which expects well-formed JSON filename and `BenchmarkResult.empty(*, metadata : Any = None)` intended to represent failed result with reasons that can be recorded in metadata at user's discretion.

The `BenchmarkResult` implements mapping interface, supporting `.keys()`, `.values()`, `.items()` methods, `__len__`, `__contains__`, `__getitem__` and `__iter__` special methods. 

Values in `BenchmarkResult` has type `cuda.bench.results.SubBenchmarkResult` which implements a list-like interface, i.e. implements `__len__`, `__getitem__`, and `__iter__` special methods. Values in this list-like structure correspond to measurements of individual states of a particular benchmark (the key in `BenchmarkResult`).

Elements of `SubBenchmarkResult` structure have type `SubBenchmarkState` that supports mapping protocol with axis_values as a key and represent data corresponding to measurements for a particular state (combination of settings for each axis). 

The state provides `.samples` and `.frequencies` attributes storing raw execution duration values and estimates for average GPU frequencies. 

Example usage:

```
import array, numpy as np, cuda.bench.results

r = cuda.bench.results.BenchmarkResult("perf_data/axes_run1.json")

r["copy_sweep_grid_shape"].centers_with_frequencies(
     lambda t, f: np.median(np.asarray(t)*np.asarray(f)))

```

```
In [1]: import array, numpy as np, cuda.bench.results

In [2]: r = cuda.bench.results.BenchmarkResult("temp_data/axes_run1.json")

In [3]: list(r)
Out[3]:
['simple',
 'single_float64_axis',
 'copy_sweep_grid_shape',
 'copy_type_sweep',
 'copy_type_conversion_sweep',
 'copy_type_and_block_size_sweep']

In [4]: r["simple"].centers(lambda t: np.percentile(t, [25,75]))
Out[4]: {'Device=0': array([0.00100966, 0.00101299])}

In [5]: r.centers(lambda t: np.percentile(t, [25,75]))["simple"]
Out[5]: {'Device=0': array([0.00100966, 0.00101299])}

In [6]: len(r)
Out[6]: 6

In [7]: "fake" in r
Out[7]: False
```

Each `SubBenchmarkState` implements 
`.summaries` attribute - rich object that retains tag/name/hint/hide/description metadata.

* Add nvbench-json-summary to render NVBench JSON output as an NVBench-style
markdown summary table, including axis formatting, device sections, hidden
summary filtering, and summary hint formatting.

Update packaging, type stubs, and tests for the new namespace, renamed
classes, Python 3.10-compatible annotations, and summary-table generation.

* Split tests in test_benchmark_result into smaller tests

* Fix break due to file name change

* Add python/examples/benchmark_result_autotune.py

This example demonstrates using cuda.bench and cuda.bench.results
to implement simple auto-tuning, demonstrated on selecting of
tile shape hyperparameter for naive stencil kernel implemented
in numba-cuda.

* Resolve ruff PLE0604

* Fix for format_axis_value in json format script to handle None value

Add tests to cover such input.

* Address code rabbit review feedback

* Fix license header, add validation

* Addressed both issues raised in review

Malformed values are now represented in result as None.

Skipped benchmarks are no longer dropped, i.e., they are present
in BenchmarkResult data, but they are not reflected in summary
table in line with what NVBench-instrumented benchmarks do.
2026-05-13 13:23:58 -05:00
Oleksandr Pavlyk
6df6dc8d89 Enable building of NVBench on Windows (#362)
* Enable building of NVBench on Windows, no testing

* Add guard to disable nvbench-windows for now
2026-05-13 13:16:41 -04:00
Oleksandr Pavlyk
f14055d5cc Change CMake's nvbench::main exported target to correspond to static library (#350)
Previously, it corresponded to main.cu.o object file. Now it corresponds to
static library libnvbench_main.a which is archive file with main.cu.o object
in it.

This closes #349
2026-05-13 13:10:44 -04:00
Oleksandr Pavlyk
9ea77bccaa Implement CLI option to control warmups for cold measurements (#339)
* Implement warmup-runs count, supported as CLI

CLI option --warmup-runs implemented and documented.

The warm-up counts is enforced to always be positive.
This is necessary to ensure that JIT-ting has occurred,
and use of blocking kernel would not result in time-outs.

Test is option parser is added.

* Ensure that measure_cold::run_warmup instantiates blocking kernel

Because warm-up runs are executed without use of blocking kernel,
the blocking kernel was not jitted until actual measurements were
collected. The module loading cost incurred during the first run
shows as elevated CPU time noise value for the first measurement
as noted in https://github.com/NVIDIA/nvbench/pull/339

This PR adds `this->block_stream(); this->unblock_stream();` prior
to executing warm-up loop with use of blocking kernel disabled.

This ensures that blocking kernel is instantiated during the warm-up,
but it no other kernel is launched between its launch and stream sync
thus avoiding deadlocking.

* Rename --warmup-runs to --cold-warmup-runs, add --cold-max-warmup-walltime

Since configurable number of warmups only applies to measure_cold.cuh
rename the CLI option to reflect that.

Also add --cold-max-warmup-walltime (defaults to -1, i.e. disabled).
If enabled, exits warmup loop before request count is reached if
the wall-time expanded executign warmups exceeds this max-warmup-walltime
value.
2026-05-12 14:30:08 -05:00
Oleksandr Pavlyk
ebf9f9a087 Add .coderabbit.yaml following in footsteps of CCCL (#359) 2026-05-12 13:55:46 -05:00
Oleksandr Pavlyk
7dfbcad27c Create directories for output files (#360)
* QOL UX, NVBench creates directories for output JSON, MD, CSV files

This closes #185 and supports specifying
`--json path/to/nonexistent/folder/result.json`

This would create sequence of folders where to place result.json

```
(py313) :~/repos/nvbench$ rm -rf /tmp/nested/
(py313) :~/repos/nvbench$ ./build2/bin/nvbench.example.cpp20.axes -b copy_type_and_block_size_sweep -a Type=I32 -a BlockSize=64 --jsonbin /tmp/nested/json/axes.json --md /tmp/nested/md/res.md --csv /tmp/nested/csv/res.csv > /dev/null 2>&1
(py313) :~/repos/nvbench$ tree /tmp/nested/
/tmp/nested/
├── csv
│   └── res.csv
├── json
│   ├── axes.json
│   ├── axes.json-bin
│   │   └── 0.bin
│   └── axes.json-freqs-bin
│       └── 0.bin
└── md
    └── res.md

6 directories, 5 files
```

* Add a test that non-existent output folder is created

* Remove throwing custom error message. Use default

* Replace static_assert(false, ...) with #error
2026-05-12 10:26:28 -05:00
Oleksandr Pavlyk
d13a0fde32 Correct cuda cccl examples per change in api (#353) 2026-05-06 13:30:44 -05:00
Oleksandr Pavlyk
f392725015 Correct Python API signature of State.get_axis_values_as_strings (#346)
* Correct Python API signature of State.get_axis_values_as_strings

The C++ API has default boolean argument color, but Python API
declared no arguments.

Closes #345

* Also exercise invocation of get_axis_values_as_string with keyword argument value

* Remove use of cuda.core.experimental
2026-05-04 08:40:29 -05:00
Oleksandr Pavlyk
a3364ca5c7 Port changes to the package from #323 (#337)
Fixed relative text alignment in docstrings to fix autodoc warnigns

Renamed cuda.bench.test_cpp_exception and cuda.bench.test_py_exception functions
to start with underscore, signaling that these functions are internal and should
not be documented

Account for test_cpp_exceptions -> _test_cpp_exception, same for *_py_*

Make sure to reset __module__ of reexported symbols to be cuda.bench
2026-04-22 08:28:15 -05:00
Oleksandr Pavlyk
b0a46f44c2 Modularize color handling (#336)
* Introduce function colorize to modularize colorization/no-color handling

* Use sns.set_theme instead of deprecated sns.set()

* Use str.format instead of legacy % syntax

* Simplified iteration over list

Use f-string (supported since Python 3.6) instead of str.format for
better readability and performance
2026-04-14 08:09:44 -05:00
pre-commit-ci[bot]
8d23e3e73c [pre-commit.ci] pre-commit autoupdate (#333)
* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v21.1.8 → v22.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v21.1.8...v22.1.2)
- [github.com/astral-sh/ruff-pre-commit: v0.14.10 → v0.15.9](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.10...v0.15.9)
- [github.com/codespell-project/codespell: v2.4.1 → v2.4.2](https://github.com/codespell-project/codespell/compare/v2.4.1...v2.4.2)

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-13 16:24:55 +00:00
Oleksandr Pavlyk
e62c5b6f79 Correct description/hint entries for summaries with name "Noise" (#335)
See #334
2026-04-13 11:13:37 -05:00
Nader Al Awar
373970323f Merge pull request #331 from oleksandr-pavlyk/update-python-examples
Update python examples
2026-04-02 15:20:24 -04:00
Oleksandr Pavlyk
39730efbc3 Update requirements to reflect packages used by examples 2026-04-02 10:37:17 -05:00
Oleksandr Pavlyk
9f75642387 Add patch to cutlass.base_dsl.dsl.BaseDSL to work-around a bug
See https://github.com/NVIDIA/cutlass/issues/3142
2026-04-02 10:29:31 -05:00
Nader Al Awar
488173a242 Add --no-color flag to nvbench_compare.py which can be used for github issues and PRs python-0.2.1 2026-04-01 18:27:54 -04:00
Nader Al Awar
7a68e53df0 Rename flag from markdown to no-color 2026-04-01 17:01:29 -05:00
Nader Al Awar
7e5e784855 Add --markdown flag to nvbench_compare.py which can be use for github issues/prs 2026-04-01 14:53:13 -05:00