* Add statistics utilities to compute quartiles
Quartiles are computed using nearest rank method.
Two implementations are provided:
1. Sort-based:
a. sort array
b. extract values at ranks of interest
2. Selection based:
a. Run nth_element to find median on whole range
b. Run nth_element on left side to find first quartile
c. Run nth_element on right side to find thirst quartile
Public API copies input into temporary vector which is mutated as needed.
Public API uses sort-based implementation for small arrays ( <= 4096 elements),
and selection-based implementation for larger arrays.
Sort-based implementation can support computation of arbitrary percentiles,
which could be useful later if more extreme statistics is needed.
Add tests covering percentile and quartile edge cases, input iterators,
selection-vs-sorting agreement, empty and singleton inputs, and relative
dispersion validation.
* Add quartiles information to summaries
Use the quartile helpers to report robust cold and CPU-only timing summaries:
Q1, median, Q3, interquartile range, and relative interquartile range.
These values stay hidden.
Summary tags are nv/cold/time/gpu/q1, nv/cold/time/gpu/median,
nv/cold/time/gpu/q3, nv/cold/time/gpu/ir/absolute, nv/cold/time/gpu/ir/relative
ir/absolute = q3 - q1, ir/relative = (q3 - q1)/median
Similar tags added for nv/cold/time/cpu and for CPU-only measures.
Validate relative-dispersion calculations before publishing relative noise
summaries so invalid centers or dispersion values do not produce misleading
summary entries.
* Prefer robust summaries in default output
Only flip visibility for nv/cold/cpu/time, nv/cold/gpu/time,
and nv/cpu_only/only:
- hide mean
- hide stdev/relative
- show median
- show ir/relative
* Use is_close where std::abs(act-exp) was used
* Revert "Prefer robust summaries in default output"
This reverts commit 9a0afc361c.
Basically, all robust statistics summaries entries are hidden,
and mean + stdev/relative are back to be default displayed items
* Address PR review feedback
* Reduce stdrel criterion complexity and ensure termination
Replace the stdrel criterion's growing sample history with an online
mean/variance accumulator. This keeps the stopping criterion based on
relative standard deviation, preserves the unbiased standard-deviation
estimate used for convergence, and reduces per-sample update work from
recomputing over the full history to constant time.
Add a bounded invalid-noise path so measurements that persistently produce
non-finite relative noise, such as all-zero timings, can terminate without
waiting for the wall-time timeout. Keep the normal min-time gate for ordinary
stdrel convergence.
Add focused tests for the online accumulator, stdrel sample-count threshold,
sample-standard-deviation behavior, deterministic convergence inputs, and
persistent invalid-noise termination. Update the CLI help for the stdrel
termination behavior.
* change max-noise to for consistency
* Use online_mean_variance on m_noise_tracker in is_finished()
Previously, standard deviation call was made using current
noise level instead of mean noise level. Because of identity
E[ (N - C)^2 ] =
E[ (N - E[N])^2 ] + (E[N] - C)^2 >= E[ (N - E[N])^2 ]
this led to criterion terminating later than it could have because
the estimated expectation is always greater of equal that the
estimate relative to the mean.
Code used current noise level instead of mean to avoid needing to
make two passed through m_noise_tracker container.
Use of online_mean_variance allows to improve accuracy of estimating
dispersion of noise signal while maintaining single pass through
container.
* Address review feedback
Fixed misleading commit. Introduce private methods to refactor
computation of repeated expressions.
Renamed m_cuda_times_summary to m_measurements_summary, since
criterion can be applied for CPU-only measurements too.
Introduced is_close utility for checking whether two floating
point numbers are closed to one another.
Introduced descriptive constexpr variables for hard-wired
constants
Improve exception safety of timer structs by using local scope guards to ensure that cleanup steps, such as signaling blocking kernel to unblock and making sure that the stream is synchronized are performed even launch object throws an exception.
Tests of exception safety were added.
--
* blocking_kernel.unblock_noexcept() noexcept method added
This decouples the logic of signaling to unblock from checking
of the timeout.
* Improve exception safely in kernel_launch_timer
Introduce noexcept cleanup methods. Place body of start()
and stop() methods in the try/catch block and execute
noexcept clean-up on exception before rethrowing.
* Improve exception safety of measure_hot
* Make sure that throwing methods call noexcept ones instead of duplicating functionality
* Use cleanup_guard in measure_cold_base::kernel_launch_timer
Replace try/catch pattern with cleaner use of cleanup_guard
class.
* cpu_timer::start, cpu_timer::stop methods marked noexcept
These methods do not throw, and marking them noexcept explicitly
makes it fine to call them from other noexcept methods, as such
cleanup_noexcept in measure_cold.
* Address remaining exception safety issue in measure_hot
* Renamed guard variables to reflect their purpose, apply arm-then-do to ops queueing kernels
Set m_block_stream_armed = true; before launching the kernel. Doing so signals
cleanup guard that stream must be unblocked, even if launching of the kernel failed.
Same for operation launching time-stamps kernel.
* Add testing/device/exception_safety.cu
This test add benchmark that throws. It verifies that it did not
time-out and control counters the benchmark maintains are at
the expected values.
* Refactor measurement cleanup guards for testability
Extract hot stream cleanup and cold launch timer cleanup into reusable
detail helpers. Keep measure_hot and measure_cold using those helpers through
thin adapters so the tested cleanup logic matches the production path.
Add driver-free cleanup guard tests using a fake measure object to verify
cleanup ordering when exceptions occur after blocking stream setup, after hot
unblock, and around cold GPU frequency start/stop paths.
* Implement cpu_timer_stop_noexcept in terms of cpu_timer_stop
The cpu_timer_stop is already noexcept by nature of implementation,
but we maintain cpu_timer_stop_noexcept method for symmetry with
other pairs sync_stream()/sync_stream_noexcept().
The cpu_timer_stop_noexcept() is implemented via cpu_timer_stop().
These methods are annotated __forceinline__, so the same code should be
generated.
* More readable initialization of bool members
* Moved exception_safety.cu back to testing/ folder
testing/device is reserved for tests that require locking
of GPU frequency per CMake option description.
* Fixed nitpick and bug it discovered
Changed testing/exception_safety.cu:237 so run_benchmark now iterates over every state
from bench.get_states() and checks each one is skipped with a reason
containing "requested".
That exposed a real runner behavior gap, so I also made a minimal fix in
nvbench/runner.cuh:120: after stop_runner_loop, remaining states are now explicitly
marked skipped with a reason instead of only printing a skip notification.
* Move static assertions (pertaining to cleanup guards) to
testing/cleanup_guards.cu
The CI failure with CTK 12.0 and certain version of GCC is caused
by OOM in cudafe++ process tripped by compiling instantiation
of contract verification on cold_launch_timer_probe struct.
As a work-around, this instantiation is excluded for CTK 12.0-12.6
* Implement warmup-runs count, supported as CLI
CLI option --warmup-runs implemented and documented.
The warm-up counts is enforced to always be positive.
This is necessary to ensure that JIT-ting has occurred,
and use of blocking kernel would not result in time-outs.
Test is option parser is added.
* Ensure that measure_cold::run_warmup instantiates blocking kernel
Because warm-up runs are executed without use of blocking kernel,
the blocking kernel was not jitted until actual measurements were
collected. The module loading cost incurred during the first run
shows as elevated CPU time noise value for the first measurement
as noted in https://github.com/NVIDIA/nvbench/pull/339
This PR adds `this->block_stream(); this->unblock_stream();` prior
to executing warm-up loop with use of blocking kernel disabled.
This ensures that blocking kernel is instantiated during the warm-up,
but it no other kernel is launched between its launch and stream sync
thus avoiding deadlocking.
* Rename --warmup-runs to --cold-warmup-runs, add --cold-max-warmup-walltime
Since configurable number of warmups only applies to measure_cold.cuh
rename the CLI option to reflect that.
Also add --cold-max-warmup-walltime (defaults to -1, i.e. disabled).
If enabled, exits warmup loop before request count is reached if
the wall-time expanded executign warmups exceeds this max-warmup-walltime
value.
These are now owned by the stdrel stopping criterion, and should not be exposed directly in the benchmark/state/etc APIs.
This will affect users that are calling
`NVBENCH_BENCH(...).set_min_time(...)` or
`NVBENCH_BENCH(...).set_max_noise(...)`.
These can be updated to
`NVBENCH_BENCH(...).set_criterion_param_float64(["min-time"|"max-noise"], ...)`.
* Create and use NVBENCH_CUDA_CALL_RESET_ERROR.
* Moved cudaGetLastError() call to NVBENCH_CUDA_CALL macro
---------
Co-authored-by: Sergey Pavlov <psvvsp89@gmail.com>
* Refactor main implementation to improve reusability and customization.
Move the implementation of `main` out of macros and into separate
functions. This allows for easier reuse and customization of the macros.
Existing macro usage should still work as expected, and new
customization points will simplify common tasks like argument parsing
going forward.
* Add tests that validate common main customizations.
The string used when constructing a summary is no longer a human
readable name, but rather a tag string (e.g. "nv/cold/time/gpu/mean").
These will make lookup easier and more stable going forward.
name vs. short_name no longer exists. Now there is just "name", which
is used for column headings. The "description" string may still be
used for detailed information.
Updated the json tests and compare script to reflect these changes.
Previously, convergence was tested by waiting for the relative stdev
of cuda timings ("noise") to drop below a certain percentage
(`max_noise`).
This assumed that all benchmarks would eventually see their noise drop
to some threshold, but this is not the case. In practice, many benchmarks
never converge to the default 0.5% relative stdev and instead will always
run to the 15s timeout -- even if the means have converged in a second
or two.
Added a new check that tests when the noise itself stabilizes and ends
the benchmark, even if noise > max_noise.
After testing, this patch alone significantly reduces the runtime of the
Thrust+CUB benchmark suite (from 30 hours to 5 hours) and produces similar
timing results.
The parameters used to tune this feature are not exposed -- if this
approach works long-term and there's a strong motivation to let users
tweak them, then we can worry about names/APIs/CLI/docs later.
This will provide functionality such as clock locking (--lgm),
persistance mode (--pm), device querying (--list), version checking
(--version), and documentation (--help).
This is possible already with any nvbench executable, but having
one with a reliable name will be helpful for scripting and writing
documentation.
- /W4 on MSVC
- -Wall -Wextra + others on gcc/clang
- New NVBench_ENABLE_WERROR option to toggle "warnings as errors"
- Mark the nlohmann_json library as IMPORTED to switch to system includes
- Rename nvbench_main -> nvbench.main to follow target name conventions
- Explicitly suppress some cudafe warnings when compiling templates in
nlohmann_json headers.
- Explicitly suppress some warnings from Thrust headers.
- Various fixes for warnings exposed by new flags.
- Disable CUPTI on CTK < 11.3 (See #52).
- Add export sets
- Add install rules
- Remove manual CPM import, port to rapids_cpm_*, etc
- Organize CMake code into cmake/*.cmake files.
- NVBench is now a shared library.
Human-readable outputs (md) and CLI inputs still use percentages.
In-memory and machine-readable outputs (csv, json) use ratios.
This is the convention that spreadsheet apps expect. Fixes#2.