* Reduce stdrel criterion complexity and ensure termination
Replace the stdrel criterion's growing sample history with an online
mean/variance accumulator. This keeps the stopping criterion based on
relative standard deviation, preserves the unbiased standard-deviation
estimate used for convergence, and reduces per-sample update work from
recomputing over the full history to constant time.
Add a bounded invalid-noise path so measurements that persistently produce
non-finite relative noise, such as all-zero timings, can terminate without
waiting for the wall-time timeout. Keep the normal min-time gate for ordinary
stdrel convergence.
Add focused tests for the online accumulator, stdrel sample-count threshold,
sample-standard-deviation behavior, deterministic convergence inputs, and
persistent invalid-noise termination. Update the CLI help for the stdrel
termination behavior.
* change max-noise to for consistency
* Use online_mean_variance on m_noise_tracker in is_finished()
Previously, standard deviation call was made using current
noise level instead of mean noise level. Because of identity
E[ (N - C)^2 ] =
E[ (N - E[N])^2 ] + (E[N] - C)^2 >= E[ (N - E[N])^2 ]
this led to criterion terminating later than it could have because
the estimated expectation is always greater of equal that the
estimate relative to the mean.
Code used current noise level instead of mean to avoid needing to
make two passed through m_noise_tracker container.
Use of online_mean_variance allows to improve accuracy of estimating
dispersion of noise signal while maintaining single pass through
container.
* Address review feedback
Fixed misleading commit. Introduce private methods to refactor
computation of repeated expressions.
Renamed m_cuda_times_summary to m_measurements_summary, since
criterion can be applied for CPU-only measurements too.
Introduced is_close utility for checking whether two floating
point numbers are closed to one another.
Introduced descriptive constexpr variables for hard-wired
constants
* Implement warmup-runs count, supported as CLI
CLI option --warmup-runs implemented and documented.
The warm-up counts is enforced to always be positive.
This is necessary to ensure that JIT-ting has occurred,
and use of blocking kernel would not result in time-outs.
Test is option parser is added.
* Ensure that measure_cold::run_warmup instantiates blocking kernel
Because warm-up runs are executed without use of blocking kernel,
the blocking kernel was not jitted until actual measurements were
collected. The module loading cost incurred during the first run
shows as elevated CPU time noise value for the first measurement
as noted in https://github.com/NVIDIA/nvbench/pull/339
This PR adds `this->block_stream(); this->unblock_stream();` prior
to executing warm-up loop with use of blocking kernel disabled.
This ensures that blocking kernel is instantiated during the warm-up,
but it no other kernel is launched between its launch and stream sync
thus avoiding deadlocking.
* Rename --warmup-runs to --cold-warmup-runs, add --cold-max-warmup-walltime
Since configurable number of warmups only applies to measure_cold.cuh
rename the CLI option to reflect that.
Also add --cold-max-warmup-walltime (defaults to -1, i.e. disabled).
If enabled, exits warmup loop before request count is reached if
the wall-time expanded executign warmups exceeds this max-warmup-walltime
value.
The option sets m_skip_batched boolean member in benchmark_base class.
Methods `bool get_skip_batched()` and `void set_skip_batched(bool)` added.
m_skip_batched is also added to state class. Similarly named methods
are added.
CLI help file documents `--no-batched` option.
Text for --profile modified to be self-consistent, i.e., not to refer
to removed --run-once and --disable-blocking-kernel for explanantion
of what it does.
Locking clocks is currently only implemented for Volta+ devices.
Example usage:
my_bench -d [0,1,3] --persistence-mode 1 --lock-gpu-clocks base
See the cli_help.md docs for more info.
Fixes#10.
Adds a mode that forces a benchmark to only run once, simplifying
profiling usecases. This can be enabled by any of the following methods:
* Passing `--run-once` on the command line
* `NVBENCH_CREATE(...).set_run_once(true)` when declaring a benchmark
* `state.set_run_once(true)` from within the benchmark implementation.
Human-readable outputs (md) and CLI inputs still use percentages.
In-memory and machine-readable outputs (csv, json) use ratios.
This is the convention that spreadsheet apps expect. Fixes#2.