make batch/sync arguments of State.exec keyword-only
Provide default column_name value for State.addElementCount method,
so that it can be called state.addElementCount(count), or as
state.addElementCount(count, column_name="Descriptive Name")
Fix run-time exception:
```
Fail: Unexpected error: RuntimeError: return_value_policy = copy, but type is non-copyable! (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)
```
caused by attempt to returning move-only `nvbench::cuda_stream` class
instance using default `pybind11::return_value_policy::copy`.
* Measure cold must not use block_kernel for single runs
Per https://github.com/NVIDIA/nvbench/issues/242, we should not
use blocking kernel when --run-once, or --profile is used to avoid
possible deadlocks when providing with external tools, also to avoid
deadlocking when Python programs load the program on the first execution.
* Measure hot should not use blocking kernel during warmup
This change follows suite of measure_cold, where it is prompted
by deadlock, see https://github.com/NVIDIA/nvbench/pull/241
* Remove setting of CUDA_MODULE_LOADING=EAGER
This is no longer necessary as warm-up runs in regular runs,
or the single run in (run-once/profile) no longer use blocking kernel.
Text for --profile modified to be self-consistent, i.e., not to refer
to removed --run-once and --disable-blocking-kernel for explanantion
of what it does.
The entry with tag "nv/cold/time/cpu/stdev/absolute" stores
value of standard deviation of execution duration measurments,
not the relative standard deviation.
Based on findings of https://github.com/NVIDIA/nvbench/issues/249,
m_cpu_timer.start() is being called from kernel_launcher_timer.start()
method.
Previously it was called from kernel_launcher_timer.stop() just before
unblock_stream() call with the intention to hone in time to execute
GPU work, but this excluded any host work performed by the benched function
from CPU time.
In python kernel generator is a user-defined callable.
We need to capture Python object of that callable in
kernel generator provided for each benchmark.
To this end, nvbench::benchmark has been modified to have member of
kernel_generator type (must be copy-constructable). Constructor acquires
an optional parameter of type `kernel_generator` with default value
of default-contstructed instance.
nvbench::runner was modified to store kernel_generator instance as well.
Its run method creates a fresh copy of stored instance for each invocation,
just as it was happening before.
nvbench tests/examples pass with this change.
These are now owned by the stdrel stopping criterion, and should not be exposed directly in the benchmark/state/etc APIs.
This will affect users that are calling
`NVBENCH_BENCH(...).set_min_time(...)` or
`NVBENCH_BENCH(...).set_max_noise(...)`.
These can be updated to
`NVBENCH_BENCH(...).set_criterion_param_float64(["min-time"|"max-noise"], ...)`.