Benchmark function that sleeps for 1 seconda on the host using CPU-only
timer, as well as CPU/GPU timer that does/doesn't use blocking kernel.
All three methods must report consistent values close to 1 second.
python examples/cpu_only.py --run-once -d 0 --output foo.md
used to trip SystemError, returned a result with an exception set.
It now returns a clean NVBenchmarkError exception.
Change explicit constructor of benchmark_wrapper_t to use move-constructor
of py::object instead of copy constructor by replacing `py::object(o)` with
`py::object(std::move(o))`.
Add throughput.py example, which is based on the same kernel as
auto_throughput.py but records global memory reads/writes amounts
to output BWUtil metric measuring %SOL in bandwidth utilization.
state.add_summary(column_name: str, value: Union[int, float, str])
This is used in examples/axes.py to map integral value from Int64Axis
to string description.
make batch/sync arguments of State.exec keyword-only
Provide default column_name value for State.addElementCount method,
so that it can be called state.addElementCount(count), or as
state.addElementCount(count, column_name="Descriptive Name")
Fix run-time exception:
```
Fail: Unexpected error: RuntimeError: return_value_policy = copy, but type is non-copyable! (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)
```
caused by attempt to returning move-only `nvbench::cuda_stream` class
instance using default `pybind11::return_value_policy::copy`.
* Measure cold must not use block_kernel for single runs
Per https://github.com/NVIDIA/nvbench/issues/242, we should not
use blocking kernel when --run-once, or --profile is used to avoid
possible deadlocks when providing with external tools, also to avoid
deadlocking when Python programs load the program on the first execution.
* Measure hot should not use blocking kernel during warmup
This change follows suite of measure_cold, where it is prompted
by deadlock, see https://github.com/NVIDIA/nvbench/pull/241
* Remove setting of CUDA_MODULE_LOADING=EAGER
This is no longer necessary as warm-up runs in regular runs,
or the single run in (run-once/profile) no longer use blocking kernel.
Text for --profile modified to be self-consistent, i.e., not to refer
to removed --run-once and --disable-blocking-kernel for explanantion
of what it does.
The entry with tag "nv/cold/time/cpu/stdev/absolute" stores
value of standard deviation of execution duration measurments,
not the relative standard deviation.
Based on findings of https://github.com/NVIDIA/nvbench/issues/249,
m_cpu_timer.start() is being called from kernel_launcher_timer.start()
method.
Previously it was called from kernel_launcher_timer.stop() just before
unblock_stream() call with the intention to hone in time to execute
GPU work, but this excluded any host work performed by the benched function
from CPU time.