Edit wheel.packages metadata to include namespace package "cuda".
Updated README to remove the work-around of setting PYTHONPATH,
as it is no longer necessary.
Change example to illustrate timing CPU work.
First example does only CPU work (sleeps), use CPU-only timer.
Second examples does both CPU and GPU work (sleeps in either case).
Use cold-run timer with/without sync tag to measure both CPU and GPU times.
Benchmark function that sleeps for 1 seconda on the host using CPU-only
timer, as well as CPU/GPU timer that does/doesn't use blocking kernel.
All three methods must report consistent values close to 1 second.
python examples/cpu_only.py --run-once -d 0 --output foo.md
used to trip SystemError, returned a result with an exception set.
It now returns a clean NVBenchmarkError exception.
Change explicit constructor of benchmark_wrapper_t to use move-constructor
of py::object instead of copy constructor by replacing `py::object(o)` with
`py::object(std::move(o))`.
Add throughput.py example, which is based on the same kernel as
auto_throughput.py but records global memory reads/writes amounts
to output BWUtil metric measuring %SOL in bandwidth utilization.
state.add_summary(column_name: str, value: Union[int, float, str])
This is used in examples/axes.py to map integral value from Int64Axis
to string description.
make batch/sync arguments of State.exec keyword-only
Provide default column_name value for State.addElementCount method,
so that it can be called state.addElementCount(count), or as
state.addElementCount(count, column_name="Descriptive Name")
Fix run-time exception:
```
Fail: Unexpected error: RuntimeError: return_value_policy = copy, but type is non-copyable! (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)
```
caused by attempt to returning move-only `nvbench::cuda_stream` class
instance using default `pybind11::return_value_policy::copy`.