Introduce get_int64_or_default method, and counterparts for
float64 and string.
Provided names for Python arguments.
Tried generating Python stubs automatically with
```
stubgen -m cuda.nvbench._nvbench
```
Gave up on this, since it does not include doc-strings.
It would be nice to compare auto-generated _nvbench.pyi with
__init__.pyi for discrepancies though.
Edit wheel.packages metadata to include namespace package "cuda".
Updated README to remove the work-around of setting PYTHONPATH,
as it is no longer necessary.
Change example to illustrate timing CPU work.
First example does only CPU work (sleeps), use CPU-only timer.
Second examples does both CPU and GPU work (sleeps in either case).
Use cold-run timer with/without sync tag to measure both CPU and GPU times.
Benchmark function that sleeps for 1 seconda on the host using CPU-only
timer, as well as CPU/GPU timer that does/doesn't use blocking kernel.
All three methods must report consistent values close to 1 second.
python examples/cpu_only.py --run-once -d 0 --output foo.md
used to trip SystemError, returned a result with an exception set.
It now returns a clean NVBenchmarkError exception.
Change explicit constructor of benchmark_wrapper_t to use move-constructor
of py::object instead of copy constructor by replacing `py::object(o)` with
`py::object(std::move(o))`.
Add throughput.py example, which is based on the same kernel as
auto_throughput.py but records global memory reads/writes amounts
to output BWUtil metric measuring %SOL in bandwidth utilization.
state.add_summary(column_name: str, value: Union[int, float, str])
This is used in examples/axes.py to map integral value from Int64Axis
to string description.
make batch/sync arguments of State.exec keyword-only
Provide default column_name value for State.addElementCount method,
so that it can be called state.addElementCount(count), or as
state.addElementCount(count, column_name="Descriptive Name")
Fix run-time exception:
```
Fail: Unexpected error: RuntimeError: return_value_policy = copy, but type is non-copyable! (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)
```
caused by attempt to returning move-only `nvbench::cuda_stream` class
instance using default `pybind11::return_value_policy::copy`.
* Measure cold must not use block_kernel for single runs
Per https://github.com/NVIDIA/nvbench/issues/242, we should not
use blocking kernel when --run-once, or --profile is used to avoid
possible deadlocks when providing with external tools, also to avoid
deadlocking when Python programs load the program on the first execution.
* Measure hot should not use blocking kernel during warmup
This change follows suite of measure_cold, where it is prompted
by deadlock, see https://github.com/NVIDIA/nvbench/pull/241
* Remove setting of CUDA_MODULE_LOADING=EAGER
This is no longer necessary as warm-up runs in regular runs,
or the single run in (run-once/profile) no longer use blocking kernel.
Text for --profile modified to be self-consistent, i.e., not to refer
to removed --run-once and --disable-blocking-kernel for explanantion
of what it does.