- Add license header to each example file.
- Fixed broken runs caused by type declarations.
- Fixed hang in throughput.py when --run-once by doing a
manual warm-up step, like in auto_throughput.py
Change example to illustrate timing CPU work.
First example does only CPU work (sleeps), use CPU-only timer.
Second examples does both CPU and GPU work (sleeps in either case).
Use cold-run timer with/without sync tag to measure both CPU and GPU times.
Benchmark function that sleeps for 1 seconda on the host using CPU-only
timer, as well as CPU/GPU timer that does/doesn't use blocking kernel.
All three methods must report consistent values close to 1 second.
Add throughput.py example, which is based on the same kernel as
auto_throughput.py but records global memory reads/writes amounts
to output BWUtil metric measuring %SOL in bandwidth utilization.