Add throughput.py example, which is based on the same kernel as auto_throughput.py but records global memory reads/writes amounts to output BWUtil metric measuring %SOL in bandwidth utilization.