* manual control of MAC cluster for improved 2-wave performance
ensure setprio's order; ensure inner loop size >= local read size
synchronize when single mac cluster
* format
* use value field from ck::integral_constant
* roll out inter-wave loop scheduler to c-shuffle gemm variants
will gradually roll out to other applicable device ops when occasional reg spill is resolved
* additional comments
* format
* fix mismatch between inter-wave pipeline and interwave blockwise gemm
* address review feedback
* amend
* Add ThreadwiseReduction functor as per-thread reduction api
* Using ThreadwiseReduce api and some change in using PartitionedBlockwiseReduction api to simply the kernels
* Add comments and remove useless declarations in the kernels
* Tiny updates
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
* Rename the folder name for the pool2d and reduce examples
* Update to reduction test scripts
* Add Readme for pool2d_fwd and reduce_blockwise examples
* Tiny fix in reduce profiler and tiny update in reduce testing scripts
* Tiny fix in testing script profile_reduce_no_index.sh
* Tiny change in script/profile_reduce_with_index.sh
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming all NumReduceDims to NumReduceDim