* manual control of MAC cluster for improved 2-wave performance
ensure setprio's order; ensure inner loop size >= local read size
synchronize when single mac cluster
* format
* use value field from ck::integral_constant
* roll out inter-wave loop scheduler to c-shuffle gemm variants
will gradually roll out to other applicable device ops when occasional reg spill is resolved
* additional comments
* format
* fix mismatch between inter-wave pipeline and interwave blockwise gemm
* address review feedback
* amend