* Summary:
- Refactor epilogue (with CShuffle) to support fused operations:
- EpilogueCShuffleBase holds common parts
- EpilogueCShuffle: runs CShuffle and write out
- EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out
- Extend thread transfer v7r3:
- Support for intermediate data type different from src and dst type
- New functionality to write to dst buffer and keep data (to be able to use them for additional operations)
* Adress review comments
[ROCm/composable_kernel commit: 4ebc48a3cd]
* Adding support for TiledPermuteN
* Adding test
* moving shuffle functions to common place
* resolving commit hook
* fix formatting
[ROCm/composable_kernel commit: b11f53a484]