* sync with function interface of cshuffleepiloge,fix flatmm build fail * move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm --------- Co-authored-by: solin <bingzhou@amd.com> [ROCm/composable_kernel commit: 6a3960c1e1]
6a3960c1e1