Description:
1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
on float data. All the post-ops are updated to operate on f32
by converting s32 accumulator registers to float at the end of k
loop. Changed all post-ops to operate on float data.
2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
the output in u8
AMD-Internal - SWLCSG-3366
Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5