- Updated the conversion function(in case of receiving
column stored inputs) from BF16 to F32, in order to
use the correct strides while storing.
- Conversion of B is potentially multithreaded using
the threads meant for IC compute. With the wrong
strides in the kernel, this gives rise to incorrect
writes onto the miscellaneous buffer.
AMD-Internal: [CPUPL-7675]
Co-authored-by: Vishal-A <Vishal.Akula@amd.com>
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>