Adding SGEMM tiny path for Zen architectures.
Needed to cover some performance gaps seen wrt MKL
Only allowing matrices that all fit into the L1 cache to the tiny path
Only tuned for single threaded operation at the moment
Todo: Tune cases where AVX2 performs better than AVX512 on Zen4
Todo: The current ranges are very conservative, there may be scope to increase the matrix sizes that go into the tiny path
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan rohrayan@amd.com