Files
blis/kernels/zen
Nallani Bhaskar 3a2e4c3db8 Added optimized single threaded dtrsm small for left cases
Details:

1. Added optimized dtrsm kernels for all 8 left side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a10 block required for GEMM operation. The block size
      increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM
      from one end of A to other end of triangular A
   b. Performing inregister transpose whenever required
   c. Packing of 8 diagonal elements in one location helped to utilize
      cache line efficiently

2. Enabled calling dtrsm small for smaller sizes at cblas level itself
   to avoid frame work overhead, which is significant for very small
   sizes

3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun
   and manideep.kurumella@amd.com for implementing lut kernels
   using intrinsics.

4. Removed all older implementations of strsm which are not
   developed as per the guide lines, can be refered from
   older releases if required.

Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6
2021-05-18 16:16:00 +05:30
..
2021-04-27 11:09:48 +05:30
2021-04-27 11:09:48 +05:30
2021-03-08 19:04:17 +05:30
2021-03-08 19:04:17 +05:30