Files
blis/kernels
Harsh Dave a7f600b3a4 Implemented ztrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 12 dcomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ztrsm_small for in ztrsm_ BLAS path for single thread
   when (m,n)<500 and multithread (m+n)<128
-- Taken care of --disable_pre_inversion configuration
-- Achieved 10% average performance improvement for sizes less than 500
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75
2021-11-12 08:58:54 +05:30
..
2021-11-12 08:58:52 +05:30
2020-09-29 16:52:18 -05:00
2021-11-12 08:58:54 +05:30
2021-04-27 11:09:48 +05:30
2020-07-22 18:24:26 +05:30
2021-03-08 19:04:17 +05:30