Files
blis/kernels
Harsh Dave 590c763e22 Implemented ctrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
   when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
2021-11-12 08:58:55 +05:30
..
2021-11-12 08:58:52 +05:30
2020-09-29 16:52:18 -05:00
2021-11-12 08:58:55 +05:30
2021-04-27 11:09:48 +05:30
2020-07-22 18:24:26 +05:30
2021-03-08 19:04:17 +05:30