Files
blis/frame/compat
Harsh Dave 590c763e22 Implemented ctrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
   when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
2021-11-12 08:58:55 +05:30
..
2021-03-08 19:04:17 +05:30
2021-03-08 19:04:17 +05:30
2021-11-12 08:58:49 +05:30
2021-04-27 11:09:48 +05:30
2021-11-12 08:58:50 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2020-11-06 10:16:31 +05:30
2020-10-30 19:12:19 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-06-04 15:24:13 +05:30
2021-11-12 08:58:55 +05:30
2021-06-04 15:24:13 +05:30
2021-03-08 19:04:17 +05:30
2021-03-08 19:04:17 +05:30