mirror of
https://github.com/amd/blis.git
synced 2026-05-12 01:59:59 +00:00
Details: -- AMD Internal Id: CPUPL-1702 -- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 24 scomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ctrsm_small for in ctrsm_ BLAS path for single thread when (m,n)<1000 and multithread (m+n)<320 -- Taken care of --disable_pre_inversion configuration -- Achieved 13% average performance improvement for sizes less than 1000 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64