mirror of
https://github.com/amd/blis.git
synced 2026-05-21 17:08:17 +00:00
Details: -- AMD Internal Id: CPUPL-1702 -- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 12 dcomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ztrsm_small for in ztrsm_ BLAS path for single thread when (m,n)<500 and multithread (m+n)<128 -- Taken care of --disable_pre_inversion configuration -- Achieved 10% average performance improvement for sizes less than 500 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75