mirror of
https://github.com/amd/blis.git
synced 2026-05-25 19:04:32 +00:00
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation of lpgemv_rowvar_f32 is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. 5. Fixed few warnings while loading 2 f32 bias elements using _mm_load_sd using float pointer. Typecasted to (const double *) AMD-Internal: [SWLCSG-2391, SWLCSG-2353] Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599