mirror of
https://github.com/amd/blis.git
synced 2026-05-12 01:59:59 +00:00
Details:
1. Added optimized dtrsm kernels for all 8 right side cases
Below are few notable optimizations which improved performance
a. Loading, transposing (for transa cases), packing and reusing
of a01 block required for GEMM operation. The block size
increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
from one end of A to other end of triangular A
b. Packing of 6 diagonal elements in one location helped to utilize
cache line efficiently
AMD-Internal: [CPUPL-1563]
Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3