Optimisation of DTRSM and ZTRSM

1. Extract instruction replaced with cast when accessing first 128bit,
   as cast inst needs no cycle but extract takes few cycles
2. Added prefetch of A buffer when computing gemm operation
3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c

With above changes performance improvements observed in case of Single thread

Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be
This commit is contained in:
Mangala.V
2022-07-22 14:52:24 +05:30
committed by Mangala V
parent 737e08cd7a
commit 8504ef013d

File diff suppressed because it is too large Load Diff