1. Extract instruction replaced with cast when accessing first 128bit,
as cast inst needs no cycle but extract takes few cycles
2. Added prefetch of A buffer when computing gemm operation
3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c
With above changes performance improvements observed in case of Single thread
Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be