- Vectorized alpha scaling of X vector using SSE instructions. This
can be done irrespective of incx.
- Added code to prefetch A matrix and Y vector to L1 cache
- Vectorized fringe case computation and non-unit stride computation
with SSE instructions.
- Increased unroll in unit stride cases for better register
utilization.
AMD-Internal: [CPUPL-2773]
Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715