Previously, the DGEMM implementation used `dscalv` for cases
where the M dimension of matrix A is not in multiple of 24,
resulting in a ~40% performance drop.
This commit introduces a specialized edge cases in pack kernel
to optimize performance for these cases.
The new packing support significantly improves the performance.
- Removed reliance on `dscalv` for edge cases, addressing the
performance bottleneck.
AMD-Internal: [CPUPL-6677]
Change-Id: I150d13eb536d84f8eb439d7f4a77a04a0d0e6d60