mirror of
https://github.com/amd/blis.git
synced 2026-05-12 10:05:38 +00:00
* Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance This update optimizes the DGEMM kernel by implementing well suited prefetching techniques. Key changes include: - **Prefetching Strategy**: - Introduced prefetching instructions to load matrix data into cache ahead of computation. - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed. - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed. - **Unrolling Optimization**: - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B. - This adjustment enhances data locality and reduces the overhead associated with loop control. - **Performance Improvements**: - Reduced memory access latency by ensuring data is preloaded into cache. - Enhanced computational throughput by minimizing stalls due to memory access delays. - Improved overall efficiency of matrix multiplication operations. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] * added unroll K by 4 along with unroll K by 8 * Added descriptive comments explaining prefetch strategy * Added descriptive comments explaining prefetch strategy * More optimizations in 6x8m DGEMM SUP Kernel using prefetching - Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization. - Introduced forward prefetching for A and future B rows to better align with unrolled access patterns. - Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] * Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance This update optimizes the DGEMM kernel by implementing well suited prefetching techniques. Key changes include: - **Prefetching Strategy**: - Introduced prefetching instructions to load matrix data into cache ahead of computation. - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed. - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed. - **Unrolling Optimization**: - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B. - This adjustment enhances data locality and reduces the overhead associated with loop control. - **Performance Improvements**: - Reduced memory access latency by ensuring data is preloaded into cache. - Enhanced computational throughput by minimizing stalls due to memory access delays. - Improved overall efficiency of matrix multiplication operations. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] * added unroll K by 4 along with unroll K by 8 * Added descriptive comments explaining prefetch strategy * More optimizations in 6x8m DGEMM SUP Kernel using prefetching - Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization. - Introduced forward prefetching for A and future B rows to better align with unrolled access patterns. - Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] --------- Co-authored-by: Harsh Dave <harsdave@amd.com> Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>