blis/kernels/haswell at dc06cdb621783bbb5498e02cdad13cf0ef6db1b0 - blis

amd/blis

Fork 0

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Files

History

Dave, Harsh 7c6c04a457 More optimizations in 6x8m DGEMM SUP Kernel using prefetching (#34 )

* Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance

This update optimizes the DGEMM kernel by implementing well suited prefetching techniques.

Key changes include:

- **Prefetching Strategy**:
- Introduced prefetching instructions to load matrix data into cache ahead of computation.
- Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed.
- Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed.

- **Unrolling Optimization**:
- Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B.
- This adjustment enhances data locality and reduces the overhead associated with loop control.

- **Performance Improvements**:
- Reduced memory access latency by ensuring data is preloaded into cache.
- Enhanced computational throughput by minimizing stalls due to memory access delays.
- Improved overall efficiency of matrix multiplication operations.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance.

AMD-Internal: [CPUPL-6435]

* added unroll K by 4 along with unroll K by 8

* Added descriptive comments explaining prefetch strategy

* More optimizations in 6x8m DGEMM SUP Kernel using prefetching

- Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization.
- Introduced forward prefetching for A and future B rows to better align with unrolled access patterns.
- Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling
to boost overall performance.

AMD-Internal: [CPUPL-6435]

* Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance

This update optimizes the DGEMM kernel by implementing well suited prefetching techniques.

Key changes include:

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance.

AMD-Internal: [CPUPL-6435]

* added unroll K by 4 along with unroll K by 8

* Added descriptive comments explaining prefetch strategy

* More optimizations in 6x8m DGEMM SUP Kernel using prefetching

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling
to boost overall performance.

AMD-Internal: [CPUPL-6435]

---------

Co-authored-by: Harsh Dave <harsdave@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>

2025-07-01 15:02:50 +05:30

Code cleanup: Copyright notices

2024-08-05 15:35:08 -04:00

More optimizations in 6x8m DGEMM SUP Kernel using prefetching (#34 )

2025-07-01 15:02:50 +05:30

bli_kernels_haswell.h

Code cleanup: AMD copyright notice

2023-11-23 08:54:31 -05:00