blis/kernels/zen at 8f99d8a5bb95fcbbbdee03ea344866aee51220df - blis

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 18:57:23 +00:00

Files

harsdave 54b46ec1ed Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels

This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:

1. **IR Loop Optimization:**
- The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
with `begin_asm` and `end_asm` calls, resulting in more efficient execution.

2. **JR Loop Integration:**
- The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
of stack frame management for each JR iteration, thereby enhancing loop performance.

3. **Kernel Decomposition Strategy:**
- The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
- For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.

1. **Interleaved Scaling by Alpha:**
- Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
and reduce latency.

2. **Efficient Mask Preparation:**
- Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
minimizing unnecessary overhead.

3. **Broadcast Instruction Optimization:**
- In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
the broadcast instruction is replaced with `mem_1to8`.
- This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
dependency chains and improving execution efficiency.

4. **C Matrix Update Optimization:**
- During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
performance bottlenecks and enhancing throughput.

These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.

This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.

For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.

AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a

2024-12-13 00:03:00 -05:00

SCALV alpha=zero BLAS compliance

2024-09-16 07:10:28 -04:00

Fixed compiler warnings due to prefetch in AVX2 and AVX512 kernels

2024-09-12 11:37:06 +05:30

DGEMV Optimizations for NO_TRANSPOSE Cases

2024-12-12 10:26:50 -05:00

Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels

2024-12-13 00:03:00 -05:00

lpgemm

Sigmoid and Tanh post-operation support for int8 API's.