Files
blis/kernels
Dave, Harsh e39cf64708 Optimized avx512 ZGEMM kernel and edge-case handling (#147)
* Optimized avx512 ZGEMM kernel and edge-case handling
  Edge kernel implementation:
   - Refactored all of the zgemm kernels to process micro-tiles efficiently
   - Specialized sub-kernels are added to handle leftover m dimention:12MASK,
     8, 8MASK, 8, 4, 4MASK, 2.
   - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
     load/store and 1 masked load/store.
   - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
     1 masked load/store.
   - 4MASK handles 3, 1 m_left using 1 masked load/store.

   - ZGEMM kernel now internally decomposes the m dimension into the following.
     The main kernel is 12x4, which is having following edge kernels to
     handle left-over m dimension:
     edge kernels:
     12MASKx4 (handles 11x4, 10x4, 9x4)
     8x4      (handles 8x4)
     8MASKx4  (handles 7x4, 6x4, 5x4)
     4x4      (handles 4x4)
     4MASKx4  (handles 3x4, 1x4)
     2x4      (handles 2x4)

   - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
     which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
     8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
     2xN_LEFT(3, 2, 1) handles leftover m dimension.

  Threshold tuning:
   - Enforced odd m dimension to avx512 kernels in tiny path, as avx2
     kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
     The gemv function call adds overhead for very small sizes and results
     in suboptimal performance.

   - condition check "m%2 == 0" is added along with threshold checks to
     force input with odd m dimension to use avx512 zgemm kernel.

   - Threshold change to route all of the inputs to tiny path. Eliminating
     dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
     'T'(transpose).

   - However tiny re-uses zgemm sup kernels which do not support
     conjugate transpose storage of matrices. For such storage of
     A, B matrix we still rely on avx2 zgemm_small kernel.

  gtest changes:
   - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
     respective testing instaces from gtest.

AMD-Internal: [CPUPL-7203]

* Optimized avx512 ZGEMM kernel and edge-case handling
  Edge kernel implementation:
   - Refactored all of the zgemm kernels to process micro-tiles efficiently
   - Specialized sub-kernels are added to handle leftover m dimention:12MASK,
     8, 8MASK, 8, 4, 4MASK, 2.
   - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
     load/store and 1 masked load/store.
   - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
     1 masked load/store.
   - 4MASK handles 3, 1 m_left using 1 masked load/store.

   - ZGEMM kernel now internally decomposes the m dimension into the following.
     The main kernel is 12x4, which is having following edge kernels to
     handle left-over m dimension:
     edge kernels:
     12MASKx4 (handles 11x4, 10x4, 9x4)
     8x4      (handles 8x4)
     8MASKx4  (handles 7x4, 6x4, 5x4)
     4x4      (handles 4x4)
     4MASKx4  (handles 3x4, 1x4)
     2x4      (handles 2x4)

   - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
     which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
     8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
     2xN_LEFT(3, 2, 1) handles leftover m dimension.

  Threshold tuning:
   - Enforced odd m dimension to avx512 kernels in tiny path, as avx2
     kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
     The gemv function call adds overhead for very small sizes and results
     in suboptimal performance.

   - condition check "m%2 == 0" is added along with threshold checks to
     force input with odd m dimension to use avx512 zgemm kernel.

   - Threshold change to route all of the inputs to tiny path. Eliminating
     dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
     'T'(transpose).

   - However tiny re-uses zgemm sup kernels which do not support
     conjugate transpose storage of matrices. For such storage of
     A, B matrix we still rely on avx2 zgemm_small kernel.

  gtest changes:
   - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
     respective testing instaces from gtest.

AMD-Internal: [CPUPL-7203]

---------

Co-authored-by: harsdave <harsdave@amd.com>
2025-08-21 09:46:10 +05:30
..
2021-10-08 02:35:58 +09:00
2024-08-05 15:35:08 -04:00
2024-08-05 15:35:08 -04:00
2024-08-05 15:35:08 -04:00
2025-08-19 18:19:51 +01:00
2023-11-23 08:54:31 -05:00
2020-07-22 18:24:26 +05:30
2025-08-19 18:19:51 +01:00