mirror of
https://github.com/amd/blis.git
synced 2026-04-19 23:28:52 +00:00
* Optimized avx512 ZGEMM kernel and edge-case handling
Edge kernel implementation:
- Refactored all of the zgemm kernels to process micro-tiles efficiently
- Specialized sub-kernels are added to handle leftover m dimention:12MASK,
8, 8MASK, 8, 4, 4MASK, 2.
- 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
load/store and 1 masked load/store.
- Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
1 masked load/store.
- 4MASK handles 3, 1 m_left using 1 masked load/store.
- ZGEMM kernel now internally decomposes the m dimension into the following.
The main kernel is 12x4, which is having following edge kernels to
handle left-over m dimension:
edge kernels:
12MASKx4 (handles 11x4, 10x4, 9x4)
8x4 (handles 8x4)
8MASKx4 (handles 7x4, 6x4, 5x4)
4x4 (handles 4x4)
4MASKx4 (handles 3x4, 1x4)
2x4 (handles 2x4)
- similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
2xN_LEFT(3, 2, 1) handles leftover m dimension.
Threshold tuning:
- Enforced odd m dimension to avx512 kernels in tiny path, as avx2
kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
The gemv function call adds overhead for very small sizes and results
in suboptimal performance.
- condition check "m%2 == 0" is added along with threshold checks to
force input with odd m dimension to use avx512 zgemm kernel.
- Threshold change to route all of the inputs to tiny path. Eliminating
dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
'T'(transpose).
- However tiny re-uses zgemm sup kernels which do not support
conjugate transpose storage of matrices. For such storage of
A, B matrix we still rely on avx2 zgemm_small kernel.
gtest changes:
- Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
respective testing instaces from gtest.
AMD-Internal: [CPUPL-7203]
* Optimized avx512 ZGEMM kernel and edge-case handling
Edge kernel implementation:
- Refactored all of the zgemm kernels to process micro-tiles efficiently
- Specialized sub-kernels are added to handle leftover m dimention:12MASK,
8, 8MASK, 8, 4, 4MASK, 2.
- 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
load/store and 1 masked load/store.
- Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
1 masked load/store.
- 4MASK handles 3, 1 m_left using 1 masked load/store.
- ZGEMM kernel now internally decomposes the m dimension into the following.
The main kernel is 12x4, which is having following edge kernels to
handle left-over m dimension:
edge kernels:
12MASKx4 (handles 11x4, 10x4, 9x4)
8x4 (handles 8x4)
8MASKx4 (handles 7x4, 6x4, 5x4)
4x4 (handles 4x4)
4MASKx4 (handles 3x4, 1x4)
2x4 (handles 2x4)
- similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
2xN_LEFT(3, 2, 1) handles leftover m dimension.
Threshold tuning:
- Enforced odd m dimension to avx512 kernels in tiny path, as avx2
kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
The gemv function call adds overhead for very small sizes and results
in suboptimal performance.
- condition check "m%2 == 0" is added along with threshold checks to
force input with odd m dimension to use avx512 zgemm kernel.
- Threshold change to route all of the inputs to tiny path. Eliminating
dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
'T'(transpose).
- However tiny re-uses zgemm sup kernels which do not support
conjugate transpose storage of matrices. For such storage of
A, B matrix we still rely on avx2 zgemm_small kernel.
gtest changes:
- Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
respective testing instaces from gtest.
AMD-Internal: [CPUPL-7203]
---------
Co-authored-by: harsdave <harsdave@amd.com>