Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) (#135)

* Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11)

- Introduced specialized AVX-512 assembly paths for cdim0 edge cases (1–11), replacing inefficient zscalv fallback.
- Refactored cdim0 == mnr condition into a switch statement to support multiple optimized cases.
- Added three new macros for column-stored packing with distinct masking patterns.
- Implemented 11 dedicated handlers for row and column stored A matrix packing
  with efficient masked loads/stores for partial data.

    AMD-Internal: [CPUPL-6677]

Co-authored-by: harsh dave <harsdave@amd.com>

* Update bli_packm_zen4_asm_z12xk.c

---------

Co-authored-by: harsh dave <harsdave@amd.com>
Co-authored-by: Sharma, Shubham <Shubham.Sharma3@amd.com>
This commit is contained in:
Dave, Harsh
2025-08-18 12:38:45 +05:30
committed by GitHub
parent 33ea09d967
commit b88bea6e72

View File

@@ -246,6 +246,10 @@ void bli_zpackm_zen4_asm_12xk
const uint64_t lda = lda0;
const uint64_t ldp = ldp0;
// Note: k_left is currently initialized as k % 4, which ensures safe mask calculation.
// Be cautious if modifying this logic in the future (e.g., using k % by other large values),
// as (k_left * 2) may overflow when used in bit shifts, potentially causing undefined behavior
// or incorrect masks for uint8_t. Ensure k_left remains within a safe range (e.g., < 128
uint8_t mask = ((1 << (k_left*2)) - 1);
if (mask == 0) mask = 0xff;