Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) (#135)

* Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) - Introduced specialized AVX-512 assembly paths for cdim0 edge cases (1–11), replacing inefficient zscalv fallback. - Refactored cdim0 == mnr condition into a switch statement to support multiple optimized cases. - Added three new macros for column-stored packing with distinct masking patterns. - Implemented 11 dedicated handlers for row and column stored A matrix packing with efficient masked loads/stores for partial data. AMD-Internal: [CPUPL-6677] Co-authored-by: harsh dave <harsdave@amd.com> * Update bli_packm_zen4_asm_z12xk.c --------- Co-authored-by: harsh dave <harsdave@amd.com> Co-authored-by: Sharma, Shubham <Shubham.Sharma3@amd.com>
2026-04-19 23:28:52 +00:00 · 2025-08-18 12:38:45 +05:30
parent 33ea09d967
commit b88bea6e72
1 changed files with 4 additions and 0 deletions
--- a/kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
+++ b/kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
@@ -246,6 +246,10 @@ void bli_zpackm_zen4_asm_12xk
    const uint64_t lda    = lda0;
    const uint64_t ldp    = ldp0;

+	 // Note: k_left is currently initialized as k % 4, which ensures safe mask calculation.
+	 // Be cautious if modifying this logic in the future (e.g., using k % by other large values),
+	 // as (k_left * 2) may overflow when used in bit shifts, potentially causing undefined behavior
+	 // or incorrect masks for uint8_t. Ensure k_left remains within a safe range (e.g., < 128
    uint8_t mask = ((1 << (k_left*2)) - 1);
    if (mask == 0) mask = 0xff;