Added bf16s4f32 kernels to handle m=4 cases

Details: - In WOQ, if m = 4, special case kernels are added where s4->bf16 conversion happens inside the compute kernel and packing is avoided. For all other cases, B matrix is dequantized and packed at KC loop level and native bf16 kernels are re-used at compute level. - Fixes in bench to avoid accuracy failures when datatype of output is bf16. Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
2026-05-03 22:11:12 +00:00 · 2024-09-02 10:39:49 +00:00
parent 711dce14d0
commit 2e1cc2f14a
8 changed files with 4961 additions and 17 deletions
--- a/addon/aocl_gemm/frame/lpgemm_5loop_interface_apis.h
+++ b/addon/aocl_gemm/frame/lpgemm_5loop_interface_apis.h
@@ -85,7 +85,7 @@ void lpgemm_rowvar_ ## LP_SFX \
       const B_type*         b, \
       const dim_t           rs_b, \
       const dim_t           cs_b, \
-       const AOCL_MEMORY_TAG mtag_b, \
+       AOCL_MEMORY_TAG       mtag_b, \
       C_type*               c, \
       const dim_t           rs_c, \
       const dim_t           cs_c, \