Added bf16s4f32 kernels to handle m=4 cases

Details:
- In WOQ, if m = 4, special case kernels are added where
  s4->bf16 conversion happens inside the compute kernel and
  packing is avoided. For all other cases, B matrix is
  dequantized and packed at KC loop level and native bf16
  kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
  output is bf16.

Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
This commit is contained in:
Meghana Vankadari
2024-09-02 10:39:49 +00:00
committed by Nallani Bhaskar
parent 711dce14d0
commit 2e1cc2f14a
8 changed files with 4961 additions and 17 deletions

View File

@@ -85,7 +85,7 @@ void lpgemm_rowvar_ ## LP_SFX \
const B_type* b, \
const dim_t rs_b, \
const dim_t cs_b, \
const AOCL_MEMORY_TAG mtag_b, \
AOCL_MEMORY_TAG mtag_b, \
C_type* c, \
const dim_t rs_c, \
const dim_t cs_c, \