Commit Graph

188 Commits

Author SHA1 Message Date
Vlachopoulou, Eleni
1f8a7d2218 Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65) 2025-08-07 15:51:59 +01:00
Bhaskar, Nallani
9d571bb5d3 Fixed few Coverity warnings in aocl gemm addon
Fixed few Coverity warnings in aocl gemm addon 


AMD-Internal: CPUPL-6913
2025-08-06 15:37:40 +05:30
V, Varsha
68d47281df Fixing some copying bugs in Batch-Matmul code
- Removed duplicate calls to BATCH_GEMM_CHECK().
 - Refactored freeing of post-op pointer in bench code and verified the
    functionality.
 - Modified indexing of the array to take the correct values.
2025-08-01 18:42:10 +05:30
Bhaskar, Nallani
46aac600ec Added f32 kernels without post-ops to avoid overhead
Description:

1. Crated f32 intrinsic kernels without post-ops support f32 gemm
   without post-ops optimally.
2. Initiated the no post-ops kernels from main kernel when post-ops
   hander has no post-ops to do.
3. The kernels are redundant but added to get the best perf
   for pure GEMM call.

AMD-Internal : SWLCSG-3692
2025-07-25 23:14:23 +05:30
Balasubramanian, Vignesh
93414f56c8 Bugfix : Guarded AOCL_ENABLE_INSTRUCTONS support based on AVX512-ISA support
- As part of rerouting to AVX2 code-paths on ZEN4/ZEN5(or similar)
  architectures, the code-base established a contingency when
  deploying fat binary on ZEN/ZEN2/ZEN3 systems. Due to this,
  it was required that we always set AOCL_ENABLE_INSTRUCTIONS to
  'ZEN3'(or similar values) to make sure we don't run AVX512
  code on such architectures. This issue existed on FP32 and BF16
  APIs.

- Added checks to detect the AVX512-ISA support to enable rerouting
  based on AOCL_ENABLE_INSTRUCTIONS. This removes the incorrect
  constraint that was put forth.

AMD-Internal: [CPUPL-7020]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-07-24 12:20:05 +05:30
V, Varsha
8a86620753 Bug Fix in INT8 reference un-reorder API
- For int8/uint8 reorder function, the k dimension is made multiple of 4 to
 meet the alignment requirements.
 - Modified the logic to update the k_updated to use multiples of 4.

[AMD - Internal : SWLCSG - 3686 ]
2025-07-24 11:26:49 +05:30
V, Varsha
9e8c9e2764 Fixed compiler warnings in LPGEMM
- Modified the correct variables to be passed for the batch_gemm_thread_decorator() for
 u8s8s32os32 API.
 - Removed commented lines in f32 GEMV_M kernels.
 - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros.
 - Re-aligned the BIAS macro in the macro definition file.

[ AMD - Internal : CPUPL - 7013 ]
2025-07-18 16:15:52 +05:30
V, Varsha
2f54bc1e14 Added F32 reference Unreorder function
- Implemeneted unpackb_f32f32f32of32_reference function.
 - Modified const pointer declaration in aocl_reorder_reference() to avoid compiler warnings.

[AMD-Internal: SWLCSG-3618 ]
2025-07-18 14:52:03 +05:30
Bhaskar, Nallani
76c08fe81d Implemented f32 reference reorder function
Implemented aocl_reorder_f32f32f32of32_reference( ) function and tested.

Implemented framework changes required and place holder for kernels for aocl_unreorder_f32f32f32of32_reference( ) function. It is not tested completely and will be taken care in subsequent commits.

[AMD-Internal: SWLCSG-3618 ]
2025-07-15 12:26:05 +05:30
V, Varsha
837d3974d4 Bug Fixes for GEMV AVX2 BF16 to F32 path
- Added the correct strides to be used while unreorder/convert B matrix in m=1 cases.
 - Modified Zero point vector loads to proper instructions.
 - Modified bf16 store in AVX2 GEMV M kenrel

AMD Internal - [SWLCSG - 3602 ]
2025-07-10 16:23:46 +05:30
V, Varsha
98901847f1 Enabled GEMV path for BF16 GEMV operations on non-BF16 supporting machines
- Added new GEMV_AVX2 5-Loop for handling BF16 inputs, for n = 1 and m = 1 conditions.
 - Modified Re-order and Un-reorder functions to cater to default n=1 reorder conditions.
 - Added bf16 beta and store support in F32 GEMV N AVX2 and 256_512 kernels.
 - Added bf16 beta support for F32 GEMV M kernels, and modified bf16 store conditions for
   GEMV M kernels.
 -  Modified the n=1 re-order guards for reference bf16 re-order API.
 - Added an additional path in the un-reorder case for handling n=1 vector conversion

AMD-Internal: [ SWLCSG - 3602 ]
2025-07-09 19:45:40 +05:30
V, Varsha
1f9d1a85d3 Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. (#58)
* Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API.

 - Modified Batch-Gemm API to align with cblas_?gemm_batch_ API,
 and added a parameter group_size to the existing APIs.
 - Updated bench batch_gemm code to align to the new API definition.
 - Modified the hardcoded number in lpgemm_postop file.
 - Added necessary early return condition to account for group_count/group_size < 0.

AMD-Internal: [ SWLCSG - 3592 ]
2025-06-30 11:16:04 +05:30
Vankadari, Meghana
c81408c805 Modified reorder and pack code in sym quant API (#59)
Details:
- In s8 APIs with symmetric quantization, Existing kernels are
  reused to avoid duplication of reorder code.
- Since the existing kernels are designed assuming that entire
  KCxNC block is packed at once, to handle grouping in symmetric
  quantization, we have to add JR and group loop outside the
  function call to existing packB function.
- Though this was being done before, the cases where n_rem < 64
  was not handled properly.
- Modified reorder and pack code to first divide the n_fringe part
  into multiples-of-16 part and n_lt_16 part and then calling the
  pack kernel twice to handle both parts separately.
- All the strides to access the reordered/pack buffer are updated
  accordingly.
2025-06-24 11:36:35 +05:30
Vankadari, Meghana
26e5c63781 Disabled default packing of matrices in batch_gemm of FP32 (#55)
AMD-Internal: SWLCSG-3527
2025-06-17 10:53:05 +05:30
Vankadari, Meghana
8649cdc14b Removed unnecessary pack checks in FP32 GEMV (#54)
Details:
- In FP32 GEMM, when threading is disabled, rntm_pack_a and rntm_pack_b
  were set to true by default. This leads to perf regression for smaller
  sizes. Modified FP32 interface API to not overwrite the packA and
  packB variables in rntm structure.
- In FP32 GEMV, Removed the decision making code based on mtag_A/B
  and should_pack_A/B for packing. Matrices will be packed only
  if the storage format of the matrices doesn't match the storage
  format required by the kernel.
- Changed the control flow of checking the value of mtag to whether
  matrix is "reordered" or "to-be-packed" or "unpacked". checking
  for "reorder" first, followed by "pack". This will ensure that
  packing doesn't happen when the matrix is already reordered even
  though user forces packing by setting "BLIS_PACK_A/B"
-Modified python script to generate testcases based on block sizes

AMD-Internal: SWLCSG-3527
2025-06-16 12:34:11 +05:30
Balasubramanian, Vignesh
1847a1e8c6 Bugfix : Segmentation fault at the topology detection layer (#51)
- The current implementation of the topology detector establishes
      a contingency, wherein it is expected that the parallel region
      uses all the threads queried through omp_get_max_threads(). In
      case the actual parallelism in the function is limited(lower than
      this expectation), the code may access unallocated memory section
      (using uninitialized pointers).

    - This was because every thread(having it's own pointer), sets its
      initial value to NULL inside the parallel section, thereby leaving
      some pointers uninitialized if the associated thread is not spawned.

    - Also, the current implementation would use negative indexing(with -1)
      if any associated thread was not spawned.

    - Fix : Set every thread-specific pointer to NULL outside the parallel
            region, using calloc(). As long as we have NULL checks for pointers
            before accessing through them, no issues will be observed. Avoid
            incurring the topology detection cost if all the reuqired threads
            are not spawned(thereby avoiding potential negative indexing).
            (when using core-group ID).

AMD-Internal: [SWLCSG-3573]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Bhaskar, Nallani <Nallani.Bhaskar@amd.com>
2025-06-14 21:55:02 +05:30
Vankadari, Meghana
8968973c2d Performance fix for FP32 GEMV (#47)
Details:
- In FP32 GEMM interface, mtag_b is being set to PACK by default.
  This is leading to packing of B matrix even though packing is not
  absolutely required leading to perf regression.
- Setting mtag_b to PACK only if it is absolutely necessary to pack B matrix
  modified check conditions before packing appropriately.

AMD-Internal - [SWLCSG-3575]
2025-06-10 14:54:01 +05:30
V, Varsha
875375a362 Bug Fixes in FP32 Kernels: (#41)
* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels.
 Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B
 conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-06-06 17:48:50 +05:30
Vankadari, Meghana
37efbd284e Added 6x16 and 6xlt16 main kernels for f32 using AVX512 instructions (#38)
* Implemented 6xlt8 AVX2 kernel for n<8 inputs

* Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Added the deleted kernels and fixed bias bug

AMD-Internal: SWLCSG-3556
2025-06-05 15:17:02 +05:30
V, Varsha
532eab12d3 Bug Fixes in LPGEMM for AVX512(SkyLake) machine (#24)
* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types in FP32 avx512_256 kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256
  kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and
   BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode.

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 type to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension. The entry condition here has been modified for
  AVX512 configuration.

 - Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI
   ISA supporting configruations:
   - BF16 tiny path entry check has been modified to take into account arch_id
     to ensure improper entry into the tiny kernel.
   - The store in BF16->FP32 col-major for m = 1 conditions were updated to
     correct storage pattern,
   - BF16 beta load macro was modified to account for data in unaligned memory.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake)

AMD Internal: [SWLCSG-3552]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-05-30 17:22:49 +05:30
Arnav Sharma
62d4fcb398 Bugfix: Group Size Validation for s8s8s32o32_sym_quant
- Fixed the group size validation logic to correctly check if the
  group_size is a multiple of 4.

- Previously the condition was incorrectly performing bitwise AND with
  decimal 11 instead of binary 11 (decimal 3).

AMD-Internal: [CPUPL-6754]
2025-05-30 11:53:23 +05:30
Bhaskar, Nallani
42a0d74ced Fixed configuration issues in AOCL_GEMM addon (#4)
* Fixed configuration issues in AOCL_GEMM addon

Description:

Fixed aocl_gemm addon initialization of kernels and block sizes
for machines which supports only AVX512 but not
AVX512_VNNI/VNNI_BF16.

Aligned NC, KC blocking variables between ZEN and ZEN4

AMD-Internal: [SWLCSG-3527]
2025-05-13 17:19:19 +05:30
Negi, Deepak
121d81df16 Implemented GEMV kernel for m=1 case. (#5)
* Implemented GEMV kernel for m=1 case.

Description:

- Added a new GEMV kernel for AVX2 where m=1.
- Added a new GEMV kernel for AVX512 with ymm registers where m=1.
2025-05-13 16:33:04 +05:30
Meghana Vankadari
8557e2f7b9 Implemented GEMV for n=1 case using 32 YMM registers
Details:
- This implementation is picked form cntx when GEMM is invoked on
  machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This implementation uses MR=16 for GEMV.

AMD-Internal: [SWLCSG-3519]
Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a
2025-05-05 05:31:13 -04:00
Meghana Vankadari
21aa63eca1 Implemented AVX2 based GEMV for n=1 case.
- Added a new GEMV kernel with MR = 8 which will be used
  for cases where n=1.
- Modified GEMM and GEMV framework to choose right GEMV kernel
  based on compile-time and run-time architecture parameters. This
  had to be done since GEMV kernels are not stored-in/retrieved-from
  the cntx.
- Added a pack kernel that packs A matrix from col-major to row-major
  using AVX2 instructions.

AMD-Internal: [SWLCSG-3519]
Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c
2025-05-05 08:56:22 +00:00
Meghana Vankadari
4745cf876e Implemented a new set of kernels for f32 using 32 YMM regs
Details:
- These kernels are picked from cntx when GEMM is invoked
  on machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This path uses the same blocksizes and pack kernels as AVX512
  path.
- GEMV is disabled currently as AVX2 kernels for GEMV are not
  implemented.

AMD-Internal: [SWLCSG-3519]
Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5
2025-04-23 12:02:01 +00:00
Deepak Negi
48c7452b08 Beta and Downscale support for F32 AVX-512 kernels
Description
- To enable AVX512 VNNI support without native BF16 in BF16 kernels, the
  BF16 C_type is converted to F32 for computation and then cast back to
  BF16 before storing the result.
- Added support for handling BF16 zero-point values of BF16 type.
- Added a condition to disable the tiny path for the BF16 code path
  where native BF16 is not supported.

AMD Internal : [CPUPL-6627]

Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9
2025-04-23 06:12:14 -05:00
Arnav Sharma
8b0593f88d Optimizations and Improved Support for FP32 RD Kernels
- Updated the decision logic for taking the RD path for FP32.

- Since the 5-loop was designed specifically for RV kernels, added a
  boolean flag to specify when RD path is to be taken, and set ps_b_use
  to cs_b_use in case B matrix is unpacked.

AMD-Internal: [SWLCSG-3497]
Change-Id: I94ed28304a71b759796edcdd4edf65b9bad22bea
2025-04-23 12:26:51 +05:30
Arnav Sharma
87c9230cac Bugfix: Disable A Packing for FP32 RD kernels and Post-Ops Fix
- For single-threaded configuration of BLIS, packing of A and B matrices
  are enabled by default. But, packing of A is only supported for RV
  kernels where elements from matrix A are being broadcasted. Since
  elements are being loaded in RD kernels, packing of A results in
  failures. Hence, disabled packing of matrix A for RD kernels.

- Fixed the issue where c_i index pointer was incorrectly being reset
  when exceeding MC block thus, resulting in failures for certain
  Post-Ops.

- Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it
  was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float).

AMD-Internal: [SWLCSG-3497]
Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd
2025-04-18 14:31:03 +05:30
Meghana Vankadari
1ff96343f1 Fixed Early return checks in reorder function for f32 & int8 APIs.
Details:
- In reorder functions, validity of strides are being checked assuming
  that the matrix to be reordered is always row-major. Modified the code
  to take stor_order into consideration while checking for validity of
  strides.
- This does not directly impact the functionality of GEMM as we don't
  support GEMM on col-major matrices where A and/or B matrices are
  reordered before GEMM computation. But this change makes sense when
  reordering is viewed as an independent functionality irrespective of
  what the reordered buffers will be used for.

Change-Id: If2cc4a353bca2f998ad557d6f128881bc9963330
2025-04-15 09:45:48 +00:00
Arnav Sharma
267aae80ea Added Post-Ops Support for F32 RD Kernels
- Support for Post-Ops has been added for all F32 RD AVX512 and AVX2
  kernels.

AMD-Internal: [SWLCSG-3497]
Change-Id: Ia2967417303d8278c547957878d93c42c887109e
2025-04-11 05:25:30 -04:00
Arnav Sharma
c68c258fad Added AVX512 and AVX2 FP32 RD Kernels
- Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs.
- The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64
  (MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel.
- Updatd f32 framework to accomodate rd kernels in case of B trans
  with thresholds
- Updated data gen python script
TODO:
    - Post-Ops not yet supported.

Change-Id: Ibf282741f58a1446321273d5b8044db993f23714
2025-04-05 20:16:51 -05:00
varshav
81d219e3f8 Added destination scale type check in INT8 API's
- Updated the S8 main, GEMV, m_, n_ and mn_ fringe kernels to support
   multiple scale types for vector and scalar scales

 - Updated the U8 main, GEMV, m_, n_, extMR_ and mn_ fringe kernels to
   support multiple scale types for vector and scalar scales

 - Updated the bench to accommodate multiple scale type input, and
   modified the downscale_accuracy_check_ to verify with multiple scale
   type inputs.

AMD Internal: [ SWLCSG-3304 ]

Change-Id: I7b9f3ec8ea830d3265f72d18a0aa36086e14a86e
2025-03-28 00:51:17 -05:00
Arnav Sharma
6d1afeae95 Column-Major Support Added for F32 Tiny Path
- Updated the F32 tiny path to support column-major inputs.
- Tuned the tiny-path thresholds to redirect additional inputs to the
  tiny path based on the m*n*k value.

AMD-Internal: [SWLCSG-3380]
Change-Id: If3476b17cc5eaf4f4e1cf820af0a32ede3e1742e
2025-03-13 05:54:50 -04:00
varshav
acee9c7d4e Added column-major support for BF16 tiny path
- Added column major path for BF16 tiny path
 - Tuned tiny-path thresholds to support few more inputs to the
   tiny path.

AMD-Internal: [SWLCSG-3380]
Change-Id: I9a5578c9f0d689881fc5a67ab778e6a917c4fce1
2025-03-13 05:45:33 -04:00
varshav
4d22451fbb Bug Fix in BF16 Re-order/unreorder with AOCL_ENABLE_INSTRUCTIONS
- Currently, the bf16 reorder function does not add padding for
   n=1 cases. But, the bf16 AVX2 Unreorder path considers the input
   re-ordered B matrix to be padded along the n and k dimension.
 - Hence, modified the conditions to make sure the path doesn't break
   while the AVX2 kernels are executed in AVX512 machines when
   B matrix reordered.

Change-Id: I7dd3d37a24758a8e93e80945b533abfcf15f65a1
2025-03-05 06:31:19 +00:00
Mithun Mohan
37d590e53f Tid spread threshold update in LPGEMM thread decorator.
-Currently the Tid spread does not happen for n=4096 even if there
are threads available to facilitate the same. Update the threshold
to account for the same.

AMD-Internal: [SWLCSG-3185]
Change-Id: I281b1639c32ba2145bd84062324f1f11b1167eeb
2025-03-04 10:53:51 +00:00
Meghana Vankadari
7243a5d521 Implemented group level static quantization for s8s8s32of32|bf16 APIs
Details:
- Group quantization is technique to improve accuracy
  where scale factors to quantize inputs and weights
  varies at group level instead of per channel
  and per tensor level.
- Added new bench files to test GEMM with symmetric static
  quantization.
- Added new get_size and reorder functions to account for
  storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.

AMD-Internal:[SWLCSG-3274]

Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
2025-02-28 04:44:44 -05:00
Nallani Bhaskar
0e6b562711 Implemented s8 unreorder reference API
Description:
1. Implement s8 unreorder API function which performs
   unreordering of int8 matrix which is reordered
2. Removed bf16vnni check for bf16 unreorder reference API
   because it can work on any architecture as it is reference
   code
3. Tested the reference code for all main and fringe paths.

AMD-Interneal: [SWLCSG-3426]

Change-Id: I920f807be870e1db5f9d0784cdcec7b366e1eff5
2025-02-27 13:06:40 +00:00
Deepak Negi
cc321fb95d Added support for different types of zero-point in f32 eltwise APIs.
Description
 - Zero point support for <s32/s8/bf16/u8> datatype in element-wise
   postop only f32o<f32/s8/u8/s32/bf16> APIs.

 AMD-Internal: [SWLCSG-3390]

Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
2025-02-26 04:04:13 -05:00
Mithun Mohan
7394aafd1e New A packing kernels for F32 API in LPGEMM.
-New packing kernels for A matrix, both based on AVX512 and AVX2 ISA,
for both row and column major storage are added as part of this change.
Dependency on haswell A packing kernels are removed by this.
-Tiny GEMM thresholds are further tuned for BF16 and F32 APIs.

AMD-Internal: [SWLCSG-3380, SWLCSG-3415]

Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be
2025-02-26 05:23:35 +00:00
varshav
8a69141294 Bug fix in BF16-F32 supported AVX2 Kernels
- Bug fix in Matrix Mul post op.
 - Updated the config in AVX512_VNNI_BF16 context
   to work in AVX2 kernels

Change-Id: I25980508facc38606596402dba4cfce88f4eb173
2025-02-25 14:42:45 +00:00
varshav
a0005c60ce Add col-major pack kernels and BF16 output support in F32 AVX-2 kernels.
- Added column major pack kernels, which will transpose and store the
   BF16 matrix input to F32 input matrix
 - Added BF16 Zero point Downscale support to F32 main and fringe
   kernels.
 - Updated Matrix Add and Matrix Mul post-ops in f32-AVX2 main and
   fringe kernels to support BF16 input.
 - Modified the f32 tiny kernels loop to update the buf_downscale
   parameter.
 - Modified bf16bf16f32obf16 framework to work with AVX-2 system.
 - Added wrapper in bf16 5-Loop to call the corresponding AVX-2/AVX-512
   5 Loop functions.
 - Bug fixes in the f32-AVX2 kernels BIAS post-ops.
 - Bug fixes in the Convert function, and the bf16 5-loop
   for multi-threaded inputs.

AMD-Internal:[SWLCSG-3281 , CPUPL-6447]

Change-Id: I4191fbe6f79119410c2328cd61d9b4d87b7a2bcd
2025-02-24 09:51:12 +05:30
Nallani Bhaskar
5a3c58b315 Fixed column major case of bf16 un-reorder reference function
Description:

1. Fixed bf16 un-reorder column major kernel
2. Fixed a bug in nrlt16 case of f32obf16 reorder function
3. Unit testing done .

AMD-internal: [SWLCSG-3279]

Change-Id: I65024342935ae65186b95885eb010baf3269aa7d
2025-02-20 06:26:31 -05:00
Mithun Mohan
ae182c3fcc Using GEN_BUF buffer instead of <A|B>_PANEL for pack buffer in F32/BF16.
-When bli_pba_acquire_m is invoked to get a buffer for packing, if
buffer type is BLIS_BUFFER_FOR_B_PANEL, then the memory is returned
from a memory pool. In order to ensure thread safety, this memory
pool is protected using locks. Instead if buffer type was
BLIS_BUFFER_FOR_GEN_USE, then memory is allocated using malloc.
-However it was observed that for relatively small input dimensions,
if on the go packing is required, and if jc_ways is sufficiently
large, then there was contention at the lock on the memory pool for
B_PANEL buffer type. This turned out to be an overhead and is now
avoided by checking out GEN_USE buffer type for packing.

AMD-Internal: [SWLCSG-3398]

Change-Id: I781ad5da2a2f75997b58d6c3db70f6277250bd99
2025-02-14 06:12:51 -05:00
Meghana Vankadari
17634d7ae8 Fixed compiler errors and warning for gcc < 11.2
Description:

1. When compiler gcc version less than 11.2 few BF16 instructions
   are not supported by the compiler even though the processors arch's
   zen4 and zen5 supports.

2. These instructions are guarded now with a macro.


Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131
2025-02-13 10:18:13 -05:00
Mithun Mohan
d61c54dc26 Enable BF16 tiny GEMM path only for Zen4/5 arch id.
-BF16 tiny GEMM path is only enabled for Zen4 or Zen5 arch id as
returned by the bli_arch_query_id function. Additionally it is
disabled if JIT kernels are used.

-Fixed nrlt16 case in bf16_unreorder_ref function

AMD-Internal: [SWLCSG-3380, SWLCSG-3258]

Change-Id: I8af638a85e949f12181bc56c63e5e983c24ca3af
2025-02-12 06:39:53 -05:00
Mithun Mohan
4cfbb47b87 Initialize block sizes for F32 element wise post-op APIs.
-The block sizes and micro kernel dimensions for the F32OF32 group
of APIs are updated in the element wise operations cntx map.

AMD-Internal: [SWLCSG-3390]

Change-Id: Ic5690b7eb4f7b2559d893f374dd811b00e31e329
2025-02-11 06:47:24 -05:00
varshav
f4e3a4b1c3 AVX2 Support for BF16 Kernels - Bug fixes
- Added early return checks for A/B transpose cases and Column major
  support, as it is not currently supported.
- Enabled the JIT kernels for the Zen4 architecture.

AMD Internal: [SWLCSG - 3281]

Change-Id: Ie671676c51c739dd18709892414fd34d26a540df
2025-02-11 12:40:43 +05:30
Nallani Bhaskar
0acb5eb9a4 Implemented reference unreorder bf16 function
Description:

Implemented a c reference for
aocl_gemm_unreorder_bf16bf16f32of32 function

The implementation working for row major and
column major yet to be enabled.

AMD-Internal: [ SWLCSG-3279 ]

Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a
2025-02-11 02:04:42 +00:00