956 Commits

Author SHA1 Message Date
Bhaskar, Nallani
46aac600ec Added f32 kernels without post-ops to avoid overhead
Description:

1. Crated f32 intrinsic kernels without post-ops support f32 gemm
   without post-ops optimally.
2. Initiated the no post-ops kernels from main kernel when post-ops
   hander has no post-ops to do.
3. The kernels are redundant but added to get the best perf
   for pure GEMM call.

AMD-Internal : SWLCSG-3692
2025-07-25 23:14:23 +05:30
S, Hari Govind
273a05f0bd Fix for performance regression caused by non-unit stride y in DGEMV API (#91)
- Temperory fix for regression in DGEMV for non-unit stride y inputs. The code
  section responsible for handling non-unit stride y has been removed from the
  frame.

- The kernel code is extended with if condition to handle both unit and non-unit
  stride y.

AMD-Internal: [CPUPL-6869]
2025-07-25 10:57:57 +05:30
V, Varsha
9e8c9e2764 Fixed compiler warnings in LPGEMM
- Modified the correct variables to be passed for the batch_gemm_thread_decorator() for
 u8s8s32os32 API.
 - Removed commented lines in f32 GEMV_M kernels.
 - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros.
 - Re-aligned the BIAS macro in the macro definition file.

[ AMD - Internal : CPUPL - 7013 ]
2025-07-18 16:15:52 +05:30
Sharma, Shubham
355018e739 Fixed Extra reads in DTRSM small kernels.
In DTRSM small code path lower triangular kernels, extra data from upper triangular region is being read.
To fix this, new macros have been added to make sure only relevant data is read.
AMD-Internal: [SWLCSG-3611]
2025-07-17 10:17:13 +05:30
V, Varsha
837d3974d4 Bug Fixes for GEMV AVX2 BF16 to F32 path
- Added the correct strides to be used while unreorder/convert B matrix in m=1 cases.
 - Modified Zero point vector loads to proper instructions.
 - Modified bf16 store in AVX2 GEMV M kenrel

AMD Internal - [SWLCSG - 3602 ]
2025-07-10 16:23:46 +05:30
Balasubramanian, Vignesh
ab4bb2f1e8 Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN4)
- Updated the thresholds to enter the AVX512 Tiny and SUP codepaths
  for ZGEMM(on ZEN4). This caters to inputs that perform well on
  a single-threaded execution(in the Tiny-path), and inputs that
  scale well with multithreaded-execution(in the SUP path).

- Also updated the thresholds to decide ideal threads, based on
  'm', 'n' and 'k' values. The thread-setting logic involves
  determining the number of tiles for computation, and using them
  to further tune for the optimal number of threads.

AMD-Internal: [CPUPL-6378][CPUPL-6661]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-07-10 15:35:22 +05:30
V, Varsha
98901847f1 Enabled GEMV path for BF16 GEMV operations on non-BF16 supporting machines
- Added new GEMV_AVX2 5-Loop for handling BF16 inputs, for n = 1 and m = 1 conditions.
 - Modified Re-order and Un-reorder functions to cater to default n=1 reorder conditions.
 - Added bf16 beta and store support in F32 GEMV N AVX2 and 256_512 kernels.
 - Added bf16 beta support for F32 GEMV M kernels, and modified bf16 store conditions for
   GEMV M kernels.
 -  Modified the n=1 re-order guards for reference bf16 re-order API.
 - Added an additional path in the un-reorder case for handling n=1 vector conversion

AMD-Internal: [ SWLCSG - 3602 ]
2025-07-09 19:45:40 +05:30
Bhaskar, Nallani
9b02201b5b Updated Poly 16 in Gelu Erf to double precision
Updated poly Gelu Erf precision to double to keep the error with in 1e-5 limit when compared to reference gelu_erf, which is also increased the compute to 2x compared to float.

AMD-Internal: SWLCSG-3551
2025-07-07 14:05:40 +05:30
Balasubramanian, Vignesh
c0d33879ec Bugfix : Integer typecast inside CGEMM AVX512 24xk packing kernel (#68)
- When building the library with LP64 configuration, it is expected
  that we typecast integers to 64-bit internally, before loading them
  onto 64-bit GPRs. This ensures that the upper 32-bit lane is zeroed
  out, to avoid any possible junk values. The current change enforces
  this typecast inside the 24xk packing kernel for CGEMM(AVX512), which
  was missing before.

AMD-Internal: [CPUPL-6907]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-07-01 17:05:54 +05:30
Dave, Harsh
7c6c04a457 More optimizations in 6x8m DGEMM SUP Kernel using prefetching (#34)
* Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance

This update optimizes the DGEMM kernel by implementing well suited prefetching techniques.

Key changes include:

- **Prefetching Strategy**:
  - Introduced prefetching instructions to load matrix data into cache ahead of computation.
  - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed.
  - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed.

- **Unrolling Optimization**:
  - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B.
  - This adjustment enhances data locality and reduces the overhead associated with loop control.

- **Performance Improvements**:
  - Reduced memory access latency by ensuring data is preloaded into cache.
  - Enhanced computational throughput by minimizing stalls due to memory access delays.
  - Improved overall efficiency of matrix multiplication operations.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance.

AMD-Internal: [CPUPL-6435]

* added unroll K by 4 along with unroll K by 8

* Added descriptive comments explaining prefetch strategy

* Added descriptive comments explaining prefetch strategy

* More optimizations in 6x8m DGEMM SUP Kernel using prefetching

- Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization.
- Introduced forward prefetching for A and future B rows to better align with unrolled access patterns.
- Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling
to boost overall performance.

AMD-Internal: [CPUPL-6435]

* Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance

This update optimizes the DGEMM kernel by implementing well suited prefetching techniques.

Key changes include:

- **Prefetching Strategy**:
  - Introduced prefetching instructions to load matrix data into cache ahead of computation.
  - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed.
  - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed.

- **Unrolling Optimization**:
  - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B.
  - This adjustment enhances data locality and reduces the overhead associated with loop control.

- **Performance Improvements**:
  - Reduced memory access latency by ensuring data is preloaded into cache.
  - Enhanced computational throughput by minimizing stalls due to memory access delays.
  - Improved overall efficiency of matrix multiplication operations.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance.

AMD-Internal: [CPUPL-6435]

* added unroll K by 4 along with unroll K by 8

* Added descriptive comments explaining prefetch strategy

* More optimizations in 6x8m DGEMM SUP Kernel using prefetching

- Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization.
- Introduced forward prefetching for A and future B rows to better align with unrolled access patterns.
- Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling
to boost overall performance.

AMD-Internal: [CPUPL-6435]

---------

Co-authored-by: Harsh Dave <harsdave@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-07-01 15:02:50 +05:30
Smyth, Edward
8a8d3f43d5 Improve consistency of optimized BLAS3 code (#64)
* Improve consistency of optimized BLAS3 code

Tidy AMD optimized GEMM and TRSM framework code to reduce
differences between different data type variants:
- Improve consistency of code indentation and white space
- Added some missing AOCL_DTL calls
- Removed some dead code
- Consistent naming of variables for function return status
- GEMM: More consistent early return when k=1
- Correct data type of literal values used for single precision data

In kernels/zen/3/bli_gemm_small.c and bli_family_*.h files:
- Set default values for thresholds if not set in the relevant
  bli_family_*.h file
- Remove unused definitions and commented out code

AMD-Internal: [CPUPL-6579]
2025-07-01 09:29:52 +01:00
Balasubramanian, Vignesh
98bc1d80e7 Support for Tiny-GEMM interface(CGEMM)
- Added the support for Tiny-CGEMM as part of the existing
  macro based Tiny-GEMM interface. This involved definining
  the appropriate AVX2/AVX512 lookup tables and functions for
  the target architectures(as per the design), for compile-time
  instantiation and runtime usage.

- Also extended the current Tiny-GEMM design to incorporate packing
  kernels as part of its lookup tables. These kernels will be queried
  through lookup functions and used in case of wanting to support
  non-trivial storage schemes(such as dot-product computation).

- This allows for a plug-and-play fashion of experimenting with
  pack and outer product method against native inner product implementations.

- Further updated the existing AVX512 pack routine that packs the A matrix
  (in blocks of 24xk). This utilizes masked loads/stores instructions to
  handle fringe cases of the input(i.e, when m < 24).

- Also added the AVX512 outer product kernels for CGEMM as part of the
  ZEN4 and ZEN5 contexts, to handle RRC and CRC storage schemes. This is
  facilitated through optional packing of A matrix in the SUP framework.

AMD-Internal: [CPUPL-6498]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-06-30 12:14:44 +05:30
Sharma, Arnav
5193433141 Disable GCC 11.4 tree loop optimization for AVX512 F32 Sigmoid Post-Op (#63)
- Disabled tree loop optimizations for all AVX512 F32 fringe kernels
  when compiled with GCC 11.4 to address numerical inaccuracies in
  Sigmoid post-op cause by aggressive loop optimizations.

- The fix uses function-level GCC attribute
  __attribute__((optimize("no-tree-loop-optimize"))) to selectively
  disable tree loop optimizations only for the affected kernels based on
  GCC version check.

AMD-Internal: [SWLCSG-3559]
2025-06-26 16:31:55 +05:30
V, Varsha
e05d24315e Bug Fixes for Accuracy issues in Int8 API (#62)
- In U8 GEMV n=1 kernels, the default zp condition was S8 ZP type,
 which leads to accuracy issues which u8s8s32u8 API is used.
 - Few modifications in bench code to take the correct path for
 accuracy check.
2025-06-25 17:01:22 +05:30
V, Varsha
8fd7060b2f Matrix Add and Matrix Mul Post-op addition in F32 AVX512_256 kernels (#50)
Added Matrix-mul and Matrix-add postops in FP32 AVX512_256 GEMV kernels

 - Matrix-add and Matrix-mul post ops in FP32 AVX512_256 GEMV m = 1 and
 n = 1 kernels has been added.

Co-authored-by: VarshaV <varshav2@amd.com>
2025-06-17 16:17:13 +05:30
S, Hari Govind
e097346658 Implemented Multithreading Support and Optimization of DGEMV API (#10)
- Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance.

- The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta.

AMD-Internal: [CPUPL-6746]
2025-06-17 12:39:48 +05:30
V, Varsha
875375a362 Bug Fixes in FP32 Kernels: (#41)
* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels.
 Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B
 conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-06-06 17:48:50 +05:30
Vankadari, Meghana
9e9441db47 Fix for n_fringe in AVX512 FP32 6x64 kernel (#42)
Details:
- Fixed the problem decomposition for n-fringe case of
  6x64 AVX512 FP32 kernel by updating the pointers
  correctly after each fringe kernel call.

-  AMD-Internal: SWLCSG-3556
2025-06-06 11:33:25 +05:30
Vankadari, Meghana
37efbd284e Added 6x16 and 6xlt16 main kernels for f32 using AVX512 instructions (#38)
* Implemented 6xlt8 AVX2 kernel for n<8 inputs

* Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Added the deleted kernels and fixed bias bug

AMD-Internal: SWLCSG-3556
2025-06-05 15:17:02 +05:30
Dave, Harsh
3c8b7895f7 Fixed functionality failure of DGEMM pack kernel. (#31)
* Fixed functionality failure of DGEMM pack kernel.

- Corrected the mask preparation needed for load/store
in edge kernel where m = 18.

- Corrected the usage of right vector registers while
storing data back to buffer in edge kernels.

AMD-Internal: [CPUPL-6773]

* Fixed functionality failure of DGEMM pack kernel.

- Corrected the mask preparation needed for load/store
in edge kernel where m = 18.

- Corrected the usage of right vector registers while
storing data back to buffer in edge kernels.

AMD-Internal: [CPUPL-6773]

* Update bli_packm_zen4_asm_d24xk.c

---------

Co-authored-by: Harsh Dave <harsdave@amd.com>
2025-06-03 17:33:16 +05:30
V, Varsha
532eab12d3 Bug Fixes in LPGEMM for AVX512(SkyLake) machine (#24)
* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types in FP32 avx512_256 kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256
  kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and
   BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode.

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 type to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension. The entry condition here has been modified for
  AVX512 configuration.

 - Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI
   ISA supporting configruations:
   - BF16 tiny path entry check has been modified to take into account arch_id
     to ensure improper entry into the tiny kernel.
   - The store in BF16->FP32 col-major for m = 1 conditions were updated to
     correct storage pattern,
   - BF16 beta load macro was modified to account for data in unaligned memory.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake)

AMD Internal: [SWLCSG-3552]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-05-30 17:22:49 +05:30
Negi, Deepak
ffd7c5c3e0 Postop support for Static Quant and Integer APIs (#20)
Support for S32 Zero point type is added for aocl_gemm_s8s8s32os32_sym_quant
Support for BF16 scale factors type is added for aocl_gemm_s8s8s32os32_sym_quant
U8 buffer type support is added for matadd, matmul, bias post-ops in all int8 APIs.

AMD-Internal: SWLCSG-3503
2025-05-27 16:29:32 +05:30
Negi, Deepak
121d81df16 Implemented GEMV kernel for m=1 case. (#5)
* Implemented GEMV kernel for m=1 case.

Description:

- Added a new GEMV kernel for AVX2 where m=1.
- Added a new GEMV kernel for AVX512 with ymm registers where m=1.
2025-05-13 16:33:04 +05:30
harsdave
cd83fc38b5 Add packing support M edge cases in DGEMM 24xk pack kernel
Previously, the DGEMM implementation used `dscalv` for cases
where the M dimension of matrix A is not in multiple of 24,
resulting in a ~40% performance drop.

This commit introduces a specialized edge cases in pack kernel
to optimize performance for these cases.

The new packing support significantly improves the performance.

- Removed reliance on `dscalv` for edge cases, addressing the
  performance bottleneck.

AMD-Internal: [CPUPL-6677]

Change-Id: I150d13eb536d84f8eb439d7f4a77a04a0d0e6d60
2025-05-06 09:22:49 +05:30
Meghana Vankadari
8557e2f7b9 Implemented GEMV for n=1 case using 32 YMM registers
Details:
- This implementation is picked form cntx when GEMM is invoked on
  machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This implementation uses MR=16 for GEMV.

AMD-Internal: [SWLCSG-3519]
Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a
2025-05-05 05:31:13 -04:00
Meghana Vankadari
21aa63eca1 Implemented AVX2 based GEMV for n=1 case.
- Added a new GEMV kernel with MR = 8 which will be used
  for cases where n=1.
- Modified GEMM and GEMV framework to choose right GEMV kernel
  based on compile-time and run-time architecture parameters. This
  had to be done since GEMV kernels are not stored-in/retrieved-from
  the cntx.
- Added a pack kernel that packs A matrix from col-major to row-major
  using AVX2 instructions.

AMD-Internal: [SWLCSG-3519]
Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c
2025-05-05 08:56:22 +00:00
Chandrashekara K R
b06c6f921b CMake: compiler flags updated for lpgemm kernels under zen folder.
The "-mno-avx512f" compiler flag has been added for zen/lpgemm
source files to address an issue observed with the znver4 compiler
flag when using GCC installed through Spack. The error message
"unsupported instruction `vpcmpeqd'" was encountered, indicating
unsupported AVX-512F instructions. As a workaround, the
"-mno-avx512f" flag was introduced, ensuring that AVX-512F
instructions are disabled during compilation.

AMD-Internal: [CPUPL-6694]

Change-Id: I546475226fbfea4931d568fc1b928cf6c8699b61
2025-04-30 06:09:36 -04:00
Hari Govind S
29f30c7863 Optimisation for DCOPY API
-  Introducted new assembly kernel that copies data from source
   to destination from the front and back of the vector at the
   same time. This kernel provides better performance for larger
   input sizes.

-  Added a wrapper function responsible for selecting the kernel
   used by DCOPYV API to handle the given input for zen5
   architecture.

-  Updated AOCL-dynamic threshold for DCOPYV API in zen4 and
   zen5 architectures.

-  New unit-tests were included in the grestsuite for the new
   kernel.

AMD-Internal: [CPUPL-6650]
Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568
2025-04-28 05:58:21 -04:00
Meghana Vankadari
4745cf876e Implemented a new set of kernels for f32 using 32 YMM regs
Details:
- These kernels are picked from cntx when GEMM is invoked
  on machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This path uses the same blocksizes and pack kernels as AVX512
  path.
- GEMV is disabled currently as AVX2 kernels for GEMV are not
  implemented.

AMD-Internal: [SWLCSG-3519]
Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5
2025-04-23 12:02:01 +00:00
Deepak Negi
48c7452b08 Beta and Downscale support for F32 AVX-512 kernels
Description
- To enable AVX512 VNNI support without native BF16 in BF16 kernels, the
  BF16 C_type is converted to F32 for computation and then cast back to
  BF16 before storing the result.
- Added support for handling BF16 zero-point values of BF16 type.
- Added a condition to disable the tiny path for the BF16 code path
  where native BF16 is not supported.

AMD Internal : [CPUPL-6627]

Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9
2025-04-23 06:12:14 -05:00
Vignesh Balasubramanian
b4b0887ca4 Additional optimizations to ZGEMM SUP and Tiny codepaths(ZEN4 and ZEN5)
- Added a set of AVX512 fringe kernels(using masked loads and
  stores) in order to avoid rerouting to the GEMV typed API
  interface(when m = 1). This ensures uniformity in performance
  across the main and fringe cases, when the calls are multithreaded.

- Further tuned the thresholds to decide between ZGEMM Tiny, Small
  SUP and Native paths for ZEN4 and ZEN5 architectures(in case
  of parallel execution). This would account for additional
  combinations of the input dimensions.

- Moved the call to Tiny-ZGEMM before the BLIS object creation,
  since this code-path operates on raw buffers.

- Added the necessary test-cases for functional and memory testing
  of the newly added kernels.

AMD-Internal: [CPUPL-6378][CPUPL-6661]
Change-Id: I9af73d1b6ef82b26503d4fc373111132aee3afd6
2025-04-23 00:56:58 -04:00
Arnav Sharma
87c9230cac Bugfix: Disable A Packing for FP32 RD kernels and Post-Ops Fix
- For single-threaded configuration of BLIS, packing of A and B matrices
  are enabled by default. But, packing of A is only supported for RV
  kernels where elements from matrix A are being broadcasted. Since
  elements are being loaded in RD kernels, packing of A results in
  failures. Hence, disabled packing of matrix A for RD kernels.

- Fixed the issue where c_i index pointer was incorrectly being reset
  when exceeding MC block thus, resulting in failures for certain
  Post-Ops.

- Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it
  was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float).

AMD-Internal: [SWLCSG-3497]
Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd
2025-04-18 14:31:03 +05:30
Deepak Negi
f76f37cc11 Bug Fix in F32 eltwise Api with post ops(clip, swish, relu_scale).
Description
1. In the cases of clip, swish, and relu_scale, constants are currently
   loaded as float. However, they are of C type, so handling has been
   adjusted, for integer these constants are first loaded as integer
   and then converted to float.

Change-Id: I176b805b69679df42be5745b6306f75e23de274d
2025-04-11 08:14:34 -04:00
varshav2
b9998a1d7f Added mutiple ZP type checks in INT8 APIs
- Currently the int8/uint8 APIs do not support multiple ZP types,
   but works only with int8 type or uint8 type.
 - The support is added to enable multiple zp types in these kernels
   and added additional macros to support the operations.
 - Modified the bench downscale reference code to support the updated
   types.

AMD-Internal : [ SWLCSG-3304 ]

Change-Id: Ia5e40ee3705a38d09262086d20731e8f0a126987
2025-04-11 07:25:10 -04:00
Arnav Sharma
267aae80ea Added Post-Ops Support for F32 RD Kernels
- Support for Post-Ops has been added for all F32 RD AVX512 and AVX2
  kernels.

AMD-Internal: [SWLCSG-3497]
Change-Id: Ia2967417303d8278c547957878d93c42c887109e
2025-04-11 05:25:30 -04:00
Arnav Sharma
c68c258fad Added AVX512 and AVX2 FP32 RD Kernels
- Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs.
- The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64
  (MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel.
- Updatd f32 framework to accomodate rd kernels in case of B trans
  with thresholds
- Updated data gen python script
TODO:
    - Post-Ops not yet supported.

Change-Id: Ibf282741f58a1446321273d5b8044db993f23714
2025-04-05 20:16:51 -05:00
varshav
81d219e3f8 Added destination scale type check in INT8 API's
- Updated the S8 main, GEMV, m_, n_ and mn_ fringe kernels to support
   multiple scale types for vector and scalar scales

 - Updated the U8 main, GEMV, m_, n_, extMR_ and mn_ fringe kernels to
   support multiple scale types for vector and scalar scales

 - Updated the bench to accommodate multiple scale type input, and
   modified the downscale_accuracy_check_ to verify with multiple scale
   type inputs.

AMD Internal: [ SWLCSG-3304 ]

Change-Id: I7b9f3ec8ea830d3265f72d18a0aa36086e14a86e
2025-03-28 00:51:17 -05:00
Deepak Negi
fb4617d7c3 Bug fix in gemv_n kernel of f32 api.
Description:
1. For column major case when m=1 there was an accuracy mismatch with
   post ops(bias, matrix_add, matrix_add).
2. Added check for column major case and replace _mm512_loadu_ps with
   _mm512_maskz_loadu_ps.

AMD-Internal: [CPUPL-6585]

Change-Id: I8d98e2cb0b9dd445c9868f4c8af3abbc6c2dfc95
2025-03-12 06:22:47 -05:00
harsh dave
a359a25765 Fix typo in 24x8m DGEMM sup kernel causing incorrect result.
- Corrected a typo in dgemm kernel implementation, beta=0 and
  n_left=6 edge kernel.

Thanks to Shubham Sharma<shubham.sharma3@amd.com> for helping with debugging.

AMD-Internal: [CPUPL-6443]
Change-Id: Ifa1e16ec544b7e85c21651bc23c4c27e86d6730b
2025-03-07 04:43:17 -05:00
Vignesh Balasubramanian
c4b84601da AVX512 optimizations for CGEMM(rank-1 kernel)
- Implemented an AVX512 rank-1 kernel that is
  expected to handle column-major storage schemes
  of A, B and C(without transposition) when k = 1.

- This kernel is single-threaded, and acts as a direct
  call from the BLAS layer for its compatible inputs.

- Defined custom BLAS and BLIS_IMPLI layers for CGEMM
  (instead of using the macro definition), in order to
  integrate the call to this kernel at runtime(based on
  the corresponding architecture and input constraints).

- Added unit-tests for functional and memory testing of the
  kernel.

- Updated the ZEN5 context to include the AVX512 CGEMM
  SUP kernels, with its cache-blocking parameters.

AMD-Internal: [CPUPL-6498]
Change-Id: I42a66c424325bd117ceb38970726a05e2896a46b
2025-03-06 20:14:05 +05:30
Vignesh Balasubramanian
07df9f471e AVX512 optimizations for CGEMM(SUP)
- Implemented the following AVX512 SUP
  column-preferential kernels(m-variant) for CGEMM :
  Main kernel    : 24x4m
  Fringe kernels : 24x3m, 24x2m, 24x1m,
                   16x4, 16x3, 16x2, 16x1,
                   8x4, 8x3, 8x2, 8x1,
                   fx4, fx3, fx2, fx1(where 0<f<8).

- Utlized the packing kernel to pack A when
  handling inputs with CRC storage scheme. This
  would in turn handle RRC with operation transpose
  in the framework layer.

- Further adding C prefetching to the main kernel,
  and updated the cache-blocking parameters for
  ZEN4 and ZEN5 contexts.

- Added a set of decision logics to choose between
  SUP and Native AVX512 code-paths for ZEN4 and ZEN5
  architectures.

- Updated the testing interface for complex GEMMSUP
  to accept the kernel dimension(MR) as a parameter, in
  order to set the appropriate panel stride for functional
  and memory testing. Also updated the existing instantiators
  to send their kernel dimensions as a parameter.

- Added unit tests for functional and memory testing of these
  newly added kernels.

AMD-Internal: [CPUPL-6498]

Change-Id: Ie79d3d0dc7eed7edf30d8d4f74b888135f31d6b4
2025-03-06 06:03:39 -05:00
Hari Govind S
8998839c71 Optimisation of DGEMV Transpose Case for unit stride
- Included a new code section to handle input having non-unit strided y
  vector for dgemv transpose case. Removed the same from the respective
  kernels to avoid repeated branching caused by condition checks within
  the 'for' loop.

- The condition check for beta is equal to zero in the primary kernels
  are moved outside the for loop to avoid repeated branching.

- The '_mm512_reduce_pd' operations in the primary kernel is replaced by
  a series of operations to reduce the number of instructions required
  to reduce the 8 registers.

- Changing naming convention for DGEMV transpose kernels.

- Modified unit kernel test to avoid y increment for dgemv tranpose
  kernels during the test.

AMD-Internal: [CPUPL-6565]
Change-Id: I1ac516d6b8f156ac53ac9f6eb18badd50e152e05
2025-03-06 05:15:58 -05:00
Meghana Vankadari
7243a5d521 Implemented group level static quantization for s8s8s32of32|bf16 APIs
Details:
- Group quantization is technique to improve accuracy
  where scale factors to quantize inputs and weights
  varies at group level instead of per channel
  and per tensor level.
- Added new bench files to test GEMM with symmetric static
  quantization.
- Added new get_size and reorder functions to account for
  storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.

AMD-Internal:[SWLCSG-3274]

Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
2025-02-28 04:44:44 -05:00
Vignesh Balasubramanian
99770558bb AVX512 optimizations for CGEMM(Native)
- Implemented the following AVX512 native
  computational kernels for CGEMM :
  Row-preferential    : 4x24
  Column-preferential : 24x4

- The implementations use a common set of macros,
  defined in a separate header. This is due to the
  fact that the implementations differ solely on
  the matrix chosen for load/broadcast operations.

- Added the associated AVX512 based packing kernels,
  packing 24xk and 4xk panels of input.

- Registered the column-preferential kernel(24x4) in
  ZEN4 and ZEN5 contexts. Further updated the cache-blocking
  parameters.

- Removed redundant BLIS object creation and its contingencies
  in the native micro-kernel testing interface(for complex types).
  Added the required unit-tests for memory and functionality
  checks of the new kernels.

AMD-Interal: [CPUPL-6498]
Change-Id: I520ff17dba4c2f9bc277bf33ba9ab4384408ffe1
2025-02-28 03:18:24 -05:00
Meghana Vankadari
6c29236166 Bug fixes in bench and pack code for s8 and bf16 datatypes
Details:
- Fixed the logic to identify an API that has int4 weights in
  bench files for gemm and batch_gemm.
- Eliminated the memcpy instructions used in pack functions of
  zen4 kernels and replaced them with masked load instruction.
  This ensures that the load register will be populated with
  zeroes at locations where mask is set to zero.

Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e
AMD-Internal:[SWLCSG-3274]
2025-02-28 01:18:11 -05:00
Arnav Sharma
b4c1026ec2 Added Support for General Stride in DGEMV
- Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride.
- Since the latest dgemv kernels don't support general stride, added
  checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general
  stride.
- Thanks to Vignesh Balasubramanian <vignesh.balasubramanian@amd.com>
  for finding this issue.

AMD-Internal: [CPUPL-6492]
Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328
2025-02-27 12:47:21 -05:00
Shubham Sharma
e6ca01c1ba Fixed C prefetch in 8x24 DGEMM kernel
- In 8x24 DGEMM kernel, prefetch is always done assuming
  row major C.
- For TRSM, the DGEMM kernel can be called with column major C also.
- Current prefetch logic results in suboptimal performance.
- Changed C prefetch logic so that correct C is prefetched for both row
  and column major C.

 AMD-Internal: [CPUPL-6493]

Change-Id: I7c732ceac54d1056159b3749544c5380340aacd2
2025-02-27 12:17:29 -05:00
Mithun Mohan
9906fd7b91 F32 eltwise kernel updates to use masks in scale factor load.
-Currently the scale factor is loaded without using mask in downscale,
and matrix add/mul ops in the F32 eltwise kernels. This results in
out of memory reads when n is not a multiple of NR (64).
-The loads are updated to masked loads to fix the same.

AMD-Internal: [SWLCSG-3390]

Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb
2025-02-27 08:17:58 -05:00
Deepak Negi
cc321fb95d Added support for different types of zero-point in f32 eltwise APIs.
Description
 - Zero point support for <s32/s8/bf16/u8> datatype in element-wise
   postop only f32o<f32/s8/u8/s32/bf16> APIs.

 AMD-Internal: [SWLCSG-3390]

Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
2025-02-26 04:04:13 -05:00
Mithun Mohan
7394aafd1e New A packing kernels for F32 API in LPGEMM.
-New packing kernels for A matrix, both based on AVX512 and AVX2 ISA,
for both row and column major storage are added as part of this change.
Dependency on haswell A packing kernels are removed by this.
-Tiny GEMM thresholds are further tuned for BF16 and F32 APIs.

AMD-Internal: [SWLCSG-3380, SWLCSG-3415]

Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be
2025-02-26 05:23:35 +00:00