403 Commits

Author SHA1 Message Date
Smyth, Edward
4ee6f75292 GCC 16 fixes
Changes to fix errors and warnings when using gcc 16.1.0:
- Copy changes from 5c2b22da81 in upstream BLIS to extend disabling of
  tree-vectorization in affected kernels to gcc 16 and later.
- Remove unused variables in bli_packm_blk_var1_md.c and bli_util_unb_var1.c
  to fix warning messages.

Background
bp (base pointer) is the %rbp/%ebp register on x86/x86-64. Inline assembly
kernels in BLIS use asm volatile blocks where they manually manage registers
- including saving and restoring bp themselves to use it as a general-purpose
register for holding loop counters or matrix pointers.

When GCC's tree-vectorizer (specifically the superword-level parallelism (SLP)
pass) runs on a translation unit containing inline asm, it can generate code
that itself needs bp as a frame pointer or in the vectorized prologue/epilogue.
At that point GCC internally marks bp as unavailable and then, when it tries to
compile the inline asm block that also references bp, it throws an error.

As a workaround, disabling tree vectorization for the entire file removes the
conflict - with no vectorizer-generated code, bp stays free for the inline asm.
2026-05-18 15:54:29 +01:00
S, Hari Govind
cdd181a7d7 Optimize complex scalv kernels with inline assembly and FMA instructions
This PR optimizes the complex scalar vector multiplication kernels by replacing
intrinsics with inline assembly and leveraging FMA (Fused Multiply-Add) instructions
for improved performance.

Changes:
- Replaced intrinsic-based implementation with inline assembly
- Utilizes `vfmaddsub231ps`, `vfmadd231ss`, and `vfmsub231ss` FMA instructions
- Improved instruction scheduling and register usage
- Handles both unit-stride (vectorized) and non-unit-stride (scalar) cases
- Processes up to 16 complex elements per iteration in the main loop
2026-04-13 16:06:02 +05:30
Rayan, Rohan
d512e3a736 Converting mul+add to FMA for ddot, daxpy and daxpyf zen kernels
- Replace mul+add with FMA for ddot, daxpy and daxpyf
- Using masked operations where possible
- Non-unit stride code paths still use scalar loops, but use FMAs for accuracy

AMD-Internal: CPUPL-8055
Co-authored-by: Rohan Rayan rohrayan@amd.com
2026-03-20 11:07:12 +05:30
Rayan, Rohan
af4c3ef1a5 Fixing memory issues in sgemm SUP kernels on AVX2 and AVX512
* Resolved memory-access issues in the SGEMM SUP kernels on AVX2 and AVX-512 by correcting instructions that could read invalid addresses in the C matrix.
* Removed k=0 kernel gtests for the native SGEMM and DGEMM paths as these tests caused spurious failures for kernels that are not intended to handle this case.
* Standardized all instruction macros to lowercase in the Zen4 kernel to improve readability and code consistency.

---------

AMD-Internal: CPUPL-8117
Co-authored-by: Rayan <rohrayan@amd.com>
2026-03-16 16:05:34 +05:30
Dave, Harsh
48a6db6c69 add support for conjugate transpose in avx512 zgemm sup kernel (#300)
* ZGEMM SUP: Add conjugate support for AVX-512 kernels on Zen4/Zen5/Zen6

- Add CONJA, CONJB and CONJA_CONJB variants to zgemm SUP micro-tiles
- Enable SUP path for conjugate cases when both are same type
- Unify RRC/CRC storage to use CV kernel variant
- Update SUP dispatch to handle conjugate flags correctly

Note: CONJ_NO_TRANSPOSE + CONJ_NO_TRANSPOSE and
      CONJ_TRANSPOSE + CONJ_TRANSPOSE remain unsupported

---------

Co-authored-by: harsdave <harsdave@amd.com>
2026-03-12 19:00:53 +05:30
Dave, Harsh
fbd45e8eab optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels (#334)
* optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels

- Updated the edge and non-unit stride path in c/saxpy to use intrinsics.
- This ensures that the edge and non-unit stride cases maintain numerical
   consistency with the optimized Zen assembly/intrinsic path.

---------

Co-authored-by: harsdave <harsdave@amd.com>
2026-03-12 17:20:44 +05:30
Sharma, Shubham
4e84bbfb68 Ensure Accumulation Consistency Across DOTXF and DOTXV kernels (#325)
APIs like GEMV use DOTXF (for parts of problem which are multiple of fuse_factor) and DOTXV (for parts not multiple of fuse_factor).

DOTXF and DOTXV use different numbers of temporary accumulation registers(rho).

This results in different round offs which can be significant when sizes are small and problem is about equally divided between DOTXF and DOTXV.

To fix this, the number of temporary accumulation (and therefore roundoffs) and have made identical across both kernels.

Known related GCC bugs to reference

GCC Bug #56812 — incorrect code with vzeroupper and register allocation
GCC Bug #95483 — vzeroupper clobbers live values
GCC Bug #101617 — wrong code generation with AVX intrinsics and transitions

AMD-Internal:CPUPL-8015
2026-03-03 16:38:55 +05:30
Sharma, Shubham
3bbf12665c Ensure consistency across AVX2 and AVX512 AMAX kernels
Fix NaN handling in AVX2 amax kernel by initializing global max to 0

In the AVX2 kernel, the global maximum (curr_max_val) was previously
initialized to -1, while the local maximum (temp) is initialized to 0.

The kernel determines the maximum using the condition:
max(abs(x[i]), temp) > curr_max_val

Because curr_max_val started at -1, this initial check always evaluates
to true for the first iteration ( 0 > -1), which is incorrect if the
first element is a NaN.

If subsequent elements are valid numbers, curr_max_val eventually recovers
and updates to the correct value. However, if the entire array consists of
NaNs, this logic fails to properly update the trackers, meaning we never
correctly return index 0 as required by the BLAS standard.

To fix this and align the behavior with the AVX-512 kernel, curr_max_val
is now initialized to 0. This ensures that the initial condition evaluates
correctly and all-NaN arrays return the proper index (0 if all values are NaN).

Window start and end index are also updated to 0 which is the minimum valid
value of index.

AMD-Internal: CPUPL-8047
2026-03-03 13:01:52 +05:30
Smyth, Edward
011c75dddb Remove unnecessary OpenMP include (AOCL)
Copy of similar change in upstream BLIS (843a5e8) to fix issues
https://github.com/flame/blis/issues/873 and
https://github.com/amd/blis/issues/50

Details:
- Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the
  framework could access the necessary OpenMP functions.
- As @melven reported (#873), this causes issues when `blis.h` is included
  in C++ code since the `<omp.h>` include happens with `extern "C"`.
- Move the include from the header to the necessary .c files so that it
  does not "pollute" `blis.h`.

Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in
AOCL BLIS

AMD-Internal: [CPUPL-7303]
2026-02-06 10:41:38 +00:00
Rayan, Rohan
ebf8721a5c Optimizing sgemm rd kernels on zen3 (#293)
Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM.
After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction.
This caused a lot of unnecessary iterations when the remainder of k < 8.
This has been fixed by introducing masked operations for k < 8
When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation.
There were also some unnecessary instructions in the zen4 kernels which have been removed.

AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775
Co-authored-by: rohrayan@amd.com
2026-02-04 09:08:11 +05:30
S, Hari Govind
ec6f4e96cd Replace intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5
GCC over-optimizes intrinsics code by reordering and interleaving
instructions, making it difficult to verify correctness and causing
potential accuracy issues in certain cases. This change replaces
intrinsics-based implementations with inline assembly to ensure
one-to-one mapping between source and generated assembly.

Changes:
- bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly
  * Processes blocks of 128, 64, 32, 16, and 8 elements
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

- bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly
  * Processes blocks of 16 and 8 elements with 5-way fusion
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

Benefits:
- Predictable code generation with no compiler reordering
- Better numerical accuracy by preventing unexpected transformations
- Easier verification of generated assembly against specifications
- Explicit control over instruction sequence and register allocation
2026-01-29 11:48:47 +05:30
Rayan, Rohan
a22e0022c2 SGEMM tiny path tuning for zen4 and zen5 (#267)
* Adding a model to determine which matrices enter the SGEMM tiny path
* This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously
* Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path
* Adding thresholds based on the SUP path sizes
* Added for Zen4 and Zen5

---------
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2025-12-10 15:58:54 +05:30
Balasubramanian, Vignesh
54ac36c8bc Bugfix: BF16 to F32 conversion in AVX2 F32 codepath
- Updated the conversion function(in case of receiving
  column stored inputs) from BF16 to F32, in order to
  use the correct strides while storing.

- Conversion of B is potentially multithreaded using
  the threads meant for IC compute. With the wrong
  strides in the kernel, this gives rise to incorrect
  writes onto the miscellaneous buffer.

AMD-Internal: [CPUPL-7675]

Co-authored-by: Vishal-A <Vishal.Akula@amd.com>
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-12-01 15:06:13 +05:30
S, Hari Govind
4ecfbde082 Fix extreme values handling in GEMV
- When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all.
- This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks.
- When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel.
- To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic.
- DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0.

AMD-Internal: [CPUPL-7585]
2025-11-08 12:30:03 +05:30
Rayan, Rohan
e85be22da0 Adding tiny path for SGEMM (#237)
Adding SGEMM tiny path for Zen architectures.
Needed to cover some performance gaps seen wrt MKL
Only allowing matrices that all fit into the L1 cache to the tiny path
Only tuned for single threaded operation at the moment
Todo: Tune cases where AVX2 performs better than AVX512 on Zen4
Todo: The current ranges are very conservative, there may be scope to increase the matrix sizes that go into the tiny path

AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan rohrayan@amd.com
2025-10-24 13:14:33 +05:30
V, Varsha
fecb1aa7a5 Bug Fix in BF16 AVX2 conversion path (#236)
- In the current implementation of bf16 to f32 conversion for packed data
 we handle both GEMM and GEMV conditions in the same function separated
 with conditions.
 - But, when n = (NC+1) the function would execute GEMV conversion logic
 and write back the data inaccurately leading to accuracy issues.
 - Hence, modified the convert function and reorder functions to have
 separate conversion logic to make it cleaner and avoid confusions.
 -  Also, updated the API calls to adhere to the changes appropriately.

[AMD-Internal: CPUPL-7540]
2025-10-17 15:38:02 +05:30
Bhaskar, Nallani
db3134ed6d Disabled no post-ops path in lpgemm f32 kernels for few gcc versions
Guarded np (no post-ops) path in f32 API with a macro 
 as a workaround as gcc 11.4 and 11.2 are giving accuracy issues 
 with np path.
2025-09-22 15:52:21 +05:30
Smyth, Edward
e59eabaf58 Compiler warnings fixes (2)
Fix compiler warning messages in LPGEMM code:
- Removed extraneous parentheses in aocl_batch_gemm_s8s8s32os32.c
- Removed unused variables in lpgemv_{m,n}_kernel_s8_grp_amd512vnni.c
- Changed ERR_UBOUND in math_utils_avx2.h and math_utils_avx512.h
  to match how it is specified in AOCL libm erff.c

AMD-Internal: [CPUPL-6579]
2025-09-17 18:28:34 +01:00
KadavilMadanaMohanan, MithunMohan (Mithun Mohan)
5de25ce9a7 Fixed high priority coverity issues in LPGEMM. (#178)
* Fixed high priority coverity issues in LPGEMM.

-Out of bounds issue and uninitialized variables fixed in aocl_gemm addon.
2025-09-11 18:27:19 +05:30
Smyth, Edward
a4db661b44 GCC 15 SUP kernel workaround (2)
Previous commit (30c42202d7) for this problem turned off
-ftree-slp-vectorize optimizations for all kernels. Instead, copy
the approach of upstream BLIS commit 36effd70b6a323856d98 and disable
these optimizations only for the affected files by using GCC pragmas

AMD-Internal: [CPUPL-6579]
2025-09-04 17:14:06 +01:00
Smyth, Edward
fb2a682725 Miscellaneous changes
- Change begin_asm and end_asm comments and unused code in files
     kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c
     kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c
  to avoid problems in clobber checking script.
- Add missing clobbers in files
     kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c
     kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
     kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c
- Add missing newline at end of files.
- Update some copyright years for recent changes.
- Standardize license text formatting.

AMD-Internal: [CPUPL-6579]
2025-08-26 16:37:43 +01:00
Vankadari, Meghana
a05279cd97 Bug fix in F32 AVX2 kernels (#164)
- corrected the loading strides used to load matadd and matmul
  pointers in F32 AVX2 kernels.

AMD-Internal: CPUPL-7221
2025-08-26 19:52:50 +05:30
Vankadari, Meghana
7d42db73e5 Fixed out-of-bound memory accesses in F32 API (#161)
- Corrected the BF16 data handling in post-ops for F32 API.
- Verified and ensured that mask-loads are used wherever necessary.

AMD-Internal: CPUPL-7221
2025-08-26 17:53:13 +05:30
Vankadari, Meghana
5044b69d3d Bug fix in LPGEMV m=1 AVX2 kernel for post-ops
Details:
- Fixed loading of matadd and matmul pointers in GEMV
 lt16 kernel for AVX2 M=1 case.
- Hard-set row-stride of B to 1(inside GEMV), when it has
   already been reordered.

AMD-Internal:CPUPL-7197, CPUPL-7221
Co-authored-by:Balasubramanian, Vignesh <Vignesh.Balasubramanian@amd.com>
2025-08-22 18:15:05 +05:30
S, Hari Govind
d29f3f0b5e Fix GCC 12+ instruction scheduling issue in complex scalv kernel (#149)
Replace fused multiply-add (FMA) intrinsics with explicit multiply and add/subtract operations in bli_cscalv_zen_int to resolve incorrect results with GCC 12 and later compilers.

The original code used register reuse pattern with _mm256_fmaddsub_ps() that causes GCC 12+ instruction scheduler to generate assembly with corrupted intermediate values due to register allocation conflicts. GCC 11 and earlier handled the same pattern correctly.

Changes:
- Replace _mm256_fmaddsub_ps() with _mm256_mul_ps() + _mm256_addsub_ps()
- Eliminate temp register reuse to fix instruction scheduling conflicts

AMD-Internal: [CPUPL-6445]
2025-08-22 14:23:43 +05:30
Smyth, Edward
509aa07785 Standardize Zen kernel names
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.

AMD-Internal: [CPUPL-6579]
2025-08-19 18:19:51 +01:00
Sharma, Shubham
3a14417ce1 DGEMV BugFixes and code cleanup (#134)
- Modified gemv (matrix-vector multiply) reference for better handling of transpose flags.
- Modified Zen4 kernel implementations for better handling of transpose flags and vector stride (incy).
- The changes refine kernel selection logic and move variable definition in macro guards.
2025-08-14 12:54:06 +05:30
Sharma, Shubham
b0a4914417 Added DGEMV no transpose multithreaded Implementations (#12)
* Added DGEMV no transpose multithreaded Implementations
- Added new avx512 M and N kernels for DGEMV.
- Added multiple MT implementations for same kernels.
- Added AOCL_dynamic logic for L2 apis.
- Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5.
- Added same kernels for SGEMV, but these kernels are not enabled yet.
- Added SGEMV reference kernel.

AMD-Internal: [SWLCSG-3408]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-08-12 10:39:12 +05:30
Bhaskar, Nallani
46aac600ec Added f32 kernels without post-ops to avoid overhead
Description:

1. Crated f32 intrinsic kernels without post-ops support f32 gemm
   without post-ops optimally.
2. Initiated the no post-ops kernels from main kernel when post-ops
   hander has no post-ops to do.
3. The kernels are redundant but added to get the best perf
   for pure GEMM call.

AMD-Internal : SWLCSG-3692
2025-07-25 23:14:23 +05:30
S, Hari Govind
273a05f0bd Fix for performance regression caused by non-unit stride y in DGEMV API (#91)
- Temperory fix for regression in DGEMV for non-unit stride y inputs. The code
  section responsible for handling non-unit stride y has been removed from the
  frame.

- The kernel code is extended with if condition to handle both unit and non-unit
  stride y.

AMD-Internal: [CPUPL-6869]
2025-07-25 10:57:57 +05:30
V, Varsha
9e8c9e2764 Fixed compiler warnings in LPGEMM
- Modified the correct variables to be passed for the batch_gemm_thread_decorator() for
 u8s8s32os32 API.
 - Removed commented lines in f32 GEMV_M kernels.
 - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros.
 - Re-aligned the BIAS macro in the macro definition file.

[ AMD - Internal : CPUPL - 7013 ]
2025-07-18 16:15:52 +05:30
Sharma, Shubham
355018e739 Fixed Extra reads in DTRSM small kernels.
In DTRSM small code path lower triangular kernels, extra data from upper triangular region is being read.
To fix this, new macros have been added to make sure only relevant data is read.
AMD-Internal: [SWLCSG-3611]
2025-07-17 10:17:13 +05:30
V, Varsha
837d3974d4 Bug Fixes for GEMV AVX2 BF16 to F32 path
- Added the correct strides to be used while unreorder/convert B matrix in m=1 cases.
 - Modified Zero point vector loads to proper instructions.
 - Modified bf16 store in AVX2 GEMV M kenrel

AMD Internal - [SWLCSG - 3602 ]
2025-07-10 16:23:46 +05:30
V, Varsha
98901847f1 Enabled GEMV path for BF16 GEMV operations on non-BF16 supporting machines
- Added new GEMV_AVX2 5-Loop for handling BF16 inputs, for n = 1 and m = 1 conditions.
 - Modified Re-order and Un-reorder functions to cater to default n=1 reorder conditions.
 - Added bf16 beta and store support in F32 GEMV N AVX2 and 256_512 kernels.
 - Added bf16 beta support for F32 GEMV M kernels, and modified bf16 store conditions for
   GEMV M kernels.
 -  Modified the n=1 re-order guards for reference bf16 re-order API.
 - Added an additional path in the un-reorder case for handling n=1 vector conversion

AMD-Internal: [ SWLCSG - 3602 ]
2025-07-09 19:45:40 +05:30
Bhaskar, Nallani
9b02201b5b Updated Poly 16 in Gelu Erf to double precision
Updated poly Gelu Erf precision to double to keep the error with in 1e-5 limit when compared to reference gelu_erf, which is also increased the compute to 2x compared to float.

AMD-Internal: SWLCSG-3551
2025-07-07 14:05:40 +05:30
Smyth, Edward
8a8d3f43d5 Improve consistency of optimized BLAS3 code (#64)
* Improve consistency of optimized BLAS3 code

Tidy AMD optimized GEMM and TRSM framework code to reduce
differences between different data type variants:
- Improve consistency of code indentation and white space
- Added some missing AOCL_DTL calls
- Removed some dead code
- Consistent naming of variables for function return status
- GEMM: More consistent early return when k=1
- Correct data type of literal values used for single precision data

In kernels/zen/3/bli_gemm_small.c and bli_family_*.h files:
- Set default values for thresholds if not set in the relevant
  bli_family_*.h file
- Remove unused definitions and commented out code

AMD-Internal: [CPUPL-6579]
2025-07-01 09:29:52 +01:00
Balasubramanian, Vignesh
98bc1d80e7 Support for Tiny-GEMM interface(CGEMM)
- Added the support for Tiny-CGEMM as part of the existing
  macro based Tiny-GEMM interface. This involved definining
  the appropriate AVX2/AVX512 lookup tables and functions for
  the target architectures(as per the design), for compile-time
  instantiation and runtime usage.

- Also extended the current Tiny-GEMM design to incorporate packing
  kernels as part of its lookup tables. These kernels will be queried
  through lookup functions and used in case of wanting to support
  non-trivial storage schemes(such as dot-product computation).

- This allows for a plug-and-play fashion of experimenting with
  pack and outer product method against native inner product implementations.

- Further updated the existing AVX512 pack routine that packs the A matrix
  (in blocks of 24xk). This utilizes masked loads/stores instructions to
  handle fringe cases of the input(i.e, when m < 24).

- Also added the AVX512 outer product kernels for CGEMM as part of the
  ZEN4 and ZEN5 contexts, to handle RRC and CRC storage schemes. This is
  facilitated through optional packing of A matrix in the SUP framework.

AMD-Internal: [CPUPL-6498]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-06-30 12:14:44 +05:30
V, Varsha
8fd7060b2f Matrix Add and Matrix Mul Post-op addition in F32 AVX512_256 kernels (#50)
Added Matrix-mul and Matrix-add postops in FP32 AVX512_256 GEMV kernels

 - Matrix-add and Matrix-mul post ops in FP32 AVX512_256 GEMV m = 1 and
 n = 1 kernels has been added.

Co-authored-by: VarshaV <varshav2@amd.com>
2025-06-17 16:17:13 +05:30
S, Hari Govind
e097346658 Implemented Multithreading Support and Optimization of DGEMV API (#10)
- Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance.

- The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta.

AMD-Internal: [CPUPL-6746]
2025-06-17 12:39:48 +05:30
V, Varsha
875375a362 Bug Fixes in FP32 Kernels: (#41)
* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels.
 Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B
 conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-06-06 17:48:50 +05:30
Vankadari, Meghana
37efbd284e Added 6x16 and 6xlt16 main kernels for f32 using AVX512 instructions (#38)
* Implemented 6xlt8 AVX2 kernel for n<8 inputs

* Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Added the deleted kernels and fixed bias bug

AMD-Internal: SWLCSG-3556
2025-06-05 15:17:02 +05:30
V, Varsha
532eab12d3 Bug Fixes in LPGEMM for AVX512(SkyLake) machine (#24)
* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types in FP32 avx512_256 kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256
  kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and
   BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode.

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 type to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension. The entry condition here has been modified for
  AVX512 configuration.

 - Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI
   ISA supporting configruations:
   - BF16 tiny path entry check has been modified to take into account arch_id
     to ensure improper entry into the tiny kernel.
   - The store in BF16->FP32 col-major for m = 1 conditions were updated to
     correct storage pattern,
   - BF16 beta load macro was modified to account for data in unaligned memory.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake)

AMD Internal: [SWLCSG-3552]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-05-30 17:22:49 +05:30
Negi, Deepak
121d81df16 Implemented GEMV kernel for m=1 case. (#5)
* Implemented GEMV kernel for m=1 case.

Description:

- Added a new GEMV kernel for AVX2 where m=1.
- Added a new GEMV kernel for AVX512 with ymm registers where m=1.
2025-05-13 16:33:04 +05:30
Meghana Vankadari
8557e2f7b9 Implemented GEMV for n=1 case using 32 YMM registers
Details:
- This implementation is picked form cntx when GEMM is invoked on
  machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This implementation uses MR=16 for GEMV.

AMD-Internal: [SWLCSG-3519]
Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a
2025-05-05 05:31:13 -04:00
Meghana Vankadari
21aa63eca1 Implemented AVX2 based GEMV for n=1 case.
- Added a new GEMV kernel with MR = 8 which will be used
  for cases where n=1.
- Modified GEMM and GEMV framework to choose right GEMV kernel
  based on compile-time and run-time architecture parameters. This
  had to be done since GEMV kernels are not stored-in/retrieved-from
  the cntx.
- Added a pack kernel that packs A matrix from col-major to row-major
  using AVX2 instructions.

AMD-Internal: [SWLCSG-3519]
Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c
2025-05-05 08:56:22 +00:00
Meghana Vankadari
4745cf876e Implemented a new set of kernels for f32 using 32 YMM regs
Details:
- These kernels are picked from cntx when GEMM is invoked
  on machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This path uses the same blocksizes and pack kernels as AVX512
  path.
- GEMV is disabled currently as AVX2 kernels for GEMV are not
  implemented.

AMD-Internal: [SWLCSG-3519]
Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5
2025-04-23 12:02:01 +00:00
Deepak Negi
48c7452b08 Beta and Downscale support for F32 AVX-512 kernels
Description
- To enable AVX512 VNNI support without native BF16 in BF16 kernels, the
  BF16 C_type is converted to F32 for computation and then cast back to
  BF16 before storing the result.
- Added support for handling BF16 zero-point values of BF16 type.
- Added a condition to disable the tiny path for the BF16 code path
  where native BF16 is not supported.

AMD Internal : [CPUPL-6627]

Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9
2025-04-23 06:12:14 -05:00
Arnav Sharma
87c9230cac Bugfix: Disable A Packing for FP32 RD kernels and Post-Ops Fix
- For single-threaded configuration of BLIS, packing of A and B matrices
  are enabled by default. But, packing of A is only supported for RV
  kernels where elements from matrix A are being broadcasted. Since
  elements are being loaded in RD kernels, packing of A results in
  failures. Hence, disabled packing of matrix A for RD kernels.

- Fixed the issue where c_i index pointer was incorrectly being reset
  when exceeding MC block thus, resulting in failures for certain
  Post-Ops.

- Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it
  was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float).

AMD-Internal: [SWLCSG-3497]
Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd
2025-04-18 14:31:03 +05:30
Arnav Sharma
267aae80ea Added Post-Ops Support for F32 RD Kernels
- Support for Post-Ops has been added for all F32 RD AVX512 and AVX2
  kernels.

AMD-Internal: [SWLCSG-3497]
Change-Id: Ia2967417303d8278c547957878d93c42c887109e
2025-04-11 05:25:30 -04:00
Arnav Sharma
c68c258fad Added AVX512 and AVX2 FP32 RD Kernels
- Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs.
- The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64
  (MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel.
- Updatd f32 framework to accomodate rd kernels in case of B trans
  with thresholds
- Updated data gen python script
TODO:
    - Post-Ops not yet supported.

Change-Id: Ibf282741f58a1446321273d5b8044db993f23714
2025-04-05 20:16:51 -05:00