Commit Graph

3773 Commits

Author SHA1 Message Date
Smyth, Edward
0b9e846fee DTL logging fixes and improvements (5)
More improvements to DTL coverage and coding:
- Expand logging and tracing coverage to IxAMIN and GEMM_BATCH APIs
- Expand logging and performance states to GEMM3M APIs
- Expand logging coverage to matrix copy, transpose and add APIs
- Misc tidying of code

AMD-Internal: [CPUPL-7010]
2025-08-20 11:37:03 +01:00
Smyth, Edward
509aa07785 Standardize Zen kernel names
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.

AMD-Internal: [CPUPL-6579]
2025-08-19 18:19:51 +01:00
Sharma, Shubham
aa95a8ce4a Added Compiler flags to improve Security (#136)
Following Flags have been added.

1. D_FORTIFY_SOURCE=2
What it does
• At compile time the header files replace certain libc calls (strcpy, sprintf, …) with inline wrappers that perform a compile-time length check whenever the size of the destination buffer is known.
• At run time an extra check is executed only if the compiler could not prove the copy is safe.

Cost
• Only functions that call those specific libc routines pay anything.

2. fstack-protector-strong
What it does
• Functions that contain local arrays, address‐taken locals, or alloca get a canary word inserted into the stack frame.
• The function prologue writes the canary; the epilogue verifies it before the ret.

Cost
• 8 bytes of additional stack per protected function frame.
• Two or three extra instructions per entry/exit.

4. Wl,-z,relro
What it does
• Marks the relocation tables read-only after relocation is finished.
• No effect once the library is fully loaded.

Cost
• None at run time.

5. Wl,-z,now
What it does
• Forces the dynamic loader to resolve all external symbols in the library up-front instead of lazily on first call.

Cost
• Startup: one extra relocation pass.
• Steady-state execution: zero or slightly faster, because PLT stubs are bypassed.

Usage:
cmake -DENABLE_SECURITY_FLAGS=off
cmake -DENABLE_SECURITY_FLAGS=on
configure --enable-security-flags
configure --disable-security-flags

AMD-Internal: [CPUPL-6886]
2025-08-18 16:11:02 +05:30
Dave, Harsh
b88bea6e72 Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) (#135)
* Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11)

- Introduced specialized AVX-512 assembly paths for cdim0 edge cases (1–11), replacing inefficient zscalv fallback.
- Refactored cdim0 == mnr condition into a switch statement to support multiple optimized cases.
- Added three new macros for column-stored packing with distinct masking patterns.
- Implemented 11 dedicated handlers for row and column stored A matrix packing
  with efficient masked loads/stores for partial data.

    AMD-Internal: [CPUPL-6677]

Co-authored-by: harsh dave <harsdave@amd.com>

* Update bli_packm_zen4_asm_z12xk.c

---------

Co-authored-by: harsh dave <harsdave@amd.com>
Co-authored-by: Sharma, Shubham <Shubham.Sharma3@amd.com>
2025-08-18 12:38:45 +05:30
Sharma, Shubham
33ea09d967 Fix ZTRSM accuracy for conjugate transpose (#133)
- For Conjugate inputs, ZTRSM small code path is less accurate than native codepath.
- Redirected the conjugate inputs to native code path on ZEN4 if TRSM preinversion is disabled.
- Tuned AOCL_DYNAMIC to handle the new inputs redirected to ZTRSM native.
2025-08-14 19:49:28 +05:30
Dave, Harsh
1b1b19486b Add packing support M edge cases in ZGEMM 12xk pack kernel (#89)
Previously, the ZGEMM implementation used `zscalv` for cases
    where the M dimension of matrix A is not in multiple of 24,
    resulting in a ~40% performance drop.

    This commit introduces a specialized edge cases in pack kernel
    to optimize performance for these cases.

    The new packing support significantly improves the performance.

    - Removed reliance on `zscalv` for edge cases, addressing the
      performance bottleneck.

    AMD-Internal: [CPUPL-6677]

Co-authored-by: harsh dave <harsdave@amd.com>
2025-08-14 14:29:03 +05:30
Sharma, Arnav
76c4872718 GEMV support for S8S8S32O32 Symmetric Quantization
Introduced support for GEMV operations with group-level symmetric quantization for the S8S8S32032 API.

Framework Changes:
- Added macro definitions and function prototypes for GEMV with symmetric quantization in lpgemm_5loop_interface_apis.h and lpgemm_kernels.h.
  - LPGEMV_M_EQ1_KERN2 for the lpgemv_m_one_s8s8s32os32_sym_quant kernel, and
  - LPGEMV_N_EQ1_KERN2 for the lpgemv_n_one_s8s8s32os32_sym_quant kernel.
- Implemented the main GEMV framework for symmetric quantization in lpgemm_s8s8s32_sym_quant.c.

Kernel Changes:
- lpgemv_m_one_s8s8s32os32_sym_quant for handling the case where M = 1 and implemented in lpgemv_m_kernel_s8_grp_amd512vnni.c.
- lpgemv_n_one_s8s8s32os32_sym_quant for handling the case where N = 1 and implemented in lpgemv_n_kernel_s8_grp_amd512vnni.c.
- Updated the buffer reordering logic for group quantization for N=1 cases in aocl_gemm_s8s8s32os32_utils.c.

Notes
- Ensure that group_size is a factor of both K (and KC when K > KC).
- The B matrix must be provided in reordered format (mtag_b == REORDERED).

AMD-Internal: [SWLCSG-3604]
2025-08-14 13:41:25 +05:30
Sharma, Shubham
3a14417ce1 DGEMV BugFixes and code cleanup (#134)
- Modified gemv (matrix-vector multiply) reference for better handling of transpose flags.
- Modified Zen4 kernel implementations for better handling of transpose flags and vector stride (incy).
- The changes refine kernel selection logic and move variable definition in macro guards.
2025-08-14 12:54:06 +05:30
S, Hari Govind
9a7bacb30c Improve numerical precision in ZGEMV API (#130)
- Replaced separate real and imaginary accumulators (real_acc, imag_acc) with a column-wise accumulator array (row_acc[2]), making accumulation and updates to the target Y vector more direct, concise, and unified.

- Leveraged AVX-512 fused multiply-add/subtract operations (_mm512_fmaddsub_pd, _mm512_fmsubadd_pd) and efficient permutations (_mm512_permute_pd) to enable accurate and efficient computation of real and imaginary components in a single instruction, while reducing code complexity for both code paths.

- Removed redundant instructions (such as unnecessary permutations and zero-register operations) and simplified the control flow.

AMD-Internal: [CPUPL-7015]
2025-08-14 11:19:51 +05:30
Dave, Harsh
fa69528a3b Bugfix: Tuned zgemm threshold for zen4 (#129)
* Bugfix: Tuned zgemm threshold for zen4

Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.

This tuning forces skinny matrices to take SUP path to ensure better
performance.

* Bugfix: Tuned zgemm threshold for zen4 and zen5

Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.

This tuning forces skinny matrices to take SUP path to ensure better
performance.

---------

Co-authored-by: harsdave <harsdave@amd.com>
2025-08-13 19:02:39 +05:30
Smyth, Edward
da875888d7 DTL logging fixes and improvements (4)
More improvements to DTL coverage and coding:
- Removed some DTL overheads from performance stats timing for all APIs
  where it is currently implemented (i.e. gemm, gemmt, trsm, nrm2)
- Expand logging coverage to gemm pack and compute APIs, including
  performance stats for gemm_compute
- Expand logging coverage to rot, rotg, rotm and rotmg APIs
- Tidied order of function prototypes in aocl_dtl/aocldtl_blis.h

AMD-Internal: [CPUPL-7010]
2025-08-13 11:12:16 +01:00
Sharma, Shubham
b7b9d3ec53 Exported AVX512 DGEMV kernels (#131)
- Exported DGEMV AVX512 kernels so that they can be directly called by libflame to avoid blis and omp overhead.
2025-08-12 17:17:41 +05:30
Smyth, Edward
021f6bc960 GEMMTR full set of APIs
Commit eaa76dfe28 added LAPACK 3.12 GEMMTR
interfaces as aliases to existing BLIS GEMMT. Here we add full set of
Fortran upper case and no underscore API aliases and _blis_impl variants.

AMD-Internal: [CPUPL-6581]
2025-08-12 10:24:24 +01:00
Smyth, Edward
dc06cdb621 DTL logging fixes and improvements (3)
More improvements to DTL coverage and coding:
- Expand logging coverage to banded matrix APIs in frame/compat/f2c
- Expand logging coverage to packed matrix APIs in frame/compat/f2c
- Commit b8aa5c2894 was wrong to
  remove calls to AOCL_DTL_INITIALIZE for APIs where bli_init_auto()
  is not called. AOCL_DTL_INITIALIZE is essential when logging is
  enabled but tracing is not, otherwise the ICV gbIsLoggingEnabled
  will not be initialized based on logging status and remain as
  the default FALSE value.

AMD-Internal: [CPUPL-7010]
2025-08-12 09:42:40 +01:00
Sharma, Shubham
b0a4914417 Added DGEMV no transpose multithreaded Implementations (#12)
* Added DGEMV no transpose multithreaded Implementations
- Added new avx512 M and N kernels for DGEMV.
- Added multiple MT implementations for same kernels.
- Added AOCL_dynamic logic for L2 apis.
- Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5.
- Added same kernels for SGEMV, but these kernels are not enabled yet.
- Added SGEMV reference kernel.

AMD-Internal: [SWLCSG-3408]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-08-12 10:39:12 +05:30
Vlachopoulou, Eleni
1f8a7d2218 Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65) 2025-08-07 15:51:59 +01:00
Smyth, Edward
563b161933 Standardize Python files to use Python 3
Python 2 is no longer maintained, and using python3 avoids accidental invocation of outdated interpreters.

AMD-Internal: [CPUPL-6579]
2025-08-06 12:04:26 +01:00
Bhaskar, Nallani
9d571bb5d3 Fixed few Coverity warnings in aocl gemm addon
Fixed few Coverity warnings in aocl gemm addon 


AMD-Internal: CPUPL-6913
2025-08-06 15:37:40 +05:30
Smyth, Edward
efdb5a06df Changes to default choices of sub-configuration
amdzen and x86_64 configuration family: On Intel processors supporting
AVX-512, the zen4 sub-configuration was dispatched by default, as even
though it is not optimized specifically for Intel processors, it includes
a range of additional optimizations than are present in the the older skx
sub-configuration. However, the zen4 data path is 256 bit, thus this
sub-configuration uses a mixture of AVX2 and AVX-512 kernels. Now that
zen5 sub-configuration is available, with more extensive use of AVX-512
kernels, switch to use this by default on relevant Intel processors.

intel64 configuration family: On AMD processors supporting AVX-512 or
AVX2, the generic sub-configuation was dispatched by default. Change to
dispatch skx or haswell sub-configuation, based on the available ISA
support.

AMD-Internal: [CPUPL-6743]
2025-08-06 07:56:53 +01:00
Sharma, Shubham
6db8639284 Fix coverity issue in ZTRSM kernels (#112)
Static analysis issues in ZTRSM (triangular solve with matrix) kernels for Zen5 architecture by initializing variables to prevent potential use of uninitialized values.
Initialize loop variables i, j, and k_iter to 0 to prevent potential uninitialized access
Initialize mask variables and remainder variables to 0 across multiple kernel functions
2025-08-05 15:00:59 +05:30
Smyth, Edward
09592bf4f3 Windows TLS with Microsoft compiler
We currently use clang compiler on Windows, so no problem with current
builds. However, if we wanted to use Microsoft compiler, add different
definition of BLIS_THREAD_LOCAL as __declspec(thread)

AMD-Internal: [CPUPL-6958]
2025-08-04 15:09:25 +01:00
Smyth, Edward
b8aa5c2894 DTL logging fixes and improvements (2)
More improvements to DTL coverage and coding:
- Tidy functions and prototypes in aocl_dtl/aocldtl_blis.{c,h} into
  alphabetical groups within different BLAS categories.
- Expand tracing coverage to APIs in frame/compat/f2c
- Remove calls to AOCL_DTL_INITIALIZE (added in
  c56dcb6ffb) as DTL_Trace calls
  bli_init_auto which will call AOCL_DTL_INITIALIZE

AMD-Internal: [CPUPL-7010]
2025-08-01 15:27:35 +01:00
V, Varsha
68d47281df Fixing some copying bugs in Batch-Matmul code
- Removed duplicate calls to BATCH_GEMM_CHECK().
 - Refactored freeing of post-op pointer in bench code and verified the
    functionality.
 - Modified indexing of the array to take the correct values.
2025-08-01 18:42:10 +05:30
Balasubramanian, Vignesh
c96e7eb197 Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN5)
- Updated the thresholds to enter the AVX512 SUP codepath in
  ZGEMM(on ZEN5). This caters to inputs that scale well with
  multithreaded-execution(in the SUP path).

- Also updated the thresholds to decide ideal threads, based on
  'm', 'n' and 'k' values. The thread-setting logic involves
  determining the number of tiles for computation, and using them
  to further tune for the optimal number of threads.

- This logic builds over the assumption that the current thread
  factorization logic is optimal. Thus, an additional data analysis
  was performed(on the existing ZEN4 and the new ZEN5 thresholds),
  to also cover the corner cases, where this assumption doesn't hold
  true.

- As part of the future work, we could reimplement the thread
  factorization for GEMM, which would additionally require a new
  set of threshold tuning for every datatype.

AMD-Internal: [CPUPL-7028]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
AOCL-Aug2025-b1
2025-08-01 16:02:12 +05:30
Rayan, Rohan
1bb1160061 Fixing bench overflow in GFLOPS computation
Fixing a bug in some bench applications where GFLOPS computation ran into integer overflows because explicit type casting to double was not done in the computation

removing all multiplies by 1.0 during GFLOP computation

AMD-Internal: CPUPL-7016

---------
Co-authored-by: Rayan <rohrayan@amd.com>
2025-08-01 11:08:53 +05:30
Rayan, Rohan
a26f0a93f9 Removing some unnecessary branches in CGEMM's aocl-dynamic logic.
Code cleanup: Removing some redundant if-else code in the CGEMM
aocl-dynamic logic. This should ensure that multiple branching is
avoided, while preserving existing heuristics.

AMD Internal: [CPUPL - 6579]

Co-authored-by: Rayan <rohrayan@amd.com>
2025-08-01 10:31:45 +05:30
Chandrashekara K R
2ff3e02309 Add platform-specific alignment macro for rntm_s struct
Introduced BLIS_ATTRIB_ALIGN to standardize 64-byte alignment across platforms.
On Windows, alignment is enabled only for Clang and disabled for other compilers.
Replaced direct usage of __attribute__((aligned(64))) in rntm_s with the macro.
2025-07-30 21:15:44 +05:30
Smyth, Edward
e3dcc15a80 Add configure and CMake options to enable DTL logging and tracing (#86)
Instead of editing a header file, add options to build systems to allow
DTL tracing and/or logging output to be generated. For most users
logging is recommended, producing a line of output per application
thread of every BLAS call made. Tracing provides more detailed info
of internal BLIS calls, and is aimed more at expert users and BLIS
developers. Different tracing levels from 1 to 10 provide control of
the granularity of information produced. The default level is 5. Note
that tracing, especially at higher tracing levels, will impose a
significant runtime cost overhead.

Example usage:

Using configure:

  ./configure ... --enable-aocl-dtl=log amdzen

  ./configure ... --enable-aocl-dtl=trace --aocl-dtl-trace-level=6 amdzen

  ./configure ... --enable-aocl-dtl=all amdzen

Using CMake:

  cmake ... -DENABLE_AOCL_DTL=LOG

  cmake ... -DENABLE_AOCL_DTL=TRACE -DAOCL_DTL_TRACE_LEVEL=6

  cmake ... -DENABLE_AOCL_DTL=ALL

Also, modify function AOCL_get_requested_threads_count to correct
reported thread count in cases where internal value is recorded as -1

AMD-Internal: [CPUPL-7010]
2025-07-28 15:24:10 +01:00
Bhaskar, Nallani
46aac600ec Added f32 kernels without post-ops to avoid overhead
Description:

1. Crated f32 intrinsic kernels without post-ops support f32 gemm
   without post-ops optimally.
2. Initiated the no post-ops kernels from main kernel when post-ops
   hander has no post-ops to do.
3. The kernels are redundant but added to get the best perf
   for pure GEMM call.

AMD-Internal : SWLCSG-3692
2025-07-25 23:14:23 +05:30
Smyth, Edward
c56dcb6ffb DTL logging fixes and improvements
The environment variable AOCL_VERBOSE was inconsistent in its
behaviour, sometimes producing a single line of output per file from
multiple BLAS calls, when it should be all or nothing. Note that:
- AOCL_VERBOSE is only active when DTL logging has been enabled at
  compile time. Otherwise, this environment variable is not read.
- When logging is enable at compile time, logging output is produced
  by default. Thus AOCL_VERBOSE is more of use to turn output off,
  rather than on.
- For production runs without logging, it is recommended to recompile
  with DTL disabled, as this minimizes overheads within the BLIS code.
- AOCL_VERBOSE should be set to 0 or 1, and not values such as FALSE
  or TRUE.

Changes to improve consistency when AOCL_VERBOSE is set:
- Change DTL variables from Bool (unsigned char) datatype to bool, as
  used elsewhere in BLIS.
- Ensure bli_init_auto() is called before AOCL_DTL_TRACE_ENTRY() and
  AOCL_DTL_LOG_*_INPUTS(), as bli_init_auto calls AOCL_DTL_INITIALIZE()
- In APIs which avoid calling bli_init_auto(), add explicit calls to
  AOCL_DTL_INITIALIZE(). Also, make a proper comment about not calling
  bli_init_auto(), rather than just commenting out call, which looks like
  dead code.

Other DTL logging control changes:
- Make gbIsLoggingEnabled ICV thread local as this can be updated by
  calls to AOCL_DTL_Enable_Logs and AOCL_DTL_Disable_Logs APIs
- After recent changes to hide some internal BLIS definitions behind
  ifdef BLIS_IS_BUILDING_LIBRARY guard, change BLIS_THREAD_LOCAL
  definition to be exported again.

Logging output changes:
- Standardize printing of datatype to be lower case.
- Don't force printing of GEMM transa and transb to upper case, instead
  print in the case provided by the application code.
- Add logging output to all variants (in terms of AMD/non-AMD optimized
  and datatype) of SWAP and SCAL.

AMD-Internal: [CPUPL-7010]
2025-07-25 11:27:00 +01:00
S, Hari Govind
273a05f0bd Fix for performance regression caused by non-unit stride y in DGEMV API (#91)
- Temperory fix for regression in DGEMV for non-unit stride y inputs. The code
  section responsible for handling non-unit stride y has been removed from the
  frame.

- The kernel code is extended with if condition to handle both unit and non-unit
  stride y.

AMD-Internal: [CPUPL-6869]
AOCL-Weekly-250725
2025-07-25 10:57:57 +05:30
Balasubramanian, Vignesh
93414f56c8 Bugfix : Guarded AOCL_ENABLE_INSTRUCTONS support based on AVX512-ISA support
- As part of rerouting to AVX2 code-paths on ZEN4/ZEN5(or similar)
  architectures, the code-base established a contingency when
  deploying fat binary on ZEN/ZEN2/ZEN3 systems. Due to this,
  it was required that we always set AOCL_ENABLE_INSTRUCTIONS to
  'ZEN3'(or similar values) to make sure we don't run AVX512
  code on such architectures. This issue existed on FP32 and BF16
  APIs.

- Added checks to detect the AVX512-ISA support to enable rerouting
  based on AOCL_ENABLE_INSTRUCTIONS. This removes the incorrect
  constraint that was put forth.

AMD-Internal: [CPUPL-7020]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-07-24 12:20:05 +05:30
V, Varsha
8a86620753 Bug Fix in INT8 reference un-reorder API
- For int8/uint8 reorder function, the k dimension is made multiple of 4 to
 meet the alignment requirements.
 - Modified the logic to update the k_updated to use multiples of 4.

[AMD - Internal : SWLCSG - 3686 ]
2025-07-24 11:26:49 +05:30
Smyth, Edward
4bc5287f72 Support applications using Intel icc and icx compilers on Windows (#82)
The blis.h header file includes a lot of BLIS internal definitions. Some of these caused problems
when using a BLIS library compiled with clang on Windows from an applications compiled with
the Intel icc and icx compilers. Workaround is to use "#ifdef BLIS_IS_BUILDING_LIBRARY" to
guard these definitions from being exposed to applications including blis.h. (The BLIS configure
and cmake builds systems automatically define BLIS_IS_BUILDING_LIBRARY only for compiling
the BLIS library.)

This patch implements the minimum changes to resolve the issue. Longer term, similar changes
may need to be added around all BLIS internal definitions in blis.h.

AMD-Internal: [CPUPL-6953]
2025-07-22 10:22:45 +01:00
V, Varsha
9e8c9e2764 Fixed compiler warnings in LPGEMM
- Modified the correct variables to be passed for the batch_gemm_thread_decorator() for
 u8s8s32os32 API.
 - Removed commented lines in f32 GEMV_M kernels.
 - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros.
 - Re-aligned the BIAS macro in the macro definition file.

[ AMD - Internal : CPUPL - 7013 ]
2025-07-18 16:15:52 +05:30
V, Varsha
2f54bc1e14 Added F32 reference Unreorder function
- Implemeneted unpackb_f32f32f32of32_reference function.
 - Modified const pointer declaration in aocl_reorder_reference() to avoid compiler warnings.

[AMD-Internal: SWLCSG-3618 ]
2025-07-18 14:52:03 +05:30
Sharma, Shubham
355018e739 Fixed Extra reads in DTRSM small kernels.
In DTRSM small code path lower triangular kernels, extra data from upper triangular region is being read.
To fix this, new macros have been added to make sure only relevant data is read.
AMD-Internal: [SWLCSG-3611]
2025-07-17 10:17:13 +05:30
Bhaskar, Nallani
76c08fe81d Implemented f32 reference reorder function
Implemented aocl_reorder_f32f32f32of32_reference( ) function and tested.

Implemented framework changes required and place holder for kernels for aocl_unreorder_f32f32f32of32_reference( ) function. It is not tested completely and will be taken care in subsequent commits.

[AMD-Internal: SWLCSG-3618 ]
2025-07-15 12:26:05 +05:30
V, Varsha
837d3974d4 Bug Fixes for GEMV AVX2 BF16 to F32 path
- Added the correct strides to be used while unreorder/convert B matrix in m=1 cases.
 - Modified Zero point vector loads to proper instructions.
 - Modified bf16 store in AVX2 GEMV M kenrel

AMD Internal - [SWLCSG - 3602 ]
AOCL-Weekly-100725
2025-07-10 16:23:46 +05:30
Balasubramanian, Vignesh
ab4bb2f1e8 Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN4)
- Updated the thresholds to enter the AVX512 Tiny and SUP codepaths
  for ZGEMM(on ZEN4). This caters to inputs that perform well on
  a single-threaded execution(in the Tiny-path), and inputs that
  scale well with multithreaded-execution(in the SUP path).

- Also updated the thresholds to decide ideal threads, based on
  'm', 'n' and 'k' values. The thread-setting logic involves
  determining the number of tiles for computation, and using them
  to further tune for the optimal number of threads.

AMD-Internal: [CPUPL-6378][CPUPL-6661]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-07-10 15:35:22 +05:30
V, Varsha
98901847f1 Enabled GEMV path for BF16 GEMV operations on non-BF16 supporting machines
- Added new GEMV_AVX2 5-Loop for handling BF16 inputs, for n = 1 and m = 1 conditions.
 - Modified Re-order and Un-reorder functions to cater to default n=1 reorder conditions.
 - Added bf16 beta and store support in F32 GEMV N AVX2 and 256_512 kernels.
 - Added bf16 beta support for F32 GEMV M kernels, and modified bf16 store conditions for
   GEMV M kernels.
 -  Modified the n=1 re-order guards for reference bf16 re-order API.
 - Added an additional path in the un-reorder case for handling n=1 vector conversion

AMD-Internal: [ SWLCSG - 3602 ]
2025-07-09 19:45:40 +05:30
Smyth, Edward
969ceb7413 Finer control of code path options (#67)
Add macros to allow specific code options to be enabled or disabled,
controlled by options to configure and cmake. This expands on the
existing GEMM and/or TRSM functionality to enable/disable SUP handling
and replaces the hard coded #define in include files to enable small matrix
paths.

All options are enabled by default for all BLIS sub-configs but many of them
are currently only implemented in AMD specific framework code variants.

AMD-Internal: [CPUPL-6906]
---------

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-07-08 10:59:23 +01:00
Bhaskar, Nallani
9b02201b5b Updated Poly 16 in Gelu Erf to double precision
Updated poly Gelu Erf precision to double to keep the error with in 1e-5 limit when compared to reference gelu_erf, which is also increased the compute to 2x compared to float.

AMD-Internal: SWLCSG-3551
2025-07-07 14:05:40 +05:30
Balasubramanian, Vignesh
c0d33879ec Bugfix : Integer typecast inside CGEMM AVX512 24xk packing kernel (#68)
- When building the library with LP64 configuration, it is expected
  that we typecast integers to 64-bit internally, before loading them
  onto 64-bit GPRs. This ensures that the upper 32-bit lane is zeroed
  out, to avoid any possible junk values. The current change enforces
  this typecast inside the 24xk packing kernel for CGEMM(AVX512), which
  was missing before.

AMD-Internal: [CPUPL-6907]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-07-01 17:05:54 +05:30
Dave, Harsh
7c6c04a457 More optimizations in 6x8m DGEMM SUP Kernel using prefetching (#34)
* Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance

This update optimizes the DGEMM kernel by implementing well suited prefetching techniques.

Key changes include:

- **Prefetching Strategy**:
  - Introduced prefetching instructions to load matrix data into cache ahead of computation.
  - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed.
  - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed.

- **Unrolling Optimization**:
  - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B.
  - This adjustment enhances data locality and reduces the overhead associated with loop control.

- **Performance Improvements**:
  - Reduced memory access latency by ensuring data is preloaded into cache.
  - Enhanced computational throughput by minimizing stalls due to memory access delays.
  - Improved overall efficiency of matrix multiplication operations.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance.

AMD-Internal: [CPUPL-6435]

* added unroll K by 4 along with unroll K by 8

* Added descriptive comments explaining prefetch strategy

* Added descriptive comments explaining prefetch strategy

* More optimizations in 6x8m DGEMM SUP Kernel using prefetching

- Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization.
- Introduced forward prefetching for A and future B rows to better align with unrolled access patterns.
- Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling
to boost overall performance.

AMD-Internal: [CPUPL-6435]

* Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance

This update optimizes the DGEMM kernel by implementing well suited prefetching techniques.

Key changes include:

- **Prefetching Strategy**:
  - Introduced prefetching instructions to load matrix data into cache ahead of computation.
  - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed.
  - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed.

- **Unrolling Optimization**:
  - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B.
  - This adjustment enhances data locality and reduces the overhead associated with loop control.

- **Performance Improvements**:
  - Reduced memory access latency by ensuring data is preloaded into cache.
  - Enhanced computational throughput by minimizing stalls due to memory access delays.
  - Improved overall efficiency of matrix multiplication operations.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance.

AMD-Internal: [CPUPL-6435]

* added unroll K by 4 along with unroll K by 8

* Added descriptive comments explaining prefetch strategy

* More optimizations in 6x8m DGEMM SUP Kernel using prefetching

- Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization.
- Introduced forward prefetching for A and future B rows to better align with unrolled access patterns.
- Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently.

These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling
to boost overall performance.

AMD-Internal: [CPUPL-6435]

---------

Co-authored-by: Harsh Dave <harsdave@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-07-01 15:02:50 +05:30
Smyth, Edward
8a8d3f43d5 Improve consistency of optimized BLAS3 code (#64)
* Improve consistency of optimized BLAS3 code

Tidy AMD optimized GEMM and TRSM framework code to reduce
differences between different data type variants:
- Improve consistency of code indentation and white space
- Added some missing AOCL_DTL calls
- Removed some dead code
- Consistent naming of variables for function return status
- GEMM: More consistent early return when k=1
- Correct data type of literal values used for single precision data

In kernels/zen/3/bli_gemm_small.c and bli_family_*.h files:
- Set default values for thresholds if not set in the relevant
  bli_family_*.h file
- Remove unused definitions and commented out code

AMD-Internal: [CPUPL-6579]
2025-07-01 09:29:52 +01:00
Balasubramanian, Vignesh
98bc1d80e7 Support for Tiny-GEMM interface(CGEMM)
- Added the support for Tiny-CGEMM as part of the existing
  macro based Tiny-GEMM interface. This involved definining
  the appropriate AVX2/AVX512 lookup tables and functions for
  the target architectures(as per the design), for compile-time
  instantiation and runtime usage.

- Also extended the current Tiny-GEMM design to incorporate packing
  kernels as part of its lookup tables. These kernels will be queried
  through lookup functions and used in case of wanting to support
  non-trivial storage schemes(such as dot-product computation).

- This allows for a plug-and-play fashion of experimenting with
  pack and outer product method against native inner product implementations.

- Further updated the existing AVX512 pack routine that packs the A matrix
  (in blocks of 24xk). This utilizes masked loads/stores instructions to
  handle fringe cases of the input(i.e, when m < 24).

- Also added the AVX512 outer product kernels for CGEMM as part of the
  ZEN4 and ZEN5 contexts, to handle RRC and CRC storage schemes. This is
  facilitated through optional packing of A matrix in the SUP framework.

AMD-Internal: [CPUPL-6498]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
AOCL-Jul2025-b1
2025-06-30 12:14:44 +05:30
V, Varsha
1f9d1a85d3 Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. (#58)
* Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API.

 - Modified Batch-Gemm API to align with cblas_?gemm_batch_ API,
 and added a parameter group_size to the existing APIs.
 - Updated bench batch_gemm code to align to the new API definition.
 - Modified the hardcoded number in lpgemm_postop file.
 - Added necessary early return condition to account for group_count/group_size < 0.

AMD-Internal: [ SWLCSG - 3592 ]
2025-06-30 11:16:04 +05:30
Sharma, Arnav
5193433141 Disable GCC 11.4 tree loop optimization for AVX512 F32 Sigmoid Post-Op (#63)
- Disabled tree loop optimizations for all AVX512 F32 fringe kernels
  when compiled with GCC 11.4 to address numerical inaccuracies in
  Sigmoid post-op cause by aggressive loop optimizations.

- The fix uses function-level GCC attribute
  __attribute__((optimize("no-tree-loop-optimize"))) to selectively
  disable tree loop optimizations only for the affected kernels based on
  GCC version check.

AMD-Internal: [SWLCSG-3559]
2025-06-26 16:31:55 +05:30
V, Varsha
e05d24315e Bug Fixes for Accuracy issues in Int8 API (#62)
- In U8 GEMV n=1 kernels, the default zp condition was S8 ZP type,
 which leads to accuracy issues which u8s8s32u8 API is used.
 - Few modifications in bench code to take the correct path for
 accuracy check.
2025-06-25 17:01:22 +05:30