Commit Graph

3728 Commits

Author SHA1 Message Date
Smyth, Edward
8a8d3f43d5 Improve consistency of optimized BLAS3 code (#64)
* Improve consistency of optimized BLAS3 code

Tidy AMD optimized GEMM and TRSM framework code to reduce
differences between different data type variants:
- Improve consistency of code indentation and white space
- Added some missing AOCL_DTL calls
- Removed some dead code
- Consistent naming of variables for function return status
- GEMM: More consistent early return when k=1
- Correct data type of literal values used for single precision data

In kernels/zen/3/bli_gemm_small.c and bli_family_*.h files:
- Set default values for thresholds if not set in the relevant
  bli_family_*.h file
- Remove unused definitions and commented out code

AMD-Internal: [CPUPL-6579]
2025-07-01 09:29:52 +01:00
Balasubramanian, Vignesh
98bc1d80e7 Support for Tiny-GEMM interface(CGEMM)
- Added the support for Tiny-CGEMM as part of the existing
  macro based Tiny-GEMM interface. This involved definining
  the appropriate AVX2/AVX512 lookup tables and functions for
  the target architectures(as per the design), for compile-time
  instantiation and runtime usage.

- Also extended the current Tiny-GEMM design to incorporate packing
  kernels as part of its lookup tables. These kernels will be queried
  through lookup functions and used in case of wanting to support
  non-trivial storage schemes(such as dot-product computation).

- This allows for a plug-and-play fashion of experimenting with
  pack and outer product method against native inner product implementations.

- Further updated the existing AVX512 pack routine that packs the A matrix
  (in blocks of 24xk). This utilizes masked loads/stores instructions to
  handle fringe cases of the input(i.e, when m < 24).

- Also added the AVX512 outer product kernels for CGEMM as part of the
  ZEN4 and ZEN5 contexts, to handle RRC and CRC storage schemes. This is
  facilitated through optional packing of A matrix in the SUP framework.

AMD-Internal: [CPUPL-6498]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
AOCL-Jul2025-b1
2025-06-30 12:14:44 +05:30
V, Varsha
1f9d1a85d3 Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. (#58)
* Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API.

 - Modified Batch-Gemm API to align with cblas_?gemm_batch_ API,
 and added a parameter group_size to the existing APIs.
 - Updated bench batch_gemm code to align to the new API definition.
 - Modified the hardcoded number in lpgemm_postop file.
 - Added necessary early return condition to account for group_count/group_size < 0.

AMD-Internal: [ SWLCSG - 3592 ]
2025-06-30 11:16:04 +05:30
Sharma, Arnav
5193433141 Disable GCC 11.4 tree loop optimization for AVX512 F32 Sigmoid Post-Op (#63)
- Disabled tree loop optimizations for all AVX512 F32 fringe kernels
  when compiled with GCC 11.4 to address numerical inaccuracies in
  Sigmoid post-op cause by aggressive loop optimizations.

- The fix uses function-level GCC attribute
  __attribute__((optimize("no-tree-loop-optimize"))) to selectively
  disable tree loop optimizations only for the affected kernels based on
  GCC version check.

AMD-Internal: [SWLCSG-3559]
2025-06-26 16:31:55 +05:30
V, Varsha
e05d24315e Bug Fixes for Accuracy issues in Int8 API (#62)
- In U8 GEMV n=1 kernels, the default zp condition was S8 ZP type,
 which leads to accuracy issues which u8s8s32u8 API is used.
 - Few modifications in bench code to take the correct path for
 accuracy check.
2025-06-25 17:01:22 +05:30
Smyth, Edward
30c42202d7 GCC 15 SUP kernel workaround (#35)
GCC 15 fails to compile some SUP kernels. The problem seems to be
related to one of the optimization phases enabled at -O2 or above.
Workaround is to disable this specific optimization by adding the
flag -fno-tree-slp-vectorize to CKOPTFLAGS.

AMD-Internal: [CPUPL-6579]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-06-25 11:01:34 +01:00
Balasubramanian, Vignesh
15c44a6f8c Adding dynamic thread-setting logic for CGEMM(AOCL_DYNAMIC) (#48)
- Added a set of thresholds(based on input dimensions) that
  determine and set the ideal number of threads to be used
  for CGEMM (on ZEN4 and ZEN5 architectures).

- The thread-setting logic is as follows :
    - The underlying kernels(single-threaded) work on blocks
      of MRxk of A, kxNR of B and MRxNR of C. Thus, it is
      initially assumed that the optimal number of threads is
      ceil(m/MR)*ceil(n/NR). This is the upper bound on the
      actual number of threads that is ideal.

    - The actual ideal thread count could be lesser than the
      upper bound, based on the work that every thread receives.
      This is mainly determined by the value of 'k'.

    - If 'k' is small, the arithmetic intensity(AI) is low and
      memory bandwidth becomes the limiting factor, thus favoring
      smaller thread counts. In contrast, if 'k' is high, the AI
      is high and the workload scales well with higher thread counts.

    - So, we limit the number of threads when 'k' is small to avoid
      bandwidth contention. Using fewer threads ensures each thread
      gets more bandwidth, improving efficiency. In contrast, we allow
      more threads when 'k' is large, as the computation becomes more
      compute-bound and less limited by memory bandwidth, thereby benefitting
      with a higher-thread count.

- The new logic will now set the upper bound for the optimal number of threads
  (based on the number of tiles), and then further reduce it based on the values
  of 'm', 'n' and 'k'. This comes under the 'AOCL_DYNAMIC' feature for CGEMM,
  specifically for ZEN4 and ZEN5 architectures.

AMD-Internal: [CPUPL-6498]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-06-25 10:05:40 +05:30
Vankadari, Meghana
c81408c805 Modified reorder and pack code in sym quant API (#59)
Details:
- In s8 APIs with symmetric quantization, Existing kernels are
  reused to avoid duplication of reorder code.
- Since the existing kernels are designed assuming that entire
  KCxNC block is packed at once, to handle grouping in symmetric
  quantization, we have to add JR and group loop outside the
  function call to existing packB function.
- Though this was being done before, the cases where n_rem < 64
  was not handled properly.
- Modified reorder and pack code to first divide the n_fringe part
  into multiples-of-16 part and n_lt_16 part and then calling the
  pack kernel twice to handle both parts separately.
- All the strides to access the reordered/pack buffer are updated
  accordingly.
2025-06-24 11:36:35 +05:30
S, Hari Govind
8d41565822 Fix build failure when AOCL_DYNAMIC is disabled (#57)
- The build was failing when AOCL_DYNAMIC was disabled because
  `fast_path_thresh` was only declared when both AOCL_DYNAMIC and
  OpenMP were enabled. This variable was used in an `if` condition for
  single-thread execution without an AOCL_DYNAMIC guard.

- To resolve this, the test expression for single-thread execution has
  been replaced with a macro. This macro is set to 0 when AOCL_DYNAMIC
  is disabled, ensuring the condition is handled correctly.

AMD-Internal: [CPUPL-6854]
2025-06-23 15:56:15 +05:30
V, Varsha
8fd7060b2f Matrix Add and Matrix Mul Post-op addition in F32 AVX512_256 kernels (#50)
Added Matrix-mul and Matrix-add postops in FP32 AVX512_256 GEMV kernels

 - Matrix-add and Matrix-mul post ops in FP32 AVX512_256 GEMV m = 1 and
 n = 1 kernels has been added.

Co-authored-by: VarshaV <varshav2@amd.com>
AOCL-Weekly-200625
2025-06-17 16:17:13 +05:30
Smyth, Edward
b5c66a9d8c Implement bli_thread_reset (#32)
BLIS-specific setting of threading takes precedence over OpenMP
thread count ICV values, and if the BLIS-specific threading APIs
are used, there was no way for the program to revert to OpenMP
settings. This patch implements a function bli_thread_reset() to
do this. This is similar to that implemented in upstream BLIS in
commit 6dcf7666ef

More specifically, it reverts the internal threading data to that
which existed when the program was launched, subject where appropriate
to any changes in the OpenMP ICVs. In other words:
- It will undo changes to threading set by previous calls to
  bli_thread_set_num_threads or bli_thread_set_ways.
- If the environment variable BLIS_NUM_THREADS was used, this will
  NOT be cleared, as the initial state of the program is restored.
- Changes to OpenMP ICVs from previous calls to omp_set_num_threads()
  will still be in effect, but can be overridden by further calls to
  omp_set_num_threads().

Note: the internal BLIS data structure updated by the threading APIs,
including bli_thread_reset(), is thread-local to each user
(e.g. application) thread.

Example usage:
omp_set_num_threads(4);
bli_thread_set_num_threads(7);
dgemm(...); // 7 threads will be used
bli_thread_reset();
dgemm(...); // 4 threads will be used
2025-06-17 10:40:10 +01:00
S, Hari Govind
e097346658 Implemented Multithreading Support and Optimization of DGEMV API (#10)
- Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance.

- The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta.

AMD-Internal: [CPUPL-6746]
2025-06-17 12:39:48 +05:30
Vankadari, Meghana
26e5c63781 Disabled default packing of matrices in batch_gemm of FP32 (#55)
AMD-Internal: SWLCSG-3527
2025-06-17 10:53:05 +05:30
Vankadari, Meghana
8649cdc14b Removed unnecessary pack checks in FP32 GEMV (#54)
Details:
- In FP32 GEMM, when threading is disabled, rntm_pack_a and rntm_pack_b
  were set to true by default. This leads to perf regression for smaller
  sizes. Modified FP32 interface API to not overwrite the packA and
  packB variables in rntm structure.
- In FP32 GEMV, Removed the decision making code based on mtag_A/B
  and should_pack_A/B for packing. Matrices will be packed only
  if the storage format of the matrices doesn't match the storage
  format required by the kernel.
- Changed the control flow of checking the value of mtag to whether
  matrix is "reordered" or "to-be-packed" or "unpacked". checking
  for "reorder" first, followed by "pack". This will ensure that
  packing doesn't happen when the matrix is already reordered even
  though user forces packing by setting "BLIS_PACK_A/B"
-Modified python script to generate testcases based on block sizes

AMD-Internal: SWLCSG-3527
2025-06-16 12:34:11 +05:30
Balasubramanian, Vignesh
1847a1e8c6 Bugfix : Segmentation fault at the topology detection layer (#51)
- The current implementation of the topology detector establishes
      a contingency, wherein it is expected that the parallel region
      uses all the threads queried through omp_get_max_threads(). In
      case the actual parallelism in the function is limited(lower than
      this expectation), the code may access unallocated memory section
      (using uninitialized pointers).

    - This was because every thread(having it's own pointer), sets its
      initial value to NULL inside the parallel section, thereby leaving
      some pointers uninitialized if the associated thread is not spawned.

    - Also, the current implementation would use negative indexing(with -1)
      if any associated thread was not spawned.

    - Fix : Set every thread-specific pointer to NULL outside the parallel
            region, using calloc(). As long as we have NULL checks for pointers
            before accessing through them, no issues will be observed. Avoid
            incurring the topology detection cost if all the reuqired threads
            are not spawned(thereby avoiding potential negative indexing).
            (when using core-group ID).

AMD-Internal: [SWLCSG-3573]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Bhaskar, Nallani <Nallani.Bhaskar@amd.com>
2025-06-14 21:55:02 +05:30
Chandrashekara K R
ae698be825 Updated version string from 5.0.1 to 5.1.1 2025-06-13 11:24:50 +05:30
Vankadari, Meghana
8968973c2d Performance fix for FP32 GEMV (#47)
Details:
- In FP32 GEMM interface, mtag_b is being set to PACK by default.
  This is leading to packing of B matrix even though packing is not
  absolutely required leading to perf regression.
- Setting mtag_b to PACK only if it is absolutely necessary to pack B matrix
  modified check conditions before packing appropriately.

AMD-Internal - [SWLCSG-3575]
AOCL-Jun2025-b2
2025-06-10 14:54:01 +05:30
Smyth, Edward
49ae7db89a Avoid including .c files (#40)
Including a C file directly in another C file is not recommended, and some
build systems (e.g. Bazel and Buck) do not allow .c files to include other
.c files. This commit changes the tapi and oapi framework files that are
included from the _ex and _ba file variants from .c filenames to .h
filenames.

AMD-Internal: [CPUPL-6784]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-06-10 11:33:33 +05:30
V, Varsha
875375a362 Bug Fixes in FP32 Kernels: (#41)
* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels.
 Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B
 conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

* Bug Fixes in FP32 Kernels:

 - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
 but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
 LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.

- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
 main and GEMV kernels.

- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
 and AVX512_256 GEMV kernels.

- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.

- Modified the condition check in FP32 Zero point in AVX512 kernels, and
 fixed few bugs in Col-major Zero point evaluation and instruction usage.

AMD Internal: [ CPUPL - 6748 ]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-06-06 17:48:50 +05:30
Vankadari, Meghana
9e9441db47 Fix for n_fringe in AVX512 FP32 6x64 kernel (#42)
Details:
- Fixed the problem decomposition for n-fringe case of
  6x64 AVX512 FP32 kernel by updating the pointers
  correctly after each fringe kernel call.

-  AMD-Internal: SWLCSG-3556
2025-06-06 11:33:25 +05:30
Vankadari, Meghana
37efbd284e Added 6x16 and 6xlt16 main kernels for f32 using AVX512 instructions (#38)
* Implemented 6xlt8 AVX2 kernel for n<8 inputs

* Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Implemented m-fringe kernels for 6xlt8 kernel for AVX2

* Added the deleted kernels and fixed bias bug

AMD-Internal: SWLCSG-3556
2025-06-05 15:17:02 +05:30
Smyth, Edward
14e46ad83b Improvements to x86 make_defs files (#29)
Various changes to simplify and improve x86 related make_defs files:
- Make better use of common definitions in config/zen/amd_config.mk
  from config/zen*/make_defs.mk files
- Similarly for config/zen/amd_config.make from the
  config/zen*/make_defs.cmake files
- Pass cc_major, cc_minor and cc_revision definitions from configure
  to generated config.mk file, and use these instead of defining
  GCC_VERSION in config/zen*/make_defs.mk files
- Add znver3 support for LLVM 13 in config/zen3/make_defs.{mk,cmake}
- Add znver5 support for LLVM 19 in config/zen5/make_defs.{mk,cmake}
- Improve readability of haswell, intel64, skx and x86_64 files
- Correct and tidy some comments

AMD-Internal: [CPUPL-6579]
2025-06-03 16:20:43 +01:00
Dave, Harsh
3c8b7895f7 Fixed functionality failure of DGEMM pack kernel. (#31)
* Fixed functionality failure of DGEMM pack kernel.

- Corrected the mask preparation needed for load/store
in edge kernel where m = 18.

- Corrected the usage of right vector registers while
storing data back to buffer in edge kernels.

AMD-Internal: [CPUPL-6773]

* Fixed functionality failure of DGEMM pack kernel.

- Corrected the mask preparation needed for load/store
in edge kernel where m = 18.

- Corrected the usage of right vector registers while
storing data back to buffer in edge kernels.

AMD-Internal: [CPUPL-6773]

* Update bli_packm_zen4_asm_d24xk.c

---------

Co-authored-by: Harsh Dave <harsdave@amd.com>
2025-06-03 17:33:16 +05:30
Smyth, Edward
dcf72968cf Blacklist KNL with GCC 15+ (#844) (#28)
Details:
- GCC 15 drops support for Xeon Phi architectures such as KNL.
- This PR blacklists the `knl` configuration for GCC 15+.

Co-authored-by: Dave Love <dave.love@manchester.ac.uk>
2025-06-02 10:31:30 +01:00
V, Varsha
532eab12d3 Bug Fixes in LPGEMM for AVX512(SkyLake) machine (#24)
* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types in FP32 avx512_256 kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension.

 - The entry condition to the above has been modified for AVX512 configuration.

 - In bf16 API, the tiny path entry check has been modified to prevent
  seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
  machines.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).

 - Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256
  kernels

AMD Internal: [SWLCSG-3552]

* Bug Fixes in LPGEMM for AVX512(SkyLake) machine

 - Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and
   BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode.

 - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
  doesn't support BF16 instructions, the BF16 input is unre-ordered and
  converted to FP32 type to use FP32 kernels.

 - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
  matrix to the re-ordered buffer array. But the un-reordering to FP32
  requires the matrix to have size multiple of 16 along n and multiple
  of 2 along k dimension. The entry condition here has been modified for
  AVX512 configuration.

 - Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI
   ISA supporting configruations:
   - BF16 tiny path entry check has been modified to take into account arch_id
     to ensure improper entry into the tiny kernel.
   - The store in BF16->FP32 col-major for m = 1 conditions were updated to
     correct storage pattern,
   - BF16 beta load macro was modified to account for data in unaligned memory.

 - Modified existing store instructions in FP32 AVX512 kernels to support
  execution in machines that has AVX512 support but not BF16/VNNI(SkyLake)

AMD Internal: [SWLCSG-3552]

---------

Co-authored-by: VarshaV <varshav2@amd.com>
2025-05-30 17:22:49 +05:30
Arnav Sharma
62d4fcb398 Bugfix: Group Size Validation for s8s8s32o32_sym_quant
- Fixed the group size validation logic to correctly check if the
  group_size is a multiple of 4.

- Previously the condition was incorrectly performing bitwise AND with
  decimal 11 instead of binary 11 (decimal 3).

AMD-Internal: [CPUPL-6754]
2025-05-30 11:53:23 +05:30
Prabhu, Anantha
9b7e1105dc Update branch-name-check.yml (#27) 2025-05-29 18:13:15 +05:30
Negi, Deepak
ffd7c5c3e0 Postop support for Static Quant and Integer APIs (#20)
Support for S32 Zero point type is added for aocl_gemm_s8s8s32os32_sym_quant
Support for BF16 scale factors type is added for aocl_gemm_s8s8s32os32_sym_quant
U8 buffer type support is added for matadd, matmul, bias post-ops in all int8 APIs.

AMD-Internal: SWLCSG-3503
2025-05-27 16:29:32 +05:30
Prabhu, Anantha
ced8b3b7d8 Potential fix for code scanning alert no. 1: Workflow does not contain permissions (#16)
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-21 07:51:09 -07:00
Vlachopoulou, Eleni
4555917040 CMake: Removing INT_SIZE variable from presets (#11)
With this change, the default INT_SIZE of the system will be used.
That is compatible with the Make system and default CMake options.
2025-05-21 09:35:32 +01:00
Vlachopoulou, Eleni
fac4a97118 Adding flags when building bench with CMake (#9)
Since the gnu extensions where removed, executables in bench directory cannon be built correctly.
The fix is adding "-D_POSIX_C_SOURCE=200112L" on those targets. When -std=gnu99 was used,
bench worked without this flag, but that was not the case since we switched to -std=c99.
2025-05-16 11:47:39 +01:00
Kallesh, Vijay-teekinavar
a2a045cb2e SWLDEVOPS-7853 - Action file to mandate branch naming convention (#2)
SWLDEVOPS-7853 - Action file to mandate branch naming convention
---------
Co-authored-by: vkallesh <vkallesh@amd.com>
2025-05-14 15:29:35 +05:30
Bhaskar, Nallani
42a0d74ced Fixed configuration issues in AOCL_GEMM addon (#4)
* Fixed configuration issues in AOCL_GEMM addon

Description:

Fixed aocl_gemm addon initialization of kernels and block sizes
for machines which supports only AVX512 but not
AVX512_VNNI/VNNI_BF16.

Aligned NC, KC blocking variables between ZEN and ZEN4

AMD-Internal: [SWLCSG-3527]
2025-05-13 17:19:19 +05:30
Negi, Deepak
121d81df16 Implemented GEMV kernel for m=1 case. (#5)
* Implemented GEMV kernel for m=1 case.

Description:

- Added a new GEMV kernel for AVX2 where m=1.
- Added a new GEMV kernel for AVX512 with ymm registers where m=1.
2025-05-13 16:33:04 +05:30
harsdave
cd83fc38b5 Add packing support M edge cases in DGEMM 24xk pack kernel
Previously, the DGEMM implementation used `dscalv` for cases
where the M dimension of matrix A is not in multiple of 24,
resulting in a ~40% performance drop.

This commit introduces a specialized edge cases in pack kernel
to optimize performance for these cases.

The new packing support significantly improves the performance.

- Removed reliance on `dscalv` for edge cases, addressing the
  performance bottleneck.

AMD-Internal: [CPUPL-6677]

Change-Id: I150d13eb536d84f8eb439d7f4a77a04a0d0e6d60
2025-05-06 09:22:49 +05:30
Meghana Vankadari
8557e2f7b9 Implemented GEMV for n=1 case using 32 YMM registers
Details:
- This implementation is picked form cntx when GEMM is invoked on
  machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This implementation uses MR=16 for GEMV.

AMD-Internal: [SWLCSG-3519]
Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a
2025-05-05 05:31:13 -04:00
Meghana Vankadari
21aa63eca1 Implemented AVX2 based GEMV for n=1 case.
- Added a new GEMV kernel with MR = 8 which will be used
  for cases where n=1.
- Modified GEMM and GEMV framework to choose right GEMV kernel
  based on compile-time and run-time architecture parameters. This
  had to be done since GEMV kernels are not stored-in/retrieved-from
  the cntx.
- Added a pack kernel that packs A matrix from col-major to row-major
  using AVX2 instructions.

AMD-Internal: [SWLCSG-3519]
Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c
2025-05-05 08:56:22 +00:00
Chandrashekara K R
a285cf4b27 The LICENSE and NOTICES files have been updated with the latest content.
Change-Id: I522b9b07f6b03ea31fa829a038db3016c3fb13ac
(cherry picked from commit a8c76c8cf0)
AOCL-May2025-b1
2025-05-02 06:35:59 -05:00
Chandrashekara K R
b06c6f921b CMake: compiler flags updated for lpgemm kernels under zen folder.
The "-mno-avx512f" compiler flag has been added for zen/lpgemm
source files to address an issue observed with the znver4 compiler
flag when using GCC installed through Spack. The error message
"unsupported instruction `vpcmpeqd'" was encountered, indicating
unsupported AVX-512F instructions. As a workaround, the
"-mno-avx512f" flag was introduced, ensuring that AVX-512F
instructions are disabled during compilation.

AMD-Internal: [CPUPL-6694]

Change-Id: I546475226fbfea4931d568fc1b928cf6c8699b61
2025-04-30 06:09:36 -04:00
Hari Govind S
29f30c7863 Optimisation for DCOPY API
-  Introducted new assembly kernel that copies data from source
   to destination from the front and back of the vector at the
   same time. This kernel provides better performance for larger
   input sizes.

-  Added a wrapper function responsible for selecting the kernel
   used by DCOPYV API to handle the given input for zen5
   architecture.

-  Updated AOCL-dynamic threshold for DCOPYV API in zen4 and
   zen5 architectures.

-  New unit-tests were included in the grestsuite for the new
   kernel.

AMD-Internal: [CPUPL-6650]
Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568
2025-04-28 05:58:21 -04:00
Meghana Vankadari
4745cf876e Implemented a new set of kernels for f32 using 32 YMM regs
Details:
- These kernels are picked from cntx when GEMM is invoked
  on machines that support AVX512 instructions by forcing the
  AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This path uses the same blocksizes and pack kernels as AVX512
  path.
- GEMV is disabled currently as AVX2 kernels for GEMV are not
  implemented.

AMD-Internal: [SWLCSG-3519]
Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5
AOCL-Apr2025-b2
2025-04-23 12:02:01 +00:00
Deepak Negi
48c7452b08 Beta and Downscale support for F32 AVX-512 kernels
Description
- To enable AVX512 VNNI support without native BF16 in BF16 kernels, the
  BF16 C_type is converted to F32 for computation and then cast back to
  BF16 before storing the result.
- Added support for handling BF16 zero-point values of BF16 type.
- Added a condition to disable the tiny path for the BF16 code path
  where native BF16 is not supported.

AMD Internal : [CPUPL-6627]

Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9
2025-04-23 06:12:14 -05:00
Arnav Sharma
8b0593f88d Optimizations and Improved Support for FP32 RD Kernels
- Updated the decision logic for taking the RD path for FP32.

- Since the 5-loop was designed specifically for RV kernels, added a
  boolean flag to specify when RD path is to be taken, and set ps_b_use
  to cs_b_use in case B matrix is unpacked.

AMD-Internal: [SWLCSG-3497]
Change-Id: I94ed28304a71b759796edcdd4edf65b9bad22bea
2025-04-23 12:26:51 +05:30
Vignesh Balasubramanian
b4b0887ca4 Additional optimizations to ZGEMM SUP and Tiny codepaths(ZEN4 and ZEN5)
- Added a set of AVX512 fringe kernels(using masked loads and
  stores) in order to avoid rerouting to the GEMV typed API
  interface(when m = 1). This ensures uniformity in performance
  across the main and fringe cases, when the calls are multithreaded.

- Further tuned the thresholds to decide between ZGEMM Tiny, Small
  SUP and Native paths for ZEN4 and ZEN5 architectures(in case
  of parallel execution). This would account for additional
  combinations of the input dimensions.

- Moved the call to Tiny-ZGEMM before the BLIS object creation,
  since this code-path operates on raw buffers.

- Added the necessary test-cases for functional and memory testing
  of the newly added kernels.

AMD-Internal: [CPUPL-6378][CPUPL-6661]
Change-Id: I9af73d1b6ef82b26503d4fc373111132aee3afd6
2025-04-23 00:56:58 -04:00
Arnav Sharma
87c9230cac Bugfix: Disable A Packing for FP32 RD kernels and Post-Ops Fix
- For single-threaded configuration of BLIS, packing of A and B matrices
  are enabled by default. But, packing of A is only supported for RV
  kernels where elements from matrix A are being broadcasted. Since
  elements are being loaded in RD kernels, packing of A results in
  failures. Hence, disabled packing of matrix A for RD kernels.

- Fixed the issue where c_i index pointer was incorrectly being reset
  when exceeding MC block thus, resulting in failures for certain
  Post-Ops.

- Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it
  was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float).

AMD-Internal: [SWLCSG-3497]
Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd
2025-04-18 14:31:03 +05:30
Meghana Vankadari
1ff96343f1 Fixed Early return checks in reorder function for f32 & int8 APIs.
Details:
- In reorder functions, validity of strides are being checked assuming
  that the matrix to be reordered is always row-major. Modified the code
  to take stor_order into consideration while checking for validity of
  strides.
- This does not directly impact the functionality of GEMM as we don't
  support GEMM on col-major matrices where A and/or B matrices are
  reordered before GEMM computation. But this change makes sense when
  reordering is viewed as an independent functionality irrespective of
  what the reordered buffers will be used for.

Change-Id: If2cc4a353bca2f998ad557d6f128881bc9963330
2025-04-15 09:45:48 +00:00
Deepak Negi
f76f37cc11 Bug Fix in F32 eltwise Api with post ops(clip, swish, relu_scale).
Description
1. In the cases of clip, swish, and relu_scale, constants are currently
   loaded as float. However, they are of C type, so handling has been
   adjusted, for integer these constants are first loaded as integer
   and then converted to float.

Change-Id: I176b805b69679df42be5745b6306f75e23de274d
2025-04-11 08:14:34 -04:00
varshav2
b9998a1d7f Added mutiple ZP type checks in INT8 APIs
- Currently the int8/uint8 APIs do not support multiple ZP types,
   but works only with int8 type or uint8 type.
 - The support is added to enable multiple zp types in these kernels
   and added additional macros to support the operations.
 - Modified the bench downscale reference code to support the updated
   types.

AMD-Internal : [ SWLCSG-3304 ]

Change-Id: Ia5e40ee3705a38d09262086d20731e8f0a126987
2025-04-11 07:25:10 -04:00
Arnav Sharma
267aae80ea Added Post-Ops Support for F32 RD Kernels
- Support for Post-Ops has been added for all F32 RD AVX512 and AVX2
  kernels.

AMD-Internal: [SWLCSG-3497]
Change-Id: Ia2967417303d8278c547957878d93c42c887109e
2025-04-11 05:25:30 -04:00
Edward Smyth
a5f11a1540 Add blis_impl wrappers for matrix copy etc APIs (2)
Previous commit on this (e0b86c69af)
was incorrect and incomplete. Add additional changes to enable
blis_impl layer for extension APIs for copying and transposing
matrices.

Change-Id: Ic707e3585acc1c0c554d7e00435464620a8c85dc
2025-04-07 08:54:54 -04:00