Commit Graph

608 Commits

Author SHA1 Message Date
Vignesh Balasubramanian
327142395b Cleanup for readability and uniformity of Tiny-ZGEMM
- Guarded the inclusion of thresholds(configuration
  headers) using macros, to maintain uniformity in
  the design principles.

- Updated the threshold macro names for every
  micro-architecture.

AMD-Internal: [CPUPL-5895]
Change-Id: I9fd193371c41469d9ef38c37f9c055c21457b56c
2025-01-27 15:48:31 +05:30
Vignesh Balasubramanian
fb6dcc4edb Support for Tiny-GEMM interface(ZGEMM)
- As part of AOCL-BLAS, there exists a set of vectorized
  SUP kernels for GEMM, that are performant when invoked
  in a bare-metal fashion.

- Designed a macro-based interface for handling tiny
  sizes in GEMM, that would utilize there kernels. This
  is currently instantiated for 'Z' datatype(double-precision
  complex).

- Design breakdown :
  - Tiny path requires the usage of AVX2 and/or AVX512
    SUP kernels, based on the micro-architecture. The
    decision logic for invoking tiny-path is specific
    to the micro-architecture. These thresholds are defined
    in their respective configuration directories(header files).

  - List of AVX2/AVX512 SUP kernels(lookup table), and their
    lookup functions are defined in the base-architecture from
    which the support starts. Since we need to support backward
    compatibility when defining the lookup table/functions, they
    are present in the kernels folder(base-architecture).

- Defined a new type to be used to create the lookup table and its
  entries. This type holds the kernel pointer, blocking dimensions
  and the storage preference.

- This design would only require the appropriate thresholds and
  the associated lookup table to be defined for the other datatypes
  and micro-architecture support. Thus, is it extensible.

- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
         kernels. Thus, the blocking in framework is done accordingly.
         In case of adding the support for n-var, the variant
         information could be encoded in the object definition.

- Added test-cases to validate the interface for functionality(API
  level tests). Also added exception value tests, which have been
  disabled due to the SUP kernel optimizations.

AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
2025-01-24 12:59:26 -05:00
Vignesh Balasubramanian
cdaa2ac7fd Bugfix and optimizations for AVX512 AMAXV micro-kernels
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
  incorrect functionality with multiple absolute maximums.
  They returned the last index when having multiple occurences,
  instead of the first one.

- Implemented a bug-fix to handle this issue on these AVX512
  kernels. Also ensured that the kernels are compliant with
  the standard when handling exception values.

- Further optimized the code by decoupling the logic to find
  the maximum element and its search space for index. This way,
  we use lesser latency instructions to compute the maximum
  first.

- Updated the unit-tests, exception value tests and early return
  tests for the API to ensure code-coverage.

AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
2025-01-07 22:56:20 +05:30
Vignesh Balasubramanian
609af9bfe2 Threshold tuning for ZGEMM small path
- Updated the threshold check for ZGEMM small path to include
  runtime checks for redirection, specific to the micro-architecture.

- The current ZGEMM small path has only its AVX2 variant available.
  Post implementing an AVX512(same/different algorithm), the thresholds
  will further be fine-tuned.

- Included the dot-product based AVX512 ZGEMM kernels in the ZEN5
  context. It will be used as part of handling RRC and CRC storage
  schemes of C, A and B matrices in both single-thread and multi-thread
  runs.

AMD-Internal: [CPUPL-5949]
Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75
2024-12-13 12:51:22 -05:00
harsdave
54b46ec1ed Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:

1. **IR Loop Optimization:**
   - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
     with `begin_asm` and `end_asm` calls, resulting in more efficient execution.

2. **JR Loop Integration:**
   - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
     of stack frame management for each JR iteration, thereby enhancing loop performance.

3. **Kernel Decomposition Strategy:**
   - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
   - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.

1. **Interleaved Scaling by Alpha:**
   - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
     and reduce latency.

2. **Efficient Mask Preparation:**
   - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
     minimizing unnecessary overhead.

3. **Broadcast Instruction Optimization:**
   - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
     the broadcast instruction is replaced with `mem_1to8`.
   - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
     dependency chains and improving execution efficiency.

4. **C Matrix Update Optimization:**
   - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
     This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
     performance bottlenecks and enhancing throughput.

These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.

This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.

For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.

AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
2024-12-13 00:03:00 -05:00
Shubham Sharma.
be6fbadd95 BlockSize Tuning for ZEN4 and ZEN5
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.

AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
2024-11-29 03:19:16 -05:00
Shubham Sharma
f2320a1fef Enabled DGEMM row major kernel for ZEN4
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
  kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
  is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.

AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
2024-11-29 08:18:48 +00:00
Vignesh Balasubramanian
06d776b025 AVX512 ZGEMM SUP Inner product kernels
- Implemented a set of column preferential dot-product based
  ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
  These kernels perform matrix multiplication as a sequence
  of inner products(i.e, dot-products).

- These standalone kernels are expected to strictly handle
  the CRC storage scheme for C, A and B matrices. RRC is also
  supported through operation transpose, at the framework
  level.

- Added unit-tests to test all the kernels(main and fringe),
  as well as the redirection between these kernels.

AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
2024-11-06 04:18:57 -05:00
Mangala V
705755bb5c Revert "Using znver2 flags for building zen/zen2/zen3 kernels on amdzen builds."
This reverts commit 7d379c7879.

Reason for revert: < Perf regression is observed for GEMM(gemm_small_At)
                    as fma uses memory operand >

Change-Id: I0ec3a22acaacfaade860c67858be6a2ba6296bce
2024-09-02 09:07:46 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Eleni Vlachopoulou
7d379c7879 Using znver2 flags for building zen/zen2/zen3 kernels on amdzen builds.
config => config/build/arch folder

Issue:

  1. Performance drop is observed as part of the fat binary(amdzen config)
     built to support all the platforms using dynamic dispatch feature.
  2. Observed only in intrinsic code and not in assembly code.
  3. Observed in many of level1 kernels on Milan and Genoa

Previous Design:

Znver flags are picked based on config or function name

In case of ref_kernels:
   Compiler picks up znver flag based on the function name. All
   ref_kernels are named based on BLIS_CNAME which is a
   config name (zen, zen2, zen3, zen4, zen5)
In case of Zen kernels:
   Compiler picks up znver flag based on the config name where the
   source file exists. All avx2 kernels are placed in zen and all avx512
   kernels are placed in zen4/zen5 folder.

   Kernels placed in zen (AVX2 kernels) are being compiled with znver1
   flag rather than using znver2/znver3 flags on  zen2/zen3 arch
   respectively

New Design: For amdzen builds

  1. For ref_kernels and kernels/(zen/zen2/zen3), znver2 flag is used instead of
     znver1 in make and cmake build system.
  2. To use znver2 flags, make_defs.mk of zen2 is included in zen config
  3. No changes are made for auto or any individual config
  4. Significant perfomance improvement is observed

AMD-Internal : [CPUPL-5407] [CPUPL-5406] [CPUPL-4873]  [CPUPL-4872] [CPUPL-4871]  [CPUPL-4801] [CPUPL-4800] [CPUPL-4799]

Change-Id: Ie817c13b8b69a2dc4328aad7ae09a3af06f83df5
2024-08-05 14:27:01 +05:30
Varaganti, Kiran
145e706992 Fixed auxiliary cache block sizes for Native and SUP DGEMM kernels for ZEN4 and ZEN5 configs.
Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. (https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md).
      In bli_cntx_init_zen4 and zen5 - auxiliary blocksize for KC was less than primary blocksize. These are fixed.
      Code-cleanup of the files bli_family_zen4, zen5.h" Removed unused constants.
Thanks to Igor Kozachenko <igork@berkeley.edu> for pointing out these two bugs.

Change-Id: I44fc564d5d91cb978d062c413e70751aeaa07f2c
2024-08-05 10:29:43 +05:30
Mangala V
0a4f9d5ac1 Removed -fno-tree-loop-vectorize from kernel flags
- This change in made in MAKE build system.
- Removed -fno-tree-loop-vectorize from global kernel flags,
  instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
  vectorize the code which results in usages of
  vector registers, if the auto vectorized function
  is using intrinsic then the total numbers of vector
  registers used by intrinsic and auto vectorized
  code becomes more than the registers
  available in machine which causes read and writes
  to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
  do not use any intrinsic code do not get auto
  vectorized.
- To get optimal performance for both blis and lpgemm,
  this flag is enabled for lpgemm kernels only.

Previous commit (75df1ef218) contains
similar changes on cmake build system

AMD-Internal: [CPUPL-5544]

Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506
2024-08-02 09:40:06 -04:00
Shubham Sharma
0d95fcf20c Revert "DGEMM Native AVX512 updates"
This reverts commit f378fc57b5.

Reason for revert: Causing Failure

AMD-Internal: [CPUPL-5262]
Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f
2024-08-02 07:32:28 -04:00
Ruchika Ashtankar
92fbd04238 DGEMM SUP Optimizations for Turin
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
  which is now modified to prefetching the second next column
  of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.

AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
2024-08-02 04:00:51 -04:00
Ruchika Ashtankar
5760e06100 Threshold tuning for DGEMM SUP for zen5
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.

AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
2024-08-02 11:34:32 +05:30
Shubham Sharma.
f378fc57b5 DGEMM Native AVX512 updates
- In the initial patch - for m, n non-multiple of MR and NR
  respectively we are calling bli_dgemm_ker_var2. Now we have
  implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
  This will result in better performance for matrix sizes
  like m=4000 or greater when running on single thread.


AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
2024-07-31 12:23:34 -04:00
Shubham Sharma
16c56e0101 Added 24x8 triangular kernels for DGEMMT SUP
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
   24x8 triangular AVX512 DGEMMT SUP kernels are added.
 - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
   pattern repeats every 24x24 block of C. To cover this 24x24 block,
   3 kernels are needed for one variant of DGEMMT. A total of 6
   kernels are needed to cover both upper and lower variants.
 - In order to maximize code reuse, the 24x8 kernels are broken
   into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
   diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
   full GEMM part is computed by 24x8 DGEMM SUP kernel.
 - Changes are made in framework to enable the use of these kernels.

AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian
b48e864e82 AVX512 optimizations for DAXPBYV API
- Implemented AVX512 computational kernel for DAXPBYV
  with optimal unrolling. Further implemented the other
  missing kernels that would be required to decompose
  the computation in special cases, namely the AVX512
  DADDV and DSCAL2V kernels.

- Updated the zen4 and zen5 contexts to ensure any query
  to acquire the kernel pointer for DAXPBYV returns the
  address of the new kernel.

- Added micro-kernel units tests to GTestsuite to check
  for functionality and out-of-bounds reads and writes.

AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
2024-07-22 11:32:19 +05:30
Shubham Sharma
75df1ef218 Removed -fno-tree-loop-vectorize from kernel flags
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
  instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
  vectorize the code which results in usages of
  vector registers, if the auto vectorized function
  is using intrinsics then the total numbers of vector
  registers used by intrinsic and auto vectorized
  code becomes more than the registers
  available in machine which causes read and writes
  to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
  do not use any intrinsic code do not get auto
  vectorized.
- To get optimal performance for both blis and lpgemm,
  this flag is enabled for lpgemm kernels only.

Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
2024-07-19 00:49:52 -04:00
Arnav Sharma
4aa66f108e Added CSCALV AVX512 Kernel
- Added CSCALV kernel utilizing the AVX512 ISA.

- Added function pointers for the same to zen4 and zen5 contexts.

- Updated the BLAS interface to invoke respective CSCALV kernels based
  on the architecture.

- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).

AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
2024-07-15 07:17:43 -04:00
Shubham Sharma.
a7744361e4 DGEMM optimizations for Turin Classic
- Introduced new 8x24 macro kernels.
   - 4 new kernels are added for beta 0, beta 1, beta -1
      and beta N.
   - IR and JR loop moved to ASM region.
   - Kernels support row major storage scheme.
   - Prefetch of current micro panel of C is enabled.
   - Kernel supports negative offsets for A and B matrices.
 - Moved alpha scaling from DGEMM kernel to B pack kernel.
 - Tuned blocksizes for new kernel.
 - Added support for alpha scaling in 24xk pack kernel.
 - Reverted back to old b_next computation
   in gemm_ker_var2.
 - BugFix in 8x24 DGEMM kernel for beta 1,
   comparsion for jmp conditions was done using integer
   instructions, which caused beta 1 path to never be taken.
   Fixed this by changing the comparsion to double.

AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
2024-07-09 07:53:27 -04:00
Hari Govind S
627bf0b1ba Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API
-  Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
   'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
   zen5  config.

-  Implemented multithreading support and AOCL-dynamic
   for ZAXPY API.

-  Utilized 'bli_thread_range_sub' function to achieve
   better work distribution and avoid false sharing.

AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
2024-07-09 01:29:53 -04:00
Edward Smyth
8de8dc2961 Merge commit '81e10346' into amd-main
* commit '81e10346':
  Alloc at least 1 elem in pool_t block_ptrs. (#560)
  Fix insufficient pool-growing logic in bli_pool.c. (#559)
  Arm SVE C/ZGEMM Fix FMOV 0 Mistake
  SH Kernel Unused Eigher
  Arm SVE C/ZGEMM Support *beta==0
  Arm SVE Config armsve Use ZGEMM/CGEMM
  Arm SVE: Update Perf. Graph
  Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
  Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
  A64FX Config Use ZGEMM/CGEMM
  Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
  Arm SVE Add SGEMM 2Vx10 Unindexed
  Arm SVE ZGEMM Support Gather Load / Scatt. St.
  Arm SVE Add ZGEMM 2Vx10 Unindexed
  Arm SVE Add ZGEMM 2Vx7 Unindexed
  Arm SVE Add ZGEMM 2Vx8 Unindexed
  Update Travis CI badge
  Armv8 Trash New Bulk Kernels
  Enable testing 1m in `make check`.
  Config ArmSVE Unregister 12xk. Move 12xk to Old
  Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
  Register firestorm into arm64 Metaconfig
  Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
  Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
  Add test for Apple M1 (firestorm)
  Firestorm CPUID Dispatcher
  Armv8 GEMMSUP Edge Cases Require Signed Ints
  Make error checking level a thread-local variable.
  Fix data race in testsuite.
  Update .appveyor.yml
  Firestorm Block Size Fixes
  Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
  Move unused ARM SVE kernels to "old" directory.
  Add an option to control whether or not to use @rpath.
  Fix $ORIGIN usage on linux.
  Arm micro-architecture dispatch (#344)
  Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
  Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
  Armv8 Fix 6x8 Row-Maj Ukr
  Apply patch from @xrq-phys.
  Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
  bli_error: more cleanup on the error strings array
  Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
  Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
  Fix config_name in bli_arch.c
  Arm Whole GEMMSUP Call Route is Asm/Int Optimized
  Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
  Header Typo
  Arm: DGEMMSUP ??r(rv) Invoke Edge Size
  Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
  Arm: Implement GEMMSUP Fallback Method
  Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
  Added Apple Firestorm (A14/M1) Subconfig
  Arm64 8x4 Kernel Use Less Regs
  Armv8-A Supplimentary GEMMSUP Sizes for RD
  Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
  Armv8-A Adjust Types for PACKM Kernels
  Armv8-A GEMMSUP-RD 6x8m
  Armv8-A GEMMSUP-RD 6x8n
  Armv8-A s/d Packing Kernels Fix Typo
  Armv8-A Introduced s/d Packing Kernels
  Armv8-A DGEMMSUP 6x8m Kernel
  Armv8-A DGEMMSUP Adjustments
  Armv8-A Add More DGEMMSUP
  Armv8-A Add GEMMSUP 4x8n Kernel
  Armv8-A Add Part of GEMMSUP 8x4m Kernel
  Armv8A DGEMM 4x4 Kernel WIP. Slow
  Armv8-A Add 8x4 Kernel WIP

AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
2024-06-25 05:48:46 -04:00
mkadavil
a5c4a8c7e0 Int4 B matrix reordering support in LPGEMM.
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.

AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
2024-06-24 07:55:34 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Shubham Sharma.
580282e655 DGEMM optimizations for Turin Classic
- Introduced new 8x24 row preferred kernel for zen5.
  - Kernel supports row/col/gen
    storage schemes.
  - Prefetch of current panel of A and C
    are enabled.
  - Prefetch of next panel of B is enabled.
  - Kernel supports negative offsets for A and B
    matrices.
- Cache block tuning is done for zen5 core.

AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
2024-06-17 05:18:49 -04:00
Mangala V
64d9c96d45 ZGEMMT SUP: AVX512 GEMMT code for Upper variant
1. Enabled AVX512 path for
   -  Upper variant
   -  Different storage schemes for upper and lower variant

2. Modified mask value to handle all fringe cases correctly

AMD_Internal: [CPUPL-5091]

Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
2024-05-15 13:08:32 +05:30
Hari Govind S
61d0f3b873 Additional optimisations on COPYV API
-  Reduced number of jump operations in AVX512
   assembly kernel for SCOPYV, DCOPYV and ZCOPYV.

-  Fixed memory test failure for bli_zcopyv_zen_int_avx512
   kernel.

-  Replaced existing AVX2 COPYV intrinsic kernels in
   bli_cntx_init_zen5.c with AVX512 assembly kernels.

Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159
2024-05-10 07:22:04 -04:00
Arnav Sharma
cb27fad49c ZSCALV AVX512 Kernel
- Implemented ZSCALV kernel utilizing AVX512 intrinsics.

- Gtestsuite: Added ukr tests for the new kernel.

AMD-Internal: [CPUPL-5012]
Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2
2024-05-08 11:55:13 -04:00
Arnav Sharma
1dbeee4d19 ZDOTV AVX512 Kernel with MT Support
- Added AVX512 kernel for ZDOTV.

- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.

AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
2024-05-08 04:54:05 -04:00
Mangala V
e6cc2a3e22 ZGEMMT SUP Optimizations for AVX512
Existing Design:
 - GEMM AVX2 kernel performs computation and updates temporary C buffer
 - Portion of temporary C buffer is copied to output C buffer
   based on UPLO parameter
 - For diagonal blocks, using GEMM kernels is not efficient

New Design: Implemented in current patch when UPLO='L'
 - GEMMT kernel used for computation, temporary buffer is not required.
 - Only required elements are computed using mask load store for all
   fringe cases
 - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC

- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
  native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
  check which exists for double datatype

AMD_Internal: [CPUPL-5006]

Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
2024-05-07 23:00:29 +05:30
Shubham Sharma
b70347d0d4 DGEMMT SUP Optimizations for AVX512
- In DGEMMT SUP AVX2 code path, traingular kernels
  are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
  AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
  to make sure that AVX512 kernels can be used without
  creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
   for upper variant of DGEMMT, temporary C buffer is
   created, full GEMM kernel is called on temporary C and
   traingular region from temporary C is copied to C
   buffer.

AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
2024-05-03 05:11:11 -04:00
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
Hari Govind
9c26de1a18 Optimisiation COPYV APIs
- Implemented AVX512 kernels for scopyv_, dcopyv_ and  zcopyv_
  using respective AVX512 intrinsics including masked
  load and store operations.

- Implemented AVX512 kernels for scopy_, dcopy_ and
  zcopy_ using assembly language to prevent loss of
  performance during the translation of intrinsics.

- Updated the dcopy_blis_impl( ... ) and
  zcopy_blis_impl( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Implemented OpenMP parallelization for dcopyv_ and
  zcopyv_ APIs, while scopyv_ and ccopyv_ only support
  single thread.

AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
2024-05-01 00:23:01 +05:30
Edward Smyth
2450a1813b BLIS: Implement zen5 sub-configuration
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.

AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
2024-04-12 07:26:31 -04:00
Eleni Vlachopoulou
020b9ff7f0 CMake: Enable builds for both static and shared builds for Linux.
- Added BUILD_STATIC_LIBS option which is on by default, only on Linux.
- Added TEST_WITH_SHARED option which is off by default, only on Linux.
- If only shared or static lib is being built, that's the one that will be used for testing.
- If both are being built, TEST_WITH_SHARED determins which library wil be used for testing.
- Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing.
- Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0.

AMD-Internal: [CPUPL-2748]
Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e
2024-03-14 10:32:51 -04:00
jagar
e2de45b454 CMake:Added support for ADDON(aocl_gemm) on Windows
CMakelists.txt is updated to support aocl_gemm on windows.
On windows, BLIS library(blis+aocl_gemm) is built successfully
only with AOCC Compiler. (Clang has an issue with optimizing
VNNI instructions).

$cmake .. -DENABLE_ADDON="aocl_gemm" ....

AMD-Internal: [CPUPL-2748]
Change-Id: I9620878ab6934233fadc9ddc5d5e82ad85be9209
2024-03-14 07:57:02 -04:00
jagar
bbfa4a88ec CMake: Updated compiler ID in cmake files
Updated compiler id in cmake related files from
CMAKE_CXX_COMPILER_ID to CMAKE_C_COMPILER_ID

AMD-Internal: [CPUPL-2748]
Change-Id: Ib0e2a2e3ec8fafeb423fe56b9842a93db0115371
2024-03-14 07:24:04 -04:00
Kiran Varaganti
0784679d4d Fix gcc 7.5 compilation error for zen4 and above configs
For gcc greater than or equal to 7.0 version added AVX512 compiler flags
    in makde_defs.mk and make_defs.cmake. AVX512VNNI compiler flag is only
    supported from gcc version 8 or greater. So added another else condition
    for gcc version greater than or equal to 7 - enabling avx512 flags.
    This enables compilation of AVX512 assembly code paths with gcc 7.5 version.

Change-Id: I2cda00e578010db5e5a515b506c0b99f685307e0
2024-02-26 05:20:35 -05:00
mkadavil
01b7f8c945 Matrix Add post-operation support for integer(s16|s32) LPGEMM APIs.
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.

AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
2024-02-12 23:51:36 +05:30
Edward Smyth
f93ccb0cea BLIS: zen5 cpuid and arch changes
Implement initial support for Zen5 systems:
- Detect new Zen5 AVXVNNI, AVX512VP2INTERSECT, MOVDIRI and MOVDIR64B
  instructions.
- Assume for now that Zen5 will use Zen4 code path. BLIS_ARCH_TYPE=zen5
  will therefore function as an alias for BLIS_ARCH_TYPE=zen4, but
  different hardware model will still be detected.

AMD-Internal: [CPUPL-3518]
Change-Id: I00fb413d743f152a5412ace3e740df1fd39a1600
2024-01-17 11:41:15 -05:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Edward Smyth
f471615c66 Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
2023-11-22 17:11:10 -05:00
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Shubham Sharma
ffa8f584be Added ZTRSM AVX512 native path kernels
- Added 4x12 ZGEMM row-preferred kernel.
- Added 4x12 ZTRSM row-preferred lower
  and upper kernels using AVX512 ISA.
- These kernels are used for ZTRSM only, zgemm
  still uses 12x4 kernel.
- Kernels support row/col/gen storage.
- Kernels support A prefetch, B prefetch,
  A_next prefetch, B_next prefetch and c prefetch.
- B prefetch, B_next prefetch and C prefetch
  are enabled by default.
- Updated CMakeLists.txt with ZGEMM kernels for
  windows build.
AMD-Internal: [CPUPL-3781]

Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a
2023-11-03 09:42:24 -04:00
Edward Smyth
f5505be9f3 Merge commit 'e366665c' into amd-main
* commit 'e366665c':
  Fixed stale API calls to membrk API in gemmlike.
  Fixed bli_init.c compile-time error on OSX clang.
  Fixed configure breakage on OSX clang.
  Fixed one-time use property of bli_init() (#525).
  CREDITS file update.
  Added Graviton2 Neoverse N1 performance results.
  Remove unnecesary windows/zen2 directory.
  Add vzeroupper to Haswell microkernels. (#524)
  Fix Win64 AVX512 bug.
  Add comment about make checkblas on Windows
  CREDITS file update.
  Test installation in Travis CI
  Add symlink to blis.pc.in for out-of-tree builds
  Revert "Always run `make check`."
  Always run `make check`.
  Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script.   if the string contains zen and zen2, and zen need to be replaced with another string, then zen2   also be incorrectly replaced.
  Update POWER10.md
  Rework POWER10 sandbox
  Skip clearing temp microtile in gemmlike sandbox.
  Fix asm warning
  Sandbox header edits trigger full library rebuild.
  Add vhsubpd/vhsubpd.
  Fixed bugs in cpackm kernels, gemmlike code.
  Armv8A Rename Regs for Safe Darwin Compile
  Armv8A Rename Regs for Clang Compile: FP32 Part
  Armv8A Rename Regs for Clang Compile: FP64 Part
  Asm Flag Mingling for Darwin_Aarch64
  Added a new 'gemmlike' sandbox.
  Updated Fugaku (a64fx) performance results.
  Add explicit compiler check for Windows.
  Remove `rm-dupls` function in common.mk.
  Travis CI Revert Unnecessary Extras from 91d3636
  Adjust TravisCI
  Travis Support Arm SVE
  Added 512b SVE-based a64fx subconfig + SVE kernels.
  Replace bli_dlamch with something less archaic (#498)
  Allow clang for ThunderX2 config

AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
2023-10-18 09:09:54 -04:00
Shubham Sharma
9a2a4151ac Added improved ZTRSM AVX2 kernels
- Added 2x6 ZGEMM row-preferred kernel.
  - Kernel supports prefetch_a, prefetch_b,
    prefetch_a_next and prefetch_b_next.
  - Multiple Ways to prefetch c are supported.
  - prefetch_a and prefetch_c are enabled by
    default.
  - K loop is divided into multiple subloops for
    better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
  and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
  still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
  use of these kernels for TRSM in zen3 and
  zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
  windows build.

AMD-Internal: [CPUPL-3781]

Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
2023-10-13 07:43:33 -04:00
Meghana Vankadari
3a71550bc3 Enabling SUP blocksizes & kernels for generic config
Details:
- pack and compute extension APIs derive blocksizes(MR, NR...) from
  SUP cntx.
- SUP blocksizes are not set for generic/skx configs. As a result pack
  and compute APIs cause floating point exceptions.
- To fix these issues, we have enabled non-zero SUP blocksizes for
  generic config and zen4 SUP blocksizes for skx config.
- However, these changes will not enable SUP path for skx/generic config
  as thresholds are set to zero.
- To enable SUP path for skx config, more work is needed like non-zero
  thresholds and modifications to build system.

Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9
2023-10-12 05:27:10 -04:00
Edward Smyth
85f2bf6c4a Fix for x86_64 builds
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
  config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
  for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
  frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
  enabled.

Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.

AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
2023-10-09 07:24:21 -04:00