Commit Graph

25 Commits

Author SHA1 Message Date
Shubham Sharma
f8c83fedb6 Added new ZTRSM small code path for ZEN5
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
2025-02-06 18:01:10 +05:30
Shubham Sharma
7695561f4e Tuned DGEMM blocksizes for ZEN5
- In the existing code, blocksizes for sizes where M >> K, N >> K and K < 500
  were not tuned properly for cases when application would use more than
  one instance of blis in parallel.
- Imporved DGEMM performane for sizes where M, N >> k by retuning blocksizes.
  Such sizes are used by applications like HPL.

AMD-Internal: [SWLCSG-3338]
Change-Id: Iec17ecc53a6fabf50eedacaf208e4e74a4e21418
2025-02-03 05:40:07 -05:00
Vignesh Balasubramanian
fb6dcc4edb Support for Tiny-GEMM interface(ZGEMM)
- As part of AOCL-BLAS, there exists a set of vectorized
  SUP kernels for GEMM, that are performant when invoked
  in a bare-metal fashion.

- Designed a macro-based interface for handling tiny
  sizes in GEMM, that would utilize there kernels. This
  is currently instantiated for 'Z' datatype(double-precision
  complex).

- Design breakdown :
  - Tiny path requires the usage of AVX2 and/or AVX512
    SUP kernels, based on the micro-architecture. The
    decision logic for invoking tiny-path is specific
    to the micro-architecture. These thresholds are defined
    in their respective configuration directories(header files).

  - List of AVX2/AVX512 SUP kernels(lookup table), and their
    lookup functions are defined in the base-architecture from
    which the support starts. Since we need to support backward
    compatibility when defining the lookup table/functions, they
    are present in the kernels folder(base-architecture).

- Defined a new type to be used to create the lookup table and its
  entries. This type holds the kernel pointer, blocking dimensions
  and the storage preference.

- This design would only require the appropriate thresholds and
  the associated lookup table to be defined for the other datatypes
  and micro-architecture support. Thus, is it extensible.

- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
         kernels. Thus, the blocking in framework is done accordingly.
         In case of adding the support for n-var, the variant
         information could be encoded in the object definition.

- Added test-cases to validate the interface for functionality(API
  level tests). Also added exception value tests, which have been
  disabled due to the SUP kernel optimizations.

AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
2025-01-24 12:59:26 -05:00
harsh dave
7510e27007 DGEMM Optimizations
Refined thresholds to decide between native and sup DGEMM code-paths for both zen4 and zen5 processors.

AMD-Internal: [CPUPL-6300]
Change-Id: Ib32a256dba99a0a92b7ecaa7684443a66c459566
2025-01-13 01:09:39 -05:00
Edward Smyth
97ede96ed4 Correct duplicate object file names
Some kernel file names were the same for different sub-configurations,
which could result in duplicate copies of the same object being archived
depending upon the order of (re-)compiling the source files. Rename the
files to be specific to each sub-configuration to avoid this problem.

AMD-Internal: [CPUPL-5895]
Change-Id: I182ac706e04a364f1df20fd0fb5b633eb10eeafb
2025-01-10 06:03:36 -05:00
Shubham Sharma
8f99d8a5bb Fixed warnings and compilation issues with GCC in TRSM
- Current implementation uses macros to expand the code at
  compile time, but this is causing some false warning in GCC12 and 14.
- Added switch case in trsm right variants for n_remainder.
- This ensures that n_rem is compile time constant, therefore
   warnings related to array subscript out of bounds are fixed.
- mtune=znver3 flag is causing compilation issue in GCC 9.1,
  therefore this flag is removed.
- Remaned the file bli_trsm_small to bli_trsm_small_zen5 in order
  to avoid possibily of missing symbols.

AMD-Internal: [CPUPL-6199]
Change-Id: Ib8e90196ce0a41d38c2b29226df5ab6c2d8ba996
2024-12-18 06:22:05 -05:00
Shubham Sharma
050e5a382f Fixed warning for GCC 12+
- Warnings in DTRSM  kernel caused by uninitialized registers
   and extra loop unroll is fixed.
- Warning in DGEMM kernel caused by extra space is fixed.

Change-Id: I1d9cfaa0b2847f5fdbe8b343a462d67a3aca0819
2024-12-17 01:44:41 -05:00
Shubham Sharma
beaea1b88f Added new DTRSM small code path for ZEN5
- Added new DTRSM kernels for right  and left variants.
- Kernel dimensions are 24x8.
- 24x8 DGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6016]
Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db
2024-12-16 10:38:45 +05:30
Shubham Sharma.
be6fbadd95 BlockSize Tuning for ZEN4 and ZEN5
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.

AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
2024-11-29 03:19:16 -05:00
Shubham Sharma
f2320a1fef Enabled DGEMM row major kernel for ZEN4
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
  kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
  is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.

AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
2024-11-29 08:18:48 +00:00
Shubham Sharma
082081658f BugFix: Fixed extreme value handling in AVX512 DGEMM kernel
- Extreme values are not handled correctly when beta == 0 and C is
  column major stored.
- For checking if beta is zero, VCOMISD(XMM(1), XMM(2)) is used,
  beta(XMM1) is compared with zero(XMM2),
  for column major C, setting of xmm2 to zero was missed.
- XMM2 is set to zero after the jump to column major stored C code
  is made, this skips the setting of XMM2 to zero for column major
  C.
- This is fixed by setting XMM2 to zero before the column major jump.

AMD-Internal: [CPUPL-5851]
Change-Id: Ic511071fbc82a082fa48a1543c0c7325eaf75cb8
2024-11-29 08:13:57 +00:00
Shubham Sharma.
bc3238e21e BugFixes in ZEN5 DGEMM kernel
- Changed fringe cases to use ZEN5 DGEMM kernel instead
  of ZEN4 kernel.
- ASAN reporting error when RBP is used even when
  -fno-stack-pointer flag is used, therefore replaced RBP
   register with R11 register.
- Added missing RDX register in clobber list which is causing
  failures with AOCC compiler.

Thanks to harsh.dave@amd.com for debugging some of the issues.

AMD-Internal: [CPUPL-5851]
Change-Id: I0ee412c97c9dbfb3e7a736a10bfd93d775779b5b
2024-11-29 00:22:41 -05:00
Shubham Sharma
266bd32dea Enable fringe case handling in DGEMM ZEN5 macro kernel
- Generic kernel is used if N is not multiple of NR
  or M is not multiple of MR.
- This limit the maximum values of NR that can be used.
- Support for fringe case handling is added in DGEMM
  macro kernel so that macro kernel can be used for
  all problem sizes.

AMD-Internal: [CPUPL-5912]
Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f
2024-11-29 00:22:33 -05:00
Shubham Sharma
1833ee70cd Fixed bug in bli_dgemm_avx512 8x24 native kernel.
-  Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case.
-  When loading 64-bit registers using "mov" instructions, mov(rax, var(m)),
   the "m" should be 64-bit otherwise incorrect values gets loaded.

Fix: We typecast these variables to int64_t before loading into registers.

AMD-Internal: [CPUPL-5819]
Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0
2024-09-23 04:54:04 -04:00
Edward Smyth
89f52a6df5 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
2024-08-05 16:18:51 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Vignesh Balasubramanian
9843bd0317 Tuning the decision logic to choose SUP vs Native for ZGEMM
- Added an additional decision logic to choose between SUP and
  Native paths for zen4 and zen5 micro-architectures, based on
  the input dimensions. This logic has been added to the
  architecture-specific thresholds functions, that are registered
  in the context.

- The decision logic will overrule the discrete thresholds present
  in the zen4 and zen5 contexts.

AMD-Internal: [CPUPL-5547]
Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03
2024-08-03 19:08:07 +05:30
Shubham Sharma
0d95fcf20c Revert "DGEMM Native AVX512 updates"
This reverts commit f378fc57b5.

Reason for revert: Causing Failure

AMD-Internal: [CPUPL-5262]
Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f
2024-08-02 07:32:28 -04:00
Ruchika Ashtankar
92fbd04238 DGEMM SUP Optimizations for Turin
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
  which is now modified to prefetching the second next column
  of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.

AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
2024-08-02 04:00:51 -04:00
Ruchika Ashtankar
5760e06100 Threshold tuning for DGEMM SUP for zen5
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.

AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
2024-08-02 11:34:32 +05:30
Shubham Sharma.
f378fc57b5 DGEMM Native AVX512 updates
- In the initial patch - for m, n non-multiple of MR and NR
  respectively we are calling bli_dgemm_ker_var2. Now we have
  implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
  This will result in better performance for matrix sizes
  like m=4000 or greater when running on single thread.


AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
2024-07-31 12:23:34 -04:00
Shubham Sharma.
a7744361e4 DGEMM optimizations for Turin Classic
- Introduced new 8x24 macro kernels.
   - 4 new kernels are added for beta 0, beta 1, beta -1
      and beta N.
   - IR and JR loop moved to ASM region.
   - Kernels support row major storage scheme.
   - Prefetch of current micro panel of C is enabled.
   - Kernel supports negative offsets for A and B matrices.
 - Moved alpha scaling from DGEMM kernel to B pack kernel.
 - Tuned blocksizes for new kernel.
 - Added support for alpha scaling in 24xk pack kernel.
 - Reverted back to old b_next computation
   in gemm_ker_var2.
 - BugFix in 8x24 DGEMM kernel for beta 1,
   comparsion for jmp conditions was done using integer
   instructions, which caused beta 1 path to never be taken.
   Fixed this by changing the comparsion to double.

AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
2024-07-09 07:53:27 -04:00
Shubham Sharma
1d6dd726cd Fixed Prefetch in Turin DGEMM kernel
- Fixed the prefetch of next micro panel
  of B matrix in 8x24 DGEMM kernel.

Change-Id: Id84bb2841abb86bda780062d67266377fda12038
2024-06-20 10:31:08 +05:30
Shubham Sharma.
580282e655 DGEMM optimizations for Turin Classic
- Introduced new 8x24 row preferred kernel for zen5.
  - Kernel supports row/col/gen
    storage schemes.
  - Prefetch of current panel of A and C
    are enabled.
  - Prefetch of next panel of B is enabled.
  - Kernel supports negative offsets for A and B
    matrices.
- Cache block tuning is done for zen5 core.

AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
2024-06-17 05:18:49 -04:00
Edward Smyth
2450a1813b BLIS: Implement zen5 sub-configuration
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.

AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
2024-04-12 07:26:31 -04:00