Commit Graph

17 Commits

Author SHA1 Message Date
Shubham Sharma
f8c83fedb6 Added new ZTRSM small code path for ZEN5
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
2025-02-06 18:01:10 +05:30
Shubham Sharma
8f99d8a5bb Fixed warnings and compilation issues with GCC in TRSM
- Current implementation uses macros to expand the code at
  compile time, but this is causing some false warning in GCC12 and 14.
- Added switch case in trsm right variants for n_remainder.
- This ensures that n_rem is compile time constant, therefore
   warnings related to array subscript out of bounds are fixed.
- mtune=znver3 flag is causing compilation issue in GCC 9.1,
  therefore this flag is removed.
- Remaned the file bli_trsm_small to bli_trsm_small_zen5 in order
  to avoid possibily of missing symbols.

AMD-Internal: [CPUPL-6199]
Change-Id: Ib8e90196ce0a41d38c2b29226df5ab6c2d8ba996
2024-12-18 06:22:05 -05:00
Shubham Sharma
050e5a382f Fixed warning for GCC 12+
- Warnings in DTRSM  kernel caused by uninitialized registers
   and extra loop unroll is fixed.
- Warning in DGEMM kernel caused by extra space is fixed.

Change-Id: I1d9cfaa0b2847f5fdbe8b343a462d67a3aca0819
2024-12-17 01:44:41 -05:00
Shubham Sharma
beaea1b88f Added new DTRSM small code path for ZEN5
- Added new DTRSM kernels for right  and left variants.
- Kernel dimensions are 24x8.
- 24x8 DGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6016]
Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db
2024-12-16 10:38:45 +05:30
Shubham Sharma
f2320a1fef Enabled DGEMM row major kernel for ZEN4
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
  kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
  is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.

AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
2024-11-29 08:18:48 +00:00
Shubham Sharma
082081658f BugFix: Fixed extreme value handling in AVX512 DGEMM kernel
- Extreme values are not handled correctly when beta == 0 and C is
  column major stored.
- For checking if beta is zero, VCOMISD(XMM(1), XMM(2)) is used,
  beta(XMM1) is compared with zero(XMM2),
  for column major C, setting of xmm2 to zero was missed.
- XMM2 is set to zero after the jump to column major stored C code
  is made, this skips the setting of XMM2 to zero for column major
  C.
- This is fixed by setting XMM2 to zero before the column major jump.

AMD-Internal: [CPUPL-5851]
Change-Id: Ic511071fbc82a082fa48a1543c0c7325eaf75cb8
2024-11-29 08:13:57 +00:00
Shubham Sharma.
bc3238e21e BugFixes in ZEN5 DGEMM kernel
- Changed fringe cases to use ZEN5 DGEMM kernel instead
  of ZEN4 kernel.
- ASAN reporting error when RBP is used even when
  -fno-stack-pointer flag is used, therefore replaced RBP
   register with R11 register.
- Added missing RDX register in clobber list which is causing
  failures with AOCC compiler.

Thanks to harsh.dave@amd.com for debugging some of the issues.

AMD-Internal: [CPUPL-5851]
Change-Id: I0ee412c97c9dbfb3e7a736a10bfd93d775779b5b
2024-11-29 00:22:41 -05:00
Shubham Sharma
266bd32dea Enable fringe case handling in DGEMM ZEN5 macro kernel
- Generic kernel is used if N is not multiple of NR
  or M is not multiple of MR.
- This limit the maximum values of NR that can be used.
- Support for fringe case handling is added in DGEMM
  macro kernel so that macro kernel can be used for
  all problem sizes.

AMD-Internal: [CPUPL-5912]
Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f
2024-11-29 00:22:33 -05:00
Shubham Sharma
1833ee70cd Fixed bug in bli_dgemm_avx512 8x24 native kernel.
-  Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case.
-  When loading 64-bit registers using "mov" instructions, mov(rax, var(m)),
   the "m" should be 64-bit otherwise incorrect values gets loaded.

Fix: We typecast these variables to int64_t before loading into registers.

AMD-Internal: [CPUPL-5819]
Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0
2024-09-23 04:54:04 -04:00
Edward Smyth
89f52a6df5 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
2024-08-05 16:18:51 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Shubham Sharma
0d95fcf20c Revert "DGEMM Native AVX512 updates"
This reverts commit f378fc57b5.

Reason for revert: Causing Failure

AMD-Internal: [CPUPL-5262]
Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f
2024-08-02 07:32:28 -04:00
Ruchika Ashtankar
92fbd04238 DGEMM SUP Optimizations for Turin
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
  which is now modified to prefetching the second next column
  of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.

AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
2024-08-02 04:00:51 -04:00
Shubham Sharma.
f378fc57b5 DGEMM Native AVX512 updates
- In the initial patch - for m, n non-multiple of MR and NR
  respectively we are calling bli_dgemm_ker_var2. Now we have
  implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
  This will result in better performance for matrix sizes
  like m=4000 or greater when running on single thread.


AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
2024-07-31 12:23:34 -04:00
Shubham Sharma.
a7744361e4 DGEMM optimizations for Turin Classic
- Introduced new 8x24 macro kernels.
   - 4 new kernels are added for beta 0, beta 1, beta -1
      and beta N.
   - IR and JR loop moved to ASM region.
   - Kernels support row major storage scheme.
   - Prefetch of current micro panel of C is enabled.
   - Kernel supports negative offsets for A and B matrices.
 - Moved alpha scaling from DGEMM kernel to B pack kernel.
 - Tuned blocksizes for new kernel.
 - Added support for alpha scaling in 24xk pack kernel.
 - Reverted back to old b_next computation
   in gemm_ker_var2.
 - BugFix in 8x24 DGEMM kernel for beta 1,
   comparsion for jmp conditions was done using integer
   instructions, which caused beta 1 path to never be taken.
   Fixed this by changing the comparsion to double.

AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
2024-07-09 07:53:27 -04:00
Shubham Sharma
1d6dd726cd Fixed Prefetch in Turin DGEMM kernel
- Fixed the prefetch of next micro panel
  of B matrix in 8x24 DGEMM kernel.

Change-Id: Id84bb2841abb86bda780062d67266377fda12038
2024-06-20 10:31:08 +05:30
Shubham Sharma.
580282e655 DGEMM optimizations for Turin Classic
- Introduced new 8x24 row preferred kernel for zen5.
  - Kernel supports row/col/gen
    storage schemes.
  - Prefetch of current panel of A and C
    are enabled.
  - Prefetch of next panel of B is enabled.
  - Kernel supports negative offsets for A and B
    matrices.
- Cache block tuning is done for zen5 core.

AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
2024-06-17 05:18:49 -04:00