- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.
AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
- In the existing code, blocksizes for sizes where M >> K, N >> K and K < 500
were not tuned properly for cases when application would use more than
one instance of blis in parallel.
- Imporved DGEMM performane for sizes where M, N >> k by retuning blocksizes.
Such sizes are used by applications like HPL.
AMD-Internal: [SWLCSG-3338]
Change-Id: Iec17ecc53a6fabf50eedacaf208e4e74a4e21418
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
Refined thresholds to decide between native and sup DGEMM code-paths for both zen4 and zen5 processors.
AMD-Internal: [CPUPL-6300]
Change-Id: Ib32a256dba99a0a92b7ecaa7684443a66c459566
Some kernel file names were the same for different sub-configurations,
which could result in duplicate copies of the same object being archived
depending upon the order of (re-)compiling the source files. Rename the
files to be specific to each sub-configuration to avoid this problem.
AMD-Internal: [CPUPL-5895]
Change-Id: I182ac706e04a364f1df20fd0fb5b633eb10eeafb
- Current implementation uses macros to expand the code at
compile time, but this is causing some false warning in GCC12 and 14.
- Added switch case in trsm right variants for n_remainder.
- This ensures that n_rem is compile time constant, therefore
warnings related to array subscript out of bounds are fixed.
- mtune=znver3 flag is causing compilation issue in GCC 9.1,
therefore this flag is removed.
- Remaned the file bli_trsm_small to bli_trsm_small_zen5 in order
to avoid possibily of missing symbols.
AMD-Internal: [CPUPL-6199]
Change-Id: Ib8e90196ce0a41d38c2b29226df5ab6c2d8ba996
- Warnings in DTRSM kernel caused by uninitialized registers
and extra loop unroll is fixed.
- Warning in DGEMM kernel caused by extra space is fixed.
Change-Id: I1d9cfaa0b2847f5fdbe8b343a462d67a3aca0819
- Added new DTRSM kernels for right and left variants.
- Kernel dimensions are 24x8.
- 24x8 DGEMM SUP kernels are used internally
for solving GEMM subproblem.
- Tuned thresholds to pick efficent code path for ZEN5.
AMD-Internal: [CPUPL-6016]
Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.
AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.
AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
- Extreme values are not handled correctly when beta == 0 and C is
column major stored.
- For checking if beta is zero, VCOMISD(XMM(1), XMM(2)) is used,
beta(XMM1) is compared with zero(XMM2),
for column major C, setting of xmm2 to zero was missed.
- XMM2 is set to zero after the jump to column major stored C code
is made, this skips the setting of XMM2 to zero for column major
C.
- This is fixed by setting XMM2 to zero before the column major jump.
AMD-Internal: [CPUPL-5851]
Change-Id: Ic511071fbc82a082fa48a1543c0c7325eaf75cb8
- Changed fringe cases to use ZEN5 DGEMM kernel instead
of ZEN4 kernel.
- ASAN reporting error when RBP is used even when
-fno-stack-pointer flag is used, therefore replaced RBP
register with R11 register.
- Added missing RDX register in clobber list which is causing
failures with AOCC compiler.
Thanks to harsh.dave@amd.com for debugging some of the issues.
AMD-Internal: [CPUPL-5851]
Change-Id: I0ee412c97c9dbfb3e7a736a10bfd93d775779b5b
- Generic kernel is used if N is not multiple of NR
or M is not multiple of MR.
- This limit the maximum values of NR that can be used.
- Support for fringe case handling is added in DGEMM
macro kernel so that macro kernel can be used for
all problem sizes.
AMD-Internal: [CPUPL-5912]
Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f
- Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case.
- When loading 64-bit registers using "mov" instructions, mov(rax, var(m)),
the "m" should be 64-bit otherwise incorrect values gets loaded.
Fix: We typecast these variables to int64_t before loading into registers.
AMD-Internal: [CPUPL-5819]
Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- Added an additional decision logic to choose between SUP and
Native paths for zen4 and zen5 micro-architectures, based on
the input dimensions. This logic has been added to the
architecture-specific thresholds functions, that are registered
in the context.
- The decision logic will overrule the discrete thresholds present
in the zen4 and zen5 contexts.
AMD-Internal: [CPUPL-5547]
Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
which is now modified to prefetching the second next column
of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.
AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.
AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
- In the initial patch - for m, n non-multiple of MR and NR
respectively we are calling bli_dgemm_ker_var2. Now we have
implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
This will result in better performance for matrix sizes
like m=4000 or greater when running on single thread.
AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
- Introduced new 8x24 macro kernels.
- 4 new kernels are added for beta 0, beta 1, beta -1
and beta N.
- IR and JR loop moved to ASM region.
- Kernels support row major storage scheme.
- Prefetch of current micro panel of C is enabled.
- Kernel supports negative offsets for A and B matrices.
- Moved alpha scaling from DGEMM kernel to B pack kernel.
- Tuned blocksizes for new kernel.
- Added support for alpha scaling in 24xk pack kernel.
- Reverted back to old b_next computation
in gemm_ker_var2.
- BugFix in 8x24 DGEMM kernel for beta 1,
comparsion for jmp conditions was done using integer
instructions, which caused beta 1 path to never be taken.
Fixed this by changing the comparsion to double.
AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
- Introduced new 8x24 row preferred kernel for zen5.
- Kernel supports row/col/gen
storage schemes.
- Prefetch of current panel of A and C
are enabled.
- Prefetch of next panel of B is enabled.
- Kernel supports negative offsets for A and B
matrices.
- Cache block tuning is done for zen5 core.
AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09