* Adding a model to determine which matrices enter the SGEMM tiny path
* This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously
* Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path
* Adding thresholds based on the SUP path sizes
* Added for Zen4 and Zen5
---------
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
Adding SGEMM tiny path for Zen architectures.
Needed to cover some performance gaps seen wrt MKL
Only allowing matrices that all fit into the L1 cache to the tiny path
Only tuned for single threaded operation at the moment
Todo: Tune cases where AVX2 performs better than AVX512 on Zen4
Todo: The current ranges are very conservative, there may be scope to increase the matrix sizes that go into the tiny path
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan rohrayan@amd.com
* commit 'db3134ed6d239a7962a2b7470d8c46611b9d17ef':
Disabled no post-ops path in lpgemm f32 kernels for few gcc versions
DTL Log update
Add external PR integration process and flowchart to CONTRIBUTING.md
Enabled disable-sba-pools feature in AOCL-BLAS (#101)
Fix for F32 to BF16 Conversion and AVX512 ISA Support Checks
Fixed Integer Overflow Issue in TPSV
Add AI Code Review workflow (#211)
Add AI Code Review Self-enablement file (#209)
Re-tuned GEMV thresholds (#210)
Adding bli_print_msg before bli_abort() for bli_thrinfo_sup_create_for_cntl
Add missing license text
Modified AVPY kernel to ensure consistency of numerical results (#188)
Fix memory leak in DGEMV kernel (#187)
Tuned DGEMV no-transpose thresholds #193
Set Security flags default enable (#194)
Standardize Zen kernel names (2)
Compiler warnings fixes (2)
coverity issue fix for ztrsm (#176)
Fixes Coverity static analysis issue in the DTRSM (#181)
Add files via upload (#197)
Initialize mem_t structures safely and handle NULL communicator in threading
Tidying code
Compiler warnings fixes
Fixing the coverity issues with CID: 23269 and CID: 137049 (#180)
Fixed high priority coverity issues in LPGEMM. (#178)
GCC 15 SUP kernel workaround (2)
Disable small_gemm for zen4/5 and added single thread check for tiny path (#167)
Optimal rerouting of GEMV inputs to avoid packing
Updated Guards in s8s8s32of32_sym_quant Framework
Fixed out-of-bound access in F32 matrix add/mul ops (#168)
Previous commit (30c42202d7) for this problem turned off
-ftree-slp-vectorize optimizations for all kernels. Instead, copy
the approach of upstream BLIS commit 36effd70b6a323856d98 and disable
these optimizations only for the affected files by using GCC pragmas
AMD-Internal: [CPUPL-6579]
- Change begin_asm and end_asm comments and unused code in files
kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c
kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c
to avoid problems in clobber checking script.
- Add missing clobbers in files
kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c
kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c
- Add missing newline at end of files.
- Update some copyright years for recent changes.
- Standardize license text formatting.
AMD-Internal: [CPUPL-6579]
- Currently TRSM reference kernels are derived from GEMM blocksizes and GEMM_UKR.
- This does not allow the flexibility to use different GEMM_UKR for GEMM and TRSM if optimized TRSM_UKR are not available.
- Made changes so that ref TRSM kernels are derived from TRSM blocksizes.
- Changed ZEN4 and ZEN5 cntx to use AVX2 kernels for CTRSM.
AMD-Internal: [SWLCSG-3702]
* Optimized avx512 ZGEMM kernel and edge-case handling
Edge kernel implementation:
- Refactored all of the zgemm kernels to process micro-tiles efficiently
- Specialized sub-kernels are added to handle leftover m dimention:12MASK,
8, 8MASK, 8, 4, 4MASK, 2.
- 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
load/store and 1 masked load/store.
- Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
1 masked load/store.
- 4MASK handles 3, 1 m_left using 1 masked load/store.
- ZGEMM kernel now internally decomposes the m dimension into the following.
The main kernel is 12x4, which is having following edge kernels to
handle left-over m dimension:
edge kernels:
12MASKx4 (handles 11x4, 10x4, 9x4)
8x4 (handles 8x4)
8MASKx4 (handles 7x4, 6x4, 5x4)
4x4 (handles 4x4)
4MASKx4 (handles 3x4, 1x4)
2x4 (handles 2x4)
- similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
2xN_LEFT(3, 2, 1) handles leftover m dimension.
Threshold tuning:
- Enforced odd m dimension to avx512 kernels in tiny path, as avx2
kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
The gemv function call adds overhead for very small sizes and results
in suboptimal performance.
- condition check "m%2 == 0" is added along with threshold checks to
force input with odd m dimension to use avx512 zgemm kernel.
- Threshold change to route all of the inputs to tiny path. Eliminating
dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
'T'(transpose).
- However tiny re-uses zgemm sup kernels which do not support
conjugate transpose storage of matrices. For such storage of
A, B matrix we still rely on avx2 zgemm_small kernel.
gtest changes:
- Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
respective testing instaces from gtest.
AMD-Internal: [CPUPL-7203]
* Optimized avx512 ZGEMM kernel and edge-case handling
Edge kernel implementation:
- Refactored all of the zgemm kernels to process micro-tiles efficiently
- Specialized sub-kernels are added to handle leftover m dimention:12MASK,
8, 8MASK, 8, 4, 4MASK, 2.
- 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
load/store and 1 masked load/store.
- Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
1 masked load/store.
- 4MASK handles 3, 1 m_left using 1 masked load/store.
- ZGEMM kernel now internally decomposes the m dimension into the following.
The main kernel is 12x4, which is having following edge kernels to
handle left-over m dimension:
edge kernels:
12MASKx4 (handles 11x4, 10x4, 9x4)
8x4 (handles 8x4)
8MASKx4 (handles 7x4, 6x4, 5x4)
4x4 (handles 4x4)
4MASKx4 (handles 3x4, 1x4)
2x4 (handles 2x4)
- similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
2xN_LEFT(3, 2, 1) handles leftover m dimension.
Threshold tuning:
- Enforced odd m dimension to avx512 kernels in tiny path, as avx2
kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
The gemv function call adds overhead for very small sizes and results
in suboptimal performance.
- condition check "m%2 == 0" is added along with threshold checks to
force input with odd m dimension to use avx512 zgemm kernel.
- Threshold change to route all of the inputs to tiny path. Eliminating
dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
'T'(transpose).
- However tiny re-uses zgemm sup kernels which do not support
conjugate transpose storage of matrices. For such storage of
A, B matrix we still rely on avx2 zgemm_small kernel.
gtest changes:
- Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
respective testing instaces from gtest.
AMD-Internal: [CPUPL-7203]
---------
Co-authored-by: harsdave <harsdave@amd.com>
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.
AMD-Internal: [CPUPL-6579]
- Updated the thresholds to enter the AVX512 Tiny and SUP codepaths
for ZGEMM(on ZEN4). This caters to inputs that perform well on
a single-threaded execution(in the Tiny-path), and inputs that
scale well with multithreaded-execution(in the SUP path).
- Also updated the thresholds to decide ideal threads, based on
'm', 'n' and 'k' values. The thread-setting logic involves
determining the number of tiles for computation, and using them
to further tune for the optimal number of threads.
AMD-Internal: [CPUPL-6378][CPUPL-6661]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Add macros to allow specific code options to be enabled or disabled,
controlled by options to configure and cmake. This expands on the
existing GEMM and/or TRSM functionality to enable/disable SUP handling
and replaces the hard coded #define in include files to enable small matrix
paths.
All options are enabled by default for all BLIS sub-configs but many of them
are currently only implemented in AMD specific framework code variants.
AMD-Internal: [CPUPL-6906]
---------
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
* Improve consistency of optimized BLAS3 code
Tidy AMD optimized GEMM and TRSM framework code to reduce
differences between different data type variants:
- Improve consistency of code indentation and white space
- Added some missing AOCL_DTL calls
- Removed some dead code
- Consistent naming of variables for function return status
- GEMM: More consistent early return when k=1
- Correct data type of literal values used for single precision data
In kernels/zen/3/bli_gemm_small.c and bli_family_*.h files:
- Set default values for thresholds if not set in the relevant
bli_family_*.h file
- Remove unused definitions and commented out code
AMD-Internal: [CPUPL-6579]
- Added the support for Tiny-CGEMM as part of the existing
macro based Tiny-GEMM interface. This involved definining
the appropriate AVX2/AVX512 lookup tables and functions for
the target architectures(as per the design), for compile-time
instantiation and runtime usage.
- Also extended the current Tiny-GEMM design to incorporate packing
kernels as part of its lookup tables. These kernels will be queried
through lookup functions and used in case of wanting to support
non-trivial storage schemes(such as dot-product computation).
- This allows for a plug-and-play fashion of experimenting with
pack and outer product method against native inner product implementations.
- Further updated the existing AVX512 pack routine that packs the A matrix
(in blocks of 24xk). This utilizes masked loads/stores instructions to
handle fringe cases of the input(i.e, when m < 24).
- Also added the AVX512 outer product kernels for CGEMM as part of the
ZEN4 and ZEN5 contexts, to handle RRC and CRC storage schemes. This is
facilitated through optional packing of A matrix in the SUP framework.
AMD-Internal: [CPUPL-6498]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
GCC 15 fails to compile some SUP kernels. The problem seems to be
related to one of the optimization phases enabled at -O2 or above.
Workaround is to disable this specific optimization by adding the
flag -fno-tree-slp-vectorize to CKOPTFLAGS.
AMD-Internal: [CPUPL-6579]
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
Various changes to simplify and improve x86 related make_defs files:
- Make better use of common definitions in config/zen/amd_config.mk
from config/zen*/make_defs.mk files
- Similarly for config/zen/amd_config.make from the
config/zen*/make_defs.cmake files
- Pass cc_major, cc_minor and cc_revision definitions from configure
to generated config.mk file, and use these instead of defining
GCC_VERSION in config/zen*/make_defs.mk files
- Add znver3 support for LLVM 13 in config/zen3/make_defs.{mk,cmake}
- Add znver5 support for LLVM 19 in config/zen5/make_defs.{mk,cmake}
- Improve readability of haswell, intel64, skx and x86_64 files
- Correct and tidy some comments
AMD-Internal: [CPUPL-6579]
- Introducted new assembly kernel that copies data from source
to destination from the front and back of the vector at the
same time. This kernel provides better performance for larger
input sizes.
- Added a wrapper function responsible for selecting the kernel
used by DCOPYV API to handle the given input for zen5
architecture.
- Updated AOCL-dynamic threshold for DCOPYV API in zen4 and
zen5 architectures.
- New unit-tests were included in the grestsuite for the new
kernel.
AMD-Internal: [CPUPL-6650]
Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568
- Added a set of AVX512 fringe kernels(using masked loads and
stores) in order to avoid rerouting to the GEMV typed API
interface(when m = 1). This ensures uniformity in performance
across the main and fringe cases, when the calls are multithreaded.
- Further tuned the thresholds to decide between ZGEMM Tiny, Small
SUP and Native paths for ZEN4 and ZEN5 architectures(in case
of parallel execution). This would account for additional
combinations of the input dimensions.
- Moved the call to Tiny-ZGEMM before the BLIS object creation,
since this code-path operates on raw buffers.
- Added the necessary test-cases for functional and memory testing
of the newly added kernels.
AMD-Internal: [CPUPL-6378][CPUPL-6661]
Change-Id: I9af73d1b6ef82b26503d4fc373111132aee3afd6
- Implemented an AVX512 rank-1 kernel that is
expected to handle column-major storage schemes
of A, B and C(without transposition) when k = 1.
- This kernel is single-threaded, and acts as a direct
call from the BLAS layer for its compatible inputs.
- Defined custom BLAS and BLIS_IMPLI layers for CGEMM
(instead of using the macro definition), in order to
integrate the call to this kernel at runtime(based on
the corresponding architecture and input constraints).
- Added unit-tests for functional and memory testing of the
kernel.
- Updated the ZEN5 context to include the AVX512 CGEMM
SUP kernels, with its cache-blocking parameters.
AMD-Internal: [CPUPL-6498]
Change-Id: I42a66c424325bd117ceb38970726a05e2896a46b
- Implemented the following AVX512 SUP
column-preferential kernels(m-variant) for CGEMM :
Main kernel : 24x4m
Fringe kernels : 24x3m, 24x2m, 24x1m,
16x4, 16x3, 16x2, 16x1,
8x4, 8x3, 8x2, 8x1,
fx4, fx3, fx2, fx1(where 0<f<8).
- Utlized the packing kernel to pack A when
handling inputs with CRC storage scheme. This
would in turn handle RRC with operation transpose
in the framework layer.
- Further adding C prefetching to the main kernel,
and updated the cache-blocking parameters for
ZEN4 and ZEN5 contexts.
- Added a set of decision logics to choose between
SUP and Native AVX512 code-paths for ZEN4 and ZEN5
architectures.
- Updated the testing interface for complex GEMMSUP
to accept the kernel dimension(MR) as a parameter, in
order to set the appropriate panel stride for functional
and memory testing. Also updated the existing instantiators
to send their kernel dimensions as a parameter.
- Added unit tests for functional and memory testing of these
newly added kernels.
AMD-Internal: [CPUPL-6498]
Change-Id: Ie79d3d0dc7eed7edf30d8d4f74b888135f31d6b4
- Implemented the following AVX512 native
computational kernels for CGEMM :
Row-preferential : 4x24
Column-preferential : 24x4
- The implementations use a common set of macros,
defined in a separate header. This is due to the
fact that the implementations differ solely on
the matrix chosen for load/broadcast operations.
- Added the associated AVX512 based packing kernels,
packing 24xk and 4xk panels of input.
- Registered the column-preferential kernel(24x4) in
ZEN4 and ZEN5 contexts. Further updated the cache-blocking
parameters.
- Removed redundant BLIS object creation and its contingencies
in the native micro-kernel testing interface(for complex types).
Added the required unit-tests for memory and functionality
checks of the new kernels.
AMD-Interal: [CPUPL-6498]
Change-Id: I520ff17dba4c2f9bc277bf33ba9ab4384408ffe1
- Guarded the inclusion of thresholds(configuration
headers) using macros, to maintain uniformity in
the design principles.
- Updated the threshold macro names for every
micro-architecture.
AMD-Internal: [CPUPL-5895]
Change-Id: I9fd193371c41469d9ef38c37f9c055c21457b56c
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
incorrect functionality with multiple absolute maximums.
They returned the last index when having multiple occurences,
instead of the first one.
- Implemented a bug-fix to handle this issue on these AVX512
kernels. Also ensured that the kernels are compliant with
the standard when handling exception values.
- Further optimized the code by decoupling the logic to find
the maximum element and its search space for index. This way,
we use lesser latency instructions to compute the maximum
first.
- Updated the unit-tests, exception value tests and early return
tests for the API to ensure code-coverage.
AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
- Updated the threshold check for ZGEMM small path to include
runtime checks for redirection, specific to the micro-architecture.
- The current ZGEMM small path has only its AVX2 variant available.
Post implementing an AVX512(same/different algorithm), the thresholds
will further be fine-tuned.
- Included the dot-product based AVX512 ZGEMM kernels in the ZEN5
context. It will be used as part of handling RRC and CRC storage
schemes of C, A and B matrices in both single-thread and multi-thread
runs.
AMD-Internal: [CPUPL-5949]
Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:
1. **IR Loop Optimization:**
- The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
with `begin_asm` and `end_asm` calls, resulting in more efficient execution.
2. **JR Loop Integration:**
- The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
of stack frame management for each JR iteration, thereby enhancing loop performance.
3. **Kernel Decomposition Strategy:**
- The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
- For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.
1. **Interleaved Scaling by Alpha:**
- Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
and reduce latency.
2. **Efficient Mask Preparation:**
- Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
minimizing unnecessary overhead.
3. **Broadcast Instruction Optimization:**
- In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
the broadcast instruction is replaced with `mem_1to8`.
- This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
dependency chains and improving execution efficiency.
4. **C Matrix Update Optimization:**
- During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
performance bottlenecks and enhancing throughput.
These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.
This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.
For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.
AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.
AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.
AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
- Implemented a set of column preferential dot-product based
ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
These kernels perform matrix multiplication as a sequence
of inner products(i.e, dot-products).
- These standalone kernels are expected to strictly handle
the CRC storage scheme for C, A and B matrices. RRC is also
supported through operation transpose, at the framework
level.
- Added unit-tests to test all the kernels(main and fringe),
as well as the redirection between these kernels.
AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
This reverts commit 7d379c7879.
Reason for revert: < Perf regression is observed for GEMM(gemm_small_At)
as fma uses memory operand >
Change-Id: I0ec3a22acaacfaade860c67858be6a2ba6296bce
(cherry picked from commit 705755bb5c)
This reverts commit 7d379c7879.
Reason for revert: < Perf regression is observed for GEMM(gemm_small_At)
as fma uses memory operand >
Change-Id: I0ec3a22acaacfaade860c67858be6a2ba6296bce
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
config => config/build/arch folder
Issue:
1. Performance drop is observed as part of the fat binary(amdzen config)
built to support all the platforms using dynamic dispatch feature.
2. Observed only in intrinsic code and not in assembly code.
3. Observed in many of level1 kernels on Milan and Genoa
Previous Design:
Znver flags are picked based on config or function name
In case of ref_kernels:
Compiler picks up znver flag based on the function name. All
ref_kernels are named based on BLIS_CNAME which is a
config name (zen, zen2, zen3, zen4, zen5)
In case of Zen kernels:
Compiler picks up znver flag based on the config name where the
source file exists. All avx2 kernels are placed in zen and all avx512
kernels are placed in zen4/zen5 folder.
Kernels placed in zen (AVX2 kernels) are being compiled with znver1
flag rather than using znver2/znver3 flags on zen2/zen3 arch
respectively
New Design: For amdzen builds
1. For ref_kernels and kernels/(zen/zen2/zen3), znver2 flag is used instead of
znver1 in make and cmake build system.
2. To use znver2 flags, make_defs.mk of zen2 is included in zen config
3. No changes are made for auto or any individual config
4. Significant perfomance improvement is observed
AMD-Internal : [CPUPL-5407] [CPUPL-5406] [CPUPL-4873] [CPUPL-4872] [CPUPL-4871] [CPUPL-4801] [CPUPL-4800] [CPUPL-4799]
Change-Id: Ie817c13b8b69a2dc4328aad7ae09a3af06f83df5
Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. (https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md).
In bli_cntx_init_zen4 and zen5 - auxiliary blocksize for KC was less than primary blocksize. These are fixed.
Code-cleanup of the files bli_family_zen4, zen5.h" Removed unused constants.
Thanks to Igor Kozachenko <igork@berkeley.edu> for pointing out these two bugs.
Change-Id: I44fc564d5d91cb978d062c413e70751aeaa07f2c
- This change in made in MAKE build system.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsic then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Previous commit (75df1ef218) contains
similar changes on cmake build system
AMD-Internal: [CPUPL-5544]
Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
which is now modified to prefetching the second next column
of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.
AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.
AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
- In the initial patch - for m, n non-multiple of MR and NR
respectively we are calling bli_dgemm_ker_var2. Now we have
implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
This will result in better performance for matrix sizes
like m=4000 or greater when running on single thread.
AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
24x8 triangular AVX512 DGEMMT SUP kernels are added.
- Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
pattern repeats every 24x24 block of C. To cover this 24x24 block,
3 kernels are needed for one variant of DGEMMT. A total of 6
kernels are needed to cover both upper and lower variants.
- In order to maximize code reuse, the 24x8 kernels are broken
into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
full GEMM part is computed by 24x8 DGEMM SUP kernel.
- Changes are made in framework to enable the use of these kernels.
AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
- Implemented AVX512 computational kernel for DAXPBYV
with optimal unrolling. Further implemented the other
missing kernels that would be required to decompose
the computation in special cases, namely the AVX512
DADDV and DSCAL2V kernels.
- Updated the zen4 and zen5 contexts to ensure any query
to acquire the kernel pointer for DAXPBYV returns the
address of the new kernel.
- Added micro-kernel units tests to GTestsuite to check
for functionality and out-of-bounds reads and writes.
AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsics then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
- Added CSCALV kernel utilizing the AVX512 ISA.
- Added function pointers for the same to zen4 and zen5 contexts.
- Updated the BLAS interface to invoke respective CSCALV kernels based
on the architecture.
- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).
AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
- Introduced new 8x24 macro kernels.
- 4 new kernels are added for beta 0, beta 1, beta -1
and beta N.
- IR and JR loop moved to ASM region.
- Kernels support row major storage scheme.
- Prefetch of current micro panel of C is enabled.
- Kernel supports negative offsets for A and B matrices.
- Moved alpha scaling from DGEMM kernel to B pack kernel.
- Tuned blocksizes for new kernel.
- Added support for alpha scaling in 24xk pack kernel.
- Reverted back to old b_next computation
in gemm_ker_var2.
- BugFix in 8x24 DGEMM kernel for beta 1,
comparsion for jmp conditions was done using integer
instructions, which caused beta 1 path to never be taken.
Fixed this by changing the comparsion to double.
AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
- Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
zen5 config.
- Implemented multithreading support and AOCL-dynamic
for ZAXPY API.
- Utilized 'bli_thread_range_sub' function to achieve
better work distribution and avoid false sharing.
AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
* commit '81e10346':
Alloc at least 1 elem in pool_t block_ptrs. (#560)
Fix insufficient pool-growing logic in bli_pool.c. (#559)
Arm SVE C/ZGEMM Fix FMOV 0 Mistake
SH Kernel Unused Eigher
Arm SVE C/ZGEMM Support *beta==0
Arm SVE Config armsve Use ZGEMM/CGEMM
Arm SVE: Update Perf. Graph
Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
A64FX Config Use ZGEMM/CGEMM
Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
Arm SVE Add SGEMM 2Vx10 Unindexed
Arm SVE ZGEMM Support Gather Load / Scatt. St.
Arm SVE Add ZGEMM 2Vx10 Unindexed
Arm SVE Add ZGEMM 2Vx7 Unindexed
Arm SVE Add ZGEMM 2Vx8 Unindexed
Update Travis CI badge
Armv8 Trash New Bulk Kernels
Enable testing 1m in `make check`.
Config ArmSVE Unregister 12xk. Move 12xk to Old
Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
Register firestorm into arm64 Metaconfig
Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
Add test for Apple M1 (firestorm)
Firestorm CPUID Dispatcher
Armv8 GEMMSUP Edge Cases Require Signed Ints
Make error checking level a thread-local variable.
Fix data race in testsuite.
Update .appveyor.yml
Firestorm Block Size Fixes
Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
Move unused ARM SVE kernels to "old" directory.
Add an option to control whether or not to use @rpath.
Fix $ORIGIN usage on linux.
Arm micro-architecture dispatch (#344)
Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
Armv8 Fix 6x8 Row-Maj Ukr
Apply patch from @xrq-phys.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
bli_error: more cleanup on the error strings array
Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
Fix config_name in bli_arch.c
Arm Whole GEMMSUP Call Route is Asm/Int Optimized
Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Header Typo
Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
Arm: Implement GEMMSUP Fallback Method
Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Added Apple Firestorm (A14/M1) Subconfig
Arm64 8x4 Kernel Use Less Regs
Armv8-A Supplimentary GEMMSUP Sizes for RD
Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Armv8-A Adjust Types for PACKM Kernels
Armv8-A GEMMSUP-RD 6x8m
Armv8-A GEMMSUP-RD 6x8n
Armv8-A s/d Packing Kernels Fix Typo
Armv8-A Introduced s/d Packing Kernels
Armv8-A DGEMMSUP 6x8m Kernel
Armv8-A DGEMMSUP Adjustments
Armv8-A Add More DGEMMSUP
Armv8-A Add GEMMSUP 4x8n Kernel
Armv8-A Add Part of GEMMSUP 8x4m Kernel
Armv8A DGEMM 4x4 Kernel WIP. Slow
Armv8-A Add 8x4 Kernel WIP
AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.
AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
- Updated the existing code-path for ?AXPBYV to
reroute the inputs to the appropriate L1 kernel,
based on the alpha and beta value. This is done
in order to utilize sensible optimizations with
regards to the compute and memory operations.
- Updated the typed API interface for ?AXPBYV to include
an early exit condition(when n is 0, or when alpha is
0 and beta is 1). Further updated this layer to query
the right kernel from context, based on the input values
of alpha and beta.
- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
case handling in ?AXPBYV.
- Moved the early return with negative increments from ?SCAL2V
kernels to its typed API interface.
- Updated the zen, zen2 and zen3 context to include function
pointers for all these vector kernels.
- Updated the existing ?AXPBYV vector kernels to handle only
the required computation. Additional cleanup was done to
these kernels.
- Added accuracy and memory tests for AVX2 kernels of ?SETV
?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs
- Updated the existing thresholds in ?AXPBYV tests for complex
types. This is due to the fact that every complex multiplication
involves two mul ops and one add op. Further added test-cases
for API level accuracy check, that includes special cases of
alpha and beta.
- Decomposed the reference call to ?AXPBYV with several other
L1 BLAS APIs(in case of the reference not supporting its own
?AXPBYV API). The decomposition is done to match the exact
operations that is done in BLIS based on alpha and/or beta
values. This ensures that we test for our own compliance.
AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6