- Guarded the inclusion of thresholds(configuration
headers) using macros, to maintain uniformity in
the design principles.
- Updated the threshold macro names for every
micro-architecture.
AMD-Internal: [CPUPL-5895]
Change-Id: I9fd193371c41469d9ef38c37f9c055c21457b56c
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
incorrect functionality with multiple absolute maximums.
They returned the last index when having multiple occurences,
instead of the first one.
- Implemented a bug-fix to handle this issue on these AVX512
kernels. Also ensured that the kernels are compliant with
the standard when handling exception values.
- Further optimized the code by decoupling the logic to find
the maximum element and its search space for index. This way,
we use lesser latency instructions to compute the maximum
first.
- Updated the unit-tests, exception value tests and early return
tests for the API to ensure code-coverage.
AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:
1. **IR Loop Optimization:**
- The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
with `begin_asm` and `end_asm` calls, resulting in more efficient execution.
2. **JR Loop Integration:**
- The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
of stack frame management for each JR iteration, thereby enhancing loop performance.
3. **Kernel Decomposition Strategy:**
- The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
- For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.
1. **Interleaved Scaling by Alpha:**
- Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
and reduce latency.
2. **Efficient Mask Preparation:**
- Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
minimizing unnecessary overhead.
3. **Broadcast Instruction Optimization:**
- In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
the broadcast instruction is replaced with `mem_1to8`.
- This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
dependency chains and improving execution efficiency.
4. **C Matrix Update Optimization:**
- During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
performance bottlenecks and enhancing throughput.
These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.
This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.
For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.
AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.
AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.
AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
- Implemented a set of column preferential dot-product based
ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
These kernels perform matrix multiplication as a sequence
of inner products(i.e, dot-products).
- These standalone kernels are expected to strictly handle
the CRC storage scheme for C, A and B matrices. RRC is also
supported through operation transpose, at the framework
level.
- Added unit-tests to test all the kernels(main and fringe),
as well as the redirection between these kernels.
AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. (https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md).
In bli_cntx_init_zen4 and zen5 - auxiliary blocksize for KC was less than primary blocksize. These are fixed.
Code-cleanup of the files bli_family_zen4, zen5.h" Removed unused constants.
Thanks to Igor Kozachenko <igork@berkeley.edu> for pointing out these two bugs.
Change-Id: I44fc564d5d91cb978d062c413e70751aeaa07f2c
- This change in made in MAKE build system.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsic then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Previous commit (75df1ef218) contains
similar changes on cmake build system
AMD-Internal: [CPUPL-5544]
Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
24x8 triangular AVX512 DGEMMT SUP kernels are added.
- Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
pattern repeats every 24x24 block of C. To cover this 24x24 block,
3 kernels are needed for one variant of DGEMMT. A total of 6
kernels are needed to cover both upper and lower variants.
- In order to maximize code reuse, the 24x8 kernels are broken
into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
full GEMM part is computed by 24x8 DGEMM SUP kernel.
- Changes are made in framework to enable the use of these kernels.
AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
- Implemented AVX512 computational kernel for DAXPBYV
with optimal unrolling. Further implemented the other
missing kernels that would be required to decompose
the computation in special cases, namely the AVX512
DADDV and DSCAL2V kernels.
- Updated the zen4 and zen5 contexts to ensure any query
to acquire the kernel pointer for DAXPBYV returns the
address of the new kernel.
- Added micro-kernel units tests to GTestsuite to check
for functionality and out-of-bounds reads and writes.
AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsics then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
- Added CSCALV kernel utilizing the AVX512 ISA.
- Added function pointers for the same to zen4 and zen5 contexts.
- Updated the BLAS interface to invoke respective CSCALV kernels based
on the architecture.
- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).
AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
- Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
zen5 config.
- Implemented multithreading support and AOCL-dynamic
for ZAXPY API.
- Utilized 'bli_thread_range_sub' function to achieve
better work distribution and avoid false sharing.
AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.
AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
1. Enabled AVX512 path for
- Upper variant
- Different storage schemes for upper and lower variant
2. Modified mask value to handle all fringe cases correctly
AMD_Internal: [CPUPL-5091]
Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
- Added AVX512 kernel for ZDOTV.
- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.
AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
Existing Design:
- GEMM AVX2 kernel performs computation and updates temporary C buffer
- Portion of temporary C buffer is copied to output C buffer
based on UPLO parameter
- For diagonal blocks, using GEMM kernels is not efficient
New Design: Implemented in current patch when UPLO='L'
- GEMMT kernel used for computation, temporary buffer is not required.
- Only required elements are computed using mask load store for all
fringe cases
- Exception: AVX2 code path is used when storage format is RRC, CRR, CRC
- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
check which exists for double datatype
AMD_Internal: [CPUPL-5006]
Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
- In DGEMMT SUP AVX2 code path, traingular kernels
are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
to make sure that AVX512 kernels can be used without
creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
for upper variant of DGEMMT, temporary C buffer is
created, full GEMM kernel is called on temporary C and
traingular region from temporary C is copied to C
buffer.
AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
- Implemented AVX512 kernels for handling the calls to ZGEMV
with no-transpose to A matrix.
- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
The set of ZAXPYF kernels include those with fuse-factor 8
(main kernel), 4 and 2(fringe kernels).
- Updated the bli_zgemv_unf_var2( ... ) function to set
the function pointers to these kernels, based on the
configuration. Further added the call to ZSETV at this
layer in case beta is 0.
AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
- Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_
using respective AVX512 intrinsics including masked
load and store operations.
- Implemented AVX512 kernels for scopy_, dcopy_ and
zcopy_ using assembly language to prevent loss of
performance during the translation of intrinsics.
- Updated the dcopy_blis_impl( ... ) and
zcopy_blis_impl( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Implemented OpenMP parallelization for dcopyv_ and
zcopyv_ APIs, while scopyv_ and ccopyv_ only support
single thread.
AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
CMakelists.txt is updated to support aocl_gemm on windows.
On windows, BLIS library(blis+aocl_gemm) is built successfully
only with AOCC Compiler. (Clang has an issue with optimizing
VNNI instructions).
$cmake .. -DENABLE_ADDON="aocl_gemm" ....
AMD-Internal: [CPUPL-2748]
Change-Id: I9620878ab6934233fadc9ddc5d5e82ad85be9209
Updated compiler id in cmake related files from
CMAKE_CXX_COMPILER_ID to CMAKE_C_COMPILER_ID
AMD-Internal: [CPUPL-2748]
Change-Id: Ib0e2a2e3ec8fafeb423fe56b9842a93db0115371
For gcc greater than or equal to 7.0 version added AVX512 compiler flags
in makde_defs.mk and make_defs.cmake. AVX512VNNI compiler flag is only
supported from gcc version 8 or greater. So added another else condition
for gcc version greater than or equal to 7 - enabling avx512 flags.
This enables compilation of AVX512 assembly code paths with gcc 7.5 version.
Change-Id: I2cda00e578010db5e5a515b506c0b99f685307e0
Implement initial support for Zen5 systems:
- Detect new Zen5 AVXVNNI, AVX512VP2INTERSECT, MOVDIRI and MOVDIR64B
instructions.
- Assume for now that Zen5 will use Zen4 code path. BLIS_ARCH_TYPE=zen5
will therefore function as an alias for BLIS_ARCH_TYPE=zen4, but
different hardware model will still be detected.
AMD-Internal: [CPUPL-3518]
Change-Id: I00fb413d743f152a5412ace3e740df1fd39a1600
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
- Added 4x12 ZGEMM row-preferred kernel.
- Added 4x12 ZTRSM row-preferred lower
and upper kernels using AVX512 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 12x4 kernel.
- Kernels support row/col/gen storage.
- Kernels support A prefetch, B prefetch,
A_next prefetch, B_next prefetch and c prefetch.
- B prefetch, B_next prefetch and C prefetch
are enabled by default.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a
- Added 2x6 ZGEMM row-preferred kernel.
- Kernel supports prefetch_a, prefetch_b,
prefetch_a_next and prefetch_b_next.
- Multiple Ways to prefetch c are supported.
- prefetch_a and prefetch_c are enabled by
default.
- K loop is divided into multiple subloops for
better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
use of these kernels for TRSM in zen3 and
zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
Tidy formatting of config/*zen*/bli_cntx_init_zen*.c and
config/*zen*/bli_family_*.c files to make them more
consistent with each other and improve readability.
AMD-Internal: [CPUPL-3519]
Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b
1. Two CGEMM function pointers are added for different storage schemes
1. bli_cgemmsup_rv_zen_asm_3x8m
2. bli_cgemmsup_rv_zen_asm_3x8n
2. In previous commit:
(Level-3 triangular routines now use different block sizes and kernels
Commit Id: 79e174ff0a)
1. bli_cntx_set_l3_sup_tri_kers cntx function was created
2. Function holds optimised function pointers for GEMMT/SYRK API's
3. It avoids over riding default block sizes which improves the
performance
4. This function did not include optimised CGEMM function pointers
leading to regression as reference kernels were invoked
3. With this commit, 2 optimized CGEMM function pointers are added in
bli_cntx_set_l3_sup_tri_kers
1. This fixes the regression as optimized CGEMM functions are invoked
AMD-Internal: [CPUPL-3831] [CPUPL-3830]
Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e
- TRSM and GEMM has different blocksizes in zen4, in order
to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
are removed. Instead of overriding the default blocksizes,
TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
and GEMM pack buffers have to be packed with default blocksizes.
To check if we are packing for TRSM, "family" argument is added
in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
it is not set then BLIS_GEMM_UKR has to be used. This functionality
has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
signature of bli_packm_init_pack.
AMD-Internal: [CPUPL-3781]
Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
Details:
- Eliminated the need for override function in SUP for GEMMT/SYRK.
- New set of block sizes, kernels and kernel preferences
are added to cntx data structure for level-3 triangular routines.
- Added supporting functions to set and get the above parameters from cntx.
- Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
In case they are not set, use the default block sizes/kernels of
Level-3 SUP.
AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
Improvements to zen make_defs.mk files:
* Add -znver4 flag for GCC 13 and later.
* Add AVX512 flags or -znver4 as appropriate for upstream LLVM
in config/zen4/make_defs.mk to enable BLIS to be build with
LLVM rather than AOCC.
* zen make_defs.mk files were inheriting settings from the previous
one (zen->zen2->zen3->zen4), when they should be independent
of each other. Correct by including config/zen/amd_config.mk
in all zen make_defs.mk files to reinitialize the compiler
flags.
* Update zen2 and zen3 make_defs.mk for recent AOCC compiler
releases, rather than rely on LLVM settings.
* Remove -mfpmath=sse flag in config/zen4/make_defs.mk as
this is already specified in amd_config.mk (and should
be the default setting anyway).
* Tidy files to simplify nested if structures and be more
consistent with one another.
AMD-Internal: [CPUPL-3399]
Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac
- In Zen 4 context, there was a mismatch between the fuse factor
initialized in the block size parameter and fuse factor of the
corresponding kernel initialized.
AMD-Internal: [SWLCSG-2051]
Change-Id: I65f71532692a1459605abb860b91a2a360bcca5d
- Added Smart Threading logic for AVX-512 based SGEMM SUP.
- Calculating ic and jc for optimal work distribution to the allocated
threads based on logic similar to Zen3.
- Zen4 Architecture specific Native-to-SUP check has been added to
redirect few Native inputs to the SUP path based on the fact that in a
multi-threaded environment some Native cases perfom better as SUP.
- For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have
been increased from 512 and 200 to 682 and 512, respectively.
- Further optimizations to the work distribution logic will be added
subsequently.
AMD-Internal: [CPUPL-3248]
Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca
Incorporate a means of detecting submodels of a microarchitecture,
so that different optimizations e.g. block sizes or kernel choices
can be used. The details are as follows:
- Different models are currently only enabled for zen3 and zen4
architectures (for server parts).
- There is a single enumeration (model_t) for all models for all
architectures, but function bli_check_valid_model_id() should
check the provided model_id against the suitable range within
the enumeration for the provided arch_id.
- To enable the model_id to be used within the cntx setup functions,
checking of a user specified value of BLIS_ARCH_TYPE against
the enabled configurations is delayed to a separate function,
bli_arch_check_id().
- Default selection based on hardware can be overridden using the
BLIS_MODEL_TYPE environment variable. Valid values are:
Genoa, Bergamo, Genoa-X, Milan, Milan-X
Values are case-insensitive and -X can also be specified as _X or X
- Specifying an incorrect value for BLIS_MODEL_TYPE is not an error,
but will result in the default option for that architecture being
selected. This is different to specifying an incorrect value of
BLIS_ARCH_TYPE, which is an error.
- The environment variable BLIS_MODEL_TYPE can be renamed using
the --rename-blis-model-type argument to configure (or cmake
equivalent), in a similar way to renaming BLIS_ARCH_TYPE with
--rename-blis-arch-type.
- Configure option --disable-blis-arch-type will disable both
BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables.
- Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes,
currently only for AMD cpus. Functions are provided to query
these from other parts of the code, namely:
uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size()
AMD-Internal: [CPUPL-3033]
Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702
Details:
- Overriding of blocksizes with avx-2 specific ones(6x8) is done
for gemmt/syrk because near-to-square shaped kernel performs
better than skewed/rectangular shaped kernel.
- Overriding is done for S,D and Z datatypes.
AMD-Internal: [CPUPL-3060]
Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53
Details:
- Added a new function for choosing between SUP and
native implementation for a given size.
- This function pointer is stored in cntx for zen4 config.
- Divided total combinations of sizes into 3 categories:
- one dimension is small
- Two dimensions are small
- All dimensions are small
- Added different threshold conditions for each of the
categories.
AMD-Internal: [CPUPL-2755]
Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf
- Added AVX512 based double and float AXPYV which will be used in
Zen4 context.
- Added n <= 0 check and alpha == 0 check to the BLAS layer of
SAXPY.
- Modified BLAS framework of float AXPYV to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2793]
Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1
- Added AVX512 based double and float DOTV which will be used in
Zen4 context.
- Added n <= 0 check to the BLAS layer of SDOTV.
- Modified BLAS framework of float DOTV to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2800]
Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702
- Added AVX512 based double and float SCALV which will be used in
Zen4 context.
- Added incx <= 0 check and alpha == 1 check to the BLAS layer of
SSCAL.
- Modified BLAS framework of float SCAL to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2766],[CPUPL-2765]
Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc
- Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for
SGEMM. This can be updated once GEMMT specific optimization are added
for AVX-512.
- Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override
blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for
SGEMM operation.
AMD-Internal: [CPUPL-3060]
Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a
- Implemented 12x4m column preferential SUP kernels(main and fringe
cases). The main kernel dimension is 12x4, and the associated fringe
kernel dimensions are : 12x3m, 12x2m, 12x1m
8x4, 8x3, 8x2, 8x1
4x4, 4x3, 4x2, 4x1
2x4, 2x3, 2x2, 2x1.
- Included in-register transposition support for C, thus extending
the storage scheme supports to CCC, CCR, RCC and RCR inside the
milli-kernel.
- Integrated conditional packing of A onto the SUP front end for
dcomplex datatype. This redirects RRC and CRC storage schemes
onto the preceding set of SUP kernels through storage scheme
transformation(RCC and CCC respectively).
- Updated the zen4 context file with the new set of SUP kernels, to
get enabled appropriately. Furthermore, the context file was updated
with the AVX-2 dotxv signatures for dcomplex datatype. This redirects
the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines.
- Added C prefetching onto L2-cache, and an unroll factor of 4 for the
k loop in all the kernels.
- Work in progress to include conjugate support and input spectrum
extension for the AVX-512 SUP kernels. The current thresholds in zen4
context is the same as that of the zen3 thresholds for ZGEMM SUP.
AMD-Internal: [CPUPL-3122]
Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e
-The n fringe micro kernels uses only a few zmm registers for computing
the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24
used in 6x64). This results in lot of wasted registers that if utilized
can help increase the MR dimension and thus improve the reuse of
registers loaded with B. Based on this concept, the existing n fringe
kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted
that the maximum number of registers are not used, since it results in
cache inefficient code due to the increase in MR and thus more
broadcasts required from unpacked A matrix.
-Compiler flag updates for AOCC build to generate loops with 64 byte
alignment. This has been observed to improve performance slightly when
k dimension is small.
AMD-Internal: [CPUPL-3173]
Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca
- Kernel block size is 12x4
- Updated the zen4 config to enable these kernels in zen4 path.
- Tuned MC,KC,NC for better performance for m/n/k size > 500
- Updated CMakeLists.txt with ZGEMM kernels for windows build.
Kernel supports:
1. Preload and prebroadcast of A and B
2. Prefecth of C Matrix
3. K loop is sub divided in to multiple loops to maintain distance between c prefetchs.
4. Special case when alpha/beta imag component is zero
5. Row/Col/General stride of Matrix C
AMD-Internal: [CPUPL-2998]
Change-Id: I62e3c352d475b1add3f43270805fbcee00e2e440