More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
- Mixed precision datatypes use a modified cntx.
- For some variants of mixed precision, complex and real blocksizes
are needed to be same. This is achieved by creating a local copy of
cntx and copying complex blocksizes onto real blocksizes.
- By using the dynamic blocksizes, the changes made to the
blocksizes for mixed precision are overwritten by changes made
by dynamic blocksizes.
- This mismatch between complex and real blocksizes is causing a issue
where the pack buffer is allocated based on complex blocksizes but
amount of data packed is based on real blocksizes.
- This makes the pack buffer sizes smaller than the required sizes.
- To fix this, dynamic blocksizes are disabled for mixed precision.
AMD-Internal: [CPUPL-6384]
Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83
- This patch reverts the previous changes that removed the enforcement
of dgemm inputs under a certain threshold to be processed by kernels
selected based on architecture ID and handled in single-threaded mode.
- This change is now forcing such small inputs to be computed in tiny
path. Previously when this check was not there, it was routing these
inputs to SUP path and causing performance regression due to framework
overhead.
AMD-Internal: [CPUPL-5927]
Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
When compiling with config generic (or any non-zen build),
the bli_dgemm_tiny_6x8 kernel is not defined. Since bli_dgemm_tiny()
is only used within amd specific file, bli_tiny_gemm.c has been renamed
to bli_tiny_gemm_amd.c to reflect its specific usage.
Thanks to Smyth, Edward<edward.smyth@amd.com> for identifying and helping to fix the issue.
Change-Id: If5d134aeba6d30d0a51e6d7d6fa9b3c4450a3307
- Updated the threshold check for ZGEMM small path to include
runtime checks for redirection, specific to the micro-architecture.
- The current ZGEMM small path has only its AVX2 variant available.
Post implementing an AVX512(same/different algorithm), the thresholds
will further be fine-tuned.
- Included the dot-product based AVX512 ZGEMM kernels in the ZEN5
context. It will be used as part of handling RRC and CRC storage
schemes of C, A and B matrices in both single-thread and multi-thread
runs.
AMD-Internal: [CPUPL-5949]
Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:
1. **IR Loop Optimization:**
- The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
with `begin_asm` and `end_asm` calls, resulting in more efficient execution.
2. **JR Loop Integration:**
- The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
of stack frame management for each JR iteration, thereby enhancing loop performance.
3. **Kernel Decomposition Strategy:**
- The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
- For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.
1. **Interleaved Scaling by Alpha:**
- Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
and reduce latency.
2. **Efficient Mask Preparation:**
- Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
minimizing unnecessary overhead.
3. **Broadcast Instruction Optimization:**
- In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
the broadcast instruction is replaced with `mem_1to8`.
- This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
dependency chains and improving execution efficiency.
4. **C Matrix Update Optimization:**
- During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
performance bottlenecks and enhancing throughput.
These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.
This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.
For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.
AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.
AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.
AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
- Generic kernel is used if N is not multiple of NR
or M is not multiple of MR.
- This limit the maximum values of NR that can be used.
- Support for fringe case handling is added in DGEMM
macro kernel so that macro kernel can be used for
all problem sizes.
AMD-Internal: [CPUPL-5912]
Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f
- Optimized DGEMM macro kernel does not
support mixed precision.
- This kernel was being used for solving
some of the mixed precision problems.
- Currently only ( bli_obj_elem(A) == 8 ) is used for checking
if the problem being solved is mixed precision.
- bli_obj_elem(A) will be equal to 8 for both double precision
data type and mixed precision case single-complex.
- Added extra checks (bli_obj_is_real( a )) to make sure that
A and B are real and DGEMM macro kernel is being used only
for DDDGEMM.
AMD-Internal: [CPUPL-5804]
Change-Id: Iaa1accf8d851d11533f8ba31dc0235fbc14f89a9
- Optimized macro kernel (bli_dgemm_avx512_asm_8x24_macro_kernel)
for zen5 do not support alpha scaling. Alpha scaling is
supported by zen5 micro kernel (bli_dgemm_avx512_asm_8x24).
- Optimized macro kernel expects alpha scaling to be done during
packing. The packing kernel used for mixed precision do not support
alpha scaling. Therefore, the optimized Zen5 macro kernel is not
compatible with existing packing logic.
- Changes have been made to use the generic macro kernel which in turn
used zen5 micro kernel for mixed precision which supports alpha scaling.
AMD-Internal: [CPUPL-5058]
Change-Id: I1bfeb32ae07eedafadad7dd2c62d63913a46e446
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- C<- alpha * op(A) *op(B) + beta *C.
C(nxn) - A(n x k) * B(k x n)
For ZEN4 and ZEN5
DGEMM is col-preferred kernel
DGEMMT = DGEMM + DGEMMT
DGEMM is col-preferred and DGEMMT is row-preferred.
DGEMM is evaluated as C = A*B (all col-storage)
whereas DGEMMT is evaluated as C = B * A (row-storage).
When A is packed it is packed as row-panels with col-stored elements.
So DGEMM is evaluated as C = A*B (A is col-stored) it aligns
with col-stored preference.
For DGEMMT: C = B * A, here A will become col-stored
because of packingand as result it will break the DGEMMT
kernel assumption that A is row-storage.
- Fixed this by disabling this optimization for ZEN4
and ZEN5.
AMD-Internal: [CPUPL-5542}
Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379
- In the initial patch - for m, n non-multiple of MR and NR
respectively we are calling bli_dgemm_ker_var2. Now we have
implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
This will result in better performance for matrix sizes
like m=4000 or greater when running on single thread.
AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
- Logic to calculate the kernel index in AVX512
DGEMMT SUP framework is incorrect.
- The granularity for workload distribution along N
dimension is NR(8), whereas current logic to pick
diagonal kernel assumes the granularity to be MR (24).
- To Fix this, the logic to determine the kernel index is
changed, instead of relying solely on n_offset, the kernel
index is derived depending on distance from the diagonal.
- If distance from diagonal is greater than
LCM of (MR and NR) - NR, that that means the current micro
panel is not a diagonal micro panel.
- If the micro panel is a diagonal micro panel, then the
distance from diagonal is equal to the M dimension for
initial full GEMM region or empty region of diagonal
kernel. This info can be used to determine the kernel index.
AMD-Internal: [CPUPL-5440]
Change-Id: I640d3a1b43e63b24bc9f0ed4a67cced45f6fa3b3
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
24x8 triangular AVX512 DGEMMT SUP kernels are added.
- Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
pattern repeats every 24x24 block of C. To cover this 24x24 block,
3 kernels are needed for one variant of DGEMMT. A total of 6
kernels are needed to cover both upper and lower variants.
- In order to maximize code reuse, the 24x8 kernels are broken
into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
full GEMM part is computed by 24x8 DGEMM SUP kernel.
- Changes are made in framework to enable the use of these kernels.
AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
- Introduced new 8x24 macro kernels.
- 4 new kernels are added for beta 0, beta 1, beta -1
and beta N.
- IR and JR loop moved to ASM region.
- Kernels support row major storage scheme.
- Prefetch of current micro panel of C is enabled.
- Kernel supports negative offsets for A and B matrices.
- Moved alpha scaling from DGEMM kernel to B pack kernel.
- Tuned blocksizes for new kernel.
- Added support for alpha scaling in 24xk pack kernel.
- Reverted back to old b_next computation
in gemm_ker_var2.
- BugFix in 8x24 DGEMM kernel for beta 1,
comparsion for jmp conditions was done using integer
instructions, which caused beta 1 path to never be taken.
Fixed this by changing the comparsion to double.
AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
* commit 'cfa3db3f':
Fixed bug in mixed-dt gemm introduced in e9da642.
Removed support for 3m, 4m induced methods.
Updated do_sde.sh to get SDE from GitHub.
Disable SDE testing of old AMD microarchitectures.
Fixed substitution bug in configure.
Allow use of 1m with mixing of row/col-pref ukrs.
AMD-Internal: [CPUPL-2698]
Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f
- Introduced new 8x24 row preferred kernel for zen5.
- Kernel supports row/col/gen
storage schemes.
- Prefetch of current panel of A and C
are enabled.
- Prefetch of next panel of B is enabled.
- Kernel supports negative offsets for A and B
matrices.
- Cache block tuning is done for zen5 core.
AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
1. Enabled AVX512 path for
- Upper variant
- Different storage schemes for upper and lower variant
2. Modified mask value to handle all fringe cases correctly
AMD_Internal: [CPUPL-5091]
Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
supported by optimized kernels.
AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.
AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
Existing Design:
- GEMM AVX2 kernel performs computation and updates temporary C buffer
- Portion of temporary C buffer is copied to output C buffer
based on UPLO parameter
- For diagonal blocks, using GEMM kernels is not efficient
New Design: Implemented in current patch when UPLO='L'
- GEMMT kernel used for computation, temporary buffer is not required.
- Only required elements are computed using mask load store for all
fringe cases
- Exception: AVX2 code path is used when storage format is RRC, CRR, CRC
- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
check which exists for double datatype
AMD_Internal: [CPUPL-5006]
Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
- In DGEMMT SUP AVX2 code path, traingular kernels
are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
to make sure that AVX512 kernels can be used without
creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
for upper variant of DGEMMT, temporary C buffer is
created, full GEMM kernel is called on temporary C and
traingular region from temporary C is copied to C
buffer.
AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
- Kernel dimensions are 4x4.
- Two kernels are implemented, Right Upper and
Right lower.
- In case of Left variants of TRSM, transpose is
induced so that Right variant kernels can be used.
- No packing is performed in these kernels.
- Changes are made in the threshold to pick ZTRSM small
code path.
- BLIS_INLINE is removed from signature of
"TRSMSMALL_KER_PROT".
- These kernels do not support "ENABLE_TRSM_PREINVERSION".
- Newly added kernels do not support conjugate
transpose.
- Added multithreading to ZTRSM small code path.
AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
- Warning is raised for the implicit declaration of bli_gemm_md_is_ccr()
when BLIS is configured with --disable-mixed-dt flag.
- Encapsulated the usage of bli_gemm_md_is_ccr( ... ) inside the
BLIS_ENABLE_GEMM_MD macro.
AMD-Internal: [CPUPL-4630]
Change-Id: Icc59b1bcd3a21492daaaf6bcec80a5bf67012ace
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
1. Added input parameter checking for the extension APIs
1. gemm_pack_get_size API
2. gemm_pack API
2. Additionally added early returns for these APIs when
m or n dimensions are 0.
3. Routines for input parameter check for all the 3
BLAS extension APIs - gemm_pack_get_size, gemm_pack and
gemm_compute are defined in:
frame/compat/check/bla_gemm_pack_compute_check.h
4. Added AOCL DTL TRACE for all the functions of
1. gemm_pack_get_size
2. gemm_pack
3. gemm_compute
AMD-Internal: [CPUPL-3560]
Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631
1. OpenMP based multi-threading parallelism is added for BLAS
extension APIs of Pack and Compute
2. Both pack and compute APIs are parallelized.
3. Multi-threading of pack and compute APIs done with different
number of threads can lead to inconsistent results due to
output difference of the full packed matrix buffer when packed
with different number of threads.
4. In multi-threaded execution, we ensure output of packed buffer
is exactly the same as in single threaded execution.
5. Similarly for compute API, read of packed buffer in multi-
threaded execution is exactly the same as in single-threaded
execution.
6. Routines are added to compute the offsets for thread workload
distribution for MT execution.
1. The offsets are calculated in such a way that it resembles
the reorder buffer traversal in single threaded reordering.
2. The panel boundaries (KCxNC) remain as it is accessed in
single thread, and as a consequence a thread with jc_start
inside the panel cannot consider NC range for reorder.
3. It has to work with NC' < NC, and the offset is calulated
using prev NC panels spanning k dim + cur NC panel spaning
pc loop cur iteration + (NC - NC') spanning current
kc0 (<= KC).
7. Routines to ensure the same are added for MT execution
1. frame/base/bli_pack_compute_utils.c
2. frame/base/bli_pack_compute_utils.h
AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
- Added support for 2 new APIs:
1. sgemm_compute()
2. dgemm_compute()
These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
packed matrix identifier) and performs the GEMM operation:
C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
scheme isn't matching, and the respective matrix being loaded isn't
packed either, on-the-go packing has been enabled for such cases to
pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
it is the responsibility of the user to pack only one matrix with
alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
and compute APIs are forced to take n_threads=1.
AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
enabled.
Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.
AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to
frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore
bli_gemmt_sup_var1n2m.c as of commit 10ca8710f0 as variant
for non-AMD codepath builds.
AMD-Internal: [CPUPL-3838]
Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
- TRSM and GEMM has different blocksizes in zen4, in order
to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
are removed. Instead of overriding the default blocksizes,
TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
and GEMM pack buffers have to be packed with default blocksizes.
To check if we are packing for TRSM, "family" argument is added
in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
it is not set then BLIS_GEMM_UKR has to be used. This functionality
has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
signature of bli_packm_init_pack.
AMD-Internal: [CPUPL-3781]
Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
Details:
- Eliminated the need for override function in SUP for GEMMT/SYRK.
- New set of block sizes, kernels and kernel preferences
are added to cntx data structure for level-3 triangular routines.
- Added supporting functions to set and get the above parameters from cntx.
- Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
In case they are not set, use the default block sizes/kernels of
Level-3 SUP.
AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
- Added Smart Threading logic for AVX-512 based SGEMM SUP.
- Calculating ic and jc for optimal work distribution to the allocated
threads based on logic similar to Zen3.
- Zen4 Architecture specific Native-to-SUP check has been added to
redirect few Native inputs to the SUP path based on the fact that in a
multi-threaded environment some Native cases perfom better as SUP.
- For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have
been increased from 512 and 200 to 682 and 512, respectively.
- Further optimizations to the work distribution logic will be added
subsequently.
AMD-Internal: [CPUPL-3248]
Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca
Details:
- Overriding of blocksizes with avx-2 specific ones(6x8) is done
for gemmt/syrk because near-to-square shaped kernel performs
better than skewed/rectangular shaped kernel.
- Overriding is done for S,D and Z datatypes.
AMD-Internal: [CPUPL-3060]
Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53
Thanks to Moore, Branden <Branden.Moore@amd.com> for identifying the
race condition and suggesting the changes to fix the same
Existing Design:
- AOCL progress callback pointer is a global pointer which is shared
across all threads
Existing Design challenges:
- The callback function cannot safely disable the progress mechanism,
as another thread may have already checked to see if the function
pointer is set, and then re-reads the pointer upon invocation of
the callback. If one thread sets the callback to NULL in this time,
then the resulting thread will attempt to call the null pointer as a
function pointer, leading to a segfault.
New Design :
- Each thread maintains a local copy of progress pointer
AMD-Internal: [SWLCSG-1971]
Change-Id: I282989805a4a2a8a759a7373b645f3569bf42ed4
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
- In gemmt and normf, #ifdef BLIS_KERNELS_* is added
to make sure only compiled kernels are used.
- In bal_copy and bla_swap, missing '\' is added.
AMD-Internal: [CPUPL-2870]
Change-Id: I83452dff761f60db6957f557321ce210ab72c037
Details:
- Added a new function for choosing between SUP and
native implementation for a given size.
- This function pointer is stored in cntx for zen4 config.
- Divided total combinations of sizes into 3 categories:
- one dimension is small
- Two dimensions are small
- All dimensions are small
- Added different threshold conditions for each of the
categories.
AMD-Internal: [CPUPL-2755]
Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf
- Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for
SGEMM. This can be updated once GEMMT specific optimization are added
for AVX-512.
- Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override
blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for
SGEMM operation.
AMD-Internal: [CPUPL-3060]
Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a