Reordering B matrix of datatype int4 is done as per the pack schema
requirements of u8s8s32 kernel. However for fringe cases, the
matrix pointer increments need to be halved to account for the half
byte size of int4 elements.
AMD-Internal: [SWLCSG-2390]
Change-Id: I22a04c4c8133db6ae6ca0a4d3e86c11aba1e2cdb
* commit '81e10346':
Alloc at least 1 elem in pool_t block_ptrs. (#560)
Fix insufficient pool-growing logic in bli_pool.c. (#559)
Arm SVE C/ZGEMM Fix FMOV 0 Mistake
SH Kernel Unused Eigher
Arm SVE C/ZGEMM Support *beta==0
Arm SVE Config armsve Use ZGEMM/CGEMM
Arm SVE: Update Perf. Graph
Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
A64FX Config Use ZGEMM/CGEMM
Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
Arm SVE Add SGEMM 2Vx10 Unindexed
Arm SVE ZGEMM Support Gather Load / Scatt. St.
Arm SVE Add ZGEMM 2Vx10 Unindexed
Arm SVE Add ZGEMM 2Vx7 Unindexed
Arm SVE Add ZGEMM 2Vx8 Unindexed
Update Travis CI badge
Armv8 Trash New Bulk Kernels
Enable testing 1m in `make check`.
Config ArmSVE Unregister 12xk. Move 12xk to Old
Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
Register firestorm into arm64 Metaconfig
Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
Add test for Apple M1 (firestorm)
Firestorm CPUID Dispatcher
Armv8 GEMMSUP Edge Cases Require Signed Ints
Make error checking level a thread-local variable.
Fix data race in testsuite.
Update .appveyor.yml
Firestorm Block Size Fixes
Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
Move unused ARM SVE kernels to "old" directory.
Add an option to control whether or not to use @rpath.
Fix $ORIGIN usage on linux.
Arm micro-architecture dispatch (#344)
Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
Armv8 Fix 6x8 Row-Maj Ukr
Apply patch from @xrq-phys.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
bli_error: more cleanup on the error strings array
Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
Fix config_name in bli_arch.c
Arm Whole GEMMSUP Call Route is Asm/Int Optimized
Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Header Typo
Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
Arm: Implement GEMMSUP Fallback Method
Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Added Apple Firestorm (A14/M1) Subconfig
Arm64 8x4 Kernel Use Less Regs
Armv8-A Supplimentary GEMMSUP Sizes for RD
Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Armv8-A Adjust Types for PACKM Kernels
Armv8-A GEMMSUP-RD 6x8m
Armv8-A GEMMSUP-RD 6x8n
Armv8-A s/d Packing Kernels Fix Typo
Armv8-A Introduced s/d Packing Kernels
Armv8-A DGEMMSUP 6x8m Kernel
Armv8-A DGEMMSUP Adjustments
Armv8-A Add More DGEMMSUP
Armv8-A Add GEMMSUP 4x8n Kernel
Armv8-A Add Part of GEMMSUP 8x4m Kernel
Armv8A DGEMM 4x4 Kernel WIP. Slow
Armv8-A Add 8x4 Kernel WIP
AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is
unnecessary and incorrectly caused AVX512 code to be run
on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2
or equivalent options was selected.
Replace with code to select kernel in a similar way to other
dgemm code paths and other APIs. Note that at present AVX2 code
is used the smallest matrix sizes on all zen platforms.
AMD-Internal: [CPUPL-5078]
Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
computational kernel for DNRM2 API.
- Updated the header to include this kernel signature, as well
as the framework layer to use this function in case of ZEN4
and ZEN5 configurations.
- Updated the tipping points for ideal thread setting in DNRM2
for ZEN5 micro-architecture. These thresholds are specific
to the library's linkage to LLVM's OpenMP or GNU's OpenMp.
- Further abstracted the AOCL-DYNAMIC logic to separate functions
for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).
- Further updated the ?NRM2 framework to accommodate the necessary
changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
kernel, when needed.
- Added micro-kernel and memory tests for this kernel in GTestsuite,
to validate accuracy and out-of-bounds read and write.
AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.
AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
Details:
- For a variable x, Using address of x in an instruction throws
exception if the difference between &x and access position is
larger than 2 GiB. To solve this issue all variables are stored
within the JIT code section and are accessed using relative addressing.
- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.
AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
- Updated the existing code-path for ?AXPBYV to
reroute the inputs to the appropriate L1 kernel,
based on the alpha and beta value. This is done
in order to utilize sensible optimizations with
regards to the compute and memory operations.
- Updated the typed API interface for ?AXPBYV to include
an early exit condition(when n is 0, or when alpha is
0 and beta is 1). Further updated this layer to query
the right kernel from context, based on the input values
of alpha and beta.
- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
case handling in ?AXPBYV.
- Moved the early return with negative increments from ?SCAL2V
kernels to its typed API interface.
- Updated the zen, zen2 and zen3 context to include function
pointers for all these vector kernels.
- Updated the existing ?AXPBYV vector kernels to handle only
the required computation. Additional cleanup was done to
these kernels.
- Added accuracy and memory tests for AVX2 kernels of ?SETV
?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs
- Updated the existing thresholds in ?AXPBYV tests for complex
types. This is due to the fact that every complex multiplication
involves two mul ops and one add op. Further added test-cases
for API level accuracy check, that includes special cases of
alpha and beta.
- Decomposed the reference call to ?AXPBYV with several other
L1 BLAS APIs(in case of the reference not supporting its own
?AXPBYV API). The decomposition is done to match the exact
operations that is done in BLIS based on alpha and/or beta
values. This ensures that we test for our own compliance.
AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
- Updated the fused kernels (DOTXF and AXPYF) to properly handle cases
when b_n > fuse_factor.
- The fused kernels are expected to invoke respective Level-1 kernels
iteratively when b_n > fuse_factor.
AMD-Internal: [CPUPL-5246]
Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095
BUG:
When alpha real and imaginary is zero
Output is computed as C= Beta * C + A * B instead of C = Beta * C
FIX:
Updated kernel to scale A * B product with alpha in case of alpha=0
Existing framework design:
- When alpha real and imaginary value is zero, framework handles to skip
kernel call to avoid alpha * A * B operation
- SCALM is invoked to perform Beta * C
- Accuracy issue was not observed as alpha=0 was handled in framework
- If we call kernel directly with alpha=0, results would be wrong
- Issue was figured out during microkernel testing using gtestsuite
AMD-Internal: [CPUPL-4454]
Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n=1.
AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
- Introduced new 8x24 row preferred kernel for zen5.
- Kernel supports row/col/gen
storage schemes.
- Prefetch of current panel of A and C
are enabled.
- Prefetch of next panel of B is enabled.
- Kernel supports negative offsets for A and B
matrices.
- Cache block tuning is done for zen5 core.
AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
Description:
- _mm512_storeu_epi8 and _mm512_storeu_epi16
intrensic instructions are not available in gcc-10
- Replaced above intrensics _mm512_storeu_si512
Change-Id: I2878780b7acd040ccf45e571d486ff8c2388088c
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.
AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
according to latest input format.
AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
Export more symbols for BLIS kernels so that AOCL libFLAME
optimizations can call them directly.
AMD-Internal: [CPUPL-5044]
Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733
1. Enabled AVX512 path for
- Upper variant
- Different storage schemes for upper and lower variant
2. Modified mask value to handle all fringe cases correctly
AMD_Internal: [CPUPL-5091]
Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.
AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
supported by optimized kernels.
AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
datatype.
AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
- Reduced number of jump operations in AVX512
assembly kernel for SCOPYV, DCOPYV and ZCOPYV.
- Fixed memory test failure for bli_zcopyv_zen_int_avx512
kernel.
- Replaced existing AVX2 COPYV intrinsic kernels in
bli_cntx_init_zen5.c with AVX512 assembly kernels.
Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159
- Utilized the memory testing feature in GTestsuite
to update the testing interfaces for micro-kernel
testing of SCOPY, DCOPY and ZCOPY APIs.
Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f
- Existing vectorizes code was disabled because
of the failures observed in matlab tests.
- The issue is caused by underflow during division when diagonal
elements of A matrix are very small.
- When diagonal is very small (4E-324 in case of matlab), sqauring the
diagonal during divison causes the square to be rounded off to zero.
- Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by
max(ar, ai), this will make either (ar) or (ai) 1, and hence
reduce the likelihood of underflow.
AMD-Internal: [CPUPL-5052]
Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.
AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
- Added AVX512 kernel for ZDOTV.
- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.
AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
Existing Design:
- GEMM AVX2 kernel performs computation and updates temporary C buffer
- Portion of temporary C buffer is copied to output C buffer
based on UPLO parameter
- For diagonal blocks, using GEMM kernels is not efficient
New Design: Implemented in current patch when UPLO='L'
- GEMMT kernel used for computation, temporary buffer is not required.
- Only required elements are computed using mask load store for all
fringe cases
- Exception: AVX2 code path is used when storage format is RRC, CRR, CRC
- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
check which exists for double datatype
AMD_Internal: [CPUPL-5006]
Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation is developed considering the b matrix
reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
- In AVX512 ZTRSM kernel, vertorizes division code
is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.
AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
- Updated the AVX512 DOTXF kernels to use MASKZ loads
instead of MASK loads when loading X vector in fringe
case. This avoids compiler warnings of uninitialized
vector as input to the intrinsic.
- The functionality will not change when using either MASK
or MASKZ loads on X, since A matrix is loaded using MASKZ
loads.
AMD-Internal: [CPUPL-4974]
Change-Id: I1ef98a1292352d0e905cc09cd5667acd883df827
- In DGEMMT SUP AVX2 code path, traingular kernels
are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
to make sure that AVX512 kernels can be used without
creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
for upper variant of DGEMMT, temporary C buffer is
created, full GEMM kernel is called on temporary C and
traingular region from temporary C is copied to C
buffer.
AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
- Kernel dimensions are 4x4.
- Two kernels are implemented, Right Upper and
Right lower.
- In case of Left variants of TRSM, transpose is
induced so that Right variant kernels can be used.
- No packing is performed in these kernels.
- Changes are made in the threshold to pick ZTRSM small
code path.
- BLIS_INLINE is removed from signature of
"TRSMSMALL_KER_PROT".
- These kernels do not support "ENABLE_TRSM_PREINVERSION".
- Newly added kernels do not support conjugate
transpose.
- Added multithreading to ZTRSM small code path.
AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
factor 8 and 32.
- Multithreading is also added to the DAXPYf
kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
in ZEN4
AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
- Implemented AVX512 kernels for handling the calls to ZGEMV
with transpose to A matrix.
- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
kernels include those with fuse-factor 8 (main kernel), 4
and 2(fringe kernels).
- Updated the bli_zgemv_unf_var1( ... ) function to update
the function pointers to these kernels, based on the
configuration.
AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
- Implemented AVX512 kernels for handling the calls to ZGEMV
with no-transpose to A matrix.
- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
The set of ZAXPYF kernels include those with fuse-factor 8
(main kernel), 4 and 2(fringe kernels).
- Updated the bli_zgemv_unf_var2( ... ) function to set
the function pointers to these kernels, based on the
configuration. Further added the call to ZSETV at this
layer in case beta is 0.
AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
- Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_
using respective AVX512 intrinsics including masked
load and store operations.
- Implemented AVX512 kernels for scopy_, dcopy_ and
zcopy_ using assembly language to prevent loss of
performance during the translation of intrinsics.
- Updated the dcopy_blis_impl( ... ) and
zcopy_blis_impl( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Implemented OpenMP parallelization for dcopyv_ and
zcopyv_ APIs, while scopyv_ and ccopyv_ only support
single thread.
AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
Details:
- variable m0 is being loaded into a register without typecasting
it to uint64_t. This resulted in seg-fault when int size is set
to be 32 bits during configure time.
- Any variable that is loaded using mov in assembly needs to be
typecasted to uint64_t before begin_asm, so that change in size
of integer doesn't affect the functionality.
- Modified all instances using variable m0 to use variable 'm' where
m = (uint64_t)m0;
AMD-Internal: [CPUPL-4971]
Change-Id: I49b66d2cacf19ace40ab44c9f85904644e8921f4
- In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex
precision numbers are being read instead of 1 scomplex.
- Fixed the code to read only one scomplex.
AMD-Internal: [CPUPL-4403]
Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
Description:
1. Replaced aligned load intrinsics _mm512_load_ps
with unaligned load intrinsics _mm512_loadu_ps.
2. There is no guarantee that the memory address
can be aligned everywhere. The changes are under
beta multiplication. Copy paste error.
Change-Id: I978231b556e17ad7e66c5028ed1cd904c653e0a8
- Added BUILD_STATIC_LIBS option which is on by default, only on Linux.
- Added TEST_WITH_SHARED option which is off by default, only on Linux.
- If only shared or static lib is being built, that's the one that will be used for testing.
- If both are being built, TEST_WITH_SHARED determins which library wil be used for testing.
- Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing.
- Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0.
AMD-Internal: [CPUPL-2748]
Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
for 6x64 bf16 microkernel and store function pointers to all the
generated kernels in a global function pointer array. This happens
only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
otherwise, microkernel uses the intrinsics based implementation.
AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
Description
1. when mr0=1 case the accumulator register and operand
registers for an fma instruction got swapped. Corrected
the copy paste error.
2. Removed fill array for c_ref in bench_lpgemm.c and used
memcpy from c buf, because fill array now using rand()
function to initialize data which can be different
when c_ref and c called separately, this was working
because data was fixed (i=0 ... i%5).
Change-Id: Ia513331ba49d28adc7bcdc0ec78d443abe66780b
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation of lpgemv_rowvar_f32 is developed
considering the b matrix reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
5. Fixed few warnings while loading 2 f32 bias elements using
_mm_load_sd using float pointer. Typecasted to (const double *)
AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
AMD-Internal: [SWLCSG-2424]
Change-Id: I9464d1f514e3b04275fe93441489b4503a08937a
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.
AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
- Column stride is not taken into consideration in
current implementation when writing to C buffer
if beta is zero and C is column major stored.
- Fixed C storage in case of column major stored C
when beta is zero in 8x24 DGEMM kernel.
AMD-Internal: [CPUPL-4404]
Change-Id: I5b8dfce962995e3238cf902b5a09dd1bf90002a8
- In 3x1 fringe case in [RLN/RUT] kernel, 4 double
precision floats are being read instead of 3 doubles.
- Fixed the code to read only 3 double.
AMD-Internal: [CPUPL-4403]
Change-Id: If0afb155efefabe13487cf322d479981f1838aa2
1. Added Trans A feature to handle column major inputs
for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
MCxKC block of A into row storage MCxKC block as
LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
pointers.
6. Packing of A matrix:
- Converts column major input A to row major
in blocks of MCxKC with newly added pack A
functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
1. u8s8s32os32/os8
2. u8s8s16os16/os8/ou8
3. s8s8s32os32/os8
4. s8s8s16os16/os8
AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf