- Fixed bug in ddotxf generic tests where the parameters lda_inc and
inca were being read incorrectly.
- Fixed bug in dotxf test wherein the y vector was being generated with
length m instead of b.
- Corrected function signatures to use type gtint_t instead of gint_t.
- Updated the tests to use conjugate values of type char and convert to
conj_t type only while invoking BLIS tests for both DOTXF and AXPYF.
AMD-Internal: [CPUPL-5117]
Change-Id: I0ef7af429057583a1cbf34827802e72401181caf
Improve consistency in test names across different APIs:
- Improve consistency of TEST_P part of test names.
- Rename *_evt_testing.cpp and nrm2_extreme.cpp files to
*_evt.cpp to match other APIs.
- Standardize naming of IIT_ERS files.
Also:
- Restore trsv IIT_ERS file which was misnamed in commit
a2beef3255
- Tidy ukr gemm tests to be more consistent with each other
and move threshold setting to individual TEST_P functions
to allow different adjustments to be made.
- Similarly make trsm tests more consistent.
- Tidy naming of is_memory_test variable.
AMD-Internal: [CPUPL-4500]
Change-Id: I0af1fc9973b02187b19a7c2488eed1b829cfdc2f
Description:
- _mm512_storeu_epi8 and _mm512_storeu_epi16
intrensic instructions are not available in gcc-10
- Replaced above intrensics _mm512_storeu_si512
Change-Id: I2878780b7acd040ccf45e571d486ff8c2388088c
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.
AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
according to latest input format.
AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
BLIS includes the BLAS and CBLAS interfaces for zdscal
but not the BLIS typed interface bli_zdscalv. Thus, when
TEST_INTERFACE=BLIS_TYPED is defined, disable tests
for zdscal.
AMD-Internal: [CPUPL-4671]
Change-Id: I397454c83e272f9e775e37e00533002576041a93
- Updated the parameter check for leading dimensions in the
functions handling transpose case of matrix A.
- Updated the logic to perform ?IMATCOPY operation. The new
logic uses an auxiliary buffer to copy and scale in place,
if and when needed. This is done in order to avoid overwriting
any subsequent reads that might follow(specifically in case
of having different leading dimensions for reading and writing).
- Updated xerbla_() to throw memory allocation failure based on
INFO parameter being -10. This value is specific to its use-case
in ?IMATCOPY, where it is set to -10.
- Updated the Extreme Value Tests(EVT) logger for ?IMATCOPY
for uniformity.
- Cleaned up the files to follow coding conventions.
AMD-Internal: [CPUPL-4862][SWLCSG-2706]
Change-Id: I34dfa2bcb66b821315e11f7ab2139c41a79ef780
- Correct value of alpha in ger ERS test.
- rename ERS_IIT.cpp files to match naming convention
used for other APIs.
- Change all cases of gint_t to gtint_t except for
dotxf, which is fixed in another commit.
- Add TEST_UPPERCASE_ARGS to imatcopy and omatcopy{2}
headers.
- Corrected typo.
AMD-Internal: [CPUPL-4500]
Change-Id: I8844bb8c5941785e64daa9df5569092c19f91838
-Adding multiplier for complex APIs.
-Updating for trmv and trsv to reflect multiplication with alpha.
AMD-Internal: [CPUPL-4500]
Change-Id: I17361da5afa5d1e219b4c8a14542e2b216a7ea58
Improve consistency in test names across different APIs.
In this commit, standardize leading dimensions (lda, ldb,
ldc) in test names. Also some misc tidying changes.
AMD-Internal: [CPUPL-4500]
Change-Id: Icbc82d0b9a3420ddfdb4f418396f9e56ab1765ab
Correction to commit 8657e661fc
to allocate matrix or vector correctly when special read-only
case occurs.
Also define a set_matrix generator for symmetric matrices
to only set upper or lower triangle to the supplied value,
while setting the unused elements to a large value to help
catch incorrect access to those elements.
AMD-Internal: [CPUPL-4548]
Change-Id: I22b3a20e2ce8be70eb27179247cd47fdb2d87b9d
Export more symbols for BLIS kernels so that AOCL libFLAME
optimizations can call them directly.
AMD-Internal: [CPUPL-5044]
Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733
1. Enabled AVX512 path for
- Upper variant
- Different storage schemes for upper and lower variant
2. Modified mask value to handle all fringe cases correctly
AMD_Internal: [CPUPL-5091]
Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.
AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
supported by optimized kernels.
AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
datatype.
AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
Improve consistency in test names across different APIs.
Various changes in this patch:
- Explicitly cast char variables to std::string when
adding to test name. Adding the char directly was
causing errors in name generation.
- Use template version of print function in zdscalv
and remove print function zdscalvGenericTestPrint.
- Remove unused print function ztrsvPrint.
- Eliminate some differences in gemm ukr print
functions.
- Remove extraneous API name labels in ukr axpyf and
setv.
- Make ukr/trsm/test_trsm_ukr.h more consistent with
other files.
AMD-Internal: [CPUPL-4500]
Change-Id: Ib8092de216712586fe4ec0ae91698d0c1aaffd54
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1. Adding the support for swish in JIT
based BF16 kernels.
AMD-Internal: [SWLCSG-2387]
Change-Id: I9eea0c801f5f067a5cfbd2941bc991708b86e45e
Improve consistency in test names across different APIs.
In this commit, standardize storage, side, uplo, trans
diag and conj in test names.
AMD-Internal: [CPUPL-4500]
Change-Id: Ifcdb6e9f684b134841d86087218d7aefd9cabe63
- Reduced number of jump operations in AVX512
assembly kernel for SCOPYV, DCOPYV and ZCOPYV.
- Fixed memory test failure for bli_zcopyv_zen_int_avx512
kernel.
- Replaced existing AVX2 COPYV intrinsic kernels in
bli_cntx_init_zen5.c with AVX512 assembly kernels.
Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159
Some BLAS routines do not require matrices or vectors to be
initialized in certain use cases. For example, in GEMM when
beta=zero, C is set rather than updated, thus input values of
C should not be used. In these cases set the inital values of
such matrices or vectors to an extreme value, to help detect
if these are incorrectly being read.
The extreme value can be NaN or Inf. The default is Inf,
change it by running
cmake ... -DEXT_VALUE=NaN
AMD-Internal: [CPUPL-4548]
Change-Id: I4a665363779d2496b8247f6357e970b7f23cd1eb
- Utilized the memory testing feature in GTestsuite
to update the testing interfaces for micro-kernel
testing of SCOPY, DCOPY and ZCOPY APIs.
Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f
- Existing vectorizes code was disabled because
of the failures observed in matlab tests.
- The issue is caused by underflow during division when diagonal
elements of A matrix are very small.
- When diagonal is very small (4E-324 in case of matlab), sqauring the
diagonal during divison causes the square to be rounded off to zero.
- Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by
max(ar, ai), this will make either (ar) or (ai) 1, and hence
reduce the likelihood of underflow.
AMD-Internal: [CPUPL-5052]
Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f
- Added accuracy and memory tests for AVX2 and AVX512 ?SETV kernels,
AVX512 ZAXPYV kernel and AVX512 ZAXPYF kernels, with fuse-factors
2, 4 and 8.
- Cleanup of the code-section that declares and defines the reference
compute for AXPYF operation. Corrected the type mismatch with the
arguments that reference AXPYV would expect(this is used to decompose
AXPYF as part of reference). Ensured usage of GTestSuite's internal
alias for integer types.
- Updated the API level testsuite and testing interface for AXPYF,
based on the cleaup done to the reference code.
AMD-Internal: [CPUPL-4974]
Change-Id: I71de6c09d3877cd3dd1eaa20ab4f90e7c33eb1e1
Test programs for key APIs like GEMM take a long time to run,
and even to generate the list of test cases. Break into
separate test programs for different data types to enable
these to run in parallel (at gtest level). In this patch
we break up GEMM, TRSM, GEMV and TRSV.
AMD-Internal: [CPUPL-4500]
Change-Id: I21363b050d30e0402d5a1e8cbeaed2ebcc87aaeb
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.
AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
- Updated DOTV Gtestsuite interface to invoke C/ZDOTC when conjx='c'
and testing interface is either BLAS or CBLAS.
- Added ukr tests for bli_zdotv_zen4_asm_avx512( ... ) and
bli_zdotv_zen_int_avx512( ... ) kernels.
AMD-Internal: [CPUPL-5011]
Change-Id: I32fb69027a35d9ea92f997a095d412c8242a4b68
* Functional tests are covered for saxpyv and zaxpyv.
* As part of functional large size of m, stride greater than m, scalar
combinations(including special cases), Zero increment tests are
added for saxpyv and zaxpyv.
Signed-off-by: eseswari <sangadala.eswari@amd.com>
AMD-Internal: CPUPL-4413
Change-Id: I61473357680cb0f394e6e653796ec31110895fa4
- Added AVX512 kernel for ZDOTV.
- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.
AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
* As part of functional test cases, large size of m, stride greater than
m,scalar combinations, Zero increment tests are added for ?copyv.
Signed-off-by: eseswari <sangadala.eswari@amd.com>
AMD-Internal: CPUPL-4412
Change-Id: I9fa74c147975bbe21263aaf48190170c6ed0a8fd
- Before the system was assuming 3 levels in the directory structure and
was creating corresponding targets.
- Now the system looks into the subdirectories of testsuite and creates
a target for each subdirectory that has at least one cpp file.
- Also deleted a directory that seems duplicate and was breaking builds.
AMD-Internal: [CPUPL-4500]
Change-Id: I03ca362b09783f1c7c5f37ab420d8ca2c2b45e2e
Existing Design:
- GEMM AVX2 kernel performs computation and updates temporary C buffer
- Portion of temporary C buffer is copied to output C buffer
based on UPLO parameter
- For diagonal blocks, using GEMM kernels is not efficient
New Design: Implemented in current patch when UPLO='L'
- GEMMT kernel used for computation, temporary buffer is not required.
- Only required elements are computed using mask load store for all
fringe cases
- Exception: AVX2 code path is used when storage format is RRC, CRR, CRC
- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
check which exists for double datatype
AMD_Internal: [CPUPL-5006]
Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation is developed considering the b matrix
reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
Thanks to Zhenyu Zhu ajz34 for pointing out this bug.
When trans="t" or "conjugate transpose" in the case of complex data-types
the ldb should be greater than equal to cols.
In the bug it was checked against "rows". Fixed this bug.
Some minor code format is done.
[CPUPL-4810][SWLCSG-2706]
Change-Id: Ie796d25a361b2ba72eda80e8c5867d6352af901f
- In AVX512 ZTRSM kernel, vertorizes division code
is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.
AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
- Modified all structs that are passed to JIT-generated code to use
integer of type uint64_t rather than dim_t so that functionality
is not affected when size of BLIS-internal integer is modified
during configure time.
Change-Id: Ib81c088072badf13da4ca73be2d4af4551b713d8
- Added a {} around zen4 switch case to avoid AOCC error.
- Error is caused because in C declarations are not a statement, therefore
they cannot be labled hence compiler is not able to create a lable
for jump.
AMD-Internal: [CPUPL-4880]
Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e
- Updated the AVX512 DOTXF kernels to use MASKZ loads
instead of MASK loads when loading X vector in fringe
case. This avoids compiler warnings of uninitialized
vector as input to the intrinsic.
- The functionality will not change when using either MASK
or MASKZ loads on X, since A matrix is loaded using MASKZ
loads.
AMD-Internal: [CPUPL-4974]
Change-Id: I1ef98a1292352d0e905cc09cd5667acd883df827
Check internal value of INFO for BLAS2 and BLAS3 routines
using the bli_info_get_info_value() function added in AOCL 4.2.
If testing a BLIS library that does not have this, use
cmake ... -DCAN_TEST_INFO_VALUE=OFF
AMD-Internal: [CPUPL-4993]
Change-Id: Ida5d252b0f6727793ebfb74bb160e8cb96b61b74
- In DGEMMT SUP AVX2 code path, traingular kernels
are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
to make sure that AVX512 kernels can be used without
creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
for upper variant of DGEMMT, temporary C buffer is
created, full GEMM kernel is called on temporary C and
traingular region from temporary C is copied to C
buffer.
AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
- Kernel dimensions are 4x4.
- Two kernels are implemented, Right Upper and
Right lower.
- In case of Left variants of TRSM, transpose is
induced so that Right variant kernels can be used.
- No packing is performed in these kernels.
- Changes are made in the threshold to pick ZTRSM small
code path.
- BLIS_INLINE is removed from signature of
"TRSMSMALL_KER_PROT".
- These kernels do not support "ENABLE_TRSM_PREINVERSION".
- Newly added kernels do not support conjugate
transpose.
- Added multithreading to ZTRSM small code path.
AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
factor 8 and 32.
- Multithreading is also added to the DAXPYf
kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
in ZEN4
AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
- Implemented AVX512 kernels for handling the calls to ZGEMV
with transpose to A matrix.
- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
kernels include those with fuse-factor 8 (main kernel), 4
and 2(fringe kernels).
- Updated the bli_zgemv_unf_var1( ... ) function to update
the function pointers to these kernels, based on the
configuration.
AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
- Implemented AVX512 kernels for handling the calls to ZGEMV
with no-transpose to A matrix.
- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
The set of ZAXPYF kernels include those with fuse-factor 8
(main kernel), 4 and 2(fringe kernels).
- Updated the bli_zgemv_unf_var2( ... ) function to set
the function pointers to these kernels, based on the
configuration. Further added the call to ZSETV at this
layer in case beta is 0.
AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
- Using a template class for the printing operator that depends
on the type.
- USe a macro to denote which interface is being tested.
AMD-Internal: [CPUPL-4500]
Change-Id: I453c4ef4842c354064f49ff32ec4bf42920cc17c
Following a recent change to the data generators to allow a stride
to be specified (60cc23f3d3), seg
faults can occur if m<=0 for column storage or n<=0 for row storage.
Prevent this by having separarate code paths to handle these
scenarios.
AMD-Internal: [CPUPL-4500]
Change-Id: I23ed8b2dccaaca140e2ddfda45bcdb4c888d5708