- Added AVX512 kernel for ZDOTV.
- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.
AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
* As part of functional test cases, large size of m, stride greater than
m,scalar combinations, Zero increment tests are added for ?copyv.
Signed-off-by: eseswari <sangadala.eswari@amd.com>
AMD-Internal: CPUPL-4412
Change-Id: I9fa74c147975bbe21263aaf48190170c6ed0a8fd
- Before the system was assuming 3 levels in the directory structure and
was creating corresponding targets.
- Now the system looks into the subdirectories of testsuite and creates
a target for each subdirectory that has at least one cpp file.
- Also deleted a directory that seems duplicate and was breaking builds.
AMD-Internal: [CPUPL-4500]
Change-Id: I03ca362b09783f1c7c5f37ab420d8ca2c2b45e2e
Existing Design:
- GEMM AVX2 kernel performs computation and updates temporary C buffer
- Portion of temporary C buffer is copied to output C buffer
based on UPLO parameter
- For diagonal blocks, using GEMM kernels is not efficient
New Design: Implemented in current patch when UPLO='L'
- GEMMT kernel used for computation, temporary buffer is not required.
- Only required elements are computed using mask load store for all
fringe cases
- Exception: AVX2 code path is used when storage format is RRC, CRR, CRC
- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
check which exists for double datatype
AMD_Internal: [CPUPL-5006]
Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation is developed considering the b matrix
reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
Thanks to Zhenyu Zhu ajz34 for pointing out this bug.
When trans="t" or "conjugate transpose" in the case of complex data-types
the ldb should be greater than equal to cols.
In the bug it was checked against "rows". Fixed this bug.
Some minor code format is done.
[CPUPL-4810][SWLCSG-2706]
Change-Id: Ie796d25a361b2ba72eda80e8c5867d6352af901f
- In AVX512 ZTRSM kernel, vertorizes division code
is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.
AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
- Modified all structs that are passed to JIT-generated code to use
integer of type uint64_t rather than dim_t so that functionality
is not affected when size of BLIS-internal integer is modified
during configure time.
Change-Id: Ib81c088072badf13da4ca73be2d4af4551b713d8
- Added a {} around zen4 switch case to avoid AOCC error.
- Error is caused because in C declarations are not a statement, therefore
they cannot be labled hence compiler is not able to create a lable
for jump.
AMD-Internal: [CPUPL-4880]
Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e
- Updated the AVX512 DOTXF kernels to use MASKZ loads
instead of MASK loads when loading X vector in fringe
case. This avoids compiler warnings of uninitialized
vector as input to the intrinsic.
- The functionality will not change when using either MASK
or MASKZ loads on X, since A matrix is loaded using MASKZ
loads.
AMD-Internal: [CPUPL-4974]
Change-Id: I1ef98a1292352d0e905cc09cd5667acd883df827
Check internal value of INFO for BLAS2 and BLAS3 routines
using the bli_info_get_info_value() function added in AOCL 4.2.
If testing a BLIS library that does not have this, use
cmake ... -DCAN_TEST_INFO_VALUE=OFF
AMD-Internal: [CPUPL-4993]
Change-Id: Ida5d252b0f6727793ebfb74bb160e8cb96b61b74
- In DGEMMT SUP AVX2 code path, traingular kernels
are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
to make sure that AVX512 kernels can be used without
creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
for upper variant of DGEMMT, temporary C buffer is
created, full GEMM kernel is called on temporary C and
traingular region from temporary C is copied to C
buffer.
AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
- Kernel dimensions are 4x4.
- Two kernels are implemented, Right Upper and
Right lower.
- In case of Left variants of TRSM, transpose is
induced so that Right variant kernels can be used.
- No packing is performed in these kernels.
- Changes are made in the threshold to pick ZTRSM small
code path.
- BLIS_INLINE is removed from signature of
"TRSMSMALL_KER_PROT".
- These kernels do not support "ENABLE_TRSM_PREINVERSION".
- Newly added kernels do not support conjugate
transpose.
- Added multithreading to ZTRSM small code path.
AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
factor 8 and 32.
- Multithreading is also added to the DAXPYf
kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
in ZEN4
AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
- Implemented AVX512 kernels for handling the calls to ZGEMV
with transpose to A matrix.
- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
kernels include those with fuse-factor 8 (main kernel), 4
and 2(fringe kernels).
- Updated the bli_zgemv_unf_var1( ... ) function to update
the function pointers to these kernels, based on the
configuration.
AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
- Implemented AVX512 kernels for handling the calls to ZGEMV
with no-transpose to A matrix.
- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
The set of ZAXPYF kernels include those with fuse-factor 8
(main kernel), 4 and 2(fringe kernels).
- Updated the bli_zgemv_unf_var2( ... ) function to set
the function pointers to these kernels, based on the
configuration. Further added the call to ZSETV at this
layer in case beta is 0.
AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
- Using a template class for the printing operator that depends
on the type.
- USe a macro to denote which interface is being tested.
AMD-Internal: [CPUPL-4500]
Change-Id: I453c4ef4842c354064f49ff32ec4bf42920cc17c
Following a recent change to the data generators to allow a stride
to be specified (60cc23f3d3), seg
faults can occur if m<=0 for column storage or n<=0 for row storage.
Prevent this by having separarate code paths to handle these
scenarios.
AMD-Internal: [CPUPL-4500]
Change-Id: I23ed8b2dccaaca140e2ddfda45bcdb4c888d5708
Improve consistency in test names across different APIs.
In this commit, standardize m, n, k and b in test names.
AMD-Internal: [CPUPL-4500]
Change-Id: I53e7dd83cbf426ab1ebe8aa4af1da01594f4af23
- Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_
using respective AVX512 intrinsics including masked
load and store operations.
- Implemented AVX512 kernels for scopy_, dcopy_ and
zcopy_ using assembly language to prevent loss of
performance during the translation of intrinsics.
- Updated the dcopy_blis_impl( ... ) and
zcopy_blis_impl( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Implemented OpenMP parallelization for dcopyv_ and
zcopyv_ APIs, while scopyv_ and ccopyv_ only support
single thread.
AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
- Updated the IIT_ERS tests for SUBV to avoid using undefined
variables. These tests are enabled only when GTestSuite is
configured for BLIS_TYPED interface testing.
- Updated an instantiator in DAXPBY accuracy tests, to avoid
parsing error(extra comma). These tests are enabled only when
GTestSuite is configured for BLIS_TYPED interface.
AMD-Internal: [CPUPL-4500]
Change-Id: If6894daadbbc353dd66968649642ff07fa663782
Return type of xerbla and xerbla_array APIs are defined as int in BLIS, but according to netlib it should be void. Updated the defination and declaration accordingly.
Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I3072ba76111189de5c5cf08df83ea154163dd34d
First in a series of commits to improve consistency in test names
across different APIs. This will help with gtest filtering.
In this commit, standardize alpha, beta, incx and incy.
AMD-Internal: [CPUPL-4500]
Change-Id: I0cde85f9a4cf969c0b12ac589b232786ad011f09
Details:
- variable m0 is being loaded into a register without typecasting
it to uint64_t. This resulted in seg-fault when int size is set
to be 32 bits during configure time.
- Any variable that is loaded using mov in assembly needs to be
typecasted to uint64_t before begin_asm, so that change in size
of integer doesn't affect the functionality.
- Modified all instances using variable m0 to use variable 'm' where
m = (uint64_t)m0;
AMD-Internal: [CPUPL-4971]
Change-Id: I49b66d2cacf19ace40ab44c9f85904644e8921f4
- Updated test_gemv.h to pass the right boolean
to computediff( ... ), based on whether we run
it for exception value tests or not.
AMD-Internal: [CPUPL-4500]
Change-Id: I1ad2cde4f9b4bb1dadc32d1f7d02a90a457e218f
*covered large sizes, scalar combinations and strides greater than the
size for cger, dger, sger and zger.
Signed-off-by: Sangadala Eswari <Sangadala.Eswari@amd.com>
AMD-Internal: CPUPL-4414
Change-Id: I6fba26a35903d1f6dbd713f19eac6bb537b3d8d2
- Changed the macro guard for accuracy tests of SIMATCOPY,
to ensure that tests are enabled/disabled based on the reference.
- Updated test_gemv.h to make sure the contents of y vector is copied
to y_ref post inducing exception values.
AMD-Internal: [CPUPL-4500]
Change-Id: I7249e643677e7e493eba5d072567615bc913a532
Add name of variable being tested in error output from
computediff functions. First step to adding (optional)
tests on input arguments.
AMD-Internal: [CPUPL-4379]
Change-Id: I9553b660bcf5ecf1dd675cb837655078933455ac
Modify thresholds to reflect number of operations that
accumulate results into each output element. Different
limits are set for early return and special cases.
Constants are still subject to experimentation and change.
AMD-Internal: [CPUPL-4378]
Change-Id: I81f63a36c161ff1866f2d404b9e3cbb9a2948d3a
Modify thresholds to reflect number of operations that
accumulate results into each output element. Different
limits are set for early return and special cases.
Constants are still subject to experimentation and change.
AMD-Internal: [CPUPL-4378]
Change-Id: I03cd8901e574f2e44e85ce8b0bc234e36edb4819
Correct the order of tests in bli_cpuid.c to test all known zen
AVX512 platforms before considering fallback tests on AVX512
support. This avoids builds with "configure auto" or
"cmake -DBLIS_CONFIG_FAMILY=auto" incorrectly selecting zen5
sub-configuration on zen4 systems.
AMD-Internal: [CPUPL-4966]
Change-Id: I8706382e2df7c9ae4bb456e3a7f465053e15beea
1. Fixed issue related to linking reference library.
2. Clean-up of how reference library variables are set.
2. Compilation error related to std::max() and std::min().
AMD-Internal: [CPUPL-4879]
Change-Id: I427a4a4c0ea56a340a8bbd1a6649252e9680b937
- Added Memory Access Test support for GEMV.
- Added Extreme Value Tests for various combinations of NaN, Inf and
-Inf for ?GEMV.
- Also fixed some invalid IIT_ERS tests.
AMD-Internal: [CPUPL-4825]
Change-Id: Iee77b305f6c6b9427153fbbc5191176dae9fbfea
- In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex
precision numbers are being read instead of 1 scomplex.
- Fixed the code to read only one scomplex.
AMD-Internal: [CPUPL-4403]
Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d
- Scaling vector X is skipped when alpha is 1 in ZTRSV.
- Scaling matrix A is skipped when alpha is 1 in ZTRSM.
AMD-Internal: [CPUPL-4324]
Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81
- Comparision using bli_eqsc is slower than direct comparison.
- Changed comparision logic for 1x1 matrix
from bli_sqsc to direct comparision.
AMD-Internal: [CPUPL-4324]
Change-Id: Ifb2d0ad7a97c8bf33b66d624a7ecc53e38c1c803
Correction to commit 2450a1813b
to add -DBLIS_CONFIG_FAMILY=zen5 support in cmake.
AMD-Internal: [CPUPL-3518]
Change-Id: Iecff2b64d5d95960cecbbf98d5269133747b122e
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
1. Different matrix sizes
2. Different Stride values and Scalar values
3. Added Early Return tests in new file
Signed-off by: Harish Kumar<harish.kumar@amd.com>
AMD-Internal: [CPUPL-4417]
Change-Id: I5e645612808336e11da0c5ed8da9fe17a5543fbd
- Implemented the feature to benchmark ?AXPYV APIs
for the supported datatypes. The feature allows to
benchmark BLAS, CBLAS or the native BLIS API, based
on the macro definition.
- Added a sample input file to provide examples to benchmark
AXPYV for all its datatype supports.
- Updated the sample input file for SCALV to provide examples
to benchmark all of its datatype supports.
AMD-Internal: [CPUPL-4805]
Change-Id: I550920e3a57fcc2e4900e9e698330d8b8595bdee
Previous commit introduced an infinite recursion problem in
generators for symmetric matrices. This was reported as a
compiler warning by gcc 12.2 but not by gcc 11.4.
AMD-Internal: [CPUPL-4862]
Change-Id: I8642b81a62f0643b5a9ebedb4fcc83b25542de1b
- Added test-cases to verify the functional behaviour
of the BLAS-extension API ?imatcopy_() and ?omatcopy2_().
The test-cases cover the following categories for the
supported datatypes :
- Functional and memory testing.
- Negative parameter testing with invalid inputs.
- Early return scenarios.
- Exception value testing.
- Updated functions in testinghelpers to include strides
in addition to leading-dimension, when initializing
a matrix. The default value for stride is set as 1.
- Implemented functions to load the reference symbol, based
on the choice of the reference library. The function definition
is overloaded due to different API standards being exposed by
different libraries.
- Code cleanup of files for ?OMATCOPY API.
AMD-Internal: [CPUPL-4862]
Change-Id: If63b348f517e2cde1fe48f3a195808b33a91c312
- Added support for ?DOTC in bench.
- Updated DTL to accept conjx as a parameter:
- 'N', i.e., no conjugate for DOTU
- 'C', i.e., conjugate for DOTC
- Updated DTL calls in the interface with respective values of
conjx.
AMD-Internal: [CPUPL-4804]
Change-Id: I447b19a6273566c6021c1721ce173bac4a59142c
- Added overflow and underflow tests for dgemm
These tests cause floating point overflow and underflow by feeding
values close to DBL_MAX and DBL_MIN values to matrices
DBL_MAX = 1.7976931348623158e+308
DBL_MIN = 2.2250738585072014e-308
When computations result in values beyond the range [DBL_MIN, DBL_MAX],
it leads to an overflow or underflow condition
Two new arguments are added to test_gemm routine - over_under and input_range
over_under = 0 indicates overflow
over_under = 1 indicates underflow
input_range = -1 indicates values within overflow or underflow limits
input_range = 0 indicates values very close to DBL_MIN or DBL_MAX
input_range = 1 indicates values beyond DBL_MIN or DBL_MAX
- New file: dgemm_ovr_undr.cpp
Overflow and underflow tests are called from this file
dgemm_overflow and dgemm_underflow. This file uses
cfloat header file for DBL_MIN and DBL_MAX values
Signed-off-by: Nimmy Krishnan <nimmy.krishnan@amd.com>
AMD-Internal: [CPUPL-4492]
Change-Id: I4bbd519abacc56f322c73d6c0187ed6e1abbbf2b
- Added test-cases to verify the functional behaviour
of the BLAS-extension API ?omatcopy_(). The test-cases
cover the following categories for the supported datatypes :
- Functional and memory testing.
- Negative parameter testing with invalid inputs.
- Early return scenarios.
- Exception value testing.
- Implemented a function to load the reference symbol, based
on the choice of the reference library. The function definition
is overloaded due to different API standards being exposed by
different libraries.
AMD-Internal: [CPUPL-4810][SWLCSG-2706]
Change-Id: I8dcaeeaa36d392b752eb0685e32583a12ddc4220
- Updated the generate_NAN_INF() in test_trsm.h to properly induce NaNs
and Infs for complex types.
AMD-Internal: [CPUPL-4639]
Change-Id: I4226e5c5b5f7de85eb89271551f897f87755f4f5
- Handle -0.0 separately in get_value_string()
- Avoid unused variable warning when not TEST_BLIS_TYPED in
subv_evt_testing.cpp
- Remove unused variables in dgemm_ukernel.cpp
- Remove unnecessary local copies of greenzone1 in test
programs now that greenzone_1 and greenzone_2 will
not overlap.
- Protect tests of haswell kernels by ifdef on
BLIS_KERNELS_HASWELL rather than BLIS_KERNELS_ZEN.
- Added GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST
statements in TRSM kernel tests.
- Correct descriptions of trsm and trmm operations.
- Correct typos.
AMD-Internal: [CPUPL-4500]
Change-Id: If8520347e417785e6aa953a0c8a65d4f5f3c1591