Optionally enable parallelism inside gtestsuite so that we can
check BLIS functions perform correctly when nested parallelism
is in operation. Enable with:
cmake ... -DOPENMP_NESTED={0,1,2,1diff}
where in gtestsuite
- 0 is the default choice with no parallelism.
- 1 and 2 are simple nested parallelism.
- 1diff has one level of parallelism setting different numbers
of threads to be used by BLIS and reference library calls
from each gtestsuite thread.
Note: OMP_NUM_THREADS must be set appropriately to enable or
disable parallelism at each level in the test programs
as desired.
OMP_NUM_THREADS will also define the parallelism used
within the BLIS library (if it is multithreaded), unless
BLIS-specific ways of specifying parallelism have been
used. If a BLIS-specific parallelism option has been set,
the same mechanism will be used in the 1diff option to
vary the number of threads in BLIS per application thread.
AMD-Internal: [CPUPL-3902]
Change-Id: I89f9edb4125c64ef03e025a9f6ccb84960ba8771
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
- Added 32x3n n-biased kernels to directly handle the cases where n=3
which were earlier being handled by the primary n-biased, 32x8n,
kernel.
- Modified the n-biased fringe kernels to further handle the smaller
m-fringe cases. Thus, now the kernels handle the following range of m
for any value of n:
- 16x8n : m = [16, 31)
- 8x8n : m = [8, 15)
- m_leftx8n : m = [1, 7]
- Updated the function pointer map for n-biased kernels with added
granularity to invoke the smaller fringe cases directly on the basis
of m-dimension.
- Added micro-kernel unit tests for all the dgemv_n kernels.
AMD-Internal: [CPUPL-6231]
Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.
AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
- Reduced the blocking size of 'bli_ddotv_zen_int10'
kernel from 40 elements to 20 elements for better
utilization of vector registers
- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
kernel with 'if' conditions to handle reminder
iterations. As only a single iteration is used when
reminder is less than the primary unroll factor.
- Added a conditional check to invoke the vectorized
DDOTV kernels directly(fast-path), without incurring
any additional framework overhead.
- The fast-path is taken when the input size is ideal
for single-threaded execution. Thus, we avoid the
call to bli_nthreads_l1() function to set the ideal
number of threads.
- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
kernel.
AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
In case the executable to obtain the BLIS library version fails,
catch and report common errors to help with debugging.
Also correct the test for bli_info_get_info() support to mark
that it is not available in any AOCL version <= 4.1
AMD-Internal: [CPUPL-4500]
Change-Id: Ie8f728b49faa60e0469562dbf77d67f86b415cd8
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
- Added support for gemv kernels unit test in gtestsuite.
- Added micro-kernel tests and memory tests for DGEMV
transpose case kernels.
AMD-Internal: [CPUPL-5835]
Change-Id: I7d2d3cdbfea436f6c9b2cce9f2e85bfc5c51f201
- Updated the existing AVX2 {S/D}AMAXV kernels to comply
to the standard when having exception values. This makes
it exhibit the same behaviour as it AVX512 variants.
Provided additional optimizations with loop unrolling.
- Removed redundant early return checks inside the kernels,
since they have been abstracted to a higher layer.
- Updated the unit-tests(micro-kernel) and exception value
tests for appropriate code-coverage. Also re-enabled the
exception value tests.
AMD-Internal: [CPUPL-4745]
Change-Id: I36c793220bd4977a00281af9737c51cd1e5c60d9
- Restore test for old bli_dgemm_zen4_asm_8x24 kernel, so that
we can test this if linking with older AOCL versions.
- Move K_bli_dgemm_avx512_asm_8x24 definition from AOCL_42 list
to AOCL_50 list.
AMD-Internal: [CPUPL-4500]
Change-Id: Id522f4bc5b89e86f77c4e1d26c75e261736ab450
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
incorrect functionality with multiple absolute maximums.
They returned the last index when having multiple occurences,
instead of the first one.
- Implemented a bug-fix to handle this issue on these AVX512
kernels. Also ensured that the kernels are compliant with
the standard when handling exception values.
- Further optimized the code by decoupling the logic to find
the maximum element and its search space for index. This way,
we use lesser latency instructions to compute the maximum
first.
- Updated the unit-tests, exception value tests and early return
tests for the API to ensure code-coverage.
AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:
1. **IR Loop Optimization:**
- The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
with `begin_asm` and `end_asm` calls, resulting in more efficient execution.
2. **JR Loop Integration:**
- The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
of stack frame management for each JR iteration, thereby enhancing loop performance.
3. **Kernel Decomposition Strategy:**
- The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
- For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.
1. **Interleaved Scaling by Alpha:**
- Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
and reduce latency.
2. **Efficient Mask Preparation:**
- Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
minimizing unnecessary overhead.
3. **Broadcast Instruction Optimization:**
- In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
the broadcast instruction is replaced with `mem_1to8`.
- This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
dependency chains and improving execution efficiency.
4. **C Matrix Update Optimization:**
- During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
performance bottlenecks and enhancing throughput.
These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.
This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.
For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.
AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
- AVX512 specific DGEMV native kernels are added for Zen4/5
architectures to handle the NO_TRANSPOSE cases and are independent of
the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
beta scaling of y vector within the kernel itself and handle cases
where n is less than 5:
- bli_dgemv_n_zen_int_32x8n_avx512( ... )
- bli_dgemv_n_zen_int_32x4n_avx512( ... )
- bli_dgemv_n_zen_int_32x2n_avx512( ... )
- bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
m-dimension and for this kernel beta scaling is handled beforehand
within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
using fused kernel, namely AXPYF, to perform the GEMV operation.
AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.
AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
Add definitions in gtestsuite header to list available kernel
by AOCL BLIS version. Check these definitions in ukr test
programs to avoid missing symbol errors when testing with an
older version of BLIS.
Currently AOCL_41, AOCL_42, AOCL_50 and AOCL_DEV are supported,
with AOCL_DEV inferred from the version being later than the
value of AOCL_BLAS_LATEST_VERSION set in CMakeLists.txt. Thanks
to Eleni Vlachopoulou for the cmake functionality to automatically
detect the version from the library.
AMD-Internal: [CPUPL-4500]
Change-Id: I40ffd3d3789324fbb1dabfbf5e1dd4e0c94d54d9
gemm_compute currently has differences in the interface to
the blis_impl layer compared to the top-level API. Modify
gtestsuite wrapper to account for this.
AMD-Internal: [CPUPL-4500]
Change-Id: Ie96c9ac3b23128ae8e03af34ad11e65910dec594
- Implemented a set of column preferential dot-product based
ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
These kernels perform matrix multiplication as a sequence
of inner products(i.e, dot-products).
- These standalone kernels are expected to strictly handle
the CRC storage scheme for C, A and B matrices. RRC is also
supported through operation transpose, at the framework
level.
- Added unit-tests to test all the kernels(main and fringe),
as well as the redirection between these kernels.
AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
Correction to CMakeLists.txt to fix problem building executables
when testing against MKL.
AMD-Internal: [CPUPL-5928]
Change-Id: Ie427fff0afb48be6ce6d940b1db2c9d1c7a40e5b
Since there is thorough kernel testing, we reduce the number of "Black Box" test cases so that CI is faster.
AMD-Internal: [CPUPL-4500]
Change-Id: Ie57eeccff8103c0051eb1904162d6447da0ef102
- Correct matsize and NumericalComparison functions for
tests with first matrix dimension <= 0.
- BLAS1:
- Fix for BLAS vs CBLAS differences in amaxv IIT_ERS tests.
- Threshold adjustments in ddotxf and zaxpy.
- Break axpyv and scalv into separate executables for
each data type.
- BLAS2:
- Threshold adjustments in symv and hemv.
- Break ger into separate executables for each data type.
- UKR:
- Break gemm and trsm ukr test into separate executables
for each data type.
- Threshold adjustments in daxpyf
- Disable {z,c}trsm ukr tests when BLIS_INT_ELEMENT_TYPE
is used, as matrix generator is not currently suitable
for this.
AMD-Internal: [CPUPL-4500]
Change-Id: I1d9e7acc11025f1478b8b511c14def5517ef0ae6
Those can be run in --gtest_also_run_disabled_tests is used.
Bugs will be addressed and resolved in the future.
AMD-Internal: [CPUPL-4500]
Change-Id: I7a5443606ea8ef20f18ff8beec14bece5f6ee661
Various changes to BLAS2 test cases:
- GEMV: Reduce number of tests to make runtime more reasonable.
- TRSV:
- Standardize tests across different data types, including
adding memory testing for all variants.
- Improve scaling when making matrix A diagonally dominant and
avoid singular matrix when BLIS_INT_ELEMENT_TYPE is used.
- TRMV: Copy TRSV generic tests.
- Expand set of tests for HEMV, HER, HER2, SYMV, SYR, SYR2 and
make lda contribution to test names consistent with others
routines.
- Various adjustments to thresholds added.
Update gtestsuite documentation to describe using GTEST_FILTER
environment variable to select tests to run or exclude. This
works particularly well when using ctest, as we do not enumerate
all the tests at this level and so need to pass the selection
down to the individual executables.
AMD-Internal: [CPUPL-4500]
Change-Id: Ifcb6410455b7f91e58b555f94b9fd7920d7ad9d9
Check if alpha and beta are null before testing values. This
avoids possible seg faults if alpha or beta have not been
defined in IIT tests.
AMD-Internal: [CPUPL-4500]
Change-Id: Ibbf2d6a8fb38d9a95033f3fec3d06c3441e98689
Add BLAS_TEST_IMPL option for TEST_INTERFACE to test the
wrapper layer underneath BLAS and CBLAS interfaces. This is
particularly useful if building a BLIS library with these
interfaces disabled, e.g.
./configure --disable-blas amdzen
or
cmake . -DENABLE_BLAS=OFF -DBLIS_CONFIG_FAMILY=amdzen
The ?_blis_impl wrappers should have the same arguments as the
BLAS interfaces, thus we define TEST_BLAS_LIKE as an additional
definition for convenience when selecting tests and options in
the C++ files.
AMD-Internal: [CPUPL-5650]
Change-Id: I0275a387563f3efc2b40029950c8569956f2df7b
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.
Changes in this commit:
- Standardize early returns from SCALV reference and optimized
kernels.
- User supplied N<0 is handled at the top level API layer. Use
negative values of N in kernel calls to signify that SETV
should _not_ be used when alpha=0. This should only be
required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.
AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
Add guards around bli_trsm_small kernel tests to only call them
if BLIS_ENABLE_SMALL_MATRIX_TRSM is defined. This fixes missing
symbol errors in tests of non-zen builds, e.g. generic or skx.
AMD-Internal: [CPUPL-4500]
Change-Id: I7a822a41b5f686b5e38b0c63dd1871963e990407
- Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling
m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction
that strictly needed YMMs.
- Further updated the existing micro-kernel test cases to simulate
these issues and validate the fix.
AMD-Internal: [CPUPL-5353]
Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1
- The IIT_ERS test for GEMM_COMPUTE where alpha = 0 and beta = 0 was
failing since neither of the matrices was being packed and thus,
missing the scaling by alpha resulting in a non-zero output for C
matrix (C := A * B).
- Enabled packing of A matrix for the ZeroAlpha_ZeroBeta IIT_ERS test
which handles the alpha scaling.
AMD-Internal: [CPUPL-5598]
Change-Id: Id9179ec6150d1bc5a0274edce727ce6cc4172213
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
Add tests to check input arguments have not been modified by BLIS
routine. These tests add a large runtime overhead, so they are
disabled by default. To enable them, configure gtestsuite with:
cmake -DTEST_INPUT_ARGS=ON ...
and run desired tests as normal.
Also:
- Correct testinghelpers::chktrans to handle upper case values of
argument trns.
- Change testinghelpers::matsize to return size 0 if m, n or
leading dimension are 0, or if leading dimension is too small.
AMD-Internal: [CPUPL-4379]
Change-Id: I9494af800f9383195272ce99f622104a38fd0ed8
- Set threshold to epsilon for early return cases where we are just
scaling a matrix.
- Add this threshold to IIT_ERS files for appropriate tests.
- In IIT_ERS for gemm_compute, remove tests on null A and B when
we are expecting to set or scale C. More thought is required
in gemm_compute tests to handle these cases and look at cases
where A or B has been packed.
AMD-Internal: [CPUPL-4500]
Change-Id: Ia649cc340ca1df6511388f9c43a31e53296cb2bf
- Added reference kernel for dgemv that handles computation for tiny
sizes (m < 8 && n < 8).
- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
row/column storage schemes as well as transpose and no transpose
cases.
- Added additional unit-tests for functional verification.
AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
- Added dgemmGenericSUP test for the new 24x8 DGEMM SUP kernel for zen5.
AMD-Internal: [CPUPL-4404]
Change-Id: I150ca310655a495bdcf5ea9d5a16746483a17b68
Various improvements:
- Where appropriate, test both:
- with nullptr for suitable arguments that should never
be touched.
- with all arguments correct except the one we want to
test, to check we are not returning early because
another argument is a nullptr.
- Test incorrect values for order argument in CBLAS calls.
- Test early exits with limited data changes, e.g. set
C to 0 or scale C in GEMM when alpha = 0.
- Bugfix in gemmt test when alpha is 0 and beta is 1.
- Use reference library gemmt for comparison when library
is not netlib BLAS.
AMD-Internal: [CPUPL-4500]
Change-Id: Ibde7eaba5a484a87674044ca44855c6f6ee4ff4b
Unnecessary whitespace (spaces, tabs) at the end of lines
has been removed.
AMD-Internal: [CPUPL-4500]
Change-Id: Ice5f5504232cb22460c14ac47e6a3a43309cba22
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency.
AMD-Internal: [CPUPL-4500]
Change-Id: Ia3e479643b0bed4ae8a9107bde6e2cddf32d5bd8
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.
- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.
- Added a wrapper for DAXPYF kernels for redirection to kernels with a
smaller fuse factor than 32.
- Also added UKR tests for the new fused kernels.
AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
- Implemented AVX512 computational kernel for DAXPBYV
with optimal unrolling. Further implemented the other
missing kernels that would be required to decompose
the computation in special cases, namely the AVX512
DADDV and DSCAL2V kernels.
- Updated the zen4 and zen5 contexts to ensure any query
to acquire the kernel pointer for DAXPBYV returns the
address of the new kernel.
- Added micro-kernel units tests to GTestsuite to check
for functionality and out-of-bounds reads and writes.
AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
- Added CSCALV kernel utilizing the AVX512 ISA.
- Added function pointers for the same to zen4 and zen5 contexts.
- Updated the BLAS interface to invoke respective CSCALV kernels based
on the architecture.
- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).
AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
for handling the GEMM computation with inputs having k = 1,
with the transpose values being N(for column-major) and T(for
row-major).
- Updated the zgemm_blis_impl( ... ) layer to query the architecture
ID and invoke the AVX2 or AVX512 kernel accordingly.
- Added API level tests for accuracy and code-coverage, as well as
micro-kernel tests for verifying functionality and out-of-bounds
memory accesses.
AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
computational kernel for DNRM2 API.
- Updated the header to include this kernel signature, as well
as the framework layer to use this function in case of ZEN4
and ZEN5 configurations.
- Updated the tipping points for ideal thread setting in DNRM2
for ZEN5 micro-architecture. These thresholds are specific
to the library's linkage to LLVM's OpenMP or GNU's OpenMp.
- Further abstracted the AOCL-DYNAMIC logic to separate functions
for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).
- Further updated the ?NRM2 framework to accommodate the necessary
changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
kernel, when needed.
- Added micro-kernel and memory tests for this kernel in GTestsuite,
to validate accuracy and out-of-bounds read and write.
AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
- Updated the existing code-path for ?AXPBYV to
reroute the inputs to the appropriate L1 kernel,
based on the alpha and beta value. This is done
in order to utilize sensible optimizations with
regards to the compute and memory operations.
- Updated the typed API interface for ?AXPBYV to include
an early exit condition(when n is 0, or when alpha is
0 and beta is 1). Further updated this layer to query
the right kernel from context, based on the input values
of alpha and beta.
- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
case handling in ?AXPBYV.
- Moved the early return with negative increments from ?SCAL2V
kernels to its typed API interface.
- Updated the zen, zen2 and zen3 context to include function
pointers for all these vector kernels.
- Updated the existing ?AXPBYV vector kernels to handle only
the required computation. Additional cleanup was done to
these kernels.
- Added accuracy and memory tests for AVX2 kernels of ?SETV
?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs
- Updated the existing thresholds in ?AXPBYV tests for complex
types. This is due to the fact that every complex multiplication
involves two mul ops and one add op. Further added test-cases
for API level accuracy check, that includes special cases of
alpha and beta.
- Decomposed the reference call to ?AXPBYV with several other
L1 BLAS APIs(in case of the reference not supporting its own
?AXPBYV API). The decomposition is done to match the exact
operations that is done in BLIS based on alpha and/or beta
values. This ensures that we test for our own compliance.
AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6