- Added new decision logic to choose between
native TRSM vs unpacked small TRSM for
double precision.
- The changes are made for zen5 processor.
AMD-Internal: [CPUPL-5534]
Change-Id: I5204f6df111edec27d006daeb1c2b535a67b3e46
- In the initial patch - for m, n non-multiple of MR and NR
respectively we are calling bli_dgemm_ker_var2. Now we have
implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
This will result in better performance for matrix sizes
like m=4000 or greater when running on single thread.
AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
- Removed some of the unrolling factors that affected the
performance of AVX2 DAXPYV kernel. In addition to improving
the current performance on sizes compatible to single-threaded
runs, this will now perform better for tiny sizes as well
since the overhead to reach the computation is less.
- Updated the vector partitioning logic, by using
bli_thread_range_sub( ... ), which ensures that there is no
false sharing among multiple threads.
- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
or zen4 and zen5 micro-architectures.
AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
Currently the CBLAS xerbla always prints and always stops
on error. This commits adds similar functionality to the
regular BLAS xerbla to match the changes in 6d0444497f,
namely:
- Option to stop in xerbla on error. This is controlled by
setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
is controlled by setting the environment variable
BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
assuming xerbla was not set to stop on error. Example call is
info = bli_info_get_info_value();
The default behaviour remains to print but has been changed to
not stop on error, i.e. the equivalent to
export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0
AMD-Internal: [CPUPL-5361]
Change-Id: Icd6125fd60da139e3ec0969e52337a1ed515f0a2
- Adjusted the macro-guards for variables specific to
multithreading, when BLIS is configured with OpenMP.
- This included calling the single-threaded kernel directly
if increment is 0 as well, since this would remove an
unnecessary dependency on one of the variables used only
when we enable OpenMP.
- Further updated the condition to pack the vector, to
avoid it when increment is 0. In this case, we directly
call the kernel.
AMD-Internal: [CPUPL-5480]
Change-Id: I31a9c6e3ffc3c4f9d5b03ed8745919ad65c99c79
- Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
kernel.
- Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
and "bli_scopyv_zen4_asm_avx512" kernels.
- Replaced existing load balancing algorithm for dcopy API with
"bli_thread_range_sub" algorithm.
- Included AOCL-dynamic values for optimial number of threads
for zen5 architecture.
AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.
- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.
- Added a wrapper for DAXPYF kernels for redirection to kernels with a
smaller fuse factor than 32.
- Also added UKR tests for the new fused kernels.
AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
- Logic to calculate the kernel index in AVX512
DGEMMT SUP framework is incorrect.
- The granularity for workload distribution along N
dimension is NR(8), whereas current logic to pick
diagonal kernel assumes the granularity to be MR (24).
- To Fix this, the logic to determine the kernel index is
changed, instead of relying solely on n_offset, the kernel
index is derived depending on distance from the diagonal.
- If distance from diagonal is greater than
LCM of (MR and NR) - NR, that that means the current micro
panel is not a diagonal micro panel.
- If the micro panel is a diagonal micro panel, then the
distance from diagonal is equal to the M dimension for
initial full GEMM region or empty region of diagonal
kernel. This info can be used to determine the kernel index.
AMD-Internal: [CPUPL-5440]
Change-Id: I640d3a1b43e63b24bc9f0ed4a67cced45f6fa3b3
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
24x8 triangular AVX512 DGEMMT SUP kernels are added.
- Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
pattern repeats every 24x24 block of C. To cover this 24x24 block,
3 kernels are needed for one variant of DGEMMT. A total of 6
kernels are needed to cover both upper and lower variants.
- In order to maximize code reuse, the 24x8 kernels are broken
into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
full GEMM part is computed by 24x8 DGEMM SUP kernel.
- Changes are made in framework to enable the use of these kernels.
AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
- Implemented AVX512 computational kernel for DAXPBYV
with optimal unrolling. Further implemented the other
missing kernels that would be required to decompose
the computation in special cases, namely the AVX512
DADDV and DSCAL2V kernels.
- Updated the zen4 and zen5 contexts to ensure any query
to acquire the kernel pointer for DAXPBYV returns the
address of the new kernel.
- Added micro-kernel units tests to GTestsuite to check
for functionality and out-of-bounds reads and writes.
AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
- Implemented a new front-end for the BLAS/CBLAS calls
to ?AXPBYV(BLAS-extension API), that is intended to
be compiled only on Zen micro-architectures(as per the
existing build system).
- This new front-end makes the framework lightweight for
BLAS/CBLAS calls to ?AXPBYV, by directly querying the
architecture ID and deploying the associated computational
kernel.
- Further updated the rerouting to other L1 kernels based
on alpha and beta value. This was initially present in
the Typed-API interface. It has been moved inside the
respective kernels, and only necessary rerouting is done
to specific L1 kernels to avoid redundant checks.
AMD-Internal: [CPUPL-5406]
Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e
- Implemented two new axpyf kernels for fused factors 8 and 12
by manually unrolling the loops. Used to achieve better performance
in var2 case.
AMD-Internal: [CPUPL-5184]
Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5
- Added CSCALV kernel utilizing the AVX512 ISA.
- Added function pointers for the same to zen4 and zen5 contexts.
- Updated the BLAS interface to invoke respective CSCALV kernels based
on the architecture.
- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).
AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
- Introduced new 8x24 macro kernels.
- 4 new kernels are added for beta 0, beta 1, beta -1
and beta N.
- IR and JR loop moved to ASM region.
- Kernels support row major storage scheme.
- Prefetch of current micro panel of C is enabled.
- Kernel supports negative offsets for A and B matrices.
- Moved alpha scaling from DGEMM kernel to B pack kernel.
- Tuned blocksizes for new kernel.
- Added support for alpha scaling in 24xk pack kernel.
- Reverted back to old b_next computation
in gemm_ker_var2.
- BugFix in 8x24 DGEMM kernel for beta 1,
comparsion for jmp conditions was done using integer
instructions, which caused beta 1 path to never be taken.
Fixed this by changing the comparsion to double.
AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
for handling the GEMM computation with inputs having k = 1,
with the transpose values being N(for column-major) and T(for
row-major).
- Updated the zgemm_blis_impl( ... ) layer to query the architecture
ID and invoke the AVX2 or AVX512 kernel accordingly.
- Added API level tests for accuracy and code-coverage, as well as
micro-kernel tests for verifying functionality and out-of-bounds
memory accesses.
AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84
- Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
zen5 config.
- Implemented multithreading support and AOCL-dynamic
for ZAXPY API.
- Utilized 'bli_thread_range_sub' function to achieve
better work distribution and avoid false sharing.
AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
* commit 'cfa3db3f':
Fixed bug in mixed-dt gemm introduced in e9da642.
Removed support for 3m, 4m induced methods.
Updated do_sde.sh to get SDE from GitHub.
Disable SDE testing of old AMD microarchitectures.
Fixed substitution bug in configure.
Allow use of 1m with mixing of row/col-pref ukrs.
AMD-Internal: [CPUPL-2698]
Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f
* commit '81e10346':
Alloc at least 1 elem in pool_t block_ptrs. (#560)
Fix insufficient pool-growing logic in bli_pool.c. (#559)
Arm SVE C/ZGEMM Fix FMOV 0 Mistake
SH Kernel Unused Eigher
Arm SVE C/ZGEMM Support *beta==0
Arm SVE Config armsve Use ZGEMM/CGEMM
Arm SVE: Update Perf. Graph
Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
A64FX Config Use ZGEMM/CGEMM
Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
Arm SVE Add SGEMM 2Vx10 Unindexed
Arm SVE ZGEMM Support Gather Load / Scatt. St.
Arm SVE Add ZGEMM 2Vx10 Unindexed
Arm SVE Add ZGEMM 2Vx7 Unindexed
Arm SVE Add ZGEMM 2Vx8 Unindexed
Update Travis CI badge
Armv8 Trash New Bulk Kernels
Enable testing 1m in `make check`.
Config ArmSVE Unregister 12xk. Move 12xk to Old
Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
Register firestorm into arm64 Metaconfig
Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
Add test for Apple M1 (firestorm)
Firestorm CPUID Dispatcher
Armv8 GEMMSUP Edge Cases Require Signed Ints
Make error checking level a thread-local variable.
Fix data race in testsuite.
Update .appveyor.yml
Firestorm Block Size Fixes
Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
Move unused ARM SVE kernels to "old" directory.
Add an option to control whether or not to use @rpath.
Fix $ORIGIN usage on linux.
Arm micro-architecture dispatch (#344)
Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
Armv8 Fix 6x8 Row-Maj Ukr
Apply patch from @xrq-phys.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
bli_error: more cleanup on the error strings array
Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
Fix config_name in bli_arch.c
Arm Whole GEMMSUP Call Route is Asm/Int Optimized
Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Header Typo
Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
Arm: Implement GEMMSUP Fallback Method
Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Added Apple Firestorm (A14/M1) Subconfig
Arm64 8x4 Kernel Use Less Regs
Armv8-A Supplimentary GEMMSUP Sizes for RD
Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Armv8-A Adjust Types for PACKM Kernels
Armv8-A GEMMSUP-RD 6x8m
Armv8-A GEMMSUP-RD 6x8n
Armv8-A s/d Packing Kernels Fix Typo
Armv8-A Introduced s/d Packing Kernels
Armv8-A DGEMMSUP 6x8m Kernel
Armv8-A DGEMMSUP Adjustments
Armv8-A Add More DGEMMSUP
Armv8-A Add GEMMSUP 4x8n Kernel
Armv8-A Add Part of GEMMSUP 8x4m Kernel
Armv8A DGEMM 4x4 Kernel WIP. Slow
Armv8-A Add 8x4 Kernel WIP
AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
computational kernel for DNRM2 API.
- Updated the header to include this kernel signature, as well
as the framework layer to use this function in case of ZEN4
and ZEN5 configurations.
- Updated the tipping points for ideal thread setting in DNRM2
for ZEN5 micro-architecture. These thresholds are specific
to the library's linkage to LLVM's OpenMP or GNU's OpenMp.
- Further abstracted the AOCL-DYNAMIC logic to separate functions
for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).
- Further updated the ?NRM2 framework to accommodate the necessary
changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
kernel, when needed.
- Added micro-kernel and memory tests for this kernel in GTestsuite,
to validate accuracy and out-of-bounds read and write.
AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
- Updated the existing code-path for ?AXPBYV to
reroute the inputs to the appropriate L1 kernel,
based on the alpha and beta value. This is done
in order to utilize sensible optimizations with
regards to the compute and memory operations.
- Updated the typed API interface for ?AXPBYV to include
an early exit condition(when n is 0, or when alpha is
0 and beta is 1). Further updated this layer to query
the right kernel from context, based on the input values
of alpha and beta.
- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
case handling in ?AXPBYV.
- Moved the early return with negative increments from ?SCAL2V
kernels to its typed API interface.
- Updated the zen, zen2 and zen3 context to include function
pointers for all these vector kernels.
- Updated the existing ?AXPBYV vector kernels to handle only
the required computation. Additional cleanup was done to
these kernels.
- Added accuracy and memory tests for AVX2 kernels of ?SETV
?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs
- Updated the existing thresholds in ?AXPBYV tests for complex
types. This is due to the fact that every complex multiplication
involves two mul ops and one add op. Further added test-cases
for API level accuracy check, that includes special cases of
alpha and beta.
- Decomposed the reference call to ?AXPBYV with several other
L1 BLAS APIs(in case of the reference not supporting its own
?AXPBYV API). The decomposition is done to match the exact
operations that is done in BLIS based on alpha and/or beta
values. This ensures that we test for our own compliance.
AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
- Modifying threading framework for L1 APIs to update only number of threads from runtime env and avoid overhead of reading other ICVs.
- Removing bli_arch_set_id_once() from bli_arch_set_id_once() flow as bli_arch_check_id_once() calls it.
AMD-Internal: [CPUPL-4877]
Change-Id: I87b346825a96d74e746a41530b6d22ae162f19ba
Changes to how AOCL_ENABLE_INSTRUCTIONS handles requests
for different ISAs (i.e. BLIS sub-configurations):
- Add missing SSE and AVX options. These will all chose the
generic option in amdzen builds.
- For unsupported ISAs (e.g. AVX512 on Milan), select the
hardware's default sub-configuration instead of trying
to step down through alternative choices.
- For invalid options, or options not implemented in the BLIS
build (e.g. skx in amdzen build), select the hardware's
default sub-configuration instead of aborting.
Currently BLIS_ARCH_TYPE behaviour is not affected by these
changes.
AMD-Internal: [CPUPL-5078]
Change-Id: Idbd00d2806b1679889a9249878c51981c8d23b3f
- Introduced new 8x24 row preferred kernel for zen5.
- Kernel supports row/col/gen
storage schemes.
- Prefetch of current panel of A and C
are enabled.
- Prefetch of next panel of B is enabled.
- Kernel supports negative offsets for A and B
matrices.
- Cache block tuning is done for zen5 core.
AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
- Updated nt_ideal to 12 for n_elem <= 2500000 and nt_ideal to 16 for
n_elem <= 4000000
AMD-Internal: [CPUPL-4408]
Change-Id: I97c143ab0d9b97e797358af93181c71d948757cc
- Updated the parameter check for leading dimensions in the
functions handling transpose case of matrix A.
- Updated the logic to perform ?IMATCOPY operation. The new
logic uses an auxiliary buffer to copy and scale in place,
if and when needed. This is done in order to avoid overwriting
any subsequent reads that might follow(specifically in case
of having different leading dimensions for reading and writing).
- Updated xerbla_() to throw memory allocation failure based on
INFO parameter being -10. This value is specific to its use-case
in ?IMATCOPY, where it is set to -10.
- Updated the Extreme Value Tests(EVT) logger for ?IMATCOPY
for uniformity.
- Cleaned up the files to follow coding conventions.
AMD-Internal: [CPUPL-4862][SWLCSG-2706]
Change-Id: I34dfa2bcb66b821315e11f7ab2139c41a79ef780
1. Enabled AVX512 path for
- Upper variant
- Different storage schemes for upper and lower variant
2. Modified mask value to handle all fringe cases correctly
AMD_Internal: [CPUPL-5091]
Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.
AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
supported by optimized kernels.
AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.
AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
- Added AVX512 kernel for ZDOTV.
- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.
AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
Existing Design:
- GEMM AVX2 kernel performs computation and updates temporary C buffer
- Portion of temporary C buffer is copied to output C buffer
based on UPLO parameter
- For diagonal blocks, using GEMM kernels is not efficient
New Design: Implemented in current patch when UPLO='L'
- GEMMT kernel used for computation, temporary buffer is not required.
- Only required elements are computed using mask load store for all
fringe cases
- Exception: AVX2 code path is used when storage format is RRC, CRR, CRC
- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
check which exists for double datatype
AMD_Internal: [CPUPL-5006]
Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
Thanks to Zhenyu Zhu ajz34 for pointing out this bug.
When trans="t" or "conjugate transpose" in the case of complex data-types
the ldb should be greater than equal to cols.
In the bug it was checked against "rows". Fixed this bug.
Some minor code format is done.
[CPUPL-4810][SWLCSG-2706]
Change-Id: Ie796d25a361b2ba72eda80e8c5867d6352af901f
- In AVX512 ZTRSM kernel, vertorizes division code
is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.
AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
- Added a {} around zen4 switch case to avoid AOCC error.
- Error is caused because in C declarations are not a statement, therefore
they cannot be labled hence compiler is not able to create a lable
for jump.
AMD-Internal: [CPUPL-4880]
Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e
- In DGEMMT SUP AVX2 code path, traingular kernels
are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
to make sure that AVX512 kernels can be used without
creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
for upper variant of DGEMMT, temporary C buffer is
created, full GEMM kernel is called on temporary C and
traingular region from temporary C is copied to C
buffer.
AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
- Kernel dimensions are 4x4.
- Two kernels are implemented, Right Upper and
Right lower.
- In case of Left variants of TRSM, transpose is
induced so that Right variant kernels can be used.
- No packing is performed in these kernels.
- Changes are made in the threshold to pick ZTRSM small
code path.
- BLIS_INLINE is removed from signature of
"TRSMSMALL_KER_PROT".
- These kernels do not support "ENABLE_TRSM_PREINVERSION".
- Newly added kernels do not support conjugate
transpose.
- Added multithreading to ZTRSM small code path.
AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
factor 8 and 32.
- Multithreading is also added to the DAXPYf
kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
in ZEN4
AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
- Implemented AVX512 kernels for handling the calls to ZGEMV
with transpose to A matrix.
- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
kernels include those with fuse-factor 8 (main kernel), 4
and 2(fringe kernels).
- Updated the bli_zgemv_unf_var1( ... ) function to update
the function pointers to these kernels, based on the
configuration.
AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
- Implemented AVX512 kernels for handling the calls to ZGEMV
with no-transpose to A matrix.
- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
The set of ZAXPYF kernels include those with fuse-factor 8
(main kernel), 4 and 2(fringe kernels).
- Updated the bli_zgemv_unf_var2( ... ) function to set
the function pointers to these kernels, based on the
configuration. Further added the call to ZSETV at this
layer in case beta is 0.
AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
- Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_
using respective AVX512 intrinsics including masked
load and store operations.
- Implemented AVX512 kernels for scopy_, dcopy_ and
zcopy_ using assembly language to prevent loss of
performance during the translation of intrinsics.
- Updated the dcopy_blis_impl( ... ) and
zcopy_blis_impl( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Implemented OpenMP parallelization for dcopyv_ and
zcopyv_ APIs, while scopyv_ and ccopyv_ only support
single thread.
AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
Return type of xerbla and xerbla_array APIs are defined as int in BLIS, but according to netlib it should be void. Updated the defination and declaration accordingly.
Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I3072ba76111189de5c5cf08df83ea154163dd34d
Correct the order of tests in bli_cpuid.c to test all known zen
AVX512 platforms before considering fallback tests on AVX512
support. This avoids builds with "configure auto" or
"cmake -DBLIS_CONFIG_FAMILY=auto" incorrectly selecting zen5
sub-configuration on zen4 systems.
AMD-Internal: [CPUPL-4966]
Change-Id: I8706382e2df7c9ae4bb456e3a7f465053e15beea
- Scaling vector X is skipped when alpha is 1 in ZTRSV.
- Scaling matrix A is skipped when alpha is 1 in ZTRSM.
AMD-Internal: [CPUPL-4324]
Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81
- Comparision using bli_eqsc is slower than direct comparison.
- Changed comparision logic for 1x1 matrix
from bli_sqsc to direct comparision.
AMD-Internal: [CPUPL-4324]
Change-Id: Ifb2d0ad7a97c8bf33b66d624a7ecc53e38c1c803
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
- Added support for ?DOTC in bench.
- Updated DTL to accept conjx as a parameter:
- 'N', i.e., no conjugate for DOTU
- 'C', i.e., conjugate for DOTC
- Updated DTL calls in the interface with respective values of
conjx.
AMD-Internal: [CPUPL-4804]
Change-Id: I447b19a6273566c6021c1721ce173bac4a59142c
- Added BUILD_STATIC_LIBS option which is on by default, only on Linux.
- Added TEST_WITH_SHARED option which is off by default, only on Linux.
- If only shared or static lib is being built, that's the one that will be used for testing.
- If both are being built, TEST_WITH_SHARED determins which library wil be used for testing.
- Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing.
- Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0.
AMD-Internal: [CPUPL-2748]
Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e