amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-24 18:34:40 +00:00

Author	SHA1	Message	Date
Shubham Sharma.	45d82a1ebf	Threshold tuning for DTRSM on zen5 - Added new decision logic to choose between native TRSM vs unpacked small TRSM for double precision. - The changes are made for zen5 processor. AMD-Internal: [CPUPL-5534] Change-Id: I5204f6df111edec27d006daeb1c2b535a67b3e46	2024-08-01 11:27:28 -04:00
Shubham Sharma.	f378fc57b5	DGEMM Native AVX512 updates - In the initial patch - for m, n non-multiple of MR and NR respectively we are calling bli_dgemm_ker_var2. Now we have implemented macro-kernel for these fringe cases as well. - Replaced RBP register with R11 in the macro-kernel. - Retuned MC, KC and NC with these new changes. This will result in better performance for matrix sizes like m=4000 or greater when running on single thread. AMD-Internal: [CPUPL-5262] Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a	2024-07-31 12:23:34 -04:00
Vignesh Balasubramanian	f23b8e636b	AVX2 and AVX512 optimizations for DAXPYV - Removed some of the unrolling factors that affected the performance of AVX2 DAXPYV kernel. In addition to improving the current performance on sizes compatible to single-threaded runs, this will now perform better for tiny sizes as well since the overhead to reach the computation is less. - Updated the vector partitioning logic, by using bli_thread_range_sub( ... ), which ensures that there is no false sharing among multiple threads. - Updated the AOCL-DYNAMIC logic for the API, to include thresholds or zen4 and zen5 micro-architectures. AMD-Internal: [CPUPL-5514] Change-Id: Iee9edddac685334213cd6694421ab3df3547e930	2024-07-31 09:24:36 -04:00
Edward Smyth	8848ecb103	Improvements to CBLAS xerbla functionality Currently the CBLAS xerbla always prints and always stops on error. This commits adds similar functionality to the regular BLAS xerbla to match the changes in `6d0444497f`, namely: - Option to stop in xerbla on error. This is controlled by setting the environment variable BLIS_STOP_ON_ERROR=1 - Option to disable printing of error message from BLIS. This is controlled by setting the environment variable BLIS_PRINT_ON_ERROR=0 - Added a function to return the value of INFO passed to xerbla, assuming xerbla was not set to stop on error. Example call is info = bli_info_get_info_value(); The default behaviour remains to print but has been changed to not stop on error, i.e. the equivalent to export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0 AMD-Internal: [CPUPL-5361] Change-Id: Icd6125fd60da139e3ec0969e52337a1ed515f0a2	2024-07-26 10:36:37 -04:00
Vignesh Balasubramanian	68c54297bd	Fixing compiler warnings when configuring BLIS without OpenMP - Adjusted the macro-guards for variables specific to multithreading, when BLIS is configured with OpenMP. - This included calling the single-threaded kernel directly if increment is 0 as well, since this would remove an unnecessary dependency on one of the variables used only when we enable OpenMP. - Further updated the condition to pack the vector, to avoid it when increment is 0. In this case, we directly call the kernel. AMD-Internal: [CPUPL-5480] Change-Id: I31a9c6e3ffc3c4f9d5b03ed8745919ad65c99c79	2024-07-25 10:29:33 -04:00
Hari Govind S	eacad443e3	Optimization for DCOPY and SCOPY API - Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512" kernel. - Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512" and "bli_scopyv_zen4_asm_avx512" kernels. - Replaced existing load balancing algorithm for dcopy API with "bli_thread_range_sub" algorithm. - Included AOCL-dynamic values for optimial number of threads for zen5 architecture. AMD-Internal: [CPUPL-5238] Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957	2024-07-24 08:23:07 -04:00
Arnav Sharma	9583ee2e23	DGEMV Optimizations for NO_TRANSPOSE cases - Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases. - Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16. - Added a wrapper for DAXPYF kernels for redirection to kernels with a smaller fuse factor than 32. - Also added UKR tests for the new fused kernels. AMD-Internal: [CPUPL-5098] Change-Id: I0b102b67c6c068873393bac0494284f379c253f2	2024-07-24 15:59:36 +05:30
Shubham Sharma	15ef6532e9	BugFix in DGEMMT SUP AVX512 code path - Logic to calculate the kernel index in AVX512 DGEMMT SUP framework is incorrect. - The granularity for workload distribution along N dimension is NR(8), whereas current logic to pick diagonal kernel assumes the granularity to be MR (24). - To Fix this, the logic to determine the kernel index is changed, instead of relying solely on n_offset, the kernel index is derived depending on distance from the diagonal. - If distance from diagonal is greater than LCM of (MR and NR) - NR, that that means the current micro panel is not a diagonal micro panel. - If the micro panel is a diagonal micro panel, then the distance from diagonal is equal to the M dimension for initial full GEMM region or empty region of diagonal kernel. This info can be used to determine the kernel index. AMD-Internal: [CPUPL-5440] Change-Id: I640d3a1b43e63b24bc9f0ed4a67cced45f6fa3b3	2024-07-24 06:36:34 +00:00
Shubham Sharma	16c56e0101	Added 24x8 triangular kernels for DGEMMT SUP - In order to reuse 24x8 AVX512 DGEMM SUP kernels, 24x8 triangular AVX512 DGEMMT SUP kernels are added. - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal pattern repeats every 24x24 block of C. To cover this 24x24 block, 3 kernels are needed for one variant of DGEMMT. A total of 6 kernels are needed to cover both upper and lower variants. - In order to maximize code reuse, the 24x8 kernels are broken into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8 diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8 full GEMM part is computed by 24x8 DGEMM SUP kernel. - Changes are made in framework to enable the use of these kernels. AMD-Internal: [CPUPL-5338] Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a	2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian	b48e864e82	AVX512 optimizations for DAXPBYV API - Implemented AVX512 computational kernel for DAXPBYV with optimal unrolling. Further implemented the other missing kernels that would be required to decompose the computation in special cases, namely the AVX512 DADDV and DSCAL2V kernels. - Updated the zen4 and zen5 contexts to ensure any query to acquire the kernel pointer for DAXPBYV returns the address of the new kernel. - Added micro-kernel units tests to GTestsuite to check for functionality and out-of-bounds reads and writes. AMD-Internal: [CPUPL-5406][CPUPL-5421] Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6	2024-07-22 11:32:19 +05:30
Vignesh Balasubramanian	cec9fdcc6e	Framework enhancements for ?AXPBYV APIs - Implemented a new front-end for the BLAS/CBLAS calls to ?AXPBYV(BLAS-extension API), that is intended to be compiled only on Zen micro-architectures(as per the existing build system). - This new front-end makes the framework lightweight for BLAS/CBLAS calls to ?AXPBYV, by directly querying the architecture ID and deploying the associated computational kernel. - Further updated the rerouting to other L1 kernels based on alpha and beta value. This was initially present in the Typed-API interface. It has been moved inside the respective kernels, and only necessary rerouting is done to specific L1 kernels to avoid redundant checks. AMD-Internal: [CPUPL-5406] Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e	2024-07-18 10:06:31 -04:00
Arnav Sharma	d5e29e3c7b	CSCALV Framework Bugfix - Fixed bug for non-zen architecture where CSCALV framework incorrectly fetches the dcomplex (ZSCALV) kernel pointer. AMD-Internal: [CPUPL-5299] Change-Id: I1d16588aa9dffd8b9dca69860026e377fa74d547	2024-07-17 00:27:47 +05:30
Hari Govind S	38824244d5	Implementation of AXPYF Kernels for DTRSV - Implemented two new axpyf kernels for fused factors 8 and 12 by manually unrolling the loops. Used to achieve better performance in var2 case. AMD-Internal: [CPUPL-5184] Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5	2024-07-16 06:23:01 -04:00
Arnav Sharma	4aa66f108e	Added CSCALV AVX512 Kernel - Added CSCALV kernel utilizing the AVX512 ISA. - Added function pointers for the same to zen4 and zen5 contexts. - Updated the BLAS interface to invoke respective CSCALV kernels based on the architecture. - Added UKR tests for bli_cscalv_zen_int_avx512( ... ). AMD-Internal: [CPUPL-5299] Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3	2024-07-15 07:17:43 -04:00
Shubham Sharma.	a7744361e4	DGEMM optimizations for Turin Classic - Introduced new 8x24 macro kernels. - 4 new kernels are added for beta 0, beta 1, beta -1 and beta N. - IR and JR loop moved to ASM region. - Kernels support row major storage scheme. - Prefetch of current micro panel of C is enabled. - Kernel supports negative offsets for A and B matrices. - Moved alpha scaling from DGEMM kernel to B pack kernel. - Tuned blocksizes for new kernel. - Added support for alpha scaling in 24xk pack kernel. - Reverted back to old b_next computation in gemm_ker_var2. - BugFix in 8x24 DGEMM kernel for beta 1, comparsion for jmp conditions was done using integer instructions, which caused beta 1 path to never be taken. Fixed this by changing the comparsion to double. AMD-Internal: [CPUPL-5262] Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f	2024-07-09 07:53:27 -04:00
vignbala	236d092656	AVX512 optimizations for ZGEMM to handle k = 1 cases - Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, with the transpose values being N(for column-major) and T(for row-major). - Updated the zgemm_blis_impl( ... ) layer to query the architecture ID and invoke the AVX2 or AVX512 kernel accordingly. - Added API level tests for accuracy and code-coverage, as well as micro-kernel tests for verifying functionality and out-of-bounds memory accesses. AMD-Internal: [CPUPL-5249] Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84	2024-07-09 07:07:24 -04:00
Hari Govind S	627bf0b1ba	Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API - Replaced 'bli_zaxpyv_zen_int5' kernel with optimised 'bli_zaxpyv_zen_int_avx512' kernel for zen4 and zen5 config. - Implemented multithreading support and AOCL-dynamic for ZAXPY API. - Utilized 'bli_thread_range_sub' function to achieve better work distribution and avoid false sharing. AMD-Internal: [CPUPL-5250] Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b	2024-07-09 01:29:53 -04:00
Edward Smyth	2ee46a3a3a	Merge commit 'cfa3db3f' into amd-main * commit 'cfa3db3f': Fixed bug in mixed-dt gemm introduced in `e9da642`. Removed support for 3m, 4m induced methods. Updated do_sde.sh to get SDE from GitHub. Disable SDE testing of old AMD microarchitectures. Fixed substitution bug in configure. Allow use of 1m with mixing of row/col-pref ukrs. AMD-Internal: [CPUPL-2698] Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f	2024-07-08 06:09:11 -04:00
Edward Smyth	8de8dc2961	Merge commit '81e10346' into amd-main * commit '81e10346': Alloc at least 1 elem in pool_t block_ptrs. (#560) Fix insufficient pool-growing logic in bli_pool.c. (#559) Arm SVE C/ZGEMM Fix FMOV 0 Mistake SH Kernel Unused Eigher Arm SVE C/ZGEMM Support beta==0 Arm SVE Config armsve Use ZGEMM/CGEMM Arm SVE: Update Perf. Graph Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 A64FX Config Use ZGEMM/CGEMM Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg Arm SVE Add SGEMM 2Vx10 Unindexed Arm SVE ZGEMM Support Gather Load / Scatt. St. Arm SVE Add ZGEMM 2Vx10 Unindexed Arm SVE Add ZGEMM 2Vx7 Unindexed Arm SVE Add ZGEMM 2Vx8 Unindexed Update Travis CI badge Armv8 Trash New Bulk Kernels Enable testing 1m in `make check`. Config ArmSVE Unregister 12xk. Move 12xk to Old Revert __has_include(). Distinguish w/ BLIS_FAMILY_* Register firestorm into arm64 Metaconfig Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo Add test for Apple M1 (firestorm) Firestorm CPUID Dispatcher Armv8 GEMMSUP Edge Cases Require Signed Ints Make error checking level a thread-local variable. Fix data race in testsuite. Update .appveyor.yml Firestorm Block Size Fixes Armv8 Handle beta == 0 for GEMMSUP ??r Case. Move unused ARM SVE kernels to "old" directory. Add an option to control whether or not to use @rpath. Fix $ORIGIN usage on linux. Arm micro-architecture dispatch (#344) Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. Armv8 Handle beta == 0 for GEMMSUP ?rc Case. Armv8 Fix 6x8 Row-Maj Ukr Apply patch from @xrq-phys. Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. bli_error: more cleanup on the error strings array Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers Fix config_name in bli_arch.c Arm Whole GEMMSUP Call Route is Asm/Int Optimized Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Header Typo Arm: DGEMMSUP ??r(rv) Invoke Edge Size Arm: DGEMMSUP ?rc(rd) Invoke Edge Size Arm: Implement GEMMSUP Fallback Method Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Added Apple Firestorm (A14/M1) Subconfig Arm64 8x4 Kernel Use Less Regs Armv8-A Supplimentary GEMMSUP Sizes for RD Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Armv8-A Adjust Types for PACKM Kernels Armv8-A GEMMSUP-RD 6x8m Armv8-A GEMMSUP-RD 6x8n Armv8-A s/d Packing Kernels Fix Typo Armv8-A Introduced s/d Packing Kernels Armv8-A DGEMMSUP 6x8m Kernel Armv8-A DGEMMSUP Adjustments Armv8-A Add More DGEMMSUP Armv8-A Add GEMMSUP 4x8n Kernel Armv8-A Add Part of GEMMSUP 8x4m Kernel Armv8A DGEMM 4x4 Kernel WIP. Slow Armv8-A Add 8x4 Kernel WIP AMD-Internal: [CPUPL-2698] Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803	2024-06-25 05:48:46 -04:00
Vignesh Balasubramanian	02da190560	AVX512 optimizations for DNRM2 - Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512 computational kernel for DNRM2 API. - Updated the header to include this kernel signature, as well as the framework layer to use this function in case of ZEN4 and ZEN5 configurations. - Updated the tipping points for ideal thread setting in DNRM2 for ZEN5 micro-architecture. These thresholds are specific to the library's linkage to LLVM's OpenMP or GNU's OpenMp. - Further abstracted the AOCL-DYNAMIC logic to separate functions for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2). - Further updated the ?NRM2 framework to accommodate the necessary changes to invoke the newer AOCL-DYNAMIC functions and the AVX512 kernel, when needed. - Added micro-kernel and memory tests for this kernel in GTestsuite, to validate accuracy and out-of-bounds read and write. AMD-Internal: [CPUPL-5265] Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049	2024-06-24 08:50:36 -04:00
Vignesh Balasubramanian	6165001658	Bugfix and optimizations for ?AXPBYV API - Updated the existing code-path for ?AXPBYV to reroute the inputs to the appropriate L1 kernel, based on the alpha and beta value. This is done in order to utilize sensible optimizations with regards to the compute and memory operations. - Updated the typed API interface for ?AXPBYV to include an early exit condition(when n is 0, or when alpha is 0 and beta is 1). Further updated this layer to query the right kernel from context, based on the input values of alpha and beta. - Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV, ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special case handling in ?AXPBYV. - Moved the early return with negative increments from ?SCAL2V kernels to its typed API interface. - Updated the zen, zen2 and zen3 context to include function pointers for all these vector kernels. - Updated the existing ?AXPBYV vector kernels to handle only the required computation. Additional cleanup was done to these kernels. - Added accuracy and memory tests for AVX2 kernels of ?SETV ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs - Updated the existing thresholds in ?AXPBYV tests for complex types. This is due to the fact that every complex multiplication involves two mul ops and one add op. Further added test-cases for API level accuracy check, that includes special cases of alpha and beta. - Decomposed the reference call to ?AXPBYV with several other L1 BLAS APIs(in case of the reference not supporting its own ?AXPBYV API). The decomposition is done to match the exact operations that is done in BLIS based on alpha and/or beta values. This ensures that we test for our own compliance. AMD-Internal: [CPUPL-4861] Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6	2024-06-20 16:22:07 +05:30
Moripalli Chitra	cb915c241d	Tuning ddotv API - Modifying threading framework for L1 APIs to update only number of threads from runtime env and avoid overhead of reading other ICVs. - Removing bli_arch_set_id_once() from bli_arch_set_id_once() flow as bli_arch_check_id_once() calls it. AMD-Internal: [CPUPL-4877] Change-Id: I87b346825a96d74e746a41530b6d22ae162f19ba	2024-06-18 19:31:17 +05:30
Edward Smyth	ca7ba707e7	AOCL_ENABLE_INSTRUCTIONS improvements Changes to how AOCL_ENABLE_INSTRUCTIONS handles requests for different ISAs (i.e. BLIS sub-configurations): - Add missing SSE and AVX options. These will all chose the generic option in amdzen builds. - For unsupported ISAs (e.g. AVX512 on Milan), select the hardware's default sub-configuration instead of trying to step down through alternative choices. - For invalid options, or options not implemented in the BLIS build (e.g. skx in amdzen build), select the hardware's default sub-configuration instead of aborting. Currently BLIS_ARCH_TYPE behaviour is not affected by these changes. AMD-Internal: [CPUPL-5078] Change-Id: Idbd00d2806b1679889a9249878c51981c8d23b3f	2024-06-17 09:58:30 -04:00
Shubham Sharma.	580282e655	DGEMM optimizations for Turin Classic - Introduced new 8x24 row preferred kernel for zen5. - Kernel supports row/col/gen storage schemes. - Prefetch of current panel of A and C are enabled. - Prefetch of next panel of B is enabled. - Kernel supports negative offsets for A and B matrices. - Cache block tuning is done for zen5 core. AMD-Internal: [CPUPL-5262] Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f	2024-06-17 05:18:49 -04:00
Arnav Sharma	38f65c28fc	Correction for DSCALV AOCL_DYNAMIC Thresholds - Updated nt_ideal to 12 for n_elem <= 2500000 and nt_ideal to 16 for n_elem <= 4000000 AMD-Internal: [CPUPL-4408] Change-Id: I97c143ab0d9b97e797358af93181c71d948757cc	2024-06-14 11:51:47 +05:30
Vignesh Balasubramanian	947811a429	Bugfix for ?OMATCOPY2 and ?IMATCOPY APIs - Updated the parameter check for leading dimensions in the functions handling transpose case of matrix A. - Updated the logic to perform ?IMATCOPY operation. The new logic uses an auxiliary buffer to copy and scale in place, if and when needed. This is done in order to avoid overwriting any subsequent reads that might follow(specifically in case of having different leading dimensions for reading and writing). - Updated xerbla_() to throw memory allocation failure based on INFO parameter being -10. This value is specific to its use-case in ?IMATCOPY, where it is set to -10. - Updated the Extreme Value Tests(EVT) logger for ?IMATCOPY for uniformity. - Cleaned up the files to follow coding conventions. AMD-Internal: [CPUPL-4862][SWLCSG-2706] Change-Id: I34dfa2bcb66b821315e11f7ab2139c41a79ef780	2024-05-21 11:13:28 +05:30
Mangala V	64d9c96d45	ZGEMMT SUP: AVX512 GEMMT code for Upper variant 1. Enabled AVX512 path for - Upper variant - Different storage schemes for upper and lower variant 2. Modified mask value to handle all fringe cases correctly AMD_Internal: [CPUPL-5091] Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e	2024-05-15 13:08:32 +05:30
Shubham Sharma	b4bc71f3ac	Bug fix IN DAXPYF MT and Code Cleanup - Fixed bug in DAXPYF MT kernel when incx != inca. - Added AOCL Dynamic function for 1f kernels. - Moved all DOTXF and AXPYF kernels into one file. AMD-Internal: [CPUPL-4880] Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe	2024-05-14 06:44:10 -04:00
Shubham Sharma	f4b06547fd	Enabled DGEMMT SUP optimized code for upper variant - Enabled DGEMMT SUP upper kernels in AVX512 code path. - Enabled use of optimized kernels for all the storages supported by optimized kernels. AMD-Internal: [CPUPL-4881] Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e	2024-05-14 05:23:51 -04:00
Edward Smyth	62c886feee	Export some BLIS internal symbols AOCL libFLAME optimizations directly call some internal BLIS symbols. Export them to enable this to work with the BLIS shared library. AMD-Internal: [CPUPL-5044] Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d	2024-05-08 12:51:32 -04:00
Arnav Sharma	cb27fad49c	ZSCALV AVX512 Kernel - Implemented ZSCALV kernel utilizing AVX512 intrinsics. - Gtestsuite: Added ukr tests for the new kernel. AMD-Internal: [CPUPL-5012] Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2	2024-05-08 11:55:13 -04:00
Arnav Sharma	1dbeee4d19	ZDOTV AVX512 Kernel with MT Support - Added AVX512 kernel for ZDOTV. - Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support. AMD-Internal: [CPUPL-5011] Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8	2024-05-08 04:54:05 -04:00
Arnav Sharma	b1d69180f9	Updated DOTV DTL in bla_dot.c - Updated DOTV DTL entry to include conjugate parameter. AMD-Internal: [CPUPL-5059] Change-Id: Id66be02fc06ff2faa18325dffe76559af2c6a5cf	2024-05-08 01:46:17 -04:00
Mangala V	e6cc2a3e22	ZGEMMT SUP Optimizations for AVX512 Existing Design: - GEMM AVX2 kernel performs computation and updates temporary C buffer - Portion of temporary C buffer is copied to output C buffer based on UPLO parameter - For diagonal blocks, using GEMM kernels is not efficient New Design: Implemented in current patch when UPLO='L' - GEMMT kernel used for computation, temporary buffer is not required. - Only required elements are computed using mask load store for all fringe cases - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC - AOCL-Dynamic is added based on dimension - Check for AVX platform is added in SUP interface, It returns to native implementation if hardware doesnot support AVX platform - SUP ref_var2m is expanded for dcomplex datatype to avoid condition check which exists for double datatype AMD_Internal: [CPUPL-5006] Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286	2024-05-07 23:00:29 +05:30
Kiran Varaganti	fd61c69778	Fixed bug in omatcopy for when trans="t" Thanks to Zhenyu Zhu ajz34 for pointing out this bug. When trans="t" or "conjugate transpose" in the case of complex data-types the ldb should be greater than equal to cols. In the bug it was checked against "rows". Fixed this bug. Some minor code format is done. [CPUPL-4810][SWLCSG-2706] Change-Id: Ie796d25a361b2ba72eda80e8c5867d6352af901f	2024-05-06 12:57:38 -04:00
Shubham Sharma	be34169001	Fixed Matlab Failure in ZTRSM - In AVX512 ZTRSM kernel, vertorizes division code is causing failures in matlab. - The logic is identical in reference C code and intrinsics code, but intrinsics code is causing failure - Replaced optimized intrinsics code with C code. AMD-Internal: [CPUPL-5052] Change-Id: Iea184330b22c46d979867b870486066ef980eb84	2024-05-06 06:56:45 -04:00
Shubham Sharma	7553abad8e	Fixed compilation error with AOCC in TRSV - Added a {} around zen4 switch case to avoid AOCC error. - Error is caused because in C declarations are not a statement, therefore they cannot be labled hence compiler is not able to create a lable for jump. AMD-Internal: [CPUPL-4880] Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e	2024-05-03 21:08:38 +05:30
Shubham Sharma	b70347d0d4	DGEMMT SUP Optimizations for AVX512 - In DGEMMT SUP AVX2 code path, traingular kernels are added in order to avoid temporary C buffer. - Since these kernels did not exist for AVX512, AVX2 kernels were being used in GEMMT. - AVX512 triangular GEMM kernel has been added to make sure that AVX512 kernels can be used without creating a temporary buffer. - This kernel is added only for Lower variant of GEMMT, for upper variant of DGEMMT, temporary C buffer is created, full GEMM kernel is called on temporary C and traingular region from temporary C is copied to C buffer. AMD-Internal: [CPUPL-4881] Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9	2024-05-03 05:11:11 -04:00
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Shubham Sharma	1d983e6124	Added AVX512 kernels for DAXPYF and DDOTXF - Added DAXPYF and DDOTXF AVX512 kernels. - Fuse factor for ddotxf kernel is 8. - 2 DAXPYF kernels are added, with fuse factor 8 and 32. - Multithreading is also added to the DAXPYf kernel with fuse factor 32. - These kernels are internally used by TRSM. - Added changes in TRSV to call these kernels in ZEN4 AMD-Internal: [CPUPL-4880] Change-Id: I12850de974b437bbca07677b68bc3d6a35858770	2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian	4e2966f9b0	AVX512 optimizations for ZGEMV API with transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with transpose to A matrix. - This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var1( ... ) function to update the function pointers to these kernels, based on the configuration. AMD-Internal: [CPUPL-4974] Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda	2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian	53cb83d0cc	AVX512 optimizations for ZGEMV API with no-transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with no-transpose to A matrix. - This includes the ZAXPYF, ZAXPYV and ZSETV kernels. The set of ZAXPYF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var2( ... ) function to set the function pointers to these kernels, based on the configuration. Further added the call to ZSETV at this layer in case beta is 0. AMD-Internal: [CPUPL-4974] Change-Id: Iee4b724719e49023138bb16479765be44d677cd9	2024-05-03 07:04:47 +00:00
Hari Govind	9c26de1a18	Optimisiation COPYV APIs - Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_ using respective AVX512 intrinsics including masked load and store operations. - Implemented AVX512 kernels for scopy_, dcopy_ and zcopy_ using assembly language to prevent loss of performance during the translation of intrinsics. - Updated the dcopy_blis_impl( ... ) and zcopy_blis_impl( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Implemented OpenMP parallelization for dcopyv_ and zcopyv_ APIs, while scopyv_ and ccopyv_ only support single thread. AMD-Internal: [CPUPL-4854] Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e	2024-05-01 00:23:01 +05:30
srigovin	2c838dadfb	Updated return type of xerbla and xerbla_array APIs to void Return type of xerbla and xerbla_array APIs are defined as int in BLIS, but according to netlib it should be void. Updated the defination and declaration accordingly. Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com> Change-Id: I3072ba76111189de5c5cf08df83ea154163dd34d	2024-04-29 00:51:10 -04:00
Edward Smyth	ccf3910209	BLIS: bli_cpuid.c incorrectly selecting zen5 on zen4 hardware Correct the order of tests in bli_cpuid.c to test all known zen AVX512 platforms before considering fallback tests on AVX512 support. This avoids builds with "configure auto" or "cmake -DBLIS_CONFIG_FAMILY=auto" incorrectly selecting zen5 sub-configuration on zen4 systems. AMD-Internal: [CPUPL-4966] Change-Id: I8706382e2df7c9ae4bb456e3a7f465053e15beea	2024-04-22 03:26:06 -04:00
Shubham Sharma	632c32767b	Avoid alpha scaling in ZTRSV/ZTRSM when alpha = 1 - Scaling vector X is skipped when alpha is 1 in ZTRSV. - Scaling matrix A is skipped when alpha is 1 in ZTRSM. AMD-Internal: [CPUPL-4324] Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81	2024-04-16 02:24:02 -04:00
Shubham Sharma	ea010c5dc2	Improve perf of bli_obj_equals for 1x1 matrices - Comparision using bli_eqsc is slower than direct comparison. - Changed comparision logic for 1x1 matrix from bli_sqsc to direct comparision. AMD-Internal: [CPUPL-4324] Change-Id: Ifb2d0ad7a97c8bf33b66d624a7ecc53e38c1c803	2024-04-16 00:43:28 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Arnav Sharma	f71495a135	Support for DOTC in DOTV Bench and DTL updates - Added support for ?DOTC in bench. - Updated DTL to accept conjx as a parameter: - 'N', i.e., no conjugate for DOTU - 'C', i.e., conjugate for DOTC - Updated DTL calls in the interface with respective values of conjx. AMD-Internal: [CPUPL-4804] Change-Id: I447b19a6273566c6021c1721ce173bac4a59142c	2024-04-04 12:27:53 +05:30
Eleni Vlachopoulou	020b9ff7f0	CMake: Enable builds for both static and shared builds for Linux. - Added BUILD_STATIC_LIBS option which is on by default, only on Linux. - Added TEST_WITH_SHARED option which is off by default, only on Linux. - If only shared or static lib is being built, that's the one that will be used for testing. - If both are being built, TEST_WITH_SHARED determins which library wil be used for testing. - Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing. - Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0. AMD-Internal: [CPUPL-2748] Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e	2024-03-14 10:32:51 -04:00

1 2 3 4 5 ...

1403 Commits