amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-04-19 23:28:52 +00:00

Author	SHA1	Message	Date
mkadavil	a26c85333a	Int4 B matrix reordering support fixes in LPGEMM. Reordering B matrix of datatype int4 is done as per the pack schema requirements of u8s8s32 kernel. However for fringe cases, the matrix pointer increments need to be halved to account for the half byte size of int4 elements. AMD-Internal: [SWLCSG-2390] Change-Id: I22a04c4c8133db6ae6ca0a4d3e86c11aba1e2cdb	2024-06-26 05:39:45 +05:30
Edward Smyth	8de8dc2961	Merge commit '81e10346' into amd-main * commit '81e10346': Alloc at least 1 elem in pool_t block_ptrs. (#560) Fix insufficient pool-growing logic in bli_pool.c. (#559) Arm SVE C/ZGEMM Fix FMOV 0 Mistake SH Kernel Unused Eigher Arm SVE C/ZGEMM Support beta==0 Arm SVE Config armsve Use ZGEMM/CGEMM Arm SVE: Update Perf. Graph Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 A64FX Config Use ZGEMM/CGEMM Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg Arm SVE Add SGEMM 2Vx10 Unindexed Arm SVE ZGEMM Support Gather Load / Scatt. St. Arm SVE Add ZGEMM 2Vx10 Unindexed Arm SVE Add ZGEMM 2Vx7 Unindexed Arm SVE Add ZGEMM 2Vx8 Unindexed Update Travis CI badge Armv8 Trash New Bulk Kernels Enable testing 1m in `make check`. Config ArmSVE Unregister 12xk. Move 12xk to Old Revert __has_include(). Distinguish w/ BLIS_FAMILY_* Register firestorm into arm64 Metaconfig Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo Add test for Apple M1 (firestorm) Firestorm CPUID Dispatcher Armv8 GEMMSUP Edge Cases Require Signed Ints Make error checking level a thread-local variable. Fix data race in testsuite. Update .appveyor.yml Firestorm Block Size Fixes Armv8 Handle beta == 0 for GEMMSUP ??r Case. Move unused ARM SVE kernels to "old" directory. Add an option to control whether or not to use @rpath. Fix $ORIGIN usage on linux. Arm micro-architecture dispatch (#344) Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. Armv8 Handle beta == 0 for GEMMSUP ?rc Case. Armv8 Fix 6x8 Row-Maj Ukr Apply patch from @xrq-phys. Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. bli_error: more cleanup on the error strings array Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers Fix config_name in bli_arch.c Arm Whole GEMMSUP Call Route is Asm/Int Optimized Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Header Typo Arm: DGEMMSUP ??r(rv) Invoke Edge Size Arm: DGEMMSUP ?rc(rd) Invoke Edge Size Arm: Implement GEMMSUP Fallback Method Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Added Apple Firestorm (A14/M1) Subconfig Arm64 8x4 Kernel Use Less Regs Armv8-A Supplimentary GEMMSUP Sizes for RD Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Armv8-A Adjust Types for PACKM Kernels Armv8-A GEMMSUP-RD 6x8m Armv8-A GEMMSUP-RD 6x8n Armv8-A s/d Packing Kernels Fix Typo Armv8-A Introduced s/d Packing Kernels Armv8-A DGEMMSUP 6x8m Kernel Armv8-A DGEMMSUP Adjustments Armv8-A Add More DGEMMSUP Armv8-A Add GEMMSUP 4x8n Kernel Armv8-A Add Part of GEMMSUP 8x4m Kernel Armv8A DGEMM 4x4 Kernel WIP. Slow Armv8-A Add 8x4 Kernel WIP AMD-Internal: [CPUPL-2698] Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803	2024-06-25 05:48:46 -04:00
Edward Smyth	43d36b9f66	AOCL_ENABLE_INSTRUCTIONS improvements 2 Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is unnecessary and incorrectly caused AVX512 code to be run on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2 or equivalent options was selected. Replace with code to select kernel in a similar way to other dgemm code paths and other APIs. Note that at present AVX2 code is used the smallest matrix sizes on all zen platforms. AMD-Internal: [CPUPL-5078] Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85	2024-06-25 04:57:38 -04:00
Vignesh Balasubramanian	02da190560	AVX512 optimizations for DNRM2 - Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512 computational kernel for DNRM2 API. - Updated the header to include this kernel signature, as well as the framework layer to use this function in case of ZEN4 and ZEN5 configurations. - Updated the tipping points for ideal thread setting in DNRM2 for ZEN5 micro-architecture. These thresholds are specific to the library's linkage to LLVM's OpenMP or GNU's OpenMp. - Further abstracted the AOCL-DYNAMIC logic to separate functions for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2). - Further updated the ?NRM2 framework to accommodate the necessary changes to invoke the newer AOCL-DYNAMIC functions and the AVX512 kernel, when needed. - Added micro-kernel and memory tests for this kernel in GTestsuite, to validate accuracy and out-of-bounds read and write. AMD-Internal: [CPUPL-5265] Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049	2024-06-24 08:50:36 -04:00
mkadavil	a5c4a8c7e0	Int4 B matrix reordering support in LPGEMM. Support for reordering B matrix of datatype int4 as per the pack schema requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion implemented via leveraging the vpmultishiftqb instruction. The reordered B matrix will then be used in the u8s8s32o<s32\|s8> api. AMD-Internal: [SWLCSG-2390] Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba	2024-06-24 07:55:34 -04:00
Meghana Vankadari	c1e063e65c	Fix for offset issue while reading constants from JIT code Details: - For a variable x, Using address of x in an instruction throws exception if the difference between &x and access position is larger than 2 GiB. To solve this issue all variables are stored within the JIT code section and are accessed using relative addressing. - Fixed a bug in B matrix pack function for s8s8s32os32 API. - Fixed a bug in JIT code to apply bias on col-major matrices. AMD-Internal: [SWLCSG-2820] Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96	2024-06-24 07:14:15 -04:00
Vignesh Balasubramanian	6165001658	Bugfix and optimizations for ?AXPBYV API - Updated the existing code-path for ?AXPBYV to reroute the inputs to the appropriate L1 kernel, based on the alpha and beta value. This is done in order to utilize sensible optimizations with regards to the compute and memory operations. - Updated the typed API interface for ?AXPBYV to include an early exit condition(when n is 0, or when alpha is 0 and beta is 1). Further updated this layer to query the right kernel from context, based on the input values of alpha and beta. - Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV, ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special case handling in ?AXPBYV. - Moved the early return with negative increments from ?SCAL2V kernels to its typed API interface. - Updated the zen, zen2 and zen3 context to include function pointers for all these vector kernels. - Updated the existing ?AXPBYV vector kernels to handle only the required computation. Additional cleanup was done to these kernels. - Added accuracy and memory tests for AVX2 kernels of ?SETV ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs - Updated the existing thresholds in ?AXPBYV tests for complex types. This is due to the fact that every complex multiplication involves two mul ops and one add op. Further added test-cases for API level accuracy check, that includes special cases of alpha and beta. - Decomposed the reference call to ?AXPBYV with several other L1 BLAS APIs(in case of the reference not supporting its own ?AXPBYV API). The decomposition is done to match the exact operations that is done in BLIS based on alpha and/or beta values. This ensures that we test for our own compliance. AMD-Internal: [CPUPL-4861] Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6	2024-06-20 16:22:07 +05:30
Arnav Sharma	aa3adb8d69	Updated DOTXF and AXPYF Kernels - Updated the fused kernels (DOTXF and AXPYF) to properly handle cases when b_n > fuse_factor. - The fused kernels are expected to invoke respective Level-1 kernels iteratively when b_n > fuse_factor. AMD-Internal: [CPUPL-5246] Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095	2024-06-20 04:41:41 -04:00
Mangala V	e9124ffca7	BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case BUG: When alpha real and imaginary is zero Output is computed as C= Beta * C + A * B instead of C = Beta * C FIX: Updated kernel to scale A * B product with alpha in case of alpha=0 Existing framework design: - When alpha real and imaginary value is zero, framework handles to skip kernel call to avoid alpha * A * B operation - SCALM is invoked to perform Beta * C - Accuracy issue was not observed as alpha=0 was handled in framework - If we call kernel directly with alpha=0, results would be wrong - Issue was figured out during microkernel testing using gtestsuite AMD-Internal: [CPUPL-4454] Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803	2024-06-20 02:58:43 -04:00
Shubham Sharma	1d6dd726cd	Fixed Prefetch in Turin DGEMM kernel - Fixed the prefetch of next micro panel of B matrix in 8x24 DGEMM kernel. Change-Id: Id84bb2841abb86bda780062d67266377fda12038	2024-06-20 10:31:08 +05:30
Meghana Vankadari	c9254bd9e9	Implemented LPGEMV(n=1) for AVX2-INT8 variants - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n=1. AMD-Internal: [SWLCSG-2354] Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1	2024-06-18 12:09:18 +05:30
Shubham Sharma.	580282e655	DGEMM optimizations for Turin Classic - Introduced new 8x24 row preferred kernel for zen5. - Kernel supports row/col/gen storage schemes. - Prefetch of current panel of A and C are enabled. - Prefetch of next panel of B is enabled. - Kernel supports negative offsets for A and B matrices. - Cache block tuning is done for zen5 core. AMD-Internal: [CPUPL-5262] Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f	2024-06-17 05:18:49 -04:00
Nallani Bhaskar	1b79f35e6d	Updated store to avoid warning in gcc-10 Description: - _mm512_storeu_epi8 and _mm512_storeu_epi16 intrensic instructions are not available in gcc-10 - Replaced above intrensics _mm512_storeu_si512 Change-Id: I2878780b7acd040ccf45e571d486ff8c2388088c	2024-05-30 22:22:50 +05:30
mkadavil	cd032225ca	BF16 bias support for bf16bf16f32ob16. -As it stands the bf16bf16f32ob16 API expects bias array to be of type float. However actual use case requires the usage of bias array of bf16 type. The bf16 micro-kernels are updated to work with bf16 bias array by upscaling it to float type and then using it in the post-ops workflow. -Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API when k > KC. AMD-Internal: [SWLCSG-2604] Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1	2024-05-23 04:48:20 +05:30
Nallani Bhaskar	29db6eb42b	Added transB in all AVX512 based int8 API's Description: --Added support for tranB in u8s8s32o<s32\|s8> and s8s8s32o<s32\|s8> API's --Updated the bench_lpgemm by adding options to support transpose of B matrix --Updated data_gen_script.py in lpgemm bench according to latest input format. AMD-Internal: [SWLCSG-2582] Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048	2024-05-23 03:46:13 +05:30
Edward Smyth	1f60b7c366	Export some BLIS internal symbols 2 Export more symbols for BLIS kernels so that AOCL libFLAME optimizations can call them directly. AMD-Internal: [CPUPL-5044] Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733	2024-05-15 06:59:56 -04:00
Mangala V	64d9c96d45	ZGEMMT SUP: AVX512 GEMMT code for Upper variant 1. Enabled AVX512 path for - Upper variant - Different storage schemes for upper and lower variant 2. Modified mask value to handle all fringe cases correctly AMD_Internal: [CPUPL-5091] Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e	2024-05-15 13:08:32 +05:30
Shubham Sharma	b4bc71f3ac	Bug fix IN DAXPYF MT and Code Cleanup - Fixed bug in DAXPYF MT kernel when incx != inca. - Added AOCL Dynamic function for 1f kernels. - Moved all DOTXF and AXPYF kernels into one file. AMD-Internal: [CPUPL-4880] Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe	2024-05-14 06:44:10 -04:00
Shubham Sharma	f4b06547fd	Enabled DGEMMT SUP optimized code for upper variant - Enabled DGEMMT SUP upper kernels in AVX512 code path. - Enabled use of optimized kernels for all the storages supported by optimized kernels. AMD-Internal: [CPUPL-4881] Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e	2024-05-14 05:23:51 -04:00
Meghana Vankadari	3a8b9270e7	Implemented lpgemv for AVX512-INT8 variants - Implemented optimized lpgemv for both m == 1 and n == 1 cases. - Fixed few bugs in LPGEMV for bf16 and f32 datatypes. - Fixed few bugs in JIT-based implementation of LPGEMM for BF16 datatype. AMD-Internal: [SWLCSG-2354] Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38	2024-05-14 01:55:49 +05:30
Hari Govind S	61d0f3b873	Additional optimisations on COPYV API - Reduced number of jump operations in AVX512 assembly kernel for SCOPYV, DCOPYV and ZCOPYV. - Fixed memory test failure for bli_zcopyv_zen_int_avx512 kernel. - Replaced existing AVX2 COPYV intrinsic kernels in bli_cntx_init_zen5.c with AVX512 assembly kernels. Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159	2024-05-10 07:22:04 -04:00
Hari Govind S	92847ae912	Gtestsuite: Memory testing for SCOPYV, DCOPYV and ZCOPYV APIs - Utilized the memory testing feature in GTestsuite to update the testing interfaces for micro-kernel testing of SCOPY, DCOPY and ZCOPY APIs. Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f	2024-05-09 12:10:17 -04:00
Shubham Sharma	f36468a9e9	Enabled vectorized division code in ZTRSM - Existing vectorizes code was disabled because of the failures observed in matlab tests. - The issue is caused by underflow during division when diagonal elements of A matrix are very small. - When diagonal is very small (4E-324 in case of matlab), sqauring the diagonal during divison causes the square to be rounded off to zero. - Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by max(ar, ai), this will make either (ar) or (ai) 1, and hence reduce the likelihood of underflow. AMD-Internal: [CPUPL-5052] Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f	2024-05-09 01:35:39 -04:00
Edward Smyth	62c886feee	Export some BLIS internal symbols AOCL libFLAME optimizations directly call some internal BLIS symbols. Export them to enable this to work with the BLIS shared library. AMD-Internal: [CPUPL-5044] Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d	2024-05-08 12:51:32 -04:00
Arnav Sharma	cb27fad49c	ZSCALV AVX512 Kernel - Implemented ZSCALV kernel utilizing AVX512 intrinsics. - Gtestsuite: Added ukr tests for the new kernel. AMD-Internal: [CPUPL-5012] Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2	2024-05-08 11:55:13 -04:00
Arnav Sharma	1dbeee4d19	ZDOTV AVX512 Kernel with MT Support - Added AVX512 kernel for ZDOTV. - Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support. AMD-Internal: [CPUPL-5011] Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8	2024-05-08 04:54:05 -04:00
Mangala V	e6cc2a3e22	ZGEMMT SUP Optimizations for AVX512 Existing Design: - GEMM AVX2 kernel performs computation and updates temporary C buffer - Portion of temporary C buffer is copied to output C buffer based on UPLO parameter - For diagonal blocks, using GEMM kernels is not efficient New Design: Implemented in current patch when UPLO='L' - GEMMT kernel used for computation, temporary buffer is not required. - Only required elements are computed using mask load store for all fringe cases - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC - AOCL-Dynamic is added based on dimension - Check for AVX platform is added in SUP interface, It returns to native implementation if hardware doesnot support AVX platform - SUP ref_var2m is expanded for dcomplex datatype to avoid condition check which exists for double datatype AMD_Internal: [CPUPL-5006] Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286	2024-05-07 23:00:29 +05:30
Meghana Vankadari	1072770c63	Implemented LPGEMV for bf16 datatype 1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. AMD-Internal: [SWLCSG-2355] Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880	2024-05-06 23:55:15 +05:30
Shubham Sharma	be34169001	Fixed Matlab Failure in ZTRSM - In AVX512 ZTRSM kernel, vertorizes division code is causing failures in matlab. - The logic is identical in reference C code and intrinsics code, but intrinsics code is causing failure - Replaced optimized intrinsics code with C code. AMD-Internal: [CPUPL-5052] Change-Id: Iea184330b22c46d979867b870486066ef980eb84	2024-05-06 06:56:45 -04:00
mkadavil	118e955a22	SWISH post-op support for all LPGEMM APIs. SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)). SiLU = SWISH with alpha = 1. AMD-Internal: [SWLCSG-2387] Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26	2024-05-06 06:05:11 -04:00
vignbala	f8218bb9f2	Compiler warnings when using masked loads - Updated the AVX512 DOTXF kernels to use MASKZ loads instead of MASK loads when loading X vector in fringe case. This avoids compiler warnings of uninitialized vector as input to the intrinsic. - The functionality will not change when using either MASK or MASKZ loads on X, since A matrix is loaded using MASKZ loads. AMD-Internal: [CPUPL-4974] Change-Id: I1ef98a1292352d0e905cc09cd5667acd883df827	2024-05-03 09:53:36 -04:00
Shubham Sharma	b70347d0d4	DGEMMT SUP Optimizations for AVX512 - In DGEMMT SUP AVX2 code path, traingular kernels are added in order to avoid temporary C buffer. - Since these kernels did not exist for AVX512, AVX2 kernels were being used in GEMMT. - AVX512 triangular GEMM kernel has been added to make sure that AVX512 kernels can be used without creating a temporary buffer. - This kernel is added only for Lower variant of GEMMT, for upper variant of DGEMMT, temporary C buffer is created, full GEMM kernel is called on temporary C and traingular region from temporary C is copied to C buffer. AMD-Internal: [CPUPL-4881] Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9	2024-05-03 05:11:11 -04:00
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Shubham Sharma	1d983e6124	Added AVX512 kernels for DAXPYF and DDOTXF - Added DAXPYF and DDOTXF AVX512 kernels. - Fuse factor for ddotxf kernel is 8. - 2 DAXPYF kernels are added, with fuse factor 8 and 32. - Multithreading is also added to the DAXPYf kernel with fuse factor 32. - These kernels are internally used by TRSM. - Added changes in TRSV to call these kernels in ZEN4 AMD-Internal: [CPUPL-4880] Change-Id: I12850de974b437bbca07677b68bc3d6a35858770	2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian	4e2966f9b0	AVX512 optimizations for ZGEMV API with transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with transpose to A matrix. - This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var1( ... ) function to update the function pointers to these kernels, based on the configuration. AMD-Internal: [CPUPL-4974] Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda	2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian	53cb83d0cc	AVX512 optimizations for ZGEMV API with no-transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with no-transpose to A matrix. - This includes the ZAXPYF, ZAXPYV and ZSETV kernels. The set of ZAXPYF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var2( ... ) function to set the function pointers to these kernels, based on the configuration. Further added the call to ZSETV at this layer in case beta is 0. AMD-Internal: [CPUPL-4974] Change-Id: Iee4b724719e49023138bb16479765be44d677cd9	2024-05-03 07:04:47 +00:00
Hari Govind	9c26de1a18	Optimisiation COPYV APIs - Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_ using respective AVX512 intrinsics including masked load and store operations. - Implemented AVX512 kernels for scopy_, dcopy_ and zcopy_ using assembly language to prevent loss of performance during the translation of intrinsics. - Updated the dcopy_blis_impl( ... ) and zcopy_blis_impl( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Implemented OpenMP parallelization for dcopyv_ and zcopyv_ APIs, while scopyv_ and ccopyv_ only support single thread. AMD-Internal: [CPUPL-4854] Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e	2024-05-01 00:23:01 +05:30
Meghana Vankadari	ceee4b7818	Fix in DGEMMSUP for cases where C matrix is row-major. Details: - variable m0 is being loaded into a register without typecasting it to uint64_t. This resulted in seg-fault when int size is set to be 32 bits during configure time. - Any variable that is loaded using mov in assembly needs to be typecasted to uint64_t before begin_asm, so that change in size of integer doesn't affect the functionality. - Modified all instances using variable m0 to use variable 'm' where m = (uint64_t)m0; AMD-Internal: [CPUPL-4971] Change-Id: I49b66d2cacf19ace40ab44c9f85904644e8921f4	2024-04-25 13:07:23 -04:00
Shubham Sharma	14bab0eb17	Fixed out of bounds read in CTRSM small kernel - In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex precision numbers are being read instead of 1 scomplex. - Fixed the code to read only one scomplex. AMD-Internal: [CPUPL-4403] Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d	2024-04-16 02:42:34 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Nallani Bhaskar	5070343318	Fixed load intrinsic in aocl-gemm addon f32 api Description: 1. Replaced aligned load intrinsics _mm512_load_ps with unaligned load intrinsics _mm512_loadu_ps. 2. There is no guarantee that the memory address can be aligned everywhere. The changes are under beta multiplication. Copy paste error. Change-Id: I978231b556e17ad7e66c5028ed1cd904c653e0a8	2024-03-20 06:24:32 -04:00
Eleni Vlachopoulou	020b9ff7f0	CMake: Enable builds for both static and shared builds for Linux. - Added BUILD_STATIC_LIBS option which is on by default, only on Linux. - Added TEST_WITH_SHARED option which is off by default, only on Linux. - If only shared or static lib is being built, that's the one that will be used for testing. - If both are being built, TEST_WITH_SHARED determins which library wil be used for testing. - Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing. - Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0. AMD-Internal: [CPUPL-2748] Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e	2024-03-14 10:32:51 -04:00
Meghana Vankadari	da8fd8c301	Implemented JIT-based microkernel for bf16 datatype Details: - Added new folder named JIT/ under addon/aocl_gemm/. This folder will contain all the JIT related code. - Modified lpgemm_cntx_init code to generate main and fringe kernels for 6x64 bf16 microkernel and store function pointers to all the generated kernels in a global function pointer array. This happens only when gcc version is < 11.2 - When gcc version < 11.2, microkernel uses JIT-generated kernels. otherwise, microkernel uses the intrinsics based implementation. AMD-Internal: [SWLCSG-2622] Change-Id: I16256c797b2546a8cd2049680001947346260461	2024-03-13 05:55:18 +05:30
Nallani Bhaskar	799a456abc	Fixed corner case issue in aocl_gemm addon Description 1. when mr0=1 case the accumulator register and operand registers for an fma instruction got swapped. Corrected the copy paste error. 2. Removed fill array for c_ref in bench_lpgemm.c and used memcpy from c buf, because fill array now using rand() function to initialize data which can be different when c_ref and c called separately, this was working because data was fixed (i=0 ... i%5). Change-Id: Ia513331ba49d28adc7bcdc0ec78d443abe66780b	2024-03-08 04:10:19 -05:00
Bhaskar Nallani	2ce47e6f5e	Implemented optimal AVX512-variant of f32 LPGEMV 1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation of lpgemv_rowvar_f32 is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. 5. Fixed few warnings while loading 2 f32 bias elements using _mm_load_sd using float pointer. Typecasted to (const double *) AMD-Internal: [SWLCSG-2391, SWLCSG-2353] Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599	2024-03-04 23:53:23 +05:30
mkadavil	d00e84ced3	Matrix Add post-operation support for float(bf16\|f32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. AMD-Internal: [SWLCSG-2424] Change-Id: I9464d1f514e3b04275fe93441489b4503a08937a	2024-02-23 02:02:33 -05:00
mkadavil	01b7f8c945	Matrix Add post-operation support for integer(s16\|s32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. -For clang compilers (including aocc), -march=znver1 is not enabled for zen kernels. Have updated CKVECFLAGS to capture the same. AMD-Internal: [SWLCSG-2424] Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09	2024-02-12 23:51:36 +05:30
Shubham Sharma	d5cd5836b1	Fixed DGEMM 8x24 kernel for beta zero - Column stride is not taken into consideration in current implementation when writing to C buffer if beta is zero and C is column major stored. - Fixed C storage in case of column major stored C when beta is zero in 8x24 DGEMM kernel. AMD-Internal: [CPUPL-4404] Change-Id: I5b8dfce962995e3238cf902b5a09dd1bf90002a8	2024-02-05 06:57:06 -05:00
Shubham Sharma	fc91932b4a	Fixed out of bounds read in DTRSM small kernels - In 3x1 fringe case in [RLN/RUT] kernel, 4 double precision floats are being read instead of 3 doubles. - Fixed the code to read only 3 double. AMD-Internal: [CPUPL-4403] Change-Id: If0afb155efefabe13487cf322d479981f1838aa2	2024-02-02 10:31:12 +05:30
eashdash	ef134dc49f	Added Trans A feature for all INT8 LPGEMM APIs 1. Added Trans A feature to handle column major inputs for A matrix. 2. Trans A is enabled by on-the-go pack of A matrix. 3. The on-the-go pack of A converts a column storage MCxKC block of A into row storage MCxKC block as LPGEMM kernels are row major kernels. 4. New pack routines are added for conversion of A matrix from column major storage to row major storage. 5. LPGEMM Cntx is updated with pack kernel function pointers. 6. Packing of A matrix: - Converts column major input A to row major in blocks of MCxKC with newly added pack A functions when cs_a > 1. 7. Pack routines are added for AVX512 and AVX2 INT8 LPGEMM APIs. 8. Trans A feature is now supported in: 1. u8s8s32os32/os8 2. u8s8s16os16/os8/ou8 3. s8s8s32os32/os8 4. s8s8s16os16/os8 AMD-Internal: SWLCSG-2582 Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf	2024-01-30 03:40:56 -05:00

... 2 3 4 5 6 ...

892 Commits