amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-25 10:54:33 +00:00

Author	SHA1	Message	Date
Hari Govind S	61d0f3b873	Additional optimisations on COPYV API - Reduced number of jump operations in AVX512 assembly kernel for SCOPYV, DCOPYV and ZCOPYV. - Fixed memory test failure for bli_zcopyv_zen_int_avx512 kernel. - Replaced existing AVX2 COPYV intrinsic kernels in bli_cntx_init_zen5.c with AVX512 assembly kernels. Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159	2024-05-10 07:22:04 -04:00
Hari Govind S	92847ae912	Gtestsuite: Memory testing for SCOPYV, DCOPYV and ZCOPYV APIs - Utilized the memory testing feature in GTestsuite to update the testing interfaces for micro-kernel testing of SCOPY, DCOPY and ZCOPY APIs. Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f	2024-05-09 12:10:17 -04:00
Shubham Sharma	f36468a9e9	Enabled vectorized division code in ZTRSM - Existing vectorizes code was disabled because of the failures observed in matlab tests. - The issue is caused by underflow during division when diagonal elements of A matrix are very small. - When diagonal is very small (4E-324 in case of matlab), sqauring the diagonal during divison causes the square to be rounded off to zero. - Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by max(ar, ai), this will make either (ar) or (ai) 1, and hence reduce the likelihood of underflow. AMD-Internal: [CPUPL-5052] Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f	2024-05-09 01:35:39 -04:00
Edward Smyth	62c886feee	Export some BLIS internal symbols AOCL libFLAME optimizations directly call some internal BLIS symbols. Export them to enable this to work with the BLIS shared library. AMD-Internal: [CPUPL-5044] Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d	2024-05-08 12:51:32 -04:00
Arnav Sharma	cb27fad49c	ZSCALV AVX512 Kernel - Implemented ZSCALV kernel utilizing AVX512 intrinsics. - Gtestsuite: Added ukr tests for the new kernel. AMD-Internal: [CPUPL-5012] Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2	2024-05-08 11:55:13 -04:00
Arnav Sharma	1dbeee4d19	ZDOTV AVX512 Kernel with MT Support - Added AVX512 kernel for ZDOTV. - Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support. AMD-Internal: [CPUPL-5011] Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8	2024-05-08 04:54:05 -04:00
Mangala V	e6cc2a3e22	ZGEMMT SUP Optimizations for AVX512 Existing Design: - GEMM AVX2 kernel performs computation and updates temporary C buffer - Portion of temporary C buffer is copied to output C buffer based on UPLO parameter - For diagonal blocks, using GEMM kernels is not efficient New Design: Implemented in current patch when UPLO='L' - GEMMT kernel used for computation, temporary buffer is not required. - Only required elements are computed using mask load store for all fringe cases - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC - AOCL-Dynamic is added based on dimension - Check for AVX platform is added in SUP interface, It returns to native implementation if hardware doesnot support AVX platform - SUP ref_var2m is expanded for dcomplex datatype to avoid condition check which exists for double datatype AMD_Internal: [CPUPL-5006] Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286	2024-05-07 23:00:29 +05:30
Meghana Vankadari	1072770c63	Implemented LPGEMV for bf16 datatype 1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. AMD-Internal: [SWLCSG-2355] Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880	2024-05-06 23:55:15 +05:30
Shubham Sharma	be34169001	Fixed Matlab Failure in ZTRSM - In AVX512 ZTRSM kernel, vertorizes division code is causing failures in matlab. - The logic is identical in reference C code and intrinsics code, but intrinsics code is causing failure - Replaced optimized intrinsics code with C code. AMD-Internal: [CPUPL-5052] Change-Id: Iea184330b22c46d979867b870486066ef980eb84	2024-05-06 06:56:45 -04:00
mkadavil	118e955a22	SWISH post-op support for all LPGEMM APIs. SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)). SiLU = SWISH with alpha = 1. AMD-Internal: [SWLCSG-2387] Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26	2024-05-06 06:05:11 -04:00
vignbala	f8218bb9f2	Compiler warnings when using masked loads - Updated the AVX512 DOTXF kernels to use MASKZ loads instead of MASK loads when loading X vector in fringe case. This avoids compiler warnings of uninitialized vector as input to the intrinsic. - The functionality will not change when using either MASK or MASKZ loads on X, since A matrix is loaded using MASKZ loads. AMD-Internal: [CPUPL-4974] Change-Id: I1ef98a1292352d0e905cc09cd5667acd883df827	2024-05-03 09:53:36 -04:00
Shubham Sharma	b70347d0d4	DGEMMT SUP Optimizations for AVX512 - In DGEMMT SUP AVX2 code path, traingular kernels are added in order to avoid temporary C buffer. - Since these kernels did not exist for AVX512, AVX2 kernels were being used in GEMMT. - AVX512 triangular GEMM kernel has been added to make sure that AVX512 kernels can be used without creating a temporary buffer. - This kernel is added only for Lower variant of GEMMT, for upper variant of DGEMMT, temporary C buffer is created, full GEMM kernel is called on temporary C and traingular region from temporary C is copied to C buffer. AMD-Internal: [CPUPL-4881] Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9	2024-05-03 05:11:11 -04:00
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Shubham Sharma	1d983e6124	Added AVX512 kernels for DAXPYF and DDOTXF - Added DAXPYF and DDOTXF AVX512 kernels. - Fuse factor for ddotxf kernel is 8. - 2 DAXPYF kernels are added, with fuse factor 8 and 32. - Multithreading is also added to the DAXPYf kernel with fuse factor 32. - These kernels are internally used by TRSM. - Added changes in TRSV to call these kernels in ZEN4 AMD-Internal: [CPUPL-4880] Change-Id: I12850de974b437bbca07677b68bc3d6a35858770	2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian	4e2966f9b0	AVX512 optimizations for ZGEMV API with transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with transpose to A matrix. - This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var1( ... ) function to update the function pointers to these kernels, based on the configuration. AMD-Internal: [CPUPL-4974] Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda	2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian	53cb83d0cc	AVX512 optimizations for ZGEMV API with no-transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with no-transpose to A matrix. - This includes the ZAXPYF, ZAXPYV and ZSETV kernels. The set of ZAXPYF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var2( ... ) function to set the function pointers to these kernels, based on the configuration. Further added the call to ZSETV at this layer in case beta is 0. AMD-Internal: [CPUPL-4974] Change-Id: Iee4b724719e49023138bb16479765be44d677cd9	2024-05-03 07:04:47 +00:00
Hari Govind	9c26de1a18	Optimisiation COPYV APIs - Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_ using respective AVX512 intrinsics including masked load and store operations. - Implemented AVX512 kernels for scopy_, dcopy_ and zcopy_ using assembly language to prevent loss of performance during the translation of intrinsics. - Updated the dcopy_blis_impl( ... ) and zcopy_blis_impl( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Implemented OpenMP parallelization for dcopyv_ and zcopyv_ APIs, while scopyv_ and ccopyv_ only support single thread. AMD-Internal: [CPUPL-4854] Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e	2024-05-01 00:23:01 +05:30
Meghana Vankadari	ceee4b7818	Fix in DGEMMSUP for cases where C matrix is row-major. Details: - variable m0 is being loaded into a register without typecasting it to uint64_t. This resulted in seg-fault when int size is set to be 32 bits during configure time. - Any variable that is loaded using mov in assembly needs to be typecasted to uint64_t before begin_asm, so that change in size of integer doesn't affect the functionality. - Modified all instances using variable m0 to use variable 'm' where m = (uint64_t)m0; AMD-Internal: [CPUPL-4971] Change-Id: I49b66d2cacf19ace40ab44c9f85904644e8921f4	2024-04-25 13:07:23 -04:00
Shubham Sharma	14bab0eb17	Fixed out of bounds read in CTRSM small kernel - In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex precision numbers are being read instead of 1 scomplex. - Fixed the code to read only one scomplex. AMD-Internal: [CPUPL-4403] Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d	2024-04-16 02:42:34 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Nallani Bhaskar	5070343318	Fixed load intrinsic in aocl-gemm addon f32 api Description: 1. Replaced aligned load intrinsics _mm512_load_ps with unaligned load intrinsics _mm512_loadu_ps. 2. There is no guarantee that the memory address can be aligned everywhere. The changes are under beta multiplication. Copy paste error. Change-Id: I978231b556e17ad7e66c5028ed1cd904c653e0a8	2024-03-20 06:24:32 -04:00
Eleni Vlachopoulou	020b9ff7f0	CMake: Enable builds for both static and shared builds for Linux. - Added BUILD_STATIC_LIBS option which is on by default, only on Linux. - Added TEST_WITH_SHARED option which is off by default, only on Linux. - If only shared or static lib is being built, that's the one that will be used for testing. - If both are being built, TEST_WITH_SHARED determins which library wil be used for testing. - Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing. - Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0. AMD-Internal: [CPUPL-2748] Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e	2024-03-14 10:32:51 -04:00
Meghana Vankadari	da8fd8c301	Implemented JIT-based microkernel for bf16 datatype Details: - Added new folder named JIT/ under addon/aocl_gemm/. This folder will contain all the JIT related code. - Modified lpgemm_cntx_init code to generate main and fringe kernels for 6x64 bf16 microkernel and store function pointers to all the generated kernels in a global function pointer array. This happens only when gcc version is < 11.2 - When gcc version < 11.2, microkernel uses JIT-generated kernels. otherwise, microkernel uses the intrinsics based implementation. AMD-Internal: [SWLCSG-2622] Change-Id: I16256c797b2546a8cd2049680001947346260461	2024-03-13 05:55:18 +05:30
Nallani Bhaskar	799a456abc	Fixed corner case issue in aocl_gemm addon Description 1. when mr0=1 case the accumulator register and operand registers for an fma instruction got swapped. Corrected the copy paste error. 2. Removed fill array for c_ref in bench_lpgemm.c and used memcpy from c buf, because fill array now using rand() function to initialize data which can be different when c_ref and c called separately, this was working because data was fixed (i=0 ... i%5). Change-Id: Ia513331ba49d28adc7bcdc0ec78d443abe66780b	2024-03-08 04:10:19 -05:00
Bhaskar Nallani	2ce47e6f5e	Implemented optimal AVX512-variant of f32 LPGEMV 1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation of lpgemv_rowvar_f32 is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. 5. Fixed few warnings while loading 2 f32 bias elements using _mm_load_sd using float pointer. Typecasted to (const double *) AMD-Internal: [SWLCSG-2391, SWLCSG-2353] Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599	2024-03-04 23:53:23 +05:30
mkadavil	d00e84ced3	Matrix Add post-operation support for float(bf16\|f32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. AMD-Internal: [SWLCSG-2424] Change-Id: I9464d1f514e3b04275fe93441489b4503a08937a	2024-02-23 02:02:33 -05:00
mkadavil	01b7f8c945	Matrix Add post-operation support for integer(s16\|s32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. -For clang compilers (including aocc), -march=znver1 is not enabled for zen kernels. Have updated CKVECFLAGS to capture the same. AMD-Internal: [SWLCSG-2424] Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09	2024-02-12 23:51:36 +05:30
Shubham Sharma	d5cd5836b1	Fixed DGEMM 8x24 kernel for beta zero - Column stride is not taken into consideration in current implementation when writing to C buffer if beta is zero and C is column major stored. - Fixed C storage in case of column major stored C when beta is zero in 8x24 DGEMM kernel. AMD-Internal: [CPUPL-4404] Change-Id: I5b8dfce962995e3238cf902b5a09dd1bf90002a8	2024-02-05 06:57:06 -05:00
Shubham Sharma	fc91932b4a	Fixed out of bounds read in DTRSM small kernels - In 3x1 fringe case in [RLN/RUT] kernel, 4 double precision floats are being read instead of 3 doubles. - Fixed the code to read only 3 double. AMD-Internal: [CPUPL-4403] Change-Id: If0afb155efefabe13487cf322d479981f1838aa2	2024-02-02 10:31:12 +05:30
eashdash	ef134dc49f	Added Trans A feature for all INT8 LPGEMM APIs 1. Added Trans A feature to handle column major inputs for A matrix. 2. Trans A is enabled by on-the-go pack of A matrix. 3. The on-the-go pack of A converts a column storage MCxKC block of A into row storage MCxKC block as LPGEMM kernels are row major kernels. 4. New pack routines are added for conversion of A matrix from column major storage to row major storage. 5. LPGEMM Cntx is updated with pack kernel function pointers. 6. Packing of A matrix: - Converts column major input A to row major in blocks of MCxKC with newly added pack A functions when cs_a > 1. 7. Pack routines are added for AVX512 and AVX2 INT8 LPGEMM APIs. 8. Trans A feature is now supported in: 1. u8s8s32os32/os8 2. u8s8s16os16/os8/ou8 3. s8s8s32os32/os8 4. s8s8s16os16/os8 AMD-Internal: SWLCSG-2582 Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf	2024-01-30 03:40:56 -05:00
mkadavil	864170f5cb	Scalar value support for zero-point and scale-factor. -As it stands, in LPGEMM, users are expected to pass an array of values with length the same as N dimension as inputs for zero point or scale factor. However at times, a single scalar value is used as zero point or scale factor for the entire downscaling operation. The mandate to pass an array requires the user to allocate extra memory and fill it with the scalar value so as to be used in downscaling. This limitation is lifted as part of this commit, and now scalar values can be passed as zero point or scale factor. -LPGEMM bench enhancements along with new input format to improve readability as well as flexibility. AMD-Internal: [SWLCSG-2581] Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f	2024-01-12 04:37:35 +05:30
Meghana Vankadari	6567df7b12	bf16bf16f32o<bf16\|f32> Fix for scaling issue when transA is enabled. Details: - LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_A_BLOCK to checkout memory when A matrix needs to be packed. This multi-threaded lock overhead becomes prominent when m/n dimensions are relatively small, even when k is large. In order to address this, bli_pba_acquire_m is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is allocated using aligned malloc instead of checking out from memory pool. Experiments have shown malloc costs to be far lower than memory pool guarded by locks, especially for higher thread count. - Deleted few unnecessary instructions from packing kernels. - Replaced bench_input.txt with lesser number of inputs. AMD-Internal: [CPUPL-4329] Change-Id: I5982a0a4df9dc72fab0cffab795c23822d5c8774	2023-12-21 04:53:32 +05:30
mkadavil	2676ac8249	LPGEMM s32 micro-kernel updates to fix gcc10.2 compilation issue. Some AVX512 intrinsics(eg: _mm_loadu_epi8) were introduced in later versions of gcc (11+) in addition to already existing masked intrinsic (eg: _mm_mask_loadu_epi8). In order to support compilation using gcc 10.2, either the masked intrinsic or other gcc 10.2 compatible intrinsic needs to be used (eg: _mm_loadu_si128) in LPGEMM <u\|s>8s8os32 kernels. AMD-Internal: [SWLCSG-2542] Change-Id: I6cfedfdcb28711b19df63d162ab267f5eea8d2ef	2023-11-24 01:58:58 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	50608f28df	BLIS: Missing clobbers (batch 7) Add missing clobbers in: - bli_gemmsup_rv_haswell kernels - spare copies of kernels in old, other and broken subdirectories - misc kernels for legacy platforms AMD-Internal: [CPUPL-3521] Change-Id: I7cdb7fd1cb29630d8b7fa914b1002a270dfe9ef5	2023-11-22 17:51:46 -05:00
Edward Smyth	f471615c66	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. AMD-Internal: [CPUPL-3519] Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce	2023-11-22 17:11:10 -05:00
mangala v	e0df20806a	Updated prefetching in SGEMM SUP (mask load/store) kernels 1. Prefetch only MR rows or rows required for fringe cases 2. Specify prefetching offset - the least column address supported by masked functions 3. Removed unnecessary prefetches in fringe case for mx4 kernels Updated gtestuite for sgemm calls AMD_Internal: [CPUPL-4221] Change-Id: I1e2e7d3ebce37dc54a2f0a5c1c70ce0a6d4c8d6c	2023-11-21 06:31:47 -05:00
Harsh Dave	e91d23ff05	Re-implements ddotv edge kernel using masked instructions - This commit uses avx2 and avx512 masked load instructions for handling edge case where vector size is not exact multiple of avx2/avx512 vector register size. - Thanks to Shubham, Sharma <shubham.sharma3@amd.com> for avx512 ddotv kernel changes Change-Id: I998651eeb1083caf3308f1b45bd7d55b7974bcb4	2023-11-21 02:25:00 -05:00
mangala v	3256a7b074	BugFix: Re-Designed SGEMM SUP kernel to use mask load/store instruction Segfault was reported through nightly jenkins job. Issue was observed when running in MT mode. Issue was due to extra broadcast being used. Extra broadcast would access out of bound memory on input buffer Cleaned up cobbler list by removing unused registers. AMD_Internal: [CPUPL-4180] Change-Id: I1c8715b2850ef855328f2ef12f215987299bdb2b	2023-11-17 18:14:34 +05:30
Edward Smyth	c6f3340125	Merge commit '5013a6cb' into amd-main * commit '5013a6cb': More edits and fixes to docs/FAQ.md. Fixed newly broken link to CREDITS in FAQ.md. More minor fixes to FAQ.md and Sandboxes.md. Updates to FAQ.md, Sandboxes.md, and README.md. Safelist 'master', 'dev', 'amd' branches. Re-enable and fix `fb93d24`. Reverted `fb93d24`. Re-enable and fix `8e0c425` (BLIS_ENABLE_SYSTEM). Removed last vestige of #define BLIS_NUM_ARCHS. Added new packm var3 to 'gemmlike'. Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. Fix more copy-paste errors in the haswell gemmsup code. Do a fast test on OSX. [ci skip] Fix AArch64 tests and consolidate some other tests. Use C++ cross-compiler for ARM tests. Attempt to fix cxx-test for OOT builds. Updated travis-ci.org link in README.md to .com. Disabled (at least temporarily) commit `8e0c425`. Define BLIS_OS_NONE when using --disable-system. Updated stale calls to malloc_intl() in gemmlike. Blacklist clang10/gcc9 and older for 'armsve'. Add test to Travis using C++ compiler to make sure blis.h is C++-compatible. Moved lang defs from _macro_def.h to _lang_defs.h. Minor tweaks to gemmlike sandbox. Added local _check() code to gemmlike sandbox. README.md citation updates (e.g. BLIS7 bibtex). Tweaks to gemmlike to facilitate 3rd party mods. Whitespace tweaks. Add row- and column-strides for A/B in obj_ukr_fn_t. Clean up some warnings that show up on clang/OSX. Remove schema field on obj_t (redundant) and add new API functions. Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects. Disabled sanity check in bli_pool_finalize(). Implement proposed new function pointer fields for obj_t. AMD-Internal: [CPUPL-2698] Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d	2023-11-10 13:05:12 -05:00
Mangala V	f6046784ce	Re-Designed SGEMM SUP kernel to use mask load/store instruction Added all fringe kernels with mask load store support Fringe kernels cover m direction from 5 to 1 and n direction from 15 to 1 for row storage format - New edge kernels that uses masked load-store instructions for handling corner cases. - Mask load-store instruction macros are added. vmaskmovps, VMASKMOVPS for masked load-store. - It improves performance by reducing branching overhead and by being more cache friendly. - Mask load-store is added only for row storage format AMD-Internal: [CPUPL-4041] Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456	2023-11-10 01:23:48 -05:00
Harsh Dave	77161c1e5d	Design change of DGEMM 6x8 native kernel. - Following optimizations are included for dgemm 6x8 native kernel. 1) Reorganized the C update and store to reduce register dependencies. 2) moved the C prefetch to part-way through the kernel for efficiently prefetching C matrix at appropriate distance. 3) Offsetting A matrix, so that kernel can use a smaller instruction encoding saving, saving i-cache space. 4) Aligned the K iteration loop. - Thanks to Moore, Branden <Branden.Moore@amd.com> for these design changes of DGEMM 6x8 native kernels. - Additional change, reorganization of C update and store for beta zero case to facilitate out of order execution of storing of C matrix. Change-Id: I9d1ec8d39f1154b0f38b136bd6a04b05d7d1e6ba	2023-11-09 23:07:43 -05:00
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Edward Smyth	9500cbee63	Code cleanup: spelling corrections Corrections for some spelling mistakes in comments. AMD-Internal: [CPUPL-3519] Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba	2023-11-09 00:16:30 -05:00
Harsh Dave	75356d45e5	DGEMM improvement for very tiny sizes less than 24. - This commit helps improving performance for very small input by reducing framework check and routing all such inputs to bli_dgemm_tiny_6x8_kernel. It forces single threaded computation for such sizes. - It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4 code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment variable is set to avx512. In that case, such a small inputs are routed to bli_dgemm_tiny_24x8_kernel avx512 kernel. AMD-Internal: [CPUPL-1701] Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e	2023-11-08 23:45:57 -05:00
Vignesh Balasubramanian	5f9c8c6929	Bugfix : Fallback mechanism in SNRM2 and SCNRM2 kernels if packing fails - Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to a layer higher. - Added a scalar loop to handle compute in case of non-unit strides. This loop ensures functionality in case packing fails at the framework level. AMD-Internal: [CPUPL-3633] Change-Id: I555aea519d7434d43c541bb0f661f81105135b98	2023-11-08 15:16:10 +05:30
Vignesh Balasubramanian	06f23c4fd4	Bugfix : Functional correctness of DNRM2_ and DZNRM2_ APIs - Updated the final reduction of partial sums( AVX-2 code section ) to use scalar accumulation entirely, instead of using the _mm256_hadd_pd( ... ) intrinsic. This will in turn change the associativity in the reduction step. - Reverted to using scalar code on the fringe cases in AVX-2 kernel for DNRM2 and DZNRM2, for improving functional correctness. AMD-Internal: [CPUPL-4049] Change-Id: I9d320b39d23a0cbcc77fb24d951fced778ea5ea5	2023-11-07 10:21:41 -05:00
Harsh Dave	0de10cc86c	Added k=1 avx512 dgemm kernel. - This commit implements avx512 dgemm kernel for k=1 cases. which gets called for zen4 codepath. - Added architecture check for k=1 kernel in dgemm code path to pick correct kernel based on cpu arhcitecture since now blis is having avx2 and avx512 dgemm kernels for k=1 case. - Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was being called irrespective of architecture type. - Added architecture check before calling the kernel for case where k=1, so only for respective architectures this kernel is invoked. AMD-Internal: [CPUPL-4017] Change-Id: I418bbc933b41db41d323b331c6d89893868a6971	2023-11-07 01:10:09 -05:00
Shubham Sharma	ffa8f584be	Added ZTRSM AVX512 native path kernels - Added 4x12 ZGEMM row-preferred kernel. - Added 4x12 ZTRSM row-preferred lower and upper kernels using AVX512 ISA. - These kernels are used for ZTRSM only, zgemm still uses 12x4 kernel. - Kernels support row/col/gen storage. - Kernels support A prefetch, B prefetch, A_next prefetch, B_next prefetch and c prefetch. - B prefetch, B_next prefetch and C prefetch are enabled by default. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a	2023-11-03 09:42:24 -04:00
Vignesh Balasubramanian	ef545b928e	Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel - The call to the bli_saxpyf_zen_int_6( ... ) is explicitly present in the bli_gemv_unf_var2_amd.c file, as part of the bli_sgemv_unf_var2( ... ) function. This was changed to bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor from 6 to 5 ), in accordance to the function pointer present in the zen3 and zen4 context files. - Changed the accumulator type to double from float, inside the fringe loop for unit-strides(vectorized path) and non-unit strides (scalar code). AMD-Internal: [CPUPL-4028] Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e	2023-11-03 01:37:43 -04:00

1 2 3 4 5 ...

672 Commits