amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 09:39:59 +00:00

Author	SHA1	Message	Date
Harsh Dave	f7b9bc734e	Added AVX512 24xk packing kernel - Added AVX-512 24xk packing kernel. - Prefetching is yet to be added. AMD-Internal: [CPUPL-2966] Change-Id: Iaf973788b51172b48ebb73e8f24b8f4020d94de2	2023-03-17 02:00:00 -05:00
mkadavil	3d74b62e60	Lpgemm threading and micro-kernel optimizations. -Certain sections of the f32 avx512 micro-kernel were observed to slow down when more post-ops are added. Analysis of the binary pointed to false dependencies in instructions being introduced in the presence of the extra post-ops. Addition of vzeroupper at the beginning of ir loop in f32 micro-kernel fixes this issue. -F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s32 micro-kernels. Alpha scaling is now only done when alpha != 1. -s16 micro-kernel performance was observed to be regressing when compiled with gcc for zen3 and older architecture supporting avx2. This issue is not observed when compiling using gcc with avx512 support enabled. The root cause was identified to be the -fgcse optimization flag in O2 when applied with avx2 support. This flag is now disabled for zen3 and older zen configs. AMD-Internal: [CPUPL-3067] Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c	2023-03-16 11:44:51 +05:30
eashdash	e36f699939	Implemented ERF Based GeLU Activation for LPEGMM and SGEMM 1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the error function - ERF 2. Added error function based GeLU activation post-ops for the S32, S16 and BF16 (LPGEMM) and SGEMM APIs. 3. Changes for this includes frame and micro-kernel level changes in addition to adding the marco based function definations of the ERF function in the math-utils and gelu headerfiles. AMD-Internal: [CPUPL-3036] Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be	2023-03-13 06:10:31 -04:00
Edward Smyth	1617589d24	Add consistent NaN/Inf handling in sumsqv. (#668 ) Details: - Changed sumsqv implementation as follows: - If there is a NaN (either real or imaginary), then return a sum of NaN and unit scale. - Else, if there is an Inf (either real or imaginary), then return a sum of +Inf and unit scale. - Otherwise behave as normal. (cherry picked from commit `b861c71b50`) AMD-Internal: [SWLCSG-1900] Change-Id: Ic7ba9cad1fbaf11823b9ba96e72a4ddd973db5b6	2023-03-09 06:36:44 -05:00
Vignesh Balasubramanian	c53b0c96ec	Extended the support for beta scaling in 3x4 RD kernels of ZGEMM SUP - Extended the existing support for handling beta scaling in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The added support ensures that NaN values initialized in C do not propogate to the result when beta is 0. - The support has been added to fringe cases common to, as well as specific to the m and n variants of the RD kernels. AMD-Internal: [CPUPL-3053] [SWLCSG-1900] Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9	2023-03-08 06:16:58 -05:00
Eleni Vlachopoulou	1416ab6b5e	Bugfix on the memory access for nrm2 AVX2 implementations. - Since the definition of negative increments is different between BLAS and BLIS, there was bug on how the memory was accessed when we were copying the elements of a vector with negative increments. Updated the code under the assumption that when negative increments are set, the vector is being accessed starting from the end. For the BLAS interface, there is an intermediate conversion before calling into the blis layer. Change-Id: I08343472b418733fad6f7add9e90aa96cdf68285 AMD-Internal: [SWLCSG-1900]	2023-03-08 04:21:36 -05:00
mkadavil	7dca25d056	Fixed decision logic in choosing between sup vs native GEMM code paths -Currently the storage preference of native kernel is used to determine appropriate "m" and "n" which are used in native/SUP threshold logic. This assumption works only when "sup" & native kernel preferences are same. However, this need not have to be the case. This fix calculates the correct "m" and "n" irrespective of the native/sup kernel preferences thereby improving accuracy of this decision logic. AMD-Internal: [CPUPL-3039] Change-Id: I88c5a6838aeed867effdbbdfc41fbf2e88d4e52a	2023-03-02 02:12:42 -05:00
Harihara Sudhan S	d4901f53ce	Improved performance of double complex AXPYV kernel - Increased unroll by reusing X registers that was previously used for performing shuffle. - Added loops with smaller increment steps for better problem decomposition. - Added X vector and Y vector prefetch to the kernel. - Removed redundant code that handles fringe in incx = 1 and incy = 1. This remainder will be performed by the loop that handles non-unit stride cases. - Vectorized loops that handle non-unit stride cases using SSE instructions. AMD-Internal: [CPUPL-2773] Change-Id: Ifb5dc128e17b4e21315789bfaa147e3a7ec976f0	2023-03-02 00:35:42 -05:00
Harsh Dave	222e00e840	Obliterated usage of rbp register in SUP gemm kernel - Mx4 edge kernels were overwriting rbp registers for prefetches. - Since rbp along with rsp defines stack frame, it resulted in stack overflow issue. - Replaced rbp with rdx register for prefetches. AMD-Internal: [CPUPL-2987] Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd	2023-03-01 11:09:57 -05:00
Harihara Sudhan S	299bed3fa8	Vectorized SCAL2V for double complex - Added SCAL2V kernel that uses AVX2 and SSE instructions for vectorization. - The routine returns early when the vector dimension is zero or incx <= 0 or incy <= 0. - The kernel takes one among the two available paths based on conjugation requirement of X vector. - VZEROUPPER is added before transitioning from AVX2 to SSE. - Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts. - Added the new SCAL2V file from the CMAKE list. AMD-Internal: [CPUPL-2773] Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9	2023-02-28 06:59:23 -05:00
Shubham	b84157fed6	Added AVX512 DTRSM small RLNN/RUTN variant kernels - 8x8 kernels are used for DTRSM SMALL - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: Ifb8cfba6958e1c89ddbfa18893127ab6d44cc367	2023-02-28 01:40:03 +05:30
mkadavil	1f2447f800	Post-ops support for f32 gemm(aocl_gemm). - Bias add, relu, parametric relu and gelu post-ops support added in all f32 gemm micro-kernels. These post-ops are implemented for both AVX512 and AVX2 ISA based on the micro-kernel flavor. The support is added for both row and column major cases. - Lpgemm bench updates to support f32 post-ops. AMD-Internal: [CPUPL-3032] Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476	2023-02-23 18:58:59 +05:30
Arnav Sharma	5baf38b76d	K-Loop Unrolling for 6x64 SGEMM SUP RV kernels - Added k-loop unrolling by a factor of 4 to the following SGEMM SUP RV kernels: - 5x48, 5x32, 5x16 - 4x64, 4x48, 3x32, 4x16 - 3x48, 3x32, 3x16 - 2x64, 2x48, 2x32, 2x16 - 1x64, 1x48, 1x32, 1x16 - 6x64n, 5x64n, 3x64n, 2x64n, 1x64n - Removed unused variables which were resulting in warnings during compilation. - Added a newline at the end of header files to resolve warnings shown during compilation. AMD-Internal: [CPUPL-3002] Change-Id: Iab6cf329f6d7fbd7544b5c8837e493069e8c9921	2023-02-23 04:08:38 -05:00
Arnav Sharma	3d4611ab8b	Updated CMakeLists.txt with SGEMM kernels - Added SGEMM kernel files to CMakeLists.txt for Windows build. AMD-Internal: [CPUPL-3023] Change-Id: I2ab6d23abd2b96574a6d3540fa7b62bc20c92daf	2023-02-22 11:22:53 +00:00
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
Harihara Sudhan S	dfacb47125	Code cleanup to remove multiple function declaration - Removed repetitive function declaration of GEMM small kernels from the C files - Function declaration of these kernels exist in the header files where the kernels are supposed to be declared. AMD-Internal: [CPUPL-3003] Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63	2023-02-21 00:35:31 -05:00
Shubham	1faee9f89e	Fixed 32xk AVX512 double precision pack kernel - Currently the pointer received as function argument is used for packing which causes only a partial copy of input buffer to output buffer due to strange optimizations by compiler. - To fix this, instead of using a normal pointer for output buffer, we define a "restrict" local pointer variable. - "restrict" keyword tells the compiler that the pointer is the only way to access the object pointed by the pointer. - By defining "restrict" local pointer pointing to output buffer, the mysterious problem of incomplete copy has been solved. Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646	2023-02-20 21:18:54 -05:00
Shubham	3ae84c98fd	Fixed seg fault in Testsuite for DTRSM micro kernel - In zen4 arch TRSM and GEMM have different blocksizes. TRSM call will update blockize in global cntx object which is incorrect for GEMM, when GEMM and TRSM are called in parallel. - Hence using a local copy of cntx which holds blocksizes would help. AMD-Internal: [CPUPL-3019] Change-Id: I5f0f5675b3917d2a11d582ac626ca5d8f4752c53	2023-02-20 05:34:42 +05:30
bhaskarn	91a9968a5e	Developed intrinsic based f32 kernels in lpgemm Description: 1. Developed row variant intrinsic Kernels for float32/sgemm which are called from lpgemm api aocl_gemm_f32f32f32of32() 2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are developed using avx512. 3. 6x16m main kernel and respective n fringe and mn fringe are are developed based on avx2 and avx 4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic dispatch are planned next 5. When leading dims are greater than dims bench_lpgemm need to be updated to test it and this is planned next. Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370	2023-02-20 01:11:22 -05:00
Arnav Sharma	46965dfc57	Developed 6x64 SGEMM row-preferred kernels - Added kernels for all rv and rd variants. - Main kernel is of size 6x64, and the associated fringe kernels added are - 4x64, 2x64, 1x64 - 6x32, 4x32, 2x32, 1x32 - 6x16, 4x16, 2x16, 1x16 - Updated the zen4 config to enable these kernels in zen4 path. - Added C-prefetching to 6x? row-stored main kernels. - C-prefetching for column storage yet to be added. - K-loop unrolling for fringe kernels yet to be added. AMD-Internal: [CPUPL-3002] Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4	2023-02-19 23:56:58 -05:00
Harihara Sudhan S	535b49f150	Vectorized COPYV for double complex - Added ZCOPYV kernel that uses AVX2 and SSE instructions for vectorization. - The routine returns early when the vector dimension is zero. - The kernel takes one among the two available paths based on conjugation requirement of X vector. - VZEROUPPER is added before transitioning from AVX2 to SEE. - Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts. AMD-Internal: [CPUPL-2773] Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0	2023-02-16 09:32:10 -05:00
Shubham	18569b42ee	Added DTRSM small RUNN/RLTN variant AVX512 kernels - 8x8 kernels are used for DTRSM SMALL - Matrix A(a10) is packed for GEMM operations. - Packed martix A will be re-used in all the col-block along N-dimension. - Diagonal elements of A matrix are packed(a11) for TRSM operations. - Implemented fringe cases with following block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I6a174e7f88a4c2c5778052525879552a1e82f6ad	2023-02-16 06:49:46 -05:00
mkadavil	63ee4c5e4c	Remove memcpy usage in u8s8s32 lpgemm micro kernels. -As of now, memcpy is used in u8s8s32 micro-kernel for copying in k fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for small k/n dimensions, memcpy invocation has high overhead. -This issue is fixed by replacing memcpy with a MACRO based implementation of copy routine, specifically optimized for the sizes that will be encountered in fringe cases (k < 4, NR' < 16). AMD-Internal: [CPUPL-3008] Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc	2023-02-16 05:52:19 -05:00
chandrkr	399831d7cb	AOCL-Windows: Script Update to mirror reference kernel files with Unique names Details: 1. Reference kernel File names are to be mirrored with unique names for each architecture configuration while generating fat binary. 2. Python script "blis_ref_kernel_mirror.py" is updated to append the filenames with each configuration (zen, zen2, zen3, zen4 and generic) that is built for windows. AMD-Internal: [CPUPL-3009] Change-Id: Ib02206382199cf2aebe14ff9c869b6089228e1c2	2023-02-14 23:37:13 -05:00
Harihara Sudhan S	f6960a3b6a	Fixed bug in scal2v reference - When alpha is equal 1 in scal2v the computation can be reduced to copyv. The condition check in else if condition has to be a check for alpha equal to one and not zero. - BLIS_NO_CONJ has been passed while invoking copy. In case of single and double complex, the appropriate conjx needs to be passed with copyv so that the vector can be conjugated before copying to the destination. AMD-Internal: [CPUPL-2985] Change-Id: Ieb1ad111dff68098a08d870da7215bbd4ff5c9d0	2023-02-05 23:11:51 -05:00
Harihara Sudhan S	8d7593a940	AVX2 vectorized SCALV for double complex - Added ZSCALV that uses AVX2 and SSE instructions for vectorization. - Return early when the vector dimension is zero. When alpha is 1 there is no need to perform computation hence return early. - When alpha is zero expert interface of ZSETV is invoked. In this case, all the elements of the input vector are set 0. - Invocation of expert interface means that NULL pointer can be passed to the function in place of context. Expert interface of ZSETV will query the context and get the approriate function pointer. - Added BLAS interface for ZSCALV. The architecture ID is used to decide the function that is to be invoked. - Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV BLAS macro interface only for single complex type and single complex, float mixed type AMD-Internal: [CPUPL-2773] Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050	2023-02-05 09:36:12 -05:00
Harihara Sudhan S	18ae57305e	ZAXPYF4 optimization - Vectorized alpha scaling of X vector using SSE instructions. This can be done irrespective of incx. - Added code to prefetch A matrix and Y vector to L1 cache - Vectorized fringe case computation and non-unit stride computation with SSE instructions. - Increased unroll in unit stride cases for better register utilization. AMD-Internal: [CPUPL-2773] Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715	2023-02-04 12:34:50 -05:00
mkadavil	d2713d3dc0	GCC compiler optimization flag (kernel) update for zen3 and zen4 config. -Inefficient assembly is generated for s16 gemm micro-kernel(intrinsics code) when compiled using gcc. The presence of -fschedule-insns + -fschedule-insns2 + -ftree-pre in O2 compiler optimization flags results in the code being optimized to reduce data stalls, and results in the usage of stack to store intermediate C register output. Disabling -ftree-pre in gcc fixes the issue, even in the presence of the other two flags. AMD-Internal: [CPUPL-2971] Change-Id: Ibf0dcde20b5a18708a05faad34e684eb0a9a5463	2023-02-02 23:58:14 -05:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
sireesha.sanga	fe999a9653	Removal of Prohibited Source Links as per Legal Scan AMD-Internal: [SWCSD-5866] Change-Id: I03011b114d37fcee8a399af8d631aee610b1f924	2023-01-27 07:56:01 -05:00
sireesha.sanga	71a25a5152	DAXPY Parallelization Details: - DAXPYV is parallelized based on the number of threads set by Application. - Using the empirical data collected on genoa, optimal number of threads is set. - If intended number of threads to be spawned is not equal to actual number of threads, AXPY routine will execute sequentially. Change-Id: I48015a712ada6428056af6320aa5570596a02b1f AMD-Internal: [CPUPL-2747]	2023-01-24 07:32:16 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Harsh Dave	0a699c45f0	Corrected VEXTRACTF64X2 macro in bli_x86_asm_macro file. - Previously VEXTRACTF64X2 macro was defined as vextractf64x4. Change-Id: I79727a85b7d6da3b4d524064e297fc8c71d4f466	2023-01-16 23:04:48 -05:00
mkadavil	4b5e24d0d9	Column major input support for f32 gemm (sgemm for lpgemm). -The f32 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. AMD-Internal: [CPUPL-2919] Change-Id: I805f1cb9ff934bb3106e01eb74e528915ffb90a3	2023-01-16 04:04:21 -05:00
Sireesha Sanga	540509f374	Enabling AVX2 path for SCNRM2	2023-01-13 10:27:54 +05:30
Eleni Vlachopoulou	758d68467f	Disabling AVX2 path for SNRM2 and SCNRM2. AMD-Internal: [CPUPL-2865] Change-Id: I09c67115801a6b9446c7930c54fc937bd17908a3	2023-01-12 09:58:28 -05:00
Meghana Vankadari	f39dba9fd8	Added dzgemm, DZGEMM, DZGEMM_ prototypes to wrapper file. AMD-Internal: [CPUPL-2199] Change-Id: Ied814cb7be60d30b8217ec42ac436b4e628ea6d2	2023-01-12 01:35:29 -05:00
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
Kiran Varaganti	d8d4499e54	AVX2 dgemm kernel optimization for AOCC Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with k0 is typecasted to uint64_t to enable AOCC generate optimized code. Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting this change. Similar change was applied to sgemm, cgemm and zgemm kernels. Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9	2023-01-09 07:49:41 -05:00
Shubham	9e8595356f	Fixed bug related to block size override for TRSM - In zen4 TRSM and GEMM have different blocksizes, when trsm is called, blocksizes are changed in global cntx object. If GEMM and TRSM are called in parallel, blocksizes in global cntx will not be correct for GEMM which will cause a seg fault. - To fix this, a local copy of cntx is created and blocksizes are changed only in the local copy. AMD-Internal:[CPUPL-2896] Change-Id: I0e724520a92fc3b2ed0becf385ec41ab5d1b4490	2023-01-09 05:30:36 -05:00
Edward Smyth	82c2eb4e8e	Code cleanup and warnings fixes Corrections for some occurances of: - Compiler warnings about initialization of float from double - Spelling mistakes in comments - Incorrect indentation of code and comments AMD-Internal: [CPUPL-2870] Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc	2023-01-09 04:34:52 -05:00
Meghana Vankadari	fbd9dde869	Fix bug in DZGEMM Details: - In case of dzgemm, if the microkernel prefers column output, We will induce a transposition and perform C += A*B where A (formerly B) becomes complex and B(formerly A) becomes real. Hence attach complex alpha object to A instead of B. - This commit reverts all the changes made by `d62f12a18a` and `0b81f53074` as they are causing failures in make checkblis-md. AMD-Internal: [CPUPL-2893] Change-Id: I56b94ac136fb96003302c568ae2587142c836620	2023-01-09 03:30:57 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Nithya V S	6e9defe1b5	Developed AVX512 based "sup" kernels for SGEMM - Only for RRR case, var2m kernels are added - Main kernel is of 12x32 (AVX512), associated fringe kernels of - 8x32, 4x32, 2x32, 1x32 (AVX512) - 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512) - 12x8, 8x8 (AVX2) - 12x4, 8x4 (SSE4) - 12x2, 8x2 (SSE4) - existing AVX2/SSE4 kernels are used for other fringe cases - Currently, these kernels are not invoked in zen4 path - Once all AVX512 kernels (n and rd) are done, invoke all of them together in zen4 config AMD-Internal: [CPUPL-2801] Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196	2022-12-16 10:18:09 +05:30
Eleni Vlachopoulou	13aa3c8cd0	Adding AVX2 support for SNRM2 and SCNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [SWLCSG-1080] Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9	2022-12-14 04:25:45 -05:00
Edward Smyth	2592774fe8	BLIS: Nested Parallelism issues (3) Bugfix for parallel BLAS1 and BLAS2 routines. Threading information was not being set correctly when initializing local rntm from global. Also ensure th_rntm is initialized along with global_rntm by updating it in bli_thread_init(), called by bli_init_once() AMD-Internal: [CPUPL-2433] Change-Id: Iba658f87ae13fe16a57ca1fc279e149b7fa294cf	2022-12-13 12:38:40 -05:00
Eleni Vlachopoulou	88e549e7bd	Using CMake as the build system for both Linux and Windows: - GoogleTest headers removed. GoogleTest gets fetched at configuration time. - BLIS headers removed. A BLIS installation path is required at configuration time. - Windows has been temporarily disabled. AMD-Internal: [CPUPL-2732] Change-Id: I9e55c8e43b2733f96cd8b6e5449d79623decad5c	2022-12-13 19:09:23 +05:30
jagar	cff29bde76	Added gtestsuite folder into blis repo Moved blis gtestsuite from lib-confscript to blis repo (branch: amd-main) Change-Id: If7ad391eef66bac6d26cf5223e6043d52b746072	2022-12-07 23:57:13 -05:00
Edward Smyth	345aacf806	BLIS: Nested parallelism issues (2) Improvements to recent parallelism changes: 1. BLIS specific threading options: In bli_thread_update_rntm_from_env() set threading variables in tl_rntm to serial values when OpenMP level for parallelism within BLIS will not be active. User supplied BLIS threading values remain unchanged in global_rntm. 2. Simplify code structure in bli_thread_update_rntm_from_env(). 3. Change variable declarations in bli_thread_init_rntm_from_env() and bli_thread_update_rntm_from_env() to avoid unused variable warnings in non-OpenMP builds. AMD-Internal: [CPUPL-2433] Change-Id: I5505657e3d2722e69bc4a1c1bb9fd8df55407fdd	2022-12-07 04:34:07 -05:00
Edward Smyth	7f86561d26	BLIS-Nov2022: HPL memory issues with GCC. HPL script was using BLIS manual way to set threading, i.e. setting BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return -1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines. Fix: if this occurs, set local number of threads based on product of BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values. Note: BLIS_PC_NT should always be 1, but this environment variable is currently being read (contrary to documentation), so include it for now. Other changes: * implement _Pragma convention in all code used on AMD * frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag AMD-Internal: [CPUPL-2803] Change-Id: I37e8b038e5640d6693a87be0609888186322b465	2022-12-06 05:10:34 -05:00

1 2 3 4 5 ...

2830 Commits