amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Harihara Sudhan S	be7fb342c1	Added AVX512 based double and float SCALV - Added AVX512 based double and float SCALV which will be used in Zen4 context. - Added incx <= 0 check and alpha == 1 check to the BLAS layer of SSCAL. - Modified BLAS framework of float SCAL to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2766],[CPUPL-2765] Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc	2023-04-11 14:41:56 +05:30
Shubham	cc25cff864	Added AVX512 flag for d24xk pack kernel for windows - on windows 24xk kernel is compiled without avx512 flag which causes out of bounds writes for DTRSM. - to fix this avx512 flag has been added to the CMakeLists.txt file for 24xk kernel. AMD-Internal: [CPUPL-3186] Change-Id: I0314dea88302fc4964a303853a4b9b719ecd8064	2023-04-09 22:38:33 +05:30
Aayush Kumar	8c537b0cd5	Added DTRSM Small Path AVX512 based LLNN/LUTN Variant Kernels - 8x8 kernels are used for DTRSM SMALL - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I58d28912bddbaadb404052c0f3449ebbe3c97b68	2023-04-07 08:50:28 +00:00
Edward Smyth	1885540c5a	Code cleanup: compiler warning fixes Modify code to correct some warning messages from GCC 12.2 or AOCC 4.0: - Increase size of nbuf in blastest/f2c/endfile.c - Remove unused variables in kernels/zen/1/bli_scal2v_zen_int.c and kernels/zen/1/bli_axpyv_zen_int10.c - Remove extraneous parentheses in frame/compat/bla_trsm_amd.c and kernels/zen4/3/bli_zgemm_zen4_asm_12x4.c - Add __attribute__ ((unused)) to several variables in frame/1m/packm/bli_packm_struc_cxk.c and frame/1m/packm/bli_packm_struc_cxk_md.c AMD-Internal: [CPUPL-2870] Change-Id: I595e46f0a3d737beb393c3ab531717565220b10d	2023-04-06 06:56:09 -04:00
vignbala	775ce1f13c	Implemented AVX-512 based 12x4 m-variant SUP kernels for ZGEMM - Implemented 12x4m column preferential SUP kernels(main and fringe cases). The main kernel dimension is 12x4, and the associated fringe kernel dimensions are : 12x3m, 12x2m, 12x1m 8x4, 8x3, 8x2, 8x1 4x4, 4x3, 4x2, 4x1 2x4, 2x3, 2x2, 2x1. - Included in-register transposition support for C, thus extending the storage scheme supports to CCC, CCR, RCC and RCR inside the milli-kernel. - Integrated conditional packing of A onto the SUP front end for dcomplex datatype. This redirects RRC and CRC storage schemes onto the preceding set of SUP kernels through storage scheme transformation(RCC and CCC respectively). - Updated the zen4 context file with the new set of SUP kernels, to get enabled appropriately. Furthermore, the context file was updated with the AVX-2 dotxv signatures for dcomplex datatype. This redirects the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines. - Added C prefetching onto L2-cache, and an unroll factor of 4 for the k loop in all the kernels. - Work in progress to include conjugate support and input spectrum extension for the AVX-512 SUP kernels. The current thresholds in zen4 context is the same as that of the zen3 thresholds for ZGEMM SUP. AMD-Internal: [CPUPL-3122] Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e	2023-04-06 04:49:15 -04:00
Edward Smyth	1ac03e64b5	BLIS cpuid tidy and bugfix. Improvements to BLIS cpuid functionality: - Tidy names of avx support test functions, especially rename bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported() to more accurately describe what it tests. - Fix bug in frame/base/bli_check.c related to changes in commit `6861fcae91` AMD-Internal: [CPUPL-3031] Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5	2023-04-03 08:46:37 -04:00
mkadavil	27a9e2a0ff	u8s8s32 fringe kernel optimizations. -The n fringe micro kernels uses only a few zmm registers for computing the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24 used in 6x64). This results in lot of wasted registers that if utilized can help increase the MR dimension and thus improve the reuse of registers loaded with B. Based on this concept, the existing n fringe kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted that the maximum number of registers are not used, since it results in cache inefficient code due to the increase in MR and thus more broadcasts required from unpacked A matrix. -Compiler flag updates for AOCC build to generate loops with 64 byte alignment. This has been observed to improve performance slightly when k dimension is small. AMD-Internal: [CPUPL-3173] Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca	2023-04-03 05:35:18 -05:00
eashdash	bd8cd763ff	Added NEW LPGEMM TYPE- S8S8S32/S8 1. New LPGEMM type - S8S8S32/S8 is added. 2. New interface, frame and kernel files are added. 3. Frame and kernel files added/modified for S8S8S32/S8 have 2 operations - Pack B and Mat Mul 4. Pack B kernel routines to pack B matrix for VNNI and compute the sum of every column of B matrix to implement the S8S8S32 operation using the VNNI instructions. 5. Mat Mul Kernel files to compute the GEMM output using the VNNI. Here the A matrix elements are converted from int8 to uint8 (VNNI works with A matrix type uint8 only). 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s32os32 and s8s8s32os8. 8. All previously added post-ops are supported on S8S8S32/S8 also. AMD-Internal: [CPUPL-3154] Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3	2023-03-31 05:44:54 -04:00
Mangala V	245fdf072c	AVX-512 based col-preferred kernels for ZGEMM in native path - Kernel block size is 12x4 - Updated the zen4 config to enable these kernels in zen4 path. - Tuned MC,KC,NC for better performance for m/n/k size > 500 - Updated CMakeLists.txt with ZGEMM kernels for windows build. Kernel supports: 1. Preload and prebroadcast of A and B 2. Prefecth of C Matrix 3. K loop is sub divided in to multiple loops to maintain distance between c prefetchs. 4. Special case when alpha/beta imag component is zero 5. Row/Col/General stride of Matrix C AMD-Internal: [CPUPL-2998] Change-Id: I62e3c352d475b1add3f43270805fbcee00e2e440	2023-03-28 23:05:06 -04:00
Harsh Dave	f5dc3db648	Added AVX512 8xk packing kernel AVX512 optimised kernel for Double datatype supports row and column major matrix Packing kernel is column major implementation If matrix is row major, we need to transpose block before storing it. If matrix is column major, we directly store AMD-Internal: [CPUPL-2966] Change-Id: I8e43f1e2b562c382f44278cd47b3d1e84a4d24c9	2023-03-27 23:18:32 -05:00
Mangala V	62d63eb1ba	AVX-512 based 4xk and 12xk packing kernel for dcomplex AVX512 packing kernel supports: 1. Dcomplex datatype 2. Row and column major matrix AVX512 packing kernel doesnot support: 1. General stride matrix 2. Fringe cases(only multiplies of 4 or 12 is supported) 3. Conjugate is not supported scal2m will be used for above unsupported functionality AVX512 packing kernel is column preferred kernel If matrix is row major, we need to transpose block before storing it. If matrix is column major, we directly store it AMD-Internal: [CPUPL-3088] Change-Id: I3fcd94248a3a6527c807cccc1b3408db9fe2a737	2023-03-28 00:19:08 +05:30
Arnav Sharma	0bfb0393fd	Bug fix for C/ZAXPBY Kernels - Complex AXPBY kernels gave incorrect output when both alpha and beta had non-zero imaginary parts. - Previously, the scalar code (used to calculate remainder result or non-unit increment cases) was directly accessing and updating the y-vector pointer thus, resulting in an incorrect output. Updated it to operate on a local copy of the currect y element and store the final result to the y-pointer. - Also, added operation to store temporary calculation of alpha*x in an intermediate vector and then later added to the y vector. AMD-Internal: [CPUPL-3037] Change-Id: Iddbd3000dcb1505b444b0ad41ab881b055842e1c	2023-03-24 07:32:43 -04:00
Harsh Dave	c1766e312a	Added in row storage support for C matrix. - Added in-register transpose support for c matrix to support row stored C matrix for dgemm sup. - Support is added for all edge case kernels. - FMA are made independent of each other, for faster computation while storing data back to C matrix. AMD-Internal: [CPUPL-2966] Change-Id: I1d13af99a17ee66adbf5f537a4664ade489a7cad	2023-03-24 02:46:21 -04:00
Meghana Vankadari	31a4203c32	Added AVX-512 based col-preferred kernels for DGEMM with optional pack framework - Main kernel is of size 24x8 and the associated fringe kernels added are - 24x7m, 24x6m, 24x5m, 24x4m, 24x3m, 24x2m, 24x1m - 24x8, 24x7, 24x6, 24x5, 24x4, 24x3, 24x2, 24x1 - 16x8, 16x7, 16x6, 16x5, 16x4, 16x3, 16x2, 16x1 - 8x8, 8x7, 8x6, 8x5, 8x4, 8x3, 8x2, 8x1 - For fringe kernels, 24x? kernel handles 16 < m_remainder < 24 16x? kernel handles 8 < m_remainder <= 16 8x? kernel handles 0 < m_remainder <= 8 - Added a function 'bli_zen4_override_gemm_blkszs' to override blocksizes and kernels to be used for SUP for supported storage schemes. - Updated the zen4 config to enable these kernels in zen4 path. - Thresholds are yet to be derived. - Updated CMakeLists.txt with DGEMM SUP kernels for windows build. Kernel-specific details: - K-loop is unrolled by 8 times to facilitate prefetch of B. - For every load of one column of A, the corresponding column in next panel of A is prefetched with T1 hint. - One column of C is prefetched with T0 hint per iteration of LOOP2. - TAIL_NITER is derived to be 3. - For every unroll of k-loop, one row of B is prefetched with T0 hint. - C-prefetching for row-storage is yet to be added. - B-prefetching for col-storage is yet to be added. - Support for C transpose is yet to added. AMD-Internal: [CPUPL-2755], [CPUPL-2409] Change-Id: Ie240c893469032dc2271cbfe00cceccfe6c4ea48	2023-03-24 06:40:36 +00:00
Eleni Vlachopoulou	ad7a812db2	Remove quick return for zero increments. Details: - To be BLAS compliant, if increment is zero then iterate through the first element n times. - For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level. AMD-Internal: [SWLCSG-1900] Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd	2023-03-23 23:35:03 -04:00
Harsh Dave	238d9fda9e	Fixed ASAN memory issue due to modifying RBP register - RBP is base pointer which points to base of current stack frame. ASAN tool rely on rbp and rsp for stack related validations. So over-writting or modifying RBP register results in application termination with the error code of stack overflow. - Removed all the code snippets which were using rbp register for prefetching matrices and sometimes loading elements from memory in all of the gemm sup kernels for double datatype. - Removed reference to rbp from register clobber list as well to completely avoid the usage of rbp register. AMD-Internal: [CPUPL-2613, CPUPL-2587] Change-Id: Idd402d3c644c4dd66e8d4988aede539ad8c77b28	2023-03-23 21:23:44 -04:00
Shubham	dfc95d29fc	Enable DTRSM small multithreading path for BLAS interface - Enabled DTRSM small mt for sizes where performance is better than small or native. - Threshold Tuning for small path is updated. - Function signature for bli_trsm_small_mt has been made similar to bli_trsm_small so that one function pointer can be used for all functions. - Early return condition in DTRSM small for sizes > 1000 has been removed so that the sizes for which small path to take can be decided on bla layer instead of inside kernel. AMD-Internal: [CPUPL-2735] Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f	2023-03-23 12:07:22 -04:00
Shubham	323e31649f	Added AVX512 8x24 DTRSM native path kernels - Added DGEMM and DTRSM row preferred micro kernels. - DTRSM left lower and left upper micro kernels are added. - DGEMM kernel is optimized for both row stored C and col stored C. AMD-Internal: [CPUPL-2745] Change-Id: Iecd2c1b0b0972e17e7b31e4b117e49c90def5180	2023-03-23 00:43:15 -04:00
Harihara Sudhan S	4b36529a8b	Added vector packing logic to ZGEMV variant 2 - In cases when incy != 1, a buffer is created for y vector. The contents of vector y is scaled by beta and stored in this buffer. - After performing the compute using ZAXPYF kernel, the results in y buffer memory is copied back to the orginal buffer using ZCOPYV. - In cases when alpha is zero, we only scale the y vector by beta without using the buffer and return. - The kernels are picked based on the architecture ID. For any zen based architecture, AVX2 kernels are invoked. For other, the kernels are invoked based on the context. - In ZSCAL2V, query for the context if NULL pointer is passed. AMD-Internal: [CPUPL-2773] Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f	2023-03-22 03:19:18 -04:00
Harihara Sudhan S	a67205d8bd	Modified Double precision AMAXV kernel - The new AMAXV adheres to the BLAS definition of ISAMAX by not handle NaN separately. In the previous kernel, NaN is considered the smallest element of all the elements in the array. - The new logic uses two helper functions - bli_vec_absmax_double and bli_vec_search_double. - bli_vec_absmax_double finds the absolute largest element and the index range in which the first occurence of this element can be found. - bli_vec_search_double returns the index of the first occurence of the absolute value of an element. - AMAXV uses these two helper functions to find the absolute largest element and then searches using bli_vec_search_double in the reduced range provided by bli_vec_absmax_double. - Added condition check for n == 1 in BLAS layer. It is an optimization mention in the BLAS standard API definition. - Removed redundant n == 0 condition check from the kernel. This is a BLAS exception and is already done in the BLAS layer. - Removed AVX2 flag check from the BLAS layer. Kernels will be picked based on the architecture ID in the new design. AMD-Internal: [CPUPL-2773] Change-Id: Ida2dae84a60742e632dc810ab1b7b80fc354e178	2023-03-21 05:18:05 -04:00
Harsh Dave	f7b9bc734e	Added AVX512 24xk packing kernel - Added AVX-512 24xk packing kernel. - Prefetching is yet to be added. AMD-Internal: [CPUPL-2966] Change-Id: Iaf973788b51172b48ebb73e8f24b8f4020d94de2	2023-03-17 02:00:00 -05:00
mkadavil	3d74b62e60	Lpgemm threading and micro-kernel optimizations. -Certain sections of the f32 avx512 micro-kernel were observed to slow down when more post-ops are added. Analysis of the binary pointed to false dependencies in instructions being introduced in the presence of the extra post-ops. Addition of vzeroupper at the beginning of ir loop in f32 micro-kernel fixes this issue. -F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s32 micro-kernels. Alpha scaling is now only done when alpha != 1. -s16 micro-kernel performance was observed to be regressing when compiled with gcc for zen3 and older architecture supporting avx2. This issue is not observed when compiling using gcc with avx512 support enabled. The root cause was identified to be the -fgcse optimization flag in O2 when applied with avx2 support. This flag is now disabled for zen3 and older zen configs. AMD-Internal: [CPUPL-3067] Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c	2023-03-16 11:44:51 +05:30
eashdash	e36f699939	Implemented ERF Based GeLU Activation for LPEGMM and SGEMM 1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the error function - ERF 2. Added error function based GeLU activation post-ops for the S32, S16 and BF16 (LPGEMM) and SGEMM APIs. 3. Changes for this includes frame and micro-kernel level changes in addition to adding the marco based function definations of the ERF function in the math-utils and gelu headerfiles. AMD-Internal: [CPUPL-3036] Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be	2023-03-13 06:10:31 -04:00
Vignesh Balasubramanian	c53b0c96ec	Extended the support for beta scaling in 3x4 RD kernels of ZGEMM SUP - Extended the existing support for handling beta scaling in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The added support ensures that NaN values initialized in C do not propogate to the result when beta is 0. - The support has been added to fringe cases common to, as well as specific to the m and n variants of the RD kernels. AMD-Internal: [CPUPL-3053] [SWLCSG-1900] Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9	2023-03-08 06:16:58 -05:00
Eleni Vlachopoulou	1416ab6b5e	Bugfix on the memory access for nrm2 AVX2 implementations. - Since the definition of negative increments is different between BLAS and BLIS, there was bug on how the memory was accessed when we were copying the elements of a vector with negative increments. Updated the code under the assumption that when negative increments are set, the vector is being accessed starting from the end. For the BLAS interface, there is an intermediate conversion before calling into the blis layer. Change-Id: I08343472b418733fad6f7add9e90aa96cdf68285 AMD-Internal: [SWLCSG-1900]	2023-03-08 04:21:36 -05:00
Harihara Sudhan S	d4901f53ce	Improved performance of double complex AXPYV kernel - Increased unroll by reusing X registers that was previously used for performing shuffle. - Added loops with smaller increment steps for better problem decomposition. - Added X vector and Y vector prefetch to the kernel. - Removed redundant code that handles fringe in incx = 1 and incy = 1. This remainder will be performed by the loop that handles non-unit stride cases. - Vectorized loops that handle non-unit stride cases using SSE instructions. AMD-Internal: [CPUPL-2773] Change-Id: Ifb5dc128e17b4e21315789bfaa147e3a7ec976f0	2023-03-02 00:35:42 -05:00
Harsh Dave	222e00e840	Obliterated usage of rbp register in SUP gemm kernel - Mx4 edge kernels were overwriting rbp registers for prefetches. - Since rbp along with rsp defines stack frame, it resulted in stack overflow issue. - Replaced rbp with rdx register for prefetches. AMD-Internal: [CPUPL-2987] Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd	2023-03-01 11:09:57 -05:00
Harihara Sudhan S	299bed3fa8	Vectorized SCAL2V for double complex - Added SCAL2V kernel that uses AVX2 and SSE instructions for vectorization. - The routine returns early when the vector dimension is zero or incx <= 0 or incy <= 0. - The kernel takes one among the two available paths based on conjugation requirement of X vector. - VZEROUPPER is added before transitioning from AVX2 to SSE. - Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts. - Added the new SCAL2V file from the CMAKE list. AMD-Internal: [CPUPL-2773] Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9	2023-02-28 06:59:23 -05:00
Shubham	b84157fed6	Added AVX512 DTRSM small RLNN/RUTN variant kernels - 8x8 kernels are used for DTRSM SMALL - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: Ifb8cfba6958e1c89ddbfa18893127ab6d44cc367	2023-02-28 01:40:03 +05:30
mkadavil	1f2447f800	Post-ops support for f32 gemm(aocl_gemm). - Bias add, relu, parametric relu and gelu post-ops support added in all f32 gemm micro-kernels. These post-ops are implemented for both AVX512 and AVX2 ISA based on the micro-kernel flavor. The support is added for both row and column major cases. - Lpgemm bench updates to support f32 post-ops. AMD-Internal: [CPUPL-3032] Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476	2023-02-23 18:58:59 +05:30
Arnav Sharma	5baf38b76d	K-Loop Unrolling for 6x64 SGEMM SUP RV kernels - Added k-loop unrolling by a factor of 4 to the following SGEMM SUP RV kernels: - 5x48, 5x32, 5x16 - 4x64, 4x48, 3x32, 4x16 - 3x48, 3x32, 3x16 - 2x64, 2x48, 2x32, 2x16 - 1x64, 1x48, 1x32, 1x16 - 6x64n, 5x64n, 3x64n, 2x64n, 1x64n - Removed unused variables which were resulting in warnings during compilation. - Added a newline at the end of header files to resolve warnings shown during compilation. AMD-Internal: [CPUPL-3002] Change-Id: Iab6cf329f6d7fbd7544b5c8837e493069e8c9921	2023-02-23 04:08:38 -05:00
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
Harihara Sudhan S	dfacb47125	Code cleanup to remove multiple function declaration - Removed repetitive function declaration of GEMM small kernels from the C files - Function declaration of these kernels exist in the header files where the kernels are supposed to be declared. AMD-Internal: [CPUPL-3003] Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63	2023-02-21 00:35:31 -05:00
Shubham	1faee9f89e	Fixed 32xk AVX512 double precision pack kernel - Currently the pointer received as function argument is used for packing which causes only a partial copy of input buffer to output buffer due to strange optimizations by compiler. - To fix this, instead of using a normal pointer for output buffer, we define a "restrict" local pointer variable. - "restrict" keyword tells the compiler that the pointer is the only way to access the object pointed by the pointer. - By defining "restrict" local pointer pointing to output buffer, the mysterious problem of incomplete copy has been solved. Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646	2023-02-20 21:18:54 -05:00
Arnav Sharma	46965dfc57	Developed 6x64 SGEMM row-preferred kernels - Added kernels for all rv and rd variants. - Main kernel is of size 6x64, and the associated fringe kernels added are - 4x64, 2x64, 1x64 - 6x32, 4x32, 2x32, 1x32 - 6x16, 4x16, 2x16, 1x16 - Updated the zen4 config to enable these kernels in zen4 path. - Added C-prefetching to 6x? row-stored main kernels. - C-prefetching for column storage yet to be added. - K-loop unrolling for fringe kernels yet to be added. AMD-Internal: [CPUPL-3002] Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4	2023-02-19 23:56:58 -05:00
Harihara Sudhan S	535b49f150	Vectorized COPYV for double complex - Added ZCOPYV kernel that uses AVX2 and SSE instructions for vectorization. - The routine returns early when the vector dimension is zero. - The kernel takes one among the two available paths based on conjugation requirement of X vector. - VZEROUPPER is added before transitioning from AVX2 to SEE. - Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts. AMD-Internal: [CPUPL-2773] Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0	2023-02-16 09:32:10 -05:00
Shubham	18569b42ee	Added DTRSM small RUNN/RLTN variant AVX512 kernels - 8x8 kernels are used for DTRSM SMALL - Matrix A(a10) is packed for GEMM operations. - Packed martix A will be re-used in all the col-block along N-dimension. - Diagonal elements of A matrix are packed(a11) for TRSM operations. - Implemented fringe cases with following block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I6a174e7f88a4c2c5778052525879552a1e82f6ad	2023-02-16 06:49:46 -05:00
Harihara Sudhan S	8d7593a940	AVX2 vectorized SCALV for double complex - Added ZSCALV that uses AVX2 and SSE instructions for vectorization. - Return early when the vector dimension is zero. When alpha is 1 there is no need to perform computation hence return early. - When alpha is zero expert interface of ZSETV is invoked. In this case, all the elements of the input vector are set 0. - Invocation of expert interface means that NULL pointer can be passed to the function in place of context. Expert interface of ZSETV will query the context and get the approriate function pointer. - Added BLAS interface for ZSCALV. The architecture ID is used to decide the function that is to be invoked. - Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV BLAS macro interface only for single complex type and single complex, float mixed type AMD-Internal: [CPUPL-2773] Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050	2023-02-05 09:36:12 -05:00
Harihara Sudhan S	18ae57305e	ZAXPYF4 optimization - Vectorized alpha scaling of X vector using SSE instructions. This can be done irrespective of incx. - Added code to prefetch A matrix and Y vector to L1 cache - Vectorized fringe case computation and non-unit stride computation with SSE instructions. - Increased unroll in unit stride cases for better register utilization. AMD-Internal: [CPUPL-2773] Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715	2023-02-04 12:34:50 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Kiran Varaganti	d8d4499e54	AVX2 dgemm kernel optimization for AOCC Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with k0 is typecasted to uint64_t to enable AOCC generate optimized code. Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting this change. Similar change was applied to sgemm, cgemm and zgemm kernels. Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9	2023-01-09 07:49:41 -05:00
Edward Smyth	82c2eb4e8e	Code cleanup and warnings fixes Corrections for some occurances of: - Compiler warnings about initialization of float from double - Spelling mistakes in comments - Incorrect indentation of code and comments AMD-Internal: [CPUPL-2870] Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc	2023-01-09 04:34:52 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Nithya V S	6e9defe1b5	Developed AVX512 based "sup" kernels for SGEMM - Only for RRR case, var2m kernels are added - Main kernel is of 12x32 (AVX512), associated fringe kernels of - 8x32, 4x32, 2x32, 1x32 (AVX512) - 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512) - 12x8, 8x8 (AVX2) - 12x4, 8x4 (SSE4) - 12x2, 8x2 (SSE4) - existing AVX2/SSE4 kernels are used for other fringe cases - Currently, these kernels are not invoked in zen4 path - Once all AVX512 kernels (n and rd) are done, invoke all of them together in zen4 config AMD-Internal: [CPUPL-2801] Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196	2022-12-16 10:18:09 +05:30
Eleni Vlachopoulou	13aa3c8cd0	Adding AVX2 support for SNRM2 and SCNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [SWLCSG-1080] Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9	2022-12-14 04:25:45 -05:00
Edward Smyth	7f86561d26	BLIS-Nov2022: HPL memory issues with GCC. HPL script was using BLIS manual way to set threading, i.e. setting BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return -1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines. Fix: if this occurs, set local number of threads based on product of BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values. Note: BLIS_PC_NT should always be 1, but this environment variable is currently being read (contrary to documentation), so include it for now. Other changes: * implement _Pragma convention in all code used on AMD * frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag AMD-Internal: [CPUPL-2803] Change-Id: I37e8b038e5640d6693a87be0609888186322b465	2022-12-06 05:10:34 -05:00
Edward Smyth	34730a1e4c	BLIS: Nested parallelism issues 1. Check OpenMP active level against max active levels when setting number of threads for starting a new parallel region in ./frame/thread/bli_thread.c to ensure the correct number of threads is used when BLIS is called within nested OpenMP parallelism. 2. In subsequent BLIS calls, threading choices could be incorrectly set based on values used and stored in global_rntm by a previous call. This could apply when the OpenMP number of threads differ from call to call, different nested parallelism is used in different parts of a user's code, or different threads at the user level request different numbers of OpenMP threads for BLIS calls. Keep threading information in both global_rntm and a new Thread Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime environment (as appropriate) during bli_init_auto() calls in each BLIS routine. The details are: * global_rntm is initialized on first BLIS call based on OpenMP and BLIS threading environment variables. * global_rntm is updated by any BLIS threading function calls. * In bli_thread_update_tl(), called by bli_init_auto(), sync with any BLIS values set or updated in global_rntm. Then, if BLIS threading control is not used, check OpenMP ICVs and set thread count and auto_factor appropriately. * Setting BLIS threading locally (using expert interfaces to pass a user defined rntm data structure) should work as before. 3. bli_thread_get_is_parallel can now only be called outside of parallelism within BLIS routines. Change calls in trsm to reflect this. 4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env() if any BLIS_*_NT environment variables are set. 5. Set auto_factor = FALSE when the number of threads is 1. 6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE. 7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init(). 8. For debugging, internal information on the rntm threading data can be printed by defining "PRINT_THREADING" at the top of bli_rntm.h 9. bli_rntm_print() now also prints the value of blis_mt. 10. Function prototypes in bli_rntm.h moved to top of file, so that bli_rntm_print() can be used within inline functions defined in this header file. 11. Comment out bli_init_auto() and bli_finalize_auto() calls in Fortran interfaces in frame/compat/blis/thread/b77_thread.c 12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and set_pack_b functions outside of the auto_factor if statements. 13. Misc code tidying. AMD-Internal: [CPUPL-2433] Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee	2022-10-21 07:38:39 -04:00
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Shubham Sharma	2d72d4c862	DTRSM Native path AVX 512 kernels are developed - Added DTRSM AVX512 kernels for lower and upper variants in the native path. - Changes in framework are made to accommodate these kernels. AMD-Internal: [CPUPL-2588] Change-Id: I1f74273ef2389018343c0645870290373ce25efe	2022-10-01 15:58:24 +00:00
Nallani Bhaskar	f1caa8e630	Fixed initialization issues in ddot multithread implementation Description: 1. n value per thread and offset are defined outside omp loop which can vary for each thread. Moved them to inside the omp loop so that the output 2. Fixed few compilation warnings in dnrm2 avx2 implemenation AMD Internal:[CPUPL-2606] Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81	2022-10-01 14:43:39 +05:30

1 2 3 4 5 ...

546 Commits