amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-22 01:18:18 +00:00

Author	SHA1	Message	Date
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Edward Smyth	9500cbee63	Code cleanup: spelling corrections Corrections for some spelling mistakes in comments. AMD-Internal: [CPUPL-3519] Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba	2023-11-09 00:16:30 -05:00
Harsh Dave	0de10cc86c	Added k=1 avx512 dgemm kernel. - This commit implements avx512 dgemm kernel for k=1 cases. which gets called for zen4 codepath. - Added architecture check for k=1 kernel in dgemm code path to pick correct kernel based on cpu arhcitecture since now blis is having avx2 and avx512 dgemm kernels for k=1 case. - Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was being called irrespective of architecture type. - Added architecture check before calling the kernel for case where k=1, so only for respective architectures this kernel is invoked. AMD-Internal: [CPUPL-4017] Change-Id: I418bbc933b41db41d323b331c6d89893868a6971	2023-11-07 01:10:09 -05:00
Shubham Sharma	ffa8f584be	Added ZTRSM AVX512 native path kernels - Added 4x12 ZGEMM row-preferred kernel. - Added 4x12 ZTRSM row-preferred lower and upper kernels using AVX512 ISA. - These kernels are used for ZTRSM only, zgemm still uses 12x4 kernel. - Kernels support row/col/gen storage. - Kernels support A prefetch, B prefetch, A_next prefetch, B_next prefetch and c prefetch. - B prefetch, B_next prefetch and C prefetch are enabled by default. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a	2023-11-03 09:42:24 -04:00
Nallani Bhaskar	b3391ef5da	Updated ERF threshold and packa changes in bf16 Description: 1. Updated ERF function threshold from 3.91920590400 to 3.553 to match with the reference erf float implementation which reduced errors a the borders and also clipped the output to 1.0 2. Updated packa function call with pack function ptr in bf16 api to avoid compilation issues for non avx512bf16 archs 3. Updated lpgemm bench [AMD-Internal: SWLCSG-2423 ] Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d	2023-10-29 23:55:46 +05:30
Shubham Sharma	d45d1d68c6	Reset ZMM Registers before exiting, in L3 APIs - Register ZMM16 to ZMM31 are zeroed after L3 api calls. - This change is done only for ZEN4 code path. - bli_zero_zmm function is added which resets these registers. AMD-Internal: [CPUPL-3882] Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8 (cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)	2023-10-27 00:51:04 -04:00
Harsh Dave	edbbfd9a86	Optimized AVX512 DGEMM SUP edge kernels - For edge kernels which handles the corner cases and specially for cases where there is really small amount of computation to be done, executing FMA efficiently becomes very crucial. - In previous implementation, edge kernels were using same, limited number of vector register to hold FMA result, which indirectly creates dependency on previous FMA to complete before CPU can issue new FMA. - This commit address this issue by using different vector registers that are available at disposal to hold FMA result. - That way we hold FMA results in two sets of vector registers, so that sub-sequent FMA won't have to wait for previous FMA to complete. - At the end of un-rolled K loop these two sets of vector registers are added together to store correct result in intended vector registers. - Following kernels are modified: bli_dgemmsup_rv_zen4_asm_24x4m, bli_dgemmsup_rv_zen4_asm_24x3m, bli_dgemmsup_rv_zen4_asm_24x2m, bli_dgemmsup_rv_zen4_asm_24x1m, bli_dgemmsup_rv_zen4_asm_24x1, bli_dgemmsup_rv_zen4_asm_16x1, bli_dgemmsup_rv_zen4_asm_8x1, bli_dgemmsup_rv_zen4_asm_24x2, bli_dgemmsup_rv_zen4_asm_16x2, bli_dgemmsup_rv_zen4_asm_8x2, bli_dgemmsup_rv_zen4_asm_24x3, bli_dgemmsup_rv_zen4_asm_16x3, bli_dgemmsup_rv_zen4_asm_8x3, bli_dgemmsup_rv_zen4_asm_16x4, bli_dgemmsup_rv_zen4_asm_8x4, bli_dgemmsup_rv_zen4_asm_16x5, bli_dgemmsup_rv_zen4_asm_8x5, bli_dgemmsup_rv_zen4_asm_16x6, bli_dgemmsup_rv_zen4_asm_8x6, bli_dgemmsup_rv_zen4_asm_8x7, bli_dgemmsup_rv_zen4_asm_8x8 AMD-Internal: [CPUPL-3574] Change-Id: I318ff8e2f075820bcc0505aa1c13d0679f73af44	2023-10-16 04:03:56 -04:00
Meghana Vankadari	eb5ab3f762	LPGEMM: Added transB support for bf16bf16f32o<bf16\|f32> APIs Details: - Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs to allow reordering from column major input matrix. - Added new pack kernels that packs/reorders B matrix from column-major input format. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [CPUPL-2268] Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c	2023-10-12 23:36:18 +05:30
bhaskarn	5fd24c27a7	Updated expf max min precission fix nan issue in Tanh Description: The expf_max and expf_min have more precission than the computation which is leading to corss the clipping at the edge case which is causing nan's in the tanh output. Updated the thresholds to less precission to clip the edge cases to avoid nan's in the tanh output. AMD-Internal: [SWLCSG-2423 ] Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29	2023-10-12 01:04:59 -04:00
Meghana Vankadari	4874895a68	LPGEMM: Added transA support for bf16bf16f32o<bf16\|f32> APIs Details: - Added new params(order, trans) to aocl_get_reorder_buf_size_ and aocl_reorder_ APIs. - Added new pack kernels that packs A matrix from either row-major or column major input matrix to pack buffer with row-major format. - Updated cntx with pack kernel function pointers for packing A matrix. - Transpose of A matrix is handled by packing A matrix to row-major format during run-time. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [SWLCSG-2268, SWLCSG-2442] Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4	2023-10-11 07:16:08 -04:00
mkadavil	c3b97559c1	Zero Point support for <u\|s>8s8s<32\|16>os8 LPGEMM APIs -Downscaled / quantized value is calculated using the formula x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels for these APIs only support scaling. Zero point addition is implemented as part of this commit, with it being fused as part of the downscale post-op in the micro-kernel. The zero point input is a vector of int8 values, and currently only vector based zero point addition is supported. -Bench enhancements to test/benchmark zero point addition. AMD-Internal: [SWLCSG-2332] Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea	2023-10-10 12:05:47 +05:30
Arnav Sharma	f0416cff08	SGEMM SUP Panel Stride Bug Fix - The AVX512 SGEMM SUP rv m and n kernels did not accomodate for the use of panel strides in case of packed matrices, thus resulting in incorrect matrix strides when packing was explicitly enabled using BLIS_PACK_A=1, BLIS_PACK_B=1 or both. - The kernels are updated to use panel strides for traversing both A and B matrix buffers accurately. [AMD-Internal]: CPUPL-3673 Change-Id: I4341ed7e1e1419cc3e2063b06f278edcb9145adb	2023-09-27 03:02:24 -04:00
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Harihara Sudhan S	278ca71706	Fixes for GEMV Functionality Issues - Added call to dsetv in dscalv. When DSCALV is invoked by DGEMV the SCAL function is expected to SET the vector to zero when alpha is 0. This change is done to ensure BLAS compatibility of DGEMV. - Fixed bug in DGEMV var 1. Reverted changes in DGEMV var 1 to remove packing and dispatch logic. - CMAKE now builds with _amd files for unf_var2 of GEMV. AMD-Internal: [CPUPL-3772] Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade	2023-08-14 13:54:48 +05:30
Edward Smyth	c445f192d5	BLIS: Missing clobbers (batch 6) More missing clobbers in skx and zen4 kernels, missed in previous commits. AMD-Internal: [CPUPL-3521] Change-Id: I838240f0539af4bf977a10d20302a40c34710858	2023-08-07 10:52:23 -04:00
Harihara Sudhan S	c97471dce0	Added AVX512 ZDSCALV kernel - Added AVX512-based kernel for ZDSCAL. This will be dispatched from the BLAS layer for machines that have AVX512 flags. - In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE instructions. - Removed the negative incx handling checks from the blis_impli layer of ZDSCAL as BLAS expects early return for incx <= 0. AMD-Internal: [CPUPL-3648] Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9	2023-08-06 01:51:47 -04:00
Eleni Vlachopoulou	9c613c4c03	Windows CMake bugfix in object libraries for shared library option Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory. The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries. AMD-Internal: [CPUPL-3241] Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52	2023-05-24 17:30:16 +05:30
Edward Smyth	dea5fe4d12	BLIS: Missing clobbers (batch 5) Add missing clobbers for AVX512 mask registers k0-k7 in zen4 kernels. AMD-Internal: [CPUPL-3456] Change-Id: I5f28c725d7af1466df4db4cdfa2d456bbc6ab36d	2023-05-23 15:40:29 -04:00
Edward Smyth	a3adfb68cf	BLIS: Missing clobbers (batch 4) Add missing clobbers haswell (sup) kernels. AMD-Internal: [CPUPL-3456] Change-Id: I19fa97b85f75c8b8fe15d31b13768f937cc5e4cc	2023-05-23 14:57:08 -04:00
Edward Smyth	e960141fe2	BLIS: Missing clobbers (batch 2) Add missing clobbers in other zen4 kernels. AMD-Internal: [CPUPL-3456] Change-Id: I5cceb44fe100e03269cfe21d8c4c0d2171b921c3	2023-05-23 13:12:20 -04:00
Edward Smyth	ea2eea5097	BLIS: Missing clobbers (batch 1) Add missing clobbers in first batch of assembly kernels: - zen3 bli_gemmsup* - bli_zgemm_zen4_asm_12x4 - bli_gemmsup_rv_haswell_asm_sMx6 AMD-Internal: [CPUPL-3456] Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1	2023-05-23 11:51:18 -04:00
eashdash	2c4f032e0f	Fix for lack of BF16 instruction when compiled with GCC-11 GCC-11 and below support AVX512-BF16. However, it doesn't support all the bf16 instructions required. For bf16 downscale APIs, when beta scaling is done, C output elements must be upscaled from BF16 type to Float type for beta scaling operation. For this upscaling operation of bf16 to float, _mm512_cvtpbh_ps is used. This however is not supported by GCC-11 and below (but is supported on GCC 12 onwards) Lack of this instruction support in gcc11, and below leads to compilation issues with this instruction (_mm512_cvtpbh_ps) not being recognized. To fix, this, we use a set of instructions: 1. register containing bf16 type __m256bh a1 2. Convert bf16 to float with shift left ops __m512 float_a1 = (__m512) (_mm512_sllv_epi32 (_mm512_cvtepi16_epi32 ((__m256i) a1), _mm512_set1_epi32 (16))); AMD-Internal: [CPUPL-3454] Change-Id: Ie4a9f04881c59ced088608633774b27f22b4ab8e	2023-05-19 10:15:08 +00:00
eashdash	061a68ff0d	BF16 Downscale and Performance fix for bf16 API This change contains the following: 1. Downscale optimization fix a. Similar to downscale optimizations made for s32 and s16 gemm, the following optimizations are done to improve the downscale performance for BF16 gemm b. The store to temporary float buffer can be avoided when k < KC since intermediate accumulation will not be required for the pc loop (only 1 iteration). The downscaled values (bf16) are written directly to the output C matrix. c. Within the micro-kernel when beta != 0, the bf16 data from the original C output matrix is loaded to a register, converted to float and beta scaling is applied on it at register level. This eliminates the requirement of previous design of copying the bf16 value to the temporary float buffer inside jc loop. 2. Alpha scaling a. Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in bf16 micro-kernels. b. Alpha scaling is now only done when alpha != 1. 3. K Fringe optimization a. Previously memcpy was used for K fringe case to load elements from A matrix in the microkernels b. Now, masked stores are used to store the downscaled and non-downscaled outputs without the need to use memcpy functions 4. N LT-16 fringe optimization a. Previously memcpy was used for N LT 16 fringe case in the microkernelsfor storing the downscaled and non-downscaled output. b. Now, masked stores are used to store the downscaled and non-downscaled outputs of BF16 without the need to use memcpy functions 5. Framework updates to avoid unnecessary pack buffer allocation a. The default allocation of the temporary pack buffer is removed and the pack buffer is now only allocated if k > KC. AMD-Internal: [CPUPL-3437] Change-Id: I71ff862e7d250559409a12a3533678c7a7951044	2023-05-18 10:02:56 -04:00
Eleni Vlachopoulou	1a7f60ff5b	Update CMake system to use object libraries for haswell, skx and zen4. - AVX2 and AVX512 flags are set up locally for each object library that requires them. - Default ENABLE_SIMD_FLAGS value is set to none and for AVX2 option the corresponding compiler flag is set globally. - To be able to build zen4 codepath when ENABLE_SIMD_FLAGS=AVX2, the compiler option is removed by removing the definition before building the corresponding object library. AMD-Internal: [CPUPL-3241] Change-Id: Ia570e60f06c4c72b7c58f4c9ca73bac4c060ae73	2023-05-12 10:04:16 -04:00
Harsh Dave	30b931ae60	Fixed compilation error due to inconsistent compiler behavior towards AVX512 zero masking instruction syntax - Since the code used whitespace variant of AVX512 mask instruction. But some compilers accept whitespace variant and some don't - to be safe, we removed whitespace. - Whitespace variant of masked instruction "vmovupd (%rax,%r8,1),%zmm8{%k2} {z}" is replaced with this instruction "vmovupd (%rax,%r8,1),%zmm8{%k2}{z}" to resolve the compilation failure issue. - Thanks to Shubham Sharma<shubham.sharma3@amd.com> for identifying issue. AMD-Internal: [CPUPL-1963] Change-Id: I290589132e8cce25cab0d1e4c195a7dd0a014937	2023-05-12 06:16:15 -04:00
mkadavil	b167e47091	LPGEMM frame and micro-kernel updates to fix gcc9.4 compilation issue. -Micro-kernel: Some AVX512 intrinsics(eg: _mm512_loadu_epi32) were introduced in later versions of gcc (>10) in addition to already existing masked intrinsic(eg: _mm512_mask_loadu_epi32). In order to support compilation using gcc 9.4, either the masked intrinsic or other gcc 9.4 compatible intrinsic needs to be used (eg: _mm512_loadu_si512) in LPGEMM Zen4 micro-kernels. -Frame: BF16 LPGEMM api's (aocl_gemm_bf16bf16f32obf16/bf16bf16f32of32) needs to be disabled if aocl_gemm (LPGEMM) addon is compiled using gcc 9.4. BF16 intrinsics are not supported in gcc 9.4, and the micro-kernels for BF16 LPGEMM is excluded from compilation based on GNUC macro. AMD-Internal: [CPUPL-3396] Change-Id: I096b05cdceea77e3e7fec18a5e41feccdf47f0e7	2023-05-11 18:00:18 +05:30
Mangala V	7739a3fbfe	Bug fix for 4xk AVX512 packing kernel Few tests failed on windows OS as some registers were not added as part of cobbler list Updated below registers into clobber list: In function bli_zpackm_zen4_asm_12xk : ZMM12-ZMM15 In function bli_zpackm_zen4_asm_4xk : ZMM4-ZMM7 AMD-Internal: [CPUPL-3253] Change-Id: I3e42130bf1a3b48717c4b437179ae3f116e5cf1d	2023-05-05 04:15:25 +05:30
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Edward Smyth	0f0277e104	Code cleanup: dos2unix file conversion Source and other files in some directories were a mixture of Unix and DOS file formats. Convert all relevant files to Unix format for consistency. Some Windows-specific files remain in DOS format. AMD-Internal: [CPUPL-2870] Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb	2023-04-21 08:41:16 -04:00
mkadavil	3572baa9d3	aocl_softmax_f32 api's for softmax computation as part of lpgemm. -Softmax is often used as the last activation function in a neural network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))). This step happens after the final low precision gemm computation, and it helps to have the softmax functionality that can be invoked as part of the lpgemm workflow. In order to support this, a new api, aocl_softmax_f32 is introduced as part of aocl_gemm. This api computes element-wise softmax of a matrix/vector of floats. This api invokes ISA specific vectorized micro-kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3247] Change-Id: If15880360947435985fa87b6436e475571e4684a	2023-04-21 05:26:08 -04:00
Edward Smyth	6835205ba8	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-2870] Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce	2023-04-19 12:44:56 -04:00
eashdash	462f9e0012	Added Custom Clip post-op support for u8s8s32os32/os8 and s8s8s32os32/os8 1. Custom Clip is a post-op which is used to clip the accumulated GEMM output within a certain range. 2. This post-op is implemented for u8s8s32os32/os8 and s8s8s32os32/os8 LPGEMM types. 3. Changes are done at the microkernel level for these 2 APIs to support Clip Post-Op AMD-Internal: [CPUPL-3207] Change-Id: I8b4da5807de6a93711b0ae9343970c55192f75d4	2023-04-18 15:21:27 -04:00
Meghana Vankadari	42d05a5aa0	DGEMM: Added decision logic to choose between sup vs native for zen4 architecture Details: - Added a new function for choosing between SUP and native implementation for a given size. - This function pointer is stored in cntx for zen4 config. - Divided total combinations of sizes into 3 categories: - one dimension is small - Two dimensions are small - All dimensions are small - Added different threshold conditions for each of the categories. AMD-Internal: [CPUPL-2755] Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf	2023-04-17 13:08:34 -04:00
mkadavil	e23765010d	aocl_gelu_<tanh\|erf>_f32 api's for gelu computation as part of lpgemm. -Currently in aocl_gemm, gelu (both tanh and erf based) computation is only supported as a post-op as part of low precision gemm api call (done at micro-kernel level). However gelu computation alone without gemm is required in certain cases for users of aocl_gemm. -In order to support this, two new api's - aocl_gelu_tanh_f32 and aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's computes element-wise gelu_tanh and gelu_erf respectively of a matrix/ vector of floats. Both the api's invokes ISA specific vectorized micro- kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3218] Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704	2023-04-17 05:15:56 -04:00
eashdash	12c97021a1	Added New Post-Op - Custom Clipping for LPGEMM and SGEMM 1. Custom Clip is an element-wise post-op which is used to clip the accumulated GEMM output within a certain range. 2. The Clip Post-Op is used in downscaled and non-downscaled LPGEMM APIs and SGEMM. 3. Changes are done at frame and microkernel level to implement this post-op. 4. Different versions are implemented - AVX-512, AVX-2, SSE-2 to enable custom clipping for various LPGEMM types and SGEMM AMD-Internal: [CPUPL-3207] Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6	2023-04-17 04:38:20 -04:00
Aayush Kumar	71272ab574	.Fixed Compiler warnings for GCC 12 and AOCC 4.0 - Set the variables to zero to avoid the compiler warning (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c, bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and bli_trsm_small_AVX512.c - Changed the datatype from dim_t to siz_t for i,k,j in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to avoid the compiler warning (-Waggressive-loop-optimizations) AMD-Internal: [CPUPL-2870] Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03	2023-04-14 13:29:17 +00:00
Harihara Sudhan S	15bd0f9646	Added AVX512 based double and float AXPYV - Added AVX512 based double and float AXPYV which will be used in Zen4 context. - Added n <= 0 check and alpha == 0 check to the BLAS layer of SAXPY. - Modified BLAS framework of float AXPYV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2793] Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1	2023-04-14 01:06:57 -04:00
mkadavil	5e510727a9	Masked load/store to replace copy macros in u8s8s32 micro-kernels. -As part of an earlier optimization, the memcpy function call in k fringe ((k % 4) != 0 case, to utilize vpdpbusd instruction) and n fringe (n < 16 - beta scale and C store) were replaced with copy macros specifically optimized for less than 4 and 16 elements each. However upon further analysis it was observed that masked load/broadcast and masked store performed better on average than the copy macros. The copy macros contained more if conditions, which resulted in more branching and thus resulting in perf variations. It was also noted that code generation varied a lot based on the compilers when using the copy macros due to the extra conditional code. -As part of this change, the copy macros are completely replaced with masked load/broadcast/store. Performance was observed to be better and less prone to variations for the k fringe and n fringe (< 16) cases. AMD-Internal: [CPUPL-3173] Change-Id: I73e6e65302ecf02e1397541b4a32b2a536f19503	2023-04-13 09:17:26 -04:00
Aayush Kumar	6ad387c2aa	Added DTRSM Small Path AVX512 based LUNN/LLTN Variant Kernels - 8x8 kernels are used for DTRSM SMALL - Matrix A(a10) is packed for GEMM operations. - Packed martix A will be re-used in all the col-block along N-dimension. - Diagonal elements of A matrix are packed(a11) for TRSM operations. - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I5bb57501f6d3783eb654e375d63901467dd14734	2023-04-13 01:44:31 -04:00
Harihara Sudhan S	6b8f4744a4	Added AVX512 based double and float DOTV - Added AVX512 based double and float DOTV which will be used in Zen4 context. - Added n <= 0 check to the BLAS layer of SDOTV. - Modified BLAS framework of float DOTV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2800] Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702	2023-04-12 12:36:52 +05:30
Harihara Sudhan S	be7fb342c1	Added AVX512 based double and float SCALV - Added AVX512 based double and float SCALV which will be used in Zen4 context. - Added incx <= 0 check and alpha == 1 check to the BLAS layer of SSCAL. - Modified BLAS framework of float SCAL to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2766],[CPUPL-2765] Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc	2023-04-11 14:41:56 +05:30
Shubham	cc25cff864	Added AVX512 flag for d24xk pack kernel for windows - on windows 24xk kernel is compiled without avx512 flag which causes out of bounds writes for DTRSM. - to fix this avx512 flag has been added to the CMakeLists.txt file for 24xk kernel. AMD-Internal: [CPUPL-3186] Change-Id: I0314dea88302fc4964a303853a4b9b719ecd8064	2023-04-09 22:38:33 +05:30
Aayush Kumar	8c537b0cd5	Added DTRSM Small Path AVX512 based LLNN/LUTN Variant Kernels - 8x8 kernels are used for DTRSM SMALL - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I58d28912bddbaadb404052c0f3449ebbe3c97b68	2023-04-07 08:50:28 +00:00
Edward Smyth	1885540c5a	Code cleanup: compiler warning fixes Modify code to correct some warning messages from GCC 12.2 or AOCC 4.0: - Increase size of nbuf in blastest/f2c/endfile.c - Remove unused variables in kernels/zen/1/bli_scal2v_zen_int.c and kernels/zen/1/bli_axpyv_zen_int10.c - Remove extraneous parentheses in frame/compat/bla_trsm_amd.c and kernels/zen4/3/bli_zgemm_zen4_asm_12x4.c - Add __attribute__ ((unused)) to several variables in frame/1m/packm/bli_packm_struc_cxk.c and frame/1m/packm/bli_packm_struc_cxk_md.c AMD-Internal: [CPUPL-2870] Change-Id: I595e46f0a3d737beb393c3ab531717565220b10d	2023-04-06 06:56:09 -04:00
vignbala	775ce1f13c	Implemented AVX-512 based 12x4 m-variant SUP kernels for ZGEMM - Implemented 12x4m column preferential SUP kernels(main and fringe cases). The main kernel dimension is 12x4, and the associated fringe kernel dimensions are : 12x3m, 12x2m, 12x1m 8x4, 8x3, 8x2, 8x1 4x4, 4x3, 4x2, 4x1 2x4, 2x3, 2x2, 2x1. - Included in-register transposition support for C, thus extending the storage scheme supports to CCC, CCR, RCC and RCR inside the milli-kernel. - Integrated conditional packing of A onto the SUP front end for dcomplex datatype. This redirects RRC and CRC storage schemes onto the preceding set of SUP kernels through storage scheme transformation(RCC and CCC respectively). - Updated the zen4 context file with the new set of SUP kernels, to get enabled appropriately. Furthermore, the context file was updated with the AVX-2 dotxv signatures for dcomplex datatype. This redirects the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines. - Added C prefetching onto L2-cache, and an unroll factor of 4 for the k loop in all the kernels. - Work in progress to include conjugate support and input spectrum extension for the AVX-512 SUP kernels. The current thresholds in zen4 context is the same as that of the zen3 thresholds for ZGEMM SUP. AMD-Internal: [CPUPL-3122] Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e	2023-04-06 04:49:15 -04:00
mkadavil	27a9e2a0ff	u8s8s32 fringe kernel optimizations. -The n fringe micro kernels uses only a few zmm registers for computing the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24 used in 6x64). This results in lot of wasted registers that if utilized can help increase the MR dimension and thus improve the reuse of registers loaded with B. Based on this concept, the existing n fringe kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted that the maximum number of registers are not used, since it results in cache inefficient code due to the increase in MR and thus more broadcasts required from unpacked A matrix. -Compiler flag updates for AOCC build to generate loops with 64 byte alignment. This has been observed to improve performance slightly when k dimension is small. AMD-Internal: [CPUPL-3173] Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca	2023-04-03 05:35:18 -05:00
eashdash	bd8cd763ff	Added NEW LPGEMM TYPE- S8S8S32/S8 1. New LPGEMM type - S8S8S32/S8 is added. 2. New interface, frame and kernel files are added. 3. Frame and kernel files added/modified for S8S8S32/S8 have 2 operations - Pack B and Mat Mul 4. Pack B kernel routines to pack B matrix for VNNI and compute the sum of every column of B matrix to implement the S8S8S32 operation using the VNNI instructions. 5. Mat Mul Kernel files to compute the GEMM output using the VNNI. Here the A matrix elements are converted from int8 to uint8 (VNNI works with A matrix type uint8 only). 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s32os32 and s8s8s32os8. 8. All previously added post-ops are supported on S8S8S32/S8 also. AMD-Internal: [CPUPL-3154] Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3	2023-03-31 05:44:54 -04:00
Mangala V	245fdf072c	AVX-512 based col-preferred kernels for ZGEMM in native path - Kernel block size is 12x4 - Updated the zen4 config to enable these kernels in zen4 path. - Tuned MC,KC,NC for better performance for m/n/k size > 500 - Updated CMakeLists.txt with ZGEMM kernels for windows build. Kernel supports: 1. Preload and prebroadcast of A and B 2. Prefecth of C Matrix 3. K loop is sub divided in to multiple loops to maintain distance between c prefetchs. 4. Special case when alpha/beta imag component is zero 5. Row/Col/General stride of Matrix C AMD-Internal: [CPUPL-2998] Change-Id: I62e3c352d475b1add3f43270805fbcee00e2e440	2023-03-28 23:05:06 -04:00
Harsh Dave	f5dc3db648	Added AVX512 8xk packing kernel AVX512 optimised kernel for Double datatype supports row and column major matrix Packing kernel is column major implementation If matrix is row major, we need to transpose block before storing it. If matrix is column major, we directly store AMD-Internal: [CPUPL-2966] Change-Id: I8e43f1e2b562c382f44278cd47b3d1e84a4d24c9	2023-03-27 23:18:32 -05:00
Mangala V	62d63eb1ba	AVX-512 based 4xk and 12xk packing kernel for dcomplex AVX512 packing kernel supports: 1. Dcomplex datatype 2. Row and column major matrix AVX512 packing kernel doesnot support: 1. General stride matrix 2. Fringe cases(only multiplies of 4 or 12 is supported) 3. Conjugate is not supported scal2m will be used for above unsupported functionality AVX512 packing kernel is column preferred kernel If matrix is row major, we need to transpose block before storing it. If matrix is column major, we directly store it AMD-Internal: [CPUPL-3088] Change-Id: I3fcd94248a3a6527c807cccc1b3408db9fe2a737	2023-03-28 00:19:08 +05:30

1 2

70 Commits