amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 18:57:23 +00:00

Author	SHA1	Message	Date
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
Edward Smyth	97ede96ed4	Correct duplicate object file names Some kernel file names were the same for different sub-configurations, which could result in duplicate copies of the same object being archived depending upon the order of (re-)compiling the source files. Rename the files to be specific to each sub-configuration to avoid this problem. AMD-Internal: [CPUPL-5895] Change-Id: I182ac706e04a364f1df20fd0fb5b633eb10eeafb	2025-01-10 06:03:36 -05:00
harsdave	54b46ec1ed	Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop efficiency and edge kernel performance. The following technical improvements have been implemented: 1. IR Loop Optimization: - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated with `begin_asm` and `end_asm` calls, resulting in more efficient execution. 2. JR Loop Integration: - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead of stack frame management for each JR iteration, thereby enhancing loop performance. 3. Kernel Decomposition Strategy: - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1. - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently. 1. Interleaved Scaling by Alpha: - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline and reduce latency. 2. Efficient Mask Preparation: - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary, minimizing unnecessary overhead. 3. Broadcast Instruction Optimization: - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse, the broadcast instruction is replaced with `mem_1to8`. - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding dependency chains and improving execution efficiency. 4. C Matrix Update Optimization: - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers. This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating performance bottlenecks and enhancing throughput. These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication operations. This patch also involves changes for tiny gemm interface. A light interface for calling kernels and removing calls to avx2 dgemm kernels as we use avx512 dgemm kernels for all the sizes for zen4 and zen5. For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have the support to handle such inputs and thus such inputs are routed to gemm_small path. AMD-Internal: [CPUPL-6054] Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a	2024-12-13 00:03:00 -05:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Moripalli Chitra	8b486e8d14	Added new decision logic to choose between 6x8 dgemm kernel vs 24x8 kernel. The decision is based on the values of "m, n and k". Change-Id: I307ff002797ccef5bd61106b808cecb069b91fd6	2024-08-02 14:18:58 +05:30
vignbala	236d092656	AVX512 optimizations for ZGEMM to handle k = 1 cases - Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, with the transpose values being N(for column-major) and T(for row-major). - Updated the zgemm_blis_impl( ... ) layer to query the architecture ID and invoke the AVX2 or AVX512 kernel accordingly. - Added API level tests for accuracy and code-coverage, as well as micro-kernel tests for verifying functionality and out-of-bounds memory accesses. AMD-Internal: [CPUPL-5249] Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84	2024-07-09 07:07:24 -04:00
Edward Smyth	43d36b9f66	AOCL_ENABLE_INSTRUCTIONS improvements 2 Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is unnecessary and incorrectly caused AVX512 code to be run on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2 or equivalent options was selected. Replace with code to select kernel in a similar way to other dgemm code paths and other APIs. Note that at present AVX2 code is used the smallest matrix sizes on all zen platforms. AMD-Internal: [CPUPL-5078] Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85	2024-06-25 04:57:38 -04:00
Mangala V	e9124ffca7	BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case BUG: When alpha real and imaginary is zero Output is computed as C= Beta * C + A * B instead of C = Beta * C FIX: Updated kernel to scale A * B product with alpha in case of alpha=0 Existing framework design: - When alpha real and imaginary value is zero, framework handles to skip kernel call to avoid alpha * A * B operation - SCALM is invoked to perform Beta * C - Accuracy issue was not observed as alpha=0 was handled in framework - If we call kernel directly with alpha=0, results would be wrong - Issue was figured out during microkernel testing using gtestsuite AMD-Internal: [CPUPL-4454] Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803	2024-06-20 02:58:43 -04:00
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Shubham Sharma	14bab0eb17	Fixed out of bounds read in CTRSM small kernel - In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex precision numbers are being read instead of 1 scomplex. - Fixed the code to read only one scomplex. AMD-Internal: [CPUPL-4403] Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d	2024-04-16 02:42:34 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Shubham Sharma	fc91932b4a	Fixed out of bounds read in DTRSM small kernels - In 3x1 fringe case in [RLN/RUT] kernel, 4 double precision floats are being read instead of 3 doubles. - Fixed the code to read only 3 double. AMD-Internal: [CPUPL-4403] Change-Id: If0afb155efefabe13487cf322d479981f1838aa2	2024-02-02 10:31:12 +05:30
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	50608f28df	BLIS: Missing clobbers (batch 7) Add missing clobbers in: - bli_gemmsup_rv_haswell kernels - spare copies of kernels in old, other and broken subdirectories - misc kernels for legacy platforms AMD-Internal: [CPUPL-3521] Change-Id: I7cdb7fd1cb29630d8b7fa914b1002a270dfe9ef5	2023-11-22 17:51:46 -05:00
Edward Smyth	f471615c66	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. AMD-Internal: [CPUPL-3519] Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce	2023-11-22 17:11:10 -05:00
mangala v	e0df20806a	Updated prefetching in SGEMM SUP (mask load/store) kernels 1. Prefetch only MR rows or rows required for fringe cases 2. Specify prefetching offset - the least column address supported by masked functions 3. Removed unnecessary prefetches in fringe case for mx4 kernels Updated gtestuite for sgemm calls AMD_Internal: [CPUPL-4221] Change-Id: I1e2e7d3ebce37dc54a2f0a5c1c70ce0a6d4c8d6c	2023-11-21 06:31:47 -05:00
mangala v	3256a7b074	BugFix: Re-Designed SGEMM SUP kernel to use mask load/store instruction Segfault was reported through nightly jenkins job. Issue was observed when running in MT mode. Issue was due to extra broadcast being used. Extra broadcast would access out of bound memory on input buffer Cleaned up cobbler list by removing unused registers. AMD_Internal: [CPUPL-4180] Change-Id: I1c8715b2850ef855328f2ef12f215987299bdb2b	2023-11-17 18:14:34 +05:30
Mangala V	f6046784ce	Re-Designed SGEMM SUP kernel to use mask load/store instruction Added all fringe kernels with mask load store support Fringe kernels cover m direction from 5 to 1 and n direction from 15 to 1 for row storage format - New edge kernels that uses masked load-store instructions for handling corner cases. - Mask load-store instruction macros are added. vmaskmovps, VMASKMOVPS for masked load-store. - It improves performance by reducing branching overhead and by being more cache friendly. - Mask load-store is added only for row storage format AMD-Internal: [CPUPL-4041] Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456	2023-11-10 01:23:48 -05:00
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Edward Smyth	9500cbee63	Code cleanup: spelling corrections Corrections for some spelling mistakes in comments. AMD-Internal: [CPUPL-3519] Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba	2023-11-09 00:16:30 -05:00
Harsh Dave	75356d45e5	DGEMM improvement for very tiny sizes less than 24. - This commit helps improving performance for very small input by reducing framework check and routing all such inputs to bli_dgemm_tiny_6x8_kernel. It forces single threaded computation for such sizes. - It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4 code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment variable is set to avx512. In that case, such a small inputs are routed to bli_dgemm_tiny_24x8_kernel avx512 kernel. AMD-Internal: [CPUPL-1701] Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e	2023-11-08 23:45:57 -05:00
Harsh Dave	0de10cc86c	Added k=1 avx512 dgemm kernel. - This commit implements avx512 dgemm kernel for k=1 cases. which gets called for zen4 codepath. - Added architecture check for k=1 kernel in dgemm code path to pick correct kernel based on cpu arhcitecture since now blis is having avx2 and avx512 dgemm kernels for k=1 case. - Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was being called irrespective of architecture type. - Added architecture check before calling the kernel for case where k=1, so only for respective architectures this kernel is invoked. AMD-Internal: [CPUPL-4017] Change-Id: I418bbc933b41db41d323b331c6d89893868a6971	2023-11-07 01:10:09 -05:00
Harsh Dave	7bcb701b79	Fixed functionality failure for dgemm tiny kernel. - For k > KC, C matrix is getting scaled by beta on each iteration. It should be scaled only once. Fixed the scaling of C matrix by beta in K loop. - Corrected A and B matrix buffer offsets, for cases where k > KC. AMD-Internal: [CPUPL-4078] AMD-Internal: [CPUPL-4079] AMD-Internal: [CPUPL-4081] AMD-Internal: [CPUPL-4080] AMD-Internal: [CPUPL-4087] Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa	2023-10-26 15:11:59 +05:30
Harsh Dave	7a4f84fbac	Optimized dgemm for tiny input sizes. - This commit focused on enhancing the performance of dgemm for matrices for very small dimenstions. - blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing the conventional SUP framework code path. As SUP framework code path requires the creation and initilization of blis objects, accessing all the needed meta-information from objects, querying contexts which adds performance penaulty while computing for matrices with very small dimensions. - To avoid such performance penaulty blis_dgemm_tiny function implements a lightweight support code so that it can re-use dgemm SUP kernels such a way that it directly operates on input buffers. It avoids framework overhead of creating and intializing blis objects, context intialization, accessing other large framework data structures. - blis_dgemm_tiny function checks for threshold condition to match before picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500. While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is invoked. -blis_dgemm_tiny function supports single threaded computation as of now. AMD-Internal: [CPUPL-3574] Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f	2023-10-16 05:52:49 -04:00
Shubham Sharma	9a2a4151ac	Added improved ZTRSM AVX2 kernels - Added 2x6 ZGEMM row-preferred kernel. - Kernel supports prefetch_a, prefetch_b, prefetch_a_next and prefetch_b_next. - Multiple Ways to prefetch c are supported. - prefetch_a and prefetch_c are enabled by default. - K loop is divided into multiple subloops for better c prefetch. - Added 2x6 ZTRSM row-preferred lower and upper kernels using AVX2 ISA. - These kernels are used for ZTRSM only, zgemm still uses 3x4 kernel. - Kernels support row/col/gen storage. - Updated the zen3 and zen4 config to enable use of these kernels for TRSM in zen3 and zen4 path. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73	2023-10-13 07:43:33 -04:00
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Harsh Dave	5bdf5e2aaa	Optimized AVX2 DGEMM SUP and small edge kernels. - Re-designed the new edge kernels that uses masked load-store instructions for handling corner cases. - Mask load-store instruction macros are added. vmovdqu, VMOVDQU for setting up the mask. vmaskmovpd, VMASKMOVPD for masked load-store - Following edge kernels are added for 6x8m dgemm sup. n-left edge kernels - bli_dgemmsup_rv_haswell_asm_6x7m - bli_dgemmsup_rv_haswell_asm_6x5m - bli_dgemmsup_rv_haswell_asm_6x3m m-left edge kernels - bli_dgemmsup_rv_haswell_asm_5x7 - bli_dgemmsup_rv_haswell_asm_4x7 - bli_dgemmsup_rv_haswell_asm_3x7 - bli_dgemmsup_rv_haswell_asm_2x7 - bli_dgemmsup_rv_haswell_asm_1x7 - bli_dgemmsup_rv_haswell_asm_5x5 - bli_dgemmsup_rv_haswell_asm_4x5 - bli_dgemmsup_rv_haswell_asm_3x5 - bli_dgemmsup_rv_haswell_asm_2x5 - bli_dgemmsup_rv_haswell_asm_1x5 - bli_dgemmsup_rv_haswell_asm_5x3 - bli_dgemmsup_rv_haswell_asm_4x3 - bli_dgemmsup_rv_haswell_asm_3x3 - bli_dgemmsup_rv_haswell_asm_2x3 - bli_dgemmsup_rv_haswell_asm_1x3 - For 16x3 dgemm_small, m_left computation is handled with masked load-store instructions avoid overhead of conditional checks for edge cases. - It improves performance by reducing branching overhead and by being more cache friendly. AMD-Internal: [CPUPL-3574] Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173	2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian	758ec3b5ca	ZGEMM optimizations for cases with k = 1 - Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, and the transpose values for A and B as N. - The kernel dimension has been changed from 4x6 to 4x4, due to the following reasons : - The 1xNR block of B in the n-loop can be reused over multiple MRx1 blocks of A in the m-loop during computation. Similar analogy exists for the fringe cases. - Every 1xNR block of B was scaled with alpha and stored in registers before traversing in the m-dimension. Similar change was done for fringe cases in n-dimension. - These registers should not be modified during compute, hence the kernel dimension was changed from 4x6 to 4x4. - The check for early exit(with regards to BLAS mandate) has been removed, since it is already present in the BLAS layer. - The check for parallel ZGEMM has been moved post the redirection to this kernel, since the kernel is single-threaded. - The bli_kernels_zen.h file was updated with the new kernel signature. AMD-Internal: [CPUPL-3622] Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea	2023-08-07 15:10:08 +05:30
Eleni Vlachopoulou	9c613c4c03	Windows CMake bugfix in object libraries for shared library option Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory. The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries. AMD-Internal: [CPUPL-3241] Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52	2023-05-24 17:30:16 +05:30
Edward Smyth	ea2eea5097	BLIS: Missing clobbers (batch 1) Add missing clobbers in first batch of assembly kernels: - zen3 bli_gemmsup* - bli_zgemm_zen4_asm_12x4 - bli_gemmsup_rv_haswell_asm_sMx6 AMD-Internal: [CPUPL-3456] Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1	2023-05-23 11:51:18 -04:00
Mangala V	5f5bc24989	Bug fix: AVX2 code being invoked on non-avx2 machine for ZGEMM API Prevented calling avx2 based bli_zgemm_ref_k1_nn code on non-supported systems. Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn(). Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn(). Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com> for identifying and helping to fix the issue. AMD-Internal: [CPUPL-3352] Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35	2023-05-21 23:13:46 +05:30
Shubham Sharma	26e120ea25	Fixed diagonal packing for C/Z TRSM small - In C/Z TRSM small, packing in case of unit diagonal is not handled properly. - Diagonal elements are still being read even in case of unit diagonal. - This causes "Conditional jump or move depends on uninitialised value" error during valgrind tests. - To fix this, diagonal elements should not be read in case of unit diagonal. AMD-Internal: [CPUPL-3406] Change-Id: If3d6965299998a83d87f3a032f654fc7f8c43d4e	2023-05-18 07:57:21 -04:00
Eleni Vlachopoulou	bf26b8ffbc	Removing /arch:AVX2 flag from-high level CMake - Previously, this flag was set as a default at the high-level CMakeLists.txt which means that this flag is used to build everything,all files and all subdirectories, including ref_kernels and testsuite. Also, all files as target sources for this project and compiled with the same flags. - Now, we create object files using the source in kernels/ directory and add to the object files the AVX2 flag explicitly. So, now only those files will have this flag and it should not be used to compile ref_kernels, etc. - This is a quick solution to enable runs on non-AVX2 machines. AMD-Internal: [CPUPL-3241] Change-Id: Id569b26ffeea40eaa36ab4465b0c52b6446d7650	2023-04-28 09:22:13 -04:00
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Edward Smyth	0f0277e104	Code cleanup: dos2unix file conversion Source and other files in some directories were a mixture of Unix and DOS file formats. Convert all relevant files to Unix format for consistency. Some Windows-specific files remain in DOS format. AMD-Internal: [CPUPL-2870] Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb	2023-04-21 08:41:16 -04:00
Edward Smyth	6835205ba8	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-2870] Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce	2023-04-19 12:44:56 -04:00
Aayush Kumar	71272ab574	.Fixed Compiler warnings for GCC 12 and AOCC 4.0 - Set the variables to zero to avoid the compiler warning (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c, bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and bli_trsm_small_AVX512.c - Changed the datatype from dim_t to siz_t for i,k,j in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to avoid the compiler warning (-Waggressive-loop-optimizations) AMD-Internal: [CPUPL-2870] Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03	2023-04-14 13:29:17 +00:00
Aayush Kumar	8c537b0cd5	Added DTRSM Small Path AVX512 based LLNN/LUTN Variant Kernels - 8x8 kernels are used for DTRSM SMALL - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I58d28912bddbaadb404052c0f3449ebbe3c97b68	2023-04-07 08:50:28 +00:00
Edward Smyth	1ac03e64b5	BLIS cpuid tidy and bugfix. Improvements to BLIS cpuid functionality: - Tidy names of avx support test functions, especially rename bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported() to more accurately describe what it tests. - Fix bug in frame/base/bli_check.c related to changes in commit `6861fcae91` AMD-Internal: [CPUPL-3031] Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5	2023-04-03 08:46:37 -04:00
Shubham	dfc95d29fc	Enable DTRSM small multithreading path for BLAS interface - Enabled DTRSM small mt for sizes where performance is better than small or native. - Threshold Tuning for small path is updated. - Function signature for bli_trsm_small_mt has been made similar to bli_trsm_small so that one function pointer can be used for all functions. - Early return condition in DTRSM small for sizes > 1000 has been removed so that the sizes for which small path to take can be decided on bla layer instead of inside kernel. AMD-Internal: [CPUPL-2735] Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f	2023-03-23 12:07:22 -04:00
Vignesh Balasubramanian	c53b0c96ec	Extended the support for beta scaling in 3x4 RD kernels of ZGEMM SUP - Extended the existing support for handling beta scaling in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The added support ensures that NaN values initialized in C do not propogate to the result when beta is 0. - The support has been added to fringe cases common to, as well as specific to the m and n variants of the RD kernels. AMD-Internal: [CPUPL-3053] [SWLCSG-1900] Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9	2023-03-08 06:16:58 -05:00
Harihara Sudhan S	dfacb47125	Code cleanup to remove multiple function declaration - Removed repetitive function declaration of GEMM small kernels from the C files - Function declaration of these kernels exist in the header files where the kernels are supposed to be declared. AMD-Internal: [CPUPL-3003] Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63	2023-02-21 00:35:31 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Edward Smyth	82c2eb4e8e	Code cleanup and warnings fixes Corrections for some occurances of: - Compiler warnings about initialization of float from double - Spelling mistakes in comments - Incorrect indentation of code and comments AMD-Internal: [CPUPL-2870] Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc	2023-01-09 04:34:52 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Edward Smyth	34730a1e4c	BLIS: Nested parallelism issues 1. Check OpenMP active level against max active levels when setting number of threads for starting a new parallel region in ./frame/thread/bli_thread.c to ensure the correct number of threads is used when BLIS is called within nested OpenMP parallelism. 2. In subsequent BLIS calls, threading choices could be incorrectly set based on values used and stored in global_rntm by a previous call. This could apply when the OpenMP number of threads differ from call to call, different nested parallelism is used in different parts of a user's code, or different threads at the user level request different numbers of OpenMP threads for BLIS calls. Keep threading information in both global_rntm and a new Thread Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime environment (as appropriate) during bli_init_auto() calls in each BLIS routine. The details are: * global_rntm is initialized on first BLIS call based on OpenMP and BLIS threading environment variables. * global_rntm is updated by any BLIS threading function calls. * In bli_thread_update_tl(), called by bli_init_auto(), sync with any BLIS values set or updated in global_rntm. Then, if BLIS threading control is not used, check OpenMP ICVs and set thread count and auto_factor appropriately. * Setting BLIS threading locally (using expert interfaces to pass a user defined rntm data structure) should work as before. 3. bli_thread_get_is_parallel can now only be called outside of parallelism within BLIS routines. Change calls in trsm to reflect this. 4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env() if any BLIS_*_NT environment variables are set. 5. Set auto_factor = FALSE when the number of threads is 1. 6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE. 7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init(). 8. For debugging, internal information on the rntm threading data can be printed by defining "PRINT_THREADING" at the top of bli_rntm.h 9. bli_rntm_print() now also prints the value of blis_mt. 10. Function prototypes in bli_rntm.h moved to top of file, so that bli_rntm_print() can be used within inline functions defined in this header file. 11. Comment out bli_init_auto() and bli_finalize_auto() calls in Fortran interfaces in frame/compat/blis/thread/b77_thread.c 12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and set_pack_b functions outside of the auto_factor if statements. 13. Misc code tidying. AMD-Internal: [CPUPL-2433] Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee	2022-10-21 07:38:39 -04:00
satish kumar nuggu	9c292b79e2	Fixed ASAN reported issues in [s/c]trsm small kernels Details: 1. Fixed the memory access paritial overflows for the variables AlphaVal,ones reported by ASAN. 2. Using 128 bit packed broadcast with the 64 bit data types after type casting would cause the garbage data to be filled in the destination register. 3. Fixed this issue by using set_ps instruction instead of broadcast. 4. In cases of n remainder being 1, extra elements were accessed that could cause out of memory access. Removed the extra element access. AMD-Internal: [CPUPL-2578][CPUPL-2587] Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75	2022-09-30 03:02:07 -04:00
satish kumar nuggu	754627eae9	Addressed uninitialized variables in trsm small algo 1. Addressed uninitialized variables reported in coverity for all datatypes of trsm small algo. AMD-Internal: [CPUPL-2542] Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71	2022-09-29 01:49:22 -04:00
Mangala V	42434fbc31	Bug fix in TRSM small for S, D, C & Z datatype incase of MT 1. Corrected B buffer accessing to access by its offset instead of starting address which is required incase of MT. 3. When num_threads > 1, B buffer is divided in to blocks in m or n dimension based on side right or left. Hence need to access by its offset to access starting of the block. 4. Currently B Matrix is divided in to blocks for each thread and complete matrix A is used by all threads. Incase of design change in future, modified A buffer accessing by its offset to support partition of matrix A for MT AMD-Internal:[CPUPL-2520] Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea	2022-09-23 03:30:25 -04:00

1 2 3 4

175 Commits