amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 02:37:05 +00:00

Author	SHA1	Message	Date
Dave, Harsh	48a6db6c69	add support for conjugate transpose in avx512 zgemm sup kernel (#300 ) * ZGEMM SUP: Add conjugate support for AVX-512 kernels on Zen4/Zen5/Zen6 - Add CONJA, CONJB and CONJA_CONJB variants to zgemm SUP micro-tiles - Enable SUP path for conjugate cases when both are same type - Unify RRC/CRC storage to use CV kernel variant - Update SUP dispatch to handle conjugate flags correctly Note: CONJ_NO_TRANSPOSE + CONJ_NO_TRANSPOSE and CONJ_TRANSPOSE + CONJ_TRANSPOSE remain unsupported --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-03-12 19:00:53 +05:30
Smyth, Edward	05e837d176	BLIS: Implement zen6 sub-configuration Implement zen6 cpuid and arch changes, and add zen6 as a separate BLIS sub-configuration and code path within amdzen configuration family. Currently all optimization choices are copies of zen5 sub-configuration. AMD-Internal: [CPUPL-7162]	2026-03-05 13:33:56 +00:00
Vlachopoulou, Eleni	199f2347ba	Fix AOCC version detection in CMake and config script (#321 ) * Updating zen5/make_defs.* so that we use an AOCC_VERSION_STRING * Adding some error handling for AOCC versions with different name convention * Adding VERSION_GREATER_EQUAL functionality to all zen config directories * Cleanup and addressing review comments * Update config/zen/amd_config.cmake Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Updates to support x.y.z or x_y_z versioning --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-20 13:58:13 +00:00
Smyth, Edward	8310b2d5d3	Optimize bli_arch_query_id and related functions bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous implementation incurred the overhead of multiple function calls. This has been reduced by: - Changing the function to be defined in a header file so it can be inlined. - Avoiding call to bli_arch_check_id_once that was a wrapper for a call to bli_pthread_once. Instead bli_pthread_once is called directly. - For builds with a single BLIS sub-configuration, correct arch_id is taken directly from a header file in the corresponding config subdirectory, avoiding the bli_pthread_once call and making the value explicit at compile time, which may enable additional optimizations. To enable these changes, the variables arch_id and model_id defined in frame/base/bli_arch.c are no longer static, as they must be accessed in multiple files (i.e. they are now global variables). Rename to g_arch_id and g_model_id to distinguish from any locally defined arch_id or model_id variables.	2026-02-04 13:16:46 +00:00
Rayan, Rohan	a22e0022c2	SGEMM tiny path tuning for zen4 and zen5 (#267 ) * Adding a model to determine which matrices enter the SGEMM tiny path * This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously * Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path * Adding thresholds based on the SUP path sizes * Added for Zen4 and Zen5 --------- AMD-Internal: CPUPL-7555 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2025-12-10 15:58:54 +05:30
Rayan, Rohan	e85be22da0	Adding tiny path for SGEMM (#237 ) Adding SGEMM tiny path for Zen architectures. Needed to cover some performance gaps seen wrt MKL Only allowing matrices that all fit into the L1 cache to the tiny path Only tuned for single threaded operation at the moment Todo: Tune cases where AVX2 performs better than AVX512 on Zen4 Todo: The current ranges are very conservative, there may be scope to increase the matrix sizes that go into the tiny path AMD-Internal: CPUPL-7555 Co-authored-by: Rohan Rayan rohrayan@amd.com	2025-10-24 13:14:33 +05:30
Smyth, Edward	a4db661b44	GCC 15 SUP kernel workaround (2) Previous commit (`30c42202d7`) for this problem turned off -ftree-slp-vectorize optimizations for all kernels. Instead, copy the approach of upstream BLIS commit 36effd70b6a323856d98 and disable these optimizations only for the affected files by using GCC pragmas AMD-Internal: [CPUPL-6579]	2025-09-04 17:14:06 +01:00
Smyth, Edward	fb2a682725	Miscellaneous changes - Change begin_asm and end_asm comments and unused code in files kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c to avoid problems in clobber checking script. - Add missing clobbers in files kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c - Add missing newline at end of files. - Update some copyright years for recent changes. - Standardize license text formatting. AMD-Internal: [CPUPL-6579]	2025-08-26 16:37:43 +01:00
Sharma, Shubham	b5c8124d3d	Derive TRSM ref kernels from TRSM blkzsz instead of GEMM blszs (#148 ) - Currently TRSM reference kernels are derived from GEMM blocksizes and GEMM_UKR. - This does not allow the flexibility to use different GEMM_UKR for GEMM and TRSM if optimized TRSM_UKR are not available. - Made changes so that ref TRSM kernels are derived from TRSM blocksizes. - Changed ZEN4 and ZEN5 cntx to use AVX2 kernels for CTRSM. AMD-Internal: [SWLCSG-3702]	2025-08-21 11:25:45 +05:30
Dave, Harsh	e39cf64708	Optimized avx512 ZGEMM kernel and edge-case handling (#147 ) * Optimized avx512 ZGEMM kernel and edge-case handling Edge kernel implementation: - Refactored all of the zgemm kernels to process micro-tiles efficiently - Specialized sub-kernels are added to handle leftover m dimention:12MASK, 8, 8MASK, 8, 4, 4MASK, 2. - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm load/store and 1 masked load/store. - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and 1 masked load/store. - 4MASK handles 3, 1 m_left using 1 masked load/store. - ZGEMM kernel now internally decomposes the m dimension into the following. The main kernel is 12x4, which is having following edge kernels to handle left-over m dimension: edge kernels: 12MASKx4 (handles 11x4, 10x4, 9x4) 8x4 (handles 8x4) 8MASKx4 (handles 7x4, 6x4, 5x4) 4x4 (handles 4x4) 4MASKx4 (handles 3x4, 1x4) 2x4 (handles 2x4) - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1), 8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1), 2xN_LEFT(3, 2, 1) handles leftover m dimension. Threshold tuning: - Enforced odd m dimension to avx512 kernels in tiny path, as avx2 kernels invokes gemv calls for m_left=1(odd m dimension of matrix) The gemv function call adds overhead for very small sizes and results in suboptimal performance. - condition check "m%2 == 0" is added along with threshold checks to force input with odd m dimension to use avx512 zgemm kernel. - Threshold change to route all of the inputs to tiny path. Eliminating dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or 'T'(transpose). - However tiny re-uses zgemm sup kernels which do not support conjugate transpose storage of matrices. For such storage of A, B matrix we still rely on avx2 zgemm_small kernel. gtest changes: - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their respective testing instaces from gtest. AMD-Internal: [CPUPL-7203] * Optimized avx512 ZGEMM kernel and edge-case handling Edge kernel implementation: - Refactored all of the zgemm kernels to process micro-tiles efficiently - Specialized sub-kernels are added to handle leftover m dimention:12MASK, 8, 8MASK, 8, 4, 4MASK, 2. - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm load/store and 1 masked load/store. - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and 1 masked load/store. - 4MASK handles 3, 1 m_left using 1 masked load/store. - ZGEMM kernel now internally decomposes the m dimension into the following. The main kernel is 12x4, which is having following edge kernels to handle left-over m dimension: edge kernels: 12MASKx4 (handles 11x4, 10x4, 9x4) 8x4 (handles 8x4) 8MASKx4 (handles 7x4, 6x4, 5x4) 4x4 (handles 4x4) 4MASKx4 (handles 3x4, 1x4) 2x4 (handles 2x4) - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1), 8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1), 2xN_LEFT(3, 2, 1) handles leftover m dimension. Threshold tuning: - Enforced odd m dimension to avx512 kernels in tiny path, as avx2 kernels invokes gemv calls for m_left=1(odd m dimension of matrix) The gemv function call adds overhead for very small sizes and results in suboptimal performance. - condition check "m%2 == 0" is added along with threshold checks to force input with odd m dimension to use avx512 zgemm kernel. - Threshold change to route all of the inputs to tiny path. Eliminating dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or 'T'(transpose). - However tiny re-uses zgemm sup kernels which do not support conjugate transpose storage of matrices. For such storage of A, B matrix we still rely on avx2 zgemm_small kernel. gtest changes: - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their respective testing instaces from gtest. AMD-Internal: [CPUPL-7203] --------- Co-authored-by: harsdave <harsdave@amd.com>	2025-08-21 09:46:10 +05:30
Smyth, Edward	509aa07785	Standardize Zen kernel names Naming of Zen kernels and associated files was inconsistent with BLIS conventions for other sub-configurations and between different Zen generations. Other anomalies existed, e.g. dgemmsup 24x column preferred kernels names with _rv_ instead of _cv_. This patch renames kernels and file names to address these issues. AMD-Internal: [CPUPL-6579]	2025-08-19 18:19:51 +01:00
Vlachopoulou, Eleni	1f8a7d2218	Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65 )	2025-08-07 15:51:59 +01:00
Balasubramanian, Vignesh	ab4bb2f1e8	Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN4) - Updated the thresholds to enter the AVX512 Tiny and SUP codepaths for ZGEMM(on ZEN4). This caters to inputs that perform well on a single-threaded execution(in the Tiny-path), and inputs that scale well with multithreaded-execution(in the SUP path). - Also updated the thresholds to decide ideal threads, based on 'm', 'n' and 'k' values. The thread-setting logic involves determining the number of tiles for computation, and using them to further tune for the optimal number of threads. AMD-Internal: [CPUPL-6378][CPUPL-6661] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-07-10 15:35:22 +05:30
Smyth, Edward	969ceb7413	Finer control of code path options (#67 ) Add macros to allow specific code options to be enabled or disabled, controlled by options to configure and cmake. This expands on the existing GEMM and/or TRSM functionality to enable/disable SUP handling and replaces the hard coded #define in include files to enable small matrix paths. All options are enabled by default for all BLIS sub-configs but many of them are currently only implemented in AMD specific framework code variants. AMD-Internal: [CPUPL-6906] --------- Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-07-08 10:59:23 +01:00
Smyth, Edward	8a8d3f43d5	Improve consistency of optimized BLAS3 code (#64 ) * Improve consistency of optimized BLAS3 code Tidy AMD optimized GEMM and TRSM framework code to reduce differences between different data type variants: - Improve consistency of code indentation and white space - Added some missing AOCL_DTL calls - Removed some dead code - Consistent naming of variables for function return status - GEMM: More consistent early return when k=1 - Correct data type of literal values used for single precision data In kernels/zen/3/bli_gemm_small.c and bli_family_.h files: - Set default values for thresholds if not set in the relevant bli_family_.h file - Remove unused definitions and commented out code AMD-Internal: [CPUPL-6579]	2025-07-01 09:29:52 +01:00
Balasubramanian, Vignesh	98bc1d80e7	Support for Tiny-GEMM interface(CGEMM) - Added the support for Tiny-CGEMM as part of the existing macro based Tiny-GEMM interface. This involved definining the appropriate AVX2/AVX512 lookup tables and functions for the target architectures(as per the design), for compile-time instantiation and runtime usage. - Also extended the current Tiny-GEMM design to incorporate packing kernels as part of its lookup tables. These kernels will be queried through lookup functions and used in case of wanting to support non-trivial storage schemes(such as dot-product computation). - This allows for a plug-and-play fashion of experimenting with pack and outer product method against native inner product implementations. - Further updated the existing AVX512 pack routine that packs the A matrix (in blocks of 24xk). This utilizes masked loads/stores instructions to handle fringe cases of the input(i.e, when m < 24). - Also added the AVX512 outer product kernels for CGEMM as part of the ZEN4 and ZEN5 contexts, to handle RRC and CRC storage schemes. This is facilitated through optional packing of A matrix in the SUP framework. AMD-Internal: [CPUPL-6498] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-06-30 12:14:44 +05:30
Smyth, Edward	30c42202d7	GCC 15 SUP kernel workaround (#35 ) GCC 15 fails to compile some SUP kernels. The problem seems to be related to one of the optimization phases enabled at -O2 or above. Workaround is to disable this specific optimization by adding the flag -fno-tree-slp-vectorize to CKOPTFLAGS. AMD-Internal: [CPUPL-6579] Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-06-25 11:01:34 +01:00
Smyth, Edward	14e46ad83b	Improvements to x86 make_defs files (#29 ) Various changes to simplify and improve x86 related make_defs files: - Make better use of common definitions in config/zen/amd_config.mk from config/zen/make_defs.mk files - Similarly for config/zen/amd_config.make from the config/zen/make_defs.cmake files - Pass cc_major, cc_minor and cc_revision definitions from configure to generated config.mk file, and use these instead of defining GCC_VERSION in config/zen*/make_defs.mk files - Add znver3 support for LLVM 13 in config/zen3/make_defs.{mk,cmake} - Add znver5 support for LLVM 19 in config/zen5/make_defs.{mk,cmake} - Improve readability of haswell, intel64, skx and x86_64 files - Correct and tidy some comments AMD-Internal: [CPUPL-6579]	2025-06-03 16:20:43 +01:00
Hari Govind S	29f30c7863	Optimisation for DCOPY API - Introducted new assembly kernel that copies data from source to destination from the front and back of the vector at the same time. This kernel provides better performance for larger input sizes. - Added a wrapper function responsible for selecting the kernel used by DCOPYV API to handle the given input for zen5 architecture. - Updated AOCL-dynamic threshold for DCOPYV API in zen4 and zen5 architectures. - New unit-tests were included in the grestsuite for the new kernel. AMD-Internal: [CPUPL-6650] Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568	2025-04-28 05:58:21 -04:00
Vignesh Balasubramanian	b4b0887ca4	Additional optimizations to ZGEMM SUP and Tiny codepaths(ZEN4 and ZEN5) - Added a set of AVX512 fringe kernels(using masked loads and stores) in order to avoid rerouting to the GEMV typed API interface(when m = 1). This ensures uniformity in performance across the main and fringe cases, when the calls are multithreaded. - Further tuned the thresholds to decide between ZGEMM Tiny, Small SUP and Native paths for ZEN4 and ZEN5 architectures(in case of parallel execution). This would account for additional combinations of the input dimensions. - Moved the call to Tiny-ZGEMM before the BLIS object creation, since this code-path operates on raw buffers. - Added the necessary test-cases for functional and memory testing of the newly added kernels. AMD-Internal: [CPUPL-6378][CPUPL-6661] Change-Id: I9af73d1b6ef82b26503d4fc373111132aee3afd6	2025-04-23 00:56:58 -04:00
Vignesh Balasubramanian	c4b84601da	AVX512 optimizations for CGEMM(rank-1 kernel) - Implemented an AVX512 rank-1 kernel that is expected to handle column-major storage schemes of A, B and C(without transposition) when k = 1. - This kernel is single-threaded, and acts as a direct call from the BLAS layer for its compatible inputs. - Defined custom BLAS and BLIS_IMPLI layers for CGEMM (instead of using the macro definition), in order to integrate the call to this kernel at runtime(based on the corresponding architecture and input constraints). - Added unit-tests for functional and memory testing of the kernel. - Updated the ZEN5 context to include the AVX512 CGEMM SUP kernels, with its cache-blocking parameters. AMD-Internal: [CPUPL-6498] Change-Id: I42a66c424325bd117ceb38970726a05e2896a46b	2025-03-06 20:14:05 +05:30
Vignesh Balasubramanian	07df9f471e	AVX512 optimizations for CGEMM(SUP) - Implemented the following AVX512 SUP column-preferential kernels(m-variant) for CGEMM : Main kernel : 24x4m Fringe kernels : 24x3m, 24x2m, 24x1m, 16x4, 16x3, 16x2, 16x1, 8x4, 8x3, 8x2, 8x1, fx4, fx3, fx2, fx1(where 0<f<8). - Utlized the packing kernel to pack A when handling inputs with CRC storage scheme. This would in turn handle RRC with operation transpose in the framework layer. - Further adding C prefetching to the main kernel, and updated the cache-blocking parameters for ZEN4 and ZEN5 contexts. - Added a set of decision logics to choose between SUP and Native AVX512 code-paths for ZEN4 and ZEN5 architectures. - Updated the testing interface for complex GEMMSUP to accept the kernel dimension(MR) as a parameter, in order to set the appropriate panel stride for functional and memory testing. Also updated the existing instantiators to send their kernel dimensions as a parameter. - Added unit tests for functional and memory testing of these newly added kernels. AMD-Internal: [CPUPL-6498] Change-Id: Ie79d3d0dc7eed7edf30d8d4f74b888135f31d6b4	2025-03-06 06:03:39 -05:00
Vignesh Balasubramanian	99770558bb	AVX512 optimizations for CGEMM(Native) - Implemented the following AVX512 native computational kernels for CGEMM : Row-preferential : 4x24 Column-preferential : 24x4 - The implementations use a common set of macros, defined in a separate header. This is due to the fact that the implementations differ solely on the matrix chosen for load/broadcast operations. - Added the associated AVX512 based packing kernels, packing 24xk and 4xk panels of input. - Registered the column-preferential kernel(24x4) in ZEN4 and ZEN5 contexts. Further updated the cache-blocking parameters. - Removed redundant BLIS object creation and its contingencies in the native micro-kernel testing interface(for complex types). Added the required unit-tests for memory and functionality checks of the new kernels. AMD-Interal: [CPUPL-6498] Change-Id: I520ff17dba4c2f9bc277bf33ba9ab4384408ffe1	2025-02-28 03:18:24 -05:00
Vignesh Balasubramanian	327142395b	Cleanup for readability and uniformity of Tiny-ZGEMM - Guarded the inclusion of thresholds(configuration headers) using macros, to maintain uniformity in the design principles. - Updated the threshold macro names for every micro-architecture. AMD-Internal: [CPUPL-5895] Change-Id: I9fd193371c41469d9ef38c37f9c055c21457b56c	2025-01-27 15:48:31 +05:30
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
Vignesh Balasubramanian	cdaa2ac7fd	Bugfix and optimizations for AVX512 AMAXV micro-kernels - Bug : The current {S/D}AMAXV AVX512 kernels produced an incorrect functionality with multiple absolute maximums. They returned the last index when having multiple occurences, instead of the first one. - Implemented a bug-fix to handle this issue on these AVX512 kernels. Also ensured that the kernels are compliant with the standard when handling exception values. - Further optimized the code by decoupling the logic to find the maximum element and its search space for index. This way, we use lesser latency instructions to compute the maximum first. - Updated the unit-tests, exception value tests and early return tests for the API to ensure code-coverage. AMD-Internal: [CPUPL-4745] Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f	2025-01-07 22:56:20 +05:30
Vignesh Balasubramanian	609af9bfe2	Threshold tuning for ZGEMM small path - Updated the threshold check for ZGEMM small path to include runtime checks for redirection, specific to the micro-architecture. - The current ZGEMM small path has only its AVX2 variant available. Post implementing an AVX512(same/different algorithm), the thresholds will further be fine-tuned. - Included the dot-product based AVX512 ZGEMM kernels in the ZEN5 context. It will be used as part of handling RRC and CRC storage schemes of C, A and B matrices in both single-thread and multi-thread runs. AMD-Internal: [CPUPL-5949] Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75	2024-12-13 12:51:22 -05:00
harsdave	54b46ec1ed	Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop efficiency and edge kernel performance. The following technical improvements have been implemented: 1. IR Loop Optimization: - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated with `begin_asm` and `end_asm` calls, resulting in more efficient execution. 2. JR Loop Integration: - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead of stack frame management for each JR iteration, thereby enhancing loop performance. 3. Kernel Decomposition Strategy: - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1. - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently. 1. Interleaved Scaling by Alpha: - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline and reduce latency. 2. Efficient Mask Preparation: - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary, minimizing unnecessary overhead. 3. Broadcast Instruction Optimization: - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse, the broadcast instruction is replaced with `mem_1to8`. - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding dependency chains and improving execution efficiency. 4. C Matrix Update Optimization: - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers. This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating performance bottlenecks and enhancing throughput. These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication operations. This patch also involves changes for tiny gemm interface. A light interface for calling kernels and removing calls to avx2 dgemm kernels as we use avx512 dgemm kernels for all the sizes for zen4 and zen5. For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have the support to handle such inputs and thus such inputs are routed to gemm_small path. AMD-Internal: [CPUPL-6054] Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a	2024-12-13 00:03:00 -05:00
Shubham Sharma.	be6fbadd95	BlockSize Tuning for ZEN4 and ZEN5 - Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems. - MC, KC and NC are dynamically selected at runtime for DGEMM native. - A local copy of cntx is created and blocksizes are updated in the local cntx. - Updated threshold for picking DGEMM SUP kernel for ZEN4. AMD-Internal: [CPUPL-5912] Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6	2024-11-29 03:19:16 -05:00
Shubham Sharma	f2320a1fef	Enabled DGEMM row major kernel for ZEN4 - Merged ZEN4 and ZEN5 DGEMM 8x24 kernel. - Replaced 32x6 kernel with 8x24. Now same kernel is used for ZEN4 and ZEN5. - Blocksizes have been tuned for genoa only. - DGEMM kernel for DTRSM native code path is replaced with 8x24 kernel. - Enabled alpha scaling during packing for ZEN4. - ZEN4 8x24 kernel has been removed. AMD-Internal: [CPUPL-5912] Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754	2024-11-29 08:18:48 +00:00
Vignesh Balasubramanian	06d776b025	AVX512 ZGEMM SUP Inner product kernels - Implemented a set of column preferential dot-product based ZGEMM kernels(main and fringe) in AVX512(for SUP code-path). These kernels perform matrix multiplication as a sequence of inner products(i.e, dot-products). - These standalone kernels are expected to strictly handle the CRC storage scheme for C, A and B matrices. RRC is also supported through operation transpose, at the framework level. - Added unit-tests to test all the kernels(main and fringe), as well as the redirection between these kernels. AMD-Internal: [CPUPL-5949] Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38	2024-11-06 04:18:57 -05:00
Mangala V	705755bb5c	Revert "Using znver2 flags for building zen/zen2/zen3 kernels on amdzen builds." This reverts commit `7d379c7879`. Reason for revert: < Perf regression is observed for GEMM(gemm_small_At) as fma uses memory operand > Change-Id: I0ec3a22acaacfaade860c67858be6a2ba6296bce	2024-09-02 09:07:46 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Eleni Vlachopoulou	7d379c7879	Using znver2 flags for building zen/zen2/zen3 kernels on amdzen builds. config => config/build/arch folder Issue: 1. Performance drop is observed as part of the fat binary(amdzen config) built to support all the platforms using dynamic dispatch feature. 2. Observed only in intrinsic code and not in assembly code. 3. Observed in many of level1 kernels on Milan and Genoa Previous Design: Znver flags are picked based on config or function name In case of ref_kernels: Compiler picks up znver flag based on the function name. All ref_kernels are named based on BLIS_CNAME which is a config name (zen, zen2, zen3, zen4, zen5) In case of Zen kernels: Compiler picks up znver flag based on the config name where the source file exists. All avx2 kernels are placed in zen and all avx512 kernels are placed in zen4/zen5 folder. Kernels placed in zen (AVX2 kernels) are being compiled with znver1 flag rather than using znver2/znver3 flags on zen2/zen3 arch respectively New Design: For amdzen builds 1. For ref_kernels and kernels/(zen/zen2/zen3), znver2 flag is used instead of znver1 in make and cmake build system. 2. To use znver2 flags, make_defs.mk of zen2 is included in zen config 3. No changes are made for auto or any individual config 4. Significant perfomance improvement is observed AMD-Internal : [CPUPL-5407] [CPUPL-5406] [CPUPL-4873] [CPUPL-4872] [CPUPL-4871] [CPUPL-4801] [CPUPL-4800] [CPUPL-4799] Change-Id: Ie817c13b8b69a2dc4328aad7ae09a3af06f83df5	2024-08-05 14:27:01 +05:30
Varaganti, Kiran	145e706992	Fixed auxiliary cache block sizes for Native and SUP DGEMM kernels for ZEN4 and ZEN5 configs. Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. (https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md). In bli_cntx_init_zen4 and zen5 - auxiliary blocksize for KC was less than primary blocksize. These are fixed. Code-cleanup of the files bli_family_zen4, zen5.h" Removed unused constants. Thanks to Igor Kozachenko <igork@berkeley.edu> for pointing out these two bugs. Change-Id: I44fc564d5d91cb978d062c413e70751aeaa07f2c	2024-08-05 10:29:43 +05:30
Mangala V	0a4f9d5ac1	Removed -fno-tree-loop-vectorize from kernel flags - This change in made in MAKE build system. - Removed -fno-tree-loop-vectorize from global kernel flags, instead added it to lpgemm specific kernels only. - If this flag is not used , then gcc tries to auto vectorize the code which results in usages of vector registers, if the auto vectorized function is using intrinsic then the total numbers of vector registers used by intrinsic and auto vectorized code becomes more than the registers available in machine which causes read and writes to stack, which is causing regression in lpgemm. - If this flag is enabled globally, then the files which do not use any intrinsic code do not get auto vectorized. - To get optimal performance for both blis and lpgemm, this flag is enabled for lpgemm kernels only. Previous commit (`75df1ef218`) contains similar changes on cmake build system AMD-Internal: [CPUPL-5544] Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506	2024-08-02 09:40:06 -04:00
Shubham Sharma	0d95fcf20c	Revert "DGEMM Native AVX512 updates" This reverts commit `f378fc57b5`. Reason for revert: Causing Failure AMD-Internal: [CPUPL-5262] Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f	2024-08-02 07:32:28 -04:00
Ruchika Ashtankar	92fbd04238	DGEMM SUP Optimizations for Turin - Introduced a new 24x8 column preferred DGEMM sup kernel for zen5. - A prefetch logic is modified compared to zen4 24x8 sup kernels. - Earlier, next panel of A is prefetched into L2 cache, which is now modified to prefetching the second next column of the current panel of A into L1 cache. - B and C prefetches are enabled and unchanged. - Tuned MC, KC and NC block sizes for new kernel. AMD-Internal: [CPUPL-5262] Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d	2024-08-02 04:00:51 -04:00
Ruchika Ashtankar	5760e06100	Threshold tuning for DGEMM SUP for zen5 - New Decision threshold constants are added to decide between double precision sup vs native dgemm code-path for zen5 processors. - The decision is based on the values of m, n and k. AMD-Internal: [CPUPL-5262] Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040	2024-08-02 11:34:32 +05:30
Shubham Sharma.	f378fc57b5	DGEMM Native AVX512 updates - In the initial patch - for m, n non-multiple of MR and NR respectively we are calling bli_dgemm_ker_var2. Now we have implemented macro-kernel for these fringe cases as well. - Replaced RBP register with R11 in the macro-kernel. - Retuned MC, KC and NC with these new changes. This will result in better performance for matrix sizes like m=4000 or greater when running on single thread. AMD-Internal: [CPUPL-5262] Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a	2024-07-31 12:23:34 -04:00
Shubham Sharma	16c56e0101	Added 24x8 triangular kernels for DGEMMT SUP - In order to reuse 24x8 AVX512 DGEMM SUP kernels, 24x8 triangular AVX512 DGEMMT SUP kernels are added. - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal pattern repeats every 24x24 block of C. To cover this 24x24 block, 3 kernels are needed for one variant of DGEMMT. A total of 6 kernels are needed to cover both upper and lower variants. - In order to maximize code reuse, the 24x8 kernels are broken into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8 diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8 full GEMM part is computed by 24x8 DGEMM SUP kernel. - Changes are made in framework to enable the use of these kernels. AMD-Internal: [CPUPL-5338] Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a	2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian	b48e864e82	AVX512 optimizations for DAXPBYV API - Implemented AVX512 computational kernel for DAXPBYV with optimal unrolling. Further implemented the other missing kernels that would be required to decompose the computation in special cases, namely the AVX512 DADDV and DSCAL2V kernels. - Updated the zen4 and zen5 contexts to ensure any query to acquire the kernel pointer for DAXPBYV returns the address of the new kernel. - Added micro-kernel units tests to GTestsuite to check for functionality and out-of-bounds reads and writes. AMD-Internal: [CPUPL-5406][CPUPL-5421] Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6	2024-07-22 11:32:19 +05:30
Shubham Sharma	75df1ef218	Removed -fno-tree-loop-vectorize from kernel flags - This change in made in CMAKE build system only. - Removed -fno-tree-loop-vectorize from global kernel flags, instead added it to lpgemm specific kernels only. - If this flag is not used , then gcc tries to auto vectorize the code which results in usages of vector registers, if the auto vectorized function is using intrinsics then the total numbers of vector registers used by intrinsic and auto vectorized code becomes more than the registers available in machine which causes read and writes to stack, which is causing regression in lpgemm. - If this flag is enabled globally, then the files which do not use any intrinsic code do not get auto vectorized. - To get optimal performance for both blis and lpgemm, this flag is enabled for lpgemm kernels only. Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4	2024-07-19 00:49:52 -04:00
Arnav Sharma	4aa66f108e	Added CSCALV AVX512 Kernel - Added CSCALV kernel utilizing the AVX512 ISA. - Added function pointers for the same to zen4 and zen5 contexts. - Updated the BLAS interface to invoke respective CSCALV kernels based on the architecture. - Added UKR tests for bli_cscalv_zen_int_avx512( ... ). AMD-Internal: [CPUPL-5299] Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3	2024-07-15 07:17:43 -04:00
Shubham Sharma.	a7744361e4	DGEMM optimizations for Turin Classic - Introduced new 8x24 macro kernels. - 4 new kernels are added for beta 0, beta 1, beta -1 and beta N. - IR and JR loop moved to ASM region. - Kernels support row major storage scheme. - Prefetch of current micro panel of C is enabled. - Kernel supports negative offsets for A and B matrices. - Moved alpha scaling from DGEMM kernel to B pack kernel. - Tuned blocksizes for new kernel. - Added support for alpha scaling in 24xk pack kernel. - Reverted back to old b_next computation in gemm_ker_var2. - BugFix in 8x24 DGEMM kernel for beta 1, comparsion for jmp conditions was done using integer instructions, which caused beta 1 path to never be taken. Fixed this by changing the comparsion to double. AMD-Internal: [CPUPL-5262] Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f	2024-07-09 07:53:27 -04:00
Hari Govind S	627bf0b1ba	Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API - Replaced 'bli_zaxpyv_zen_int5' kernel with optimised 'bli_zaxpyv_zen_int_avx512' kernel for zen4 and zen5 config. - Implemented multithreading support and AOCL-dynamic for ZAXPY API. - Utilized 'bli_thread_range_sub' function to achieve better work distribution and avoid false sharing. AMD-Internal: [CPUPL-5250] Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b	2024-07-09 01:29:53 -04:00
Edward Smyth	8de8dc2961	Merge commit '81e10346' into amd-main * commit '81e10346': Alloc at least 1 elem in pool_t block_ptrs. (#560) Fix insufficient pool-growing logic in bli_pool.c. (#559) Arm SVE C/ZGEMM Fix FMOV 0 Mistake SH Kernel Unused Eigher Arm SVE C/ZGEMM Support beta==0 Arm SVE Config armsve Use ZGEMM/CGEMM Arm SVE: Update Perf. Graph Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 A64FX Config Use ZGEMM/CGEMM Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg Arm SVE Add SGEMM 2Vx10 Unindexed Arm SVE ZGEMM Support Gather Load / Scatt. St. Arm SVE Add ZGEMM 2Vx10 Unindexed Arm SVE Add ZGEMM 2Vx7 Unindexed Arm SVE Add ZGEMM 2Vx8 Unindexed Update Travis CI badge Armv8 Trash New Bulk Kernels Enable testing 1m in `make check`. Config ArmSVE Unregister 12xk. Move 12xk to Old Revert __has_include(). Distinguish w/ BLIS_FAMILY_* Register firestorm into arm64 Metaconfig Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo Add test for Apple M1 (firestorm) Firestorm CPUID Dispatcher Armv8 GEMMSUP Edge Cases Require Signed Ints Make error checking level a thread-local variable. Fix data race in testsuite. Update .appveyor.yml Firestorm Block Size Fixes Armv8 Handle beta == 0 for GEMMSUP ??r Case. Move unused ARM SVE kernels to "old" directory. Add an option to control whether or not to use @rpath. Fix $ORIGIN usage on linux. Arm micro-architecture dispatch (#344) Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. Armv8 Handle beta == 0 for GEMMSUP ?rc Case. Armv8 Fix 6x8 Row-Maj Ukr Apply patch from @xrq-phys. Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. bli_error: more cleanup on the error strings array Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers Fix config_name in bli_arch.c Arm Whole GEMMSUP Call Route is Asm/Int Optimized Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Header Typo Arm: DGEMMSUP ??r(rv) Invoke Edge Size Arm: DGEMMSUP ?rc(rd) Invoke Edge Size Arm: Implement GEMMSUP Fallback Method Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Added Apple Firestorm (A14/M1) Subconfig Arm64 8x4 Kernel Use Less Regs Armv8-A Supplimentary GEMMSUP Sizes for RD Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Armv8-A Adjust Types for PACKM Kernels Armv8-A GEMMSUP-RD 6x8m Armv8-A GEMMSUP-RD 6x8n Armv8-A s/d Packing Kernels Fix Typo Armv8-A Introduced s/d Packing Kernels Armv8-A DGEMMSUP 6x8m Kernel Armv8-A DGEMMSUP Adjustments Armv8-A Add More DGEMMSUP Armv8-A Add GEMMSUP 4x8n Kernel Armv8-A Add Part of GEMMSUP 8x4m Kernel Armv8A DGEMM 4x4 Kernel WIP. Slow Armv8-A Add 8x4 Kernel WIP AMD-Internal: [CPUPL-2698] Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803	2024-06-25 05:48:46 -04:00
mkadavil	a5c4a8c7e0	Int4 B matrix reordering support in LPGEMM. Support for reordering B matrix of datatype int4 as per the pack schema requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion implemented via leveraging the vpmultishiftqb instruction. The reordered B matrix will then be used in the u8s8s32o<s32\|s8> api. AMD-Internal: [SWLCSG-2390] Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba	2024-06-24 07:55:34 -04:00
Vignesh Balasubramanian	6165001658	Bugfix and optimizations for ?AXPBYV API - Updated the existing code-path for ?AXPBYV to reroute the inputs to the appropriate L1 kernel, based on the alpha and beta value. This is done in order to utilize sensible optimizations with regards to the compute and memory operations. - Updated the typed API interface for ?AXPBYV to include an early exit condition(when n is 0, or when alpha is 0 and beta is 1). Further updated this layer to query the right kernel from context, based on the input values of alpha and beta. - Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV, ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special case handling in ?AXPBYV. - Moved the early return with negative increments from ?SCAL2V kernels to its typed API interface. - Updated the zen, zen2 and zen3 context to include function pointers for all these vector kernels. - Updated the existing ?AXPBYV vector kernels to handle only the required computation. Additional cleanup was done to these kernels. - Added accuracy and memory tests for AVX2 kernels of ?SETV ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs - Updated the existing thresholds in ?AXPBYV tests for complex types. This is due to the fact that every complex multiplication involves two mul ops and one add op. Further added test-cases for API level accuracy check, that includes special cases of alpha and beta. - Decomposed the reference call to ?AXPBYV with several other L1 BLAS APIs(in case of the reference not supporting its own ?AXPBYV API). The decomposition is done to match the exact operations that is done in BLIS based on alpha and/or beta values. This ensures that we test for our own compliance. AMD-Internal: [CPUPL-4861] Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6	2024-06-20 16:22:07 +05:30
Shubham Sharma.	580282e655	DGEMM optimizations for Turin Classic - Introduced new 8x24 row preferred kernel for zen5. - Kernel supports row/col/gen storage schemes. - Prefetch of current panel of A and C are enabled. - Prefetch of next panel of B is enabled. - Kernel supports negative offsets for A and B matrices. - Cache block tuning is done for zen5 core. AMD-Internal: [CPUPL-5262] Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f	2024-06-17 05:18:49 -04:00

1 2 3 4 5 ...

631 Commits