amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-30 03:07:23 +00:00

Author	SHA1	Message	Date
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Shubham Sharma	bac0fed3cf	Fixed Bug in Dynamic Blocksizes - Mixed precision datatypes use a modified cntx. - For some variants of mixed precision, complex and real blocksizes are needed to be same. This is achieved by creating a local copy of cntx and copying complex blocksizes onto real blocksizes. - By using the dynamic blocksizes, the changes made to the blocksizes for mixed precision are overwritten by changes made by dynamic blocksizes. - This mismatch between complex and real blocksizes is causing a issue where the pack buffer is allocated based on complex blocksizes but amount of data packed is based on real blocksizes. - This makes the pack buffer sizes smaller than the required sizes. - To fix this, dynamic blocksizes are disabled for mixed precision. AMD-Internal: [CPUPL-6384] Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83	2025-02-05 01:09:24 -05:00
harsh dave	c5e842e8d3	Revert changes to force dgemm inputs under threshold to arch-specific kernels in single-threaded mode - This patch reverts the previous changes that removed the enforcement of dgemm inputs under a certain threshold to be processed by kernels selected based on architecture ID and handled in single-threaded mode. - This change is now forcing such small inputs to be computed in tiny path. Previously when this check was not there, it was routing these inputs to SUP path and causing performance regression due to framework overhead. AMD-Internal: [CPUPL-5927] Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb	2025-01-29 04:22:16 -05:00
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
harsh dave	aeda581539	Fix: Rename bli_tiny_gemm.c to bli_tiny_gemm_amd.c When compiling with config generic (or any non-zen build), the bli_dgemm_tiny_6x8 kernel is not defined. Since bli_dgemm_tiny() is only used within amd specific file, bli_tiny_gemm.c has been renamed to bli_tiny_gemm_amd.c to reflect its specific usage. Thanks to Smyth, Edward<edward.smyth@amd.com> for identifying and helping to fix the issue. Change-Id: If5d134aeba6d30d0a51e6d7d6fa9b3c4450a3307	2025-01-07 23:29:32 -05:00
Vignesh Balasubramanian	609af9bfe2	Threshold tuning for ZGEMM small path - Updated the threshold check for ZGEMM small path to include runtime checks for redirection, specific to the micro-architecture. - The current ZGEMM small path has only its AVX2 variant available. Post implementing an AVX512(same/different algorithm), the thresholds will further be fine-tuned. - Included the dot-product based AVX512 ZGEMM kernels in the ZEN5 context. It will be used as part of handling RRC and CRC storage schemes of C, A and B matrices in both single-thread and multi-thread runs. AMD-Internal: [CPUPL-5949] Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75	2024-12-13 12:51:22 -05:00
harsdave	54b46ec1ed	Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop efficiency and edge kernel performance. The following technical improvements have been implemented: 1. IR Loop Optimization: - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated with `begin_asm` and `end_asm` calls, resulting in more efficient execution. 2. JR Loop Integration: - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead of stack frame management for each JR iteration, thereby enhancing loop performance. 3. Kernel Decomposition Strategy: - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1. - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently. 1. Interleaved Scaling by Alpha: - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline and reduce latency. 2. Efficient Mask Preparation: - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary, minimizing unnecessary overhead. 3. Broadcast Instruction Optimization: - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse, the broadcast instruction is replaced with `mem_1to8`. - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding dependency chains and improving execution efficiency. 4. C Matrix Update Optimization: - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers. This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating performance bottlenecks and enhancing throughput. These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication operations. This patch also involves changes for tiny gemm interface. A light interface for calling kernels and removing calls to avx2 dgemm kernels as we use avx512 dgemm kernels for all the sizes for zen4 and zen5. For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have the support to handle such inputs and thus such inputs are routed to gemm_small path. AMD-Internal: [CPUPL-6054] Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a	2024-12-13 00:03:00 -05:00
Shubham Sharma.	be6fbadd95	BlockSize Tuning for ZEN4 and ZEN5 - Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems. - MC, KC and NC are dynamically selected at runtime for DGEMM native. - A local copy of cntx is created and blocksizes are updated in the local cntx. - Updated threshold for picking DGEMM SUP kernel for ZEN4. AMD-Internal: [CPUPL-5912] Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6	2024-11-29 03:19:16 -05:00
Shubham Sharma	f2320a1fef	Enabled DGEMM row major kernel for ZEN4 - Merged ZEN4 and ZEN5 DGEMM 8x24 kernel. - Replaced 32x6 kernel with 8x24. Now same kernel is used for ZEN4 and ZEN5. - Blocksizes have been tuned for genoa only. - DGEMM kernel for DTRSM native code path is replaced with 8x24 kernel. - Enabled alpha scaling during packing for ZEN4. - ZEN4 8x24 kernel has been removed. AMD-Internal: [CPUPL-5912] Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754	2024-11-29 08:18:48 +00:00
Shubham Sharma	266bd32dea	Enable fringe case handling in DGEMM ZEN5 macro kernel - Generic kernel is used if N is not multiple of NR or M is not multiple of MR. - This limit the maximum values of NR that can be used. - Support for fringe case handling is added in DGEMM macro kernel so that macro kernel can be used for all problem sizes. AMD-Internal: [CPUPL-5912] Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f	2024-11-29 00:22:33 -05:00
Shubham Sharma	b3b56ae3bb	BugFix: Fixed GEMM mixed precision failure in ZEN5 - Optimized DGEMM macro kernel does not support mixed precision. - This kernel was being used for solving some of the mixed precision problems. - Currently only ( bli_obj_elem(A) == 8 ) is used for checking if the problem being solved is mixed precision. - bli_obj_elem(A) will be equal to 8 for both double precision data type and mixed precision case single-complex. - Added extra checks (bli_obj_is_real( a )) to make sure that A and B are real and DGEMM macro kernel is being used only for DDDGEMM. AMD-Internal: [CPUPL-5804] Change-Id: Iaa1accf8d851d11533f8ba31dc0235fbc14f89a9	2024-09-19 04:54:53 -04:00
Shubham Sharma.	1a17e95332	Fixed failures for mixed precision on ZEN5 in GEMM - Optimized macro kernel (bli_dgemm_avx512_asm_8x24_macro_kernel) for zen5 do not support alpha scaling. Alpha scaling is supported by zen5 micro kernel (bli_dgemm_avx512_asm_8x24). - Optimized macro kernel expects alpha scaling to be done during packing. The packing kernel used for mixed precision do not support alpha scaling. Therefore, the optimized Zen5 macro kernel is not compatible with existing packing logic. - Changes have been made to use the generic macro kernel which in turn used zen5 micro kernel for mixed precision which supports alpha scaling. AMD-Internal: [CPUPL-5058] Change-Id: I1bfeb32ae07eedafadad7dd2c62d63913a46e446	2024-08-20 00:40:12 -04:00
Edward Smyth	7fff7b4026	Code cleanup: Miscellaneous fixes - Delete unused cmake files. - Add guards around call to bli_cpuid_is_avx2fma3_supported in frame/3/bli_l3_sup.c, currently assumes that non-x86 platforms will not use bli_gemmtsup. - Correct variable in frame/base/bli_arch.c on non-x86 builds. - Add guards around omp pragma to avoid possible gcc compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c. - Add missing registers in clobber list in kernels/zen4/1/bli_dotv_zen_int_avx512.c. - Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV. - Correct calls to cblas_{c,z}swap in gtestsuite. - Correct test name in ddotxf gtestsuite program. AMD-Internal: [CPUPL-4415] Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5	2024-08-06 06:56:01 -04:00
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Shubham Sharma.	dac0524ac8	BugFix in AVX512 DGEMMT SUP ST RRC variant - C<- alpha * op(A) op(B) + beta C. C(nxn) - A(n x k) * B(k x n) For ZEN4 and ZEN5 DGEMM is col-preferred kernel DGEMMT = DGEMM + DGEMMT DGEMM is col-preferred and DGEMMT is row-preferred. DGEMM is evaluated as C = AB (all col-storage) whereas DGEMMT is evaluated as C = B A (row-storage). When A is packed it is packed as row-panels with col-stored elements. So DGEMM is evaluated as C = AB (A is col-stored) it aligns with col-stored preference. For DGEMMT: C = B A, here A will become col-stored because of packingand as result it will break the DGEMMT kernel assumption that A is row-storage. - Fixed this by disabling this optimization for ZEN4 and ZEN5. AMD-Internal: [CPUPL-5542} Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379	2024-08-05 05:21:24 -04:00
Shubham Sharma	0d95fcf20c	Revert "DGEMM Native AVX512 updates" This reverts commit `f378fc57b5`. Reason for revert: Causing Failure AMD-Internal: [CPUPL-5262] Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f	2024-08-02 07:32:28 -04:00
Shubham Sharma.	f378fc57b5	DGEMM Native AVX512 updates - In the initial patch - for m, n non-multiple of MR and NR respectively we are calling bli_dgemm_ker_var2. Now we have implemented macro-kernel for these fringe cases as well. - Replaced RBP register with R11 in the macro-kernel. - Retuned MC, KC and NC with these new changes. This will result in better performance for matrix sizes like m=4000 or greater when running on single thread. AMD-Internal: [CPUPL-5262] Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a	2024-07-31 12:23:34 -04:00
Shubham Sharma	15ef6532e9	BugFix in DGEMMT SUP AVX512 code path - Logic to calculate the kernel index in AVX512 DGEMMT SUP framework is incorrect. - The granularity for workload distribution along N dimension is NR(8), whereas current logic to pick diagonal kernel assumes the granularity to be MR (24). - To Fix this, the logic to determine the kernel index is changed, instead of relying solely on n_offset, the kernel index is derived depending on distance from the diagonal. - If distance from diagonal is greater than LCM of (MR and NR) - NR, that that means the current micro panel is not a diagonal micro panel. - If the micro panel is a diagonal micro panel, then the distance from diagonal is equal to the M dimension for initial full GEMM region or empty region of diagonal kernel. This info can be used to determine the kernel index. AMD-Internal: [CPUPL-5440] Change-Id: I640d3a1b43e63b24bc9f0ed4a67cced45f6fa3b3	2024-07-24 06:36:34 +00:00
Shubham Sharma	16c56e0101	Added 24x8 triangular kernels for DGEMMT SUP - In order to reuse 24x8 AVX512 DGEMM SUP kernels, 24x8 triangular AVX512 DGEMMT SUP kernels are added. - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal pattern repeats every 24x24 block of C. To cover this 24x24 block, 3 kernels are needed for one variant of DGEMMT. A total of 6 kernels are needed to cover both upper and lower variants. - In order to maximize code reuse, the 24x8 kernels are broken into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8 diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8 full GEMM part is computed by 24x8 DGEMM SUP kernel. - Changes are made in framework to enable the use of these kernels. AMD-Internal: [CPUPL-5338] Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a	2024-07-22 12:02:30 -04:00
Shubham Sharma.	a7744361e4	DGEMM optimizations for Turin Classic - Introduced new 8x24 macro kernels. - 4 new kernels are added for beta 0, beta 1, beta -1 and beta N. - IR and JR loop moved to ASM region. - Kernels support row major storage scheme. - Prefetch of current micro panel of C is enabled. - Kernel supports negative offsets for A and B matrices. - Moved alpha scaling from DGEMM kernel to B pack kernel. - Tuned blocksizes for new kernel. - Added support for alpha scaling in 24xk pack kernel. - Reverted back to old b_next computation in gemm_ker_var2. - BugFix in 8x24 DGEMM kernel for beta 1, comparsion for jmp conditions was done using integer instructions, which caused beta 1 path to never be taken. Fixed this by changing the comparsion to double. AMD-Internal: [CPUPL-5262] Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f	2024-07-09 07:53:27 -04:00
Edward Smyth	2ee46a3a3a	Merge commit 'cfa3db3f' into amd-main * commit 'cfa3db3f': Fixed bug in mixed-dt gemm introduced in `e9da642`. Removed support for 3m, 4m induced methods. Updated do_sde.sh to get SDE from GitHub. Disable SDE testing of old AMD microarchitectures. Fixed substitution bug in configure. Allow use of 1m with mixing of row/col-pref ukrs. AMD-Internal: [CPUPL-2698] Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f	2024-07-08 06:09:11 -04:00
Shubham Sharma.	580282e655	DGEMM optimizations for Turin Classic - Introduced new 8x24 row preferred kernel for zen5. - Kernel supports row/col/gen storage schemes. - Prefetch of current panel of A and C are enabled. - Prefetch of next panel of B is enabled. - Kernel supports negative offsets for A and B matrices. - Cache block tuning is done for zen5 core. AMD-Internal: [CPUPL-5262] Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f	2024-06-17 05:18:49 -04:00
Mangala V	64d9c96d45	ZGEMMT SUP: AVX512 GEMMT code for Upper variant 1. Enabled AVX512 path for - Upper variant - Different storage schemes for upper and lower variant 2. Modified mask value to handle all fringe cases correctly AMD_Internal: [CPUPL-5091] Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e	2024-05-15 13:08:32 +05:30
Shubham Sharma	f4b06547fd	Enabled DGEMMT SUP optimized code for upper variant - Enabled DGEMMT SUP upper kernels in AVX512 code path. - Enabled use of optimized kernels for all the storages supported by optimized kernels. AMD-Internal: [CPUPL-4881] Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e	2024-05-14 05:23:51 -04:00
Edward Smyth	62c886feee	Export some BLIS internal symbols AOCL libFLAME optimizations directly call some internal BLIS symbols. Export them to enable this to work with the BLIS shared library. AMD-Internal: [CPUPL-5044] Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d	2024-05-08 12:51:32 -04:00
Mangala V	e6cc2a3e22	ZGEMMT SUP Optimizations for AVX512 Existing Design: - GEMM AVX2 kernel performs computation and updates temporary C buffer - Portion of temporary C buffer is copied to output C buffer based on UPLO parameter - For diagonal blocks, using GEMM kernels is not efficient New Design: Implemented in current patch when UPLO='L' - GEMMT kernel used for computation, temporary buffer is not required. - Only required elements are computed using mask load store for all fringe cases - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC - AOCL-Dynamic is added based on dimension - Check for AVX platform is added in SUP interface, It returns to native implementation if hardware doesnot support AVX platform - SUP ref_var2m is expanded for dcomplex datatype to avoid condition check which exists for double datatype AMD_Internal: [CPUPL-5006] Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286	2024-05-07 23:00:29 +05:30
Shubham Sharma	b70347d0d4	DGEMMT SUP Optimizations for AVX512 - In DGEMMT SUP AVX2 code path, traingular kernels are added in order to avoid temporary C buffer. - Since these kernels did not exist for AVX512, AVX2 kernels were being used in GEMMT. - AVX512 triangular GEMM kernel has been added to make sure that AVX512 kernels can be used without creating a temporary buffer. - This kernel is added only for Lower variant of GEMMT, for upper variant of DGEMMT, temporary C buffer is created, full GEMM kernel is called on temporary C and traingular region from temporary C is copied to C buffer. AMD-Internal: [CPUPL-4881] Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9	2024-05-03 05:11:11 -04:00
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Arnav Sharma	970a655ee4	Fix for build issue when Mixed Datatypes are disabled - Warning is raised for the implicit declaration of bli_gemm_md_is_ccr() when BLIS is configured with --disable-mixed-dt flag. - Encapsulated the usage of bli_gemm_md_is_ccr( ... ) inside the BLIS_ENABLE_GEMM_MD macro. AMD-Internal: [CPUPL-4630] Change-Id: Icc59b1bcd3a21492daaaf6bcec80a5bf67012ace	2024-02-23 04:02:49 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	f471615c66	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. AMD-Internal: [CPUPL-3519] Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce	2023-11-22 17:11:10 -05:00
Eashan Dash	e4e4fe55fb	Added Parameter Checks and DTL Trace for Extension APIs 1. Added input parameter checking for the extension APIs 1. gemm_pack_get_size API 2. gemm_pack API 2. Additionally added early returns for these APIs when m or n dimensions are 0. 3. Routines for input parameter check for all the 3 BLAS extension APIs - gemm_pack_get_size, gemm_pack and gemm_compute are defined in: frame/compat/check/bla_gemm_pack_compute_check.h 4. Added AOCL DTL TRACE for all the functions of 1. gemm_pack_get_size 2. gemm_pack 3. gemm_compute AMD-Internal: [CPUPL-3560] Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631	2023-11-09 18:53:59 +05:30
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Edward Smyth	9500cbee63	Code cleanup: spelling corrections Corrections for some spelling mistakes in comments. AMD-Internal: [CPUPL-3519] Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba	2023-11-09 00:16:30 -05:00
Eashan Dash	c3d1a3878c	Parallelized Pack and Compute Extension APIs 1. OpenMP based multi-threading parallelism is added for BLAS extension APIs of Pack and Compute 2. Both pack and compute APIs are parallelized. 3. Multi-threading of pack and compute APIs done with different number of threads can lead to inconsistent results due to output difference of the full packed matrix buffer when packed with different number of threads. 4. In multi-threaded execution, we ensure output of packed buffer is exactly the same as in single threaded execution. 5. Similarly for compute API, read of packed buffer in multi- threaded execution is exactly the same as in single-threaded execution. 6. Routines are added to compute the offsets for thread workload distribution for MT execution. 1. The offsets are calculated in such a way that it resembles the reorder buffer traversal in single threaded reordering. 2. The panel boundaries (KCxNC) remain as it is accessed in single thread, and as a consequence a thread with jc_start inside the panel cannot consider NC range for reorder. 3. It has to work with NC' < NC, and the offset is calulated using prev NC panels spanning k dim + cur NC panel spaning pc loop cur iteration + (NC - NC') spanning current kc0 (<= KC). 7. Routines to ensure the same are added for MT execution 1. frame/base/bli_pack_compute_utils.c 2. frame/base/bli_pack_compute_utils.h AMD-Internal: [CPUPL-3560] Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac	2023-11-03 08:47:17 -04:00
Arnav Sharma	c8f14edcf5	BLAS Extension API - ?gemm_compute() - Added support for 2 new APIs: 1. sgemm_compute() 2. dgemm_compute() These are dependent on the ?gemm_pack_get_size() and ?gemm_pack() APIs. - ?gemm_compute() takes the packed matrix buffer (represented by the packed matrix identifier) and performs the GEMM operation: C := A * B + beta * C. - Whenever the kernel storage preference and the matrix storage scheme isn't matching, and the respective matrix being loaded isn't packed either, on-the-go packing has been enabled for such cases to pack that matrix. - Note: If both the matrices are packed using the ?gemm_pack() API, it is the responsibility of the user to pack only one matrix with alpha scalar and the other with a unit scalar. - Note: Support is presently limited to Single Thread only. Both, pack and compute APIs are forced to take n_threads=1. AMD-Internal: [CPUPL-3560] Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158	2023-10-16 08:18:52 -04:00
Edward Smyth	85f2bf6c4a	Fix for x86_64 builds Configuration x86_64 includes all Intel and AMD sub-configurations. Fixes to enable this to work correctly again are: - In config_registry use amdzen rather than amd64 in x86_64 family. - Copy settings from config/amdzen/bli_family_amdzen.h to config/x86_64/bli_family_x86_64.h - Modify configure to set enable_aocl_zen=yes for x86_64, but not for amd64_legacy. - Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are enabled. Note: sub-configurations knl and bulldozer use instructions that are not supported on most x86_64 processors. AMD-Internal: [CPUPL-3838] Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04	2023-10-09 07:24:21 -04:00
Edward Smyth	15f9a747af	Fix for non-x86 builds: bli_gemmt_sup_var1n2m.c bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore bli_gemmt_sup_var1n2m.c as of commit `10ca8710f0` as variant for non-AMD codepath builds. AMD-Internal: [CPUPL-3838] Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f	2023-09-13 16:03:53 +05:30
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Shubham Sharma	0000cc88de	Removed local copy of cntx in TRSM - TRSM and GEMM has different blocksizes in zen4, in order to accommodate this, a local copy of cntx was created in TRSM. - Local copy of cntx has been removed and TRSM blocksizes are stored in cntx->trsmblkszs. - Functions to override and restore default blocksizes for TRSM are removed. Instead of overriding the default blocksizes, TRSM blocksizes are stored separately in cntx. - Pack buffers for TRSM have to be packed with TRSM blocksizes and GEMM pack buffers have to be packed with default blocksizes. To check if we are packing for TRSM, "family" argument is added in bli_packm_init_pack function. - BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if it is not set then BLIS_GEMM_UKR has to be used. This functionality has been added to all TRSM macro kernels. - Methods to retrieve TRSM blocksizes from cntx are added to bli_cntx.h. - Tests for micro kernels are modified to accommodate the change in signature of bli_packm_init_pack. AMD-Internal: [CPUPL-3781] Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a	2023-08-16 08:09:01 -04:00
Meghana Vankadari	79e174ff0a	Level-3 triangular routines now use different block sizes and kernels. Details: - Eliminated the need for override function in SUP for GEMMT/SYRK. - New set of block sizes, kernels and kernel preferences are added to cntx data structure for level-3 triangular routines. - Added supporting functions to set and get the above parameters from cntx. - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels. In case they are not set, use the default block sizes/kernels of Level-3 SUP. AMD-Internal: [CPUPL-3649] Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0	2023-07-26 01:26:11 -04:00
Arnav Sharma	4aace5f524	Smart Threading for SGEMM SUP for Zen4 Architecture - Added Smart Threading logic for AVX-512 based SGEMM SUP. - Calculating ic and jc for optimal work distribution to the allocated threads based on logic similar to Zen3. - Zen4 Architecture specific Native-to-SUP check has been added to redirect few Native inputs to the SUP path based on the fact that in a multi-threaded environment some Native cases perfom better as SUP. - For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have been increased from 512 and 200 to 682 and 512, respectively. - Further optimizations to the work distribution logic will be added subsequently. AMD-Internal: [CPUPL-3248] Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca	2023-04-21 12:54:03 +05:30
Meghana Vankadari	f788618f27	Setting AVX-512 specific blocksizes as default for L3 SUP for zen4 config Details: - Overriding of blocksizes with avx-2 specific ones(6x8) is done for gemmt/syrk because near-to-square shaped kernel performs better than skewed/rectangular shaped kernel. - Overriding is done for S,D and Z datatypes. AMD-Internal: [CPUPL-3060] Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53	2023-04-20 08:52:34 -04:00
Mangala V	5dc8e3fbca	AOCL progress callback pointer update per thread Thanks to Moore, Branden <Branden.Moore@amd.com> for identifying the race condition and suggesting the changes to fix the same Existing Design: - AOCL progress callback pointer is a global pointer which is shared across all threads Existing Design challenges: - The callback function cannot safely disable the progress mechanism, as another thread may have already checked to see if the function pointer is set, and then re-reads the pointer upon invocation of the callback. If one thread sets the callback to NULL in this time, then the resulting thread will attempt to call the null pointer as a function pointer, leading to a segfault. New Design : - Each thread maintains a local copy of progress pointer AMD-Internal: [SWLCSG-1971] Change-Id: I282989805a4a2a8a759a7373b645f3569bf42ed4	2023-04-20 05:33:12 -04:00
Edward Smyth	6835205ba8	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-2870] Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce	2023-04-19 12:44:56 -04:00
Shubham Sharma	036da2e651	Fixed compilation errors for generic configuration - In gemmt and normf, #ifdef BLIS_KERNELS_* is added to make sure only compiled kernels are used. - In bal_copy and bla_swap, missing '\' is added. AMD-Internal: [CPUPL-2870] Change-Id: I83452dff761f60db6957f557321ce210ab72c037	2023-04-18 00:27:05 -04:00
Meghana Vankadari	42d05a5aa0	DGEMM: Added decision logic to choose between sup vs native for zen4 architecture Details: - Added a new function for choosing between SUP and native implementation for a given size. - This function pointer is stored in cntx for zen4 config. - Divided total combinations of sizes into 3 categories: - one dimension is small - Two dimensions are small - All dimensions are small - Added different threshold conditions for each of the categories. AMD-Internal: [CPUPL-2755] Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf	2023-04-17 13:08:34 -04:00
Arnav Sharma	c14ce55bcb	Added SGEMM SUP blocksize override to zen4 context - Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for SGEMM. This can be updated once GEMMT specific optimization are added for AVX-512. - Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for SGEMM operation. AMD-Internal: [CPUPL-3060] Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a	2023-04-10 08:16:45 -04:00

1 2 3 4 5 ...

494 Commits