amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-07-02 21:27:52 +00:00

Author	SHA1	Message	Date
Shubham Sharma	f8c83fedb6	Added new ZTRSM small code path for ZEN5 - Added new ZTRSM kernels for right and left variants. - Kernel dimensions are 12x4. - 12x4 ZGEMM SUP kernels are used internally for solving GEMM subproblem. - These kernels do not support conjugate transpose. - Only column major inputs are supported. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6356] Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e	2025-02-06 18:01:10 +05:30
Shubham Sharma	8f99d8a5bb	Fixed warnings and compilation issues with GCC in TRSM - Current implementation uses macros to expand the code at compile time, but this is causing some false warning in GCC12 and 14. - Added switch case in trsm right variants for n_remainder. - This ensures that n_rem is compile time constant, therefore warnings related to array subscript out of bounds are fixed. - mtune=znver3 flag is causing compilation issue in GCC 9.1, therefore this flag is removed. - Remaned the file bli_trsm_small to bli_trsm_small_zen5 in order to avoid possibily of missing symbols. AMD-Internal: [CPUPL-6199] Change-Id: Ib8e90196ce0a41d38c2b29226df5ab6c2d8ba996	2024-12-18 06:22:05 -05:00
Shubham Sharma	050e5a382f	Fixed warning for GCC 12+ - Warnings in DTRSM kernel caused by uninitialized registers and extra loop unroll is fixed. - Warning in DGEMM kernel caused by extra space is fixed. Change-Id: I1d9cfaa0b2847f5fdbe8b343a462d67a3aca0819	2024-12-17 01:44:41 -05:00
Shubham Sharma	beaea1b88f	Added new DTRSM small code path for ZEN5 - Added new DTRSM kernels for right and left variants. - Kernel dimensions are 24x8. - 24x8 DGEMM SUP kernels are used internally for solving GEMM subproblem. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6016] Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db	2024-12-16 10:38:45 +05:30
Shubham Sharma	f2320a1fef	Enabled DGEMM row major kernel for ZEN4 - Merged ZEN4 and ZEN5 DGEMM 8x24 kernel. - Replaced 32x6 kernel with 8x24. Now same kernel is used for ZEN4 and ZEN5. - Blocksizes have been tuned for genoa only. - DGEMM kernel for DTRSM native code path is replaced with 8x24 kernel. - Enabled alpha scaling during packing for ZEN4. - ZEN4 8x24 kernel has been removed. AMD-Internal: [CPUPL-5912] Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754	2024-11-29 08:18:48 +00:00
Shubham Sharma	082081658f	BugFix: Fixed extreme value handling in AVX512 DGEMM kernel - Extreme values are not handled correctly when beta == 0 and C is column major stored. - For checking if beta is zero, VCOMISD(XMM(1), XMM(2)) is used, beta(XMM1) is compared with zero(XMM2), for column major C, setting of xmm2 to zero was missed. - XMM2 is set to zero after the jump to column major stored C code is made, this skips the setting of XMM2 to zero for column major C. - This is fixed by setting XMM2 to zero before the column major jump. AMD-Internal: [CPUPL-5851] Change-Id: Ic511071fbc82a082fa48a1543c0c7325eaf75cb8	2024-11-29 08:13:57 +00:00
Shubham Sharma.	bc3238e21e	BugFixes in ZEN5 DGEMM kernel - Changed fringe cases to use ZEN5 DGEMM kernel instead of ZEN4 kernel. - ASAN reporting error when RBP is used even when -fno-stack-pointer flag is used, therefore replaced RBP register with R11 register. - Added missing RDX register in clobber list which is causing failures with AOCC compiler. Thanks to harsh.dave@amd.com for debugging some of the issues. AMD-Internal: [CPUPL-5851] Change-Id: I0ee412c97c9dbfb3e7a736a10bfd93d775779b5b	2024-11-29 00:22:41 -05:00
Shubham Sharma	266bd32dea	Enable fringe case handling in DGEMM ZEN5 macro kernel - Generic kernel is used if N is not multiple of NR or M is not multiple of MR. - This limit the maximum values of NR that can be used. - Support for fringe case handling is added in DGEMM macro kernel so that macro kernel can be used for all problem sizes. AMD-Internal: [CPUPL-5912] Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f	2024-11-29 00:22:33 -05:00
Shubham Sharma	1833ee70cd	Fixed bug in bli_dgemm_avx512 8x24 native kernel. - Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case. - When loading 64-bit registers using "mov" instructions, mov(rax, var(m)), the "m" should be 64-bit otherwise incorrect values gets loaded. Fix: We typecast these variables to int64_t before loading into registers. AMD-Internal: [CPUPL-5819] Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0	2024-09-23 04:54:04 -04:00
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Shubham Sharma	0d95fcf20c	Revert "DGEMM Native AVX512 updates" This reverts commit `f378fc57b5`. Reason for revert: Causing Failure AMD-Internal: [CPUPL-5262] Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f	2024-08-02 07:32:28 -04:00
Ruchika Ashtankar	92fbd04238	DGEMM SUP Optimizations for Turin - Introduced a new 24x8 column preferred DGEMM sup kernel for zen5. - A prefetch logic is modified compared to zen4 24x8 sup kernels. - Earlier, next panel of A is prefetched into L2 cache, which is now modified to prefetching the second next column of the current panel of A into L1 cache. - B and C prefetches are enabled and unchanged. - Tuned MC, KC and NC block sizes for new kernel. AMD-Internal: [CPUPL-5262] Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d	2024-08-02 04:00:51 -04:00
Shubham Sharma.	f378fc57b5	DGEMM Native AVX512 updates - In the initial patch - for m, n non-multiple of MR and NR respectively we are calling bli_dgemm_ker_var2. Now we have implemented macro-kernel for these fringe cases as well. - Replaced RBP register with R11 in the macro-kernel. - Retuned MC, KC and NC with these new changes. This will result in better performance for matrix sizes like m=4000 or greater when running on single thread. AMD-Internal: [CPUPL-5262] Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a	2024-07-31 12:23:34 -04:00
Shubham Sharma.	a7744361e4	DGEMM optimizations for Turin Classic - Introduced new 8x24 macro kernels. - 4 new kernels are added for beta 0, beta 1, beta -1 and beta N. - IR and JR loop moved to ASM region. - Kernels support row major storage scheme. - Prefetch of current micro panel of C is enabled. - Kernel supports negative offsets for A and B matrices. - Moved alpha scaling from DGEMM kernel to B pack kernel. - Tuned blocksizes for new kernel. - Added support for alpha scaling in 24xk pack kernel. - Reverted back to old b_next computation in gemm_ker_var2. - BugFix in 8x24 DGEMM kernel for beta 1, comparsion for jmp conditions was done using integer instructions, which caused beta 1 path to never be taken. Fixed this by changing the comparsion to double. AMD-Internal: [CPUPL-5262] Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f	2024-07-09 07:53:27 -04:00
Shubham Sharma	1d6dd726cd	Fixed Prefetch in Turin DGEMM kernel - Fixed the prefetch of next micro panel of B matrix in 8x24 DGEMM kernel. Change-Id: Id84bb2841abb86bda780062d67266377fda12038	2024-06-20 10:31:08 +05:30
Shubham Sharma.	580282e655	DGEMM optimizations for Turin Classic - Introduced new 8x24 row preferred kernel for zen5. - Kernel supports row/col/gen storage schemes. - Prefetch of current panel of A and C are enabled. - Prefetch of next panel of B is enabled. - Kernel supports negative offsets for A and B matrices. - Cache block tuning is done for zen5 core. AMD-Internal: [CPUPL-5262] Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f	2024-06-17 05:18:49 -04:00

17 Commits