amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 09:39:59 +00:00

Author	SHA1	Message	Date
V, Varsha	3df4aac2d2	Bugfix for A matrix packing in int8(S8/U8) APIs for Batch-Matmul - A matrix by default isn't expected to be packed for a normal row-stored case. Hence the packing implementation is incomplete. - But if the user explicitly enables packing, interface wasn't handling the condition appropriately leading to data overwriting inside the incomplete pack kernels, thereby leading to accuracy failure. - As a fix, updated the interface to set the explicit PACK A to UNPACKED and proceed with GEMM in cases where transpose of A is not necessary. - Updated the batch gemm input file with additional test cases covering all the APIs. Bug Fixes: - Fixed implementation logic for column major inputs with post-ops to be disabled in S8 batch mat-mul. With the existing implementation, column-major inputs wouldn't be executed in case of of32/os32 inputs. - Fixed the Scale/ZP calculation in bench foru8s8s32ou8 condition, which was leading to accuracy failures. [AMD-Internal: CPUPL-7283 ]	2025-08-26 16:46:37 +05:30
S, Hari Govind	deafc527fc	Tune DCOPY aocl_dynamic logic for zen4/zen5 architectures (#160 ) - Adjust the DCOPY aocl_dynamic threshold on Zen4 for optimal fast-path selection. - Extending the Zen5 dynamic-scheduling logic beyond the previous 8-thread limit. - Update corresponding fast_path_thresh values in the frame such that it matches the new dynamic logic.	2025-08-26 16:15:56 +05:30
Dave, Harsh	9269a7d65d	Implement dgemm_thread_decision with K-threshold logic (#152 ) - Added the initial implementation of the dgemm_thread_decision() function to decide between single-threaded and multi-threaded execution for DGEMM inputs. - The function models per-thread tile work, core GFLOP/s, and thread overhead (T_over=15 µs), and computes a K-threshold that determines when multi-threading becomes beneficial. - Returns true for ST and false for MT. AMD-Internal: [SWLCSG-3418]	2025-08-26 15:02:34 +05:30
Sharma, Shubham	9c6777fc6b	Fix DTRSM small threshold for extremely skinny sizes for ZEN5 (#151 ) - Logic to determine if small code path should be taken or not does not take into account if matrix A is too large. - Added a condition to use native code path if matrix A is very large. AMD-Internal: [CPUPL-7201]	2025-08-22 22:15:40 +05:30
V, Varsha	6cdab2720c	Bugfix for A matrix packing in int8(S8/U8) APIs - A matrix packing by default in isn't necessary for row-major matrix data. Also, it seems packing of A was giving regressions and hence wasn't expected to be used. - However, packA is necessary in column-major cases, where transpose has to be done. This path has been verified. - Hence, when user sets pack A explicitly, it gets into the incomplete packA function, and overwrites the elements in the buffer after subsequent iterations, leading to accuracy issues. As a fix to this the patch updates PACK condition to UNPACKED at the interface while user explicitly sets one, ensuring seamless execution. [ AMD-Internal : CPUPL - 7193 ]	2025-08-22 18:46:19 +05:30
Vankadari, Meghana	5044b69d3d	Bug fix in LPGEMV m=1 AVX2 kernel for post-ops Details: - Fixed loading of matadd and matmul pointers in GEMV lt16 kernel for AVX2 M=1 case. - Hard-set row-stride of B to 1(inside GEMV), when it has already been reordered. AMD-Internal:CPUPL-7197, CPUPL-7221 Co-authored-by:Balasubramanian, Vignesh <Vignesh.Balasubramanian@amd.com>	2025-08-22 18:15:05 +05:30
S, Hari Govind	d29f3f0b5e	Fix GCC 12+ instruction scheduling issue in complex scalv kernel (#149 ) Replace fused multiply-add (FMA) intrinsics with explicit multiply and add/subtract operations in bli_cscalv_zen_int to resolve incorrect results with GCC 12 and later compilers. The original code used register reuse pattern with _mm256_fmaddsub_ps() that causes GCC 12+ instruction scheduler to generate assembly with corrupted intermediate values due to register allocation conflicts. GCC 11 and earlier handled the same pattern correctly. Changes: - Replace _mm256_fmaddsub_ps() with _mm256_mul_ps() + _mm256_addsub_ps() - Eliminate temp register reuse to fix instruction scheduling conflicts AMD-Internal: [CPUPL-6445]	2025-08-22 14:23:43 +05:30
KR, Chandrashekara	36c37585de	CMake: Set working directory for try_run to avoid DLL-related runtime errors on Windows. - Added WORKING_DIRECTORY to try_run() calls to ensure execution occurs in ${BLIS_PATH}/lib. - Prevents Windows error 0xc0000135 caused by missing DLLs during runtime of get_version.cpp. - Ensures compatibility with both static and shared BLIS builds by aligning runtime context with expected DLL locations. AMD-Internal: [CPUPL-7187]	2025-08-21 17:09:07 +05:30
Sharma, Shubham	b5c8124d3d	Derive TRSM ref kernels from TRSM blkzsz instead of GEMM blszs (#148 ) - Currently TRSM reference kernels are derived from GEMM blocksizes and GEMM_UKR. - This does not allow the flexibility to use different GEMM_UKR for GEMM and TRSM if optimized TRSM_UKR are not available. - Made changes so that ref TRSM kernels are derived from TRSM blocksizes. - Changed ZEN4 and ZEN5 cntx to use AVX2 kernels for CTRSM. AMD-Internal: [SWLCSG-3702]	2025-08-21 11:25:45 +05:30
Dave, Harsh	e39cf64708	Optimized avx512 ZGEMM kernel and edge-case handling (#147 ) * Optimized avx512 ZGEMM kernel and edge-case handling Edge kernel implementation: - Refactored all of the zgemm kernels to process micro-tiles efficiently - Specialized sub-kernels are added to handle leftover m dimention:12MASK, 8, 8MASK, 8, 4, 4MASK, 2. - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm load/store and 1 masked load/store. - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and 1 masked load/store. - 4MASK handles 3, 1 m_left using 1 masked load/store. - ZGEMM kernel now internally decomposes the m dimension into the following. The main kernel is 12x4, which is having following edge kernels to handle left-over m dimension: edge kernels: 12MASKx4 (handles 11x4, 10x4, 9x4) 8x4 (handles 8x4) 8MASKx4 (handles 7x4, 6x4, 5x4) 4x4 (handles 4x4) 4MASKx4 (handles 3x4, 1x4) 2x4 (handles 2x4) - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1), 8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1), 2xN_LEFT(3, 2, 1) handles leftover m dimension. Threshold tuning: - Enforced odd m dimension to avx512 kernels in tiny path, as avx2 kernels invokes gemv calls for m_left=1(odd m dimension of matrix) The gemv function call adds overhead for very small sizes and results in suboptimal performance. - condition check "m%2 == 0" is added along with threshold checks to force input with odd m dimension to use avx512 zgemm kernel. - Threshold change to route all of the inputs to tiny path. Eliminating dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or 'T'(transpose). - However tiny re-uses zgemm sup kernels which do not support conjugate transpose storage of matrices. For such storage of A, B matrix we still rely on avx2 zgemm_small kernel. gtest changes: - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their respective testing instaces from gtest. AMD-Internal: [CPUPL-7203] * Optimized avx512 ZGEMM kernel and edge-case handling Edge kernel implementation: - Refactored all of the zgemm kernels to process micro-tiles efficiently - Specialized sub-kernels are added to handle leftover m dimention:12MASK, 8, 8MASK, 8, 4, 4MASK, 2. - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm load/store and 1 masked load/store. - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and 1 masked load/store. - 4MASK handles 3, 1 m_left using 1 masked load/store. - ZGEMM kernel now internally decomposes the m dimension into the following. The main kernel is 12x4, which is having following edge kernels to handle left-over m dimension: edge kernels: 12MASKx4 (handles 11x4, 10x4, 9x4) 8x4 (handles 8x4) 8MASKx4 (handles 7x4, 6x4, 5x4) 4x4 (handles 4x4) 4MASKx4 (handles 3x4, 1x4) 2x4 (handles 2x4) - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1), 8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1), 2xN_LEFT(3, 2, 1) handles leftover m dimension. Threshold tuning: - Enforced odd m dimension to avx512 kernels in tiny path, as avx2 kernels invokes gemv calls for m_left=1(odd m dimension of matrix) The gemv function call adds overhead for very small sizes and results in suboptimal performance. - condition check "m%2 == 0" is added along with threshold checks to force input with odd m dimension to use avx512 zgemm kernel. - Threshold change to route all of the inputs to tiny path. Eliminating dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or 'T'(transpose). - However tiny re-uses zgemm sup kernels which do not support conjugate transpose storage of matrices. For such storage of A, B matrix we still rely on avx2 zgemm_small kernel. gtest changes: - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their respective testing instaces from gtest. AMD-Internal: [CPUPL-7203] --------- Co-authored-by: harsdave <harsdave@amd.com>	2025-08-21 09:46:10 +05:30
Smyth, Edward	7969d43f2c	Improve BLAS3 IIT_ERS tests New tests: - Add IIT_ERS tests for TRMM, SYMM, SYRK, SYR2K, HEMM, HERK, HER2K Corrections and improvements: - GEMM: Use local definitions of input size, trans, etc arguments to allow finer control of choices, especially for testing invalid leading dimensions. - GEMM, GEMMT, GEMM_COMPUTE, TRMM, TRSM: In alpha=beta=zero test, initialize C to extreme value to test that C is set rather than scaled - GEMM: Use correct M x N dimensions for C in calls to computediff - GEMM: Declare info variable in disabled tests in gemm_IIT_ERS.cpp AMD-Internal: [CPUPL-6725]	2025-08-20 22:38:45 +01:00
Sharma, Shubham	805f36965d	Added ability to handle non unit incx in GEMV transpose kernel. (#145 ) - GEMV transpose kernels lack ability to compute directly on non-unit stride inputs. - This limitation is stopping libflame to use blis kernel directly instead of going through framework. - Added ability to handle non-unit incx in the kernel by packing x into a temporary buffer. AMD-Internal: [CPUPL-6903]	2025-08-20 23:33:53 +05:30
Smyth, Edward	67616752c3	Improve BLAS2 IIT_ERS tests New tests: - Add IIT_ERS tests for HEMV, HER, HER2, SYMV, SYR, SYR2 Corrections and improvements: - GEMV: Correct matrix sizes for transpose cases when checking input matrix A has not been modified. - GEMV: Initialize y to extreme value for alpha=beta=zero case to check y is set rather than scaled. AMD-Internal: [CPUPL-6725]	2025-08-20 18:06:20 +01:00
Sharma, Shubham	0d98c8f9ec	Remove extra security flag print from the make build system (#146 ) - Status of security flags are already printed in the output from configure and CMake. - Prints in make command are redundant. - Removed prints from make to keep the build log clean. AMD-Internal: [CPUPL-7222]	2025-08-20 16:58:44 +05:30
Smyth, Edward	0b9e846fee	DTL logging fixes and improvements (5) More improvements to DTL coverage and coding: - Expand logging and tracing coverage to IxAMIN and GEMM_BATCH APIs - Expand logging and performance states to GEMM3M APIs - Expand logging coverage to matrix copy, transpose and add APIs - Misc tidying of code AMD-Internal: [CPUPL-7010]	2025-08-20 11:37:03 +01:00
Smyth, Edward	509aa07785	Standardize Zen kernel names Naming of Zen kernels and associated files was inconsistent with BLIS conventions for other sub-configurations and between different Zen generations. Other anomalies existed, e.g. dgemmsup 24x column preferred kernels names with _rv_ instead of _cv_. This patch renames kernels and file names to address these issues. AMD-Internal: [CPUPL-6579]	2025-08-19 18:19:51 +01:00
Sharma, Shubham	aa95a8ce4a	Added Compiler flags to improve Security (#136 ) Following Flags have been added. 1. D_FORTIFY_SOURCE=2 What it does • At compile time the header files replace certain libc calls (strcpy, sprintf, …) with inline wrappers that perform a compile-time length check whenever the size of the destination buffer is known. • At run time an extra check is executed only if the compiler could not prove the copy is safe. Cost • Only functions that call those specific libc routines pay anything. 2. fstack-protector-strong What it does • Functions that contain local arrays, address‐taken locals, or alloca get a canary word inserted into the stack frame. • The function prologue writes the canary; the epilogue verifies it before the ret. Cost • 8 bytes of additional stack per protected function frame. • Two or three extra instructions per entry/exit. 4. Wl,-z,relro What it does • Marks the relocation tables read-only after relocation is finished. • No effect once the library is fully loaded. Cost • None at run time. 5. Wl,-z,now What it does • Forces the dynamic loader to resolve all external symbols in the library up-front instead of lazily on first call. Cost • Startup: one extra relocation pass. • Steady-state execution: zero or slightly faster, because PLT stubs are bypassed. Usage: cmake -DENABLE_SECURITY_FLAGS=off cmake -DENABLE_SECURITY_FLAGS=on configure --enable-security-flags configure --disable-security-flags AMD-Internal: [CPUPL-6886]	2025-08-18 16:11:02 +05:30
Dave, Harsh	b88bea6e72	Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) (#135 ) * Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) - Introduced specialized AVX-512 assembly paths for cdim0 edge cases (1–11), replacing inefficient zscalv fallback. - Refactored cdim0 == mnr condition into a switch statement to support multiple optimized cases. - Added three new macros for column-stored packing with distinct masking patterns. - Implemented 11 dedicated handlers for row and column stored A matrix packing with efficient masked loads/stores for partial data. AMD-Internal: [CPUPL-6677] Co-authored-by: harsh dave <harsdave@amd.com> * Update bli_packm_zen4_asm_z12xk.c --------- Co-authored-by: harsh dave <harsdave@amd.com> Co-authored-by: Sharma, Shubham <Shubham.Sharma3@amd.com>	2025-08-18 12:38:45 +05:30
Sharma, Shubham	33ea09d967	Fix ZTRSM accuracy for conjugate transpose (#133 ) - For Conjugate inputs, ZTRSM small code path is less accurate than native codepath. - Redirected the conjugate inputs to native code path on ZEN4 if TRSM preinversion is disabled. - Tuned AOCL_DYNAMIC to handle the new inputs redirected to ZTRSM native.	2025-08-14 19:49:28 +05:30
Dave, Harsh	1b1b19486b	Add packing support M edge cases in ZGEMM 12xk pack kernel (#89 ) Previously, the ZGEMM implementation used `zscalv` for cases where the M dimension of matrix A is not in multiple of 24, resulting in a ~40% performance drop. This commit introduces a specialized edge cases in pack kernel to optimize performance for these cases. The new packing support significantly improves the performance. - Removed reliance on `zscalv` for edge cases, addressing the performance bottleneck. AMD-Internal: [CPUPL-6677] Co-authored-by: harsh dave <harsdave@amd.com>	2025-08-14 14:29:03 +05:30
Sharma, Arnav	76c4872718	GEMV support for S8S8S32O32 Symmetric Quantization Introduced support for GEMV operations with group-level symmetric quantization for the S8S8S32032 API. Framework Changes: - Added macro definitions and function prototypes for GEMV with symmetric quantization in lpgemm_5loop_interface_apis.h and lpgemm_kernels.h. - LPGEMV_M_EQ1_KERN2 for the lpgemv_m_one_s8s8s32os32_sym_quant kernel, and - LPGEMV_N_EQ1_KERN2 for the lpgemv_n_one_s8s8s32os32_sym_quant kernel. - Implemented the main GEMV framework for symmetric quantization in lpgemm_s8s8s32_sym_quant.c. Kernel Changes: - lpgemv_m_one_s8s8s32os32_sym_quant for handling the case where M = 1 and implemented in lpgemv_m_kernel_s8_grp_amd512vnni.c. - lpgemv_n_one_s8s8s32os32_sym_quant for handling the case where N = 1 and implemented in lpgemv_n_kernel_s8_grp_amd512vnni.c. - Updated the buffer reordering logic for group quantization for N=1 cases in aocl_gemm_s8s8s32os32_utils.c. Notes - Ensure that group_size is a factor of both K (and KC when K > KC). - The B matrix must be provided in reordered format (mtag_b == REORDERED). AMD-Internal: [SWLCSG-3604]	2025-08-14 13:41:25 +05:30
Sharma, Shubham	3a14417ce1	DGEMV BugFixes and code cleanup (#134 ) - Modified gemv (matrix-vector multiply) reference for better handling of transpose flags. - Modified Zen4 kernel implementations for better handling of transpose flags and vector stride (incy). - The changes refine kernel selection logic and move variable definition in macro guards.	2025-08-14 12:54:06 +05:30
S, Hari Govind	9a7bacb30c	Improve numerical precision in ZGEMV API (#130 ) - Replaced separate real and imaginary accumulators (real_acc, imag_acc) with a column-wise accumulator array (row_acc[2]), making accumulation and updates to the target Y vector more direct, concise, and unified. - Leveraged AVX-512 fused multiply-add/subtract operations (_mm512_fmaddsub_pd, _mm512_fmsubadd_pd) and efficient permutations (_mm512_permute_pd) to enable accurate and efficient computation of real and imaginary components in a single instruction, while reducing code complexity for both code paths. - Removed redundant instructions (such as unnecessary permutations and zero-register operations) and simplified the control flow. AMD-Internal: [CPUPL-7015]	2025-08-14 11:19:51 +05:30
Dave, Harsh	fa69528a3b	Bugfix: Tuned zgemm threshold for zen4 (#129 ) * Bugfix: Tuned zgemm threshold for zen4 Threshold tuning that determines whether SUP or native path should be used for given input matrix size. This tuning forces skinny matrices to take SUP path to ensure better performance. * Bugfix: Tuned zgemm threshold for zen4 and zen5 Threshold tuning that determines whether SUP or native path should be used for given input matrix size. This tuning forces skinny matrices to take SUP path to ensure better performance. --------- Co-authored-by: harsdave <harsdave@amd.com>	2025-08-13 19:02:39 +05:30
Smyth, Edward	da875888d7	DTL logging fixes and improvements (4) More improvements to DTL coverage and coding: - Removed some DTL overheads from performance stats timing for all APIs where it is currently implemented (i.e. gemm, gemmt, trsm, nrm2) - Expand logging coverage to gemm pack and compute APIs, including performance stats for gemm_compute - Expand logging coverage to rot, rotg, rotm and rotmg APIs - Tidied order of function prototypes in aocl_dtl/aocldtl_blis.h AMD-Internal: [CPUPL-7010]	2025-08-13 11:12:16 +01:00
Sharma, Shubham	b7b9d3ec53	Exported AVX512 DGEMV kernels (#131 ) - Exported DGEMV AVX512 kernels so that they can be directly called by libflame to avoid blis and omp overhead.	2025-08-12 17:17:41 +05:30
Smyth, Edward	021f6bc960	GEMMTR full set of APIs Commit `eaa76dfe28` added LAPACK 3.12 GEMMTR interfaces as aliases to existing BLIS GEMMT. Here we add full set of Fortran upper case and no underscore API aliases and _blis_impl variants. AMD-Internal: [CPUPL-6581]	2025-08-12 10:24:24 +01:00
Smyth, Edward	dc06cdb621	DTL logging fixes and improvements (3) More improvements to DTL coverage and coding: - Expand logging coverage to banded matrix APIs in frame/compat/f2c - Expand logging coverage to packed matrix APIs in frame/compat/f2c - Commit `b8aa5c2894` was wrong to remove calls to AOCL_DTL_INITIALIZE for APIs where bli_init_auto() is not called. AOCL_DTL_INITIALIZE is essential when logging is enabled but tracing is not, otherwise the ICV gbIsLoggingEnabled will not be initialized based on logging status and remain as the default FALSE value. AMD-Internal: [CPUPL-7010]	2025-08-12 09:42:40 +01:00
Sharma, Shubham	b0a4914417	Added DGEMV no transpose multithreaded Implementations (#12 ) * Added DGEMV no transpose multithreaded Implementations - Added new avx512 M and N kernels for DGEMV. - Added multiple MT implementations for same kernels. - Added AOCL_dynamic logic for L2 apis. - Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5. - Added same kernels for SGEMV, but these kernels are not enabled yet. - Added SGEMV reference kernel. AMD-Internal: [SWLCSG-3408] Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-08-12 10:39:12 +05:30
Vlachopoulou, Eleni	1f8a7d2218	Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65 )	2025-08-07 15:51:59 +01:00
Smyth, Edward	563b161933	Standardize Python files to use Python 3 Python 2 is no longer maintained, and using python3 avoids accidental invocation of outdated interpreters. AMD-Internal: [CPUPL-6579]	2025-08-06 12:04:26 +01:00
Bhaskar, Nallani	9d571bb5d3	Fixed few Coverity warnings in aocl gemm addon Fixed few Coverity warnings in aocl gemm addon AMD-Internal: CPUPL-6913	2025-08-06 15:37:40 +05:30
Smyth, Edward	efdb5a06df	Changes to default choices of sub-configuration amdzen and x86_64 configuration family: On Intel processors supporting AVX-512, the zen4 sub-configuration was dispatched by default, as even though it is not optimized specifically for Intel processors, it includes a range of additional optimizations than are present in the the older skx sub-configuration. However, the zen4 data path is 256 bit, thus this sub-configuration uses a mixture of AVX2 and AVX-512 kernels. Now that zen5 sub-configuration is available, with more extensive use of AVX-512 kernels, switch to use this by default on relevant Intel processors. intel64 configuration family: On AMD processors supporting AVX-512 or AVX2, the generic sub-configuation was dispatched by default. Change to dispatch skx or haswell sub-configuation, based on the available ISA support. AMD-Internal: [CPUPL-6743]	2025-08-06 07:56:53 +01:00
Sharma, Shubham	6db8639284	Fix coverity issue in ZTRSM kernels (#112 ) Static analysis issues in ZTRSM (triangular solve with matrix) kernels for Zen5 architecture by initializing variables to prevent potential use of uninitialized values. Initialize loop variables i, j, and k_iter to 0 to prevent potential uninitialized access Initialize mask variables and remainder variables to 0 across multiple kernel functions	2025-08-05 15:00:59 +05:30
Smyth, Edward	09592bf4f3	Windows TLS with Microsoft compiler We currently use clang compiler on Windows, so no problem with current builds. However, if we wanted to use Microsoft compiler, add different definition of BLIS_THREAD_LOCAL as __declspec(thread) AMD-Internal: [CPUPL-6958]	2025-08-04 15:09:25 +01:00
Smyth, Edward	b8aa5c2894	DTL logging fixes and improvements (2) More improvements to DTL coverage and coding: - Tidy functions and prototypes in aocl_dtl/aocldtl_blis.{c,h} into alphabetical groups within different BLAS categories. - Expand tracing coverage to APIs in frame/compat/f2c - Remove calls to AOCL_DTL_INITIALIZE (added in `c56dcb6ffb`) as DTL_Trace calls bli_init_auto which will call AOCL_DTL_INITIALIZE AMD-Internal: [CPUPL-7010]	2025-08-01 15:27:35 +01:00
V, Varsha	68d47281df	Fixing some copying bugs in Batch-Matmul code - Removed duplicate calls to BATCH_GEMM_CHECK(). - Refactored freeing of post-op pointer in bench code and verified the functionality. - Modified indexing of the array to take the correct values.	2025-08-01 18:42:10 +05:30
Balasubramanian, Vignesh	c96e7eb197	Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN5) - Updated the thresholds to enter the AVX512 SUP codepath in ZGEMM(on ZEN5). This caters to inputs that scale well with multithreaded-execution(in the SUP path). - Also updated the thresholds to decide ideal threads, based on 'm', 'n' and 'k' values. The thread-setting logic involves determining the number of tiles for computation, and using them to further tune for the optimal number of threads. - This logic builds over the assumption that the current thread factorization logic is optimal. Thus, an additional data analysis was performed(on the existing ZEN4 and the new ZEN5 thresholds), to also cover the corner cases, where this assumption doesn't hold true. - As part of the future work, we could reimplement the thread factorization for GEMM, which would additionally require a new set of threshold tuning for every datatype. AMD-Internal: [CPUPL-7028] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> AOCL-Aug2025-b1	2025-08-01 16:02:12 +05:30
Rayan, Rohan	1bb1160061	Fixing bench overflow in GFLOPS computation Fixing a bug in some bench applications where GFLOPS computation ran into integer overflows because explicit type casting to double was not done in the computation removing all multiplies by 1.0 during GFLOP computation AMD-Internal: CPUPL-7016 --------- Co-authored-by: Rayan <rohrayan@amd.com>	2025-08-01 11:08:53 +05:30
Rayan, Rohan	a26f0a93f9	Removing some unnecessary branches in CGEMM's aocl-dynamic logic. Code cleanup: Removing some redundant if-else code in the CGEMM aocl-dynamic logic. This should ensure that multiple branching is avoided, while preserving existing heuristics. AMD Internal: [CPUPL - 6579] Co-authored-by: Rayan <rohrayan@amd.com>	2025-08-01 10:31:45 +05:30
Chandrashekara K R	2ff3e02309	Add platform-specific alignment macro for rntm_s struct Introduced BLIS_ATTRIB_ALIGN to standardize 64-byte alignment across platforms. On Windows, alignment is enabled only for Clang and disabled for other compilers. Replaced direct usage of __attribute__((aligned(64))) in rntm_s with the macro.	2025-07-30 21:15:44 +05:30
Smyth, Edward	e3dcc15a80	Add configure and CMake options to enable DTL logging and tracing (#86 ) Instead of editing a header file, add options to build systems to allow DTL tracing and/or logging output to be generated. For most users logging is recommended, producing a line of output per application thread of every BLAS call made. Tracing provides more detailed info of internal BLIS calls, and is aimed more at expert users and BLIS developers. Different tracing levels from 1 to 10 provide control of the granularity of information produced. The default level is 5. Note that tracing, especially at higher tracing levels, will impose a significant runtime cost overhead. Example usage: Using configure: ./configure ... --enable-aocl-dtl=log amdzen ./configure ... --enable-aocl-dtl=trace --aocl-dtl-trace-level=6 amdzen ./configure ... --enable-aocl-dtl=all amdzen Using CMake: cmake ... -DENABLE_AOCL_DTL=LOG cmake ... -DENABLE_AOCL_DTL=TRACE -DAOCL_DTL_TRACE_LEVEL=6 cmake ... -DENABLE_AOCL_DTL=ALL Also, modify function AOCL_get_requested_threads_count to correct reported thread count in cases where internal value is recorded as -1 AMD-Internal: [CPUPL-7010]	2025-07-28 15:24:10 +01:00
Bhaskar, Nallani	46aac600ec	Added f32 kernels without post-ops to avoid overhead Description: 1. Crated f32 intrinsic kernels without post-ops support f32 gemm without post-ops optimally. 2. Initiated the no post-ops kernels from main kernel when post-ops hander has no post-ops to do. 3. The kernels are redundant but added to get the best perf for pure GEMM call. AMD-Internal : SWLCSG-3692	2025-07-25 23:14:23 +05:30
Smyth, Edward	c56dcb6ffb	DTL logging fixes and improvements The environment variable AOCL_VERBOSE was inconsistent in its behaviour, sometimes producing a single line of output per file from multiple BLAS calls, when it should be all or nothing. Note that: - AOCL_VERBOSE is only active when DTL logging has been enabled at compile time. Otherwise, this environment variable is not read. - When logging is enable at compile time, logging output is produced by default. Thus AOCL_VERBOSE is more of use to turn output off, rather than on. - For production runs without logging, it is recommended to recompile with DTL disabled, as this minimizes overheads within the BLIS code. - AOCL_VERBOSE should be set to 0 or 1, and not values such as FALSE or TRUE. Changes to improve consistency when AOCL_VERBOSE is set: - Change DTL variables from Bool (unsigned char) datatype to bool, as used elsewhere in BLIS. - Ensure bli_init_auto() is called before AOCL_DTL_TRACE_ENTRY() and AOCL_DTL_LOG_*_INPUTS(), as bli_init_auto calls AOCL_DTL_INITIALIZE() - In APIs which avoid calling bli_init_auto(), add explicit calls to AOCL_DTL_INITIALIZE(). Also, make a proper comment about not calling bli_init_auto(), rather than just commenting out call, which looks like dead code. Other DTL logging control changes: - Make gbIsLoggingEnabled ICV thread local as this can be updated by calls to AOCL_DTL_Enable_Logs and AOCL_DTL_Disable_Logs APIs - After recent changes to hide some internal BLIS definitions behind ifdef BLIS_IS_BUILDING_LIBRARY guard, change BLIS_THREAD_LOCAL definition to be exported again. Logging output changes: - Standardize printing of datatype to be lower case. - Don't force printing of GEMM transa and transb to upper case, instead print in the case provided by the application code. - Add logging output to all variants (in terms of AMD/non-AMD optimized and datatype) of SWAP and SCAL. AMD-Internal: [CPUPL-7010]	2025-07-25 11:27:00 +01:00
S, Hari Govind	273a05f0bd	Fix for performance regression caused by non-unit stride y in DGEMV API (#91 ) - Temperory fix for regression in DGEMV for non-unit stride y inputs. The code section responsible for handling non-unit stride y has been removed from the frame. - The kernel code is extended with if condition to handle both unit and non-unit stride y. AMD-Internal: [CPUPL-6869] AOCL-Weekly-250725	2025-07-25 10:57:57 +05:30
Balasubramanian, Vignesh	93414f56c8	Bugfix : Guarded AOCL_ENABLE_INSTRUCTONS support based on AVX512-ISA support - As part of rerouting to AVX2 code-paths on ZEN4/ZEN5(or similar) architectures, the code-base established a contingency when deploying fat binary on ZEN/ZEN2/ZEN3 systems. Due to this, it was required that we always set AOCL_ENABLE_INSTRUCTIONS to 'ZEN3'(or similar values) to make sure we don't run AVX512 code on such architectures. This issue existed on FP32 and BF16 APIs. - Added checks to detect the AVX512-ISA support to enable rerouting based on AOCL_ENABLE_INSTRUCTIONS. This removes the incorrect constraint that was put forth. AMD-Internal: [CPUPL-7020] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-07-24 12:20:05 +05:30
V, Varsha	8a86620753	Bug Fix in INT8 reference un-reorder API - For int8/uint8 reorder function, the k dimension is made multiple of 4 to meet the alignment requirements. - Modified the logic to update the k_updated to use multiples of 4. [AMD - Internal : SWLCSG - 3686 ]	2025-07-24 11:26:49 +05:30
Smyth, Edward	4bc5287f72	Support applications using Intel icc and icx compilers on Windows (#82 ) The blis.h header file includes a lot of BLIS internal definitions. Some of these caused problems when using a BLIS library compiled with clang on Windows from an applications compiled with the Intel icc and icx compilers. Workaround is to use "#ifdef BLIS_IS_BUILDING_LIBRARY" to guard these definitions from being exposed to applications including blis.h. (The BLIS configure and cmake builds systems automatically define BLIS_IS_BUILDING_LIBRARY only for compiling the BLIS library.) This patch implements the minimum changes to resolve the issue. Longer term, similar changes may need to be added around all BLIS internal definitions in blis.h. AMD-Internal: [CPUPL-6953]	2025-07-22 10:22:45 +01:00
V, Varsha	9e8c9e2764	Fixed compiler warnings in LPGEMM - Modified the correct variables to be passed for the batch_gemm_thread_decorator() for u8s8s32os32 API. - Removed commented lines in f32 GEMV_M kernels. - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros. - Re-aligned the BIAS macro in the macro definition file. [ AMD - Internal : CPUPL - 7013 ]	2025-07-18 16:15:52 +05:30
V, Varsha	2f54bc1e14	Added F32 reference Unreorder function - Implemeneted unpackb_f32f32f32of32_reference function. - Modified const pointer declaration in aocl_reorder_reference() to avoid compiler warnings. [AMD-Internal: SWLCSG-3618 ]	2025-07-18 14:52:03 +05:30

1 2 3 4 5 ...

3787 Commits