amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Smyth, Edward	c56dcb6ffb	DTL logging fixes and improvements The environment variable AOCL_VERBOSE was inconsistent in its behaviour, sometimes producing a single line of output per file from multiple BLAS calls, when it should be all or nothing. Note that: - AOCL_VERBOSE is only active when DTL logging has been enabled at compile time. Otherwise, this environment variable is not read. - When logging is enable at compile time, logging output is produced by default. Thus AOCL_VERBOSE is more of use to turn output off, rather than on. - For production runs without logging, it is recommended to recompile with DTL disabled, as this minimizes overheads within the BLIS code. - AOCL_VERBOSE should be set to 0 or 1, and not values such as FALSE or TRUE. Changes to improve consistency when AOCL_VERBOSE is set: - Change DTL variables from Bool (unsigned char) datatype to bool, as used elsewhere in BLIS. - Ensure bli_init_auto() is called before AOCL_DTL_TRACE_ENTRY() and AOCL_DTL_LOG_*_INPUTS(), as bli_init_auto calls AOCL_DTL_INITIALIZE() - In APIs which avoid calling bli_init_auto(), add explicit calls to AOCL_DTL_INITIALIZE(). Also, make a proper comment about not calling bli_init_auto(), rather than just commenting out call, which looks like dead code. Other DTL logging control changes: - Make gbIsLoggingEnabled ICV thread local as this can be updated by calls to AOCL_DTL_Enable_Logs and AOCL_DTL_Disable_Logs APIs - After recent changes to hide some internal BLIS definitions behind ifdef BLIS_IS_BUILDING_LIBRARY guard, change BLIS_THREAD_LOCAL definition to be exported again. Logging output changes: - Standardize printing of datatype to be lower case. - Don't force printing of GEMM transa and transb to upper case, instead print in the case provided by the application code. - Add logging output to all variants (in terms of AMD/non-AMD optimized and datatype) of SWAP and SCAL. AMD-Internal: [CPUPL-7010]	2025-07-25 11:27:00 +01:00
S, Hari Govind	273a05f0bd	Fix for performance regression caused by non-unit stride y in DGEMV API (#91 ) - Temperory fix for regression in DGEMV for non-unit stride y inputs. The code section responsible for handling non-unit stride y has been removed from the frame. - The kernel code is extended with if condition to handle both unit and non-unit stride y. AMD-Internal: [CPUPL-6869] AOCL-Weekly-250725	2025-07-25 10:57:57 +05:30
Balasubramanian, Vignesh	93414f56c8	Bugfix : Guarded AOCL_ENABLE_INSTRUCTONS support based on AVX512-ISA support - As part of rerouting to AVX2 code-paths on ZEN4/ZEN5(or similar) architectures, the code-base established a contingency when deploying fat binary on ZEN/ZEN2/ZEN3 systems. Due to this, it was required that we always set AOCL_ENABLE_INSTRUCTIONS to 'ZEN3'(or similar values) to make sure we don't run AVX512 code on such architectures. This issue existed on FP32 and BF16 APIs. - Added checks to detect the AVX512-ISA support to enable rerouting based on AOCL_ENABLE_INSTRUCTIONS. This removes the incorrect constraint that was put forth. AMD-Internal: [CPUPL-7020] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-07-24 12:20:05 +05:30
V, Varsha	8a86620753	Bug Fix in INT8 reference un-reorder API - For int8/uint8 reorder function, the k dimension is made multiple of 4 to meet the alignment requirements. - Modified the logic to update the k_updated to use multiples of 4. [AMD - Internal : SWLCSG - 3686 ]	2025-07-24 11:26:49 +05:30
Smyth, Edward	4bc5287f72	Support applications using Intel icc and icx compilers on Windows (#82 ) The blis.h header file includes a lot of BLIS internal definitions. Some of these caused problems when using a BLIS library compiled with clang on Windows from an applications compiled with the Intel icc and icx compilers. Workaround is to use "#ifdef BLIS_IS_BUILDING_LIBRARY" to guard these definitions from being exposed to applications including blis.h. (The BLIS configure and cmake builds systems automatically define BLIS_IS_BUILDING_LIBRARY only for compiling the BLIS library.) This patch implements the minimum changes to resolve the issue. Longer term, similar changes may need to be added around all BLIS internal definitions in blis.h. AMD-Internal: [CPUPL-6953]	2025-07-22 10:22:45 +01:00
V, Varsha	9e8c9e2764	Fixed compiler warnings in LPGEMM - Modified the correct variables to be passed for the batch_gemm_thread_decorator() for u8s8s32os32 API. - Removed commented lines in f32 GEMV_M kernels. - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros. - Re-aligned the BIAS macro in the macro definition file. [ AMD - Internal : CPUPL - 7013 ]	2025-07-18 16:15:52 +05:30
V, Varsha	2f54bc1e14	Added F32 reference Unreorder function - Implemeneted unpackb_f32f32f32of32_reference function. - Modified const pointer declaration in aocl_reorder_reference() to avoid compiler warnings. [AMD-Internal: SWLCSG-3618 ]	2025-07-18 14:52:03 +05:30
Sharma, Shubham	355018e739	Fixed Extra reads in DTRSM small kernels. In DTRSM small code path lower triangular kernels, extra data from upper triangular region is being read. To fix this, new macros have been added to make sure only relevant data is read. AMD-Internal: [SWLCSG-3611]	2025-07-17 10:17:13 +05:30
Bhaskar, Nallani	76c08fe81d	Implemented f32 reference reorder function Implemented aocl_reorder_f32f32f32of32_reference( ) function and tested. Implemented framework changes required and place holder for kernels for aocl_unreorder_f32f32f32of32_reference( ) function. It is not tested completely and will be taken care in subsequent commits. [AMD-Internal: SWLCSG-3618 ]	2025-07-15 12:26:05 +05:30
V, Varsha	837d3974d4	Bug Fixes for GEMV AVX2 BF16 to F32 path - Added the correct strides to be used while unreorder/convert B matrix in m=1 cases. - Modified Zero point vector loads to proper instructions. - Modified bf16 store in AVX2 GEMV M kenrel AMD Internal - [SWLCSG - 3602 ] AOCL-Weekly-100725	2025-07-10 16:23:46 +05:30
Balasubramanian, Vignesh	ab4bb2f1e8	Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN4) - Updated the thresholds to enter the AVX512 Tiny and SUP codepaths for ZGEMM(on ZEN4). This caters to inputs that perform well on a single-threaded execution(in the Tiny-path), and inputs that scale well with multithreaded-execution(in the SUP path). - Also updated the thresholds to decide ideal threads, based on 'm', 'n' and 'k' values. The thread-setting logic involves determining the number of tiles for computation, and using them to further tune for the optimal number of threads. AMD-Internal: [CPUPL-6378][CPUPL-6661] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-07-10 15:35:22 +05:30
V, Varsha	98901847f1	Enabled GEMV path for BF16 GEMV operations on non-BF16 supporting machines - Added new GEMV_AVX2 5-Loop for handling BF16 inputs, for n = 1 and m = 1 conditions. - Modified Re-order and Un-reorder functions to cater to default n=1 reorder conditions. - Added bf16 beta and store support in F32 GEMV N AVX2 and 256_512 kernels. - Added bf16 beta support for F32 GEMV M kernels, and modified bf16 store conditions for GEMV M kernels. - Modified the n=1 re-order guards for reference bf16 re-order API. - Added an additional path in the un-reorder case for handling n=1 vector conversion AMD-Internal: [ SWLCSG - 3602 ]	2025-07-09 19:45:40 +05:30
Smyth, Edward	969ceb7413	Finer control of code path options (#67 ) Add macros to allow specific code options to be enabled or disabled, controlled by options to configure and cmake. This expands on the existing GEMM and/or TRSM functionality to enable/disable SUP handling and replaces the hard coded #define in include files to enable small matrix paths. All options are enabled by default for all BLIS sub-configs but many of them are currently only implemented in AMD specific framework code variants. AMD-Internal: [CPUPL-6906] --------- Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-07-08 10:59:23 +01:00
Bhaskar, Nallani	9b02201b5b	Updated Poly 16 in Gelu Erf to double precision Updated poly Gelu Erf precision to double to keep the error with in 1e-5 limit when compared to reference gelu_erf, which is also increased the compute to 2x compared to float. AMD-Internal: SWLCSG-3551	2025-07-07 14:05:40 +05:30
Balasubramanian, Vignesh	c0d33879ec	Bugfix : Integer typecast inside CGEMM AVX512 24xk packing kernel (#68 ) - When building the library with LP64 configuration, it is expected that we typecast integers to 64-bit internally, before loading them onto 64-bit GPRs. This ensures that the upper 32-bit lane is zeroed out, to avoid any possible junk values. The current change enforces this typecast inside the 24xk packing kernel for CGEMM(AVX512), which was missing before. AMD-Internal: [CPUPL-6907] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-07-01 17:05:54 +05:30
Dave, Harsh	7c6c04a457	More optimizations in 6x8m DGEMM SUP Kernel using prefetching (#34 ) * Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance This update optimizes the DGEMM kernel by implementing well suited prefetching techniques. Key changes include: - Prefetching Strategy: - Introduced prefetching instructions to load matrix data into cache ahead of computation. - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed. - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed. - Unrolling Optimization: - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B. - This adjustment enhances data locality and reduces the overhead associated with loop control. - Performance Improvements: - Reduced memory access latency by ensuring data is preloaded into cache. - Enhanced computational throughput by minimizing stalls due to memory access delays. - Improved overall efficiency of matrix multiplication operations. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] * added unroll K by 4 along with unroll K by 8 * Added descriptive comments explaining prefetch strategy * Added descriptive comments explaining prefetch strategy * More optimizations in 6x8m DGEMM SUP Kernel using prefetching - Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization. - Introduced forward prefetching for A and future B rows to better align with unrolled access patterns. - Interleaved alpha scaling with FMA for computation of alphaAB + C more efficiently. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] Enhance Prefetching in 6x8m DGEMM SUP Kernel for Improved Performance This update optimizes the DGEMM kernel by implementing well suited prefetching techniques. Key changes include: - Prefetching Strategy: - Introduced prefetching instructions to load matrix data into cache ahead of computation. - Prefetching for matrix A is based on the k-loop, starting from columns close to the ones being loaded and computed. - Prefetching for matrix B follows a similar approach, focusing on rows close to the ones being loaded and computed. - Unrolling Optimization: - Increased the unroll factor of the k-loop from 4 to 8, allowing for more efficient prefetching of matrices A and B. - This adjustment enhances data locality and reduces the overhead associated with loop control. - Performance Improvements: - Reduced memory access latency by ensuring data is preloaded into cache. - Enhanced computational throughput by minimizing stalls due to memory access delays. - Improved overall efficiency of matrix multiplication operations. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] * added unroll K by 4 along with unroll K by 8 * Added descriptive comments explaining prefetch strategy * More optimizations in 6x8m DGEMM SUP Kernel using prefetching - Restructured main loop with 8× and 4× unrolling (k_iter_8, k_iter_4, k_left) for deeper pipeline utilization. - Introduced forward prefetching for A and future B rows to better align with unrolled access patterns. - Interleaved alpha scaling with FMA for computation of alpha*AB + C more efficiently. These enhancements lead to faster DGEMM computations, leveraging improved cache utilization and loop unrolling to boost overall performance. AMD-Internal: [CPUPL-6435] --------- Co-authored-by: Harsh Dave <harsdave@amd.com> Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-07-01 15:02:50 +05:30
Smyth, Edward	8a8d3f43d5	Improve consistency of optimized BLAS3 code (#64 ) * Improve consistency of optimized BLAS3 code Tidy AMD optimized GEMM and TRSM framework code to reduce differences between different data type variants: - Improve consistency of code indentation and white space - Added some missing AOCL_DTL calls - Removed some dead code - Consistent naming of variables for function return status - GEMM: More consistent early return when k=1 - Correct data type of literal values used for single precision data In kernels/zen/3/bli_gemm_small.c and bli_family_.h files: - Set default values for thresholds if not set in the relevant bli_family_.h file - Remove unused definitions and commented out code AMD-Internal: [CPUPL-6579]	2025-07-01 09:29:52 +01:00
Balasubramanian, Vignesh	98bc1d80e7	Support for Tiny-GEMM interface(CGEMM) - Added the support for Tiny-CGEMM as part of the existing macro based Tiny-GEMM interface. This involved definining the appropriate AVX2/AVX512 lookup tables and functions for the target architectures(as per the design), for compile-time instantiation and runtime usage. - Also extended the current Tiny-GEMM design to incorporate packing kernels as part of its lookup tables. These kernels will be queried through lookup functions and used in case of wanting to support non-trivial storage schemes(such as dot-product computation). - This allows for a plug-and-play fashion of experimenting with pack and outer product method against native inner product implementations. - Further updated the existing AVX512 pack routine that packs the A matrix (in blocks of 24xk). This utilizes masked loads/stores instructions to handle fringe cases of the input(i.e, when m < 24). - Also added the AVX512 outer product kernels for CGEMM as part of the ZEN4 and ZEN5 contexts, to handle RRC and CRC storage schemes. This is facilitated through optional packing of A matrix in the SUP framework. AMD-Internal: [CPUPL-6498] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com> AOCL-Jul2025-b1	2025-06-30 12:14:44 +05:30
V, Varsha	1f9d1a85d3	Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. (#58 ) * Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. - Modified Batch-Gemm API to align with cblas_?gemm_batch_ API, and added a parameter group_size to the existing APIs. - Updated bench batch_gemm code to align to the new API definition. - Modified the hardcoded number in lpgemm_postop file. - Added necessary early return condition to account for group_count/group_size < 0. AMD-Internal: [ SWLCSG - 3592 ]	2025-06-30 11:16:04 +05:30
Sharma, Arnav	5193433141	Disable GCC 11.4 tree loop optimization for AVX512 F32 Sigmoid Post-Op (#63 ) - Disabled tree loop optimizations for all AVX512 F32 fringe kernels when compiled with GCC 11.4 to address numerical inaccuracies in Sigmoid post-op cause by aggressive loop optimizations. - The fix uses function-level GCC attribute __attribute__((optimize("no-tree-loop-optimize"))) to selectively disable tree loop optimizations only for the affected kernels based on GCC version check. AMD-Internal: [SWLCSG-3559]	2025-06-26 16:31:55 +05:30
V, Varsha	e05d24315e	Bug Fixes for Accuracy issues in Int8 API (#62 ) - In U8 GEMV n=1 kernels, the default zp condition was S8 ZP type, which leads to accuracy issues which u8s8s32u8 API is used. - Few modifications in bench code to take the correct path for accuracy check.	2025-06-25 17:01:22 +05:30
Smyth, Edward	30c42202d7	GCC 15 SUP kernel workaround (#35 ) GCC 15 fails to compile some SUP kernels. The problem seems to be related to one of the optimization phases enabled at -O2 or above. Workaround is to disable this specific optimization by adding the flag -fno-tree-slp-vectorize to CKOPTFLAGS. AMD-Internal: [CPUPL-6579] Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-06-25 11:01:34 +01:00
Balasubramanian, Vignesh	15c44a6f8c	Adding dynamic thread-setting logic for CGEMM(AOCL_DYNAMIC) (#48 ) - Added a set of thresholds(based on input dimensions) that determine and set the ideal number of threads to be used for CGEMM (on ZEN4 and ZEN5 architectures). - The thread-setting logic is as follows : - The underlying kernels(single-threaded) work on blocks of MRxk of A, kxNR of B and MRxNR of C. Thus, it is initially assumed that the optimal number of threads is ceil(m/MR)*ceil(n/NR). This is the upper bound on the actual number of threads that is ideal. - The actual ideal thread count could be lesser than the upper bound, based on the work that every thread receives. This is mainly determined by the value of 'k'. - If 'k' is small, the arithmetic intensity(AI) is low and memory bandwidth becomes the limiting factor, thus favoring smaller thread counts. In contrast, if 'k' is high, the AI is high and the workload scales well with higher thread counts. - So, we limit the number of threads when 'k' is small to avoid bandwidth contention. Using fewer threads ensures each thread gets more bandwidth, improving efficiency. In contrast, we allow more threads when 'k' is large, as the computation becomes more compute-bound and less limited by memory bandwidth, thereby benefitting with a higher-thread count. - The new logic will now set the upper bound for the optimal number of threads (based on the number of tiles), and then further reduce it based on the values of 'm', 'n' and 'k'. This comes under the 'AOCL_DYNAMIC' feature for CGEMM, specifically for ZEN4 and ZEN5 architectures. AMD-Internal: [CPUPL-6498] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-06-25 10:05:40 +05:30
Vankadari, Meghana	c81408c805	Modified reorder and pack code in sym quant API (#59 ) Details: - In s8 APIs with symmetric quantization, Existing kernels are reused to avoid duplication of reorder code. - Since the existing kernels are designed assuming that entire KCxNC block is packed at once, to handle grouping in symmetric quantization, we have to add JR and group loop outside the function call to existing packB function. - Though this was being done before, the cases where n_rem < 64 was not handled properly. - Modified reorder and pack code to first divide the n_fringe part into multiples-of-16 part and n_lt_16 part and then calling the pack kernel twice to handle both parts separately. - All the strides to access the reordered/pack buffer are updated accordingly.	2025-06-24 11:36:35 +05:30
S, Hari Govind	8d41565822	Fix build failure when AOCL_DYNAMIC is disabled (#57 ) - The build was failing when AOCL_DYNAMIC was disabled because `fast_path_thresh` was only declared when both AOCL_DYNAMIC and OpenMP were enabled. This variable was used in an `if` condition for single-thread execution without an AOCL_DYNAMIC guard. - To resolve this, the test expression for single-thread execution has been replaced with a macro. This macro is set to 0 when AOCL_DYNAMIC is disabled, ensuring the condition is handled correctly. AMD-Internal: [CPUPL-6854]	2025-06-23 15:56:15 +05:30
V, Varsha	8fd7060b2f	Matrix Add and Matrix Mul Post-op addition in F32 AVX512_256 kernels (#50 ) Added Matrix-mul and Matrix-add postops in FP32 AVX512_256 GEMV kernels - Matrix-add and Matrix-mul post ops in FP32 AVX512_256 GEMV m = 1 and n = 1 kernels has been added. Co-authored-by: VarshaV <varshav2@amd.com> AOCL-Weekly-200625	2025-06-17 16:17:13 +05:30
Smyth, Edward	b5c66a9d8c	Implement bli_thread_reset (#32 ) BLIS-specific setting of threading takes precedence over OpenMP thread count ICV values, and if the BLIS-specific threading APIs are used, there was no way for the program to revert to OpenMP settings. This patch implements a function bli_thread_reset() to do this. This is similar to that implemented in upstream BLIS in commit `6dcf7666ef` More specifically, it reverts the internal threading data to that which existed when the program was launched, subject where appropriate to any changes in the OpenMP ICVs. In other words: - It will undo changes to threading set by previous calls to bli_thread_set_num_threads or bli_thread_set_ways. - If the environment variable BLIS_NUM_THREADS was used, this will NOT be cleared, as the initial state of the program is restored. - Changes to OpenMP ICVs from previous calls to omp_set_num_threads() will still be in effect, but can be overridden by further calls to omp_set_num_threads(). Note: the internal BLIS data structure updated by the threading APIs, including bli_thread_reset(), is thread-local to each user (e.g. application) thread. Example usage: omp_set_num_threads(4); bli_thread_set_num_threads(7); dgemm(...); // 7 threads will be used bli_thread_reset(); dgemm(...); // 4 threads will be used	2025-06-17 10:40:10 +01:00
S, Hari Govind	e097346658	Implemented Multithreading Support and Optimization of DGEMV API (#10 ) - Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance. - The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta. AMD-Internal: [CPUPL-6746]	2025-06-17 12:39:48 +05:30
Vankadari, Meghana	26e5c63781	Disabled default packing of matrices in batch_gemm of FP32 (#55 ) AMD-Internal: SWLCSG-3527	2025-06-17 10:53:05 +05:30
Vankadari, Meghana	8649cdc14b	Removed unnecessary pack checks in FP32 GEMV (#54 ) Details: - In FP32 GEMM, when threading is disabled, rntm_pack_a and rntm_pack_b were set to true by default. This leads to perf regression for smaller sizes. Modified FP32 interface API to not overwrite the packA and packB variables in rntm structure. - In FP32 GEMV, Removed the decision making code based on mtag_A/B and should_pack_A/B for packing. Matrices will be packed only if the storage format of the matrices doesn't match the storage format required by the kernel. - Changed the control flow of checking the value of mtag to whether matrix is "reordered" or "to-be-packed" or "unpacked". checking for "reorder" first, followed by "pack". This will ensure that packing doesn't happen when the matrix is already reordered even though user forces packing by setting "BLIS_PACK_A/B" -Modified python script to generate testcases based on block sizes AMD-Internal: SWLCSG-3527	2025-06-16 12:34:11 +05:30
Balasubramanian, Vignesh	1847a1e8c6	Bugfix : Segmentation fault at the topology detection layer (#51 ) - The current implementation of the topology detector establishes a contingency, wherein it is expected that the parallel region uses all the threads queried through omp_get_max_threads(). In case the actual parallelism in the function is limited(lower than this expectation), the code may access unallocated memory section (using uninitialized pointers). - This was because every thread(having it's own pointer), sets its initial value to NULL inside the parallel section, thereby leaving some pointers uninitialized if the associated thread is not spawned. - Also, the current implementation would use negative indexing(with -1) if any associated thread was not spawned. - Fix : Set every thread-specific pointer to NULL outside the parallel region, using calloc(). As long as we have NULL checks for pointers before accessing through them, no issues will be observed. Avoid incurring the topology detection cost if all the reuqired threads are not spawned(thereby avoiding potential negative indexing). (when using core-group ID). AMD-Internal: [SWLCSG-3573] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> Co-authored-by: Bhaskar, Nallani <Nallani.Bhaskar@amd.com>	2025-06-14 21:55:02 +05:30
Chandrashekara K R	ae698be825	Updated version string from 5.0.1 to 5.1.1	2025-06-13 11:24:50 +05:30
Vankadari, Meghana	8968973c2d	Performance fix for FP32 GEMV (#47 ) Details: - In FP32 GEMM interface, mtag_b is being set to PACK by default. This is leading to packing of B matrix even though packing is not absolutely required leading to perf regression. - Setting mtag_b to PACK only if it is absolutely necessary to pack B matrix modified check conditions before packing appropriately. AMD-Internal - [SWLCSG-3575] AOCL-Jun2025-b2	2025-06-10 14:54:01 +05:30
Smyth, Edward	49ae7db89a	Avoid including .c files (#40 ) Including a C file directly in another C file is not recommended, and some build systems (e.g. Bazel and Buck) do not allow .c files to include other .c files. This commit changes the tapi and oapi framework files that are included from the _ex and _ba file variants from .c filenames to .h filenames. AMD-Internal: [CPUPL-6784] Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-06-10 11:33:33 +05:30
V, Varsha	875375a362	Bug Fixes in FP32 Kernels: (#41 ) * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] --------- Co-authored-by: VarshaV <varshav2@amd.com>	2025-06-06 17:48:50 +05:30
Vankadari, Meghana	9e9441db47	Fix for n_fringe in AVX512 FP32 6x64 kernel (#42 ) Details: - Fixed the problem decomposition for n-fringe case of 6x64 AVX512 FP32 kernel by updating the pointers correctly after each fringe kernel call. - AMD-Internal: SWLCSG-3556	2025-06-06 11:33:25 +05:30
Vankadari, Meghana	37efbd284e	Added 6x16 and 6xlt16 main kernels for f32 using AVX512 instructions (#38 ) * Implemented 6xlt8 AVX2 kernel for n<8 inputs * Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32 * Implemented m-fringe kernels for 6xlt8 kernel for AVX2 * Implemented m-fringe kernels for 6xlt8 kernel for AVX2 * Added the deleted kernels and fixed bias bug AMD-Internal: SWLCSG-3556	2025-06-05 15:17:02 +05:30
Smyth, Edward	14e46ad83b	Improvements to x86 make_defs files (#29 ) Various changes to simplify and improve x86 related make_defs files: - Make better use of common definitions in config/zen/amd_config.mk from config/zen/make_defs.mk files - Similarly for config/zen/amd_config.make from the config/zen/make_defs.cmake files - Pass cc_major, cc_minor and cc_revision definitions from configure to generated config.mk file, and use these instead of defining GCC_VERSION in config/zen*/make_defs.mk files - Add znver3 support for LLVM 13 in config/zen3/make_defs.{mk,cmake} - Add znver5 support for LLVM 19 in config/zen5/make_defs.{mk,cmake} - Improve readability of haswell, intel64, skx and x86_64 files - Correct and tidy some comments AMD-Internal: [CPUPL-6579]	2025-06-03 16:20:43 +01:00
Dave, Harsh	3c8b7895f7	Fixed functionality failure of DGEMM pack kernel. (#31 ) * Fixed functionality failure of DGEMM pack kernel. - Corrected the mask preparation needed for load/store in edge kernel where m = 18. - Corrected the usage of right vector registers while storing data back to buffer in edge kernels. AMD-Internal: [CPUPL-6773] * Fixed functionality failure of DGEMM pack kernel. - Corrected the mask preparation needed for load/store in edge kernel where m = 18. - Corrected the usage of right vector registers while storing data back to buffer in edge kernels. AMD-Internal: [CPUPL-6773] * Update bli_packm_zen4_asm_d24xk.c --------- Co-authored-by: Harsh Dave <harsdave@amd.com>	2025-06-03 17:33:16 +05:30
Smyth, Edward	dcf72968cf	Blacklist KNL with GCC 15+ (#844 ) (#28 ) Details: - GCC 15 drops support for Xeon Phi architectures such as KNL. - This PR blacklists the `knl` configuration for GCC 15+. Co-authored-by: Dave Love <dave.love@manchester.ac.uk>	2025-06-02 10:31:30 +01:00
V, Varsha	532eab12d3	Bug Fixes in LPGEMM for AVX512(SkyLake) machine (#24 ) * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. - The entry condition to the above has been modified for AVX512 configuration. - In bf16 API, the tiny path entry check has been modified to prevent seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting machines. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake). - Added Bf16 beta and store types in FP32 avx512_256 kernels AMD Internal: [SWLCSG-3552] * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. - The entry condition to the above has been modified for AVX512 configuration. - In bf16 API, the tiny path entry check has been modified to prevent seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting machines. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake). - Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256 kernels AMD Internal: [SWLCSG-3552] * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode. - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 type to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. The entry condition here has been modified for AVX512 configuration. - Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI ISA supporting configruations: - BF16 tiny path entry check has been modified to take into account arch_id to ensure improper entry into the tiny kernel. - The store in BF16->FP32 col-major for m = 1 conditions were updated to correct storage pattern, - BF16 beta load macro was modified to account for data in unaligned memory. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake) AMD Internal: [SWLCSG-3552] --------- Co-authored-by: VarshaV <varshav2@amd.com>	2025-05-30 17:22:49 +05:30
Arnav Sharma	62d4fcb398	Bugfix: Group Size Validation for s8s8s32o32_sym_quant - Fixed the group size validation logic to correctly check if the group_size is a multiple of 4. - Previously the condition was incorrectly performing bitwise AND with decimal 11 instead of binary 11 (decimal 3). AMD-Internal: [CPUPL-6754]	2025-05-30 11:53:23 +05:30
Prabhu, Anantha	9b7e1105dc	Update branch-name-check.yml (#27 )	2025-05-29 18:13:15 +05:30
Negi, Deepak	ffd7c5c3e0	Postop support for Static Quant and Integer APIs (#20 ) Support for S32 Zero point type is added for aocl_gemm_s8s8s32os32_sym_quant Support for BF16 scale factors type is added for aocl_gemm_s8s8s32os32_sym_quant U8 buffer type support is added for matadd, matmul, bias post-ops in all int8 APIs. AMD-Internal: SWLCSG-3503	2025-05-27 16:29:32 +05:30
Prabhu, Anantha	ced8b3b7d8	Potential fix for code scanning alert no. 1: Workflow does not contain permissions (#16 ) Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>	2025-05-21 07:51:09 -07:00
Vlachopoulou, Eleni	4555917040	CMake: Removing INT_SIZE variable from presets (#11 ) With this change, the default INT_SIZE of the system will be used. That is compatible with the Make system and default CMake options.	2025-05-21 09:35:32 +01:00
Vlachopoulou, Eleni	fac4a97118	Adding flags when building bench with CMake (#9 ) Since the gnu extensions where removed, executables in bench directory cannon be built correctly. The fix is adding "-D_POSIX_C_SOURCE=200112L" on those targets. When -std=gnu99 was used, bench worked without this flag, but that was not the case since we switched to -std=c99.	2025-05-16 11:47:39 +01:00
Kallesh, Vijay-teekinavar	a2a045cb2e	SWLDEVOPS-7853 - Action file to mandate branch naming convention (#2 ) SWLDEVOPS-7853 - Action file to mandate branch naming convention --------- Co-authored-by: vkallesh <vkallesh@amd.com>	2025-05-14 15:29:35 +05:30
Bhaskar, Nallani	42a0d74ced	Fixed configuration issues in AOCL_GEMM addon (#4 ) * Fixed configuration issues in AOCL_GEMM addon Description: Fixed aocl_gemm addon initialization of kernels and block sizes for machines which supports only AVX512 but not AVX512_VNNI/VNNI_BF16. Aligned NC, KC blocking variables between ZEN and ZEN4 AMD-Internal: [SWLCSG-3527]	2025-05-13 17:19:19 +05:30
Negi, Deepak	121d81df16	Implemented GEMV kernel for m=1 case. (#5 ) * Implemented GEMV kernel for m=1 case. Description: - Added a new GEMV kernel for AVX2 where m=1. - Added a new GEMV kernel for AVX512 with ymm registers where m=1.	2025-05-13 16:33:04 +05:30

1 2 3 4 5 ...

3744 Commits