amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 02:37:05 +00:00

Author	SHA1	Message	Date
V, Varsha	d995496ac1	BugFix: BF16 AVX2 fallback GEMV m=1 path for reordered B inputs - For BF16 GEMVM1 fallback path when B matrix is reordered, there wasn't a panel adjustment happening after the kernel execution. - When the input size exceeds the panel boundary this would cause wrong panel access leading to incorrect results. Hence, added the same. [ AMD-Internal : CPUPL-8201 ]	2026-03-25 12:34:51 +05:30
Smyth, Edward	05e837d176	BLIS: Implement zen6 sub-configuration Implement zen6 cpuid and arch changes, and add zen6 as a separate BLIS sub-configuration and code path within amdzen configuration family. Currently all optimization choices are copies of zen5 sub-configuration. AMD-Internal: [CPUPL-7162]	2026-03-05 13:33:56 +00:00
Balasubramanian, Vignesh	237393ec71	Coverity fixes in LPGEMM group post-ops translator - Updated the condition for pointer checks on scale factors for A and B matrices, in order to avoid 'Dereference before' and 'Dereference after' null check issues. - Also updated the symmetric quantization interfaces to have NULL check for post-ops pointer. AMD-Internal: [CPUPL-7995] Signed-off-by: Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>	2026-02-25 17:21:12 +05:30
Smyth, Edward	011c75dddb	Remove unnecessary OpenMP include (AOCL) Copy of similar change in upstream BLIS (843a5e8) to fix issues https://github.com/flame/blis/issues/873 and https://github.com/amd/blis/issues/50 Details: - Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the framework could access the necessary OpenMP functions. - As @melven reported (#873), this causes issues when `blis.h` is included in C++ code since the `<omp.h>` include happens with `extern "C"`. - Move the include from the header to the necessary .c files so that it does not "pollute" `blis.h`. Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in AOCL BLIS AMD-Internal: [CPUPL-7303]	2026-02-06 10:41:38 +00:00
Smyth, Edward	8310b2d5d3	Optimize bli_arch_query_id and related functions bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous implementation incurred the overhead of multiple function calls. This has been reduced by: - Changing the function to be defined in a header file so it can be inlined. - Avoiding call to bli_arch_check_id_once that was a wrapper for a call to bli_pthread_once. Instead bli_pthread_once is called directly. - For builds with a single BLIS sub-configuration, correct arch_id is taken directly from a header file in the corresponding config subdirectory, avoiding the bli_pthread_once call and making the value explicit at compile time, which may enable additional optimizations. To enable these changes, the variables arch_id and model_id defined in frame/base/bli_arch.c are no longer static, as they must be accessed in multiple files (i.e. they are now global variables). Rename to g_arch_id and g_model_id to distinguish from any locally defined arch_id or model_id variables.	2026-02-04 13:16:46 +00:00
Balasubramanian, Vignesh	73911d5990	Updates to the build systems(CMake and Make) for LPGEMM compilation (#303 ) - The current build systems have the following behaviour with regards to building "aocl_gemm" addon codebase(LPGEMM) when giving "amdzen" as the target architecture(fat-binary) - Make: Attempts to compile LPGEMM kernels using the same compiler flags that the makefile fragments set for BLIS kernels, based on the compiler version. - CMake: With presets, it always enables the addon compilation unless explicitly specified with the ENABLE_ADDON variable. - This poses a bug with older compilers, owing to them not supporting BF16 or INT8 intrinsic compilation. - This patch adds the functionality to check for GCC and Clang compiler versions, and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0. - Make: Updated the configure script to check for the compiler version if the addon is specified. CMake: Updated the main CMakeLists.txt to check for the compiler version if the addon is specified, and to also force-update the associated cache variable update. Also updated kernels/CMakeLists.txt to check if "aocl_gemm" remains in the ENABLE_ADDONS list after all the checks in the previous layers. AMD-Internal: [CPUPL-7850] Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>	2026-01-16 19:39:55 +05:30
Balasubramanian, Vignesh	f992942f6b	Disabling GEMV(M1) rerouting in BF16 APIs(AVX512) - Disabled rerouting to GEMV in BF16 APIs, in case of inputs having m == 1. We would now use the GEMM path for handling such inputs. AMD-Internal: [CPUPL-7536] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-11-27 14:43:31 +05:30
V, Varsha	fecb1aa7a5	Bug Fix in BF16 AVX2 conversion path (#236 ) - In the current implementation of bf16 to f32 conversion for packed data we handle both GEMM and GEMV conditions in the same function separated with conditions. - But, when n = (NC+1) the function would execute GEMV conversion logic and write back the data inaccurately leading to accuracy issues. - Hence, modified the convert function and reorder functions to have separate conversion logic to make it cleaner and avoid confusions. - Also, updated the API calls to adhere to the changes appropriately. [AMD-Internal: CPUPL-7540]	2025-10-17 15:38:02 +05:30
Bhaskar, Nallani	db3134ed6d	Disabled no post-ops path in lpgemm f32 kernels for few gcc versions Guarded np (no post-ops) path in f32 API with a macro as a workaround as gcc 11.4 and 11.2 are giving accuracy issues with np path.	2025-09-22 15:52:21 +05:30
Sharma, Arnav	ee3d250b7a	Fix for F32 to BF16 Conversion and AVX512 ISA Support Checks - Fixed register assignment bug in lpgemv_m_kernel_f32_avx512 where zmm3 was incorrectly used instead of zmm4 in BF16_F32_BETA_OP_NLT16F_MASK macro. - Replaced hardware-specific BF16 conversion intrinsics with manual rounding, bit manipulation and F32 instruction set for compatibility on hardware without native BF16 support. - Added AVX512_BF16 ISA support checks for s8s8s32obf16 and u8s8s32obf16 GEMM operations to ensure processor compatibility before execution. AMD-Internal: [CPUPL-7410]	2025-09-19 18:49:33 +05:30
Smyth, Edward	e59eabaf58	Compiler warnings fixes (2) Fix compiler warning messages in LPGEMM code: - Removed extraneous parentheses in aocl_batch_gemm_s8s8s32os32.c - Removed unused variables in lpgemv_{m,n}_kernel_s8_grp_amd512vnni.c - Changed ERR_UBOUND in math_utils_avx2.h and math_utils_avx512.h to match how it is specified in AOCL libm erff.c AMD-Internal: [CPUPL-6579]	2025-09-17 18:28:34 +01:00
Smyth, Edward	ae6c7d86df	Tidying code - AMD specific BLAS1 and BLAS2 franework: changes to make variants more consistent with each other - Initialize kernel pointers to NULL where not immediately set - Fix code indentation and other other whitespace changes in DTL code and addon/aocl_gemm/frame/s8s8s32/lpgemm_s8s8s32_sym_quant.c - Fix typos in DTL comments - Add missing newline at end of test/CMakeLists.txt - Standardize on using arch_id variable name AMD-Internal: [CPUPL-6579]	2025-09-16 14:52:54 +01:00
KadavilMadanaMohanan, MithunMohan (Mithun Mohan)	5de25ce9a7	Fixed high priority coverity issues in LPGEMM. (#178 ) * Fixed high priority coverity issues in LPGEMM. -Out of bounds issue and uninitialized variables fixed in aocl_gemm addon.	2025-09-11 18:27:19 +05:30
Balasubramanian, Vignesh	37f255821a	Optimal rerouting of GEMV inputs to avoid packing - Added conditional swapping of input matrices and their strides for GEMV, based on whether transpose is toggled specifically for the matrix, namely the B matrix when m=1 and the A matrix when n=1. - This swapping ensures that we reroute the inputs to use the alternative variant(code-path) in order to avoid packing cost for the matrix, through logical transposition. - Currently, this optimization is enabled only when no post-ops are involved. With post-ops, there is a need to update the incoming data(from the user) in some scenarios, which will be dealt with later. AMD-Internal: [CPUPL-7323] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-09-03 09:15:59 +05:30
Sharma, Arnav	98eeeb0ddb	Updated Guards in s8s8s32of32_sym_quant Framework - Moved the `#ifdef BLIS_KERNELS_ZEN4` directive to encompass the relevant code block more effectively in `lpgemm_s8s8s32_sym_quant.c` and to remove unused variables warning. AMD-Internal: [CPUPL-7320]	2025-09-01 19:56:32 +05:30
Smyth, Edward	fb2a682725	Miscellaneous changes - Change begin_asm and end_asm comments and unused code in files kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c to avoid problems in clobber checking script. - Add missing clobbers in files kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c - Add missing newline at end of files. - Update some copyright years for recent changes. - Standardize license text formatting. AMD-Internal: [CPUPL-6579]	2025-08-26 16:37:43 +01:00
Bhaskar, Nallani	b052775644	Disabled topology detection in LPGEMM - Disabled topology detection as libgomp is not honoring the standard function omp_get_place_proc_ids - Added B prefetch in bf16 B packing kernels AMD-Internal: SWLCSG-3761	2025-08-26 14:50:01 +01:00
V, Varsha	3df4aac2d2	Bugfix for A matrix packing in int8(S8/U8) APIs for Batch-Matmul - A matrix by default isn't expected to be packed for a normal row-stored case. Hence the packing implementation is incomplete. - But if the user explicitly enables packing, interface wasn't handling the condition appropriately leading to data overwriting inside the incomplete pack kernels, thereby leading to accuracy failure. - As a fix, updated the interface to set the explicit PACK A to UNPACKED and proceed with GEMM in cases where transpose of A is not necessary. - Updated the batch gemm input file with additional test cases covering all the APIs. Bug Fixes: - Fixed implementation logic for column major inputs with post-ops to be disabled in S8 batch mat-mul. With the existing implementation, column-major inputs wouldn't be executed in case of of32/os32 inputs. - Fixed the Scale/ZP calculation in bench foru8s8s32ou8 condition, which was leading to accuracy failures. [AMD-Internal: CPUPL-7283 ]	2025-08-26 16:46:37 +05:30
V, Varsha	6cdab2720c	Bugfix for A matrix packing in int8(S8/U8) APIs - A matrix packing by default in isn't necessary for row-major matrix data. Also, it seems packing of A was giving regressions and hence wasn't expected to be used. - However, packA is necessary in column-major cases, where transpose has to be done. This path has been verified. - Hence, when user sets pack A explicitly, it gets into the incomplete packA function, and overwrites the elements in the buffer after subsequent iterations, leading to accuracy issues. As a fix to this the patch updates PACK condition to UNPACKED at the interface while user explicitly sets one, ensuring seamless execution. [ AMD-Internal : CPUPL - 7193 ]	2025-08-22 18:46:19 +05:30
Vankadari, Meghana	5044b69d3d	Bug fix in LPGEMV m=1 AVX2 kernel for post-ops Details: - Fixed loading of matadd and matmul pointers in GEMV lt16 kernel for AVX2 M=1 case. - Hard-set row-stride of B to 1(inside GEMV), when it has already been reordered. AMD-Internal:CPUPL-7197, CPUPL-7221 Co-authored-by:Balasubramanian, Vignesh <Vignesh.Balasubramanian@amd.com>	2025-08-22 18:15:05 +05:30
Sharma, Arnav	76c4872718	GEMV support for S8S8S32O32 Symmetric Quantization Introduced support for GEMV operations with group-level symmetric quantization for the S8S8S32032 API. Framework Changes: - Added macro definitions and function prototypes for GEMV with symmetric quantization in lpgemm_5loop_interface_apis.h and lpgemm_kernels.h. - LPGEMV_M_EQ1_KERN2 for the lpgemv_m_one_s8s8s32os32_sym_quant kernel, and - LPGEMV_N_EQ1_KERN2 for the lpgemv_n_one_s8s8s32os32_sym_quant kernel. - Implemented the main GEMV framework for symmetric quantization in lpgemm_s8s8s32_sym_quant.c. Kernel Changes: - lpgemv_m_one_s8s8s32os32_sym_quant for handling the case where M = 1 and implemented in lpgemv_m_kernel_s8_grp_amd512vnni.c. - lpgemv_n_one_s8s8s32os32_sym_quant for handling the case where N = 1 and implemented in lpgemv_n_kernel_s8_grp_amd512vnni.c. - Updated the buffer reordering logic for group quantization for N=1 cases in aocl_gemm_s8s8s32os32_utils.c. Notes - Ensure that group_size is a factor of both K (and KC when K > KC). - The B matrix must be provided in reordered format (mtag_b == REORDERED). AMD-Internal: [SWLCSG-3604]	2025-08-14 13:41:25 +05:30
Vlachopoulou, Eleni	1f8a7d2218	Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65 )	2025-08-07 15:51:59 +01:00
Bhaskar, Nallani	9d571bb5d3	Fixed few Coverity warnings in aocl gemm addon Fixed few Coverity warnings in aocl gemm addon AMD-Internal: CPUPL-6913	2025-08-06 15:37:40 +05:30
V, Varsha	68d47281df	Fixing some copying bugs in Batch-Matmul code - Removed duplicate calls to BATCH_GEMM_CHECK(). - Refactored freeing of post-op pointer in bench code and verified the functionality. - Modified indexing of the array to take the correct values.	2025-08-01 18:42:10 +05:30
Bhaskar, Nallani	46aac600ec	Added f32 kernels without post-ops to avoid overhead Description: 1. Crated f32 intrinsic kernels without post-ops support f32 gemm without post-ops optimally. 2. Initiated the no post-ops kernels from main kernel when post-ops hander has no post-ops to do. 3. The kernels are redundant but added to get the best perf for pure GEMM call. AMD-Internal : SWLCSG-3692	2025-07-25 23:14:23 +05:30
Balasubramanian, Vignesh	93414f56c8	Bugfix : Guarded AOCL_ENABLE_INSTRUCTONS support based on AVX512-ISA support - As part of rerouting to AVX2 code-paths on ZEN4/ZEN5(or similar) architectures, the code-base established a contingency when deploying fat binary on ZEN/ZEN2/ZEN3 systems. Due to this, it was required that we always set AOCL_ENABLE_INSTRUCTIONS to 'ZEN3'(or similar values) to make sure we don't run AVX512 code on such architectures. This issue existed on FP32 and BF16 APIs. - Added checks to detect the AVX512-ISA support to enable rerouting based on AOCL_ENABLE_INSTRUCTIONS. This removes the incorrect constraint that was put forth. AMD-Internal: [CPUPL-7020] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-07-24 12:20:05 +05:30
V, Varsha	8a86620753	Bug Fix in INT8 reference un-reorder API - For int8/uint8 reorder function, the k dimension is made multiple of 4 to meet the alignment requirements. - Modified the logic to update the k_updated to use multiples of 4. [AMD - Internal : SWLCSG - 3686 ]	2025-07-24 11:26:49 +05:30
V, Varsha	9e8c9e2764	Fixed compiler warnings in LPGEMM - Modified the correct variables to be passed for the batch_gemm_thread_decorator() for u8s8s32os32 API. - Removed commented lines in f32 GEMV_M kernels. - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros. - Re-aligned the BIAS macro in the macro definition file. [ AMD - Internal : CPUPL - 7013 ]	2025-07-18 16:15:52 +05:30
V, Varsha	2f54bc1e14	Added F32 reference Unreorder function - Implemeneted unpackb_f32f32f32of32_reference function. - Modified const pointer declaration in aocl_reorder_reference() to avoid compiler warnings. [AMD-Internal: SWLCSG-3618 ]	2025-07-18 14:52:03 +05:30
Bhaskar, Nallani	76c08fe81d	Implemented f32 reference reorder function Implemented aocl_reorder_f32f32f32of32_reference( ) function and tested. Implemented framework changes required and place holder for kernels for aocl_unreorder_f32f32f32of32_reference( ) function. It is not tested completely and will be taken care in subsequent commits. [AMD-Internal: SWLCSG-3618 ]	2025-07-15 12:26:05 +05:30
V, Varsha	837d3974d4	Bug Fixes for GEMV AVX2 BF16 to F32 path - Added the correct strides to be used while unreorder/convert B matrix in m=1 cases. - Modified Zero point vector loads to proper instructions. - Modified bf16 store in AVX2 GEMV M kenrel AMD Internal - [SWLCSG - 3602 ]	2025-07-10 16:23:46 +05:30
V, Varsha	98901847f1	Enabled GEMV path for BF16 GEMV operations on non-BF16 supporting machines - Added new GEMV_AVX2 5-Loop for handling BF16 inputs, for n = 1 and m = 1 conditions. - Modified Re-order and Un-reorder functions to cater to default n=1 reorder conditions. - Added bf16 beta and store support in F32 GEMV N AVX2 and 256_512 kernels. - Added bf16 beta support for F32 GEMV M kernels, and modified bf16 store conditions for GEMV M kernels. - Modified the n=1 re-order guards for reference bf16 re-order API. - Added an additional path in the un-reorder case for handling n=1 vector conversion AMD-Internal: [ SWLCSG - 3602 ]	2025-07-09 19:45:40 +05:30
V, Varsha	1f9d1a85d3	Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. (#58 ) * Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. - Modified Batch-Gemm API to align with cblas_?gemm_batch_ API, and added a parameter group_size to the existing APIs. - Updated bench batch_gemm code to align to the new API definition. - Modified the hardcoded number in lpgemm_postop file. - Added necessary early return condition to account for group_count/group_size < 0. AMD-Internal: [ SWLCSG - 3592 ]	2025-06-30 11:16:04 +05:30
Vankadari, Meghana	c81408c805	Modified reorder and pack code in sym quant API (#59 ) Details: - In s8 APIs with symmetric quantization, Existing kernels are reused to avoid duplication of reorder code. - Since the existing kernels are designed assuming that entire KCxNC block is packed at once, to handle grouping in symmetric quantization, we have to add JR and group loop outside the function call to existing packB function. - Though this was being done before, the cases where n_rem < 64 was not handled properly. - Modified reorder and pack code to first divide the n_fringe part into multiples-of-16 part and n_lt_16 part and then calling the pack kernel twice to handle both parts separately. - All the strides to access the reordered/pack buffer are updated accordingly.	2025-06-24 11:36:35 +05:30
Vankadari, Meghana	26e5c63781	Disabled default packing of matrices in batch_gemm of FP32 (#55 ) AMD-Internal: SWLCSG-3527	2025-06-17 10:53:05 +05:30
Vankadari, Meghana	8649cdc14b	Removed unnecessary pack checks in FP32 GEMV (#54 ) Details: - In FP32 GEMM, when threading is disabled, rntm_pack_a and rntm_pack_b were set to true by default. This leads to perf regression for smaller sizes. Modified FP32 interface API to not overwrite the packA and packB variables in rntm structure. - In FP32 GEMV, Removed the decision making code based on mtag_A/B and should_pack_A/B for packing. Matrices will be packed only if the storage format of the matrices doesn't match the storage format required by the kernel. - Changed the control flow of checking the value of mtag to whether matrix is "reordered" or "to-be-packed" or "unpacked". checking for "reorder" first, followed by "pack". This will ensure that packing doesn't happen when the matrix is already reordered even though user forces packing by setting "BLIS_PACK_A/B" -Modified python script to generate testcases based on block sizes AMD-Internal: SWLCSG-3527	2025-06-16 12:34:11 +05:30
Balasubramanian, Vignesh	1847a1e8c6	Bugfix : Segmentation fault at the topology detection layer (#51 ) - The current implementation of the topology detector establishes a contingency, wherein it is expected that the parallel region uses all the threads queried through omp_get_max_threads(). In case the actual parallelism in the function is limited(lower than this expectation), the code may access unallocated memory section (using uninitialized pointers). - This was because every thread(having it's own pointer), sets its initial value to NULL inside the parallel section, thereby leaving some pointers uninitialized if the associated thread is not spawned. - Also, the current implementation would use negative indexing(with -1) if any associated thread was not spawned. - Fix : Set every thread-specific pointer to NULL outside the parallel region, using calloc(). As long as we have NULL checks for pointers before accessing through them, no issues will be observed. Avoid incurring the topology detection cost if all the reuqired threads are not spawned(thereby avoiding potential negative indexing). (when using core-group ID). AMD-Internal: [SWLCSG-3573] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> Co-authored-by: Bhaskar, Nallani <Nallani.Bhaskar@amd.com>	2025-06-14 21:55:02 +05:30
Vankadari, Meghana	8968973c2d	Performance fix for FP32 GEMV (#47 ) Details: - In FP32 GEMM interface, mtag_b is being set to PACK by default. This is leading to packing of B matrix even though packing is not absolutely required leading to perf regression. - Setting mtag_b to PACK only if it is absolutely necessary to pack B matrix modified check conditions before packing appropriately. AMD-Internal - [SWLCSG-3575]	2025-06-10 14:54:01 +05:30
V, Varsha	875375a362	Bug Fixes in FP32 Kernels: (#41 ) * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] --------- Co-authored-by: VarshaV <varshav2@amd.com>	2025-06-06 17:48:50 +05:30
Vankadari, Meghana	37efbd284e	Added 6x16 and 6xlt16 main kernels for f32 using AVX512 instructions (#38 ) * Implemented 6xlt8 AVX2 kernel for n<8 inputs * Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32 * Implemented m-fringe kernels for 6xlt8 kernel for AVX2 * Implemented m-fringe kernels for 6xlt8 kernel for AVX2 * Added the deleted kernels and fixed bias bug AMD-Internal: SWLCSG-3556	2025-06-05 15:17:02 +05:30
V, Varsha	532eab12d3	Bug Fixes in LPGEMM for AVX512(SkyLake) machine (#24 ) * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. - The entry condition to the above has been modified for AVX512 configuration. - In bf16 API, the tiny path entry check has been modified to prevent seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting machines. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake). - Added Bf16 beta and store types in FP32 avx512_256 kernels AMD Internal: [SWLCSG-3552] * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. - The entry condition to the above has been modified for AVX512 configuration. - In bf16 API, the tiny path entry check has been modified to prevent seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting machines. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake). - Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256 kernels AMD Internal: [SWLCSG-3552] * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode. - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 type to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. The entry condition here has been modified for AVX512 configuration. - Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI ISA supporting configruations: - BF16 tiny path entry check has been modified to take into account arch_id to ensure improper entry into the tiny kernel. - The store in BF16->FP32 col-major for m = 1 conditions were updated to correct storage pattern, - BF16 beta load macro was modified to account for data in unaligned memory. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake) AMD Internal: [SWLCSG-3552] --------- Co-authored-by: VarshaV <varshav2@amd.com>	2025-05-30 17:22:49 +05:30
Arnav Sharma	62d4fcb398	Bugfix: Group Size Validation for s8s8s32o32_sym_quant - Fixed the group size validation logic to correctly check if the group_size is a multiple of 4. - Previously the condition was incorrectly performing bitwise AND with decimal 11 instead of binary 11 (decimal 3). AMD-Internal: [CPUPL-6754]	2025-05-30 11:53:23 +05:30
Bhaskar, Nallani	42a0d74ced	Fixed configuration issues in AOCL_GEMM addon (#4 ) * Fixed configuration issues in AOCL_GEMM addon Description: Fixed aocl_gemm addon initialization of kernels and block sizes for machines which supports only AVX512 but not AVX512_VNNI/VNNI_BF16. Aligned NC, KC blocking variables between ZEN and ZEN4 AMD-Internal: [SWLCSG-3527]	2025-05-13 17:19:19 +05:30
Negi, Deepak	121d81df16	Implemented GEMV kernel for m=1 case. (#5 ) * Implemented GEMV kernel for m=1 case. Description: - Added a new GEMV kernel for AVX2 where m=1. - Added a new GEMV kernel for AVX512 with ymm registers where m=1.	2025-05-13 16:33:04 +05:30
Meghana Vankadari	8557e2f7b9	Implemented GEMV for n=1 case using 32 YMM registers Details: - This implementation is picked form cntx when GEMM is invoked on machines that support AVX512 instructions by forcing the AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time. - This implementation uses MR=16 for GEMV. AMD-Internal: [SWLCSG-3519] Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a	2025-05-05 05:31:13 -04:00
Meghana Vankadari	21aa63eca1	Implemented AVX2 based GEMV for n=1 case. - Added a new GEMV kernel with MR = 8 which will be used for cases where n=1. - Modified GEMM and GEMV framework to choose right GEMV kernel based on compile-time and run-time architecture parameters. This had to be done since GEMV kernels are not stored-in/retrieved-from the cntx. - Added a pack kernel that packs A matrix from col-major to row-major using AVX2 instructions. AMD-Internal: [SWLCSG-3519] Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c	2025-05-05 08:56:22 +00:00
Meghana Vankadari	4745cf876e	Implemented a new set of kernels for f32 using 32 YMM regs Details: - These kernels are picked from cntx when GEMM is invoked on machines that support AVX512 instructions by forcing the AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time. - This path uses the same blocksizes and pack kernels as AVX512 path. - GEMV is disabled currently as AVX2 kernels for GEMV are not implemented. AMD-Internal: [SWLCSG-3519] Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5	2025-04-23 12:02:01 +00:00
Deepak Negi	48c7452b08	Beta and Downscale support for F32 AVX-512 kernels Description - To enable AVX512 VNNI support without native BF16 in BF16 kernels, the BF16 C_type is converted to F32 for computation and then cast back to BF16 before storing the result. - Added support for handling BF16 zero-point values of BF16 type. - Added a condition to disable the tiny path for the BF16 code path where native BF16 is not supported. AMD Internal : [CPUPL-6627] Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9	2025-04-23 06:12:14 -05:00
Arnav Sharma	8b0593f88d	Optimizations and Improved Support for FP32 RD Kernels - Updated the decision logic for taking the RD path for FP32. - Since the 5-loop was designed specifically for RV kernels, added a boolean flag to specify when RD path is to be taken, and set ps_b_use to cs_b_use in case B matrix is unpacked. AMD-Internal: [SWLCSG-3497] Change-Id: I94ed28304a71b759796edcdd4edf65b9bad22bea	2025-04-23 12:26:51 +05:30
Arnav Sharma	87c9230cac	Bugfix: Disable A Packing for FP32 RD kernels and Post-Ops Fix - For single-threaded configuration of BLIS, packing of A and B matrices are enabled by default. But, packing of A is only supported for RV kernels where elements from matrix A are being broadcasted. Since elements are being loaded in RD kernels, packing of A results in failures. Hence, disabled packing of matrix A for RD kernels. - Fixed the issue where c_i index pointer was incorrectly being reset when exceeding MC block thus, resulting in failures for certain Post-Ops. - Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float). AMD-Internal: [SWLCSG-3497] Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd	2025-04-18 14:31:03 +05:30

1 2 3 4 5

209 Commits