amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 02:37:05 +00:00

Author	SHA1	Message	Date
V, Varsha	8fd7060b2f	Matrix Add and Matrix Mul Post-op addition in F32 AVX512_256 kernels (#50 ) Added Matrix-mul and Matrix-add postops in FP32 AVX512_256 GEMV kernels - Matrix-add and Matrix-mul post ops in FP32 AVX512_256 GEMV m = 1 and n = 1 kernels has been added. Co-authored-by: VarshaV <varshav2@amd.com>	2025-06-17 16:17:13 +05:30
S, Hari Govind	e097346658	Implemented Multithreading Support and Optimization of DGEMV API (#10 ) - Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance. - The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta. AMD-Internal: [CPUPL-6746]	2025-06-17 12:39:48 +05:30
V, Varsha	875375a362	Bug Fixes in FP32 Kernels: (#41 ) * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] * Bug Fixes in FP32 Kernels: - The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop, but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions. - Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32 main and GEMV kernels. - Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N and AVX512_256 GEMV kernels. - Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels. - Modified the condition check in FP32 Zero point in AVX512 kernels, and fixed few bugs in Col-major Zero point evaluation and instruction usage. AMD Internal: [ CPUPL - 6748 ] --------- Co-authored-by: VarshaV <varshav2@amd.com>	2025-06-06 17:48:50 +05:30
Vankadari, Meghana	9e9441db47	Fix for n_fringe in AVX512 FP32 6x64 kernel (#42 ) Details: - Fixed the problem decomposition for n-fringe case of 6x64 AVX512 FP32 kernel by updating the pointers correctly after each fringe kernel call. - AMD-Internal: SWLCSG-3556	2025-06-06 11:33:25 +05:30
Vankadari, Meghana	37efbd284e	Added 6x16 and 6xlt16 main kernels for f32 using AVX512 instructions (#38 ) * Implemented 6xlt8 AVX2 kernel for n<8 inputs * Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32 * Implemented m-fringe kernels for 6xlt8 kernel for AVX2 * Implemented m-fringe kernels for 6xlt8 kernel for AVX2 * Added the deleted kernels and fixed bias bug AMD-Internal: SWLCSG-3556	2025-06-05 15:17:02 +05:30
Dave, Harsh	3c8b7895f7	Fixed functionality failure of DGEMM pack kernel. (#31 ) * Fixed functionality failure of DGEMM pack kernel. - Corrected the mask preparation needed for load/store in edge kernel where m = 18. - Corrected the usage of right vector registers while storing data back to buffer in edge kernels. AMD-Internal: [CPUPL-6773] * Fixed functionality failure of DGEMM pack kernel. - Corrected the mask preparation needed for load/store in edge kernel where m = 18. - Corrected the usage of right vector registers while storing data back to buffer in edge kernels. AMD-Internal: [CPUPL-6773] * Update bli_packm_zen4_asm_d24xk.c --------- Co-authored-by: Harsh Dave <harsdave@amd.com>	2025-06-03 17:33:16 +05:30
V, Varsha	532eab12d3	Bug Fixes in LPGEMM for AVX512(SkyLake) machine (#24 ) * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. - The entry condition to the above has been modified for AVX512 configuration. - In bf16 API, the tiny path entry check has been modified to prevent seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting machines. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake). - Added Bf16 beta and store types in FP32 avx512_256 kernels AMD Internal: [SWLCSG-3552] * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. - The entry condition to the above has been modified for AVX512 configuration. - In bf16 API, the tiny path entry check has been modified to prevent seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting machines. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake). - Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256 kernels AMD Internal: [SWLCSG-3552] * Bug Fixes in LPGEMM for AVX512(SkyLake) machine - Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode. - B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that doesn't support BF16 instructions, the BF16 input is unre-ordered and converted to FP32 type to use FP32 kernels. - For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the matrix to the re-ordered buffer array. But the un-reordering to FP32 requires the matrix to have size multiple of 16 along n and multiple of 2 along k dimension. The entry condition here has been modified for AVX512 configuration. - Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI ISA supporting configruations: - BF16 tiny path entry check has been modified to take into account arch_id to ensure improper entry into the tiny kernel. - The store in BF16->FP32 col-major for m = 1 conditions were updated to correct storage pattern, - BF16 beta load macro was modified to account for data in unaligned memory. - Modified existing store instructions in FP32 AVX512 kernels to support execution in machines that has AVX512 support but not BF16/VNNI(SkyLake) AMD Internal: [SWLCSG-3552] --------- Co-authored-by: VarshaV <varshav2@amd.com>	2025-05-30 17:22:49 +05:30
Negi, Deepak	ffd7c5c3e0	Postop support for Static Quant and Integer APIs (#20 ) Support for S32 Zero point type is added for aocl_gemm_s8s8s32os32_sym_quant Support for BF16 scale factors type is added for aocl_gemm_s8s8s32os32_sym_quant U8 buffer type support is added for matadd, matmul, bias post-ops in all int8 APIs. AMD-Internal: SWLCSG-3503	2025-05-27 16:29:32 +05:30
Negi, Deepak	121d81df16	Implemented GEMV kernel for m=1 case. (#5 ) * Implemented GEMV kernel for m=1 case. Description: - Added a new GEMV kernel for AVX2 where m=1. - Added a new GEMV kernel for AVX512 with ymm registers where m=1.	2025-05-13 16:33:04 +05:30
harsdave	cd83fc38b5	Add packing support M edge cases in DGEMM 24xk pack kernel Previously, the DGEMM implementation used `dscalv` for cases where the M dimension of matrix A is not in multiple of 24, resulting in a ~40% performance drop. This commit introduces a specialized edge cases in pack kernel to optimize performance for these cases. The new packing support significantly improves the performance. - Removed reliance on `dscalv` for edge cases, addressing the performance bottleneck. AMD-Internal: [CPUPL-6677] Change-Id: I150d13eb536d84f8eb439d7f4a77a04a0d0e6d60	2025-05-06 09:22:49 +05:30
Meghana Vankadari	8557e2f7b9	Implemented GEMV for n=1 case using 32 YMM registers Details: - This implementation is picked form cntx when GEMM is invoked on machines that support AVX512 instructions by forcing the AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time. - This implementation uses MR=16 for GEMV. AMD-Internal: [SWLCSG-3519] Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a	2025-05-05 05:31:13 -04:00
Meghana Vankadari	21aa63eca1	Implemented AVX2 based GEMV for n=1 case. - Added a new GEMV kernel with MR = 8 which will be used for cases where n=1. - Modified GEMM and GEMV framework to choose right GEMV kernel based on compile-time and run-time architecture parameters. This had to be done since GEMV kernels are not stored-in/retrieved-from the cntx. - Added a pack kernel that packs A matrix from col-major to row-major using AVX2 instructions. AMD-Internal: [SWLCSG-3519] Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c	2025-05-05 08:56:22 +00:00
Chandrashekara K R	b06c6f921b	CMake: compiler flags updated for lpgemm kernels under zen folder. The "-mno-avx512f" compiler flag has been added for zen/lpgemm source files to address an issue observed with the znver4 compiler flag when using GCC installed through Spack. The error message "unsupported instruction `vpcmpeqd'" was encountered, indicating unsupported AVX-512F instructions. As a workaround, the "-mno-avx512f" flag was introduced, ensuring that AVX-512F instructions are disabled during compilation. AMD-Internal: [CPUPL-6694] Change-Id: I546475226fbfea4931d568fc1b928cf6c8699b61	2025-04-30 06:09:36 -04:00
Hari Govind S	29f30c7863	Optimisation for DCOPY API - Introducted new assembly kernel that copies data from source to destination from the front and back of the vector at the same time. This kernel provides better performance for larger input sizes. - Added a wrapper function responsible for selecting the kernel used by DCOPYV API to handle the given input for zen5 architecture. - Updated AOCL-dynamic threshold for DCOPYV API in zen4 and zen5 architectures. - New unit-tests were included in the grestsuite for the new kernel. AMD-Internal: [CPUPL-6650] Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568	2025-04-28 05:58:21 -04:00
Meghana Vankadari	4745cf876e	Implemented a new set of kernels for f32 using 32 YMM regs Details: - These kernels are picked from cntx when GEMM is invoked on machines that support AVX512 instructions by forcing the AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time. - This path uses the same blocksizes and pack kernels as AVX512 path. - GEMV is disabled currently as AVX2 kernels for GEMV are not implemented. AMD-Internal: [SWLCSG-3519] Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5	2025-04-23 12:02:01 +00:00
Deepak Negi	48c7452b08	Beta and Downscale support for F32 AVX-512 kernels Description - To enable AVX512 VNNI support without native BF16 in BF16 kernels, the BF16 C_type is converted to F32 for computation and then cast back to BF16 before storing the result. - Added support for handling BF16 zero-point values of BF16 type. - Added a condition to disable the tiny path for the BF16 code path where native BF16 is not supported. AMD Internal : [CPUPL-6627] Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9	2025-04-23 06:12:14 -05:00
Vignesh Balasubramanian	b4b0887ca4	Additional optimizations to ZGEMM SUP and Tiny codepaths(ZEN4 and ZEN5) - Added a set of AVX512 fringe kernels(using masked loads and stores) in order to avoid rerouting to the GEMV typed API interface(when m = 1). This ensures uniformity in performance across the main and fringe cases, when the calls are multithreaded. - Further tuned the thresholds to decide between ZGEMM Tiny, Small SUP and Native paths for ZEN4 and ZEN5 architectures(in case of parallel execution). This would account for additional combinations of the input dimensions. - Moved the call to Tiny-ZGEMM before the BLIS object creation, since this code-path operates on raw buffers. - Added the necessary test-cases for functional and memory testing of the newly added kernels. AMD-Internal: [CPUPL-6378][CPUPL-6661] Change-Id: I9af73d1b6ef82b26503d4fc373111132aee3afd6	2025-04-23 00:56:58 -04:00
Arnav Sharma	87c9230cac	Bugfix: Disable A Packing for FP32 RD kernels and Post-Ops Fix - For single-threaded configuration of BLIS, packing of A and B matrices are enabled by default. But, packing of A is only supported for RV kernels where elements from matrix A are being broadcasted. Since elements are being loaded in RD kernels, packing of A results in failures. Hence, disabled packing of matrix A for RD kernels. - Fixed the issue where c_i index pointer was incorrectly being reset when exceeding MC block thus, resulting in failures for certain Post-Ops. - Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float). AMD-Internal: [SWLCSG-3497] Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd	2025-04-18 14:31:03 +05:30
Deepak Negi	f76f37cc11	Bug Fix in F32 eltwise Api with post ops(clip, swish, relu_scale). Description 1. In the cases of clip, swish, and relu_scale, constants are currently loaded as float. However, they are of C type, so handling has been adjusted, for integer these constants are first loaded as integer and then converted to float. Change-Id: I176b805b69679df42be5745b6306f75e23de274d	2025-04-11 08:14:34 -04:00
varshav2	b9998a1d7f	Added mutiple ZP type checks in INT8 APIs - Currently the int8/uint8 APIs do not support multiple ZP types, but works only with int8 type or uint8 type. - The support is added to enable multiple zp types in these kernels and added additional macros to support the operations. - Modified the bench downscale reference code to support the updated types. AMD-Internal : [ SWLCSG-3304 ] Change-Id: Ia5e40ee3705a38d09262086d20731e8f0a126987	2025-04-11 07:25:10 -04:00
Arnav Sharma	267aae80ea	Added Post-Ops Support for F32 RD Kernels - Support for Post-Ops has been added for all F32 RD AVX512 and AVX2 kernels. AMD-Internal: [SWLCSG-3497] Change-Id: Ia2967417303d8278c547957878d93c42c887109e	2025-04-11 05:25:30 -04:00
Arnav Sharma	c68c258fad	Added AVX512 and AVX2 FP32 RD Kernels - Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs. - The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64 (MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel. - Updatd f32 framework to accomodate rd kernels in case of B trans with thresholds - Updated data gen python script TODO: - Post-Ops not yet supported. Change-Id: Ibf282741f58a1446321273d5b8044db993f23714	2025-04-05 20:16:51 -05:00
varshav	81d219e3f8	Added destination scale type check in INT8 API's - Updated the S8 main, GEMV, m_, n_ and mn_ fringe kernels to support multiple scale types for vector and scalar scales - Updated the U8 main, GEMV, m_, n_, extMR_ and mn_ fringe kernels to support multiple scale types for vector and scalar scales - Updated the bench to accommodate multiple scale type input, and modified the downscale_accuracy_check_ to verify with multiple scale type inputs. AMD Internal: [ SWLCSG-3304 ] Change-Id: I7b9f3ec8ea830d3265f72d18a0aa36086e14a86e	2025-03-28 00:51:17 -05:00
Deepak Negi	fb4617d7c3	Bug fix in gemv_n kernel of f32 api. Description: 1. For column major case when m=1 there was an accuracy mismatch with post ops(bias, matrix_add, matrix_add). 2. Added check for column major case and replace _mm512_loadu_ps with _mm512_maskz_loadu_ps. AMD-Internal: [CPUPL-6585] Change-Id: I8d98e2cb0b9dd445c9868f4c8af3abbc6c2dfc95	2025-03-12 06:22:47 -05:00
harsh dave	a359a25765	Fix typo in 24x8m DGEMM sup kernel causing incorrect result. - Corrected a typo in dgemm kernel implementation, beta=0 and n_left=6 edge kernel. Thanks to Shubham Sharma<shubham.sharma3@amd.com> for helping with debugging. AMD-Internal: [CPUPL-6443] Change-Id: Ifa1e16ec544b7e85c21651bc23c4c27e86d6730b	2025-03-07 04:43:17 -05:00
Vignesh Balasubramanian	c4b84601da	AVX512 optimizations for CGEMM(rank-1 kernel) - Implemented an AVX512 rank-1 kernel that is expected to handle column-major storage schemes of A, B and C(without transposition) when k = 1. - This kernel is single-threaded, and acts as a direct call from the BLAS layer for its compatible inputs. - Defined custom BLAS and BLIS_IMPLI layers for CGEMM (instead of using the macro definition), in order to integrate the call to this kernel at runtime(based on the corresponding architecture and input constraints). - Added unit-tests for functional and memory testing of the kernel. - Updated the ZEN5 context to include the AVX512 CGEMM SUP kernels, with its cache-blocking parameters. AMD-Internal: [CPUPL-6498] Change-Id: I42a66c424325bd117ceb38970726a05e2896a46b	2025-03-06 20:14:05 +05:30
Vignesh Balasubramanian	07df9f471e	AVX512 optimizations for CGEMM(SUP) - Implemented the following AVX512 SUP column-preferential kernels(m-variant) for CGEMM : Main kernel : 24x4m Fringe kernels : 24x3m, 24x2m, 24x1m, 16x4, 16x3, 16x2, 16x1, 8x4, 8x3, 8x2, 8x1, fx4, fx3, fx2, fx1(where 0<f<8). - Utlized the packing kernel to pack A when handling inputs with CRC storage scheme. This would in turn handle RRC with operation transpose in the framework layer. - Further adding C prefetching to the main kernel, and updated the cache-blocking parameters for ZEN4 and ZEN5 contexts. - Added a set of decision logics to choose between SUP and Native AVX512 code-paths for ZEN4 and ZEN5 architectures. - Updated the testing interface for complex GEMMSUP to accept the kernel dimension(MR) as a parameter, in order to set the appropriate panel stride for functional and memory testing. Also updated the existing instantiators to send their kernel dimensions as a parameter. - Added unit tests for functional and memory testing of these newly added kernels. AMD-Internal: [CPUPL-6498] Change-Id: Ie79d3d0dc7eed7edf30d8d4f74b888135f31d6b4	2025-03-06 06:03:39 -05:00
Hari Govind S	8998839c71	Optimisation of DGEMV Transpose Case for unit stride - Included a new code section to handle input having non-unit strided y vector for dgemv transpose case. Removed the same from the respective kernels to avoid repeated branching caused by condition checks within the 'for' loop. - The condition check for beta is equal to zero in the primary kernels are moved outside the for loop to avoid repeated branching. - The '_mm512_reduce_pd' operations in the primary kernel is replaced by a series of operations to reduce the number of instructions required to reduce the 8 registers. - Changing naming convention for DGEMV transpose kernels. - Modified unit kernel test to avoid y increment for dgemv tranpose kernels during the test. AMD-Internal: [CPUPL-6565] Change-Id: I1ac516d6b8f156ac53ac9f6eb18badd50e152e05	2025-03-06 05:15:58 -05:00
Meghana Vankadari	7243a5d521	Implemented group level static quantization for s8s8s32of32\|bf16 APIs Details: - Group quantization is technique to improve accuracy where scale factors to quantize inputs and weights varies at group level instead of per channel and per tensor level. - Added new bench files to test GEMM with symmetric static quantization. - Added new get_size and reorder functions to account for storing sum of col-values separately per group. - Added new framework, kernels to support the same. - The scalefactors could be of type float or bf16. AMD-Internal:[SWLCSG-3274] Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f	2025-02-28 04:44:44 -05:00
Vignesh Balasubramanian	99770558bb	AVX512 optimizations for CGEMM(Native) - Implemented the following AVX512 native computational kernels for CGEMM : Row-preferential : 4x24 Column-preferential : 24x4 - The implementations use a common set of macros, defined in a separate header. This is due to the fact that the implementations differ solely on the matrix chosen for load/broadcast operations. - Added the associated AVX512 based packing kernels, packing 24xk and 4xk panels of input. - Registered the column-preferential kernel(24x4) in ZEN4 and ZEN5 contexts. Further updated the cache-blocking parameters. - Removed redundant BLIS object creation and its contingencies in the native micro-kernel testing interface(for complex types). Added the required unit-tests for memory and functionality checks of the new kernels. AMD-Interal: [CPUPL-6498] Change-Id: I520ff17dba4c2f9bc277bf33ba9ab4384408ffe1	2025-02-28 03:18:24 -05:00
Meghana Vankadari	6c29236166	Bug fixes in bench and pack code for s8 and bf16 datatypes Details: - Fixed the logic to identify an API that has int4 weights in bench files for gemm and batch_gemm. - Eliminated the memcpy instructions used in pack functions of zen4 kernels and replaced them with masked load instruction. This ensures that the load register will be populated with zeroes at locations where mask is set to zero. Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e AMD-Internal:[SWLCSG-3274]	2025-02-28 01:18:11 -05:00
Arnav Sharma	b4c1026ec2	Added Support for General Stride in DGEMV - Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride. - Since the latest dgemv kernels don't support general stride, added checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general stride. - Thanks to Vignesh Balasubramanian <vignesh.balasubramanian@amd.com> for finding this issue. AMD-Internal: [CPUPL-6492] Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328	2025-02-27 12:47:21 -05:00
Shubham Sharma	e6ca01c1ba	Fixed C prefetch in 8x24 DGEMM kernel - In 8x24 DGEMM kernel, prefetch is always done assuming row major C. - For TRSM, the DGEMM kernel can be called with column major C also. - Current prefetch logic results in suboptimal performance. - Changed C prefetch logic so that correct C is prefetched for both row and column major C. AMD-Internal: [CPUPL-6493] Change-Id: I7c732ceac54d1056159b3749544c5380340aacd2	2025-02-27 12:17:29 -05:00
Mithun Mohan	9906fd7b91	F32 eltwise kernel updates to use masks in scale factor load. -Currently the scale factor is loaded without using mask in downscale, and matrix add/mul ops in the F32 eltwise kernels. This results in out of memory reads when n is not a multiple of NR (64). -The loads are updated to masked loads to fix the same. AMD-Internal: [SWLCSG-3390] Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb	2025-02-27 08:17:58 -05:00
Deepak Negi	cc321fb95d	Added support for different types of zero-point in f32 eltwise APIs. Description - Zero point support for <s32/s8/bf16/u8> datatype in element-wise postop only f32o<f32/s8/u8/s32/bf16> APIs. AMD-Internal: [SWLCSG-3390] Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379	2025-02-26 04:04:13 -05:00
Mithun Mohan	7394aafd1e	New A packing kernels for F32 API in LPGEMM. -New packing kernels for A matrix, both based on AVX512 and AVX2 ISA, for both row and column major storage are added as part of this change. Dependency on haswell A packing kernels are removed by this. -Tiny GEMM thresholds are further tuned for BF16 and F32 APIs. AMD-Internal: [SWLCSG-3380, SWLCSG-3415] Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be	2025-02-26 05:23:35 +00:00
varshav	8a69141294	Bug fix in BF16-F32 supported AVX2 Kernels - Bug fix in Matrix Mul post op. - Updated the config in AVX512_VNNI_BF16 context to work in AVX2 kernels Change-Id: I25980508facc38606596402dba4cfce88f4eb173	2025-02-25 14:42:45 +00:00
varshav	a0005c60ce	Add col-major pack kernels and BF16 output support in F32 AVX-2 kernels. - Added column major pack kernels, which will transpose and store the BF16 matrix input to F32 input matrix - Added BF16 Zero point Downscale support to F32 main and fringe kernels. - Updated Matrix Add and Matrix Mul post-ops in f32-AVX2 main and fringe kernels to support BF16 input. - Modified the f32 tiny kernels loop to update the buf_downscale parameter. - Modified bf16bf16f32obf16 framework to work with AVX-2 system. - Added wrapper in bf16 5-Loop to call the corresponding AVX-2/AVX-512 5 Loop functions. - Bug fixes in the f32-AVX2 kernels BIAS post-ops. - Bug fixes in the Convert function, and the bf16 5-loop for multi-threaded inputs. AMD-Internal:[SWLCSG-3281 , CPUPL-6447] Change-Id: I4191fbe6f79119410c2328cd61d9b4d87b7a2bcd	2025-02-24 09:51:12 +05:30
Nallani Bhaskar	5a3c58b315	Fixed column major case of bf16 un-reorder reference function Description: 1. Fixed bf16 un-reorder column major kernel 2. Fixed a bug in nrlt16 case of f32obf16 reorder function 3. Unit testing done . AMD-internal: [SWLCSG-3279] Change-Id: I65024342935ae65186b95885eb010baf3269aa7d	2025-02-20 06:26:31 -05:00
Meghana Vankadari	17634d7ae8	Fixed compiler errors and warning for gcc < 11.2 Description: 1. When compiler gcc version less than 11.2 few BF16 instructions are not supported by the compiler even though the processors arch's zen4 and zen5 supports. 2. These instructions are guarded now with a macro. Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131	2025-02-13 10:18:13 -05:00
Nallani Bhaskar	0acb5eb9a4	Implemented reference unreorder bf16 function Description: Implemented a c reference for aocl_gemm_unreorder_bf16bf16f32of32 function The implementation working for row major and column major yet to be enabled. AMD-Internal: [ SWLCSG-3279 ] Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a	2025-02-11 02:04:42 +00:00
varshav2	ef04388a44	Added AVX2 support for BF16 kernels: Row major - Currently the BF16 kernels uses the AVX512 VNNI instructions. In order to support AVX2 kernels, the BF16 input has to be converted to F32 and then the F32 kernels has to be executed. - Added un-pack function for the B-Matrix, which does the unpacking of the Re-ordered BF16 B-Matrix and converts it to Float. - Added a kernel, to convert the matrix data from Bf16 to F32 for the give input. - Added a new path to the BF16 5LOOP to work with the BF16 data, where the packed/unpacked A matrix is converted from BF16 to F32. The packed B matrix is converted from BF16 to F32 and the re-ordered B matrix is unre-ordered and converted to F32 before feeding to the F32 micro kernels. - Removed AVX512 condition checks in BF16 code path. - Added the Re-order reference code path to support BF16 AVX2. - Currently the F32 AVX-2 kernels supports only F32 BIAS support. Added BF16 support for BIAS post-op in F32 AVX2 kernels. - Bug fix in the test input generation script. AMD Internal : [SWLCSG - 3281] Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb	2025-02-10 08:18:52 -05:00
Deepak Negi	3a7523b51b	Element wise post-op APIs are upgraded with new post-ops Description: 1. Added new output types for f32 element wise API's to support s8, u8, s32 , bf16 outputs. 2. Updated the base f32 API to support all the post-ops supported in gemm API's AMD Internal: [SWLCSG-3384] Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8	2025-02-10 01:06:39 -05:00
Edward Smyth	0bae96d7ac	BLIS: Missing clobbers (batch 8) - Add missing xmm, ymm and k registers to clobber lists in bli_dgemmsup_rv_zen4_asm_24x8m.c - Add missing ymm1 in bli_dgemmsup_rv_zen4_asm_24x8m.c bli_gemmsup_rv_haswell_asm_d6x8m.c and bli_gemmsup_rd_zen_s6x64.c - Also change formatting in bli_copyv_zen4_asm_avx512.c bli_dgemm_avx512_asm_8x24.c and bli_zero_zmm.c to make automatic processing of clobber lists easier. AMD-Internal: [CPUPL-5895] Change-Id: If05a3f00e6c0f9033eeced5de165ba4c3128b3e5	2025-02-07 10:39:24 -05:00
Mithun Mohan	bffa92ec93	Deprecate S16 LPGEMM APIs. -The following S16 APIs are removed: 1. aocl_gemm_u8s8s16os16 2. aocl_gemm_u8s8s16os8 3. aocl_gemm_u8s8s16ou8 4. aocl_gemm_s8s8s16os16 5. aocl_gemm_s8s8s16os8 along with the associated reorder APIs and corresponding framework elements. AMD-Internal: [CPUPL-6412] Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b	2025-02-07 11:43:28 +00:00
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Arnav Sharma	5a4739d288	DGEMV NO_TRANSPOSE Optimizations and Unit Tests - Added 32x3n n-biased kernels to directly handle the cases where n=3 which were earlier being handled by the primary n-biased, 32x8n, kernel. - Modified the n-biased fringe kernels to further handle the smaller m-fringe cases. Thus, now the kernels handle the following range of m for any value of n: - 16x8n : m = [16, 31) - 8x8n : m = [8, 15) - m_leftx8n : m = [1, 7] - Updated the function pointer map for n-biased kernels with added granularity to invoke the smaller fringe cases directly on the basis of m-dimension. - Added micro-kernel unit tests for all the dgemv_n kernels. AMD-Internal: [CPUPL-6231] Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119	2025-02-06 18:52:32 +05:30
Shubham Sharma	f8c83fedb6	Added new ZTRSM small code path for ZEN5 - Added new ZTRSM kernels for right and left variants. - Kernel dimensions are 12x4. - 12x4 ZGEMM SUP kernels are used internally for solving GEMM subproblem. - These kernels do not support conjugate transpose. - Only column major inputs are supported. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6356] Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e	2025-02-06 18:01:10 +05:30
Deepak Negi	2e687d8847	Updated all post-ops in s8s8s32 API to operate in float precision Description: 1. Changed all post-ops in s8s8s32o<s32\|s8\|u8\|f32\|bf16> to operate on float data. All the post-ops are updated to operate on f32 by converting s32 accumulator registers to float at the end of k loop. Changed all post-ops to operate on float data. 2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store the output in u8 AMD-Internal - SWLCSG-3366 Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6	2025-02-06 07:31:28 -05:00
Meghana Vankadari	13e7ada3f2	Modified bench to test different types of post-ops - Modified bench to support testing of different types of buffers for bias, mat_add and mat_mul postops. - Added support for testing integer APIs with float accumulation type. Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96	2025-02-06 02:38:08 +05:30

1 2 3 4 5 ...

892 Commits