amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-04-19 23:28:52 +00:00

Author	SHA1	Message	Date
Hari Govind S	3d2653f1ab	DDOTV Optimization for ZEN3 Architecture - Reduced the blocking size of 'bli_ddotv_zen_int10' kernel from 40 elements to 20 elements for better utilization of vector registers - Replaced redundant 'for' loops in 'bli_ddotv_zen_int10' kernel with 'if' conditions to handle reminder iterations. As only a single iteration is used when reminder is less than the primary unroll factor. - Added a conditional check to invoke the vectorized DDOTV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Updated getestsuite ukr tests for 'bli_ddotv_zen_int10' kernel. AMD-Internal: [CPUPL-4877] Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2	2025-02-04 06:01:04 -05:00
Edward Smyth	bec9406996	Export some BLIS internal symbols (3) libFLAME calls DAMAX kernel directly. Now that AVX512 version has been enabled in BLIS cntx, export this symbol. AMD-Internal: [CPUPL-5895] Change-Id: I4c74150578f49eb643b0f68c6cc32ee2bb23bec2	2025-02-03 06:30:14 -05:00
Shubham Sharma	7695561f4e	Tuned DGEMM blocksizes for ZEN5 - In the existing code, blocksizes for sizes where M >> K, N >> K and K < 500 were not tuned properly for cases when application would use more than one instance of blis in parallel. - Imporved DGEMM performane for sizes where M, N >> k by retuning blocksizes. Such sizes are used by applications like HPL. AMD-Internal: [SWLCSG-3338] Change-Id: Iec17ecc53a6fabf50eedacaf208e4e74a4e21418	2025-02-03 05:40:07 -05:00
Shubham Sharma	50306c4854	Tuned ZEN4 DGEMM blocksizes - Blocksizes for sizes where M >> K, N >> K and K < 500 were tuned by running blis bench on only one MPI rank. Blocksizes tuned this way are not performing well for all configurations. - Retuned the blocksizes so that performance is good for such skinny sizes. AMD-Internal: [CPUPL-6362] Change-Id: I89c61889df2443ef6bf0e87bf89263768b5c00c1	2025-02-03 03:05:50 -05:00
Deepak Negi	4ade159800	Buffer scale support for matrix add and matrix mul post-ops in f32 API. Description: 1. Support has been added to scale buffer values using both scalar and vector scale factors before matrix add or matrix mul post-ops. AMD-Internal: CPUPL-6340 Change-Id: Ie023d5963689897509ef3d5784c3592791e57125	2025-01-30 07:09:18 -05:00
Nallani Bhaskar	805bd10353	Updated all post-ops in u8s8s32 API to operate in float precision Description: 1. Changed all post-ops in u8s8s32o<s32\|s8\|u8\|f32\|bf16> to operate on float data. All the post-ops are updated to operate on f32 by converting s32 accumulator registers to float at the end of k loop. Changed all post-ops to operate on float data. 2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store the output in u8 AMD-Internal - SWLCSG-3366 Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5	2025-01-29 04:21:17 -05:00
Deepak Negi	db407fd202	Added F32 bias type support, F32, BF16 output type support in int8 APIs Description: 1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs. Where the inputs are uint8/int8 and the processing is done using VNNI but the output is stored in f32 and bf16 formats. All the int8 kernels are reused and updated with the new output data types. 2. Added F32 data type support in bias. 3. Updated the bench and bench input file to support validation. AMD-Internal: SWLCSG-3335 Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd	2025-01-26 11:38:30 -05:00
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
Hari Govind S	c149b5a98b	Optimisation of DCOPY kernel in zen3 - Using 'if' condition instead of 'for'loop to handle fringe cases. 'for' loop is redundant for handling reminder iterations as only a single iteration is used when reminder is less than primary unroll factor. AMD-Internal: [CPUPL-5594] Change-Id: I8cebc037742ee47961869e22e2471e550fcd99e9	2025-01-24 05:55:59 -05:00
Hari Govind S	349fc47ec5	DGEMV Optimizations for TRANSPOSE Cases - Developed new AVX512 DGEMV kernels for Zen4/5 architectures and AVX2 kernels for Zen1/2/3 architectures. These kernels are written from the ground up and are independent of fused kernels. - The DGEMV primary kernel processes the calculation in chunks of 8 columns. Fringe columns (sizes 1 to 7) are handled by fringe kernels, which are invoked by the primary kernel as needed. - Implemented the kernels by computing the dot product of matrix A columns with vector x in chunks of 32 elements, storing the results in accumulator registers. Fringe elements are handled in chunks of 16, 8, etc. The data in the accumulator registers is then reduced and added to vector y. AMD-Internal: [CPUPL-5835] Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61	2025-01-24 00:38:34 -05:00
Meghana Vankadari	69ca5dbcd6	Fixed compilation errors for gcc versions < 11.2 Details: - Disabled intrinsics code of f32obf16 pack function for gcc < 11.2 as the instructions used in kernels are not supported by the compiler versions. - Addded early-return check for WOQ APIs when compiling with gcc < 11.2 - Fixed code to check whether JIT kernels are generated inside batch_gemm API for bf16 datatype. AMD Internal: [CPUPL-6327] Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a	2025-01-21 07:13:31 -05:00
Deepak Negi	994098dd35	Fix for a bias data type in u8s8s32/s8s8s32 gemv APIs. Description: Added _mm512_cvtps_epi32 for bf16 to s32 conversion in gemv APIs. AMD-Internal: SWLCSG-3302 Change-Id: I7e3e6da8f50d1f7177629cb68ac21e3bbce40bee	2025-01-16 01:43:38 -05:00
Deepak Negi	182a6373b5	Added support to specify bias data type in u8s8s32/s8s8s32 API's Description: 1. The bias type was supported only based on output data type. 2. The option is added in the pre-ops structure to select the bias data type(s8/s32/bf16) irrespective of the storage data type in u8s8s32/s8s8s32 API's. AMD-Internal: SWLCSG-3302 Change-Id: I3c465fe428672d2d58c1c60115c46d2d5b11f0f4	2025-01-15 05:56:26 -05:00
Vignesh Balasubramanian	a80436ab21	Standardizing the EVT compliance of {S/D}AMAXV API - Updated the existing AVX2 {S/D}AMAXV kernels to comply to the standard when having exception values. This makes it exhibit the same behaviour as it AVX512 variants. Provided additional optimizations with loop unrolling. - Removed redundant early return checks inside the kernels, since they have been abstracted to a higher layer. - Updated the unit-tests(micro-kernel) and exception value tests for appropriate code-coverage. Also re-enabled the exception value tests. AMD-Internal: [CPUPL-4745] Change-Id: I36c793220bd4977a00281af9737c51cd1e5c60d9	2025-01-13 06:56:31 -05:00
harsh dave	7510e27007	DGEMM Optimizations Refined thresholds to decide between native and sup DGEMM code-paths for both zen4 and zen5 processors. AMD-Internal: [CPUPL-6300] Change-Id: Ib32a256dba99a0a92b7ecaa7684443a66c459566	2025-01-13 01:09:39 -05:00
Edward Smyth	97ede96ed4	Correct duplicate object file names Some kernel file names were the same for different sub-configurations, which could result in duplicate copies of the same object being archived depending upon the order of (re-)compiling the source files. Rename the files to be specific to each sub-configuration to avoid this problem. AMD-Internal: [CPUPL-5895] Change-Id: I182ac706e04a364f1df20fd0fb5b633eb10eeafb	2025-01-10 06:03:36 -05:00
Meghana Vankadari	852cdc6a9a	Implemented batch_matmul for f32 & int8 datatypes Details: - The batch matmul performs a series of matmuls, processing more than one GEMM problem at once. - Introduced a new parameter called batch_size for the user to indicate number of GEMM problems in a batch/group. - This operation supports processing GEMM problems with different parameters including dims,post-ops,stor-schemes etc., - This operation is optimized for problems where all the GEMMs in a batch are of same size and shape. - For now, the threads are distributed among different GEMM problems equally irrespective of their dimensions which leads to better performance for batches with identical GEMMs but performs sub-optimally for batches with non-identical GEMMs. - Optimizations for batches with non-identical GEMMs is in progress. - Added bench and input files for batch_matmul. - Added logger functionality for batch_matmul APIs. AMD-Internal: [SWLCSG-2944] Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a	2025-01-10 04:10:53 -05:00
Mithun Mohan	ef4286a97e	Multi-data type buffer and scale support for matrix add\|mul post-ops in s32 API. -As it stands the buffer type in matrix add\|mul post-ops is expected to be the same as that of the output C matrix type. This limitation is now removed and user can specify the buffer type by setting the stor_type attribute in add\|mul post-op struct. As of now int8, int32, bfloat16 and float types are supported for the buffer in s32 micro-kernels. The same support is also added for bf16 micro-kernels, with bfloat16 and float supported for now. -Additionally the values (from buffer) are added/multiplied as is to the output registers while performing the matrix add\|mul post-ops. Support is added for scaling these values before using them in the post-ops. Both scalar and vector scale_factors are supported. -The bias_stor_type attribute is renamed to stor_type in bias post-ops. AMD-Internal: [SWLCSG-3319] Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa	2025-01-10 02:11:12 -05:00
Meghana Vankadari	051c9ac7a2	Bug fixes in F32 and INT8 APIs Details: - Fixed few bugs in downscale post-op for f32 datatype. - Fixed a bug in setting strides of packB buffer in int8 APIs. Change-Id: Idb3019cc4593eace3bd5475dd1463dea32dbe75c	2025-01-09 04:07:26 -05:00
varshav	7b9d29f9b3	Adding post-ops for JIT kernels - Added Downscale, tanh and sigmoid post-op support to the JIT kernels - Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error. - Added the optional support for B-prefetch in the JIT kernels - Resolved the visibility issues in global variable jit_krnels_generated - Modified the array generation for scale and zp values in the bench Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d	2025-01-08 12:55:27 +00:00
Mithun Mohan	4a95f44d39	Buffer scale support for matrix add and matrix mul post-ops in bf16 API. -Currently the values (from buffer) are added/multiplied as is to the output registers while performing the matrix add/mul post-ops. Support is added for scaling these values before using them in the post-ops. Both scalar and vector scale_factors are supported. AMD-Internal: [SWLCSG-3181] Change-Id: Ifdb7160a1ea4f5ecccfa3ef31ecfed432898c14d	2025-01-08 10:35:50 +00:00
Vignesh Balasubramanian	cdaa2ac7fd	Bugfix and optimizations for AVX512 AMAXV micro-kernels - Bug : The current {S/D}AMAXV AVX512 kernels produced an incorrect functionality with multiple absolute maximums. They returned the last index when having multiple occurences, instead of the first one. - Implemented a bug-fix to handle this issue on these AVX512 kernels. Also ensured that the kernels are compliant with the standard when handling exception values. - Further optimized the code by decoupling the logic to find the maximum element and its search space for index. This way, we use lesser latency instructions to compute the maximum first. - Updated the unit-tests, exception value tests and early return tests for the API to ensure code-coverage. AMD-Internal: [CPUPL-4745] Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f	2025-01-07 22:56:20 +05:30
harsh dave	ea4212c550	Increase buffer size to prevent segmentation fault Since the threshold for tiny path was large but the buffer size was not enough to store the complete packed matrix. That is leading to segmentation faults. This commit fix the buffer size as per the threshold of tiny gemm path. With the corrected buffer size, the matrix is packed correctly. AMD-Internal: [CPUPL-6201] Change-Id: I0292a07f6146e7f1ccd8c1010b4c41c218fd9b47	2025-01-03 06:39:50 -05:00
Shubham Sharma	8f99d8a5bb	Fixed warnings and compilation issues with GCC in TRSM - Current implementation uses macros to expand the code at compile time, but this is causing some false warning in GCC12 and 14. - Added switch case in trsm right variants for n_remainder. - This ensures that n_rem is compile time constant, therefore warnings related to array subscript out of bounds are fixed. - mtune=znver3 flag is causing compilation issue in GCC 9.1, therefore this flag is removed. - Remaned the file bli_trsm_small to bli_trsm_small_zen5 in order to avoid possibily of missing symbols. AMD-Internal: [CPUPL-6199] Change-Id: Ib8e90196ce0a41d38c2b29226df5ab6c2d8ba996	2024-12-18 06:22:05 -05:00
Shubham Sharma	050e5a382f	Fixed warning for GCC 12+ - Warnings in DTRSM kernel caused by uninitialized registers and extra loop unroll is fixed. - Warning in DGEMM kernel caused by extra space is fixed. Change-Id: I1d9cfaa0b2847f5fdbe8b343a462d67a3aca0819	2024-12-17 01:44:41 -05:00
Shubham Sharma	beaea1b88f	Added new DTRSM small code path for ZEN5 - Added new DTRSM kernels for right and left variants. - Kernel dimensions are 24x8. - 24x8 DGEMM SUP kernels are used internally for solving GEMM subproblem. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6016] Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db	2024-12-16 10:38:45 +05:30
harsdave	7813938f70	Support DGEMM Computation for Transposed A Matrix with CRC and RRC Storage Scheme - This patch introduces changes to support DGEMM computation when the input matrix A is transposed. - The changes accommodate CRC (Column-Row-Column) and RRC (Row-Row-Column) storage schemes for matrices C, A, and B. The primary goal is to pack the A matrix in a column-stored scheme, enabling the re-use of the DGEMM SUP kernel for efficient computation. - Performance is better when BLIS_PACK_BUFFER macro is set to 0. By default, it is set to 1[enabled]. AMD-Internal: [CPUPL-6054] Change-Id: I543a84b05c9e6380bc03017ab6da685e7006a64e	2024-12-13 05:19:40 -05:00
harsdave	54b46ec1ed	Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop efficiency and edge kernel performance. The following technical improvements have been implemented: 1. IR Loop Optimization: - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated with `begin_asm` and `end_asm` calls, resulting in more efficient execution. 2. JR Loop Integration: - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead of stack frame management for each JR iteration, thereby enhancing loop performance. 3. Kernel Decomposition Strategy: - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1. - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently. 1. Interleaved Scaling by Alpha: - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline and reduce latency. 2. Efficient Mask Preparation: - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary, minimizing unnecessary overhead. 3. Broadcast Instruction Optimization: - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse, the broadcast instruction is replaced with `mem_1to8`. - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding dependency chains and improving execution efficiency. 4. C Matrix Update Optimization: - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers. This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating performance bottlenecks and enhancing throughput. These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication operations. This patch also involves changes for tiny gemm interface. A light interface for calling kernels and removing calls to avx2 dgemm kernels as we use avx512 dgemm kernels for all the sizes for zen4 and zen5. For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have the support to handle such inputs and thus such inputs are routed to gemm_small path. AMD-Internal: [CPUPL-6054] Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a	2024-12-13 00:03:00 -05:00
Arnav Sharma	25e59fcbb9	DGEMV Optimizations for NO_TRANSPOSE Cases - AVX512 specific DGEMV native kernels are added for Zen4/5 architectures to handle the NO_TRANSPOSE cases and are independent of the AXPYF fused kernels. - The following set of kernels biased towards the n-dimension perform beta scaling of y vector within the kernel itself and handle cases where n is less than 5: - bli_dgemv_n_zen_int_32x8n_avx512( ... ) - bli_dgemv_n_zen_int_32x4n_avx512( ... ) - bli_dgemv_n_zen_int_32x2n_avx512( ... ) - bli_dgemv_n_zen_int_32x1n_avx512( ... ) - The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the m-dimension and for this kernel beta scaling is handled beforehand within the framework. - Added unit-tests for the new kernels. - AVX2 path for Zen/2/3 architectures still follows the old approach of using fused kernel, namely AXPYF, to perform the GEMV operation. AMD-Internal: [CPUPL-5560] Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79	2024-12-12 10:26:50 -05:00
Deepak Negi	baeebe75c9	Support for standard AutoAWQ storage format. Description: 1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this format [0, 2, 4, 6, 1, 3, 5, 7]. 2. Support is added to convert above format back to the original sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering in the AWQ API. AMD-Internal: SWLCSG-3169 Change-Id: I5395766060c200ab81d0b8be94356678a169ac13	2024-12-02 04:02:27 -05:00
Meghana Vankadari	fbb72d047f	Added group quantization and zero-point support for WOQ kernels Description: 1. Added group quantization and zero-point (zp) in aocl_gemm_bf16s4f32o<bf16\|f32> API. 2. Group quantization is technique to improve accuracy where scale factors to dequantize weights varies at group level instead of per channel and per tensor level. 3. Added zp and scaling in woq packb kernels so that for large M values zp and scaling are performed at pack-b stage and bf16 kernels are called 4. Adding zp support and scaling to default path in WoQ kernels created some performance overhead when M value is very small. 5. Added string group_size to lpgemm bench to read group size from bench_input.txt and tested for various combinations of matrix dimensions. 6. The scalefactors could be of type float or bf16 and the zeropoint values are expected to be in int8 format. AMD-Internal: [SWLCSG-3168, SWLCSG-3172] Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57	2024-12-02 06:46:13 +00:00
Shubham Sharma.	be6fbadd95	BlockSize Tuning for ZEN4 and ZEN5 - Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems. - MC, KC and NC are dynamically selected at runtime for DGEMM native. - A local copy of cntx is created and blocksizes are updated in the local cntx. - Updated threshold for picking DGEMM SUP kernel for ZEN4. AMD-Internal: [CPUPL-5912] Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6	2024-11-29 03:19:16 -05:00
Shubham Sharma	f2320a1fef	Enabled DGEMM row major kernel for ZEN4 - Merged ZEN4 and ZEN5 DGEMM 8x24 kernel. - Replaced 32x6 kernel with 8x24. Now same kernel is used for ZEN4 and ZEN5. - Blocksizes have been tuned for genoa only. - DGEMM kernel for DTRSM native code path is replaced with 8x24 kernel. - Enabled alpha scaling during packing for ZEN4. - ZEN4 8x24 kernel has been removed. AMD-Internal: [CPUPL-5912] Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754	2024-11-29 08:18:48 +00:00
Shubham Sharma	082081658f	BugFix: Fixed extreme value handling in AVX512 DGEMM kernel - Extreme values are not handled correctly when beta == 0 and C is column major stored. - For checking if beta is zero, VCOMISD(XMM(1), XMM(2)) is used, beta(XMM1) is compared with zero(XMM2), for column major C, setting of xmm2 to zero was missed. - XMM2 is set to zero after the jump to column major stored C code is made, this skips the setting of XMM2 to zero for column major C. - This is fixed by setting XMM2 to zero before the column major jump. AMD-Internal: [CPUPL-5851] Change-Id: Ic511071fbc82a082fa48a1543c0c7325eaf75cb8	2024-11-29 08:13:57 +00:00
Shubham Sharma.	bc3238e21e	BugFixes in ZEN5 DGEMM kernel - Changed fringe cases to use ZEN5 DGEMM kernel instead of ZEN4 kernel. - ASAN reporting error when RBP is used even when -fno-stack-pointer flag is used, therefore replaced RBP register with R11 register. - Added missing RDX register in clobber list which is causing failures with AOCC compiler. Thanks to harsh.dave@amd.com for debugging some of the issues. AMD-Internal: [CPUPL-5851] Change-Id: I0ee412c97c9dbfb3e7a736a10bfd93d775779b5b	2024-11-29 00:22:41 -05:00
Shubham Sharma	266bd32dea	Enable fringe case handling in DGEMM ZEN5 macro kernel - Generic kernel is used if N is not multiple of NR or M is not multiple of MR. - This limit the maximum values of NR that can be used. - Support for fringe case handling is added in DGEMM macro kernel so that macro kernel can be used for all problem sizes. AMD-Internal: [CPUPL-5912] Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f	2024-11-29 00:22:33 -05:00
Deepak Negi	04ae01aeab	Added support to specify bias data type in bf16 API's Description: 1. The bias type was supported only based on output data type. 2. The option is added in the pre-ops structure to select the bias data type irrespective of the storage data type in bf16 and WoQ API's AMD-Internal: SWLCSG-3171 Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf	2024-11-19 05:30:02 -05:00
Deepak Negi	60a8c71a1a	Sigmoid and Tanh post-operation support for int8 API's. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_<s8\|u8>s8<s32\|s16>o<s8\|u8\|s32> API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: I9df3aab02222f728ff9d1f292c7bc549f30176f0	2024-11-15 05:36:31 -05:00
Deepak Negi	146f3b2eb2	Sigmoid and Tanh post-operation support for f32 API. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_f32f32f32of32 API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: Iac0a907f6dea1d9cb82d9fd8716bfdbf1c33921d	2024-11-15 04:20:20 -04:00
Deepak Negi	b5c1b6055a	Sigmoid and Tanh post-operation support for bf16 API. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_bf16bf16f32o<f32\|bf16) API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969	2024-11-15 01:13:31 -04:00
Vignesh Balasubramanian	06d776b025	AVX512 ZGEMM SUP Inner product kernels - Implemented a set of column preferential dot-product based ZGEMM kernels(main and fringe) in AVX512(for SUP code-path). These kernels perform matrix multiplication as a sequence of inner products(i.e, dot-products). - These standalone kernels are expected to strictly handle the CRC storage scheme for C, A and B matrices. RRC is also supported through operation transpose, at the framework level. - Added unit-tests to test all the kernels(main and fringe), as well as the redirection between these kernels. AMD-Internal: [CPUPL-5949] Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38	2024-11-06 04:18:57 -05:00
Nallani Bhaskar	3e6f98f046	Bug fixes f32 to bf16 reorder API Description: _mm512_cvtne2ps_pbh(a, b) instruction takes b when j<16 but the code was developed in with assuming reverse order. Fixed some indentation issues Changed the file name and made it uniform Change-Id: I7b45b4c35931d8febde7b7b5d9604ea953046f97	2024-11-05 12:58:15 +00:00
Nallani Bhaskar	9735391e1d	Implemented f32tobf16 reorder function Description: aocl_reorder_f32obf16 function is implemented to reorder input weight matrix of data type float to bfloat16. The reordering is done to match the input requirements of API aocl_gemm_bf16bf16f32o<f32\|bf16>. The objective of the API is to convert a model/matrix of type f32 to bf16 and process when machine supports bf16 FMA instruction _mm512_dpbf16_ps but the model is still in float Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5	2024-11-04 04:32:01 +00:00
Mithun Mohan	097cda9f9e	Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API. -Currently lpgemm sets the context (block sizes and micro-kernels) based on the ISA of the machine it is being executed on. However this approach does not give the flexibility to select a different context at runtime. In order to enable runtime selection of context, the context initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env variable and set the context based on the same. As part of this commit, only f32 context selection is enabled. -Bug fixes in scale ops in f32 micro-kernels and GEMV path selection. -Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512). This is only for B matrix and helps remove dependency of f32 lpgemm api on the BLIS packing framework. AMD Internal: [CPUPL-5959] Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6	2024-10-30 08:52:22 +00:00
Meghana Vankadari	b04b8f22c9	Introduced un-reorder API for bf16bf16f32of32 Details: - Added a new API called unreorder that converts a matrix from reordered format to it's original format( row-major or col-major ). - Currently this API only supports bf16 datatype. - Added corresponding bench and input file to test accuracy of the API. - The new API is only supported for 'B' matrix. - Modified input validation checks in reorder API to account for row Vs col storage of matrix and transposes for bf16 datatype. Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7	2024-10-23 07:49:24 -04:00
Deepak Negi	6ddde390aa	Added support for column major B matrix in BF16S4F32F32 reorder API. -Added new pack kernels that packs/reorders B matrix (odd strides) from column-major input format. This also supports the transB scenario if input B matrix is row major. Change-Id: Ia0fe7e5f19ae9eba5c418f4089c7e6df11091853	2024-10-22 03:01:39 -04:00
varshav2	dabfdf484a	Add Scale post-op for F32 API - Implemented the Scale post-op for the F32 API for all kernels - f32_scale = (f32 * scale_factor) + offset - Added the bench inputs Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c	2024-10-21 06:08:31 -04:00
Shubham Sharma.	2b02024404	Fixed bug in bli_zdotv_zen4_asm_avx512 kernel. - Data-type of n, and conj is dim_t which will be int32_t for LP64 case. - When loading 64-bit registers using "mov" instructions, mov(rax, var(n)), the "n" should be 64-bit otherwise incorrect values gets loaded. Fix: We typecast these variables to int64_t before loading into registers. Thanks to mangala.v@amd.com for finding this bug. Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5	2024-09-24 02:33:44 -04:00
Shubham Sharma	1833ee70cd	Fixed bug in bli_dgemm_avx512 8x24 native kernel. - Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case. - When loading 64-bit registers using "mov" instructions, mov(rax, var(m)), the "m" should be 64-bit otherwise incorrect values gets loaded. Fix: We typecast these variables to int64_t before loading into registers. AMD-Internal: [CPUPL-5819] Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0	2024-09-23 04:54:04 -04:00
Deepak Negi	16653ed208	Added support for column major B matrix in BF16S4F32F32 reorder API. -Added new pack kernels that packs/reorders B matrix from column-major input format. This also supports the transB scenario if input B matrix is row major. Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f	2024-09-20 01:11:21 +05:30

1 2 3 4 5 ...

892 Commits