amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 22:41:11 +00:00

Author	SHA1	Message	Date
Meghana Vankadari	bfc512d3e1	Implemented batch_gemm for bf16bf16f32of32\|bf16 Details: - The batch matmul performs a series of matmuls, processing more than one GEMM problem at once. - Introduced a new parameter called batch_size for the user to indicate number of GEMM problems in a batch/group. - This operation supports processing GEMM problems with different parameters including dims,post-ops,stor-schemes etc., - This operation is optimized for problems where all the GEMMs in a batch are of same size and shape. - For now, the threads are distributed among different GEMM problems equally irrespective of their dimensions which leads to better performance for batches with identical GEMMs but performs sub-optimally for batches with non-identical GEMMs. - Optimizations for batches with non-identical GEMMs is in progress. - Added bench and input files for batch_matmul. AMD-Internal: [SWLCSG-2944] Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155	2025-01-03 03:28:32 -05:00
Deepak Negi	615789e196	Fixed compilation issue with clang 18 on windows Description -In enum AOCL_PARAMS_STORAGE_TYPES the member FLOAT was declared and the clang 18 compiler in msvc throwing issue with multiple definition. We replace FLOAT and BFLOAT16 to AOCL_GEMM_<F32/BF16>. AMD-Internal: CPUPL-6174 Change-Id: Ic061af068854d51629b82b495efd0eb54543f329	2024-12-12 06:37:06 -05:00
Deepak Negi	baeebe75c9	Support for standard AutoAWQ storage format. Description: 1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this format [0, 2, 4, 6, 1, 3, 5, 7]. 2. Support is added to convert above format back to the original sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering in the AWQ API. AMD-Internal: SWLCSG-3169 Change-Id: I5395766060c200ab81d0b8be94356678a169ac13	2024-12-02 04:02:27 -05:00
Meghana Vankadari	fbb72d047f	Added group quantization and zero-point support for WOQ kernels Description: 1. Added group quantization and zero-point (zp) in aocl_gemm_bf16s4f32o<bf16\|f32> API. 2. Group quantization is technique to improve accuracy where scale factors to dequantize weights varies at group level instead of per channel and per tensor level. 3. Added zp and scaling in woq packb kernels so that for large M values zp and scaling are performed at pack-b stage and bf16 kernels are called 4. Adding zp support and scaling to default path in WoQ kernels created some performance overhead when M value is very small. 5. Added string group_size to lpgemm bench to read group size from bench_input.txt and tested for various combinations of matrix dimensions. 6. The scalefactors could be of type float or bf16 and the zeropoint values are expected to be in int8 format. AMD-Internal: [SWLCSG-3168, SWLCSG-3172] Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57	2024-12-02 06:46:13 +00:00
Deepak Negi	04ae01aeab	Added support to specify bias data type in bf16 API's Description: 1. The bias type was supported only based on output data type. 2. The option is added in the pre-ops structure to select the bias data type irrespective of the storage data type in bf16 and WoQ API's AMD-Internal: SWLCSG-3171 Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf	2024-11-19 05:30:02 -05:00
Deepak Negi	b5c1b6055a	Sigmoid and Tanh post-operation support for bf16 API. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_bf16bf16f32o<f32\|bf16) API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969	2024-11-15 01:13:31 -04:00
Deepak Negi	80bf6249f0	Matrix MUL post-operation support for float(bf16\|f32) LPGEMM APIs. This post-operation computes C = (betaC + alphaAB) D, where D is a matrix with dimensions and data type the same as that of C matrix. AMD-Internal: [SWLCSG-2953] Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd	2024-08-05 08:25:32 -04:00
mkadavil	ec8c39541e	Test/benchmark framework updates to test WOQ workflow. -To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs have been developed where data types are A:bf16, B:int4 and C:f32/bf16. The testing and benchmarking framework for the same are added. AMD-Internal: [SWLCSG-2943] Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17	2024-07-25 06:44:37 +05:30
Nallani Bhaskar	c6dd7c1b4b	Added new API in aocl_gemm to support A bf16 data type and B s4 data type Description: 1. Added a new API aocl_gemm_bf16s4f32of32 to support for WoQ (Weight-only-Quantization) in LLM's 2. The API supports only reordered B matrix of data size signed 4 bits (S4). 3. Substracting zero point and multiplying with scale on B matrix is performed in packing B. 4. zero point and scale data should be passed by user through pre-ops data structure. 5. The API is still in experimental state and NOT tested. AMD-Internal: SWLCSG-2943 Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee	2024-07-25 11:59:03 +00:00
mkadavil	d37c91dffa	Quantization (scale + zero point) support for BF16 LPGEMM api. -Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point) instead of just type conversion in aocl_gemm_bf16bf16f32obf16. -Support for multiple scale/sum/matrix_add/bias post-ops in a single LPGEMM api call. -Post-ops mask related fixes in lpgemv kernels . -Additional scale post-ops sanity checks. AMD-Internal: [SWLCSG-2945] Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb	2024-07-18 05:32:51 -04:00
mkadavil	118e955a22	SWISH post-op support for all LPGEMM APIs. SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)). SiLU = SWISH with alpha = 1. AMD-Internal: [SWLCSG-2387] Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26	2024-05-06 06:05:11 -04:00
mkadavil	01b7f8c945	Matrix Add post-operation support for integer(s16\|s32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. -For clang compilers (including aocc), -march=znver1 is not enabled for zen kernels. Have updated CKVECFLAGS to capture the same. AMD-Internal: [SWLCSG-2424] Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09	2024-02-12 23:51:36 +05:30
mkadavil	864170f5cb	Scalar value support for zero-point and scale-factor. -As it stands, in LPGEMM, users are expected to pass an array of values with length the same as N dimension as inputs for zero point or scale factor. However at times, a single scalar value is used as zero point or scale factor for the entire downscaling operation. The mandate to pass an array requires the user to allocate extra memory and fill it with the scalar value so as to be used in downscaling. This limitation is lifted as part of this commit, and now scalar values can be passed as zero point or scale factor. -LPGEMM bench enhancements along with new input format to improve readability as well as flexibility. AMD-Internal: [SWLCSG-2581] Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f	2024-01-12 04:37:35 +05:30
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
mkadavil	ffa72f09cc	Support for multiple eltwise post-ops in low precision gemm. -Currently only one eltwise post-op (one of relu/prelu/gelu_tanh/ gelu_erf) is supported in the post-op struct along with bias or downscale. This setup was sufficient when only activation functions were supported as eltwise post-ops. But with the introduction of clip post-op(a type of non-activation eltwise operation), it has become necessary to extend the post-ops framework to support multiple eltwise operations, with the multiple eltwise often used in the form activation eltwise op + non-activation eltwise ops. The aocl post-op struct is modified and the post-op parser is updated to support this use case. -The lpgemm_bench is updated to support testing/benchmarking of the multiple eltwise operations use case. The function for accuracy checking is modified to support correctness testing irrespective of the order and count of post-ops. Additionally the help message is updated so as to better describe the capabilities of lpgemm_bench. AMD-Internal: [CPUPL-3244] Change-Id: If4ce8d7261d32073da8fa4757ed4f2ea0e94249f	2023-04-20 07:24:32 -04:00
eashdash	12c97021a1	Added New Post-Op - Custom Clipping for LPGEMM and SGEMM 1. Custom Clip is an element-wise post-op which is used to clip the accumulated GEMM output within a certain range. 2. The Clip Post-Op is used in downscaled and non-downscaled LPGEMM APIs and SGEMM. 3. Changes are done at frame and microkernel level to implement this post-op. 4. Different versions are implemented - AVX-512, AVX-2, SSE-2 to enable custom clipping for various LPGEMM types and SGEMM AMD-Internal: [CPUPL-3207] Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6	2023-04-17 04:38:20 -04:00
eashdash	e36f699939	Implemented ERF Based GeLU Activation for LPEGMM and SGEMM 1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the error function - ERF 2. Added error function based GeLU activation post-ops for the S32, S16 and BF16 (LPGEMM) and SGEMM APIs. 3. Changes for this includes frame and micro-kernel level changes in addition to adding the marco based function definations of the ERF function in the math-utils and gelu headerfiles. AMD-Internal: [CPUPL-3036] Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be	2023-03-13 06:10:31 -04:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
mkadavil	584069bf74	Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM. -Parametric ReLU is the generalization of leaky ReLU in which the leakage coefficient is tunable. The support for the same is added following the register-level fusion technique. -Low precision bench enhancement to check accuracy/performance of low precision gemm with PReLU. -Bug fixes in low precision gemm kernels. AMD-Internal: [CPUPL-2442] Change-Id: I81336405b185a994297d122b2d868b758ae6dad5	2022-08-25 13:33:02 +05:30
mkadavil	828d3cd3d3	Post operations support for low precision gemm. - Low precision gemm is often used in ML/DNN workloads and is used in conjunction with pre and post operations. Performing gemm and ops together at the micro kernel level results in better overall performance due to cache/register reuse of output matrix. The provision for defining the post-operations and invoking the micro-kernel with it from the framework is added as part of this change. This includes adding new data structures/functions to define the post-ops to be applied and an extensible template using which new post-ops can easily be integrated. As for the post-operations, RELU and Bias Add for u8s8s32 is implemented in this first cut. - aocl_gemm bench modifications to test/benchmark RELU and Bias Add. AMD-Internal: [CPUPL-2316] Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18	2022-08-05 11:53:05 +05:30

21 Commits