amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 06:21:12 +00:00

Author	SHA1	Message	Date
Deepak Negi	baeebe75c9	Support for standard AutoAWQ storage format. Description: 1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this format [0, 2, 4, 6, 1, 3, 5, 7]. 2. Support is added to convert above format back to the original sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering in the AWQ API. AMD-Internal: SWLCSG-3169 Change-Id: I5395766060c200ab81d0b8be94356678a169ac13	2024-12-02 04:02:27 -05:00
Meghana Vankadari	fbb72d047f	Added group quantization and zero-point support for WOQ kernels Description: 1. Added group quantization and zero-point (zp) in aocl_gemm_bf16s4f32o<bf16\|f32> API. 2. Group quantization is technique to improve accuracy where scale factors to dequantize weights varies at group level instead of per channel and per tensor level. 3. Added zp and scaling in woq packb kernels so that for large M values zp and scaling are performed at pack-b stage and bf16 kernels are called 4. Adding zp support and scaling to default path in WoQ kernels created some performance overhead when M value is very small. 5. Added string group_size to lpgemm bench to read group size from bench_input.txt and tested for various combinations of matrix dimensions. 6. The scalefactors could be of type float or bf16 and the zeropoint values are expected to be in int8 format. AMD-Internal: [SWLCSG-3168, SWLCSG-3172] Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57	2024-12-02 06:46:13 +00:00
Nallani Bhaskar	9735391e1d	Implemented f32tobf16 reorder function Description: aocl_reorder_f32obf16 function is implemented to reorder input weight matrix of data type float to bfloat16. The reordering is done to match the input requirements of API aocl_gemm_bf16bf16f32o<f32\|bf16>. The objective of the API is to convert a model/matrix of type f32 to bf16 and process when machine supports bf16 FMA instruction _mm512_dpbf16_ps but the model is still in float Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5	2024-11-04 04:32:01 +00:00
Meghana Vankadari	b04b8f22c9	Introduced un-reorder API for bf16bf16f32of32 Details: - Added a new API called unreorder that converts a matrix from reordered format to it's original format( row-major or col-major ). - Currently this API only supports bf16 datatype. - Added corresponding bench and input file to test accuracy of the API. - The new API is only supported for 'B' matrix. - Modified input validation checks in reorder API to account for row Vs col storage of matrix and transposes for bf16 datatype. Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7	2024-10-23 07:49:24 -04:00
Meghana Vankadari	d5b4d3aa5e	Fixing control flow in aocl_gemm_bf16s4f32of32\|bf16 - Fixed framework of bf16s4f32of32 API to correct pointer updations. - Modified pre_op structure to exclude pre-op-offset. Now offset is passed as a separate parameter to the scale-pack functions. - Fixed work-distribution among threads in MT scenario. - Added Blocksizes and kernel-pointers and verified functionality for the new API. AMD-Internal: [SWLCSG-2943] Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d	2024-07-29 05:12:09 -04:00
Nallani Bhaskar	c6dd7c1b4b	Added new API in aocl_gemm to support A bf16 data type and B s4 data type Description: 1. Added a new API aocl_gemm_bf16s4f32of32 to support for WoQ (Weight-only-Quantization) in LLM's 2. The API supports only reordered B matrix of data size signed 4 bits (S4). 3. Substracting zero point and multiplying with scale on B matrix is performed in packing B. 4. zero point and scale data should be passed by user through pre-ops data structure. 5. The API is still in experimental state and NOT tested. AMD-Internal: SWLCSG-2943 Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee	2024-07-25 11:59:03 +00:00
mkadavil	7114376519	New kernels for int4 B matrix reordering following BF16 kernel schema. -To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs are required where data types are A:bf16, B:int4 and C:f32/bf16. It is expected that the BF16 kernels will be reused within this API and subsequently the B matrix needs to be reordered following the BF16 kernel schema, but with the reordered matrix type still being int4. To address this, new BF16 reorder kernels enabling the same are added. AMD-Internal: [SWLCSG-2943] Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738	2024-07-25 01:10:13 -04:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Nallani Bhaskar	b3391ef5da	Updated ERF threshold and packa changes in bf16 Description: 1. Updated ERF function threshold from 3.91920590400 to 3.553 to match with the reference erf float implementation which reduced errors a the borders and also clipped the output to 1.0 2. Updated packa function call with pack function ptr in bf16 api to avoid compilation issues for non avx512bf16 archs 3. Updated lpgemm bench [AMD-Internal: SWLCSG-2423 ] Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d	2023-10-29 23:55:46 +05:30
Meghana Vankadari	eb5ab3f762	LPGEMM: Added transB support for bf16bf16f32o<bf16\|f32> APIs Details: - Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs to allow reordering from column major input matrix. - Added new pack kernels that packs/reorders B matrix from column-major input format. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [CPUPL-2268] Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c	2023-10-12 23:36:18 +05:30
Meghana Vankadari	4874895a68	LPGEMM: Added transA support for bf16bf16f32o<bf16\|f32> APIs Details: - Added new params(order, trans) to aocl_get_reorder_buf_size_ and aocl_reorder_ APIs. - Added new pack kernels that packs A matrix from either row-major or column major input matrix to pack buffer with row-major format. - Updated cntx with pack kernel function pointers for packing A matrix. - Transpose of A matrix is handled by packing A matrix to row-major format during run-time. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [SWLCSG-2268, SWLCSG-2442] Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4	2023-10-11 07:16:08 -04:00
Edward Smyth	0f0277e104	Code cleanup: dos2unix file conversion Source and other files in some directories were a mixture of Unix and DOS file formats. Convert all relevant files to Unix format for consistency. Some Windows-specific files remain in DOS format. AMD-Internal: [CPUPL-2870] Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb	2023-04-21 08:41:16 -04:00
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
Harihara Sudhan S	11c42ce1d3	C matrix prefetch for BF16 GEMM - Broke down the KR loop inside the compute kernel into two pieces - Added C matrix prefetch between the two decomposed pieces of KR loop AMD-Internal: [CPUPL-2693] Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e	2022-11-21 04:57:19 -05:00
mkadavil	f4702debb9	Zen4 compilation flag updates to support low precision gemm. - BFloat16 flags added to zen4 make_defs in order to enable compilation of low precision gemm by using zen4 config. - Avoid -ftree-partial-pre optimization flag with gcc due to non optimal code generation for intrinsics based kernels in low precision gemm. - Enable only Zen3 specific low precision gemm kernels (s16) compilation when aocl_gemm addon is compiled on Zen3 machines. AMD-Internal: [CPUPL-1545] Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f	2022-09-29 08:19:40 -04:00
mkadavil	bf4d1da1b9	Column major input support for BFloat16 gemm. -The bf16 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. -Bench changes to test low precision gemm for column major inputs. AMD-Internal: [CPUPL-2570] Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495	2022-09-22 02:50:46 -04:00
eashdash	e1349c0c71	LPGEMM BF16 MT panel based balancing Introduced multi-thread panel based balancing for BF16 to improve the overall MT performance. AMD-Internal: [CPUPL-2502] Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25	2022-09-07 03:20:50 -04:00
eashdash	32a9e735f1	BF16 Output downscaling functionality - BF16 instructions output is accumulated at a higher precision of FP32 which needs to be converted to a lower precison of bf16 post the GEMM operations. This is required in AI workloads where both input and output are in BF16 format. - BF16 downscaling is implemented as post-ops inside the GEMM microkernels. Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8	2022-08-30 13:46:09 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
eashdash	e674fae758	Post-Ops for bf16bf16f32 Functionality - Post-ops is a set of operations performed elemnent wise on the output matrix post GEMM operation. The support for the same is added by fusing post-ops with GEMM operations. - Post-ops Bias, Relu and Parametric Relu are added to all the compute kernels of bf16bf16f32of32 - Modified bf16 interface files to add check for bf16 ISA support Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3	2022-08-30 08:14:14 +00:00
mkadavil	a7d1cc7369	Multi-Threading support for BFloat16 gemm. -OpenMP based multi-threading support added for BFloat16 gemm. Both gemm and reorder api's are parallelized. -Multi-threading support for u8s8s16 reorder api. -Typecast issues fixed for bfloat16 gemm kernels. AMD-Internal: [CPUPL-2459] Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109	2022-08-30 02:54:19 -04:00
eashdash	4e3e00fb7e	Added low precision GEMM - bf16bf16f32of32 Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float. 1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps 2. Kernel for packing B matrix is provided Change-Id: If5d08213068869eff060c9998596d2d2703a6793	2022-08-24 03:27:00 -04:00

24 Commits