amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 14:31:12 +00:00

Author	SHA1	Message	Date
Nallani Bhaskar	9735391e1d	Implemented f32tobf16 reorder function Description: aocl_reorder_f32obf16 function is implemented to reorder input weight matrix of data type float to bfloat16. The reordering is done to match the input requirements of API aocl_gemm_bf16bf16f32o<f32\|bf16>. The objective of the API is to convert a model/matrix of type f32 to bf16 and process when machine supports bf16 FMA instruction _mm512_dpbf16_ps but the model is still in float Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5	2024-11-04 04:32:01 +00:00
Mithun Mohan	097cda9f9e	Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API. -Currently lpgemm sets the context (block sizes and micro-kernels) based on the ISA of the machine it is being executed on. However this approach does not give the flexibility to select a different context at runtime. In order to enable runtime selection of context, the context initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env variable and set the context based on the same. As part of this commit, only f32 context selection is enabled. -Bug fixes in scale ops in f32 micro-kernels and GEMV path selection. -Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512). This is only for B matrix and helps remove dependency of f32 lpgemm api on the BLIS packing framework. AMD Internal: [CPUPL-5959] Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6	2024-10-30 08:52:22 +00:00
Meghana Vankadari	2e1cc2f14a	Added bf16s4f32 kernels to handle m=4 cases Details: - In WOQ, if m = 4, special case kernels are added where s4->bf16 conversion happens inside the compute kernel and packing is avoided. For all other cases, B matrix is dequantized and packed at KC loop level and native bf16 kernels are re-used at compute level. - Fixes in bench to avoid accuracy failures when datatype of output is bf16. Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2	2024-09-04 07:36:57 -04:00
mkadavil	f040ba617f	Element wise operations API for bfloat16 input matrix in LPGEMM. -This API supports applying element wise operations (eg: post-ops) on a bfloat16 input matrix to get an output matrix of the same(bfloat16) or upscaled data type (float). -Benchmarking/testing framework for the same is added. AMD Internal: SWLCSG-2947 Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2	2024-08-05 07:17:08 -04:00
Meghana Vankadari	d5b4d3aa5e	Fixing control flow in aocl_gemm_bf16s4f32of32\|bf16 - Fixed framework of bf16s4f32of32 API to correct pointer updations. - Modified pre_op structure to exclude pre-op-offset. Now offset is passed as a separate parameter to the scale-pack functions. - Fixed work-distribution among threads in MT scenario. - Added Blocksizes and kernel-pointers and verified functionality for the new API. AMD-Internal: [SWLCSG-2943] Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d	2024-07-29 05:12:09 -04:00
mkadavil	a5c4a8c7e0	Int4 B matrix reordering support in LPGEMM. Support for reordering B matrix of datatype int4 as per the pack schema requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion implemented via leveraging the vpmultishiftqb instruction. The reordered B matrix will then be used in the u8s8s32o<s32\|s8> api. AMD-Internal: [SWLCSG-2390] Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba	2024-06-24 07:55:34 -04:00
Meghana Vankadari	c9254bd9e9	Implemented LPGEMV(n=1) for AVX2-INT8 variants - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n=1. AMD-Internal: [SWLCSG-2354] Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1	2024-06-18 12:09:18 +05:30
mkadavil	cd032225ca	BF16 bias support for bf16bf16f32ob16. -As it stands the bf16bf16f32ob16 API expects bias array to be of type float. However actual use case requires the usage of bias array of bf16 type. The bf16 micro-kernels are updated to work with bf16 bias array by upscaling it to float type and then using it in the post-ops workflow. -Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API when k > KC. AMD-Internal: [SWLCSG-2604] Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1	2024-05-23 04:48:20 +05:30
eashdash	a72fff2be9	Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8 1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added. 2. New interface, frame and kernel files are added. 3. Frame and kernel level files added and modified for s8s8s16 4. s8s8s16 type involves design changes of 2 operations - Pack B and Mat Mul 5. Pack B kernel routines to pack B matrix for s16 FMA and compute the sum of every column of B matrix to implement the s8s8s16 operation using the s16 FMA instructions. 5. Mat Mul Kernel files to compute the GEMM output using s16 FMA. Here the A matrix elements are converted from int8 to uint8 (s16 FMA works with A matrix type uint8 only) by adding extra 128 to every A matrix element 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. Final C = C - ( (sum of column of B matrix) * 128 ) This is done to compensate for the addition of extra 128 to every A matrix elements 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s16os16 and s8s8s16os8. 8. All previously added post-ops are supported on s8s8os16/os8 also. AMD-Internal: [CPUPL-3234] Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c	2023-04-21 05:30:38 -04:00
mkadavil	e23765010d	aocl_gelu_<tanh\|erf>_f32 api's for gelu computation as part of lpgemm. -Currently in aocl_gemm, gelu (both tanh and erf based) computation is only supported as a post-op as part of low precision gemm api call (done at micro-kernel level). However gelu computation alone without gemm is required in certain cases for users of aocl_gemm. -In order to support this, two new api's - aocl_gelu_tanh_f32 and aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's computes element-wise gelu_tanh and gelu_erf respectively of a matrix/ vector of floats. Both the api's invokes ISA specific vectorized micro- kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3218] Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704	2023-04-17 05:15:56 -04:00
eashdash	bd8cd763ff	Added NEW LPGEMM TYPE- S8S8S32/S8 1. New LPGEMM type - S8S8S32/S8 is added. 2. New interface, frame and kernel files are added. 3. Frame and kernel files added/modified for S8S8S32/S8 have 2 operations - Pack B and Mat Mul 4. Pack B kernel routines to pack B matrix for VNNI and compute the sum of every column of B matrix to implement the S8S8S32 operation using the VNNI instructions. 5. Mat Mul Kernel files to compute the GEMM output using the VNNI. Here the A matrix elements are converted from int8 to uint8 (VNNI works with A matrix type uint8 only). 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s32os32 and s8s8s32os8. 8. All previously added post-ops are supported on S8S8S32/S8 also. AMD-Internal: [CPUPL-3154] Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3	2023-03-31 05:44:54 -04:00
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00

12 Commits