amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 06:21:12 +00:00

Author	SHA1	Message	Date
Deepak Negi	baeebe75c9	Support for standard AutoAWQ storage format. Description: 1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this format [0, 2, 4, 6, 1, 3, 5, 7]. 2. Support is added to convert above format back to the original sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering in the AWQ API. AMD-Internal: SWLCSG-3169 Change-Id: I5395766060c200ab81d0b8be94356678a169ac13	2024-12-02 04:02:27 -05:00
Meghana Vankadari	fbb72d047f	Added group quantization and zero-point support for WOQ kernels Description: 1. Added group quantization and zero-point (zp) in aocl_gemm_bf16s4f32o<bf16\|f32> API. 2. Group quantization is technique to improve accuracy where scale factors to dequantize weights varies at group level instead of per channel and per tensor level. 3. Added zp and scaling in woq packb kernels so that for large M values zp and scaling are performed at pack-b stage and bf16 kernels are called 4. Adding zp support and scaling to default path in WoQ kernels created some performance overhead when M value is very small. 5. Added string group_size to lpgemm bench to read group size from bench_input.txt and tested for various combinations of matrix dimensions. 6. The scalefactors could be of type float or bf16 and the zeropoint values are expected to be in int8 format. AMD-Internal: [SWLCSG-3168, SWLCSG-3172] Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57	2024-12-02 06:46:13 +00:00
Nallani Bhaskar	9735391e1d	Implemented f32tobf16 reorder function Description: aocl_reorder_f32obf16 function is implemented to reorder input weight matrix of data type float to bfloat16. The reordering is done to match the input requirements of API aocl_gemm_bf16bf16f32o<f32\|bf16>. The objective of the API is to convert a model/matrix of type f32 to bf16 and process when machine supports bf16 FMA instruction _mm512_dpbf16_ps but the model is still in float Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5	2024-11-04 04:32:01 +00:00
Mithun Mohan	097cda9f9e	Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API. -Currently lpgemm sets the context (block sizes and micro-kernels) based on the ISA of the machine it is being executed on. However this approach does not give the flexibility to select a different context at runtime. In order to enable runtime selection of context, the context initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env variable and set the context based on the same. As part of this commit, only f32 context selection is enabled. -Bug fixes in scale ops in f32 micro-kernels and GEMV path selection. -Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512). This is only for B matrix and helps remove dependency of f32 lpgemm api on the BLIS packing framework. AMD Internal: [CPUPL-5959] Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6	2024-10-30 08:52:22 +00:00
Meghana Vankadari	b04b8f22c9	Introduced un-reorder API for bf16bf16f32of32 Details: - Added a new API called unreorder that converts a matrix from reordered format to it's original format( row-major or col-major ). - Currently this API only supports bf16 datatype. - Added corresponding bench and input file to test accuracy of the API. - The new API is only supported for 'B' matrix. - Modified input validation checks in reorder API to account for row Vs col storage of matrix and transposes for bf16 datatype. Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7	2024-10-23 07:49:24 -04:00
varshav2	605517964b	Add Transpose Kernel for A matrix in F32F32f32Of32 - Implemented the AVX512 packA kernel for col major inputs in F32 API - Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was being directed to GEMM instead of GEMV. Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f	2024-09-19 06:37:11 -04:00
Meghana Vankadari	5120f98e12	Developed all WoQ kernels for bf16s4f32o<f32\|bf16> Description: 1. Written 6x64 main and other fringe kernels for WoQ where scaling s4 weights into bf16 performed in the kernel itself to reduce bandwidth. 2. These kernels are performing better compared to bf16 weights when m is small and n is large. 3. Established a threshold to do quantization support at packing of B (KCXNC) level or WoQ kernel level. Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289	2024-09-10 12:00:10 +00:00
Meghana Vankadari	2e1cc2f14a	Added bf16s4f32 kernels to handle m=4 cases Details: - In WOQ, if m = 4, special case kernels are added where s4->bf16 conversion happens inside the compute kernel and packing is avoided. For all other cases, B matrix is dequantized and packed at KC loop level and native bf16 kernels are re-used at compute level. - Fixes in bench to avoid accuracy failures when datatype of output is bf16. Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2	2024-09-04 07:36:57 -04:00
Deepak Negi	6dcf500703	Element wise operations API for float(f32) input matrix in LPGEMM. This API supports applying element wise operations (eg: post-ops) on a float(f32) input matrix to get an output matrix of the same (float(f32)). Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24 AMD Internal: [SWLCSG-2947] Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24	2024-08-27 03:28:52 -04:00
Meghana Vankadari	5514c7a75f	Added LPGEMV(n=1) kernels for s8s8s32os32\|s8 and s8s8s16os16\|s8 APIs - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n==1. AMD-Internal: [SWLCSG-2354] Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761	2024-08-09 06:17:52 -04:00
mkadavil	f040ba617f	Element wise operations API for bfloat16 input matrix in LPGEMM. -This API supports applying element wise operations (eg: post-ops) on a bfloat16 input matrix to get an output matrix of the same(bfloat16) or upscaled data type (float). -Benchmarking/testing framework for the same is added. AMD Internal: SWLCSG-2947 Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2	2024-08-05 07:17:08 -04:00
Meghana Vankadari	d5b4d3aa5e	Fixing control flow in aocl_gemm_bf16s4f32of32\|bf16 - Fixed framework of bf16s4f32of32 API to correct pointer updations. - Modified pre_op structure to exclude pre-op-offset. Now offset is passed as a separate parameter to the scale-pack functions. - Fixed work-distribution among threads in MT scenario. - Added Blocksizes and kernel-pointers and verified functionality for the new API. AMD-Internal: [SWLCSG-2943] Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d	2024-07-29 05:12:09 -04:00
Nallani Bhaskar	c6dd7c1b4b	Added new API in aocl_gemm to support A bf16 data type and B s4 data type Description: 1. Added a new API aocl_gemm_bf16s4f32of32 to support for WoQ (Weight-only-Quantization) in LLM's 2. The API supports only reordered B matrix of data size signed 4 bits (S4). 3. Substracting zero point and multiplying with scale on B matrix is performed in packing B. 4. zero point and scale data should be passed by user through pre-ops data structure. 5. The API is still in experimental state and NOT tested. AMD-Internal: SWLCSG-2943 Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee	2024-07-25 11:59:03 +00:00
mkadavil	7114376519	New kernels for int4 B matrix reordering following BF16 kernel schema. -To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs are required where data types are A:bf16, B:int4 and C:f32/bf16. It is expected that the BF16 kernels will be reused within this API and subsequently the B matrix needs to be reordered following the BF16 kernel schema, but with the reordered matrix type still being int4. To address this, new BF16 reorder kernels enabling the same are added. AMD-Internal: [SWLCSG-2943] Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738	2024-07-25 01:10:13 -04:00
mkadavil	a5c4a8c7e0	Int4 B matrix reordering support in LPGEMM. Support for reordering B matrix of datatype int4 as per the pack schema requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion implemented via leveraging the vpmultishiftqb instruction. The reordered B matrix will then be used in the u8s8s32o<s32\|s8> api. AMD-Internal: [SWLCSG-2390] Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba	2024-06-24 07:55:34 -04:00
Meghana Vankadari	c9254bd9e9	Implemented LPGEMV(n=1) for AVX2-INT8 variants - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n=1. AMD-Internal: [SWLCSG-2354] Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1	2024-06-18 12:09:18 +05:30
Nallani Bhaskar	29db6eb42b	Added transB in all AVX512 based int8 API's Description: --Added support for tranB in u8s8s32o<s32\|s8> and s8s8s32o<s32\|s8> API's --Updated the bench_lpgemm by adding options to support transpose of B matrix --Updated data_gen_script.py in lpgemm bench according to latest input format. AMD-Internal: [SWLCSG-2582] Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048	2024-05-23 03:46:13 +05:30
Meghana Vankadari	3a8b9270e7	Implemented lpgemv for AVX512-INT8 variants - Implemented optimized lpgemv for both m == 1 and n == 1 cases. - Fixed few bugs in LPGEMV for bf16 and f32 datatypes. - Fixed few bugs in JIT-based implementation of LPGEMM for BF16 datatype. AMD-Internal: [SWLCSG-2354] Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38	2024-05-14 01:55:49 +05:30
Meghana Vankadari	1072770c63	Implemented LPGEMV for bf16 datatype 1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. AMD-Internal: [SWLCSG-2355] Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880	2024-05-06 23:55:15 +05:30
Meghana Vankadari	da8fd8c301	Implemented JIT-based microkernel for bf16 datatype Details: - Added new folder named JIT/ under addon/aocl_gemm/. This folder will contain all the JIT related code. - Modified lpgemm_cntx_init code to generate main and fringe kernels for 6x64 bf16 microkernel and store function pointers to all the generated kernels in a global function pointer array. This happens only when gcc version is < 11.2 - When gcc version < 11.2, microkernel uses JIT-generated kernels. otherwise, microkernel uses the intrinsics based implementation. AMD-Internal: [SWLCSG-2622] Change-Id: I16256c797b2546a8cd2049680001947346260461	2024-03-13 05:55:18 +05:30
Bhaskar Nallani	2ce47e6f5e	Implemented optimal AVX512-variant of f32 LPGEMV 1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation of lpgemv_rowvar_f32 is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. 5. Fixed few warnings while loading 2 f32 bias elements using _mm_load_sd using float pointer. Typecasted to (const double *) AMD-Internal: [SWLCSG-2391, SWLCSG-2353] Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599	2024-03-04 23:53:23 +05:30
eashdash	ef134dc49f	Added Trans A feature for all INT8 LPGEMM APIs 1. Added Trans A feature to handle column major inputs for A matrix. 2. Trans A is enabled by on-the-go pack of A matrix. 3. The on-the-go pack of A converts a column storage MCxKC block of A into row storage MCxKC block as LPGEMM kernels are row major kernels. 4. New pack routines are added for conversion of A matrix from column major storage to row major storage. 5. LPGEMM Cntx is updated with pack kernel function pointers. 6. Packing of A matrix: - Converts column major input A to row major in blocks of MCxKC with newly added pack A functions when cs_a > 1. 7. Pack routines are added for AVX512 and AVX2 INT8 LPGEMM APIs. 8. Trans A feature is now supported in: 1. u8s8s32os32/os8 2. u8s8s16os16/os8/ou8 3. s8s8s32os32/os8 4. s8s8s16os16/os8 AMD-Internal: SWLCSG-2582 Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf	2024-01-30 03:40:56 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Nallani Bhaskar	b3391ef5da	Updated ERF threshold and packa changes in bf16 Description: 1. Updated ERF function threshold from 3.91920590400 to 3.553 to match with the reference erf float implementation which reduced errors a the borders and also clipped the output to 1.0 2. Updated packa function call with pack function ptr in bf16 api to avoid compilation issues for non avx512bf16 archs 3. Updated lpgemm bench [AMD-Internal: SWLCSG-2423 ] Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d	2023-10-29 23:55:46 +05:30
Meghana Vankadari	eb5ab3f762	LPGEMM: Added transB support for bf16bf16f32o<bf16\|f32> APIs Details: - Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs to allow reordering from column major input matrix. - Added new pack kernels that packs/reorders B matrix from column-major input format. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [CPUPL-2268] Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c	2023-10-12 23:36:18 +05:30
Meghana Vankadari	4874895a68	LPGEMM: Added transA support for bf16bf16f32o<bf16\|f32> APIs Details: - Added new params(order, trans) to aocl_get_reorder_buf_size_ and aocl_reorder_ APIs. - Added new pack kernels that packs A matrix from either row-major or column major input matrix to pack buffer with row-major format. - Updated cntx with pack kernel function pointers for packing A matrix. - Transpose of A matrix is handled by packing A matrix to row-major format during run-time. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [SWLCSG-2268, SWLCSG-2442] Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4	2023-10-11 07:16:08 -04:00
Edward Smyth	0f0277e104	Code cleanup: dos2unix file conversion Source and other files in some directories were a mixture of Unix and DOS file formats. Convert all relevant files to Unix format for consistency. Some Windows-specific files remain in DOS format. AMD-Internal: [CPUPL-2870] Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb	2023-04-21 08:41:16 -04:00
eashdash	a72fff2be9	Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8 1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added. 2. New interface, frame and kernel files are added. 3. Frame and kernel level files added and modified for s8s8s16 4. s8s8s16 type involves design changes of 2 operations - Pack B and Mat Mul 5. Pack B kernel routines to pack B matrix for s16 FMA and compute the sum of every column of B matrix to implement the s8s8s16 operation using the s16 FMA instructions. 5. Mat Mul Kernel files to compute the GEMM output using s16 FMA. Here the A matrix elements are converted from int8 to uint8 (s16 FMA works with A matrix type uint8 only) by adding extra 128 to every A matrix element 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. Final C = C - ( (sum of column of B matrix) * 128 ) This is done to compensate for the addition of extra 128 to every A matrix elements 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s16os16 and s8s8s16os8. 8. All previously added post-ops are supported on s8s8os16/os8 also. AMD-Internal: [CPUPL-3234] Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c	2023-04-21 05:30:38 -04:00
mkadavil	3572baa9d3	aocl_softmax_f32 api's for softmax computation as part of lpgemm. -Softmax is often used as the last activation function in a neural network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))). This step happens after the final low precision gemm computation, and it helps to have the softmax functionality that can be invoked as part of the lpgemm workflow. In order to support this, a new api, aocl_softmax_f32 is introduced as part of aocl_gemm. This api computes element-wise softmax of a matrix/vector of floats. This api invokes ISA specific vectorized micro-kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3247] Change-Id: If15880360947435985fa87b6436e475571e4684a	2023-04-21 05:26:08 -04:00
mkadavil	e23765010d	aocl_gelu_<tanh\|erf>_f32 api's for gelu computation as part of lpgemm. -Currently in aocl_gemm, gelu (both tanh and erf based) computation is only supported as a post-op as part of low precision gemm api call (done at micro-kernel level). However gelu computation alone without gemm is required in certain cases for users of aocl_gemm. -In order to support this, two new api's - aocl_gelu_tanh_f32 and aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's computes element-wise gelu_tanh and gelu_erf respectively of a matrix/ vector of floats. Both the api's invokes ISA specific vectorized micro- kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3218] Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704	2023-04-17 05:15:56 -04:00
mkadavil	27a9e2a0ff	u8s8s32 fringe kernel optimizations. -The n fringe micro kernels uses only a few zmm registers for computing the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24 used in 6x64). This results in lot of wasted registers that if utilized can help increase the MR dimension and thus improve the reuse of registers loaded with B. Based on this concept, the existing n fringe kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted that the maximum number of registers are not used, since it results in cache inefficient code due to the increase in MR and thus more broadcasts required from unpacked A matrix. -Compiler flag updates for AOCC build to generate loops with 64 byte alignment. This has been observed to improve performance slightly when k dimension is small. AMD-Internal: [CPUPL-3173] Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca	2023-04-03 05:35:18 -05:00
eashdash	bd8cd763ff	Added NEW LPGEMM TYPE- S8S8S32/S8 1. New LPGEMM type - S8S8S32/S8 is added. 2. New interface, frame and kernel files are added. 3. Frame and kernel files added/modified for S8S8S32/S8 have 2 operations - Pack B and Mat Mul 4. Pack B kernel routines to pack B matrix for VNNI and compute the sum of every column of B matrix to implement the S8S8S32 operation using the VNNI instructions. 5. Mat Mul Kernel files to compute the GEMM output using the VNNI. Here the A matrix elements are converted from int8 to uint8 (VNNI works with A matrix type uint8 only). 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s32os32 and s8s8s32os8. 8. All previously added post-ops are supported on S8S8S32/S8 also. AMD-Internal: [CPUPL-3154] Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3	2023-03-31 05:44:54 -04:00
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
bhaskarn	91a9968a5e	Developed intrinsic based f32 kernels in lpgemm Description: 1. Developed row variant intrinsic Kernels for float32/sgemm which are called from lpgemm api aocl_gemm_f32f32f32of32() 2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are developed using avx512. 3. 6x16m main kernel and respective n fringe and mn fringe are are developed based on avx2 and avx 4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic dispatch are planned next 5. When leading dims are greater than dims bench_lpgemm need to be updated to test it and this is planned next. Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370	2023-02-20 01:11:22 -05:00
mkadavil	63ee4c5e4c	Remove memcpy usage in u8s8s32 lpgemm micro kernels. -As of now, memcpy is used in u8s8s32 micro-kernel for copying in k fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for small k/n dimensions, memcpy invocation has high overhead. -This issue is fixed by replacing memcpy with a MACRO based implementation of copy routine, specifically optimized for the sizes that will be encountered in fringe cases (k < 4, NR' < 16). AMD-Internal: [CPUPL-3008] Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc	2023-02-16 05:52:19 -05:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
Harihara Sudhan S	11c42ce1d3	C matrix prefetch for BF16 GEMM - Broke down the KR loop inside the compute kernel into two pieces - Added C matrix prefetch between the two decomposed pieces of KR loop AMD-Internal: [CPUPL-2693] Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e	2022-11-21 04:57:19 -05:00
eashdash	63864d7dfb	Added clipping while downscaling for u8s8s32os8 and u8s8s16os8. Clipping is done during the downscaling of the accumulated result from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8, to saturate the final output values between [-128,127] AMD-Internal: [LWPZENDNN-493] Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f	2022-10-11 07:28:06 -04:00
mkadavil	f4702debb9	Zen4 compilation flag updates to support low precision gemm. - BFloat16 flags added to zen4 make_defs in order to enable compilation of low precision gemm by using zen4 config. - Avoid -ftree-partial-pre optimization flag with gcc due to non optimal code generation for intrinsics based kernels in low precision gemm. - Enable only Zen3 specific low precision gemm kernels (s16) compilation when aocl_gemm addon is compiled on Zen3 machines. AMD-Internal: [CPUPL-1545] Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f	2022-09-29 08:19:40 -04:00
Harihara Sudhan S	a45827b3f9	u8s8s16os16 bug fix for downscale operation - Removed some read code from the macros for downscale - Store permute correction - Simplified macros for edge cases and corrected intermediate operation AMD-Internal:[CPUPL-2171] Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68	2022-09-22 05:02:17 -04:00
mkadavil	bf4d1da1b9	Column major input support for BFloat16 gemm. -The bf16 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. -Bench changes to test low precision gemm for column major inputs. AMD-Internal: [CPUPL-2570] Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495	2022-09-22 02:50:46 -04:00
Harihara Sudhan S	5b6cc5d39d	Bug fix in s16 downscale operation - Store operations was done to c matrix and not to c buffer AMD-Internal:[CPUPL-2171] Change-Id: Ic0897a20850fdae96db52f0ccc6fa087c84239fa	2022-09-13 06:01:48 -04:00
eashdash	e1349c0c71	LPGEMM BF16 MT panel based balancing Introduced multi-thread panel based balancing for BF16 to improve the overall MT performance. AMD-Internal: [CPUPL-2502] Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25	2022-09-07 03:20:50 -04:00
eashdash	32a9e735f1	BF16 Output downscaling functionality - BF16 instructions output is accumulated at a higher precision of FP32 which needs to be converted to a lower precison of bf16 post the GEMM operations. This is required in AI workloads where both input and output are in BF16 format. - BF16 downscaling is implemented as post-ops inside the GEMM microkernels. Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8	2022-08-30 13:46:09 -04:00
Harihara Sudhan S	5faab43e66	Downscaling as part of u8s8s16os16 - int16 c matrix intermediate values are converted to int32, then the int32 values are converted to fp32. On these fp32 values scaling is done - The resultant value is down scaled to int8 and stored in a separate buffer AMD-Internal: [2171] Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab	2022-08-30 13:41:36 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
eashdash	e674fae758	Post-Ops for bf16bf16f32 Functionality - Post-ops is a set of operations performed elemnent wise on the output matrix post GEMM operation. The support for the same is added by fusing post-ops with GEMM operations. - Post-ops Bias, Relu and Parametric Relu are added to all the compute kernels of bf16bf16f32of32 - Modified bf16 interface files to add check for bf16 ISA support Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3	2022-08-30 08:14:14 +00:00
mkadavil	a7d1cc7369	Multi-Threading support for BFloat16 gemm. -OpenMP based multi-threading support added for BFloat16 gemm. Both gemm and reorder api's are parallelized. -Multi-threading support for u8s8s16 reorder api. -Typecast issues fixed for bfloat16 gemm kernels. AMD-Internal: [CPUPL-2459] Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109	2022-08-30 02:54:19 -04:00
Harihara Sudhan S	326d8a557f	Performance regression in u8s8s16os16 - Performance of u8s8s16os16 came down by 40% after the introduction of post-ops - Analysis revealed that the target compiler assumed false dependency and was generating sub-optimal code due to the post-ops structure - Inserted vzeroupper to hint the compiler that no ISA change will occur AMD-Internal: [CPUPL-2447] Change-Id: I0b383b9742ad237d0e053394602428872691ef0c	2022-08-29 03:20:02 -04:00

1 2

58 Commits