amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 14:31:12 +00:00

Author	SHA1	Message	Date
eashdash	ef134dc49f	Added Trans A feature for all INT8 LPGEMM APIs 1. Added Trans A feature to handle column major inputs for A matrix. 2. Trans A is enabled by on-the-go pack of A matrix. 3. The on-the-go pack of A converts a column storage MCxKC block of A into row storage MCxKC block as LPGEMM kernels are row major kernels. 4. New pack routines are added for conversion of A matrix from column major storage to row major storage. 5. LPGEMM Cntx is updated with pack kernel function pointers. 6. Packing of A matrix: - Converts column major input A to row major in blocks of MCxKC with newly added pack A functions when cs_a > 1. 7. Pack routines are added for AVX512 and AVX2 INT8 LPGEMM APIs. 8. Trans A feature is now supported in: 1. u8s8s32os32/os8 2. u8s8s16os16/os8/ou8 3. s8s8s32os32/os8 4. s8s8s16os16/os8 AMD-Internal: SWLCSG-2582 Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf	2024-01-30 03:40:56 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
eashdash	63864d7dfb	Added clipping while downscaling for u8s8s32os8 and u8s8s16os8. Clipping is done during the downscaling of the accumulated result from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8, to saturate the final output values between [-128,127] AMD-Internal: [LWPZENDNN-493] Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f	2022-10-11 07:28:06 -04:00
Harihara Sudhan S	a45827b3f9	u8s8s16os16 bug fix for downscale operation - Removed some read code from the macros for downscale - Store permute correction - Simplified macros for edge cases and corrected intermediate operation AMD-Internal:[CPUPL-2171] Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68	2022-09-22 05:02:17 -04:00
Harihara Sudhan S	5b6cc5d39d	Bug fix in s16 downscale operation - Store operations was done to c matrix and not to c buffer AMD-Internal:[CPUPL-2171] Change-Id: Ic0897a20850fdae96db52f0ccc6fa087c84239fa	2022-09-13 06:01:48 -04:00
Harihara Sudhan S	5faab43e66	Downscaling as part of u8s8s16os16 - int16 c matrix intermediate values are converted to int32, then the int32 values are converted to fp32. On these fp32 values scaling is done - The resultant value is down scaled to int8 and stored in a separate buffer AMD-Internal: [2171] Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab	2022-08-30 13:41:36 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
Harihara Sudhan S	326d8a557f	Performance regression in u8s8s16os16 - Performance of u8s8s16os16 came down by 40% after the introduction of post-ops - Analysis revealed that the target compiler assumed false dependency and was generating sub-optimal code due to the post-ops structure - Inserted vzeroupper to hint the compiler that no ISA change will occur AMD-Internal: [CPUPL-2447] Change-Id: I0b383b9742ad237d0e053394602428872691ef0c	2022-08-29 03:20:02 -04:00
mkadavil	584069bf74	Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM. -Parametric ReLU is the generalization of leaky ReLU in which the leakage coefficient is tunable. The support for the same is added following the register-level fusion technique. -Low precision bench enhancement to check accuracy/performance of low precision gemm with PReLU. -Bug fixes in low precision gemm kernels. AMD-Internal: [CPUPL-2442] Change-Id: I81336405b185a994297d122b2d868b758ae6dad5	2022-08-25 13:33:02 +05:30
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
Harihara Sudhan S	d1eaf65a26	Post-Ops for u8s8s16os16 Functionality - Post-ops is an operation performed on every element of the output matrix after GEMM operation is completed. - Post-ops relu and bias added to all the compute kernels of u8s8s16os16 - Post-ops are done on the value loaded into the register to avoid reloading of C matrix elements - Minor bug fixes in openmp thread decorator of lpgemm - Added test cases to lpgemm bench input file AMD-Internal: [CPUPL-2171] Change-Id: If49f763fdfac19749f6665c172348691165d8631	2022-08-09 14:52:41 +05:30
Harihara Sudhan S	60de0a1856	Multithreading and support for unpacked B matrix in u8s8s16os16 Fucntionality - When the B matrix is not reordered before the u8s8s16os16 compute kernel call packing of B matrix is done as part of the five loop algorithm. The state of B matrix (packed or unpacked) is given as an user input. - Packing of B matrix is done as part of the five loop compute. - Temprorary buffer for pack B is allocated in the five loop algorithm - Multithreading for computation kernel - Configuration constants for u8s8s16os16 are part of the lpgemm config AMD-Internal: [CPUPL-2171] Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8	2022-08-05 19:28:37 +05:30
Harihara Sudhan S	e5d4fc2a70	Added low precision GEMM (u8s8s16os16) Feature Addition : Added low precision GEMM to addon. The kernel takes unsigned int8 and signed int8 as inputs and performs GEMM operation. The intermediate accumulation and output are in signed int16. - The compute kernel will perform computation only if B matrix reordered to suit the usage of AVX2 instruction vpmaddubsw. - Kernel for packing the B matrix is provided. - LPGEMM bench code was modified to test the performance and accuracy of the new variant. AMD-Internal: [CPUPL-2171] Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b	2022-08-02 02:20:00 -04:00

16 Commits