amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 14:31:12 +00:00

Author	SHA1	Message	Date
mkadavil	a5c4a8c7e0	Int4 B matrix reordering support in LPGEMM. Support for reordering B matrix of datatype int4 as per the pack schema requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion implemented via leveraging the vpmultishiftqb instruction. The reordered B matrix will then be used in the u8s8s32o<s32\|s8> api. AMD-Internal: [SWLCSG-2390] Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba	2024-06-24 07:55:34 -04:00
Nallani Bhaskar	29db6eb42b	Added transB in all AVX512 based int8 API's Description: --Added support for tranB in u8s8s32o<s32\|s8> and s8s8s32o<s32\|s8> API's --Updated the bench_lpgemm by adding options to support transpose of B matrix --Updated data_gen_script.py in lpgemm bench according to latest input format. AMD-Internal: [SWLCSG-2582] Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048	2024-05-23 03:46:13 +05:30
eashdash	ef134dc49f	Added Trans A feature for all INT8 LPGEMM APIs 1. Added Trans A feature to handle column major inputs for A matrix. 2. Trans A is enabled by on-the-go pack of A matrix. 3. The on-the-go pack of A converts a column storage MCxKC block of A into row storage MCxKC block as LPGEMM kernels are row major kernels. 4. New pack routines are added for conversion of A matrix from column major storage to row major storage. 5. LPGEMM Cntx is updated with pack kernel function pointers. 6. Packing of A matrix: - Converts column major input A to row major in blocks of MCxKC with newly added pack A functions when cs_a > 1. 7. Pack routines are added for AVX512 and AVX2 INT8 LPGEMM APIs. 8. Trans A feature is now supported in: 1. u8s8s32os32/os8 2. u8s8s16os16/os8/ou8 3. s8s8s32os32/os8 4. s8s8s16os16/os8 AMD-Internal: SWLCSG-2582 Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf	2024-01-30 03:40:56 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
mkadavil	63ee4c5e4c	Remove memcpy usage in u8s8s32 lpgemm micro kernels. -As of now, memcpy is used in u8s8s32 micro-kernel for copying in k fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for small k/n dimensions, memcpy invocation has high overhead. -This issue is fixed by replacing memcpy with a MACRO based implementation of copy routine, specifically optimized for the sizes that will be encountered in fringe cases (k < 4, NR' < 16). AMD-Internal: [CPUPL-3008] Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc	2023-02-16 05:52:19 -05:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
eashdash	63864d7dfb	Added clipping while downscaling for u8s8s32os8 and u8s8s16os8. Clipping is done during the downscaling of the accumulated result from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8, to saturate the final output values between [-128,127] AMD-Internal: [LWPZENDNN-493] Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f	2022-10-11 07:28:06 -04:00
mkadavil	f4702debb9	Zen4 compilation flag updates to support low precision gemm. - BFloat16 flags added to zen4 make_defs in order to enable compilation of low precision gemm by using zen4 config. - Avoid -ftree-partial-pre optimization flag with gcc due to non optimal code generation for intrinsics based kernels in low precision gemm. - Enable only Zen3 specific low precision gemm kernels (s16) compilation when aocl_gemm addon is compiled on Zen3 machines. AMD-Internal: [CPUPL-1545] Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f	2022-09-29 08:19:40 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
mkadavil	584069bf74	Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM. -Parametric ReLU is the generalization of leaky ReLU in which the leakage coefficient is tunable. The support for the same is added following the register-level fusion technique. -Low precision bench enhancement to check accuracy/performance of low precision gemm with PReLU. -Bug fixes in low precision gemm kernels. AMD-Internal: [CPUPL-2442] Change-Id: I81336405b185a994297d122b2d868b758ae6dad5	2022-08-25 13:33:02 +05:30
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
mkadavil	828d3cd3d3	Post operations support for low precision gemm. - Low precision gemm is often used in ML/DNN workloads and is used in conjunction with pre and post operations. Performing gemm and ops together at the micro kernel level results in better overall performance due to cache/register reuse of output matrix. The provision for defining the post-operations and invoking the micro-kernel with it from the framework is added as part of this change. This includes adding new data structures/functions to define the post-ops to be applied and an extensible template using which new post-ops can easily be integrated. As for the post-operations, RELU and Bias Add for u8s8s32 is implemented in this first cut. - aocl_gemm bench modifications to test/benchmark RELU and Bias Add. AMD-Internal: [CPUPL-2316] Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18	2022-08-05 11:53:05 +05:30
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00

15 Commits