Commit Graph

15 Commits

Author SHA1 Message Date
mkadavil
a5c4a8c7e0 Int4 B matrix reordering support in LPGEMM.
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.

AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
2024-06-24 07:55:34 -04:00
Nallani Bhaskar
29db6eb42b Added transB in all AVX512 based int8 API's
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
  s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
  support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
  according to latest input format.

AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
2024-05-23 03:46:13 +05:30
eashdash
ef134dc49f Added Trans A feature for all INT8 LPGEMM APIs
1. Added Trans A feature to handle column major inputs
   for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
   MCxKC block of A into row storage MCxKC block as
   LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
   from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
   pointers.
6. Packing of A matrix:
   -  Converts column major input A to row major
      in blocks of MCxKC with newly added pack A
      functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
   INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
   1. u8s8s32os32/os8
   2. u8s8s16os16/os8/ou8
   3. s8s8s32os32/os8
   4. s8s8s16os16/os8

AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
2024-01-30 03:40:56 -05:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00
mkadavil
63ee4c5e4c Remove memcpy usage in u8s8s32 lpgemm micro kernels.
-As of now, memcpy is used in u8s8s32 micro-kernel for copying in k
fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for
small k/n dimensions, memcpy invocation has high overhead.
-This issue is fixed by replacing memcpy with a MACRO based
implementation of copy routine, specifically optimized for the sizes
that will be encountered in fringe cases (k < 4, NR' < 16).

AMD-Internal: [CPUPL-3008]
Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc
2023-02-16 05:52:19 -05:00
eashdash
672544bc04 GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16
1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16
2. Changes are done at frame and micro-kernel level to
   implement this post-op.
3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF
   functions are implemented for the GeLU post-operation.
4. TANH and EXPF math functions are efficiently implemented in
   macro-based fashion to exploit register level fusion of GeLU
   with GEMM operations for improved performance
5. LPGEMM bench is changed to pass GeLU post-op as input and
   support accuracy check to verify functional correctness

AMD-Internal: [CPUPL-2978]
Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28
2023-02-02 08:25:04 -05:00
mkadavil
3870792e62 Low precision gemm s32 downscale optimization.
-The post operations attributes are moved to a new struct
lpgemm_post_op_attr, and an object of this struct is passed to the
low precision gemm kernels in place of the multiple parameters.
-The u8s8s32s8 api (downscale api) performance is low when the k
value is less (k < KC). Two scenarios are observed here:
a. beta = 0: Currently, for downscale api, a temporary buffer is
used to accumulate intermediate s32 output, so that it can be used
in later iterations of pc loop (k dim). The usage of this buffer
(store) can be avoided if k < KC. Here intermediate accumulation
is not required, since the after the first iteration of the pc loop,
the output can be downscaled and stored.
b. beta != 0: In this case the existing values of the original s8 C
output matrix needs to be converted to s32 and beta scaled. Currently
the s8 values are converted to s32 and stored in temporary buffer in
pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer
is passed to the micro kernel and beta scaling is applied on this.
However the mxNC block copy is costly and can be avoided if a new
condition is introduced for beta scaling in the micro kernel, whereby
the original s8 data is loaded instead of from the temporary buffer
to a register, converted to s32 and beta scaling applied on it.

AMD-Internal: [CPUPL-2884]
Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c
2023-01-10 13:15:22 +05:30
eashdash
63864d7dfb Added clipping while downscaling for u8s8s32os8 and u8s8s16os8.
Clipping is done during the downscaling of the accumulated result
from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8,
to saturate the final output values between [-128,127]

AMD-Internal: [LWPZENDNN-493]

Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f
2022-10-11 07:28:06 -04:00
mkadavil
f4702debb9 Zen4 compilation flag updates to support low precision gemm.
- BFloat16 flags added to zen4 make_defs in order to enable
compilation of low precision gemm by using zen4 config.
- Avoid -ftree-partial-pre optimization flag with gcc due to
non optimal code generation for intrinsics based kernels in
low precision gemm.
- Enable only Zen3 specific low precision gemm kernels (s16)
compilation when aocl_gemm addon is compiled on Zen3 machines.

AMD-Internal: [CPUPL-1545]
Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f
2022-09-29 08:19:40 -04:00
mkadavil
958c9238ac Output downscaling support for low precision GEMM.
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.

AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
2022-08-30 10:27:19 -04:00
mkadavil
584069bf74 Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM.
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.

AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
2022-08-25 13:33:02 +05:30
mkadavil
6fbdfc3cf2 Low precision gemm refactoring and bug fixes.
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.

AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
2022-08-14 17:39:00 +05:30
mkadavil
828d3cd3d3 Post operations support for low precision gemm.
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.

AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
2022-08-05 11:53:05 +05:30
mkadavil
6c112632a7 Low precision gemm integrated as aocl_gemm addon.
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.

AMD-Internal: [CPUPL-2102]

Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
2022-06-09 10:28:38 -04:00