Commit Graph

6 Commits

Author SHA1 Message Date
eashdash
bd8cd763ff Added NEW LPGEMM TYPE- S8S8S32/S8
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
   2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
   of every column of B matrix to implement the S8S8S32 operation using
   the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
   Here the A matrix elements are converted from int8 to uint8 (VNNI
   works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.

AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
2023-03-31 05:44:54 -04:00
eashdash
d21cd51fde Accumulation type for alpha, beta values and BF16 bench integration
1. Correcting the type of alpha, and beta values from C_type
   (output type) to accumulation type.
   For the downscaled LPGEMM APIs, C_type will be the downscaled
   type but the required type for alpha and beta values should
   be the accumulation type.
2. BF16 bench integration with the LPGEMM bench for both the BF16
   (bf16bf16f32of32 and bf16bf16f32obf16) APIs

AMD-Internal: [CPUPL-2561]
Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797
2022-09-23 05:00:49 -04:00
mkadavil
bf4d1da1b9 Column major input support for BFloat16 gemm.
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.

AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
2022-09-22 02:50:46 -04:00
mkadavil
958c9238ac Output downscaling support for low precision GEMM.
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.

AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
2022-08-30 10:27:19 -04:00
eashdash
4e3e00fb7e Added low precision GEMM - bf16bf16f32of32
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.

1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided

Change-Id: If5d08213068869eff060c9998596d2d2703a6793
2022-08-24 03:27:00 -04:00
mkadavil
6fbdfc3cf2 Low precision gemm refactoring and bug fixes.
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.

AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
2022-08-14 17:39:00 +05:30