Description
-In enum AOCL_PARAMS_STORAGE_TYPES the member FLOAT was declared and the
clang 18 compiler in msvc throwing issue with multiple definition. We
replace FLOAT and BFLOAT16 to AOCL_GEMM_<F32/BF16>.
AMD-Internal: CPUPL-6174
Change-Id: Ic061af068854d51629b82b495efd0eb54543f329
- Implemented the AVX512 packA kernel for col major inputs in F32 API
- Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was
being directed to GEMM instead of GEMV.
Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.
AMD Internal: SWLCSG-2947
Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
for 6x64 bf16 microkernel and store function pointers to all the
generated kernels in a global function pointer array. This happens
only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
otherwise, microkernel uses the intrinsics based implementation.
AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
1. Added Trans A feature to handle column major inputs
for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
MCxKC block of A into row storage MCxKC block as
LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
pointers.
6. Packing of A matrix:
- Converts column major input A to row major
in blocks of MCxKC with newly added pack A
functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
1. u8s8s32os32/os8
2. u8s8s16os16/os8/ou8
3. s8s8s32os32/os8
4. s8s8s16os16/os8
AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
of every column of B matrix to implement the S8S8S32 operation using
the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
Here the A matrix elements are converted from int8 to uint8 (VNNI
works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.
AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.
AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.
AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.
AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.
- The compute kernel will perform computation only
if B matrix reordered to suit the usage of AVX2
instruction vpmaddubsw.
- Kernel for packing the B matrix is provided.
- LPGEMM bench code was modified to test the
performance and accuracy of the new variant.
AMD-Internal: [CPUPL-2171]
Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.
AMD-Internal: [CPUPL-2102]
Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182