- Added early return checks for A/B transpose cases and Column major
support, as it is not currently supported.
- Enabled the JIT kernels for the Zen4 architecture.
AMD Internal: [SWLCSG - 3281]
Change-Id: Ie671676c51c739dd18709892414fd34d26a540df
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
In order to support AVX2 kernels, the BF16 input has to be converted
to F32 and then the F32 kernels has to be executed.
- Added un-pack function for the B-Matrix, which does the unpacking of
the Re-ordered BF16 B-Matrix and converts it to Float.
- Added a kernel, to convert the matrix data from Bf16 to F32 for the
give input.
- Added a new path to the BF16 5LOOP to work with the BF16 data, where
the packed/unpacked A matrix is converted from BF16 to F32. The
packed B matrix is converted from BF16 to F32 and the re-ordered B
matrix is unre-ordered and converted to F32 before feeding to the
F32 micro kernels.
- Removed AVX512 condition checks in BF16 code path.
- Added the Re-order reference code path to support BF16 AVX2.
- Currently the F32 AVX-2 kernels supports only F32 BIAS support.
Added BF16 support for BIAS post-op in F32 AVX2 kernels.
- Bug fix in the test input generation script.
AMD Internal : [SWLCSG - 3281]
Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
-The following S16 APIs are removed:
1. aocl_gemm_u8s8s16os16
2. aocl_gemm_u8s8s16os8
3. aocl_gemm_u8s8s16ou8
4. aocl_gemm_s8s8s16os16
5. aocl_gemm_s8s8s16os8
along with the associated reorder APIs and corresponding
framework elements.
AMD-Internal: [CPUPL-6412]
Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b
- Added Downscale, tanh and sigmoid post-op support to the JIT kernels
- Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error.
- Added the optional support for B-prefetch in the JIT kernels
- Resolved the visibility issues in global variable jit_krnels_generated
- Modified the array generation for scale and zp values in the bench
Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d
Description:
1. Added group quantization and zero-point (zp) in
aocl_gemm_bf16s4f32o<bf16|f32> API.
2. Group quantization is technique to improve accuracy
where scale factors to dequantize weights varies at group
level instead of per channel and per tensor level.
3. Added zp and scaling in woq packb kernels so that for
large M values zp and scaling are performed at pack-b
stage and bf16 kernels are called
4. Adding zp support and scaling to default path in WoQ kernels
created some performance overhead when M value is very small.
5. Added string group_size to lpgemm bench to read
group size from bench_input.txt and tested for
various combinations of matrix dimensions.
6. The scalefactors could be of type float or bf16
and the zeropoint values are expected to be
in int8 format.
AMD-Internal: [SWLCSG-3168, SWLCSG-3172]
Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57
Description:
1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
weights into bf16 performed in the kernel itself to reduce bandwidth.
2. These kernels are performing better compared to bf16 weights when m
is small and n is large.
3. Established a threshold to do quantization support at packing of
B (KCXNC) level or WoQ kernel level.
Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
Details:
- In WOQ, if m = 4, special case kernels are added where
s4->bf16 conversion happens inside the compute kernel and
packing is avoided. For all other cases, B matrix is
dequantized and packed at KC loop level and native bf16
kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
output is bf16.
Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n==1.
AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n=1.
AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
datatype.
AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation is developed considering the b matrix
reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
for 6x64 bf16 microkernel and store function pointers to all the
generated kernels in a global function pointer array. This happens
only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
otherwise, microkernel uses the intrinsics based implementation.
AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation of lpgemv_rowvar_f32 is developed
considering the b matrix reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
5. Fixed few warnings while loading 2 f32 bias elements using
_mm_load_sd using float pointer. Typecasted to (const double *)
AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-The n fringe micro kernels uses only a few zmm registers for computing
the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24
used in 6x64). This results in lot of wasted registers that if utilized
can help increase the MR dimension and thus improve the reuse of
registers loaded with B. Based on this concept, the existing n fringe
kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted
that the maximum number of registers are not used, since it results in
cache inefficient code due to the increase in MR and thus more
broadcasts required from unpacked A matrix.
-Compiler flag updates for AOCC build to generate loops with 64 byte
alignment. This has been observed to improve performance slightly when
k dimension is small.
AMD-Internal: [CPUPL-3173]
Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
of every column of B matrix to implement the S8S8S32 operation using
the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
Here the A matrix elements are converted from int8 to uint8 (VNNI
works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.
AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
Description:
1. Developed row variant intrinsic Kernels for float32/sgemm
which are called from lpgemm api aocl_gemm_f32f32f32of32()
2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are
developed using avx512.
3. 6x16m main kernel and respective n fringe and mn fringe are
are developed based on avx2 and avx
4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic
dispatch are planned next
5. When leading dims are greater than dims bench_lpgemm need
to be updated to test it and this is planned next.
Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370
-The post operations attributes are moved to a new struct
lpgemm_post_op_attr, and an object of this struct is passed to the
low precision gemm kernels in place of the multiple parameters.
-The u8s8s32s8 api (downscale api) performance is low when the k
value is less (k < KC). Two scenarios are observed here:
a. beta = 0: Currently, for downscale api, a temporary buffer is
used to accumulate intermediate s32 output, so that it can be used
in later iterations of pc loop (k dim). The usage of this buffer
(store) can be avoided if k < KC. Here intermediate accumulation
is not required, since the after the first iteration of the pc loop,
the output can be downscaled and stored.
b. beta != 0: In this case the existing values of the original s8 C
output matrix needs to be converted to s32 and beta scaled. Currently
the s8 values are converted to s32 and stored in temporary buffer in
pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer
is passed to the micro kernel and beta scaling is applied on this.
However the mxNC block copy is costly and can be avoided if a new
condition is introduced for beta scaling in the micro kernel, whereby
the original s8 data is loaded instead of from the temporary buffer
to a register, converted to s32 and beta scaling applied on it.
AMD-Internal: [CPUPL-2884]
Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.
AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.
1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided
Change-Id: If5d08213068869eff060c9998596d2d2703a6793
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.
AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3