Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.
The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.
The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float
Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.
AMD Internal: [CPUPL-5959]
Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
Details:
- In WOQ, if m = 4, special case kernels are added where
s4->bf16 conversion happens inside the compute kernel and
packing is avoided. For all other cases, B matrix is
dequantized and packed at KC loop level and native bf16
kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
output is bf16.
Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.
AMD Internal: SWLCSG-2947
Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
- Fixed framework of bf16s4f32of32 API to correct
pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
Now offset is passed as a separate parameter to the
scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
functionality for the new API.
AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.
AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n=1.
AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.
AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
of every column of B matrix to implement the S8S8S32 operation using
the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
Here the A matrix elements are converted from int8 to uint8 (VNNI
works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.
AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.
AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef