-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.
AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
Description:
1. Developed row variant intrinsic Kernels for float32/sgemm
which are called from lpgemm api aocl_gemm_f32f32f32of32()
2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are
developed using avx512.
3. 6x16m main kernel and respective n fringe and mn fringe are
are developed based on avx2 and avx
4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic
dispatch are planned next
5. When leading dims are greater than dims bench_lpgemm need
to be updated to test it and this is planned next.
Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370
-As of now, memcpy is used in u8s8s32 micro-kernel for copying in k
fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for
small k/n dimensions, memcpy invocation has high overhead.
-This issue is fixed by replacing memcpy with a MACRO based
implementation of copy routine, specifically optimized for the sizes
that will be encountered in fringe cases (k < 4, NR' < 16).
AMD-Internal: [CPUPL-3008]
Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc
1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16
2. Changes are done at frame and micro-kernel level to
implement this post-op.
3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF
functions are implemented for the GeLU post-operation.
4. TANH and EXPF math functions are efficiently implemented in
macro-based fashion to exploit register level fusion of GeLU
with GEMM operations for improved performance
5. LPGEMM bench is changed to pass GeLU post-op as input and
support accuracy check to verify functional correctness
AMD-Internal: [CPUPL-2978]
Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28
-The f32 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
AMD-Internal: [CPUPL-2919]
Change-Id: I805f1cb9ff934bb3106e01eb74e528915ffb90a3
-The post operations attributes are moved to a new struct
lpgemm_post_op_attr, and an object of this struct is passed to the
low precision gemm kernels in place of the multiple parameters.
-The u8s8s32s8 api (downscale api) performance is low when the k
value is less (k < KC). Two scenarios are observed here:
a. beta = 0: Currently, for downscale api, a temporary buffer is
used to accumulate intermediate s32 output, so that it can be used
in later iterations of pc loop (k dim). The usage of this buffer
(store) can be avoided if k < KC. Here intermediate accumulation
is not required, since the after the first iteration of the pc loop,
the output can be downscaled and stored.
b. beta != 0: In this case the existing values of the original s8 C
output matrix needs to be converted to s32 and beta scaled. Currently
the s8 values are converted to s32 and stored in temporary buffer in
pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer
is passed to the micro kernel and beta scaling is applied on this.
However the mxNC block copy is costly and can be avoided if a new
condition is introduced for beta scaling in the micro kernel, whereby
the original s8 data is loaded instead of from the temporary buffer
to a register, converted to s32 and beta scaling applied on it.
AMD-Internal: [CPUPL-2884]
Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c
- Broke down the KR loop inside the compute kernel into
two pieces
- Added C matrix prefetch between the two decomposed
pieces of KR loop
AMD-Internal: [CPUPL-2693]
Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e
- NC value is adjusted based on the n dimension of the B
matrix.
- KC value is adjusted based on the k dimension input
matrices.
- In order to ensure a handshake between the reorder
function and compute function, the same dynamic block
tuning logic is called in both places.
- Reorder followed by compute scenarios mean that tuning
KC value based on MC value and vice versa is not feasible.
AMD-Internal: [CPUPL-2638]
Change-Id: Ie515da7de83e3dfb59946e873c92d67f11559429
Clipping is done during the downscaling of the accumulated result
from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8,
to saturate the final output values between [-128,127]
AMD-Internal: [LWPZENDNN-493]
Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f
- BFloat16 flags added to zen4 make_defs in order to enable
compilation of low precision gemm by using zen4 config.
- Avoid -ftree-partial-pre optimization flag with gcc due to
non optimal code generation for intrinsics based kernels in
low precision gemm.
- Enable only Zen3 specific low precision gemm kernels (s16)
compilation when aocl_gemm addon is compiled on Zen3 machines.
AMD-Internal: [CPUPL-1545]
Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f
1. Correcting the type of alpha, and beta values from C_type
(output type) to accumulation type.
For the downscaled LPGEMM APIs, C_type will be the downscaled
type but the required type for alpha and beta values should
be the accumulation type.
2. BF16 bench integration with the LPGEMM bench for both the BF16
(bf16bf16f32of32 and bf16bf16f32obf16) APIs
AMD-Internal: [CPUPL-2561]
Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797
- Removed some read code from the macros for downscale
- Store permute correction
- Simplified macros for edge cases and corrected intermediate
operation
AMD-Internal:[CPUPL-2171]
Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.
AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
- The temporary buffer allocated for C matrix when downscaling is
enabled is not filled properly. This results in wrong gemm accumulation
when beta != 0, and thus wrong output after downscaling. The C panel
iterators used for filling the temporary buffer are updated to fix it.
- Low precision gemm bench updated for testing/benchmarking downscaling.
AMD-Internal: [CPUPL-2514]
Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb
Introduced multi-thread panel based balancing for BF16 to improve the
overall MT performance.
AMD-Internal: [CPUPL-2502]
Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25
- BF16 instructions output is accumulated at a higher precision of
FP32 which needs to be converted to a lower precison of bf16 post
the GEMM operations. This is required in AI workloads where both
input and output are in BF16 format.
- BF16 downscaling is implemented as post-ops inside the GEMM
microkernels.
Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8
- int16 c matrix intermediate values are converted to int32,
then the int32 values are converted to fp32. On these fp32
values scaling is done
- The resultant value is down scaled to int8 and stored in a
separate buffer
AMD-Internal: [2171]
Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.
AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
Functionality - Post-ops is a set of operations performed elemnent
wise on the output matrix post GEMM operation. The support for
the same is added by fusing post-ops with GEMM operations.
- Post-ops Bias, Relu and Parametric Relu are added to all the
compute kernels of bf16bf16f32of32
- Modified bf16 interface files to add check for bf16 ISA support
Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3
-OpenMP based multi-threading support added for BFloat16 gemm.
Both gemm and reorder api's are parallelized.
-Multi-threading support for u8s8s16 reorder api.
-Typecast issues fixed for bfloat16 gemm kernels.
AMD-Internal: [CPUPL-2459]
Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109
- Performance of u8s8s16os16 came down by 40% after the
introduction of post-ops
- Analysis revealed that the target compiler assumed false
dependency and was generating sub-optimal code due to the
post-ops structure
- Inserted vzeroupper to hint the compiler that no ISA change
will occur
AMD-Internal: [CPUPL-2447]
Change-Id: I0b383b9742ad237d0e053394602428872691ef0c
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.
AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.
1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided
Change-Id: If5d08213068869eff060c9998596d2d2703a6793
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.
AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
Functionality - Post-ops is an operation performed on every element
of the output matrix after GEMM operation is completed.
- Post-ops relu and bias added to all the compute kernels
of u8s8s16os16
- Post-ops are done on the value loaded into the register
to avoid reloading of C matrix elements
- Minor bug fixes in openmp thread decorator of lpgemm
- Added test cases to lpgemm bench input file
AMD-Internal: [CPUPL-2171]
Change-Id: If49f763fdfac19749f6665c172348691165d8631
Fucntionality - When the B matrix is not reordered before the
u8s8s16os16 compute kernel call packing of B matrix is done as
part of the five loop algorithm. The state of B matrix (packed
or unpacked) is given as an user input.
- Packing of B matrix is done as part of the five loop
compute.
- Temprorary buffer for pack B is allocated in the five
loop algorithm
- Multithreading for computation kernel
- Configuration constants for u8s8s16os16 are part of the
lpgemm config
AMD-Internal: [CPUPL-2171]
Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.
AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.
- The compute kernel will perform computation only
if B matrix reordered to suit the usage of AVX2
instruction vpmaddubsw.
- Kernel for packing the B matrix is provided.
- LPGEMM bench code was modified to test the
performance and accuracy of the new variant.
AMD-Internal: [CPUPL-2171]
Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b
- Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL
when compiled without open mp threading support. Subsequently the code
is executed as if it is single-threaded. However, when B matrix needs
to be packed, communicators are required (irrespective of single or
multi-threaded), and the code accesses lpgemm_thrinfo_t for the same
without NULL check. This results in seg fault.
For the fix, a non-open mp thread decorator layer is added, which
creates a placeholder lpgemm_thrinfo_t object with a communicator before
invoking the 5 loop algorithm. This object will be used for packing.
- Makefile for compilation of aocl_gemm bench.
AMD-Internal: [CPUPL-2304]
Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.
AMD-Internal: [CPUPL-2102]
Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding #include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is #included before bli_type_defs.h).
This is why the #include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717