- Fixed framework of bf16s4f32of32 API to correct
pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
Now offset is passed as a separate parameter to the
scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
functionality for the new API.
AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
have been developed where data types are A:bf16, B:int4 and C:f32/bf16.
The testing and benchmarking framework for the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17
Description:
1. Added a new API aocl_gemm_bf16s4f32of32 to support
for WoQ (Weight-only-Quantization) in LLM's
2. The API supports only reordered B matrix of data
size signed 4 bits (S4).
3. Substracting zero point and multiplying with scale
on B matrix is performed in packing B.
4. zero point and scale data should be passed by user
through pre-ops data structure.
5. The API is still in experimental state and NOT tested.
AMD-Internal: SWLCSG-2943
Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
new LPGEMM APIs are added where datatypes are A: bf16,
B: int4, C: f32/bf16. To support this, B matrix will
be reordered with type still being int4. New pack
kernels that packs the reordered B matrix after
converting the data from int4 to bf16 and applying
zero-point and scale are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
-Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point)
instead of just type conversion in aocl_gemm_bf16bf16f32obf16.
-Support for multiple scale/sum/matrix_add/bias post-ops in a single
LPGEMM api call.
-Post-ops mask related fixes in lpgemv kernels .
-Additional scale post-ops sanity checks.
AMD-Internal: [SWLCSG-2945]
Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.
AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
Details:
- For a variable x, Using address of x in an instruction throws
exception if the difference between &x and access position is
larger than 2 GiB. To solve this issue all variables are stored
within the JIT code section and are accessed using relative addressing.
- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.
AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n=1.
AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.
AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
according to latest input format.
AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
datatype.
AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1. Adding the support for swish in JIT
based BF16 kernels.
AMD-Internal: [SWLCSG-2387]
Change-Id: I9eea0c801f5f067a5cfbd2941bc991708b86e45e
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation is developed considering the b matrix
reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
- Modified all structs that are passed to JIT-generated code to use
integer of type uint64_t rather than dim_t so that functionality
is not affected when size of BLIS-internal integer is modified
during configure time.
Change-Id: Ib81c088072badf13da4ca73be2d4af4551b713d8
- Added BUILD_STATIC_LIBS option which is on by default, only on Linux.
- Added TEST_WITH_SHARED option which is off by default, only on Linux.
- If only shared or static lib is being built, that's the one that will be used for testing.
- If both are being built, TEST_WITH_SHARED determins which library wil be used for testing.
- Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing.
- Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0.
AMD-Internal: [CPUPL-2748]
Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e
CMakelists.txt is updated to support aocl_gemm on windows.
On windows, BLIS library(blis+aocl_gemm) is built successfully
only with AOCC Compiler. (Clang has an issue with optimizing
VNNI instructions).
$cmake .. -DENABLE_ADDON="aocl_gemm" ....
AMD-Internal: [CPUPL-2748]
Change-Id: I9620878ab6934233fadc9ddc5d5e82ad85be9209
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
for 6x64 bf16 microkernel and store function pointers to all the
generated kernels in a global function pointer array. This happens
only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
otherwise, microkernel uses the intrinsics based implementation.
AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation of lpgemv_rowvar_f32 is developed
considering the b matrix reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
5. Fixed few warnings while loading 2 f32 bias elements using
_mm_load_sd using float pointer. Typecasted to (const double *)
AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.
AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
1. Added Trans A feature to handle column major inputs
for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
MCxKC block of A into row storage MCxKC block as
LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
pointers.
6. Packing of A matrix:
- Converts column major input A to row major
in blocks of MCxKC with newly added pack A
functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
1. u8s8s32os32/os8
2. u8s8s16os16/os8/ou8
3. s8s8s32os32/os8
4. s8s8s16os16/os8
AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
-As it stands, in LPGEMM, users are expected to pass an array of values
with length the same as N dimension as inputs for zero point or scale
factor. However at times, a single scalar value is used as zero point
or scale factor for the entire downscaling operation. The mandate to
pass an array requires the user to allocate extra memory and fill it
with the scalar value so as to be used in downscaling. This limitation
is lifted as part of this commit, and now scalar values can be passed
as zero point or scale factor.
-LPGEMM bench enhancements along with new input format to improve
readability as well as flexibility.
AMD-Internal: [SWLCSG-2581]
Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f
Details:
- LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_A_BLOCK to checkout
memory when A matrix needs to be packed. This multi-threaded lock
overhead becomes prominent when m/n dimensions are relatively small,
even when k is large. In order to address this, bli_pba_acquire_m
is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE,
the memory is allocated using aligned malloc instead of checking
out from memory pool. Experiments have shown malloc costs to be
far lower than memory pool guarded by locks, especially for higher
thread count.
- Deleted few unnecessary instructions from packing kernels.
- Replaced bench_input.txt with lesser number of inputs.
AMD-Internal: [CPUPL-4329]
Change-Id: I5982a0a4df9dc72fab0cffab795c23822d5c8774
Description:
1. Updated the thread partition logic for aocl_gemm_f32f32f32of32
for m<MR, n<NR cases and also balanced thread in m, n directions
such that each thread gets equal amount of work and not to span
thread without any work.
2. Disabled dynamic enabling of packing of a and b matrixes for
smaller sizes for genoa architecture.
AMD-Internal: [SWLCSG-2353 , SWLCSG-2391]
Change-Id: I03b2c50e592c2e9d336ea84c0e0394af63a34cec
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
Details:
- Moved the downscale & postop options from commmandline to
input file.
- Now the format of the input file is as follows:
dt_in dt_out stor transa transb op_a op_b m n k lda ldb ldc postops
- In case of no-postops, 'none' has to be passed in the place of
postops.
- Removed duplication of mat_mul_bench_main function for bf16 APIs.
- Added a function called print_matrix for each datatype which can
help in printing matrices while debugging.
- Added printing of ref, computed and diff values while reporting
failure.
- Added new functions for memory allocation and freeing. Different
types of memory allocation is chosen based on mode bench is
running(performance or accuracy mode).
Change-Id: Ia7d740c53035bc76e578a03869590c9f04396b72
The LPGEMM micro-kernel operates on blocks of dimension MRxKC and
KCxNR. Current LPGEMM design involves using all the available threads
for computing the output. If the number of threads assigned along ic
or jc direction is more than M/MR or N/NR blocks respectively, it
could results in threads sleeping due to the lack of MR or NR blocks.
This scenario is now handled by reducing the number of threads if
there are threads without any work (MR or NR blocks).
AMD-Internal: [SWLCSG-2354, SWLCSG-2389, SWLCSG-2267]
Change-Id: I74819337c7a0d3ab05ea0e18bb42780f977ea8f6
Details:
- Modified bench to support testing for sizes where matrix
strides are larger than the corresponding dimensions.
- Modified early-return checks in all interface APIs to
check validity of strides in relation to the corresponding
dimension rather than checking if strides are equal to dimensions.
Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4
Description:
1. Updated ERF function threshold from 3.91920590400 to 3.553
to match with the reference erf float implementation which
reduced errors a the borders and also clipped the output
to 1.0
2. Updated packa function call with pack function ptr in bf16
api to avoid compilation issues for non avx512bf16 archs
3. Updated lpgemm bench
[AMD-Internal: SWLCSG-2423 ]
Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
Details:
- Updated pack function call in ic loop to accept correct params.
- Modified documentation in bench file to reflect updated usage of
bench for downscaled APIs.
- Modified memory allocation for C panel in BF16 APIs to use
BLIS_BUFFER_FOR_GEN_USE while requesting for memory from pool.
Change-Id: Id624ed92ae7c8dafd7f6a32fc1554d2357de4df5
-When bli_pba_acquire_m api is used for packbuf type BLIS_BUFFER_FOR_
<A_BLOCK|B_PANEL|C_PANEL>, the memory is allocated by checking out a
block from an internal memory pool. In order to ensure thread safety,
the memory pool checkout is protected using mutex (bli_pba_lock/
bli_pba_unlock). When the number of threads trying to checkout memory
(in parallel) are high, these locks tend to become a scaling bottleneck,
especially when the memory is to be used for non-packing purposes
(packing could hide some of this cost). LPGEMM uses bli_pba_acquire_m
with BLIS_BUFFER_FOR_C_PANEL to checkout memory when downscale is
enabled for temporary C accumulation. This multi-threaded lock overhead
becomes prominent when m/n dimensions are relatively small, even when k
is large. In order to address this, bli_pba_acquire_m is used with
BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is
allocated using aligned malloc instead of checking out from memory pool.
Experiments have shown malloc costs to be far lower than memory pool
guarded by locks, especially for higher thread count.
-LPGEMM bench fixes for crash observed when benchmarking with post-ops
enabled and no downscale.
AMD-Internal: [SWLCSG-2354]
Change-Id: I4e92feadd2cf638bb26dd03b773556800a1a3d50
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
column-major input format.
- Updated Early-return check conditions to account for trans
parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.
AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
Certain functions were updated recently and now takes extra arguments
for error handling. Usage of the same are now updated in aocl_gemm.
Change-Id: I7daca4fd1f284d57034d564f0a08cc6410ccfd5c
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
This change contains the following:
1. Downscale optimization fix
a. Similar to downscale optimizations made for s32 and s16 gemm,
the following optimizations are done to improve the downscale
performance for BF16 gemm
b. The store to temporary float buffer can be avoided when k < KC
since intermediate accumulation will not be required for the
pc loop (only 1 iteration). The downscaled values (bf16) are
written directly to the output C matrix.
c. Within the micro-kernel when beta != 0, the bf16 data from the
original C output matrix is loaded to a register, converted to
float and beta scaling is applied on it at register level.
This eliminates the requirement of previous design of copying the
bf16 value to the temporary float buffer inside jc loop.
2. Alpha scaling
a. Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in
bf16 micro-kernels.
b. Alpha scaling is now only done when alpha != 1.
3. K Fringe optimization
a. Previously memcpy was used for K fringe case to load elements
from A matrix in the microkernels
b. Now, masked stores are used to store the downscaled and
non-downscaled outputs without the need to use
memcpy functions
4. N LT-16 fringe optimization
a. Previously memcpy was used for N LT 16 fringe case in the
microkernelsfor storing the downscaled and non-downscaled output.
b. Now, masked stores are used to store the downscaled and
non-downscaled outputs of BF16 without the need to use
memcpy functions
5. Framework updates to avoid unnecessary pack buffer allocation
a. The default allocation of the temporary pack buffer is removed
and the pack buffer is now only allocated if k > KC.
AMD-Internal: [CPUPL-3437]
Change-Id: I71ff862e7d250559409a12a3533678c7a7951044
-Currently when any of the downscale API is called, a temporary pack
buffer is allocated (with bli_membrk_acquire_m) by each thread. It is
used to persist intermediate higher precision output accumulated by the
micro-kernel across pc loop when the number of pc iterations is more
than 1 (k > KC). The bli_membrk_acquire_m is a thread safe operation and
uses locks (pthread_mutex) to ensure thread safe checkout of memory/
block from the memory pool.
-However when k < KC, this temporary buffer is not required. But since
this pack buffer is allocated by default in downscale API, the overhead
from locks affects performance when k < KC, m or n is sufficiently small
and the number of threads involved is high. This default allocation is
removed and the pack buffer is now only allocated if k > KC.
AMD-Internal: [CPUPL-3430]
Change-Id: I492586ff4c47bc7480d364efb7af3674e31bd2c1
-Micro-kernel: Some AVX512 intrinsics(eg: _mm512_loadu_epi32) were
introduced in later versions of gcc (>10) in addition to already
existing masked intrinsic(eg: _mm512_mask_loadu_epi32). In order to
support compilation using gcc 9.4, either the masked intrinsic or other
gcc 9.4 compatible intrinsic needs to be used (eg: _mm512_loadu_si512)
in LPGEMM Zen4 micro-kernels.
-Frame: BF16 LPGEMM api's (aocl_gemm_bf16bf16f32obf16/bf16bf16f32of32)
needs to be disabled if aocl_gemm (LPGEMM) addon is compiled using gcc
9.4. BF16 intrinsics are not supported in gcc 9.4, and the micro-kernels
for BF16 LPGEMM is excluded from compilation based on GNUC macro.
AMD-Internal: [CPUPL-3396]
Change-Id: I096b05cdceea77e3e7fec18a5e41feccdf47f0e7
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.
AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
-Currently only one eltwise post-op (one of relu/prelu/gelu_tanh/
gelu_erf) is supported in the post-op struct along with bias or
downscale. This setup was sufficient when only activation functions
were supported as eltwise post-ops. But with the introduction of clip
post-op(a type of non-activation eltwise operation), it has become
necessary to extend the post-ops framework to support multiple eltwise
operations, with the multiple eltwise often used in the form activation
eltwise op + non-activation eltwise ops. The aocl post-op struct is
modified and the post-op parser is updated to support this use case.
-The lpgemm_bench is updated to support testing/benchmarking of the
multiple eltwise operations use case. The function for accuracy checking
is modified to support correctness testing irrespective of the order and
count of post-ops. Additionally the help message is updated so as to
better describe the capabilities of lpgemm_bench.
AMD-Internal: [CPUPL-3244]
Change-Id: If4ce8d7261d32073da8fa4757ed4f2ea0e94249f
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
-Similar to downscale optimizations made for u8s8s32 gemm, the following
optimizations are made to improve the downscale performance for u8s8s16
gemm:
a. The store to temporary s16 buffer can be avoided when k < KC since
intermediate accumulation will not required for the pc loop (only 1
iteration). The downscaled values (s8) are written directly to the
output C matrix.
b. Within the micro-kernel when beta != 0, the s8 data from the original
C output matrix is loaded to a register, converted to s16 and beta
scaling applied on it. The previous design of copying the s8 value to
the s16 temporary buffer inside jc loop and using the same in beta
scaling is removed.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s16
micro-kernels. Alpha scaling is now only done when alpha != 1.
AMD-Internal: [CPUPL-3237]
Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef