-It has been observed that reduction of threads as part of smart
threading for smaller input dimensions hampers the performance of the
other inputs with larger dimensions due to lower operating frequency of
the newly launched threads (apart from the existing ones). Disabling
smart threading for these bandwidth bound input patterns (small m and n)
fixes this issue.
-Bug fixes related to work split in LPGEMV for n < NR and m < MR cases.
AMD Internal: [SWLCSG-2948]
Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n==1.
AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
according to latest input format.
AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
datatype.
AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
1. Added Trans A feature to handle column major inputs
for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
MCxKC block of A into row storage MCxKC block as
LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
pointers.
6. Packing of A matrix:
- Converts column major input A to row major
in blocks of MCxKC with newly added pack A
functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
1. u8s8s32os32/os8
2. u8s8s16os16/os8/ou8
3. s8s8s32os32/os8
4. s8s8s16os16/os8
AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
-When bli_pba_acquire_m api is used for packbuf type BLIS_BUFFER_FOR_
<A_BLOCK|B_PANEL|C_PANEL>, the memory is allocated by checking out a
block from an internal memory pool. In order to ensure thread safety,
the memory pool checkout is protected using mutex (bli_pba_lock/
bli_pba_unlock). When the number of threads trying to checkout memory
(in parallel) are high, these locks tend to become a scaling bottleneck,
especially when the memory is to be used for non-packing purposes
(packing could hide some of this cost). LPGEMM uses bli_pba_acquire_m
with BLIS_BUFFER_FOR_C_PANEL to checkout memory when downscale is
enabled for temporary C accumulation. This multi-threaded lock overhead
becomes prominent when m/n dimensions are relatively small, even when k
is large. In order to address this, bli_pba_acquire_m is used with
BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is
allocated using aligned malloc instead of checking out from memory pool.
Experiments have shown malloc costs to be far lower than memory pool
guarded by locks, especially for higher thread count.
-LPGEMM bench fixes for crash observed when benchmarking with post-ops
enabled and no downscale.
AMD-Internal: [SWLCSG-2354]
Change-Id: I4e92feadd2cf638bb26dd03b773556800a1a3d50
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.
AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
-Currently when any of the downscale API is called, a temporary pack
buffer is allocated (with bli_membrk_acquire_m) by each thread. It is
used to persist intermediate higher precision output accumulated by the
micro-kernel across pc loop when the number of pc iterations is more
than 1 (k > KC). The bli_membrk_acquire_m is a thread safe operation and
uses locks (pthread_mutex) to ensure thread safe checkout of memory/
block from the memory pool.
-However when k < KC, this temporary buffer is not required. But since
this pack buffer is allocated by default in downscale API, the overhead
from locks affects performance when k < KC, m or n is sufficiently small
and the number of threads involved is high. This default allocation is
removed and the pack buffer is now only allocated if k > KC.
AMD-Internal: [CPUPL-3430]
Change-Id: I492586ff4c47bc7480d364efb7af3674e31bd2c1
-The n fringe micro kernels uses only a few zmm registers for computing
the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24
used in 6x64). This results in lot of wasted registers that if utilized
can help increase the MR dimension and thus improve the reuse of
registers loaded with B. Based on this concept, the existing n fringe
kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted
that the maximum number of registers are not used, since it results in
cache inefficient code due to the increase in MR and thus more
broadcasts required from unpacked A matrix.
-Compiler flag updates for AOCC build to generate loops with 64 byte
alignment. This has been observed to improve performance slightly when
k dimension is small.
AMD-Internal: [CPUPL-3173]
Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
of every column of B matrix to implement the S8S8S32 operation using
the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
Here the A matrix elements are converted from int8 to uint8 (VNNI
works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.
AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3