- Fixed framework of bf16s4f32of32 API to correct
pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
Now offset is passed as a separate parameter to the
scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
functionality for the new API.
AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
have been developed where data types are A:bf16, B:int4 and C:f32/bf16.
The testing and benchmarking framework for the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17
-Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point)
instead of just type conversion in aocl_gemm_bf16bf16f32obf16.
-Support for multiple scale/sum/matrix_add/bias post-ops in a single
LPGEMM api call.
-Post-ops mask related fixes in lpgemv kernels .
-Additional scale post-ops sanity checks.
AMD-Internal: [SWLCSG-2945]
Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb
Description:
In recent changes bli_print_msg is used in lpgemm
test application file bench_lpgemm.c for printing error
message. bli_print_msg is a blis library function which
is not exported for the usage of applications, because
of which linking failed when blis shared library is used
to build.
Updated bli_print_msg with printf in the bench_lpgemm.c
AMD Internal: CPUPL-5326
Change-Id: I021849baa6881bd997013e42013db1c5c711627f
Initailized c_save instead of 'c" and then removed copying c to c_save.
Because at the start every n_repeats iteration we are copying back c_save to c.
Therefore if we initialize c_save, we can avoid extra copy of "c" to c_save before calling
GEMM. For very large sizes matrix initialization takes considerable amount of time. This can
be reduced now.
Change-Id: I2c6ffe169e991607314897cb0c1fbfc0d74ef179
Details:
- Corrected the usage of vpdpbusd instruction in
GEMV implementation for INT8 APIs.
- Modified bench to fill matrices with values
ranging between -5 and +5 whenever the datatype is
a signed integer.
Change-Id: I457462b888b667d8a34c53de762e9b4aee784ecc
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.
AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
Enabled command line option to link libiomp5.so or libomp.so or libgomp.so libraries using cmake.
Eg:- -DOpenMP_libomp_LIBRARY=<path to openmp library including library name>.
If we not set above variable, by default openmp library will be libomp.so for clang and libgomp.so for gcc compiler.
Change-Id: I5bffa10ff8351f5d10f0d543cbdf55aa16c84c90
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.
AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
according to latest input format.
AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
- Implemented the feature to benchmark ?AXPYV APIs
for the supported datatypes. The feature allows to
benchmark BLAS, CBLAS or the native BLIS API, based
on the macro definition.
- Added a sample input file to provide examples to benchmark
AXPYV for all its datatype supports.
- Updated the sample input file for SCALV to provide examples
to benchmark all of its datatype supports.
AMD-Internal: [CPUPL-4805]
Change-Id: I550920e3a57fcc2e4900e9e698330d8b8595bdee
- Added support for ?DOTC in bench.
- Updated DTL to accept conjx as a parameter:
- 'N', i.e., no conjugate for DOTU
- 'C', i.e., conjugate for DOTC
- Updated DTL calls in the interface with respective values of
conjx.
AMD-Internal: [CPUPL-4804]
Change-Id: I447b19a6273566c6021c1721ce173bac4a59142c
- Added BUILD_STATIC_LIBS option which is on by default, only on Linux.
- Added TEST_WITH_SHARED option which is off by default, only on Linux.
- If only shared or static lib is being built, that's the one that will be used for testing.
- If both are being built, TEST_WITH_SHARED determins which library wil be used for testing.
- Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing.
- Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0.
AMD-Internal: [CPUPL-2748]
Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e
CMakelists.txt is updated to support aocl_gemm on windows.
On windows, BLIS library(blis+aocl_gemm) is built successfully
only with AOCC Compiler. (Clang has an issue with optimizing
VNNI instructions).
$cmake .. -DENABLE_ADDON="aocl_gemm" ....
AMD-Internal: [CPUPL-2748]
Change-Id: I9620878ab6934233fadc9ddc5d5e82ad85be9209
- Updated the existing benchmarking file for SCALV API, to include
support to call the BLAS and CBLAS mixed-precision SCALV, namely
cblas_csscalv(), csscalv_(), cblas_zdscalv(), zdscalv_().
- The input is expected to be given with the datatype 'ZD' and 'CS'
in order to benchmark the associated mixed-precision APIs.
AMD-Internal: [CPUPL-4722]
Change-Id: I4ab0fb19fe1949468cf707d0a857e8a1681addeb
Description
1. when mr0=1 case the accumulator register and operand
registers for an fma instruction got swapped. Corrected
the copy paste error.
2. Removed fill array for c_ref in bench_lpgemm.c and used
memcpy from c buf, because fill array now using rand()
function to initialize data which can be different
when c_ref and c called separately, this was working
because data was fixed (i=0 ... i%5).
Change-Id: Ia513331ba49d28adc7bcdc0ec78d443abe66780b
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
(i.e, m == 1 or n == 1).
2. An efficient implementation of lpgemv_rowvar_f32 is developed
considering the b matrix reorder in case of m=1 and post-ops fusion.
3. When m = 1 the algorithm divide the GEMM workload in n dimension
intelligently at a granularity of NR. Each thread work on A:1xk
B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with
remainder loop.
4. When n = 1 the algorithm divide the GEMM workload in m dimension
intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
to efficiently process in n one kernel.
5. Fixed few warnings while loading 2 f32 bias elements using
_mm_load_sd using float pointer. Typecasted to (const double *)
AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
AMD-Internal: [SWLCSG-2424]
Change-Id: I9464d1f514e3b04275fe93441489b4503a08937a
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.
AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
CMakelists.txt is added in bench.
Steps are provided to build for different targets.
AMD-Internal: [CPUPL-2748]
Change-Id: I58027f4e42d1323cafb151224c45868bc8337ff4
-As it stands, in LPGEMM, users are expected to pass an array of values
with length the same as N dimension as inputs for zero point or scale
factor. However at times, a single scalar value is used as zero point
or scale factor for the entire downscaling operation. The mandate to
pass an array requires the user to allocate extra memory and fill it
with the scalar value so as to be used in downscaling. This limitation
is lifted as part of this commit, and now scalar values can be passed
as zero point or scale factor.
-LPGEMM bench enhancements along with new input format to improve
readability as well as flexibility.
AMD-Internal: [SWLCSG-2581]
Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f
Details:
- LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_A_BLOCK to checkout
memory when A matrix needs to be packed. This multi-threaded lock
overhead becomes prominent when m/n dimensions are relatively small,
even when k is large. In order to address this, bli_pba_acquire_m
is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE,
the memory is allocated using aligned malloc instead of checking
out from memory pool. Experiments have shown malloc costs to be
far lower than memory pool guarded by locks, especially for higher
thread count.
- Deleted few unnecessary instructions from packing kernels.
- Replaced bench_input.txt with lesser number of inputs.
AMD-Internal: [CPUPL-4329]
Change-Id: I5982a0a4df9dc72fab0cffab795c23822d5c8774
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
Details:
- Moved the downscale & postop options from commmandline to
input file.
- Now the format of the input file is as follows:
dt_in dt_out stor transa transb op_a op_b m n k lda ldb ldc postops
- In case of no-postops, 'none' has to be passed in the place of
postops.
- Removed duplication of mat_mul_bench_main function for bf16 APIs.
- Added a function called print_matrix for each datatype which can
help in printing matrices while debugging.
- Added printing of ref, computed and diff values while reporting
failure.
- Added new functions for memory allocation and freeing. Different
types of memory allocation is chosen based on mode bench is
running(performance or accuracy mode).
Change-Id: Ia7d740c53035bc76e578a03869590c9f04396b72
1. OpenMP based multi-threading parallelism is added for BLAS
extension APIs of Pack and Compute
2. Both pack and compute APIs are parallelized.
3. Multi-threading of pack and compute APIs done with different
number of threads can lead to inconsistent results due to
output difference of the full packed matrix buffer when packed
with different number of threads.
4. In multi-threaded execution, we ensure output of packed buffer
is exactly the same as in single threaded execution.
5. Similarly for compute API, read of packed buffer in multi-
threaded execution is exactly the same as in single-threaded
execution.
6. Routines are added to compute the offsets for thread workload
distribution for MT execution.
1. The offsets are calculated in such a way that it resembles
the reorder buffer traversal in single threaded reordering.
2. The panel boundaries (KCxNC) remain as it is accessed in
single thread, and as a consequence a thread with jc_start
inside the panel cannot consider NC range for reorder.
3. It has to work with NC' < NC, and the offset is calulated
using prev NC panels spanning k dim + cur NC panel spaning
pc loop cur iteration + (NC - NC') spanning current
kc0 (<= KC).
7. Routines to ensure the same are added for MT execution
1. frame/base/bli_pack_compute_utils.c
2. frame/base/bli_pack_compute_utils.h
AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
Details:
- Modified bench to support testing for sizes where matrix
strides are larger than the corresponding dimensions.
- Modified early-return checks in all interface APIs to
check validity of strides in relation to the corresponding
dimension rather than checking if strides are equal to dimensions.
Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4
Description:
1. Updated ERF function threshold from 3.91920590400 to 3.553
to match with the reference erf float implementation which
reduced errors a the borders and also clipped the output
to 1.0
2. Updated packa function call with pack function ptr in bf16
api to avoid compilation issues for non avx512bf16 archs
3. Updated lpgemm bench
[AMD-Internal: SWLCSG-2423 ]
Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
Details:
- Updated pack function call in ic loop to accept correct params.
- Modified documentation in bench file to reflect updated usage of
bench for downscaled APIs.
- Modified memory allocation for C panel in BF16 APIs to use
BLIS_BUFFER_FOR_GEN_USE while requesting for memory from pool.
Change-Id: Id624ed92ae7c8dafd7f6a32fc1554d2357de4df5
-When bli_pba_acquire_m api is used for packbuf type BLIS_BUFFER_FOR_
<A_BLOCK|B_PANEL|C_PANEL>, the memory is allocated by checking out a
block from an internal memory pool. In order to ensure thread safety,
the memory pool checkout is protected using mutex (bli_pba_lock/
bli_pba_unlock). When the number of threads trying to checkout memory
(in parallel) are high, these locks tend to become a scaling bottleneck,
especially when the memory is to be used for non-packing purposes
(packing could hide some of this cost). LPGEMM uses bli_pba_acquire_m
with BLIS_BUFFER_FOR_C_PANEL to checkout memory when downscale is
enabled for temporary C accumulation. This multi-threaded lock overhead
becomes prominent when m/n dimensions are relatively small, even when k
is large. In order to address this, bli_pba_acquire_m is used with
BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is
allocated using aligned malloc instead of checking out from memory pool.
Experiments have shown malloc costs to be far lower than memory pool
guarded by locks, especially for higher thread count.
-LPGEMM bench fixes for crash observed when benchmarking with post-ops
enabled and no downscale.
AMD-Internal: [SWLCSG-2354]
Change-Id: I4e92feadd2cf638bb26dd03b773556800a1a3d50
- Added support for 2 new APIs:
1. sgemm_compute()
2. dgemm_compute()
These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
packed matrix identifier) and performs the GEMM operation:
C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
scheme isn't matching, and the respective matrix being loaded isn't
packed either, on-the-go packing has been enabled for such cases to
pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
it is the responsibility of the user to pack only one matrix with
alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
and compute APIs are forced to take n_threads=1.
AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
column-major input format.
- Updated Early-return check conditions to account for trans
parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.
AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
-Downscaled / quantized value is calculated using the formula
x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels
for these APIs only support scaling.
Zero point addition is implemented as part of this commit, with it
being fused as part of the downscale post-op in the micro-kernel. The
zero point input is a vector of int8 values, and currently only vector
based zero point addition is supported.
-Bench enhancements to test/benchmark zero point addition.
AMD-Internal: [SWLCSG-2332]
Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea
Functions which are declared as "inline" may trigger compiler error "undefined function"
This linker error is eliminated by use "static" before "inline".
Therefore added "static" before all inline functions.
Change-Id: I5952fb71112fc4792011c3e29be930ccfbce4562
Certain functions were updated recently and now takes extra arguments
for error handling. Usage of the same are now updated in aocl_gemm.
Change-Id: I7daca4fd1f284d57034d564f0a08cc6410ccfd5c
Modifying blis/bench/CMakeLists.txt to include nrm2 target and produce the corresponding executable.
AMD-Internal: [CPUPL-3625]
Change-Id: I7945416142e07ac99510ed9500a2c620053c7e13
-Micro-kernel: Some AVX512 intrinsics(eg: _mm512_loadu_epi32) were
introduced in later versions of gcc (>10) in addition to already
existing masked intrinsic(eg: _mm512_mask_loadu_epi32). In order to
support compilation using gcc 9.4, either the masked intrinsic or other
gcc 9.4 compatible intrinsic needs to be used (eg: _mm512_loadu_si512)
in LPGEMM Zen4 micro-kernels.
-Frame: BF16 LPGEMM api's (aocl_gemm_bf16bf16f32obf16/bf16bf16f32of32)
needs to be disabled if aocl_gemm (LPGEMM) addon is compiled using gcc
9.4. BF16 intrinsics are not supported in gcc 9.4, and the micro-kernels
for BF16 LPGEMM is excluded from compilation based on GNUC macro.
AMD-Internal: [CPUPL-3396]
Change-Id: I096b05cdceea77e3e7fec18a5e41feccdf47f0e7
Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
-Currently only one eltwise post-op (one of relu/prelu/gelu_tanh/
gelu_erf) is supported in the post-op struct along with bias or
downscale. This setup was sufficient when only activation functions
were supported as eltwise post-ops. But with the introduction of clip
post-op(a type of non-activation eltwise operation), it has become
necessary to extend the post-ops framework to support multiple eltwise
operations, with the multiple eltwise often used in the form activation
eltwise op + non-activation eltwise ops. The aocl post-op struct is
modified and the post-op parser is updated to support this use case.
-The lpgemm_bench is updated to support testing/benchmarking of the
multiple eltwise operations use case. The function for accuracy checking
is modified to support correctness testing irrespective of the order and
count of post-ops. Additionally the help message is updated so as to
better describe the capabilities of lpgemm_bench.
AMD-Internal: [CPUPL-3244]
Change-Id: If4ce8d7261d32073da8fa4757ed4f2ea0e94249f
-Similar to downscale optimizations made for u8s8s32 gemm, the following
optimizations are made to improve the downscale performance for u8s8s16
gemm:
a. The store to temporary s16 buffer can be avoided when k < KC since
intermediate accumulation will not required for the pc loop (only 1
iteration). The downscaled values (s8) are written directly to the
output C matrix.
b. Within the micro-kernel when beta != 0, the s8 data from the original
C output matrix is loaded to a register, converted to s16 and beta
scaling applied on it. The previous design of copying the s8 value to
the s16 temporary buffer inside jc loop and using the same in beta
scaling is removed.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s16
micro-kernels. Alpha scaling is now only done when alpha != 1.
AMD-Internal: [CPUPL-3237]
Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
1. Custom Clip is an element-wise post-op which is used to
clip the accumulated GEMM output within a certain range.
2. The Clip Post-Op is used in downscaled and non-downscaled
LPGEMM APIs and SGEMM.
3. Changes are done at frame and microkernel level to implement
this post-op.
4. Different versions are implemented - AVX-512, AVX-2, SSE-2
to enable custom clipping for various LPGEMM types and SGEMM
AMD-Internal: [CPUPL-3207]
Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6