Previously, the DGEMM implementation used `dscalv` for cases
where the M dimension of matrix A is not in multiple of 24,
resulting in a ~40% performance drop.
This commit introduces a specialized edge cases in pack kernel
to optimize performance for these cases.
The new packing support significantly improves the performance.
- Removed reliance on `dscalv` for edge cases, addressing the
performance bottleneck.
AMD-Internal: [CPUPL-6677]
Change-Id: I150d13eb536d84f8eb439d7f4a77a04a0d0e6d60
Details:
- This implementation is picked form cntx when GEMM is invoked on
machines that support AVX512 instructions by forcing the
AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This implementation uses MR=16 for GEMV.
AMD-Internal: [SWLCSG-3519]
Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a
- Added a new GEMV kernel with MR = 8 which will be used
for cases where n=1.
- Modified GEMM and GEMV framework to choose right GEMV kernel
based on compile-time and run-time architecture parameters. This
had to be done since GEMV kernels are not stored-in/retrieved-from
the cntx.
- Added a pack kernel that packs A matrix from col-major to row-major
using AVX2 instructions.
AMD-Internal: [SWLCSG-3519]
Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c
The "-mno-avx512f" compiler flag has been added for zen/lpgemm
source files to address an issue observed with the znver4 compiler
flag when using GCC installed through Spack. The error message
"unsupported instruction `vpcmpeqd'" was encountered, indicating
unsupported AVX-512F instructions. As a workaround, the
"-mno-avx512f" flag was introduced, ensuring that AVX-512F
instructions are disabled during compilation.
AMD-Internal: [CPUPL-6694]
Change-Id: I546475226fbfea4931d568fc1b928cf6c8699b61
- Introducted new assembly kernel that copies data from source
to destination from the front and back of the vector at the
same time. This kernel provides better performance for larger
input sizes.
- Added a wrapper function responsible for selecting the kernel
used by DCOPYV API to handle the given input for zen5
architecture.
- Updated AOCL-dynamic threshold for DCOPYV API in zen4 and
zen5 architectures.
- New unit-tests were included in the grestsuite for the new
kernel.
AMD-Internal: [CPUPL-6650]
Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568
Details:
- These kernels are picked from cntx when GEMM is invoked
on machines that support AVX512 instructions by forcing the
AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This path uses the same blocksizes and pack kernels as AVX512
path.
- GEMV is disabled currently as AVX2 kernels for GEMV are not
implemented.
AMD-Internal: [SWLCSG-3519]
Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5
Description
- To enable AVX512 VNNI support without native BF16 in BF16 kernels, the
BF16 C_type is converted to F32 for computation and then cast back to
BF16 before storing the result.
- Added support for handling BF16 zero-point values of BF16 type.
- Added a condition to disable the tiny path for the BF16 code path
where native BF16 is not supported.
AMD Internal : [CPUPL-6627]
Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9
- Updated the decision logic for taking the RD path for FP32.
- Since the 5-loop was designed specifically for RV kernels, added a
boolean flag to specify when RD path is to be taken, and set ps_b_use
to cs_b_use in case B matrix is unpacked.
AMD-Internal: [SWLCSG-3497]
Change-Id: I94ed28304a71b759796edcdd4edf65b9bad22bea
- Added a set of AVX512 fringe kernels(using masked loads and
stores) in order to avoid rerouting to the GEMV typed API
interface(when m = 1). This ensures uniformity in performance
across the main and fringe cases, when the calls are multithreaded.
- Further tuned the thresholds to decide between ZGEMM Tiny, Small
SUP and Native paths for ZEN4 and ZEN5 architectures(in case
of parallel execution). This would account for additional
combinations of the input dimensions.
- Moved the call to Tiny-ZGEMM before the BLIS object creation,
since this code-path operates on raw buffers.
- Added the necessary test-cases for functional and memory testing
of the newly added kernels.
AMD-Internal: [CPUPL-6378][CPUPL-6661]
Change-Id: I9af73d1b6ef82b26503d4fc373111132aee3afd6
- For single-threaded configuration of BLIS, packing of A and B matrices
are enabled by default. But, packing of A is only supported for RV
kernels where elements from matrix A are being broadcasted. Since
elements are being loaded in RD kernels, packing of A results in
failures. Hence, disabled packing of matrix A for RD kernels.
- Fixed the issue where c_i index pointer was incorrectly being reset
when exceeding MC block thus, resulting in failures for certain
Post-Ops.
- Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it
was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float).
AMD-Internal: [SWLCSG-3497]
Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd
Details:
- In reorder functions, validity of strides are being checked assuming
that the matrix to be reordered is always row-major. Modified the code
to take stor_order into consideration while checking for validity of
strides.
- This does not directly impact the functionality of GEMM as we don't
support GEMM on col-major matrices where A and/or B matrices are
reordered before GEMM computation. But this change makes sense when
reordering is viewed as an independent functionality irrespective of
what the reordered buffers will be used for.
Change-Id: If2cc4a353bca2f998ad557d6f128881bc9963330
Description
1. In the cases of clip, swish, and relu_scale, constants are currently
loaded as float. However, they are of C type, so handling has been
adjusted, for integer these constants are first loaded as integer
and then converted to float.
Change-Id: I176b805b69679df42be5745b6306f75e23de274d
- Currently the int8/uint8 APIs do not support multiple ZP types,
but works only with int8 type or uint8 type.
- The support is added to enable multiple zp types in these kernels
and added additional macros to support the operations.
- Modified the bench downscale reference code to support the updated
types.
AMD-Internal : [ SWLCSG-3304 ]
Change-Id: Ia5e40ee3705a38d09262086d20731e8f0a126987
- Support for Post-Ops has been added for all F32 RD AVX512 and AVX2
kernels.
AMD-Internal: [SWLCSG-3497]
Change-Id: Ia2967417303d8278c547957878d93c42c887109e
Previous commit on this (e0b86c69af)
was incorrect and incomplete. Add additional changes to enable
blis_impl layer for extension APIs for copying and transposing
matrices.
Change-Id: Ic707e3585acc1c0c554d7e00435464620a8c85dc
- Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs.
- The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64
(MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel.
- Updatd f32 framework to accomodate rd kernels in case of B trans
with thresholds
- Updated data gen python script
TODO:
- Post-Ops not yet supported.
Change-Id: Ibf282741f58a1446321273d5b8044db993f23714
BLAS and BLIS extension APIs for copying and transposing matrices
currently only have one interface option. This patch adds a
blis_impl layer and makes the top level interface enabled only if
BLIS_ENABLE_BLAS is enabled, as with standard BLAS interfaces.
Change-Id: I1b6c668e8492305b16e8735b9ed83bea3c0d3b6c
- Updated the S8 main, GEMV, m_, n_ and mn_ fringe kernels to support
multiple scale types for vector and scalar scales
- Updated the U8 main, GEMV, m_, n_, extMR_ and mn_ fringe kernels to
support multiple scale types for vector and scalar scales
- Updated the bench to accommodate multiple scale type input, and
modified the downscale_accuracy_check_ to verify with multiple scale
type inputs.
AMD Internal: [ SWLCSG-3304 ]
Change-Id: I7b9f3ec8ea830d3265f72d18a0aa36086e14a86e
- Instead of native, we are wrongly selecting TRSM small, now its fixed.
AMD-Internal: [SWLCSG-3338]
Change-Id: I7a06a483fd874c71562a924b50118e0fc9e3b213
Copy changes from upstream BLIS to add gemmtr interfaces to
match new BLAS functionality in recent LAPACK releases.
This addresses https://github.com/amd/blis/issues/31, thanks
to Greg Jones for reporting this issue.
AMD-Internal: [CPUPL-6581]
Change-Id: I2b1a724d80902541b1d2b073fa3d1ea71442f445
- Updated the F32 tiny path to support column-major inputs.
- Tuned the tiny-path thresholds to redirect additional inputs to the
tiny path based on the m*n*k value.
AMD-Internal: [SWLCSG-3380]
Change-Id: If3476b17cc5eaf4f4e1cf820af0a32ede3e1742e
- Added column major path for BF16 tiny path
- Tuned tiny-path thresholds to support few more inputs to the
tiny path.
AMD-Internal: [SWLCSG-3380]
Change-Id: I9a5578c9f0d689881fc5a67ab778e6a917c4fce1
Description:
1. For column major case when m=1 there was an accuracy mismatch with
post ops(bias, matrix_add, matrix_add).
2. Added check for column major case and replace _mm512_loadu_ps with
_mm512_maskz_loadu_ps.
AMD-Internal: [CPUPL-6585]
Change-Id: I8d98e2cb0b9dd445c9868f4c8af3abbc6c2dfc95
Rename generated aocl-blas.pc and aocl-blas-mt.pc to blis.pc and blis-mt.pc.
AMD-Internal: [SWLCSG-3446]
Change-Id: Ica784c7a0fd1e52b4d419795659947316e932ef6
- Corrected a typo in dgemm kernel implementation, beta=0 and
n_left=6 edge kernel.
Thanks to Shubham Sharma<shubham.sharma3@amd.com> for helping with debugging.
AMD-Internal: [CPUPL-6443]
Change-Id: Ifa1e16ec544b7e85c21651bc23c4c27e86d6730b
- Implemented an AVX512 rank-1 kernel that is
expected to handle column-major storage schemes
of A, B and C(without transposition) when k = 1.
- This kernel is single-threaded, and acts as a direct
call from the BLAS layer for its compatible inputs.
- Defined custom BLAS and BLIS_IMPLI layers for CGEMM
(instead of using the macro definition), in order to
integrate the call to this kernel at runtime(based on
the corresponding architecture and input constraints).
- Added unit-tests for functional and memory testing of the
kernel.
- Updated the ZEN5 context to include the AVX512 CGEMM
SUP kernels, with its cache-blocking parameters.
AMD-Internal: [CPUPL-6498]
Change-Id: I42a66c424325bd117ceb38970726a05e2896a46b
- Implemented the following AVX512 SUP
column-preferential kernels(m-variant) for CGEMM :
Main kernel : 24x4m
Fringe kernels : 24x3m, 24x2m, 24x1m,
16x4, 16x3, 16x2, 16x1,
8x4, 8x3, 8x2, 8x1,
fx4, fx3, fx2, fx1(where 0<f<8).
- Utlized the packing kernel to pack A when
handling inputs with CRC storage scheme. This
would in turn handle RRC with operation transpose
in the framework layer.
- Further adding C prefetching to the main kernel,
and updated the cache-blocking parameters for
ZEN4 and ZEN5 contexts.
- Added a set of decision logics to choose between
SUP and Native AVX512 code-paths for ZEN4 and ZEN5
architectures.
- Updated the testing interface for complex GEMMSUP
to accept the kernel dimension(MR) as a parameter, in
order to set the appropriate panel stride for functional
and memory testing. Also updated the existing instantiators
to send their kernel dimensions as a parameter.
- Added unit tests for functional and memory testing of these
newly added kernels.
AMD-Internal: [CPUPL-6498]
Change-Id: Ie79d3d0dc7eed7edf30d8d4f74b888135f31d6b4
- Included a new code section to handle input having non-unit strided y
vector for dgemv transpose case. Removed the same from the respective
kernels to avoid repeated branching caused by condition checks within
the 'for' loop.
- The condition check for beta is equal to zero in the primary kernels
are moved outside the for loop to avoid repeated branching.
- The '_mm512_reduce_pd' operations in the primary kernel is replaced by
a series of operations to reduce the number of instructions required
to reduce the 8 registers.
- Changing naming convention for DGEMV transpose kernels.
- Modified unit kernel test to avoid y increment for dgemv tranpose
kernels during the test.
AMD-Internal: [CPUPL-6565]
Change-Id: I1ac516d6b8f156ac53ac9f6eb18badd50e152e05
Updated CMakeLists.txt to remove GNU extensions for both C and C++.
Now during building -std=c99 is used instead of -std=gnu99.
Signed-off-by: Jagadish R <jagadish1.r@amd.com>
AMD-Internal: [CPUPL-6553]
Change-Id: I98150707990112c5736660d287f1ddbe71a4e8e6
- Currently, the bf16 reorder function does not add padding for
n=1 cases. But, the bf16 AVX2 Unreorder path considers the input
re-ordered B matrix to be padded along the n and k dimension.
- Hence, modified the conditions to make sure the path doesn't break
while the AVX2 kernels are executed in AVX512 machines when
B matrix reordered.
Change-Id: I7dd3d37a24758a8e93e80945b533abfcf15f65a1
-Currently the Tid spread does not happen for n=4096 even if there
are threads available to facilitate the same. Update the threshold
to account for the same.
AMD-Internal: [SWLCSG-3185]
Change-Id: I281b1639c32ba2145bd84062324f1f11b1167eeb
Details:
- Setting post_op_grp to NULL at the start of post-op
- creator to ensure that there is no junk value(non-null)
- which might lead to destroyer trying to free
- non-allocated buffers.
AMD-Internal: [SWLCSG-3274]
Change-Id: I45a54d01f0d128d072d5d9c7e66fc08412c7c79c
Details:
- Group quantization is technique to improve accuracy
where scale factors to quantize inputs and weights
varies at group level instead of per channel
and per tensor level.
- Added new bench files to test GEMM with symmetric static
quantization.
- Added new get_size and reorder functions to account for
storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.
AMD-Internal:[SWLCSG-3274]
Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
- Implemented the following AVX512 native
computational kernels for CGEMM :
Row-preferential : 4x24
Column-preferential : 24x4
- The implementations use a common set of macros,
defined in a separate header. This is due to the
fact that the implementations differ solely on
the matrix chosen for load/broadcast operations.
- Added the associated AVX512 based packing kernels,
packing 24xk and 4xk panels of input.
- Registered the column-preferential kernel(24x4) in
ZEN4 and ZEN5 contexts. Further updated the cache-blocking
parameters.
- Removed redundant BLIS object creation and its contingencies
in the native micro-kernel testing interface(for complex types).
Added the required unit-tests for memory and functionality
checks of the new kernels.
AMD-Interal: [CPUPL-6498]
Change-Id: I520ff17dba4c2f9bc277bf33ba9ab4384408ffe1
Details:
- Fixed the logic to identify an API that has int4 weights in
bench files for gemm and batch_gemm.
- Eliminated the memcpy instructions used in pack functions of
zen4 kernels and replaced them with masked load instruction.
This ensures that the load register will be populated with
zeroes at locations where mask is set to zero.
Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e
AMD-Internal:[SWLCSG-3274]
- Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride.
- Since the latest dgemv kernels don't support general stride, added
checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general
stride.
- Thanks to Vignesh Balasubramanian <vignesh.balasubramanian@amd.com>
for finding this issue.
AMD-Internal: [CPUPL-6492]
Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328
- In 8x24 DGEMM kernel, prefetch is always done assuming
row major C.
- For TRSM, the DGEMM kernel can be called with column major C also.
- Current prefetch logic results in suboptimal performance.
- Changed C prefetch logic so that correct C is prefetched for both row
and column major C.
AMD-Internal: [CPUPL-6493]
Change-Id: I7c732ceac54d1056159b3749544c5380340aacd2
-Currently the scale factor is loaded without using mask in downscale,
and matrix add/mul ops in the F32 eltwise kernels. This results in
out of memory reads when n is not a multiple of NR (64).
-The loads are updated to masked loads to fix the same.
AMD-Internal: [SWLCSG-3390]
Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb
Description:
1. Implement s8 unreorder API function which performs
unreordering of int8 matrix which is reordered
2. Removed bf16vnni check for bf16 unreorder reference API
because it can work on any architecture as it is reference
code
3. Tested the reference code for all main and fringe paths.
AMD-Interneal: [SWLCSG-3426]
Change-Id: I920f807be870e1db5f9d0784cdcec7b366e1eff5
Description
- Zero point support for <s32/s8/bf16/u8> datatype in element-wise
postop only f32o<f32/s8/u8/s32/bf16> APIs.
AMD-Internal: [SWLCSG-3390]
Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
-New packing kernels for A matrix, both based on AVX512 and AVX2 ISA,
for both row and column major storage are added as part of this change.
Dependency on haswell A packing kernels are removed by this.
-Tiny GEMM thresholds are further tuned for BF16 and F32 APIs.
AMD-Internal: [SWLCSG-3380, SWLCSG-3415]
Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be
- Bug fix in Matrix Mul post op.
- Updated the config in AVX512_VNNI_BF16 context
to work in AVX2 kernels
Change-Id: I25980508facc38606596402dba4cfce88f4eb173
Description
Due to different datatype for zero point during post-op creation
and accuracy check we see an accuracy issue for u8/s8s8s32 apis
with output type f32/bf16.
AMD-Internal: [CPUPL-6456]
Change-Id: If8925988841af87cb5687c84aade607967c744fe
- Added column major pack kernels, which will transpose and store the
BF16 matrix input to F32 input matrix
- Added BF16 Zero point Downscale support to F32 main and fringe
kernels.
- Updated Matrix Add and Matrix Mul post-ops in f32-AVX2 main and
fringe kernels to support BF16 input.
- Modified the f32 tiny kernels loop to update the buf_downscale
parameter.
- Modified bf16bf16f32obf16 framework to work with AVX-2 system.
- Added wrapper in bf16 5-Loop to call the corresponding AVX-2/AVX-512
5 Loop functions.
- Bug fixes in the f32-AVX2 kernels BIAS post-ops.
- Bug fixes in the Convert function, and the bf16 5-loop
for multi-threaded inputs.
AMD-Internal:[SWLCSG-3281 , CPUPL-6447]
Change-Id: I4191fbe6f79119410c2328cd61d9b4d87b7a2bcd
Description:
1. Fixed bf16 un-reorder column major kernel
2. Fixed a bug in nrlt16 case of f32obf16 reorder function
3. Unit testing done .
AMD-internal: [SWLCSG-3279]
Change-Id: I65024342935ae65186b95885eb010baf3269aa7d
-When bli_pba_acquire_m is invoked to get a buffer for packing, if
buffer type is BLIS_BUFFER_FOR_B_PANEL, then the memory is returned
from a memory pool. In order to ensure thread safety, this memory
pool is protected using locks. Instead if buffer type was
BLIS_BUFFER_FOR_GEN_USE, then memory is allocated using malloc.
-However it was observed that for relatively small input dimensions,
if on the go packing is required, and if jc_ways is sufficiently
large, then there was contention at the lock on the memory pool for
B_PANEL buffer type. This turned out to be an overhead and is now
avoided by checking out GEN_USE buffer type for packing.
AMD-Internal: [SWLCSG-3398]
Change-Id: I781ad5da2a2f75997b58d6c3db70f6277250bd99
Description:
1. When compiler gcc version less than 11.2 few BF16 instructions
are not supported by the compiler even though the processors arch's
zen4 and zen5 supports.
2. These instructions are guarded now with a macro.
Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131
-BF16 tiny GEMM path is only enabled for Zen4 or Zen5 arch id as
returned by the bli_arch_query_id function. Additionally it is
disabled if JIT kernels are used.
-Fixed nrlt16 case in bf16_unreorder_ref function
AMD-Internal: [SWLCSG-3380, SWLCSG-3258]
Change-Id: I8af638a85e949f12181bc56c63e5e983c24ca3af
-The block sizes and micro kernel dimensions for the F32OF32 group
of APIs are updated in the element wise operations cntx map.
AMD-Internal: [SWLCSG-3390]
Change-Id: Ic5690b7eb4f7b2559d893f374dd811b00e31e329