- Implemented a set of column preferential dot-product based
ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
These kernels perform matrix multiplication as a sequence
of inner products(i.e, dot-products).
- These standalone kernels are expected to strictly handle
the CRC storage scheme for C, A and B matrices. RRC is also
supported through operation transpose, at the framework
level.
- Added unit-tests to test all the kernels(main and fringe),
as well as the redirection between these kernels.
AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
Description:
_mm512_cvtne2ps_pbh(a, b) instruction takes
b when j<16 but the code was developed in
with assuming reverse order.
Fixed some indentation issues
Changed the file name and made it uniform
Change-Id: I7b45b4c35931d8febde7b7b5d9604ea953046f97
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.
The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.
The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float
Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.
AMD Internal: [CPUPL-5959]
Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
Details:
- Added a new API called unreorder that converts a matrix from
reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
row Vs col storage of matrix and transposes for bf16 datatype.
Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
-Added new pack kernels that packs/reorders B matrix (odd strides) from
column-major input format. This also supports the transB scenario if
input B matrix is row major.
Change-Id: Ia0fe7e5f19ae9eba5c418f4089c7e6df11091853
- Implemented the Scale post-op for the F32 API for all kernels
- f32_scale = (f32 * scale_factor) + offset
- Added the bench inputs
Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c
- Data-type of n, and conj is dim_t which will be int32_t for LP64 case.
- When loading 64-bit registers using "mov" instructions, mov(rax, var(n)),
the "n" should be 64-bit otherwise incorrect values gets loaded.
Fix: We typecast these variables to int64_t before loading into registers.
Thanks to mangala.v@amd.com for finding this bug.
Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5
- Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case.
- When loading 64-bit registers using "mov" instructions, mov(rax, var(m)),
the "m" should be 64-bit otherwise incorrect values gets loaded.
Fix: We typecast these variables to int64_t before loading into registers.
AMD-Internal: [CPUPL-5819]
Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0
-Added new pack kernels that packs/reorders B matrix from column-major
input format. This also supports the transB scenario if input B matrix
is row major.
Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f
- Implemented the AVX512 packA kernel for col major inputs in F32 API
- Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was
being directed to GEMM instead of GEMV.
Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.
Changes in this commit:
- Standardize early returns from SCALV reference and optimized
kernels.
- User supplied N<0 is handled at the top level API layer. Use
negative values of N in kernel calls to signify that SETV
should _not_ be used when alpha=0. This should only be
required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.
AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
- Added explicit typecast to the pointers that are passed
to the _mm_prefetch( ... ) intrinsic, to avoid compiler
warnings.
AMD-Internal: [CPUPL-4415]
Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea
Description:
1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
weights into bf16 performed in the kernel itself to reduce bandwidth.
2. These kernels are performing better compared to bf16 weights when m
is small and n is large.
3. Established a threshold to do quantization support at packing of
B (KCXNC) level or WoQ kernel level.
Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
- Added support for TransA and transB in f32f32of32 APIs
- Modified the GEMV case(m == 1) to support PACKB feature
- Redirecting the operations to GEMM instead of GEMV in case of n == 1
conditions, with storage scheme r/transA and c/transB to avoid the
packing errors which would lead to failures in computation.
Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9
Details:
- In WOQ, if m = 4, special case kernels are added where
s4->bf16 conversion happens inside the compute kernel and
packing is avoided. For all other cases, B matrix is
dequantized and packed at KC loop level and native bf16
kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
output is bf16.
Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).
Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
AMD Internal: [SWLCSG-2947]
Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
1. Updated datatype from __int64_t to int64_t. Since
__int64_t was not defined for Windows
2. Updated CMake build system to build lpgemm on windows
Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8
- Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling
m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction
that strictly needed YMMs.
- Further updated the existing micro-kernel test cases to simulate
these issues and validate the fix.
AMD-Internal: [CPUPL-5353]
Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1
- Added the attribute to export symbols, in the header file that
contains the L1 kernel declarations. This attribute was previously
added as part of the kernel definitions.
AMD-Internal: [CPUPL-4415]
Change-Id: I375246f47d53c220f885644f9b75c7d7991ae710
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n==1.
AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
AMD-Internal: SWLCSG-2948
Change-Id: Ibbf186fe6569e6265e2c2bb4ec3141ef323ea3e6
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.
Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.
AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.
AMD Internal: [SWLCSG-2947, SWLCSG-2953]
Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.
AMD-Internal: [SWLCSG-2953]
Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
5. Moved packb_s4_to_bf16 under JIT macro to resovle
compilation issue with gcc version < 11.2
AMD-Internal: SWLCSG-2948
Change-Id: I5bd1bad7ad129e0dde91ed78d49a4ede3bff456a
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.
AMD Internal: SWLCSG-2947
Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
- Added reference kernel for dgemv that handles computation for tiny
sizes (m < 8 && n < 8).
- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
row/column storage schemes as well as transpose and no transpose
cases.
- Added additional unit-tests for functional verification.
AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
- Added an additional decision logic to choose between SUP and
Native paths for zen4 and zen5 micro-architectures, based on
the input dimensions. This logic has been added to the
architecture-specific thresholds functions, that are registered
in the context.
- The decision logic will overrule the discrete thresholds present
in the zen4 and zen5 contexts.
AMD-Internal: [CPUPL-5547]
Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
which is now modified to prefetching the second next column
of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.
AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.
AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
- Updated the final reduction of partial sums to use scalar accumulation
entirely, instead of using the _mm512_reduce_add_pd( ... ) intrinsic.
This will in turn change the associativity and the rounding-off
pattern in the reduction step.
- Defined a union data-type to do the same, by having a 512-bit
register and a double-precision array as its members.
- Updated the declaration and usage of the register variable according
to the union definition, for uniformity.
AMD-Internal: [CPUPL-5472]
Change-Id: I997464a6ec47e4054dca48a000fbd4ac0cfcc679
- In the initial patch - for m, n non-multiple of MR and NR
respectively we are calling bli_dgemm_ker_var2. Now we have
implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
This will result in better performance for matrix sizes
like m=4000 or greater when running on single thread.
AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
- Removed some of the unrolling factors that affected the
performance of AVX2 DAXPYV kernel. In addition to improving
the current performance on sizes compatible to single-threaded
runs, this will now perform better for tiny sizes as well
since the overhead to reach the computation is less.
- Updated the vector partitioning logic, by using
bli_thread_range_sub( ... ), which ensures that there is no
false sharing among multiple threads.
- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
or zen4 and zen5 micro-architectures.
AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
- Added the missing registers in end_asm for scopy,
dcopy and zcopy APIs.
- Removed unnecessary registers from end_asm for scopy
and dcopy APIs.
- Corrected mistakes in the comments.
Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498
- Fixed framework of bf16s4f32of32 API to correct
pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
Now offset is passed as a separate parameter to the
scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
functionality for the new API.
AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
Description:
1. Added a new API aocl_gemm_bf16s4f32of32 to support
for WoQ (Weight-only-Quantization) in LLM's
2. The API supports only reordered B matrix of data
size signed 4 bits (S4).
3. Substracting zero point and multiplying with scale
on B matrix is performed in packing B.
4. zero point and scale data should be passed by user
through pre-ops data structure.
5. The API is still in experimental state and NOT tested.
AMD-Internal: SWLCSG-2943
Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
new LPGEMM APIs are added where datatypes are A: bf16,
B: int4, C: f32/bf16. To support this, B matrix will
be reordered with type still being int4. New pack
kernels that packs the reordered B matrix after
converting the data from int4 to bf16 and applying
zero-point and scale are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
- Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
kernel.
- Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
and "bli_scopyv_zen4_asm_avx512" kernels.
- Replaced existing load balancing algorithm for dcopy API with
"bli_thread_range_sub" algorithm.
- Included AOCL-dynamic values for optimial number of threads
for zen5 architecture.
AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.
- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.
- Added a wrapper for DAXPYF kernels for redirection to kernels with a
smaller fuse factor than 32.
- Also added UKR tests for the new fused kernels.
AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
-_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc
(<gcc 12.2) and subsequently throws a compilation error. This is fixed
by replacing this intrinsic with a macro that achieves the bf16 to f32
conversion via shift operations.
-Bug fixes in the vector scale factor load in fringe kernels.
AMD-Internal: [SWLCSG-2945]
Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
24x8 triangular AVX512 DGEMMT SUP kernels are added.
- Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
pattern repeats every 24x24 block of C. To cover this 24x24 block,
3 kernels are needed for one variant of DGEMMT. A total of 6
kernels are needed to cover both upper and lower variants.
- In order to maximize code reuse, the 24x8 kernels are broken
into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
full GEMM part is computed by 24x8 DGEMM SUP kernel.
- Changes are made in framework to enable the use of these kernels.
AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
- Implemented AVX512 computational kernel for DAXPBYV
with optimal unrolling. Further implemented the other
missing kernels that would be required to decompose
the computation in special cases, namely the AVX512
DADDV and DSCAL2V kernels.
- Updated the zen4 and zen5 contexts to ensure any query
to acquire the kernel pointer for DAXPBYV returns the
address of the new kernel.
- Added micro-kernel units tests to GTestsuite to check
for functionality and out-of-bounds reads and writes.
AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6