- Data-type of n, and conj is dim_t which will be int32_t for LP64 case.
- When loading 64-bit registers using "mov" instructions, mov(rax, var(n)),
the "n" should be 64-bit otherwise incorrect values gets loaded.
Fix: We typecast these variables to int64_t before loading into registers.
Thanks to mangala.v@amd.com for finding this bug.
Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5
- Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case.
- When loading 64-bit registers using "mov" instructions, mov(rax, var(m)),
the "m" should be 64-bit otherwise incorrect values gets loaded.
Fix: We typecast these variables to int64_t before loading into registers.
AMD-Internal: [CPUPL-5819]
Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0
-Added new pack kernels that packs/reorders B matrix from column-major
input format. This also supports the transB scenario if input B matrix
is row major.
Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f
- Implemented the AVX512 packA kernel for col major inputs in F32 API
- Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was
being directed to GEMM instead of GEMV.
Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.
Changes in this commit:
- Standardize early returns from SCALV reference and optimized
kernels.
- User supplied N<0 is handled at the top level API layer. Use
negative values of N in kernel calls to signify that SETV
should _not_ be used when alpha=0. This should only be
required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.
AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
- Added explicit typecast to the pointers that are passed
to the _mm_prefetch( ... ) intrinsic, to avoid compiler
warnings.
AMD-Internal: [CPUPL-4415]
Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea
Description:
1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
weights into bf16 performed in the kernel itself to reduce bandwidth.
2. These kernels are performing better compared to bf16 weights when m
is small and n is large.
3. Established a threshold to do quantization support at packing of
B (KCXNC) level or WoQ kernel level.
Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
- Added support for TransA and transB in f32f32of32 APIs
- Modified the GEMV case(m == 1) to support PACKB feature
- Redirecting the operations to GEMM instead of GEMV in case of n == 1
conditions, with storage scheme r/transA and c/transB to avoid the
packing errors which would lead to failures in computation.
Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9
Details:
- In WOQ, if m = 4, special case kernels are added where
s4->bf16 conversion happens inside the compute kernel and
packing is avoided. For all other cases, B matrix is
dequantized and packed at KC loop level and native bf16
kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
output is bf16.
Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).
Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
AMD Internal: [SWLCSG-2947]
Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
1. Updated datatype from __int64_t to int64_t. Since
__int64_t was not defined for Windows
2. Updated CMake build system to build lpgemm on windows
Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8
- Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling
m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction
that strictly needed YMMs.
- Further updated the existing micro-kernel test cases to simulate
these issues and validate the fix.
AMD-Internal: [CPUPL-5353]
Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1
- Added the attribute to export symbols, in the header file that
contains the L1 kernel declarations. This attribute was previously
added as part of the kernel definitions.
AMD-Internal: [CPUPL-4415]
Change-Id: I375246f47d53c220f885644f9b75c7d7991ae710
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n==1.
AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
AMD-Internal: SWLCSG-2948
Change-Id: Ibbf186fe6569e6265e2c2bb4ec3141ef323ea3e6
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.
Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.
AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.
AMD Internal: [SWLCSG-2947, SWLCSG-2953]
Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.
AMD-Internal: [SWLCSG-2953]
Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
5. Moved packb_s4_to_bf16 under JIT macro to resovle
compilation issue with gcc version < 11.2
AMD-Internal: SWLCSG-2948
Change-Id: I5bd1bad7ad129e0dde91ed78d49a4ede3bff456a
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.
AMD Internal: SWLCSG-2947
Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
- Added reference kernel for dgemv that handles computation for tiny
sizes (m < 8 && n < 8).
- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
row/column storage schemes as well as transpose and no transpose
cases.
- Added additional unit-tests for functional verification.
AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
- Added an additional decision logic to choose between SUP and
Native paths for zen4 and zen5 micro-architectures, based on
the input dimensions. This logic has been added to the
architecture-specific thresholds functions, that are registered
in the context.
- The decision logic will overrule the discrete thresholds present
in the zen4 and zen5 contexts.
AMD-Internal: [CPUPL-5547]
Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
which is now modified to prefetching the second next column
of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.
AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.
AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
- Updated the final reduction of partial sums to use scalar accumulation
entirely, instead of using the _mm512_reduce_add_pd( ... ) intrinsic.
This will in turn change the associativity and the rounding-off
pattern in the reduction step.
- Defined a union data-type to do the same, by having a 512-bit
register and a double-precision array as its members.
- Updated the declaration and usage of the register variable according
to the union definition, for uniformity.
AMD-Internal: [CPUPL-5472]
Change-Id: I997464a6ec47e4054dca48a000fbd4ac0cfcc679
- In the initial patch - for m, n non-multiple of MR and NR
respectively we are calling bli_dgemm_ker_var2. Now we have
implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
This will result in better performance for matrix sizes
like m=4000 or greater when running on single thread.
AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
- Removed some of the unrolling factors that affected the
performance of AVX2 DAXPYV kernel. In addition to improving
the current performance on sizes compatible to single-threaded
runs, this will now perform better for tiny sizes as well
since the overhead to reach the computation is less.
- Updated the vector partitioning logic, by using
bli_thread_range_sub( ... ), which ensures that there is no
false sharing among multiple threads.
- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
or zen4 and zen5 micro-architectures.
AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
- Added the missing registers in end_asm for scopy,
dcopy and zcopy APIs.
- Removed unnecessary registers from end_asm for scopy
and dcopy APIs.
- Corrected mistakes in the comments.
Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498
- Fixed framework of bf16s4f32of32 API to correct
pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
Now offset is passed as a separate parameter to the
scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
functionality for the new API.
AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
Description:
1. Added a new API aocl_gemm_bf16s4f32of32 to support
for WoQ (Weight-only-Quantization) in LLM's
2. The API supports only reordered B matrix of data
size signed 4 bits (S4).
3. Substracting zero point and multiplying with scale
on B matrix is performed in packing B.
4. zero point and scale data should be passed by user
through pre-ops data structure.
5. The API is still in experimental state and NOT tested.
AMD-Internal: SWLCSG-2943
Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
new LPGEMM APIs are added where datatypes are A: bf16,
B: int4, C: f32/bf16. To support this, B matrix will
be reordered with type still being int4. New pack
kernels that packs the reordered B matrix after
converting the data from int4 to bf16 and applying
zero-point and scale are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
- Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
kernel.
- Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
and "bli_scopyv_zen4_asm_avx512" kernels.
- Replaced existing load balancing algorithm for dcopy API with
"bli_thread_range_sub" algorithm.
- Included AOCL-dynamic values for optimial number of threads
for zen5 architecture.
AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.
- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.
- Added a wrapper for DAXPYF kernels for redirection to kernels with a
smaller fuse factor than 32.
- Also added UKR tests for the new fused kernels.
AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
-_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc
(<gcc 12.2) and subsequently throws a compilation error. This is fixed
by replacing this intrinsic with a macro that achieves the bf16 to f32
conversion via shift operations.
-Bug fixes in the vector scale factor load in fringe kernels.
AMD-Internal: [SWLCSG-2945]
Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
24x8 triangular AVX512 DGEMMT SUP kernels are added.
- Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
pattern repeats every 24x24 block of C. To cover this 24x24 block,
3 kernels are needed for one variant of DGEMMT. A total of 6
kernels are needed to cover both upper and lower variants.
- In order to maximize code reuse, the 24x8 kernels are broken
into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
full GEMM part is computed by 24x8 DGEMM SUP kernel.
- Changes are made in framework to enable the use of these kernels.
AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
- Implemented AVX512 computational kernel for DAXPBYV
with optimal unrolling. Further implemented the other
missing kernels that would be required to decompose
the computation in special cases, namely the AVX512
DADDV and DSCAL2V kernels.
- Updated the zen4 and zen5 contexts to ensure any query
to acquire the kernel pointer for DAXPBYV returns the
address of the new kernel.
- Added micro-kernel units tests to GTestsuite to check
for functionality and out-of-bounds reads and writes.
AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsics then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
- Implemented a new front-end for the BLAS/CBLAS calls
to ?AXPBYV(BLAS-extension API), that is intended to
be compiled only on Zen micro-architectures(as per the
existing build system).
- This new front-end makes the framework lightweight for
BLAS/CBLAS calls to ?AXPBYV, by directly querying the
architecture ID and deploying the associated computational
kernel.
- Further updated the rerouting to other L1 kernels based
on alpha and beta value. This was initially present in
the Typed-API interface. It has been moved inside the
respective kernels, and only necessary rerouting is done
to specific L1 kernels to avoid redundant checks.
AMD-Internal: [CPUPL-5406]
Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e
-Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point)
instead of just type conversion in aocl_gemm_bf16bf16f32obf16.
-Support for multiple scale/sum/matrix_add/bias post-ops in a single
LPGEMM api call.
-Post-ops mask related fixes in lpgemv kernels .
-Additional scale post-ops sanity checks.
AMD-Internal: [SWLCSG-2945]
Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb
- Implemented two new axpyf kernels for fused factors 8 and 12
by manually unrolling the loops. Used to achieve better performance
in var2 case.
AMD-Internal: [CPUPL-5184]
Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5
- Added CSCALV kernel utilizing the AVX512 ISA.
- Added function pointers for the same to zen4 and zen5 contexts.
- Updated the BLAS interface to invoke respective CSCALV kernels based
on the architecture.
- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).
AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
- Introduced new 8x24 macro kernels.
- 4 new kernels are added for beta 0, beta 1, beta -1
and beta N.
- IR and JR loop moved to ASM region.
- Kernels support row major storage scheme.
- Prefetch of current micro panel of C is enabled.
- Kernel supports negative offsets for A and B matrices.
- Moved alpha scaling from DGEMM kernel to B pack kernel.
- Tuned blocksizes for new kernel.
- Added support for alpha scaling in 24xk pack kernel.
- Reverted back to old b_next computation
in gemm_ker_var2.
- BugFix in 8x24 DGEMM kernel for beta 1,
comparsion for jmp conditions was done using integer
instructions, which caused beta 1 path to never be taken.
Fixed this by changing the comparsion to double.
AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
for handling the GEMM computation with inputs having k = 1,
with the transpose values being N(for column-major) and T(for
row-major).
- Updated the zgemm_blis_impl( ... ) layer to query the architecture
ID and invoke the AVX2 or AVX512 kernel accordingly.
- Added API level tests for accuracy and code-coverage, as well as
micro-kernel tests for verifying functionality and out-of-bounds
memory accesses.
AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84