SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.
Changes in this commit:
- Standardize early returns from SCALV reference and optimized
kernels.
- User supplied N<0 is handled at the top level API layer. Use
negative values of N in kernel calls to signify that SETV
should _not_ be used when alpha=0. This should only be
required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.
AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
(cherry picked from commit a07e041b1f)
- Added the appropriate CBLAS wrappers for CROTG, CSROT,
ZROTG and ZDROT APIs. These would internally call their
?_blis_impl() layer.
AMD-Internal: [CPUPL-5813]
Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97
- The existing row-preferred reference kernels for GEMM SUP path were
not taking into consideration the packing state of matrices A or B.
Thus, whenever either or both A and B matrices were packed the
kernel was unable to iterate appropriately through the matrices
thereby calculating incorrect values resulting in failures.
- Though, for generic configuration, the SUP path is disabled by default
the set of Pack and Compute Extension APIs use these kernels thus,
this issue resulted in their failures as well.
- With this patch, the loops being used in these kernels have been fixed
to iterate over steps of MR and NR while also accounting for the
fringe cases. Within the updated loops, temporary pointers used to
point to the correct block/panel of the matrices are incremented with
panel strides of respective matrices.
AMD-Internal: [CPUPL-5674]
Change-Id: Ic3939877c79ebb9ccf9e53b1d1672cea4b8c5959
Updated logic to use "%ld" and "%lld" format specifiers to read
64-bit integer from input files using fscanf function on Linux and
Windows respectively when the user set INT_SIZE='auto' on 64-bit
machine or INT_SIZE='64'. Otherwise "%d" on both windows and Linux
for benchmarking blis and LPGEMM.
Change-Id: I4762c4c1b3fcd09cf66d0cc9572d38766be6be60
(cherry picked from commit e4eed817aa)
- Added explicit typecast to the pointers that are passed
to the _mm_prefetch( ... ) intrinsic, to avoid compiler
warnings.
AMD-Internal: [CPUPL-4415]
Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea
Updated format specifier to read signed double("%lld") and unsigned
double("%llu") from file using fscanf from both windows and Linux.
AMD-Internal: [CPUPL-5787]
Change-Id: Ibef50b0df708f474e22f703240e264eff1de3994
(cherry picked from commit 91d4337b8b)
- added the missing stride updates in B reorder case in GEMV
- added the missing stride updates for the cast of transA with B
reordered case.
Change-Id: Ic89781dfa7c0d9380ea523796958f795828a1ade
For the bf16bf16of32bf16 lpgemm api, inside the micro-kernels in order
to convert the accumulated float values to bfloat16 before storing,
the _mm512_cvtneps_pbh intrinsic (vcvtneps2bf16) is used. This
intrinsic rounds the value based on a rounding bias logic. Replicating
the same rounding logic inside the bf16 bench accuracy check function
to get proper one to one comparison of output values.
AMD Internal: [SWLCSG-2948]
Change-Id: I135ac39ac8484769b6c0fe5b3e351dd22d7ca1d8
Description:
1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
weights into bf16 performed in the kernel itself to reduce bandwidth.
2. These kernels are performing better compared to bf16 weights when m
is small and n is large.
3. Established a threshold to do quantization support at packing of
B (KCXNC) level or WoQ kernel level.
Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
(cherry picked from commit 5120f98e12)
- Added support for TransA and transB in f32f32of32 APIs
- Modified the GEMV case(m == 1) to support PACKB feature
- Redirecting the operations to GEMM instead of GEMV in case of n == 1
conditions, with storage scheme r/transA and c/transB to avoid the
packing errors which would lead to failures in computation.
Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9
Description:
Due to the latest VNNI instructions are supported only from Clang
Version 18 and above, updated clang version check from 17 to 18.
AMD-Internal: [CPUPL-5744]
Change-Id: I4a3ecec65bd88d9dccfe1018fb25cb7be29946f0
(cherry picked from commit 5ada963b4c)
-It has been observed that reduction of threads as part of smart
threading for smaller input dimensions hampers the performance of the
other inputs with larger dimensions due to lower operating frequency of
the newly launched threads (apart from the existing ones). Disabling
smart threading for these bandwidth bound input patterns (small m and n)
fixes this issue.
-Bug fixes related to work split in LPGEMV for n < NR and m < MR cases.
AMD Internal: [SWLCSG-2948]
Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121
Details:
- In WOQ, if m = 4, special case kernels are added where
s4->bf16 conversion happens inside the compute kernel and
packing is avoided. For all other cases, B matrix is
dequantized and packed at KC loop level and native bf16
kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
output is bf16.
Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
(cherry picked from commit 2e1cc2f14a)
The _blis_impl layer provide a BLAS-like API for use in builds
where BLAS and CBLAS interfaces are not desirable. This patch
generates interfaces in uppercase and with and without trailing
underscores, to match what is generated for the regular BLAS
interface.
AMD-Internal: [CPUPL-5650]
Change-Id: I3ba9d0992291b0977479ab479acb71e42277c7c2
(cherry picked from commit 711dce14d0)
This reverts commit 7d379c7879.
Reason for revert: < Perf regression is observed for GEMM(gemm_small_At)
as fma uses memory operand >
Change-Id: I0ec3a22acaacfaade860c67858be6a2ba6296bce
(cherry picked from commit 705755bb5c)
Different Zen processors may have a 512-bit, 256-bit or 128-bit
FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows
a selection of FP512 or FP256 width in BIOS settings. Add cpuid
code to detect the width and store an indication of it in the
global variable bli_fp_datapath. This should be accessed internally
via the function bli_cpuid_query_fp_datapath(). This functionality
is currently only enabled on x86_64 platforms and only currently
reports a value for AMD CPUs.
Also add Zen3 as a fallback path for any unknown AMD processors if
AVX512 is not supported or has been disabled.
AMD-Internal: [CPUPL-4415]
Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a
(cherry picked from commit 1f18eeb267)
For some applications, one of the input dimension is mostly m < MR or
n < NR with the other dimension being small for the most part, with
intermittent large ones. Currently in these cases (m < MR or n < NR),
the number of threads used is reduced (as part of smart threading) if
the other dimension (n or m) is also small. For larger dimensions all
the threads are used.
However its been observed that this reduction of threads hampers the
performance of the larger inputs due to lower operating frequency of
the newly launched threads (apart from the existing ones). Disabling
smart threading for these bandwidth bound input patterns (m < MR or
n < NR) fixes this issue.
AMD Internal: [SWLCSG-2948]
Change-Id: I5334860cf4411ea4504d2e6bc598b9904780bbbf
- Revert of patch 1110983 - Duplicate check removal and early return for
s8s8s32/u8s8s32
- Add fix - Added check to see if post-ops is enabled with col-major
storage and return early in that case.
Change-Id: Id3b8c97b6d1425dfb06f3b196e5acd60caee8fca
- removed the duplicate check for col-major inputs in s8s8s32/u8s8s32
APIs
- Fixed the print in bench_lpgemm
Change-Id: If40837b89927dd82d8aa6f620d1a7f2c24aed53c
- Reverted the change done for tuning ddotv API. When number of threads
is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads
are not calculated and as a result number of threads value is -1.
OpenMP threads are launched with -1 value. This results in crash.
This bug is fixed by correctly calculating number of threads.
AMD-Internal: [SWLCSG-3028][CPUPL-5689]
Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).
Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
AMD Internal: [SWLCSG-2947]
Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
1. Updated datatype from __int64_t to int64_t. Since
__int64_t was not defined for Windows
2. Updated CMake build system to build lpgemm on windows
Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8
(cherry picked from commit 2ff0125f11)
- Bug : For non-zen architectures, {D/C/Z}AXPBY had
incorrect datatypes passed when querying the computational
kernel from context. The right datatype is now passed to
each variant.
- Bug : For ZAXPY, a NULL context was passed to the kernel
when using the single-threaded path. In case of further
using the context inside the kernel, this would be an issue.
We now pass the context instead of a null pointer.
AMD-Internal: [CPUPL-5643]
Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf
The disable sba pools functionality currently gives incorrect results
at runtime when multiple threads are used. Fixes and improvements are
present in the upstream version of BLIS, so until these are downstreamed
only allow builds where sba pools are enabled.
AMD-Internal: [CPUPL-5512]
Change-Id: I9ccd654477fb714a2fb5f38a138b7e9b5e55e33d
Add guards around bli_trsm_small kernel tests to only call them
if BLIS_ENABLE_SMALL_MATRIX_TRSM is defined. This fixes missing
symbol errors in tests of non-zen builds, e.g. generic or skx.
AMD-Internal: [CPUPL-4500]
Change-Id: I7a822a41b5f686b5e38b0c63dd1871963e990407
- Use AVX2 kernels for tiny sizes on genoa.
- Removed the runtime init overhead for small sizes.
AMD-Internal: [CPUPL-5407]
Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30
- Optimized macro kernel (bli_dgemm_avx512_asm_8x24_macro_kernel)
for zen5 do not support alpha scaling. Alpha scaling is
supported by zen5 micro kernel (bli_dgemm_avx512_asm_8x24).
- Optimized macro kernel expects alpha scaling to be done during
packing. The packing kernel used for mixed precision do not support
alpha scaling. Therefore, the optimized Zen5 macro kernel is not
compatible with existing packing logic.
- Changes have been made to use the generic macro kernel which in turn
used zen5 micro kernel for mixed precision which supports alpha scaling.
AMD-Internal: [CPUPL-5058]
Change-Id: I1bfeb32ae07eedafadad7dd2c62d63913a46e446
- Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling
m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction
that strictly needed YMMs.
- Further updated the existing micro-kernel test cases to simulate
these issues and validate the fix.
AMD-Internal: [CPUPL-5353]
Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1
- The IIT_ERS test for GEMM_COMPUTE where alpha = 0 and beta = 0 was
failing since neither of the matrices was being packed and thus,
missing the scaling by alpha resulting in a non-zero output for C
matrix (C := A * B).
- Enabled packing of A matrix for the ZeroAlpha_ZeroBeta IIT_ERS test
which handles the alpha scaling.
AMD-Internal: [CPUPL-5598]
Change-Id: Id9179ec6150d1bc5a0274edce727ce6cc4172213
- Added the attribute to export symbols, in the header file that
contains the L1 kernel declarations. This attribute was previously
added as part of the kernel definitions.
AMD-Internal: [CPUPL-4415]
Change-Id: I375246f47d53c220f885644f9b75c7d7991ae710
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n==1.
AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
AMD-Internal: SWLCSG-2948
Change-Id: Ibbf186fe6569e6265e2c2bb4ec3141ef323ea3e6
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.
Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.
AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.
AMD Internal: [SWLCSG-2947, SWLCSG-2953]
Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
Add tests to check input arguments have not been modified by BLIS
routine. These tests add a large runtime overhead, so they are
disabled by default. To enable them, configure gtestsuite with:
cmake -DTEST_INPUT_ARGS=ON ...
and run desired tests as normal.
Also:
- Correct testinghelpers::chktrans to handle upper case values of
argument trns.
- Change testinghelpers::matsize to return size 0 if m, n or
leading dimension are 0, or if leading dimension is too small.
AMD-Internal: [CPUPL-4379]
Change-Id: I9494af800f9383195272ce99f622104a38fd0ed8
- Set threshold to epsilon for early return cases where we are just
scaling a matrix.
- Add this threshold to IIT_ERS files for appropriate tests.
- In IIT_ERS for gemm_compute, remove tests on null A and B when
we are expecting to set or scale C. More thought is required
in gemm_compute tests to handle these cases and look at cases
where A or B has been packed.
AMD-Internal: [CPUPL-4500]
Change-Id: Ia649cc340ca1df6511388f9c43a31e53296cb2bf
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.
AMD-Internal: [SWLCSG-2953]
Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
5. Moved packb_s4_to_bf16 under JIT macro to resovle
compilation issue with gcc version < 11.2
AMD-Internal: SWLCSG-2948
Change-Id: I5bd1bad7ad129e0dde91ed78d49a4ede3bff456a
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.
AMD Internal: SWLCSG-2947
Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
- C<- alpha * op(A) *op(B) + beta *C.
C(nxn) - A(n x k) * B(k x n)
For ZEN4 and ZEN5
DGEMM is col-preferred kernel
DGEMMT = DGEMM + DGEMMT
DGEMM is col-preferred and DGEMMT is row-preferred.
DGEMM is evaluated as C = A*B (all col-storage)
whereas DGEMMT is evaluated as C = B * A (row-storage).
When A is packed it is packed as row-panels with col-stored elements.
So DGEMM is evaluated as C = A*B (A is col-stored) it aligns
with col-stored preference.
For DGEMMT: C = B * A, here A will become col-stored
because of packingand as result it will break the DGEMMT
kernel assumption that A is row-storage.
- Fixed this by disabling this optimization for ZEN4
and ZEN5.
AMD-Internal: [CPUPL-5542}
Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379