- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
AMD-Internal: SWLCSG-2948
Change-Id: Ibbf186fe6569e6265e2c2bb4ec3141ef323ea3e6
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.
Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.
AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.
AMD Internal: [SWLCSG-2947, SWLCSG-2953]
Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
Add tests to check input arguments have not been modified by BLIS
routine. These tests add a large runtime overhead, so they are
disabled by default. To enable them, configure gtestsuite with:
cmake -DTEST_INPUT_ARGS=ON ...
and run desired tests as normal.
Also:
- Correct testinghelpers::chktrans to handle upper case values of
argument trns.
- Change testinghelpers::matsize to return size 0 if m, n or
leading dimension are 0, or if leading dimension is too small.
AMD-Internal: [CPUPL-4379]
Change-Id: I9494af800f9383195272ce99f622104a38fd0ed8
- Set threshold to epsilon for early return cases where we are just
scaling a matrix.
- Add this threshold to IIT_ERS files for appropriate tests.
- In IIT_ERS for gemm_compute, remove tests on null A and B when
we are expecting to set or scale C. More thought is required
in gemm_compute tests to handle these cases and look at cases
where A or B has been packed.
AMD-Internal: [CPUPL-4500]
Change-Id: Ia649cc340ca1df6511388f9c43a31e53296cb2bf
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.
AMD-Internal: [SWLCSG-2953]
Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
Description:
1. GCC avoiding loading b into registers in m fringe
kenrels of int8 kernels. Instead gcc generating
fma with memory as an operand for B input.
2. This is causing performance regression for larger n
where each fma needs to load the input from memory
again and again.
3. This is observed with gcc but not with clang.
4. Inserted dummy shuffle instructions for b data to
further explicitly tell compiler that b needs to be in
registers.
5. Moved packb_s4_to_bf16 under JIT macro to resovle
compilation issue with gcc version < 11.2
AMD-Internal: SWLCSG-2948
Change-Id: I5bd1bad7ad129e0dde91ed78d49a4ede3bff456a
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.
AMD Internal: SWLCSG-2947
Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
- C<- alpha * op(A) *op(B) + beta *C.
C(nxn) - A(n x k) * B(k x n)
For ZEN4 and ZEN5
DGEMM is col-preferred kernel
DGEMMT = DGEMM + DGEMMT
DGEMM is col-preferred and DGEMMT is row-preferred.
DGEMM is evaluated as C = A*B (all col-storage)
whereas DGEMMT is evaluated as C = B * A (row-storage).
When A is packed it is packed as row-panels with col-stored elements.
So DGEMM is evaluated as C = A*B (A is col-stored) it aligns
with col-stored preference.
For DGEMMT: C = B * A, here A will become col-stored
because of packingand as result it will break the DGEMMT
kernel assumption that A is row-storage.
- Fixed this by disabling this optimization for ZEN4
and ZEN5.
AMD-Internal: [CPUPL-5542}
Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379
config => config/build/arch folder
Issue:
1. Performance drop is observed as part of the fat binary(amdzen config)
built to support all the platforms using dynamic dispatch feature.
2. Observed only in intrinsic code and not in assembly code.
3. Observed in many of level1 kernels on Milan and Genoa
Previous Design:
Znver flags are picked based on config or function name
In case of ref_kernels:
Compiler picks up znver flag based on the function name. All
ref_kernels are named based on BLIS_CNAME which is a
config name (zen, zen2, zen3, zen4, zen5)
In case of Zen kernels:
Compiler picks up znver flag based on the config name where the
source file exists. All avx2 kernels are placed in zen and all avx512
kernels are placed in zen4/zen5 folder.
Kernels placed in zen (AVX2 kernels) are being compiled with znver1
flag rather than using znver2/znver3 flags on zen2/zen3 arch
respectively
New Design: For amdzen builds
1. For ref_kernels and kernels/(zen/zen2/zen3), znver2 flag is used instead of
znver1 in make and cmake build system.
2. To use znver2 flags, make_defs.mk of zen2 is included in zen config
3. No changes are made for auto or any individual config
4. Significant perfomance improvement is observed
AMD-Internal : [CPUPL-5407] [CPUPL-5406] [CPUPL-4873] [CPUPL-4872] [CPUPL-4871] [CPUPL-4801] [CPUPL-4800] [CPUPL-4799]
Change-Id: Ie817c13b8b69a2dc4328aad7ae09a3af06f83df5
- Added reference kernel for dgemv that handles computation for tiny
sizes (m < 8 && n < 8).
- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
row/column storage schemes as well as transpose and no transpose
cases.
- Added additional unit-tests for functional verification.
AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
- Avoid performance degradation of dscalv for ST when OpenMP is enabled
by using fast-path to skip the overhead caused by 'bli_nthreads_l1'
function if the input size is less than a particular threshold.
- Replaced 'bli_thread_vector_partition' work distribution function
with 'bli_thread_range_sub'.
AMD-Internal: [CPUPL-5522]
Change-Id: I4ad0041d6e448c4a26fcd47ce44e0321a41b8b9f
Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. (https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md).
In bli_cntx_init_zen4 and zen5 - auxiliary blocksize for KC was less than primary blocksize. These are fixed.
Code-cleanup of the files bli_family_zen4, zen5.h" Removed unused constants.
Thanks to Igor Kozachenko <igork@berkeley.edu> for pointing out these two bugs.
Change-Id: I44fc564d5d91cb978d062c413e70751aeaa07f2c
- Create seperate AOCL Dynamic values for
multithreading dcopy API for zen1, zen2 and zen3
AMD-Internal: [CPUPL-5238]
Change-Id: I42f56393716edeeace8bfe71d7adab0ba7325b47
- Added an additional decision logic to choose between SUP and
Native paths for zen4 and zen5 micro-architectures, based on
the input dimensions. This logic has been added to the
architecture-specific thresholds functions, that are registered
in the context.
- The decision logic will overrule the discrete thresholds present
in the zen4 and zen5 contexts.
AMD-Internal: [CPUPL-5547]
Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03
- Added dgemmGenericSUP test for the new 24x8 DGEMM SUP kernel for zen5.
AMD-Internal: [CPUPL-4404]
Change-Id: I150ca310655a495bdcf5ea9d5a16746483a17b68
- This change in made in MAKE build system.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsic then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Previous commit (75df1ef218) contains
similar changes on cmake build system
AMD-Internal: [CPUPL-5544]
Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506
Out-of-bound access fix in malloc failure case for following APIs: ddot_, zdotc_, zdotu_
AMD-Internal: [CPUPL-4686]
Change-Id: I676697223604fbb2a8d03421d98ed0d8d706f8c7
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
which is now modified to prefetching the second next column
of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.
AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.
AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
- Added new decision logic to choose between
native TRSM vs unpacked small TRSM for
double precision.
- The changes are made for zen5 processor.
AMD-Internal: [CPUPL-5534]
Change-Id: I5204f6df111edec27d006daeb1c2b535a67b3e46
- Updated the final reduction of partial sums to use scalar accumulation
entirely, instead of using the _mm512_reduce_add_pd( ... ) intrinsic.
This will in turn change the associativity and the rounding-off
pattern in the reduction step.
- Defined a union data-type to do the same, by having a 512-bit
register and a double-precision array as its members.
- Updated the declaration and usage of the register variable according
to the union definition, for uniformity.
AMD-Internal: [CPUPL-5472]
Change-Id: I997464a6ec47e4054dca48a000fbd4ac0cfcc679
Various improvements:
- Where appropriate, test both:
- with nullptr for suitable arguments that should never
be touched.
- with all arguments correct except the one we want to
test, to check we are not returning early because
another argument is a nullptr.
- Test incorrect values for order argument in CBLAS calls.
- Test early exits with limited data changes, e.g. set
C to 0 or scale C in GEMM when alpha = 0.
- Bugfix in gemmt test when alpha is 0 and beta is 1.
- Use reference library gemmt for comparison when library
is not netlib BLAS.
AMD-Internal: [CPUPL-4500]
Change-Id: Ibde7eaba5a484a87674044ca44855c6f6ee4ff4b
- In the initial patch - for m, n non-multiple of MR and NR
respectively we are calling bli_dgemm_ker_var2. Now we have
implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
This will result in better performance for matrix sizes
like m=4000 or greater when running on single thread.
AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
- Removed some of the unrolling factors that affected the
performance of AVX2 DAXPYV kernel. In addition to improving
the current performance on sizes compatible to single-threaded
runs, this will now perform better for tiny sizes as well
since the overhead to reach the computation is less.
- Updated the vector partitioning logic, by using
bli_thread_range_sub( ... ), which ensures that there is no
false sharing among multiple threads.
- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
or zen4 and zen5 micro-architectures.
AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
- Added the missing registers in end_asm for scopy,
dcopy and zcopy APIs.
- Removed unnecessary registers from end_asm for scopy
and dcopy APIs.
- Corrected mistakes in the comments.
Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498
- Fixed framework of bf16s4f32of32 API to correct
pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
Now offset is passed as a separate parameter to the
scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
functionality for the new API.
AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
Unnecessary whitespace (spaces, tabs) at the end of lines
has been removed.
AMD-Internal: [CPUPL-4500]
Change-Id: Ice5f5504232cb22460c14ac47e6a3a43309cba22
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency.
AMD-Internal: [CPUPL-4500]
Change-Id: Ia3e479643b0bed4ae8a9107bde6e2cddf32d5bd8
Currently the CBLAS xerbla always prints and always stops
on error. This commits adds similar functionality to the
regular BLAS xerbla to match the changes in 6d0444497f,
namely:
- Option to stop in xerbla on error. This is controlled by
setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
is controlled by setting the environment variable
BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
assuming xerbla was not set to stop on error. Example call is
info = bli_info_get_info_value();
The default behaviour remains to print but has been changed to
not stop on error, i.e. the equivalent to
export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0
AMD-Internal: [CPUPL-5361]
Change-Id: Icd6125fd60da139e3ec0969e52337a1ed515f0a2
- Adjusted the macro-guards for variables specific to
multithreading, when BLIS is configured with OpenMP.
- This included calling the single-threaded kernel directly
if increment is 0 as well, since this would remove an
unnecessary dependency on one of the variables used only
when we enable OpenMP.
- Further updated the condition to pack the vector, to
avoid it when increment is 0. In this case, we directly
call the kernel.
AMD-Internal: [CPUPL-5480]
Change-Id: I31a9c6e3ffc3c4f9d5b03ed8745919ad65c99c79
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
have been developed where data types are A:bf16, B:int4 and C:f32/bf16.
The testing and benchmarking framework for the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17
Description:
1. Added a new API aocl_gemm_bf16s4f32of32 to support
for WoQ (Weight-only-Quantization) in LLM's
2. The API supports only reordered B matrix of data
size signed 4 bits (S4).
3. Substracting zero point and multiplying with scale
on B matrix is performed in packing B.
4. zero point and scale data should be passed by user
through pre-ops data structure.
5. The API is still in experimental state and NOT tested.
AMD-Internal: SWLCSG-2943
Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
new LPGEMM APIs are added where datatypes are A: bf16,
B: int4, C: f32/bf16. To support this, B matrix will
be reordered with type still being int4. New pack
kernels that packs the reordered B matrix after
converting the data from int4 to bf16 and applying
zero-point and scale are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
- Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
kernel.
- Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
and "bli_scopyv_zen4_asm_avx512" kernels.
- Replaced existing load balancing algorithm for dcopy API with
"bli_thread_range_sub" algorithm.
- Included AOCL-dynamic values for optimial number of threads
for zen5 architecture.
AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.
- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.
- Added a wrapper for DAXPYF kernels for redirection to kernels with a
smaller fuse factor than 32.
- Also added UKR tests for the new fused kernels.
AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
- Logic to calculate the kernel index in AVX512
DGEMMT SUP framework is incorrect.
- The granularity for workload distribution along N
dimension is NR(8), whereas current logic to pick
diagonal kernel assumes the granularity to be MR (24).
- To Fix this, the logic to determine the kernel index is
changed, instead of relying solely on n_offset, the kernel
index is derived depending on distance from the diagonal.
- If distance from diagonal is greater than
LCM of (MR and NR) - NR, that that means the current micro
panel is not a diagonal micro panel.
- If the micro panel is a diagonal micro panel, then the
distance from diagonal is equal to the M dimension for
initial full GEMM region or empty region of diagonal
kernel. This info can be used to determine the kernel index.
AMD-Internal: [CPUPL-5440]
Change-Id: I640d3a1b43e63b24bc9f0ed4a67cced45f6fa3b3
-_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc
(<gcc 12.2) and subsequently throws a compilation error. This is fixed
by replacing this intrinsic with a macro that achieves the bf16 to f32
conversion via shift operations.
-Bug fixes in the vector scale factor load in fringe kernels.
AMD-Internal: [SWLCSG-2945]
Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018