Implement zen6 cpuid and arch changes, and add zen6 as a
separate BLIS sub-configuration and code path within amdzen
configuration family. Currently all optimization choices are
copies of zen5 sub-configuration.
AMD-Internal: [CPUPL-7162]
Copy of similar change in upstream BLIS (843a5e8) to fix issues
https://github.com/flame/blis/issues/873 and
https://github.com/amd/blis/issues/50
Details:
- Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the
framework could access the necessary OpenMP functions.
- As @melven reported (#873), this causes issues when `blis.h` is included
in C++ code since the `<omp.h>` include happens with `extern "C"`.
- Move the include from the header to the necessary .c files so that it
does not "pollute" `blis.h`.
Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in
AOCL BLIS
AMD-Internal: [CPUPL-7303]
bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous
implementation incurred the overhead of multiple function calls. This has
been reduced by:
- Changing the function to be defined in a header file so it can be inlined.
- Avoiding call to bli_arch_check_id_once that was a wrapper for a call to
bli_pthread_once. Instead bli_pthread_once is called directly.
- For builds with a single BLIS sub-configuration, correct arch_id is taken
directly from a header file in the corresponding config subdirectory,
avoiding the bli_pthread_once call and making the value explicit at
compile time, which may enable additional optimizations.
To enable these changes, the variables arch_id and model_id defined in
frame/base/bli_arch.c are no longer static, as they must be accessed in multiple
files (i.e. they are now global variables). Rename to g_arch_id and g_model_id
to distinguish from any locally defined arch_id or model_id variables.
- When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all.
- This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks.
- When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel.
- To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic.
- DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0.
AMD-Internal: [CPUPL-7585]
- Remove redundant AOCL_DTL_LOG_NUM_THREADS calls from early return paths
- Update thread count logging to use AOCL_get_requested_threads_count() for early exits
- Clean up duplicate DTL logging in gemv_unf_var1 and gemv_unf_var2 implementations
- Remove thread count logging from bli_dgemv_n_zen4_int kernel variants
- Simplify aocldtl_blis.c AOCL_DTL_log_gemv_sizes by removing redundant conditional
- Standardize DTL trace exit patterns across axpy, scal, and gemv operations
- Remove commented-out DTL logging code in zen4 gemv kernel
This patch ensures thread count is logged only once per operation and uses
the correct API (AOCL_get_requested_threads_count) for early exit scenarios
where the actual execution thread count may differ from requested threads.
* DTL Log update
Updates logs with nt and AOCL Dynamic selected nt for axpy, scal and dgemv
Modified bench_gemv.c to able to process modified dtl logs.
* Updated DTL log for copy routine with actual nt and dynamic nt
* Refactor OpenMP pragmas and clean up code
Removed unnecessary nested OpenMP pragma and cleaned up function end comment.
* Fixed DTL log for sequential build
* Added thread logging in bla_gemv_check for invalid inputs
---------
Co-authored-by: Smyth, Edward <Edward.Smyth@amd.com>
- AMD specific BLAS1 and BLAS2 franework: changes to make variants
more consistent with each other
- Initialize kernel pointers to NULL where not immediately set
- Fix code indentation and other other whitespace changes in DTL
code and addon/aocl_gemm/frame/s8s8s32/lpgemm_s8s8s32_sym_quant.c
- Fix typos in DTL comments
- Add missing newline at end of test/CMakeLists.txt
- Standardize on using arch_id variable name
AMD-Internal: [CPUPL-6579]
- Change begin_asm and end_asm comments and unused code in files
kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c
kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c
to avoid problems in clobber checking script.
- Add missing clobbers in files
kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c
kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c
- Add missing newline at end of files.
- Update some copyright years for recent changes.
- Standardize license text formatting.
AMD-Internal: [CPUPL-6579]
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.
AMD-Internal: [CPUPL-6579]
* Added DGEMV no transpose multithreaded Implementations
- Added new avx512 M and N kernels for DGEMV.
- Added multiple MT implementations for same kernels.
- Added AOCL_dynamic logic for L2 apis.
- Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5.
- Added same kernels for SGEMV, but these kernels are not enabled yet.
- Added SGEMV reference kernel.
AMD-Internal: [SWLCSG-3408]
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
- Temperory fix for regression in DGEMV for non-unit stride y inputs. The code
section responsible for handling non-unit stride y has been removed from the
frame.
- The kernel code is extended with if condition to handle both unit and non-unit
stride y.
AMD-Internal: [CPUPL-6869]
- The build was failing when AOCL_DYNAMIC was disabled because
`fast_path_thresh` was only declared when both AOCL_DYNAMIC and
OpenMP were enabled. This variable was used in an `if` condition for
single-thread execution without an AOCL_DYNAMIC guard.
- To resolve this, the test expression for single-thread execution has
been replaced with a macro. This macro is set to 0 when AOCL_DYNAMIC
is disabled, ensuring the condition is handled correctly.
AMD-Internal: [CPUPL-6854]
- Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance.
- The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta.
AMD-Internal: [CPUPL-6746]
Including a C file directly in another C file is not recommended, and some
build systems (e.g. Bazel and Buck) do not allow .c files to include other
.c files. This commit changes the tapi and oapi framework files that are
included from the _ex and _ba file variants from .c filenames to .h
filenames.
AMD-Internal: [CPUPL-6784]
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
- Introducted new assembly kernel that copies data from source
to destination from the front and back of the vector at the
same time. This kernel provides better performance for larger
input sizes.
- Added a wrapper function responsible for selecting the kernel
used by DCOPYV API to handle the given input for zen5
architecture.
- Updated AOCL-dynamic threshold for DCOPYV API in zen4 and
zen5 architectures.
- New unit-tests were included in the grestsuite for the new
kernel.
AMD-Internal: [CPUPL-6650]
Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568
- Included a new code section to handle input having non-unit strided y
vector for dgemv transpose case. Removed the same from the respective
kernels to avoid repeated branching caused by condition checks within
the 'for' loop.
- The condition check for beta is equal to zero in the primary kernels
are moved outside the for loop to avoid repeated branching.
- The '_mm512_reduce_pd' operations in the primary kernel is replaced by
a series of operations to reduce the number of instructions required
to reduce the 8 registers.
- Changing naming convention for DGEMV transpose kernels.
- Modified unit kernel test to avoid y increment for dgemv tranpose
kernels during the test.
AMD-Internal: [CPUPL-6565]
Change-Id: I1ac516d6b8f156ac53ac9f6eb18badd50e152e05
- Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride.
- Since the latest dgemv kernels don't support general stride, added
checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general
stride.
- Thanks to Vignesh Balasubramanian <vignesh.balasubramanian@amd.com>
for finding this issue.
AMD-Internal: [CPUPL-6492]
Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328
- Replaced switch case with if else, lookup table for switch case
is palced at the end of .text section which causes a huge jump.
- Reduced number of branches for tiny sizes.
- Cpuid query is slow, therefore added a new if statement which avoids cpuid
query for tiny sizes(<200).
- Redirected tiny sizes to AVX2 kernel.
AMD-Internal: [CPUPL-5407]
Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
AVX2 kernels for Zen1/2/3 architectures. These kernels are written
from the ground up and are independent of fused kernels.
- The DGEMV primary kernel processes the calculation in chunks of
8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
kernels, which are invoked by the primary kernel as needed.
- Implemented the kernels by computing the dot product of matrix A
columns with vector x in chunks of 32 elements, storing the results
in accumulator registers. Fringe elements are handled in chunks
of 16, 8, etc. The data in the accumulator registers is then reduced
and added to vector y.
AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
- AVX512 specific DGEMV native kernels are added for Zen4/5
architectures to handle the NO_TRANSPOSE cases and are independent of
the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
beta scaling of y vector within the kernel itself and handle cases
where n is less than 5:
- bli_dgemv_n_zen_int_32x8n_avx512( ... )
- bli_dgemv_n_zen_int_32x4n_avx512( ... )
- bli_dgemv_n_zen_int_32x2n_avx512( ... )
- bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
m-dimension and for this kernel beta scaling is handled beforehand
within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
using fused kernel, namely AXPYF, to perform the GEMV operation.
AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
- Use AVX2 kernels for tiny sizes on genoa.
- Removed the runtime init overhead for small sizes.
AMD-Internal: [CPUPL-5407]
Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.
- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.
- Added a wrapper for DAXPYF kernels for redirection to kernels with a
smaller fuse factor than 32.
- Also added UKR tests for the new fused kernels.
AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
- Implemented two new axpyf kernels for fused factors 8 and 12
by manually unrolling the loops. Used to achieve better performance
in var2 case.
AMD-Internal: [CPUPL-5184]
Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5
- Added a {} around zen4 switch case to avoid AOCC error.
- Error is caused because in C declarations are not a statement, therefore
they cannot be labled hence compiler is not able to create a lable
for jump.
AMD-Internal: [CPUPL-4880]
Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
factor 8 and 32.
- Multithreading is also added to the DAXPYf
kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
in ZEN4
AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
- Implemented AVX512 kernels for handling the calls to ZGEMV
with transpose to A matrix.
- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
kernels include those with fuse-factor 8 (main kernel), 4
and 2(fringe kernels).
- Updated the bli_zgemv_unf_var1( ... ) function to update
the function pointers to these kernels, based on the
configuration.
AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
- Implemented AVX512 kernels for handling the calls to ZGEMV
with no-transpose to A matrix.
- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
The set of ZAXPYF kernels include those with fuse-factor 8
(main kernel), 4 and 2(fringe kernels).
- Updated the bli_zgemv_unf_var2( ... ) function to set
the function pointers to these kernels, based on the
configuration. Further added the call to ZSETV at this
layer in case beta is 0.
AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
- Scaling vector X is skipped when alpha is 1 in ZTRSV.
- Scaling matrix A is skipped when alpha is 1 in ZTRSM.
AMD-Internal: [CPUPL-4324]
Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
- When configured for haswell config "Warning unused variable 'zero'"
was throwed during compilation.
- Removed zero variable which is not being used
AMD-Internal: [CPUPL-3973]
Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
present in the bli_gemv_unf_var2_amd.c file, as part of the
bli_sgemv_unf_var2( ... ) function. This was changed to
bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
from 6 to 5 ), in accordance to the function pointer present
in the zen3 and zen4 context files.
- Changed the accumulator type to double from float, inside the
fringe loop for unit-strides(vectorized path) and non-unit strides
(scalar code).
AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
- Added an explicit function definition for ZGEMV var 1. This
removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
instead of querying it.
- With this change fringe loop is vectorized using SSE
instructions.
AMD-Internal:[CPUPL-3997]
Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
- Added call to dsetv in dscalv. When DSCALV is invoked by
DGEMV the SCAL function is expected to SET the vector to
zero when alpha is 0. This change is done to ensure BLAS
compatibility of DGEMV.
- Fixed bug in DGEMV var 1. Reverted changes in DGEMV var
1 to remove packing and dispatch logic.
- CMAKE now builds with _amd files for unf_var2 of GEMV.
AMD-Internal: [CPUPL-3772]
Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade
- In GEMV variant 1, the input matrix A is in row major. X vector
has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
operation inside the kernel has been ensured by packing the input X
vector to a temporary buffer with unit stride. Currently, the
packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
kernels will be picked from the context for non-avx machines. For
avx machines, the kernel(s) to be dispatched is(are) assigned to
the function pointer in the unf_var layer.
AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
- In variant 2 of GEMV, A matrix is in column major. Y vector has
to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
GEMV operation inside the kernel has been ensured by packing the
input Y vector to a temporary buffer with unit stride. As part of
the packing Y is scaled by beta to reduce the number of times Y
vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
buffer are copied to the original buffer and the temporary one is
released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
kernels will be picked from the context for non-avx machines. For
avx machines, the kernel(s) to be dispatched is(are) assigned to
the function pointer in the unf_var layer.
AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
- Set the variables to zero to avoid the compiler warning
(-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
bli_trsm_small_AVX512.c
- Changed the datatype from dim_t to siz_t for i,k,j
in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
avoid the compiler warning (-Waggressive-loop-optimizations)
AMD-Internal: [CPUPL-2870]
Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
- Fixed segmentation fault that was seen on non zen and non avx2
machines.
- cntx object was not passed to the invoked kernel causing a seg
fault.
AMD-Internal: [CPUPL-3167]
Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
6861fcae91
AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
- In cases when incy != 1, a buffer is created for y vector. The
contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
based architecture, AVX2 kernels are invoked. For other, the
kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.
AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.
Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.
Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.
Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag
AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
only defined in header file if BLIS_ENABLE_OPENMP is defined
AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.
AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8