Added Matrix-mul and Matrix-add postops in FP32 AVX512_256 GEMV kernels
- Matrix-add and Matrix-mul post ops in FP32 AVX512_256 GEMV m = 1 and
n = 1 kernels has been added.
Co-authored-by: VarshaV <varshav2@amd.com>
BLIS-specific setting of threading takes precedence over OpenMP
thread count ICV values, and if the BLIS-specific threading APIs
are used, there was no way for the program to revert to OpenMP
settings. This patch implements a function bli_thread_reset() to
do this. This is similar to that implemented in upstream BLIS in
commit 6dcf7666ef
More specifically, it reverts the internal threading data to that
which existed when the program was launched, subject where appropriate
to any changes in the OpenMP ICVs. In other words:
- It will undo changes to threading set by previous calls to
bli_thread_set_num_threads or bli_thread_set_ways.
- If the environment variable BLIS_NUM_THREADS was used, this will
NOT be cleared, as the initial state of the program is restored.
- Changes to OpenMP ICVs from previous calls to omp_set_num_threads()
will still be in effect, but can be overridden by further calls to
omp_set_num_threads().
Note: the internal BLIS data structure updated by the threading APIs,
including bli_thread_reset(), is thread-local to each user
(e.g. application) thread.
Example usage:
omp_set_num_threads(4);
bli_thread_set_num_threads(7);
dgemm(...); // 7 threads will be used
bli_thread_reset();
dgemm(...); // 4 threads will be used
- Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance.
- The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta.
AMD-Internal: [CPUPL-6746]
Details:
- In FP32 GEMM, when threading is disabled, rntm_pack_a and rntm_pack_b
were set to true by default. This leads to perf regression for smaller
sizes. Modified FP32 interface API to not overwrite the packA and
packB variables in rntm structure.
- In FP32 GEMV, Removed the decision making code based on mtag_A/B
and should_pack_A/B for packing. Matrices will be packed only
if the storage format of the matrices doesn't match the storage
format required by the kernel.
- Changed the control flow of checking the value of mtag to whether
matrix is "reordered" or "to-be-packed" or "unpacked". checking
for "reorder" first, followed by "pack". This will ensure that
packing doesn't happen when the matrix is already reordered even
though user forces packing by setting "BLIS_PACK_A/B"
-Modified python script to generate testcases based on block sizes
AMD-Internal: SWLCSG-3527
- The current implementation of the topology detector establishes
a contingency, wherein it is expected that the parallel region
uses all the threads queried through omp_get_max_threads(). In
case the actual parallelism in the function is limited(lower than
this expectation), the code may access unallocated memory section
(using uninitialized pointers).
- This was because every thread(having it's own pointer), sets its
initial value to NULL inside the parallel section, thereby leaving
some pointers uninitialized if the associated thread is not spawned.
- Also, the current implementation would use negative indexing(with -1)
if any associated thread was not spawned.
- Fix : Set every thread-specific pointer to NULL outside the parallel
region, using calloc(). As long as we have NULL checks for pointers
before accessing through them, no issues will be observed. Avoid
incurring the topology detection cost if all the reuqired threads
are not spawned(thereby avoiding potential negative indexing).
(when using core-group ID).
AMD-Internal: [SWLCSG-3573]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Bhaskar, Nallani <Nallani.Bhaskar@amd.com>
Details:
- In FP32 GEMM interface, mtag_b is being set to PACK by default.
This is leading to packing of B matrix even though packing is not
absolutely required leading to perf regression.
- Setting mtag_b to PACK only if it is absolutely necessary to pack B matrix
modified check conditions before packing appropriately.
AMD-Internal - [SWLCSG-3575]
Including a C file directly in another C file is not recommended, and some
build systems (e.g. Bazel and Buck) do not allow .c files to include other
.c files. This commit changes the tapi and oapi framework files that are
included from the _ex and _ba file variants from .c filenames to .h
filenames.
AMD-Internal: [CPUPL-6784]
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
* Bug Fixes in FP32 Kernels:
- The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
but the m=1 GEMV kernel call doesn't have the call to GEMV_M_ONE kernels.
Added the m=1 path in LPGEMV_TINY loop by handling the pack A/Pack B/reorder B
conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
main and GEMV kernels
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
fixed few bugs in Col-major Zero point evaluation.
AMD Internal: [ CPUPL - 6748 ]
* Bug Fixes in FP32 Kernels:
- The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
main and GEMV kernels.
- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
and AVX512_256 GEMV kernels.
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
fixed few bugs in Col-major Zero point evaluation and instruction usage.
AMD Internal: [ CPUPL - 6748 ]
* Bug Fixes in FP32 Kernels:
- The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
main and GEMV kernels.
- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
and AVX512_256 GEMV kernels.
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
fixed few bugs in Col-major Zero point evaluation and instruction usage.
AMD Internal: [ CPUPL - 6748 ]
* Bug Fixes in FP32 Kernels:
- The current implementation lets m=1 tiny cases inside LPGEMV_TINY loop,
but doesn't have the call to GEMV_M_ONE kernels. Added the m=1 path in
LPGEMV_TINY loop by handling the pack A/Pack B/reorder B conditions.
- Added BF16 support for BIAS, Matrix-Add and Matrix-Mul for AVX512 F32
main and GEMV kernels.
- Added BF16 Downscale, BIAS, Matrix-Add and Matrix-Mul support in AVX2 GEMV_N
and AVX512_256 GEMV kernels.
- Added BF16 Matrix-Add and Matrix-Mul support for AVX512_256 F32 kernels.
- Modified the condition check in FP32 Zero point in AVX512 kernels, and
fixed few bugs in Col-major Zero point evaluation and instruction usage.
AMD Internal: [ CPUPL - 6748 ]
---------
Co-authored-by: VarshaV <varshav2@amd.com>
Details:
- Fixed the problem decomposition for n-fringe case of
6x64 AVX512 FP32 kernel by updating the pointers
correctly after each fringe kernel call.
- AMD-Internal: SWLCSG-3556
* Implemented 6xlt8 AVX2 kernel for n<8 inputs
* Implemented fringe kernels for 6x16 and 6xlt16 AVX512 kernels for FP32
* Implemented m-fringe kernels for 6xlt8 kernel for AVX2
* Implemented m-fringe kernels for 6xlt8 kernel for AVX2
* Added the deleted kernels and fixed bias bug
AMD-Internal: SWLCSG-3556
Various changes to simplify and improve x86 related make_defs files:
- Make better use of common definitions in config/zen/amd_config.mk
from config/zen*/make_defs.mk files
- Similarly for config/zen/amd_config.make from the
config/zen*/make_defs.cmake files
- Pass cc_major, cc_minor and cc_revision definitions from configure
to generated config.mk file, and use these instead of defining
GCC_VERSION in config/zen*/make_defs.mk files
- Add znver3 support for LLVM 13 in config/zen3/make_defs.{mk,cmake}
- Add znver5 support for LLVM 19 in config/zen5/make_defs.{mk,cmake}
- Improve readability of haswell, intel64, skx and x86_64 files
- Correct and tidy some comments
AMD-Internal: [CPUPL-6579]
* Fixed functionality failure of DGEMM pack kernel.
- Corrected the mask preparation needed for load/store
in edge kernel where m = 18.
- Corrected the usage of right vector registers while
storing data back to buffer in edge kernels.
AMD-Internal: [CPUPL-6773]
* Fixed functionality failure of DGEMM pack kernel.
- Corrected the mask preparation needed for load/store
in edge kernel where m = 18.
- Corrected the usage of right vector registers while
storing data back to buffer in edge kernels.
AMD-Internal: [CPUPL-6773]
* Update bli_packm_zen4_asm_d24xk.c
---------
Co-authored-by: Harsh Dave <harsdave@amd.com>
Details:
- GCC 15 drops support for Xeon Phi architectures such as KNL.
- This PR blacklists the `knl` configuration for GCC 15+.
Co-authored-by: Dave Love <dave.love@manchester.ac.uk>
* Bug Fixes in LPGEMM for AVX512(SkyLake) machine
- B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
doesn't support BF16 instructions, the BF16 input is unre-ordered and
converted to FP32 to use FP32 kernels.
- For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
matrix to the re-ordered buffer array. But the un-reordering to FP32
requires the matrix to have size multiple of 16 along n and multiple
of 2 along k dimension.
- The entry condition to the above has been modified for AVX512 configuration.
- In bf16 API, the tiny path entry check has been modified to prevent
seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
machines.
- Modified existing store instructions in FP32 AVX512 kernels to support
execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).
- Added Bf16 beta and store types in FP32 avx512_256 kernels
AMD Internal: [SWLCSG-3552]
* Bug Fixes in LPGEMM for AVX512(SkyLake) machine
- B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
doesn't support BF16 instructions, the BF16 input is unre-ordered and
converted to FP32 to use FP32 kernels.
- For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
matrix to the re-ordered buffer array. But the un-reordering to FP32
requires the matrix to have size multiple of 16 along n and multiple
of 2 along k dimension.
- The entry condition to the above has been modified for AVX512 configuration.
- In bf16 API, the tiny path entry check has been modified to prevent
seg fault while AOCL_ENABLE_INSTRUCTIONS=AVX2 is set in BF16 supporting
machines.
- Modified existing store instructions in FP32 AVX512 kernels to support
execution in machines that has AVX512 support but not BF16/VNNI(SkyLake).
- Added Bf16 beta and store types, along with BIAS and ZP in FP32 avx512_256
kernels
AMD Internal: [SWLCSG-3552]
* Bug Fixes in LPGEMM for AVX512(SkyLake) machine
- Support added in FP32 512_256 kerenls for : Beta, BIAS, Zero-point and
BF16 store types for bf16bf16f32obf16 API execution in AVX2 mode.
- B-matrix in bf16bf16f32obf16/f32 API is re-ordered. For machines that
doesn't support BF16 instructions, the BF16 input is unre-ordered and
converted to FP32 type to use FP32 kernels.
- For n = 1 and k = 1 sized matrices, re-ordering in BF16 is copying the
matrix to the re-ordered buffer array. But the un-reordering to FP32
requires the matrix to have size multiple of 16 along n and multiple
of 2 along k dimension. The entry condition here has been modified for
AVX512 configuration.
- Fix for seg fault with AOCL_ENABLE_INSTRUCTIONS=AVX2 mode in BF16/VNNI
ISA supporting configruations:
- BF16 tiny path entry check has been modified to take into account arch_id
to ensure improper entry into the tiny kernel.
- The store in BF16->FP32 col-major for m = 1 conditions were updated to
correct storage pattern,
- BF16 beta load macro was modified to account for data in unaligned memory.
- Modified existing store instructions in FP32 AVX512 kernels to support
execution in machines that has AVX512 support but not BF16/VNNI(SkyLake)
AMD Internal: [SWLCSG-3552]
---------
Co-authored-by: VarshaV <varshav2@amd.com>
- Fixed the group size validation logic to correctly check if the
group_size is a multiple of 4.
- Previously the condition was incorrectly performing bitwise AND with
decimal 11 instead of binary 11 (decimal 3).
AMD-Internal: [CPUPL-6754]
Support for S32 Zero point type is added for aocl_gemm_s8s8s32os32_sym_quant
Support for BF16 scale factors type is added for aocl_gemm_s8s8s32os32_sym_quant
U8 buffer type support is added for matadd, matmul, bias post-ops in all int8 APIs.
AMD-Internal: SWLCSG-3503
Since the gnu extensions where removed, executables in bench directory cannon be built correctly.
The fix is adding "-D_POSIX_C_SOURCE=200112L" on those targets. When -std=gnu99 was used,
bench worked without this flag, but that was not the case since we switched to -std=c99.
* Fixed configuration issues in AOCL_GEMM addon
Description:
Fixed aocl_gemm addon initialization of kernels and block sizes
for machines which supports only AVX512 but not
AVX512_VNNI/VNNI_BF16.
Aligned NC, KC blocking variables between ZEN and ZEN4
AMD-Internal: [SWLCSG-3527]
* Implemented GEMV kernel for m=1 case.
Description:
- Added a new GEMV kernel for AVX2 where m=1.
- Added a new GEMV kernel for AVX512 with ymm registers where m=1.
Previously, the DGEMM implementation used `dscalv` for cases
where the M dimension of matrix A is not in multiple of 24,
resulting in a ~40% performance drop.
This commit introduces a specialized edge cases in pack kernel
to optimize performance for these cases.
The new packing support significantly improves the performance.
- Removed reliance on `dscalv` for edge cases, addressing the
performance bottleneck.
AMD-Internal: [CPUPL-6677]
Change-Id: I150d13eb536d84f8eb439d7f4a77a04a0d0e6d60
Details:
- This implementation is picked form cntx when GEMM is invoked on
machines that support AVX512 instructions by forcing the
AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This implementation uses MR=16 for GEMV.
AMD-Internal: [SWLCSG-3519]
Change-Id: I8598ce6b05c3d5a96c764d96089171570fbb9e1a
- Added a new GEMV kernel with MR = 8 which will be used
for cases where n=1.
- Modified GEMM and GEMV framework to choose right GEMV kernel
based on compile-time and run-time architecture parameters. This
had to be done since GEMV kernels are not stored-in/retrieved-from
the cntx.
- Added a pack kernel that packs A matrix from col-major to row-major
using AVX2 instructions.
AMD-Internal: [SWLCSG-3519]
Change-Id: Ibf7a8121d0bde37660eac58a160c5b9c9ebd2b5c
The "-mno-avx512f" compiler flag has been added for zen/lpgemm
source files to address an issue observed with the znver4 compiler
flag when using GCC installed through Spack. The error message
"unsupported instruction `vpcmpeqd'" was encountered, indicating
unsupported AVX-512F instructions. As a workaround, the
"-mno-avx512f" flag was introduced, ensuring that AVX-512F
instructions are disabled during compilation.
AMD-Internal: [CPUPL-6694]
Change-Id: I546475226fbfea4931d568fc1b928cf6c8699b61
- Introducted new assembly kernel that copies data from source
to destination from the front and back of the vector at the
same time. This kernel provides better performance for larger
input sizes.
- Added a wrapper function responsible for selecting the kernel
used by DCOPYV API to handle the given input for zen5
architecture.
- Updated AOCL-dynamic threshold for DCOPYV API in zen4 and
zen5 architectures.
- New unit-tests were included in the grestsuite for the new
kernel.
AMD-Internal: [CPUPL-6650]
Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568
Details:
- These kernels are picked from cntx when GEMM is invoked
on machines that support AVX512 instructions by forcing the
AVX2 path using AOCL_ENABLE_INSTRUCTIONS=AVX2 during run-time.
- This path uses the same blocksizes and pack kernels as AVX512
path.
- GEMV is disabled currently as AVX2 kernels for GEMV are not
implemented.
AMD-Internal: [SWLCSG-3519]
Change-Id: I75401fac48478fe99edb8e71fa44d36dd7513ae5
Description
- To enable AVX512 VNNI support without native BF16 in BF16 kernels, the
BF16 C_type is converted to F32 for computation and then cast back to
BF16 before storing the result.
- Added support for handling BF16 zero-point values of BF16 type.
- Added a condition to disable the tiny path for the BF16 code path
where native BF16 is not supported.
AMD Internal : [CPUPL-6627]
Change-Id: I1e0cfefd24c5ffbcc95db73e7f5784a957c79ab9
- Updated the decision logic for taking the RD path for FP32.
- Since the 5-loop was designed specifically for RV kernels, added a
boolean flag to specify when RD path is to be taken, and set ps_b_use
to cs_b_use in case B matrix is unpacked.
AMD-Internal: [SWLCSG-3497]
Change-Id: I94ed28304a71b759796edcdd4edf65b9bad22bea
- Added a set of AVX512 fringe kernels(using masked loads and
stores) in order to avoid rerouting to the GEMV typed API
interface(when m = 1). This ensures uniformity in performance
across the main and fringe cases, when the calls are multithreaded.
- Further tuned the thresholds to decide between ZGEMM Tiny, Small
SUP and Native paths for ZEN4 and ZEN5 architectures(in case
of parallel execution). This would account for additional
combinations of the input dimensions.
- Moved the call to Tiny-ZGEMM before the BLIS object creation,
since this code-path operates on raw buffers.
- Added the necessary test-cases for functional and memory testing
of the newly added kernels.
AMD-Internal: [CPUPL-6378][CPUPL-6661]
Change-Id: I9af73d1b6ef82b26503d4fc373111132aee3afd6
- For single-threaded configuration of BLIS, packing of A and B matrices
are enabled by default. But, packing of A is only supported for RV
kernels where elements from matrix A are being broadcasted. Since
elements are being loaded in RD kernels, packing of A results in
failures. Hence, disabled packing of matrix A for RD kernels.
- Fixed the issue where c_i index pointer was incorrectly being reset
when exceeding MC block thus, resulting in failures for certain
Post-Ops.
- Fixed the FP32 reoder case were for n == 1 and rs_b == 1 condition, it
was incorrectly using sizeof(BLIS_FLOAT) instead of sizeof(float).
AMD-Internal: [SWLCSG-3497]
Change-Id: I6d18afa996c253d79f666ea9789270bb59b629dd
Details:
- In reorder functions, validity of strides are being checked assuming
that the matrix to be reordered is always row-major. Modified the code
to take stor_order into consideration while checking for validity of
strides.
- This does not directly impact the functionality of GEMM as we don't
support GEMM on col-major matrices where A and/or B matrices are
reordered before GEMM computation. But this change makes sense when
reordering is viewed as an independent functionality irrespective of
what the reordered buffers will be used for.
Change-Id: If2cc4a353bca2f998ad557d6f128881bc9963330
Description
1. In the cases of clip, swish, and relu_scale, constants are currently
loaded as float. However, they are of C type, so handling has been
adjusted, for integer these constants are first loaded as integer
and then converted to float.
Change-Id: I176b805b69679df42be5745b6306f75e23de274d
- Currently the int8/uint8 APIs do not support multiple ZP types,
but works only with int8 type or uint8 type.
- The support is added to enable multiple zp types in these kernels
and added additional macros to support the operations.
- Modified the bench downscale reference code to support the updated
types.
AMD-Internal : [ SWLCSG-3304 ]
Change-Id: Ia5e40ee3705a38d09262086d20731e8f0a126987
- Support for Post-Ops has been added for all F32 RD AVX512 and AVX2
kernels.
AMD-Internal: [SWLCSG-3497]
Change-Id: Ia2967417303d8278c547957878d93c42c887109e
Previous commit on this (e0b86c69af)
was incorrect and incomplete. Add additional changes to enable
blis_impl layer for extension APIs for copying and transposing
matrices.
Change-Id: Ic707e3585acc1c0c554d7e00435464620a8c85dc
- Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs.
- The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64
(MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel.
- Updatd f32 framework to accomodate rd kernels in case of B trans
with thresholds
- Updated data gen python script
TODO:
- Post-Ops not yet supported.
Change-Id: Ibf282741f58a1446321273d5b8044db993f23714
BLAS and BLIS extension APIs for copying and transposing matrices
currently only have one interface option. This patch adds a
blis_impl layer and makes the top level interface enabled only if
BLIS_ENABLE_BLAS is enabled, as with standard BLAS interfaces.
Change-Id: I1b6c668e8492305b16e8735b9ed83bea3c0d3b6c
- Updated the S8 main, GEMV, m_, n_ and mn_ fringe kernels to support
multiple scale types for vector and scalar scales
- Updated the U8 main, GEMV, m_, n_, extMR_ and mn_ fringe kernels to
support multiple scale types for vector and scalar scales
- Updated the bench to accommodate multiple scale type input, and
modified the downscale_accuracy_check_ to verify with multiple scale
type inputs.
AMD Internal: [ SWLCSG-3304 ]
Change-Id: I7b9f3ec8ea830d3265f72d18a0aa36086e14a86e
- Instead of native, we are wrongly selecting TRSM small, now its fixed.
AMD-Internal: [SWLCSG-3338]
Change-Id: I7a06a483fd874c71562a924b50118e0fc9e3b213
Copy changes from upstream BLIS to add gemmtr interfaces to
match new BLAS functionality in recent LAPACK releases.
This addresses https://github.com/amd/blis/issues/31, thanks
to Greg Jones for reporting this issue.
AMD-Internal: [CPUPL-6581]
Change-Id: I2b1a724d80902541b1d2b073fa3d1ea71442f445
- Updated the F32 tiny path to support column-major inputs.
- Tuned the tiny-path thresholds to redirect additional inputs to the
tiny path based on the m*n*k value.
AMD-Internal: [SWLCSG-3380]
Change-Id: If3476b17cc5eaf4f4e1cf820af0a32ede3e1742e
- Added column major path for BF16 tiny path
- Tuned tiny-path thresholds to support few more inputs to the
tiny path.
AMD-Internal: [SWLCSG-3380]
Change-Id: I9a5578c9f0d689881fc5a67ab778e6a917c4fce1
Description:
1. For column major case when m=1 there was an accuracy mismatch with
post ops(bias, matrix_add, matrix_add).
2. Added check for column major case and replace _mm512_loadu_ps with
_mm512_maskz_loadu_ps.
AMD-Internal: [CPUPL-6585]
Change-Id: I8d98e2cb0b9dd445c9868f4c8af3abbc6c2dfc95
Rename generated aocl-blas.pc and aocl-blas-mt.pc to blis.pc and blis-mt.pc.
AMD-Internal: [SWLCSG-3446]
Change-Id: Ica784c7a0fd1e52b4d419795659947316e932ef6