Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM.
After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction.
This caused a lot of unnecessary iterations when the remainder of k < 8.
This has been fixed by introducing masked operations for k < 8
When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation.
There were also some unnecessary instructions in the zen4 kernels which have been removed.
AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775
Co-authored-by: rohrayan@amd.com
GCC over-optimizes intrinsics code by reordering and interleaving
instructions, making it difficult to verify correctness and causing
potential accuracy issues in certain cases. This change replaces
intrinsics-based implementations with inline assembly to ensure
one-to-one mapping between source and generated assembly.
Changes:
- bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly
* Processes blocks of 128, 64, 32, 16, and 8 elements
* Handles fringe cases with masked operations
* Preserves scalar path for non-unit strides
- bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly
* Processes blocks of 16 and 8 elements with 5-way fusion
* Handles fringe cases with masked operations
* Preserves scalar path for non-unit strides
Benefits:
- Predictable code generation with no compiler reordering
- Better numerical accuracy by preventing unexpected transformations
- Easier verification of generated assembly against specifications
- Explicit control over instruction sequence and register allocation
Various changes:
- Correct signature of get_random_matrix call in some level2 APIs.
- Move RNG seed to header to allow it to be used elsewhere in the code
- Remove unused variable in ref_gbmv.cpp
- Fix seed for all calls to rand()
- Correct arguments in calls to matrix setup and computediff calls,
especially for CBLAS row-major calls
- Add missing if statements in tests of input arguments
- Removed unused alpha argument from tbmv and tbsv
- Enable nan_inf check when testing input args like alpha and beta
- Also some corrections to testing input matrices and vectors
AMD-Internal: [CPUPL-7386]
Error handing code for invalid diag argument in Col major path was
incorrect in cblas_ctrmm compared to other invalid argument checks
and other data type variants.
AMD-Internal: [CPUPL-7303]
- The current build systems have the following behaviour
with regards to building "aocl_gemm" addon codebase(LPGEMM)
when giving "amdzen" as the target architecture(fat-binary)
- Make: Attempts to compile LPGEMM kernels using the same
compiler flags that the makefile fragments set for BLIS
kernels, based on the compiler version.
- CMake: With presets, it always enables the addon compilation
unless explicitly specified with the ENABLE_ADDON variable.
- This poses a bug with older compilers, owing to them not supporting
BF16 or INT8 intrinsic compilation.
- This patch adds the functionality to check for GCC and Clang compiler versions,
and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0.
- Make: Updated the configure script to check for the compiler version
if the addon is specified.
CMake: Updated the main CMakeLists.txt to check for the compiler version
if the addon is specified, and to also force-update the associated
cache variable update. Also updated kernels/CMakeLists.txt to
check if "aocl_gemm" remains in the ENABLE_ADDONS list after
all the checks in the previous layers.
AMD-Internal: [CPUPL-7850]
Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>
Create gtestsuite programs for banded matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem
sizes can be investigated later.
AMD-Internal: [CPUPL-7386]
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.
AMD-Internal: [CPUPL-7386]
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.
AMD-Internal: [CPUPL-7386]
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM
The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers
This caused multiple unnecessary iterations in the kernel for matrices with certain k-values.
Fixed by introducing masked loads and computations for these cases
AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
* Adding targets and aliases so that blis works with fetch content
* Using PUBLIC instead of INTERFACE
* Using BLIS instead of blis and adding BLAS in the targets
* Fixing installation paths do be the same as before
* Adding documentation for FetchContent()
* Adding a model to determine which matrices enter the SGEMM tiny path
* This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously
* Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path
* Adding thresholds based on the SUP path sizes
* Added for Zen4 and Zen5
---------
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
* thread: free global communicator after parallel region completes in pthreads decorator
Avoid potential data race by deferring free until all threads have joined. Previously, chief thread could free inside while non-chief threads still held pointers. Now, frees after the parallel region, following barrier and joins.
Files:
- frame/thread/bli_l3_sup_decor_pthreads.c
- frame/thread/bli_l3_decor_pthreads.c
* AMD-Internal: [CPUPL-7694]
- Move asumv and nrm2 testinghelpers files from util to level1 (missed in
commit 0923d8ff56)
- Correct spelling mistakes and references to incorrect arguments in
comments in various files
- Correct comments listing invalid input tests in syr_IIT_ERS.cpp and
her_IIT_ERS.cpp
- Fix incorrect use of M in symv_IIT_ERS.cpp and hemv_IIT_ERS.cpp
AMD-Internal: [CPUPL-7386]
IIT_ERS tests were only implemented for {c,z}dotu and {c,z}geru. Add
the same tests for conjugate variants {c,z}dotc and {c,z}gerc.
AMD-Internal: [CPUPL-7386]
* Global Communicator is now freed outside the parallel region
Description:
// Root threads don't "own" the global communicator
thrinfo_t* root = bli_thrinfo_create_root(comm, id, pool, pba);
// Setting free_comm=FALSE makes it clear: "This thread doesn't own this resource"
The thread that creates the communicator should be responsible for freeing it:
// Framework creates global communicator
thrcomm_t* gl_comm = bli_thrcomm_create(n_threads);
// Framework should clean it up, not individual threads
// (Even though only chief would actually do the cleanup)
Global communicators: Created by framework → free_comm=FALSE
Local communicators: Created by threads → free_comm=TRUE
Setting free_comm=FALSE provides an extra safety layer - if the chief thread logic ever changes, root threads won't accidentally try to free global communicators.
The current implementation has the framework handle global communicator cleanup:
// We shouldn't free the global communicator since it was already freed
// by the global communicator's chief thread in bli_l3_thrinfo_free()
** Technically root threads could have free_comm=TRUE and still be safe due to the chief thread protection, but the current change uses FALSE for better semantic clarity and architectural consistency.**
Made all changes to align with this design.
[CPUPL-7577]
* Removed old comments
* Applied similar changes to sequential code path
* For single thread we use global BLIS_SINGLE_COMM variable instead of allocating memory from sba pool
* Fixed comments
* Cleanup comments
Display CC, CXX compiler information and version details to help users
identify which compiler was used to build BLIS.
Changes:
- Makefile: Add CC, CXX, and CFLAGS to 'make showconfig' output
- bench_getlibraryInfo.c: Add runtime compiler detection with version
* Properly detect AOCC (AMD Optimizing C/C++ Compiler)
* Detect Clang, GCC, Intel ICC/oneAPI, and MSVC
* Show compiler version information
* Detection order ensures AOCC is identified correctly (not as GCC)
The showconfig output now includes:
CC (C compiler): clang
CXX (C++ compiler): clang++
CFLAGS: <flags>
The bench_getlibraryInfo program now shows:
== Build Information ==
C Compiler (CC): AOCC x.x.x (Clang x.x.x based)
C++ Compiler (CXX): N/A (compiled as C)
This provides both build-time and runtime compiler information for
better build diagnostics and issue reporting.
Issue: CPUPL-7678
- Updated the conversion function(in case of receiving
column stored inputs) from BF16 to F32, in order to
use the correct strides while storing.
- Conversion of B is potentially multithreaded using
the threads meant for IC compute. With the wrong
strides in the kernel, this gives rise to incorrect
writes onto the miscellaneous buffer.
AMD-Internal: [CPUPL-7675]
Co-authored-by: Vishal-A <Vishal.Akula@amd.com>
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
This PR updates the ai_coverity.json configuration file by restricting the 'projects' list to only include 'AOCL_RELEASE' and clearing out the 'email_recipients' field. The changes align with ticket CPUPL-7574, ensuring that only the necessary project is monitored, and no emails are automatically sent.
* Adding data pool for random number generator
* Adopting pool in axpy, gemv, and ukr.gemm
* Using templates in data pool getter function to generate different pools for different ranges
* Addressing review comments
* Using data pool in more APIs
* Using data pool in more APIs
* Updating testing infra for trsv and trsm
* Addressing review comments
* Fixing numerical issues in trsm testing
* More review comments
* Small change to get trsm tests to pass
* Scalling tiny data pool
* Addressing review comments
- Disabled rerouting to GEMV in BF16 APIs, in case of inputs
having m == 1. We would now use the GEMM path for
handling such inputs.
AMD-Internal: [CPUPL-7536]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
In earlier commits (e.g. a2beef3255) long running tests were broken
into separate executables for each data type. Standardize this approach
across all APIs, including separate executables for IIT_ERS tests, and
for each data type have separate executables for EVT tests.
AMD-Internal: [CPUPL-7386]
Co-authored-by: Vlachopoulou, Eleni <Eleni.Vlachopoulou@amd.com>
* GTestSuite: computediff improvements
- Using lazy evaluation to only create error strings when the comparison failed
- Check if increments are zero, then only check first element
Lots of the bench applications were not taking n_repeats as an argument on the command line (while others were).
Also adding the feature to skip the rest of the line (at the end) while reading bench inputs for all APIs
---------
Co-authored-by: Rayan <rohrayan@amd.com>
- If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead.
- A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned.
AMD-Internal : [SWLCSG-3408]
(cherry picked from commit b60542c45c)
- If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead.
- A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned.
AMD-Internal : [SWLCSG-3408]
- When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all.
- This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks.
- When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel.
- To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic.
- DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0.
AMD-Internal: [CPUPL-7585]