4333 Commits

Author SHA1 Message Date
Kiran Varaganti
ad6a372abf Updated LICENSE and NOTICES files for 5.2.2 release 5.2.2 2026-03-20 06:23:20 +00:00
Kiran Varaganti
a7da8ee174 AOCL 5.2.2 GA Release 2026-03-12 16:08:56 +00:00
Rayan, Rohan
ebf8721a5c Optimizing sgemm rd kernels on zen3 (#293)
Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM.
After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction.
This caused a lot of unnecessary iterations when the remainder of k < 8.
This has been fixed by introducing masked operations for k < 8
When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation.
There were also some unnecessary instructions in the zen4 kernels which have been removed.

AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775
Co-authored-by: rohrayan@amd.com
2026-02-04 09:08:11 +05:30
Chandrashekara K R
50ae5a05ef Updated version string from 5.1.1 to 5.2.2 2026-02-02 18:13:01 +05:30
S, Hari Govind
ec6f4e96cd Replace intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5
GCC over-optimizes intrinsics code by reordering and interleaving
instructions, making it difficult to verify correctness and causing
potential accuracy issues in certain cases. This change replaces
intrinsics-based implementations with inline assembly to ensure
one-to-one mapping between source and generated assembly.

Changes:
- bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly
  * Processes blocks of 128, 64, 32, 16, and 8 elements
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

- bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly
  * Processes blocks of 16 and 8 elements with 5-way fusion
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

Benefits:
- Predictable code generation with no compiler reordering
- Better numerical accuracy by preventing unexpected transformations
- Easier verification of generated assembly against specifications
- Explicit control over instruction sequence and register allocation
2026-01-29 11:48:47 +05:30
Smyth, Edward
3b8aca0874 GTestSuite: Misc fixes (3)
Various changes:
- Correct signature of get_random_matrix call in some level2 APIs.
- Move RNG seed to header to allow it to be used elsewhere in the code
- Remove unused variable in ref_gbmv.cpp
- Fix seed for all calls to rand()
- Correct arguments in calls to matrix setup and computediff calls,
  especially for CBLAS row-major calls
- Add missing if statements in tests of input arguments
- Removed unused alpha argument from tbmv and tbsv
- Enable nan_inf check when testing input args like alpha and beta
- Also some corrections to testing input matrices and vectors

AMD-Internal: [CPUPL-7386]
2026-01-23 17:23:52 +00:00
Smyth, Edward
dd66dfff50 cblas_ctrmm invalid diag fix
Error handing code for invalid diag argument in Col major path was
incorrect in cblas_ctrmm compared to other invalid argument checks
and other data type variants.

AMD-Internal: [CPUPL-7303]
2026-01-23 16:51:35 +00:00
Dave, Harsh
b510d06cc8 Tuned input threshold for tiny dgemm interface (#309)
* Tuned input threshold for tiny dgemm interface

- Added upper limit check for M dimension to avoid cache thrashing.
- Added required buffer size check needed while packing A matrix.

AMD-Internal : [CPUPL-7915]

* Tuned input threshold for tiny dgemm interface

- Added upper limit check for M dimension to avoid cache thrashing.
- Added required buffer size check needed while packing A matrix.

AMD-Internal : [CPUPL-7915]

---------

Co-authored-by: harsdave <harsdave@amd.com>
2026-01-22 20:11:06 +05:30
Balasubramanian, Vignesh
73911d5990 Updates to the build systems(CMake and Make) for LPGEMM compilation (#303)
- The current build systems have the following behaviour
  with regards to building "aocl_gemm" addon codebase(LPGEMM)
  when giving "amdzen" as the target architecture(fat-binary)
  - Make:  Attempts to compile LPGEMM kernels using the same
                compiler flags that the makefile fragments set for BLIS
                kernels, based on the compiler version.
  - CMake: With presets, it always enables the addon compilation
                 unless explicitly specified with the ENABLE_ADDON variable.

- This poses a bug with older compilers, owing to them not supporting
  BF16 or INT8 intrinsic compilation.

- This patch adds the functionality to check for GCC and Clang compiler versions,
  and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0.

- Make:  Updated the configure script to check for the compiler version
              if the addon is specified.
  CMake: Updated the main CMakeLists.txt to check for the compiler version
               if the addon is specified, and to also force-update the associated
               cache variable update. Also updated kernels/CMakeLists.txt to
               check if "aocl_gemm" remains in the ENABLE_ADDONS list after
               all the checks in the previous layers.

AMD-Internal: [CPUPL-7850]

Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>
AOCL-Jan2026-b2
2026-01-16 19:39:55 +05:30
Smyth, Edward
9f9bfbed7f GTestSuite: Banded APIs (gbmv, hbmv, sbmv, tbmv, tbsv)
Create gtestsuite programs for banded matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem
sizes can be investigated later.

AMD-Internal: [CPUPL-7386]
2026-01-16 12:37:47 +00:00
Smyth, Edward
c32247678c GTestSuite: Packed APIs (hpmv, spmv, tpmv, tpsv)
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.

AMD-Internal: [CPUPL-7386]
2026-01-16 12:08:36 +00:00
Smyth, Edward
72e0c001f2 GTestSuite: Packed APIs (hpr, hpr2, spr, spr2)
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.

AMD-Internal: [CPUPL-7386]
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-16 10:25:27 +00:00
Smyth, Edward
bd99d6cd92 GTestSuite: Misc fixes (2)
Various changes:
- Fixed undeclared variables in her, her2, syr and syr2 IIT_ERS tests
- Correct typos in comments

AMD-Internal: [CPUPL-7386]
2026-01-16 00:33:16 +00:00
Sharma, Shubham
824e289899 Tuned decision logic for DGEMV multithreading for skinny sizes. (#301)
AMD-Internal: [CPUPL-7769]
2026-01-14 12:08:46 +05:30
Rayan, Rohan
9cbb1c45d8 Improving sgemm rd kernel on zen4/zen5 (#292)
Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM
The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers
This caused multiple unnecessary iterations in the kernel for matrices with certain k-values.
Fixed by introducing masked loads and computations for these cases

AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2025-12-17 18:48:50 +05:30
Vlachopoulou, Eleni
504ac9d8a2 CMake: Adding targets and aliases so that blis works with fetch content (#179)
* Adding targets and aliases so that blis works with fetch content

* Using PUBLIC instead of INTERFACE

* Using BLIS instead of blis and adding BLAS in the targets

* Fixing installation paths do be the same as before

* Adding documentation for FetchContent()
AOCL-Weekly-121225
2025-12-10 13:02:09 +00:00
Vlachopoulou, Eleni
1d80d5fee4 Fixing doc about building bench (#290) 2025-12-10 12:07:50 +00:00
Rayan, Rohan
a22e0022c2 SGEMM tiny path tuning for zen4 and zen5 (#267)
* Adding a model to determine which matrices enter the SGEMM tiny path
* This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously
* Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path
* Adding thresholds based on the SUP path sizes
* Added for Zen4 and Zen5

---------
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2025-12-10 15:58:54 +05:30
KR, Chandrashekara
b06b55e864 Update LICENSE and NOTICES files for AOCL-5.2 release (#285) 2025-12-10 11:25:02 +05:30
Varaganti, Kiran
bbb7edcb22 thread: free global communicator after parallel region completes in p…
* thread: free global communicator after parallel region completes in pthreads decorator

Avoid potential data race by deferring  free until all threads have joined. Previously, chief thread could free  inside  while non-chief threads still held pointers. Now,  frees  after the parallel region, following barrier and joins.

Files:
- frame/thread/bli_l3_sup_decor_pthreads.c
- frame/thread/bli_l3_decor_pthreads.c

* AMD-Internal: [CPUPL-7694]
2025-12-09 19:15:52 +05:30
Kiran Varaganti
9734fc18cc AOCL-5.2 GA Release 5.2 2025-12-07 05:22:11 +00:00
Smyth, Edward
eff1b561c5 GTestSuite: Misc fixes
- Move asumv and nrm2 testinghelpers files from util to level1 (missed in
  commit 0923d8ff56)
- Correct spelling mistakes and references to incorrect arguments in
  comments in various files
- Correct comments listing invalid input tests in syr_IIT_ERS.cpp and
  her_IIT_ERS.cpp
- Fix incorrect use of M in symv_IIT_ERS.cpp and hemv_IIT_ERS.cpp

AMD-Internal: [CPUPL-7386]
AOCL-Weekly-051225
2025-12-05 17:23:47 +00:00
Smyth, Edward
b04818bf48 GTestSuite: Conjugate dot and ger IIT_ERS tests
IIT_ERS tests were only implemented for {c,z}dotu and {c,z}geru. Add
the same tests for conjugate variants {c,z}dotc and {c,z}gerc.

AMD-Internal: [CPUPL-7386]
2025-12-05 16:01:49 +00:00
Varaganti, Kiran
8a84b2fb2c Global Communicator is now freed outside the parallel region
* Global Communicator is now freed outside the parallel region
Description:

// Root threads don't "own" the global communicator
thrinfo_t* root = bli_thrinfo_create_root(comm, id, pool, pba);
// Setting free_comm=FALSE makes it clear: "This thread doesn't own this resource"
The thread that creates the communicator should be responsible for freeing it:

// Framework creates global communicator
thrcomm_t* gl_comm = bli_thrcomm_create(n_threads);

// Framework should clean it up, not individual threads
// (Even though only chief would actually do the cleanup)
Global communicators: Created by framework → free_comm=FALSE

Local communicators: Created by threads → free_comm=TRUE

Setting free_comm=FALSE provides an extra safety layer - if the chief thread logic ever changes, root threads won't accidentally try to free global communicators.

The current implementation has the framework handle global communicator cleanup:

// We shouldn't free the global communicator since it was already freed
// by the global communicator's chief thread in bli_l3_thrinfo_free()
** Technically root threads could have free_comm=TRUE and still be safe due to the chief thread protection, but the current change uses FALSE for better semantic clarity and architectural consistency.**

Made all changes to align with this design.
[CPUPL-7577]

* Removed old comments

* Applied similar changes to sequential code path

* For single thread we use global BLIS_SINGLE_COMM variable instead of allocating memory from sba pool

* Fixed comments

* Cleanup comments
2025-12-05 15:52:08 +05:30
Chandrashekara K R
ad472369c1 Update LICENSE and NOTICES files for AOCL-5.2 release 2025-12-05 15:09:14 +05:30
Varaganti, Kiran
17702aedef Add compiler information to "make showconfig" and "bench_getlibraryInfo"
Display CC, CXX compiler information and version details to help users
identify which compiler was used to build BLIS.

Changes:
- Makefile: Add CC, CXX, and CFLAGS to 'make showconfig' output
- bench_getlibraryInfo.c: Add runtime compiler detection with version
  * Properly detect AOCC (AMD Optimizing C/C++ Compiler)
  * Detect Clang, GCC, Intel ICC/oneAPI, and MSVC
  * Show compiler version information
  * Detection order ensures AOCC is identified correctly (not as GCC)

The showconfig output now includes:
  CC (C compiler):             clang
  CXX (C++ compiler):          clang++
  CFLAGS:                      <flags>

The bench_getlibraryInfo program now shows:
  == Build Information ==
  C Compiler (CC):             AOCC x.x.x (Clang x.x.x based)
  C++ Compiler (CXX):          N/A (compiled as C)

This provides both build-time and runtime compiler information for
better build diagnostics and issue reporting.

Issue: CPUPL-7678
2025-12-01 21:14:25 +05:30
Balasubramanian, Vignesh
54ac36c8bc Bugfix: BF16 to F32 conversion in AVX2 F32 codepath
- Updated the conversion function(in case of receiving
  column stored inputs) from BF16 to F32, in order to
  use the correct strides while storing.

- Conversion of B is potentially multithreaded using
  the threads meant for IC compute. With the wrong
  strides in the kernel, this gives rise to incorrect
  writes onto the miscellaneous buffer.

AMD-Internal: [CPUPL-7675]

Co-authored-by: Vishal-A <Vishal.Akula@amd.com>
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-12-01 15:06:13 +05:30
Kumar, Harish
1b15f940aa Update ai_coverity.json file with project as AOCL_RELEASE and email recipients with blank (#277)
This PR updates the ai_coverity.json configuration file by restricting the 'projects' list to only include 'AOCL_RELEASE' and clearing out the 'email_recipients' field. The changes align with ticket CPUPL-7574, ensuring that only the necessary project is monitored, and no emails are automatically sent.
2025-11-28 21:19:15 +05:30
Vlachopoulou, Eleni
edf64e2b89 GTestSuite: Adding data pool (#272)
* Adding data pool for random number generator

* Adopting pool in axpy, gemv, and ukr.gemm

* Using templates in data pool getter function to generate different pools for different ranges

* Addressing review comments

* Using data pool in more APIs

* Using data pool in more APIs

* Updating testing infra for trsv and trsm

* Addressing review comments

* Fixing numerical issues in trsm testing

* More review comments

* Small change to get trsm tests to pass

* Scalling tiny data pool

* Addressing review comments
2025-11-27 14:01:29 +00:00
Balasubramanian, Vignesh
f992942f6b Disabling GEMV(M1) rerouting in BF16 APIs(AVX512)
- Disabled rerouting to GEMV in BF16 APIs, in case of inputs
  having m == 1. We would now use the GEMM path for
  handling such inputs.

AMD-Internal: [CPUPL-7536]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-11-27 14:43:31 +05:30
Vlachopoulou, Eleni
5c42229e05 GTestSuite: Moving data generator definitions in a cpp file (#270)
* GTestSuite: Moving data generator definitions in a cpp file

- Using the same number in the rng engine for all tests
- Adjusting thresholds
2025-11-18 10:52:49 +00:00
Smyth, Edward
0923d8ff56 GTestSuite: break up tests
In earlier commits (e.g. a2beef3255) long running tests were broken
into separate executables for each data type. Standardize this approach
across all APIs, including separate executables for IIT_ERS tests, and
for each data type have separate executables for EVT tests.

AMD-Internal: [CPUPL-7386]
Co-authored-by: Vlachopoulou, Eleni <Eleni.Vlachopoulou@amd.com>
2025-11-17 09:12:03 +00:00
Vlachopoulou, Eleni
a8daea04ea GTestSuite: computediff improvements (#264)
* GTestSuite: computediff improvements

- Using lazy evaluation to only create error strings when the comparison failed
- Check if increments are zero, then only check first element
2025-11-17 08:30:20 +00:00
Vlachopoulou, Eleni
50f3520c33 GTestSuite: Fix in swap (#266) 2025-11-14 10:11:39 +00:00
Rayan, Rohan
979b547876 Make all bench applications consistent (#189)
Lots of the bench applications were not taking n_repeats as an argument on the command line (while others were).
Also adding the feature to skip the rest of the line (at the end) while reading bench inputs for all APIs

---------

Co-authored-by: Rayan <rohrayan@amd.com>
2025-11-13 07:42:53 +05:30
Sharma, Shubham
26588d5814 Added Fast path for single threaded AVX512 DGEMV kernel #260 (#262)
- If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead.
- A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned.
AMD-Internal : [SWLCSG-3408]

(cherry picked from commit b60542c45c)
2025-11-10 15:26:55 +05:30
Sharma, Shubham
b60542c45c Added Fast path for single threaded AVX512 DGEMV kernel #260
- If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead.
- A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned.
AMD-Internal : [SWLCSG-3408]
2025-11-10 10:32:36 +05:30
S, Hari Govind
4ecfbde082 Fix extreme values handling in GEMV
- When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all.
- This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks.
- When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel.
- To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic.
- DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0.

AMD-Internal: [CPUPL-7585]
2025-11-08 12:30:03 +05:30
Smyth, Edward
42195faa52 DTL Windows getpid support
Use _getpid in Windows builds so that separate DTL log and trace
files are used for each process, e.g. each MPI rank.

AMD-Internal: [CPUPL-7524]
2025-11-07 19:16:46 +00:00
Prabhu, Anantha
15cca23574 Delete .github/self_enablement_config.yaml 2025-11-07 23:09:59 +05:30
Prabhu, Anantha
7fdc1229db Delete .github/recipients.yaml 2025-11-07 23:09:59 +05:30
Prabhu, Anantha
15eca83c74 Delete .github/psdb-jenkins-trigger.yml 2025-11-07 23:09:59 +05:30
Prabhu, Anantha
66694a36e7 Delete .github/workflows/ai-code-review-trigger.yml 2025-11-07 23:09:59 +05:30
Prabhu, Anantha
90f1e8d183 Update ai_coverity.json 2025-11-07 23:09:59 +05:30
Prabhu, Anantha
77e4d229be Update ai-pr-platform-app.yml 2025-11-07 23:09:59 +05:30
Prabhu, Anantha
25d172b835 AI Coverity Fix - Historical Analysis related changes
Add AI Coverity Fix - Historical Analysis related changes
2025-11-07 23:09:59 +05:30
Tyagi, Shubham
b602b61601 Update ai-pr-platform-app.yml 2025-11-07 23:09:59 +05:30
Tyagi, Shubham
98164e06a6 Update ai-pr-platform-app.yml 2025-11-07 23:09:59 +05:30
Tyagi, Shubham
aebc02e2f3 Update ai-pr-platform-app.yml 2025-11-07 23:09:59 +05:30
Tyagi, Shubham
8c0196fef9 Add ai-pr-platform-app.yml 2025-11-07 23:09:59 +05:30