amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-04-19 15:18:52 +00:00

Author	SHA1	Message	Date
Kiran Varaganti	ad6a372abf	Updated LICENSE and NOTICES files for 5.2.2 release 5.2.2	2026-03-20 06:23:20 +00:00
Kiran Varaganti	a7da8ee174	AOCL 5.2.2 GA Release	2026-03-12 16:08:56 +00:00
Rayan, Rohan	ebf8721a5c	Optimizing sgemm rd kernels on zen3 (#293 ) Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM. After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction. This caused a lot of unnecessary iterations when the remainder of k < 8. This has been fixed by introducing masked operations for k < 8 When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation. There were also some unnecessary instructions in the zen4 kernels which have been removed. AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775 Co-authored-by: rohrayan@amd.com	2026-02-04 09:08:11 +05:30
Chandrashekara K R	50ae5a05ef	Updated version string from 5.1.1 to 5.2.2	2026-02-02 18:13:01 +05:30
S, Hari Govind	ec6f4e96cd	Replace intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5 GCC over-optimizes intrinsics code by reordering and interleaving instructions, making it difficult to verify correctness and causing potential accuracy issues in certain cases. This change replaces intrinsics-based implementations with inline assembly to ensure one-to-one mapping between source and generated assembly. Changes: - bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly * Processes blocks of 128, 64, 32, 16, and 8 elements * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides - bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly * Processes blocks of 16 and 8 elements with 5-way fusion * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides Benefits: - Predictable code generation with no compiler reordering - Better numerical accuracy by preventing unexpected transformations - Easier verification of generated assembly against specifications - Explicit control over instruction sequence and register allocation	2026-01-29 11:48:47 +05:30
Smyth, Edward	3b8aca0874	GTestSuite: Misc fixes (3) Various changes: - Correct signature of get_random_matrix call in some level2 APIs. - Move RNG seed to header to allow it to be used elsewhere in the code - Remove unused variable in ref_gbmv.cpp - Fix seed for all calls to rand() - Correct arguments in calls to matrix setup and computediff calls, especially for CBLAS row-major calls - Add missing if statements in tests of input arguments - Removed unused alpha argument from tbmv and tbsv - Enable nan_inf check when testing input args like alpha and beta - Also some corrections to testing input matrices and vectors AMD-Internal: [CPUPL-7386]	2026-01-23 17:23:52 +00:00
Smyth, Edward	dd66dfff50	cblas_ctrmm invalid diag fix Error handing code for invalid diag argument in Col major path was incorrect in cblas_ctrmm compared to other invalid argument checks and other data type variants. AMD-Internal: [CPUPL-7303]	2026-01-23 16:51:35 +00:00
Dave, Harsh	b510d06cc8	Tuned input threshold for tiny dgemm interface (#309 ) * Tuned input threshold for tiny dgemm interface - Added upper limit check for M dimension to avoid cache thrashing. - Added required buffer size check needed while packing A matrix. AMD-Internal : [CPUPL-7915] * Tuned input threshold for tiny dgemm interface - Added upper limit check for M dimension to avoid cache thrashing. - Added required buffer size check needed while packing A matrix. AMD-Internal : [CPUPL-7915] --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-01-22 20:11:06 +05:30
Balasubramanian, Vignesh	73911d5990	Updates to the build systems(CMake and Make) for LPGEMM compilation (#303 ) - The current build systems have the following behaviour with regards to building "aocl_gemm" addon codebase(LPGEMM) when giving "amdzen" as the target architecture(fat-binary) - Make: Attempts to compile LPGEMM kernels using the same compiler flags that the makefile fragments set for BLIS kernels, based on the compiler version. - CMake: With presets, it always enables the addon compilation unless explicitly specified with the ENABLE_ADDON variable. - This poses a bug with older compilers, owing to them not supporting BF16 or INT8 intrinsic compilation. - This patch adds the functionality to check for GCC and Clang compiler versions, and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0. - Make: Updated the configure script to check for the compiler version if the addon is specified. CMake: Updated the main CMakeLists.txt to check for the compiler version if the addon is specified, and to also force-update the associated cache variable update. Also updated kernels/CMakeLists.txt to check if "aocl_gemm" remains in the ENABLE_ADDONS list after all the checks in the previous layers. AMD-Internal: [CPUPL-7850] Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com> AOCL-Jan2026-b2	2026-01-16 19:39:55 +05:30
Smyth, Edward	9f9bfbed7f	GTestSuite: Banded APIs (gbmv, hbmv, sbmv, tbmv, tbsv) Create gtestsuite programs for banded matrix APIs. Priority is to create framework with a basic set of tests - refinement of problem sizes can be investigated later. AMD-Internal: [CPUPL-7386]	2026-01-16 12:37:47 +00:00
Smyth, Edward	c32247678c	GTestSuite: Packed APIs (hpmv, spmv, tpmv, tpsv) Create gtestsuite programs for subset of packed matrix APIs. Priority is to create framework with a basic set of tests - refinement of problem sizes can be investigated later. AMD-Internal: [CPUPL-7386]	2026-01-16 12:08:36 +00:00
Smyth, Edward	72e0c001f2	GTestSuite: Packed APIs (hpr, hpr2, spr, spr2) Create gtestsuite programs for subset of packed matrix APIs. Priority is to create framework with a basic set of tests - refinement of problem sizes can be investigated later. AMD-Internal: [CPUPL-7386] Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-16 10:25:27 +00:00
Smyth, Edward	bd99d6cd92	GTestSuite: Misc fixes (2) Various changes: - Fixed undeclared variables in her, her2, syr and syr2 IIT_ERS tests - Correct typos in comments AMD-Internal: [CPUPL-7386]	2026-01-16 00:33:16 +00:00
Sharma, Shubham	824e289899	Tuned decision logic for DGEMV multithreading for skinny sizes. (#301 ) AMD-Internal: [CPUPL-7769]	2026-01-14 12:08:46 +05:30
Rayan, Rohan	9cbb1c45d8	Improving sgemm rd kernel on zen4/zen5 (#292 ) Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers This caused multiple unnecessary iterations in the kernel for matrices with certain k-values. Fixed by introducing masked loads and computations for these cases AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2025-12-17 18:48:50 +05:30
Vlachopoulou, Eleni	504ac9d8a2	CMake: Adding targets and aliases so that blis works with fetch content (#179 ) * Adding targets and aliases so that blis works with fetch content * Using PUBLIC instead of INTERFACE * Using BLIS instead of blis and adding BLAS in the targets * Fixing installation paths do be the same as before * Adding documentation for FetchContent() AOCL-Weekly-121225	2025-12-10 13:02:09 +00:00
Vlachopoulou, Eleni	1d80d5fee4	Fixing doc about building bench (#290 )	2025-12-10 12:07:50 +00:00
Rayan, Rohan	a22e0022c2	SGEMM tiny path tuning for zen4 and zen5 (#267 ) * Adding a model to determine which matrices enter the SGEMM tiny path * This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously * Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path * Adding thresholds based on the SUP path sizes * Added for Zen4 and Zen5 --------- AMD-Internal: CPUPL-7555 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2025-12-10 15:58:54 +05:30
KR, Chandrashekara	b06b55e864	Update LICENSE and NOTICES files for AOCL-5.2 release (#285 )	2025-12-10 11:25:02 +05:30
Varaganti, Kiran	bbb7edcb22	thread: free global communicator after parallel region completes in p… * thread: free global communicator after parallel region completes in pthreads decorator Avoid potential data race by deferring free until all threads have joined. Previously, chief thread could free inside while non-chief threads still held pointers. Now, frees after the parallel region, following barrier and joins. Files: - frame/thread/bli_l3_sup_decor_pthreads.c - frame/thread/bli_l3_decor_pthreads.c * AMD-Internal: [CPUPL-7694]	2025-12-09 19:15:52 +05:30
Kiran Varaganti	9734fc18cc	AOCL-5.2 GA Release 5.2	2025-12-07 05:22:11 +00:00
Smyth, Edward	eff1b561c5	GTestSuite: Misc fixes - Move asumv and nrm2 testinghelpers files from util to level1 (missed in commit `0923d8ff56`) - Correct spelling mistakes and references to incorrect arguments in comments in various files - Correct comments listing invalid input tests in syr_IIT_ERS.cpp and her_IIT_ERS.cpp - Fix incorrect use of M in symv_IIT_ERS.cpp and hemv_IIT_ERS.cpp AMD-Internal: [CPUPL-7386] AOCL-Weekly-051225	2025-12-05 17:23:47 +00:00
Smyth, Edward	b04818bf48	GTestSuite: Conjugate dot and ger IIT_ERS tests IIT_ERS tests were only implemented for {c,z}dotu and {c,z}geru. Add the same tests for conjugate variants {c,z}dotc and {c,z}gerc. AMD-Internal: [CPUPL-7386]	2025-12-05 16:01:49 +00:00
Varaganti, Kiran	8a84b2fb2c	Global Communicator is now freed outside the parallel region * Global Communicator is now freed outside the parallel region Description: // Root threads don't "own" the global communicator thrinfo_t* root = bli_thrinfo_create_root(comm, id, pool, pba); // Setting free_comm=FALSE makes it clear: "This thread doesn't own this resource" The thread that creates the communicator should be responsible for freeing it: // Framework creates global communicator thrcomm_t* gl_comm = bli_thrcomm_create(n_threads); // Framework should clean it up, not individual threads // (Even though only chief would actually do the cleanup) Global communicators: Created by framework → free_comm=FALSE Local communicators: Created by threads → free_comm=TRUE Setting free_comm=FALSE provides an extra safety layer - if the chief thread logic ever changes, root threads won't accidentally try to free global communicators. The current implementation has the framework handle global communicator cleanup: // We shouldn't free the global communicator since it was already freed // by the global communicator's chief thread in bli_l3_thrinfo_free() Technically root threads could have free_comm=TRUE and still be safe due to the chief thread protection, but the current change uses FALSE for better semantic clarity and architectural consistency. Made all changes to align with this design. [CPUPL-7577] * Removed old comments * Applied similar changes to sequential code path * For single thread we use global BLIS_SINGLE_COMM variable instead of allocating memory from sba pool * Fixed comments * Cleanup comments	2025-12-05 15:52:08 +05:30
Chandrashekara K R	ad472369c1	Update LICENSE and NOTICES files for AOCL-5.2 release	2025-12-05 15:09:14 +05:30
Varaganti, Kiran	17702aedef	Add compiler information to "make showconfig" and "bench_getlibraryInfo" Display CC, CXX compiler information and version details to help users identify which compiler was used to build BLIS. Changes: - Makefile: Add CC, CXX, and CFLAGS to 'make showconfig' output - bench_getlibraryInfo.c: Add runtime compiler detection with version * Properly detect AOCC (AMD Optimizing C/C++ Compiler) * Detect Clang, GCC, Intel ICC/oneAPI, and MSVC * Show compiler version information * Detection order ensures AOCC is identified correctly (not as GCC) The showconfig output now includes: CC (C compiler): clang CXX (C++ compiler): clang++ CFLAGS: <flags> The bench_getlibraryInfo program now shows: == Build Information == C Compiler (CC): AOCC x.x.x (Clang x.x.x based) C++ Compiler (CXX): N/A (compiled as C) This provides both build-time and runtime compiler information for better build diagnostics and issue reporting. Issue: CPUPL-7678	2025-12-01 21:14:25 +05:30
Balasubramanian, Vignesh	54ac36c8bc	Bugfix: BF16 to F32 conversion in AVX2 F32 codepath - Updated the conversion function(in case of receiving column stored inputs) from BF16 to F32, in order to use the correct strides while storing. - Conversion of B is potentially multithreaded using the threads meant for IC compute. With the wrong strides in the kernel, this gives rise to incorrect writes onto the miscellaneous buffer. AMD-Internal: [CPUPL-7675] Co-authored-by: Vishal-A <Vishal.Akula@amd.com> Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-12-01 15:06:13 +05:30
Kumar, Harish	1b15f940aa	Update ai_coverity.json file with project as AOCL_RELEASE and email recipients with blank (#277 ) This PR updates the ai_coverity.json configuration file by restricting the 'projects' list to only include 'AOCL_RELEASE' and clearing out the 'email_recipients' field. The changes align with ticket CPUPL-7574, ensuring that only the necessary project is monitored, and no emails are automatically sent.	2025-11-28 21:19:15 +05:30
Vlachopoulou, Eleni	edf64e2b89	GTestSuite: Adding data pool (#272 ) * Adding data pool for random number generator * Adopting pool in axpy, gemv, and ukr.gemm * Using templates in data pool getter function to generate different pools for different ranges * Addressing review comments * Using data pool in more APIs * Using data pool in more APIs * Updating testing infra for trsv and trsm * Addressing review comments * Fixing numerical issues in trsm testing * More review comments * Small change to get trsm tests to pass * Scalling tiny data pool * Addressing review comments	2025-11-27 14:01:29 +00:00
Balasubramanian, Vignesh	f992942f6b	Disabling GEMV(M1) rerouting in BF16 APIs(AVX512) - Disabled rerouting to GEMV in BF16 APIs, in case of inputs having m == 1. We would now use the GEMM path for handling such inputs. AMD-Internal: [CPUPL-7536] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-11-27 14:43:31 +05:30
Vlachopoulou, Eleni	5c42229e05	GTestSuite: Moving data generator definitions in a cpp file (#270 ) * GTestSuite: Moving data generator definitions in a cpp file - Using the same number in the rng engine for all tests - Adjusting thresholds	2025-11-18 10:52:49 +00:00
Smyth, Edward	0923d8ff56	GTestSuite: break up tests In earlier commits (e.g. `a2beef3255`) long running tests were broken into separate executables for each data type. Standardize this approach across all APIs, including separate executables for IIT_ERS tests, and for each data type have separate executables for EVT tests. AMD-Internal: [CPUPL-7386] Co-authored-by: Vlachopoulou, Eleni <Eleni.Vlachopoulou@amd.com>	2025-11-17 09:12:03 +00:00
Vlachopoulou, Eleni	a8daea04ea	GTestSuite: computediff improvements (#264 ) * GTestSuite: computediff improvements - Using lazy evaluation to only create error strings when the comparison failed - Check if increments are zero, then only check first element	2025-11-17 08:30:20 +00:00
Vlachopoulou, Eleni	50f3520c33	GTestSuite: Fix in swap (#266 )	2025-11-14 10:11:39 +00:00
Rayan, Rohan	979b547876	Make all bench applications consistent (#189 ) Lots of the bench applications were not taking n_repeats as an argument on the command line (while others were). Also adding the feature to skip the rest of the line (at the end) while reading bench inputs for all APIs --------- Co-authored-by: Rayan <rohrayan@amd.com>	2025-11-13 07:42:53 +05:30
Sharma, Shubham	26588d5814	Added Fast path for single threaded AVX512 DGEMV kernel #260 (#262 ) - If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead. - A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned. AMD-Internal : [SWLCSG-3408] (cherry picked from commit `b60542c45c`)	2025-11-10 15:26:55 +05:30
Sharma, Shubham	b60542c45c	Added Fast path for single threaded AVX512 DGEMV kernel #260 - If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead. - A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned. AMD-Internal : [SWLCSG-3408]	2025-11-10 10:32:36 +05:30
S, Hari Govind	4ecfbde082	Fix extreme values handling in GEMV - When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all. - This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks. - When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel. - To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic. - DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0. AMD-Internal: [CPUPL-7585]	2025-11-08 12:30:03 +05:30
Smyth, Edward	42195faa52	DTL Windows getpid support Use _getpid in Windows builds so that separate DTL log and trace files are used for each process, e.g. each MPI rank. AMD-Internal: [CPUPL-7524]	2025-11-07 19:16:46 +00:00
Prabhu, Anantha	15cca23574	Delete .github/self_enablement_config.yaml	2025-11-07 23:09:59 +05:30
Prabhu, Anantha	7fdc1229db	Delete .github/recipients.yaml	2025-11-07 23:09:59 +05:30
Prabhu, Anantha	15eca83c74	Delete .github/psdb-jenkins-trigger.yml	2025-11-07 23:09:59 +05:30
Prabhu, Anantha	66694a36e7	Delete .github/workflows/ai-code-review-trigger.yml	2025-11-07 23:09:59 +05:30
Prabhu, Anantha	90f1e8d183	Update ai_coverity.json	2025-11-07 23:09:59 +05:30
Prabhu, Anantha	77e4d229be	Update ai-pr-platform-app.yml	2025-11-07 23:09:59 +05:30
Prabhu, Anantha	25d172b835	AI Coverity Fix - Historical Analysis related changes Add AI Coverity Fix - Historical Analysis related changes	2025-11-07 23:09:59 +05:30
Tyagi, Shubham	b602b61601	Update ai-pr-platform-app.yml	2025-11-07 23:09:59 +05:30
Tyagi, Shubham	98164e06a6	Update ai-pr-platform-app.yml	2025-11-07 23:09:59 +05:30
Tyagi, Shubham	aebc02e2f3	Update ai-pr-platform-app.yml	2025-11-07 23:09:59 +05:30
Tyagi, Shubham	8c0196fef9	Add ai-pr-platform-app.yml	2025-11-07 23:09:59 +05:30

1 2 3 4 5 ...

4333 Commits