amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-28 18:27:08 +00:00

Author	SHA1	Message	Date
S, Hari Govind	958dd9a7a5	Enable Exploit Mitigations for Windows (clang-cl) Builds Enable Control Flow Guard (CFG) and Control-flow Enforcement Technology (CET) hardening for Windows clang-cl builds in the CMake build system. Gated by the ENABLE_SECURITY_FLAGS option (default ON). Flags: Compiler: /guard:cf /guard:ehcont Linker: /GUARD:CF /GUARD:EHCONT /cetcompat Toolchain gating: Requires CMAKE_C/CXX_COMPILER_ID == Clang AND CMAKE_C/CXX_COMPILER_FRONTEND_VARIANT == MSVC. Native cl.exe and other MSVC-frontend toolchains fall through to a warning and skip the hardening flags. Scope of linker flags: /cetcompat is applied to blis_shared only (not propagated via INTERFACE from blis_static) to avoid enabling hardware shadow-stack enforcement on downstream consumer EXEs that include non-CET-clean paths (e.g. the testsuite's setjmp/longjmp xerbla code on zen4/zen5). Configuration: cmake -DENABLE_SECURITY_FLAGS=OFF ... # disable Files changed: blis/CMakeLists.txt AOCL-20260601	2026-05-19 14:26:54 +05:30
Smyth, Edward	4ee6f75292	GCC 16 fixes Changes to fix errors and warnings when using gcc 16.1.0: - Copy changes from 5c2b22da81 in upstream BLIS to extend disabling of tree-vectorization in affected kernels to gcc 16 and later. - Remove unused variables in bli_packm_blk_var1_md.c and bli_util_unb_var1.c to fix warning messages. Background bp (base pointer) is the %rbp/%ebp register on x86/x86-64. Inline assembly kernels in BLIS use asm volatile blocks where they manually manage registers - including saving and restoring bp themselves to use it as a general-purpose register for holding loop counters or matrix pointers. When GCC's tree-vectorizer (specifically the superword-level parallelism (SLP) pass) runs on a translation unit containing inline asm, it can generate code that itself needs bp as a frame pointer or in the vectorized prologue/epilogue. At that point GCC internally marks bp as unavailable and then, when it tries to compile the inline asm block that also references bp, it throws an error. As a workaround, disabling tree vectorization for the entire file removes the conflict - with no vectorizer-generated code, bp stays free for the inline asm. AOCL-20260502	2026-05-18 15:54:29 +01:00
Dave, Harsh	c845e41b38	Pack B matrix for zgemm conjugate input in SUP path (#362 ) * Pack B matrix for zgemm conjugate input in SUP path - B matrix is packed for zgemm input where B matrix is conjugate transpose. AMD-Internal: CPUPL-8274 * Pack B matrix for zgemm conjugate input in SUP path - B matrix is packed for zgemm input where B matrix is conjugate transpose. AMD-Internal: CPUPL-8274 --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-04-21 11:12:52 +05:30
Dave, Harsh	07fd52823a	Fix: Resovle label redefination errors in zgemm sup kernel (#359 ) - Resovles label collision errors across assembly blocks of zgemm sup kernels. AMD-Internal: [CPUPL-8336] Co-authored-by: harsdave <harsdave@amd.com> AOCL-20260402	2026-04-16 16:51:55 +05:30
S, Hari Govind	cdd181a7d7	Optimize complex scalv kernels with inline assembly and FMA instructions This PR optimizes the complex scalar vector multiplication kernels by replacing intrinsics with inline assembly and leveraging FMA (Fused Multiply-Add) instructions for improved performance. Changes: - Replaced intrinsic-based implementation with inline assembly - Utilizes `vfmaddsub231ps`, `vfmadd231ss`, and `vfmsub231ss` FMA instructions - Improved instruction scheduling and register usage - Handles both unit-stride (vectorized) and non-unit-stride (scalar) cases - Processes up to 16 complex elements per iteration in the main loop	2026-04-13 16:06:02 +05:30
Rayan, Rohan	c8d0259ed4	Fixing memory issue in the cgemm pack kernels on zen4 Fixing a memory issue in the cgemm zen4 packing kernel In the loop section where the leftover m and k iterations were handled, the load operations (in the k-direction) were missing the mask instructions which has now been added. AMD-Internal: CPUPL-8189 Co-authored-by: Rohan Rayan rohrayan@amd.com	2026-03-30 11:08:34 +05:30
Dave, Harsh	0b68dfc5ed	The zgemm tiny and sup paths currently only support nn, tn, tt, nt, cn, ct, nc, tc. Previously, the error-handling logic only explicitly checked for and rejected cc and hh cases. This patch extends that check to include ch and hc configurations, ensuring they correctly return a failure/fallback instead of proceeding with unsupported kernels. AMD-Internal: [CPUPL-8151] Co-authored-by: harsdave <harsdave@amd.com>	2026-03-25 14:45:54 +05:30
V, Varsha	d995496ac1	BugFix: BF16 AVX2 fallback GEMV m=1 path for reordered B inputs - For BF16 GEMVM1 fallback path when B matrix is reordered, there wasn't a panel adjustment happening after the kernel execution. - When the input size exceeds the panel boundary this would cause wrong panel access leading to incorrect results. Hence, added the same. [ AMD-Internal : CPUPL-8201 ]	2026-03-25 12:34:51 +05:30
Rayan, Rohan	d512e3a736	Converting mul+add to FMA for ddot, daxpy and daxpyf zen kernels - Replace mul+add with FMA for ddot, daxpy and daxpyf - Using masked operations where possible - Non-unit stride code paths still use scalar loops, but use FMAs for accuracy AMD-Internal: CPUPL-8055 Co-authored-by: Rohan Rayan rohrayan@amd.com	2026-03-20 11:07:12 +05:30
Smyth, Edward	6459b66b48	GTestSuite: Fix for ZGEMM tiny tests on Zen3 and earlier Non-AVX512 variants of ZGEMM tiny code path do not support as many problem sizes and transa/transb options. Skip unsuitable tests based on the return value from from bli_zgemm_tiny call. Also remove unused local variables. AMD-Internal: [CPUPL-7303]	2026-03-18 18:11:56 +00:00
Smyth, Edward	f632492b91	GTestSuite: Refine tests for GEMM Update problem sizes and other parameters for general GEMM and GEMMT tests, and dgemmt EVT tests, with the aim of reducing the runtime and make tests more practical for use in pre-submit CI jobs. AMD-Internal: [CPUPL-7386]	2026-03-17 10:58:27 +00:00
Vlachopoulou, Eleni	d93c8a7b58	GTestSuite: cleanup of disabled tests (#323 ) * Cleanup of disabled tests * Updating ger evt tests to identify the failing cases * Updating sgemv.evt tests to separate known failures and disable those tests * Update threshold in dtpsv * disabling dgemm_kernel tests that fail * disabling sgemm_kernel tests that fail * disabling cgemv_evt tests that fail * disabling cgemv_evt tests that fail * disabling cgemv_evt and zgemv_evt tests that fail * Fixing typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Adding comment on dtpsv threshold adjustment * Update gtestsuite/testsuite/level2/ger/cger/evt/cger_evt.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update gtestsuite/testsuite/level2/ger/zger/evt/zger_evt.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update gtestsuite/testsuite/level2/tpsv/dtpsv/dtpsv_generic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update gtestsuite/testsuite/level2/tpsv/dtpsv/dtpsv_generic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update gtestsuite/testsuite/level2/ger/zger/evt/zger_evt.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update gtestsuite/testsuite/level2/ger/cger/evt/cger_evt.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-16 12:04:39 +00:00
Rayan, Rohan	af4c3ef1a5	Fixing memory issues in sgemm SUP kernels on AVX2 and AVX512 * Resolved memory-access issues in the SGEMM SUP kernels on AVX2 and AVX-512 by correcting instructions that could read invalid addresses in the C matrix. * Removed k=0 kernel gtests for the native SGEMM and DGEMM paths as these tests caused spurious failures for kernels that are not intended to handle this case. * Standardized all instruction macros to lowercase in the Zen4 kernel to improve readability and code consistency. --------- AMD-Internal: CPUPL-8117 Co-authored-by: Rayan <rohrayan@amd.com>	2026-03-16 16:05:34 +05:30
Vettickal Sen, Anuraj	4a9af35bf4	Bitexactness CRC verification and per-test JSON output (#320 ) * Bitexactness CRC verification and per-test JSON output * Remove redundant BLIS_TEST_SEED random seed utilities The random_seed_utils.h and BLIS_TEST_SEED environment variable are unnecessary since the codebase already ensures deterministic random number generation via RANDOM_POOL_SEED and SRAND_SEED constants hardcoded in testing_helpers.h. * Add CRC support for integer/char computediff and cache env var checks Add CRC calculation and binary output to the gtint_t and char specializations of computediff, matching the pattern used by all other overloads. Char values are widened to gtint_t for safe uint32_t-aligned CRC access. Cache BLIS_ENABLE_CRC and BLIS_ENABLE_BINARY_OUTPUT env var lookups via static const bool in is_crc_enabled() and is_binary_output_enabled(). Guard all CRC/binary blocks in computediff with is_any_verification_enabled() so the common disabled path is a single static bool read with zero allocations. * Address PR review comments and refactor computediff CRC blocks Refactor: Extract duplicated CRC/binary-output blocks from all 8 computediff overloads into verify_vector_data and verify_matrix_data helpers in blis_test_utils namespace. Bug fixes from PR review: add missing includes (cstdlib, utility), enforce MAX_OUTPUT_SIZE_BYTES limit with integer overflow guard, add buffer validation in all CRC generation functions, add default case to FLA_GET_DATATYPE_FACTOR macro, replace deprecated test_case_name() with test_suite_name(), add MAKE_DIRECTORY error checking in CMake, and update copyright years to 2026. * Refactored crc_utils based on review comments. * binary_output_utils.h cleanup. * Address PR review comments: remove unused functions and fix copyright years. Remove unused generate_crc_matrix, generate_crc_matrix_no_nb_diag, generate_crc_matrix_no_nb_diag_with_storage, and calculate_and_print_matrix_crc from crc_utils.h. Remove unused calculate_and_print_matrix_hash from check_error.h. Fix copyright year to 2026 only in crc_utils.h and binary_output_utils.h. Remove (Performance) label from CRC heading in README.md. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix for review comments. * Address review comments: rename verify to collect, consistent void returns, remove filename prefix - Rename verify_vector_data/verify_matrix_data to collect_vector_data/ collect_matrix_data since these functions only collect CRC and binary output data without performing comparison. - Make return types consistent: change calculate_and_print_crc, calculate_and_print_matrix_crc_with_storage, format_and_record_crc, and write_comparison_outputs to return void since return values were never used. - Remove redundant test_output_ prefix from generate_binary_filename to avoid duplication with the blis_test_outputs/ directory. - Remove unused utility include from binary_output_utils.h. - Update README wording from compiled out to disabled. Made-with: Cursor * Fix strict aliasing, use if constexpr, zero-pad CRC hex, separate feature guards - Replace reinterpret_cast<uint32_t> with memcpy-based read_uint32() helper to avoid strict-aliasing UB on float/double/complex buffers. Produces identical CRC values. - Use if constexpr(CRC_ENABLED) instead of runtime if(!CRC_ENABLED) to prevent CRC template instantiation when ENABLE_CRC is off. - Zero-pad CRC hex output to 8 digits for stable downstream comparison. - Separate ENABLE_CRC and ENABLE_BINARY_OUTPUT preprocessor guards in verification_utils.h so each feature is compiled independently. Made-with: Cursor Handle write_binary_output return values in write_comparison_outputs Capture the bool return values from write_binary_output and, on failure, log a warning to stdout and record the error as a GTest property. This keeps binary output as a non-fatal diagnostic aid while ensuring return values are explicitly used. Made-with: Cursor --------- Co-authored-by: Anuraj <avettick@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-03-13 15:12:17 +05:30
Dave, Harsh	48a6db6c69	add support for conjugate transpose in avx512 zgemm sup kernel (#300 ) * ZGEMM SUP: Add conjugate support for AVX-512 kernels on Zen4/Zen5/Zen6 - Add CONJA, CONJB and CONJA_CONJB variants to zgemm SUP micro-tiles - Enable SUP path for conjugate cases when both are same type - Unify RRC/CRC storage to use CV kernel variant - Update SUP dispatch to handle conjugate flags correctly Note: CONJ_NO_TRANSPOSE + CONJ_NO_TRANSPOSE and CONJ_TRANSPOSE + CONJ_TRANSPOSE remain unsupported --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-03-12 19:00:53 +05:30
Dave, Harsh	fbd45e8eab	optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels (#334 ) * optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels - Updated the edge and non-unit stride path in c/saxpy to use intrinsics. - This ensures that the edge and non-unit stride cases maintain numerical consistency with the optimized Zen assembly/intrinsic path. --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-03-12 17:20:44 +05:30
Request, Osi	c8a5b21b46	configure: follow reproducible-builds spec for SOURCE_DATE_EPOCH - When SOURCE_DATE_EPOCH is set, it contains a unix timestamp to be used in place of the current datetime during builds to allow bit-for-bit reproducible builds to be produced. - Added support for this behavior in the CMake build system as well. See also: https://reproducible-builds.org/docs/source-date-epoch/ https://reproducible-builds.org/specs/source-date-epoch/ Co-authored-by: Luna <git@lunnova.dev> Co-authored-by: KR, Chandrashekara <Chandrashekara.KR@amd.com>	2026-03-10 18:28:46 +05:30
Smyth, Edward	23b48bb999	Enable support for OpenMP 2.5 and earlier Add compatibility for OpenMP implementations (e.g., MSVC, older GCC) that lack functions introduced in OpenMP 3.0 i.e. omp_get_active_level() and omp_get_max_active_levels(). On these compilers, the tests instead are based on the older omp_get_nested() functionality. Thanks to @tony-davis for highlighting this issue. AMD-Internal: [CPUPL-7303]	2026-03-06 09:34:17 +00:00
Varaganti, Kiran	bb6545a46b	Added new thread control API with global and thread-local variants CPUPL-7578: New thread control API with global and thread-local variants Summary: Add new BLIS thread control APIs that provide fine-grained control over threading with proper global and thread-local (TLS) semantics. Fix several correctness issues where set_num_threads() and set_ways() did not properly override each other's state. New/Modified APIs: bli_thread_set_num_threads() — Sets thread count globally (updates both global_rntm and tl_rntm) bli_thread_set_num_threads_local() — Sets thread count for calling thread only (tl_rntm) bli_thread_get_num_threads() — Returns effective thread count, deriving from ways if set bli_thread_reset() — Resyncs tl_rntm from global_rntm bli_thread_set_ways() — Sets loop factorization (jc, pc, ic, jr, ir) bli_thread_get_is_parallel() — Returns whether parallelism is enabled bli_thread_get_jc_nt/ic_nt/pc_nt/jr_nt/ir_nt() — Returns individual way values b77_thread_set_num_threads_local_() — Fortran-compatible wrapper Bug fixes: bli_thread_set_num_threads() now clears ways (-1) and sets auto_factor=TRUE on both global_rntm and tl_rntm, so it properly overrides prior BLIS_JC_NT/BLIS_IC_NT environment settings bli_thread_set_ways() now propagates to global_rntm (inside mutex) and clears stale num_threads on both global_rntm and tl_rntm, so get_num_threads() returns the product of ways instead of a stale value Fix data race in bli_thread_init_rntm_from_global_rntm() — copy global_rntm under mutex before debug printing Fix data race in set_num_threads_local() debug print Test suite (43 tests, 106 assertions): test_thread_control.c (OpenMP, 23 tests): environment inheritance, global propagation, thread-local isolation, local precedence, per-thread local, reset, nested parallel, edge cases, set_ways, is_parallel, concurrent updates, DGEMM with threads, interleaved settings, persistence, parallel DGEMM, thread pool, reset-to-sync, env ways vs set_num_threads, ways→set_nt→reset, ways→local→reset, round-trip, set_nt→set_ways override, set_ways propagation to new threads test_thread_control_pthread.c (pthread, 20 tests): equivalent coverage plus concurrent set/reset race condition test, set_nt→set_ways override, set_ways propagation via pthread_create Files changed (9 files, +2630/-29 lines): bli_thread.c — Core API implementations and fixes bli_thread.h — New function declarations b77_thread.c — Fortran wrapper test_thread_control.c — OpenMP test suite (23 tests) test_thread_control_pthread.c — pthread test suite (20 tests) TEST_THREAD_CONTROL_README.md — Documentation AMD-Internal: CPUPL-7578	2026-03-06 12:16:17 +05:30
Smyth, Edward	cf2de1e7e6	Fix for undefined arch and model id symbols Commit `8310b2d5d3` added new functions and global variables in blis.h intended only for internal use. These were causing missing symbol problems when blis.h is included in C applications as they are not exported from the shared library. Use BLIS_IS_BUILDING_LIBRARY and BLIS_CONFIGURETIME_CPUID preprocessor definitions to only expose these when compiling BLIS and not when using it. AMD-Internal: [CPUPL-8091]	2026-03-05 14:01:00 +00:00
Smyth, Edward	05e837d176	BLIS: Implement zen6 sub-configuration Implement zen6 cpuid and arch changes, and add zen6 as a separate BLIS sub-configuration and code path within amdzen configuration family. Currently all optimization choices are copies of zen5 sub-configuration. AMD-Internal: [CPUPL-7162]	2026-03-05 13:33:56 +00:00
Smyth, Edward	e62d246789	Misc AMD CPUID improvements (#222 ) Changes to simplify AMD CPUID functionality: - Variable "features" is limited in size as each bit represents a specific hardware function. Move detection of FP datapath width to a separate variable. Also mask the FP datapath bits explicitly for a more reliable test. - Add detection of facility to downgrade FP512 datapath to FP256. - bli_cpuid_is_avx512_fallback function does not exist, so remove header definition. AMD-Internal: [CPUPL-7303]	2026-03-05 11:59:56 +00:00
Varaganti, Kiran	713b09b407	Remove unnecessary barrier in sup path decorator to fix ~10% DGEMM regression The bli_thread_barrier(thread) call before bli_l3_sup_thrinfo_free() in bli_l3_sup_thread_decorator() was added by analogy with the conventional path's PR #702 fix, but is not needed in the sup (small/unpacked) path. In the conventional path, pack buffers are cached in the control tree (cntl_t->pack_mem) and freed in the decorator after func() returns. A barrier is required there to prevent a fast chief from releasing a pack buffer back to the PBA pool while slower peers in a different sub-group still read from it. The sup path does not have this problem because: 1. Pack buffers are stack-local variables (mem_t in var2m), freed inside func() by packm_sup_finalize_mem() after internal loop barriers. They are never freed in this decorator. 2. The global communicator (gl_comm) is freed outside the parallel region, protected by the implicit OpenMP barrier at the closing brace of the parallel construct. 3. Sub-group communicators (created when packa/packb is enabled) are freed only by the ochief thread in bli_thrinfo_free(). Non-chief threads never dereference the shared communicator — they only read their own ocomm_id and free_comm fields. When neither matrix is packed, no sub-communicators exist (ocomm=NULL, free_comm=FALSE). The custom spin-wait barrier (bli_thread_barrier) is significantly slower than the OpenMP runtime barrier at high thread counts, causing a ~10% DGEMM performance regression at 96 threads on AMD EPYC Turin (e.g. 11000x300x200 DGEMM). Ref: https://github.com/flame/blis/pull/702 Resolves: [CPUPL-7979] [SWLCSG-3951] [LWPHPCENGG-622]	2026-03-05 11:44:57 +05:30
Sharma, Shubham	4e84bbfb68	Ensure Accumulation Consistency Across DOTXF and DOTXV kernels (#325 ) APIs like GEMV use DOTXF (for parts of problem which are multiple of fuse_factor) and DOTXV (for parts not multiple of fuse_factor). DOTXF and DOTXV use different numbers of temporary accumulation registers(rho). This results in different round offs which can be significant when sizes are small and problem is about equally divided between DOTXF and DOTXV. To fix this, the number of temporary accumulation (and therefore roundoffs) and have made identical across both kernels. Known related GCC bugs to reference GCC Bug #56812 — incorrect code with vzeroupper and register allocation GCC Bug #95483 — vzeroupper clobbers live values GCC Bug #101617 — wrong code generation with AVX intrinsics and transitions AMD-Internal:CPUPL-8015	2026-03-03 16:38:55 +05:30
Sharma, Shubham	3bbf12665c	Ensure consistency across AVX2 and AVX512 AMAX kernels Fix NaN handling in AVX2 amax kernel by initializing global max to 0 In the AVX2 kernel, the global maximum (curr_max_val) was previously initialized to -1, while the local maximum (temp) is initialized to 0. The kernel determines the maximum using the condition: max(abs(x[i]), temp) > curr_max_val Because curr_max_val started at -1, this initial check always evaluates to true for the first iteration ( 0 > -1), which is incorrect if the first element is a NaN. If subsequent elements are valid numbers, curr_max_val eventually recovers and updates to the correct value. However, if the entire array consists of NaNs, this logic fails to properly update the trackers, meaning we never correctly return index 0 as required by the BLAS standard. To fix this and align the behavior with the AVX-512 kernel, curr_max_val is now initialized to 0. This ensures that the initial condition evaluates correctly and all-NaN arrays return the proper index (0 if all values are NaN). Window start and end index are also updated to 0 which is the minimum valid value of index. AMD-Internal: CPUPL-8047	2026-03-03 13:01:52 +05:30
Rayan, Rohan	6f718982d6	SGEMM RV kernel optimization on Zen4 SGEMM RV kernel optimizations for Zen4: Introduces new masked kernels handling edge cases efficiently. Also introduces better instruction selection for Zen4. New Masked Kernels: 6x16m_mask: Handles n_left = [9-15] with masked ZMM operations 6x8m_mask: Handles n_left = [5-8] with masked YMM operations 6x4m_mask: Handles n_left = [2-4] with masked XMM operations Corresponding edge-M kernels (4x, 3x, 2x, 1x variants) Replaces multiple kernel calls with single masked kernel call Instruction Cleanup: Remove unnecessary branches and unused macros Remove separate load->FMA instructions in favor of memory-operand FMAs Replace cascading if-else chains with O(1) jump table dispatch New Assembly Macros: MOVSLQ - move a 32‑bit signed value into a 64‑bit register with sign‑extension JUMP_TABLE(table_label, ...) - Defines the jump table TABLE_ENTRY(label) - Defines a 4-byte relative offset entry in the table LEA_RIP(label, reg) - RIP-relative address load JMPI(reg) - Indirect jump through register --------- Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2026-02-27 15:50:25 +05:30
Balasubramanian, Vignesh	237393ec71	Coverity fixes in LPGEMM group post-ops translator - Updated the condition for pointer checks on scale factors for A and B matrices, in order to avoid 'Dereference before' and 'Dereference after' null check issues. - Also updated the symmetric quantization interfaces to have NULL check for post-ops pointer. AMD-Internal: [CPUPL-7995] Signed-off-by: Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>	2026-02-25 17:21:12 +05:30
Vlachopoulou, Eleni	199f2347ba	Fix AOCC version detection in CMake and config script (#321 ) * Updating zen5/make_defs.* so that we use an AOCC_VERSION_STRING * Adding some error handling for AOCC versions with different name convention * Adding VERSION_GREATER_EQUAL functionality to all zen config directories * Cleanup and addressing review comments * Update config/zen/amd_config.cmake Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Updates to support x.y.z or x_y_z versioning --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-20 13:58:13 +00:00
Rayan, Rohan	c315766c8d	Fixing undefined behavior in bli_arch_log * Fixing a potential access of unallocated memory in bli_arch_log() --------- AMD-Internal: CPUPL-7995 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2026-02-18 13:36:18 +05:30
Smyth, Edward	011c75dddb	Remove unnecessary OpenMP include (AOCL) Copy of similar change in upstream BLIS (843a5e8) to fix issues https://github.com/flame/blis/issues/873 and https://github.com/amd/blis/issues/50 Details: - Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the framework could access the necessary OpenMP functions. - As @melven reported (#873), this causes issues when `blis.h` is included in C++ code since the `<omp.h>` include happens with `extern "C"`. - Move the include from the header to the necessary .c files so that it does not "pollute" `blis.h`. Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in AOCL BLIS AMD-Internal: [CPUPL-7303] AOCL-Feb2026-b2	2026-02-06 10:41:38 +00:00
Smyth, Edward	8310b2d5d3	Optimize bli_arch_query_id and related functions bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous implementation incurred the overhead of multiple function calls. This has been reduced by: - Changing the function to be defined in a header file so it can be inlined. - Avoiding call to bli_arch_check_id_once that was a wrapper for a call to bli_pthread_once. Instead bli_pthread_once is called directly. - For builds with a single BLIS sub-configuration, correct arch_id is taken directly from a header file in the corresponding config subdirectory, avoiding the bli_pthread_once call and making the value explicit at compile time, which may enable additional optimizations. To enable these changes, the variables arch_id and model_id defined in frame/base/bli_arch.c are no longer static, as they must be accessed in multiple files (i.e. they are now global variables). Rename to g_arch_id and g_model_id to distinguish from any locally defined arch_id or model_id variables.	2026-02-04 13:16:46 +00:00
Rayan, Rohan	ebf8721a5c	Optimizing sgemm rd kernels on zen3 (#293 ) Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM. After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction. This caused a lot of unnecessary iterations when the remainder of k < 8. This has been fixed by introducing masked operations for k < 8 When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation. There were also some unnecessary instructions in the zen4 kernels which have been removed. AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775 Co-authored-by: rohrayan@amd.com	2026-02-04 09:08:11 +05:30
Chandrashekara K R	50ae5a05ef	Updated version string from 5.1.1 to 5.2.2	2026-02-02 18:13:01 +05:30
S, Hari Govind	ec6f4e96cd	Replace intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5 GCC over-optimizes intrinsics code by reordering and interleaving instructions, making it difficult to verify correctness and causing potential accuracy issues in certain cases. This change replaces intrinsics-based implementations with inline assembly to ensure one-to-one mapping between source and generated assembly. Changes: - bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly * Processes blocks of 128, 64, 32, 16, and 8 elements * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides - bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly * Processes blocks of 16 and 8 elements with 5-way fusion * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides Benefits: - Predictable code generation with no compiler reordering - Better numerical accuracy by preventing unexpected transformations - Easier verification of generated assembly against specifications - Explicit control over instruction sequence and register allocation	2026-01-29 11:48:47 +05:30
Smyth, Edward	3b8aca0874	GTestSuite: Misc fixes (3) Various changes: - Correct signature of get_random_matrix call in some level2 APIs. - Move RNG seed to header to allow it to be used elsewhere in the code - Remove unused variable in ref_gbmv.cpp - Fix seed for all calls to rand() - Correct arguments in calls to matrix setup and computediff calls, especially for CBLAS row-major calls - Add missing if statements in tests of input arguments - Removed unused alpha argument from tbmv and tbsv - Enable nan_inf check when testing input args like alpha and beta - Also some corrections to testing input matrices and vectors AMD-Internal: [CPUPL-7386]	2026-01-23 17:23:52 +00:00
Smyth, Edward	dd66dfff50	cblas_ctrmm invalid diag fix Error handing code for invalid diag argument in Col major path was incorrect in cblas_ctrmm compared to other invalid argument checks and other data type variants. AMD-Internal: [CPUPL-7303]	2026-01-23 16:51:35 +00:00
Dave, Harsh	b510d06cc8	Tuned input threshold for tiny dgemm interface (#309 ) * Tuned input threshold for tiny dgemm interface - Added upper limit check for M dimension to avoid cache thrashing. - Added required buffer size check needed while packing A matrix. AMD-Internal : [CPUPL-7915] * Tuned input threshold for tiny dgemm interface - Added upper limit check for M dimension to avoid cache thrashing. - Added required buffer size check needed while packing A matrix. AMD-Internal : [CPUPL-7915] --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-01-22 20:11:06 +05:30
Balasubramanian, Vignesh	73911d5990	Updates to the build systems(CMake and Make) for LPGEMM compilation (#303 ) - The current build systems have the following behaviour with regards to building "aocl_gemm" addon codebase(LPGEMM) when giving "amdzen" as the target architecture(fat-binary) - Make: Attempts to compile LPGEMM kernels using the same compiler flags that the makefile fragments set for BLIS kernels, based on the compiler version. - CMake: With presets, it always enables the addon compilation unless explicitly specified with the ENABLE_ADDON variable. - This poses a bug with older compilers, owing to them not supporting BF16 or INT8 intrinsic compilation. - This patch adds the functionality to check for GCC and Clang compiler versions, and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0. - Make: Updated the configure script to check for the compiler version if the addon is specified. CMake: Updated the main CMakeLists.txt to check for the compiler version if the addon is specified, and to also force-update the associated cache variable update. Also updated kernels/CMakeLists.txt to check if "aocl_gemm" remains in the ENABLE_ADDONS list after all the checks in the previous layers. AMD-Internal: [CPUPL-7850] Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com> AOCL-Jan2026-b2	2026-01-16 19:39:55 +05:30
Smyth, Edward	9f9bfbed7f	GTestSuite: Banded APIs (gbmv, hbmv, sbmv, tbmv, tbsv) Create gtestsuite programs for banded matrix APIs. Priority is to create framework with a basic set of tests - refinement of problem sizes can be investigated later. AMD-Internal: [CPUPL-7386]	2026-01-16 12:37:47 +00:00
Smyth, Edward	c32247678c	GTestSuite: Packed APIs (hpmv, spmv, tpmv, tpsv) Create gtestsuite programs for subset of packed matrix APIs. Priority is to create framework with a basic set of tests - refinement of problem sizes can be investigated later. AMD-Internal: [CPUPL-7386]	2026-01-16 12:08:36 +00:00
Smyth, Edward	72e0c001f2	GTestSuite: Packed APIs (hpr, hpr2, spr, spr2) Create gtestsuite programs for subset of packed matrix APIs. Priority is to create framework with a basic set of tests - refinement of problem sizes can be investigated later. AMD-Internal: [CPUPL-7386] Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-16 10:25:27 +00:00
Smyth, Edward	bd99d6cd92	GTestSuite: Misc fixes (2) Various changes: - Fixed undeclared variables in her, her2, syr and syr2 IIT_ERS tests - Correct typos in comments AMD-Internal: [CPUPL-7386]	2026-01-16 00:33:16 +00:00
Sharma, Shubham	824e289899	Tuned decision logic for DGEMV multithreading for skinny sizes. (#301 ) AMD-Internal: [CPUPL-7769]	2026-01-14 12:08:46 +05:30
Rayan, Rohan	9cbb1c45d8	Improving sgemm rd kernel on zen4/zen5 (#292 ) Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers This caused multiple unnecessary iterations in the kernel for matrices with certain k-values. Fixed by introducing masked loads and computations for these cases AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2025-12-17 18:48:50 +05:30
Vlachopoulou, Eleni	504ac9d8a2	CMake: Adding targets and aliases so that blis works with fetch content (#179 ) * Adding targets and aliases so that blis works with fetch content * Using PUBLIC instead of INTERFACE * Using BLIS instead of blis and adding BLAS in the targets * Fixing installation paths do be the same as before * Adding documentation for FetchContent() AOCL-Weekly-121225	2025-12-10 13:02:09 +00:00
Vlachopoulou, Eleni	1d80d5fee4	Fixing doc about building bench (#290 )	2025-12-10 12:07:50 +00:00
Rayan, Rohan	a22e0022c2	SGEMM tiny path tuning for zen4 and zen5 (#267 ) * Adding a model to determine which matrices enter the SGEMM tiny path * This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously * Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path * Adding thresholds based on the SUP path sizes * Added for Zen4 and Zen5 --------- AMD-Internal: CPUPL-7555 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2025-12-10 15:58:54 +05:30
KR, Chandrashekara	b06b55e864	Update LICENSE and NOTICES files for AOCL-5.2 release (#285 )	2025-12-10 11:25:02 +05:30
Varaganti, Kiran	bbb7edcb22	thread: free global communicator after parallel region completes in p… * thread: free global communicator after parallel region completes in pthreads decorator Avoid potential data race by deferring free until all threads have joined. Previously, chief thread could free inside while non-chief threads still held pointers. Now, frees after the parallel region, following barrier and joins. Files: - frame/thread/bli_l3_sup_decor_pthreads.c - frame/thread/bli_l3_decor_pthreads.c * AMD-Internal: [CPUPL-7694]	2025-12-09 19:15:52 +05:30
Smyth, Edward	eff1b561c5	GTestSuite: Misc fixes - Move asumv and nrm2 testinghelpers files from util to level1 (missed in commit `0923d8ff56`) - Correct spelling mistakes and references to incorrect arguments in comments in various files - Correct comments listing invalid input tests in syr_IIT_ERS.cpp and her_IIT_ERS.cpp - Fix incorrect use of M in symv_IIT_ERS.cpp and hemv_IIT_ERS.cpp AMD-Internal: [CPUPL-7386] AOCL-Weekly-051225	2025-12-05 17:23:47 +00:00

1 2 3 4 5 ...

3914 Commits