3914 Commits

Author SHA1 Message Date
S, Hari Govind
958dd9a7a5 Enable Exploit Mitigations for Windows (clang-cl) Builds
Enable Control Flow Guard (CFG) and Control-flow Enforcement Technology
(CET) hardening for Windows clang-cl builds in the CMake build system.
Gated by the ENABLE_SECURITY_FLAGS option (default ON).

Flags:
  Compiler: /guard:cf /guard:ehcont
  Linker:   /GUARD:CF /GUARD:EHCONT /cetcompat

Toolchain gating:
  Requires CMAKE_C/CXX_COMPILER_ID == Clang AND
           CMAKE_C/CXX_COMPILER_FRONTEND_VARIANT == MSVC.
  Native cl.exe and other MSVC-frontend toolchains fall through to a
  warning and skip the hardening flags.

Scope of linker flags:
  /cetcompat is applied to blis_shared only (not propagated via
  INTERFACE from blis_static) to avoid enabling hardware shadow-stack
  enforcement on downstream consumer EXEs that include non-CET-clean
  paths (e.g. the testsuite's setjmp/longjmp xerbla code on zen4/zen5).

Configuration:
  cmake -DENABLE_SECURITY_FLAGS=OFF ...   # disable

Files changed:
  blis/CMakeLists.txt
AOCL-20260601
2026-05-19 14:26:54 +05:30
Smyth, Edward
4ee6f75292 GCC 16 fixes
Changes to fix errors and warnings when using gcc 16.1.0:
- Copy changes from 5c2b22da81 in upstream BLIS to extend disabling of
  tree-vectorization in affected kernels to gcc 16 and later.
- Remove unused variables in bli_packm_blk_var1_md.c and bli_util_unb_var1.c
  to fix warning messages.

Background
bp (base pointer) is the %rbp/%ebp register on x86/x86-64. Inline assembly
kernels in BLIS use asm volatile blocks where they manually manage registers
- including saving and restoring bp themselves to use it as a general-purpose
register for holding loop counters or matrix pointers.

When GCC's tree-vectorizer (specifically the superword-level parallelism (SLP)
pass) runs on a translation unit containing inline asm, it can generate code
that itself needs bp as a frame pointer or in the vectorized prologue/epilogue.
At that point GCC internally marks bp as unavailable and then, when it tries to
compile the inline asm block that also references bp, it throws an error.

As a workaround, disabling tree vectorization for the entire file removes the
conflict - with no vectorizer-generated code, bp stays free for the inline asm.
AOCL-20260502
2026-05-18 15:54:29 +01:00
Dave, Harsh
c845e41b38 Pack B matrix for zgemm conjugate input in SUP path (#362)
* Pack B matrix for zgemm conjugate input in SUP path

- B matrix is packed for zgemm input where B matrix is conjugate transpose.

AMD-Internal: CPUPL-8274

* Pack B matrix for zgemm conjugate input in SUP path

- B matrix is packed for zgemm input where B matrix is conjugate transpose.

AMD-Internal: CPUPL-8274

---------

Co-authored-by: harsdave <harsdave@amd.com>
2026-04-21 11:12:52 +05:30
Dave, Harsh
07fd52823a Fix: Resovle label redefination errors in zgemm sup kernel (#359)
- Resovles label collision errors across assembly blocks of zgemm sup
  kernels.

AMD-Internal: [CPUPL-8336]

Co-authored-by: harsdave <harsdave@amd.com>
AOCL-20260402
2026-04-16 16:51:55 +05:30
S, Hari Govind
cdd181a7d7 Optimize complex scalv kernels with inline assembly and FMA instructions
This PR optimizes the complex scalar vector multiplication kernels by replacing
intrinsics with inline assembly and leveraging FMA (Fused Multiply-Add) instructions
for improved performance.

Changes:
- Replaced intrinsic-based implementation with inline assembly
- Utilizes `vfmaddsub231ps`, `vfmadd231ss`, and `vfmsub231ss` FMA instructions
- Improved instruction scheduling and register usage
- Handles both unit-stride (vectorized) and non-unit-stride (scalar) cases
- Processes up to 16 complex elements per iteration in the main loop
2026-04-13 16:06:02 +05:30
Rayan, Rohan
c8d0259ed4 Fixing memory issue in the cgemm pack kernels on zen4
Fixing a memory issue in the cgemm zen4 packing kernel
In the loop section where the leftover m and k iterations were handled, the load operations (in the k-direction) were missing the mask instructions which has now been added.

AMD-Internal: CPUPL-8189
Co-authored-by: Rohan Rayan rohrayan@amd.com
2026-03-30 11:08:34 +05:30
Dave, Harsh
0b68dfc5ed The zgemm tiny and sup paths currently only support
nn, tn, tt, nt, cn, ct, nc, tc.

Previously, the error-handling logic only explicitly checked
for and rejected cc and hh cases. This patch extends that
check to include ch and hc configurations, ensuring they
correctly return a failure/fallback instead of proceeding
with unsupported kernels.

AMD-Internal: [CPUPL-8151]

Co-authored-by: harsdave <harsdave@amd.com>
2026-03-25 14:45:54 +05:30
V, Varsha
d995496ac1 BugFix: BF16 AVX2 fallback GEMV m=1 path for reordered B inputs
- For BF16 GEMVM1 fallback path when B matrix is reordered,
 there wasn't a panel adjustment happening after the kernel execution.
 - When the input size exceeds the panel boundary this would cause
 wrong panel access leading to incorrect results. Hence, added the same.

[ AMD-Internal : CPUPL-8201 ]
2026-03-25 12:34:51 +05:30
Rayan, Rohan
d512e3a736 Converting mul+add to FMA for ddot, daxpy and daxpyf zen kernels
- Replace mul+add with FMA for ddot, daxpy and daxpyf
- Using masked operations where possible
- Non-unit stride code paths still use scalar loops, but use FMAs for accuracy

AMD-Internal: CPUPL-8055
Co-authored-by: Rohan Rayan rohrayan@amd.com
2026-03-20 11:07:12 +05:30
Smyth, Edward
6459b66b48 GTestSuite: Fix for ZGEMM tiny tests on Zen3 and earlier
Non-AVX512 variants of ZGEMM tiny code path do not support as many
problem sizes and transa/transb options. Skip unsuitable tests based
on the return value from from bli_zgemm_tiny call.

Also remove unused local variables.

AMD-Internal: [CPUPL-7303]
2026-03-18 18:11:56 +00:00
Smyth, Edward
f632492b91 GTestSuite: Refine tests for GEMM
Update problem sizes and other parameters for general GEMM and
GEMMT tests, and dgemmt EVT tests, with the aim of reducing the
runtime and make tests more practical for use in pre-submit CI jobs.

AMD-Internal: [CPUPL-7386]
2026-03-17 10:58:27 +00:00
Vlachopoulou, Eleni
d93c8a7b58 GTestSuite: cleanup of disabled tests (#323)
* Cleanup of disabled tests

* Updating ger evt tests to identify the failing cases

* Updating sgemv.evt tests to separate known failures and disable those tests

* Update threshold in dtpsv

* disabling dgemm_kernel tests that fail

* disabling sgemm_kernel tests that fail

* disabling cgemv_evt tests that fail

* disabling cgemv_evt tests that fail

* disabling cgemv_evt and zgemv_evt tests that fail

* Fixing typo

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Adding comment on dtpsv threshold adjustment

* Update gtestsuite/testsuite/level2/ger/cger/evt/cger_evt.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update gtestsuite/testsuite/level2/ger/zger/evt/zger_evt.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update gtestsuite/testsuite/level2/tpsv/dtpsv/dtpsv_generic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update gtestsuite/testsuite/level2/tpsv/dtpsv/dtpsv_generic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update gtestsuite/testsuite/level2/ger/zger/evt/zger_evt.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update gtestsuite/testsuite/level2/ger/cger/evt/cger_evt.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-16 12:04:39 +00:00
Rayan, Rohan
af4c3ef1a5 Fixing memory issues in sgemm SUP kernels on AVX2 and AVX512
* Resolved memory-access issues in the SGEMM SUP kernels on AVX2 and AVX-512 by correcting instructions that could read invalid addresses in the C matrix.
* Removed k=0 kernel gtests for the native SGEMM and DGEMM paths as these tests caused spurious failures for kernels that are not intended to handle this case.
* Standardized all instruction macros to lowercase in the Zen4 kernel to improve readability and code consistency.

---------

AMD-Internal: CPUPL-8117
Co-authored-by: Rayan <rohrayan@amd.com>
2026-03-16 16:05:34 +05:30
Vettickal Sen, Anuraj
4a9af35bf4 Bitexactness CRC verification and per-test JSON output (#320)
* Bitexactness CRC verification and per-test JSON output

* Remove redundant BLIS_TEST_SEED random seed utilities

The random_seed_utils.h and BLIS_TEST_SEED environment variable are unnecessary since the codebase already ensures deterministic random number generation via RANDOM_POOL_SEED and SRAND_SEED constants hardcoded in testing_helpers.h.

* Add CRC support for integer/char computediff and cache env var checks

Add CRC calculation and binary output to the gtint_t and char specializations of computediff, matching the pattern used by all other overloads. Char values are widened to gtint_t for safe uint32_t-aligned CRC access.

Cache BLIS_ENABLE_CRC and BLIS_ENABLE_BINARY_OUTPUT env var lookups via static const bool in is_crc_enabled() and is_binary_output_enabled(). Guard all CRC/binary blocks in computediff with is_any_verification_enabled() so the common disabled path is a single static bool read with zero allocations.

* Address PR review comments and refactor computediff CRC blocks

Refactor: Extract duplicated CRC/binary-output blocks from all 8 computediff overloads into verify_vector_data and verify_matrix_data helpers in blis_test_utils namespace.

Bug fixes from PR review: add missing includes (cstdlib, utility), enforce MAX_OUTPUT_SIZE_BYTES limit with integer overflow guard, add buffer validation in all CRC generation functions, add default case to FLA_GET_DATATYPE_FACTOR macro, replace deprecated test_case_name() with test_suite_name(), add MAKE_DIRECTORY error checking in CMake, and update copyright years to 2026.

* Refactored crc_utils based on review comments.

* binary_output_utils.h cleanup.

* Address PR review comments: remove unused functions and fix copyright years.

Remove unused generate_crc_matrix, generate_crc_matrix_no_nb_diag,
generate_crc_matrix_no_nb_diag_with_storage, and
calculate_and_print_matrix_crc from crc_utils.h.
Remove unused calculate_and_print_matrix_hash from check_error.h.
Fix copyright year to 2026 only in crc_utils.h and binary_output_utils.h.
Remove (Performance) label from CRC heading in README.md.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix for review comments.

* Address review comments: rename verify to collect, consistent void returns, remove filename prefix

- Rename verify_vector_data/verify_matrix_data to collect_vector_data/
  collect_matrix_data since these functions only collect CRC and binary
  output data without performing comparison.
- Make return types consistent: change calculate_and_print_crc,
  calculate_and_print_matrix_crc_with_storage, format_and_record_crc,
  and write_comparison_outputs to return void since return values were
  never used.
- Remove redundant test_output_ prefix from generate_binary_filename
  to avoid duplication with the blis_test_outputs/ directory.
- Remove unused utility include from binary_output_utils.h.
- Update README wording from compiled out to disabled.

Made-with: Cursor

* Fix strict aliasing, use if constexpr, zero-pad CRC hex, separate feature guards

- Replace reinterpret_cast<uint32_t*> with memcpy-based read_uint32()
  helper to avoid strict-aliasing UB on float/double/complex buffers.
  Produces identical CRC values.
- Use if constexpr(CRC_ENABLED) instead of runtime if(!CRC_ENABLED) to
  prevent CRC template instantiation when ENABLE_CRC is off.
- Zero-pad CRC hex output to 8 digits for stable downstream comparison.
- Separate ENABLE_CRC and ENABLE_BINARY_OUTPUT preprocessor guards in
  verification_utils.h so each feature is compiled independently.

Made-with: Cursor

* Handle write_binary_output return values in write_comparison_outputs

Capture the bool return values from write_binary_output and, on failure, log a warning to stdout and record the error as a GTest property. This keeps binary output as a non-fatal diagnostic aid while ensuring return values are explicitly used.

Made-with: Cursor

---------

Co-authored-by: Anuraj <avettick@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-03-13 15:12:17 +05:30
Dave, Harsh
48a6db6c69 add support for conjugate transpose in avx512 zgemm sup kernel (#300)
* ZGEMM SUP: Add conjugate support for AVX-512 kernels on Zen4/Zen5/Zen6

- Add CONJA, CONJB and CONJA_CONJB variants to zgemm SUP micro-tiles
- Enable SUP path for conjugate cases when both are same type
- Unify RRC/CRC storage to use CV kernel variant
- Update SUP dispatch to handle conjugate flags correctly

Note: CONJ_NO_TRANSPOSE + CONJ_NO_TRANSPOSE and
      CONJ_TRANSPOSE + CONJ_TRANSPOSE remain unsupported

---------

Co-authored-by: harsdave <harsdave@amd.com>
2026-03-12 19:00:53 +05:30
Dave, Harsh
fbd45e8eab optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels (#334)
* optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels

- Updated the edge and non-unit stride path in c/saxpy to use intrinsics.
- This ensures that the edge and non-unit stride cases maintain numerical
   consistency with the optimized Zen assembly/intrinsic path.

---------

Co-authored-by: harsdave <harsdave@amd.com>
2026-03-12 17:20:44 +05:30
Request, Osi
c8a5b21b46 configure: follow reproducible-builds spec for SOURCE_DATE_EPOCH
- When SOURCE_DATE_EPOCH is set, it contains a unix timestamp to be used
  in place of the current datetime during builds to allow bit-for-bit
  reproducible builds to be produced.
- Added support for this behavior in the CMake build system as well.

See also:
https://reproducible-builds.org/docs/source-date-epoch/
https://reproducible-builds.org/specs/source-date-epoch/

Co-authored-by: Luna <git@lunnova.dev>
Co-authored-by: KR, Chandrashekara <Chandrashekara.KR@amd.com>
2026-03-10 18:28:46 +05:30
Smyth, Edward
23b48bb999 Enable support for OpenMP 2.5 and earlier
Add compatibility for OpenMP implementations (e.g., MSVC, older GCC)
that lack functions introduced in OpenMP 3.0 i.e. omp_get_active_level()
and omp_get_max_active_levels(). On these compilers, the tests instead
are based on the older omp_get_nested() functionality.

Thanks to @tony-davis for highlighting this issue.

AMD-Internal: [CPUPL-7303]
2026-03-06 09:34:17 +00:00
Varaganti, Kiran
bb6545a46b Added new thread control API with global and thread-local variants
CPUPL-7578: New thread control API with global and thread-local variants

Summary: Add new BLIS thread control APIs that provide fine-grained control over threading with proper global and thread-local (TLS) semantics. Fix several correctness issues where set_num_threads() and set_ways() did not properly override each other's state.

New/Modified APIs:

bli_thread_set_num_threads() — Sets thread count globally (updates both global_rntm and tl_rntm)
bli_thread_set_num_threads_local() — Sets thread count for calling thread only (tl_rntm)
bli_thread_get_num_threads() — Returns effective thread count, deriving from ways if set
bli_thread_reset() — Resyncs tl_rntm from global_rntm
bli_thread_set_ways() — Sets loop factorization (jc, pc, ic, jr, ir)
bli_thread_get_is_parallel() — Returns whether parallelism is enabled
bli_thread_get_jc_nt/ic_nt/pc_nt/jr_nt/ir_nt() — Returns individual way values
b77_thread_set_num_threads_local_() — Fortran-compatible wrapper
Bug fixes:

bli_thread_set_num_threads() now clears ways (-1) and sets auto_factor=TRUE on both global_rntm and tl_rntm, so it properly overrides prior BLIS_JC_NT/BLIS_IC_NT environment settings
bli_thread_set_ways() now propagates to global_rntm (inside mutex) and clears stale num_threads on both global_rntm and tl_rntm, so get_num_threads() returns the product of ways instead of a stale value
Fix data race in bli_thread_init_rntm_from_global_rntm() — copy global_rntm under mutex before debug printing
Fix data race in set_num_threads_local() debug print
Test suite (43 tests, 106 assertions):

test_thread_control.c (OpenMP, 23 tests): environment inheritance, global propagation, thread-local isolation, local precedence, per-thread local, reset, nested parallel, edge cases, set_ways, is_parallel, concurrent updates, DGEMM with threads, interleaved settings, persistence, parallel DGEMM, thread pool, reset-to-sync, env ways vs set_num_threads, ways→set_nt→reset, ways→local→reset, round-trip, set_nt→set_ways override, set_ways propagation to new threads
test_thread_control_pthread.c (pthread, 20 tests): equivalent coverage plus concurrent set/reset race condition test, set_nt→set_ways override, set_ways propagation via pthread_create
Files changed (9 files, +2630/-29 lines):

bli_thread.c — Core API implementations and fixes
bli_thread.h — New function declarations
b77_thread.c — Fortran wrapper
test_thread_control.c — OpenMP test suite (23 tests)
test_thread_control_pthread.c — pthread test suite (20 tests)
TEST_THREAD_CONTROL_README.md — Documentation
AMD-Internal: CPUPL-7578
2026-03-06 12:16:17 +05:30
Smyth, Edward
cf2de1e7e6 Fix for undefined arch and model id symbols
Commit 8310b2d5d3 added new functions and global variables in
blis.h intended only for internal use. These were causing
missing symbol problems when blis.h is included in C
applications as they are not exported from the shared library.
Use BLIS_IS_BUILDING_LIBRARY and BLIS_CONFIGURETIME_CPUID
preprocessor definitions to only expose these when compiling BLIS
and not when using it.

AMD-Internal: [CPUPL-8091]
2026-03-05 14:01:00 +00:00
Smyth, Edward
05e837d176 BLIS: Implement zen6 sub-configuration
Implement zen6 cpuid and arch changes, and add zen6 as a
separate BLIS sub-configuration and code path within amdzen
configuration family. Currently all optimization choices are
copies of zen5 sub-configuration.

AMD-Internal: [CPUPL-7162]
2026-03-05 13:33:56 +00:00
Smyth, Edward
e62d246789 Misc AMD CPUID improvements (#222)
Changes to simplify AMD CPUID functionality:
- Variable "features" is limited in size as each bit represents a
  specific hardware function. Move detection of FP datapath
  width to a separate variable. Also mask the FP datapath bits
  explicitly for a more reliable test.
- Add detection of facility to downgrade FP512 datapath to FP256.
- bli_cpuid_is_avx512_fallback function does not exist, so remove header
  definition.

AMD-Internal: [CPUPL-7303]
2026-03-05 11:59:56 +00:00
Varaganti, Kiran
713b09b407 Remove unnecessary barrier in sup path decorator to fix ~10% DGEMM regression
The bli_thread_barrier(thread) call before bli_l3_sup_thrinfo_free() in
bli_l3_sup_thread_decorator() was added by analogy with the conventional
path's PR #702 fix, but is not needed in the sup (small/unpacked) path.

In the conventional path, pack buffers are cached in the control tree
(cntl_t->pack_mem) and freed in the decorator after func() returns. A
barrier is required there to prevent a fast chief from releasing a pack
buffer back to the PBA pool while slower peers in a different sub-group
still read from it.

The sup path does not have this problem because:

1. Pack buffers are stack-local variables (mem_t in var2m), freed inside
   func() by packm_sup_finalize_mem() after internal loop barriers.
   They are never freed in this decorator.

2. The global communicator (gl_comm) is freed outside the parallel
   region, protected by the implicit OpenMP barrier at the closing
   brace of the parallel construct.

3. Sub-group communicators (created when packa/packb is enabled) are
   freed only by the ochief thread in bli_thrinfo_free(). Non-chief
   threads never dereference the shared communicator — they only read
   their own ocomm_id and free_comm fields. When neither matrix is
   packed, no sub-communicators exist (ocomm=NULL, free_comm=FALSE).

The custom spin-wait barrier (bli_thread_barrier) is significantly
slower than the OpenMP runtime barrier at high thread counts, causing
a ~10% DGEMM performance regression at 96 threads on AMD EPYC Turin
(e.g. 11000x300x200 DGEMM).

Ref: https://github.com/flame/blis/pull/702
Resolves: [CPUPL-7979] [SWLCSG-3951] [LWPHPCENGG-622]
2026-03-05 11:44:57 +05:30
Sharma, Shubham
4e84bbfb68 Ensure Accumulation Consistency Across DOTXF and DOTXV kernels (#325)
APIs like GEMV use DOTXF (for parts of problem which are multiple of fuse_factor) and DOTXV (for parts not multiple of fuse_factor).

DOTXF and DOTXV use different numbers of temporary accumulation registers(rho).

This results in different round offs which can be significant when sizes are small and problem is about equally divided between DOTXF and DOTXV.

To fix this, the number of temporary accumulation (and therefore roundoffs) and have made identical across both kernels.

Known related GCC bugs to reference

GCC Bug #56812 — incorrect code with vzeroupper and register allocation
GCC Bug #95483 — vzeroupper clobbers live values
GCC Bug #101617 — wrong code generation with AVX intrinsics and transitions

AMD-Internal:CPUPL-8015
2026-03-03 16:38:55 +05:30
Sharma, Shubham
3bbf12665c Ensure consistency across AVX2 and AVX512 AMAX kernels
Fix NaN handling in AVX2 amax kernel by initializing global max to 0

In the AVX2 kernel, the global maximum (curr_max_val) was previously
initialized to -1, while the local maximum (temp) is initialized to 0.

The kernel determines the maximum using the condition:
max(abs(x[i]), temp) > curr_max_val

Because curr_max_val started at -1, this initial check always evaluates
to true for the first iteration ( 0 > -1), which is incorrect if the
first element is a NaN.

If subsequent elements are valid numbers, curr_max_val eventually recovers
and updates to the correct value. However, if the entire array consists of
NaNs, this logic fails to properly update the trackers, meaning we never
correctly return index 0 as required by the BLAS standard.

To fix this and align the behavior with the AVX-512 kernel, curr_max_val
is now initialized to 0. This ensures that the initial condition evaluates
correctly and all-NaN arrays return the proper index (0 if all values are NaN).

Window start and end index are also updated to 0 which is the minimum valid
value of index.

AMD-Internal: CPUPL-8047
2026-03-03 13:01:52 +05:30
Rayan, Rohan
6f718982d6 SGEMM RV kernel optimization on Zen4
SGEMM RV kernel optimizations for Zen4:
Introduces new masked kernels handling edge cases efficiently. Also introduces better instruction selection for Zen4.

New Masked Kernels:

6x16m_mask: Handles n_left = [9-15] with masked ZMM operations
6x8m_mask: Handles n_left = [5-8] with masked YMM operations
6x4m_mask: Handles n_left = [2-4] with masked XMM operations
Corresponding edge-M kernels (4x*, 3x*, 2x*, 1x* variants)
Replaces multiple kernel calls with single masked kernel call

Instruction Cleanup:

Remove unnecessary branches and unused macros
Remove separate load->FMA instructions in favor of memory-operand FMAs
Replace cascading if-else chains with O(1) jump table dispatch

New Assembly Macros:

MOVSLQ - move a 32‑bit signed value into a 64‑bit register with sign‑extension
JUMP_TABLE(table_label, ...) - Defines the jump table
TABLE_ENTRY(label) - Defines a 4-byte relative offset entry in the table
LEA_RIP(label, reg) - RIP-relative address load
JMPI(reg) - Indirect jump through register
---------

Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2026-02-27 15:50:25 +05:30
Balasubramanian, Vignesh
237393ec71 Coverity fixes in LPGEMM group post-ops translator
- Updated the condition for pointer checks on scale
  factors for A and B matrices, in order to avoid
  'Dereference before' and 'Dereference after' null
  check issues.

- Also updated the symmetric quantization interfaces
  to have NULL check for post-ops pointer.

AMD-Internal: [CPUPL-7995]
Signed-off-by: Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>
2026-02-25 17:21:12 +05:30
Vlachopoulou, Eleni
199f2347ba Fix AOCC version detection in CMake and config script (#321)
* Updating zen5/make_defs.* so that we use an AOCC_VERSION_STRING

* Adding some error handling for AOCC versions with different name convention

* Adding VERSION_GREATER_EQUAL functionality to all zen config directories

* Cleanup and addressing review comments

* Update config/zen/amd_config.cmake

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Updates to support x.y.z or x_y_z versioning

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-20 13:58:13 +00:00
Rayan, Rohan
c315766c8d Fixing undefined behavior in bli_arch_log
* Fixing a potential access of unallocated memory in bli_arch_log()

---------
AMD-Internal: CPUPL-7995
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2026-02-18 13:36:18 +05:30
Smyth, Edward
011c75dddb Remove unnecessary OpenMP include (AOCL)
Copy of similar change in upstream BLIS (843a5e8) to fix issues
https://github.com/flame/blis/issues/873 and
https://github.com/amd/blis/issues/50

Details:
- Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the
  framework could access the necessary OpenMP functions.
- As @melven reported (#873), this causes issues when `blis.h` is included
  in C++ code since the `<omp.h>` include happens with `extern "C"`.
- Move the include from the header to the necessary .c files so that it
  does not "pollute" `blis.h`.

Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in
AOCL BLIS

AMD-Internal: [CPUPL-7303]
AOCL-Feb2026-b2
2026-02-06 10:41:38 +00:00
Smyth, Edward
8310b2d5d3 Optimize bli_arch_query_id and related functions
bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous
implementation incurred the overhead of multiple function calls. This has
been reduced by:
- Changing the function to be defined in a header file so it can be inlined.
- Avoiding call to bli_arch_check_id_once that was a wrapper for a call to
  bli_pthread_once. Instead bli_pthread_once is called directly.
- For builds with a single BLIS sub-configuration, correct arch_id is taken
  directly from a header file in the corresponding config subdirectory,
  avoiding the bli_pthread_once call and making the value explicit at
  compile time, which may enable additional optimizations.

To enable these changes, the variables arch_id and model_id defined in
frame/base/bli_arch.c are no longer static, as they must be accessed in multiple
files (i.e. they are now global variables). Rename to g_arch_id and g_model_id
to distinguish from any locally defined arch_id or model_id variables.
2026-02-04 13:16:46 +00:00
Rayan, Rohan
ebf8721a5c Optimizing sgemm rd kernels on zen3 (#293)
Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM.
After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction.
This caused a lot of unnecessary iterations when the remainder of k < 8.
This has been fixed by introducing masked operations for k < 8
When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation.
There were also some unnecessary instructions in the zen4 kernels which have been removed.

AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775
Co-authored-by: rohrayan@amd.com
2026-02-04 09:08:11 +05:30
Chandrashekara K R
50ae5a05ef Updated version string from 5.1.1 to 5.2.2 2026-02-02 18:13:01 +05:30
S, Hari Govind
ec6f4e96cd Replace intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5
GCC over-optimizes intrinsics code by reordering and interleaving
instructions, making it difficult to verify correctness and causing
potential accuracy issues in certain cases. This change replaces
intrinsics-based implementations with inline assembly to ensure
one-to-one mapping between source and generated assembly.

Changes:
- bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly
  * Processes blocks of 128, 64, 32, 16, and 8 elements
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

- bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly
  * Processes blocks of 16 and 8 elements with 5-way fusion
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

Benefits:
- Predictable code generation with no compiler reordering
- Better numerical accuracy by preventing unexpected transformations
- Easier verification of generated assembly against specifications
- Explicit control over instruction sequence and register allocation
2026-01-29 11:48:47 +05:30
Smyth, Edward
3b8aca0874 GTestSuite: Misc fixes (3)
Various changes:
- Correct signature of get_random_matrix call in some level2 APIs.
- Move RNG seed to header to allow it to be used elsewhere in the code
- Remove unused variable in ref_gbmv.cpp
- Fix seed for all calls to rand()
- Correct arguments in calls to matrix setup and computediff calls,
  especially for CBLAS row-major calls
- Add missing if statements in tests of input arguments
- Removed unused alpha argument from tbmv and tbsv
- Enable nan_inf check when testing input args like alpha and beta
- Also some corrections to testing input matrices and vectors

AMD-Internal: [CPUPL-7386]
2026-01-23 17:23:52 +00:00
Smyth, Edward
dd66dfff50 cblas_ctrmm invalid diag fix
Error handing code for invalid diag argument in Col major path was
incorrect in cblas_ctrmm compared to other invalid argument checks
and other data type variants.

AMD-Internal: [CPUPL-7303]
2026-01-23 16:51:35 +00:00
Dave, Harsh
b510d06cc8 Tuned input threshold for tiny dgemm interface (#309)
* Tuned input threshold for tiny dgemm interface

- Added upper limit check for M dimension to avoid cache thrashing.
- Added required buffer size check needed while packing A matrix.

AMD-Internal : [CPUPL-7915]

* Tuned input threshold for tiny dgemm interface

- Added upper limit check for M dimension to avoid cache thrashing.
- Added required buffer size check needed while packing A matrix.

AMD-Internal : [CPUPL-7915]

---------

Co-authored-by: harsdave <harsdave@amd.com>
2026-01-22 20:11:06 +05:30
Balasubramanian, Vignesh
73911d5990 Updates to the build systems(CMake and Make) for LPGEMM compilation (#303)
- The current build systems have the following behaviour
  with regards to building "aocl_gemm" addon codebase(LPGEMM)
  when giving "amdzen" as the target architecture(fat-binary)
  - Make:  Attempts to compile LPGEMM kernels using the same
                compiler flags that the makefile fragments set for BLIS
                kernels, based on the compiler version.
  - CMake: With presets, it always enables the addon compilation
                 unless explicitly specified with the ENABLE_ADDON variable.

- This poses a bug with older compilers, owing to them not supporting
  BF16 or INT8 intrinsic compilation.

- This patch adds the functionality to check for GCC and Clang compiler versions,
  and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0.

- Make:  Updated the configure script to check for the compiler version
              if the addon is specified.
  CMake: Updated the main CMakeLists.txt to check for the compiler version
               if the addon is specified, and to also force-update the associated
               cache variable update. Also updated kernels/CMakeLists.txt to
               check if "aocl_gemm" remains in the ENABLE_ADDONS list after
               all the checks in the previous layers.

AMD-Internal: [CPUPL-7850]

Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>
AOCL-Jan2026-b2
2026-01-16 19:39:55 +05:30
Smyth, Edward
9f9bfbed7f GTestSuite: Banded APIs (gbmv, hbmv, sbmv, tbmv, tbsv)
Create gtestsuite programs for banded matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem
sizes can be investigated later.

AMD-Internal: [CPUPL-7386]
2026-01-16 12:37:47 +00:00
Smyth, Edward
c32247678c GTestSuite: Packed APIs (hpmv, spmv, tpmv, tpsv)
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.

AMD-Internal: [CPUPL-7386]
2026-01-16 12:08:36 +00:00
Smyth, Edward
72e0c001f2 GTestSuite: Packed APIs (hpr, hpr2, spr, spr2)
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.

AMD-Internal: [CPUPL-7386]
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-16 10:25:27 +00:00
Smyth, Edward
bd99d6cd92 GTestSuite: Misc fixes (2)
Various changes:
- Fixed undeclared variables in her, her2, syr and syr2 IIT_ERS tests
- Correct typos in comments

AMD-Internal: [CPUPL-7386]
2026-01-16 00:33:16 +00:00
Sharma, Shubham
824e289899 Tuned decision logic for DGEMV multithreading for skinny sizes. (#301)
AMD-Internal: [CPUPL-7769]
2026-01-14 12:08:46 +05:30
Rayan, Rohan
9cbb1c45d8 Improving sgemm rd kernel on zen4/zen5 (#292)
Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM
The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers
This caused multiple unnecessary iterations in the kernel for matrices with certain k-values.
Fixed by introducing masked loads and computations for these cases

AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2025-12-17 18:48:50 +05:30
Vlachopoulou, Eleni
504ac9d8a2 CMake: Adding targets and aliases so that blis works with fetch content (#179)
* Adding targets and aliases so that blis works with fetch content

* Using PUBLIC instead of INTERFACE

* Using BLIS instead of blis and adding BLAS in the targets

* Fixing installation paths do be the same as before

* Adding documentation for FetchContent()
AOCL-Weekly-121225
2025-12-10 13:02:09 +00:00
Vlachopoulou, Eleni
1d80d5fee4 Fixing doc about building bench (#290) 2025-12-10 12:07:50 +00:00
Rayan, Rohan
a22e0022c2 SGEMM tiny path tuning for zen4 and zen5 (#267)
* Adding a model to determine which matrices enter the SGEMM tiny path
* This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously
* Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path
* Adding thresholds based on the SUP path sizes
* Added for Zen4 and Zen5

---------
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2025-12-10 15:58:54 +05:30
KR, Chandrashekara
b06b55e864 Update LICENSE and NOTICES files for AOCL-5.2 release (#285) 2025-12-10 11:25:02 +05:30
Varaganti, Kiran
bbb7edcb22 thread: free global communicator after parallel region completes in p…
* thread: free global communicator after parallel region completes in pthreads decorator

Avoid potential data race by deferring  free until all threads have joined. Previously, chief thread could free  inside  while non-chief threads still held pointers. Now,  frees  after the parallel region, following barrier and joins.

Files:
- frame/thread/bli_l3_sup_decor_pthreads.c
- frame/thread/bli_l3_decor_pthreads.c

* AMD-Internal: [CPUPL-7694]
2025-12-09 19:15:52 +05:30
Smyth, Edward
eff1b561c5 GTestSuite: Misc fixes
- Move asumv and nrm2 testinghelpers files from util to level1 (missed in
  commit 0923d8ff56)
- Correct spelling mistakes and references to incorrect arguments in
  comments in various files
- Correct comments listing invalid input tests in syr_IIT_ERS.cpp and
  her_IIT_ERS.cpp
- Fix incorrect use of M in symv_IIT_ERS.cpp and hemv_IIT_ERS.cpp

AMD-Internal: [CPUPL-7386]
AOCL-Weekly-051225
2025-12-05 17:23:47 +00:00