Enable Control Flow Guard (CFG) and Control-flow Enforcement Technology
(CET) hardening for Windows clang-cl builds in the CMake build system.
Gated by the ENABLE_SECURITY_FLAGS option (default ON).
Flags:
Compiler: /guard:cf /guard:ehcont
Linker: /GUARD:CF /GUARD:EHCONT /cetcompat
Toolchain gating:
Requires CMAKE_C/CXX_COMPILER_ID == Clang AND
CMAKE_C/CXX_COMPILER_FRONTEND_VARIANT == MSVC.
Native cl.exe and other MSVC-frontend toolchains fall through to a
warning and skip the hardening flags.
Scope of linker flags:
/cetcompat is applied to blis_shared only (not propagated via
INTERFACE from blis_static) to avoid enabling hardware shadow-stack
enforcement on downstream consumer EXEs that include non-CET-clean
paths (e.g. the testsuite's setjmp/longjmp xerbla code on zen4/zen5).
Configuration:
cmake -DENABLE_SECURITY_FLAGS=OFF ... # disable
Files changed:
blis/CMakeLists.txt
Changes to fix errors and warnings when using gcc 16.1.0:
- Copy changes from 5c2b22da81 in upstream BLIS to extend disabling of
tree-vectorization in affected kernels to gcc 16 and later.
- Remove unused variables in bli_packm_blk_var1_md.c and bli_util_unb_var1.c
to fix warning messages.
Background
bp (base pointer) is the %rbp/%ebp register on x86/x86-64. Inline assembly
kernels in BLIS use asm volatile blocks where they manually manage registers
- including saving and restoring bp themselves to use it as a general-purpose
register for holding loop counters or matrix pointers.
When GCC's tree-vectorizer (specifically the superword-level parallelism (SLP)
pass) runs on a translation unit containing inline asm, it can generate code
that itself needs bp as a frame pointer or in the vectorized prologue/epilogue.
At that point GCC internally marks bp as unavailable and then, when it tries to
compile the inline asm block that also references bp, it throws an error.
As a workaround, disabling tree vectorization for the entire file removes the
conflict - with no vectorizer-generated code, bp stays free for the inline asm.
* Pack B matrix for zgemm conjugate input in SUP path
- B matrix is packed for zgemm input where B matrix is conjugate transpose.
AMD-Internal: CPUPL-8274
* Pack B matrix for zgemm conjugate input in SUP path
- B matrix is packed for zgemm input where B matrix is conjugate transpose.
AMD-Internal: CPUPL-8274
---------
Co-authored-by: harsdave <harsdave@amd.com>
This PR optimizes the complex scalar vector multiplication kernels by replacing
intrinsics with inline assembly and leveraging FMA (Fused Multiply-Add) instructions
for improved performance.
Changes:
- Replaced intrinsic-based implementation with inline assembly
- Utilizes `vfmaddsub231ps`, `vfmadd231ss`, and `vfmsub231ss` FMA instructions
- Improved instruction scheduling and register usage
- Handles both unit-stride (vectorized) and non-unit-stride (scalar) cases
- Processes up to 16 complex elements per iteration in the main loop
Fixing a memory issue in the cgemm zen4 packing kernel
In the loop section where the leftover m and k iterations were handled, the load operations (in the k-direction) were missing the mask instructions which has now been added.
AMD-Internal: CPUPL-8189
Co-authored-by: Rohan Rayan rohrayan@amd.com
nn, tn, tt, nt, cn, ct, nc, tc.
Previously, the error-handling logic only explicitly checked
for and rejected cc and hh cases. This patch extends that
check to include ch and hc configurations, ensuring they
correctly return a failure/fallback instead of proceeding
with unsupported kernels.
AMD-Internal: [CPUPL-8151]
Co-authored-by: harsdave <harsdave@amd.com>
- For BF16 GEMVM1 fallback path when B matrix is reordered,
there wasn't a panel adjustment happening after the kernel execution.
- When the input size exceeds the panel boundary this would cause
wrong panel access leading to incorrect results. Hence, added the same.
[ AMD-Internal : CPUPL-8201 ]
- Replace mul+add with FMA for ddot, daxpy and daxpyf
- Using masked operations where possible
- Non-unit stride code paths still use scalar loops, but use FMAs for accuracy
AMD-Internal: CPUPL-8055
Co-authored-by: Rohan Rayan rohrayan@amd.com
Non-AVX512 variants of ZGEMM tiny code path do not support as many
problem sizes and transa/transb options. Skip unsuitable tests based
on the return value from from bli_zgemm_tiny call.
Also remove unused local variables.
AMD-Internal: [CPUPL-7303]
Update problem sizes and other parameters for general GEMM and
GEMMT tests, and dgemmt EVT tests, with the aim of reducing the
runtime and make tests more practical for use in pre-submit CI jobs.
AMD-Internal: [CPUPL-7386]
* Resolved memory-access issues in the SGEMM SUP kernels on AVX2 and AVX-512 by correcting instructions that could read invalid addresses in the C matrix.
* Removed k=0 kernel gtests for the native SGEMM and DGEMM paths as these tests caused spurious failures for kernels that are not intended to handle this case.
* Standardized all instruction macros to lowercase in the Zen4 kernel to improve readability and code consistency.
---------
AMD-Internal: CPUPL-8117
Co-authored-by: Rayan <rohrayan@amd.com>
* Bitexactness CRC verification and per-test JSON output
* Remove redundant BLIS_TEST_SEED random seed utilities
The random_seed_utils.h and BLIS_TEST_SEED environment variable are unnecessary since the codebase already ensures deterministic random number generation via RANDOM_POOL_SEED and SRAND_SEED constants hardcoded in testing_helpers.h.
* Add CRC support for integer/char computediff and cache env var checks
Add CRC calculation and binary output to the gtint_t and char specializations of computediff, matching the pattern used by all other overloads. Char values are widened to gtint_t for safe uint32_t-aligned CRC access.
Cache BLIS_ENABLE_CRC and BLIS_ENABLE_BINARY_OUTPUT env var lookups via static const bool in is_crc_enabled() and is_binary_output_enabled(). Guard all CRC/binary blocks in computediff with is_any_verification_enabled() so the common disabled path is a single static bool read with zero allocations.
* Address PR review comments and refactor computediff CRC blocks
Refactor: Extract duplicated CRC/binary-output blocks from all 8 computediff overloads into verify_vector_data and verify_matrix_data helpers in blis_test_utils namespace.
Bug fixes from PR review: add missing includes (cstdlib, utility), enforce MAX_OUTPUT_SIZE_BYTES limit with integer overflow guard, add buffer validation in all CRC generation functions, add default case to FLA_GET_DATATYPE_FACTOR macro, replace deprecated test_case_name() with test_suite_name(), add MAKE_DIRECTORY error checking in CMake, and update copyright years to 2026.
* Refactored crc_utils based on review comments.
* binary_output_utils.h cleanup.
* Address PR review comments: remove unused functions and fix copyright years.
Remove unused generate_crc_matrix, generate_crc_matrix_no_nb_diag,
generate_crc_matrix_no_nb_diag_with_storage, and
calculate_and_print_matrix_crc from crc_utils.h.
Remove unused calculate_and_print_matrix_hash from check_error.h.
Fix copyright year to 2026 only in crc_utils.h and binary_output_utils.h.
Remove (Performance) label from CRC heading in README.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Fix for review comments.
* Address review comments: rename verify to collect, consistent void returns, remove filename prefix
- Rename verify_vector_data/verify_matrix_data to collect_vector_data/
collect_matrix_data since these functions only collect CRC and binary
output data without performing comparison.
- Make return types consistent: change calculate_and_print_crc,
calculate_and_print_matrix_crc_with_storage, format_and_record_crc,
and write_comparison_outputs to return void since return values were
never used.
- Remove redundant test_output_ prefix from generate_binary_filename
to avoid duplication with the blis_test_outputs/ directory.
- Remove unused utility include from binary_output_utils.h.
- Update README wording from compiled out to disabled.
Made-with: Cursor
* Fix strict aliasing, use if constexpr, zero-pad CRC hex, separate feature guards
- Replace reinterpret_cast<uint32_t*> with memcpy-based read_uint32()
helper to avoid strict-aliasing UB on float/double/complex buffers.
Produces identical CRC values.
- Use if constexpr(CRC_ENABLED) instead of runtime if(!CRC_ENABLED) to
prevent CRC template instantiation when ENABLE_CRC is off.
- Zero-pad CRC hex output to 8 digits for stable downstream comparison.
- Separate ENABLE_CRC and ENABLE_BINARY_OUTPUT preprocessor guards in
verification_utils.h so each feature is compiled independently.
Made-with: Cursor
* Handle write_binary_output return values in write_comparison_outputs
Capture the bool return values from write_binary_output and, on failure, log a warning to stdout and record the error as a GTest property. This keeps binary output as a non-fatal diagnostic aid while ensuring return values are explicitly used.
Made-with: Cursor
---------
Co-authored-by: Anuraj <avettick@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* ZGEMM SUP: Add conjugate support for AVX-512 kernels on Zen4/Zen5/Zen6
- Add CONJA, CONJB and CONJA_CONJB variants to zgemm SUP micro-tiles
- Enable SUP path for conjugate cases when both are same type
- Unify RRC/CRC storage to use CV kernel variant
- Update SUP dispatch to handle conjugate flags correctly
Note: CONJ_NO_TRANSPOSE + CONJ_NO_TRANSPOSE and
CONJ_TRANSPOSE + CONJ_TRANSPOSE remain unsupported
---------
Co-authored-by: harsdave <harsdave@amd.com>
* optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels
- Updated the edge and non-unit stride path in c/saxpy to use intrinsics.
- This ensures that the edge and non-unit stride cases maintain numerical
consistency with the optimized Zen assembly/intrinsic path.
---------
Co-authored-by: harsdave <harsdave@amd.com>
Add compatibility for OpenMP implementations (e.g., MSVC, older GCC)
that lack functions introduced in OpenMP 3.0 i.e. omp_get_active_level()
and omp_get_max_active_levels(). On these compilers, the tests instead
are based on the older omp_get_nested() functionality.
Thanks to @tony-davis for highlighting this issue.
AMD-Internal: [CPUPL-7303]
CPUPL-7578: New thread control API with global and thread-local variants
Summary: Add new BLIS thread control APIs that provide fine-grained control over threading with proper global and thread-local (TLS) semantics. Fix several correctness issues where set_num_threads() and set_ways() did not properly override each other's state.
New/Modified APIs:
bli_thread_set_num_threads() — Sets thread count globally (updates both global_rntm and tl_rntm)
bli_thread_set_num_threads_local() — Sets thread count for calling thread only (tl_rntm)
bli_thread_get_num_threads() — Returns effective thread count, deriving from ways if set
bli_thread_reset() — Resyncs tl_rntm from global_rntm
bli_thread_set_ways() — Sets loop factorization (jc, pc, ic, jr, ir)
bli_thread_get_is_parallel() — Returns whether parallelism is enabled
bli_thread_get_jc_nt/ic_nt/pc_nt/jr_nt/ir_nt() — Returns individual way values
b77_thread_set_num_threads_local_() — Fortran-compatible wrapper
Bug fixes:
bli_thread_set_num_threads() now clears ways (-1) and sets auto_factor=TRUE on both global_rntm and tl_rntm, so it properly overrides prior BLIS_JC_NT/BLIS_IC_NT environment settings
bli_thread_set_ways() now propagates to global_rntm (inside mutex) and clears stale num_threads on both global_rntm and tl_rntm, so get_num_threads() returns the product of ways instead of a stale value
Fix data race in bli_thread_init_rntm_from_global_rntm() — copy global_rntm under mutex before debug printing
Fix data race in set_num_threads_local() debug print
Test suite (43 tests, 106 assertions):
test_thread_control.c (OpenMP, 23 tests): environment inheritance, global propagation, thread-local isolation, local precedence, per-thread local, reset, nested parallel, edge cases, set_ways, is_parallel, concurrent updates, DGEMM with threads, interleaved settings, persistence, parallel DGEMM, thread pool, reset-to-sync, env ways vs set_num_threads, ways→set_nt→reset, ways→local→reset, round-trip, set_nt→set_ways override, set_ways propagation to new threads
test_thread_control_pthread.c (pthread, 20 tests): equivalent coverage plus concurrent set/reset race condition test, set_nt→set_ways override, set_ways propagation via pthread_create
Files changed (9 files, +2630/-29 lines):
bli_thread.c — Core API implementations and fixes
bli_thread.h — New function declarations
b77_thread.c — Fortran wrapper
test_thread_control.c — OpenMP test suite (23 tests)
test_thread_control_pthread.c — pthread test suite (20 tests)
TEST_THREAD_CONTROL_README.md — Documentation
AMD-Internal: CPUPL-7578
Commit 8310b2d5d3 added new functions and global variables in
blis.h intended only for internal use. These were causing
missing symbol problems when blis.h is included in C
applications as they are not exported from the shared library.
Use BLIS_IS_BUILDING_LIBRARY and BLIS_CONFIGURETIME_CPUID
preprocessor definitions to only expose these when compiling BLIS
and not when using it.
AMD-Internal: [CPUPL-8091]
Implement zen6 cpuid and arch changes, and add zen6 as a
separate BLIS sub-configuration and code path within amdzen
configuration family. Currently all optimization choices are
copies of zen5 sub-configuration.
AMD-Internal: [CPUPL-7162]
Changes to simplify AMD CPUID functionality:
- Variable "features" is limited in size as each bit represents a
specific hardware function. Move detection of FP datapath
width to a separate variable. Also mask the FP datapath bits
explicitly for a more reliable test.
- Add detection of facility to downgrade FP512 datapath to FP256.
- bli_cpuid_is_avx512_fallback function does not exist, so remove header
definition.
AMD-Internal: [CPUPL-7303]
The bli_thread_barrier(thread) call before bli_l3_sup_thrinfo_free() in
bli_l3_sup_thread_decorator() was added by analogy with the conventional
path's PR #702 fix, but is not needed in the sup (small/unpacked) path.
In the conventional path, pack buffers are cached in the control tree
(cntl_t->pack_mem) and freed in the decorator after func() returns. A
barrier is required there to prevent a fast chief from releasing a pack
buffer back to the PBA pool while slower peers in a different sub-group
still read from it.
The sup path does not have this problem because:
1. Pack buffers are stack-local variables (mem_t in var2m), freed inside
func() by packm_sup_finalize_mem() after internal loop barriers.
They are never freed in this decorator.
2. The global communicator (gl_comm) is freed outside the parallel
region, protected by the implicit OpenMP barrier at the closing
brace of the parallel construct.
3. Sub-group communicators (created when packa/packb is enabled) are
freed only by the ochief thread in bli_thrinfo_free(). Non-chief
threads never dereference the shared communicator — they only read
their own ocomm_id and free_comm fields. When neither matrix is
packed, no sub-communicators exist (ocomm=NULL, free_comm=FALSE).
The custom spin-wait barrier (bli_thread_barrier) is significantly
slower than the OpenMP runtime barrier at high thread counts, causing
a ~10% DGEMM performance regression at 96 threads on AMD EPYC Turin
(e.g. 11000x300x200 DGEMM).
Ref: https://github.com/flame/blis/pull/702
Resolves: [CPUPL-7979] [SWLCSG-3951] [LWPHPCENGG-622]
APIs like GEMV use DOTXF (for parts of problem which are multiple of fuse_factor) and DOTXV (for parts not multiple of fuse_factor).
DOTXF and DOTXV use different numbers of temporary accumulation registers(rho).
This results in different round offs which can be significant when sizes are small and problem is about equally divided between DOTXF and DOTXV.
To fix this, the number of temporary accumulation (and therefore roundoffs) and have made identical across both kernels.
Known related GCC bugs to reference
GCC Bug #56812 — incorrect code with vzeroupper and register allocation
GCC Bug #95483 — vzeroupper clobbers live values
GCC Bug #101617 — wrong code generation with AVX intrinsics and transitions
AMD-Internal:CPUPL-8015
Fix NaN handling in AVX2 amax kernel by initializing global max to 0
In the AVX2 kernel, the global maximum (curr_max_val) was previously
initialized to -1, while the local maximum (temp) is initialized to 0.
The kernel determines the maximum using the condition:
max(abs(x[i]), temp) > curr_max_val
Because curr_max_val started at -1, this initial check always evaluates
to true for the first iteration ( 0 > -1), which is incorrect if the
first element is a NaN.
If subsequent elements are valid numbers, curr_max_val eventually recovers
and updates to the correct value. However, if the entire array consists of
NaNs, this logic fails to properly update the trackers, meaning we never
correctly return index 0 as required by the BLAS standard.
To fix this and align the behavior with the AVX-512 kernel, curr_max_val
is now initialized to 0. This ensures that the initial condition evaluates
correctly and all-NaN arrays return the proper index (0 if all values are NaN).
Window start and end index are also updated to 0 which is the minimum valid
value of index.
AMD-Internal: CPUPL-8047
- Updated the condition for pointer checks on scale
factors for A and B matrices, in order to avoid
'Dereference before' and 'Dereference after' null
check issues.
- Also updated the symmetric quantization interfaces
to have NULL check for post-ops pointer.
AMD-Internal: [CPUPL-7995]
Signed-off-by: Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>
* Updating zen5/make_defs.* so that we use an AOCC_VERSION_STRING
* Adding some error handling for AOCC versions with different name convention
* Adding VERSION_GREATER_EQUAL functionality to all zen config directories
* Cleanup and addressing review comments
* Update config/zen/amd_config.cmake
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Updates to support x.y.z or x_y_z versioning
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy of similar change in upstream BLIS (843a5e8) to fix issues
https://github.com/flame/blis/issues/873 and
https://github.com/amd/blis/issues/50
Details:
- Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the
framework could access the necessary OpenMP functions.
- As @melven reported (#873), this causes issues when `blis.h` is included
in C++ code since the `<omp.h>` include happens with `extern "C"`.
- Move the include from the header to the necessary .c files so that it
does not "pollute" `blis.h`.
Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in
AOCL BLIS
AMD-Internal: [CPUPL-7303]
bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous
implementation incurred the overhead of multiple function calls. This has
been reduced by:
- Changing the function to be defined in a header file so it can be inlined.
- Avoiding call to bli_arch_check_id_once that was a wrapper for a call to
bli_pthread_once. Instead bli_pthread_once is called directly.
- For builds with a single BLIS sub-configuration, correct arch_id is taken
directly from a header file in the corresponding config subdirectory,
avoiding the bli_pthread_once call and making the value explicit at
compile time, which may enable additional optimizations.
To enable these changes, the variables arch_id and model_id defined in
frame/base/bli_arch.c are no longer static, as they must be accessed in multiple
files (i.e. they are now global variables). Rename to g_arch_id and g_model_id
to distinguish from any locally defined arch_id or model_id variables.
Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM.
After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction.
This caused a lot of unnecessary iterations when the remainder of k < 8.
This has been fixed by introducing masked operations for k < 8
When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation.
There were also some unnecessary instructions in the zen4 kernels which have been removed.
AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775
Co-authored-by: rohrayan@amd.com
GCC over-optimizes intrinsics code by reordering and interleaving
instructions, making it difficult to verify correctness and causing
potential accuracy issues in certain cases. This change replaces
intrinsics-based implementations with inline assembly to ensure
one-to-one mapping between source and generated assembly.
Changes:
- bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly
* Processes blocks of 128, 64, 32, 16, and 8 elements
* Handles fringe cases with masked operations
* Preserves scalar path for non-unit strides
- bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly
* Processes blocks of 16 and 8 elements with 5-way fusion
* Handles fringe cases with masked operations
* Preserves scalar path for non-unit strides
Benefits:
- Predictable code generation with no compiler reordering
- Better numerical accuracy by preventing unexpected transformations
- Easier verification of generated assembly against specifications
- Explicit control over instruction sequence and register allocation
Various changes:
- Correct signature of get_random_matrix call in some level2 APIs.
- Move RNG seed to header to allow it to be used elsewhere in the code
- Remove unused variable in ref_gbmv.cpp
- Fix seed for all calls to rand()
- Correct arguments in calls to matrix setup and computediff calls,
especially for CBLAS row-major calls
- Add missing if statements in tests of input arguments
- Removed unused alpha argument from tbmv and tbsv
- Enable nan_inf check when testing input args like alpha and beta
- Also some corrections to testing input matrices and vectors
AMD-Internal: [CPUPL-7386]
Error handing code for invalid diag argument in Col major path was
incorrect in cblas_ctrmm compared to other invalid argument checks
and other data type variants.
AMD-Internal: [CPUPL-7303]
- The current build systems have the following behaviour
with regards to building "aocl_gemm" addon codebase(LPGEMM)
when giving "amdzen" as the target architecture(fat-binary)
- Make: Attempts to compile LPGEMM kernels using the same
compiler flags that the makefile fragments set for BLIS
kernels, based on the compiler version.
- CMake: With presets, it always enables the addon compilation
unless explicitly specified with the ENABLE_ADDON variable.
- This poses a bug with older compilers, owing to them not supporting
BF16 or INT8 intrinsic compilation.
- This patch adds the functionality to check for GCC and Clang compiler versions,
and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0.
- Make: Updated the configure script to check for the compiler version
if the addon is specified.
CMake: Updated the main CMakeLists.txt to check for the compiler version
if the addon is specified, and to also force-update the associated
cache variable update. Also updated kernels/CMakeLists.txt to
check if "aocl_gemm" remains in the ENABLE_ADDONS list after
all the checks in the previous layers.
AMD-Internal: [CPUPL-7850]
Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>
Create gtestsuite programs for banded matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem
sizes can be investigated later.
AMD-Internal: [CPUPL-7386]
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.
AMD-Internal: [CPUPL-7386]
Create gtestsuite programs for subset of packed matrix APIs. Priority is to
create framework with a basic set of tests - refinement of problem sizes
can be investigated later.
AMD-Internal: [CPUPL-7386]
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM
The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers
This caused multiple unnecessary iterations in the kernel for matrices with certain k-values.
Fixed by introducing masked loads and computations for these cases
AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
* Adding targets and aliases so that blis works with fetch content
* Using PUBLIC instead of INTERFACE
* Using BLIS instead of blis and adding BLAS in the targets
* Fixing installation paths do be the same as before
* Adding documentation for FetchContent()
* Adding a model to determine which matrices enter the SGEMM tiny path
* This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously
* Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path
* Adding thresholds based on the SUP path sizes
* Added for Zen4 and Zen5
---------
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
* thread: free global communicator after parallel region completes in pthreads decorator
Avoid potential data race by deferring free until all threads have joined. Previously, chief thread could free inside while non-chief threads still held pointers. Now, frees after the parallel region, following barrier and joins.
Files:
- frame/thread/bli_l3_sup_decor_pthreads.c
- frame/thread/bli_l3_decor_pthreads.c
* AMD-Internal: [CPUPL-7694]
- Move asumv and nrm2 testinghelpers files from util to level1 (missed in
commit 0923d8ff56)
- Correct spelling mistakes and references to incorrect arguments in
comments in various files
- Correct comments listing invalid input tests in syr_IIT_ERS.cpp and
her_IIT_ERS.cpp
- Fix incorrect use of M in symv_IIT_ERS.cpp and hemv_IIT_ERS.cpp
AMD-Internal: [CPUPL-7386]