Commit Graph

3787 Commits

Author SHA1 Message Date
V, Varsha
3df4aac2d2 Bugfix for A matrix packing in int8(S8/U8) APIs for Batch-Matmul
- A matrix by default isn't expected to be packed for a normal row-stored
 case. Hence the packing implementation is incomplete.
 - But if the user explicitly enables packing, interface wasn't handling
 the condition appropriately leading to data overwriting inside the incomplete
 pack kernels, thereby leading to accuracy failure.
 - As a fix, updated the interface to set the explicit PACK A to UNPACKED and
 proceed with GEMM in cases where transpose of A is not necessary.
 - Updated the batch gemm input file with additional test cases covering all the
 APIs.
Bug Fixes:
 - Fixed implementation logic for column major inputs with post-ops to be disabled
 in S8 batch mat-mul. With the existing implementation, column-major inputs wouldn't
 be executed in case of of32/os32 inputs.
 - Fixed the Scale/ZP calculation in bench foru8s8s32ou8 condition, which was leading
 to accuracy failures.

[AMD-Internal: CPUPL-7283 ]
2025-08-26 16:46:37 +05:30
S, Hari Govind
deafc527fc Tune DCOPY aocl_dynamic logic for zen4/zen5 architectures (#160)
- Adjust the DCOPY aocl_dynamic threshold on Zen4 for optimal
  fast-path selection.

- Extending the Zen5 dynamic-scheduling logic beyond the previous
  8-thread limit.

- Update corresponding fast_path_thresh values in the frame such
  that it matches the new dynamic logic.
2025-08-26 16:15:56 +05:30
Dave, Harsh
9269a7d65d Implement dgemm_thread_decision with K-threshold logic (#152)
- Added the initial implementation of the dgemm_thread_decision()
  function to decide between single-threaded and multi-threaded
  execution for DGEMM inputs.

- The function models per-thread tile work, core GFLOP/s, and
  thread overhead (T_over=15 µs), and computes a K-threshold
  that determines when multi-threading becomes beneficial.

- Returns true for ST and false for MT.

AMD-Internal: [SWLCSG-3418]
2025-08-26 15:02:34 +05:30
Sharma, Shubham
9c6777fc6b Fix DTRSM small threshold for extremely skinny sizes for ZEN5 (#151)
- Logic to determine if small code path should be taken or not does not take into account if matrix A is too large.
- Added a condition to use native code path if matrix A is very large.
AMD-Internal: [CPUPL-7201]
2025-08-22 22:15:40 +05:30
V, Varsha
6cdab2720c Bugfix for A matrix packing in int8(S8/U8) APIs
- A matrix packing by default in isn't necessary for row-major matrix data. Also, it seems packing of A was
 giving regressions and hence wasn't expected to be used.
 - However, packA is necessary in column-major cases, where transpose has to be done. This path has been verified.
 - Hence, when user sets pack A explicitly, it gets into the incomplete packA function, and overwrites the elements
 in the buffer after subsequent iterations, leading to accuracy issues. As a fix to this the patch updates PACK
 condition to UNPACKED at the interface while user explicitly sets one, ensuring seamless execution.

[ AMD-Internal : CPUPL - 7193 ]
2025-08-22 18:46:19 +05:30
Vankadari, Meghana
5044b69d3d Bug fix in LPGEMV m=1 AVX2 kernel for post-ops
Details:
- Fixed loading of matadd and matmul pointers in GEMV
 lt16 kernel for AVX2 M=1 case.
- Hard-set row-stride of B to 1(inside GEMV), when it has
   already been reordered.

AMD-Internal:CPUPL-7197, CPUPL-7221
Co-authored-by:Balasubramanian, Vignesh <Vignesh.Balasubramanian@amd.com>
2025-08-22 18:15:05 +05:30
S, Hari Govind
d29f3f0b5e Fix GCC 12+ instruction scheduling issue in complex scalv kernel (#149)
Replace fused multiply-add (FMA) intrinsics with explicit multiply and add/subtract operations in bli_cscalv_zen_int to resolve incorrect results with GCC 12 and later compilers.

The original code used register reuse pattern with _mm256_fmaddsub_ps() that causes GCC 12+ instruction scheduler to generate assembly with corrupted intermediate values due to register allocation conflicts. GCC 11 and earlier handled the same pattern correctly.

Changes:
- Replace _mm256_fmaddsub_ps() with _mm256_mul_ps() + _mm256_addsub_ps()
- Eliminate temp register reuse to fix instruction scheduling conflicts

AMD-Internal: [CPUPL-6445]
2025-08-22 14:23:43 +05:30
KR, Chandrashekara
36c37585de CMake: Set working directory for try_run to avoid DLL-related runtime errors on Windows.
- Added WORKING_DIRECTORY to try_run() calls to ensure execution occurs in ${BLIS_PATH}/lib.
- Prevents Windows error 0xc0000135 caused by missing DLLs during runtime of get_version.cpp.
- Ensures compatibility with both static and shared BLIS builds by aligning runtime context
  with expected DLL locations.

AMD-Internal: [CPUPL-7187]
2025-08-21 17:09:07 +05:30
Sharma, Shubham
b5c8124d3d Derive TRSM ref kernels from TRSM blkzsz instead of GEMM blszs (#148)
- Currently TRSM reference kernels are derived from GEMM blocksizes and GEMM_UKR.
- This does not allow the flexibility to use different GEMM_UKR for GEMM and TRSM if optimized TRSM_UKR are not available.
- Made changes so that ref TRSM kernels are derived from TRSM blocksizes.
- Changed ZEN4 and ZEN5 cntx to use AVX2 kernels for CTRSM.

AMD-Internal: [SWLCSG-3702]
2025-08-21 11:25:45 +05:30
Dave, Harsh
e39cf64708 Optimized avx512 ZGEMM kernel and edge-case handling (#147)
* Optimized avx512 ZGEMM kernel and edge-case handling
  Edge kernel implementation:
   - Refactored all of the zgemm kernels to process micro-tiles efficiently
   - Specialized sub-kernels are added to handle leftover m dimention:12MASK,
     8, 8MASK, 8, 4, 4MASK, 2.
   - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
     load/store and 1 masked load/store.
   - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
     1 masked load/store.
   - 4MASK handles 3, 1 m_left using 1 masked load/store.

   - ZGEMM kernel now internally decomposes the m dimension into the following.
     The main kernel is 12x4, which is having following edge kernels to
     handle left-over m dimension:
     edge kernels:
     12MASKx4 (handles 11x4, 10x4, 9x4)
     8x4      (handles 8x4)
     8MASKx4  (handles 7x4, 6x4, 5x4)
     4x4      (handles 4x4)
     4MASKx4  (handles 3x4, 1x4)
     2x4      (handles 2x4)

   - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
     which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
     8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
     2xN_LEFT(3, 2, 1) handles leftover m dimension.

  Threshold tuning:
   - Enforced odd m dimension to avx512 kernels in tiny path, as avx2
     kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
     The gemv function call adds overhead for very small sizes and results
     in suboptimal performance.

   - condition check "m%2 == 0" is added along with threshold checks to
     force input with odd m dimension to use avx512 zgemm kernel.

   - Threshold change to route all of the inputs to tiny path. Eliminating
     dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
     'T'(transpose).

   - However tiny re-uses zgemm sup kernels which do not support
     conjugate transpose storage of matrices. For such storage of
     A, B matrix we still rely on avx2 zgemm_small kernel.

  gtest changes:
   - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
     respective testing instaces from gtest.

AMD-Internal: [CPUPL-7203]

* Optimized avx512 ZGEMM kernel and edge-case handling
  Edge kernel implementation:
   - Refactored all of the zgemm kernels to process micro-tiles efficiently
   - Specialized sub-kernels are added to handle leftover m dimention:12MASK,
     8, 8MASK, 8, 4, 4MASK, 2.
   - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
     load/store and 1 masked load/store.
   - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
     1 masked load/store.
   - 4MASK handles 3, 1 m_left using 1 masked load/store.

   - ZGEMM kernel now internally decomposes the m dimension into the following.
     The main kernel is 12x4, which is having following edge kernels to
     handle left-over m dimension:
     edge kernels:
     12MASKx4 (handles 11x4, 10x4, 9x4)
     8x4      (handles 8x4)
     8MASKx4  (handles 7x4, 6x4, 5x4)
     4x4      (handles 4x4)
     4MASKx4  (handles 3x4, 1x4)
     2x4      (handles 2x4)

   - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
     which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
     8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
     2xN_LEFT(3, 2, 1) handles leftover m dimension.

  Threshold tuning:
   - Enforced odd m dimension to avx512 kernels in tiny path, as avx2
     kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
     The gemv function call adds overhead for very small sizes and results
     in suboptimal performance.

   - condition check "m%2 == 0" is added along with threshold checks to
     force input with odd m dimension to use avx512 zgemm kernel.

   - Threshold change to route all of the inputs to tiny path. Eliminating
     dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
     'T'(transpose).

   - However tiny re-uses zgemm sup kernels which do not support
     conjugate transpose storage of matrices. For such storage of
     A, B matrix we still rely on avx2 zgemm_small kernel.

  gtest changes:
   - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
     respective testing instaces from gtest.

AMD-Internal: [CPUPL-7203]

---------

Co-authored-by: harsdave <harsdave@amd.com>
2025-08-21 09:46:10 +05:30
Smyth, Edward
7969d43f2c Improve BLAS3 IIT_ERS tests
New tests:
- Add IIT_ERS tests for TRMM, SYMM, SYRK, SYR2K, HEMM, HERK, HER2K

Corrections and improvements:
- GEMM: Use local definitions of input size, trans, etc arguments
  to allow finer control of choices, especially for testing invalid
  leading dimensions.
- GEMM, GEMMT, GEMM_COMPUTE, TRMM, TRSM: In alpha=beta=zero
  test, initialize C to extreme value to test that C is set rather than scaled
- GEMM: Use correct M x N dimensions for C in calls to computediff
- GEMM: Declare info variable in disabled tests in gemm_IIT_ERS.cpp

AMD-Internal: [CPUPL-6725]
2025-08-20 22:38:45 +01:00
Sharma, Shubham
805f36965d Added ability to handle non unit incx in GEMV transpose kernel. (#145)
- GEMV transpose kernels lack ability to compute directly on non-unit stride inputs.
- This limitation is stopping libflame to use blis kernel directly instead of going through framework.
- Added ability to handle non-unit incx in the kernel by packing x into a temporary buffer.

AMD-Internal: [CPUPL-6903]
2025-08-20 23:33:53 +05:30
Smyth, Edward
67616752c3 Improve BLAS2 IIT_ERS tests
New tests:
- Add IIT_ERS tests for HEMV, HER, HER2, SYMV, SYR, SYR2

Corrections and improvements:
- GEMV: Correct matrix sizes for transpose cases when checking
  input matrix A has not been modified.
- GEMV: Initialize y to extreme value for alpha=beta=zero case
  to check y is set rather than scaled.

AMD-Internal: [CPUPL-6725]
2025-08-20 18:06:20 +01:00
Sharma, Shubham
0d98c8f9ec Remove extra security flag print from the make build system (#146)
- Status of security flags are already printed in the output from configure and CMake.
- Prints in make command are redundant.
- Removed prints from make to keep the build log clean.

AMD-Internal: [CPUPL-7222]
2025-08-20 16:58:44 +05:30
Smyth, Edward
0b9e846fee DTL logging fixes and improvements (5)
More improvements to DTL coverage and coding:
- Expand logging and tracing coverage to IxAMIN and GEMM_BATCH APIs
- Expand logging and performance states to GEMM3M APIs
- Expand logging coverage to matrix copy, transpose and add APIs
- Misc tidying of code

AMD-Internal: [CPUPL-7010]
2025-08-20 11:37:03 +01:00
Smyth, Edward
509aa07785 Standardize Zen kernel names
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.

AMD-Internal: [CPUPL-6579]
2025-08-19 18:19:51 +01:00
Sharma, Shubham
aa95a8ce4a Added Compiler flags to improve Security (#136)
Following Flags have been added.

1. D_FORTIFY_SOURCE=2
What it does
• At compile time the header files replace certain libc calls (strcpy, sprintf, …) with inline wrappers that perform a compile-time length check whenever the size of the destination buffer is known.
• At run time an extra check is executed only if the compiler could not prove the copy is safe.

Cost
• Only functions that call those specific libc routines pay anything.

2. fstack-protector-strong
What it does
• Functions that contain local arrays, address‐taken locals, or alloca get a canary word inserted into the stack frame.
• The function prologue writes the canary; the epilogue verifies it before the ret.

Cost
• 8 bytes of additional stack per protected function frame.
• Two or three extra instructions per entry/exit.

4. Wl,-z,relro
What it does
• Marks the relocation tables read-only after relocation is finished.
• No effect once the library is fully loaded.

Cost
• None at run time.

5. Wl,-z,now
What it does
• Forces the dynamic loader to resolve all external symbols in the library up-front instead of lazily on first call.

Cost
• Startup: one extra relocation pass.
• Steady-state execution: zero or slightly faster, because PLT stubs are bypassed.

Usage:
cmake -DENABLE_SECURITY_FLAGS=off
cmake -DENABLE_SECURITY_FLAGS=on
configure --enable-security-flags
configure --disable-security-flags

AMD-Internal: [CPUPL-6886]
2025-08-18 16:11:02 +05:30
Dave, Harsh
b88bea6e72 Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) (#135)
* Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11)

- Introduced specialized AVX-512 assembly paths for cdim0 edge cases (1–11), replacing inefficient zscalv fallback.
- Refactored cdim0 == mnr condition into a switch statement to support multiple optimized cases.
- Added three new macros for column-stored packing with distinct masking patterns.
- Implemented 11 dedicated handlers for row and column stored A matrix packing
  with efficient masked loads/stores for partial data.

    AMD-Internal: [CPUPL-6677]

Co-authored-by: harsh dave <harsdave@amd.com>

* Update bli_packm_zen4_asm_z12xk.c

---------

Co-authored-by: harsh dave <harsdave@amd.com>
Co-authored-by: Sharma, Shubham <Shubham.Sharma3@amd.com>
2025-08-18 12:38:45 +05:30
Sharma, Shubham
33ea09d967 Fix ZTRSM accuracy for conjugate transpose (#133)
- For Conjugate inputs, ZTRSM small code path is less accurate than native codepath.
- Redirected the conjugate inputs to native code path on ZEN4 if TRSM preinversion is disabled.
- Tuned AOCL_DYNAMIC to handle the new inputs redirected to ZTRSM native.
2025-08-14 19:49:28 +05:30
Dave, Harsh
1b1b19486b Add packing support M edge cases in ZGEMM 12xk pack kernel (#89)
Previously, the ZGEMM implementation used `zscalv` for cases
    where the M dimension of matrix A is not in multiple of 24,
    resulting in a ~40% performance drop.

    This commit introduces a specialized edge cases in pack kernel
    to optimize performance for these cases.

    The new packing support significantly improves the performance.

    - Removed reliance on `zscalv` for edge cases, addressing the
      performance bottleneck.

    AMD-Internal: [CPUPL-6677]

Co-authored-by: harsh dave <harsdave@amd.com>
2025-08-14 14:29:03 +05:30
Sharma, Arnav
76c4872718 GEMV support for S8S8S32O32 Symmetric Quantization
Introduced support for GEMV operations with group-level symmetric quantization for the S8S8S32032 API.

Framework Changes:
- Added macro definitions and function prototypes for GEMV with symmetric quantization in lpgemm_5loop_interface_apis.h and lpgemm_kernels.h.
  - LPGEMV_M_EQ1_KERN2 for the lpgemv_m_one_s8s8s32os32_sym_quant kernel, and
  - LPGEMV_N_EQ1_KERN2 for the lpgemv_n_one_s8s8s32os32_sym_quant kernel.
- Implemented the main GEMV framework for symmetric quantization in lpgemm_s8s8s32_sym_quant.c.

Kernel Changes:
- lpgemv_m_one_s8s8s32os32_sym_quant for handling the case where M = 1 and implemented in lpgemv_m_kernel_s8_grp_amd512vnni.c.
- lpgemv_n_one_s8s8s32os32_sym_quant for handling the case where N = 1 and implemented in lpgemv_n_kernel_s8_grp_amd512vnni.c.
- Updated the buffer reordering logic for group quantization for N=1 cases in aocl_gemm_s8s8s32os32_utils.c.

Notes
- Ensure that group_size is a factor of both K (and KC when K > KC).
- The B matrix must be provided in reordered format (mtag_b == REORDERED).

AMD-Internal: [SWLCSG-3604]
2025-08-14 13:41:25 +05:30
Sharma, Shubham
3a14417ce1 DGEMV BugFixes and code cleanup (#134)
- Modified gemv (matrix-vector multiply) reference for better handling of transpose flags.
- Modified Zen4 kernel implementations for better handling of transpose flags and vector stride (incy).
- The changes refine kernel selection logic and move variable definition in macro guards.
2025-08-14 12:54:06 +05:30
S, Hari Govind
9a7bacb30c Improve numerical precision in ZGEMV API (#130)
- Replaced separate real and imaginary accumulators (real_acc, imag_acc) with a column-wise accumulator array (row_acc[2]), making accumulation and updates to the target Y vector more direct, concise, and unified.

- Leveraged AVX-512 fused multiply-add/subtract operations (_mm512_fmaddsub_pd, _mm512_fmsubadd_pd) and efficient permutations (_mm512_permute_pd) to enable accurate and efficient computation of real and imaginary components in a single instruction, while reducing code complexity for both code paths.

- Removed redundant instructions (such as unnecessary permutations and zero-register operations) and simplified the control flow.

AMD-Internal: [CPUPL-7015]
2025-08-14 11:19:51 +05:30
Dave, Harsh
fa69528a3b Bugfix: Tuned zgemm threshold for zen4 (#129)
* Bugfix: Tuned zgemm threshold for zen4

Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.

This tuning forces skinny matrices to take SUP path to ensure better
performance.

* Bugfix: Tuned zgemm threshold for zen4 and zen5

Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.

This tuning forces skinny matrices to take SUP path to ensure better
performance.

---------

Co-authored-by: harsdave <harsdave@amd.com>
2025-08-13 19:02:39 +05:30
Smyth, Edward
da875888d7 DTL logging fixes and improvements (4)
More improvements to DTL coverage and coding:
- Removed some DTL overheads from performance stats timing for all APIs
  where it is currently implemented (i.e. gemm, gemmt, trsm, nrm2)
- Expand logging coverage to gemm pack and compute APIs, including
  performance stats for gemm_compute
- Expand logging coverage to rot, rotg, rotm and rotmg APIs
- Tidied order of function prototypes in aocl_dtl/aocldtl_blis.h

AMD-Internal: [CPUPL-7010]
2025-08-13 11:12:16 +01:00
Sharma, Shubham
b7b9d3ec53 Exported AVX512 DGEMV kernels (#131)
- Exported DGEMV AVX512 kernels so that they can be directly called by libflame to avoid blis and omp overhead.
2025-08-12 17:17:41 +05:30
Smyth, Edward
021f6bc960 GEMMTR full set of APIs
Commit eaa76dfe28 added LAPACK 3.12 GEMMTR
interfaces as aliases to existing BLIS GEMMT. Here we add full set of
Fortran upper case and no underscore API aliases and _blis_impl variants.

AMD-Internal: [CPUPL-6581]
2025-08-12 10:24:24 +01:00
Smyth, Edward
dc06cdb621 DTL logging fixes and improvements (3)
More improvements to DTL coverage and coding:
- Expand logging coverage to banded matrix APIs in frame/compat/f2c
- Expand logging coverage to packed matrix APIs in frame/compat/f2c
- Commit b8aa5c2894 was wrong to
  remove calls to AOCL_DTL_INITIALIZE for APIs where bli_init_auto()
  is not called. AOCL_DTL_INITIALIZE is essential when logging is
  enabled but tracing is not, otherwise the ICV gbIsLoggingEnabled
  will not be initialized based on logging status and remain as
  the default FALSE value.

AMD-Internal: [CPUPL-7010]
2025-08-12 09:42:40 +01:00
Sharma, Shubham
b0a4914417 Added DGEMV no transpose multithreaded Implementations (#12)
* Added DGEMV no transpose multithreaded Implementations
- Added new avx512 M and N kernels for DGEMV.
- Added multiple MT implementations for same kernels.
- Added AOCL_dynamic logic for L2 apis.
- Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5.
- Added same kernels for SGEMV, but these kernels are not enabled yet.
- Added SGEMV reference kernel.

AMD-Internal: [SWLCSG-3408]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-08-12 10:39:12 +05:30
Vlachopoulou, Eleni
1f8a7d2218 Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65) 2025-08-07 15:51:59 +01:00
Smyth, Edward
563b161933 Standardize Python files to use Python 3
Python 2 is no longer maintained, and using python3 avoids accidental invocation of outdated interpreters.

AMD-Internal: [CPUPL-6579]
2025-08-06 12:04:26 +01:00
Bhaskar, Nallani
9d571bb5d3 Fixed few Coverity warnings in aocl gemm addon
Fixed few Coverity warnings in aocl gemm addon 


AMD-Internal: CPUPL-6913
2025-08-06 15:37:40 +05:30
Smyth, Edward
efdb5a06df Changes to default choices of sub-configuration
amdzen and x86_64 configuration family: On Intel processors supporting
AVX-512, the zen4 sub-configuration was dispatched by default, as even
though it is not optimized specifically for Intel processors, it includes
a range of additional optimizations than are present in the the older skx
sub-configuration. However, the zen4 data path is 256 bit, thus this
sub-configuration uses a mixture of AVX2 and AVX-512 kernels. Now that
zen5 sub-configuration is available, with more extensive use of AVX-512
kernels, switch to use this by default on relevant Intel processors.

intel64 configuration family: On AMD processors supporting AVX-512 or
AVX2, the generic sub-configuation was dispatched by default. Change to
dispatch skx or haswell sub-configuation, based on the available ISA
support.

AMD-Internal: [CPUPL-6743]
2025-08-06 07:56:53 +01:00
Sharma, Shubham
6db8639284 Fix coverity issue in ZTRSM kernels (#112)
Static analysis issues in ZTRSM (triangular solve with matrix) kernels for Zen5 architecture by initializing variables to prevent potential use of uninitialized values.
Initialize loop variables i, j, and k_iter to 0 to prevent potential uninitialized access
Initialize mask variables and remainder variables to 0 across multiple kernel functions
2025-08-05 15:00:59 +05:30
Smyth, Edward
09592bf4f3 Windows TLS with Microsoft compiler
We currently use clang compiler on Windows, so no problem with current
builds. However, if we wanted to use Microsoft compiler, add different
definition of BLIS_THREAD_LOCAL as __declspec(thread)

AMD-Internal: [CPUPL-6958]
2025-08-04 15:09:25 +01:00
Smyth, Edward
b8aa5c2894 DTL logging fixes and improvements (2)
More improvements to DTL coverage and coding:
- Tidy functions and prototypes in aocl_dtl/aocldtl_blis.{c,h} into
  alphabetical groups within different BLAS categories.
- Expand tracing coverage to APIs in frame/compat/f2c
- Remove calls to AOCL_DTL_INITIALIZE (added in
  c56dcb6ffb) as DTL_Trace calls
  bli_init_auto which will call AOCL_DTL_INITIALIZE

AMD-Internal: [CPUPL-7010]
2025-08-01 15:27:35 +01:00
V, Varsha
68d47281df Fixing some copying bugs in Batch-Matmul code
- Removed duplicate calls to BATCH_GEMM_CHECK().
 - Refactored freeing of post-op pointer in bench code and verified the
    functionality.
 - Modified indexing of the array to take the correct values.
2025-08-01 18:42:10 +05:30
Balasubramanian, Vignesh
c96e7eb197 Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN5)
- Updated the thresholds to enter the AVX512 SUP codepath in
  ZGEMM(on ZEN5). This caters to inputs that scale well with
  multithreaded-execution(in the SUP path).

- Also updated the thresholds to decide ideal threads, based on
  'm', 'n' and 'k' values. The thread-setting logic involves
  determining the number of tiles for computation, and using them
  to further tune for the optimal number of threads.

- This logic builds over the assumption that the current thread
  factorization logic is optimal. Thus, an additional data analysis
  was performed(on the existing ZEN4 and the new ZEN5 thresholds),
  to also cover the corner cases, where this assumption doesn't hold
  true.

- As part of the future work, we could reimplement the thread
  factorization for GEMM, which would additionally require a new
  set of threshold tuning for every datatype.

AMD-Internal: [CPUPL-7028]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
AOCL-Aug2025-b1
2025-08-01 16:02:12 +05:30
Rayan, Rohan
1bb1160061 Fixing bench overflow in GFLOPS computation
Fixing a bug in some bench applications where GFLOPS computation ran into integer overflows because explicit type casting to double was not done in the computation

removing all multiplies by 1.0 during GFLOP computation

AMD-Internal: CPUPL-7016

---------
Co-authored-by: Rayan <rohrayan@amd.com>
2025-08-01 11:08:53 +05:30
Rayan, Rohan
a26f0a93f9 Removing some unnecessary branches in CGEMM's aocl-dynamic logic.
Code cleanup: Removing some redundant if-else code in the CGEMM
aocl-dynamic logic. This should ensure that multiple branching is
avoided, while preserving existing heuristics.

AMD Internal: [CPUPL - 6579]

Co-authored-by: Rayan <rohrayan@amd.com>
2025-08-01 10:31:45 +05:30
Chandrashekara K R
2ff3e02309 Add platform-specific alignment macro for rntm_s struct
Introduced BLIS_ATTRIB_ALIGN to standardize 64-byte alignment across platforms.
On Windows, alignment is enabled only for Clang and disabled for other compilers.
Replaced direct usage of __attribute__((aligned(64))) in rntm_s with the macro.
2025-07-30 21:15:44 +05:30
Smyth, Edward
e3dcc15a80 Add configure and CMake options to enable DTL logging and tracing (#86)
Instead of editing a header file, add options to build systems to allow
DTL tracing and/or logging output to be generated. For most users
logging is recommended, producing a line of output per application
thread of every BLAS call made. Tracing provides more detailed info
of internal BLIS calls, and is aimed more at expert users and BLIS
developers. Different tracing levels from 1 to 10 provide control of
the granularity of information produced. The default level is 5. Note
that tracing, especially at higher tracing levels, will impose a
significant runtime cost overhead.

Example usage:

Using configure:

  ./configure ... --enable-aocl-dtl=log amdzen

  ./configure ... --enable-aocl-dtl=trace --aocl-dtl-trace-level=6 amdzen

  ./configure ... --enable-aocl-dtl=all amdzen

Using CMake:

  cmake ... -DENABLE_AOCL_DTL=LOG

  cmake ... -DENABLE_AOCL_DTL=TRACE -DAOCL_DTL_TRACE_LEVEL=6

  cmake ... -DENABLE_AOCL_DTL=ALL

Also, modify function AOCL_get_requested_threads_count to correct
reported thread count in cases where internal value is recorded as -1

AMD-Internal: [CPUPL-7010]
2025-07-28 15:24:10 +01:00
Bhaskar, Nallani
46aac600ec Added f32 kernels without post-ops to avoid overhead
Description:

1. Crated f32 intrinsic kernels without post-ops support f32 gemm
   without post-ops optimally.
2. Initiated the no post-ops kernels from main kernel when post-ops
   hander has no post-ops to do.
3. The kernels are redundant but added to get the best perf
   for pure GEMM call.

AMD-Internal : SWLCSG-3692
2025-07-25 23:14:23 +05:30
Smyth, Edward
c56dcb6ffb DTL logging fixes and improvements
The environment variable AOCL_VERBOSE was inconsistent in its
behaviour, sometimes producing a single line of output per file from
multiple BLAS calls, when it should be all or nothing. Note that:
- AOCL_VERBOSE is only active when DTL logging has been enabled at
  compile time. Otherwise, this environment variable is not read.
- When logging is enable at compile time, logging output is produced
  by default. Thus AOCL_VERBOSE is more of use to turn output off,
  rather than on.
- For production runs without logging, it is recommended to recompile
  with DTL disabled, as this minimizes overheads within the BLIS code.
- AOCL_VERBOSE should be set to 0 or 1, and not values such as FALSE
  or TRUE.

Changes to improve consistency when AOCL_VERBOSE is set:
- Change DTL variables from Bool (unsigned char) datatype to bool, as
  used elsewhere in BLIS.
- Ensure bli_init_auto() is called before AOCL_DTL_TRACE_ENTRY() and
  AOCL_DTL_LOG_*_INPUTS(), as bli_init_auto calls AOCL_DTL_INITIALIZE()
- In APIs which avoid calling bli_init_auto(), add explicit calls to
  AOCL_DTL_INITIALIZE(). Also, make a proper comment about not calling
  bli_init_auto(), rather than just commenting out call, which looks like
  dead code.

Other DTL logging control changes:
- Make gbIsLoggingEnabled ICV thread local as this can be updated by
  calls to AOCL_DTL_Enable_Logs and AOCL_DTL_Disable_Logs APIs
- After recent changes to hide some internal BLIS definitions behind
  ifdef BLIS_IS_BUILDING_LIBRARY guard, change BLIS_THREAD_LOCAL
  definition to be exported again.

Logging output changes:
- Standardize printing of datatype to be lower case.
- Don't force printing of GEMM transa and transb to upper case, instead
  print in the case provided by the application code.
- Add logging output to all variants (in terms of AMD/non-AMD optimized
  and datatype) of SWAP and SCAL.

AMD-Internal: [CPUPL-7010]
2025-07-25 11:27:00 +01:00
S, Hari Govind
273a05f0bd Fix for performance regression caused by non-unit stride y in DGEMV API (#91)
- Temperory fix for regression in DGEMV for non-unit stride y inputs. The code
  section responsible for handling non-unit stride y has been removed from the
  frame.

- The kernel code is extended with if condition to handle both unit and non-unit
  stride y.

AMD-Internal: [CPUPL-6869]
AOCL-Weekly-250725
2025-07-25 10:57:57 +05:30
Balasubramanian, Vignesh
93414f56c8 Bugfix : Guarded AOCL_ENABLE_INSTRUCTONS support based on AVX512-ISA support
- As part of rerouting to AVX2 code-paths on ZEN4/ZEN5(or similar)
  architectures, the code-base established a contingency when
  deploying fat binary on ZEN/ZEN2/ZEN3 systems. Due to this,
  it was required that we always set AOCL_ENABLE_INSTRUCTIONS to
  'ZEN3'(or similar values) to make sure we don't run AVX512
  code on such architectures. This issue existed on FP32 and BF16
  APIs.

- Added checks to detect the AVX512-ISA support to enable rerouting
  based on AOCL_ENABLE_INSTRUCTIONS. This removes the incorrect
  constraint that was put forth.

AMD-Internal: [CPUPL-7020]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-07-24 12:20:05 +05:30
V, Varsha
8a86620753 Bug Fix in INT8 reference un-reorder API
- For int8/uint8 reorder function, the k dimension is made multiple of 4 to
 meet the alignment requirements.
 - Modified the logic to update the k_updated to use multiples of 4.

[AMD - Internal : SWLCSG - 3686 ]
2025-07-24 11:26:49 +05:30
Smyth, Edward
4bc5287f72 Support applications using Intel icc and icx compilers on Windows (#82)
The blis.h header file includes a lot of BLIS internal definitions. Some of these caused problems
when using a BLIS library compiled with clang on Windows from an applications compiled with
the Intel icc and icx compilers. Workaround is to use "#ifdef BLIS_IS_BUILDING_LIBRARY" to
guard these definitions from being exposed to applications including blis.h. (The BLIS configure
and cmake builds systems automatically define BLIS_IS_BUILDING_LIBRARY only for compiling
the BLIS library.)

This patch implements the minimum changes to resolve the issue. Longer term, similar changes
may need to be added around all BLIS internal definitions in blis.h.

AMD-Internal: [CPUPL-6953]
2025-07-22 10:22:45 +01:00
V, Varsha
9e8c9e2764 Fixed compiler warnings in LPGEMM
- Modified the correct variables to be passed for the batch_gemm_thread_decorator() for
 u8s8s32os32 API.
 - Removed commented lines in f32 GEMV_M kernels.
 - Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros.
 - Re-aligned the BIAS macro in the macro definition file.

[ AMD - Internal : CPUPL - 7013 ]
2025-07-18 16:15:52 +05:30
V, Varsha
2f54bc1e14 Added F32 reference Unreorder function
- Implemeneted unpackb_f32f32f32of32_reference function.
 - Modified const pointer declaration in aocl_reorder_reference() to avoid compiler warnings.

[AMD-Internal: SWLCSG-3618 ]
2025-07-18 14:52:03 +05:30