956 Commits

Author SHA1 Message Date
Balasubramanian, Vignesh
73911d5990 Updates to the build systems(CMake and Make) for LPGEMM compilation (#303)
- The current build systems have the following behaviour
  with regards to building "aocl_gemm" addon codebase(LPGEMM)
  when giving "amdzen" as the target architecture(fat-binary)
  - Make:  Attempts to compile LPGEMM kernels using the same
                compiler flags that the makefile fragments set for BLIS
                kernels, based on the compiler version.
  - CMake: With presets, it always enables the addon compilation
                 unless explicitly specified with the ENABLE_ADDON variable.

- This poses a bug with older compilers, owing to them not supporting
  BF16 or INT8 intrinsic compilation.

- This patch adds the functionality to check for GCC and Clang compiler versions,
  and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0.

- Make:  Updated the configure script to check for the compiler version
              if the addon is specified.
  CMake: Updated the main CMakeLists.txt to check for the compiler version
               if the addon is specified, and to also force-update the associated
               cache variable update. Also updated kernels/CMakeLists.txt to
               check if "aocl_gemm" remains in the ENABLE_ADDONS list after
               all the checks in the previous layers.

AMD-Internal: [CPUPL-7850]

Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>
2026-01-16 19:39:55 +05:30
Sharma, Shubham
824e289899 Tuned decision logic for DGEMV multithreading for skinny sizes. (#301)
AMD-Internal: [CPUPL-7769]
2026-01-14 12:08:46 +05:30
Rayan, Rohan
9cbb1c45d8 Improving sgemm rd kernel on zen4/zen5 (#292)
Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM
The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers
This caused multiple unnecessary iterations in the kernel for matrices with certain k-values.
Fixed by introducing masked loads and computations for these cases

AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2025-12-17 18:48:50 +05:30
Rayan, Rohan
a22e0022c2 SGEMM tiny path tuning for zen4 and zen5 (#267)
* Adding a model to determine which matrices enter the SGEMM tiny path
* This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously
* Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path
* Adding thresholds based on the SUP path sizes
* Added for Zen4 and Zen5

---------
AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan <rohrayan@amd.com>
2025-12-10 15:58:54 +05:30
Balasubramanian, Vignesh
54ac36c8bc Bugfix: BF16 to F32 conversion in AVX2 F32 codepath
- Updated the conversion function(in case of receiving
  column stored inputs) from BF16 to F32, in order to
  use the correct strides while storing.

- Conversion of B is potentially multithreaded using
  the threads meant for IC compute. With the wrong
  strides in the kernel, this gives rise to incorrect
  writes onto the miscellaneous buffer.

AMD-Internal: [CPUPL-7675]

Co-authored-by: Vishal-A <Vishal.Akula@amd.com>
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-12-01 15:06:13 +05:30
Sharma, Shubham
b60542c45c Added Fast path for single threaded AVX512 DGEMV kernel #260
- If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead.
- A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned.
AMD-Internal : [SWLCSG-3408]
2025-11-10 10:32:36 +05:30
S, Hari Govind
4ecfbde082 Fix extreme values handling in GEMV
- When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all.
- This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks.
- When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel.
- To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic.
- DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0.

AMD-Internal: [CPUPL-7585]
2025-11-08 12:30:03 +05:30
Varaganti, Kiran
49961aa569 Fix DTL dynamic thread logging in BLAS operations (#230)
- Remove redundant AOCL_DTL_LOG_NUM_THREADS calls from early return paths
- Update thread count logging to use AOCL_get_requested_threads_count() for early exits
- Clean up duplicate DTL logging in gemv_unf_var1 and gemv_unf_var2 implementations
- Remove thread count logging from bli_dgemv_n_zen4_int kernel variants
- Simplify aocldtl_blis.c AOCL_DTL_log_gemv_sizes by removing redundant conditional
- Standardize DTL trace exit patterns across axpy, scal, and gemv operations
- Remove commented-out DTL logging code in zen4 gemv kernel

This patch ensures thread count is logged only once per operation and uses
the correct API (AOCL_get_requested_threads_count) for early exit scenarios
where the actual execution thread count may differ from requested threads.
2025-10-24 13:34:00 +01:00
S, Hari Govind
ab25b825aa Fix: Resolve Operator Precedence Warning in Zen5 DCOMPLEX Threshold Logic
- Add explicit parentheses around (n <= 1520) && (k <= 128) to clarify
  operator precedence and resolve compiler warning. The intended logic
  is (m <= 1380) OR (n <= 1520 AND k <= 128).

- This change eliminates the compiler warning about mixing || and &&
  operators without explicit grouping.
2025-10-24 14:23:23 +05:30
Rayan, Rohan
e85be22da0 Adding tiny path for SGEMM (#237)
Adding SGEMM tiny path for Zen architectures.
Needed to cover some performance gaps seen wrt MKL
Only allowing matrices that all fit into the L1 cache to the tiny path
Only tuned for single threaded operation at the moment
Todo: Tune cases where AVX2 performs better than AVX512 on Zen4
Todo: The current ranges are very conservative, there may be scope to increase the matrix sizes that go into the tiny path

AMD-Internal: CPUPL-7555
Co-authored-by: Rohan Rayan rohrayan@amd.com
2025-10-24 13:14:33 +05:30
V, Varsha
fecb1aa7a5 Bug Fix in BF16 AVX2 conversion path (#236)
- In the current implementation of bf16 to f32 conversion for packed data
 we handle both GEMM and GEMV conditions in the same function separated
 with conditions.
 - But, when n = (NC+1) the function would execute GEMV conversion logic
 and write back the data inaccurately leading to accuracy issues.
 - Hence, modified the convert function and reorder functions to have
 separate conversion logic to make it cleaner and avoid confusions.
 -  Also, updated the API calls to adhere to the changes appropriately.

[AMD-Internal: CPUPL-7540]
2025-10-17 15:38:02 +05:30
S, Hari Govind
0ce45e3147 Changing ZGEMM SUP threshold logic for zen5 to fix performance regression (#233)
- Revert the logical operator from OR (||) to AND (&&) in the DCOMPLEX
  (ZGEMM) SUP threshold condition for k <= 128. The previous change to
  OR logic was causing performance regressions for certain input sizes
  by incorrectly routing cases to the SUP path when the native path
  would be more optimal.
2025-10-17 11:25:33 +05:30
Sharma, Shubham
1dca574a9d Improved fringe case handling for AVXPV kernel (#228)
- Current kernel uses masked AVX512 instructions to handle fringe cases.
- These instructions are slow on genoa.
- To handle sizes less than 8, AVX2 and SSE code has been added.
- Existing masked AVX512 code is performing better when n > 8 therefore it is still kept for handling larger sizes where n % 8 != 0.

AMD-Internal: [CPUPL-7467]
2025-10-10 14:25:35 +05:30
Dave, Harsh
c9933886f7 Tuned zgemm threshold for zen5 (#215)
Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.

This tuning forces skinny matrices to take SUP path to ensure better
performance.

AMD-Internal: [CPUPL-7369]

Co-authored-by: harsh dave <harsdave@amd.com>
2025-09-23 08:05:31 +05:30
Bhaskar, Nallani
db3134ed6d Disabled no post-ops path in lpgemm f32 kernels for few gcc versions
Guarded np (no post-ops) path in f32 API with a macro 
 as a workaround as gcc 11.4 and 11.2 are giving accuracy issues 
 with np path.
2025-09-22 15:52:21 +05:30
Varaganti, Kiran
807de2a990 DTL Log update
* DTL Log update
Updates logs with nt and AOCL Dynamic selected nt for axpy, scal and dgemv
Modified bench_gemv.c to able to process modified dtl logs.

* Updated DTL log for copy routine with actual nt and dynamic nt

* Refactor OpenMP pragmas and clean up code

Removed unnecessary nested OpenMP pragma and cleaned up function end comment.

* Fixed DTL log for sequential build

* Added thread logging in bla_gemv_check for invalid inputs

---------

Co-authored-by: Smyth, Edward <Edward.Smyth@amd.com>
2025-09-22 11:32:00 +05:30
Sharma, Arnav
ee3d250b7a Fix for F32 to BF16 Conversion and AVX512 ISA Support Checks
- Fixed register assignment bug in lpgemv_m_kernel_f32_avx512 where zmm3
  was incorrectly used instead of zmm4 in BF16_F32_BETA_OP_NLT16F_MASK macro.

- Replaced hardware-specific BF16 conversion intrinsics with manual
  rounding, bit manipulation and F32 instruction set for compatibility on
  hardware without native BF16 support.

- Added AVX512_BF16 ISA support checks for s8s8s32obf16 and u8s8s32obf16
  GEMM operations to ensure processor compatibility before execution.

AMD-Internal: [CPUPL-7410]
2025-09-19 18:49:33 +05:30
Sharma, Shubham
773d3a3d45 Re-tuned GEMV thresholds (#210)
Retune DGEMV AVX512 non transpose thresholds to avoid regression on ZEN4.

AMD-Internal: [CPUPL-7448]
2025-09-19 12:43:50 +05:30
Sharma, Shubham
10b2e59782 Modified AVPY kernel to ensure consistency of numerical results (#188)
Current DAXPY kernel uses C code to solve cases when n %8 != 0.
This results in the compiled code using MUL+ADD instruction using SSE, instead of FMA instruction.
This causes inconsistency of numerical results.
To fix this, AVX2 and C code is replaced with masked AVX512 instructions to compute fringe cases.

AMD-Internal : [CPUPL-7315]
2025-09-18 11:42:31 +05:30
Sharma, Shubham
740fbdf50d Fix memory leak in DGEMV kernel (#187)
Memory is not freed for GEMV when MT kernel with called for NT = 1;
Fixed this by adding an extra check to make sure memory is freed.

AMD-Internal: [CPUPL-7352]
2025-09-18 11:18:13 +05:30
Sharma, Shubham
a02020686c Tuned DGEMV no-transpose thresholds #193
In order to avoid regression in some sizes, thresholds are retuned and scope of 32x8 kernels is expanded.

AMD-Internal:[CPUPL-7336]
2025-09-18 10:50:09 +05:30
Smyth, Edward
e3b22f495e Standardize Zen kernel names (2)
Further changes to fix inconsistencies in naming of zen kernels.

AMD-Internal: [CPUPL-6579]
2025-09-17 21:48:34 +01:00
Smyth, Edward
e59eabaf58 Compiler warnings fixes (2)
Fix compiler warning messages in LPGEMM code:
- Removed extraneous parentheses in aocl_batch_gemm_s8s8s32os32.c
- Removed unused variables in lpgemv_{m,n}_kernel_s8_grp_amd512vnni.c
- Changed ERR_UBOUND in math_utils_avx2.h and math_utils_avx512.h
  to match how it is specified in AOCL libm erff.c

AMD-Internal: [CPUPL-6579]
2025-09-17 18:28:34 +01:00
Dave, Harsh
31aba514fe coverity issue fix for ztrsm (#176)
* Fixed coverity issue in ztrsm small code path

* Fixed coverity issue in ztrsm small code path

---------

Co-authored-by: harsh dave <harsdave@amd.com>
2025-09-17 19:39:12 +05:30
Dave, Harsh
a2526a2593 Fixes Coverity static analysis issue in the DTRSM (#181)
* Fixes Coverity static analysis issue in the DTRSM

- Initializes ps_a_use variable and calls bli_auxinfo_set_ps_a() to set
  pack stride in auxinfo structure.

* Fixed unintialized variable issue in the DTRSM

- Initializes ps_a_use variable and calls bli_auxinfo_set_ps_a() to set
  pack stride in auxinfo structure.

---------

Co-authored-by: harsdave <harsdave@amd.com>
2025-09-17 18:02:45 +05:30
Smyth, Edward
ae6c7d86df Tidying code
- AMD specific BLAS1 and BLAS2 franework: changes to make variants
  more consistent with each other
- Initialize kernel pointers to NULL where not immediately set
- Fix code indentation and other other whitespace changes in DTL
  code and addon/aocl_gemm/frame/s8s8s32/lpgemm_s8s8s32_sym_quant.c
- Fix typos in DTL comments
- Add missing newline at end of test/CMakeLists.txt
- Standardize on using arch_id variable name

AMD-Internal: [CPUPL-6579]
2025-09-16 14:52:54 +01:00
Rayan, Rohan
2e7f387d13 Fixing the coverity issues with CID: 23269 and CID: 137049 (#180)
Fixing some coverity issues that detected out of bounds array accesses.

AMD-Internal: CPUPL-6579
Co-authored-by: Rohan Rayan rohrayan@amd.com
2025-09-16 10:01:52 +05:30
KadavilMadanaMohanan, MithunMohan (Mithun Mohan)
5de25ce9a7 Fixed high priority coverity issues in LPGEMM. (#178)
* Fixed high priority coverity issues in LPGEMM.

-Out of bounds issue and uninitialized variables fixed in aocl_gemm addon.
2025-09-11 18:27:19 +05:30
Smyth, Edward
a4db661b44 GCC 15 SUP kernel workaround (2)
Previous commit (30c42202d7) for this problem turned off
-ftree-slp-vectorize optimizations for all kernels. Instead, copy
the approach of upstream BLIS commit 36effd70b6a323856d98 and disable
these optimizations only for the affected files by using GCC pragmas

AMD-Internal: [CPUPL-6579]
2025-09-04 17:14:06 +01:00
V, Varsha
c5bd1feabd Fixed out-of-bound access in F32 matrix add/mul ops (#168)
- Modified the out-of-bound access in scale factors of matrix-add and
 matrix mul post-ops of f32 AVX512_256 kerenls.

[ AMD-Internal : CPUPL-7261 ]
2025-09-01 16:40:03 +05:30
Smyth, Edward
fb2a682725 Miscellaneous changes
- Change begin_asm and end_asm comments and unused code in files
     kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c
     kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c
  to avoid problems in clobber checking script.
- Add missing clobbers in files
     kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c
     kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
     kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c
- Add missing newline at end of files.
- Update some copyright years for recent changes.
- Standardize license text formatting.

AMD-Internal: [CPUPL-6579]
2025-08-26 16:37:43 +01:00
Vankadari, Meghana
a05279cd97 Bug fix in F32 AVX2 kernels (#164)
- corrected the loading strides used to load matadd and matmul
  pointers in F32 AVX2 kernels.

AMD-Internal: CPUPL-7221
2025-08-26 19:52:50 +05:30
Bhaskar, Nallani
b052775644 Disabled topology detection in LPGEMM
- Disabled topology detection as libgomp is not honoring
  the standard function omp_get_place_proc_ids

- Added B prefetch in bf16 B packing kernels

AMD-Internal: SWLCSG-3761
2025-08-26 14:50:01 +01:00
Vankadari, Meghana
7d42db73e5 Fixed out-of-bound memory accesses in F32 API (#161)
- Corrected the BF16 data handling in post-ops for F32 API.
- Verified and ensured that mask-loads are used wherever necessary.

AMD-Internal: CPUPL-7221
2025-08-26 17:53:13 +05:30
Vankadari, Meghana
5044b69d3d Bug fix in LPGEMV m=1 AVX2 kernel for post-ops
Details:
- Fixed loading of matadd and matmul pointers in GEMV
 lt16 kernel for AVX2 M=1 case.
- Hard-set row-stride of B to 1(inside GEMV), when it has
   already been reordered.

AMD-Internal:CPUPL-7197, CPUPL-7221
Co-authored-by:Balasubramanian, Vignesh <Vignesh.Balasubramanian@amd.com>
2025-08-22 18:15:05 +05:30
S, Hari Govind
d29f3f0b5e Fix GCC 12+ instruction scheduling issue in complex scalv kernel (#149)
Replace fused multiply-add (FMA) intrinsics with explicit multiply and add/subtract operations in bli_cscalv_zen_int to resolve incorrect results with GCC 12 and later compilers.

The original code used register reuse pattern with _mm256_fmaddsub_ps() that causes GCC 12+ instruction scheduler to generate assembly with corrupted intermediate values due to register allocation conflicts. GCC 11 and earlier handled the same pattern correctly.

Changes:
- Replace _mm256_fmaddsub_ps() with _mm256_mul_ps() + _mm256_addsub_ps()
- Eliminate temp register reuse to fix instruction scheduling conflicts

AMD-Internal: [CPUPL-6445]
2025-08-22 14:23:43 +05:30
Dave, Harsh
e39cf64708 Optimized avx512 ZGEMM kernel and edge-case handling (#147)
* Optimized avx512 ZGEMM kernel and edge-case handling
  Edge kernel implementation:
   - Refactored all of the zgemm kernels to process micro-tiles efficiently
   - Specialized sub-kernels are added to handle leftover m dimention:12MASK,
     8, 8MASK, 8, 4, 4MASK, 2.
   - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
     load/store and 1 masked load/store.
   - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
     1 masked load/store.
   - 4MASK handles 3, 1 m_left using 1 masked load/store.

   - ZGEMM kernel now internally decomposes the m dimension into the following.
     The main kernel is 12x4, which is having following edge kernels to
     handle left-over m dimension:
     edge kernels:
     12MASKx4 (handles 11x4, 10x4, 9x4)
     8x4      (handles 8x4)
     8MASKx4  (handles 7x4, 6x4, 5x4)
     4x4      (handles 4x4)
     4MASKx4  (handles 3x4, 1x4)
     2x4      (handles 2x4)

   - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
     which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
     8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
     2xN_LEFT(3, 2, 1) handles leftover m dimension.

  Threshold tuning:
   - Enforced odd m dimension to avx512 kernels in tiny path, as avx2
     kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
     The gemv function call adds overhead for very small sizes and results
     in suboptimal performance.

   - condition check "m%2 == 0" is added along with threshold checks to
     force input with odd m dimension to use avx512 zgemm kernel.

   - Threshold change to route all of the inputs to tiny path. Eliminating
     dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
     'T'(transpose).

   - However tiny re-uses zgemm sup kernels which do not support
     conjugate transpose storage of matrices. For such storage of
     A, B matrix we still rely on avx2 zgemm_small kernel.

  gtest changes:
   - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
     respective testing instaces from gtest.

AMD-Internal: [CPUPL-7203]

* Optimized avx512 ZGEMM kernel and edge-case handling
  Edge kernel implementation:
   - Refactored all of the zgemm kernels to process micro-tiles efficiently
   - Specialized sub-kernels are added to handle leftover m dimention:12MASK,
     8, 8MASK, 8, 4, 4MASK, 2.
   - 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
     load/store and 1 masked load/store.
   - Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
     1 masked load/store.
   - 4MASK handles 3, 1 m_left using 1 masked load/store.

   - ZGEMM kernel now internally decomposes the m dimension into the following.
     The main kernel is 12x4, which is having following edge kernels to
     handle left-over m dimension:
     edge kernels:
     12MASKx4 (handles 11x4, 10x4, 9x4)
     8x4      (handles 8x4)
     8MASKx4  (handles 7x4, 6x4, 5x4)
     4x4      (handles 4x4)
     4MASKx4  (handles 3x4, 1x4)
     2x4      (handles 2x4)

   - similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
     which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
     8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
     2xN_LEFT(3, 2, 1) handles leftover m dimension.

  Threshold tuning:
   - Enforced odd m dimension to avx512 kernels in tiny path, as avx2
     kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
     The gemv function call adds overhead for very small sizes and results
     in suboptimal performance.

   - condition check "m%2 == 0" is added along with threshold checks to
     force input with odd m dimension to use avx512 zgemm kernel.

   - Threshold change to route all of the inputs to tiny path. Eliminating
     dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
     'T'(transpose).

   - However tiny re-uses zgemm sup kernels which do not support
     conjugate transpose storage of matrices. For such storage of
     A, B matrix we still rely on avx2 zgemm_small kernel.

  gtest changes:
   - Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
     respective testing instaces from gtest.

AMD-Internal: [CPUPL-7203]

---------

Co-authored-by: harsdave <harsdave@amd.com>
2025-08-21 09:46:10 +05:30
Sharma, Shubham
805f36965d Added ability to handle non unit incx in GEMV transpose kernel. (#145)
- GEMV transpose kernels lack ability to compute directly on non-unit stride inputs.
- This limitation is stopping libflame to use blis kernel directly instead of going through framework.
- Added ability to handle non-unit incx in the kernel by packing x into a temporary buffer.

AMD-Internal: [CPUPL-6903]
2025-08-20 23:33:53 +05:30
Smyth, Edward
509aa07785 Standardize Zen kernel names
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.

AMD-Internal: [CPUPL-6579]
2025-08-19 18:19:51 +01:00
Dave, Harsh
b88bea6e72 Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11) (#135)
* Optimize ZGEMM Packing Kernel for M-Dimension Edge Cases (cdim0 1–11)

- Introduced specialized AVX-512 assembly paths for cdim0 edge cases (1–11), replacing inefficient zscalv fallback.
- Refactored cdim0 == mnr condition into a switch statement to support multiple optimized cases.
- Added three new macros for column-stored packing with distinct masking patterns.
- Implemented 11 dedicated handlers for row and column stored A matrix packing
  with efficient masked loads/stores for partial data.

    AMD-Internal: [CPUPL-6677]

Co-authored-by: harsh dave <harsdave@amd.com>

* Update bli_packm_zen4_asm_z12xk.c

---------

Co-authored-by: harsh dave <harsdave@amd.com>
Co-authored-by: Sharma, Shubham <Shubham.Sharma3@amd.com>
2025-08-18 12:38:45 +05:30
Dave, Harsh
1b1b19486b Add packing support M edge cases in ZGEMM 12xk pack kernel (#89)
Previously, the ZGEMM implementation used `zscalv` for cases
    where the M dimension of matrix A is not in multiple of 24,
    resulting in a ~40% performance drop.

    This commit introduces a specialized edge cases in pack kernel
    to optimize performance for these cases.

    The new packing support significantly improves the performance.

    - Removed reliance on `zscalv` for edge cases, addressing the
      performance bottleneck.

    AMD-Internal: [CPUPL-6677]

Co-authored-by: harsh dave <harsdave@amd.com>
2025-08-14 14:29:03 +05:30
Sharma, Arnav
76c4872718 GEMV support for S8S8S32O32 Symmetric Quantization
Introduced support for GEMV operations with group-level symmetric quantization for the S8S8S32032 API.

Framework Changes:
- Added macro definitions and function prototypes for GEMV with symmetric quantization in lpgemm_5loop_interface_apis.h and lpgemm_kernels.h.
  - LPGEMV_M_EQ1_KERN2 for the lpgemv_m_one_s8s8s32os32_sym_quant kernel, and
  - LPGEMV_N_EQ1_KERN2 for the lpgemv_n_one_s8s8s32os32_sym_quant kernel.
- Implemented the main GEMV framework for symmetric quantization in lpgemm_s8s8s32_sym_quant.c.

Kernel Changes:
- lpgemv_m_one_s8s8s32os32_sym_quant for handling the case where M = 1 and implemented in lpgemv_m_kernel_s8_grp_amd512vnni.c.
- lpgemv_n_one_s8s8s32os32_sym_quant for handling the case where N = 1 and implemented in lpgemv_n_kernel_s8_grp_amd512vnni.c.
- Updated the buffer reordering logic for group quantization for N=1 cases in aocl_gemm_s8s8s32os32_utils.c.

Notes
- Ensure that group_size is a factor of both K (and KC when K > KC).
- The B matrix must be provided in reordered format (mtag_b == REORDERED).

AMD-Internal: [SWLCSG-3604]
2025-08-14 13:41:25 +05:30
Sharma, Shubham
3a14417ce1 DGEMV BugFixes and code cleanup (#134)
- Modified gemv (matrix-vector multiply) reference for better handling of transpose flags.
- Modified Zen4 kernel implementations for better handling of transpose flags and vector stride (incy).
- The changes refine kernel selection logic and move variable definition in macro guards.
2025-08-14 12:54:06 +05:30
S, Hari Govind
9a7bacb30c Improve numerical precision in ZGEMV API (#130)
- Replaced separate real and imaginary accumulators (real_acc, imag_acc) with a column-wise accumulator array (row_acc[2]), making accumulation and updates to the target Y vector more direct, concise, and unified.

- Leveraged AVX-512 fused multiply-add/subtract operations (_mm512_fmaddsub_pd, _mm512_fmsubadd_pd) and efficient permutations (_mm512_permute_pd) to enable accurate and efficient computation of real and imaginary components in a single instruction, while reducing code complexity for both code paths.

- Removed redundant instructions (such as unnecessary permutations and zero-register operations) and simplified the control flow.

AMD-Internal: [CPUPL-7015]
2025-08-14 11:19:51 +05:30
Dave, Harsh
fa69528a3b Bugfix: Tuned zgemm threshold for zen4 (#129)
* Bugfix: Tuned zgemm threshold for zen4

Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.

This tuning forces skinny matrices to take SUP path to ensure better
performance.

* Bugfix: Tuned zgemm threshold for zen4 and zen5

Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.

This tuning forces skinny matrices to take SUP path to ensure better
performance.

---------

Co-authored-by: harsdave <harsdave@amd.com>
2025-08-13 19:02:39 +05:30
Sharma, Shubham
b7b9d3ec53 Exported AVX512 DGEMV kernels (#131)
- Exported DGEMV AVX512 kernels so that they can be directly called by libflame to avoid blis and omp overhead.
2025-08-12 17:17:41 +05:30
Sharma, Shubham
b0a4914417 Added DGEMV no transpose multithreaded Implementations (#12)
* Added DGEMV no transpose multithreaded Implementations
- Added new avx512 M and N kernels for DGEMV.
- Added multiple MT implementations for same kernels.
- Added AOCL_dynamic logic for L2 apis.
- Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5.
- Added same kernels for SGEMV, but these kernels are not enabled yet.
- Added SGEMV reference kernel.

AMD-Internal: [SWLCSG-3408]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-08-12 10:39:12 +05:30
Vlachopoulou, Eleni
1f8a7d2218 Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65) 2025-08-07 15:51:59 +01:00
Sharma, Shubham
6db8639284 Fix coverity issue in ZTRSM kernels (#112)
Static analysis issues in ZTRSM (triangular solve with matrix) kernels for Zen5 architecture by initializing variables to prevent potential use of uninitialized values.
Initialize loop variables i, j, and k_iter to 0 to prevent potential uninitialized access
Initialize mask variables and remainder variables to 0 across multiple kernel functions
2025-08-05 15:00:59 +05:30
Balasubramanian, Vignesh
c96e7eb197 Threshold tuning for code-paths and optimal thread selection for ZGEMM(ZEN5)
- Updated the thresholds to enter the AVX512 SUP codepath in
  ZGEMM(on ZEN5). This caters to inputs that scale well with
  multithreaded-execution(in the SUP path).

- Also updated the thresholds to decide ideal threads, based on
  'm', 'n' and 'k' values. The thread-setting logic involves
  determining the number of tiles for computation, and using them
  to further tune for the optimal number of threads.

- This logic builds over the assumption that the current thread
  factorization logic is optimal. Thus, an additional data analysis
  was performed(on the existing ZEN4 and the new ZEN5 thresholds),
  to also cover the corner cases, where this assumption doesn't hold
  true.

- As part of the future work, we could reimplement the thread
  factorization for GEMM, which would additionally require a new
  set of threshold tuning for every datatype.

AMD-Internal: [CPUPL-7028]

Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
2025-08-01 16:02:12 +05:30