Commit Graph

186 Commits

Author SHA1 Message Date
Edward Smyth
eee3fe1b54 GTestSuite: nested parallelism tests
Optionally enable parallelism inside gtestsuite so that we can
check BLIS functions perform correctly when nested parallelism
is in operation. Enable with:

  cmake ... -DOPENMP_NESTED={0,1,2,1diff}

where in gtestsuite
- 0 is the default choice with no parallelism.
- 1 and 2 are simple nested parallelism.
- 1diff has one level of parallelism setting different numbers
  of threads to be used by BLIS and reference library calls
  from each gtestsuite thread.

Note: OMP_NUM_THREADS must be set appropriately to enable or
      disable parallelism at each level in the test programs
      as desired.
      OMP_NUM_THREADS will also define the parallelism used
      within the BLIS library (if it is multithreaded), unless
      BLIS-specific ways of specifying parallelism have been
      used. If a BLIS-specific parallelism option has been set,
      the same mechanism will be used in the 1diff option to
      vary the number of threads in BLIS per application thread.

AMD-Internal: [CPUPL-3902]
Change-Id: I89f9edb4125c64ef03e025a9f6ccb84960ba8771
2025-02-07 08:49:25 -05:00
Edward Smyth
1f0fb05277 Code cleanup: Copyright notices (2)
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.

AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
2025-02-07 05:41:44 -05:00
Arnav Sharma
5a4739d288 DGEMV NO_TRANSPOSE Optimizations and Unit Tests
- Added 32x3n n-biased kernels to directly handle the cases where n=3
  which were earlier being handled by the primary n-biased, 32x8n,
  kernel.
- Modified the n-biased fringe kernels to further handle the smaller
  m-fringe cases. Thus, now the kernels handle the following range of m
  for any value of n:
  - 16x8n     : m = [16, 31)
  - 8x8n      : m = [8, 15)
  - m_leftx8n : m = [1, 7]
- Updated the function pointer map for n-biased kernels with added
  granularity to invoke the smaller fringe cases directly on the basis
  of m-dimension.
- Added micro-kernel unit tests for all the dgemv_n kernels.

AMD-Internal: [CPUPL-6231]
Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119
2025-02-06 18:52:32 +05:30
Shubham Sharma
f8c83fedb6 Added new ZTRSM small code path for ZEN5
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
2025-02-06 18:01:10 +05:30
Hari Govind S
3d2653f1ab DDOTV Optimization for ZEN3 Architecture
- Reduced the blocking size of 'bli_ddotv_zen_int10'
  kernel from 40 elements to 20 elements for better
  utilization of vector registers

- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
  kernel with 'if' conditions to handle reminder
  iterations. As only a single iteration is used when
  reminder is less than the primary unroll factor.

- Added a conditional check to invoke the vectorized
  DDOTV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal
  for single-threaded execution. Thus, we avoid the
  call to bli_nthreads_l1() function to set the ideal
  number of threads.

- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
  kernel.

AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
2025-02-04 06:01:04 -05:00
Edward Smyth
d9cabce0ba GTestSuite: Catch BLIS version executable errors
In case the executable to obtain the BLIS library version fails,
catch and report common errors to help with debugging.

Also correct the test for bli_info_get_info() support to mark
that it is not available in any AOCL version <= 4.1

AMD-Internal: [CPUPL-4500]
Change-Id: Ie8f728b49faa60e0469562dbf77d67f86b415cd8
2025-01-28 16:54:05 +05:30
Vignesh Balasubramanian
fb6dcc4edb Support for Tiny-GEMM interface(ZGEMM)
- As part of AOCL-BLAS, there exists a set of vectorized
  SUP kernels for GEMM, that are performant when invoked
  in a bare-metal fashion.

- Designed a macro-based interface for handling tiny
  sizes in GEMM, that would utilize there kernels. This
  is currently instantiated for 'Z' datatype(double-precision
  complex).

- Design breakdown :
  - Tiny path requires the usage of AVX2 and/or AVX512
    SUP kernels, based on the micro-architecture. The
    decision logic for invoking tiny-path is specific
    to the micro-architecture. These thresholds are defined
    in their respective configuration directories(header files).

  - List of AVX2/AVX512 SUP kernels(lookup table), and their
    lookup functions are defined in the base-architecture from
    which the support starts. Since we need to support backward
    compatibility when defining the lookup table/functions, they
    are present in the kernels folder(base-architecture).

- Defined a new type to be used to create the lookup table and its
  entries. This type holds the kernel pointer, blocking dimensions
  and the storage preference.

- This design would only require the appropriate thresholds and
  the associated lookup table to be defined for the other datatypes
  and micro-architecture support. Thus, is it extensible.

- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
         kernels. Thus, the blocking in framework is done accordingly.
         In case of adding the support for n-var, the variant
         information could be encoded in the object definition.

- Added test-cases to validate the interface for functionality(API
  level tests). Also added exception value tests, which have been
  disabled due to the SUP kernel optimizations.

AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
2025-01-24 12:59:26 -05:00
Hari Govind S
106a2b1fe1 Gtestsuite: UKR test for GEMV kernels
-  Added support for gemv kernels unit test in gtestsuite.
-  Added micro-kernel tests and memory tests for DGEMV
   transpose case kernels.

AMD-Internal: [CPUPL-5835]
Change-Id: I7d2d3cdbfea436f6c9b2cce9f2e85bfc5c51f201
2025-01-24 05:09:33 -05:00
Vignesh Balasubramanian
a80436ab21 Standardizing the EVT compliance of {S/D}AMAXV API
- Updated the existing AVX2 {S/D}AMAXV kernels to comply
  to the standard when having exception values. This makes
  it exhibit the same behaviour as it AVX512 variants.
  Provided additional optimizations with loop unrolling.

- Removed redundant early return checks inside the kernels,
  since they have been abstracted to a higher layer.

- Updated the unit-tests(micro-kernel) and exception value
  tests for appropriate code-coverage. Also re-enabled the
  exception value tests.

AMD-Internal: [CPUPL-4745]
Change-Id: I36c793220bd4977a00281af9737c51cd1e5c60d9
2025-01-13 06:56:31 -05:00
Edward Smyth
0ae5a0492f GTestSuite: fix to ukr tests for dgemm avx512 8x24 kernels
- Restore test for old bli_dgemm_zen4_asm_8x24 kernel, so that
  we can test this if linking with older AOCL versions.
- Move K_bli_dgemm_avx512_asm_8x24 definition from AOCL_42 list
  to AOCL_50 list.

AMD-Internal: [CPUPL-4500]
Change-Id: Id522f4bc5b89e86f77c4e1d26c75e261736ab450
2025-01-10 12:33:15 -05:00
Vignesh Balasubramanian
cdaa2ac7fd Bugfix and optimizations for AVX512 AMAXV micro-kernels
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
  incorrect functionality with multiple absolute maximums.
  They returned the last index when having multiple occurences,
  instead of the first one.

- Implemented a bug-fix to handle this issue on these AVX512
  kernels. Also ensured that the kernels are compliant with
  the standard when handling exception values.

- Further optimized the code by decoupling the logic to find
  the maximum element and its search space for index. This way,
  we use lesser latency instructions to compute the maximum
  first.

- Updated the unit-tests, exception value tests and early return
  tests for the API to ensure code-coverage.

AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
2025-01-07 22:56:20 +05:30
harsdave
54b46ec1ed Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:

1. **IR Loop Optimization:**
   - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
     with `begin_asm` and `end_asm` calls, resulting in more efficient execution.

2. **JR Loop Integration:**
   - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
     of stack frame management for each JR iteration, thereby enhancing loop performance.

3. **Kernel Decomposition Strategy:**
   - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
   - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.

1. **Interleaved Scaling by Alpha:**
   - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
     and reduce latency.

2. **Efficient Mask Preparation:**
   - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
     minimizing unnecessary overhead.

3. **Broadcast Instruction Optimization:**
   - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
     the broadcast instruction is replaced with `mem_1to8`.
   - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
     dependency chains and improving execution efficiency.

4. **C Matrix Update Optimization:**
   - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
     This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
     performance bottlenecks and enhancing throughput.

These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.

This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.

For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.

AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
2024-12-13 00:03:00 -05:00
Arnav Sharma
25e59fcbb9 DGEMV Optimizations for NO_TRANSPOSE Cases
- AVX512 specific DGEMV native kernels are added for Zen4/5
  architectures to handle the NO_TRANSPOSE cases and are independent of
  the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
  beta scaling of y vector within the kernel itself and handle cases
  where n is less than 5:
    - bli_dgemv_n_zen_int_32x8n_avx512( ... )
    - bli_dgemv_n_zen_int_32x4n_avx512( ... )
    - bli_dgemv_n_zen_int_32x2n_avx512( ... )
    - bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
  m-dimension and for this kernel beta scaling is handled beforehand
  within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
  using fused kernel, namely AXPYF, to perform the GEMV operation.

AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
2024-12-12 10:26:50 -05:00
Shubham Sharma
f2320a1fef Enabled DGEMM row major kernel for ZEN4
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
  kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
  is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.

AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
2024-11-29 08:18:48 +00:00
Edward Smyth
971c890fc6 GTestSuite: Select ukr tests by BLIS version
Add definitions in gtestsuite header to list available kernel
by AOCL BLIS version. Check these definitions in ukr test
programs to avoid missing symbol errors when testing with an
older version of BLIS.

Currently AOCL_41, AOCL_42, AOCL_50 and AOCL_DEV are supported,
with AOCL_DEV inferred from the version being later than the
value of AOCL_BLAS_LATEST_VERSION set in CMakeLists.txt. Thanks
to Eleni Vlachopoulou for the cmake functionality to automatically
detect the version from the library.

AMD-Internal: [CPUPL-4500]
Change-Id: I40ffd3d3789324fbb1dabfbf5e1dd4e0c94d54d9
2024-11-15 10:07:29 -05:00
Edward Smyth
0249f57022 GTestSuite: Correct blis_impl calls for gemm_compute
gemm_compute currently has differences in the interface to
the blis_impl layer compared to the top-level API. Modify
gtestsuite wrapper to account for this.

AMD-Internal: [CPUPL-4500]
Change-Id: Ie96c9ac3b23128ae8e03af34ad11e65910dec594
2024-11-12 06:57:59 -05:00
Edward Smyth
ef5cbf7c9a GTestSuite: More threshold changes
Various changes from testing code paths with both gcc and aocc.

AMD-Internal: [CPUPL-4378]
Change-Id: I8964d8ab4e1f5669026af606598c2eb3dfddde16
2024-11-11 12:34:05 -05:00
Vignesh Balasubramanian
06d776b025 AVX512 ZGEMM SUP Inner product kernels
- Implemented a set of column preferential dot-product based
  ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
  These kernels perform matrix multiplication as a sequence
  of inner products(i.e, dot-products).

- These standalone kernels are expected to strictly handle
  the CRC storage scheme for C, A and B matrices. RRC is also
  supported through operation transpose, at the framework
  level.

- Added unit-tests to test all the kernels(main and fringe),
  as well as the redirection between these kernels.

AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
2024-11-06 04:18:57 -05:00
Edward Smyth
9ce2696fc9 GTestSuite: Fix builds testing against MKL
Correction to CMakeLists.txt to fix problem building executables
when testing against MKL.

AMD-Internal: [CPUPL-5928]
Change-Id: Ie427fff0afb48be6ce6d940b1db2c9d1c7a40e5b
2024-10-29 06:32:27 -04:00
Edward Smyth
cffb501e00 GTestSuite: ILP64 build fix
Cast literal 0 to match integer size in std::max tests.

AMD-Internal: [CPUPL-4500]
Change-Id: I330aafd8669884c5e1900b95742b5d1e4ce8ddfa
2024-10-29 06:10:49 -04:00
Eleni Vlachopoulou
d6a411d6b6 GTestSuite: Reorganizing some tests
- Breaking tests to smaller executables.
- Removing some redundant tests.

AMD-Internal: [CPUPL-4500]
Change-Id: I6288c3fcf5194ccb5de3485ca1ad95a20414208c
2024-10-02 11:48:18 -04:00
Eleni Vlachopoulou
72536e56ba GTestSuite: Reducing gemm tests.
Since there is thorough kernel testing, we reduce the number of "Black Box" test cases so that CI is faster.

AMD-Internal: [CPUPL-4500]
Change-Id: Ie57eeccff8103c0051eb1904162d6447da0ef102
2024-09-19 12:17:20 -04:00
Edward Smyth
6330ac6a52 GTestSuite: Misc changes
- Correct matsize and NumericalComparison functions for
  tests with first matrix dimension <= 0.
- BLAS1:
  - Fix for BLAS vs CBLAS differences in amaxv IIT_ERS tests.
  - Threshold adjustments in ddotxf and zaxpy.
  - Break axpyv and scalv into separate executables for
    each data type.
- BLAS2:
  - Threshold adjustments in symv and hemv.
  - Break ger into separate executables for each data type.
- UKR:
  - Break gemm and trsm ukr test into separate executables
    for each data type.
  - Threshold adjustments in daxpyf
  - Disable {z,c}trsm ukr tests when BLIS_INT_ELEMENT_TYPE
    is used, as matrix generator is not currently suitable
    for this.

AMD-Internal: [CPUPL-4500]
Change-Id: I1d9e7acc11025f1478b8b511c14def5517ef0ae6
2024-09-19 10:17:36 -04:00
Eleni Vlachopoulou
c7a5d04d4d GTestSuite: Disabling falling tests.
Those can be run in --gtest_also_run_disabled_tests is used.
Bugs will be addressed and resolved in the future.

AMD-Internal: [CPUPL-4500]
Change-Id: I7a5443606ea8ef20f18ff8beec14bece5f6ee661
2024-09-18 13:12:35 +01:00
Edward Smyth
54f8fb951e GTestSuite: BLAS2 test case selection
Various changes to BLAS2 test cases:
- GEMV: Reduce number of tests to make runtime more reasonable.
- TRSV:
  - Standardize tests across different data types, including
    adding memory testing for all variants.
  - Improve scaling when making matrix A diagonally dominant and
    avoid singular matrix when BLIS_INT_ELEMENT_TYPE is used.
- TRMV: Copy TRSV generic tests.
- Expand set of tests for HEMV, HER, HER2, SYMV, SYR, SYR2 and
  make lda contribution to test names consistent with others
  routines.
- Various adjustments to thresholds added.

Update gtestsuite documentation to describe using GTEST_FILTER
environment variable to select tests to run or exclude. This
works particularly well when using ctest, as we do not enumerate
all the tests at this level and so need to pass the selection
down to the individual executables.

AMD-Internal: [CPUPL-4500]
Change-Id: Ifcb6410455b7f91e58b555f94b9fd7920d7ad9d9
2024-09-17 09:35:29 -04:00
Edward Smyth
61c6f1ad78 GTestSuite:a Fix alpha and beta input argument tests
Check if alpha and beta are null before testing values. This
avoids possible seg faults if alpha or beta have not been
defined in IIT tests.

AMD-Internal: [CPUPL-4500]
Change-Id: Ibbf2d6a8fb38d9a95033f3fec3d06c3441e98689
2024-09-17 09:00:09 -04:00
Edward Smyth
8d4881c4fd GTestSuite: add option to test blis_impl layer
Add BLAS_TEST_IMPL option for TEST_INTERFACE to test the
wrapper layer underneath BLAS and CBLAS interfaces. This is
particularly useful if building a BLIS library with these
interfaces disabled, e.g.

   ./configure --disable-blas amdzen
or
   cmake . -DENABLE_BLAS=OFF -DBLIS_CONFIG_FAMILY=amdzen

The ?_blis_impl wrappers should have the same arguments as the
BLAS interfaces, thus we define TEST_BLAS_LIKE as an additional
definition for convenience when selecting tests and options in
the C++ files.

AMD-Internal: [CPUPL-5650]
Change-Id: I0275a387563f3efc2b40029950c8569956f2df7b
2024-09-16 09:53:56 -04:00
Edward Smyth
a07e041b1f SCALV alpha=zero BLAS compliance
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.

Changes in this commit:
- Standardize early returns from SCALV reference and optimized
  kernels.
- User supplied N<0 is handled at the top level API layer. Use
  negative values of N in kernel calls to signify that SETV
  should _not_ be used when alpha=0. This should only be
  required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
  overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.

AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
2024-09-16 07:10:28 -04:00
Edward Smyth
3a6d367f9c GTestSuite: Fix TRSM ukr tests in non-zen builds
Add guards around bli_trsm_small kernel tests to only call them
if BLIS_ENABLE_SMALL_MATRIX_TRSM is defined. This fixes missing
symbol errors in tests of non-zen builds, e.g. generic or skx.

AMD-Internal: [CPUPL-4500]
Change-Id: I7a822a41b5f686b5e38b0c63dd1871963e990407
2024-08-21 07:45:06 -04:00
Chandrashekara K R
545f9ee44e CMake: Updated cmake minimum version to be supported to 3.22.0 to maintain uniform across all AOCL libraries.
AMD Internal : [CPUPL-5616]

Change-Id: Ic53532ff9883b1bba39e859ea2523c20c1ac383b
2024-08-21 12:09:24 +05:30
Vignesh Balasubramanian
93631410a3 Bugfix : Fixed memory accesses in AVX512 SGEMMSUP RD kernels
- Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling
  m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction
  that strictly needed YMMs.

- Further updated the existing micro-kernel test cases to simulate
  these issues and validate the fix.

AMD-Internal: [CPUPL-5353]
Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1
2024-08-19 17:18:31 +05:30
Arnav Sharma
a67c8f05fb Gtestsuite: Fix for GEMM_COMPUTE IIT_ERS Test
- The IIT_ERS test for GEMM_COMPUTE where alpha = 0 and beta = 0 was
  failing since neither of the matrices was being packed and thus,
  missing the scaling by alpha resulting in a non-zero output for C
  matrix (C := A * B).

- Enabled packing of A matrix for the ZeroAlpha_ZeroBeta IIT_ERS test
  which handles the alpha scaling.

AMD-Internal: [CPUPL-5598]
Change-Id: Id9179ec6150d1bc5a0274edce727ce6cc4172213
2024-08-13 17:24:27 +05:30
Edward Smyth
7fff7b4026 Code cleanup: Miscellaneous fixes
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
  in frame/3/bli_l3_sup.c, currently assumes that non-x86
  platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
  builds.
- Add guards around omp pragma to avoid possible gcc
  compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
  kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.

AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
2024-08-06 06:56:01 -04:00
Edward Smyth
89f52a6df5 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
2024-08-05 16:18:51 -04:00
Edward Smyth
b964308e50 GTestSuite: option to check input arguments
Add tests to check input arguments have not been modified by BLIS
routine. These tests add a large runtime overhead, so they are
disabled by default. To enable them, configure gtestsuite with:

    cmake -DTEST_INPUT_ARGS=ON ...

and run desired tests as normal.

Also:
- Correct testinghelpers::chktrans to handle upper case values of
  argument trns.
- Change testinghelpers::matsize to return size 0 if m, n or
  leading dimension are 0, or if leading dimension is too small.

AMD-Internal: [CPUPL-4379]
Change-Id: I9494af800f9383195272ce99f622104a38fd0ed8
2024-08-05 09:58:17 -04:00
Edward Smyth
6393cb9d7c GTestSuite: misc corrections 3
- Set threshold to epsilon for early return cases where we are just
  scaling a matrix.
- Add this threshold to IIT_ERS files for appropriate tests.
- In IIT_ERS for gemm_compute, remove tests on null A and B when
  we are expecting to set or scale C. More thought is required
  in gemm_compute tests to handle these cases and look at cases
  where A or B has been packed.

AMD-Internal: [CPUPL-4500]
Change-Id: Ia649cc340ca1df6511388f9c43a31e53296cb2bf
2024-08-05 09:31:18 -04:00
Arnav Sharma
0a5c057475 DGEMV Optimizations for Tiny Sizes
- Added reference kernel for dgemv that handles computation for tiny
  sizes (m < 8 && n < 8).

- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
  row/column storage schemes as well as transpose and no transpose
  cases.

- Added additional unit-tests for functional verification.

AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
2024-08-05 12:19:42 +05:30
Ruchika Ashtankar
bdb94fb218 GTestSuite: Added tests for DGEMM SUP kernel
- Added dgemmGenericSUP test for the new 24x8 DGEMM SUP kernel for zen5.

AMD-Internal: [CPUPL-4404]
Change-Id: I150ca310655a495bdcf5ea9d5a16746483a17b68
2024-08-02 11:37:29 -04:00
Edward Smyth
75f21182bd GTestSuite: IIT and ERS test improvements
Various improvements:
- Where appropriate, test both:
  - with nullptr for suitable arguments that should never
    be touched.
  - with all arguments correct except the one we want to
    test, to check we are not returning early because
    another argument is a nullptr.
- Test incorrect values for order argument in CBLAS calls.
- Test early exits with limited data changes, e.g. set
  C to 0 or scale C in GEMM when alpha = 0.
- Bugfix in gemmt test when alpha is 0 and beta is 1.
- Use reference library gemmt for comparison when library
  is not netlib BLAS.

AMD-Internal: [CPUPL-4500]
Change-Id: Ibde7eaba5a484a87674044ca44855c6f6ee4ff4b
2024-07-31 15:36:01 -04:00
Edward Smyth
b90e12dfa4 GTestSuite: copyright notice
Standardize format of copyright notice.

AMD-Internal: [CPUPL-4500]
Change-Id: I6bde64c15ff639492dd0de95423c660112a37e2c
2024-07-26 15:34:41 -04:00
Edward Smyth
ea286cf6f6 GTestSuite: whitespace at end of lines
Unnecessary whitespace (spaces, tabs) at the end of lines
has been removed.

AMD-Internal: [CPUPL-4500]
Change-Id: Ice5f5504232cb22460c14ac47e6a3a43309cba22
2024-07-26 12:12:56 -04:00
Edward Smyth
4183efa722 GTestSuite: No newline at end of file
Add missing newline at the end of these files.

AMD-Internal: [CPUPL-4500]
Change-Id: I835cc73de0008b66ae3cf77fbb3daa1c8fcaaa7f
2024-07-26 11:42:57 -04:00
Edward Smyth
46fe3f3dcb GTestSuite: dos2unix file conversion
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency.

AMD-Internal: [CPUPL-4500]
Change-Id: Ia3e479643b0bed4ae8a9107bde6e2cddf32d5bd8
2024-07-26 11:09:06 -04:00
Arnav Sharma
9583ee2e23 DGEMV Optimizations for NO_TRANSPOSE cases
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.

- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.

- Added a wrapper for DAXPYF kernels for redirection to kernels with a
  smaller fuse factor than 32.

- Also added UKR tests for the new fused kernels.

AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
2024-07-24 15:59:36 +05:30
Vignesh Balasubramanian
b48e864e82 AVX512 optimizations for DAXPBYV API
- Implemented AVX512 computational kernel for DAXPBYV
  with optimal unrolling. Further implemented the other
  missing kernels that would be required to decompose
  the computation in special cases, namely the AVX512
  DADDV and DSCAL2V kernels.

- Updated the zen4 and zen5 contexts to ensure any query
  to acquire the kernel pointer for DAXPBYV returns the
  address of the new kernel.

- Added micro-kernel units tests to GTestsuite to check
  for functionality and out-of-bounds reads and writes.

AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
2024-07-22 11:32:19 +05:30
Arnav Sharma
4aa66f108e Added CSCALV AVX512 Kernel
- Added CSCALV kernel utilizing the AVX512 ISA.

- Added function pointers for the same to zen4 and zen5 contexts.

- Updated the BLAS interface to invoke respective CSCALV kernels based
  on the architecture.

- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).

AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
2024-07-15 07:17:43 -04:00
vignbala
236d092656 AVX512 optimizations for ZGEMM to handle k = 1 cases
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
  be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
  for handling the GEMM computation with inputs having k = 1,
  with the transpose values being N(for column-major) and T(for
  row-major).

- Updated the zgemm_blis_impl( ... ) layer to query the architecture
  ID and invoke the AVX2 or AVX512 kernel accordingly.

- Added API level tests for accuracy and code-coverage, as well as
  micro-kernel tests for verifying functionality and out-of-bounds
  memory accesses.

AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84
2024-07-09 07:07:24 -04:00
Vignesh Balasubramanian
02da190560 AVX512 optimizations for DNRM2
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
  computational kernel for DNRM2 API.

- Updated the header to include this kernel signature, as well
  as the framework layer to use this function in case of ZEN4
  and ZEN5 configurations.

- Updated the tipping points for ideal thread setting in DNRM2
  for ZEN5 micro-architecture. These thresholds are specific
  to the library's linkage to LLVM's OpenMP or GNU's OpenMp.

- Further abstracted the AOCL-DYNAMIC logic to separate functions
  for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).

- Further updated the ?NRM2 framework to accommodate the necessary
  changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
  kernel, when needed.

- Added micro-kernel and memory tests for this kernel in GTestsuite,
  to validate accuracy and out-of-bounds read and write.

AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
2024-06-24 08:50:36 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Mangala V
90fe795c46 Gtestsuite: Enabled memory test for ZGEMM for k=0
AMD_Internal: [CPUPL-4657]

Change-Id: Ic5f4d24184f05e0f57634845b4fb3312b3a416f6
2024-06-20 02:51:47 -04:00