Commit Graph

350 Commits

Author SHA1 Message Date
Shubham Sharma
f8c83fedb6 Added new ZTRSM small code path for ZEN5
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
2025-02-06 18:01:10 +05:30
Hari Govind S
fe73445813 Introduced fast-path in DCOPYV API and fix compiler warning for AXPYV
- Added a conditional check to invoke the vectorized
  DCOPYV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal for
  single-threaded execution. Thus, we avoid the call to
  bli_nthreads_l1() function to set the ideal number of threads.

- Used macros to protect the declaration of fast_path_thresh in
  DAXPYV API to avoid compiler warnings.

AMD-Internal: [CPUPL-4875][CPUPL-5895]
Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb
2025-02-05 04:41:55 -05:00
Hari Govind S
3d2653f1ab DDOTV Optimization for ZEN3 Architecture
- Reduced the blocking size of 'bli_ddotv_zen_int10'
  kernel from 40 elements to 20 elements for better
  utilization of vector registers

- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
  kernel with 'if' conditions to handle reminder
  iterations. As only a single iteration is used when
  reminder is less than the primary unroll factor.

- Added a conditional check to invoke the vectorized
  DDOTV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal
  for single-threaded execution. Thus, we avoid the
  call to bli_nthreads_l1() function to set the ideal
  number of threads.

- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
  kernel.

AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
2025-02-04 06:01:04 -05:00
Shubham Sharma
9fd2aebd25 Tuned DTRSM thresholds for ZEN5
- Tuned AOCL_dynamic thresholds for DTRSM for ZEN5.
- Tuned thresholds for better selection of code path.

AMD-Internal: [CPUPL-5408]
Change-Id: Ic40b5c8d276c8ce8399fd49ce0d0569f79ec98be
2025-01-28 07:33:10 -05:00
Vignesh Balasubramanian
fb6dcc4edb Support for Tiny-GEMM interface(ZGEMM)
- As part of AOCL-BLAS, there exists a set of vectorized
  SUP kernels for GEMM, that are performant when invoked
  in a bare-metal fashion.

- Designed a macro-based interface for handling tiny
  sizes in GEMM, that would utilize there kernels. This
  is currently instantiated for 'Z' datatype(double-precision
  complex).

- Design breakdown :
  - Tiny path requires the usage of AVX2 and/or AVX512
    SUP kernels, based on the micro-architecture. The
    decision logic for invoking tiny-path is specific
    to the micro-architecture. These thresholds are defined
    in their respective configuration directories(header files).

  - List of AVX2/AVX512 SUP kernels(lookup table), and their
    lookup functions are defined in the base-architecture from
    which the support starts. Since we need to support backward
    compatibility when defining the lookup table/functions, they
    are present in the kernels folder(base-architecture).

- Defined a new type to be used to create the lookup table and its
  entries. This type holds the kernel pointer, blocking dimensions
  and the storage preference.

- This design would only require the appropriate thresholds and
  the associated lookup table to be defined for the other datatypes
  and micro-architecture support. Thus, is it extensible.

- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
         kernels. Thus, the blocking in framework is done accordingly.
         In case of adding the support for n-var, the variant
         information could be encoded in the object definition.

- Added test-cases to validate the interface for functionality(API
  level tests). Also added exception value tests, which have been
  disabled due to the SUP kernel optimizations.

AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
2025-01-24 12:59:26 -05:00
Hari Govind S
349fc47ec5 DGEMV Optimizations for TRANSPOSE Cases
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
  AVX2 kernels for Zen1/2/3 architectures. These kernels are written
  from the ground up and are independent of fused kernels.

- The DGEMV primary kernel processes the calculation in chunks of
  8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
  kernels, which are invoked by the primary kernel as needed.

- Implemented the kernels by computing the dot product of matrix A
  columns with vector x in chunks of 32 elements, storing the results
  in accumulator registers. Fringe elements are handled in chunks
  of 16, 8, etc. The data in the accumulator registers is then reduced
  and added to vector y.

AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
2025-01-24 00:38:34 -05:00
Arnav Sharma
66461b8df3 Improved Multi-threaded Performance of DSCALV
- Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5
  architectures, since earlier they were using the Zen thresholds.
- Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads
  incurred when the single-threaded path is optimally performant.

AMD-Internal: [CPUPL-5934]
Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033
2025-01-22 03:55:38 -05:00
Vignesh Balasubramanian
8e660215c3 Introduced fast-path in DAXPYV API
- Added a conditional check to invoke the vectorized
  DAXPYV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal for
  single-threaded execution. Thus, we avoid the call to
  bli_nthreads_l1() function to set the ideal number of threads.

AMD-Internal: [CPUPL-4878]
Change-Id: I001fd1b8bbd2d691ecb3e2423ec7998e130850bb
2025-01-10 09:19:38 -05:00
Vignesh Balasubramanian
345204d69b Additional updates to the thresholds for ZGEMM small path
- Further updated the thresholds for entry to ZGEMM small
  path(AVX2), when the execution is mulithreaded. The newer
  thresholds account for more skinnier inputs, compatible with
  single-threaded small path, as opposed to multithreaded
  SUP path.

AMD-Internal: [CPUPL-6040][CPUPL-5930]
Change-Id: I333f97d8af49733310e4ae48b12baba15ef828d6
2025-01-10 08:29:31 -05:00
Edward Smyth
567039a7fe Fortran interfaces for bli_thread_get APIs
Create and export Fortran interfaces for bli_thread_get_num_threads()
and bli_thread_get_{jc,pc,ic,jr,ir}_nt() APIs.

bli_thread_get_is_parallel() is intended for internal BLIS usage, so
not adding a Fortran interfaces for it at this time.

AMD-Internal: [CPUPL-6168]
Change-Id: Ieba2537e5455cc289536aec3de5d4b5866e607f1
2025-01-10 05:07:33 -05:00
Vignesh Balasubramanian
cdaa2ac7fd Bugfix and optimizations for AVX512 AMAXV micro-kernels
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
  incorrect functionality with multiple absolute maximums.
  They returned the last index when having multiple occurences,
  instead of the first one.

- Implemented a bug-fix to handle this issue on these AVX512
  kernels. Also ensured that the kernels are compliant with
  the standard when handling exception values.

- Further optimized the code by decoupling the logic to find
  the maximum element and its search space for index. This way,
  we use lesser latency instructions to compute the maximum
  first.

- Updated the unit-tests, exception value tests and early return
  tests for the API to ensure code-coverage.

AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
2025-01-07 22:56:20 +05:30
Vignesh Balasubramanian
f548f42607 Fixing compiler warnings on ZGEMM
- Scoped some of the variables used in zgemm_blis_impl()
  when determining the thresholds to small path. These
  variables will be used only when the architecture is
  ZEN5 or ZEN4.

AMD-Internal: [CPUPL-5895]
Change-Id: I6f90856f34454423ac777e33c74fe5ec6bb94e13
2025-01-07 10:59:43 +05:30
Edward Smyth
4ce708c316 Move some BLAS extension APIs to extra subdirectories
In preparation for merging next group of changes from upstream BLIS,
move some BLAS extension APIs to new extra subdirectories in
frame/compat and frame/compat/cblas/src. Other extension APIs will
be moved in later commits.

Some tidying up to better match upstream BLIS code has also been done.

AMD-Internal: [CPUPL-2698]
Change-Id: I0780a775d37242fba562c3f13666da0ad2b2cdfb
2024-12-17 04:54:39 -05:00
Edward Smyth
0c6d006225 Changes to rntm to reduce mutex operations
Change usage of global_rntm and tl_rntm to elimate need
for mutex operations when accessing global_rntm. Usage of
these data structures is now as follows:
* global_rntm is set once during bli_init_apis and includes
  all getenv calls to check BLIS threading and error printing
  environment variables. global_rntm is then read-only.
* tl_rntm is intialized once from global_rntm on each
  application thread. Any calls to BLIS set threading/ways
  APIs will update tl_rntm for that application thread only
  (Previously they updated global_rntm for all application threads).
* Re-initialize info_value in tl_rntm in every call to bli_init APIs.
* In bli_rntm_init_from_global() we initialize the local (per API
  call) rntm as a copy of tl_rntm and then update threading values
  in bli_thread_update_rntm_from_env() to reflect the current status
  of OpenMP runtime ICVs.

AMD-Internal: [CPUPL-6168][SWLCSG-3143]
Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6
2024-12-16 04:45:26 -05:00
Shubham Sharma
beaea1b88f Added new DTRSM small code path for ZEN5
- Added new DTRSM kernels for right  and left variants.
- Kernel dimensions are 24x8.
- 24x8 DGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6016]
Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db
2024-12-16 10:38:45 +05:30
Vignesh Balasubramanian
609af9bfe2 Threshold tuning for ZGEMM small path
- Updated the threshold check for ZGEMM small path to include
  runtime checks for redirection, specific to the micro-architecture.

- The current ZGEMM small path has only its AVX2 variant available.
  Post implementing an AVX512(same/different algorithm), the thresholds
  will further be fine-tuned.

- Included the dot-product based AVX512 ZGEMM kernels in the ZEN5
  context. It will be used as part of handling RRC and CRC storage
  schemes of C, A and B matrices in both single-thread and multi-thread
  runs.

AMD-Internal: [CPUPL-5949]
Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75
2024-12-13 12:51:22 -05:00
Arnav Sharma
25e59fcbb9 DGEMV Optimizations for NO_TRANSPOSE Cases
- AVX512 specific DGEMV native kernels are added for Zen4/5
  architectures to handle the NO_TRANSPOSE cases and are independent of
  the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
  beta scaling of y vector within the kernel itself and handle cases
  where n is less than 5:
    - bli_dgemv_n_zen_int_32x8n_avx512( ... )
    - bli_dgemv_n_zen_int_32x4n_avx512( ... )
    - bli_dgemv_n_zen_int_32x2n_avx512( ... )
    - bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
  m-dimension and for this kernel beta scaling is handled beforehand
  within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
  using fused kernel, namely AXPYF, to perform the GEMV operation.

AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
2024-12-12 10:26:50 -05:00
Vignesh Balasubramanian
4da1ad2cd9 Added CBLAS wrappers for complex precision ?ROT and ?ROTG APIs
- Added the appropriate CBLAS wrappers for CROTG, CSROT,
  ZROTG and ZDROT APIs. These would internally call their
  ?_blis_impl() layer.

AMD-Internal: [CPUPL-5813]
Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97
2024-09-19 08:49:46 -04:00
Edward Smyth
a07e041b1f SCALV alpha=zero BLAS compliance
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.

Changes in this commit:
- Standardize early returns from SCALV reference and optimized
  kernels.
- User supplied N<0 is handled at the top level API layer. Use
  negative values of N in kernel calls to signify that SETV
  should _not_ be used when alpha=0. This should only be
  required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
  overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.

AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
2024-09-16 07:10:28 -04:00
Vignesh Balasubramanian
189a0b7224 Bugfix for {D/C/Z}AXPBY and ZAXPY BLAS APIs
- Bug : For non-zen architectures, {D/C/Z}AXPBY had
  incorrect datatypes passed when querying the computational
  kernel from context. The right datatype is now passed to
  each variant.

- Bug : For ZAXPY, a NULL context was passed to the kernel
  when using the single-threaded path. In case of further
  using the context inside the kernel, this would be an issue.
  We now pass the context instead of a null pointer.

AMD-Internal: [CPUPL-5643]
Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf
2024-08-22 14:12:14 +05:30
Hari Govind S
d349f89df6 Fix warning caused by dscalv
-  Setting the value for ST_THRESH for default
   code path in dscalv API to avoid warning
   message.

Change-Id: I8ace2070350267904faa498197b8356de9af58d1
2024-08-06 12:13:23 +05:30
Edward Smyth
89f52a6df5 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
2024-08-05 16:18:51 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Edward Smyth
09c45525f4 Missing early returns (2)
Add missing early return in axpyv.

AMD-Internal: [CPUPL-5540]
Change-Id: I522fd6f5551a4dab24e8c164fa38818c900b89f8
2024-08-05 12:18:33 -04:00
Edward Smyth
591a3a7395 Code cleanup: file formats and permissions
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.

Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.

AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
2024-08-05 11:52:33 -04:00
Arnav Sharma
0a5c057475 DGEMV Optimizations for Tiny Sizes
- Added reference kernel for dgemv that handles computation for tiny
  sizes (m < 8 && n < 8).

- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
  row/column storage schemes as well as transpose and no transpose
  cases.

- Added additional unit-tests for functional verification.

AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
2024-08-05 12:19:42 +05:30
Hari Govind S
3ae466697b Fixed performance drop of multi-threaded dscalv
-  Avoid performance degradation of dscalv for ST when OpenMP is enabled
   by using fast-path to skip the overhead caused by 'bli_nthreads_l1'
   function if the input size is less than a particular threshold.

-  Replaced 'bli_thread_vector_partition' work distribution function
   with 'bli_thread_range_sub'.

AMD-Internal: [CPUPL-5522]
Change-Id: I4ad0041d6e448c4a26fcd47ce44e0321a41b8b9f
2024-08-05 01:51:30 -04:00
Edward Smyth
0151ea748a Missing early returns
Add missing early returns in amax, asum, gemm_compute and gemv.

AMD-Internal: [CPUPL-5540]
Change-Id: I3ed682cae954331e48da5e8ef5c7f27dd4f11c5e
2024-08-02 10:16:19 -04:00
Moripalli Chitra
448702a1b4 Coverity issue fix
Out-of-bound access fix in malloc failure case for following APIs: ddot_, zdotc_, zdotu_

AMD-Internal: [CPUPL-4686]
Change-Id: I676697223604fbb2a8d03421d98ed0d8d706f8c7
2024-08-02 09:31:38 -04:00
Shubham Sharma.
45d82a1ebf Threshold tuning for DTRSM on zen5
- Added new decision logic to choose between
  native TRSM vs unpacked small TRSM for
  double precision.
- The changes are made for zen5 processor.

AMD-Internal: [CPUPL-5534]
Change-Id: I5204f6df111edec27d006daeb1c2b535a67b3e46
2024-08-01 11:27:28 -04:00
Vignesh Balasubramanian
f23b8e636b AVX2 and AVX512 optimizations for DAXPYV
- Removed some of the unrolling factors that affected the
  performance of AVX2 DAXPYV kernel. In addition to improving
  the current performance on sizes compatible to single-threaded
  runs, this will now perform better for tiny sizes as well
  since the overhead to reach the computation is less.

- Updated the vector partitioning logic, by using
  bli_thread_range_sub( ... ), which ensures that there is no
  false sharing among multiple threads.

- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
  or zen4 and zen5 micro-architectures.

AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
2024-07-31 09:24:36 -04:00
Edward Smyth
8848ecb103 Improvements to CBLAS xerbla functionality
Currently the CBLAS xerbla always prints and always stops
on error. This commits adds similar functionality to the
regular BLAS xerbla to match the changes in 6d0444497f,
namely:

- Option to stop in xerbla on error. This is controlled by
  setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
  is controlled by setting the environment variable
  BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
  assuming xerbla was not set to stop on error. Example call is

     info = bli_info_get_info_value();

The default behaviour remains to print but has been changed to
not stop on error, i.e. the equivalent to

     export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0

AMD-Internal: [CPUPL-5361]
Change-Id: Icd6125fd60da139e3ec0969e52337a1ed515f0a2
2024-07-26 10:36:37 -04:00
Hari Govind S
eacad443e3 Optimization for DCOPY and SCOPY API
-  Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
   kernel.

-  Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
   and "bli_scopyv_zen4_asm_avx512" kernels.

-  Replaced existing load balancing algorithm for dcopy API with
   "bli_thread_range_sub" algorithm.

-  Included AOCL-dynamic values for optimial number of threads
   for zen5 architecture.

AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
2024-07-24 08:23:07 -04:00
Vignesh Balasubramanian
b48e864e82 AVX512 optimizations for DAXPBYV API
- Implemented AVX512 computational kernel for DAXPBYV
  with optimal unrolling. Further implemented the other
  missing kernels that would be required to decompose
  the computation in special cases, namely the AVX512
  DADDV and DSCAL2V kernels.

- Updated the zen4 and zen5 contexts to ensure any query
  to acquire the kernel pointer for DAXPBYV returns the
  address of the new kernel.

- Added micro-kernel units tests to GTestsuite to check
  for functionality and out-of-bounds reads and writes.

AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
2024-07-22 11:32:19 +05:30
Vignesh Balasubramanian
cec9fdcc6e Framework enhancements for ?AXPBYV APIs
- Implemented a new front-end for the BLAS/CBLAS calls
  to ?AXPBYV(BLAS-extension API), that is intended to
  be compiled only on Zen micro-architectures(as per the
  existing build system).

- This new front-end makes the framework lightweight for
  BLAS/CBLAS calls to ?AXPBYV, by directly querying the
  architecture ID and deploying the associated computational
  kernel.

- Further updated the rerouting to other L1 kernels based
  on alpha and beta value. This was initially present in
  the Typed-API interface. It has been moved inside the
  respective kernels, and only necessary rerouting is done
  to specific L1 kernels to avoid redundant checks.

AMD-Internal: [CPUPL-5406]
Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e
2024-07-18 10:06:31 -04:00
Arnav Sharma
d5e29e3c7b CSCALV Framework Bugfix
- Fixed bug for non-zen architecture where CSCALV framework incorrectly
  fetches the dcomplex (ZSCALV) kernel pointer.

AMD-Internal: [CPUPL-5299]
Change-Id: I1d16588aa9dffd8b9dca69860026e377fa74d547
2024-07-17 00:27:47 +05:30
Arnav Sharma
4aa66f108e Added CSCALV AVX512 Kernel
- Added CSCALV kernel utilizing the AVX512 ISA.

- Added function pointers for the same to zen4 and zen5 contexts.

- Updated the BLAS interface to invoke respective CSCALV kernels based
  on the architecture.

- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).

AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
2024-07-15 07:17:43 -04:00
vignbala
236d092656 AVX512 optimizations for ZGEMM to handle k = 1 cases
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
  be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
  for handling the GEMM computation with inputs having k = 1,
  with the transpose values being N(for column-major) and T(for
  row-major).

- Updated the zgemm_blis_impl( ... ) layer to query the architecture
  ID and invoke the AVX2 or AVX512 kernel accordingly.

- Added API level tests for accuracy and code-coverage, as well as
  micro-kernel tests for verifying functionality and out-of-bounds
  memory accesses.

AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84
2024-07-09 07:07:24 -04:00
Hari Govind S
627bf0b1ba Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API
-  Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
   'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
   zen5  config.

-  Implemented multithreading support and AOCL-dynamic
   for ZAXPY API.

-  Utilized 'bli_thread_range_sub' function to achieve
   better work distribution and avoid false sharing.

AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
2024-07-09 01:29:53 -04:00
Edward Smyth
2ee46a3a3a Merge commit 'cfa3db3f' into amd-main
* commit 'cfa3db3f':
  Fixed bug in mixed-dt gemm introduced in e9da642.
  Removed support for 3m, 4m induced methods.
  Updated do_sde.sh to get SDE from GitHub.
  Disable SDE testing of old AMD microarchitectures.
  Fixed substitution bug in configure.
  Allow use of 1m with mixing of row/col-pref ukrs.

AMD-Internal: [CPUPL-2698]
Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f
2024-07-08 06:09:11 -04:00
Vignesh Balasubramanian
947811a429 Bugfix for ?OMATCOPY2 and ?IMATCOPY APIs
- Updated the parameter check for leading dimensions in the
  functions handling transpose case of matrix A.

- Updated the logic to perform ?IMATCOPY operation. The new
  logic uses an auxiliary buffer to copy and scale in place,
  if and when needed. This is done in order to avoid overwriting
  any subsequent reads that might follow(specifically in case
  of having different leading dimensions for reading and writing).

- Updated xerbla_() to throw memory allocation failure based on
  INFO parameter being -10. This value is specific to its use-case
  in ?IMATCOPY, where it is set to -10.

- Updated the Extreme Value Tests(EVT) logger for ?IMATCOPY
  for uniformity.

- Cleaned up the files to follow coding conventions.

AMD-Internal: [CPUPL-4862][SWLCSG-2706]
Change-Id: I34dfa2bcb66b821315e11f7ab2139c41a79ef780
2024-05-21 11:13:28 +05:30
Arnav Sharma
cb27fad49c ZSCALV AVX512 Kernel
- Implemented ZSCALV kernel utilizing AVX512 intrinsics.

- Gtestsuite: Added ukr tests for the new kernel.

AMD-Internal: [CPUPL-5012]
Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2
2024-05-08 11:55:13 -04:00
Arnav Sharma
1dbeee4d19 ZDOTV AVX512 Kernel with MT Support
- Added AVX512 kernel for ZDOTV.

- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.

AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
2024-05-08 04:54:05 -04:00
Arnav Sharma
b1d69180f9 Updated DOTV DTL in bla_dot.c
- Updated DOTV DTL entry to include conjugate parameter.

AMD-Internal: [CPUPL-5059]
Change-Id: Id66be02fc06ff2faa18325dffe76559af2c6a5cf
2024-05-08 01:46:17 -04:00
Kiran Varaganti
fd61c69778 Fixed bug in omatcopy for when trans="t"
Thanks to Zhenyu Zhu ajz34 for pointing out this bug.
When trans="t" or "conjugate transpose" in the case of complex data-types
the ldb should be greater than equal to cols.
In the bug it was checked against "rows". Fixed this bug.
Some minor code format is done.

[CPUPL-4810][SWLCSG-2706]

Change-Id: Ie796d25a361b2ba72eda80e8c5867d6352af901f
2024-05-06 12:57:38 -04:00
Shubham Sharma
be34169001 Fixed Matlab Failure in ZTRSM
- In AVX512 ZTRSM kernel, vertorizes division code
  is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
  but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.

AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
2024-05-06 06:56:45 -04:00
Shubham Sharma
b9e21e8701 Added ZTRSM AVX512 small code path
- Kernel dimensions are 4x4.
  - Two kernels are implemented, Right Upper and
    Right lower.
  - In case of Left variants of TRSM, transpose is
    induced so that Right variant kernels can be used.
  - No packing is performed in these kernels.
  - Changes are made in the threshold to pick ZTRSM small
    code path.
  - BLIS_INLINE is removed from signature of
    "TRSMSMALL_KER_PROT".
  - These kernels do not support "ENABLE_TRSM_PREINVERSION".
  - Newly added kernels do not support conjugate
    transpose.
  - Added multithreading to ZTRSM small code path.

AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
2024-05-03 05:10:41 -04:00
Hari Govind
9c26de1a18 Optimisiation COPYV APIs
- Implemented AVX512 kernels for scopyv_, dcopyv_ and  zcopyv_
  using respective AVX512 intrinsics including masked
  load and store operations.

- Implemented AVX512 kernels for scopy_, dcopy_ and
  zcopy_ using assembly language to prevent loss of
  performance during the translation of intrinsics.

- Updated the dcopy_blis_impl( ... ) and
  zcopy_blis_impl( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Implemented OpenMP parallelization for dcopyv_ and
  zcopyv_ APIs, while scopyv_ and ccopyv_ only support
  single thread.

AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
2024-05-01 00:23:01 +05:30
srigovin
2c838dadfb Updated return type of xerbla and xerbla_array APIs to void
Return type of xerbla and xerbla_array APIs are defined as int in BLIS, but according to netlib it should be void. Updated the defination and declaration accordingly.

Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I3072ba76111189de5c5cf08df83ea154163dd34d
2024-04-29 00:51:10 -04:00
Shubham Sharma
632c32767b Avoid alpha scaling in ZTRSV/ZTRSM when alpha = 1
- Scaling vector X is skipped when alpha is 1 in ZTRSV.
- Scaling matrix A is skipped when alpha is 1 in ZTRSM.

AMD-Internal: [CPUPL-4324]
Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81
2024-04-16 02:24:02 -04:00