Commit Graph

131 Commits

Author SHA1 Message Date
Vignesh Balasubramanian
02da190560 AVX512 optimizations for DNRM2
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
  computational kernel for DNRM2 API.

- Updated the header to include this kernel signature, as well
  as the framework layer to use this function in case of ZEN4
  and ZEN5 configurations.

- Updated the tipping points for ideal thread setting in DNRM2
  for ZEN5 micro-architecture. These thresholds are specific
  to the library's linkage to LLVM's OpenMP or GNU's OpenMp.

- Further abstracted the AOCL-DYNAMIC logic to separate functions
  for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).

- Further updated the ?NRM2 framework to accommodate the necessary
  changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
  kernel, when needed.

- Added micro-kernel and memory tests for this kernel in GTestsuite,
  to validate accuracy and out-of-bounds read and write.

AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
2024-06-24 08:50:36 -04:00
srigovin
2c838dadfb Updated return type of xerbla and xerbla_array APIs to void
Return type of xerbla and xerbla_array APIs are defined as int in BLIS, but according to netlib it should be void. Updated the defination and declaration accordingly.

Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I3072ba76111189de5c5cf08df83ea154163dd34d
2024-04-29 00:51:10 -04:00
Edward Smyth
2450a1813b BLIS: Implement zen5 sub-configuration
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.

AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
2024-04-12 07:26:31 -04:00
Vignesh Balasubramanian
8693c996ac Fixing coverity issues on SNRM2_ and SCNRM2_
- The bli_snormfv_unb_var1( ... ) and bli_cnormfv_unb_var1( ... )
  functions posed an uninitialized pointer read coverity issue,
  due to the local rntm_t object being declared as part of the
  function scope, but initialized only on a need basis(i.e, when
  attempting to pack x vector if incx != 1).

- The fix was to have the declaration and initialization inside
  the case where incx != 1, thereby making the scope of the rntm_t
  and mem_t objects more stringent.

- This required an additional condition to call the kernel in case
  of unit stride.

AMD-Internal: [CPUPL-4278]
Change-Id: I763b1d4920532557749d8943f12b6df626aa5372
2023-12-06 23:56:09 +05:30
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Vignesh Balasubramanian
bd0b50a077 Introduced fast-path to kernels in DNRM2_ and DZNRM2_ APIs
- Added a conditional check to see if the vectorized kernels
  for DNRM2_ and DZNRM2_ can be called directly, without
  incurring any framework overhead.

- The condition to satisfy this fast-path is for the size to be
  such that the ideal threads required is 1, with the vector having
  unit stride( so that packing at the framework-level can be avoided ).

AMD-Internal: [CPUPL-4045]
Change-Id: Ie37e86f802ada0e226dff88e74f0341e97ebfe28
2023-11-09 21:13:10 +05:30
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Vignesh Balasubramanian
5f9c8c6929 Bugfix : Fallback mechanism in SNRM2 and SCNRM2 kernels if packing fails
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
  a layer higher.

- Added a scalar loop to handle compute in case of non-unit strides.
  This loop ensures functionality in case packing fails at the
  framework level.

AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
2023-11-08 15:16:10 +05:30
Arnav Sharma
8885510db2 Fix for Missing Symbols for gemm_pack_get_size
- Symbols for gemm_pack_get_size were not being exported properly when
  BLIS was built as a shared library.
- Correctly assigned the BLIS_EXPORT_BLAS macro to ?gemm_pack_get_size_
  function declaration.
- Added missing gemm_pack and gemm_pack_get_size macros to
  bli_macro_defs.h file.
- Removed an unnecessary BLIS_EXPORT_BLAS macro from dgemm_compute
  function definition.
- Updated bli_util_api_wrap with no underscore API wrappers for pack and
  compute set of BLAS Extension APIs:
  1. ?gemm_pack_get_size
  2. ?gemm_pack
  3. ?gemm_compute

AMD-Internal: [CPUPL-4083]
Change-Id: I78cd7642c2fcbfdf02676e654a377ad2aa5295c1
2023-11-03 08:58:59 -04:00
Vignesh Balasubramanian
84faccdd7d Enabling the vectorized path for SNRM2_
- Enabled the vectorized AVX-2 code-path for SNRM2_. The
  framework queries the architecture ID and calls the
  vectorized kernel based on the architecture support.

- In case of not having the architecture support, we use
  the default path based on the sumsqv method.

AMD-Internal: [CPUPL-3277]
Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f
2023-11-03 17:45:56 +05:30
Vignesh Balasubramanian
81161066e5 Multithreading the DNRM2 and DZNRM2 API
- Updated the bli_dnormfv_unb_var1( ... ) and
  bli_znormfv_unb_var1( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Added the logic to distribute the job among the threads such
  that only one thread has to deal with fringe case(if required).
  The remaining threads will execute only the AVX-2 code section
  of the computational kernel.

- Added reduction logic post parallel region, to handle overflow
  and/or underflow conditions as per the mandate. The reduction
  for both the APIs involve calling the vectorized kernel of
  dnormfv operation.

- Added changes to the kernel to have the scaling factors and
  thresholds prebroadcasted onto the registers, instead of
  broadcasting every time on a need basis.

- Non-unit stride cases are packed to be redirected to the
  vectorized implementation. In case the packing fails, the
  input is handled by the fringe case loop in the kernel.

- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
  and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
  cases of size = 2 ( and ) size = 1 or non-unit strides respectively.

AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
2023-10-16 07:26:27 -04:00
Vignesh Balasubramanian
9828039030 Bugfix : Inversion of sign bit with early return in SNRM2_
- The bli_snormfv_unb_var1( ... ) function returns early in
  case of n = 1, and uses the blis macro bli_fabs( ... ) to
  set the norm to the absolute value of the element.

- This macro inverts the sign bit even if the element is 0.0.
  A check is added to re-invert the sign bit in this case, so
  that the norm is set to 0.0 instead of -0.0.

- Added the same early exit condition on bli_dnormfv_unb_var1( ... )
  when n = 1.

AMD-Internal: [CPUPL-3923]
Change-Id: If7f5ae41d2acfe89b505549d28215dde319d8c33
2023-10-10 04:21:09 -04:00
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Mangala V
5dc8e3fbca AOCL progress callback pointer update per thread
Thanks to Moore, Branden <Branden.Moore@amd.com> for identifying the
race condition and suggesting the changes to fix the same

Existing Design:
- AOCL progress callback pointer is a global pointer which is shared
  across all threads

Existing Design challenges:
 - The callback function cannot safely disable the progress mechanism,
   as another thread may have already checked to see if the function
   pointer is set, and then re-reads the pointer upon invocation of
   the callback. If one thread sets the callback to NULL in this time,
   then the resulting thread will attempt to call the null pointer as a
   function pointer, leading to a segfault.

New Design :
- Each thread maintains a local copy of progress pointer

AMD-Internal: [SWLCSG-1971]

Change-Id: I282989805a4a2a8a759a7373b645f3569bf42ed4
2023-04-20 05:33:12 -04:00
Shubham Sharma
036da2e651 Fixed compilation errors for generic configuration
- In gemmt and normf, #ifdef BLIS_KERNELS_* is added
  to make sure only compiled kernels are used.
- In bal_copy and bla_swap, missing '\' is added.

AMD-Internal: [CPUPL-2870]
Change-Id: I83452dff761f60db6957f557321ce210ab72c037
2023-04-18 00:27:05 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
Eleni Vlachopoulou
ad7a812db2 Remove quick return for zero increments.
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.

AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
2023-03-23 23:35:03 -04:00
Aayush Kumar
5bd2a777ba Fixed Compilation Fails when configured with --disable-blas
- Moved *_blis_impl function declaration outside the BLIS_ENABLE_BLAS
  guard.
- Changed Makefile to continue to compile bla_ files to get
  *_blis_impl interfaces.
- Modify CBLAS headers, bli_macro_defs.h and bli_util_api_wrap.{c,h}
  to add BLIS_ENABLE_CBLAS guards.
- Comment out BLIS_ENABLE_BLAS guards in various headers and utility
  functions.
- Define BLIS Fortran-style functions lsame_blis_impl and
  xerbla_blis_impl. New macros PASTE_LSAME and PASTE_XERBLA are
  used in bla_*_check headers and some other places to select
  whether to call lsame and xerbla, or the _blis_impl versions.
- Defined various other missing _blis_impl functions.
- In bli_util_api_wrap.c, only define any functions if
  BLIS_ENABLE_BLAS is defined, and only define the subroutine
  versions of functions like dot, nrm2, etc if BLIS_ENABLE_CBLAS
  is defined.
- BLAS layer is needed if CBLAS layer is enabled. Changed header
  files build/bli_config.h.in and bli_blas.h, and configure
  program to help ensure consistency in generated blis.h header
  and configure output.

Undefining BLIS_ENABLE_BLAS_DEFS appears to be broken in UTA BLIS
too, thus BLIS_ENABLE_BLAS_DEFS is currently permanently defined.

AMD-Internal: [CPUPL-3015]

Change-Id: I7c0fe07db85781db46f2c690e174451860b37635
2023-03-23 06:11:52 -04:00
Edward Smyth
1617589d24 Add consistent NaN/Inf handling in sumsqv. (#668)
Details:
- Changed sumsqv implementation as follows:
  - If there is a NaN (either real or imaginary), then return a sum of
    NaN and unit scale.
  - Else, if there is an Inf (either real or imaginary), then return a
    sum of +Inf and unit scale.
  - Otherwise behave as normal.

(cherry picked from commit b861c71b50)

AMD-Internal: [SWLCSG-1900]
Change-Id: Ic7ba9cad1fbaf11823b9ba96e72a4ddd973db5b6
2023-03-09 06:36:44 -05:00
Sireesha Sanga
540509f374 Enabling AVX2 path for SCNRM2 2023-01-13 10:27:54 +05:30
Eleni Vlachopoulou
758d68467f Disabling AVX2 path for SNRM2 and SCNRM2.
AMD-Internal: [CPUPL-2865]
Change-Id: I09c67115801a6b9446c7930c54fc937bd17908a3
2023-01-12 09:58:28 -05:00
Meghana Vankadari
f39dba9fd8 Added dzgemm, DZGEMM, DZGEMM_ prototypes to wrapper file.
AMD-Internal: [CPUPL-2199]
Change-Id: Ied814cb7be60d30b8217ec42ac436b4e628ea6d2
2023-01-12 01:35:29 -05:00
Edward Smyth
82c2eb4e8e Code cleanup and warnings fixes
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments

AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
2023-01-09 04:34:52 -05:00
Eleni Vlachopoulou
13aa3c8cd0 Adding AVX2 support for SNRM2 and SCNRM2
- For the cases where AVX2 is available, an optimized function is called,
    based on Blue's algorithm. The fallback method based on sumsqv is used
    otherwise.

    - Scaling is used to avoid overflow and underflow.

    - Works correctly for negative increments.

    AMD-Internal: [SWLCSG-1080]

Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9
2022-12-14 04:25:45 -05:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Eleni Vlachopoulou
863b73dfaf Adding AVX2 support for DZNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

- Cleaned up some white space in the AVX2 implementation for DNRM2.

AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
2022-09-30 07:51:04 -04:00
Eleni Vlachopoulou
1c1a0027a8 Bugfix in DNRM2 AVX path
Description:
Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy
errors.

AMD-Internal: [CPUPL-2576]
Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e
2022-09-29 04:46:04 -04:00
Nallani Bhaskar
1e5d98322d Disabled DNRM2 AVX path temporarily
Description:
Disabled AVX2 optimized path for DNRM2 to avoid accuracy
issues in netlib blas test.

AMD-Internal: CPUPL-2576 ]
Change-Id: I0764725d4f6b1e4e0b5f60a255bc681bb698560e
2022-09-22 13:00:04 +05:30
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
Dipal M Zambare
2cdeea3c66 CBLAS/BLAS interface decoupling for the level 1 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dscal’
 internally invokes the BLAS API ‘dscal_’.

-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.

-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: I0e80071398af29c9313296d2a92e61e3897ac28e
2022-09-19 21:50:29 +05:30
Dipal M Zambare
866e8de7bf CBLAS/BLAS interface decoupling for the level 2 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dgemv’
 internally invokes the BLAS API ‘dgemv_’.

-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.

-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c
2022-09-15 17:51:05 +05:30
Dipal M Zambare
e18db8a172 CBLAS/BLAS interface decoupling for the level 3 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dgemm’
 internally invokes the BLAS API ‘dgemm_’.
-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.
-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409
2022-09-15 06:23:46 -04:00
Dipal M Zambare
61232d540c AOCL progress callback hardening
- BLIS uses callback function to report the progress of the
  operation. The callback is implemented in the user application
  and is invoked by BLIS.

- Updated callback function prototype to make all arguments const.
  This will ensure that any attempt to write using callback’s
  argument is prevented at the compile time itself.

AMD-Internal: [CPUPL-2504]
Change-Id: I8ceb671242365d2a9155b485301cd8c75043e667
2022-09-14 15:32:10 +05:30
Dipal M Zambare
5c42afada8 Revert "CBLAS/BLAS interface decoupling for level 3 APIs"
This reverts commit d925ebeb06.

Change-Id: I2e842b29c1fedbe14bf913949cf978f3e7515ff3
2022-08-30 14:50:38 +05:30
Dipal M Zambare
7e42b3d2e0 Revert "CBLAS/BLAS interface decoupling for level 2 APIs"
This reverts commit 192f5313a1.

Change-Id: I876cad90902970ebc61550f109eb0ce32539ea1c
2022-08-30 11:53:46 +05:30
Dipal M Zambare
6cff8b030e Revert "CBLAS/BLAS interface decoupling for level 1 APIs"
This reverts commit 95169ca806.

Change-Id: Ic441aca616be6f27c7f1ba64e4480edcc6b17632
2022-08-30 11:34:34 +05:30
Dipal M Zambare
40c71dd2e1 Revert "CBLAS/BLAS interface decoupling for swap api"
This reverts commit 2beaa6a0e6.

Reverting it as it is planned for the next release.

Change-Id: Ib9271acd0b5b4cfd10c8f8b7bbb6ef93a3d594ea
2022-08-30 10:10:06 +05:30
Edward Smyth
abf848ad12 Code cleanup and warnings fixes
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
  to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
  only defined in header file if BLIS_ENABLE_OPENMP is defined

AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
2022-08-29 08:22:30 -04:00
jagar
2beaa6a0e6 CBLAS/BLAS interface decoupling for swap api
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it.
-   This change separates the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I8d81072aaca739f175318b82f6510d386103c24b
2022-08-29 16:26:01 +05:30
jagar
95169ca806 CBLAS/BLAS interface decoupling for level 1 APIs
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it and may
    result in recursion
-   This change separate the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I0f4521e70a02f6132bdadbd4c07715c9d52fe62a
2022-08-29 14:28:20 +05:30
jagar
192f5313a1 CBLAS/BLAS interface decoupling for level 2 APIs
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it and may
    result in recursion
-   This change separates the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I8380b6468683028035f2aece48916939e0fede8a
2022-08-29 09:47:19 +05:30
Chandrashekara K R
d925ebeb06 CBLAS/BLAS interface decoupling for level 3 APIs
->In BLIS the cblas interface is implemented as a wrapper around
  the blas interface. For example the CBLAS api ‘cblas_dgemm’
  internally invokes BLAS API ‘dgemm_’.
->If the end user wants to use the different libraries for CBLAS
  and BLAS, current implantation of BLIS doesn’t allow it and
  may result in recursion
->This change separate the CBLAS and BLAS implantation by adding
  and additional level of abstraction. The implementation of the
  API is moved to the new function which is invoked directly from
  the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]

Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5
2022-08-26 05:54:29 -04:00
Dipal M Zambare
e712ffe139 Added AOCL progress support for BLIS
-- AOCL libraries are used for lengthy computations which can go
     on for hours or days, once the operation is started, the user
     doesn’t get any update on current state of the computation.
     This (AOCL progress) feature enables user to receive a periodic
     update from the libraries.
  -- User registers a callback with the library if it is interested
     in receiving the periodic update.
  -- The library invokes this callback periodically with information
     about current state of the operation.
  -- The update frequency is statically set in the code, it can be
     modified as needed if the library is built from source.
  -- These feature is supported for GEMM and TRSM operations.

  -- Added example for GEMM and TRSM.
  -- Cleaned up and reformatted test_gemm.c and test_trsm.c to
     remove warnings and making indentation consistent across the
     file.

AMD-Internal: [CPUPL-2082]
Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e
2022-05-17 18:10:39 +05:30
Saitharun
e783ea10db Enable wrapper code by default
details: Changes Made for 4.0 branch to enable wrapper code by default
and also removed ENABLE_API_WRAPPER macro.

Change-Id: I5c9ede7ae959d811bc009073a266e66cbf07ef1a
2022-01-19 11:38:45 +05:30
mkurumel
595f7b7edf dnrm2 optimization with dot method
1.  Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
	norm with dot operation.
    2.  Added vectorization to compute square of 32 double element
	block size from vector X.
    3.  Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
	to compute nrm2 using new kernel.
    4.  Dot kernel definitions and implementation have a possibility for
	accuracy issues .we can switch to traditional implementation by
	disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
	for Vector X .

    AMD-Internal: [CPUPL-1757]

Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
2021-11-12 08:58:53 +05:30
Meghana Vankadari
10ca8710f0 Optimized SUP code for GEMMT
Details:
- Eliminated the IR loop in ref_var2m functions.
- Handled the rectangular and triangular portions of C matrix
  separately.
- Added a condition to check and eliminate zero regions inside IC loop.
- modified kc selection logic to choose optimal KC in SUP
- Updated thresholds to choose between SUP and native.

Change-Id: I21908eaa6bc3a8f37bdea29f7bfca7e6fcfee724
2021-11-12 08:58:52 +05:30
Saitharun
ffcb338531 Adding ENABLE_WRAPPER CMAKE OPTION
details: Wrapper code will be enabled when selecting the cmake option
ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build
error on windows.

AMD-Internal: [CPUPL-1848]
Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e
2021-11-12 08:58:49 +05:30
Saitharun
5d6012c50d Added additional symbols for BLIS APIs using wrapper functions
Details:
BLIS currently supports BLAS and CBLAS interfaces with lowercase.
With this commit - we also supports uppercase with and without
trailing underscore, lowercase without trailing underscore symbol
names.

Change-Id: Ibb06121821ab937b25d492409625916f542b2135
2021-11-12 08:58:48 +05:30
Field G. Van Zee
b683d01b9c Use extra #undef when including ba/ex API headers.
Details:
- Inserted a "#include bli_xapi_undef.h" after each usage of the basic
  and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h,
  bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to
  the previous status quo, in which each header made minimal #undef
  prior to its own definitions and then a single instance of
  "#include bli_xapi_undef.h" cleaned up any remaining macro defs after
  all other headers were used. This commit will guarantee that macro
  defs from the setup of one header (say, bli_oapi_ex.h) don't "infect"
  the definitions made in a subsequent header. As with this previous
  commit, this change does not fix any issue but rather attempts to
  avoid creating orphaned macro definitions that are only needed within
  a very limited scope.
- Removed minimal #undef from bli_?api_[ba|ex].h.
- Removed old commented-out lines from bli_?api_[ba|ex].h.
2021-05-13 15:23:22 -05:00