- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
computational kernel for DNRM2 API.
- Updated the header to include this kernel signature, as well
as the framework layer to use this function in case of ZEN4
and ZEN5 configurations.
- Updated the tipping points for ideal thread setting in DNRM2
for ZEN5 micro-architecture. These thresholds are specific
to the library's linkage to LLVM's OpenMP or GNU's OpenMp.
- Further abstracted the AOCL-DYNAMIC logic to separate functions
for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).
- Further updated the ?NRM2 framework to accommodate the necessary
changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
kernel, when needed.
- Added micro-kernel and memory tests for this kernel in GTestsuite,
to validate accuracy and out-of-bounds read and write.
AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
Return type of xerbla and xerbla_array APIs are defined as int in BLIS, but according to netlib it should be void. Updated the defination and declaration accordingly.
Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I3072ba76111189de5c5cf08df83ea154163dd34d
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
- The bli_snormfv_unb_var1( ... ) and bli_cnormfv_unb_var1( ... )
functions posed an uninitialized pointer read coverity issue,
due to the local rntm_t object being declared as part of the
function scope, but initialized only on a need basis(i.e, when
attempting to pack x vector if incx != 1).
- The fix was to have the declaration and initialization inside
the case where incx != 1, thereby making the scope of the rntm_t
and mem_t objects more stringent.
- This required an additional condition to call the kernel in case
of unit stride.
AMD-Internal: [CPUPL-4278]
Change-Id: I763b1d4920532557749d8943f12b6df626aa5372
- Added a conditional check to see if the vectorized kernels
for DNRM2_ and DZNRM2_ can be called directly, without
incurring any framework overhead.
- The condition to satisfy this fast-path is for the size to be
such that the ideal threads required is 1, with the vector having
unit stride( so that packing at the framework-level can be avoided ).
AMD-Internal: [CPUPL-4045]
Change-Id: Ie37e86f802ada0e226dff88e74f0341e97ebfe28
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
a layer higher.
- Added a scalar loop to handle compute in case of non-unit strides.
This loop ensures functionality in case packing fails at the
framework level.
AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
- Symbols for gemm_pack_get_size were not being exported properly when
BLIS was built as a shared library.
- Correctly assigned the BLIS_EXPORT_BLAS macro to ?gemm_pack_get_size_
function declaration.
- Added missing gemm_pack and gemm_pack_get_size macros to
bli_macro_defs.h file.
- Removed an unnecessary BLIS_EXPORT_BLAS macro from dgemm_compute
function definition.
- Updated bli_util_api_wrap with no underscore API wrappers for pack and
compute set of BLAS Extension APIs:
1. ?gemm_pack_get_size
2. ?gemm_pack
3. ?gemm_compute
AMD-Internal: [CPUPL-4083]
Change-Id: I78cd7642c2fcbfdf02676e654a377ad2aa5295c1
- Enabled the vectorized AVX-2 code-path for SNRM2_. The
framework queries the architecture ID and calls the
vectorized kernel based on the architecture support.
- In case of not having the architecture support, we use
the default path based on the sumsqv method.
AMD-Internal: [CPUPL-3277]
Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f
- Updated the bli_dnormfv_unb_var1( ... ) and
bli_znormfv_unb_var1( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Added the logic to distribute the job among the threads such
that only one thread has to deal with fringe case(if required).
The remaining threads will execute only the AVX-2 code section
of the computational kernel.
- Added reduction logic post parallel region, to handle overflow
and/or underflow conditions as per the mandate. The reduction
for both the APIs involve calling the vectorized kernel of
dnormfv operation.
- Added changes to the kernel to have the scaling factors and
thresholds prebroadcasted onto the registers, instead of
broadcasting every time on a need basis.
- Non-unit stride cases are packed to be redirected to the
vectorized implementation. In case the packing fails, the
input is handled by the fringe case loop in the kernel.
- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
cases of size = 2 ( and ) size = 1 or non-unit strides respectively.
AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
- The bli_snormfv_unb_var1( ... ) function returns early in
case of n = 1, and uses the blis macro bli_fabs( ... ) to
set the norm to the absolute value of the element.
- This macro inverts the sign bit even if the element is 0.0.
A check is added to re-invert the sign bit in this case, so
that the norm is set to 0.0 instead of -0.0.
- Added the same early exit condition on bli_dnormfv_unb_var1( ... )
when n = 1.
AMD-Internal: [CPUPL-3923]
Change-Id: If7f5ae41d2acfe89b505549d28215dde319d8c33
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
Thanks to Moore, Branden <Branden.Moore@amd.com> for identifying the
race condition and suggesting the changes to fix the same
Existing Design:
- AOCL progress callback pointer is a global pointer which is shared
across all threads
Existing Design challenges:
- The callback function cannot safely disable the progress mechanism,
as another thread may have already checked to see if the function
pointer is set, and then re-reads the pointer upon invocation of
the callback. If one thread sets the callback to NULL in this time,
then the resulting thread will attempt to call the null pointer as a
function pointer, leading to a segfault.
New Design :
- Each thread maintains a local copy of progress pointer
AMD-Internal: [SWLCSG-1971]
Change-Id: I282989805a4a2a8a759a7373b645f3569bf42ed4
- In gemmt and normf, #ifdef BLIS_KERNELS_* is added
to make sure only compiled kernels are used.
- In bal_copy and bla_swap, missing '\' is added.
AMD-Internal: [CPUPL-2870]
Change-Id: I83452dff761f60db6957f557321ce210ab72c037
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
6861fcae91
AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.
AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
- Moved *_blis_impl function declaration outside the BLIS_ENABLE_BLAS
guard.
- Changed Makefile to continue to compile bla_ files to get
*_blis_impl interfaces.
- Modify CBLAS headers, bli_macro_defs.h and bli_util_api_wrap.{c,h}
to add BLIS_ENABLE_CBLAS guards.
- Comment out BLIS_ENABLE_BLAS guards in various headers and utility
functions.
- Define BLIS Fortran-style functions lsame_blis_impl and
xerbla_blis_impl. New macros PASTE_LSAME and PASTE_XERBLA are
used in bla_*_check headers and some other places to select
whether to call lsame and xerbla, or the _blis_impl versions.
- Defined various other missing _blis_impl functions.
- In bli_util_api_wrap.c, only define any functions if
BLIS_ENABLE_BLAS is defined, and only define the subroutine
versions of functions like dot, nrm2, etc if BLIS_ENABLE_CBLAS
is defined.
- BLAS layer is needed if CBLAS layer is enabled. Changed header
files build/bli_config.h.in and bli_blas.h, and configure
program to help ensure consistency in generated blis.h header
and configure output.
Undefining BLIS_ENABLE_BLAS_DEFS appears to be broken in UTA BLIS
too, thus BLIS_ENABLE_BLAS_DEFS is currently permanently defined.
AMD-Internal: [CPUPL-3015]
Change-Id: I7c0fe07db85781db46f2c690e174451860b37635
Details:
- Changed sumsqv implementation as follows:
- If there is a NaN (either real or imaginary), then return a sum of
NaN and unit scale.
- Else, if there is an Inf (either real or imaginary), then return a
sum of +Inf and unit scale.
- Otherwise behave as normal.
(cherry picked from commit b861c71b50)
AMD-Internal: [SWLCSG-1900]
Change-Id: Ic7ba9cad1fbaf11823b9ba96e72a4ddd973db5b6
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments
AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
AMD-Internal: [SWLCSG-1080]
Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
- Cleaned up some white space in the AVX2 implementation for DNRM2.
AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dscal’
internally invokes the BLAS API ‘dscal_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I0e80071398af29c9313296d2a92e61e3897ac28e
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemv’
internally invokes the BLAS API ‘dgemv_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemm’
internally invokes the BLAS API ‘dgemm_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409
- BLIS uses callback function to report the progress of the
operation. The callback is implemented in the user application
and is invoked by BLIS.
- Updated callback function prototype to make all arguments const.
This will ensure that any attempt to write using callback’s
argument is prevented at the compile time itself.
AMD-Internal: [CPUPL-2504]
Change-Id: I8ceb671242365d2a9155b485301cd8c75043e667
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
only defined in header file if BLIS_ENABLE_OPENMP is defined
AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it.
- This change separates the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I8d81072aaca739f175318b82f6510d386103c24b
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and may
result in recursion
- This change separate the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I0f4521e70a02f6132bdadbd4c07715c9d52fe62a
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and may
result in recursion
- This change separates the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I8380b6468683028035f2aece48916939e0fede8a
->In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
->If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and
may result in recursion
->This change separate the CBLAS and BLAS implantation by adding
and additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5
-- AOCL libraries are used for lengthy computations which can go
on for hours or days, once the operation is started, the user
doesn’t get any update on current state of the computation.
This (AOCL progress) feature enables user to receive a periodic
update from the libraries.
-- User registers a callback with the library if it is interested
in receiving the periodic update.
-- The library invokes this callback periodically with information
about current state of the operation.
-- The update frequency is statically set in the code, it can be
modified as needed if the library is built from source.
-- These feature is supported for GEMM and TRSM operations.
-- Added example for GEMM and TRSM.
-- Cleaned up and reformatted test_gemm.c and test_trsm.c to
remove warnings and making indentation consistent across the
file.
AMD-Internal: [CPUPL-2082]
Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e
details: Changes Made for 4.0 branch to enable wrapper code by default
and also removed ENABLE_API_WRAPPER macro.
Change-Id: I5c9ede7ae959d811bc009073a266e66cbf07ef1a
1. Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
norm with dot operation.
2. Added vectorization to compute square of 32 double element
block size from vector X.
3. Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
to compute nrm2 using new kernel.
4. Dot kernel definitions and implementation have a possibility for
accuracy issues .we can switch to traditional implementation by
disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
for Vector X .
AMD-Internal: [CPUPL-1757]
Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
Details:
- Eliminated the IR loop in ref_var2m functions.
- Handled the rectangular and triangular portions of C matrix
separately.
- Added a condition to check and eliminate zero regions inside IC loop.
- modified kc selection logic to choose optimal KC in SUP
- Updated thresholds to choose between SUP and native.
Change-Id: I21908eaa6bc3a8f37bdea29f7bfca7e6fcfee724
details: Wrapper code will be enabled when selecting the cmake option
ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build
error on windows.
AMD-Internal: [CPUPL-1848]
Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e
Details:
BLIS currently supports BLAS and CBLAS interfaces with lowercase.
With this commit - we also supports uppercase with and without
trailing underscore, lowercase without trailing underscore symbol
names.
Change-Id: Ibb06121821ab937b25d492409625916f542b2135
Details:
- Inserted a "#include bli_xapi_undef.h" after each usage of the basic
and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h,
bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to
the previous status quo, in which each header made minimal #undef
prior to its own definitions and then a single instance of
"#include bli_xapi_undef.h" cleaned up any remaining macro defs after
all other headers were used. This commit will guarantee that macro
defs from the setup of one header (say, bli_oapi_ex.h) don't "infect"
the definitions made in a subsequent header. As with this previous
commit, this change does not fix any issue but rather attempts to
avoid creating orphaned macro definitions that are only needed within
a very limited scope.
- Removed minimal #undef from bli_?api_[ba|ex].h.
- Removed old commented-out lines from bli_?api_[ba|ex].h.