Details:
- Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path.
Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and
implementing these optimizations.
- In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed
we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the
next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from
memory, since the elements of Bpack are stored contiguously, the first broadcast fetches
the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast
and interleave broadcast operation with FMAs to hide any memory latencies.
- Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..)
when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object
for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}.
Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced
dgemm kernel for TRSM with 16x14 dgemm kernel.
- New packm kernels - 16xk, 24xk and 32xk are added.
- New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is
enabled for zen4 config (bli_dpackm_32xk_zen4_ref() )
- Copyright year updated for modified files.
- cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h
- [SWLCSG-1374], [CPUPL-2918]
Change-Id: I576282382504b72072a6db068eabd164c8943627
-Implemented (r)ow preferential (d)ot product milli-kernels
(m and n variants) for dcomplex datatype along SUP path.
-These computational kernels extend the support for handling RRC and
CRC storage schemes along the SUP path. In case of BLAS api call,
it corresponds to the input cases with transa equal to T and
transb equal to N.
-In case of the B matrix being packed(conditionally), the inputs are
redirected to the existing (r)ow preferential (v)ector load optimized
kernels due to better performance.
-Added macro for vhsubpd assembly instruction, to support the arithmetic
for complex datatype in its interleaved storage.
AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.
AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.
AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
- Removed prototypes for float(sgemm3m) and double(dgemm3m) types,
as BLIS implements this API only for scomplex(cgemm3m) and
dcomplex(zgemm3m)
AMD-Internal: [SWLCSG-1477]
Change-Id: Ifad86a74b4c939ed240743894b85bb4fa5e6d754
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemv’
internally invokes the BLAS API ‘dgemv_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemm’
internally invokes the BLAS API ‘dgemm_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it.
- This change separates the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I8d81072aaca739f175318b82f6510d386103c24b
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and may
result in recursion
- This change separates the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I8380b6468683028035f2aece48916939e0fede8a
->In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
->If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and
may result in recursion
->This change separate the CBLAS and BLAS implantation by adding
and additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5
Make BLIS_ARCH_TYPE=0 be an error, so that incorrect meaningful names
will get an error rather than "skx" code path. BLIS_ARCH_TYPE=1 is
now "generic", so that it should be constant as new code paths are
added. Thus all other code path enum values have increased by 2.
Also added new options to BLIS configure program to allow:
1. BLIS_ARCH_TYPE functionality to be disabled, e.g.:
./configure --disable-blis-arch-type amdzen
2. Renaming the environment variable tested from "BLIS_ARCH_TYPE" to a
specified value, e.g.:
./configure --rename-blis-arch-type=MY_NAME_FOR_ARCH_TYPE amdzen
On Windows, these can be enabled with e.g.:
cmake ... -DDISABLE_BLIS_ARCH_TYPE=ON
or
cmake ... -DRENAME_BLIS_ARCH_TYPE=MY_NAME_FOR_ARCH_TYPE
This implements changes 2 and 3 in the Jira ticket below.
AMD-Internal: [CPUPL-2235]
Change-Id: Ie42906bd909f9d83f00a90c5bef9c5bf3ef5adb4
Details:
1. Runtime Thread Control Feature is enhanced to create a provision
for the application to allocate a different number of threads to
BLIS from the number of threads application is using for itself.
2. In the previous implementation, if application sets BLIS_NUM_THREADS
with a valid value, BLIS internally calls omp_set_num_threads() API with
same value. Due to this, application could not differentiate between
the number of threads used in BLIS library and the application.
3. With the current solution, if Application wants to allocate
different number of threads for BLIS API and application, Application
can choose either BLIS_NUM_THREADS environment variable or
bli_thread_set_num_threads(nt) API for BLIS,
and OpenMP APIs or environment variables for itself, respectively.
4. If BLIS_NUM_THREADS is set with a valid value, same value
will be used in the subsequent parallel regions unless
bli_thread_set_num_threads() API is used by the Application
to modify the desired number of threads during BLIS API execution.
5. Once BLIS_NUM_THREADS environment variable or
bli_thread_set_num_threads(nt) API is used by the application,
BLIS module would always give precedence to these values. BLIS API would
not consider the values set using OpenMP API omp_set_num_threads(nt) API
or OMP_NUM_THREADS environment variable.
6. If BLIS_NUM_THREADS is not set, then if Application is multithreaded and
issued omp_set_num_threads(nt) with desired number of threads,
omp_get_max_threads() API will fetch the number of threads set earlier.
7. If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called
by the application, but only OMP_NUM_THREADS is set,
omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS.
8. If both environment variables are not set, or if they are set with
invalid values, and omp_set_num_threads(nt) is not issued by
application, omp_get_max_threads() API will return the number of the
cores in the current context.
9. BLIS will initialize rntm->num_threads with the same value.
However if omp_set_nested is false - BLIS APIs called from parallel
threads will run in sequential. But if nested parallelism is enabled
Then each application will launch MT BLIS.
10. Order of precedence used for number of threads:
0. value set using bli_thread_set_num_threads(nt) by the application
1. valid value set for BLIS_NUM_THREADS environment variable
2. omp_set_num_threads(nt) issued by the application
3. valid value set for OMP_NUM_THREADS environment variable
4. Number of cores
11. If nt is not a valid value for omp_set_num_threads(nt) API,
number of threads would be set to 1.
omp_get_max_threads() API will return 1.
12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled.
AMD-Internal: [CPUPL-2342]
Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6
Enable meaningful names as options for BLIS_ARCH_TYPE environment
variable. For example,
BLIS_ARCH_TYPE=zen4 or BLIS_ARCH_TYPE='ZEN4' or BLIS_ARCH_TYPE=6
will select the same code path (in this release). The meaningful
names are not case sensitive.
This implements change 1 in the Jira ticket below.
Following review comments:
1. Use names from arch_t enum in function bli_env_get_var_arch_type()
rather than directly using numbers.
2. AMD copyrights updated.
AMD-Internal: [CPUPL-2235]
Change-Id: I8cfd43d34765d5e8c7e35680d18825d9934753ad
- Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called
from TRSM context it will invoke AVX2 GEMM kernels instead
of the default AVX-512 GEMM kernels.
- The default context has the block sizes for AVX512 GEMM
kernels, however, TRSM uses AVX2 GEMM kernels and they
need different block sizes.
- Added new API bli_zen4_override_trsm_blkszs(). It overrides
default block sizes in context with block sizes needed for
AVX2 GEMM kernels.
- Added new API bli_zen4_restore_default_blkszs(). It restores
The block sizes to there default values (as needed by default
AVX512 GEMM kernels).
- Updated bli_trsm_front() to override the block sizes in the
context needed by TRSM + AVX2 GEMM kernels and restore them
to the default values at the end of this function. It is done
in bli_trsm_front() so that we override the context before
creating different threads.
AMD-Internal: [CPUPL-2225]
Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55
- Completed zen4 configuration support on windows
- Enabled AVX512 kernels for AMAXV
- Added zen4 configuration in amdzen for windows
- Moved all zen4 kernels inside kernels/zen4 folder
AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
-- AOCL libraries are used for lengthy computations which can go
on for hours or days, once the operation is started, the user
doesn’t get any update on current state of the computation.
This (AOCL progress) feature enables user to receive a periodic
update from the libraries.
-- User registers a callback with the library if it is interested
in receiving the periodic update.
-- The library invokes this callback periodically with information
about current state of the operation.
-- The update frequency is statically set in the code, it can be
modified as needed if the library is built from source.
-- These feature is supported for GEMM and TRSM operations.
-- Added example for GEMM and TRSM.
-- Cleaned up and reformatted test_gemm.c and test_trsm.c to
remove warnings and making indentation consistent across the
file.
AMD-Internal: [CPUPL-2082]
Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e
Details:
- Enable ztrsm small implementation
- For small sizes, Right Variants and Left Unit Diag
Variants are using ztrsm_small implementations.
- Optimization of Left Non-Unit Diagonal Variants,
Work In Progress
AMD-Internal: [SWLCSG-1194]
Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
multithreaded gemm in the style of gemmsup but also unconditionally
employs packing. The purpose of this sandbox is to
(1) avoid select abstractions, such as objects and control trees, in
order to allow readers to better understand how a real-world
implementation of high-performance gemm can be constructed;
(2) provide a starting point for expert users who wish to build
something that is gemm-like without "reinventing the wheel."
Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
instead of "bli_" in order to avoid any symbol collisions in the main
library.
- The sandbox contains two variants, each of which implements gemm via a
block-panel algorithm. The only difference between the two is that
variant 1 calls the microkernel directly while variant 2 calls the
microkernel indirectly, via a function wrapper, which allows the edge
case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
(not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
framework.
Change-Id: Ifc3c50e9fd0072aada38eace50c57552c88cc6cf
Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding #include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is #included before bli_type_defs.h).
This is why the #include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
-- Reverted changes made to include lp/ilp info in binary name
This reverts commit c5e6f885f0.
-- Included BLAS int size in 'make showconfig'
-- Renamed amdepyc configuration to amdzen
Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
details: Wrapper code will be enabled when selecting the cmake option
ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build
error on windows.
AMD-Internal: [CPUPL-1848]
Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e
Details:
BLIS currently supports BLAS and CBLAS interfaces with lowercase.
With this commit - we also supports uppercase with and without
trailing underscore, lowercase without trailing underscore symbol
names.
Change-Id: Ibb06121821ab937b25d492409625916f542b2135
-- Created new configuration amdepyc to include fat binary which
includes zen, zen2, zen3 and generic architecture for fallback.
-- Updated amdepyc family makefiles to include macros needed
in amdepyc family binary. This file must include all macros,
compiler options to be used for non architecture specific code.
-- Added 'workaround' to exclude ZEN family specific code in some of
the framework files. There are still lot of places were ZEN family
specific code is added in framework files. They will be addressed
with proper design later.
- Moved definition of BLIS_CONFIG_EPYC from header files to
makefile so that it is enabled only for framework and kernels
-- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
wherever it was needed.
-- Removed un-used, obsolete macros, some of them may be needed for
debugging which can be added in the individual workspaces.
- BLIS_DEFAULT_MR_THREAD_MAX
- BLIS_DEFAULT_NR_THREAD_MAX
- BLIS_ENABLE_ZEN_BLOCK_SIZES
- BLIS_SMALL_MATRIX_THRES_TRSM
- BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
- BLIS_ENABLE_SUP_MR_EXT
- BLIS_ENABLE_SUP_NR_EXT
-- Corrected implementation of exiting amd64_legacy configuration.
AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
Details:
- The field "trsm_blkszs" in cntx enables us to set and use blocksizes
that are specific to trsm only instead of using commom L3 blocksizes.
- While querying blocksizes for TRSM, Added a check to see if
trsm-specific blocksizes are set, if they are not set, we continue
trsm execution by using common Level-3 blocksizes.
- TRSM-specific blocksizes( if required ) needs to be set in
cntx_init function of respective configuration.
Change-Id: I06afe2b1d29004e428d2fb3bf2be2958cbbd1a97
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
adding threshold functions under a macro enables these functions for
all the configs under a family. This can be avoided by adding function
pointers to cntx which can be queried from cntx during run-time
based on the config chosen.
Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
1. Added support in cmake scripts for linking libomp for blis multithreading build.
2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file.
3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's.
4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS.
5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file.
AMD Internal : [CPUPL-1630]
Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f
Details:
- Implemented zaxpyf kernel with fuse factor=4 for zgemv.
- Modified BLAS interface call for zgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
value in the arch_t enum. This means that it no longer needs to get
updated manually whenever new subconfigurations are added to BLIS.
Also removed the explicit initial index assigment of 0 from the
first enum value, which was unnecessary due to how the C language
standard mandates indexing of enum values. Thanks to Devin Matthews
for originally submitting this as a PR in #446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
change.
1. CMake script changes for adding new files to the build.
2. Added Upper case support for couple of API's.
3. bool is not support in clang so defined it.
AMD Internal : [CPUPL-1422]
Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61
Modified dgemm_ to able to call small_gemm 16x3 kernel.
small_gemm will be called if((m + n -k) < 2000 && (m + k-n) < 2000 && n + k-m < 2000) && n > 2.
small_gemm kernel - if m or n or k = 0 we return and this case will be handled by sup or native kernel.
[CPUPL - 1376]
Change-Id: I61c2b36ad0ae4fb3dd23bc37c2b6c78556b3105b