- Updated zen4 configuration to add -march=znver4 flag in the
compiler options if the gcc version is above or equal to 12
AMD-Internal: [CPUPL-1937]
Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f
- Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called
from TRSM context it will invoke AVX2 GEMM kernels instead
of the default AVX-512 GEMM kernels.
- The default context has the block sizes for AVX512 GEMM
kernels, however, TRSM uses AVX2 GEMM kernels and they
need different block sizes.
- Added new API bli_zen4_override_trsm_blkszs(). It overrides
default block sizes in context with block sizes needed for
AVX2 GEMM kernels.
- Added new API bli_zen4_restore_default_blkszs(). It restores
The block sizes to there default values (as needed by default
AVX512 GEMM kernels).
- Updated bli_trsm_front() to override the block sizes in the
context needed by TRSM + AVX2 GEMM kernels and restore them
to the default values at the end of this function. It is done
in bli_trsm_front() so that we override the context before
creating different threads.
AMD-Internal: [CPUPL-2225]
Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55
- Completed zen4 configuration support on windows
- Enabled AVX512 kernels for AMAXV
- Added zen4 configuration in amdzen for windows
- Moved all zen4 kernels inside kernels/zen4 folder
AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
-Failure was observed in zen configuration as gcc
flag safe-math-optimization was being used for reference
kernel compilation.
- Optmized kernels were being compiled without this gcc flag
resulted in computation difference resulting in test case
failure.
AMD-Internal: [CPUPL-2121]
Change-Id: I5d86e589cdea633220aecadbcab84d9b88b31f57
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
cases are handled serially for cdotxv kernel
- Added declaration in zen contexts
AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context
AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts
AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
Enabled AVX-512 skylake kernels in zen4 configuration.
AVX-512 kernels are added for float and double types.
AMD-Internal: [CPUPL-2108]
Change-Id: Idfe3f64a037db019cbdf43318954db52ad241a51
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Optimized dotxf implementation for double
and single precision complex datatype by
handling dot product computation in tile 2x6
and 4x6 handling 6 columns at a time, and rows
in multiple of 2 and 4.
- Dot product computation is arranged such a way
that multiple rho vector register will hold the
temporary result till the end of loop and finally
does horizontal addition to get final dot product
result.
- Corner cases are handled serially.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts
AMD-Internal: [CPUPL-1963]
Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
Details :
- Accuracy failures observed when fast math and ILP64 are enabled.
- Disabling the feature with macro BLIS_ENABLE_FAST_MATH .
AMD-Internal: [CPUPL-1907]
Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
-- Reverted changes made to include lp/ilp info in binary name
This reverts commit c5e6f885f0.
-- Included BLAS int size in 'make showconfig'
-- Renamed amdepyc configuration to amdzen
Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
1. Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
norm with dot operation.
2. Added vectorization to compute square of 32 double element
block size from vector X.
3. Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
to compute nrm2 using new kernel.
4. Dot kernel definitions and implementation have a possibility for
accuracy issues .we can switch to traditional implementation by
disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
for Vector X .
AMD-Internal: [CPUPL-1757]
Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
-- Added -march=znver3 flag if the library is built for zen3
configuration with gcc compiler version 11 or above.
-- Replaced hardcoded compiler names 'gcc' and 'clang' with
variable $CC so that options are chosen as per the compiler
specified at configure time (instead of compiler in path).
AMD-Internal: [CPUPL-1823]
Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4
Details:
1. Unrolled by a factor 5. This gave around 1GFLOPS gain
2. Changed CMP to subs and remove nan. CMP uses a lot of
compare, which is higher in latency and more number of
instructions. Replacing with subs and remove nan
reduced it to 3 instructions and lighter ones.
3. Added remove nan function.
4. Added AVX512 definition in skx context.
5. Disabled code in AMAXV kernel depending on AVX512 flag
exists or not
Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35
Details:
- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
implementation for better performance
- Unrolled the loop by order of 4.
Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
-- Created new configuration amdepyc to include fat binary which
includes zen, zen2, zen3 and generic architecture for fallback.
-- Updated amdepyc family makefiles to include macros needed
in amdepyc family binary. This file must include all macros,
compiler options to be used for non architecture specific code.
-- Added 'workaround' to exclude ZEN family specific code in some of
the framework files. There are still lot of places were ZEN family
specific code is added in framework files. They will be addressed
with proper design later.
- Moved definition of BLIS_CONFIG_EPYC from header files to
makefile so that it is enabled only for framework and kernels
-- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
wherever it was needed.
-- Removed un-used, obsolete macros, some of them may be needed for
debugging which can be added in the individual workspaces.
- BLIS_DEFAULT_MR_THREAD_MAX
- BLIS_DEFAULT_NR_THREAD_MAX
- BLIS_ENABLE_ZEN_BLOCK_SIZES
- BLIS_SMALL_MATRIX_THRES_TRSM
- BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
- BLIS_ENABLE_SUP_MR_EXT
- BLIS_ENABLE_SUP_NR_EXT
-- Corrected implementation of exiting amd64_legacy configuration.
AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
adding threshold functions under a macro enables these functions for
all the configs under a family. This can be avoided by adding function
pointers to cntx which can be queried from cntx during run-time
based on the config chosen.
Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
Details:
1. Added prefetching next micro-panel of A and B in dgemm block,
which are helping in reducing load latency and improved performance.
2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
dgemm into macros and made it more modular
3. Packing and diagonal packing in main dgemm loops are modularized.
Fringe cases are yet to modularize.
4. Updated dtrsm small thresholds for single and multi thread cases
5. Updated div/scale based on disable/enable of trsm pre-inversion
6. Code clean up
Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
- AOCL Dynamic feature is added in BLIS which determines optimal
number of threads for the current problem size.
- This feature can be enabled/disabled by modifying the source
code
- This change adds support to enable/disable this feature during
configuration time by adding a new option in configure script
AOCL-Internal : [CPUPL-1565]
Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
global_rntm data structure remains unchanged in order to keep track of
original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
number of threads recommended by the application.
AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
1. NC and MC values are tuned for both single-instance and multi-instance run.
2. zen2 and zen3 configs updated.
3. SUP path disabled for zgemm, since tuned native path performed better.
To be re-enabled after setting right threshold for SUP selection.
AMD-Internal: [CPUPL-1442]
Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
from 384 to 256. The maximum (extended) KC was also reduced
accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
this change.
1. CMake script changes for build with Clang compiler.
2. CMake script changes for build test and testsuite based on the lib type ST/MT
3. CMake script changes for testcpp and blastest
4. Added python scripts to support library build and testsuite build.
AMD Internal : [CPUPL-1422]
Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations
based on low-precision gemm kernels, and (2) extends the BLIS typed
API for those new implementations. Currently, these new kernels can
only be used for the POWER10 microarchitecture; however, they may
provide a template for developing similar kernels for other
microarchitectures (even those beyond POWER), as changes would likely
be limited to select places in the microkernel and possibly the
packing routines. The new low-precision operations that are now
supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more
information, refer to the POWER10.md document that is included in
'sandbox/power10'.
Details:
- Implemented assembly-based packm kernels for single- and double-
precision complex domain (c and z) and housed them in the 'haswell'
kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), upon which these complex kernels are
partially based.
Details:
- Implemented assembly-based packm kernels for single- and double-
precision real domain (s and d) and housed them in the 'haswell'
kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), which I have now tweaked and used to
create comparable single-precision real kernels (s6xk and s16xk).
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.
AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
znver3 flag will be enabled if compiler is AOCC Clang version 3.0
and configuration is zen3
Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39
AMD-Internal: [CPUPL-1353]
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
guts were factored out into "fast" and "slow" variants. Then added
logic to the "fast" variant that allows for more optimal thread
factorizations in some situations where there is at least one factor
of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
added comments to that file describing BLIS_THREAD_RATIO_? and
BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
macros not used in vanilla BLIS and removed the unused macro
BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
and bli_trsm_front.c. (These branches of small matrix handling have
not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.