We were using add_compile_options(-Xclang -fopenmp) statement to set
omp library compiler flags on MSVC using cmake. Observing there is
an performance regression because of the compiler version which is
using in MSVC(clang 10), so removing it from the windows build
system and configuring the compiler version(clang 13) and compiler
options manually on MSVC gui to gain a performance on matlab bench.
Change-Id: I37d778abdceb7c1fae9b1caaeea8adb114677dd2
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.
Updated version number to 3.1.1
Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
Enabled AVX-512 skylake kernels in zen4 configuration.
AVX-512 kernels are added for float and double types.
AMD-Internal: [CPUPL-2108]
Change-Id: Idfe3f64a037db019cbdf43318954db52ad241a51
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
multithreaded gemm in the style of gemmsup but also unconditionally
employs packing. The purpose of this sandbox is to
(1) avoid select abstractions, such as objects and control trees, in
order to allow readers to better understand how a real-world
implementation of high-performance gemm can be constructed;
(2) provide a starting point for expert users who wish to build
something that is gemm-like without "reinventing the wheel."
Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
instead of "bli_" in order to avoid any symbol collisions in the main
library.
- The sandbox contains two variants, each of which implements gemm via a
block-panel algorithm. The only difference between the two is that
variant 1 calls the microkernel directly while variant 2 calls the
microkernel indirectly, via a function wrapper, which allows the edge
case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
(not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
framework.
Change-Id: Ifc3c50e9fd0072aada38eace50c57552c88cc6cf
Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding #include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is #included before bli_type_defs.h).
This is why the #include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
- Altered the framework to use 2 more fused kernels for
better problem decomposition
- Increased unroll factor in AXPYF5 and AXPYF8 kernels
to improve register usage
AMD-Internal: [CPUPL-1970]
Change-Id: I79750235d9554466def5ff93898f832834990343
We added new API to check if the CPU architecture has support
for AVX instruction. This API was calling CPUID instruction
every time it is invoked. However, since this information does
not change at runtime, it is sufficient to determine it once
and use the cached results for subsequent calls. This optimization
is needed to improve performance for small size matrix vector
operations.
AMD-Internal: [CPUPL-2009]
Change-Id: If6697e1da6dd6b7f28fbfed45215ea3fdd569c5f
details: Changes Made for 4.0 branch to enable wrapper code by default
and also removed ENABLE_API_WRAPPER macro.
Change-Id: I5c9ede7ae959d811bc009073a266e66cbf07ef1a
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Optimized dotxf implementation for double
and single precision complex datatype by
handling dot product computation in tile 2x6
and 4x6 handling 6 columns at a time, and rows
in multiple of 2 and 4.
- Dot product computation is arranged such a way
that multiple rho vector register will hold the
temporary result till the end of loop and finally
does horizontal addition to get final dot product
result.
- Corner cases are handled serially.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
-Current gemm SUP path uses bli_thrinfo_sup_grow, bli_thread_range_sub
to generate per thread data ranges at each loop of gemm algorithm.
bli_thrinfo_sup_grow involves usage of multiple barriers for cross
thread synchronization. These barriers are necessary in cases where
either the A or B matrix are packed for centralized pack buffer
allocation/deallocation (bli_thread_am_ochief thread).
-However for cases where both A and B matrices are unpacked, these
barrier are resulting in overhead for smaller dimensions. Here creation
of unnecessary communicators are avoided and subsequently the
requirement for barriers are eliminated when packing is disabled for
both the input matrices in SUP path.
Change-Id: Ic373dfd2d6b08b8f577dc98399a83bb08f794afa
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
- Implemented alternate method of performing
multiplication and addition operations on
double precision complex datatype by separating
out real and imaginary parts of complex number.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1969]
Change-Id: Ib181f193c05740d5f6b9de3930e1995dea4a50f2
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts
AMD-Internal: [CPUPL-1963]
Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
- Unrolled the loop by a greater factor. Incorporated switch
case to decide unrolling factor according to the input size.
- Removed unused structs.
AMD-Internal: [CPUPL-1974]
Change-Id: Iee9d7defcc8c582ca0420f84c4fb2c202dabe3e7
1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library.
2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library.
3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture.
AMD-Internal: CPUPL-1630
Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a
- Increased the unroll factor of the loop by 15 in SAXPYV
- Increased the unroll factor of the loop by 12 in DAXPYV
- The above changes were made for better register
utilization
Change-Id: I69ad1fec2fcf958dbd1bfd71378641274b43a6aa
- Number of threads are reduced to 1 when the dimensions
are very low.
- Removed uninitialized xmm compilation warning in trsm small
Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b
- Introduced two new ddotxf functions with lower fuse
factor.
- Changed the DGEMV framework to use new kernels to
improve problem decomposition.
Change-Id: I523e158fd33260d06224118fbf74f2314e03a617
-Implemented hemv framework calls for lower and upper
kernel variants.
-hemv computation is implemented in two parts.
One part operate on triangular part of matrix and
the remaining part is computed by dotxfaxpyf kernel.
-First part performs dotxf and axpyf operation on
triangular part of matrix in chunk of 8x8.
Two separate helper function for doing so are implemented
for lower and upper kernels respectively.
-Second part is ddotxaxpyf fused kernel, which performs
dotxf and axpyf operation alltogether on non-triangular
part of matrix in chunk of 4x8.
-Implementation efficiently uses cache memory while computing
for optimal performance.
Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
Small gemm implemenation is called from gemmnat path
when library is built as multi-threaded small gemm
is completely disabled.
For single threaded the crash is fixed by disabling
small gemm on generic architecture.
AMD-Internal: [CPUPL-1930]
Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a
Summary:
1. This commit fixed issue for gemv and axpy API’s.
2. The BLIS binary with dynamic dispatch feature was
crashing on non-zen CPUs (specifically CPUs without
AVX2 support).
3. The crash was caused by un-supported instructions
in zen optimized kernels.The issue is fixed by calling
only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.
Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4
Removed direct calling of zen kernels in ctrsv, ztrsv interface.
The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels.
AMD-Internal: [CPUPL-1930]
Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b
Direct calls to zen kernels replaced by architecture
dependent calls for dotv and amaxv kernels. For non-zen
architecture, generic function is called using the BLIS
interface. For zen architecture, direct calls to zen
optimized kernels are made.
Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f
Removed direct calling of zen kernels in blis interface for
trsm, scalv, swapv.
The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.
AMD-Internal: [CPUPL-1930]
Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81
Removed direct calling of zen kernels in cblas source itself.
Similar optimizations are done by the function directly invoked from
Cblas layer.
The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.
AMD-Internal: [CPUPL-1930]
Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea
This commit fixed issue for gemm and copy API’s.
The BLIS binary with dynamic dispatch feature was crashing on non-zen
CPUs (specifically CPUs without AVX2 support).
The crash was caused by un-supported instructions in zen optimized kernels.
The issue is fixed by calling only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.
AMD-Internal: [CPUPL-1930]
Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366
Details:
- Perf regression is observed for certain m,n,k inputs where (m,n,k > 512)
and (m > 4 * n) in BLIS 3.1. The root cause was traced to commit
11dfc176a3 where BLIS_THREAD_RATIO_M was
updated from 2 to 1. This change was not part of BLIS 3.0.6 and hence
resulted in the new perf drop in 3.1.
- This workaround updates the m dimension (doubles it) that is passed as
argument to bli_rntm_set_ways_for_op which is used to determine the ic,jc
work split in the threads. The BLIS_THREAD_RATIO_M is not updated (to 2)
and rather the effect is induced using the doubled m dimension.
AMD-Internal: [CPUPL-1909]
Change-Id: I3b6ec4d4a22154289cb56d8f7db4cb60e5f34afe
Details :
- Accuracy failures observed when fast math and ILP64 are enabled.
- Disabling the feature with macro BLIS_ENABLE_FAST_MATH .
AMD-Internal: [CPUPL-1907]
Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
-- Reverted changes made to include lp/ilp info in binary name
This reverts commit c5e6f885f0.
-- Included BLAS int size in 'make showconfig'
-- Renamed amdepyc configuration to amdzen
Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
Details:
AMD Internal Id: CPUPL-1702
- For the cases of A being of 1x1 dimension and of
left and right hand side, A's only element is conjugate
transposed by negating its imaginary component.
Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.
Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
compared to axpyf based implementation
AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels
Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
Details:
-- AMD Internal Id: [CPUPL-1702]
-- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Taken care of --disable_pre_inversion configuration
-- modularized strsm 16 combinations of trsm into 4 kernels
Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea