Details:
Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16()
kernel. The bugs were caused by loading four
single-precision elements of C, via instructions such as:
vfmadd231ps(mem(rcx, 0*32), ymm3, ymm4)
or
vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:
vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)
Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
AMD_CPUPLID: CPUPL-2279
Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68
Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:
- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.
The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.
AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847
Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels
-both AVX2 and AVX512. vzeroall is expensive instruction and replaced it
with faster version of zeroing all registers. vzeroupper() instruction is
also added at the end of AVX2 kernels to avoid any AVX2/SSE transition
penalities. Kindly note only the main kernels are modified.
Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4
The current implementation for handling zgemm exploits SIMD parallelism
along the k dimension. This would give great performance in cases of k
being large. But for input sizes with k=1, it is better to exploit SIMD
parallelism along the m and n dimensions, thereby giving better
performance. This commit does the same through loop reordering, by
loading column vectors from A.
AMD-Internal: [CPUPL-2236]
Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f
Initialized ymm and xmm registers to zero to address
un-inilizaed variable errors reported in static analsys.
AMD-Internal: [CPUPL-2078]
Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04
- Added initialization of rntm object before aocl_dynamic.
- Bugfixes in dtrsm right-side kernels,
avoided accessing extra memory while using store for corner cases.
AMD-Internal: [CPUPL-2193] [CPUPL-2194]
Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba
- Completed zen4 configuration support on windows
- Enabled AVX512 kernels for AMAXV
- Added zen4 configuration in amdzen for windows
- Moved all zen4 kernels inside kernels/zen4 folder
AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.
AMD-Internal: [CPUPL-2151]
Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
Details:
- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
implementation for better performance
- Unrolled the loop by order of 4.
Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
Details:
- Handled Overflow and Underflow Vulnerabilites in
ztrsm small right implementations.
- Fixed failures observed in Scalapack testing.
AMD-Internal: [CPUPL-2115]
Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439
Description:
1. Decision logic to choose optimal number of threads for
given input dgemm dimensions under aocl dynamic feature
were retuned based on latest code.
2. Updated code in few file to avoid compilation warnings.
3. Added a min check for nt in bli_sgemv_var1_smart_threading
function
AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
- Implemented an OpenMP based stand alone SGEMV kernel for
row-major (var 1) for multithread scenarios
- Smart threading is enabled when AOCL DYNAMIC is defined
- Number of threads are decided based on the input dims
using smart threading
AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.
AMD-Internal: [CPUPL-2103]
Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
cases are handled serially for cdotxv kernel
- Added declaration in zen contexts
AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
Details:
- Enable ctrsm small implementation
- Handled Overflow and Underflow Vulnerabilites in
ctrsm small implementations.
- Fixed failures observed in libflame testing.
- For small sizes, ctrsm small implementation is
used for all variants.
Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context
AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
sizes 64<= m,n <= 256, by keeping the number of
threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
used for all variants.
AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.
Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
- Fixed memory access for edge cases such that
all load are within memory boundary only.
- Corrected ztrsm utility APIs for dcomplex
multiplication and division.
AMD-Internal: [CPUPL-2093]
Change-Id: Ib2c65e7921f6391b530cd20d6ea6b50f24bd705e
Details:
- Intrinsic implementation of zgemm_small nn kernel.
- Intrinsic implementation of zgemm_small_At kernel.
- Added support conjugate and hermitian transpose
- Main loop operates in multiple of 4x3 tile.
- Edge cases are handles separately.
AMD-Internal: [CPUPL-2084]
Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66
Details :
- SUP Threshold change for native vs SUP
- Improved the ST performances for sizes n<800
- Introduce PACKB in SUP to improve ST performance between 320<n<800
- 16T SUP Tuning for n<1600.
AMD-Internal: [CPUPL-1981]
Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.
This commit adds changes needed for windows build.
AMD-Internal: [CPUPL-2052]
Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts
AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts
AMD-Internal: [CPUPL-1963]
Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.
Updated version number to 3.1.1
Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
- Altered the framework to use 2 more fused kernels for
better problem decomposition
- Increased unroll factor in AXPYF5 and AXPYF8 kernels
to improve register usage
AMD-Internal: [CPUPL-1970]
Change-Id: I79750235d9554466def5ff93898f832834990343
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Optimized dotxf implementation for double
and single precision complex datatype by
handling dot product computation in tile 2x6
and 4x6 handling 6 columns at a time, and rows
in multiple of 2 and 4.
- Dot product computation is arranged such a way
that multiple rho vector register will hold the
temporary result till the end of loop and finally
does horizontal addition to get final dot product
result.
- Corner cases are handled serially.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
- Implemented alternate method of performing
multiplication and addition operations on
double precision complex datatype by separating
out real and imaginary parts of complex number.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1969]
Change-Id: Ib181f193c05740d5f6b9de3930e1995dea4a50f2
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts
AMD-Internal: [CPUPL-1963]
Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
- Unrolled the loop by a greater factor. Incorporated switch
case to decide unrolling factor according to the input size.
- Removed unused structs.
AMD-Internal: [CPUPL-1974]
Change-Id: Iee9d7defcc8c582ca0420f84c4fb2c202dabe3e7
- Increased the unroll factor of the loop by 15 in SAXPYV
- Increased the unroll factor of the loop by 12 in DAXPYV
- The above changes were made for better register
utilization
Change-Id: I69ad1fec2fcf958dbd1bfd71378641274b43a6aa
- Number of threads are reduced to 1 when the dimensions
are very low.
- Removed uninitialized xmm compilation warning in trsm small
Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b
- Introduced two new ddotxf functions with lower fuse
factor.
- Changed the DGEMV framework to use new kernels to
improve problem decomposition.
Change-Id: I523e158fd33260d06224118fbf74f2314e03a617
-Implemented hemv framework calls for lower and upper
kernel variants.
-hemv computation is implemented in two parts.
One part operate on triangular part of matrix and
the remaining part is computed by dotxfaxpyf kernel.
-First part performs dotxf and axpyf operation on
triangular part of matrix in chunk of 8x8.
Two separate helper function for doing so are implemented
for lower and upper kernels respectively.
-Second part is ddotxaxpyf fused kernel, which performs
dotxf and axpyf operation alltogether on non-triangular
part of matrix in chunk of 4x8.
-Implementation efficiently uses cache memory while computing
for optimal performance.
Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
Small gemm implemenation is called from gemmnat path
when library is built as multi-threaded small gemm
is completely disabled.
For single threaded the crash is fixed by disabling
small gemm on generic architecture.
AMD-Internal: [CPUPL-1930]
Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.
Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
compared to axpyf based implementation
AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels
Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64