Details:
1. In kernels for non-transpose variants, changes
are made to optimize the cases of beta zero.
2. Validated the changes with BLIS Testsuite,
GTestSuite(Functionality, Valgrind, Integer Tests)
and Netlib Tests.
3. Fixed warnings during the build process.
AMD-Internal: [CPUPL-2341]
Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a
- We prefetch next panel while packing 8xk panel.
- Modified prefetch offsets for dgemm native and
dgemm_small kernel.
AMD-Internal: [CPUPL-2366]
Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed
1. Fixed the memory leaks in corner cases which caused due to extra
loads in all datatypes(s,d,c,z).
2. In remainder cases instead of loading required number of elements,
loaded extra elements which lead to memory leaks. Fixed memory leaks by
restricting number of loads to required number of elements.
AMD-Internal: [CPUPL-2280]
Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a
Details:
1. Due to error in C output buffer address computation in
kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid
memory is being accessed. This is causing seg fault in
libflame netlib testing.
2. Validated the fix with libflame netlib testing.
AMD-Internal: [CPUPL-2341]
Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed
Details:
1. For lower and upper, non-transpose variants of gemmt, new kernels
are developed and optimized to compute only the required outputs in
the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
6x8 block of C matrix are computed and stored into a temporary
buffer. Later,the required elements are copied into the final C
output buffer.
3. Changes are made to compute only the required outputs of the 6x8
block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
reducing the number of computations.
5. Kernels specific to compute Lower and Upper Variant diagonal
outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.
AMD-Internal: [CPUPL-2341]
Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a
1. Extract instruction replaced with cast when accessing first 128bit,
as cast inst needs no cycle but extract takes few cycles
2. Added prefetch of A buffer when computing gemm operation
3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c
With above changes performance improvements observed in case of Single thread
Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
microkernels so as to avoid a performance penalty when mixing AVX
and SSE instructions. These vzeroupper instructions were once part
of the haswell kernels, but were inadvertently removed during a source
code shuffle some time ago when we were managing duplicate 'haswell'
and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
and re-inserting the missing instructions.
Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.
Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
assembly kernels. This bug is similar to the one fixed in 17b0caa
and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.
Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
single-precision elements of C, via instructions such as:
vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:
vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)
Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
Thanks to Daniël de Kok for reporting these bugs in #635, and to
Bhaskar Nallani for proposing the fix).
- CREDITS file update.
Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b
Details:
Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16()
kernel. The bugs were caused by loading four
single-precision elements of C, via instructions such as:
vfmadd231ps(mem(rcx, 0*32), ymm3, ymm4)
or
vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:
vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)
Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
AMD_CPUPLID: CPUPL-2279
Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68
Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:
- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.
The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.
AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847
Fixed syntax in AVX512 dgemm native kernel.
zen4 configuration follows Intel ASM syntax whereas other AMD configs
follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax
in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM
format. src and dst operands are interchanged.
Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447
Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels
-both AVX2 and AVX512. vzeroall is expensive instruction and replaced it
with faster version of zeroing all registers. vzeroupper() instruction is
also added at the end of AVX2 kernels to avoid any AVX2/SSE transition
penalities. Kindly note only the main kernels are modified.
Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4
The current implementation for handling zgemm exploits SIMD parallelism
along the k dimension. This would give great performance in cases of k
being large. But for input sizes with k=1, it is better to exploit SIMD
parallelism along the m and n dimensions, thereby giving better
performance. This commit does the same through loop reordering, by
loading column vectors from A.
AMD-Internal: [CPUPL-2236]
Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f
Initialized ymm and xmm registers to zero to address
un-inilizaed variable errors reported in static analsys.
AMD-Internal: [CPUPL-2078]
Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04
- Added initialization of rntm object before aocl_dynamic.
- Bugfixes in dtrsm right-side kernels,
avoided accessing extra memory while using store for corner cases.
AMD-Internal: [CPUPL-2193] [CPUPL-2194]
Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba
-DAMAXV AVX512 is giving wrong results when max element is present at
index in [n-u, n), where u < 32. This is a fallout of using wrong
start offset for the non-loop unrolled code.
-Functions for replacing NaN with negative numbers is replaced with
MACRO to avoid function call overhead and to remove static variables
used for stateful replacement numbers for NaN.
AMD-Internal: [CPUPL-2190]
Change-Id: Ie1435c38b264a271f869782793d0b52bbe6e1b2a
- Completed zen4 configuration support on windows
- Enabled AVX512 kernels for AMAXV
- Added zen4 configuration in amdzen for windows
- Moved all zen4 kernels inside kernels/zen4 folder
AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.
AMD-Internal: [CPUPL-2151]
Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
Details:
- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
implementation for better performance
- Unrolled the loop by order of 4.
Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
Details:
- Handled Overflow and Underflow Vulnerabilites in
ztrsm small right implementations.
- Fixed failures observed in Scalapack testing.
AMD-Internal: [CPUPL-2115]
Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439
Description:
1. Decision logic to choose optimal number of threads for
given input dgemm dimensions under aocl dynamic feature
were retuned based on latest code.
2. Updated code in few file to avoid compilation warnings.
3. Added a min check for nt in bli_sgemv_var1_smart_threading
function
AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
- Implemented an OpenMP based stand alone SGEMV kernel for
row-major (var 1) for multithread scenarios
- Smart threading is enabled when AOCL DYNAMIC is defined
- Number of threads are decided based on the input dims
using smart threading
AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.
AMD-Internal: [CPUPL-2103]
Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
cases are handled serially for cdotxv kernel
- Added declaration in zen contexts
AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
Details:
- Enable ctrsm small implementation
- Handled Overflow and Underflow Vulnerabilites in
ctrsm small implementations.
- Fixed failures observed in libflame testing.
- For small sizes, ctrsm small implementation is
used for all variants.
Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context
AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
sizes 64<= m,n <= 256, by keeping the number of
threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
used for all variants.
AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.
Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
- Fixed memory access for edge cases such that
all load are within memory boundary only.
- Corrected ztrsm utility APIs for dcomplex
multiplication and division.
AMD-Internal: [CPUPL-2093]
Change-Id: Ib2c65e7921f6391b530cd20d6ea6b50f24bd705e
Details:
- Intrinsic implementation of zgemm_small nn kernel.
- Intrinsic implementation of zgemm_small_At kernel.
- Added support conjugate and hermitian transpose
- Main loop operates in multiple of 4x3 tile.
- Edge cases are handles separately.
AMD-Internal: [CPUPL-2084]
Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66
Details :
- SUP Threshold change for native vs SUP
- Improved the ST performances for sizes n<800
- Introduce PACKB in SUP to improve ST performance between 320<n<800
- 16T SUP Tuning for n<1600.
AMD-Internal: [CPUPL-1981]
Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.
This commit adds changes needed for windows build.
AMD-Internal: [CPUPL-2052]
Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts
AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts
AMD-Internal: [CPUPL-1963]
Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.
Updated version number to 3.1.1
Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
- Altered the framework to use 2 more fused kernels for
better problem decomposition
- Increased unroll factor in AXPYF5 and AXPYF8 kernels
to improve register usage
AMD-Internal: [CPUPL-1970]
Change-Id: I79750235d9554466def5ff93898f832834990343
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Optimized dotxf implementation for double
and single precision complex datatype by
handling dot product computation in tile 2x6
and 4x6 handling 6 columns at a time, and rows
in multiple of 2 and 4.
- Dot product computation is arranged such a way
that multiple rho vector register will hold the
temporary result till the end of loop and finally
does horizontal addition to get final dot product
result.
- Corner cases are handled serially.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9