Details:
- Fixed another out-of-bounds read access bug in the haswell sup
assembly kernels. This bug is similar to the one fixed in 17b0caa
and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.
Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
single-precision elements of C, via instructions such as:
vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:
vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)
Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
Thanks to Daniël de Kok for reporting these bugs in #635, and to
Bhaskar Nallani for proposing the fix).
- CREDITS file update.
Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b
Details:
Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16()
kernel. The bugs were caused by loading four
single-precision elements of C, via instructions such as:
vfmadd231ps(mem(rcx, 0*32), ymm3, ymm4)
or
vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:
vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)
Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
AMD_CPUPLID: CPUPL-2279
Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68
-- Conditionally packing of B matrix is enabled in zgemmsup path
which is performing better when B matrix is large
-- Incorporated decision logic to choose between zgemm_small vs
zgemm sup based on matrix dimensions "m, n and k".
-- Calling of ZGEMV when matrix dimension m or n = 1.
Very good performance improvement is observed.
Change-Id: I7c64020f4f78a6a51617b184cc88076213b5527d
Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:
- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.
The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.
AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847
When bli_trsm() API is called, we make sure the "side" argument is "side_t" and
not f77_char and argument is passed by value and not by its address.
Change-Id: I5a616eb054c034be2d67640b8ab3b9615706a8c9
Fixed syntax in AVX512 dgemm native kernel.
zen4 configuration follows Intel ASM syntax whereas other AMD configs
follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax
in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM
format. src and dst operands are interchanged.
Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447
- Updated zen4 configuration to add -march=znver4 flag in the
compiler options if the gcc version is above or equal to 12
AMD-Internal: [CPUPL-1937]
Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f
- Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL
when compiled without open mp threading support. Subsequently the code
is executed as if it is single-threaded. However, when B matrix needs
to be packed, communicators are required (irrespective of single or
multi-threaded), and the code accesses lpgemm_thrinfo_t for the same
without NULL check. This results in seg fault.
For the fix, a non-open mp thread decorator layer is added, which
creates a placeholder lpgemm_thrinfo_t object with a communicator before
invoking the 5 loop algorithm. This object will be used for packing.
- Makefile for compilation of aocl_gemm bench.
AMD-Internal: [CPUPL-2304]
Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc
Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels
-both AVX2 and AVX512. vzeroall is expensive instruction and replaced it
with faster version of zeroing all registers. vzeroupper() instruction is
also added at the end of AVX2 kernels to avoid any AVX2/SSE transition
penalities. Kindly note only the main kernels are modified.
Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4
- Fine-tuned the thread allocation logic for parallelizing DGEMMT for the cases where n <= 220. This results in performance improvement in multi-threaded DGEMMT for small values of n.
AMD-Internal: [CPUPL-2215]
Change-Id: I2654bc64d2dc43c2db911e0c9175755be3aa8ba5
The current implementation for handling zgemm exploits SIMD parallelism
along the k dimension. This would give great performance in cases of k
being large. But for input sizes with k=1, it is better to exploit SIMD
parallelism along the m and n dimensions, thereby giving better
performance. This commit does the same through loop reordering, by
loading column vectors from A.
AMD-Internal: [CPUPL-2236]
Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f
- Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called
from TRSM context it will invoke AVX2 GEMM kernels instead
of the default AVX-512 GEMM kernels.
- The default context has the block sizes for AVX512 GEMM
kernels, however, TRSM uses AVX2 GEMM kernels and they
need different block sizes.
- Added new API bli_zen4_override_trsm_blkszs(). It overrides
default block sizes in context with block sizes needed for
AVX2 GEMM kernels.
- Added new API bli_zen4_restore_default_blkszs(). It restores
The block sizes to there default values (as needed by default
AVX512 GEMM kernels).
- Updated bli_trsm_front() to override the block sizes in the
context needed by TRSM + AVX2 GEMM kernels and restore them
to the default values at the end of this function. It is done
in bli_trsm_front() so that we override the context before
creating different threads.
AMD-Internal: [CPUPL-2225]
Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55
1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf.
Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2
Initialized ymm and xmm registers to zero to address
un-inilizaed variable errors reported in static analsys.
AMD-Internal: [CPUPL-2078]
Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04
- Need to identify new Thresholds for zgemm SUP path to avoid performance regression.
AMD-Internal: [CPUPL-2148]
Change-Id: I0baa2b415dc5e296780566ba7450249445b93d43
- Added initialization of rntm object before aocl_dynamic.
- Bugfixes in dtrsm right-side kernels,
avoided accessing extra memory while using store for corner cases.
AMD-Internal: [CPUPL-2193] [CPUPL-2194]
Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba
-DAMAXV AVX512 is giving wrong results when max element is present at
index in [n-u, n), where u < 32. This is a fallout of using wrong
start offset for the non-loop unrolled code.
-Functions for replacing NaN with negative numbers is replaced with
MACRO to avoid function call overhead and to remove static variables
used for stateful replacement numbers for NaN.
AMD-Internal: [CPUPL-2190]
Change-Id: Ie1435c38b264a271f869782793d0b52bbe6e1b2a
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.
AMD-Internal: [CPUPL-2102]
Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
- Completed zen4 configuration support on windows
- Enabled AVX512 kernels for AMAXV
- Added zen4 configuration in amdzen for windows
- Moved all zen4 kernels inside kernels/zen4 folder
AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.
AMD-Internal: [CPUPL-2151]
Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library.
2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library.
3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture.
AMD-Internal: CPUPL-1630
Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
Details:
- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
implementation for better performance
- Unrolled the loop by order of 4.
Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
- AOCL_VERBOSE implementation is causing breakage in libFLAME.
Currently DTL code is duplicated in BLIS and libFLAME,
Which results in duplicate symbol errors when DTL is enabled
in both the libraries.
- It will be addressed by making DTL as separate library.
- The input logs can still be enabled by setting
AOCL_DTL_LOG_ENABLE = 1 in aocldtlcf.h and recompiling
the BLIS library.
AMD-Internal: [CPUPL-2101]
Change-Id: I8e69b68d53940e306a1d16ffbb65019def7e655a
- auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set
irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at
runtime. Currently the auto_factor is enabled if num_threads > 0 and not
reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm.
This results in gemm/gemmt SUP path applying 2x2 factorization of
num_threads, and thereby modifying the preset factorization. This issue
is not observed in native path since factorization happens without
checking auto_factor value.
- Setting omp threads to n_threads using omp_set_num_threads after the
global_rntm n_threads update in bli_thread_set_num_threads. This ensures
that in bli_rntm_init_from_global, omp_get_max_threads returns the same
value as set previously.
AMD-Internal: [CPUPL-2137]
Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426
- sgemv calls a multi-threading friendly kernel whenever it is compiled
with open mp and multi-threading enabled. However it was observed that
this kernel is not suited for scenarios where sgemv is invoked in a
single-threaded context (eg: sgemv from ST sgemm fringe kernels and with
matrix blocking). Falling back to the default single-threaded sgemv
kernel resulted in better performance for this scenario.
AMD-Internal: [CPUPL-2136]
Change-Id: Ic023db4d20b2503ea45e56a839aa35de0337d5a6
-Failure was observed in zen configuration as gcc
flag safe-math-optimization was being used for reference
kernel compilation.
- Optmized kernels were being compiled without this gcc flag
resulted in computation difference resulting in test case
failure.
AMD-Internal: [CPUPL-2121]
Change-Id: I5d86e589cdea633220aecadbcab84d9b88b31f57
- Ensured that FMA, AVX2 based kernels are called only on platforms
supporting these instructions, otherwise standard ‘C’ kernels will
be called.
- Code cleanup for optimization and consistency
AMD-Internal: [CPUPL-2126]
Change-Id: I203270892b2fad2ccc9301fb55e2bae75508e050
Description:
1. Tuned number of threads to achive better performance
for dtrmm
AMD-Internal: [ CPUPL-2100 ]
Change-Id: Ib2e3df224ba76d86185721bef1837cd7855dd593
Details:
- Handled Overflow and Underflow Vulnerabilites in
ztrsm small right implementations.
- Fixed failures observed in Scalapack testing.
AMD-Internal: [CPUPL-2115]
Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439
Description:
1. Decision logic to choose optimal number of threads for
given input dgemm dimensions under aocl dynamic feature
were retuned based on latest code.
2. Updated code in few file to avoid compilation warnings.
3. Added a min check for nt in bli_sgemv_var1_smart_threading
function
AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
- Cache aware factorization.
Experiments shows that ic,jc factorization based on m,n gives better
results compared to mu,nu on a generic data set in SUP path. Also
slight adjustments in the factorizations w.r.t matrix data loads can
help in improving perf further.
- Moving native path inputs to SUP path.
Experiments shows that in multi-threaded scenarios if the per thread
data falls under SUP thresholds, taking SUP path instead of native path
results in improved performance. This is the case even if the original
matrix dimensions falls in native path. This is not applicable if A
matrix transpose is required.
- Enabling B matrix packing in SUP path.
Performance improvement is observed when B matrix is packed in cases
where gemm takes SUP path instead of native path based on per thread
matrix dimensions.
AMD-Internal: [CPUPL-659]
Change-Id: I3b8fc238a0ece1ababe5d64aebab63092f7c6914
- Implemented an OpenMP based stand alone SGEMV kernel for
row-major (var 1) for multithread scenarios
- Smart threading is enabled when AOCL DYNAMIC is defined
- Number of threads are decided based on the input dims
using smart threading
AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
- Previously zgemm computation failures were due to
status variable did not have pre-defined initial
value which resulted in zgemm computation to return
without being computed by any kernel. Reflected
same change in dgemm_ function as well.
- Enabled sup zgemm as the issue is fixed with
status variable with bli_zgemm_small call.
-Removed calling sqp method as it is disabled
Change-Id: I0f4edfd619bc4877ebfc5cb6532c26c3888f919d