Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
- Set the variables to zero to avoid the compiler warning
(-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
bli_trsm_small_AVX512.c
- Changed the datatype from dim_t to siz_t for i,k,j
in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
avoid the compiler warning (-Waggressive-loop-optimizations)
AMD-Internal: [CPUPL-2870]
Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
- Fixed segmentation fault that was seen on non zen and non avx2
machines.
- cntx object was not passed to the invoked kernel causing a seg
fault.
AMD-Internal: [CPUPL-3167]
Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
6861fcae91
AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
- In cases when incy != 1, a buffer is created for y vector. The
contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
based architecture, AVX2 kernels are invoked. For other, the
kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.
AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.
Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.
Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.
Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag
AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
only defined in header file if BLIS_ENABLE_OPENMP is defined
AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.
AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
- Completed zen4 configuration support on windows
- Enabled AVX512 kernels for AMAXV
- Added zen4 configuration in amdzen for windows
- Moved all zen4 kernels inside kernels/zen4 folder
AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.
AMD-Internal: [CPUPL-2151]
Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- sgemv calls a multi-threading friendly kernel whenever it is compiled
with open mp and multi-threading enabled. However it was observed that
this kernel is not suited for scenarios where sgemv is invoked in a
single-threaded context (eg: sgemv from ST sgemm fringe kernels and with
matrix blocking). Falling back to the default single-threaded sgemv
kernel resulted in better performance for this scenario.
AMD-Internal: [CPUPL-2136]
Change-Id: Ic023db4d20b2503ea45e56a839aa35de0337d5a6
Description:
1. Decision logic to choose optimal number of threads for
given input dgemm dimensions under aocl dynamic feature
were retuned based on latest code.
2. Updated code in few file to avoid compilation warnings.
3. Added a min check for nt in bli_sgemv_var1_smart_threading
function
AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
- Implemented an OpenMP based stand alone SGEMV kernel for
row-major (var 1) for multithread scenarios
- Smart threading is enabled when AOCL DYNAMIC is defined
- Number of threads are decided based on the input dims
using smart threading
AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.
This commit adds changes needed for windows build.
AMD-Internal: [CPUPL-2052]
Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
- dher2 did not have avx check for platform.
It was calling avx kernel regardless of platform
support. Which resulted in core dump.
- Added avx based platform check in both variant of dher2 for
fixing the issue.
AMD-Internal: [CPUPL-2043]
Change-Id: I1fd1dcc9336980bfb7ffa9376f491f107c889c0b
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Altered the framework to use 2 more fused kernels for
better problem decomposition
- Increased unroll factor in AXPYF5 and AXPYF8 kernels
to improve register usage
AMD-Internal: [CPUPL-1970]
Change-Id: I79750235d9554466def5ff93898f832834990343
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
- Introduced two new ddotxf functions with lower fuse
factor.
- Changed the DGEMV framework to use new kernels to
improve problem decomposition.
Change-Id: I523e158fd33260d06224118fbf74f2314e03a617
-Implemented hemv framework calls for lower and upper
kernel variants.
-hemv computation is implemented in two parts.
One part operate on triangular part of matrix and
the remaining part is computed by dotxfaxpyf kernel.
-First part performs dotxf and axpyf operation on
triangular part of matrix in chunk of 8x8.
Two separate helper function for doing so are implemented
for lower and upper kernels respectively.
-Second part is ddotxaxpyf fused kernel, which performs
dotxf and axpyf operation alltogether on non-triangular
part of matrix in chunk of 4x8.
-Implementation efficiently uses cache memory while computing
for optimal performance.
Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
Summary:
1. This commit fixed issue for gemv and axpy API’s.
2. The BLIS binary with dynamic dispatch feature was
crashing on non-zen CPUs (specifically CPUs without
AVX2 support).
3. The crash was caused by un-supported instructions
in zen optimized kernels.The issue is fixed by calling
only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.
Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4
Removed direct calling of zen kernels in ctrsv, ztrsv interface.
The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels.
AMD-Internal: [CPUPL-1930]
Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b
Removed direct calling of zen kernels in blis interface for
trsm, scalv, swapv.
The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.
AMD-Internal: [CPUPL-1930]
Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
compared to axpyf based implementation
AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
compared to axpyf based implementation
AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
Details :
- Added packing Of Y for incy >1 cases for dgemv_unf_var2.
- Added packing Of X for incx >1 cases for dgemv_unf_var1.
AMD-Internal: [SWLCSG-735]
Change-Id: Ib395f478ba984a85533e4f79b3521d0b2500c30c
Details:
- Implemented zaxpyf kernel with fuse factor=4 for zgemv.
- Modified BLAS interface call for zgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
Details:
- Placed optimized version of BLAS DGEMM, ZGEMM definitions under
BLIS_CONFIG_EPYC as they use gemm small which are defined only
for zen family configurations.
- Added code to query and set cntx in gemv and trsv framework before
cntx is referred for any function pointers to avoid querying
from NULL pointer.
AMD-Internal: [CPUPL-1562]
Change-Id: I977d028ec4ddb57dcdc70e443e7708f36c01cca9
1. CMake script changes for build with Clang compiler.
2. CMake script changes for build test and testsuite based on the lib type ST/MT
3. CMake script changes for testcpp and blastest
4. Added python scripts to support library build and testsuite build.
AMD Internal : [CPUPL-1422]
Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
Case1: Call TRSV when matrix C & B are vector & A is matrix,
When n = 1 for left side and when m = 1 for right side
Case2: Divide B/A when matrix C & B are vector & A is scalar(Diagonal element),
When m = 1 for left side and when n = 1 for right side
For right side, Transpose complete operation, Change upper to lower and
vice versa when A is being transposed
Change-Id: Ie87e4a263c287ba554832ccc56b629f982e3ac4c
Details:
- Added a new AXPYF kernel with fuse_factor = 4 and iter_unroll = 4.
- Modified blas interface of GEMM to call GEMV whenever m=1 or n=1.
Change-Id: I3f5acd37b009f53cf63f462cec79fd3e73676dbc
Added traces in cblas layer for these API's.
These test drivers didn't have calls for complex data
types, the drivers are updated to support them.
AMD-Internal : [CPUPL-1315]
Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b
Details:
- Added SIMD code
- Processing 5 rows at a time in SIMD loop to improve performance
AMD-Internal: [CPUPL-1054]
Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
accessible to all zen family architectures.
Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d