- Updated the format specifiers to have a leading space,
in order to delimit the outputs appropriately in the
output file.
- Further updated every source file to have a leading space
in its format string occuring after the macros.
AMD-Internal: [CPUPL-5895]
Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15
- Bug : When configuring our library with the native
BLIS integer size being 32, the bench application
would crash or read an invalid value when parsing
the input file. This is because of a mismatch
of format specifier, that we hardset in the
Makefile.
- Fix : Defined a header that sets the format specifiers
as macros with the right matching, based on how we
configure and build the library. It is expected to
include this header in every source file for
benchmarking.
AMD-Internal: [CPUPL-5895]
Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.
Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.
AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
Initailized c_save instead of 'c" and then removed copying c to c_save.
Because at the start every n_repeats iteration we are copying back c_save to c.
Therefore if we initialize c_save, we can avoid extra copy of "c" to c_save before calling
GEMM. For very large sizes matrix initialization takes considerable amount of time. This can
be reduced now.
Change-Id: I2c6ffe169e991607314897cb0c1fbfc0d74ef179
Details:
- Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path.
Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and
implementing these optimizations.
- In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed
we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the
next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from
memory, since the elements of Bpack are stored contiguously, the first broadcast fetches
the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast
and interleave broadcast operation with FMAs to hide any memory latencies.
- Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..)
when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object
for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}.
Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced
dgemm kernel for TRSM with 16x14 dgemm kernel.
- New packm kernels - 16xk, 24xk and 32xk are added.
- New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is
enabled for zen4 config (bli_dpackm_32xk_zen4_ref() )
- Copyright year updated for modified files.
- cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h
- [SWLCSG-1374], [CPUPL-2918]
Change-Id: I576282382504b72072a6db068eabd164c8943627
1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf.
Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2
-- Added number of threads used in DTL logs
-- Added support for timestamps in DTL traces
-- Added time taken by API at BLAS layer in the DTL logs
-- Added GFLOPS achieved in DTL logs
-- Added support to enable/disable execution time and
gflops printing for individual API's. We may not want
it for all API's. Also it will help us migrate API's
to execution time and gflops logs in stages.
-- Updated GEMM bench to match new logs
-- Refactored aocldtl_blis.c to remove code duplication.
-- Clean up logs generation and reading to use spaces
consistently to separate various fields.
-- Updated AOCL_gettid() to return correct thread id
when using pthreads.
AMD-Internal: [CPUPL-1691]
Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e
Details:
- BLIS has reserved rs = cs = 1 case only for 1x1 scalars.
- For vectors, even though rs = cs = 1 is a valid input, BLIS
adjusts the strides to satisfy the error checking.
- For an mxn matrix, if m > 1 and n = 1, BLIS sets cs = m
to indicate that this is a column vector stored in column major
order. Similarly BLIS sets rs = n in case of m = 1 and n > 1.
- So determining storage-scheme based on row-stride could lead to
errors if one of the matrices becomes vector.
- Modified bench files to determine storage scheme based on
stor_scheme character instead of checking row-strides.
Change-Id: Id2dc0ea11f0e549ce8e49eb2c393442b33851527
Details:
- To determine whether matrices are col-stored, we were checking
ldc == 1. This is incorrect as a matrix can be col-stored with ldc = 1
if dimension is 1.
- Modified the condition to check row_stride instead of col stride.
if row-stride != 1, we can assume that matrices are not col-stored
and ignore those inputs by printing an error message.
Change-Id: Id4d5b971104eb11cbcdd6d22c5c620febefd3a87
When op(A) or op(B) = transpose - the leading dimensions of these matrices altered.
Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this
mistake in both column and row storages.
Provide a provision to call BLIS interfaces when row-major inputs are used.
Change-Id: Id2041af219a64567471c14190f283274d1df2f7f
1. Updated lda, ldb based on trans flags
2. Updated deriving storage type using leading dimension
2. Cleanup and alignment
3. Included transpose and row major cases in inputgemm.txt
Change-Id: I25f5cd522eb64f212445d98f4682132bf5a330b6
Verifying the valid values of m, n, k, lda, ldb and ldc is removed.
Since the bench app is run on logs collected from AOCL traces.
The correct way of checking should consider transpose parameter and storage order.
Change-Id: If0fbf733c2650c6f328661293eb99d062685d638