More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
- Updated the format specifiers to have a leading space,
in order to delimit the outputs appropriately in the
output file.
- Further updated every source file to have a leading space
in its format string occuring after the macros.
AMD-Internal: [CPUPL-5895]
Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15
- Bug : When configuring our library with the native
BLIS integer size being 32, the bench application
would crash or read an invalid value when parsing
the input file. This is because of a mismatch
of format specifier, that we hardset in the
Makefile.
- Fix : Defined a header that sets the format specifiers
as macros with the right matching, based on how we
configure and build the library. It is expected to
include this header in every source file for
benchmarking.
AMD-Internal: [CPUPL-5895]
Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.
Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.
AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
1. OpenMP based multi-threading parallelism is added for BLAS
extension APIs of Pack and Compute
2. Both pack and compute APIs are parallelized.
3. Multi-threading of pack and compute APIs done with different
number of threads can lead to inconsistent results due to
output difference of the full packed matrix buffer when packed
with different number of threads.
4. In multi-threaded execution, we ensure output of packed buffer
is exactly the same as in single threaded execution.
5. Similarly for compute API, read of packed buffer in multi-
threaded execution is exactly the same as in single-threaded
execution.
6. Routines are added to compute the offsets for thread workload
distribution for MT execution.
1. The offsets are calculated in such a way that it resembles
the reorder buffer traversal in single threaded reordering.
2. The panel boundaries (KCxNC) remain as it is accessed in
single thread, and as a consequence a thread with jc_start
inside the panel cannot consider NC range for reorder.
3. It has to work with NC' < NC, and the offset is calulated
using prev NC panels spanning k dim + cur NC panel spaning
pc loop cur iteration + (NC - NC') spanning current
kc0 (<= KC).
7. Routines to ensure the same are added for MT execution
1. frame/base/bli_pack_compute_utils.c
2. frame/base/bli_pack_compute_utils.h
AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
- Added support for 2 new APIs:
1. sgemm_compute()
2. dgemm_compute()
These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
packed matrix identifier) and performs the GEMM operation:
C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
scheme isn't matching, and the respective matrix being loaded isn't
packed either, on-the-go packing has been enabled for such cases to
pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
it is the responsibility of the user to pack only one matrix with
alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
and compute APIs are forced to take n_threads=1.
AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158