- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n=1.
AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
6861fcae91
AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.
AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.
AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.
AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
Fucntionality - When the B matrix is not reordered before the
u8s8s16os16 compute kernel call packing of B matrix is done as
part of the five loop algorithm. The state of B matrix (packed
or unpacked) is given as an user input.
- Packing of B matrix is done as part of the five loop
compute.
- Temprorary buffer for pack B is allocated in the five
loop algorithm
- Multithreading for computation kernel
- Configuration constants for u8s8s16os16 are part of the
lpgemm config
AMD-Internal: [CPUPL-2171]
Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.
- The compute kernel will perform computation only
if B matrix reordered to suit the usage of AVX2
instruction vpmaddubsw.
- Kernel for packing the B matrix is provided.
- LPGEMM bench code was modified to test the
performance and accuracy of the new variant.
AMD-Internal: [CPUPL-2171]
Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b