-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.
AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.
1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided
Change-Id: If5d08213068869eff060c9998596d2d2703a6793
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.
AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
Functionality - Post-ops is an operation performed on every element
of the output matrix after GEMM operation is completed.
- Post-ops relu and bias added to all the compute kernels
of u8s8s16os16
- Post-ops are done on the value loaded into the register
to avoid reloading of C matrix elements
- Minor bug fixes in openmp thread decorator of lpgemm
- Added test cases to lpgemm bench input file
AMD-Internal: [CPUPL-2171]
Change-Id: If49f763fdfac19749f6665c172348691165d8631
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.
AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.
- The compute kernel will perform computation only
if B matrix reordered to suit the usage of AVX2
instruction vpmaddubsw.
- Kernel for packing the B matrix is provided.
- LPGEMM bench code was modified to test the
performance and accuracy of the new variant.
AMD-Internal: [CPUPL-2171]
Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b
When bli_trsm() API is called, we make sure the "side" argument is "side_t" and
not f77_char and argument is passed by value and not by its address.
Change-Id: I5a616eb054c034be2d67640b8ab3b9615706a8c9
- Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL
when compiled without open mp threading support. Subsequently the code
is executed as if it is single-threaded. However, when B matrix needs
to be packed, communicators are required (irrespective of single or
multi-threaded), and the code accesses lpgemm_thrinfo_t for the same
without NULL check. This results in seg fault.
For the fix, a non-open mp thread decorator layer is added, which
creates a placeholder lpgemm_thrinfo_t object with a communicator before
invoking the 5 loop algorithm. This object will be used for packing.
- Makefile for compilation of aocl_gemm bench.
AMD-Internal: [CPUPL-2304]
Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc
1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf.
Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.
AMD-Internal: [CPUPL-2102]
Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
Description:
1. Decision logic to choose optimal number of threads for
given input dgemm dimensions under aocl dynamic feature
were retuned based on latest code.
2. Updated code in few file to avoid compilation warnings.
3. Added a min check for nt in bli_sgemv_var1_smart_threading
function
AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
Details :
- SUP Threshold change for native vs SUP
- Improved the ST performances for sizes n<800
- Introduce PACKB in SUP to improve ST performance between 320<n<800
- 16T SUP Tuning for n<1600.
AMD-Internal: [CPUPL-1981]
Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts
AMD-Internal: [CPUPL-1963]
Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
-- Added number of threads used in DTL logs
-- Added support for timestamps in DTL traces
-- Added time taken by API at BLAS layer in the DTL logs
-- Added GFLOPS achieved in DTL logs
-- Added support to enable/disable execution time and
gflops printing for individual API's. We may not want
it for all API's. Also it will help us migrate API's
to execution time and gflops logs in stages.
-- Updated GEMM bench to match new logs
-- Refactored aocldtl_blis.c to remove code duplication.
-- Clean up logs generation and reading to use spaces
consistently to separate various fields.
-- Updated AOCL_gettid() to return correct thread id
when using pthreads.
AMD-Internal: [CPUPL-1691]
Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e
Details:
- BLIS has reserved rs = cs = 1 case only for 1x1 scalars.
- For vectors, even though rs = cs = 1 is a valid input, BLIS
adjusts the strides to satisfy the error checking.
- For an mxn matrix, if m > 1 and n = 1, BLIS sets cs = m
to indicate that this is a column vector stored in column major
order. Similarly BLIS sets rs = n in case of m = 1 and n > 1.
- So determining storage-scheme based on row-stride could lead to
errors if one of the matrices becomes vector.
- Modified bench files to determine storage scheme based on
stor_scheme character instead of checking row-strides.
Change-Id: Id2dc0ea11f0e549ce8e49eb2c393442b33851527
Details
- Passing enum rather than char for uplo, transa, and diaga
- Deleting log file, and other temp files, merged in the codebase from amax
AOCL-Internal: [CPUPL-1591]
Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2
Details:
- To determine whether matrices are col-stored, we were checking
ldc == 1. This is incorrect as a matrix can be col-stored with ldc = 1
if dimension is 1.
- Modified the condition to check row_stride instead of col stride.
if row-stride != 1, we can assume that matrices are not col-stored
and ignore those inputs by printing an error message.
Change-Id: Id4d5b971104eb11cbcdd6d22c5c620febefd3a87
When op(A) or op(B) = transpose - the leading dimensions of these matrices altered.
Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this
mistake in both column and row storages.
Provide a provision to call BLIS interfaces when row-major inputs are used.
Change-Id: Id2041af219a64567471c14190f283274d1df2f7f
- Added bench utility for dotv and scalv API's
- Corrected logging for scalv to handle complex types
- Corrected logging to remove transpose field from dotv logs
AOCL-Internal: [CPUPL-1577]
Change-Id: Ieb29e773309de1520c7fa5b79b97c943d894ba07
- incx and incy was not considered while allocating
memory for x and y vectors.
- Updated test data set
AMD-Internal: [CPUPL-1578]
Change-Id: I374a75aaa66f951f0f8353434d94c135d09b2f05
1. Updated lda, ldb based on trans flags
2. Updated deriving storage type using leading dimension
2. Cleanup and alignment
3. Included transpose and row major cases in inputgemm.txt
Change-Id: I25f5cd522eb64f212445d98f4682132bf5a330b6
Fix reading input parameters
Interchange the reading of n and k, first n appears and then k appears in the logs.
Added comments to explain the format of the input gemmt log.
Change-Id: I44c6081d4449ba210728bc089c4215d5eef18834
Verifying the valid values of m, n, k, lda, ldb and ldc is removed.
Since the bench app is run on logs collected from AOCL traces.
The correct way of checking should consider transpose parameter and storage order.
Change-Id: If0fbf733c2650c6f328661293eb99d062685d638