Commit Graph

46 Commits

Author SHA1 Message Date
mkadavil
3870792e62 Low precision gemm s32 downscale optimization.
-The post operations attributes are moved to a new struct
lpgemm_post_op_attr, and an object of this struct is passed to the
low precision gemm kernels in place of the multiple parameters.
-The u8s8s32s8 api (downscale api) performance is low when the k
value is less (k < KC). Two scenarios are observed here:
a. beta = 0: Currently, for downscale api, a temporary buffer is
used to accumulate intermediate s32 output, so that it can be used
in later iterations of pc loop (k dim). The usage of this buffer
(store) can be avoided if k < KC. Here intermediate accumulation
is not required, since the after the first iteration of the pc loop,
the output can be downscaled and stored.
b. beta != 0: In this case the existing values of the original s8 C
output matrix needs to be converted to s32 and beta scaled. Currently
the s8 values are converted to s32 and stored in temporary buffer in
pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer
is passed to the micro kernel and beta scaling is applied on this.
However the mxNC block copy is costly and can be avoided if a new
condition is introduced for beta scaling in the micro kernel, whereby
the original s8 data is loaded instead of from the temporary buffer
to a register, converted to s32 and beta scaling applied on it.

AMD-Internal: [CPUPL-2884]
Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c
2023-01-10 13:15:22 +05:30
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
eashdash
63864d7dfb Added clipping while downscaling for u8s8s32os8 and u8s8s16os8.
Clipping is done during the downscaling of the accumulated result
from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8,
to saturate the final output values between [-128,127]

AMD-Internal: [LWPZENDNN-493]

Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f
2022-10-11 07:28:06 -04:00
Harihara Sudhan S
492555785a Fixed bench accuracy issue in LPGEMM
Description:
- When the value of the result in s8 for u8s8s32 and u8s8s16 are
  close to 0. Values are getting ceiled to 1.
- Used nearbyintf to round the downscaled values in bench reference.
  This fixed the result mismatch issue between the vectorized kernel
  implementation and reference implementation in bench accuracy test.

AMD-Internal: [CPUPL-2617]
Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82
2022-10-07 09:48:21 +00:00
eashdash
d21cd51fde Accumulation type for alpha, beta values and BF16 bench integration
1. Correcting the type of alpha, and beta values from C_type
   (output type) to accumulation type.
   For the downscaled LPGEMM APIs, C_type will be the downscaled
   type but the required type for alpha and beta values should
   be the accumulation type.
2. BF16 bench integration with the LPGEMM bench for both the BF16
   (bf16bf16f32of32 and bf16bf16f32obf16) APIs

AMD-Internal: [CPUPL-2561]
Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797
2022-09-23 05:00:49 -04:00
mkadavil
bf4d1da1b9 Column major input support for BFloat16 gemm.
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.

AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
2022-09-22 02:50:46 -04:00
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
mkadavil
9bc59cc500 Low Precision GEMM framework fixes for downscaling.
- The temporary buffer allocated for C matrix when downscaling is
enabled is not filled properly. This results in wrong gemm accumulation
when beta != 0, and thus wrong output after downscaling. The C panel
iterators used for filling the temporary buffer are updated to fix it.
- Low precision gemm bench updated for testing/benchmarking downscaling.

AMD-Internal: [CPUPL-2514]
Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb
2022-09-13 07:42:29 -04:00
mkadavil
584069bf74 Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM.
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.

AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
2022-08-25 13:33:02 +05:30
eashdash
4e3e00fb7e Added low precision GEMM - bf16bf16f32of32
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.

1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided

Change-Id: If5d08213068869eff060c9998596d2d2703a6793
2022-08-24 03:27:00 -04:00
mkadavil
6fbdfc3cf2 Low precision gemm refactoring and bug fixes.
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.

AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
2022-08-14 17:39:00 +05:30
Harihara Sudhan S
d1eaf65a26 Post-Ops for u8s8s16os16
Functionality - Post-ops is an operation performed on every element
of the output matrix after GEMM operation is completed.

	- Post-ops relu and bias added to all the compute kernels
	  of u8s8s16os16
	- Post-ops are done on the value loaded into the register
	  to avoid reloading of C matrix elements
	- Minor bug fixes in openmp thread decorator of lpgemm
	- Added test cases to lpgemm bench input file

AMD-Internal: [CPUPL-2171]

Change-Id: If49f763fdfac19749f6665c172348691165d8631
2022-08-09 14:52:41 +05:30
mkadavil
828d3cd3d3 Post operations support for low precision gemm.
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.

AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
2022-08-05 11:53:05 +05:30
Harihara Sudhan S
e5d4fc2a70 Added low precision GEMM (u8s8s16os16)
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.

	- The compute kernel will perform computation only
	  if B matrix reordered to suit the usage of AVX2
	  instruction vpmaddubsw.
	- Kernel for packing the B matrix is provided.
	- LPGEMM bench code was modified to test the
	  performance and accuracy of the new variant.

AMD-Internal: [CPUPL-2171]

Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b
2022-08-02 02:20:00 -04:00
Kiran Varaganti
6054b888fb Fixed Bug in bench_trsm.c
When bli_trsm() API is called, we make sure the "side" argument is "side_t" and
not f77_char and argument is passed by value and not by its address.

Change-Id: I5a616eb054c034be2d67640b8ab3b9615706a8c9
2022-07-25 15:38:30 +00:00
mkadavil
f63e699c08 Fix for segmentation fault in low precision gemm.
- Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL
when compiled without open mp threading support. Subsequently the code
is executed as if it is single-threaded. However, when B matrix needs
to be packed, communicators are required (irrespective of single or
multi-threaded), and the code accesses lpgemm_thrinfo_t for the same
without NULL check. This results in seg fault.
For the fix, a non-open mp thread decorator layer is added, which
creates a placeholder lpgemm_thrinfo_t object with a communicator before
invoking the 5 loop algorithm. This object will be used for packing.

- Makefile for compilation of aocl_gemm bench.

AMD-Internal: [CPUPL-2304]
Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc
2022-07-21 11:51:40 +05:30
Chandrashekara K R
ff2ee0ae3f AOCL-WINDOWS: Added the windows build system to build bench folder on windows.
1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf.

Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2
2022-06-27 22:32:39 -04:00
mkadavil
6c112632a7 Low precision gemm integrated as aocl_gemm addon.
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.

AMD-Internal: [CPUPL-2102]

Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
2022-06-09 10:28:38 -04:00
Nallani Bhaskar
2acb3f6ed0 Tuned aocl dynamic for specific range in dgemm
Description:

1. Decision logic to choose optimal number of threads for
   given input dgemm dimensions under aocl dynamic feature
   were retuned based on latest code.

2. Updated code in few file to avoid compilation warnings.

3. Added a min check for nt in bli_sgemv_var1_smart_threading
   function

AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
2022-05-17 18:10:39 +05:30
mkurumel
ab06f17689 DGEMMT : Tuning SUP threshold to improve ST and MT performance.
Details :
          - SUP Threshold change for native vs SUP
          - Improved the ST performances for sizes n<800
	  - Introduce PACKB in SUP to improve ST performance between 320<n<800
	  - 16T SUP Tuning for n<1600.

        AMD-Internal: [CPUPL-1981]

Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda
2022-05-17 18:09:22 +05:30
Arnav Sharma
3190e547b0 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-01-05 04:19:11 -05:00
Dipal M Zambare
8f310c3384 AOCL DTL - Added thread and execution time details in logs
-- Added number of threads used in DTL logs
    -- Added support for timestamps in DTL traces
    -- Added time taken by API at BLAS layer in the DTL logs
    -- Added GFLOPS achieved in DTL logs
    -- Added support to enable/disable execution time and
       gflops printing for individual API's. We may not want
       it for all API's. Also it will help us migrate API's
       to execution time and gflops logs in stages.
    -- Updated GEMM bench to match new logs
    -- Refactored aocldtl_blis.c to remove code duplication.
    -- Clean up logs generation and reading to use spaces
       consistently to separate various fields.
    -- Updated AOCL_gettid() to return correct thread id
       when using pthreads.

AMD-Internal: [CPUPL-1691]
Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e
2021-11-12 08:58:54 +05:30
Nageshwar Singh
faeb79f2b9 Trsm bench utility missmatch DTL logs and bench
AOCL-Internal: [CPUPL-1585]
Change-Id: I2896d695e6bb40ec39a4f840240499927de16962
2021-11-12 08:58:52 +05:30
Meghana Vankadari
1944de1cfa Fixed a bug in Level-3 bench files
Details:
- BLIS has reserved rs = cs = 1 case only for 1x1 scalars.
- For vectors, even though rs = cs = 1 is a valid input, BLIS
  adjusts the strides to satisfy the error checking.
- For an mxn matrix, if m > 1 and n = 1, BLIS sets cs = m
  to indicate that this is a column vector stored in column major
  order. Similarly BLIS sets rs = n in case of m = 1 and n > 1.
- So determining storage-scheme based on row-stride could lead to
  errors if one of the matrices becomes vector.
- Modified bench files to determine storage scheme based on
  stor_scheme character instead of checking row-strides.

Change-Id: Id2dc0ea11f0e549ce8e49eb2c393442b33851527
2021-06-22 10:38:11 +05:30
Nageshwar Singh
3002239f83 Added bench utility for swapv API
AMD-Internal: [CPUPL-1591]
Change-Id: I5619d402db49d1f325e4293f3be7a8bc0dde6f15
2021-06-09 17:05:00 +05:30
Nageshwar Singh
6ca50e1b72 Added bench utility for copyv API
AOCL-Internal: [CPUPL-1591]
Change-Id: I00ddad565cb87cd9371d7b1df2b57394fef437e0
2021-06-09 12:29:49 +05:30
Nageshwar Singh
6842c2a30e Bench trsv logging error
Details
  - Passing enum rather than char for uplo, transa, and diaga
  - Deleting log file, and other temp files, merged in the codebase from amax

AOCL-Internal: [CPUPL-1591]
Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2
2021-06-08 11:54:55 +05:30
Nageshwar Singh
61b7584580 Bench addition for amaxv API
AOCL-Internal: [CPUPL-1591]
Change-Id: Ia9754dfed1a7302d5c267858f9005c8f64e28b46
2021-06-04 17:45:04 +05:30
Nageshwar Singh
ecfbdd16a8 Added bench utility for trsv API
AOCL-Internal: [CPUPL-1591]
Change-Id: I5953e13e9c75f620987ea92d92d1b1d7b5bfd043
2021-06-04 08:05:37 -04:00
Meghana Vankadari
3804e301c9 Fixed a bug in Level-3 bench files where ldc = 1
Details:
- To determine whether matrices are col-stored, we were checking
  ldc == 1. This is incorrect as a matrix can be col-stored with ldc = 1
  if dimension is 1.
- Modified the condition to check row_stride instead of col stride.
  if row-stride != 1, we can assume that matrices are not col-stored
  and ignore those inputs by printing an error message.

Change-Id: Id4d5b971104eb11cbcdd6d22c5c620febefd3a87
2021-06-01 10:57:18 +05:30
Kiran Varaganti
492f54fb5e Fix a bug in bench_gemm.c
When op(A) or op(B) = transpose - the leading dimensions of these matrices altered.
Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this
mistake in both column and row storages.
Provide a provision to call BLIS interfaces when row-major inputs are used.

Change-Id: Id2041af219a64567471c14190f283274d1df2f7f
2021-05-24 12:59:28 +05:30
Dipal M Zambare
5f53d14971 Added bench utility for dotv and scalv APIs.
- Added bench utility for dotv and scalv API's
   - Corrected logging for scalv to handle complex types
   - Corrected logging to remove transpose field from dotv logs

AOCL-Internal: [CPUPL-1577]
Change-Id: Ieb29e773309de1520c7fa5b79b97c943d894ba07
2021-05-21 10:00:32 +05:30
Dipal Madhukar Zambare
dac15bdb3f Merge "Added bench utility for ger API." into amd-staging-milan-3.1 2021-05-19 08:17:09 -04:00
Dipal M Zambare
413814fe70 Fixed crash issue in bench utility for gemv API
- incx and incy was not considered while allocating
    memory for x and y vectors.
  - Updated test data set

AMD-Internal: [CPUPL-1578]

Change-Id: I374a75aaa66f951f0f8353434d94c135d09b2f05
2021-05-19 14:21:09 +05:30
Dipal M Zambare
0e82783f1c Added bench utility for ger API.
AOCL-Internal: [CPUPL-1577]
Change-Id: Icc7a4590f605d7273077a7d2a42d4ecbafed2248
2021-05-19 14:05:01 +05:30
Nallani Bhaskar
a59796ef16 Updated leading dimensions for transpose case in gemm bench
1. Updated lda, ldb based on trans flags
2. Updated deriving storage type using leading dimension
2. Cleanup and alignment
3. Included transpose and row major cases in inputgemm.txt

Change-Id: I25f5cd522eb64f212445d98f4682132bf5a330b6
2021-05-14 15:26:20 +05:30
Meghana Vankadari
a3600d395d Added bench app for syrk - input is a log file generated from AOCL_DTL
Change-Id: I25dd695dea267a89a5c666d66abc4b91a57956c8
2021-05-11 14:57:51 +05:30
Dipal Madhukar Zambare
2b80e8824a Merge "Added bench utility for gemv API." into amd-staging-milan-3.1 2021-05-11 01:09:22 -04:00
Dipal M Zambare
08424e8896 Added bench utility for gemv API.
AMD-Internal: [CPUPL-1558]
Change-Id: Iaba1aa164fa589fa7f5047f314b26a24c4c2c3a7
2021-05-10 15:01:47 +05:30
Nageshwar Singh
a88cb82cec Revert "Adding trans h support in bench_gemm.c"
This reverts commit 791903b31c.

Change-Id: I24403cced67ea9e851adb58a8bf01a3e17bb4e85
2021-05-07 04:11:30 -04:00
Kiran Varaganti
433f17b6cd bench_gemmt Bug Fix
Fix reading input parameters

Interchange the reading of n and k, first n appears and then k appears in the logs.
Added comments to explain the format of the input gemmt log.

Change-Id: I44c6081d4449ba210728bc089c4215d5eef18834
2021-05-06 14:54:15 +05:30
Meghana Vankadari
713ca659b5 Added bench app for gemmt - input is a log file generated from AOCL DTL
Change-Id: Ia3390b529244f529d9741c86a6f8dc35a589f714
2021-04-19 09:40:24 +05:30
Nageshwar Singh
791903b31c Adding trans h support in bench_gemm.c
Change-Id: If340d515c38a593df26d5075e29685ef044601a5
2021-03-02 02:33:06 +05:30
Kiran Varaganti
80a516382e Fixed wrong dimensions check in bench/bench_gemm.c application
Verifying the valid values of m, n, k, lda, ldb and ldc is removed.
Since the bench app is run on logs collected from AOCL traces.
The correct way of checking should consider transpose parameter and storage order.

Change-Id: If0fbf733c2650c6f328661293eb99d062685d638
2020-11-20 20:39:20 +05:30
bhaskarn
008fe49df6 Added bench application for trsm
Description:
     Added bench_trsm.c to read inputs from AOCL DTL logs to benchmark
     Added sample input file

Change-Id: I6806e42244bf775cbed457553ca07fb0222ef597
2020-11-09 13:06:39 -05:00
Kiran Varaganti
60642d98a3 Benchmark using AOCL Logs as input
Added benchmark application for gemm - input is a log file generated from AOCL
DTL from BLIS.

Change-Id: I2ac7a3c48d5a37c5b24ec0f0cff7e7886dad0b99
2020-11-06 14:31:53 +05:30