Commit Graph

108 Commits

Author SHA1 Message Date
Nallani Bhaskar
9735391e1d Implemented f32tobf16 reorder function
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.

The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.

The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float

Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
2024-11-04 04:32:01 +00:00
Mithun Mohan
097cda9f9e Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API.
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.

AMD Internal: [CPUPL-5959]

Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
2024-10-30 08:52:22 +00:00
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
varshav2
605517964b Add Transpose Kernel for A matrix in F32F32f32Of32
- Implemented the AVX512 packA kernel for col major inputs in F32 API
 - Removed the work arounds for n = 1,  mtag_a = PACK case, where the execution was
   being directed to GEMM instead of GEMV.

Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
2024-09-19 06:37:11 -04:00
varshav2
7c78b9991f Bug Fixes in the F32F32 m == 1 transpose scenario
- added the missing stride updates in B reorder case in GEMV
 - added the missing stride updates for the cast of transA with B
   reordered case.

Change-Id: Ic89781dfa7c0d9380ea523796958f795828a1ade
2024-09-11 02:08:50 -04:00
Meghana Vankadari
5120f98e12 Developed all WoQ kernels for bf16s4f32o<f32|bf16>
Description:

1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
   weights into bf16 performed in the kernel itself to reduce bandwidth.

2. These kernels are performing better compared to bf16 weights when m
   is small and n is large.

3. Established a threshold to do quantization support at packing of
   B (KCXNC) level or WoQ kernel level.

Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
2024-09-10 12:00:10 +00:00
varshav2
298a165718 Add TransA and TransB support for F32F32F32oF32
- Added support for TransA and transB in f32f32of32 APIs
 - Modified the GEMV case(m == 1) to support PACKB feature
 - Redirecting the operations to GEMM instead of GEMV in case of n == 1
   conditions, with storage scheme r/transA and c/transB to avoid the
   packing errors which would lead to failures in computation.

Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9
2024-09-09 05:19:49 +05:30
Mithun Mohan
cf123aa926 Disabling smart threading for small input dimensions.
-It has been observed that reduction of threads as part of smart
threading for smaller input dimensions hampers the performance of the
other inputs with larger dimensions due to lower operating frequency of
the newly launched threads (apart from the existing ones). Disabling
smart threading for these bandwidth bound input patterns (small m and n)
fixes this issue.
-Bug fixes related to work split in LPGEMV for n < NR and m < MR cases.

AMD Internal: [SWLCSG-2948]

Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121
2024-09-06 09:20:42 -04:00
Meghana Vankadari
2e1cc2f14a Added bf16s4f32 kernels to handle m=4 cases
Details:
- In WOQ, if m = 4, special case kernels are added where
  s4->bf16 conversion happens inside the compute kernel and
  packing is avoided. For all other cases, B matrix is
  dequantized and packed at KC loop level and native bf16
  kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
  output is bf16.

Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
2024-09-04 07:36:57 -04:00
mkadavil
1257eaf72d Disabling smart threading for bandwidth bound input patterns.
For some applications, one of the input dimension is mostly m < MR or
n < NR with the other dimension being small for the most part, with
intermittent large ones. Currently in these cases (m < MR or n < NR),
the number of threads used is reduced (as part of smart threading) if
the other dimension (n or m) is also small. For larger dimensions all
the threads are used.
However its been observed that this reduction of threads hampers the
performance of the larger inputs due to lower operating frequency of
the newly launched threads (apart from the existing ones). Disabling
smart threading for these bandwidth bound input patterns (m < MR or
n < NR) fixes this issue.

AMD Internal: [SWLCSG-2948]

Change-Id: I5334860cf4411ea4504d2e6bc598b9904780bbbf
2024-09-02 02:18:45 +05:30
varshav2
d4e0fa9b4c Revert duplicate check and fix bug in the check for post-ops
- Revert of patch 1110983 - Duplicate check removal and early return for
	s8s8s32/u8s8s32
- Add fix - Added check to see if post-ops is enabled with col-major
  storage and return early in that case.

Change-Id: Id3b8c97b6d1425dfb06f3b196e5acd60caee8fca
2024-08-29 06:52:14 -04:00
Deepak Negi
6dcf500703 Element wise operations API for float(f32) input matrix in LPGEMM.
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24

AMD Internal: [SWLCSG-2947]

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
2024-08-27 03:28:52 -04:00
varshav2
e3c434080a Fix duplicate check and early return in s8s8s32/u8s8s32
- removed the duplicate check for col-major inputs in s8s8s32/u8s8s32
  APIs
- Fixed the print in bench_lpgemm

Change-Id: If40837b89927dd82d8aa6f620d1a7f2c24aed53c
2024-08-23 02:32:20 +05:30
Meghana Vankadari
5514c7a75f Added LPGEMV(n=1) kernels for s8s8s32os32|s8 and s8s8s16os16|s8 APIs
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n==1.

AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
2024-08-09 06:17:52 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Edward Smyth
591a3a7395 Code cleanup: file formats and permissions
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.

Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.

AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
2024-08-05 11:52:33 -04:00
mkadavil
9f5fec7713 Matrix MUL op support in element wise operations API for bfloat16.
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.

AMD Internal: [SWLCSG-2947, SWLCSG-2953]

Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
2024-08-05 08:29:42 +05:30
Deepak Negi
80bf6249f0 Matrix MUL post-operation support for float(bf16|f32) LPGEMM APIs.
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.

AMD-Internal: [SWLCSG-2953]

Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
2024-08-05 08:25:32 -04:00
mkadavil
f040ba617f Element wise operations API for bfloat16 input matrix in LPGEMM.
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.

AMD Internal: SWLCSG-2947

Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
2024-08-05 07:17:08 -04:00
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00
mkadavil
ec8c39541e Test/benchmark framework updates to test WOQ workflow.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
have been developed where data types are A:bf16, B:int4 and C:f32/bf16.
The testing and benchmarking framework for the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17
2024-07-25 06:44:37 +05:30
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
Meghana Vankadari
49949f488f Implemented on-the-go pack kernel for s4->bf16
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
  new LPGEMM APIs are added where datatypes are A: bf16,
  B: int4, C: f32/bf16. To support this, B matrix will
  be reordered with type still being int4. New pack
  kernels that packs the reordered B matrix after
  converting the data from int4 to bf16 and applying
  zero-point and scale are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
2024-07-25 04:13:05 +05:30
mkadavil
7114376519 New kernels for int4 B matrix reordering following BF16 kernel schema.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
2024-07-25 01:10:13 -04:00
mkadavil
d37c91dffa Quantization (scale + zero point) support for BF16 LPGEMM api.
-Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point)
instead of just type conversion in aocl_gemm_bf16bf16f32obf16.
-Support for multiple scale/sum/matrix_add/bias post-ops in a single
LPGEMM api call.
-Post-ops mask related fixes in lpgemv kernels .
-Additional scale post-ops sanity checks.

AMD-Internal: [SWLCSG-2945]
Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb
2024-07-18 05:32:51 -04:00
mkadavil
a5c4a8c7e0 Int4 B matrix reordering support in LPGEMM.
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.

AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
2024-06-24 07:55:34 -04:00
Meghana Vankadari
c1e063e65c Fix for offset issue while reading constants from JIT code
Details:
- For a variable x, Using address of x in an instruction throws
  exception if the difference between &x and access position is
  larger than 2 GiB. To solve this issue all variables are stored
  within the JIT code section and are accessed using relative addressing.

- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.

AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
2024-06-24 07:14:15 -04:00
Meghana Vankadari
c9254bd9e9 Implemented LPGEMV(n=1) for AVX2-INT8 variants
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n=1.

AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
2024-06-18 12:09:18 +05:30
mkadavil
cd032225ca BF16 bias support for bf16bf16f32ob16.
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.

AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
2024-05-23 04:48:20 +05:30
Nallani Bhaskar
29db6eb42b Added transB in all AVX512 based int8 API's
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
  s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
  support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
  according to latest input format.

AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
2024-05-23 03:46:13 +05:30
Meghana Vankadari
3a8b9270e7 Implemented lpgemv for AVX512-INT8 variants
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
  datatype.

AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
2024-05-14 01:55:49 +05:30
mkadavil
ec67289601 SWISH post-op support for BF16 JIT based kernels.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1. Adding the support for swish in JIT
based BF16 kernels.

AMD-Internal: [SWLCSG-2387]
Change-Id: I9eea0c801f5f067a5cfbd2941bc991708b86e45e
2024-05-13 01:50:32 -04:00
Meghana Vankadari
1072770c63 Implemented LPGEMV for bf16 datatype
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation is developed considering the b matrix
   reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
2024-05-06 23:55:15 +05:30
mkadavil
118e955a22 SWISH post-op support for all LPGEMM APIs.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1.

AMD-Internal: [SWLCSG-2387]
Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26
2024-05-06 06:05:11 -04:00
Meghana Vankadari
75b9d46a40 Fix in LPGEMM for variable BLIS-int size
- Modified all structs that are passed to JIT-generated code to use
  integer of type uint64_t rather than dim_t so that functionality
  is not affected when size of BLIS-internal integer is modified
  during configure time.

Change-Id: Ib81c088072badf13da4ca73be2d4af4551b713d8
2024-05-06 02:56:47 -04:00
Eleni Vlachopoulou
020b9ff7f0 CMake: Enable builds for both static and shared builds for Linux.
- Added BUILD_STATIC_LIBS option which is on by default, only on Linux.
- Added TEST_WITH_SHARED option which is off by default, only on Linux.
- If only shared or static lib is being built, that's the one that will be used for testing.
- If both are being built, TEST_WITH_SHARED determins which library wil be used for testing.
- Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing.
- Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0.

AMD-Internal: [CPUPL-2748]
Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e
2024-03-14 10:32:51 -04:00
jagar
e2de45b454 CMake:Added support for ADDON(aocl_gemm) on Windows
CMakelists.txt is updated to support aocl_gemm on windows.
On windows, BLIS library(blis+aocl_gemm) is built successfully
only with AOCC Compiler. (Clang has an issue with optimizing
VNNI instructions).

$cmake .. -DENABLE_ADDON="aocl_gemm" ....

AMD-Internal: [CPUPL-2748]
Change-Id: I9620878ab6934233fadc9ddc5d5e82ad85be9209
2024-03-14 07:57:02 -04:00
Meghana Vankadari
da8fd8c301 Implemented JIT-based microkernel for bf16 datatype
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
  will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
  for 6x64 bf16 microkernel and store function pointers to all the
  generated kernels in a global function pointer array. This happens
  only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
  otherwise, microkernel uses the intrinsics based implementation.

AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
2024-03-13 05:55:18 +05:30
Bhaskar Nallani
2ce47e6f5e Implemented optimal AVX512-variant of f32 LPGEMV
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation of lpgemv_rowvar_f32 is developed
   considering the b matrix reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

5. Fixed few warnings while loading 2 f32 bias elements using
   _mm_load_sd using float pointer. Typecasted to (const double *)

AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
2024-03-04 23:53:23 +05:30
mkadavil
01b7f8c945 Matrix Add post-operation support for integer(s16|s32) LPGEMM APIs.
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.

AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
2024-02-12 23:51:36 +05:30
eashdash
ef134dc49f Added Trans A feature for all INT8 LPGEMM APIs
1. Added Trans A feature to handle column major inputs
   for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
   MCxKC block of A into row storage MCxKC block as
   LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
   from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
   pointers.
6. Packing of A matrix:
   -  Converts column major input A to row major
      in blocks of MCxKC with newly added pack A
      functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
   INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
   1. u8s8s32os32/os8
   2. u8s8s16os16/os8/ou8
   3. s8s8s32os32/os8
   4. s8s8s16os16/os8

AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
2024-01-30 03:40:56 -05:00
mkadavil
864170f5cb Scalar value support for zero-point and scale-factor.
-As it stands, in LPGEMM, users are expected to pass an array of values
with length the same as N dimension as inputs for zero point or scale
factor. However at times, a single scalar value is used as zero point
or scale factor for the entire downscaling operation. The mandate to
pass an array requires the user to allocate extra memory and fill it
with the scalar value so as to be used in downscaling. This limitation
is lifted as part of this commit, and now scalar values can be passed
as zero point or scale factor.
-LPGEMM bench enhancements along with new input format to improve
readability as well as flexibility.

AMD-Internal: [SWLCSG-2581]
Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f
2024-01-12 04:37:35 +05:30
Meghana Vankadari
6567df7b12 bf16bf16f32o<bf16|f32> Fix for scaling issue when transA is enabled.
Details:
- LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_A_BLOCK to checkout
  memory when A matrix needs to be packed. This multi-threaded lock
  overhead becomes prominent when m/n dimensions are relatively small,
  even when k is large. In order to address this, bli_pba_acquire_m
  is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE,
  the memory is allocated using aligned malloc instead of checking
  out from memory pool. Experiments have shown malloc costs to be
  far lower than memory pool guarded by locks, especially for higher
  thread count.
- Deleted few unnecessary instructions from packing kernels.
- Replaced bench_input.txt with lesser number of inputs.

AMD-Internal: [CPUPL-4329]
Change-Id: I5982a0a4df9dc72fab0cffab795c23822d5c8774
2023-12-21 04:53:32 +05:30
Bhaskar Nallani
21d6ab6a21 Improved thread balancing for aocl_gemm f32 API
Description:
1. Updated the thread partition logic for aocl_gemm_f32f32f32of32
   for m<MR, n<NR cases and also balanced thread in m, n directions
   such that each thread gets equal amount of work and not to span
   thread without any work.
2. Disabled dynamic enabling of packing of a and b matrixes for
   smaller sizes for genoa architecture.

AMD-Internal: [SWLCSG-2353 , SWLCSG-2391]
Change-Id: I03b2c50e592c2e9d336ea84c0e0394af63a34cec
2023-11-24 03:45:44 -05:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Edward Smyth
f471615c66 Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
2023-11-22 17:11:10 -05:00
Meghana Vankadari
77bd9a7f17 Added parameter checking for LPGEMM APIs
Change-Id: I6ea89fd0d2516539e5a4e9cd8537570b23194d89
2023-11-09 21:50:55 -05:00
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Meghana Vankadari
0c12b72651 LPGEMM bench enhancements
Details:
- Moved the downscale & postop options from commmandline to
  input file.
- Now the format of the input file is as follows:
  dt_in dt_out stor transa transb op_a op_b m n k lda ldb ldc postops
- In case of no-postops, 'none' has to be passed in the place of
  postops.
- Removed duplication of mat_mul_bench_main function for bf16 APIs.
- Added a function called print_matrix for each datatype which can
  help in printing matrices while debugging.
- Added printing of ref, computed and diff values while reporting
  failure.
- Added new functions for memory allocation and freeing. Different
  types of memory allocation is chosen based on mode bench is
  running(performance or accuracy mode).

Change-Id: Ia7d740c53035bc76e578a03869590c9f04396b72
2023-11-09 03:55:10 -05:00
mkadavil
ed052c6c44 Smart Threading for LPGEMM <u|s>8s8s<16|32>os<8|16|32> API.
The LPGEMM micro-kernel operates on blocks of dimension MRxKC and
KCxNR. Current LPGEMM design involves using all the available threads
for computing the output. If the number of threads assigned along ic
or jc direction is more than M/MR or N/NR blocks respectively, it
could results in threads sleeping due to the lack of MR or NR blocks.
This scenario is now handled by reducing the number of threads if
there are threads without any work (MR or NR blocks).

AMD-Internal: [SWLCSG-2354, SWLCSG-2389, SWLCSG-2267]
Change-Id: I74819337c7a0d3ab05ea0e18bb42780f977ea8f6
2023-11-09 00:50:30 -05:00