Commit Graph

154 Commits

Author SHA1 Message Date
varshav2
ef04388a44 Added AVX2 support for BF16 kernels: Row major
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
   In order to support AVX2 kernels, the BF16 input has to be converted
   to F32 and then the F32 kernels has to be executed.
 - Added un-pack function for the B-Matrix, which does the unpacking of
   the Re-ordered BF16 B-Matrix and converts it to Float.
 - Added a kernel, to convert the matrix data from Bf16 to F32 for the
   give input.
 - Added a new path to the BF16 5LOOP to work with the BF16 data, where
   the packed/unpacked A matrix is converted from BF16 to F32. The
   packed B matrix is converted from BF16 to F32 and the re-ordered B
   matrix is unre-ordered and converted to F32 before feeding to the
   F32 micro kernels.
 - Removed AVX512 condition checks in BF16 code path.
 - Added the Re-order reference code path to support BF16 AVX2.
 - Currently the F32 AVX-2 kernels supports only F32 BIAS support.
   Added BF16 support for BIAS post-op in F32 AVX2 kernels.
 - Bug fix in the test input generation script.

AMD Internal : [SWLCSG - 3281]

Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
2025-02-10 08:18:52 -05:00
Meghana Vankadari
da3d0c6034 Added new Int8 batch_gemm APIs
Details:
- Added u8s8s32of32|bf16|u8 batch_gemm APIs.
- Fixed some bugs in bench file for bf16 API.

Change-Id: I55380238869350a848f2deec0641d7b9b416b192
2025-02-10 11:19:02 +00:00
Deepak Negi
3a7523b51b Element wise post-op APIs are upgraded with new post-ops
Description:

1. Added new output types for f32 element wise API's to support
   s8, u8, s32 , bf16 outputs.

2. Updated the base f32 API to support all the post-ops supported in
   gemm API's

AMD Internal: [SWLCSG-3384]

Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8
2025-02-10 01:06:39 -05:00
Edward Smyth
1f0fb05277 Code cleanup: Copyright notices (2)
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.

AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
2025-02-07 05:41:44 -05:00
Mithun Mohan
b9f6286731 Tiny GEMM path for BF16 LPGEMM API.
-Currently the BF16 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
-Only row major inputs are allowed to proceed with tiny GEMM.

AMD-Internal: [SWLCSG-3380, SWLCSG-3258]

Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031
2025-02-07 04:37:11 -05:00
Meghana Vankadari
c47f0f499f Fixed bug in testing matrix_mul post_op
Details:
- Added a new python script that can test all microkernels
  along with post-ops.
- Modified post_op freeing function to avoid memory leaks.

Change-Id: Iedba84e8233a88ca9261596c4c7e0a65c196b7e7
2025-02-07 02:27:14 +05:30
Deepak Negi
86e52783e4 Tiny GEMM path for F32 LPGEMM API.
-Currently the F32 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.

AMD-Internal: [SWLCSG-3380]

Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275
2025-02-06 23:36:44 -05:00
Vignesh Balasubramanian
8abb37a0ad Update to AOCL-BLAS bench application for logging outputs
- Updated the format specifiers to have a leading space,
  in order to delimit the outputs appropriately in the
  output file.

- Further updated every source file to have a leading space
  in its format string occuring after the macros.

AMD-Internal: [CPUPL-5895]
Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15
2025-02-06 22:59:59 +05:30
Deepak Negi
2e687d8847 Updated all post-ops in s8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in s8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6
2025-02-06 07:31:28 -05:00
Meghana Vankadari
13e7ada3f2 Modified bench to test different types of post-ops
- Modified bench to support testing of different types of buffers
  for bias, mat_add and mat_mul postops.
- Added support for testing integer APIs with float accumulation
  type.

Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96
2025-02-06 02:38:08 +05:30
Mithun Mohan
0701a4388a Thread factorization improvements (ic ways) for BF16 LPGEMM API.
-Currently when m is small compared to n, even if MR blks (m / MR) > 1,
and total work blocks (MR blks * NR blks) < available threads, the
number of threads assigned for m dimension (ic ways) is 1. This results
in sub par performance in bandwidth bound cases. To address this, the
thread factorization is updated to increase ic ways for these cases.

AMD-Internal: [SWLCSG-3333]

Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9
2025-02-06 00:51:10 -05:00
Hari Govind S
67322416d3 Added support to benchmark ASUMV APIs
- Implemented the feature to benchmark ?ASUMV APIs
  for the supported datatypes. The feature allows to
  benchmark BLAS, CBLAS or the native BLIS API, based
  on the macro definition.

- Added a sample input file to provide examples to benchmark
  ASUMV for all its datatype supports.

AMD-Internal: [CPUPL-5984]
Change-Id: Iff512166545687d12504babda1bd52d71a3a5755
2025-01-31 06:04:16 -05:00
Vignesh Balasubramanian
0e71d28c01 Additional bug-fix for AOCL-BLAS bench
- Corrected the format specifier setting(as macro) to not
  include additional spaces, since this would cause incorrect
  parsing of input files(in case they have exactly the expected
  number of parameters and not more).

- Updated the inputgemm.txt file to contain some inputs that
  have the exact parameters, to validate this fix.

AMD-Internal: [CPUPL-6365]
Change-Id: Ie9a83d4ed7e750ff1380d00c9c182b0c9ed42c49
2025-01-30 08:28:14 -05:00
Nallani Bhaskar
805bd10353 Updated all post-ops in u8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5
2025-01-29 04:21:17 -05:00
Vignesh Balasubramanian
445327f255 Bugfix for AOCL-BLAS bench application
- Bug : When configuring our library with the native
        BLIS integer size being 32, the bench application
	would crash or read an invalid value when parsing
        the input file. This is because of a mismatch
        of format specifier, that we hardset in the
        Makefile.

- Fix : Defined a header that sets the format specifiers
        as macros with the right matching, based on how we
        configure and build the library. It is expected to
        include this header in every source file for
        benchmarking.

AMD-Internal: [CPUPL-5895]
Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89
2025-01-29 03:25:57 -05:00
Deepak Negi
db407fd202 Added F32 bias type support, F32, BF16 output type support in int8 APIs
Description:

1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs.
   Where the inputs are uint8/int8 and the processing is done using
   VNNI but the output is stored in f32 and bf16 formats. All the int8
   kernels are reused and updated with the new output data types.

2. Added F32 data type support in bias.

3. Updated the bench and bench input file to support validation.

AMD-Internal: SWLCSG-3335

Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd
2025-01-26 11:38:30 -05:00
Mithun Mohan
39289858b7 Packed A matrix stride update to account for fringe cases.
-When A matrix is packed, it is packed in blocks of MRxKC, to form a
whole packed MCxKC block. If the m value is not a multiple of MR, then
the m % MR block is packed in a different manner as opposed to the MR
blocks. Subsequently the strides of the packed MR block and m % MR
blocks are different and the same needs to be updated when calling the
GEMV kernels with packed A matrix.
-Fixes to address compiler warnings.

AMD-Internal: [SWLCSG-3359]
Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44
2025-01-22 05:42:30 -05:00
Meghana Vankadari
69ca5dbcd6 Fixed compilation errors for gcc versions < 11.2
Details:
- Disabled intrinsics code of f32obf16 pack function
  for gcc < 11.2 as the instructions used in kernels
  are not supported by the compiler versions.
- Addded early-return check for WOQ APIs when compiling with
  gcc < 11.2
- Fixed code to check whether JIT kernels are generated inside
  batch_gemm API for bf16 datatype.

AMD Internal: [CPUPL-6327]

Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a
2025-01-21 07:13:31 -05:00
Deepak Negi
182a6373b5 Added support to specify bias data type in u8s8s32/s8s8s32 API's
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
   type(s8/s32/bf16) irrespective of the storage data type in
   u8s8s32/s8s8s32 API's.

AMD-Internal: SWLCSG-3302

Change-Id: I3c465fe428672d2d58c1c60115c46d2d5b11f0f4
2025-01-15 05:56:26 -05:00
Meghana Vankadari
852cdc6a9a Implemented batch_matmul for f32 & int8 datatypes
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.
- Added logger functionality for batch_matmul APIs.

AMD-Internal: [SWLCSG-2944]
Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a
2025-01-10 04:10:53 -05:00
Mithun Mohan
ef4286a97e Multi-data type buffer and scale support for matrix add|mul post-ops in s32 API.
-As it stands the buffer type in matrix add|mul post-ops is expected to
be the same as that of the output C matrix type. This limitation is now
removed and user can specify the buffer type by setting the stor_type
attribute in add|mul post-op struct. As of now int8, int32, bfloat16 and
float types are supported for the buffer in s32 micro-kernels. The same
support is also added for bf16 micro-kernels, with bfloat16 and float
supported for now.
-Additionally the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add|mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.
-The bias_stor_type attribute is renamed to stor_type in bias post-ops.

AMD-Internal: [SWLCSG-3319]
Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa
2025-01-10 02:11:12 -05:00
varshav
7b9d29f9b3 Adding post-ops for JIT kernels
- Added Downscale, tanh and sigmoid post-op support to the JIT kernels
 - Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error.
 - Added the optional support for B-prefetch in the JIT kernels
 - Resolved the visibility issues in global variable jit_krnels_generated
 - Modified the array generation for scale and zp values in the bench

Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d
2025-01-08 12:55:27 +00:00
Mithun Mohan
4a95f44d39 Buffer scale support for matrix add and matrix mul post-ops in bf16 API.
-Currently the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add/mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.

AMD-Internal: [SWLCSG-3181]
Change-Id: Ifdb7160a1ea4f5ecccfa3ef31ecfed432898c14d
2025-01-08 10:35:50 +00:00
Mithun Mohan
8d8a8e2f19 Light-weight logging framewok for LPGEMM.
-A light-weight mechanism/framework to log input details and a
stringified version of the post-ops structure is added to LPGEMM.
Additionally the runtime of the API is also logged.
The logging framework logs to a file with filename following the format
aocl_gemm_log_<PID>_<TID>.txt.
-To enable this feature, the AOCL_LPGEMM_LOGGER_SUPPORT=1 macro needs to
be defined when compiling BLIS (with aocl_gemm addon enabled) by passing
CFLAGS="-DAOCL_LPGEMM_LOGGER_SUPPORT=1" to ./configure. Additionally
AOCL_ENABLE_LPGEMM_LOGGER=1 has to be exported in the environment during
LPGEMM runtime.

AMD-Internal: [SWLCSG-3280]
Change-Id: I30bfb35b2dc412df70044601b335938fc9f49cfb
2025-01-03 11:28:57 +00:00
Meghana Vankadari
bfc512d3e1 Implemented batch_gemm for bf16bf16f32of32|bf16
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.

AMD-Internal: [SWLCSG-2944]
Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155
2025-01-03 03:28:32 -05:00
Deepak Negi
615789e196 Fixed compilation issue with clang 18 on windows
Description
-In enum AOCL_PARAMS_STORAGE_TYPES the member FLOAT was declared and the
 clang 18 compiler in msvc throwing issue with multiple definition. We
 replace FLOAT and BFLOAT16 to AOCL_GEMM_<F32/BF16>.

AMD-Internal: CPUPL-6174

Change-Id: Ic061af068854d51629b82b495efd0eb54543f329
2024-12-12 06:37:06 -05:00
Deepak Negi
baeebe75c9 Support for standard AutoAWQ storage format.
Description:
1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this
   format [0, 2, 4, 6, 1, 3, 5, 7].
2. Support is added to convert above format back to the original
   sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering
   in the AWQ API.

AMD-Internal: SWLCSG-3169

Change-Id: I5395766060c200ab81d0b8be94356678a169ac13
2024-12-02 04:02:27 -05:00
Meghana Vankadari
fbb72d047f Added group quantization and zero-point support for WOQ kernels
Description:

1. Added group quantization and zero-point (zp) in
   aocl_gemm_bf16s4f32o<bf16|f32> API.

2. Group quantization is technique to improve accuracy
   where scale factors to dequantize weights varies at group
   level instead of per channel and per tensor level.

3. Added zp and scaling in woq packb kernels so that for
   large M values zp and scaling are performed at pack-b
   stage and bf16 kernels are called

4. Adding zp support and scaling to default path in WoQ kernels
   created some performance overhead when M value is very small.

5. Added string group_size to lpgemm bench to read
   group size from bench_input.txt and tested for
   various combinations of matrix dimensions.

6. The scalefactors could be of type float or bf16
   and the  zeropoint values are expected to be
   in int8 format.

AMD-Internal: [SWLCSG-3168, SWLCSG-3172]

Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57
2024-12-02 06:46:13 +00:00
Deepak Negi
04ae01aeab Added support to specify bias data type in bf16 API's
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
   type irrespective of the storage data type in bf16 and WoQ API's


AMD-Internal: SWLCSG-3171


Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf
2024-11-19 05:30:02 -05:00
Deepak Negi
60a8c71a1a Sigmoid and Tanh post-operation support for int8 API's.
Description:

Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_<s8|u8>s8<s32|s16>o<s8|u8|s32> API's

Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))

Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.

AMD-Internal: [SWLCSG-3178]

Change-Id: I9df3aab02222f728ff9d1f292c7bc549f30176f0
2024-11-15 05:36:31 -05:00
Deepak Negi
146f3b2eb2 Sigmoid and Tanh post-operation support for f32 API.
Description:

Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_f32f32f32of32 API's

Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))

Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.

AMD-Internal: [SWLCSG-3178]

Change-Id: Iac0a907f6dea1d9cb82d9fd8716bfdbf1c33921d
2024-11-15 04:20:20 -04:00
Deepak Negi
b5c1b6055a Sigmoid and Tanh post-operation support for bf16 API.
Description:

Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_bf16bf16f32o<f32|bf16) API's

Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))

Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.

AMD-Internal: [SWLCSG-3178]

Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969
2024-11-15 01:13:31 -04:00
Mithun Mohan
097cda9f9e Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API.
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.

AMD Internal: [CPUPL-5959]

Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
2024-10-30 08:52:22 +00:00
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
varshav2
dabfdf484a Add Scale post-op for F32 API
- Implemented the Scale post-op for the F32 API for all kernels
 - f32_scale = (f32 * scale_factor) + offset
 - Added the bench inputs

Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c
2024-10-21 06:08:31 -04:00
Deepak Negi
16653ed208 Added support for column major B matrix in BF16S4F32F32 reorder API.
-Added new pack kernels that packs/reorders B matrix from column-major
input format. This also supports the transB scenario if input B matrix
is row major.

Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f
2024-09-20 01:11:21 +05:30
Chandrashekara K R
e4eed817aa Added logic to use right format specifier to read integer value.
Updated logic to use "%ld" and "%lld" format specifiers to read
64-bit integer from input files using fscanf function on Linux and
Windows respectively when the user set INT_SIZE='auto' on 64-bit
machine or INT_SIZE='64'. Otherwise "%d" on both windows and Linux
for benchmarking blis and LPGEMM.

Change-Id: I4762c4c1b3fcd09cf66d0cc9572d38766be6be60
2024-09-17 04:48:59 -04:00
Chandrashekara K R
91d4337b8b Updated format specifier for fscanf to read double values.
Updated format specifier to read signed double("%lld") and unsigned
double("%llu") from file using fscanf from both windows and Linux.

AMD-Internal: [CPUPL-5787]
Change-Id: Ibef50b0df708f474e22f703240e264eff1de3994
2024-09-13 14:57:28 +05:30
Mithun Mohan
453c9f0084 Fixes for bfloat16 accumulation rounding errors in bench.
For the bf16bf16of32bf16 lpgemm api, inside the micro-kernels in order
to convert the accumulated float values to bfloat16 before storing,
the _mm512_cvtneps_pbh intrinsic (vcvtneps2bf16) is used. This
intrinsic rounds the value based on a rounding bias logic. Replicating
the same rounding logic inside the bf16 bench accuracy check function
to get proper one to one comparison of output values.

AMD Internal: [SWLCSG-2948]

Change-Id: I135ac39ac8484769b6c0fe5b3e351dd22d7ca1d8
2024-09-11 01:39:11 -04:00
Meghana Vankadari
687abe4c96 Bug fix in WOQ kernel for m=4 case.
- Updated pre_op_off computation for nr0 < NR cases.
- Fixed warnings in bench file.

Change-Id: Iae30fa84b6b47ebd94ab05d2139056aee24546d7
2024-09-05 05:00:30 +00:00
Meghana Vankadari
2e1cc2f14a Added bf16s4f32 kernels to handle m=4 cases
Details:
- In WOQ, if m = 4, special case kernels are added where
  s4->bf16 conversion happens inside the compute kernel and
  packing is avoided. For all other cases, B matrix is
  dequantized and packed at KC loop level and native bf16
  kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
  output is bf16.

Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
2024-09-04 07:36:57 -04:00
Deepak Negi
e429e57b53 Replaced int_32 with dim_t in lpgemm bench
Replaced int32_t with dim_t (int64_t) to avoid overflow.

Change-Id: I4132b72fcbffd9dbd2242b3638922931bcdb1b80
2024-09-02 09:03:02 -04:00
Deepak Negi
6dcf500703 Element wise operations API for float(f32) input matrix in LPGEMM.
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24

AMD Internal: [SWLCSG-2947]

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
2024-08-27 03:28:52 -04:00
varshav2
e3c434080a Fix duplicate check and early return in s8s8s32/u8s8s32
- removed the duplicate check for col-major inputs in s8s8s32/u8s8s32
  APIs
- Fixed the print in bench_lpgemm

Change-Id: If40837b89927dd82d8aa6f620d1a7f2c24aed53c
2024-08-23 02:32:20 +05:30
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Edward Smyth
591a3a7395 Code cleanup: file formats and permissions
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.

Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.

AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
2024-08-05 11:52:33 -04:00
mkadavil
9f5fec7713 Matrix MUL op support in element wise operations API for bfloat16.
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.

AMD Internal: [SWLCSG-2947, SWLCSG-2953]

Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
2024-08-05 08:29:42 +05:30
Deepak Negi
80bf6249f0 Matrix MUL post-operation support for float(bf16|f32) LPGEMM APIs.
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.

AMD-Internal: [SWLCSG-2953]

Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
2024-08-05 08:25:32 -04:00
mkadavil
f040ba617f Element wise operations API for bfloat16 input matrix in LPGEMM.
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.

AMD Internal: SWLCSG-2947

Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
2024-08-05 07:17:08 -04:00
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00