Commit Graph

21 Commits

Author SHA1 Message Date
Meghana Vankadari
bfc512d3e1 Implemented batch_gemm for bf16bf16f32of32|bf16
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.

AMD-Internal: [SWLCSG-2944]
Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155
2025-01-03 03:28:32 -05:00
Deepak Negi
615789e196 Fixed compilation issue with clang 18 on windows
Description
-In enum AOCL_PARAMS_STORAGE_TYPES the member FLOAT was declared and the
 clang 18 compiler in msvc throwing issue with multiple definition. We
 replace FLOAT and BFLOAT16 to AOCL_GEMM_<F32/BF16>.

AMD-Internal: CPUPL-6174

Change-Id: Ic061af068854d51629b82b495efd0eb54543f329
2024-12-12 06:37:06 -05:00
Deepak Negi
baeebe75c9 Support for standard AutoAWQ storage format.
Description:
1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this
   format [0, 2, 4, 6, 1, 3, 5, 7].
2. Support is added to convert above format back to the original
   sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering
   in the AWQ API.

AMD-Internal: SWLCSG-3169

Change-Id: I5395766060c200ab81d0b8be94356678a169ac13
2024-12-02 04:02:27 -05:00
Meghana Vankadari
fbb72d047f Added group quantization and zero-point support for WOQ kernels
Description:

1. Added group quantization and zero-point (zp) in
   aocl_gemm_bf16s4f32o<bf16|f32> API.

2. Group quantization is technique to improve accuracy
   where scale factors to dequantize weights varies at group
   level instead of per channel and per tensor level.

3. Added zp and scaling in woq packb kernels so that for
   large M values zp and scaling are performed at pack-b
   stage and bf16 kernels are called

4. Adding zp support and scaling to default path in WoQ kernels
   created some performance overhead when M value is very small.

5. Added string group_size to lpgemm bench to read
   group size from bench_input.txt and tested for
   various combinations of matrix dimensions.

6. The scalefactors could be of type float or bf16
   and the  zeropoint values are expected to be
   in int8 format.

AMD-Internal: [SWLCSG-3168, SWLCSG-3172]

Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57
2024-12-02 06:46:13 +00:00
Deepak Negi
04ae01aeab Added support to specify bias data type in bf16 API's
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
   type irrespective of the storage data type in bf16 and WoQ API's


AMD-Internal: SWLCSG-3171


Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf
2024-11-19 05:30:02 -05:00
Deepak Negi
b5c1b6055a Sigmoid and Tanh post-operation support for bf16 API.
Description:

Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_bf16bf16f32o<f32|bf16) API's

Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))

Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.

AMD-Internal: [SWLCSG-3178]

Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969
2024-11-15 01:13:31 -04:00
Deepak Negi
80bf6249f0 Matrix MUL post-operation support for float(bf16|f32) LPGEMM APIs.
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.

AMD-Internal: [SWLCSG-2953]

Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
2024-08-05 08:25:32 -04:00
mkadavil
ec8c39541e Test/benchmark framework updates to test WOQ workflow.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
have been developed where data types are A:bf16, B:int4 and C:f32/bf16.
The testing and benchmarking framework for the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17
2024-07-25 06:44:37 +05:30
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
mkadavil
d37c91dffa Quantization (scale + zero point) support for BF16 LPGEMM api.
-Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point)
instead of just type conversion in aocl_gemm_bf16bf16f32obf16.
-Support for multiple scale/sum/matrix_add/bias post-ops in a single
LPGEMM api call.
-Post-ops mask related fixes in lpgemv kernels .
-Additional scale post-ops sanity checks.

AMD-Internal: [SWLCSG-2945]
Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb
2024-07-18 05:32:51 -04:00
mkadavil
118e955a22 SWISH post-op support for all LPGEMM APIs.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1.

AMD-Internal: [SWLCSG-2387]
Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26
2024-05-06 06:05:11 -04:00
mkadavil
01b7f8c945 Matrix Add post-operation support for integer(s16|s32) LPGEMM APIs.
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.

AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
2024-02-12 23:51:36 +05:30
mkadavil
864170f5cb Scalar value support for zero-point and scale-factor.
-As it stands, in LPGEMM, users are expected to pass an array of values
with length the same as N dimension as inputs for zero point or scale
factor. However at times, a single scalar value is used as zero point
or scale factor for the entire downscaling operation. The mandate to
pass an array requires the user to allocate extra memory and fill it
with the scalar value so as to be used in downscaling. This limitation
is lifted as part of this commit, and now scalar values can be passed
as zero point or scale factor.
-LPGEMM bench enhancements along with new input format to improve
readability as well as flexibility.

AMD-Internal: [SWLCSG-2581]
Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f
2024-01-12 04:37:35 +05:30
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
mkadavil
ffa72f09cc Support for multiple eltwise post-ops in low precision gemm.
-Currently only one eltwise post-op (one of relu/prelu/gelu_tanh/
gelu_erf) is supported in the post-op struct along with bias or
downscale. This setup was sufficient when only activation functions
were supported as eltwise post-ops. But with the introduction of clip
post-op(a type of non-activation eltwise operation), it has become
necessary to extend the post-ops framework to support multiple eltwise
operations, with the multiple eltwise often used in the form activation
eltwise op + non-activation eltwise ops. The aocl post-op struct is
modified and the post-op parser is updated to support this use case.
-The lpgemm_bench is updated to support testing/benchmarking of the
multiple eltwise operations use case. The function for accuracy checking
is modified to support correctness testing irrespective of the order and
count of post-ops. Additionally the help message is updated so as to
better describe the capabilities of lpgemm_bench.

AMD-Internal: [CPUPL-3244]
Change-Id: If4ce8d7261d32073da8fa4757ed4f2ea0e94249f
2023-04-20 07:24:32 -04:00
eashdash
12c97021a1 Added New Post-Op - Custom Clipping for LPGEMM and SGEMM
1. Custom Clip is an element-wise post-op which is used to
   clip the accumulated GEMM output within a certain range.
2. The Clip Post-Op is used in downscaled and non-downscaled
   LPGEMM APIs and SGEMM.
3. Changes are done at frame and microkernel level to implement
   this post-op.
4. Different versions are implemented - AVX-512, AVX-2, SSE-2
   to enable custom clipping for various LPGEMM types and SGEMM

AMD-Internal: [CPUPL-3207]
Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6
2023-04-17 04:38:20 -04:00
eashdash
e36f699939 Implemented ERF Based GeLU Activation for LPEGMM and SGEMM
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
   error function - ERF
2. Added error function based GeLU activation post-ops for the
   S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
   addition to adding the marco based function definations of the
   ERF function in the math-utils and gelu headerfiles.

AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be
2023-03-13 06:10:31 -04:00
eashdash
672544bc04 GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16
1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16
2. Changes are done at frame and micro-kernel level to
   implement this post-op.
3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF
   functions are implemented for the GeLU post-operation.
4. TANH and EXPF math functions are efficiently implemented in
   macro-based fashion to exploit register level fusion of GeLU
   with GEMM operations for improved performance
5. LPGEMM bench is changed to pass GeLU post-op as input and
   support accuracy check to verify functional correctness

AMD-Internal: [CPUPL-2978]
Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28
2023-02-02 08:25:04 -05:00
mkadavil
958c9238ac Output downscaling support for low precision GEMM.
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.

AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
2022-08-30 10:27:19 -04:00
mkadavil
584069bf74 Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM.
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.

AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
2022-08-25 13:33:02 +05:30
mkadavil
828d3cd3d3 Post operations support for low precision gemm.
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.

AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
2022-08-05 11:53:05 +05:30