Commit Graph

29 Commits

Author SHA1 Message Date
Mithun Mohan
8d8a8e2f19 Light-weight logging framewok for LPGEMM.
-A light-weight mechanism/framework to log input details and a
stringified version of the post-ops structure is added to LPGEMM.
Additionally the runtime of the API is also logged.
The logging framework logs to a file with filename following the format
aocl_gemm_log_<PID>_<TID>.txt.
-To enable this feature, the AOCL_LPGEMM_LOGGER_SUPPORT=1 macro needs to
be defined when compiling BLIS (with aocl_gemm addon enabled) by passing
CFLAGS="-DAOCL_LPGEMM_LOGGER_SUPPORT=1" to ./configure. Additionally
AOCL_ENABLE_LPGEMM_LOGGER=1 has to be exported in the environment during
LPGEMM runtime.

AMD-Internal: [SWLCSG-3280]
Change-Id: I30bfb35b2dc412df70044601b335938fc9f49cfb
2025-01-03 11:28:57 +00:00
Meghana Vankadari
bfc512d3e1 Implemented batch_gemm for bf16bf16f32of32|bf16
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.

AMD-Internal: [SWLCSG-2944]
Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155
2025-01-03 03:28:32 -05:00
Nallani Bhaskar
9735391e1d Implemented f32tobf16 reorder function
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.

The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.

The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float

Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
2024-11-04 04:32:01 +00:00
Mithun Mohan
097cda9f9e Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API.
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.

AMD Internal: [CPUPL-5959]

Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
2024-10-30 08:52:22 +00:00
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
varshav2
605517964b Add Transpose Kernel for A matrix in F32F32f32Of32
- Implemented the AVX512 packA kernel for col major inputs in F32 API
 - Removed the work arounds for n = 1,  mtag_a = PACK case, where the execution was
   being directed to GEMM instead of GEMV.

Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
2024-09-19 06:37:11 -04:00
Meghana Vankadari
2e1cc2f14a Added bf16s4f32 kernels to handle m=4 cases
Details:
- In WOQ, if m = 4, special case kernels are added where
  s4->bf16 conversion happens inside the compute kernel and
  packing is avoided. For all other cases, B matrix is
  dequantized and packed at KC loop level and native bf16
  kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
  output is bf16.

Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
2024-09-04 07:36:57 -04:00
Deepak Negi
6dcf500703 Element wise operations API for float(f32) input matrix in LPGEMM.
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24

AMD Internal: [SWLCSG-2947]

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
2024-08-27 03:28:52 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
mkadavil
f040ba617f Element wise operations API for bfloat16 input matrix in LPGEMM.
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.

AMD Internal: SWLCSG-2947

Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
2024-08-05 07:17:08 -04:00
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
mkadavil
a5c4a8c7e0 Int4 B matrix reordering support in LPGEMM.
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.

AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
2024-06-24 07:55:34 -04:00
Meghana Vankadari
c9254bd9e9 Implemented LPGEMV(n=1) for AVX2-INT8 variants
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n=1.

AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
2024-06-18 12:09:18 +05:30
mkadavil
cd032225ca BF16 bias support for bf16bf16f32ob16.
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.

AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
2024-05-23 04:48:20 +05:30
mkadavil
ec67289601 SWISH post-op support for BF16 JIT based kernels.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1. Adding the support for swish in JIT
based BF16 kernels.

AMD-Internal: [SWLCSG-2387]
Change-Id: I9eea0c801f5f067a5cfbd2941bc991708b86e45e
2024-05-13 01:50:32 -04:00
Meghana Vankadari
da8fd8c301 Implemented JIT-based microkernel for bf16 datatype
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
  will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
  for 6x64 bf16 microkernel and store function pointers to all the
  generated kernels in a global function pointer array. This happens
  only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
  otherwise, microkernel uses the intrinsics based implementation.

AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
2024-03-13 05:55:18 +05:30
eashdash
ef134dc49f Added Trans A feature for all INT8 LPGEMM APIs
1. Added Trans A feature to handle column major inputs
   for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
   MCxKC block of A into row storage MCxKC block as
   LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
   from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
   pointers.
6. Packing of A matrix:
   -  Converts column major input A to row major
      in blocks of MCxKC with newly added pack A
      functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
   INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
   1. u8s8s32os32/os8
   2. u8s8s16os16/os8/ou8
   3. s8s8s32os32/os8
   4. s8s8s16os16/os8

AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
2024-01-30 03:40:56 -05:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Meghana Vankadari
f8f4343b55 Updated cntx with packA function pointer for AVX512_VNNI support
Details:
- Modified bench to support testing for sizes where matrix
  strides are larger than the corresponding dimensions.
- Modified early-return checks in all interface APIs to
  check validity of strides in relation to the corresponding
  dimension rather than checking if strides are equal to dimensions.

Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4
2023-11-03 04:50:00 -04:00
Nallani Bhaskar
b3391ef5da Updated ERF threshold and packa changes in bf16
Description:
    1. Updated ERF function threshold from 3.91920590400 to 3.553
       to match with the reference erf float implementation which
       reduced errors a the borders and also clipped the output
       to 1.0
    2. Updated packa function call with pack function ptr in bf16
       api to avoid compilation issues for non avx512bf16 archs

    3. Updated lpgemm bench

    [AMD-Internal: SWLCSG-2423 ]

Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
2023-10-29 23:55:46 +05:30
Meghana Vankadari
4874895a68 LPGEMM: Added transA support for bf16bf16f32o<bf16|f32> APIs
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
  aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
  column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
  format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
2023-10-11 07:16:08 -04:00
eashdash
a72fff2be9 Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
   Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
   sum of every column of B matrix to implement the s8s8s16 operation
   using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
   Here the A matrix elements are converted from int8 to uint8 (s16 FMA
   works with A matrix type uint8 only) by adding extra 128 to
   every A matrix element
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
   Final C = C - ( (sum of column of B matrix) * 128 )
   This is done to compensate for the addition of extra 128 to every
   A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.

AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
2023-04-21 05:30:38 -04:00
mkadavil
3572baa9d3 aocl_softmax_f32 api's for softmax computation as part of lpgemm.
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.

AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
2023-04-21 05:26:08 -04:00
mkadavil
e23765010d aocl_gelu_<tanh|erf>_f32 api's for gelu computation as part of lpgemm.
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.

AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
2023-04-17 05:15:56 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
eashdash
bd8cd763ff Added NEW LPGEMM TYPE- S8S8S32/S8
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
   2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
   of every column of B matrix to implement the S8S8S32 operation using
   the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
   Here the A matrix elements are converted from int8 to uint8 (VNNI
   works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.

AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
2023-03-31 05:44:54 -04:00
mkadavil
3d74b62e60 Lpgemm threading and micro-kernel optimizations.
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.

AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
2023-03-16 11:44:51 +05:30
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00