Commit Graph

20 Commits

Author SHA1 Message Date
Meghana Vankadari
bfc512d3e1 Implemented batch_gemm for bf16bf16f32of32|bf16
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.

AMD-Internal: [SWLCSG-2944]
Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155
2025-01-03 03:28:32 -05:00
varshav2
7c78b9991f Bug Fixes in the F32F32 m == 1 transpose scenario
- added the missing stride updates in B reorder case in GEMV
 - added the missing stride updates for the cast of transA with B
   reordered case.

Change-Id: Ic89781dfa7c0d9380ea523796958f795828a1ade
2024-09-11 02:08:50 -04:00
Meghana Vankadari
5120f98e12 Developed all WoQ kernels for bf16s4f32o<f32|bf16>
Description:

1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
   weights into bf16 performed in the kernel itself to reduce bandwidth.

2. These kernels are performing better compared to bf16 weights when m
   is small and n is large.

3. Established a threshold to do quantization support at packing of
   B (KCXNC) level or WoQ kernel level.

Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
2024-09-10 12:00:10 +00:00
varshav2
298a165718 Add TransA and TransB support for F32F32F32oF32
- Added support for TransA and transB in f32f32of32 APIs
 - Modified the GEMV case(m == 1) to support PACKB feature
 - Redirecting the operations to GEMM instead of GEMV in case of n == 1
   conditions, with storage scheme r/transA and c/transB to avoid the
   packing errors which would lead to failures in computation.

Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9
2024-09-09 05:19:49 +05:30
Meghana Vankadari
2e1cc2f14a Added bf16s4f32 kernels to handle m=4 cases
Details:
- In WOQ, if m = 4, special case kernels are added where
  s4->bf16 conversion happens inside the compute kernel and
  packing is avoided. For all other cases, B matrix is
  dequantized and packed at KC loop level and native bf16
  kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
  output is bf16.

Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
2024-09-04 07:36:57 -04:00
Edward Smyth
591a3a7395 Code cleanup: file formats and permissions
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.

Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.

AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
2024-08-05 11:52:33 -04:00
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
Meghana Vankadari
3a8b9270e7 Implemented lpgemv for AVX512-INT8 variants
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
  datatype.

AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
2024-05-14 01:55:49 +05:30
Meghana Vankadari
1072770c63 Implemented LPGEMV for bf16 datatype
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation is developed considering the b matrix
   reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
2024-05-06 23:55:15 +05:30
Bhaskar Nallani
2ce47e6f5e Implemented optimal AVX512-variant of f32 LPGEMV
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation of lpgemv_rowvar_f32 is developed
   considering the b matrix reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

5. Fixed few warnings while loading 2 f32 bias elements using
   _mm_load_sd using float pointer. Typecasted to (const double *)

AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
2024-03-04 23:53:23 +05:30
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
mkadavil
ea0324ab95 Multi data type downscaling support for u8s8s16 - u8s8s16<u8|s8>
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.

AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
2023-10-12 09:19:56 -04:00
eashdash
a72fff2be9 Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
   Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
   sum of every column of B matrix to implement the s8s8s16 operation
   using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
   Here the A matrix elements are converted from int8 to uint8 (s16 FMA
   works with A matrix type uint8 only) by adding extra 128 to
   every A matrix element
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
   Final C = C - ( (sum of column of B matrix) * 128 )
   This is done to compensate for the addition of extra 128 to every
   A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.

AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
2023-04-21 05:30:38 -04:00
eashdash
bd8cd763ff Added NEW LPGEMM TYPE- S8S8S32/S8
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
   2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
   of every column of B matrix to implement the S8S8S32 operation using
   the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
   Here the A matrix elements are converted from int8 to uint8 (VNNI
   works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.

AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
2023-03-31 05:44:54 -04:00
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00
bhaskarn
91a9968a5e Developed intrinsic based f32 kernels in lpgemm
Description:

1. Developed row variant intrinsic Kernels for float32/sgemm
   which are called from lpgemm api aocl_gemm_f32f32f32of32()

2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are
   developed using avx512.

3. 6x16m main kernel and respective n fringe and mn fringe are
   are developed based on avx2 and avx

4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic
   dispatch are planned next

5. When leading dims are greater than dims bench_lpgemm need
   to be updated to test it and this is planned next.

Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370
2023-02-20 01:11:22 -05:00
mkadavil
bf4d1da1b9 Column major input support for BFloat16 gemm.
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.

AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
2022-09-22 02:50:46 -04:00
mkadavil
958c9238ac Output downscaling support for low precision GEMM.
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.

AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
2022-08-30 10:27:19 -04:00
eashdash
4e3e00fb7e Added low precision GEMM - bf16bf16f32of32
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.

1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided

Change-Id: If5d08213068869eff060c9998596d2d2703a6793
2022-08-24 03:27:00 -04:00
mkadavil
6fbdfc3cf2 Low precision gemm refactoring and bug fixes.
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.

AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
2022-08-14 17:39:00 +05:30