Commit Graph

19 Commits

Author SHA1 Message Date
Nallani Bhaskar
40719e0438 Implemented reference function for bf16 reorder function
Description:

Implemented a reference version for
aocl_gemm_reorder_bf16bf16f32of32 function
to make the code cpu architecture independent.

AMD-Internal: [ SWLCSG-3279 ]

Change-Id: I0c715864c0ab3e5afea2ee6ee9546b75c3dbf9ec
2024-12-17 05:46:39 +00:00
Deepak Negi
baeebe75c9 Support for standard AutoAWQ storage format.
Description:
1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this
   format [0, 2, 4, 6, 1, 3, 5, 7].
2. Support is added to convert above format back to the original
   sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering
   in the AWQ API.

AMD-Internal: SWLCSG-3169

Change-Id: I5395766060c200ab81d0b8be94356678a169ac13
2024-12-02 04:02:27 -05:00
Nallani Bhaskar
9735391e1d Implemented f32tobf16 reorder function
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.

The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.

The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float

Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
2024-11-04 04:32:01 +00:00
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
Meghana Vankadari
3a8b9270e7 Implemented lpgemv for AVX512-INT8 variants
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
  datatype.

AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
2024-05-14 01:55:49 +05:30
Meghana Vankadari
1072770c63 Implemented LPGEMV for bf16 datatype
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation is developed considering the b matrix
   reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
2024-05-06 23:55:15 +05:30
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Meghana Vankadari
eb5ab3f762 LPGEMM: Added transB support for bf16bf16f32o<bf16|f32> APIs
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
  to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
  column-major input format.
- Updated Early-return check conditions to account for trans
  parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
2023-10-12 23:36:18 +05:30
Meghana Vankadari
4874895a68 LPGEMM: Added transA support for bf16bf16f32o<bf16|f32> APIs
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
  aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
  column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
  format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
2023-10-11 07:16:08 -04:00
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Edward Smyth
0f0277e104 Code cleanup: dos2unix file conversion
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.

AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
2023-04-21 08:41:16 -04:00
Edward Smyth
6835205ba8 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
2023-04-19 12:44:56 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
mkadavil
3d74b62e60 Lpgemm threading and micro-kernel optimizations.
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.

AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
2023-03-16 11:44:51 +05:30
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00
eashdash
e674fae758 Post-Ops for bf16bf16f32
Functionality - Post-ops is a set of operations performed elemnent
wise on the output matrix post GEMM operation. The support for
the same is added by fusing post-ops with GEMM operations.

- Post-ops Bias, Relu and Parametric Relu are added to all the
compute kernels of bf16bf16f32of32
- Modified bf16 interface files to add check for bf16 ISA support

Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3
2022-08-30 08:14:14 +00:00
eashdash
4e3e00fb7e Added low precision GEMM - bf16bf16f32of32
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.

1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided

Change-Id: If5d08213068869eff060c9998596d2d2703a6793
2022-08-24 03:27:00 -04:00