Commit Graph

11 Commits

Author SHA1 Message Date
Deepak Negi
baeebe75c9 Support for standard AutoAWQ storage format.
Description:
1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this
   format [0, 2, 4, 6, 1, 3, 5, 7].
2. Support is added to convert above format back to the original
   sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering
   in the AWQ API.

AMD-Internal: SWLCSG-3169

Change-Id: I5395766060c200ab81d0b8be94356678a169ac13
2024-12-02 04:02:27 -05:00
Meghana Vankadari
fbb72d047f Added group quantization and zero-point support for WOQ kernels
Description:

1. Added group quantization and zero-point (zp) in
   aocl_gemm_bf16s4f32o<bf16|f32> API.

2. Group quantization is technique to improve accuracy
   where scale factors to dequantize weights varies at group
   level instead of per channel and per tensor level.

3. Added zp and scaling in woq packb kernels so that for
   large M values zp and scaling are performed at pack-b
   stage and bf16 kernels are called

4. Adding zp support and scaling to default path in WoQ kernels
   created some performance overhead when M value is very small.

5. Added string group_size to lpgemm bench to read
   group size from bench_input.txt and tested for
   various combinations of matrix dimensions.

6. The scalefactors could be of type float or bf16
   and the  zeropoint values are expected to be
   in int8 format.

AMD-Internal: [SWLCSG-3168, SWLCSG-3172]

Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57
2024-12-02 06:46:13 +00:00
Nallani Bhaskar
9735391e1d Implemented f32tobf16 reorder function
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.

The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.

The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float

Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
2024-11-04 04:32:01 +00:00
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
mkadavil
7114376519 New kernels for int4 B matrix reordering following BF16 kernel schema.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
2024-07-25 01:10:13 -04:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Nallani Bhaskar
b3391ef5da Updated ERF threshold and packa changes in bf16
Description:
    1. Updated ERF function threshold from 3.91920590400 to 3.553
       to match with the reference erf float implementation which
       reduced errors a the borders and also clipped the output
       to 1.0
    2. Updated packa function call with pack function ptr in bf16
       api to avoid compilation issues for non avx512bf16 archs

    3. Updated lpgemm bench

    [AMD-Internal: SWLCSG-2423 ]

Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
2023-10-29 23:55:46 +05:30
Meghana Vankadari
eb5ab3f762 LPGEMM: Added transB support for bf16bf16f32o<bf16|f32> APIs
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
  to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
  column-major input format.
- Updated Early-return check conditions to account for trans
  parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
2023-10-12 23:36:18 +05:30
Meghana Vankadari
4874895a68 LPGEMM: Added transA support for bf16bf16f32o<bf16|f32> APIs
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
  aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
  column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
  format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
2023-10-11 07:16:08 -04:00