Description:
1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this
format [0, 2, 4, 6, 1, 3, 5, 7].
2. Support is added to convert above format back to the original
sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering
in the AWQ API.
AMD-Internal: SWLCSG-3169
Change-Id: I5395766060c200ab81d0b8be94356678a169ac13
Description:
1. Added group quantization and zero-point (zp) in
aocl_gemm_bf16s4f32o<bf16|f32> API.
2. Group quantization is technique to improve accuracy
where scale factors to dequantize weights varies at group
level instead of per channel and per tensor level.
3. Added zp and scaling in woq packb kernels so that for
large M values zp and scaling are performed at pack-b
stage and bf16 kernels are called
4. Adding zp support and scaling to default path in WoQ kernels
created some performance overhead when M value is very small.
5. Added string group_size to lpgemm bench to read
group size from bench_input.txt and tested for
various combinations of matrix dimensions.
6. The scalefactors could be of type float or bf16
and the zeropoint values are expected to be
in int8 format.
AMD-Internal: [SWLCSG-3168, SWLCSG-3172]
Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.
The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.
The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float
Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
Details:
- Added a new API called unreorder that converts a matrix from
reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
row Vs col storage of matrix and transposes for bf16 datatype.
Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
- Fixed framework of bf16s4f32of32 API to correct
pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
Now offset is passed as a separate parameter to the
scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
functionality for the new API.
AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
Description:
1. Added a new API aocl_gemm_bf16s4f32of32 to support
for WoQ (Weight-only-Quantization) in LLM's
2. The API supports only reordered B matrix of data
size signed 4 bits (S4).
3. Substracting zero point and multiplying with scale
on B matrix is performed in packing B.
4. zero point and scale data should be passed by user
through pre-ops data structure.
5. The API is still in experimental state and NOT tested.
AMD-Internal: SWLCSG-2943
Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.
AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
Description:
1. Updated ERF function threshold from 3.91920590400 to 3.553
to match with the reference erf float implementation which
reduced errors a the borders and also clipped the output
to 1.0
2. Updated packa function call with pack function ptr in bf16
api to avoid compilation issues for non avx512bf16 archs
3. Updated lpgemm bench
[AMD-Internal: SWLCSG-2423 ]
Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
column-major input format.
- Updated Early-return check conditions to account for trans
parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4