204 Commits

Author SHA1 Message Date
varshav
acee9c7d4e Added column-major support for BF16 tiny path
- Added column major path for BF16 tiny path
 - Tuned tiny-path thresholds to support few more inputs to the
   tiny path.

AMD-Internal: [SWLCSG-3380]
Change-Id: I9a5578c9f0d689881fc5a67ab778e6a917c4fce1
2025-03-13 05:45:33 -04:00
varshav
4d22451fbb Bug Fix in BF16 Re-order/unreorder with AOCL_ENABLE_INSTRUCTIONS
- Currently, the bf16 reorder function does not add padding for
   n=1 cases. But, the bf16 AVX2 Unreorder path considers the input
   re-ordered B matrix to be padded along the n and k dimension.
 - Hence, modified the conditions to make sure the path doesn't break
   while the AVX2 kernels are executed in AVX512 machines when
   B matrix reordered.

Change-Id: I7dd3d37a24758a8e93e80945b533abfcf15f65a1
2025-03-05 06:31:19 +00:00
Mithun Mohan
37d590e53f Tid spread threshold update in LPGEMM thread decorator.
-Currently the Tid spread does not happen for n=4096 even if there
are threads available to facilitate the same. Update the threshold
to account for the same.

AMD-Internal: [SWLCSG-3185]
Change-Id: I281b1639c32ba2145bd84062324f1f11b1167eeb
2025-03-04 10:53:51 +00:00
Meghana Vankadari
7243a5d521 Implemented group level static quantization for s8s8s32of32|bf16 APIs
Details:
- Group quantization is technique to improve accuracy
  where scale factors to quantize inputs and weights
  varies at group level instead of per channel
  and per tensor level.
- Added new bench files to test GEMM with symmetric static
  quantization.
- Added new get_size and reorder functions to account for
  storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.

AMD-Internal:[SWLCSG-3274]

Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
2025-02-28 04:44:44 -05:00
Nallani Bhaskar
0e6b562711 Implemented s8 unreorder reference API
Description:
1. Implement s8 unreorder API function which performs
   unreordering of int8 matrix which is reordered
2. Removed bf16vnni check for bf16 unreorder reference API
   because it can work on any architecture as it is reference
   code
3. Tested the reference code for all main and fringe paths.

AMD-Interneal: [SWLCSG-3426]

Change-Id: I920f807be870e1db5f9d0784cdcec7b366e1eff5
2025-02-27 13:06:40 +00:00
Deepak Negi
cc321fb95d Added support for different types of zero-point in f32 eltwise APIs.
Description
 - Zero point support for <s32/s8/bf16/u8> datatype in element-wise
   postop only f32o<f32/s8/u8/s32/bf16> APIs.

 AMD-Internal: [SWLCSG-3390]

Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
2025-02-26 04:04:13 -05:00
Mithun Mohan
7394aafd1e New A packing kernels for F32 API in LPGEMM.
-New packing kernels for A matrix, both based on AVX512 and AVX2 ISA,
for both row and column major storage are added as part of this change.
Dependency on haswell A packing kernels are removed by this.
-Tiny GEMM thresholds are further tuned for BF16 and F32 APIs.

AMD-Internal: [SWLCSG-3380, SWLCSG-3415]

Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be
2025-02-26 05:23:35 +00:00
varshav
8a69141294 Bug fix in BF16-F32 supported AVX2 Kernels
- Bug fix in Matrix Mul post op.
 - Updated the config in AVX512_VNNI_BF16 context
   to work in AVX2 kernels

Change-Id: I25980508facc38606596402dba4cfce88f4eb173
2025-02-25 14:42:45 +00:00
varshav
a0005c60ce Add col-major pack kernels and BF16 output support in F32 AVX-2 kernels.
- Added column major pack kernels, which will transpose and store the
   BF16 matrix input to F32 input matrix
 - Added BF16 Zero point Downscale support to F32 main and fringe
   kernels.
 - Updated Matrix Add and Matrix Mul post-ops in f32-AVX2 main and
   fringe kernels to support BF16 input.
 - Modified the f32 tiny kernels loop to update the buf_downscale
   parameter.
 - Modified bf16bf16f32obf16 framework to work with AVX-2 system.
 - Added wrapper in bf16 5-Loop to call the corresponding AVX-2/AVX-512
   5 Loop functions.
 - Bug fixes in the f32-AVX2 kernels BIAS post-ops.
 - Bug fixes in the Convert function, and the bf16 5-loop
   for multi-threaded inputs.

AMD-Internal:[SWLCSG-3281 , CPUPL-6447]

Change-Id: I4191fbe6f79119410c2328cd61d9b4d87b7a2bcd
2025-02-24 09:51:12 +05:30
Nallani Bhaskar
5a3c58b315 Fixed column major case of bf16 un-reorder reference function
Description:

1. Fixed bf16 un-reorder column major kernel
2. Fixed a bug in nrlt16 case of f32obf16 reorder function
3. Unit testing done .

AMD-internal: [SWLCSG-3279]

Change-Id: I65024342935ae65186b95885eb010baf3269aa7d
2025-02-20 06:26:31 -05:00
Mithun Mohan
ae182c3fcc Using GEN_BUF buffer instead of <A|B>_PANEL for pack buffer in F32/BF16.
-When bli_pba_acquire_m is invoked to get a buffer for packing, if
buffer type is BLIS_BUFFER_FOR_B_PANEL, then the memory is returned
from a memory pool. In order to ensure thread safety, this memory
pool is protected using locks. Instead if buffer type was
BLIS_BUFFER_FOR_GEN_USE, then memory is allocated using malloc.
-However it was observed that for relatively small input dimensions,
if on the go packing is required, and if jc_ways is sufficiently
large, then there was contention at the lock on the memory pool for
B_PANEL buffer type. This turned out to be an overhead and is now
avoided by checking out GEN_USE buffer type for packing.

AMD-Internal: [SWLCSG-3398]

Change-Id: I781ad5da2a2f75997b58d6c3db70f6277250bd99
2025-02-14 06:12:51 -05:00
Meghana Vankadari
17634d7ae8 Fixed compiler errors and warning for gcc < 11.2
Description:

1. When compiler gcc version less than 11.2 few BF16 instructions
   are not supported by the compiler even though the processors arch's
   zen4 and zen5 supports.

2. These instructions are guarded now with a macro.


Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131
2025-02-13 10:18:13 -05:00
Mithun Mohan
d61c54dc26 Enable BF16 tiny GEMM path only for Zen4/5 arch id.
-BF16 tiny GEMM path is only enabled for Zen4 or Zen5 arch id as
returned by the bli_arch_query_id function. Additionally it is
disabled if JIT kernels are used.

-Fixed nrlt16 case in bf16_unreorder_ref function

AMD-Internal: [SWLCSG-3380, SWLCSG-3258]

Change-Id: I8af638a85e949f12181bc56c63e5e983c24ca3af
2025-02-12 06:39:53 -05:00
Mithun Mohan
4cfbb47b87 Initialize block sizes for F32 element wise post-op APIs.
-The block sizes and micro kernel dimensions for the F32OF32 group
of APIs are updated in the element wise operations cntx map.

AMD-Internal: [SWLCSG-3390]

Change-Id: Ic5690b7eb4f7b2559d893f374dd811b00e31e329
2025-02-11 06:47:24 -05:00
varshav
f4e3a4b1c3 AVX2 Support for BF16 Kernels - Bug fixes
- Added early return checks for A/B transpose cases and Column major
  support, as it is not currently supported.
- Enabled the JIT kernels for the Zen4 architecture.

AMD Internal: [SWLCSG - 3281]

Change-Id: Ie671676c51c739dd18709892414fd34d26a540df
2025-02-11 12:40:43 +05:30
Nallani Bhaskar
0acb5eb9a4 Implemented reference unreorder bf16 function
Description:

Implemented a c reference for
aocl_gemm_unreorder_bf16bf16f32of32 function

The implementation working for row major and
column major yet to be enabled.

AMD-Internal: [ SWLCSG-3279 ]

Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a
2025-02-11 02:04:42 +00:00
varshav2
ef04388a44 Added AVX2 support for BF16 kernels: Row major
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
   In order to support AVX2 kernels, the BF16 input has to be converted
   to F32 and then the F32 kernels has to be executed.
 - Added un-pack function for the B-Matrix, which does the unpacking of
   the Re-ordered BF16 B-Matrix and converts it to Float.
 - Added a kernel, to convert the matrix data from Bf16 to F32 for the
   give input.
 - Added a new path to the BF16 5LOOP to work with the BF16 data, where
   the packed/unpacked A matrix is converted from BF16 to F32. The
   packed B matrix is converted from BF16 to F32 and the re-ordered B
   matrix is unre-ordered and converted to F32 before feeding to the
   F32 micro kernels.
 - Removed AVX512 condition checks in BF16 code path.
 - Added the Re-order reference code path to support BF16 AVX2.
 - Currently the F32 AVX-2 kernels supports only F32 BIAS support.
   Added BF16 support for BIAS post-op in F32 AVX2 kernels.
 - Bug fix in the test input generation script.

AMD Internal : [SWLCSG - 3281]

Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
2025-02-10 08:18:52 -05:00
Meghana Vankadari
da3d0c6034 Added new Int8 batch_gemm APIs
Details:
- Added u8s8s32of32|bf16|u8 batch_gemm APIs.
- Fixed some bugs in bench file for bf16 API.

Change-Id: I55380238869350a848f2deec0641d7b9b416b192
2025-02-10 11:19:02 +00:00
Deepak Negi
3a7523b51b Element wise post-op APIs are upgraded with new post-ops
Description:

1. Added new output types for f32 element wise API's to support
   s8, u8, s32 , bf16 outputs.

2. Updated the base f32 API to support all the post-ops supported in
   gemm API's

AMD Internal: [SWLCSG-3384]

Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8
2025-02-10 01:06:39 -05:00
Mithun Mohan
bffa92ec93 Deprecate S16 LPGEMM APIs.
-The following S16 APIs are removed:
1. aocl_gemm_u8s8s16os16
2. aocl_gemm_u8s8s16os8
3. aocl_gemm_u8s8s16ou8
4. aocl_gemm_s8s8s16os16
5. aocl_gemm_s8s8s16os8
along with the associated reorder APIs and corresponding
framework elements.

AMD-Internal: [CPUPL-6412]

Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b
2025-02-07 11:43:28 +00:00
Edward Smyth
1f0fb05277 Code cleanup: Copyright notices (2)
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.

AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
2025-02-07 05:41:44 -05:00
Edward Smyth
c74faac80f Fix compiler warning messages
Various occurances of the following compiler warnings have been
fixed:
* Type mismatch
* Misleading code indentation
* Array bounds violation warning in blastest when using gcc 11
  without -fPIC flag

AMD-Internal: [CPUPL-5895]
Change-Id: Ia5d5310b76a66e87ad3953a72e8472ed5b01e588
2025-02-07 05:03:49 -05:00
Mithun Mohan
b9f6286731 Tiny GEMM path for BF16 LPGEMM API.
-Currently the BF16 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
-Only row major inputs are allowed to proceed with tiny GEMM.

AMD-Internal: [SWLCSG-3380, SWLCSG-3258]

Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031
2025-02-07 04:37:11 -05:00
Deepak Negi
86e52783e4 Tiny GEMM path for F32 LPGEMM API.
-Currently the F32 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.

AMD-Internal: [SWLCSG-3380]

Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275
2025-02-06 23:36:44 -05:00
Deepak Negi
2e687d8847 Updated all post-ops in s8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in s8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6
2025-02-06 07:31:28 -05:00
Mithun Mohan
0701a4388a Thread factorization improvements (ic ways) for BF16 LPGEMM API.
-Currently when m is small compared to n, even if MR blks (m / MR) > 1,
and total work blocks (MR blks * NR blks) < available threads, the
number of threads assigned for m dimension (ic ways) is 1. This results
in sub par performance in bandwidth bound cases. To address this, the
thread factorization is updated to increase ic ways for these cases.

AMD-Internal: [SWLCSG-3333]

Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9
2025-02-06 00:51:10 -05:00
Nallani Bhaskar
805bd10353 Updated all post-ops in u8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5
2025-01-29 04:21:17 -05:00
Deepak Negi
db407fd202 Added F32 bias type support, F32, BF16 output type support in int8 APIs
Description:

1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs.
   Where the inputs are uint8/int8 and the processing is done using
   VNNI but the output is stored in f32 and bf16 formats. All the int8
   kernels are reused and updated with the new output data types.

2. Added F32 data type support in bias.

3. Updated the bench and bench input file to support validation.

AMD-Internal: SWLCSG-3335

Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd
2025-01-26 11:38:30 -05:00
Mithun Mohan
39289858b7 Packed A matrix stride update to account for fringe cases.
-When A matrix is packed, it is packed in blocks of MRxKC, to form a
whole packed MCxKC block. If the m value is not a multiple of MR, then
the m % MR block is packed in a different manner as opposed to the MR
blocks. Subsequently the strides of the packed MR block and m % MR
blocks are different and the same needs to be updated when calling the
GEMV kernels with packed A matrix.
-Fixes to address compiler warnings.

AMD-Internal: [SWLCSG-3359]
Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44
2025-01-22 05:42:30 -05:00
Meghana Vankadari
69ca5dbcd6 Fixed compilation errors for gcc versions < 11.2
Details:
- Disabled intrinsics code of f32obf16 pack function
  for gcc < 11.2 as the instructions used in kernels
  are not supported by the compiler versions.
- Addded early-return check for WOQ APIs when compiling with
  gcc < 11.2
- Fixed code to check whether JIT kernels are generated inside
  batch_gemm API for bf16 datatype.

AMD Internal: [CPUPL-6327]

Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a
2025-01-21 07:13:31 -05:00
Mithun Mohan
7a25505f5c Simulation of spread like pattern in worker thread to core binding in LPGEMM.
-In multi-threaded cases if a packed/close pattern thread to core
binding is used (e.g.: OMP_PROC_BIND=close and OMP_PLACES=core|threads),
LPGEMM (OMP framework) launches threads such that threads with adjacent
id's are bound to nearby (even adjacent) cores. Depending on the
processor architecture, multiple threads with adjacent id's can be bound
to cores sharing the same last level cache. However it was observed that
when these threads (with adjacent id's) access the B reorder buffer, the
last level cache access was suboptimal. This can be attributed to the
per thread reorder buffer block accesses and how it maps to the last
level cache.
-In these cases, m is small (<= 4 * MR) and n value is such that number
of NR blocks (n/NR) is less than available threads nt (like < 0.5 * nt).
In such cases, id's of the threads can be modified such that the number
of threads with adjacent id's bound to the last level cache can be
reduced. This looks similar to the spread pattern used in thread to core
binding. This reduces the load on last level cache due to reorder buffer
access and improves performance in these cases. A heuristic method is
used to detect whether thread to core binding follows close pattern
before applying the thread id modifications.

AMD-Internal: [SWLCSG-3185]
Change-Id: Ie3c87d56e0f7b59161a381f382cf4e2d5d02a591
2025-01-10 06:02:06 -05:00
Meghana Vankadari
852cdc6a9a Implemented batch_matmul for f32 & int8 datatypes
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.
- Added logger functionality for batch_matmul APIs.

AMD-Internal: [SWLCSG-2944]
Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a
2025-01-10 04:10:53 -05:00
Mithun Mohan
ef4286a97e Multi-data type buffer and scale support for matrix add|mul post-ops in s32 API.
-As it stands the buffer type in matrix add|mul post-ops is expected to
be the same as that of the output C matrix type. This limitation is now
removed and user can specify the buffer type by setting the stor_type
attribute in add|mul post-op struct. As of now int8, int32, bfloat16 and
float types are supported for the buffer in s32 micro-kernels. The same
support is also added for bf16 micro-kernels, with bfloat16 and float
supported for now.
-Additionally the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add|mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.
-The bias_stor_type attribute is renamed to stor_type in bias post-ops.

AMD-Internal: [SWLCSG-3319]
Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa
2025-01-10 02:11:12 -05:00
Meghana Vankadari
051c9ac7a2 Bug fixes in F32 and INT8 APIs
Details:
- Fixed few bugs in downscale post-op for f32 datatype.
- Fixed a bug in setting strides of packB buffer in
  int8 APIs.

Change-Id: Idb3019cc4593eace3bd5475dd1463dea32dbe75c
2025-01-09 04:07:26 -05:00
varshav
7b9d29f9b3 Adding post-ops for JIT kernels
- Added Downscale, tanh and sigmoid post-op support to the JIT kernels
 - Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error.
 - Added the optional support for B-prefetch in the JIT kernels
 - Resolved the visibility issues in global variable jit_krnels_generated
 - Modified the array generation for scale and zp values in the bench

Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d
2025-01-08 12:55:27 +00:00
Mithun Mohan
4a95f44d39 Buffer scale support for matrix add and matrix mul post-ops in bf16 API.
-Currently the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add/mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.

AMD-Internal: [SWLCSG-3181]
Change-Id: Ifdb7160a1ea4f5ecccfa3ef31ecfed432898c14d
2025-01-08 10:35:50 +00:00
Meghana Vankadari
c9f0240679 Returning early for col-major inputs in u8s8s32os32|s8 APIs
Details:
- For u8s8s32os32|s8 APIs, A & B matrices are of different
  datatypes. Hence col-major inputs cannot be supported by
  swapping the matrices internally. Added a check to return
  early in such cases.

Change-Id: I99fbebe811c3d05310f30f7fc978f5084b5a51ba
2025-01-05 23:46:06 +05:30
Mithun Mohan
8d8a8e2f19 Light-weight logging framewok for LPGEMM.
-A light-weight mechanism/framework to log input details and a
stringified version of the post-ops structure is added to LPGEMM.
Additionally the runtime of the API is also logged.
The logging framework logs to a file with filename following the format
aocl_gemm_log_<PID>_<TID>.txt.
-To enable this feature, the AOCL_LPGEMM_LOGGER_SUPPORT=1 macro needs to
be defined when compiling BLIS (with aocl_gemm addon enabled) by passing
CFLAGS="-DAOCL_LPGEMM_LOGGER_SUPPORT=1" to ./configure. Additionally
AOCL_ENABLE_LPGEMM_LOGGER=1 has to be exported in the environment during
LPGEMM runtime.

AMD-Internal: [SWLCSG-3280]
Change-Id: I30bfb35b2dc412df70044601b335938fc9f49cfb
2025-01-03 11:28:57 +00:00
Nallani Bhaskar
6cb1acf3c3 Fixed out-of-memory read access in bf16 reorder reference
Description:

Loop count was taken as 16 instead of n0_partial_rem in packb_nrlt16_bf16bf16f32of32_col_major_ref function.

Updated comments on reference reorder functionality.

AMD Internal: SWLCSG-3279

Change-Id: Idfc3b92906bc2b24651c7923e395fe10db56166b
2025-01-03 04:09:08 -05:00
Meghana Vankadari
bfc512d3e1 Implemented batch_gemm for bf16bf16f32of32|bf16
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.

AMD-Internal: [SWLCSG-2944]
Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155
2025-01-03 03:28:32 -05:00
Nallani Bhaskar
40719e0438 Implemented reference function for bf16 reorder function
Description:

Implemented a reference version for
aocl_gemm_reorder_bf16bf16f32of32 function
to make the code cpu architecture independent.

AMD-Internal: [ SWLCSG-3279 ]

Change-Id: I0c715864c0ab3e5afea2ee6ee9546b75c3dbf9ec
2024-12-17 05:46:39 +00:00
Deepak Negi
615789e196 Fixed compilation issue with clang 18 on windows
Description
-In enum AOCL_PARAMS_STORAGE_TYPES the member FLOAT was declared and the
 clang 18 compiler in msvc throwing issue with multiple definition. We
 replace FLOAT and BFLOAT16 to AOCL_GEMM_<F32/BF16>.

AMD-Internal: CPUPL-6174

Change-Id: Ic061af068854d51629b82b495efd0eb54543f329
2024-12-12 06:37:06 -05:00
Deepak Negi
baeebe75c9 Support for standard AutoAWQ storage format.
Description:
1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this
   format [0, 2, 4, 6, 1, 3, 5, 7].
2. Support is added to convert above format back to the original
   sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering
   in the AWQ API.

AMD-Internal: SWLCSG-3169

Change-Id: I5395766060c200ab81d0b8be94356678a169ac13
2024-12-02 04:02:27 -05:00
Meghana Vankadari
fbb72d047f Added group quantization and zero-point support for WOQ kernels
Description:

1. Added group quantization and zero-point (zp) in
   aocl_gemm_bf16s4f32o<bf16|f32> API.

2. Group quantization is technique to improve accuracy
   where scale factors to dequantize weights varies at group
   level instead of per channel and per tensor level.

3. Added zp and scaling in woq packb kernels so that for
   large M values zp and scaling are performed at pack-b
   stage and bf16 kernels are called

4. Adding zp support and scaling to default path in WoQ kernels
   created some performance overhead when M value is very small.

5. Added string group_size to lpgemm bench to read
   group size from bench_input.txt and tested for
   various combinations of matrix dimensions.

6. The scalefactors could be of type float or bf16
   and the  zeropoint values are expected to be
   in int8 format.

AMD-Internal: [SWLCSG-3168, SWLCSG-3172]

Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57
2024-12-02 06:46:13 +00:00
Deepak Negi
04ae01aeab Added support to specify bias data type in bf16 API's
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
   type irrespective of the storage data type in bf16 and WoQ API's


AMD-Internal: SWLCSG-3171


Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf
2024-11-19 05:30:02 -05:00
Deepak Negi
b5c1b6055a Sigmoid and Tanh post-operation support for bf16 API.
Description:

Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_bf16bf16f32o<f32|bf16) API's

Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))

Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.

AMD-Internal: [SWLCSG-3178]

Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969
2024-11-15 01:13:31 -04:00
Nallani Bhaskar
9735391e1d Implemented f32tobf16 reorder function
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.

The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.

The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float

Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
2024-11-04 04:32:01 +00:00
Mithun Mohan
097cda9f9e Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API.
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.

AMD Internal: [CPUPL-5959]

Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
2024-10-30 08:52:22 +00:00
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
varshav2
605517964b Add Transpose Kernel for A matrix in F32F32f32Of32
- Implemented the AVX512 packA kernel for col major inputs in F32 API
 - Removed the work arounds for n = 1,  mtag_a = PACK case, where the execution was
   being directed to GEMM instead of GEMV.

Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
2024-09-19 06:37:11 -04:00