182 Commits

Author SHA1 Message Date
Varaganti, Kiran
17702aedef Add compiler information to "make showconfig" and "bench_getlibraryInfo"
Display CC, CXX compiler information and version details to help users
identify which compiler was used to build BLIS.

Changes:
- Makefile: Add CC, CXX, and CFLAGS to 'make showconfig' output
- bench_getlibraryInfo.c: Add runtime compiler detection with version
  * Properly detect AOCC (AMD Optimizing C/C++ Compiler)
  * Detect Clang, GCC, Intel ICC/oneAPI, and MSVC
  * Show compiler version information
  * Detection order ensures AOCC is identified correctly (not as GCC)

The showconfig output now includes:
  CC (C compiler):             clang
  CXX (C++ compiler):          clang++
  CFLAGS:                      <flags>

The bench_getlibraryInfo program now shows:
  == Build Information ==
  C Compiler (CC):             AOCC x.x.x (Clang x.x.x based)
  C++ Compiler (CXX):          N/A (compiled as C)

This provides both build-time and runtime compiler information for
better build diagnostics and issue reporting.

Issue: CPUPL-7678
2025-12-01 21:14:25 +05:30
Rayan, Rohan
979b547876 Make all bench applications consistent (#189)
Lots of the bench applications were not taking n_repeats as an argument on the command line (while others were).
Also adding the feature to skip the rest of the line (at the end) while reading bench inputs for all APIs

---------

Co-authored-by: Rayan <rohrayan@amd.com>
2025-11-13 07:42:53 +05:30
Varaganti, Kiran
807de2a990 DTL Log update
* DTL Log update
Updates logs with nt and AOCL Dynamic selected nt for axpy, scal and dgemv
Modified bench_gemv.c to able to process modified dtl logs.

* Updated DTL log for copy routine with actual nt and dynamic nt

* Refactor OpenMP pragmas and clean up code

Removed unnecessary nested OpenMP pragma and cleaned up function end comment.

* Fixed DTL log for sequential build

* Added thread logging in bla_gemv_check for invalid inputs

---------

Co-authored-by: Smyth, Edward <Edward.Smyth@amd.com>
2025-09-22 11:32:00 +05:30
Smyth, Edward
823ec2cd40 Add missing license text
Some files have copyright statements but not details of the license.
Add this to DTL source code and some build and benchmark related
scripts.

AMD-Internal: [CPUPL-6579]
2025-09-18 12:04:59 +01:00
Smyth, Edward
fb2a682725 Miscellaneous changes
- Change begin_asm and end_asm comments and unused code in files
     kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c
     kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c
  to avoid problems in clobber checking script.
- Add missing clobbers in files
     kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c
     kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
     kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c
- Add missing newline at end of files.
- Update some copyright years for recent changes.
- Standardize license text formatting.

AMD-Internal: [CPUPL-6579]
2025-08-26 16:37:43 +01:00
Varaganti, Kiran
15d2e5c628 Bug Fix in tpsv - integer overflow (#153)
Bug Fix in tpsv and tpmv - integer overflow
When BLAS integer size is 32 bits which is the case for LP64 binaries
the computation of "n * (n+1) / 2" will cause overflow due to overflow
of the operation "n * (n + 1)". This overflow is now fixed.
Fix: Replace bla_integer with dim_t for variables used for indexing 
 packed matrix (AP) in TPSV and TPMV functions. This prevents
overflow when computing kk = n*(n+1)/2 for large matrices.

In addition we added new file under bench called bench_getlibraryInfo.c which prints
all the information related to blis binary.
 

- Fixes ctpsv_, dtpsv_, stpsv_, ztpsv_
- Fixes ctpmv_, dtpmv_, stpmv_, ztpmv_
- Maintains BLAS compatibility
2025-08-26 18:38:14 +05:30
V, Varsha
3df4aac2d2 Bugfix for A matrix packing in int8(S8/U8) APIs for Batch-Matmul
- A matrix by default isn't expected to be packed for a normal row-stored
 case. Hence the packing implementation is incomplete.
 - But if the user explicitly enables packing, interface wasn't handling
 the condition appropriately leading to data overwriting inside the incomplete
 pack kernels, thereby leading to accuracy failure.
 - As a fix, updated the interface to set the explicit PACK A to UNPACKED and
 proceed with GEMM in cases where transpose of A is not necessary.
 - Updated the batch gemm input file with additional test cases covering all the
 APIs.
Bug Fixes:
 - Fixed implementation logic for column major inputs with post-ops to be disabled
 in S8 batch mat-mul. With the existing implementation, column-major inputs wouldn't
 be executed in case of of32/os32 inputs.
 - Fixed the Scale/ZP calculation in bench foru8s8s32ou8 condition, which was leading
 to accuracy failures.

[AMD-Internal: CPUPL-7283 ]
2025-08-26 16:46:37 +05:30
Vlachopoulou, Eleni
1f8a7d2218 Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65) 2025-08-07 15:51:59 +01:00
Smyth, Edward
563b161933 Standardize Python files to use Python 3
Python 2 is no longer maintained, and using python3 avoids accidental invocation of outdated interpreters.

AMD-Internal: [CPUPL-6579]
2025-08-06 12:04:26 +01:00
V, Varsha
68d47281df Fixing some copying bugs in Batch-Matmul code
- Removed duplicate calls to BATCH_GEMM_CHECK().
 - Refactored freeing of post-op pointer in bench code and verified the
    functionality.
 - Modified indexing of the array to take the correct values.
2025-08-01 18:42:10 +05:30
Rayan, Rohan
1bb1160061 Fixing bench overflow in GFLOPS computation
Fixing a bug in some bench applications where GFLOPS computation ran into integer overflows because explicit type casting to double was not done in the computation

removing all multiplies by 1.0 during GFLOP computation

AMD-Internal: CPUPL-7016

---------
Co-authored-by: Rayan <rohrayan@amd.com>
2025-08-01 11:08:53 +05:30
Bhaskar, Nallani
9b02201b5b Updated Poly 16 in Gelu Erf to double precision
Updated poly Gelu Erf precision to double to keep the error with in 1e-5 limit when compared to reference gelu_erf, which is also increased the compute to 2x compared to float.

AMD-Internal: SWLCSG-3551
2025-07-07 14:05:40 +05:30
V, Varsha
1f9d1a85d3 Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API. (#58)
* Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API.

 - Modified Batch-Gemm API to align with cblas_?gemm_batch_ API,
 and added a parameter group_size to the existing APIs.
 - Updated bench batch_gemm code to align to the new API definition.
 - Modified the hardcoded number in lpgemm_postop file.
 - Added necessary early return condition to account for group_count/group_size < 0.

AMD-Internal: [ SWLCSG - 3592 ]
2025-06-30 11:16:04 +05:30
V, Varsha
e05d24315e Bug Fixes for Accuracy issues in Int8 API (#62)
- In U8 GEMV n=1 kernels, the default zp condition was S8 ZP type,
 which leads to accuracy issues which u8s8s32u8 API is used.
 - Few modifications in bench code to take the correct path for
 accuracy check.
2025-06-25 17:01:22 +05:30
Vankadari, Meghana
8649cdc14b Removed unnecessary pack checks in FP32 GEMV (#54)
Details:
- In FP32 GEMM, when threading is disabled, rntm_pack_a and rntm_pack_b
  were set to true by default. This leads to perf regression for smaller
  sizes. Modified FP32 interface API to not overwrite the packA and
  packB variables in rntm structure.
- In FP32 GEMV, Removed the decision making code based on mtag_A/B
  and should_pack_A/B for packing. Matrices will be packed only
  if the storage format of the matrices doesn't match the storage
  format required by the kernel.
- Changed the control flow of checking the value of mtag to whether
  matrix is "reordered" or "to-be-packed" or "unpacked". checking
  for "reorder" first, followed by "pack". This will ensure that
  packing doesn't happen when the matrix is already reordered even
  though user forces packing by setting "BLIS_PACK_A/B"
-Modified python script to generate testcases based on block sizes

AMD-Internal: SWLCSG-3527
2025-06-16 12:34:11 +05:30
Negi, Deepak
ffd7c5c3e0 Postop support for Static Quant and Integer APIs (#20)
Support for S32 Zero point type is added for aocl_gemm_s8s8s32os32_sym_quant
Support for BF16 scale factors type is added for aocl_gemm_s8s8s32os32_sym_quant
U8 buffer type support is added for matadd, matmul, bias post-ops in all int8 APIs.

AMD-Internal: SWLCSG-3503
2025-05-27 16:29:32 +05:30
Vlachopoulou, Eleni
fac4a97118 Adding flags when building bench with CMake (#9)
Since the gnu extensions where removed, executables in bench directory cannon be built correctly.
The fix is adding "-D_POSIX_C_SOURCE=200112L" on those targets. When -std=gnu99 was used,
bench worked without this flag, but that was not the case since we switched to -std=c99.
2025-05-16 11:47:39 +01:00
Deepak Negi
f76f37cc11 Bug Fix in F32 eltwise Api with post ops(clip, swish, relu_scale).
Description
1. In the cases of clip, swish, and relu_scale, constants are currently
   loaded as float. However, they are of C type, so handling has been
   adjusted, for integer these constants are first loaded as integer
   and then converted to float.

Change-Id: I176b805b69679df42be5745b6306f75e23de274d
2025-04-11 08:14:34 -04:00
varshav2
b9998a1d7f Added mutiple ZP type checks in INT8 APIs
- Currently the int8/uint8 APIs do not support multiple ZP types,
   but works only with int8 type or uint8 type.
 - The support is added to enable multiple zp types in these kernels
   and added additional macros to support the operations.
 - Modified the bench downscale reference code to support the updated
   types.

AMD-Internal : [ SWLCSG-3304 ]

Change-Id: Ia5e40ee3705a38d09262086d20731e8f0a126987
2025-04-11 07:25:10 -04:00
Arnav Sharma
c68c258fad Added AVX512 and AVX2 FP32 RD Kernels
- Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs.
- The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64
  (MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel.
- Updatd f32 framework to accomodate rd kernels in case of B trans
  with thresholds
- Updated data gen python script
TODO:
    - Post-Ops not yet supported.

Change-Id: Ibf282741f58a1446321273d5b8044db993f23714
2025-04-05 20:16:51 -05:00
varshav
81d219e3f8 Added destination scale type check in INT8 API's
- Updated the S8 main, GEMV, m_, n_ and mn_ fringe kernels to support
   multiple scale types for vector and scalar scales

 - Updated the U8 main, GEMV, m_, n_, extMR_ and mn_ fringe kernels to
   support multiple scale types for vector and scalar scales

 - Updated the bench to accommodate multiple scale type input, and
   modified the downscale_accuracy_check_ to verify with multiple scale
   type inputs.

AMD Internal: [ SWLCSG-3304 ]

Change-Id: I7b9f3ec8ea830d3265f72d18a0aa36086e14a86e
2025-03-28 00:51:17 -05:00
Meghana Vankadari
1da554a9e5 Bug fix in post-op creator of bench application
Details:
- Setting post_op_grp to NULL at the start of post-op
- creator to ensure that there is no junk value(non-null)
- which might lead to destroyer trying to free
-  non-allocated buffers.

AMD-Internal: [SWLCSG-3274]
Change-Id: I45a54d01f0d128d072d5d9c7e66fc08412c7c79c
2025-03-03 07:30:10 +00:00
Meghana Vankadari
7243a5d521 Implemented group level static quantization for s8s8s32of32|bf16 APIs
Details:
- Group quantization is technique to improve accuracy
  where scale factors to quantize inputs and weights
  varies at group level instead of per channel
  and per tensor level.
- Added new bench files to test GEMM with symmetric static
  quantization.
- Added new get_size and reorder functions to account for
  storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.

AMD-Internal:[SWLCSG-3274]

Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
2025-02-28 04:44:44 -05:00
Meghana Vankadari
6c29236166 Bug fixes in bench and pack code for s8 and bf16 datatypes
Details:
- Fixed the logic to identify an API that has int4 weights in
  bench files for gemm and batch_gemm.
- Eliminated the memcpy instructions used in pack functions of
  zen4 kernels and replaced them with masked load instruction.
  This ensures that the load register will be populated with
  zeroes at locations where mask is set to zero.

Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e
AMD-Internal:[SWLCSG-3274]
2025-02-28 01:18:11 -05:00
Mithun Mohan
9906fd7b91 F32 eltwise kernel updates to use masks in scale factor load.
-Currently the scale factor is loaded without using mask in downscale,
and matrix add/mul ops in the F32 eltwise kernels. This results in
out of memory reads when n is not a multiple of NR (64).
-The loads are updated to masked loads to fix the same.

AMD-Internal: [SWLCSG-3390]

Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb
2025-02-27 08:17:58 -05:00
Deepak Negi
cc321fb95d Added support for different types of zero-point in f32 eltwise APIs.
Description
 - Zero point support for <s32/s8/bf16/u8> datatype in element-wise
   postop only f32o<f32/s8/u8/s32/bf16> APIs.

 AMD-Internal: [SWLCSG-3390]

Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
2025-02-26 04:04:13 -05:00
Deepak Negi
c813bfa609 Fix zero point datatype issue.
Description
 Due to different datatype for zero point during post-op creation
 and accuracy check we see an accuracy issue for u8/s8s8s32 apis
 with output type f32/bf16.

 AMD-Internal: [CPUPL-6456]

Change-Id: If8925988841af87cb5687c84aade607967c744fe
2025-02-24 04:40:13 -05:00
Meghana Vankadari
17634d7ae8 Fixed compiler errors and warning for gcc < 11.2
Description:

1. When compiler gcc version less than 11.2 few BF16 instructions
   are not supported by the compiler even though the processors arch's
   zen4 and zen5 supports.

2. These instructions are guarded now with a macro.


Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131
2025-02-13 10:18:13 -05:00
varshav2
ef04388a44 Added AVX2 support for BF16 kernels: Row major
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
   In order to support AVX2 kernels, the BF16 input has to be converted
   to F32 and then the F32 kernels has to be executed.
 - Added un-pack function for the B-Matrix, which does the unpacking of
   the Re-ordered BF16 B-Matrix and converts it to Float.
 - Added a kernel, to convert the matrix data from Bf16 to F32 for the
   give input.
 - Added a new path to the BF16 5LOOP to work with the BF16 data, where
   the packed/unpacked A matrix is converted from BF16 to F32. The
   packed B matrix is converted from BF16 to F32 and the re-ordered B
   matrix is unre-ordered and converted to F32 before feeding to the
   F32 micro kernels.
 - Removed AVX512 condition checks in BF16 code path.
 - Added the Re-order reference code path to support BF16 AVX2.
 - Currently the F32 AVX-2 kernels supports only F32 BIAS support.
   Added BF16 support for BIAS post-op in F32 AVX2 kernels.
 - Bug fix in the test input generation script.

AMD Internal : [SWLCSG - 3281]

Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
2025-02-10 08:18:52 -05:00
Meghana Vankadari
da3d0c6034 Added new Int8 batch_gemm APIs
Details:
- Added u8s8s32of32|bf16|u8 batch_gemm APIs.
- Fixed some bugs in bench file for bf16 API.

Change-Id: I55380238869350a848f2deec0641d7b9b416b192
2025-02-10 11:19:02 +00:00
Deepak Negi
3a7523b51b Element wise post-op APIs are upgraded with new post-ops
Description:

1. Added new output types for f32 element wise API's to support
   s8, u8, s32 , bf16 outputs.

2. Updated the base f32 API to support all the post-ops supported in
   gemm API's

AMD Internal: [SWLCSG-3384]

Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8
2025-02-10 01:06:39 -05:00
Edward Smyth
1f0fb05277 Code cleanup: Copyright notices (2)
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.

AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
2025-02-07 05:41:44 -05:00
Mithun Mohan
b9f6286731 Tiny GEMM path for BF16 LPGEMM API.
-Currently the BF16 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
-Only row major inputs are allowed to proceed with tiny GEMM.

AMD-Internal: [SWLCSG-3380, SWLCSG-3258]

Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031
2025-02-07 04:37:11 -05:00
Meghana Vankadari
c47f0f499f Fixed bug in testing matrix_mul post_op
Details:
- Added a new python script that can test all microkernels
  along with post-ops.
- Modified post_op freeing function to avoid memory leaks.

Change-Id: Iedba84e8233a88ca9261596c4c7e0a65c196b7e7
2025-02-07 02:27:14 +05:30
Deepak Negi
86e52783e4 Tiny GEMM path for F32 LPGEMM API.
-Currently the F32 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.

AMD-Internal: [SWLCSG-3380]

Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275
2025-02-06 23:36:44 -05:00
Vignesh Balasubramanian
8abb37a0ad Update to AOCL-BLAS bench application for logging outputs
- Updated the format specifiers to have a leading space,
  in order to delimit the outputs appropriately in the
  output file.

- Further updated every source file to have a leading space
  in its format string occuring after the macros.

AMD-Internal: [CPUPL-5895]
Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15
2025-02-06 22:59:59 +05:30
Deepak Negi
2e687d8847 Updated all post-ops in s8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in s8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6
2025-02-06 07:31:28 -05:00
Meghana Vankadari
13e7ada3f2 Modified bench to test different types of post-ops
- Modified bench to support testing of different types of buffers
  for bias, mat_add and mat_mul postops.
- Added support for testing integer APIs with float accumulation
  type.

Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96
2025-02-06 02:38:08 +05:30
Mithun Mohan
0701a4388a Thread factorization improvements (ic ways) for BF16 LPGEMM API.
-Currently when m is small compared to n, even if MR blks (m / MR) > 1,
and total work blocks (MR blks * NR blks) < available threads, the
number of threads assigned for m dimension (ic ways) is 1. This results
in sub par performance in bandwidth bound cases. To address this, the
thread factorization is updated to increase ic ways for these cases.

AMD-Internal: [SWLCSG-3333]

Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9
2025-02-06 00:51:10 -05:00
Hari Govind S
67322416d3 Added support to benchmark ASUMV APIs
- Implemented the feature to benchmark ?ASUMV APIs
  for the supported datatypes. The feature allows to
  benchmark BLAS, CBLAS or the native BLIS API, based
  on the macro definition.

- Added a sample input file to provide examples to benchmark
  ASUMV for all its datatype supports.

AMD-Internal: [CPUPL-5984]
Change-Id: Iff512166545687d12504babda1bd52d71a3a5755
2025-01-31 06:04:16 -05:00
Vignesh Balasubramanian
0e71d28c01 Additional bug-fix for AOCL-BLAS bench
- Corrected the format specifier setting(as macro) to not
  include additional spaces, since this would cause incorrect
  parsing of input files(in case they have exactly the expected
  number of parameters and not more).

- Updated the inputgemm.txt file to contain some inputs that
  have the exact parameters, to validate this fix.

AMD-Internal: [CPUPL-6365]
Change-Id: Ie9a83d4ed7e750ff1380d00c9c182b0c9ed42c49
2025-01-30 08:28:14 -05:00
Nallani Bhaskar
805bd10353 Updated all post-ops in u8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5
2025-01-29 04:21:17 -05:00
Vignesh Balasubramanian
445327f255 Bugfix for AOCL-BLAS bench application
- Bug : When configuring our library with the native
        BLIS integer size being 32, the bench application
	would crash or read an invalid value when parsing
        the input file. This is because of a mismatch
        of format specifier, that we hardset in the
        Makefile.

- Fix : Defined a header that sets the format specifiers
        as macros with the right matching, based on how we
        configure and build the library. It is expected to
        include this header in every source file for
        benchmarking.

AMD-Internal: [CPUPL-5895]
Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89
2025-01-29 03:25:57 -05:00
Deepak Negi
db407fd202 Added F32 bias type support, F32, BF16 output type support in int8 APIs
Description:

1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs.
   Where the inputs are uint8/int8 and the processing is done using
   VNNI but the output is stored in f32 and bf16 formats. All the int8
   kernels are reused and updated with the new output data types.

2. Added F32 data type support in bias.

3. Updated the bench and bench input file to support validation.

AMD-Internal: SWLCSG-3335

Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd
2025-01-26 11:38:30 -05:00
Mithun Mohan
39289858b7 Packed A matrix stride update to account for fringe cases.
-When A matrix is packed, it is packed in blocks of MRxKC, to form a
whole packed MCxKC block. If the m value is not a multiple of MR, then
the m % MR block is packed in a different manner as opposed to the MR
blocks. Subsequently the strides of the packed MR block and m % MR
blocks are different and the same needs to be updated when calling the
GEMV kernels with packed A matrix.
-Fixes to address compiler warnings.

AMD-Internal: [SWLCSG-3359]
Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44
2025-01-22 05:42:30 -05:00
Meghana Vankadari
69ca5dbcd6 Fixed compilation errors for gcc versions < 11.2
Details:
- Disabled intrinsics code of f32obf16 pack function
  for gcc < 11.2 as the instructions used in kernels
  are not supported by the compiler versions.
- Addded early-return check for WOQ APIs when compiling with
  gcc < 11.2
- Fixed code to check whether JIT kernels are generated inside
  batch_gemm API for bf16 datatype.

AMD Internal: [CPUPL-6327]

Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a
2025-01-21 07:13:31 -05:00
Deepak Negi
182a6373b5 Added support to specify bias data type in u8s8s32/s8s8s32 API's
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
   type(s8/s32/bf16) irrespective of the storage data type in
   u8s8s32/s8s8s32 API's.

AMD-Internal: SWLCSG-3302

Change-Id: I3c465fe428672d2d58c1c60115c46d2d5b11f0f4
2025-01-15 05:56:26 -05:00
Meghana Vankadari
852cdc6a9a Implemented batch_matmul for f32 & int8 datatypes
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.
- Added logger functionality for batch_matmul APIs.

AMD-Internal: [SWLCSG-2944]
Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a
2025-01-10 04:10:53 -05:00
Mithun Mohan
ef4286a97e Multi-data type buffer and scale support for matrix add|mul post-ops in s32 API.
-As it stands the buffer type in matrix add|mul post-ops is expected to
be the same as that of the output C matrix type. This limitation is now
removed and user can specify the buffer type by setting the stor_type
attribute in add|mul post-op struct. As of now int8, int32, bfloat16 and
float types are supported for the buffer in s32 micro-kernels. The same
support is also added for bf16 micro-kernels, with bfloat16 and float
supported for now.
-Additionally the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add|mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.
-The bias_stor_type attribute is renamed to stor_type in bias post-ops.

AMD-Internal: [SWLCSG-3319]
Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa
2025-01-10 02:11:12 -05:00
varshav
7b9d29f9b3 Adding post-ops for JIT kernels
- Added Downscale, tanh and sigmoid post-op support to the JIT kernels
 - Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error.
 - Added the optional support for B-prefetch in the JIT kernels
 - Resolved the visibility issues in global variable jit_krnels_generated
 - Modified the array generation for scale and zp values in the bench

Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d
2025-01-08 12:55:27 +00:00