3719 Commits

Author SHA1 Message Date
harsh dave
a359a25765 Fix typo in 24x8m DGEMM sup kernel causing incorrect result.
- Corrected a typo in dgemm kernel implementation, beta=0 and
  n_left=6 edge kernel.

Thanks to Shubham Sharma<shubham.sharma3@amd.com> for helping with debugging.

AMD-Internal: [CPUPL-6443]
Change-Id: Ifa1e16ec544b7e85c21651bc23c4c27e86d6730b
2025-03-07 04:43:17 -05:00
Vignesh Balasubramanian
c4b84601da AVX512 optimizations for CGEMM(rank-1 kernel)
- Implemented an AVX512 rank-1 kernel that is
  expected to handle column-major storage schemes
  of A, B and C(without transposition) when k = 1.

- This kernel is single-threaded, and acts as a direct
  call from the BLAS layer for its compatible inputs.

- Defined custom BLAS and BLIS_IMPLI layers for CGEMM
  (instead of using the macro definition), in order to
  integrate the call to this kernel at runtime(based on
  the corresponding architecture and input constraints).

- Added unit-tests for functional and memory testing of the
  kernel.

- Updated the ZEN5 context to include the AVX512 CGEMM
  SUP kernels, with its cache-blocking parameters.

AMD-Internal: [CPUPL-6498]
Change-Id: I42a66c424325bd117ceb38970726a05e2896a46b
2025-03-06 20:14:05 +05:30
Vignesh Balasubramanian
07df9f471e AVX512 optimizations for CGEMM(SUP)
- Implemented the following AVX512 SUP
  column-preferential kernels(m-variant) for CGEMM :
  Main kernel    : 24x4m
  Fringe kernels : 24x3m, 24x2m, 24x1m,
                   16x4, 16x3, 16x2, 16x1,
                   8x4, 8x3, 8x2, 8x1,
                   fx4, fx3, fx2, fx1(where 0<f<8).

- Utlized the packing kernel to pack A when
  handling inputs with CRC storage scheme. This
  would in turn handle RRC with operation transpose
  in the framework layer.

- Further adding C prefetching to the main kernel,
  and updated the cache-blocking parameters for
  ZEN4 and ZEN5 contexts.

- Added a set of decision logics to choose between
  SUP and Native AVX512 code-paths for ZEN4 and ZEN5
  architectures.

- Updated the testing interface for complex GEMMSUP
  to accept the kernel dimension(MR) as a parameter, in
  order to set the appropriate panel stride for functional
  and memory testing. Also updated the existing instantiators
  to send their kernel dimensions as a parameter.

- Added unit tests for functional and memory testing of these
  newly added kernels.

AMD-Internal: [CPUPL-6498]

Change-Id: Ie79d3d0dc7eed7edf30d8d4f74b888135f31d6b4
2025-03-06 06:03:39 -05:00
Hari Govind S
8998839c71 Optimisation of DGEMV Transpose Case for unit stride
- Included a new code section to handle input having non-unit strided y
  vector for dgemv transpose case. Removed the same from the respective
  kernels to avoid repeated branching caused by condition checks within
  the 'for' loop.

- The condition check for beta is equal to zero in the primary kernels
  are moved outside the for loop to avoid repeated branching.

- The '_mm512_reduce_pd' operations in the primary kernel is replaced by
  a series of operations to reduce the number of instructions required
  to reduce the 8 registers.

- Changing naming convention for DGEMV transpose kernels.

- Modified unit kernel test to avoid y increment for dgemv tranpose
  kernels during the test.

AMD-Internal: [CPUPL-6565]
Change-Id: I1ac516d6b8f156ac53ac9f6eb18badd50e152e05
2025-03-06 05:15:58 -05:00
jagar
0d7ff830ea CMake: removing gnu extensions for both C and C++
Updated CMakeLists.txt to remove GNU extensions for both C and C++.
Now during building -std=c99 is used instead of -std=gnu99.

Signed-off-by: Jagadish R <jagadish1.r@amd.com>
AMD-Internal: [CPUPL-6553]
Change-Id: I98150707990112c5736660d287f1ddbe71a4e8e6
2025-03-05 06:33:31 -05:00
varshav
4d22451fbb Bug Fix in BF16 Re-order/unreorder with AOCL_ENABLE_INSTRUCTIONS
- Currently, the bf16 reorder function does not add padding for
   n=1 cases. But, the bf16 AVX2 Unreorder path considers the input
   re-ordered B matrix to be padded along the n and k dimension.
 - Hence, modified the conditions to make sure the path doesn't break
   while the AVX2 kernels are executed in AVX512 machines when
   B matrix reordered.

Change-Id: I7dd3d37a24758a8e93e80945b533abfcf15f65a1
2025-03-05 06:31:19 +00:00
Mithun Mohan
37d590e53f Tid spread threshold update in LPGEMM thread decorator.
-Currently the Tid spread does not happen for n=4096 even if there
are threads available to facilitate the same. Update the threshold
to account for the same.

AMD-Internal: [SWLCSG-3185]
Change-Id: I281b1639c32ba2145bd84062324f1f11b1167eeb
2025-03-04 10:53:51 +00:00
Meghana Vankadari
1da554a9e5 Bug fix in post-op creator of bench application
Details:
- Setting post_op_grp to NULL at the start of post-op
- creator to ensure that there is no junk value(non-null)
- which might lead to destroyer trying to free
-  non-allocated buffers.

AMD-Internal: [SWLCSG-3274]
Change-Id: I45a54d01f0d128d072d5d9c7e66fc08412c7c79c
2025-03-03 07:30:10 +00:00
Meghana Vankadari
7243a5d521 Implemented group level static quantization for s8s8s32of32|bf16 APIs
Details:
- Group quantization is technique to improve accuracy
  where scale factors to quantize inputs and weights
  varies at group level instead of per channel
  and per tensor level.
- Added new bench files to test GEMM with symmetric static
  quantization.
- Added new get_size and reorder functions to account for
  storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.

AMD-Internal:[SWLCSG-3274]

Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
AOCL-Mar2025-b1
2025-02-28 04:44:44 -05:00
Vignesh Balasubramanian
99770558bb AVX512 optimizations for CGEMM(Native)
- Implemented the following AVX512 native
  computational kernels for CGEMM :
  Row-preferential    : 4x24
  Column-preferential : 24x4

- The implementations use a common set of macros,
  defined in a separate header. This is due to the
  fact that the implementations differ solely on
  the matrix chosen for load/broadcast operations.

- Added the associated AVX512 based packing kernels,
  packing 24xk and 4xk panels of input.

- Registered the column-preferential kernel(24x4) in
  ZEN4 and ZEN5 contexts. Further updated the cache-blocking
  parameters.

- Removed redundant BLIS object creation and its contingencies
  in the native micro-kernel testing interface(for complex types).
  Added the required unit-tests for memory and functionality
  checks of the new kernels.

AMD-Interal: [CPUPL-6498]
Change-Id: I520ff17dba4c2f9bc277bf33ba9ab4384408ffe1
2025-02-28 03:18:24 -05:00
Meghana Vankadari
6c29236166 Bug fixes in bench and pack code for s8 and bf16 datatypes
Details:
- Fixed the logic to identify an API that has int4 weights in
  bench files for gemm and batch_gemm.
- Eliminated the memcpy instructions used in pack functions of
  zen4 kernels and replaced them with masked load instruction.
  This ensures that the load register will be populated with
  zeroes at locations where mask is set to zero.

Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e
AMD-Internal:[SWLCSG-3274]
2025-02-28 01:18:11 -05:00
Arnav Sharma
b4c1026ec2 Added Support for General Stride in DGEMV
- Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride.
- Since the latest dgemv kernels don't support general stride, added
  checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general
  stride.
- Thanks to Vignesh Balasubramanian <vignesh.balasubramanian@amd.com>
  for finding this issue.

AMD-Internal: [CPUPL-6492]
Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328
2025-02-27 12:47:21 -05:00
Shubham Sharma
e6ca01c1ba Fixed C prefetch in 8x24 DGEMM kernel
- In 8x24 DGEMM kernel, prefetch is always done assuming
  row major C.
- For TRSM, the DGEMM kernel can be called with column major C also.
- Current prefetch logic results in suboptimal performance.
- Changed C prefetch logic so that correct C is prefetched for both row
  and column major C.

 AMD-Internal: [CPUPL-6493]

Change-Id: I7c732ceac54d1056159b3749544c5380340aacd2
2025-02-27 12:17:29 -05:00
Mithun Mohan
9906fd7b91 F32 eltwise kernel updates to use masks in scale factor load.
-Currently the scale factor is loaded without using mask in downscale,
and matrix add/mul ops in the F32 eltwise kernels. This results in
out of memory reads when n is not a multiple of NR (64).
-The loads are updated to masked loads to fix the same.

AMD-Internal: [SWLCSG-3390]

Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb
2025-02-27 08:17:58 -05:00
Nallani Bhaskar
0e6b562711 Implemented s8 unreorder reference API
Description:
1. Implement s8 unreorder API function which performs
   unreordering of int8 matrix which is reordered
2. Removed bf16vnni check for bf16 unreorder reference API
   because it can work on any architecture as it is reference
   code
3. Tested the reference code for all main and fringe paths.

AMD-Interneal: [SWLCSG-3426]

Change-Id: I920f807be870e1db5f9d0784cdcec7b366e1eff5
2025-02-27 13:06:40 +00:00
Deepak Negi
cc321fb95d Added support for different types of zero-point in f32 eltwise APIs.
Description
 - Zero point support for <s32/s8/bf16/u8> datatype in element-wise
   postop only f32o<f32/s8/u8/s32/bf16> APIs.

 AMD-Internal: [SWLCSG-3390]

Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
2025-02-26 04:04:13 -05:00
Mithun Mohan
7394aafd1e New A packing kernels for F32 API in LPGEMM.
-New packing kernels for A matrix, both based on AVX512 and AVX2 ISA,
for both row and column major storage are added as part of this change.
Dependency on haswell A packing kernels are removed by this.
-Tiny GEMM thresholds are further tuned for BF16 and F32 APIs.

AMD-Internal: [SWLCSG-3380, SWLCSG-3415]

Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be
2025-02-26 05:23:35 +00:00
varshav
8a69141294 Bug fix in BF16-F32 supported AVX2 Kernels
- Bug fix in Matrix Mul post op.
 - Updated the config in AVX512_VNNI_BF16 context
   to work in AVX2 kernels

Change-Id: I25980508facc38606596402dba4cfce88f4eb173
2025-02-25 14:42:45 +00:00
Deepak Negi
c813bfa609 Fix zero point datatype issue.
Description
 Due to different datatype for zero point during post-op creation
 and accuracy check we see an accuracy issue for u8/s8s8s32 apis
 with output type f32/bf16.

 AMD-Internal: [CPUPL-6456]

Change-Id: If8925988841af87cb5687c84aade607967c744fe
2025-02-24 04:40:13 -05:00
varshav
a0005c60ce Add col-major pack kernels and BF16 output support in F32 AVX-2 kernels.
- Added column major pack kernels, which will transpose and store the
   BF16 matrix input to F32 input matrix
 - Added BF16 Zero point Downscale support to F32 main and fringe
   kernels.
 - Updated Matrix Add and Matrix Mul post-ops in f32-AVX2 main and
   fringe kernels to support BF16 input.
 - Modified the f32 tiny kernels loop to update the buf_downscale
   parameter.
 - Modified bf16bf16f32obf16 framework to work with AVX-2 system.
 - Added wrapper in bf16 5-Loop to call the corresponding AVX-2/AVX-512
   5 Loop functions.
 - Bug fixes in the f32-AVX2 kernels BIAS post-ops.
 - Bug fixes in the Convert function, and the bf16 5-loop
   for multi-threaded inputs.

AMD-Internal:[SWLCSG-3281 , CPUPL-6447]

Change-Id: I4191fbe6f79119410c2328cd61d9b4d87b7a2bcd
2025-02-24 09:51:12 +05:30
Nallani Bhaskar
5a3c58b315 Fixed column major case of bf16 un-reorder reference function
Description:

1. Fixed bf16 un-reorder column major kernel
2. Fixed a bug in nrlt16 case of f32obf16 reorder function
3. Unit testing done .

AMD-internal: [SWLCSG-3279]

Change-Id: I65024342935ae65186b95885eb010baf3269aa7d
2025-02-20 06:26:31 -05:00
Mithun Mohan
ae182c3fcc Using GEN_BUF buffer instead of <A|B>_PANEL for pack buffer in F32/BF16.
-When bli_pba_acquire_m is invoked to get a buffer for packing, if
buffer type is BLIS_BUFFER_FOR_B_PANEL, then the memory is returned
from a memory pool. In order to ensure thread safety, this memory
pool is protected using locks. Instead if buffer type was
BLIS_BUFFER_FOR_GEN_USE, then memory is allocated using malloc.
-However it was observed that for relatively small input dimensions,
if on the go packing is required, and if jc_ways is sufficiently
large, then there was contention at the lock on the memory pool for
B_PANEL buffer type. This turned out to be an overhead and is now
avoided by checking out GEN_USE buffer type for packing.

AMD-Internal: [SWLCSG-3398]

Change-Id: I781ad5da2a2f75997b58d6c3db70f6277250bd99
2025-02-14 06:12:51 -05:00
Meghana Vankadari
17634d7ae8 Fixed compiler errors and warning for gcc < 11.2
Description:

1. When compiler gcc version less than 11.2 few BF16 instructions
   are not supported by the compiler even though the processors arch's
   zen4 and zen5 supports.

2. These instructions are guarded now with a macro.


Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131
2025-02-13 10:18:13 -05:00
Mithun Mohan
d61c54dc26 Enable BF16 tiny GEMM path only for Zen4/5 arch id.
-BF16 tiny GEMM path is only enabled for Zen4 or Zen5 arch id as
returned by the bli_arch_query_id function. Additionally it is
disabled if JIT kernels are used.

-Fixed nrlt16 case in bf16_unreorder_ref function

AMD-Internal: [SWLCSG-3380, SWLCSG-3258]

Change-Id: I8af638a85e949f12181bc56c63e5e983c24ca3af
AOCL-Feb2025-b2
2025-02-12 06:39:53 -05:00
Mithun Mohan
4cfbb47b87 Initialize block sizes for F32 element wise post-op APIs.
-The block sizes and micro kernel dimensions for the F32OF32 group
of APIs are updated in the element wise operations cntx map.

AMD-Internal: [SWLCSG-3390]

Change-Id: Ic5690b7eb4f7b2559d893f374dd811b00e31e329
2025-02-11 06:47:24 -05:00
varshav
f4e3a4b1c3 AVX2 Support for BF16 Kernels - Bug fixes
- Added early return checks for A/B transpose cases and Column major
  support, as it is not currently supported.
- Enabled the JIT kernels for the Zen4 architecture.

AMD Internal: [SWLCSG - 3281]

Change-Id: Ie671676c51c739dd18709892414fd34d26a540df
2025-02-11 12:40:43 +05:30
Nallani Bhaskar
0acb5eb9a4 Implemented reference unreorder bf16 function
Description:

Implemented a c reference for
aocl_gemm_unreorder_bf16bf16f32of32 function

The implementation working for row major and
column major yet to be enabled.

AMD-Internal: [ SWLCSG-3279 ]

Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a
2025-02-11 02:04:42 +00:00
varshav2
ef04388a44 Added AVX2 support for BF16 kernels: Row major
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
   In order to support AVX2 kernels, the BF16 input has to be converted
   to F32 and then the F32 kernels has to be executed.
 - Added un-pack function for the B-Matrix, which does the unpacking of
   the Re-ordered BF16 B-Matrix and converts it to Float.
 - Added a kernel, to convert the matrix data from Bf16 to F32 for the
   give input.
 - Added a new path to the BF16 5LOOP to work with the BF16 data, where
   the packed/unpacked A matrix is converted from BF16 to F32. The
   packed B matrix is converted from BF16 to F32 and the re-ordered B
   matrix is unre-ordered and converted to F32 before feeding to the
   F32 micro kernels.
 - Removed AVX512 condition checks in BF16 code path.
 - Added the Re-order reference code path to support BF16 AVX2.
 - Currently the F32 AVX-2 kernels supports only F32 BIAS support.
   Added BF16 support for BIAS post-op in F32 AVX2 kernels.
 - Bug fix in the test input generation script.

AMD Internal : [SWLCSG - 3281]

Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
2025-02-10 08:18:52 -05:00
Meghana Vankadari
da3d0c6034 Added new Int8 batch_gemm APIs
Details:
- Added u8s8s32of32|bf16|u8 batch_gemm APIs.
- Fixed some bugs in bench file for bf16 API.

Change-Id: I55380238869350a848f2deec0641d7b9b416b192
2025-02-10 11:19:02 +00:00
Deepak Negi
3a7523b51b Element wise post-op APIs are upgraded with new post-ops
Description:

1. Added new output types for f32 element wise API's to support
   s8, u8, s32 , bf16 outputs.

2. Updated the base f32 API to support all the post-ops supported in
   gemm API's

AMD Internal: [SWLCSG-3384]

Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8
2025-02-10 01:06:39 -05:00
Edward Smyth
0bae96d7ac BLIS: Missing clobbers (batch 8)
- Add missing xmm, ymm and k registers to clobber lists
  in bli_dgemmsup_rv_zen4_asm_24x8m.c
- Add missing ymm1 in bli_dgemmsup_rv_zen4_asm_24x8m.c
  bli_gemmsup_rv_haswell_asm_d6x8m.c and bli_gemmsup_rd_zen_s6x64.c
- Also change formatting in bli_copyv_zen4_asm_avx512.c
  bli_dgemm_avx512_asm_8x24.c and bli_zero_zmm.c to make
  automatic processing of clobber lists easier.

AMD-Internal: [CPUPL-5895]
Change-Id: If05a3f00e6c0f9033eeced5de165ba4c3128b3e5
2025-02-07 10:39:24 -05:00
Edward Smyth
eee3fe1b54 GTestSuite: nested parallelism tests
Optionally enable parallelism inside gtestsuite so that we can
check BLIS functions perform correctly when nested parallelism
is in operation. Enable with:

  cmake ... -DOPENMP_NESTED={0,1,2,1diff}

where in gtestsuite
- 0 is the default choice with no parallelism.
- 1 and 2 are simple nested parallelism.
- 1diff has one level of parallelism setting different numbers
  of threads to be used by BLIS and reference library calls
  from each gtestsuite thread.

Note: OMP_NUM_THREADS must be set appropriately to enable or
      disable parallelism at each level in the test programs
      as desired.
      OMP_NUM_THREADS will also define the parallelism used
      within the BLIS library (if it is multithreaded), unless
      BLIS-specific ways of specifying parallelism have been
      used. If a BLIS-specific parallelism option has been set,
      the same mechanism will be used in the 1diff option to
      vary the number of threads in BLIS per application thread.

AMD-Internal: [CPUPL-3902]
Change-Id: I89f9edb4125c64ef03e025a9f6ccb84960ba8771
2025-02-07 08:49:25 -05:00
Mithun Mohan
bffa92ec93 Deprecate S16 LPGEMM APIs.
-The following S16 APIs are removed:
1. aocl_gemm_u8s8s16os16
2. aocl_gemm_u8s8s16os8
3. aocl_gemm_u8s8s16ou8
4. aocl_gemm_s8s8s16os16
5. aocl_gemm_s8s8s16os8
along with the associated reorder APIs and corresponding
framework elements.

AMD-Internal: [CPUPL-6412]

Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b
2025-02-07 11:43:28 +00:00
Edward Smyth
1f0fb05277 Code cleanup: Copyright notices (2)
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.

AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
2025-02-07 05:41:44 -05:00
Edward Smyth
c74faac80f Fix compiler warning messages
Various occurances of the following compiler warnings have been
fixed:
* Type mismatch
* Misleading code indentation
* Array bounds violation warning in blastest when using gcc 11
  without -fPIC flag

AMD-Internal: [CPUPL-5895]
Change-Id: Ia5d5310b76a66e87ad3953a72e8472ed5b01e588
2025-02-07 05:03:49 -05:00
Mithun Mohan
b9f6286731 Tiny GEMM path for BF16 LPGEMM API.
-Currently the BF16 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
-Only row major inputs are allowed to proceed with tiny GEMM.

AMD-Internal: [SWLCSG-3380, SWLCSG-3258]

Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031
2025-02-07 04:37:11 -05:00
Meghana Vankadari
c47f0f499f Fixed bug in testing matrix_mul post_op
Details:
- Added a new python script that can test all microkernels
  along with post-ops.
- Modified post_op freeing function to avoid memory leaks.

Change-Id: Iedba84e8233a88ca9261596c4c7e0a65c196b7e7
2025-02-07 02:27:14 +05:30
Deepak Negi
86e52783e4 Tiny GEMM path for F32 LPGEMM API.
-Currently the F32 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.

AMD-Internal: [SWLCSG-3380]

Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275
2025-02-06 23:36:44 -05:00
Vignesh Balasubramanian
8abb37a0ad Update to AOCL-BLAS bench application for logging outputs
- Updated the format specifiers to have a leading space,
  in order to delimit the outputs appropriately in the
  output file.

- Further updated every source file to have a leading space
  in its format string occuring after the macros.

AMD-Internal: [CPUPL-5895]
Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15
2025-02-06 22:59:59 +05:30
Arnav Sharma
5a4739d288 DGEMV NO_TRANSPOSE Optimizations and Unit Tests
- Added 32x3n n-biased kernels to directly handle the cases where n=3
  which were earlier being handled by the primary n-biased, 32x8n,
  kernel.
- Modified the n-biased fringe kernels to further handle the smaller
  m-fringe cases. Thus, now the kernels handle the following range of m
  for any value of n:
  - 16x8n     : m = [16, 31)
  - 8x8n      : m = [8, 15)
  - m_leftx8n : m = [1, 7]
- Updated the function pointer map for n-biased kernels with added
  granularity to invoke the smaller fringe cases directly on the basis
  of m-dimension.
- Added micro-kernel unit tests for all the dgemv_n kernels.

AMD-Internal: [CPUPL-6231]
Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119
2025-02-06 18:52:32 +05:30
Shubham Sharma
f8c83fedb6 Added new ZTRSM small code path for ZEN5
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
2025-02-06 18:01:10 +05:30
Deepak Negi
2e687d8847 Updated all post-ops in s8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in s8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6
2025-02-06 07:31:28 -05:00
jagar
2ece628a4d Build System: Remove absolute path of files appearing in the library
Updated Makefile and main CMakelists.txt to replace absolute path within library object files with whatever path specified.
In files where __FILE__ macro is used, the absolute path was appearing in the library. Now this will be replaced with relative paths.

AMD-Internal: [CPUPL-5910]
Change-Id: Iac63645348a7f8214123fdcd3675670eedd887e3
2025-02-06 06:15:56 -05:00
Meghana Vankadari
13e7ada3f2 Modified bench to test different types of post-ops
- Modified bench to support testing of different types of buffers
  for bias, mat_add and mat_mul postops.
- Added support for testing integer APIs with float accumulation
  type.

Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96
2025-02-06 02:38:08 +05:30
Mithun Mohan
0701a4388a Thread factorization improvements (ic ways) for BF16 LPGEMM API.
-Currently when m is small compared to n, even if MR blks (m / MR) > 1,
and total work blocks (MR blks * NR blks) < available threads, the
number of threads assigned for m dimension (ic ways) is 1. This results
in sub par performance in bandwidth bound cases. To address this, the
thread factorization is updated to increase ic ways for these cases.

AMD-Internal: [SWLCSG-3333]

Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9
2025-02-06 00:51:10 -05:00
AngryLoki
ea93d2e2c9 Fix SyntaxWarning messages from python 3.12 (#809)
Details:
- When using regexes in Python, certain characters need backslash escaping, e.g.:
  ```python
  regex = re.compile( '^[\s]*#include (["<])([\w\.\-/]*)([">])' )
  ```
  However, technically escape sequences like `\s` are not valid and should actually be double-escaped: `\\s`.
  Python 3.12 now warns about such escape sequences, and in a later version these warning will be promoted
  to errors. See also: https://docs.python.org/dev/whatsnew/3.12.html#other-language-changes. The fix here
  is to use Python's "raw strings" to avoid double-escaping. This issue can be checked for all files in the current
  directory with the command `python -m compileall -d . -f -q .`
- Thanks to @AngryLoki for the fix.

AMD-Internal: [CPUPL-5895]
Change-Id: I7ab564beef1d1b81e62d985c5cb30ab6b9a937f2
(cherry picked from commit 729c57c15a)
2025-02-05 07:13:42 -05:00
Hari Govind S
b6e9cde317 Revert "Optimisation of AOCL-dynamic for dotv API"
This reverts commit a028108cbb.

Reason for revert: With libgomp, scalability issues were observed with a
                   higher number of threads, leading to the use of fewer
                   threads. However, with different OpenMP libraries
                   like libomp, this scalability issue was not observed,
                   and using fewer threads resulted in performance loss.
                   The AOCL dynamic logic has been updated to select a
                   higher number of threads, considering the iomp OpenMP
                   library.

Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f
2025-02-05 06:33:06 -05:00
jagar
8d0bf148ee Added separate PC for mt blis library
Added separate package configuration file for
st and mt library in blis Makefile and CMakeLists.txt

Change-Id: I8d851fac10d63983358e1f4c67fd9451246056bf
2025-02-05 05:10:11 -05:00
Hari Govind S
fe73445813 Introduced fast-path in DCOPYV API and fix compiler warning for AXPYV
- Added a conditional check to invoke the vectorized
  DCOPYV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal for
  single-threaded execution. Thus, we avoid the call to
  bli_nthreads_l1() function to set the ideal number of threads.

- Used macros to protect the declaration of fast_path_thresh in
  DAXPYV API to avoid compiler warnings.

AMD-Internal: [CPUPL-4875][CPUPL-5895]
Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb
2025-02-05 04:41:55 -05:00
Shubham Sharma
bac0fed3cf Fixed Bug in Dynamic Blocksizes
- Mixed precision datatypes use a modified cntx.
- For some variants of mixed precision, complex and real blocksizes
  are needed to be same. This is achieved by creating a local copy of
  cntx and copying complex blocksizes onto real blocksizes.
- By using the dynamic blocksizes, the changes made to the
  blocksizes for mixed precision are overwritten by changes made
  by dynamic blocksizes.
- This mismatch between complex and real blocksizes is causing a issue
  where the pack buffer is allocated based on complex blocksizes but
  amount of data packed is based on real blocksizes.
- This makes the pack buffer sizes smaller than the required sizes.
- To fix this, dynamic blocksizes are disabled for mixed precision.

AMD-Internal: [CPUPL-6384]
Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83
2025-02-05 01:09:24 -05:00