Commit Graph

132 Commits

Author SHA1 Message Date
Hari Govind S
e2e95a09b0 Fixing missing registers in end_asm for copyv APIs
-  Added the missing registers in end_asm for scopy,
   dcopy and zcopy APIs.

-  Removed unnecessary registers from end_asm for scopy
   and dcopy APIs.

-  Corrected mistakes in the comments.

Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498
2024-07-30 15:09:52 +05:30
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
Meghana Vankadari
49949f488f Implemented on-the-go pack kernel for s4->bf16
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
  new LPGEMM APIs are added where datatypes are A: bf16,
  B: int4, C: f32/bf16. To support this, B matrix will
  be reordered with type still being int4. New pack
  kernels that packs the reordered B matrix after
  converting the data from int4 to bf16 and applying
  zero-point and scale are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
2024-07-25 04:13:05 +05:30
mkadavil
7114376519 New kernels for int4 B matrix reordering following BF16 kernel schema.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
2024-07-25 01:10:13 -04:00
Hari Govind S
eacad443e3 Optimization for DCOPY and SCOPY API
-  Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
   kernel.

-  Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
   and "bli_scopyv_zen4_asm_avx512" kernels.

-  Replaced existing load balancing algorithm for dcopy API with
   "bli_thread_range_sub" algorithm.

-  Included AOCL-dynamic values for optimial number of threads
   for zen5 architecture.

AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
2024-07-24 08:23:07 -04:00
Arnav Sharma
9583ee2e23 DGEMV Optimizations for NO_TRANSPOSE cases
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.

- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.

- Added a wrapper for DAXPYF kernels for redirection to kernels with a
  smaller fuse factor than 32.

- Also added UKR tests for the new fused kernels.

AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
2024-07-24 15:59:36 +05:30
mkadavil
42e539b878 Quantization (scale + zero point) updates/fixes for BF16 LPGEMM api.
-_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc
(<gcc 12.2) and subsequently throws a compilation error. This is fixed
by replacing this intrinsic with a macro that achieves the bf16 to f32
conversion via shift operations.
-Bug fixes in the vector scale factor load in fringe kernels.

AMD-Internal: [SWLCSG-2945]
Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018
2024-07-23 04:39:14 +05:30
Shubham Sharma
16c56e0101 Added 24x8 triangular kernels for DGEMMT SUP
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
   24x8 triangular AVX512 DGEMMT SUP kernels are added.
 - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
   pattern repeats every 24x24 block of C. To cover this 24x24 block,
   3 kernels are needed for one variant of DGEMMT. A total of 6
   kernels are needed to cover both upper and lower variants.
 - In order to maximize code reuse, the 24x8 kernels are broken
   into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
   diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
   full GEMM part is computed by 24x8 DGEMM SUP kernel.
 - Changes are made in framework to enable the use of these kernels.

AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian
b48e864e82 AVX512 optimizations for DAXPBYV API
- Implemented AVX512 computational kernel for DAXPBYV
  with optimal unrolling. Further implemented the other
  missing kernels that would be required to decompose
  the computation in special cases, namely the AVX512
  DADDV and DSCAL2V kernels.

- Updated the zen4 and zen5 contexts to ensure any query
  to acquire the kernel pointer for DAXPBYV returns the
  address of the new kernel.

- Added micro-kernel units tests to GTestsuite to check
  for functionality and out-of-bounds reads and writes.

AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
2024-07-22 11:32:19 +05:30
mkadavil
d37c91dffa Quantization (scale + zero point) support for BF16 LPGEMM api.
-Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point)
instead of just type conversion in aocl_gemm_bf16bf16f32obf16.
-Support for multiple scale/sum/matrix_add/bias post-ops in a single
LPGEMM api call.
-Post-ops mask related fixes in lpgemv kernels .
-Additional scale post-ops sanity checks.

AMD-Internal: [SWLCSG-2945]
Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb
2024-07-18 05:32:51 -04:00
Hari Govind S
38824244d5 Implementation of AXPYF Kernels for DTRSV
-  Implemented two new axpyf kernels for fused factors 8 and 12
   by manually unrolling the loops. Used to achieve better performance
   in var2 case.

AMD-Internal: [CPUPL-5184]
Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5
2024-07-16 06:23:01 -04:00
Arnav Sharma
4aa66f108e Added CSCALV AVX512 Kernel
- Added CSCALV kernel utilizing the AVX512 ISA.

- Added function pointers for the same to zen4 and zen5 contexts.

- Updated the BLAS interface to invoke respective CSCALV kernels based
  on the architecture.

- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).

AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
2024-07-15 07:17:43 -04:00
Shubham Sharma.
a7744361e4 DGEMM optimizations for Turin Classic
- Introduced new 8x24 macro kernels.
   - 4 new kernels are added for beta 0, beta 1, beta -1
      and beta N.
   - IR and JR loop moved to ASM region.
   - Kernels support row major storage scheme.
   - Prefetch of current micro panel of C is enabled.
   - Kernel supports negative offsets for A and B matrices.
 - Moved alpha scaling from DGEMM kernel to B pack kernel.
 - Tuned blocksizes for new kernel.
 - Added support for alpha scaling in 24xk pack kernel.
 - Reverted back to old b_next computation
   in gemm_ker_var2.
 - BugFix in 8x24 DGEMM kernel for beta 1,
   comparsion for jmp conditions was done using integer
   instructions, which caused beta 1 path to never be taken.
   Fixed this by changing the comparsion to double.

AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
2024-07-09 07:53:27 -04:00
vignbala
236d092656 AVX512 optimizations for ZGEMM to handle k = 1 cases
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
  be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
  for handling the GEMM computation with inputs having k = 1,
  with the transpose values being N(for column-major) and T(for
  row-major).

- Updated the zgemm_blis_impl( ... ) layer to query the architecture
  ID and invoke the AVX2 or AVX512 kernel accordingly.

- Added API level tests for accuracy and code-coverage, as well as
  micro-kernel tests for verifying functionality and out-of-bounds
  memory accesses.

AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84
2024-07-09 07:07:24 -04:00
Hari Govind S
627bf0b1ba Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API
-  Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
   'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
   zen5  config.

-  Implemented multithreading support and AOCL-dynamic
   for ZAXPY API.

-  Utilized 'bli_thread_range_sub' function to achieve
   better work distribution and avoid false sharing.

AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
2024-07-09 01:29:53 -04:00
Meghana Vankadari
4e6fa17c08 Bug fix in LPGEMV for INT8 APIs
Details:
- Corrected the usage of vpdpbusd instruction in
  GEMV implementation for INT8 APIs.
- Modified bench to fill matrices with values
  ranging between -5 and +5 whenever the datatype is
  a signed integer.

Change-Id: I457462b888b667d8a34c53de762e9b4aee784ecc
2024-06-27 04:22:04 +05:30
mkadavil
a26c85333a Int4 B matrix reordering support fixes in LPGEMM.
Reordering B matrix of datatype int4 is done as per the pack schema
requirements of u8s8s32 kernel. However for fringe cases, the
matrix pointer increments need to be halved to account for the half
byte size of int4 elements.

AMD-Internal: [SWLCSG-2390]
Change-Id: I22a04c4c8133db6ae6ca0a4d3e86c11aba1e2cdb
2024-06-26 05:39:45 +05:30
Vignesh Balasubramanian
02da190560 AVX512 optimizations for DNRM2
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
  computational kernel for DNRM2 API.

- Updated the header to include this kernel signature, as well
  as the framework layer to use this function in case of ZEN4
  and ZEN5 configurations.

- Updated the tipping points for ideal thread setting in DNRM2
  for ZEN5 micro-architecture. These thresholds are specific
  to the library's linkage to LLVM's OpenMP or GNU's OpenMp.

- Further abstracted the AOCL-DYNAMIC logic to separate functions
  for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).

- Further updated the ?NRM2 framework to accommodate the necessary
  changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
  kernel, when needed.

- Added micro-kernel and memory tests for this kernel in GTestsuite,
  to validate accuracy and out-of-bounds read and write.

AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
2024-06-24 08:50:36 -04:00
mkadavil
a5c4a8c7e0 Int4 B matrix reordering support in LPGEMM.
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.

AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
2024-06-24 07:55:34 -04:00
Meghana Vankadari
c1e063e65c Fix for offset issue while reading constants from JIT code
Details:
- For a variable x, Using address of x in an instruction throws
  exception if the difference between &x and access position is
  larger than 2 GiB. To solve this issue all variables are stored
  within the JIT code section and are accessed using relative addressing.

- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.

AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
2024-06-24 07:14:15 -04:00
Arnav Sharma
aa3adb8d69 Updated DOTXF and AXPYF Kernels
- Updated the fused kernels (DOTXF and AXPYF) to properly handle cases
  when b_n > fuse_factor.

- The fused kernels are expected to invoke respective Level-1 kernels
  iteratively when b_n > fuse_factor.

AMD-Internal: [CPUPL-5246]
Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095
2024-06-20 04:41:41 -04:00
Nallani Bhaskar
1b79f35e6d Updated store to avoid warning in gcc-10
Description:

- _mm512_storeu_epi8 and _mm512_storeu_epi16
   intrensic instructions are not available in gcc-10
- Replaced above intrensics _mm512_storeu_si512

Change-Id: I2878780b7acd040ccf45e571d486ff8c2388088c
2024-05-30 22:22:50 +05:30
mkadavil
cd032225ca BF16 bias support for bf16bf16f32ob16.
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.

AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
2024-05-23 04:48:20 +05:30
Nallani Bhaskar
29db6eb42b Added transB in all AVX512 based int8 API's
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
  s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
  support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
  according to latest input format.

AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
2024-05-23 03:46:13 +05:30
Edward Smyth
1f60b7c366 Export some BLIS internal symbols 2
Export more symbols for BLIS kernels so that AOCL libFLAME
optimizations can call them directly.

AMD-Internal: [CPUPL-5044]
Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733
2024-05-15 06:59:56 -04:00
Mangala V
64d9c96d45 ZGEMMT SUP: AVX512 GEMMT code for Upper variant
1. Enabled AVX512 path for
   -  Upper variant
   -  Different storage schemes for upper and lower variant

2. Modified mask value to handle all fringe cases correctly

AMD_Internal: [CPUPL-5091]

Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
2024-05-15 13:08:32 +05:30
Shubham Sharma
b4bc71f3ac Bug fix IN DAXPYF MT and Code Cleanup
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.

AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
2024-05-14 06:44:10 -04:00
Shubham Sharma
f4b06547fd Enabled DGEMMT SUP optimized code for upper variant
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
  supported by optimized kernels.

AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
2024-05-14 05:23:51 -04:00
Meghana Vankadari
3a8b9270e7 Implemented lpgemv for AVX512-INT8 variants
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
  datatype.

AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
2024-05-14 01:55:49 +05:30
Hari Govind S
61d0f3b873 Additional optimisations on COPYV API
-  Reduced number of jump operations in AVX512
   assembly kernel for SCOPYV, DCOPYV and ZCOPYV.

-  Fixed memory test failure for bli_zcopyv_zen_int_avx512
   kernel.

-  Replaced existing AVX2 COPYV intrinsic kernels in
   bli_cntx_init_zen5.c with AVX512 assembly kernels.

Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159
2024-05-10 07:22:04 -04:00
Hari Govind S
92847ae912 Gtestsuite: Memory testing for SCOPYV, DCOPYV and ZCOPYV APIs
-  Utilized the memory testing feature in GTestsuite
   to update the testing interfaces for micro-kernel
   testing of SCOPY, DCOPY and ZCOPY APIs.

Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f
2024-05-09 12:10:17 -04:00
Shubham Sharma
f36468a9e9 Enabled vectorized division code in ZTRSM
- Existing vectorizes code was disabled because
  of the failures observed in matlab tests.
- The issue is caused by underflow during division when diagonal
   elements of A matrix are very small.
- When diagonal is very small (4E-324 in case of matlab), sqauring the
   diagonal during divison causes the square to be rounded off to zero.
- Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by
  max(ar, ai), this will make either (ar) or (ai) 1, and hence
  reduce the likelihood of underflow.

AMD-Internal: [CPUPL-5052]
Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f
2024-05-09 01:35:39 -04:00
Arnav Sharma
cb27fad49c ZSCALV AVX512 Kernel
- Implemented ZSCALV kernel utilizing AVX512 intrinsics.

- Gtestsuite: Added ukr tests for the new kernel.

AMD-Internal: [CPUPL-5012]
Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2
2024-05-08 11:55:13 -04:00
Arnav Sharma
1dbeee4d19 ZDOTV AVX512 Kernel with MT Support
- Added AVX512 kernel for ZDOTV.

- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.

AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
2024-05-08 04:54:05 -04:00
Mangala V
e6cc2a3e22 ZGEMMT SUP Optimizations for AVX512
Existing Design:
 - GEMM AVX2 kernel performs computation and updates temporary C buffer
 - Portion of temporary C buffer is copied to output C buffer
   based on UPLO parameter
 - For diagonal blocks, using GEMM kernels is not efficient

New Design: Implemented in current patch when UPLO='L'
 - GEMMT kernel used for computation, temporary buffer is not required.
 - Only required elements are computed using mask load store for all
   fringe cases
 - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC

- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
  native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
  check which exists for double datatype

AMD_Internal: [CPUPL-5006]

Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
2024-05-07 23:00:29 +05:30
Meghana Vankadari
1072770c63 Implemented LPGEMV for bf16 datatype
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation is developed considering the b matrix
   reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
2024-05-06 23:55:15 +05:30
Shubham Sharma
be34169001 Fixed Matlab Failure in ZTRSM
- In AVX512 ZTRSM kernel, vertorizes division code
  is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
  but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.

AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
2024-05-06 06:56:45 -04:00
mkadavil
118e955a22 SWISH post-op support for all LPGEMM APIs.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1.

AMD-Internal: [SWLCSG-2387]
Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26
2024-05-06 06:05:11 -04:00
vignbala
f8218bb9f2 Compiler warnings when using masked loads
- Updated the AVX512 DOTXF kernels to use MASKZ loads
  instead of MASK loads when loading X vector in fringe
  case. This avoids compiler warnings of uninitialized
  vector as input to the intrinsic.

- The functionality will not change when using either MASK
  or MASKZ loads on X, since A matrix is loaded using MASKZ
  loads.

AMD-Internal: [CPUPL-4974]
Change-Id: I1ef98a1292352d0e905cc09cd5667acd883df827
2024-05-03 09:53:36 -04:00
Shubham Sharma
b70347d0d4 DGEMMT SUP Optimizations for AVX512
- In DGEMMT SUP AVX2 code path, traingular kernels
  are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
  AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
  to make sure that AVX512 kernels can be used without
  creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
   for upper variant of DGEMMT, temporary C buffer is
   created, full GEMM kernel is called on temporary C and
   traingular region from temporary C is copied to C
   buffer.

AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
2024-05-03 05:11:11 -04:00
Shubham Sharma
b9e21e8701 Added ZTRSM AVX512 small code path
- Kernel dimensions are 4x4.
  - Two kernels are implemented, Right Upper and
    Right lower.
  - In case of Left variants of TRSM, transpose is
    induced so that Right variant kernels can be used.
  - No packing is performed in these kernels.
  - Changes are made in the threshold to pick ZTRSM small
    code path.
  - BLIS_INLINE is removed from signature of
    "TRSMSMALL_KER_PROT".
  - These kernels do not support "ENABLE_TRSM_PREINVERSION".
  - Newly added kernels do not support conjugate
    transpose.
  - Added multithreading to ZTRSM small code path.

AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
2024-05-03 05:10:41 -04:00
Shubham Sharma
1d983e6124 Added AVX512 kernels for DAXPYF and DDOTXF
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
  factor 8 and 32.
- Multithreading is also added to the DAXPYf
  kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
  in ZEN4

AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian
4e2966f9b0 AVX512 optimizations for ZGEMV API with transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with transpose to A matrix.

- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
  kernels include those with fuse-factor 8 (main kernel), 4
  and 2(fringe kernels).

- Updated the bli_zgemv_unf_var1( ... ) function to update
  the function pointers to these kernels, based on the
  configuration.

AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
Hari Govind
9c26de1a18 Optimisiation COPYV APIs
- Implemented AVX512 kernels for scopyv_, dcopyv_ and  zcopyv_
  using respective AVX512 intrinsics including masked
  load and store operations.

- Implemented AVX512 kernels for scopy_, dcopy_ and
  zcopy_ using assembly language to prevent loss of
  performance during the translation of intrinsics.

- Updated the dcopy_blis_impl( ... ) and
  zcopy_blis_impl( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Implemented OpenMP parallelization for dcopyv_ and
  zcopyv_ APIs, while scopyv_ and ccopyv_ only support
  single thread.

AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
2024-05-01 00:23:01 +05:30
Meghana Vankadari
ceee4b7818 Fix in DGEMMSUP for cases where C matrix is row-major.
Details:
- variable m0 is being loaded into a register without typecasting
  it to uint64_t. This resulted in seg-fault when int size is set
  to be 32 bits during configure time.
- Any variable that is loaded using mov in assembly needs to be
  typecasted to uint64_t before begin_asm, so that change in size
  of integer doesn't affect the functionality.
- Modified all instances using variable m0 to use variable 'm' where
  m = (uint64_t)m0;

AMD-Internal: [CPUPL-4971]
Change-Id: I49b66d2cacf19ace40ab44c9f85904644e8921f4
2024-04-25 13:07:23 -04:00
Nallani Bhaskar
5070343318 Fixed load intrinsic in aocl-gemm addon f32 api
Description:

1. Replaced aligned load intrinsics _mm512_load_ps
   with unaligned load intrinsics _mm512_loadu_ps.
2. There is no guarantee that the memory address
   can be aligned everywhere. The changes are under
   beta multiplication. Copy paste error.

Change-Id: I978231b556e17ad7e66c5028ed1cd904c653e0a8
2024-03-20 06:24:32 -04:00
Meghana Vankadari
da8fd8c301 Implemented JIT-based microkernel for bf16 datatype
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
  will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
  for 6x64 bf16 microkernel and store function pointers to all the
  generated kernels in a global function pointer array. This happens
  only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
  otherwise, microkernel uses the intrinsics based implementation.

AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
2024-03-13 05:55:18 +05:30
Nallani Bhaskar
799a456abc Fixed corner case issue in aocl_gemm addon
Description

1. when mr0=1 case the accumulator register and operand
   registers for an fma instruction got swapped. Corrected
   the copy paste error.

2. Removed fill array for c_ref in bench_lpgemm.c and used
   memcpy from c buf, because fill array now using rand()
   function to initialize data which can be different
   when c_ref and c called separately, this was working
   because data was fixed (i=0 ... i%5).

Change-Id: Ia513331ba49d28adc7bcdc0ec78d443abe66780b
2024-03-08 04:10:19 -05:00