892 Commits

Author SHA1 Message Date
mkadavil
a26c85333a Int4 B matrix reordering support fixes in LPGEMM.
Reordering B matrix of datatype int4 is done as per the pack schema
requirements of u8s8s32 kernel. However for fringe cases, the
matrix pointer increments need to be halved to account for the half
byte size of int4 elements.

AMD-Internal: [SWLCSG-2390]
Change-Id: I22a04c4c8133db6ae6ca0a4d3e86c11aba1e2cdb
2024-06-26 05:39:45 +05:30
Edward Smyth
8de8dc2961 Merge commit '81e10346' into amd-main
* commit '81e10346':
  Alloc at least 1 elem in pool_t block_ptrs. (#560)
  Fix insufficient pool-growing logic in bli_pool.c. (#559)
  Arm SVE C/ZGEMM Fix FMOV 0 Mistake
  SH Kernel Unused Eigher
  Arm SVE C/ZGEMM Support *beta==0
  Arm SVE Config armsve Use ZGEMM/CGEMM
  Arm SVE: Update Perf. Graph
  Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
  Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
  A64FX Config Use ZGEMM/CGEMM
  Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
  Arm SVE Add SGEMM 2Vx10 Unindexed
  Arm SVE ZGEMM Support Gather Load / Scatt. St.
  Arm SVE Add ZGEMM 2Vx10 Unindexed
  Arm SVE Add ZGEMM 2Vx7 Unindexed
  Arm SVE Add ZGEMM 2Vx8 Unindexed
  Update Travis CI badge
  Armv8 Trash New Bulk Kernels
  Enable testing 1m in `make check`.
  Config ArmSVE Unregister 12xk. Move 12xk to Old
  Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
  Register firestorm into arm64 Metaconfig
  Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
  Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
  Add test for Apple M1 (firestorm)
  Firestorm CPUID Dispatcher
  Armv8 GEMMSUP Edge Cases Require Signed Ints
  Make error checking level a thread-local variable.
  Fix data race in testsuite.
  Update .appveyor.yml
  Firestorm Block Size Fixes
  Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
  Move unused ARM SVE kernels to "old" directory.
  Add an option to control whether or not to use @rpath.
  Fix $ORIGIN usage on linux.
  Arm micro-architecture dispatch (#344)
  Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
  Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
  Armv8 Fix 6x8 Row-Maj Ukr
  Apply patch from @xrq-phys.
  Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
  bli_error: more cleanup on the error strings array
  Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
  Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
  Fix config_name in bli_arch.c
  Arm Whole GEMMSUP Call Route is Asm/Int Optimized
  Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
  Header Typo
  Arm: DGEMMSUP ??r(rv) Invoke Edge Size
  Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
  Arm: Implement GEMMSUP Fallback Method
  Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
  Added Apple Firestorm (A14/M1) Subconfig
  Arm64 8x4 Kernel Use Less Regs
  Armv8-A Supplimentary GEMMSUP Sizes for RD
  Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
  Armv8-A Adjust Types for PACKM Kernels
  Armv8-A GEMMSUP-RD 6x8m
  Armv8-A GEMMSUP-RD 6x8n
  Armv8-A s/d Packing Kernels Fix Typo
  Armv8-A Introduced s/d Packing Kernels
  Armv8-A DGEMMSUP 6x8m Kernel
  Armv8-A DGEMMSUP Adjustments
  Armv8-A Add More DGEMMSUP
  Armv8-A Add GEMMSUP 4x8n Kernel
  Armv8-A Add Part of GEMMSUP 8x4m Kernel
  Armv8A DGEMM 4x4 Kernel WIP. Slow
  Armv8-A Add 8x4 Kernel WIP

AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
2024-06-25 05:48:46 -04:00
Edward Smyth
43d36b9f66 AOCL_ENABLE_INSTRUCTIONS improvements 2
Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is
unnecessary and incorrectly caused AVX512 code to be run
on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2
or equivalent options was selected.

Replace with code to select kernel in a similar way to other
dgemm code paths and other APIs. Note that at present AVX2 code
is used the smallest matrix sizes on all zen platforms.

AMD-Internal: [CPUPL-5078]
Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85
2024-06-25 04:57:38 -04:00
Vignesh Balasubramanian
02da190560 AVX512 optimizations for DNRM2
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
  computational kernel for DNRM2 API.

- Updated the header to include this kernel signature, as well
  as the framework layer to use this function in case of ZEN4
  and ZEN5 configurations.

- Updated the tipping points for ideal thread setting in DNRM2
  for ZEN5 micro-architecture. These thresholds are specific
  to the library's linkage to LLVM's OpenMP or GNU's OpenMp.

- Further abstracted the AOCL-DYNAMIC logic to separate functions
  for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).

- Further updated the ?NRM2 framework to accommodate the necessary
  changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
  kernel, when needed.

- Added micro-kernel and memory tests for this kernel in GTestsuite,
  to validate accuracy and out-of-bounds read and write.

AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
2024-06-24 08:50:36 -04:00
mkadavil
a5c4a8c7e0 Int4 B matrix reordering support in LPGEMM.
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.

AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
2024-06-24 07:55:34 -04:00
Meghana Vankadari
c1e063e65c Fix for offset issue while reading constants from JIT code
Details:
- For a variable x, Using address of x in an instruction throws
  exception if the difference between &x and access position is
  larger than 2 GiB. To solve this issue all variables are stored
  within the JIT code section and are accessed using relative addressing.

- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.

AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
2024-06-24 07:14:15 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Arnav Sharma
aa3adb8d69 Updated DOTXF and AXPYF Kernels
- Updated the fused kernels (DOTXF and AXPYF) to properly handle cases
  when b_n > fuse_factor.

- The fused kernels are expected to invoke respective Level-1 kernels
  iteratively when b_n > fuse_factor.

AMD-Internal: [CPUPL-5246]
Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095
2024-06-20 04:41:41 -04:00
Mangala V
e9124ffca7 BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case
BUG:
When alpha real and imaginary is zero
Output is computed as C= Beta * C + A * B instead of C = Beta * C

FIX:
Updated kernel to scale A * B product with alpha in case of alpha=0

Existing framework design:
- When alpha real and imaginary value is zero, framework handles to skip
kernel call to avoid alpha * A * B operation
- SCALM is invoked to perform Beta * C

- Accuracy issue was not observed as alpha=0 was handled in framework
- If we call kernel directly with alpha=0, results would be wrong
- Issue was figured out during microkernel testing using gtestsuite

AMD-Internal: [CPUPL-4454]
Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803
2024-06-20 02:58:43 -04:00
Shubham Sharma
1d6dd726cd Fixed Prefetch in Turin DGEMM kernel
- Fixed the prefetch of next micro panel
  of B matrix in 8x24 DGEMM kernel.

Change-Id: Id84bb2841abb86bda780062d67266377fda12038
2024-06-20 10:31:08 +05:30
Meghana Vankadari
c9254bd9e9 Implemented LPGEMV(n=1) for AVX2-INT8 variants
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n=1.

AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
2024-06-18 12:09:18 +05:30
Shubham Sharma.
580282e655 DGEMM optimizations for Turin Classic
- Introduced new 8x24 row preferred kernel for zen5.
  - Kernel supports row/col/gen
    storage schemes.
  - Prefetch of current panel of A and C
    are enabled.
  - Prefetch of next panel of B is enabled.
  - Kernel supports negative offsets for A and B
    matrices.
- Cache block tuning is done for zen5 core.

AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
2024-06-17 05:18:49 -04:00
Nallani Bhaskar
1b79f35e6d Updated store to avoid warning in gcc-10
Description:

- _mm512_storeu_epi8 and _mm512_storeu_epi16
   intrensic instructions are not available in gcc-10
- Replaced above intrensics _mm512_storeu_si512

Change-Id: I2878780b7acd040ccf45e571d486ff8c2388088c
2024-05-30 22:22:50 +05:30
mkadavil
cd032225ca BF16 bias support for bf16bf16f32ob16.
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.

AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
2024-05-23 04:48:20 +05:30
Nallani Bhaskar
29db6eb42b Added transB in all AVX512 based int8 API's
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
  s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
  support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
  according to latest input format.

AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
2024-05-23 03:46:13 +05:30
Edward Smyth
1f60b7c366 Export some BLIS internal symbols 2
Export more symbols for BLIS kernels so that AOCL libFLAME
optimizations can call them directly.

AMD-Internal: [CPUPL-5044]
Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733
2024-05-15 06:59:56 -04:00
Mangala V
64d9c96d45 ZGEMMT SUP: AVX512 GEMMT code for Upper variant
1. Enabled AVX512 path for
   -  Upper variant
   -  Different storage schemes for upper and lower variant

2. Modified mask value to handle all fringe cases correctly

AMD_Internal: [CPUPL-5091]

Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
2024-05-15 13:08:32 +05:30
Shubham Sharma
b4bc71f3ac Bug fix IN DAXPYF MT and Code Cleanup
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.

AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
2024-05-14 06:44:10 -04:00
Shubham Sharma
f4b06547fd Enabled DGEMMT SUP optimized code for upper variant
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
  supported by optimized kernels.

AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
2024-05-14 05:23:51 -04:00
Meghana Vankadari
3a8b9270e7 Implemented lpgemv for AVX512-INT8 variants
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
  datatype.

AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
2024-05-14 01:55:49 +05:30
Hari Govind S
61d0f3b873 Additional optimisations on COPYV API
-  Reduced number of jump operations in AVX512
   assembly kernel for SCOPYV, DCOPYV and ZCOPYV.

-  Fixed memory test failure for bli_zcopyv_zen_int_avx512
   kernel.

-  Replaced existing AVX2 COPYV intrinsic kernels in
   bli_cntx_init_zen5.c with AVX512 assembly kernels.

Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159
2024-05-10 07:22:04 -04:00
Hari Govind S
92847ae912 Gtestsuite: Memory testing for SCOPYV, DCOPYV and ZCOPYV APIs
-  Utilized the memory testing feature in GTestsuite
   to update the testing interfaces for micro-kernel
   testing of SCOPY, DCOPY and ZCOPY APIs.

Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f
2024-05-09 12:10:17 -04:00
Shubham Sharma
f36468a9e9 Enabled vectorized division code in ZTRSM
- Existing vectorizes code was disabled because
  of the failures observed in matlab tests.
- The issue is caused by underflow during division when diagonal
   elements of A matrix are very small.
- When diagonal is very small (4E-324 in case of matlab), sqauring the
   diagonal during divison causes the square to be rounded off to zero.
- Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by
  max(ar, ai), this will make either (ar) or (ai) 1, and hence
  reduce the likelihood of underflow.

AMD-Internal: [CPUPL-5052]
Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f
2024-05-09 01:35:39 -04:00
Edward Smyth
62c886feee Export some BLIS internal symbols
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.

AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
2024-05-08 12:51:32 -04:00
Arnav Sharma
cb27fad49c ZSCALV AVX512 Kernel
- Implemented ZSCALV kernel utilizing AVX512 intrinsics.

- Gtestsuite: Added ukr tests for the new kernel.

AMD-Internal: [CPUPL-5012]
Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2
2024-05-08 11:55:13 -04:00
Arnav Sharma
1dbeee4d19 ZDOTV AVX512 Kernel with MT Support
- Added AVX512 kernel for ZDOTV.

- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.

AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
2024-05-08 04:54:05 -04:00
Mangala V
e6cc2a3e22 ZGEMMT SUP Optimizations for AVX512
Existing Design:
 - GEMM AVX2 kernel performs computation and updates temporary C buffer
 - Portion of temporary C buffer is copied to output C buffer
   based on UPLO parameter
 - For diagonal blocks, using GEMM kernels is not efficient

New Design: Implemented in current patch when UPLO='L'
 - GEMMT kernel used for computation, temporary buffer is not required.
 - Only required elements are computed using mask load store for all
   fringe cases
 - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC

- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
  native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
  check which exists for double datatype

AMD_Internal: [CPUPL-5006]

Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
2024-05-07 23:00:29 +05:30
Meghana Vankadari
1072770c63 Implemented LPGEMV for bf16 datatype
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation is developed considering the b matrix
   reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
2024-05-06 23:55:15 +05:30
Shubham Sharma
be34169001 Fixed Matlab Failure in ZTRSM
- In AVX512 ZTRSM kernel, vertorizes division code
  is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
  but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.

AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
2024-05-06 06:56:45 -04:00
mkadavil
118e955a22 SWISH post-op support for all LPGEMM APIs.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1.

AMD-Internal: [SWLCSG-2387]
Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26
2024-05-06 06:05:11 -04:00
vignbala
f8218bb9f2 Compiler warnings when using masked loads
- Updated the AVX512 DOTXF kernels to use MASKZ loads
  instead of MASK loads when loading X vector in fringe
  case. This avoids compiler warnings of uninitialized
  vector as input to the intrinsic.

- The functionality will not change when using either MASK
  or MASKZ loads on X, since A matrix is loaded using MASKZ
  loads.

AMD-Internal: [CPUPL-4974]
Change-Id: I1ef98a1292352d0e905cc09cd5667acd883df827
2024-05-03 09:53:36 -04:00
Shubham Sharma
b70347d0d4 DGEMMT SUP Optimizations for AVX512
- In DGEMMT SUP AVX2 code path, traingular kernels
  are added in order to avoid temporary C buffer.
- Since these kernels did not exist for AVX512,
  AVX2 kernels were being used in GEMMT.
- AVX512 triangular GEMM kernel has been added
  to make sure that AVX512 kernels can be used without
  creating a temporary buffer.
- This kernel is added only for Lower variant of GEMMT,
   for upper variant of DGEMMT, temporary C buffer is
   created, full GEMM kernel is called on temporary C and
   traingular region from temporary C is copied to C
   buffer.

AMD-Internal: [CPUPL-4881]
Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9
2024-05-03 05:11:11 -04:00
Shubham Sharma
b9e21e8701 Added ZTRSM AVX512 small code path
- Kernel dimensions are 4x4.
  - Two kernels are implemented, Right Upper and
    Right lower.
  - In case of Left variants of TRSM, transpose is
    induced so that Right variant kernels can be used.
  - No packing is performed in these kernels.
  - Changes are made in the threshold to pick ZTRSM small
    code path.
  - BLIS_INLINE is removed from signature of
    "TRSMSMALL_KER_PROT".
  - These kernels do not support "ENABLE_TRSM_PREINVERSION".
  - Newly added kernels do not support conjugate
    transpose.
  - Added multithreading to ZTRSM small code path.

AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
2024-05-03 05:10:41 -04:00
Shubham Sharma
1d983e6124 Added AVX512 kernels for DAXPYF and DDOTXF
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
  factor 8 and 32.
- Multithreading is also added to the DAXPYf
  kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
  in ZEN4

AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian
4e2966f9b0 AVX512 optimizations for ZGEMV API with transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with transpose to A matrix.

- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
  kernels include those with fuse-factor 8 (main kernel), 4
  and 2(fringe kernels).

- Updated the bli_zgemv_unf_var1( ... ) function to update
  the function pointers to these kernels, based on the
  configuration.

AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
Hari Govind
9c26de1a18 Optimisiation COPYV APIs
- Implemented AVX512 kernels for scopyv_, dcopyv_ and  zcopyv_
  using respective AVX512 intrinsics including masked
  load and store operations.

- Implemented AVX512 kernels for scopy_, dcopy_ and
  zcopy_ using assembly language to prevent loss of
  performance during the translation of intrinsics.

- Updated the dcopy_blis_impl( ... ) and
  zcopy_blis_impl( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Implemented OpenMP parallelization for dcopyv_ and
  zcopyv_ APIs, while scopyv_ and ccopyv_ only support
  single thread.

AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
2024-05-01 00:23:01 +05:30
Meghana Vankadari
ceee4b7818 Fix in DGEMMSUP for cases where C matrix is row-major.
Details:
- variable m0 is being loaded into a register without typecasting
  it to uint64_t. This resulted in seg-fault when int size is set
  to be 32 bits during configure time.
- Any variable that is loaded using mov in assembly needs to be
  typecasted to uint64_t before begin_asm, so that change in size
  of integer doesn't affect the functionality.
- Modified all instances using variable m0 to use variable 'm' where
  m = (uint64_t)m0;

AMD-Internal: [CPUPL-4971]
Change-Id: I49b66d2cacf19ace40ab44c9f85904644e8921f4
2024-04-25 13:07:23 -04:00
Shubham Sharma
14bab0eb17 Fixed out of bounds read in CTRSM small kernel
- In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex
  precision numbers are being read instead of 1 scomplex.

- Fixed the code to read only one scomplex.

AMD-Internal: [CPUPL-4403]
Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d
2024-04-16 02:42:34 -04:00
Edward Smyth
2450a1813b BLIS: Implement zen5 sub-configuration
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.

AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
2024-04-12 07:26:31 -04:00
Nallani Bhaskar
5070343318 Fixed load intrinsic in aocl-gemm addon f32 api
Description:

1. Replaced aligned load intrinsics _mm512_load_ps
   with unaligned load intrinsics _mm512_loadu_ps.
2. There is no guarantee that the memory address
   can be aligned everywhere. The changes are under
   beta multiplication. Copy paste error.

Change-Id: I978231b556e17ad7e66c5028ed1cd904c653e0a8
2024-03-20 06:24:32 -04:00
Eleni Vlachopoulou
020b9ff7f0 CMake: Enable builds for both static and shared builds for Linux.
- Added BUILD_STATIC_LIBS option which is on by default, only on Linux.
- Added TEST_WITH_SHARED option which is off by default, only on Linux.
- If only shared or static lib is being built, that's the one that will be used for testing.
- If both are being built, TEST_WITH_SHARED determins which library wil be used for testing.
- Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing.
- Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0.

AMD-Internal: [CPUPL-2748]
Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e
2024-03-14 10:32:51 -04:00
Meghana Vankadari
da8fd8c301 Implemented JIT-based microkernel for bf16 datatype
Details:
- Added new folder named JIT/ under addon/aocl_gemm/. This folder
  will contain all the JIT related code.
- Modified lpgemm_cntx_init code to generate main and fringe kernels
  for 6x64 bf16 microkernel and store function pointers to all the
  generated kernels in a global function pointer array. This happens
  only when gcc version is < 11.2
- When gcc version < 11.2, microkernel uses JIT-generated kernels.
  otherwise, microkernel uses the intrinsics based implementation.

AMD-Internal: [SWLCSG-2622]
Change-Id: I16256c797b2546a8cd2049680001947346260461
2024-03-13 05:55:18 +05:30
Nallani Bhaskar
799a456abc Fixed corner case issue in aocl_gemm addon
Description

1. when mr0=1 case the accumulator register and operand
   registers for an fma instruction got swapped. Corrected
   the copy paste error.

2. Removed fill array for c_ref in bench_lpgemm.c and used
   memcpy from c buf, because fill array now using rand()
   function to initialize data which can be different
   when c_ref and c called separately, this was working
   because data was fixed (i=0 ... i%5).

Change-Id: Ia513331ba49d28adc7bcdc0ec78d443abe66780b
2024-03-08 04:10:19 -05:00
Bhaskar Nallani
2ce47e6f5e Implemented optimal AVX512-variant of f32 LPGEMV
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation of lpgemv_rowvar_f32 is developed
   considering the b matrix reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

5. Fixed few warnings while loading 2 f32 bias elements using
   _mm_load_sd using float pointer. Typecasted to (const double *)

AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
2024-03-04 23:53:23 +05:30
mkadavil
d00e84ced3 Matrix Add post-operation support for float(bf16|f32) LPGEMM APIs.
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.

AMD-Internal: [SWLCSG-2424]
Change-Id: I9464d1f514e3b04275fe93441489b4503a08937a
2024-02-23 02:02:33 -05:00
mkadavil
01b7f8c945 Matrix Add post-operation support for integer(s16|s32) LPGEMM APIs.
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.

AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
2024-02-12 23:51:36 +05:30
Shubham Sharma
d5cd5836b1 Fixed DGEMM 8x24 kernel for beta zero
- Column stride is not taken into consideration in
  current implementation when writing to C buffer
  if beta is zero and C is column major stored.

- Fixed C storage in case of column major stored C
  when beta is zero in 8x24 DGEMM kernel.

AMD-Internal: [CPUPL-4404]
Change-Id: I5b8dfce962995e3238cf902b5a09dd1bf90002a8
2024-02-05 06:57:06 -05:00
Shubham Sharma
fc91932b4a Fixed out of bounds read in DTRSM small kernels
- In 3x1 fringe case in [RLN/RUT] kernel, 4 double
  precision floats are being read instead of 3 doubles.

- Fixed the code to read only 3 double.

AMD-Internal: [CPUPL-4403]
Change-Id: If0afb155efefabe13487cf322d479981f1838aa2
2024-02-02 10:31:12 +05:30
eashdash
ef134dc49f Added Trans A feature for all INT8 LPGEMM APIs
1. Added Trans A feature to handle column major inputs
   for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
   MCxKC block of A into row storage MCxKC block as
   LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
   from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
   pointers.
6. Packing of A matrix:
   -  Converts column major input A to row major
      in blocks of MCxKC with newly added pack A
      functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
   INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
   1. u8s8s32os32/os8
   2. u8s8s16os16/os8/ou8
   3. s8s8s32os32/os8
   4. s8s8s16os16/os8

AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
2024-01-30 03:40:56 -05:00