Commit Graph

3328 Commits

Author SHA1 Message Date
Meghana Vankadari
c1e063e65c Fix for offset issue while reading constants from JIT code
Details:
- For a variable x, Using address of x in an instruction throws
  exception if the difference between &x and access position is
  larger than 2 GiB. To solve this issue all variables are stored
  within the JIT code section and are accessed using relative addressing.

- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.

AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
2024-06-24 07:14:15 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Arnav Sharma
aa3adb8d69 Updated DOTXF and AXPYF Kernels
- Updated the fused kernels (DOTXF and AXPYF) to properly handle cases
  when b_n > fuse_factor.

- The fused kernels are expected to invoke respective Level-1 kernels
  iteratively when b_n > fuse_factor.

AMD-Internal: [CPUPL-5246]
Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095
2024-06-20 04:41:41 -04:00
Mangala V
e9124ffca7 BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case
BUG:
When alpha real and imaginary is zero
Output is computed as C= Beta * C + A * B instead of C = Beta * C

FIX:
Updated kernel to scale A * B product with alpha in case of alpha=0

Existing framework design:
- When alpha real and imaginary value is zero, framework handles to skip
kernel call to avoid alpha * A * B operation
- SCALM is invoked to perform Beta * C

- Accuracy issue was not observed as alpha=0 was handled in framework
- If we call kernel directly with alpha=0, results would be wrong
- Issue was figured out during microkernel testing using gtestsuite

AMD-Internal: [CPUPL-4454]
Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803
2024-06-20 02:58:43 -04:00
Mangala V
90fe795c46 Gtestsuite: Enabled memory test for ZGEMM for k=0
AMD_Internal: [CPUPL-4657]

Change-Id: Ic5f4d24184f05e0f57634845b4fb3312b3a416f6
2024-06-20 02:51:47 -04:00
Shubham Sharma
1d6dd726cd Fixed Prefetch in Turin DGEMM kernel
- Fixed the prefetch of next micro panel
  of B matrix in 8x24 DGEMM kernel.

Change-Id: Id84bb2841abb86bda780062d67266377fda12038
2024-06-20 10:31:08 +05:30
Moripalli Chitra
cb915c241d Tuning ddotv API
- Modifying threading framework for L1 APIs to update only number of threads from runtime env and avoid overhead of reading other ICVs.
- Removing bli_arch_set_id_once() from bli_arch_set_id_once() flow as bli_arch_check_id_once() calls it.

AMD-Internal: [CPUPL-4877]
Change-Id: I87b346825a96d74e746a41530b6d22ae162f19ba
2024-06-18 19:31:17 +05:30
Meghana Vankadari
c9254bd9e9 Implemented LPGEMV(n=1) for AVX2-INT8 variants
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n=1.

AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
2024-06-18 12:09:18 +05:30
Edward Smyth
ca7ba707e7 AOCL_ENABLE_INSTRUCTIONS improvements
Changes to how AOCL_ENABLE_INSTRUCTIONS handles requests
for different ISAs (i.e. BLIS sub-configurations):
- Add missing SSE and AVX options. These will all chose the
  generic option in amdzen builds.
- For unsupported ISAs (e.g. AVX512 on Milan), select the
  hardware's default sub-configuration instead of trying
  to step down through alternative choices.
- For invalid options, or options not implemented in the BLIS
  build (e.g. skx in amdzen build), select the hardware's
  default sub-configuration instead of aborting.

Currently BLIS_ARCH_TYPE behaviour is not affected by these
changes.

AMD-Internal: [CPUPL-5078]
Change-Id: Idbd00d2806b1679889a9249878c51981c8d23b3f
2024-06-17 09:58:30 -04:00
Shubham Sharma.
580282e655 DGEMM optimizations for Turin Classic
- Introduced new 8x24 row preferred kernel for zen5.
  - Kernel supports row/col/gen
    storage schemes.
  - Prefetch of current panel of A and C
    are enabled.
  - Prefetch of next panel of B is enabled.
  - Kernel supports negative offsets for A and B
    matrices.
- Cache block tuning is done for zen5 core.

AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
2024-06-17 05:18:49 -04:00
Arnav Sharma
38f65c28fc Correction for DSCALV AOCL_DYNAMIC Thresholds
- Updated nt_ideal to 12 for n_elem <= 2500000 and nt_ideal to 16 for
  n_elem <= 4000000

AMD-Internal: [CPUPL-4408]
Change-Id: I97c143ab0d9b97e797358af93181c71d948757cc
2024-06-14 11:51:47 +05:30
Chandrashekara K R
fa75ce725e CMake: Added logic to link openmp library given through OpenMP_libomp_LIBRARY cmake variable on linux.
Enabled command line option to link libiomp5.so or libomp.so or libgomp.so libraries using cmake.
Eg:- -DOpenMP_libomp_LIBRARY=<path to openmp library including library name>.
If we not set above variable, by default openmp library will be libomp.so for clang and libgomp.so for gcc compiler.

Change-Id: I5bffa10ff8351f5d10f0d543cbdf55aa16c84c90
2024-06-10 04:41:23 -04:00
Arnav Sharma
91bdf9a3eb Gtestsuite: Bugfix for DOTXF, Changes to AXPYF
- Fixed bug in ddotxf generic tests where the parameters lda_inc and
  inca were being read incorrectly.

- Fixed bug in dotxf test wherein the y vector was being generated with
  length m instead of b.

- Corrected function signatures to use type gtint_t instead of gint_t.

- Updated the tests to use conjugate values of type char and convert to
  conj_t type only while invoking BLIS tests for both DOTXF and AXPYF.

AMD-Internal: [CPUPL-5117]
Change-Id: I0ef7af429057583a1cbf34827802e72401181caf
2024-06-07 15:05:10 +05:30
Edward Smyth
7829a7cf85 GTestSuite: test name consistency changes 6
Improve consistency in test names across different APIs:
- Improve consistency of TEST_P part of test names.
- Rename *_evt_testing.cpp and nrm2_extreme.cpp files to
  *_evt.cpp to match other APIs.
- Standardize naming of IIT_ERS files.

Also:
- Restore trsv IIT_ERS file which was misnamed in commit
  a2beef3255
- Tidy ukr gemm tests to be more consistent with each other
  and move threshold setting to individual TEST_P functions
  to allow different adjustments to be made.
- Similarly make trsm tests more consistent.
- Tidy naming of is_memory_test variable.

AMD-Internal: [CPUPL-4500]
Change-Id: I0af1fc9973b02187b19a7c2488eed1b829cfdc2f
2024-06-05 11:26:16 -04:00
Nallani Bhaskar
1b79f35e6d Updated store to avoid warning in gcc-10
Description:

- _mm512_storeu_epi8 and _mm512_storeu_epi16
   intrensic instructions are not available in gcc-10
- Replaced above intrensics _mm512_storeu_si512

Change-Id: I2878780b7acd040ccf45e571d486ff8c2388088c
2024-05-30 22:22:50 +05:30
mkadavil
cd032225ca BF16 bias support for bf16bf16f32ob16.
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.

AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
2024-05-23 04:48:20 +05:30
Nallani Bhaskar
29db6eb42b Added transB in all AVX512 based int8 API's
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
  s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
  support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
  according to latest input format.

AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
2024-05-23 03:46:13 +05:30
Edward Smyth
d9c269786a GTestSuite: bli_zdscalv isn't created by BLIS
BLIS includes the BLAS and CBLAS interfaces for zdscal
but not the BLIS typed interface bli_zdscalv. Thus, when
TEST_INTERFACE=BLIS_TYPED is defined, disable tests
for zdscal.

AMD-Internal: [CPUPL-4671]
Change-Id: I397454c83e272f9e775e37e00533002576041a93
2024-05-21 15:24:02 -04:00
Vignesh Balasubramanian
947811a429 Bugfix for ?OMATCOPY2 and ?IMATCOPY APIs
- Updated the parameter check for leading dimensions in the
  functions handling transpose case of matrix A.

- Updated the logic to perform ?IMATCOPY operation. The new
  logic uses an auxiliary buffer to copy and scale in place,
  if and when needed. This is done in order to avoid overwriting
  any subsequent reads that might follow(specifically in case
  of having different leading dimensions for reading and writing).

- Updated xerbla_() to throw memory allocation failure based on
  INFO parameter being -10. This value is specific to its use-case
  in ?IMATCOPY, where it is set to -10.

- Updated the Extreme Value Tests(EVT) logger for ?IMATCOPY
  for uniformity.

- Cleaned up the files to follow coding conventions.

AMD-Internal: [CPUPL-4862][SWLCSG-2706]
Change-Id: I34dfa2bcb66b821315e11f7ab2139c41a79ef780
2024-05-21 11:13:28 +05:30
Eleni Vlachopoulou
25bfd0a982 GTestSuite: Fix so that std::max to work properly on Windows.
AMD-Internal: [CPUPL-4500]
Change-Id: I73d55dd3040daf6f8aec94799cf7f3f0cc2bddc0
2024-05-20 15:59:16 +01:00
Edward Smyth
bc7d2df832 GTestSuite: misc corrections 2
- Correct value of alpha in ger ERS test.
- rename ERS_IIT.cpp files to match naming convention
  used for other APIs.
- Change all cases of gint_t to gtint_t except for
  dotxf, which is fixed in another commit.
- Add TEST_UPPERCASE_ARGS to imatcopy and omatcopy{2}
  headers.
- Corrected typo.

AMD-Internal: [CPUPL-4500]
Change-Id: I8844bb8c5941785e64daa9df5569092c19f91838
2024-05-20 03:51:21 -04:00
Eleni Vlachopoulou
e98d58b657 GTestSuite: Adjusting thresholds.
-Adding multiplier for complex APIs.
-Updating for trmv and trsv to reflect multiplication with alpha.

AMD-Internal: [CPUPL-4500]
Change-Id: I17361da5afa5d1e219b4c8a14542e2b216a7ea58
2024-05-17 09:11:59 -04:00
Edward Smyth
a69dc3669e GTestSuite: test name consistency changes 5
Improve consistency in test names across different APIs.

In this commit, standardize leading dimensions (lda, ldb,
ldc) in test names. Also some misc tidying changes.

AMD-Internal: [CPUPL-4500]
Change-Id: Icbc82d0b9a3420ddfdb4f418396f9e56ab1765ab
2024-05-16 08:51:01 -04:00
Edward Smyth
782e009b66 GTestSuite: check data that should just be set is not read 2
Correction to commit 8657e661fc
to allocate matrix or vector correctly when special read-only
case occurs.

Also define a set_matrix generator for symmetric matrices
to only set upper or lower triangle to the supplied value,
while setting the unused elements to a large value to help
catch incorrect access to those elements.

AMD-Internal: [CPUPL-4548]
Change-Id: I22b3a20e2ce8be70eb27179247cd47fdb2d87b9d
2024-05-15 11:56:16 -04:00
Edward Smyth
1f60b7c366 Export some BLIS internal symbols 2
Export more symbols for BLIS kernels so that AOCL libFLAME
optimizations can call them directly.

AMD-Internal: [CPUPL-5044]
Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733
2024-05-15 06:59:56 -04:00
Mangala V
64d9c96d45 ZGEMMT SUP: AVX512 GEMMT code for Upper variant
1. Enabled AVX512 path for
   -  Upper variant
   -  Different storage schemes for upper and lower variant

2. Modified mask value to handle all fringe cases correctly

AMD_Internal: [CPUPL-5091]

Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
2024-05-15 13:08:32 +05:30
Shubham Sharma
b4bc71f3ac Bug fix IN DAXPYF MT and Code Cleanup
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.

AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
2024-05-14 06:44:10 -04:00
Shubham Sharma
f4b06547fd Enabled DGEMMT SUP optimized code for upper variant
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
  supported by optimized kernels.

AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
2024-05-14 05:23:51 -04:00
Meghana Vankadari
3a8b9270e7 Implemented lpgemv for AVX512-INT8 variants
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
  datatype.

AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
2024-05-14 01:55:49 +05:30
Edward Smyth
b2ed1000b3 GTestSuite: test name consistency changes 4
Improve consistency in test names across different APIs.

Various changes in this patch:
- Explicitly cast char variables to std::string when
  adding to test name. Adding the char directly was
  causing errors in name generation.
- Use template version of print function in zdscalv
  and remove print function zdscalvGenericTestPrint.
- Remove unused print function ztrsvPrint.
- Eliminate some differences in gemm ukr print
  functions.
- Remove extraneous API name labels in ukr axpyf and
  setv.
- Make ukr/trsm/test_trsm_ukr.h more consistent with
  other files.

AMD-Internal: [CPUPL-4500]
Change-Id: Ib8092de216712586fe4ec0ae91698d0c1aaffd54
2024-05-13 11:11:02 -04:00
mkadavil
ec67289601 SWISH post-op support for BF16 JIT based kernels.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1. Adding the support for swish in JIT
based BF16 kernels.

AMD-Internal: [SWLCSG-2387]
Change-Id: I9eea0c801f5f067a5cfbd2941bc991708b86e45e
2024-05-13 01:50:32 -04:00
Edward Smyth
a94d2ddf44 GTestSuite: test name consistency changes 3
Improve consistency in test names across different APIs.

In this commit, standardize storage, side, uplo, trans
diag and conj in test names.

AMD-Internal: [CPUPL-4500]
Change-Id: Ifcdb6e9f684b134841d86087218d7aefd9cabe63
2024-05-10 08:35:19 -04:00
Hari Govind S
61d0f3b873 Additional optimisations on COPYV API
-  Reduced number of jump operations in AVX512
   assembly kernel for SCOPYV, DCOPYV and ZCOPYV.

-  Fixed memory test failure for bli_zcopyv_zen_int_avx512
   kernel.

-  Replaced existing AVX2 COPYV intrinsic kernels in
   bli_cntx_init_zen5.c with AVX512 assembly kernels.

Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159
2024-05-10 07:22:04 -04:00
Edward Smyth
8657e661fc GTestSuite: check data that should just be set is not read
Some BLAS routines do not require matrices or vectors to be
initialized in certain use cases. For example, in GEMM when
beta=zero, C is set rather than updated, thus input values of
C should not be used. In these cases set the inital values of
such matrices or vectors to an extreme value, to help detect
if these are incorrectly being read.

The extreme value can be NaN or Inf. The default is Inf,
change it by running

   cmake ... -DEXT_VALUE=NaN

AMD-Internal: [CPUPL-4548]
Change-Id: I4a665363779d2496b8247f6357e970b7f23cd1eb
2024-05-10 06:29:03 -04:00
Hari Govind S
92847ae912 Gtestsuite: Memory testing for SCOPYV, DCOPYV and ZCOPYV APIs
-  Utilized the memory testing feature in GTestsuite
   to update the testing interfaces for micro-kernel
   testing of SCOPY, DCOPY and ZCOPY APIs.

Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f
2024-05-09 12:10:17 -04:00
Shubham Sharma
f36468a9e9 Enabled vectorized division code in ZTRSM
- Existing vectorizes code was disabled because
  of the failures observed in matlab tests.
- The issue is caused by underflow during division when diagonal
   elements of A matrix are very small.
- When diagonal is very small (4E-324 in case of matlab), sqauring the
   diagonal during divison causes the square to be rounded off to zero.
- Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by
  max(ar, ai), this will make either (ar) or (ai) 1, and hence
  reduce the likelihood of underflow.

AMD-Internal: [CPUPL-5052]
Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f
2024-05-09 01:35:39 -04:00
vignbala
ca6276d52b Accuracy and memory testing of AVX512 ?SETV, ZAXPYV and ZAXPYF kernels
- Added accuracy and memory tests for AVX2 and AVX512 ?SETV kernels,
  AVX512 ZAXPYV kernel and AVX512 ZAXPYF kernels, with fuse-factors
  2, 4 and 8.

- Cleanup of the code-section that declares and defines the reference
  compute for AXPYF operation. Corrected the type mismatch with the
  arguments that reference AXPYV would expect(this is used to decompose
  AXPYF as part of reference). Ensured usage of GTestSuite's internal
  alias for integer types.

- Updated the API level testsuite and testing interface for AXPYF,
  based on the cleaup done to the reference code.

AMD-Internal: [CPUPL-4974]
Change-Id: I71de6c09d3877cd3dd1eaa20ab4f90e7c33eb1e1
2024-05-09 00:24:02 -04:00
Edward Smyth
a2beef3255 GTestSuite: break up long running tests
Test programs for key APIs like GEMM take a long time to run,
and even to generate the list of test cases. Break into
separate test programs for different data types to enable
these to run in parallel (at gtest level). In this patch
we break up GEMM, TRSM, GEMV and TRSV.

AMD-Internal: [CPUPL-4500]
Change-Id: I21363b050d30e0402d5a1e8cbeaed2ebcc87aaeb
2024-05-08 13:36:38 -04:00
Edward Smyth
62c886feee Export some BLIS internal symbols
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.

AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
2024-05-08 12:51:32 -04:00
Arnav Sharma
cb27fad49c ZSCALV AVX512 Kernel
- Implemented ZSCALV kernel utilizing AVX512 intrinsics.

- Gtestsuite: Added ukr tests for the new kernel.

AMD-Internal: [CPUPL-5012]
Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2
2024-05-08 11:55:13 -04:00
Arnav Sharma
89a06cf252 Gtestsuite: Unit Tests for ZDOTV AVX512 Kernel
- Updated DOTV Gtestsuite interface to invoke C/ZDOTC when conjx='c'
  and testing interface is either BLAS or CBLAS.

- Added ukr tests for bli_zdotv_zen4_asm_avx512( ... ) and
  bli_zdotv_zen_int_avx512( ... ) kernels.

AMD-Internal: [CPUPL-5011]
Change-Id: I32fb69027a35d9ea92f997a095d412c8242a4b68
2024-05-08 09:20:31 -04:00
eseswari
e0b172174e Added testcases for axpyv api
* Functional tests are covered for saxpyv and zaxpyv.
* As part of functional large size of m, stride greater than m, scalar
  combinations(including special cases), Zero increment tests are
  added for saxpyv and zaxpyv.

Signed-off-by: eseswari <sangadala.eswari@amd.com>
AMD-Internal: CPUPL-4413
Change-Id: I61473357680cb0f394e6e653796ec31110895fa4
2024-05-08 08:44:45 -04:00
Arnav Sharma
1dbeee4d19 ZDOTV AVX512 Kernel with MT Support
- Added AVX512 kernel for ZDOTV.

- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.

AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
2024-05-08 04:54:05 -04:00
eseswari
dd10c6dc5b Added testcases for copyv API
* As part of functional test cases, large size of m, stride greater than
  m,scalar combinations, Zero increment tests are added for ?copyv.

Signed-off-by: eseswari <sangadala.eswari@amd.com>
AMD-Internal: CPUPL-4412
Change-Id: I9fa74c147975bbe21263aaf48190170c6ed0a8fd
2024-05-08 04:41:43 -04:00
Eleni Vlachopoulou
7787d5af1a GTestSuite: Updating CMake system to create executables depending on the directory structure.
- Before the system was assuming 3 levels in the directory structure and
  was creating corresponding targets.
- Now the system looks into the subdirectories of testsuite and creates
  a target for each subdirectory that has at least one cpp file.
- Also deleted a directory that seems duplicate and was breaking builds.

AMD-Internal: [CPUPL-4500]
Change-Id: I03ca362b09783f1c7c5f37ab420d8ca2c2b45e2e
2024-05-08 03:46:14 -04:00
Arnav Sharma
b1d69180f9 Updated DOTV DTL in bla_dot.c
- Updated DOTV DTL entry to include conjugate parameter.

AMD-Internal: [CPUPL-5059]
Change-Id: Id66be02fc06ff2faa18325dffe76559af2c6a5cf
2024-05-08 01:46:17 -04:00
Mangala V
e6cc2a3e22 ZGEMMT SUP Optimizations for AVX512
Existing Design:
 - GEMM AVX2 kernel performs computation and updates temporary C buffer
 - Portion of temporary C buffer is copied to output C buffer
   based on UPLO parameter
 - For diagonal blocks, using GEMM kernels is not efficient

New Design: Implemented in current patch when UPLO='L'
 - GEMMT kernel used for computation, temporary buffer is not required.
 - Only required elements are computed using mask load store for all
   fringe cases
 - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC

- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
  native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
  check which exists for double datatype

AMD_Internal: [CPUPL-5006]

Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
2024-05-07 23:00:29 +05:30
Meghana Vankadari
1072770c63 Implemented LPGEMV for bf16 datatype
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation is developed considering the b matrix
   reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

AMD-Internal: [SWLCSG-2355]
Change-Id: I7497dad4c293587cbc171a5998b9f2817a4db880
2024-05-06 23:55:15 +05:30
Kiran Varaganti
fd61c69778 Fixed bug in omatcopy for when trans="t"
Thanks to Zhenyu Zhu ajz34 for pointing out this bug.
When trans="t" or "conjugate transpose" in the case of complex data-types
the ldb should be greater than equal to cols.
In the bug it was checked against "rows". Fixed this bug.
Some minor code format is done.

[CPUPL-4810][SWLCSG-2706]

Change-Id: Ie796d25a361b2ba72eda80e8c5867d6352af901f
2024-05-06 12:57:38 -04:00
Shubham Sharma
be34169001 Fixed Matlab Failure in ZTRSM
- In AVX512 ZTRSM kernel, vertorizes division code
  is causing failures in matlab.
- The logic is identical in reference C code and intrinsics code,
  but intrinsics code is causing failure
- Replaced optimized intrinsics code with C code.

AMD-Internal: [CPUPL-5052]
Change-Id: Iea184330b22c46d979867b870486066ef980eb84
2024-05-06 06:56:45 -04:00