Commit Graph

501 Commits

Author SHA1 Message Date
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Edward Smyth
34730a1e4c BLIS: Nested parallelism issues
1. Check OpenMP active level against max active levels when setting
   number of threads for starting a new parallel region in
   ./frame/thread/bli_thread.c to ensure the correct number of threads
   is used when BLIS is called within nested OpenMP parallelism.

2. In subsequent BLIS calls, threading choices could be incorrectly
   set based on values used and stored in global_rntm by a previous
   call. This could apply when the OpenMP number of threads differ from
   call to call, different nested parallelism is used in different
   parts of a user's code, or different threads at the user level
   request different numbers of OpenMP threads for BLIS calls.

   Keep threading information in both global_rntm and a new Thread
   Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
   environment (as appropriate) during bli_init_auto() calls in each
   BLIS routine. The details are:
   * global_rntm is initialized on first BLIS call based on OpenMP and
     BLIS threading environment variables.
   * global_rntm is updated by any BLIS threading function calls.
   * In bli_thread_update_tl(), called by bli_init_auto(), sync with
     any BLIS values set or updated in global_rntm. Then, if BLIS
     threading control is not used, check OpenMP ICVs and set thread
     count and auto_factor appropriately.
   * Setting BLIS threading locally (using expert interfaces to pass
     a user defined rntm data structure) should work as before.

3. bli_thread_get_is_parallel can now only be called outside of
   parallelism within BLIS routines. Change calls in trsm to reflect
   this.

4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
   if any BLIS_*_NT environment variables are set.

5. Set auto_factor = FALSE when the number of threads is 1.

6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.

7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().

8. For debugging, internal information on the rntm threading data can
   be printed by defining "PRINT_THREADING" at the top of bli_rntm.h

9. bli_rntm_print() now also prints the value of blis_mt.

10. Function prototypes in bli_rntm.h moved to top of file, so that
   bli_rntm_print() can be used within inline functions defined in
   this header file.

11. Comment out bli_init_auto() and bli_finalize_auto() calls in
    Fortran interfaces in frame/compat/blis/thread/b77_thread.c

12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
    set_pack_b functions outside of the auto_factor if statements.

13. Misc code tidying.

AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
2022-10-21 07:38:39 -04:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Shubham Sharma
2d72d4c862 DTRSM Native path AVX 512 kernels are developed
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.

AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
2022-10-01 15:58:24 +00:00
Nallani Bhaskar
f1caa8e630 Fixed initialization issues in ddot multithread implementation
Description:

1. n value per thread and offset are defined outside omp loop
   which can vary for each thread. Moved them to inside the
   omp loop so that the output

2. Fixed few compilation warnings in dnrm2 avx2 implemenation

AMD Internal:[CPUPL-2606]

Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81
2022-10-01 14:43:39 +05:30
Mangala V
e440cbc91a Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m
Address sanitizer reports error when rbp regitser is modified.

Register rbp was stored with rs_a which was used during prefetch
of Matrix A. Usage of rbp is avoided by using rcx register as a
temporary storage register.
Hence rcx is updated with Matrix C address before storing the
computed data.

This fix address the issue reported by GEQP3 API of libflame

AMD-Internal: [CPUPL-2587]
Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d
2022-09-30 11:52:29 -04:00
Eleni Vlachopoulou
863b73dfaf Adding AVX2 support for DZNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

- Cleaned up some white space in the AVX2 implementation for DNRM2.

AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
2022-09-30 07:51:04 -04:00
Arnav Sharma
90f915d3a9 Vectorized and parallelized zdscal routine
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.

AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
2022-09-30 06:11:07 -04:00
satish kumar nuggu
9c292b79e2 Fixed ASAN reported issues in [s/c]trsm small kernels
Details:
1. Fixed the memory access paritial overflows for the variables
AlphaVal,ones reported by ASAN.
2. Using 128 bit packed broadcast with the 64 bit data types
after type casting would cause the garbage data to be filled
in the destination register.
3. Fixed this issue by using set_ps instruction instead of broadcast.
4. In cases of n remainder being 1, extra elements were accessed that
could cause out of memory access. Removed the extra element access.

AMD-Internal: [CPUPL-2578][CPUPL-2587]
Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75
2022-09-30 03:02:07 -04:00
Eleni Vlachopoulou
1c1a0027a8 Bugfix in DNRM2 AVX path
Description:
Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy
errors.

AMD-Internal: [CPUPL-2576]
Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e
2022-09-29 04:46:04 -04:00
satish kumar nuggu
754627eae9 Addressed uninitialized variables in trsm small algo
1. Addressed uninitialized variables reported in coverity for all
datatypes of trsm small algo.

AMD-Internal: [CPUPL-2542]
Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71
2022-09-29 01:49:22 -04:00
Mangala V
42434fbc31 Bug fix in TRSM small for S, D, C & Z datatype incase of MT
1. Corrected B buffer accessing to access by its offset instead of
   starting address which is required incase of MT.
3. When num_threads > 1, B buffer is divided in to blocks in m or n
   dimension based on side right or left. Hence need to access by its
   offset to access starting of the block.
4. Currently B Matrix is divided in to blocks for each thread and
   complete matrix A is used by all threads.
   Incase of design change in future, modified A buffer accessing by its
   offset to support partition of matrix A for MT
AMD-Internal:[CPUPL-2520]

Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea
2022-09-23 03:30:25 -04:00
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
Nallani Bhaskar
ea79efa915 Fixed out of bound memory access in sgemmsup zen rv kernels
Details:

1. In sgemmsup_zen_rv_?x2 kernels "vmovps" instruction
   is used to load B matrix in k loop and k last loop,
   which is loading 128 bit into xmm than 64 bit as expected.

2. Changed vmovps instruction to vmovsd instrucntions
   which load only 64 bit in xmm register

3. Avoided C memory access by vfma instruction when multiplying
   with non-beta at corner cases with required access to 128 bit
   which leads to out of bound. Replaced with vmovq first to
   get 64 bit data then peformed vfma on xmm register in rv_6x8m
   and rv_6x4m

   AMD-Internal: [CPUPL-2472]

Change-Id: Iad397f8f5b5cc607b4278b603b1e0ea3f6b082f2
2022-08-30 01:13:14 -04:00
Arnav Sharma
eb83a0fe9d Enabled ZHER Optimized Path
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.

AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
2022-08-29 08:09:42 -04:00
Dipal M Zambare
d3b503bbf2 Code cleanup and warnings fixes
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
  files

AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
2022-08-29 15:15:40 +05:30
satish kumar nuggu
2114a43df8 Fixes to avoid Out of Bound Memory Access in TRSM small algorithm
Details:
1. Fixed the issues corresponding to Out of bound memory access during
load and store.
2. In Intrinsic code:
   i. AVX2 Registers can hold 4 double elements.
   ii. In case of remainder when number of elements is lessthan
vectorised register. Though the required number of elements are lessthan
4, we are reading and writing in chunks of 4 elements due to
vectorization. This might cause out of bound memory access.
3. Redesigned code to restrict out of bound access by loading and
storing the exact number of elements required.

AMD-Internal: [SWLCSG-1470]
Change-Id: I786f8023cf5a5f3e5343bea413c59bd0e764df9b
2022-08-26 01:06:10 -04:00
satish kumar nuggu
219c41ded9 ZTRSM Improvements
Details:

1. Optimized ztrsm for small sizes upto 500 in multi thread scenarios.
2. Enabled multithreading execution for bli_trsm_small implementation
for double complex data type.
3. Added decision logic to choose between native vs multi-threaded small
path for sizes upto 500 and threads upto 8.

AMD-Internal: [CPUPL-2340]
Change-Id: I4df9d7e6ee152baa9cf33e58d36e1c17f75a00c1
2022-08-20 05:08:47 -04:00
Shubham Sharma
b8b339416a DGEMMT optimizations
Details:

1. For lower and upper, "B" column major storage variants of gemmt,
   new kernels are developed and optimized to compute only the
   required outputs in the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to
   compute Lower and Upper Variant diagonal outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I9748b2b52557718e7497ecf046530d3031636a63
2022-08-19 12:31:35 -04:00
Shubham Sharma
32c9239c7f Optimization of DGEMMT SUP kernels
Details:
1. Optimized the kernels by replacing the macros with
   the actual computation of required output elements.

AMD-Internal: [CPUPL-2341]
Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5
2022-08-18 08:30:19 -04:00
Shubham Sharma
8adef27aca Optimization of DGEMMT SUP kernel for beta zero cases.
Details:
1. In kernels for non-transpose variants, changes
   are made to optimize the cases of beta zero.
2. Validated the changes with BLIS Testsuite,
   GTestSuite(Functionality, Valgrind, Integer Tests)
   and Netlib Tests.
3. Fixed warnings during the build process.

AMD-Internal: [CPUPL-2341]
Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a
2022-08-18 08:05:58 -04:00
Harsh Dave
e2e1dadee1 DGEMM Improvements
- We prefetch next panel while packing 8xk panel.
- Modified prefetch offsets for dgemm native and 
  dgemm_small kernel.

AMD-Internal: [CPUPL-2366]

Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed
2022-08-16 08:07:30 -04:00
satish kumar nuggu
88e44c64e3 Fixed Memory Leaks in TRSM
1. Fixed the memory leaks in corner cases which caused due to extra
loads in all datatypes(s,d,c,z).
2. In remainder cases instead of loading required number of elements,
loaded extra elements which lead to memory leaks. Fixed memory leaks by
restricting number of loads to required number of elements.

AMD-Internal: [CPUPL-2280]

Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a
2022-08-16 05:08:40 -04:00
Shubham Sharma
f5ef30a44a Fix in DGEMMT SUP kernel
Details:
1. Due to error in C output buffer address computation in
   kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid
   memory is being accessed. This is causing seg fault in
   libflame netlib testing.
2. Validated the fix with libflame netlib testing.

AMD-Internal: [CPUPL-2341]
Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed
2022-08-12 21:03:36 +05:30
Shubham Sharma
4bca7f6f4a DGEMMT optimizations
Details:

1. For lower and upper, non-transpose variants of gemmt, new kernels
   are developed and optimized to compute only the required outputs in
   the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Kernels specific to compute Lower and Upper Variant diagonal
   outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a
2022-08-11 06:16:33 -04:00
Mangala.V
8504ef013d Optimisation of DTRSM and ZTRSM
1. Extract instruction replaced with cast when accessing first 128bit,
   as cast inst needs no cycle but extract takes few cycles
2. Added prefetch of A buffer when computing gemm operation
3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c

With above changes performance improvements observed in case of Single thread

Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be
2022-08-11 01:39:40 -04:00
Devin Matthews
76fbf1233d Add vzeroupper to Haswell microkernels. (#524)
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
  microkernels so as to avoid a performance penalty when mixing AVX
  and SSE instructions. These vzeroupper instructions were once part
  of the haswell kernels, but were inadvertently removed during a source
  code shuffle some time ago when we were managing duplicate 'haswell'
  and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
  and re-inserting the missing instructions.

Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9
2022-08-01 09:29:04 +05:30
Field G. Van Zee
2a81437bd8 Fixed bugs in cpackm kernels, gemmlike code.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
  bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
  kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
  of 4 bytes) from the real component. This was almost certainly a copy-
  paste bug carried over from the corresonding zpackm kernels. Thanks to
  Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. (This bug was never observed in output but rather
  noticed analytically. It probably would have also manifested as
  intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
  relating to debugging.

Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011
2022-08-01 09:11:58 +05:30
Devin Matthews
9495401b73 Fix more copy-paste errors in the haswell gemmsup code.
Fixes #486.

Change-Id: I568386b5d67a698ea9c0b6b17f133df86c2894bd
2022-07-31 21:36:21 +05:30
Devin Matthews
ea163fc23b Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
The fix is to use the same (valid) source register twice in the horizontal addition.

Change-Id: I96ed39e289aaeeb44be9117074b32bd8d4c19de6
2022-07-31 21:15:28 +05:30
Field G. Van Zee
faff30b46a Fixed out-of-bounds bug in sup s6x16m haswell kernel.
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
  assembly kernels. This bug is similar to the one fixed in 17b0caa
  and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
  Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.

Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9
2022-07-31 21:10:58 +05:30
Field G. Van Zee
4b1663213c Fixed out-of-bounds read in haswell gemmsup kernels.
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
  kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
  single-precision elements of C, via instructions such as:

	vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

  in situations where only two elements are guaranteed to exist. (These
  bugs may not have manifested in earlier tests due to the leading
  dimension alignment that BLIS employs by default.) The issue was fixed
  by replacing lines like the one above with:

	vmovsd(mem(rcx), xmm0)
	vfmadd231ps(xmm0, xmm3, xmm4)

  Thus, we use vmovsd to explicitly load only two elements of C into
  registers, and then operate on those values using register addressing.
  Thanks to Daniël de Kok for reporting these bugs in #635, and to
  Bhaskar Nallani for proposing the fix).
- CREDITS file update.

Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b
2022-07-31 21:06:08 +05:30
Nallani Bhaskar
1d31386c02 Fixed few out of bound memory reads in sgemmsup kernels
Details:
 Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16()
  kernel. The bugs were caused by loading four
  single-precision elements of C, via instructions such as:

	vfmadd231ps(mem(rcx, 0*32), ymm3, ymm4)

        or

        vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

  in situations where only two elements are guaranteed to exist. (These
  bugs may not have manifested in earlier tests due to the leading
  dimension alignment that BLIS employs by default.) The issue was fixed
  by replacing lines like the one above with:

	vmovsd(mem(rcx), xmm0)
	vfmadd231ps(xmm0, xmm3, xmm4)

  Thus, we use vmovsd to explicitly load only two elements of C into
  registers, and then operate on those values using register addressing.

  AMD_CPUPLID: CPUPL-2279

Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68
2022-07-29 09:09:12 -04:00
Vignesh Balasubramanian
808d79a610 Implemented efficient ZGEMM algorithm when k=1
Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:

- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.

The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.

AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847
2022-07-28 02:09:45 -04:00
Kiran Varaganti
eff436c653 Bug Fix to replace vzeroall
Fixed syntax in AVX512 dgemm native kernel.
zen4 configuration follows Intel ASM syntax whereas other AMD configs
follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax
in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM
format. src and dst operands are interchanged.

Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447
2022-07-22 03:42:17 -04:00
Kiran Varaganti
86134c7278 Replaced vzeroall
Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels
-both AVX2 and AVX512. vzeroall is expensive instruction and replaced it
with faster version of zeroing all registers. vzeroupper() instruction is
also added at the end of AVX2 kernels to avoid any AVX2/SSE transition
penalities. Kindly note only the main kernels are modified.

Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4
2022-07-20 01:16:59 -04:00
Vignesh Balasubramanian
2ad25a7180 ZGEMM kernel performance improvement for k=1 sizes:
The current implementation for handling zgemm exploits SIMD parallelism
along the k dimension. This would give great performance in cases of k
being large. But for input sizes with k=1, it is better to exploit SIMD
parallelism along the m and n dimensions, thereby giving better
performance. This commit does the same through loop reordering, by
loading column vectors from A.

AMD-Internal: [CPUPL-2236]
Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f
2022-06-30 07:19:52 -04:00
Dipal M Zambare
00c5048f65 Fixed high impact static analysis issues
Initialized ymm and xmm registers to zero to address
un-inilizaed variable errors reported in static analsys.

AMD-Internal: [CPUPL-2078]
Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04
2022-06-27 08:19:26 +00:00
satish kumar nuggu
13c71ca976 BugFix of AOCL_DYNAMIC in TRSM multithreaded small.
- Added initialization of rntm object before aocl_dynamic.
- Bugfixes in dtrsm right-side kernels,
  avoided accessing extra memory while using store for corner cases.

AMD-Internal: [CPUPL-2193] [CPUPL-2194]
Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba
2022-06-14 15:46:17 +05:30
mkadavil
e073e8b669 DAMAXV AXX512 micro kernel bug fix.
-DAMAXV AVX512 is giving wrong results when max element is present at
index in [n-u, n), where u < 32. This is a fallout of using wrong
start offset for the non-loop unrolled code.
-Functions for replacing NaN with negative numbers is replaced with
MACRO to avoid function call overhead and to remove static variables
used for stateful replacement numbers for NaN.

AMD-Internal: [CPUPL-2190]
Change-Id: Ie1435c38b264a271f869782793d0b52bbe6e1b2a
2022-06-13 10:52:53 +05:30
Dipal M Zambare
c87b9aab75 Added support for AVX512 for Windows and AMAVX
- Completed zen4 configuration support on windows
 - Enabled AVX512 kernels for AMAXV
 - Added zen4 configuration in amdzen for windows
 - Moved all zen4 kernels inside kernels/zen4 folder

AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
2022-06-08 11:09:48 +05:30
Dipal M. Zambare
e61ec820f9 Fixed windows build issue for BLIS 4.0
- Removed extra AVX512 typedef from bli_amaxv_zen_int.c.

AMD-Internal: [CPUPL-2154]
Change-Id: Ieaa827c7d81b8d101f3a827827b99433570219f2
2022-06-02 17:52:43 +05:30
Arnav Sharma
66b2231b65 Fixed CMake files for HER
- Removed subdirectory addition

Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1
2022-06-01 12:25:43 +05:30
Arnav Sharma
e5d5a43eab Optimized ZHER Implementation
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.

AMD-Internal: [CPUPL-2151]

Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
2022-05-25 14:03:01 +05:30
Dipal M Zambare
6e2f536590 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 20:35:40 +05:30
Harsh Dave
f48ced0811 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:13:07 +05:30
Harsh Dave
718c6bc024 Optimized daxpy2v implementation
- Optimized axpy2v implementation for double
  datatype by handling rows in mulitple of 4
  and store the final computed result at the
  end of computation, preventing unnecessary
  stores for improving the performance.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
2022-05-17 18:13:05 +05:30
Harihara Sudhan S
0fddd9eb0d Level 1 Kernel: damaxv AVX512
Details:

- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
  to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
  implementation for better performance
- Unrolled the loop by order of 4.

Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
2022-05-17 18:12:16 +05:30
satish kumar nuggu
09b70de635 Performance Improvement for ztrsm small sizes
Details:
    - Handled Overflow and Underflow Vulnerabilites in
      ztrsm small right implementations.
    - Fixed failures observed in Scalapack testing.

    AMD-Internal: [CPUPL-2115]

Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439
2022-05-17 18:10:39 +05:30
Nallani Bhaskar
2acb3f6ed0 Tuned aocl dynamic for specific range in dgemm
Description:

1. Decision logic to choose optimal number of threads for
   given input dgemm dimensions under aocl dynamic feature
   were retuned based on latest code.

2. Updated code in few file to avoid compilation warnings.

3. Added a min check for nt in bli_sgemv_var1_smart_threading
   function

AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
2022-05-17 18:10:39 +05:30