Commit Graph

526 Commits

Author SHA1 Message Date
Harsh Dave
f7b9bc734e Added AVX512 24xk packing kernel
- Added AVX-512 24xk packing kernel.
- Prefetching is yet to be added.

AMD-Internal: [CPUPL-2966]

Change-Id: Iaf973788b51172b48ebb73e8f24b8f4020d94de2
2023-03-17 02:00:00 -05:00
mkadavil
3d74b62e60 Lpgemm threading and micro-kernel optimizations.
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.

AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
2023-03-16 11:44:51 +05:30
eashdash
e36f699939 Implemented ERF Based GeLU Activation for LPEGMM and SGEMM
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
   error function - ERF
2. Added error function based GeLU activation post-ops for the
   S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
   addition to adding the marco based function definations of the
   ERF function in the math-utils and gelu headerfiles.

AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be
2023-03-13 06:10:31 -04:00
Vignesh Balasubramanian
c53b0c96ec Extended the support for beta scaling in 3x4 RD kernels of ZGEMM SUP
- Extended the existing support for handling beta scaling
  in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The
  added support ensures that NaN values initialized in C do
  not propogate to the result when beta is 0.
- The support has been added to fringe cases common to, as well
  as specific to the m and n variants of the RD kernels.

AMD-Internal: [CPUPL-3053] [SWLCSG-1900]
Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9
2023-03-08 06:16:58 -05:00
Eleni Vlachopoulou
1416ab6b5e Bugfix on the memory access for nrm2 AVX2 implementations.
- Since the definition of negative increments is different between BLAS and BLIS,
  there was bug on how the memory was accessed when we were copying the elements
  of a vector with negative increments. Updated the code under the assumption that
  when negative increments are set, the vector is being accessed starting from the end.
  For the BLAS interface, there is an intermediate conversion before calling into the blis layer.

Change-Id: I08343472b418733fad6f7add9e90aa96cdf68285
AMD-Internal: [SWLCSG-1900]
2023-03-08 04:21:36 -05:00
Harihara Sudhan S
d4901f53ce Improved performance of double complex AXPYV kernel
- Increased unroll by reusing X registers that was previously used
  for performing shuffle.
- Added loops with smaller increment steps for better problem
  decomposition.
- Added X vector and Y vector prefetch to the kernel.
- Removed redundant code that handles fringe in incx = 1 and
  incy = 1. This remainder will be performed by the loop that handles
  non-unit stride cases.
- Vectorized loops that handle non-unit stride cases using SSE
  instructions.

AMD-Internal: [CPUPL-2773]
Change-Id: Ifb5dc128e17b4e21315789bfaa147e3a7ec976f0
2023-03-02 00:35:42 -05:00
Harsh Dave
222e00e840 Obliterated usage of rbp register in SUP gemm kernel
- Mx4 edge kernels were overwriting rbp
registers for prefetches.
- Since rbp along with rsp defines stack frame,
it resulted in stack overflow issue.
- Replaced rbp with rdx register for prefetches.

AMD-Internal: [CPUPL-2987]
Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd
2023-03-01 11:09:57 -05:00
Harihara Sudhan S
299bed3fa8 Vectorized SCAL2V for double complex
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero
  or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.

AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
2023-02-28 06:59:23 -05:00
Shubham
b84157fed6 Added AVX512 DTRSM small RLNN/RUTN variant kernels
- 8x8 kernels are used for DTRSM SMALL
- Implemented fringe cases with below block sizes
   8x8, 8x4, 8x3, 8x2, 8x1
   4x8, 4x4, 4x3, 4x2, 4x1
   3x8, 3x4, 3x3, 3x2, 3x1
   2x8, 2x4, 2x3, 2x2, 2x1
   1x8, 1x4, 1x3, 1x2, 1x1

AMD-Internal: [CPUPL-2745]

Change-Id: Ifb8cfba6958e1c89ddbfa18893127ab6d44cc367
2023-02-28 01:40:03 +05:30
mkadavil
1f2447f800 Post-ops support for f32 gemm(aocl_gemm).
- Bias add, relu, parametric relu and gelu post-ops support added in all
f32 gemm micro-kernels. These post-ops are implemented for both AVX512
and AVX2 ISA based on the micro-kernel flavor. The support is added for
both row and column major cases.
- Lpgemm bench updates to support f32 post-ops.

AMD-Internal: [CPUPL-3032]
Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476
2023-02-23 18:58:59 +05:30
Arnav Sharma
5baf38b76d K-Loop Unrolling for 6x64 SGEMM SUP RV kernels
- Added k-loop unrolling by a factor of 4 to the following SGEMM
  SUP RV kernels:
	- 5x48, 5x32, 5x16
	- 4x64, 4x48, 3x32, 4x16
	- 3x48, 3x32, 3x16
	- 2x64, 2x48, 2x32, 2x16
	- 1x64, 1x48, 1x32, 1x16
	- 6x64n, 5x64n, 3x64n, 2x64n, 1x64n
- Removed unused variables which were resulting in warnings during
  compilation.
- Added a newline at the end of header files to resolve warnings
  shown during compilation.

AMD-Internal: [CPUPL-3002]
Change-Id: Iab6cf329f6d7fbd7544b5c8837e493069e8c9921
2023-02-23 04:08:38 -05:00
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00
Harihara Sudhan S
dfacb47125 Code cleanup to remove multiple function declaration
- Removed repetitive function declaration of GEMM small kernels
  from the C files
- Function declaration of these kernels exist in the header files
  where the kernels are supposed to be declared.

AMD-Internal: [CPUPL-3003]
Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63
2023-02-21 00:35:31 -05:00
Shubham
1faee9f89e Fixed 32xk AVX512 double precision pack kernel
- Currently the pointer received as function argument is
  used for packing which causes only a partial copy of
  input buffer to output buffer due to strange optimizations
  by compiler.
- To fix this, instead of using a normal pointer for output
  buffer, we define a "restrict" local pointer variable.
- "restrict" keyword tells the compiler that the pointer is
  the only way to access the object pointed by the pointer.
- By defining "restrict" local pointer pointing to output
  buffer, the mysterious problem of incomplete copy has
  been solved.

Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646
2023-02-20 21:18:54 -05:00
Arnav Sharma
46965dfc57 Developed 6x64 SGEMM row-preferred kernels
- Added kernels for all rv and rd variants.
- Main kernel is of size 6x64, and the associated fringe kernels
  added are
	- 4x64, 2x64, 1x64
	- 6x32, 4x32, 2x32, 1x32
	- 6x16, 4x16, 2x16, 1x16
- Updated the zen4 config to enable these kernels in zen4 path.
- Added C-prefetching to 6x? row-stored main kernels.
- C-prefetching for column storage yet to be added.
- K-loop unrolling for fringe kernels yet to be added.

AMD-Internal: [CPUPL-3002]
Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4
2023-02-19 23:56:58 -05:00
Harihara Sudhan S
535b49f150 Vectorized COPYV for double complex
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.

AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
2023-02-16 09:32:10 -05:00
Shubham
18569b42ee Added DTRSM small RUNN/RLTN variant AVX512 kernels
- 8x8 kernels are used for DTRSM SMALL
- Matrix A(a10) is packed for GEMM operations.
- Packed martix A will be re-used in all the col-block
  along N-dimension.
- Diagonal elements of A matrix are packed(a11) for
  TRSM operations.
- Implemented fringe cases with following block sizes
   8x8, 8x4, 8x3, 8x2, 8x1
   4x8, 4x4, 4x3, 4x2, 4x1
   3x8, 3x4, 3x3, 3x2, 3x1
   2x8, 2x4, 2x3, 2x2, 2x1
   1x8, 1x4, 1x3, 1x2, 1x1

AMD-Internal: [CPUPL-2745]

Change-Id: I6a174e7f88a4c2c5778052525879552a1e82f6ad
2023-02-16 06:49:46 -05:00
Harihara Sudhan S
8d7593a940 AVX2 vectorized SCALV for double complex
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
  is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
  all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
  to the function in place of context. Expert interface of ZSETV will
  query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
  the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
  BLAS macro interface only for single complex type and single complex,
  float mixed type

AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
2023-02-05 09:36:12 -05:00
Harihara Sudhan S
18ae57305e ZAXPYF4 optimization
- Vectorized alpha scaling of X vector using SSE instructions. This
  can be done irrespective of incx.
- Added code to prefetch A matrix and Y vector to L1 cache
- Vectorized fringe case computation and non-unit stride computation
  with SSE instructions.
- Increased unroll in unit stride cases for better register
  utilization.

AMD-Internal: [CPUPL-2773]
Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715
2023-02-04 12:34:50 -05:00
Kiran Varaganti
201db7883c Integrated 32x6 DGEMM kernel for zen4 and its related changes are added.
Details:
- Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path.
  Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and
  implementing these optimizations.
- In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed
  we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the
  next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from
  memory, since the elements of Bpack are stored contiguously, the first broadcast fetches
  the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast
  and interleave broadcast operation with FMAs to hide any memory latencies.
- Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..)
  when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object
  for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}.
  Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced
  dgemm kernel for TRSM with 16x14 dgemm kernel.
- New packm kernels - 16xk, 24xk and 32xk are added.
- New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is
  enabled for zen4 config (bli_dpackm_32xk_zen4_ref() )
- Copyright year updated for modified files.
- cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h
- [SWLCSG-1374], [CPUPL-2918]

Change-Id: I576282382504b72072a6db068eabd164c8943627
2023-01-19 23:11:36 +05:30
Kiran Varaganti
d8d4499e54 AVX2 dgemm kernel optimization for AOCC
Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with
     k0 is typecasted to uint64_t to enable AOCC generate optimized code.
     Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting
     this change. Similar change was applied to sgemm, cgemm and zgemm kernels.
Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9
2023-01-09 07:49:41 -05:00
Edward Smyth
82c2eb4e8e Code cleanup and warnings fixes
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments

AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
2023-01-09 04:34:52 -05:00
Vignesh Balasubramanian
dbd0c069d4 Implemented 3x4 RD kernels for ZGEMM in SUP path
-Implemented (r)ow preferential (d)ot product milli-kernels
 (m and n variants) for dcomplex datatype along SUP path.

-These computational kernels extend the support for handling RRC and
 CRC storage schemes along the SUP path. In case of BLAS api call,
 it corresponds to the input cases with transa equal to T and
 transb equal to N.

-In case of the B matrix being packed(conditionally), the inputs are
 redirected to the existing (r)ow preferential (v)ector load optimized
 kernels due to better performance.

-Added macro for vhsubpd assembly instruction, to support the arithmetic
 for complex datatype in its interleaved storage.

AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
2022-12-22 18:31:46 +05:30
Nithya V S
6e9defe1b5 Developed AVX512 based "sup" kernels for SGEMM
- Only for RRR case, var2m kernels are added
- Main kernel is of 12x32 (AVX512), associated fringe kernels of
	- 8x32, 4x32, 2x32, 1x32 (AVX512)
	- 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512)
	- 12x8, 8x8 (AVX2)
	- 12x4, 8x4 (SSE4)
	- 12x2, 8x2 (SSE4)
	- existing AVX2/SSE4 kernels are used for other fringe
	  cases
- Currently, these kernels are not invoked in zen4 path
- Once all AVX512 kernels (n and rd) are done, invoke all of them
  together in zen4 config

AMD-Internal: [CPUPL-2801]

Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196
2022-12-16 10:18:09 +05:30
Eleni Vlachopoulou
13aa3c8cd0 Adding AVX2 support for SNRM2 and SCNRM2
- For the cases where AVX2 is available, an optimized function is called,
    based on Blue's algorithm. The fallback method based on sumsqv is used
    otherwise.

    - Scaling is used to avoid overflow and underflow.

    - Works correctly for negative increments.

    AMD-Internal: [SWLCSG-1080]

Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9
2022-12-14 04:25:45 -05:00
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Edward Smyth
34730a1e4c BLIS: Nested parallelism issues
1. Check OpenMP active level against max active levels when setting
   number of threads for starting a new parallel region in
   ./frame/thread/bli_thread.c to ensure the correct number of threads
   is used when BLIS is called within nested OpenMP parallelism.

2. In subsequent BLIS calls, threading choices could be incorrectly
   set based on values used and stored in global_rntm by a previous
   call. This could apply when the OpenMP number of threads differ from
   call to call, different nested parallelism is used in different
   parts of a user's code, or different threads at the user level
   request different numbers of OpenMP threads for BLIS calls.

   Keep threading information in both global_rntm and a new Thread
   Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
   environment (as appropriate) during bli_init_auto() calls in each
   BLIS routine. The details are:
   * global_rntm is initialized on first BLIS call based on OpenMP and
     BLIS threading environment variables.
   * global_rntm is updated by any BLIS threading function calls.
   * In bli_thread_update_tl(), called by bli_init_auto(), sync with
     any BLIS values set or updated in global_rntm. Then, if BLIS
     threading control is not used, check OpenMP ICVs and set thread
     count and auto_factor appropriately.
   * Setting BLIS threading locally (using expert interfaces to pass
     a user defined rntm data structure) should work as before.

3. bli_thread_get_is_parallel can now only be called outside of
   parallelism within BLIS routines. Change calls in trsm to reflect
   this.

4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
   if any BLIS_*_NT environment variables are set.

5. Set auto_factor = FALSE when the number of threads is 1.

6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.

7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().

8. For debugging, internal information on the rntm threading data can
   be printed by defining "PRINT_THREADING" at the top of bli_rntm.h

9. bli_rntm_print() now also prints the value of blis_mt.

10. Function prototypes in bli_rntm.h moved to top of file, so that
   bli_rntm_print() can be used within inline functions defined in
   this header file.

11. Comment out bli_init_auto() and bli_finalize_auto() calls in
    Fortran interfaces in frame/compat/blis/thread/b77_thread.c

12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
    set_pack_b functions outside of the auto_factor if statements.

13. Misc code tidying.

AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
2022-10-21 07:38:39 -04:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Shubham Sharma
2d72d4c862 DTRSM Native path AVX 512 kernels are developed
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.

AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
2022-10-01 15:58:24 +00:00
Nallani Bhaskar
f1caa8e630 Fixed initialization issues in ddot multithread implementation
Description:

1. n value per thread and offset are defined outside omp loop
   which can vary for each thread. Moved them to inside the
   omp loop so that the output

2. Fixed few compilation warnings in dnrm2 avx2 implemenation

AMD Internal:[CPUPL-2606]

Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81
2022-10-01 14:43:39 +05:30
Mangala V
e440cbc91a Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m
Address sanitizer reports error when rbp regitser is modified.

Register rbp was stored with rs_a which was used during prefetch
of Matrix A. Usage of rbp is avoided by using rcx register as a
temporary storage register.
Hence rcx is updated with Matrix C address before storing the
computed data.

This fix address the issue reported by GEQP3 API of libflame

AMD-Internal: [CPUPL-2587]
Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d
2022-09-30 11:52:29 -04:00
Eleni Vlachopoulou
863b73dfaf Adding AVX2 support for DZNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

- Cleaned up some white space in the AVX2 implementation for DNRM2.

AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
2022-09-30 07:51:04 -04:00
Arnav Sharma
90f915d3a9 Vectorized and parallelized zdscal routine
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.

AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
2022-09-30 06:11:07 -04:00
satish kumar nuggu
9c292b79e2 Fixed ASAN reported issues in [s/c]trsm small kernels
Details:
1. Fixed the memory access paritial overflows for the variables
AlphaVal,ones reported by ASAN.
2. Using 128 bit packed broadcast with the 64 bit data types
after type casting would cause the garbage data to be filled
in the destination register.
3. Fixed this issue by using set_ps instruction instead of broadcast.
4. In cases of n remainder being 1, extra elements were accessed that
could cause out of memory access. Removed the extra element access.

AMD-Internal: [CPUPL-2578][CPUPL-2587]
Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75
2022-09-30 03:02:07 -04:00
Eleni Vlachopoulou
1c1a0027a8 Bugfix in DNRM2 AVX path
Description:
Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy
errors.

AMD-Internal: [CPUPL-2576]
Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e
2022-09-29 04:46:04 -04:00
satish kumar nuggu
754627eae9 Addressed uninitialized variables in trsm small algo
1. Addressed uninitialized variables reported in coverity for all
datatypes of trsm small algo.

AMD-Internal: [CPUPL-2542]
Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71
2022-09-29 01:49:22 -04:00
Mangala V
42434fbc31 Bug fix in TRSM small for S, D, C & Z datatype incase of MT
1. Corrected B buffer accessing to access by its offset instead of
   starting address which is required incase of MT.
3. When num_threads > 1, B buffer is divided in to blocks in m or n
   dimension based on side right or left. Hence need to access by its
   offset to access starting of the block.
4. Currently B Matrix is divided in to blocks for each thread and
   complete matrix A is used by all threads.
   Incase of design change in future, modified A buffer accessing by its
   offset to support partition of matrix A for MT
AMD-Internal:[CPUPL-2520]

Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea
2022-09-23 03:30:25 -04:00
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
Nallani Bhaskar
ea79efa915 Fixed out of bound memory access in sgemmsup zen rv kernels
Details:

1. In sgemmsup_zen_rv_?x2 kernels "vmovps" instruction
   is used to load B matrix in k loop and k last loop,
   which is loading 128 bit into xmm than 64 bit as expected.

2. Changed vmovps instruction to vmovsd instrucntions
   which load only 64 bit in xmm register

3. Avoided C memory access by vfma instruction when multiplying
   with non-beta at corner cases with required access to 128 bit
   which leads to out of bound. Replaced with vmovq first to
   get 64 bit data then peformed vfma on xmm register in rv_6x8m
   and rv_6x4m

   AMD-Internal: [CPUPL-2472]

Change-Id: Iad397f8f5b5cc607b4278b603b1e0ea3f6b082f2
2022-08-30 01:13:14 -04:00
Arnav Sharma
eb83a0fe9d Enabled ZHER Optimized Path
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.

AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
2022-08-29 08:09:42 -04:00
Dipal M Zambare
d3b503bbf2 Code cleanup and warnings fixes
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
  files

AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
2022-08-29 15:15:40 +05:30
satish kumar nuggu
2114a43df8 Fixes to avoid Out of Bound Memory Access in TRSM small algorithm
Details:
1. Fixed the issues corresponding to Out of bound memory access during
load and store.
2. In Intrinsic code:
   i. AVX2 Registers can hold 4 double elements.
   ii. In case of remainder when number of elements is lessthan
vectorised register. Though the required number of elements are lessthan
4, we are reading and writing in chunks of 4 elements due to
vectorization. This might cause out of bound memory access.
3. Redesigned code to restrict out of bound access by loading and
storing the exact number of elements required.

AMD-Internal: [SWLCSG-1470]
Change-Id: I786f8023cf5a5f3e5343bea413c59bd0e764df9b
2022-08-26 01:06:10 -04:00
satish kumar nuggu
219c41ded9 ZTRSM Improvements
Details:

1. Optimized ztrsm for small sizes upto 500 in multi thread scenarios.
2. Enabled multithreading execution for bli_trsm_small implementation
for double complex data type.
3. Added decision logic to choose between native vs multi-threaded small
path for sizes upto 500 and threads upto 8.

AMD-Internal: [CPUPL-2340]
Change-Id: I4df9d7e6ee152baa9cf33e58d36e1c17f75a00c1
2022-08-20 05:08:47 -04:00
Shubham Sharma
b8b339416a DGEMMT optimizations
Details:

1. For lower and upper, "B" column major storage variants of gemmt,
   new kernels are developed and optimized to compute only the
   required outputs in the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to
   compute Lower and Upper Variant diagonal outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I9748b2b52557718e7497ecf046530d3031636a63
2022-08-19 12:31:35 -04:00
Shubham Sharma
32c9239c7f Optimization of DGEMMT SUP kernels
Details:
1. Optimized the kernels by replacing the macros with
   the actual computation of required output elements.

AMD-Internal: [CPUPL-2341]
Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5
2022-08-18 08:30:19 -04:00
Shubham Sharma
8adef27aca Optimization of DGEMMT SUP kernel for beta zero cases.
Details:
1. In kernels for non-transpose variants, changes
   are made to optimize the cases of beta zero.
2. Validated the changes with BLIS Testsuite,
   GTestSuite(Functionality, Valgrind, Integer Tests)
   and Netlib Tests.
3. Fixed warnings during the build process.

AMD-Internal: [CPUPL-2341]
Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a
2022-08-18 08:05:58 -04:00
Harsh Dave
e2e1dadee1 DGEMM Improvements
- We prefetch next panel while packing 8xk panel.
- Modified prefetch offsets for dgemm native and 
  dgemm_small kernel.

AMD-Internal: [CPUPL-2366]

Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed
2022-08-16 08:07:30 -04:00
satish kumar nuggu
88e44c64e3 Fixed Memory Leaks in TRSM
1. Fixed the memory leaks in corner cases which caused due to extra
loads in all datatypes(s,d,c,z).
2. In remainder cases instead of loading required number of elements,
loaded extra elements which lead to memory leaks. Fixed memory leaks by
restricting number of loads to required number of elements.

AMD-Internal: [CPUPL-2280]

Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a
2022-08-16 05:08:40 -04:00
Shubham Sharma
f5ef30a44a Fix in DGEMMT SUP kernel
Details:
1. Due to error in C output buffer address computation in
   kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid
   memory is being accessed. This is causing seg fault in
   libflame netlib testing.
2. Validated the fix with libflame netlib testing.

AMD-Internal: [CPUPL-2341]
Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed
2022-08-12 21:03:36 +05:30
Shubham Sharma
4bca7f6f4a DGEMMT optimizations
Details:

1. For lower and upper, non-transpose variants of gemmt, new kernels
   are developed and optimized to compute only the required outputs in
   the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Kernels specific to compute Lower and Upper Variant diagonal
   outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a
2022-08-11 06:16:33 -04:00