Commit Graph

2859 Commits

Author SHA1 Message Date
vignbala
775ce1f13c Implemented AVX-512 based 12x4 m-variant SUP kernels for ZGEMM
- Implemented 12x4m column preferential SUP kernels(main and fringe
  cases). The main kernel dimension is 12x4, and the associated fringe
  kernel dimensions are : 12x3m, 12x2m, 12x1m
                          8x4, 8x3, 8x2, 8x1
                          4x4, 4x3, 4x2, 4x1
                          2x4, 2x3, 2x2, 2x1.

- Included in-register transposition support for C, thus extending
  the storage scheme supports to CCC, CCR, RCC and RCR inside the
  milli-kernel.

- Integrated conditional packing of A onto the SUP front end for
  dcomplex datatype. This redirects RRC and CRC storage schemes
  onto the preceding set of SUP kernels through storage scheme
  transformation(RCC and CCC respectively).

- Updated the zen4 context file with the new set of SUP kernels, to
  get enabled appropriately. Furthermore, the context file was updated
  with the AVX-2 dotxv signatures for dcomplex datatype. This redirects
  the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines.

- Added C prefetching onto L2-cache, and an unroll factor of 4 for the
  k loop in all the kernels.

- Work in progress to include conjugate support and input spectrum
  extension for the AVX-512 SUP kernels. The current thresholds in zen4
  context is the same as that of the zen3 thresholds for ZGEMM SUP.

AMD-Internal: [CPUPL-3122]

Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e
2023-04-06 04:49:15 -04:00
Edward Smyth
75c5fd1b66 Bug fixes in ZSCALV
Fixes for a couple of issues:
- Modify alpha argument in kernel call in zscal_blis_impl to
  avoid compiler warning message about discarding 'const'
  qualifier from pointer target type.
- When BLIS_ARCH_TYPE=generic, call to bli_cntx_get_l1v_ker_dt
  in ref_kernels/1/bli_scalv_ref.c for ZSCAL was causing a
  segmentation fault. This was because the cntx was NULL on
  entry to this function. Correct by getting zscal_blis_impl
  to pass the cntx it has initialized for non-Zen codepaths.
  This changes has been copied to idamax_blis_impl for
  consistency.

AMD-Internal: [CPUPL-2773]
Change-Id: Ib02e4c1a2a7bf30c208732241d4959f7a2696179
2023-04-05 13:00:58 -04:00
Eleni Vlachopoulou
bf3f5cafa8 BLIS GTestSuite Updates:
- Fix in README.md.
- Updating abs overload for scomplex and dcomplex to avoid overflow by using std::abs.
- Updating comparators to take into account NaNs and Infs when measuring error.

AMD-Internal: [CPUPL-2732]
Change-Id: I8c12bacd9d63b2e914d0a79f337f7525dc16b733
2023-04-05 06:11:34 -05:00
jagar
f9adfa8ee4 Updated CmakeLists.txt to remove cmake generated files
cmake generated files and executables are cleaned within
build directory by "make distclean" command.

Change-Id: I4fd5193e92958122ff10ecc634b42096f3b3716e
2023-04-05 06:11:16 -04:00
Harihara Sudhan S
2e6724262e ZGEMV var 2 bug fix
- Fixed segmentation fault that was seen on non zen and non avx2
  machines.
- cntx object was not passed to the invoked kernel causing a seg
  fault.

AMD-Internal: [CPUPL-3167]
Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d
2023-04-05 01:31:24 -04:00
Edward Smyth
9b9142644f BLIS cpuid: Bugfix for BLAS1/BLAS2 when BLIS_ARCH_TYPE is set
BLAS1 and BLAS2 routines may not immediately call bli_init_auto, as
the full cntx and other global data structures may not be required
for all code paths. This may cause a problem if the user sets
BLIS_ARCH_TYPE, as it needs to check the requested value against the
available options configured in cntx. Solution: add a separate
pthread_once call to run bli_gks_init().

AMD-Internal: [CPUPL-3031]
Change-Id: Icd73a8dd161b34b23cc336623d675248f28ed23f
2023-04-04 07:54:31 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
mkadavil
27a9e2a0ff u8s8s32 fringe kernel optimizations.
-The n fringe micro kernels uses only a few zmm registers for computing
the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24
used in 6x64). This results in lot of wasted registers that if utilized
can help increase the MR dimension and thus improve the reuse of
registers loaded with B. Based on this concept, the existing n fringe
kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted
that the maximum number of registers are not used, since it results in
cache inefficient code due to the increase in MR and thus more
broadcasts required from unpacked A matrix.
-Compiler flag updates for AOCC build to generate loops with 64 byte
alignment. This has been observed to improve performance slightly when
k dimension is small.

AMD-Internal: [CPUPL-3173]
Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca
2023-04-03 05:35:18 -05:00
eashdash
bd8cd763ff Added NEW LPGEMM TYPE- S8S8S32/S8
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
   2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
   of every column of B matrix to implement the S8S8S32 operation using
   the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
   Here the A matrix elements are converted from int8 to uint8 (VNNI
   works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.

AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
2023-03-31 05:44:54 -04:00
Eleni Vlachopoulou
58f85bb8f1 Adding copyright notice in gtestsuite files.
Change-Id: I5097831eb7a46c56a4a2a32da4d3ee69c8b36cb5
2023-03-29 09:01:48 -04:00
Meghana Vankadari
fb6d4b5b8b BLAS compliance for Level-3 routines
Details:
- Added checks in top-level BLAS layer for alpha = zero. Scale C with
  beta or set C with 0 appropriately if true and return. This ensures
  that input matrices are not referenced, and thus that any Inf or NaN
  values in them are not propagated to output matrix.
- Scalm internally handles the case where both alpha and beta are zero
  by calling setm to populate the output matrix with zeroes.

AMD-Internal: [CPUPL-3134]
Change-Id: I14f14419049e7fcc55bf7fd30ddbd3d5a12c4427
2023-03-29 04:36:00 -05:00
Meghana Vankadari
34a17e78ec Added a missing override function declaration for amdzen config
AMD-Internal: [CPUPL-3096]
Change-Id: I5ce1a0f8489d09a7934af89efb2aede564c24c7e
2023-03-29 02:32:49 -04:00
Mangala V
245fdf072c AVX-512 based col-preferred kernels for ZGEMM in native path
- Kernel block size is 12x4
- Updated the zen4 config to enable these kernels in zen4 path.
- Tuned MC,KC,NC for better performance for m/n/k size > 500
- Updated CMakeLists.txt with ZGEMM kernels for windows build.

Kernel supports:
1. Preload and prebroadcast of A and B
2. Prefecth of C Matrix
3. K loop is sub divided in to multiple loops to maintain distance between c prefetchs.
4. Special case when alpha/beta imag component is zero
5. Row/Col/General stride of Matrix C

AMD-Internal: [CPUPL-2998]
Change-Id: I62e3c352d475b1add3f43270805fbcee00e2e440
2023-03-28 23:05:06 -04:00
Eleni Vlachopoulou
04e091fdca BLIS GTestSuite: Link OpenMP if we test serial BLIS, but MKL is used as a reference.
Change-Id: Iacafa5ecf74622fa5e1180a81305cf7a23d79055
2023-03-28 04:43:58 -04:00
Harsh Dave
f5dc3db648 Added AVX512 8xk packing kernel
AVX512 optimised kernel for Double datatype supports
row and column major matrix

Packing kernel is column major implementation
If matrix is row major, we need to transpose block before storing it.
If matrix is column major, we directly store

AMD-Internal: [CPUPL-2966]

Change-Id: I8e43f1e2b562c382f44278cd47b3d1e84a4d24c9
2023-03-27 23:18:32 -05:00
Mangala V
62d63eb1ba AVX-512 based 4xk and 12xk packing kernel for dcomplex
AVX512 packing kernel supports:
1. Dcomplex datatype
2. Row and column major matrix

AVX512 packing kernel doesnot support:
1. General stride matrix
2. Fringe cases(only multiplies of 4 or 12 is supported)
3. Conjugate is not supported
scal2m will be used for above unsupported functionality

AVX512 packing kernel is column preferred kernel
If matrix is row major, we need to transpose block before storing it.
If matrix is column major, we directly store it

AMD-Internal: [CPUPL-3088]
Change-Id: I3fcd94248a3a6527c807cccc1b3408db9fe2a737
2023-03-28 00:19:08 +05:30
Arnav Sharma
0bfb0393fd Bug fix for C/ZAXPBY Kernels
- Complex AXPBY kernels gave incorrect output when both alpha and
  beta had non-zero imaginary parts.
- Previously, the scalar code (used to calculate remainder result
  or non-unit increment cases) was directly accessing and updating
  the y-vector pointer thus, resulting in an incorrect output.
  Updated it to operate on a local copy of the currect y element
  and store the final result to the y-pointer.
- Also, added operation to store temporary calculation of alpha*x
  in an intermediate vector and then later added to the y vector.

AMD-Internal: [CPUPL-3037]
Change-Id: Iddbd3000dcb1505b444b0ad41ab881b055842e1c
2023-03-24 07:32:43 -04:00
Harsh Dave
c1766e312a Added in row storage support for C matrix.
- Added in-register transpose support for c matrix to
support row stored C matrix for dgemm sup.
- Support is added for all edge case kernels.
- FMA are made independent of each other, for faster
computation while storing data back to C matrix.

AMD-Internal: [CPUPL-2966]
Change-Id: I1d13af99a17ee66adbf5f537a4664ade489a7cad
2023-03-24 02:46:21 -04:00
Meghana Vankadari
31a4203c32 Added AVX-512 based col-preferred kernels for DGEMM with optional pack framework
- Main kernel is of size 24x8 and the associated fringe kernels
  added are
  - 24x7m, 24x6m, 24x5m, 24x4m, 24x3m, 24x2m, 24x1m
  - 24x8,  24x7,  24x6,  24x5,  24x4,  24x3,  24x2,  24x1
  - 16x8,  16x7,  16x6,  16x5,  16x4,  16x3,  16x2,  16x1
  -  8x8,   8x7,   8x6,   8x5,   8x4,   8x3,   8x2,   8x1

- For fringe kernels, 24x? kernel handles 16 < m_remainder <  24
                      16x? kernel handles 8  < m_remainder <= 16
                       8x? kernel handles 0  < m_remainder <= 8
- Added a function 'bli_zen4_override_gemm_blkszs' to override
  blocksizes and kernels to be used for SUP for supported storage
  schemes.
- Updated the zen4 config to enable these kernels in zen4 path.
- Thresholds are yet to be derived.
- Updated CMakeLists.txt with DGEMM SUP kernels for windows build.

Kernel-specific details:
- K-loop is unrolled by 8 times to facilitate prefetch of B.
- For every load of one column of A, the corresponding column in
  next panel of A is prefetched with T1 hint.
- One column of C is prefetched with T0 hint per iteration of LOOP2.
- TAIL_NITER is derived to be 3.
- For every unroll of k-loop, one row of B is prefetched with T0 hint.
- C-prefetching for row-storage is yet to be added.
- B-prefetching for col-storage is yet to be added.
- Support for C transpose is yet to added.

AMD-Internal: [CPUPL-2755], [CPUPL-2409]
Change-Id: Ie240c893469032dc2271cbfe00cceccfe6c4ea48
2023-03-24 06:40:36 +00:00
Meghana Vankadari
253ceffb0f Redirecting scalm to setm when alpha is zero
Added a check in scalm framework for alpha=0.
Set the output matrix to zero when alpha=0.
This ensures that any Inf or NaN values in the matrix are not
propagated to the output matrix.

AMD-Internal: [CPUPL-3053]
Change-Id: I62b9b5405be220eb4df97aadda14701abcccb475
2023-03-24 00:59:43 -04:00
Eleni Vlachopoulou
ad7a812db2 Remove quick return for zero increments.
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.

AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
2023-03-23 23:35:03 -04:00
Harsh Dave
238d9fda9e Fixed ASAN memory issue due to modifying RBP register
- RBP is base pointer which points to base of current stack frame.
  ASAN tool rely on rbp and rsp for stack related validations. So over-writting
  or modifying RBP register results in application termination with the error code
  of stack overflow.
- Removed all the code snippets which were using rbp register for prefetching matrices
  and sometimes loading elements from memory in all of the gemm sup kernels for double
  datatype.
- Removed reference to rbp from register clobber list as well to completely avoid the
  usage of rbp register.

AMD-Internal: [CPUPL-2613, CPUPL-2587]

Change-Id: Idd402d3c644c4dd66e8d4988aede539ad8c77b28
2023-03-23 21:23:44 -04:00
Shubham
dfc95d29fc Enable DTRSM small multithreading path for BLAS interface
- Enabled DTRSM small mt for sizes where performance is better
  than small or native.
- Threshold Tuning for small path is updated.
- Function signature for bli_trsm_small_mt has been made similar
  to bli_trsm_small so that one function pointer can be used for
  all functions.
- Early return condition in DTRSM small for sizes > 1000 has been
  removed so that the sizes for which small path to take can be
  decided on bla layer instead of inside kernel.

AMD-Internal: [CPUPL-2735]
Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f
2023-03-23 12:07:22 -04:00
Edward Smyth
873b4f93fd GEMM: Early return when alpha = zero
Add test in top-level GEMM BLAS layer for alpha = zero. Scale C
appropriately if true and return. This ensures that A and B are not
referenced, and thus that any Inf or NaN values in A or B are not
propagated to C. This was tested in some lower level code paths,
but not consistently.

Solution throughout GEMM codebase assumes scalm handles scale value
(beta from GEMM) equal to 0 without propagating inf/NaNs in C matrix.

Also call AOCL_DTL exits and bli_finalize_auto() in a more consistent
way.

AMD-Internal: [CPUPL-3053]
Change-Id: I4009f311951eb1ce9416cf846e9fa93b7c9219cc
2023-03-23 09:24:41 -04:00
Aayush Kumar
5bd2a777ba Fixed Compilation Fails when configured with --disable-blas
- Moved *_blis_impl function declaration outside the BLIS_ENABLE_BLAS
  guard.
- Changed Makefile to continue to compile bla_ files to get
  *_blis_impl interfaces.
- Modify CBLAS headers, bli_macro_defs.h and bli_util_api_wrap.{c,h}
  to add BLIS_ENABLE_CBLAS guards.
- Comment out BLIS_ENABLE_BLAS guards in various headers and utility
  functions.
- Define BLIS Fortran-style functions lsame_blis_impl and
  xerbla_blis_impl. New macros PASTE_LSAME and PASTE_XERBLA are
  used in bla_*_check headers and some other places to select
  whether to call lsame and xerbla, or the _blis_impl versions.
- Defined various other missing _blis_impl functions.
- In bli_util_api_wrap.c, only define any functions if
  BLIS_ENABLE_BLAS is defined, and only define the subroutine
  versions of functions like dot, nrm2, etc if BLIS_ENABLE_CBLAS
  is defined.
- BLAS layer is needed if CBLAS layer is enabled. Changed header
  files build/bli_config.h.in and bli_blas.h, and configure
  program to help ensure consistency in generated blis.h header
  and configure output.

Undefining BLIS_ENABLE_BLAS_DEFS appears to be broken in UTA BLIS
too, thus BLIS_ENABLE_BLAS_DEFS is currently permanently defined.

AMD-Internal: [CPUPL-3015]

Change-Id: I7c0fe07db85781db46f2c690e174451860b37635
2023-03-23 06:11:52 -04:00
Shubham
323e31649f Added AVX512 8x24 DTRSM native path kernels
- Added DGEMM and DTRSM row preferred micro kernels.
- DTRSM left lower and left upper micro kernels are added.
- DGEMM kernel is optimized for both row stored C and col
  stored C.

AMD-Internal: [CPUPL-2745]

Change-Id: Iecd2c1b0b0972e17e7b31e4b117e49c90def5180
2023-03-23 00:43:15 -04:00
Harihara Sudhan S
4b36529a8b Added vector packing logic to ZGEMV variant 2
- In cases when incy != 1, a buffer is created for y vector. The
  contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
  y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
  without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
  based architecture, AVX2 kernels are invoked. For other, the
  kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.

AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
2023-03-22 03:19:18 -04:00
Harihara Sudhan S
a67205d8bd Modified Double precision AMAXV kernel
- The new AMAXV adheres to the BLAS definition of ISAMAX by not handle
  NaN separately. In the previous kernel, NaN is considered the smallest
  element of all the elements in the array.
- The new logic uses two helper functions - bli_vec_absmax_double and
  bli_vec_search_double.
- bli_vec_absmax_double finds the absolute largest element and the index
  range in which the first occurence of this element can be found.
- bli_vec_search_double returns the index of the first occurence of the
  absolute value of an element.
- AMAXV uses these two helper functions to find the absolute largest
  element and then searches using bli_vec_search_double in the reduced
  range provided by bli_vec_absmax_double.
- Added condition check for n == 1 in BLAS layer. It is an optimization
  mention in the BLAS standard API definition.
- Removed redundant n == 0 condition check from the kernel. This is a
  BLAS exception and is already done in the BLAS layer.
- Removed AVX2 flag check from the BLAS layer. Kernels will be picked
  based on the architecture ID in the new design.

AMD-Internal: [CPUPL-2773]
Change-Id: Ida2dae84a60742e632dc810ab1b7b80fc354e178
2023-03-21 05:18:05 -04:00
Eleni Vlachopoulou
155a64e734 Introducing upgrated BLIS GTestSuite.
Key features:
- able to test both static and dynamic libraries
- able to test BLAS, CBLAS and BLIS-typed interface
- can use any CBLAS library for reference results
- can build and/or run tests depending on the BLAS level or a specific API

AMD-Internal: [CPUPL-2732]
Change-Id: Ibe0d7938e06081526bbc54d3182ac7d17affdaf6
2023-03-21 03:17:51 +05:30
Harsh Dave
f7b9bc734e Added AVX512 24xk packing kernel
- Added AVX-512 24xk packing kernel.
- Prefetching is yet to be added.

AMD-Internal: [CPUPL-2966]

Change-Id: Iaf973788b51172b48ebb73e8f24b8f4020d94de2
2023-03-17 02:00:00 -05:00
mkadavil
3d74b62e60 Lpgemm threading and micro-kernel optimizations.
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.

AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
2023-03-16 11:44:51 +05:30
eashdash
e36f699939 Implemented ERF Based GeLU Activation for LPEGMM and SGEMM
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
   error function - ERF
2. Added error function based GeLU activation post-ops for the
   S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
   addition to adding the marco based function definations of the
   ERF function in the math-utils and gelu headerfiles.

AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be
2023-03-13 06:10:31 -04:00
Edward Smyth
1617589d24 Add consistent NaN/Inf handling in sumsqv. (#668)
Details:
- Changed sumsqv implementation as follows:
  - If there is a NaN (either real or imaginary), then return a sum of
    NaN and unit scale.
  - Else, if there is an Inf (either real or imaginary), then return a
    sum of +Inf and unit scale.
  - Otherwise behave as normal.

(cherry picked from commit b861c71b50)

AMD-Internal: [SWLCSG-1900]
Change-Id: Ic7ba9cad1fbaf11823b9ba96e72a4ddd973db5b6
2023-03-09 06:36:44 -05:00
Vignesh Balasubramanian
c53b0c96ec Extended the support for beta scaling in 3x4 RD kernels of ZGEMM SUP
- Extended the existing support for handling beta scaling
  in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The
  added support ensures that NaN values initialized in C do
  not propogate to the result when beta is 0.
- The support has been added to fringe cases common to, as well
  as specific to the m and n variants of the RD kernels.

AMD-Internal: [CPUPL-3053] [SWLCSG-1900]
Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9
2023-03-08 06:16:58 -05:00
Eleni Vlachopoulou
1416ab6b5e Bugfix on the memory access for nrm2 AVX2 implementations.
- Since the definition of negative increments is different between BLAS and BLIS,
  there was bug on how the memory was accessed when we were copying the elements
  of a vector with negative increments. Updated the code under the assumption that
  when negative increments are set, the vector is being accessed starting from the end.
  For the BLAS interface, there is an intermediate conversion before calling into the blis layer.

Change-Id: I08343472b418733fad6f7add9e90aa96cdf68285
AMD-Internal: [SWLCSG-1900]
2023-03-08 04:21:36 -05:00
mkadavil
7dca25d056 Fixed decision logic in choosing between sup vs native GEMM code paths
-Currently the storage preference of native kernel is used to determine
appropriate "m" and "n" which are used in native/SUP threshold logic.
This assumption works only when "sup" & native kernel preferences are
same. However, this need not have to be the case. This fix calculates
the correct "m" and "n" irrespective of the native/sup kernel
preferences thereby improving accuracy of this decision logic.

AMD-Internal: [CPUPL-3039]
Change-Id: I88c5a6838aeed867effdbbdfc41fbf2e88d4e52a
2023-03-02 02:12:42 -05:00
Harihara Sudhan S
d4901f53ce Improved performance of double complex AXPYV kernel
- Increased unroll by reusing X registers that was previously used
  for performing shuffle.
- Added loops with smaller increment steps for better problem
  decomposition.
- Added X vector and Y vector prefetch to the kernel.
- Removed redundant code that handles fringe in incx = 1 and
  incy = 1. This remainder will be performed by the loop that handles
  non-unit stride cases.
- Vectorized loops that handle non-unit stride cases using SSE
  instructions.

AMD-Internal: [CPUPL-2773]
Change-Id: Ifb5dc128e17b4e21315789bfaa147e3a7ec976f0
2023-03-02 00:35:42 -05:00
Harsh Dave
222e00e840 Obliterated usage of rbp register in SUP gemm kernel
- Mx4 edge kernels were overwriting rbp
registers for prefetches.
- Since rbp along with rsp defines stack frame,
it resulted in stack overflow issue.
- Replaced rbp with rdx register for prefetches.

AMD-Internal: [CPUPL-2987]
Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd
2023-03-01 11:09:57 -05:00
Harihara Sudhan S
299bed3fa8 Vectorized SCAL2V for double complex
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero
  or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.

AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
2023-02-28 06:59:23 -05:00
Shubham
b84157fed6 Added AVX512 DTRSM small RLNN/RUTN variant kernels
- 8x8 kernels are used for DTRSM SMALL
- Implemented fringe cases with below block sizes
   8x8, 8x4, 8x3, 8x2, 8x1
   4x8, 4x4, 4x3, 4x2, 4x1
   3x8, 3x4, 3x3, 3x2, 3x1
   2x8, 2x4, 2x3, 2x2, 2x1
   1x8, 1x4, 1x3, 1x2, 1x1

AMD-Internal: [CPUPL-2745]

Change-Id: Ifb8cfba6958e1c89ddbfa18893127ab6d44cc367
2023-02-28 01:40:03 +05:30
mkadavil
1f2447f800 Post-ops support for f32 gemm(aocl_gemm).
- Bias add, relu, parametric relu and gelu post-ops support added in all
f32 gemm micro-kernels. These post-ops are implemented for both AVX512
and AVX2 ISA based on the micro-kernel flavor. The support is added for
both row and column major cases.
- Lpgemm bench updates to support f32 post-ops.

AMD-Internal: [CPUPL-3032]
Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476
2023-02-23 18:58:59 +05:30
Arnav Sharma
5baf38b76d K-Loop Unrolling for 6x64 SGEMM SUP RV kernels
- Added k-loop unrolling by a factor of 4 to the following SGEMM
  SUP RV kernels:
	- 5x48, 5x32, 5x16
	- 4x64, 4x48, 3x32, 4x16
	- 3x48, 3x32, 3x16
	- 2x64, 2x48, 2x32, 2x16
	- 1x64, 1x48, 1x32, 1x16
	- 6x64n, 5x64n, 3x64n, 2x64n, 1x64n
- Removed unused variables which were resulting in warnings during
  compilation.
- Added a newline at the end of header files to resolve warnings
  shown during compilation.

AMD-Internal: [CPUPL-3002]
Change-Id: Iab6cf329f6d7fbd7544b5c8837e493069e8c9921
2023-02-23 04:08:38 -05:00
Arnav Sharma
3d4611ab8b Updated CMakeLists.txt with SGEMM kernels
- Added SGEMM kernel files to CMakeLists.txt for Windows build.

AMD-Internal: [CPUPL-3023]
Change-Id: I2ab6d23abd2b96574a6d3540fa7b62bc20c92daf
2023-02-22 11:22:53 +00:00
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00
Harihara Sudhan S
dfacb47125 Code cleanup to remove multiple function declaration
- Removed repetitive function declaration of GEMM small kernels
  from the C files
- Function declaration of these kernels exist in the header files
  where the kernels are supposed to be declared.

AMD-Internal: [CPUPL-3003]
Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63
2023-02-21 00:35:31 -05:00
Shubham
1faee9f89e Fixed 32xk AVX512 double precision pack kernel
- Currently the pointer received as function argument is
  used for packing which causes only a partial copy of
  input buffer to output buffer due to strange optimizations
  by compiler.
- To fix this, instead of using a normal pointer for output
  buffer, we define a "restrict" local pointer variable.
- "restrict" keyword tells the compiler that the pointer is
  the only way to access the object pointed by the pointer.
- By defining "restrict" local pointer pointing to output
  buffer, the mysterious problem of incomplete copy has
  been solved.

Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646
2023-02-20 21:18:54 -05:00
Shubham
3ae84c98fd Fixed seg fault in Testsuite for DTRSM micro kernel
- In zen4 arch TRSM and GEMM have different blocksizes.
  TRSM call will update blockize in global cntx object
  which is incorrect for GEMM, when GEMM and TRSM are
  called in parallel.
- Hence using a local copy of cntx which holds blocksizes
  would help.

AMD-Internal: [CPUPL-3019]
Change-Id: I5f0f5675b3917d2a11d582ac626ca5d8f4752c53
2023-02-20 05:34:42 +05:30
bhaskarn
91a9968a5e Developed intrinsic based f32 kernels in lpgemm
Description:

1. Developed row variant intrinsic Kernels for float32/sgemm
   which are called from lpgemm api aocl_gemm_f32f32f32of32()

2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are
   developed using avx512.

3. 6x16m main kernel and respective n fringe and mn fringe are
   are developed based on avx2 and avx

4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic
   dispatch are planned next

5. When leading dims are greater than dims bench_lpgemm need
   to be updated to test it and this is planned next.

Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370
2023-02-20 01:11:22 -05:00
Arnav Sharma
46965dfc57 Developed 6x64 SGEMM row-preferred kernels
- Added kernels for all rv and rd variants.
- Main kernel is of size 6x64, and the associated fringe kernels
  added are
	- 4x64, 2x64, 1x64
	- 6x32, 4x32, 2x32, 1x32
	- 6x16, 4x16, 2x16, 1x16
- Updated the zen4 config to enable these kernels in zen4 path.
- Added C-prefetching to 6x? row-stored main kernels.
- C-prefetching for column storage yet to be added.
- K-loop unrolling for fringe kernels yet to be added.

AMD-Internal: [CPUPL-3002]
Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4
2023-02-19 23:56:58 -05:00
Harihara Sudhan S
535b49f150 Vectorized COPYV for double complex
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.

AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
2023-02-16 09:32:10 -05:00