Commit Graph

265 Commits

Author SHA1 Message Date
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian
758ec3b5ca ZGEMM optimizations for cases with k = 1
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
  bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
  ZGEMM. The kernel is built for handling the GEMM computation
  with inputs having k = 1, and the transpose values for A and
  B as N.

- The kernel dimension has been changed from 4x6 to 4x4,
  due to the following reasons :

  - The 1xNR block of B in the n-loop can be reused over multiple
    MRx1 blocks of A in the m-loop during computation. Similar
    analogy exists for the fringe cases.

  - Every 1xNR block of B was scaled with alpha and stored in
    registers before traversing in the m-dimension. Similar change
    was done for fringe cases in n-dimension.

  - These registers should not be modified during compute, hence
    the kernel dimension was changed from 4x6 to 4x4.

- The check for early exit(with regards to BLAS mandate) has been
  removed, since it is already present in the BLAS layer.

- The check for parallel ZGEMM has been moved post the redirection to
  this kernel, since the kernel is single-threaded.

- The bli_kernels_zen.h file was updated with the new kernel signature.

AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
2023-08-07 15:10:08 +05:30
Harihara Sudhan S
c97471dce0 Added AVX512 ZDSCALV kernel
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
  the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
  instructions.
- Removed the negative incx handling checks from the blis_impli layer
  of ZDSCAL as BLAS expects early return for incx <= 0.

AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
2023-08-06 01:51:47 -04:00
Harihara Sudhan S
b126c9943b ZSCALV kernel optimization
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
  followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
  layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.

AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
2023-08-04 06:57:18 -04:00
Eleni Vlachopoulou
9c613c4c03 Windows CMake bugfix in object libraries for shared library option
Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory.
The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries.

AMD-Internal: [CPUPL-3241]
Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52
2023-05-24 17:30:16 +05:30
Edward Smyth
ea2eea5097 BLIS: Missing clobbers (batch 1)
Add missing clobbers in first batch of assembly kernels:
- zen3 bli_gemmsup*
- bli_zgemm_zen4_asm_12x4
- bli_gemmsup_rv_haswell_asm_sMx6

AMD-Internal: [CPUPL-3456]
Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1
2023-05-23 11:51:18 -04:00
Mangala V
5f5bc24989 Bug fix: AVX2 code being invoked on non-avx2 machine for ZGEMM API
Prevented calling avx2 based bli_zgemm_ref_k1_nn code on
non-supported systems.
Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn().
Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn().

Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com>
for identifying and helping to fix the issue.

AMD-Internal: [CPUPL-3352]
Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35
2023-05-21 23:13:46 +05:30
Shubham Sharma
26e120ea25 Fixed diagonal packing for C/Z TRSM small
- In C/Z TRSM small, packing in case of unit diagonal
  is not handled properly.
- Diagonal elements are still being read even in case of
  unit diagonal.
- This causes "Conditional jump or move depends on
  uninitialised value" error during valgrind tests.
- To fix this, diagonal elements should not be read
  in case of unit diagonal.

AMD-Internal: [CPUPL-3406]
Change-Id: If3d6965299998a83d87f3a032f654fc7f8c43d4e
2023-05-18 07:57:21 -04:00
Eleni Vlachopoulou
bf26b8ffbc Removing /arch:AVX2 flag from-high level CMake
- Previously, this flag was set as a default at the high-level CMakeLists.txt which means that this flag is used to build everything,all files and all subdirectories, including ref_kernels and testsuite. Also, all files as target sources for this project and compiled with the same flags.
 - Now, we create object files using the source in kernels/ directory and add to the object files the AVX2 flag explicitly. So, now only those files will have this flag and it should not be used to compile ref_kernels, etc.
 - This is a quick solution to enable runs on non-AVX2 machines.

AMD-Internal: [CPUPL-3241]
Change-Id: Id569b26ffeea40eaa36ab4465b0c52b6446d7650
2023-04-28 09:22:13 -04:00
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Edward Smyth
0f0277e104 Code cleanup: dos2unix file conversion
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.

AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
2023-04-21 08:41:16 -04:00
eashdash
a72fff2be9 Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
   Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
   sum of every column of B matrix to implement the s8s8s16 operation
   using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
   Here the A matrix elements are converted from int8 to uint8 (s16 FMA
   works with A matrix type uint8 only) by adding extra 128 to
   every A matrix element
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
   Final C = C - ( (sum of column of B matrix) * 128 )
   This is done to compensate for the addition of extra 128 to every
   A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.

AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
2023-04-21 05:30:38 -04:00
mkadavil
3572baa9d3 aocl_softmax_f32 api's for softmax computation as part of lpgemm.
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.

AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
2023-04-21 05:26:08 -04:00
Edward Smyth
6835205ba8 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
2023-04-19 12:44:56 -04:00
mkadavil
99d10c3f88 Low precision gemm u8s8s16 downscale optimization.
-Similar to downscale optimizations made for u8s8s32 gemm, the following
optimizations are made to improve the downscale performance for u8s8s16
gemm:
a. The store to temporary s16 buffer can be avoided when k < KC since
intermediate accumulation will not required for the pc loop (only 1
iteration). The downscaled values (s8) are written directly to the
output C matrix.
b. Within the micro-kernel when beta != 0, the s8 data from the original
C output matrix is loaded to a register, converted to s16 and beta
scaling applied on it. The previous design of copying the s8 value to
the s16 temporary buffer inside jc loop and using the same in beta
scaling is removed.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s16
micro-kernels. Alpha scaling is now only done when alpha != 1.

AMD-Internal: [CPUPL-3237]
Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef
2023-04-19 06:40:06 -04:00
mkadavil
e23765010d aocl_gelu_<tanh|erf>_f32 api's for gelu computation as part of lpgemm.
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.

AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
2023-04-17 05:15:56 -04:00
eashdash
12c97021a1 Added New Post-Op - Custom Clipping for LPGEMM and SGEMM
1. Custom Clip is an element-wise post-op which is used to
   clip the accumulated GEMM output within a certain range.
2. The Clip Post-Op is used in downscaled and non-downscaled
   LPGEMM APIs and SGEMM.
3. Changes are done at frame and microkernel level to implement
   this post-op.
4. Different versions are implemented - AVX-512, AVX-2, SSE-2
   to enable custom clipping for various LPGEMM types and SGEMM

AMD-Internal: [CPUPL-3207]
Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6
2023-04-17 04:38:20 -04:00
Aayush Kumar
71272ab574 .Fixed Compiler warnings for GCC 12 and AOCC 4.0
- Set the variables to zero to avoid the compiler warning
  (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
  bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
  bli_trsm_small_AVX512.c

- Changed the datatype from dim_t to siz_t for i,k,j
  in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
  avoid the compiler warning (-Waggressive-loop-optimizations)

AMD-Internal: [CPUPL-2870]

Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
2023-04-14 13:29:17 +00:00
Harihara Sudhan S
32bbd96652 Moving AOCL Dynamic logic from BLIS impli layer
Threading related changes
--------------------------

- Created function bli_nthreads_l1 that dispatches the AOCL dynamic
  logic for a L1 function based on the kernel ID and input datatypes.
- bli_nthreads_l1 gets the number of threads to be launched from the
  rntm variable.
- Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and
  DDOTV. This function contains the AOCL dynamic logic for the
  respective kernels.
- Added handling for cases when number of elements (n) is less than
  number of threads spawned (nt) in AOCL dynamic.
- Added function bli_thread_vector_partition that calculates the
  amount of work the calling thread is supposed to perform on a
  vector.

Interface changes
-----------------

- In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick
  kernel based on architecture ID and removed AVX2 flag check.
- Modified function signature of ZDSCALV. Alpha is passed as dcomplex
  and only the real part of the alpha passed is used inside the kernel.
  The change was done to facilitate kernel dispatch based on arch ID.
- Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV.
  Without this multithreaded code might crash because of minimum work
  calculation.

Misc
-----

- Removed unused variables from ZSCAL2V and AXPYV kernels.

AMD-Internal: [CPUPL-3095]
Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9
2023-04-14 02:05:38 -04:00
Aayush Kumar
8c537b0cd5 Added DTRSM Small Path AVX512 based LLNN/LUTN Variant Kernels
- 8x8 kernels are used for DTRSM SMALL
- Implemented fringe cases with below block sizes
   8x8, 8x4, 8x3, 8x2, 8x1
   4x8, 4x4, 4x3, 4x2, 4x1
   3x8, 3x4, 3x3, 3x2, 3x1
   2x8, 2x4, 2x3, 2x2, 2x1
   1x8, 1x4, 1x3, 1x2, 1x1

AMD-Internal: [CPUPL-2745]

Change-Id: I58d28912bddbaadb404052c0f3449ebbe3c97b68
2023-04-07 08:50:28 +00:00
Edward Smyth
1885540c5a Code cleanup: compiler warning fixes
Modify code to correct some warning messages from GCC 12.2 or
AOCC 4.0:
- Increase size of nbuf in blastest/f2c/endfile.c
- Remove unused variables in kernels/zen/1/bli_scal2v_zen_int.c
  and kernels/zen/1/bli_axpyv_zen_int10.c
- Remove extraneous parentheses in frame/compat/bla_trsm_amd.c
  and kernels/zen4/3/bli_zgemm_zen4_asm_12x4.c
- Add __attribute__ ((unused)) to several variables in
  frame/1m/packm/bli_packm_struc_cxk.c and
  frame/1m/packm/bli_packm_struc_cxk_md.c

AMD-Internal: [CPUPL-2870]
Change-Id: I595e46f0a3d737beb393c3ab531717565220b10d
2023-04-06 06:56:09 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
Arnav Sharma
0bfb0393fd Bug fix for C/ZAXPBY Kernels
- Complex AXPBY kernels gave incorrect output when both alpha and
  beta had non-zero imaginary parts.
- Previously, the scalar code (used to calculate remainder result
  or non-unit increment cases) was directly accessing and updating
  the y-vector pointer thus, resulting in an incorrect output.
  Updated it to operate on a local copy of the currect y element
  and store the final result to the y-pointer.
- Also, added operation to store temporary calculation of alpha*x
  in an intermediate vector and then later added to the y vector.

AMD-Internal: [CPUPL-3037]
Change-Id: Iddbd3000dcb1505b444b0ad41ab881b055842e1c
2023-03-24 07:32:43 -04:00
Eleni Vlachopoulou
ad7a812db2 Remove quick return for zero increments.
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.

AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
2023-03-23 23:35:03 -04:00
Shubham
dfc95d29fc Enable DTRSM small multithreading path for BLAS interface
- Enabled DTRSM small mt for sizes where performance is better
  than small or native.
- Threshold Tuning for small path is updated.
- Function signature for bli_trsm_small_mt has been made similar
  to bli_trsm_small so that one function pointer can be used for
  all functions.
- Early return condition in DTRSM small for sizes > 1000 has been
  removed so that the sizes for which small path to take can be
  decided on bla layer instead of inside kernel.

AMD-Internal: [CPUPL-2735]
Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f
2023-03-23 12:07:22 -04:00
Harihara Sudhan S
4b36529a8b Added vector packing logic to ZGEMV variant 2
- In cases when incy != 1, a buffer is created for y vector. The
  contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
  y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
  without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
  based architecture, AVX2 kernels are invoked. For other, the
  kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.

AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
2023-03-22 03:19:18 -04:00
Harihara Sudhan S
a67205d8bd Modified Double precision AMAXV kernel
- The new AMAXV adheres to the BLAS definition of ISAMAX by not handle
  NaN separately. In the previous kernel, NaN is considered the smallest
  element of all the elements in the array.
- The new logic uses two helper functions - bli_vec_absmax_double and
  bli_vec_search_double.
- bli_vec_absmax_double finds the absolute largest element and the index
  range in which the first occurence of this element can be found.
- bli_vec_search_double returns the index of the first occurence of the
  absolute value of an element.
- AMAXV uses these two helper functions to find the absolute largest
  element and then searches using bli_vec_search_double in the reduced
  range provided by bli_vec_absmax_double.
- Added condition check for n == 1 in BLAS layer. It is an optimization
  mention in the BLAS standard API definition.
- Removed redundant n == 0 condition check from the kernel. This is a
  BLAS exception and is already done in the BLAS layer.
- Removed AVX2 flag check from the BLAS layer. Kernels will be picked
  based on the architecture ID in the new design.

AMD-Internal: [CPUPL-2773]
Change-Id: Ida2dae84a60742e632dc810ab1b7b80fc354e178
2023-03-21 05:18:05 -04:00
eashdash
e36f699939 Implemented ERF Based GeLU Activation for LPEGMM and SGEMM
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
   error function - ERF
2. Added error function based GeLU activation post-ops for the
   S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
   addition to adding the marco based function definations of the
   ERF function in the math-utils and gelu headerfiles.

AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be
2023-03-13 06:10:31 -04:00
Vignesh Balasubramanian
c53b0c96ec Extended the support for beta scaling in 3x4 RD kernels of ZGEMM SUP
- Extended the existing support for handling beta scaling
  in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The
  added support ensures that NaN values initialized in C do
  not propogate to the result when beta is 0.
- The support has been added to fringe cases common to, as well
  as specific to the m and n variants of the RD kernels.

AMD-Internal: [CPUPL-3053] [SWLCSG-1900]
Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9
2023-03-08 06:16:58 -05:00
Eleni Vlachopoulou
1416ab6b5e Bugfix on the memory access for nrm2 AVX2 implementations.
- Since the definition of negative increments is different between BLAS and BLIS,
  there was bug on how the memory was accessed when we were copying the elements
  of a vector with negative increments. Updated the code under the assumption that
  when negative increments are set, the vector is being accessed starting from the end.
  For the BLAS interface, there is an intermediate conversion before calling into the blis layer.

Change-Id: I08343472b418733fad6f7add9e90aa96cdf68285
AMD-Internal: [SWLCSG-1900]
2023-03-08 04:21:36 -05:00
Harihara Sudhan S
d4901f53ce Improved performance of double complex AXPYV kernel
- Increased unroll by reusing X registers that was previously used
  for performing shuffle.
- Added loops with smaller increment steps for better problem
  decomposition.
- Added X vector and Y vector prefetch to the kernel.
- Removed redundant code that handles fringe in incx = 1 and
  incy = 1. This remainder will be performed by the loop that handles
  non-unit stride cases.
- Vectorized loops that handle non-unit stride cases using SSE
  instructions.

AMD-Internal: [CPUPL-2773]
Change-Id: Ifb5dc128e17b4e21315789bfaa147e3a7ec976f0
2023-03-02 00:35:42 -05:00
Harihara Sudhan S
299bed3fa8 Vectorized SCAL2V for double complex
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero
  or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.

AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
2023-02-28 06:59:23 -05:00
mkadavil
1f2447f800 Post-ops support for f32 gemm(aocl_gemm).
- Bias add, relu, parametric relu and gelu post-ops support added in all
f32 gemm micro-kernels. These post-ops are implemented for both AVX512
and AVX2 ISA based on the micro-kernel flavor. The support is added for
both row and column major cases.
- Lpgemm bench updates to support f32 post-ops.

AMD-Internal: [CPUPL-3032]
Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476
2023-02-23 18:58:59 +05:30
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00
Harihara Sudhan S
dfacb47125 Code cleanup to remove multiple function declaration
- Removed repetitive function declaration of GEMM small kernels
  from the C files
- Function declaration of these kernels exist in the header files
  where the kernels are supposed to be declared.

AMD-Internal: [CPUPL-3003]
Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63
2023-02-21 00:35:31 -05:00
Harihara Sudhan S
535b49f150 Vectorized COPYV for double complex
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.

AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
2023-02-16 09:32:10 -05:00
Harihara Sudhan S
8d7593a940 AVX2 vectorized SCALV for double complex
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
  is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
  all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
  to the function in place of context. Expert interface of ZSETV will
  query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
  the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
  BLAS macro interface only for single complex type and single complex,
  float mixed type

AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
2023-02-05 09:36:12 -05:00
Harihara Sudhan S
18ae57305e ZAXPYF4 optimization
- Vectorized alpha scaling of X vector using SSE instructions. This
  can be done irrespective of incx.
- Added code to prefetch A matrix and Y vector to L1 cache
- Vectorized fringe case computation and non-unit stride computation
  with SSE instructions.
- Increased unroll in unit stride cases for better register
  utilization.

AMD-Internal: [CPUPL-2773]
Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715
2023-02-04 12:34:50 -05:00
Kiran Varaganti
201db7883c Integrated 32x6 DGEMM kernel for zen4 and its related changes are added.
Details:
- Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path.
  Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and
  implementing these optimizations.
- In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed
  we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the
  next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from
  memory, since the elements of Bpack are stored contiguously, the first broadcast fetches
  the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast
  and interleave broadcast operation with FMAs to hide any memory latencies.
- Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..)
  when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object
  for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}.
  Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced
  dgemm kernel for TRSM with 16x14 dgemm kernel.
- New packm kernels - 16xk, 24xk and 32xk are added.
- New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is
  enabled for zen4 config (bli_dpackm_32xk_zen4_ref() )
- Copyright year updated for modified files.
- cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h
- [SWLCSG-1374], [CPUPL-2918]

Change-Id: I576282382504b72072a6db068eabd164c8943627
2023-01-19 23:11:36 +05:30
Edward Smyth
82c2eb4e8e Code cleanup and warnings fixes
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments

AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
2023-01-09 04:34:52 -05:00
Vignesh Balasubramanian
dbd0c069d4 Implemented 3x4 RD kernels for ZGEMM in SUP path
-Implemented (r)ow preferential (d)ot product milli-kernels
 (m and n variants) for dcomplex datatype along SUP path.

-These computational kernels extend the support for handling RRC and
 CRC storage schemes along the SUP path. In case of BLAS api call,
 it corresponds to the input cases with transa equal to T and
 transb equal to N.

-In case of the B matrix being packed(conditionally), the inputs are
 redirected to the existing (r)ow preferential (v)ector load optimized
 kernels due to better performance.

-Added macro for vhsubpd assembly instruction, to support the arithmetic
 for complex datatype in its interleaved storage.

AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
2022-12-22 18:31:46 +05:30
Eleni Vlachopoulou
13aa3c8cd0 Adding AVX2 support for SNRM2 and SCNRM2
- For the cases where AVX2 is available, an optimized function is called,
    based on Blue's algorithm. The fallback method based on sumsqv is used
    otherwise.

    - Scaling is used to avoid overflow and underflow.

    - Works correctly for negative increments.

    AMD-Internal: [SWLCSG-1080]

Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9
2022-12-14 04:25:45 -05:00
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Edward Smyth
34730a1e4c BLIS: Nested parallelism issues
1. Check OpenMP active level against max active levels when setting
   number of threads for starting a new parallel region in
   ./frame/thread/bli_thread.c to ensure the correct number of threads
   is used when BLIS is called within nested OpenMP parallelism.

2. In subsequent BLIS calls, threading choices could be incorrectly
   set based on values used and stored in global_rntm by a previous
   call. This could apply when the OpenMP number of threads differ from
   call to call, different nested parallelism is used in different
   parts of a user's code, or different threads at the user level
   request different numbers of OpenMP threads for BLIS calls.

   Keep threading information in both global_rntm and a new Thread
   Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
   environment (as appropriate) during bli_init_auto() calls in each
   BLIS routine. The details are:
   * global_rntm is initialized on first BLIS call based on OpenMP and
     BLIS threading environment variables.
   * global_rntm is updated by any BLIS threading function calls.
   * In bli_thread_update_tl(), called by bli_init_auto(), sync with
     any BLIS values set or updated in global_rntm. Then, if BLIS
     threading control is not used, check OpenMP ICVs and set thread
     count and auto_factor appropriately.
   * Setting BLIS threading locally (using expert interfaces to pass
     a user defined rntm data structure) should work as before.

3. bli_thread_get_is_parallel can now only be called outside of
   parallelism within BLIS routines. Change calls in trsm to reflect
   this.

4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
   if any BLIS_*_NT environment variables are set.

5. Set auto_factor = FALSE when the number of threads is 1.

6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.

7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().

8. For debugging, internal information on the rntm threading data can
   be printed by defining "PRINT_THREADING" at the top of bli_rntm.h

9. bli_rntm_print() now also prints the value of blis_mt.

10. Function prototypes in bli_rntm.h moved to top of file, so that
   bli_rntm_print() can be used within inline functions defined in
   this header file.

11. Comment out bli_init_auto() and bli_finalize_auto() calls in
    Fortran interfaces in frame/compat/blis/thread/b77_thread.c

12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
    set_pack_b functions outside of the auto_factor if statements.

13. Misc code tidying.

AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
2022-10-21 07:38:39 -04:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Nallani Bhaskar
f1caa8e630 Fixed initialization issues in ddot multithread implementation
Description:

1. n value per thread and offset are defined outside omp loop
   which can vary for each thread. Moved them to inside the
   omp loop so that the output

2. Fixed few compilation warnings in dnrm2 avx2 implemenation

AMD Internal:[CPUPL-2606]

Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81
2022-10-01 14:43:39 +05:30
Eleni Vlachopoulou
863b73dfaf Adding AVX2 support for DZNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

- Cleaned up some white space in the AVX2 implementation for DNRM2.

AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
2022-09-30 07:51:04 -04:00
Arnav Sharma
90f915d3a9 Vectorized and parallelized zdscal routine
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.

AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
2022-09-30 06:11:07 -04:00
satish kumar nuggu
9c292b79e2 Fixed ASAN reported issues in [s/c]trsm small kernels
Details:
1. Fixed the memory access paritial overflows for the variables
AlphaVal,ones reported by ASAN.
2. Using 128 bit packed broadcast with the 64 bit data types
after type casting would cause the garbage data to be filled
in the destination register.
3. Fixed this issue by using set_ps instruction instead of broadcast.
4. In cases of n remainder being 1, extra elements were accessed that
could cause out of memory access. Removed the extra element access.

AMD-Internal: [CPUPL-2578][CPUPL-2587]
Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75
2022-09-30 03:02:07 -04:00
Eleni Vlachopoulou
1c1a0027a8 Bugfix in DNRM2 AVX path
Description:
Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy
errors.

AMD-Internal: [CPUPL-2576]
Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e
2022-09-29 04:46:04 -04:00