Commit Graph

2630 Commits

Author SHA1 Message Date
satish kumar nuggu
09b70de635 Performance Improvement for ztrsm small sizes
Details:
    - Handled Overflow and Underflow Vulnerabilites in
      ztrsm small right implementations.
    - Fixed failures observed in Scalapack testing.

    AMD-Internal: [CPUPL-2115]

Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439
2022-05-17 18:10:39 +05:30
Nallani Bhaskar
2acb3f6ed0 Tuned aocl dynamic for specific range in dgemm
Description:

1. Decision logic to choose optimal number of threads for
   given input dgemm dimensions under aocl dynamic feature
   were retuned based on latest code.

2. Updated code in few file to avoid compilation warnings.

3. Added a min check for nt in bli_sgemv_var1_smart_threading
   function

AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
2022-05-17 18:10:39 +05:30
mkadavil
a3836a560d Smart Threading for GEMM (sgemm) v1.
- Cache aware factorization.
Experiments shows that ic,jc factorization based on m,n gives better
results compared to mu,nu on a generic data set in SUP path. Also
slight adjustments in the factorizations w.r.t matrix data loads can
help in improving perf further.

- Moving native path inputs to SUP path.
Experiments shows that in multi-threaded scenarios if the per thread
data falls under SUP thresholds, taking SUP path instead of native path
results in improved performance. This is the case even if the original
matrix dimensions falls in native path. This is not applicable if A
matrix transpose is required.

- Enabling B matrix packing in SUP path.
Performance improvement is observed when B matrix is packed in cases
where gemm takes SUP path instead of native path based on per thread
matrix dimensions.

AMD-Internal: [CPUPL-659]

Change-Id: I3b8fc238a0ece1ababe5d64aebab63092f7c6914
2022-05-17 18:10:39 +05:30
S, HariharaSudhan
a8bc55c373 Multithreaded SGEMV var 1 with smart threading
- Implemented an OpenMP based stand alone SGEMV kernel for
	  row-major (var 1) for multithread scenarios
	- Smart threading is enabled when AOCL DYNAMIC is defined
	- Number of threads are decided based on the input dims
	  using smart threading

AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
2022-05-17 18:10:39 +05:30
Dipal M Zambare
16de63c818 Updated version and copyright notice.
Changed AMD-BLIS version to 3.1.2

AMD-Internal: [CPUPL-2111]
Change-Id: Id8fc3fbc112f08bd5e5def646c472047352e65b5
2022-05-17 18:10:39 +05:30
Dave
963a6aa099 Enabled zgemm_sup path and removed sqp path
- Previously zgemm computation failures were due to
  status variable did not have pre-defined initial
  value which resulted in zgemm computation to return
  without being computed by any kernel. Reflected
  same change in dgemm_ function as well. 

- Enabled sup zgemm as the issue is fixed with
  status variable with bli_zgemm_small call.

-Removed calling sqp method as it is disabled





Change-Id: I0f4edfd619bc4877ebfc5cb6532c26c3888f919d
2022-05-17 18:10:39 +05:30
Dipal M Zambare
f23233eb4c Added runtime control for DTL logging Feature
The logs can be enabled with following two methods:
  -- Environment variable based control: The feature can be enabled
     by specifying environment variable AOCL_VERBOSE=1.
  -- API based control: Two API's will be added to enable/disable
     logging at runtime
     1. AOCL_DTL_Enable_Logs()
     2. AOCL_DTL_Disable_Logs()
  -- The API takes precedence over the environment settings.

AMD-Internal: [CPUPL-2101]
Change-Id: Ie71c1095496fae89226049c9b9f80b00400350d5
2022-05-17 18:10:39 +05:30
satish kumar nuggu
1a3428ddfc Parallelization of dtrsm_small routine
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.

    AMD-Internal: [CPUPL-2103]

Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
2022-05-17 18:10:39 +05:30
Chandrashekara K R
8e6da6b844 Added the checks to not defining the bool type for C++ code in windows to avoid redefinition build time errror.
AMD-Internal: [CPUPL-2037]
Change-Id: I065da9206ab06f60876324f258ee12fb9fe83f88
2022-05-17 18:10:39 +05:30
Harsh Dave
f17d043e1c Implemented optimal dotxv kernel
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
  cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
  cases are handled serially for cdotxv kernel
- Added declaration in zen contexts

AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
2022-05-17 18:10:39 +05:30
Dipal M Zambare
e712ffe139 Added AOCL progress support for BLIS
-- AOCL libraries are used for lengthy computations which can go
     on for hours or days, once the operation is started, the user
     doesn’t get any update on current state of the computation.
     This (AOCL progress) feature enables user to receive a periodic
     update from the libraries.
  -- User registers a callback with the library if it is interested
     in receiving the periodic update.
  -- The library invokes this callback periodically with information
     about current state of the operation.
  -- The update frequency is statically set in the code, it can be
     modified as needed if the library is built from source.
  -- These feature is supported for GEMM and TRSM operations.

  -- Added example for GEMM and TRSM.
  -- Cleaned up and reformatted test_gemm.c and test_trsm.c to
     remove warnings and making indentation consistent across the
     file.

AMD-Internal: [CPUPL-2082]
Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e
2022-05-17 18:10:39 +05:30
Harsh Dave
52e4fd0f11 Performance Improvement for ctrsm small sizes
Details:
- Enable ctrsm small implementation
- Handled Overflow and Underflow Vulnerabilites in
ctrsm small implementations.
- Fixed failures observed in libflame testing.
- For small sizes, ctrsm small implementation is
used for all variants.

Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179
2022-05-17 18:10:39 +05:30
Arnav Sharma
caa5b37005 Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context

AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
2022-05-17 18:10:39 +05:30
Sireesha Sanga
cc3069fb5e Performance Improvement for ztrsm small sizes
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
  ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
  sizes 64<= m,n <= 256, by keeping the number of
  threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
  used for all variants.

AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
2022-05-17 18:10:39 +05:30
satish kumar nuggu
fe7f0a9085 Changes to enable zgemm small from BLAS Layer
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.

Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
2022-05-17 18:10:38 +05:30
Sireesha Sanga
9621ef3067 Performance Improvement for ztrsm small sizes
Details:
- Enable ztrsm small implementation
- For small sizes, Right Variants and Left Unit Diag
  Variants are using ztrsm_small implementations.
- Optimization of Left Non-Unit Diagonal Variants,
  Work In Progress

AMD-Internal: [SWLCSG-1194]
Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4
2022-05-17 18:09:22 +05:30
Harsh Dave
015bcb88d4 Fixed ztrsm computational failure
- Fixed memory access for edge cases such that
  all load are within memory boundary only.

- Corrected ztrsm utility APIs for dcomplex
  multiplication and division.

AMD-Internal: [CPUPL-2093]
Change-Id: Ib2c65e7921f6391b530cd20d6ea6b50f24bd705e
2022-05-17 18:09:22 +05:30
Harsh Dave
0976ed9ce5 Implement zgemm_small kernel
Details:
- Intrinsic implementation of zgemm_small nn kernel.
- Intrinsic implementation of zgemm_small_At kernel.
- Added support conjugate and hermitian transpose
- Main loop operates in multiple of 4x3 tile.
- Edge cases are handles separately.

AMD-Internal: [CPUPL-2084]
Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66
2022-05-17 18:09:22 +05:30
Nallani Bhaskar
eb0ff01871 Fine-tuning dynamic threading logic of DGEMM for small dimensions
Description:
    1. For small dimensions single threads dgemm_small performing
       better than dgemmsup and native paths.
    2. Irrespecive of given number of threads we are redirecting
       into single thread dgemm_small

       AMD-Internal:[CPUPL-2053]

Change-Id: If591152d18282c2544249f70bd2f0a8cd816b94e
2022-05-17 18:09:22 +05:30
Chandrashekara K R
34fee0fdbc AOCL-Windows: Added logic in the windows build system to generate cblas.h at configure time.
AMD-Internal: [CPUPL-2037]

Change-Id: Ie4ffd1d655079c895878f96dbb6f811547ad953d
2022-05-17 18:09:22 +05:30
Sireesha Sanga
6a2c4acc66 Runtime Thread Control using OpenMP API
Details:
-  During runtime, Application can set the desired number of threads using
   standard OpenMP API omp_set_num_threads(nt).
-  BLIS Library uses standard OpenMP API omp_get_max_threads() internally,
   to fetch the latest value set by the application.
-  This value will be used to decide the number of threads in the subsequent
   BLAS calls.
-  At the time of BLIS Initialization, BLIS_NUM_THREADS environment variable
   will be given precedence, over the OpenMP standard API omp_set_num_threads(nt)
   and OMP_NUM_THREADS environment variable.
-  Order of precedence followed during BLIS Initialization is as follows
	1. Valid value of BLIS_NUM_THREADS
	2. omp_set_num_threads(nt)
	3. valid value of OMP_NUM_THREADS
	4. Number of cores
-  After BLIS initialization, if the Application issues omp_set_num_threads(nt)
   during runtime, number of threads set during BLIS Initialization,
   is overridden by the latest value set by the Application.
-  Existing precedence of BLIS_*_NT environment variables and the decision of
   optimal number of threads over the number of threads derived from the
   above process remains as it is.

AMD-Internal: [CPUPL-2076]

Change-Id: I935ba0246b1c256d0fee7d386eac0f5940fabff8
2022-05-17 18:09:22 +05:30
mkurumel
ab06f17689 DGEMMT : Tuning SUP threshold to improve ST and MT performance.
Details :
          - SUP Threshold change for native vs SUP
          - Improved the ST performances for sizes n<800
	  - Introduce PACKB in SUP to improve ST performance between 320<n<800
	  - 16T SUP Tuning for n<1600.

        AMD-Internal: [CPUPL-1981]

Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda
2022-05-17 18:09:22 +05:30
Dipal M Zambare
06e386f054 Updated Windows build system to pick AMD specific sources.
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.

This commit adds changes needed for windows build.

AMD-Internal: [CPUPL-2052]

Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
2022-05-17 18:09:20 +05:30
Harsh Dave
d50d607995 dher2 API in blis make check fails on non avx2 platform
- dher2 did not have avx check for platform.
  It was calling avx kernel regardless of platform
  support. Which resulted in core dump.

- Added avx based platform check in both variant of dher2 for
  fixing the issue.

AMD-Internal: [CPUPL-2043]
Change-Id: I1fd1dcc9336980bfb7ffa9376f491f107c889c0b
2022-05-17 18:08:57 +05:30
Chandrashekara K R
e12f45033d AOCL_Windows: Updated windows build system.
Removed the "target_link_libraries("${PROJECT_NAME}" PRIVATE OpenMP::OpenMP_CXX)" statement for the static ST library builb.
This statement is not needed for static ST library build, mistakenly added.

Change-Id: I577a28c75644043fd077d938bf7f51cdea8ee13d
2022-05-17 18:08:57 +05:30
Arnav Sharma
393effbb0c Optimized ZAXPY2V using AVX2 Intrinsics
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts

AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
2022-05-17 18:08:57 +05:30
Chandrashekara K R
97fbff4b65 AOCL_Windows: Updated windows build system.
Updated the windows build system to link the user given openmp
library using -DOpenMP_libomp_LIBRARY=<Desired lib name> option
using command line or through cmake-gui application to build
blis library and its test applications. If user not given any
openmp library then by default openmp library will be
C:/Program Files/LLVM/lib/libomp.lib.

Change-Id: I07542c79454496f88e65e26327ad76a7f49c7a8c
2022-05-17 18:08:57 +05:30
Dipal M. Zambare
d3b22f590f Updated version number to 3.2
Change-Id: Iea5712d8cb854d4eaffea510e0fe2d9657e4d21f
2022-05-17 18:08:57 +05:30
Dipal M Zambare
f69f59c32c Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 18:08:56 +05:30
Harsh Dave
d116780616 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:05:08 +05:30
Arnav Sharma
86690f9fd3 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-05-17 18:03:42 +05:30
HariharaSudhan S
d687bd36ea Merge "Improved AXPYV Kernel performance" into amd-staging-genoa-4.0 2022-05-17 18:03:42 +05:30
Chandrashekara K R
ec6e4162bc Updated windows build system.
We were using add_compile_options(-Xclang -fopenmp) statement to set
omp library compiler flags on MSVC using cmake. Observing there is
an performance regression because of the compiler version which is
using in MSVC(clang 10), so removing it from the windows build
system and configuring the compiler version(clang 13) and compiler
options manually on MSVC gui to gain a performance on matlab bench.

Change-Id: I37d778abdceb7c1fae9b1caaeea8adb114677dd2
2022-05-17 18:03:10 +05:30
Dipal M Zambare
31921b9974 Updated windows build system to define BLIS_CONFIG_EPYC flag.
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.

Updated version number to 3.1.1

Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
2022-05-17 18:03:09 +05:30
Meghana Vankadari
c11fd5a8f6 Added functionality support for dzgemm
AMD-Internal: [SWLCSG-1012]
Change-Id: I2eac3131d2dcd534f84491289cbd3fe7fb7de3da
2022-05-17 18:01:55 +05:30
Dipal M. Zambare
b90420627a Revert "Enabled AVX-512 kernels for Zen4 config"
This reverts commit 62c96a4190.
Was committed without review.
2022-04-21 06:46:00 +00:00
Dipal M. Zambare
0adb525f5b Revert "Enabled AVX-512 kernels for Zen4 config"
This reverts commit f816cf059f.
Was committed without review.
2022-04-21 06:45:38 +00:00
Dipal M. Zambare
f816cf059f Enabled AVX-512 kernels for Zen4 config
Enabled AVX-512 skylake kernels in zen4 configuration.
  AVX-512 kernels are added for float and double types.

AMD-Internal: [CPUPL-2108]
Change-Id: Idfe3f64a037db019cbdf43318954db52ad241a51
2022-04-21 06:38:24 +00:00
Dipal M. Zambare
62c96a4190 Enabled AVX-512 kernels for Zen4 config
Enabled AVX-512 skylake kernels in zen4 configuration.
  AVX-512 kernels are added for float and double types.

AMD-Internal: [CPUPL-2108]
2022-04-21 06:28:29 +00:00
Field G. Van Zee
a4abb10831 Added a new 'gemmlike' sandbox.
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
  multithreaded gemm in the style of gemmsup but also unconditionally
  employs packing. The purpose of this sandbox is to
  (1) avoid select abstractions, such as objects and control trees, in
      order to allow readers to better understand how a real-world
      implementation of high-performance gemm can be constructed;
  (2) provide a starting point for expert users who wish to build
      something that is gemm-like without "reinventing the wheel."
  Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
  Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
  instead of "bli_" in order to avoid any symbol collisions in the main
  library.
- The sandbox contains two variants, each of which implements gemm via a
  block-panel algorithm. The only difference between the two is that
  variant 1 calls the microkernel directly while variant 2 calls the
  microkernel indirectly, via a function wrapper, which allows the edge
  case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
  (not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
  framework.

Change-Id: Ifc3c50e9fd0072aada38eace50c57552c88cc6cf
2022-04-01 13:55:30 +05:30
Field G. Van Zee
7a0ba4194f Added support for addons.
Details:
- Implemented a new feature called addons, which are similar to
  sandboxes except that there is no requirement to define gemm or any
  other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
  for requesting an addon be included within a BLIS build. configure now
  outputs the list of enabled addons into config.mk. It also outputs the
  corresponding #include directives for the addons' headers to a new
  companion to the bli_config.h header file named bli_addon.h. Because
  addons may wish to make use of existing BLIS types within their own
  definitions, the addons' headers must be included sometime after that
  of bli_config.h (which currently is #included before bli_type_defs.h).
  This is why the #include directives needed to go into a new top-level
  header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
  build with them, and what assumptions their authors should keep in
  mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
  as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
  functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.

Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
2022-03-31 12:03:27 +05:30
Meghana Vankadari
0792eb8608 Fixed a bug in deriving dimensions from objects in gemm_front files
Change-Id: I1f796c3a7ce6efacb6ef64651a7818b7ee38c6bb
2022-02-16 23:24:14 -05:00
Harihara Sudhan S
6696f91f41 Improved DGEMV performance for column-major cases
- Altered the framework to use 2 more fused kernels for
	  better problem decomposition
	- Increased unroll factor in AXPYF5 and AXPYF8 kernels
	  to improve register usage

AMD-Internal: [CPUPL-1970]

Change-Id: I79750235d9554466def5ff93898f832834990343
2022-02-02 23:13:10 -05:00
Dipal M Zambare
6d1edca727 Optimized CPU feature determination.
We added new API to check if the CPU architecture has support
for AVX instruction. This API was calling CPUID instruction
every time it is invoked. However, since this information does
not change at runtime, it is sufficient to determine it once
and use the cached results for subsequent calls. This optimization
is needed to improve performance for small size matrix vector
operations.

AMD-Internal: [CPUPL-2009]
Change-Id: If6697e1da6dd6b7f28fbfed45215ea3fdd569c5f
2022-02-01 11:15:55 +05:30
Harihara Sudhan
14fb31c0d5 Improved performance of DOTXV kernel for float and double
- Vectorized sections of code that were not vectorized

AMD Internal: [CPUPL-1980]

Change-Id: I08528d054442a5e728f631142f244f1624170136
2022-01-24 23:08:38 -05:00
Saitharun
e783ea10db Enable wrapper code by default
details: Changes Made for 4.0 branch to enable wrapper code by default
and also removed ENABLE_API_WRAPPER macro.

Change-Id: I5c9ede7ae959d811bc009073a266e66cbf07ef1a
2022-01-19 11:38:45 +05:30
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
79c6aa5643 Implemented optimal S/DCOMPLEX dotxf kernel
- Optimized dotxf implementation for double
  and single precision complex datatype by
  handling dot product computation in tile 2x6
  and 4x6 handling 6 columns at a time, and rows
  in multiple of 2 and 4.

- Dot product computation is arranged such a way
  that multiple rho vector register will hold the
  temporary result till the end of loop and finally
  does horizontal addition to get final dot product
  result.

- Corner cases are handled serially.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
2022-01-06 02:22:52 -05:00
mkadavil
457c33a601 Eliminating barriers in SUP path when matrices are not packed.
-Current gemm SUP path uses bli_thrinfo_sup_grow, bli_thread_range_sub
to generate per thread data ranges at each loop of gemm algorithm.
bli_thrinfo_sup_grow involves usage of multiple barriers for cross
thread synchronization. These barriers are necessary in cases where
either the A or B matrix are packed for centralized pack buffer
allocation/deallocation (bli_thread_am_ochief thread).

-However for cases where both A and B matrices are unpacked, these
barrier are resulting in overhead for smaller dimensions. Here creation
of unnecessary communicators are avoided and subsequently the
requirement for barriers are eliminated when packing is disabled for
both the input matrices in SUP path.

Change-Id: Ic373dfd2d6b08b8f577dc98399a83bb08f794afa
2022-01-06 01:56:43 -05:00
Harsh Dave
351269219f Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-01-05 05:51:15 -06:00