Commit Graph

148 Commits

Author SHA1 Message Date
Shubham Sharma
26bd265cfd Optimized DTRSV for tiny sizes
- Replaced switch case with if else, lookup table for switch case
  is palced at the end of .text section which causes a huge jump.
- Reduced number of branches for tiny sizes.
- Cpuid query is slow, therefore added a new if statement which avoids cpuid
  query for tiny sizes(<200).
- Redirected tiny sizes to AVX2 kernel.

AMD-Internal: [CPUPL-5407]
Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c
2025-01-30 01:23:09 -05:00
Hari Govind S
349fc47ec5 DGEMV Optimizations for TRANSPOSE Cases
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
  AVX2 kernels for Zen1/2/3 architectures. These kernels are written
  from the ground up and are independent of fused kernels.

- The DGEMV primary kernel processes the calculation in chunks of
  8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
  kernels, which are invoked by the primary kernel as needed.

- Implemented the kernels by computing the dot product of matrix A
  columns with vector x in chunks of 32 elements, storing the results
  in accumulator registers. Fringe elements are handled in chunks
  of 16, 8, etc. The data in the accumulator registers is then reduced
  and added to vector y.

AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
2025-01-24 00:38:34 -05:00
Arnav Sharma
25e59fcbb9 DGEMV Optimizations for NO_TRANSPOSE Cases
- AVX512 specific DGEMV native kernels are added for Zen4/5
  architectures to handle the NO_TRANSPOSE cases and are independent of
  the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
  beta scaling of y vector within the kernel itself and handle cases
  where n is less than 5:
    - bli_dgemv_n_zen_int_32x8n_avx512( ... )
    - bli_dgemv_n_zen_int_32x4n_avx512( ... )
    - bli_dgemv_n_zen_int_32x2n_avx512( ... )
    - bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
  m-dimension and for this kernel beta scaling is handled beforehand
  within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
  using fused kernel, namely AXPYF, to perform the GEMV operation.

AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
2024-12-12 10:26:50 -05:00
Shubham Sharma
d322cc11f8 Tiny size optimization for DTRSV var2
- Use AVX2 kernels for tiny sizes on genoa.
- Removed the runtime init overhead for small sizes.

AMD-Internal: [CPUPL-5407]
Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30
2024-08-20 00:40:25 -04:00
Arnav Sharma
9583ee2e23 DGEMV Optimizations for NO_TRANSPOSE cases
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.

- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.

- Added a wrapper for DAXPYF kernels for redirection to kernels with a
  smaller fuse factor than 32.

- Also added UKR tests for the new fused kernels.

AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
2024-07-24 15:59:36 +05:30
Hari Govind S
38824244d5 Implementation of AXPYF Kernels for DTRSV
-  Implemented two new axpyf kernels for fused factors 8 and 12
   by manually unrolling the loops. Used to achieve better performance
   in var2 case.

AMD-Internal: [CPUPL-5184]
Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5
2024-07-16 06:23:01 -04:00
Shubham Sharma
7553abad8e Fixed compilation error with AOCC in TRSV
- Added a {} around zen4 switch case to avoid AOCC error.
- Error is caused because in C declarations are not a statement, therefore
  they cannot be labled hence compiler is not able to create a lable
  for jump.

AMD-Internal: [CPUPL-4880]
Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e
2024-05-03 21:08:38 +05:30
Shubham Sharma
1d983e6124 Added AVX512 kernels for DAXPYF and DDOTXF
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
  factor 8 and 32.
- Multithreading is also added to the DAXPYf
  kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
  in ZEN4

AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian
4e2966f9b0 AVX512 optimizations for ZGEMV API with transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with transpose to A matrix.

- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
  kernels include those with fuse-factor 8 (main kernel), 4
  and 2(fringe kernels).

- Updated the bli_zgemv_unf_var1( ... ) function to update
  the function pointers to these kernels, based on the
  configuration.

AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
Shubham Sharma
632c32767b Avoid alpha scaling in ZTRSV/ZTRSM when alpha = 1
- Scaling vector X is skipped when alpha is 1 in ZTRSV.
- Scaling matrix A is skipped when alpha is 1 in ZTRSM.

AMD-Internal: [CPUPL-4324]
Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81
2024-04-16 02:24:02 -04:00
Edward Smyth
2450a1813b BLIS: Implement zen5 sub-configuration
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.

AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
2024-04-12 07:26:31 -04:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
mangala v
fa355c0049 Removed warning during compilation of gemv api for non-zen config
- When configured for haswell config "Warning unused variable 'zero'"
  was throwed during compilation.
- Removed zero variable which is not being used

AMD-Internal: [CPUPL-3973]
Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8
2023-11-08 01:43:33 -05:00
Vignesh Balasubramanian
ef545b928e Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
  present in the bli_gemv_unf_var2_amd.c file, as part of the
  bli_sgemv_unf_var2( ... ) function. This was changed to
  bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
  from 6 to 5 ), in accordance to the function pointer present
  in the zen3 and zen4 context files.

- Changed the accumulator type to double from float, inside the
  fringe loop for unit-strides(vectorized path) and non-unit strides
  (scalar code).

AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
2023-11-03 01:37:43 -04:00
Harihara Sudhan S
106342f402 ZGEMV optimization for special cases in beta
- Avoiding scaling of y vector by beta when beta is 1.

AMD-Internal: [CPUPL-3829]
Change-Id: I9cf46f44c5f1c2da3653937ff035594b4046b4a1
2023-11-02 08:21:46 -04:00
Harihara Sudhan S
105de694cf Optimized ZGEMV variant 1
- Added an explicit function definition for ZGEMV var 1. This
  removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
  for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
  instead of querying it.
- With this change fringe loop is vectorized using SSE
  instructions.

AMD-Internal:[CPUPL-3997]

Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
2023-10-13 05:03:53 -04:00
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Harihara Sudhan S
278ca71706 Fixes for GEMV Functionality Issues
- Added call to dsetv in dscalv. When DSCALV is invoked by
  DGEMV the SCAL function is expected to SET the vector to
  zero when alpha is 0. This change is done to ensure BLAS
  compatibility of DGEMV.
- Fixed bug in DGEMV var 1. Reverted changes in DGEMV var
  1 to remove packing and dispatch logic.
- CMAKE now builds with _amd files for unf_var2 of GEMV.

AMD-Internal: [CPUPL-3772]
Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade
2023-08-14 13:54:48 +05:30
Harihara Sudhan S
03fa660792 Optimized xGEMV for non-unit stride X vector
- In GEMV variant 1, the input matrix A is in row major. X vector
  has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
  operation inside the kernel has been ensured by packing the input X
  vector to a temporary buffer with unit stride. Currently, the
  packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
  In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
  is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
  to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
2023-08-08 01:01:22 -04:00
Harihara Sudhan S
3be43d264f Optimized xGEMV for non-unit stride Y vector
- In variant 2 of GEMV, A matrix is in column major. Y vector has
  to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
  GEMV operation inside the kernel has been ensured by packing the
  input Y vector to a temporary buffer with unit stride. As part of
  the packing Y is scaled by beta to reduce the number of times Y
  vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
  buffer are copied to the original buffer and the temporary one is
  released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
  to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
2023-08-07 08:12:44 -04:00
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Edward Smyth
6835205ba8 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
2023-04-19 12:44:56 -04:00
Aayush Kumar
71272ab574 .Fixed Compiler warnings for GCC 12 and AOCC 4.0
- Set the variables to zero to avoid the compiler warning
  (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
  bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
  bli_trsm_small_AVX512.c

- Changed the datatype from dim_t to siz_t for i,k,j
  in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
  avoid the compiler warning (-Waggressive-loop-optimizations)

AMD-Internal: [CPUPL-2870]

Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
2023-04-14 13:29:17 +00:00
Harihara Sudhan S
2e6724262e ZGEMV var 2 bug fix
- Fixed segmentation fault that was seen on non zen and non avx2
  machines.
- cntx object was not passed to the invoked kernel causing a seg
  fault.

AMD-Internal: [CPUPL-3167]
Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d
2023-04-05 01:31:24 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
Harihara Sudhan S
4b36529a8b Added vector packing logic to ZGEMV variant 2
- In cases when incy != 1, a buffer is created for y vector. The
  contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
  y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
  without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
  based architecture, AVX2 kernels are invoked. For other, the
  kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.

AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
2023-03-22 03:19:18 -04:00
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Edward Smyth
abf848ad12 Code cleanup and warnings fixes
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
  to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
  only defined in header file if BLIS_ENABLE_OPENMP is defined

AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
2022-08-29 08:22:30 -04:00
Arnav Sharma
eb83a0fe9d Enabled ZHER Optimized Path
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.

AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
2022-08-29 08:09:42 -04:00
Arnav Sharma
035ed98b51 Temporarily disabling optimized ZHER
- Disabling optimized ZHER pending verification with netlib BLAS test.

AMD-Internal: [CPUPL-2416]
Change-Id: I74c4d16e1c99ddeb1df91130a8e14feafd0952d0
2022-08-19 12:16:46 -04:00
Harihara Sudhan S
d4bb906094 Exception handling in GEMV smart-threading
- Added condition to check if n or m is 0 in smart threading
	  logic

AMD-Internal: [CPUPL2219]
Change-Id: Idd58cd13a11aa5bdb4117b4c9262f38ef3c1afc4
2022-06-29 01:52:18 -04:00
Dipal M Zambare
c87b9aab75 Added support for AVX512 for Windows and AMAVX
- Completed zen4 configuration support on windows
 - Enabled AVX512 kernels for AMAXV
 - Added zen4 configuration in amdzen for windows
 - Moved all zen4 kernels inside kernels/zen4 folder

AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
2022-06-08 11:09:48 +05:30
Arnav Sharma
66b2231b65 Fixed CMake files for HER
- Removed subdirectory addition

Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1
2022-06-01 12:25:43 +05:30
Arnav Sharma
e5d5a43eab Optimized ZHER Implementation
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.

AMD-Internal: [CPUPL-2151]

Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
2022-05-25 14:03:01 +05:30
Dipal M Zambare
6e2f536590 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 20:35:40 +05:30
Harsh Dave
f48ced0811 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:13:07 +05:30
mkadavil
8670992c3d Default sgemv kernel to be used in single-threaded scenarios.
- sgemv calls a multi-threading friendly kernel whenever it is compiled
with open mp and multi-threading enabled. However it was observed that
this kernel is not suited for scenarios where sgemv is invoked in a
single-threaded context (eg: sgemv from ST sgemm fringe kernels and with
matrix blocking). Falling back to the default single-threaded sgemv
kernel resulted in better performance for this scenario.

AMD-Internal: [CPUPL-2136]
Change-Id: Ic023db4d20b2503ea45e56a839aa35de0337d5a6
2022-05-17 18:10:40 +05:30
Nallani Bhaskar
2acb3f6ed0 Tuned aocl dynamic for specific range in dgemm
Description:

1. Decision logic to choose optimal number of threads for
   given input dgemm dimensions under aocl dynamic feature
   were retuned based on latest code.

2. Updated code in few file to avoid compilation warnings.

3. Added a min check for nt in bli_sgemv_var1_smart_threading
   function

AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
2022-05-17 18:10:39 +05:30
S, HariharaSudhan
a8bc55c373 Multithreaded SGEMV var 1 with smart threading
- Implemented an OpenMP based stand alone SGEMV kernel for
	  row-major (var 1) for multithread scenarios
	- Smart threading is enabled when AOCL DYNAMIC is defined
	- Number of threads are decided based on the input dims
	  using smart threading

AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
2022-05-17 18:10:39 +05:30
Dipal M Zambare
06e386f054 Updated Windows build system to pick AMD specific sources.
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.

This commit adds changes needed for windows build.

AMD-Internal: [CPUPL-2052]

Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
2022-05-17 18:09:20 +05:30
Harsh Dave
d50d607995 dher2 API in blis make check fails on non avx2 platform
- dher2 did not have avx check for platform.
  It was calling avx kernel regardless of platform
  support. Which resulted in core dump.

- Added avx based platform check in both variant of dher2 for
  fixing the issue.

AMD-Internal: [CPUPL-2043]
Change-Id: I1fd1dcc9336980bfb7ffa9376f491f107c889c0b
2022-05-17 18:08:57 +05:30
Dipal M Zambare
f69f59c32c Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 18:08:56 +05:30
Harsh Dave
d116780616 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:05:08 +05:30
Harihara Sudhan S
6696f91f41 Improved DGEMV performance for column-major cases
- Altered the framework to use 2 more fused kernels for
	  better problem decomposition
	- Increased unroll factor in AXPYF5 and AXPYF8 kernels
	  to improve register usage

AMD-Internal: [CPUPL-1970]

Change-Id: I79750235d9554466def5ff93898f832834990343
2022-02-02 23:13:10 -05:00
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
351269219f Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-01-05 05:51:15 -06:00
Harihara Sudhan S
8201bcfdaf Improved DGEMV performance for smaller sizes
- Introduced two new ddotxf functions with lower fuse
  factor.
- Changed the DGEMV framework to use new kernels to
  improve problem decomposition.

Change-Id: I523e158fd33260d06224118fbf74f2314e03a617
2021-12-15 13:17:14 +05:30