Commit Graph

119 Commits

Author SHA1 Message Date
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Edward Smyth
6835205ba8 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
2023-04-19 12:44:56 -04:00
Aayush Kumar
71272ab574 .Fixed Compiler warnings for GCC 12 and AOCC 4.0
- Set the variables to zero to avoid the compiler warning
  (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
  bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
  bli_trsm_small_AVX512.c

- Changed the datatype from dim_t to siz_t for i,k,j
  in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
  avoid the compiler warning (-Waggressive-loop-optimizations)

AMD-Internal: [CPUPL-2870]

Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
2023-04-14 13:29:17 +00:00
Harihara Sudhan S
2e6724262e ZGEMV var 2 bug fix
- Fixed segmentation fault that was seen on non zen and non avx2
  machines.
- cntx object was not passed to the invoked kernel causing a seg
  fault.

AMD-Internal: [CPUPL-3167]
Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d
2023-04-05 01:31:24 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
Harihara Sudhan S
4b36529a8b Added vector packing logic to ZGEMV variant 2
- In cases when incy != 1, a buffer is created for y vector. The
  contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
  y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
  without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
  based architecture, AVX2 kernels are invoked. For other, the
  kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.

AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
2023-03-22 03:19:18 -04:00
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Edward Smyth
abf848ad12 Code cleanup and warnings fixes
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
  to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
  only defined in header file if BLIS_ENABLE_OPENMP is defined

AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
2022-08-29 08:22:30 -04:00
Arnav Sharma
eb83a0fe9d Enabled ZHER Optimized Path
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.

AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
2022-08-29 08:09:42 -04:00
Arnav Sharma
035ed98b51 Temporarily disabling optimized ZHER
- Disabling optimized ZHER pending verification with netlib BLAS test.

AMD-Internal: [CPUPL-2416]
Change-Id: I74c4d16e1c99ddeb1df91130a8e14feafd0952d0
2022-08-19 12:16:46 -04:00
Harihara Sudhan S
d4bb906094 Exception handling in GEMV smart-threading
- Added condition to check if n or m is 0 in smart threading
	  logic

AMD-Internal: [CPUPL2219]
Change-Id: Idd58cd13a11aa5bdb4117b4c9262f38ef3c1afc4
2022-06-29 01:52:18 -04:00
Dipal M Zambare
c87b9aab75 Added support for AVX512 for Windows and AMAVX
- Completed zen4 configuration support on windows
 - Enabled AVX512 kernels for AMAXV
 - Added zen4 configuration in amdzen for windows
 - Moved all zen4 kernels inside kernels/zen4 folder

AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
2022-06-08 11:09:48 +05:30
Arnav Sharma
66b2231b65 Fixed CMake files for HER
- Removed subdirectory addition

Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1
2022-06-01 12:25:43 +05:30
Arnav Sharma
e5d5a43eab Optimized ZHER Implementation
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.

AMD-Internal: [CPUPL-2151]

Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
2022-05-25 14:03:01 +05:30
Dipal M Zambare
6e2f536590 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 20:35:40 +05:30
Harsh Dave
f48ced0811 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:13:07 +05:30
mkadavil
8670992c3d Default sgemv kernel to be used in single-threaded scenarios.
- sgemv calls a multi-threading friendly kernel whenever it is compiled
with open mp and multi-threading enabled. However it was observed that
this kernel is not suited for scenarios where sgemv is invoked in a
single-threaded context (eg: sgemv from ST sgemm fringe kernels and with
matrix blocking). Falling back to the default single-threaded sgemv
kernel resulted in better performance for this scenario.

AMD-Internal: [CPUPL-2136]
Change-Id: Ic023db4d20b2503ea45e56a839aa35de0337d5a6
2022-05-17 18:10:40 +05:30
Nallani Bhaskar
2acb3f6ed0 Tuned aocl dynamic for specific range in dgemm
Description:

1. Decision logic to choose optimal number of threads for
   given input dgemm dimensions under aocl dynamic feature
   were retuned based on latest code.

2. Updated code in few file to avoid compilation warnings.

3. Added a min check for nt in bli_sgemv_var1_smart_threading
   function

AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
2022-05-17 18:10:39 +05:30
S, HariharaSudhan
a8bc55c373 Multithreaded SGEMV var 1 with smart threading
- Implemented an OpenMP based stand alone SGEMV kernel for
	  row-major (var 1) for multithread scenarios
	- Smart threading is enabled when AOCL DYNAMIC is defined
	- Number of threads are decided based on the input dims
	  using smart threading

AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
2022-05-17 18:10:39 +05:30
Dipal M Zambare
06e386f054 Updated Windows build system to pick AMD specific sources.
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.

This commit adds changes needed for windows build.

AMD-Internal: [CPUPL-2052]

Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
2022-05-17 18:09:20 +05:30
Harsh Dave
d50d607995 dher2 API in blis make check fails on non avx2 platform
- dher2 did not have avx check for platform.
  It was calling avx kernel regardless of platform
  support. Which resulted in core dump.

- Added avx based platform check in both variant of dher2 for
  fixing the issue.

AMD-Internal: [CPUPL-2043]
Change-Id: I1fd1dcc9336980bfb7ffa9376f491f107c889c0b
2022-05-17 18:08:57 +05:30
Dipal M Zambare
f69f59c32c Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 18:08:56 +05:30
Harsh Dave
d116780616 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:05:08 +05:30
Harihara Sudhan S
6696f91f41 Improved DGEMV performance for column-major cases
- Altered the framework to use 2 more fused kernels for
	  better problem decomposition
	- Increased unroll factor in AXPYF5 and AXPYF8 kernels
	  to improve register usage

AMD-Internal: [CPUPL-1970]

Change-Id: I79750235d9554466def5ff93898f832834990343
2022-02-02 23:13:10 -05:00
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
351269219f Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-01-05 05:51:15 -06:00
Harihara Sudhan S
8201bcfdaf Improved DGEMV performance for smaller sizes
- Introduced two new ddotxf functions with lower fuse
  factor.
- Changed the DGEMV framework to use new kernels to
  improve problem decomposition.

Change-Id: I523e158fd33260d06224118fbf74f2314e03a617
2021-12-15 13:17:14 +05:30
Harsh Dave
0f43db8347 Optimized dsymv implementation
-Implemented hemv framework calls for lower and upper
kernel variants.

-hemv computation is implemented in two parts.
One part operate on triangular part of matrix and
the remaining part is computed by dotxfaxpyf kernel.

-First part performs dotxf and axpyf operation on
triangular part of matrix in chunk of 8x8.
Two separate helper function for doing so are implemented
for lower and upper kernels respectively.

-Second part is ddotxaxpyf fused kernel, which performs
dotxf and axpyf operation alltogether on non-triangular
part of matrix in chunk of 4x8.

-Implementation efficiently uses cache memory while computing
for optimal performance.

Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e
2021-12-14 23:34:20 -06:00
Dipal M Zambare
fd8a3aace9 Added support for zen4 architecture
- Added configuration option for zen4 architecture
  - Added auto-detection of zen4 architecture
  - Added zen4 configuration for all checks related
    to AMD specific optimizations

AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
2021-11-23 10:29:15 +05:30
mkurumel
235071690a Fixed dynamic dispatch crash issue on non-zen architecture for gemv and axpy routines.
Summary:
    1. This commit fixed issue for gemv and axpy API’s.
    2. The BLIS binary with dynamic dispatch feature was
    crashing on non-zen CPUs (specifically CPUs without
    AVX2 support).
    3. The crash was caused by un-supported instructions
    in zen optimized kernels.The issue is fixed by calling
    only reference kernels if the architecture detected at
    runtime is not zen, zen2 or zen3.

Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4
2021-11-12 08:58:59 +05:30
Harsh Dave
1b0a7e1c89 Fixed dynamic dispatch crash issue on non-zen architecture.
Removed direct calling of zen kernels in ctrsv, ztrsv interface.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels.

AMD-Internal: [CPUPL-1930]
Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b
2021-11-12 08:58:58 +05:30
Harsh Dave
8f297f6267 Fixed dynamic dispatch crash issue on non-zen architecture.
Removed direct calling of zen kernels in blis interface for
trsm, scalv, swapv.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]
Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81
2021-11-12 08:58:58 +05:30
lcpu
30038af896 Reverted: To fix accuracy issues for complex datatypes
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
2021-11-12 08:58:57 +05:30
Nageshwar Singh
cbd9ea76af Complex single standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
2021-11-12 08:58:55 +05:30
Nageshwar Singh
a3d04a21a0 Complex double standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
2021-11-12 08:58:54 +05:30
mkurumel
ce75a86e9b BLIS : DGEMV performance improvement for incy/incx greater than 1
Details :
      - Added packing Of Y for incy >1 cases for dgemv_unf_var2.
      - Added packing Of X for incx >1 cases for dgemv_unf_var1.

    AMD-Internal: [SWLCSG-735]

Change-Id: Ib395f478ba984a85533e4f79b3521d0b2500c30c
2021-07-21 17:55:05 +05:30
mkurumel
9afbb11b4f DTL Logging bug in GEMV
Details :
  - Fixed Incorrect Macro used in dgemv and cgemv Trace logging exit.

AMD-Internal: [CPUPL-1403]
Change-Id: Icac502d8d4adad112754d9c764a30d3db56a743f
2021-06-02 21:21:00 +05:30
mkurumel
99e3bce065 SGEMV : single Precision axpyf kernel optimization for SGEMV
Details :
  - Implemented saxpyf kernel with fuse factor=6 for sgemv.

AMD-Internal: [CPUPL-1403]
Change-Id: I72fd30c08a789603267cf58910138549d45d231a
2021-06-02 07:55:48 -04:00
Nageshwar Singh
2e1a5bc1dd Optimized double complex axpyf kernel for zgemv
Details:
  - Implemented zaxpyf kernel with fuse factor=4 for zgemv.
  - Modified BLAS interface call for zgemv to reduce framework overhead.
  - Directed gemv to dotv in the case where dimension of y vector is 1.
  - when alpha = 0, gemv becomes scalv of Y with beta. Added code to
    return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
2021-06-01 18:03:29 +05:30
Meghana Vankadari
8c9a7c21b4 Optimized axpyf kernel for scomplex datatype
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
2021-05-24 14:40:17 +05:30
Meghana Vankadari
33ddf2e448 Fixed blastest failure for haswell configuration
Details:
- Placed optimized version of BLAS DGEMM, ZGEMM definitions under
  BLIS_CONFIG_EPYC as they use gemm small which are defined only
  for zen family configurations.
- Added code to query and set cntx in gemv and trsv framework before
  cntx is referred for any function pointers to avoid querying
   from NULL pointer.

AMD-Internal: [CPUPL-1562]
Change-Id: I977d028ec4ddb57dcdc70e443e7708f36c01cca9
2021-05-07 01:49:54 -04:00
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
managalv
1ff4981203 Modified blas interface of TRSM to call TRSV whenever m=1 or n=1.
Case1: Call TRSV when matrix C & B are vector & A is matrix,
         When n = 1 for left side and when m = 1 for right side
  Case2: Divide B/A when matrix C & B are vector & A is scalar(Diagonal element),
         When m = 1 for left side and when n = 1 for right side
  For right side, Transpose complete operation, Change upper to lower and
                  vice versa when A is being transposed

Change-Id: Ie87e4a263c287ba554832ccc56b629f982e3ac4c
2021-02-08 19:02:25 +05:30
Meghana Vankadari
2e7cf8d82f Added 16x4 AXPYF kernel for zen2 config
Details:
- Added a new AXPYF kernel with fuse_factor = 4 and iter_unroll = 4.
- Modified blas interface of GEMM to call GEMV whenever m=1 or n=1.

Change-Id: I3f5acd37b009f53cf63f462cec79fd3e73676dbc
2021-02-02 21:22:44 +05:30
Dipal M Zambare
0a3d94c9a2 Updated test drivers for dotv, scalv and swapv.
Added traces in cblas layer for these API's.
These test drivers didn't have calls for complex data
types, the drivers are updated to support them.

AMD-Internal : [CPUPL-1315]

Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b
2020-11-24 10:26:32 +05:30
managalv
fdc0e70cd8 Added debug log and trace for gemv and dotv for blis and cblas interface
AMD Internal: [CPUPL-1314]

Change-Id: I2708fd9c73419c968c8e02ff11545645dc639052
2020-11-23 19:55:21 +05:30
managalv
aae48c2221 Optimised AXPYF routine for complex float and complex double
Details:
    - Added SIMD code
    - Processing 5 rows at a time in SIMD loop to improve performance

AMD-Internal: [CPUPL-1054]

Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
2020-11-06 18:42:13 +05:30
Meghana Vankadari
0775f09b41 Added debug trace and log support for copy and ger routines
Change-Id: Id7fb64c0a626b2f8f53e89ee7df4391693eb4f4c
2020-11-02 22:56:58 -05:00
Meghana Vankadari
47744663d9 Enabling framework optimizations for zen family architectures.
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
  framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
  configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
  accessible to all zen family architectures.

Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d
2020-10-07 13:10:50 +05:30