Commit Graph

450 Commits

Author SHA1 Message Date
satish kumar nuggu
1a3428ddfc Parallelization of dtrsm_small routine
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.

    AMD-Internal: [CPUPL-2103]

Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
2022-05-17 18:10:39 +05:30
Harsh Dave
f17d043e1c Implemented optimal dotxv kernel
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
  cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
  cases are handled serially for cdotxv kernel
- Added declaration in zen contexts

AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
2022-05-17 18:10:39 +05:30
Harsh Dave
52e4fd0f11 Performance Improvement for ctrsm small sizes
Details:
- Enable ctrsm small implementation
- Handled Overflow and Underflow Vulnerabilites in
ctrsm small implementations.
- Fixed failures observed in libflame testing.
- For small sizes, ctrsm small implementation is
used for all variants.

Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179
2022-05-17 18:10:39 +05:30
Arnav Sharma
caa5b37005 Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context

AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
2022-05-17 18:10:39 +05:30
Sireesha Sanga
cc3069fb5e Performance Improvement for ztrsm small sizes
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
  ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
  sizes 64<= m,n <= 256, by keeping the number of
  threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
  used for all variants.

AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
2022-05-17 18:10:39 +05:30
satish kumar nuggu
fe7f0a9085 Changes to enable zgemm small from BLAS Layer
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.

Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
2022-05-17 18:10:38 +05:30
Harsh Dave
015bcb88d4 Fixed ztrsm computational failure
- Fixed memory access for edge cases such that
  all load are within memory boundary only.

- Corrected ztrsm utility APIs for dcomplex
  multiplication and division.

AMD-Internal: [CPUPL-2093]
Change-Id: Ib2c65e7921f6391b530cd20d6ea6b50f24bd705e
2022-05-17 18:09:22 +05:30
Harsh Dave
0976ed9ce5 Implement zgemm_small kernel
Details:
- Intrinsic implementation of zgemm_small nn kernel.
- Intrinsic implementation of zgemm_small_At kernel.
- Added support conjugate and hermitian transpose
- Main loop operates in multiple of 4x3 tile.
- Edge cases are handles separately.

AMD-Internal: [CPUPL-2084]
Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66
2022-05-17 18:09:22 +05:30
mkurumel
ab06f17689 DGEMMT : Tuning SUP threshold to improve ST and MT performance.
Details :
          - SUP Threshold change for native vs SUP
          - Improved the ST performances for sizes n<800
	  - Introduce PACKB in SUP to improve ST performance between 320<n<800
	  - 16T SUP Tuning for n<1600.

        AMD-Internal: [CPUPL-1981]

Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda
2022-05-17 18:09:22 +05:30
Dipal M Zambare
06e386f054 Updated Windows build system to pick AMD specific sources.
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.

This commit adds changes needed for windows build.

AMD-Internal: [CPUPL-2052]

Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
2022-05-17 18:09:20 +05:30
Arnav Sharma
393effbb0c Optimized ZAXPY2V using AVX2 Intrinsics
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts

AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
2022-05-17 18:08:57 +05:30
Arnav Sharma
86690f9fd3 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-05-17 18:03:42 +05:30
Dipal M Zambare
31921b9974 Updated windows build system to define BLIS_CONFIG_EPYC flag.
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.

Updated version number to 3.1.1

Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
2022-05-17 18:03:09 +05:30
Harihara Sudhan S
6696f91f41 Improved DGEMV performance for column-major cases
- Altered the framework to use 2 more fused kernels for
	  better problem decomposition
	- Increased unroll factor in AXPYF5 and AXPYF8 kernels
	  to improve register usage

AMD-Internal: [CPUPL-1970]

Change-Id: I79750235d9554466def5ff93898f832834990343
2022-02-02 23:13:10 -05:00
Harihara Sudhan
14fb31c0d5 Improved performance of DOTXV kernel for float and double
- Vectorized sections of code that were not vectorized

AMD Internal: [CPUPL-1980]

Change-Id: I08528d054442a5e728f631142f244f1624170136
2022-01-24 23:08:38 -05:00
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
79c6aa5643 Implemented optimal S/DCOMPLEX dotxf kernel
- Optimized dotxf implementation for double
  and single precision complex datatype by
  handling dot product computation in tile 2x6
  and 4x6 handling 6 columns at a time, and rows
  in multiple of 2 and 4.

- Dot product computation is arranged such a way
  that multiple rho vector register will hold the
  temporary result till the end of loop and finally
  does horizontal addition to get final dot product
  result.

- Corner cases are handled serially.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
2022-01-06 02:22:52 -05:00
Harsh Dave
351269219f Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-01-05 05:51:15 -06:00
Harsh Dave
8b5b2707c1 Optimized daxpy2v implementation
- Optimized axpy2v implementation for double
  datatype by handling rows in mulitple of 4
  and store the final computed result at the
  end of computation, preventing unnecessary
  stores for improving the performance.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
2022-01-05 06:37:22 -05:00
Harsh Dave
0e7073a600 Optimized ztrsv implementation
- Implemented alternate method of performing
  multiplication and addition operations on
  double precision complex datatype by separating
  out real and imaginary parts of complex number.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1969]

Change-Id: Ib181f193c05740d5f6b9de3930e1995dea4a50f2
2022-01-05 05:22:51 -05:00
Arnav Sharma
3190e547b0 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-01-05 04:19:11 -05:00
Harihara Sudhan S
b095f1f3a2 Improved SCALV kernel performance.
- Unrolled the loop by a greater factor. Incorporated switch
  case to decide unrolling factor according to the input size.

- Removed unused structs.

AMD-Internal: [CPUPL-1974]

Change-Id: Iee9d7defcc8c582ca0420f84c4fb2c202dabe3e7
2021-12-31 01:28:46 -05:00
Harihara Sudhan S
75d5f538d2 Improved AXPYV Kernel performance
- Increased the unroll factor of the loop by 15 in SAXPYV
	- Increased the unroll factor of the loop by 12 in DAXPYV
	- The above changes were made for better register
	  utilization

Change-Id: I69ad1fec2fcf958dbd1bfd71378641274b43a6aa
2021-12-20 13:21:20 +05:30
Harihara Sudhan S
f72758d80a Fixed DDOTXF Bug
- Corrected xv and avec indexing in vector loop of
	  bli_ddotxf_zen_int_2

Change-Id: I4c511236aad09541fe6b1295103a1a8b54ceec39
2021-12-17 15:27:13 +05:30
Nallani Bhaskar
c2df5eac1c Reduced number of threads in dgemm for small dimensions
- Number of threads are reduced to 1 when the dimensions
  are very low.
- Removed uninitialized xmm compilation warning in trsm small

Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b
2021-12-15 15:11:08 +05:30
Harihara Sudhan S
8201bcfdaf Improved DGEMV performance for smaller sizes
- Introduced two new ddotxf functions with lower fuse
  factor.
- Changed the DGEMV framework to use new kernels to
  improve problem decomposition.

Change-Id: I523e158fd33260d06224118fbf74f2314e03a617
2021-12-15 13:17:14 +05:30
Harsh Dave
0f43db8347 Optimized dsymv implementation
-Implemented hemv framework calls for lower and upper
kernel variants.

-hemv computation is implemented in two parts.
One part operate on triangular part of matrix and
the remaining part is computed by dotxfaxpyf kernel.

-First part performs dotxf and axpyf operation on
triangular part of matrix in chunk of 8x8.
Two separate helper function for doing so are implemented
for lower and upper kernels respectively.

-Second part is ddotxaxpyf fused kernel, which performs
dotxf and axpyf operation alltogether on non-triangular
part of matrix in chunk of 4x8.

-Implementation efficiently uses cache memory while computing
for optimal performance.

Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e
2021-12-14 23:34:20 -06:00
Dipal M Zambare
fd8a3aace9 Added support for zen4 architecture
- Added configuration option for zen4 architecture
  - Added auto-detection of zen4 architecture
  - Added zen4 configuration for all checks related
    to AMD specific optimizations

AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
2021-11-23 10:29:15 +05:30
Dipal M Zambare
7a15aa9c87 Fixed xGEMM dynamic dispatch crash on ST library.
Small gemm implemenation is called from gemmnat path
      when library is built as multi-threaded small gemm
      is completely disabled.

      For single threaded the crash is fixed by disabling
      small gemm on generic architecture.

AMD-Internal: [CPUPL-1930]
Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a
2021-11-12 08:58:59 +05:30
lcpu
30038af896 Reverted: To fix accuracy issues for complex datatypes
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
2021-11-12 08:58:57 +05:30
Harsh Dave
2b6faf21a1 Fixed conjugate transpose issue for zscalv and cscalv
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.

Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
2021-11-12 08:58:55 +05:30
Nageshwar Singh
cbd9ea76af Complex single standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
2021-11-12 08:58:55 +05:30
Harsh Dave
590c763e22 Implemented ctrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
   when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
2021-11-12 08:58:55 +05:30
satish kumar nuggu
23278627f4 STRSM small kernel implementation
Details:
-- AMD Internal Id: [CPUPL-1702]
-- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Taken care of --disable_pre_inversion configuration
-- modularized strsm 16 combinations of trsm into 4 kernels

Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea
2021-11-12 08:58:55 +05:30
Nageshwar Singh
a3d04a21a0 Complex double standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
2021-11-12 08:58:54 +05:30
Harsh Dave
a7f600b3a4 Implemented ztrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 12 dcomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ztrsm_small for in ztrsm_ BLAS path for single thread
   when (m,n)<500 and multithread (m+n)<128
-- Taken care of --disable_pre_inversion configuration
-- Achieved 10% average performance improvement for sizes less than 500
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75
2021-11-12 08:58:54 +05:30
Nageshwar Singh
a263146a4c Optimized scalv for complex data-types c and z (cscalv and zscalv)
AMD-Internal: [CPUPL-1551]
Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9
2021-11-12 08:58:53 +05:30
satish kumar nuggu
3bafdf3923 DGEMM kernel implementation for case k = 1.
Details :
      - DGEMM kernel implementation for case k = 1, vectorized with 8x6 block implementation (Rank-1 update in DGEMM Optimization).

Change-Id: I7d06378adeb8bcc5b965e2a94314d731629d0b4c
2021-11-12 08:58:53 +05:30
mkurumel
595f7b7edf dnrm2 optimization with dot method
1.  Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
	norm with dot operation.
    2.  Added vectorization to compute square of 32 double element
	block size from vector X.
    3.  Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
	to compute nrm2 using new kernel.
    4.  Dot kernel definitions and implementation have a possibility for
	accuracy issues .we can switch to traditional implementation by
	disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
	for Vector X .

    AMD-Internal: [CPUPL-1757]

Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
2021-11-12 08:58:53 +05:30
mkurumel
9f1ce594a5 BLIS : Compiler warning fixes
Details :
  - Fixed warnings with AOCC and GCC compilers.

AMD-Internal: [CPUPL-1662]

Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce
2021-11-12 08:58:52 +05:30
Meghana Vankadari
10ca8710f0 Optimized SUP code for GEMMT
Details:
- Eliminated the IR loop in ref_var2m functions.
- Handled the rectangular and triangular portions of C matrix
  separately.
- Added a condition to check and eliminate zero regions inside IC loop.
- modified kc selection logic to choose optimal KC in SUP
- Updated thresholds to choose between SUP and native.

Change-Id: I21908eaa6bc3a8f37bdea29f7bfca7e6fcfee724
2021-11-12 08:58:52 +05:30
Kiran Varaganti
7196b86f05 Removed packm kernels of zen
intrinsic optimized packm kernels written for zen are no longer used.
Therefore removing it. Currently packm kernels from haswell configuration
are being used for zen2 and zen3 configs.
2021-11-12 08:58:51 +05:30
Madan mohan Manokar
3dda4ebf22 Induced method turned off, fix for beta=0 & C = NAN
1. Induced Method turned off, till the path fully tested for different alpha,beta conditions.
2. Fix for Beta =0, and C = NAN done.

Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000
2021-11-12 08:58:50 +05:30
Nallani Bhaskar
15b7fff159 Fixed reading C when beta=0 in few sgemm asm kernels
Description:

1. When beta is zero we should not be doing any arthemetic operation
   on C data and should not assume anything on values of C matrix

2. It is taken care in all sgemmsup kernels already except
   in bli_sgemmsup_rv_zen_asm_2x8 and bli_sgemmsup_rv_zen_asm_3x8
   kernels, when beta zero and C is column storage case. Fixed
   this issue by removing reading C matrix in these kernels.

3. When C has NaN or Inf and when we multiply NaN or Inf with zero
   (beta) the result becomes NaN only.

Change-Id: I3fb8c0cd37cf1d52a7909f6b402aa9c40c7c3846
2021-11-12 08:58:50 +05:30
Meghana Vankadari
5c770cafee Removed syrk_small code
- The current implementation of syrk_small computes the entire C matrix
  rather than computing triangular part. This implementation is not
  efficient.

AMD-Internal: [CPUPL-1571]
Change-Id: I9a153207471a55e52634429062d18ba1a225fed9
2021-11-12 08:58:49 +05:30
Abhiram S
7787bc79b1 Level1 samaxv: AVX512 implementation
Details:
 1. Unrolled by a factor 5. This gave around 1GFLOPS gain
 2. Changed CMP to subs and remove nan. CMP uses a lot of
    compare, which is higher in latency and more number of
    instructions. Replacing with subs and remove nan
    reduced it to 3 instructions and lighter ones.
 3. Added remove nan function.
 4. Added AVX512 definition in skx context.
 5. Disabled code in AMAXV kernel depending on AVX512 flag
    exists or not

Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35
2021-09-27 16:10:08 +05:30
Harihara Sudhan S
bdb5e32176 Level 1 Kernel: damaxv AVX512
Details:

- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
  to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
  implementation for better performance
- Unrolled the loop by order of 4.

Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
2021-09-02 09:47:08 +05:30
satish kumar nuggu
5adb7bf1a4 Combined variants to reduce redundancy in dtrsm small
1. Left Lower non-trans,Left Upper trans
2. Left Upper non-trans,Left Lower trans
3. Right Lower non-trans.Right Upper trans
4. Right Upper non-trans,Right Lower trans

Change-Id: I0b0155d7c3a55ec74d53c8f1f49f1bceb63b15f5
2021-08-26 15:04:14 +05:30
Meghana Vankadari
6bad157754 Added a new field in cntx to store l3 threshold function pointers
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
  different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
  adding threshold functions under a macro enables these functions for
  all the configs under a family. This can be avoided by adding function
  pointers to cntx which can be queried from cntx during run-time
  based on the config chosen.

Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
2021-08-16 00:10:01 -04:00
Kiran Varaganti
adfd569591 DGEMM Optimizations
Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable
calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10).
Improved smart threading logic for dgemm,
Additional conditions at the blas interface added to invoke bli_dgemm_small.
Removed  N > 3  condition from bli_dgemm_small.

Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e
2021-08-10 12:34:43 -04:00