Commit Graph

2602 Commits

Author SHA1 Message Date
Dipal M Zambare
f69f59c32c Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 18:08:56 +05:30
Harsh Dave
d116780616 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:05:08 +05:30
Arnav Sharma
86690f9fd3 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-05-17 18:03:42 +05:30
HariharaSudhan S
d687bd36ea Merge "Improved AXPYV Kernel performance" into amd-staging-genoa-4.0 2022-05-17 18:03:42 +05:30
Chandrashekara K R
ec6e4162bc Updated windows build system.
We were using add_compile_options(-Xclang -fopenmp) statement to set
omp library compiler flags on MSVC using cmake. Observing there is
an performance regression because of the compiler version which is
using in MSVC(clang 10), so removing it from the windows build
system and configuring the compiler version(clang 13) and compiler
options manually on MSVC gui to gain a performance on matlab bench.

Change-Id: I37d778abdceb7c1fae9b1caaeea8adb114677dd2
2022-05-17 18:03:10 +05:30
Dipal M Zambare
31921b9974 Updated windows build system to define BLIS_CONFIG_EPYC flag.
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.

Updated version number to 3.1.1

Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
2022-05-17 18:03:09 +05:30
Meghana Vankadari
c11fd5a8f6 Added functionality support for dzgemm
AMD-Internal: [SWLCSG-1012]
Change-Id: I2eac3131d2dcd534f84491289cbd3fe7fb7de3da
2022-05-17 18:01:55 +05:30
Dipal M. Zambare
b90420627a Revert "Enabled AVX-512 kernels for Zen4 config"
This reverts commit 62c96a4190.
Was committed without review.
2022-04-21 06:46:00 +00:00
Dipal M. Zambare
0adb525f5b Revert "Enabled AVX-512 kernels for Zen4 config"
This reverts commit f816cf059f.
Was committed without review.
2022-04-21 06:45:38 +00:00
Dipal M. Zambare
f816cf059f Enabled AVX-512 kernels for Zen4 config
Enabled AVX-512 skylake kernels in zen4 configuration.
  AVX-512 kernels are added for float and double types.

AMD-Internal: [CPUPL-2108]
Change-Id: Idfe3f64a037db019cbdf43318954db52ad241a51
2022-04-21 06:38:24 +00:00
Dipal M. Zambare
62c96a4190 Enabled AVX-512 kernels for Zen4 config
Enabled AVX-512 skylake kernels in zen4 configuration.
  AVX-512 kernels are added for float and double types.

AMD-Internal: [CPUPL-2108]
2022-04-21 06:28:29 +00:00
Field G. Van Zee
a4abb10831 Added a new 'gemmlike' sandbox.
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
  multithreaded gemm in the style of gemmsup but also unconditionally
  employs packing. The purpose of this sandbox is to
  (1) avoid select abstractions, such as objects and control trees, in
      order to allow readers to better understand how a real-world
      implementation of high-performance gemm can be constructed;
  (2) provide a starting point for expert users who wish to build
      something that is gemm-like without "reinventing the wheel."
  Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
  Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
  instead of "bli_" in order to avoid any symbol collisions in the main
  library.
- The sandbox contains two variants, each of which implements gemm via a
  block-panel algorithm. The only difference between the two is that
  variant 1 calls the microkernel directly while variant 2 calls the
  microkernel indirectly, via a function wrapper, which allows the edge
  case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
  (not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
  framework.

Change-Id: Ifc3c50e9fd0072aada38eace50c57552c88cc6cf
2022-04-01 13:55:30 +05:30
Field G. Van Zee
7a0ba4194f Added support for addons.
Details:
- Implemented a new feature called addons, which are similar to
  sandboxes except that there is no requirement to define gemm or any
  other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
  for requesting an addon be included within a BLIS build. configure now
  outputs the list of enabled addons into config.mk. It also outputs the
  corresponding #include directives for the addons' headers to a new
  companion to the bli_config.h header file named bli_addon.h. Because
  addons may wish to make use of existing BLIS types within their own
  definitions, the addons' headers must be included sometime after that
  of bli_config.h (which currently is #included before bli_type_defs.h).
  This is why the #include directives needed to go into a new top-level
  header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
  build with them, and what assumptions their authors should keep in
  mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
  as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
  functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.

Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
2022-03-31 12:03:27 +05:30
Meghana Vankadari
0792eb8608 Fixed a bug in deriving dimensions from objects in gemm_front files
Change-Id: I1f796c3a7ce6efacb6ef64651a7818b7ee38c6bb
2022-02-16 23:24:14 -05:00
Harihara Sudhan S
6696f91f41 Improved DGEMV performance for column-major cases
- Altered the framework to use 2 more fused kernels for
	  better problem decomposition
	- Increased unroll factor in AXPYF5 and AXPYF8 kernels
	  to improve register usage

AMD-Internal: [CPUPL-1970]

Change-Id: I79750235d9554466def5ff93898f832834990343
2022-02-02 23:13:10 -05:00
Dipal M Zambare
6d1edca727 Optimized CPU feature determination.
We added new API to check if the CPU architecture has support
for AVX instruction. This API was calling CPUID instruction
every time it is invoked. However, since this information does
not change at runtime, it is sufficient to determine it once
and use the cached results for subsequent calls. This optimization
is needed to improve performance for small size matrix vector
operations.

AMD-Internal: [CPUPL-2009]
Change-Id: If6697e1da6dd6b7f28fbfed45215ea3fdd569c5f
2022-02-01 11:15:55 +05:30
Harihara Sudhan
14fb31c0d5 Improved performance of DOTXV kernel for float and double
- Vectorized sections of code that were not vectorized

AMD Internal: [CPUPL-1980]

Change-Id: I08528d054442a5e728f631142f244f1624170136
2022-01-24 23:08:38 -05:00
Saitharun
e783ea10db Enable wrapper code by default
details: Changes Made for 4.0 branch to enable wrapper code by default
and also removed ENABLE_API_WRAPPER macro.

Change-Id: I5c9ede7ae959d811bc009073a266e66cbf07ef1a
2022-01-19 11:38:45 +05:30
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
79c6aa5643 Implemented optimal S/DCOMPLEX dotxf kernel
- Optimized dotxf implementation for double
  and single precision complex datatype by
  handling dot product computation in tile 2x6
  and 4x6 handling 6 columns at a time, and rows
  in multiple of 2 and 4.

- Dot product computation is arranged such a way
  that multiple rho vector register will hold the
  temporary result till the end of loop and finally
  does horizontal addition to get final dot product
  result.

- Corner cases are handled serially.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
2022-01-06 02:22:52 -05:00
mkadavil
457c33a601 Eliminating barriers in SUP path when matrices are not packed.
-Current gemm SUP path uses bli_thrinfo_sup_grow, bli_thread_range_sub
to generate per thread data ranges at each loop of gemm algorithm.
bli_thrinfo_sup_grow involves usage of multiple barriers for cross
thread synchronization. These barriers are necessary in cases where
either the A or B matrix are packed for centralized pack buffer
allocation/deallocation (bli_thread_am_ochief thread).

-However for cases where both A and B matrices are unpacked, these
barrier are resulting in overhead for smaller dimensions. Here creation
of unnecessary communicators are avoided and subsequently the
requirement for barriers are eliminated when packing is disabled for
both the input matrices in SUP path.

Change-Id: Ic373dfd2d6b08b8f577dc98399a83bb08f794afa
2022-01-06 01:56:43 -05:00
Harsh Dave
351269219f Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-01-05 05:51:15 -06:00
Harsh Dave
8b5b2707c1 Optimized daxpy2v implementation
- Optimized axpy2v implementation for double
  datatype by handling rows in mulitple of 4
  and store the final computed result at the
  end of computation, preventing unnecessary
  stores for improving the performance.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
2022-01-05 06:37:22 -05:00
Harsh Dave
0e7073a600 Optimized ztrsv implementation
- Implemented alternate method of performing
  multiplication and addition operations on
  double precision complex datatype by separating
  out real and imaginary parts of complex number.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1969]

Change-Id: Ib181f193c05740d5f6b9de3930e1995dea4a50f2
2022-01-05 05:22:51 -05:00
Arnav Sharma
3190e547b0 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-01-05 04:19:11 -05:00
Harihara Sudhan S
b095f1f3a2 Improved SCALV kernel performance.
- Unrolled the loop by a greater factor. Incorporated switch
  case to decide unrolling factor according to the input size.

- Removed unused structs.

AMD-Internal: [CPUPL-1974]

Change-Id: Iee9d7defcc8c582ca0420f84c4fb2c202dabe3e7
2021-12-31 01:28:46 -05:00
HariharaSudhan S
4af7870ab7 Merge "Improved AXPYV Kernel performance" into amd-staging-genoa-4.0 2021-12-24 00:05:13 -05:00
Chandrashekara K R
b3553c08fa AOCL-Windows: Updating the blis windows build system.
1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library.
2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library.
3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture.

AMD-Internal: CPUPL-1630
Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a
2021-12-22 14:47:15 +05:30
Harihara Sudhan S
75d5f538d2 Improved AXPYV Kernel performance
- Increased the unroll factor of the loop by 15 in SAXPYV
	- Increased the unroll factor of the loop by 12 in DAXPYV
	- The above changes were made for better register
	  utilization

Change-Id: I69ad1fec2fcf958dbd1bfd71378641274b43a6aa
2021-12-20 13:21:20 +05:30
Harihara Sudhan S
f72758d80a Fixed DDOTXF Bug
- Corrected xv and avec indexing in vector loop of
	  bli_ddotxf_zen_int_2

Change-Id: I4c511236aad09541fe6b1295103a1a8b54ceec39
2021-12-17 15:27:13 +05:30
Nallani Bhaskar
c2df5eac1c Reduced number of threads in dgemm for small dimensions
- Number of threads are reduced to 1 when the dimensions
  are very low.
- Removed uninitialized xmm compilation warning in trsm small

Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b
2021-12-15 15:11:08 +05:30
Harihara Sudhan S
8201bcfdaf Improved DGEMV performance for smaller sizes
- Introduced two new ddotxf functions with lower fuse
  factor.
- Changed the DGEMV framework to use new kernels to
  improve problem decomposition.

Change-Id: I523e158fd33260d06224118fbf74f2314e03a617
2021-12-15 13:17:14 +05:30
Harsh Dave
0f43db8347 Optimized dsymv implementation
-Implemented hemv framework calls for lower and upper
kernel variants.

-hemv computation is implemented in two parts.
One part operate on triangular part of matrix and
the remaining part is computed by dotxfaxpyf kernel.

-First part performs dotxf and axpyf operation on
triangular part of matrix in chunk of 8x8.
Two separate helper function for doing so are implemented
for lower and upper kernels respectively.

-Second part is ddotxaxpyf fused kernel, which performs
dotxf and axpyf operation alltogether on non-triangular
part of matrix in chunk of 4x8.

-Implementation efficiently uses cache memory while computing
for optimal performance.

Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e
2021-12-14 23:34:20 -06:00
Dipal M Zambare
fd8a3aace9 Added support for zen4 architecture
- Added configuration option for zen4 architecture
  - Added auto-detection of zen4 architecture
  - Added zen4 configuration for all checks related
    to AMD specific optimizations

AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
2021-11-23 10:29:15 +05:30
Dipal M Zambare
7a15aa9c87 Fixed xGEMM dynamic dispatch crash on ST library.
Small gemm implemenation is called from gemmnat path
      when library is built as multi-threaded small gemm
      is completely disabled.

      For single threaded the crash is fixed by disabling
      small gemm on generic architecture.

AMD-Internal: [CPUPL-1930]
Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a
2021-11-12 08:58:59 +05:30
mkurumel
235071690a Fixed dynamic dispatch crash issue on non-zen architecture for gemv and axpy routines.
Summary:
    1. This commit fixed issue for gemv and axpy API’s.
    2. The BLIS binary with dynamic dispatch feature was
    crashing on non-zen CPUs (specifically CPUs without
    AVX2 support).
    3. The crash was caused by un-supported instructions
    in zen optimized kernels.The issue is fixed by calling
    only reference kernels if the architecture detected at
    runtime is not zen, zen2 or zen3.

Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4
2021-11-12 08:58:59 +05:30
Harsh Dave
1b0a7e1c89 Fixed dynamic dispatch crash issue on non-zen architecture.
Removed direct calling of zen kernels in ctrsv, ztrsv interface.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels.

AMD-Internal: [CPUPL-1930]
Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b
2021-11-12 08:58:58 +05:30
Harihara Sudhan S
4de6f2ca6d Fixed dynamic dispatch crash issue on non-zen architecture.
Direct calls to zen kernels replaced by architecture
dependent calls for dotv and amaxv kernels. For non-zen
architecture, generic function is called using the BLIS
interface. For zen architecture, direct calls to zen
optimized kernels are made.

Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f
2021-11-12 08:58:58 +05:30
Harsh Dave
8f297f6267 Fixed dynamic dispatch crash issue on non-zen architecture.
Removed direct calling of zen kernels in blis interface for
trsm, scalv, swapv.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]
Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81
2021-11-12 08:58:58 +05:30
Dipal M Zambare
61b5b9c4d0 Fixed dynamic dispatch crash issue on non-zen architecture.
Removed direct calling of zen kernels in cblas source itself.
Similar optimizations are done by the function directly invoked from
Cblas layer.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]
Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea
2021-11-12 08:58:58 +05:30
Dipal M Zambare
ddbdfd0ba4 Fixed dynamic dispatch crash issue on non-zen architecture.
This commit fixed issue for gemm and copy API’s.

The BLIS binary with dynamic dispatch feature was crashing on non-zen
CPUs (specifically CPUs without AVX2 support).
The crash was caused by un-supported instructions in zen optimized kernels.
The issue is fixed by calling only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]

Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366
2021-11-12 08:58:57 +05:30
mkadavil
d683c224e8 Workaround for perf regression observed for sgemm
Details:
- Perf regression is observed for certain m,n,k inputs where (m,n,k > 512)
  and (m > 4 * n) in BLIS 3.1. The root cause was traced to commit
  11dfc176a3 where BLIS_THREAD_RATIO_M was
  updated from 2 to 1. This change was not part of BLIS 3.0.6 and hence
  resulted in the new perf drop in 3.1.
- This workaround updates the m dimension (doubles it) that is passed as
  argument to bli_rntm_set_ways_for_op which is used to determine the ic,jc
  work split in the threads. The BLIS_THREAD_RATIO_M is not updated (to 2)
  and rather the effect is induced using the doubled m dimension.

AMD-Internal: [CPUPL-1909]
Change-Id: I3b6ec4d4a22154289cb56d8f7db4cb60e5f34afe
2021-11-12 08:58:57 +05:30
lcpu
30038af896 Reverted: To fix accuracy issues for complex datatypes
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
2021-11-12 08:58:57 +05:30
nphaniku
4af525a313 AOCL Windows BLIS : Windows build for dynamic dispatch library
Change-Id: Ie05eafbeacbd5589b514d9353517330515104939
2021-11-12 08:58:57 +05:30
mkurumel
366ab66134 DNRM2 : Disable dnrm2 Fast math implementation.
Details :
          - Accuracy failures observed when  fast math and ILP64 are enabled.
          - Disabling the feature with macro BLIS_ENABLE_FAST_MATH .

        AMD-Internal: [CPUPL-1907]

Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
2021-11-12 08:58:56 +05:30
Dipal M Zambare
3364c0e4eb Binary and dynamic dispatch configuration name change
-- Reverted changes made to include lp/ilp info in binary name
     This reverts commit c5e6f885f0.

  -- Included BLAS int size in 'make showconfig'

  -- Renamed amdepyc configuration to amdzen

Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
2021-11-12 08:58:56 +05:30
Dipal M Zambare
ae844f475c Fixed build issue in DTL when only traces are enabled.
AMD-Internal: [CPUPL-1691]
Change-Id: Idc273666054529db5a2fb96a7d7ebbf7a3f5b008
2021-11-12 08:58:56 +05:30
Harsh Dave
53e1d0539f Fixed conjugate transpose kernel issue
Details:
AMD Internal Id: CPUPL-1702
- For the cases of A being of 1x1 dimension and of
left and right hand side, A's only element is conjugate
transposed by negating its imaginary component.

Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c
2021-11-12 08:58:56 +05:30
Harsh Dave
2b6faf21a1 Fixed conjugate transpose issue for zscalv and cscalv
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.

Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
2021-11-12 08:58:55 +05:30
Nageshwar Singh
cbd9ea76af Complex single standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
2021-11-12 08:58:55 +05:30