102 Commits

Author SHA1 Message Date
Smyth, Edward
509aa07785 Standardize Zen kernel names
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.

AMD-Internal: [CPUPL-6579]
2025-08-19 18:19:51 +01:00
Vlachopoulou, Eleni
1f8a7d2218 Renaming CMAKE_SOURCE_DIR to PROJECT_SOURCE_DIR so that BLIS can be built properly via FetchContent() (#65) 2025-08-07 15:51:59 +01:00
Smyth, Edward
969ceb7413 Finer control of code path options (#67)
Add macros to allow specific code options to be enabled or disabled,
controlled by options to configure and cmake. This expands on the
existing GEMM and/or TRSM functionality to enable/disable SUP handling
and replaces the hard coded #define in include files to enable small matrix
paths.

All options are enabled by default for all BLIS sub-configs but many of them
are currently only implemented in AMD specific framework code variants.

AMD-Internal: [CPUPL-6906]
---------

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-07-08 10:59:23 +01:00
Smyth, Edward
8a8d3f43d5 Improve consistency of optimized BLAS3 code (#64)
* Improve consistency of optimized BLAS3 code

Tidy AMD optimized GEMM and TRSM framework code to reduce
differences between different data type variants:
- Improve consistency of code indentation and white space
- Added some missing AOCL_DTL calls
- Removed some dead code
- Consistent naming of variables for function return status
- GEMM: More consistent early return when k=1
- Correct data type of literal values used for single precision data

In kernels/zen/3/bli_gemm_small.c and bli_family_*.h files:
- Set default values for thresholds if not set in the relevant
  bli_family_*.h file
- Remove unused definitions and commented out code

AMD-Internal: [CPUPL-6579]
2025-07-01 09:29:52 +01:00
Smyth, Edward
14e46ad83b Improvements to x86 make_defs files (#29)
Various changes to simplify and improve x86 related make_defs files:
- Make better use of common definitions in config/zen/amd_config.mk
  from config/zen*/make_defs.mk files
- Similarly for config/zen/amd_config.make from the
  config/zen*/make_defs.cmake files
- Pass cc_major, cc_minor and cc_revision definitions from configure
  to generated config.mk file, and use these instead of defining
  GCC_VERSION in config/zen*/make_defs.mk files
- Add znver3 support for LLVM 13 in config/zen3/make_defs.{mk,cmake}
- Add znver5 support for LLVM 19 in config/zen5/make_defs.{mk,cmake}
- Improve readability of haswell, intel64, skx and x86_64 files
- Correct and tidy some comments

AMD-Internal: [CPUPL-6579]
2025-06-03 16:20:43 +01:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Mangala V
0a4f9d5ac1 Removed -fno-tree-loop-vectorize from kernel flags
- This change in made in MAKE build system.
- Removed -fno-tree-loop-vectorize from global kernel flags,
  instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
  vectorize the code which results in usages of
  vector registers, if the auto vectorized function
  is using intrinsic then the total numbers of vector
  registers used by intrinsic and auto vectorized
  code becomes more than the registers
  available in machine which causes read and writes
  to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
  do not use any intrinsic code do not get auto
  vectorized.
- To get optimal performance for both blis and lpgemm,
  this flag is enabled for lpgemm kernels only.

Previous commit (75df1ef218) contains
similar changes on cmake build system

AMD-Internal: [CPUPL-5544]

Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506
2024-08-02 09:40:06 -04:00
Shubham Sharma
75df1ef218 Removed -fno-tree-loop-vectorize from kernel flags
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
  instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
  vectorize the code which results in usages of
  vector registers, if the auto vectorized function
  is using intrinsics then the total numbers of vector
  registers used by intrinsic and auto vectorized
  code becomes more than the registers
  available in machine which causes read and writes
  to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
  do not use any intrinsic code do not get auto
  vectorized.
- To get optimal performance for both blis and lpgemm,
  this flag is enabled for lpgemm kernels only.

Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
2024-07-19 00:49:52 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
jagar
bbfa4a88ec CMake: Updated compiler ID in cmake files
Updated compiler id in cmake related files from
CMAKE_CXX_COMPILER_ID to CMAKE_C_COMPILER_ID

AMD-Internal: [CPUPL-2748]
Change-Id: Ib0e2a2e3ec8fafeb423fe56b9842a93db0115371
2024-03-14 07:24:04 -04:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Edward Smyth
f471615c66 Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
2023-11-22 17:11:10 -05:00
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Edward Smyth
24e4d58f92 Tidy zen bli_cntx_init and bli_family files
Tidy formatting of config/*zen*/bli_cntx_init_zen*.c and
config/*zen*/bli_family_*.c files to make them more
consistent with each other and improve readability.

AMD-Internal: [CPUPL-3519]
Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b
2023-10-04 05:14:39 -04:00
Edward Smyth
6911d2dd21 zen config make_defs.mk improvements
Improvements to zen make_defs.mk files:
* Add -znver4 flag for GCC 13 and later.
* Add AVX512 flags or -znver4 as appropriate for upstream LLVM
  in config/zen4/make_defs.mk to enable BLIS to be build with
  LLVM rather than AOCC.
* zen make_defs.mk files were inheriting settings from the previous
  one (zen->zen2->zen3->zen4), when they should be independent
  of each other. Correct by including config/zen/amd_config.mk
  in all zen make_defs.mk files to reinitialize the compiler
  flags.
* Update zen2 and zen3 make_defs.mk for recent AOCC compiler
  releases, rather than rely on LLVM settings.
* Remove -mfpmath=sse flag in config/zen4/make_defs.mk as
  this is already specified in amd_config.mk (and should
  be the default setting anyway).
* Tidy files to simplify nested if structures and be more
  consistent with one another.

AMD-Internal: [CPUPL-3399]
Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac
2023-05-22 07:51:41 -04:00
mkadavil
3d74b62e60 Lpgemm threading and micro-kernel optimizations.
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.

AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
2023-03-16 11:44:51 +05:30
Harihara Sudhan S
299bed3fa8 Vectorized SCAL2V for double complex
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero
  or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.

AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
2023-02-28 06:59:23 -05:00
Harihara Sudhan S
535b49f150 Vectorized COPYV for double complex
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.

AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
2023-02-16 09:32:10 -05:00
Harihara Sudhan S
8d7593a940 AVX2 vectorized SCALV for double complex
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
  is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
  all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
  to the function in place of context. Expert interface of ZSETV will
  query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
  the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
  BLAS macro interface only for single complex type and single complex,
  float mixed type

AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
2023-02-05 09:36:12 -05:00
Vignesh Balasubramanian
dbd0c069d4 Implemented 3x4 RD kernels for ZGEMM in SUP path
-Implemented (r)ow preferential (d)ot product milli-kernels
 (m and n variants) for dcomplex datatype along SUP path.

-These computational kernels extend the support for handling RRC and
 CRC storage schemes along the SUP path. In case of BLAS api call,
 it corresponds to the input cases with transa equal to T and
 transb equal to N.

-In case of the B matrix being packed(conditionally), the inputs are
 redirected to the existing (r)ow preferential (v)ector load optimized
 kernels due to better performance.

-Added macro for vhsubpd assembly instruction, to support the arithmetic
 for complex datatype in its interleaved storage.

AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
2022-12-22 18:31:46 +05:30
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
Dipal M Zambare
d3b503bbf2 Code cleanup and warnings fixes
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
  files

AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
2022-08-29 15:15:40 +05:30
Harsh Dave
f17d043e1c Implemented optimal dotxv kernel
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
  cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
  cases are handled serially for cdotxv kernel
- Added declaration in zen contexts

AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
2022-05-17 18:10:39 +05:30
Arnav Sharma
caa5b37005 Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context

AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
2022-05-17 18:10:39 +05:30
Arnav Sharma
393effbb0c Optimized ZAXPY2V using AVX2 Intrinsics
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts

AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
2022-05-17 18:08:57 +05:30
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
79c6aa5643 Implemented optimal S/DCOMPLEX dotxf kernel
- Optimized dotxf implementation for double
  and single precision complex datatype by
  handling dot product computation in tile 2x6
  and 4x6 handling 6 columns at a time, and rows
  in multiple of 2 and 4.

- Dot product computation is arranged such a way
  that multiple rho vector register will hold the
  temporary result till the end of loop and finally
  does horizontal addition to get final dot product
  result.

- Corner cases are handled serially.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
2022-01-06 02:22:52 -05:00
Harsh Dave
8b5b2707c1 Optimized daxpy2v implementation
- Optimized axpy2v implementation for double
  datatype by handling rows in mulitple of 4
  and store the final computed result at the
  end of computation, preventing unnecessary
  stores for improving the performance.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
2022-01-05 06:37:22 -05:00
Arnav Sharma
3190e547b0 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-01-05 04:19:11 -05:00
lcpu
30038af896 Reverted: To fix accuracy issues for complex datatypes
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
2021-11-12 08:58:57 +05:30
mkurumel
366ab66134 DNRM2 : Disable dnrm2 Fast math implementation.
Details :
          - Accuracy failures observed when  fast math and ILP64 are enabled.
          - Disabling the feature with macro BLIS_ENABLE_FAST_MATH .

        AMD-Internal: [CPUPL-1907]

Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
2021-11-12 08:58:56 +05:30
Nageshwar Singh
a263146a4c Optimized scalv for complex data-types c and z (cscalv and zscalv)
AMD-Internal: [CPUPL-1551]
Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9
2021-11-12 08:58:53 +05:30
mkurumel
595f7b7edf dnrm2 optimization with dot method
1.  Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
	norm with dot operation.
    2.  Added vectorization to compute square of 32 double element
	block size from vector X.
    3.  Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
	to compute nrm2 using new kernel.
    4.  Dot kernel definitions and implementation have a possibility for
	accuracy issues .we can switch to traditional implementation by
	disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
	for Vector X .

    AMD-Internal: [CPUPL-1757]

Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
2021-11-12 08:58:53 +05:30
Dipal M Zambare
41a86c2463 Enabled znver3 flag for gcc version above 11
-- Added -march=znver3 flag if the library is built for zen3
   configuration with gcc compiler version 11 or above.

-- Replaced hardcoded compiler names 'gcc' and 'clang' with
   variable $CC so that options are chosen as per the compiler
   specified at configure time (instead of compiler in path).

AMD-Internal: [CPUPL-1823]
Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4
2021-11-12 08:58:48 +05:30
Dipal M Zambare
bcd9591b3f Added support for amdepyc fat binary
-- Created new configuration amdepyc to include fat binary which
     includes zen, zen2, zen3 and generic architecture for fallback.

  -- Updated amdepyc family makefiles to include macros needed
     in amdepyc family binary. This file must include all macros,
     compiler options to be used for non architecture specific code.

  -- Added 'workaround' to exclude ZEN family specific code in some of
     the framework files. There are still lot of places were ZEN family
     specific code is added in framework files. They will be addressed
     with proper design later.
       - Moved definition of BLIS_CONFIG_EPYC from header files to
         makefile so that it is enabled only for framework and kernels

  -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
     wherever it was needed.

  -- Removed un-used, obsolete macros, some of them may be needed for
     debugging which can be added in the individual workspaces.
       - BLIS_DEFAULT_MR_THREAD_MAX
       - BLIS_DEFAULT_NR_THREAD_MAX
       - BLIS_ENABLE_ZEN_BLOCK_SIZES
       - BLIS_SMALL_MATRIX_THRES_TRSM
       - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
       - BLIS_ENABLE_SUP_MR_EXT
       - BLIS_ENABLE_SUP_NR_EXT

  -- Corrected implementation of exiting amd64_legacy configuration.

AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
2021-08-26 15:13:49 +05:30
Meghana Vankadari
6bad157754 Added a new field in cntx to store l3 threshold function pointers
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
  different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
  adding threshold functions under a macro enables these functions for
  all the configs under a family. This can be avoided by adding function
  pointers to cntx which can be queried from cntx during run-time
  based on the config chosen.

Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
2021-08-16 00:10:01 -04:00
Nallani Bhaskar
e328bdc549 Added prefetch in left cases of dtrsm small
Details:

1. Added prefetching next micro-panel of A and B in dgemm block,
   which are helping in reducing load latency and improved performance.

2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
   dgemm into macros and made it more modular

3. Packing and diagonal packing in main dgemm loops are modularized.
   Fringe cases are yet to modularize.

4. Updated dtrsm small thresholds for single and multi thread cases

5. Updated div/scale based on disable/enable of trsm pre-inversion

6. Code clean up

Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
2021-06-15 23:15:22 +05:30
Dipal M Zambare
21130ebece Added configure option for AOCL Dynamic feature.
- AOCL Dynamic feature is added in BLIS which determines optimal
    number of threads for the current problem size.
  - This feature can be enabled/disabled by modifying the source
    code
  - This change adds support to enable/disable this feature during
    configuration time by adding a new option in configure script

AOCL-Internal : [CPUPL-1565]

Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
2021-05-12 00:41:13 -04:00
Meghana Vankadari
eea347b02e Added dynamic threading support for GEMM SUP code path
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
  is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
  where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
  or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
  global_rntm data structure remains unchanged in order to keep track of
  original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
  number of threads recommended by the application.

AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
2021-05-07 09:52:51 +05:30
Madan mohan Manokar
c1fa9abe32 zgemm native path tuning
1. NC and MC values are tuned for both single-instance and multi-instance run.
2. zen2 and zen3 configs updated.
3. SUP path disabled for zgemm, since tuned native path performed better.
To be re-enabled after setting right threshold for SUP selection.

AMD-Internal: [CPUPL-1442]
Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239
2021-05-06 01:15:35 -04:00
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00
Meghana Vankadari
943b1362c7 Enabled vectorized pack kernels for zen2 configuration.
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
  implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.

AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
2021-02-12 19:16:57 +05:30
Dipal M Zambare
38a8008cd8 Enabled znver3 flag for zen3 architecture
znver3 flag will be enabled if compiler is AOCC Clang version 3.0
and configuration is zen3

Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39
AMD-Internal: [CPUPL-1353]
2020-12-04 12:28:22 +05:30
Field G. Van Zee
11dfc176a3 Reorganized thread auto-factorization logic.
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
  guts were factored out into "fast" and "slow" variants. Then added
  logic to the "fast" variant that allows for more optimal thread
  factorizations in some situations where there is at least one factor
  of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
  added comments to that file describing BLIS_THREAD_RATIO_? and
  BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
  macros not used in vanilla BLIS and removed the unused macro
  BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
  and bli_trsm_front.c. (These branches of small matrix handling have
  not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
2020-12-01 19:51:27 +00:00
Field G. Van Zee
88ad841434 Squash-merge 'pr' into 'squash'. (#457)
Merged contributions from AMD's AOCL BLIS (#448).
  
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
  only the lower or upper triangle of a square matrix C. For now, only
  the conventional/large code path will be supported (in vanilla BLIS).
  This was accomplished by leveraging the existing variant logic for
  herk. However, some of the infrastructure to support a gemmtsup is
  included in this commit, including
  - A bli_gemmtsup() front-end, similar to bli_gemmsup().
  - A bli_gemmtsup_ref() reference handler function.
  - A bli_gemmtsup_int() variant chooser function (with variant calls
    commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
  wrapper to a set of polymorphic CBLAS-like function wrappers defined
  in another header (cblas.hh). These two headers are installed if
  running the 'install' target with INSTALL_HH is set to 'yes'. (Also
  added a set of unit tests that exercise blis.hh, although they are
  disabled for now because they aren't compatible with out-of-tree
  builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
  within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
  various minor updates to dotv and scalv kernels. Also added various
  sup kernels contributed by AMD to kernels/zen/3. However, these
  kernels are (for now) not yet used, in part because they caused
  AppVeyor clang failures, and also because I have not found time to
  review and vet them.
- Output the python found during configure into the definition of PYTHON
  in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
  to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
  bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
  bug surfaced because the gemmt module verifies its computation using
  gemm with its beta parameter set to zero, which, on a cortexa15 system
  caused the gemm kernel code to unconditionally multiply the
  uninitialized C data by beta. The C matrix likely contained
  non-numeric values such as NaN, which then would have resulted in a
  false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
  in bli_l3_blocksize.c, was inadvertantly being defined in terms of
  helper functions meant for trmm. This bug was probably harmless since
  the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
  kernels/zen/3/bli_gemm_small.c since those macros are not used in
  vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
  accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
  Windows systems.
- Various whitespace changes.
2020-11-14 09:39:48 -06:00
managalv
aae48c2221 Optimised AXPYF routine for complex float and complex double
Details:
    - Added SIMD code
    - Processing 5 rows at a time in SIMD loop to improve performance

AMD-Internal: [CPUPL-1054]

Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
2020-11-06 18:42:13 +05:30