Commit Graph

50 Commits

Author SHA1 Message Date
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Mangala V
0a4f9d5ac1 Removed -fno-tree-loop-vectorize from kernel flags
- This change in made in MAKE build system.
- Removed -fno-tree-loop-vectorize from global kernel flags,
  instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
  vectorize the code which results in usages of
  vector registers, if the auto vectorized function
  is using intrinsic then the total numbers of vector
  registers used by intrinsic and auto vectorized
  code becomes more than the registers
  available in machine which causes read and writes
  to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
  do not use any intrinsic code do not get auto
  vectorized.
- To get optimal performance for both blis and lpgemm,
  this flag is enabled for lpgemm kernels only.

Previous commit (75df1ef218) contains
similar changes on cmake build system

AMD-Internal: [CPUPL-5544]

Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506
2024-08-02 09:40:06 -04:00
Shubham Sharma
75df1ef218 Removed -fno-tree-loop-vectorize from kernel flags
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
  instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
  vectorize the code which results in usages of
  vector registers, if the auto vectorized function
  is using intrinsics then the total numbers of vector
  registers used by intrinsic and auto vectorized
  code becomes more than the registers
  available in machine which causes read and writes
  to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
  do not use any intrinsic code do not get auto
  vectorized.
- To get optimal performance for both blis and lpgemm,
  this flag is enabled for lpgemm kernels only.

Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
2024-07-19 00:49:52 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
jagar
bbfa4a88ec CMake: Updated compiler ID in cmake files
Updated compiler id in cmake related files from
CMAKE_CXX_COMPILER_ID to CMAKE_C_COMPILER_ID

AMD-Internal: [CPUPL-2748]
Change-Id: Ib0e2a2e3ec8fafeb423fe56b9842a93db0115371
2024-03-14 07:24:04 -04:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Edward Smyth
f471615c66 Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
2023-11-22 17:11:10 -05:00
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Shubham Sharma
9a2a4151ac Added improved ZTRSM AVX2 kernels
- Added 2x6 ZGEMM row-preferred kernel.
  - Kernel supports prefetch_a, prefetch_b,
    prefetch_a_next and prefetch_b_next.
  - Multiple Ways to prefetch c are supported.
  - prefetch_a and prefetch_c are enabled by
    default.
  - K loop is divided into multiple subloops for
    better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
  and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
  still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
  use of these kernels for TRSM in zen3 and
  zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
  windows build.

AMD-Internal: [CPUPL-3781]

Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
2023-10-13 07:43:33 -04:00
Edward Smyth
24e4d58f92 Tidy zen bli_cntx_init and bli_family files
Tidy formatting of config/*zen*/bli_cntx_init_zen*.c and
config/*zen*/bli_family_*.c files to make them more
consistent with each other and improve readability.

AMD-Internal: [CPUPL-3519]
Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b
2023-10-04 05:14:39 -04:00
Edward Smyth
6911d2dd21 zen config make_defs.mk improvements
Improvements to zen make_defs.mk files:
* Add -znver4 flag for GCC 13 and later.
* Add AVX512 flags or -znver4 as appropriate for upstream LLVM
  in config/zen4/make_defs.mk to enable BLIS to be build with
  LLVM rather than AOCC.
* zen make_defs.mk files were inheriting settings from the previous
  one (zen->zen2->zen3->zen4), when they should be independent
  of each other. Correct by including config/zen/amd_config.mk
  in all zen make_defs.mk files to reinitialize the compiler
  flags.
* Update zen2 and zen3 make_defs.mk for recent AOCC compiler
  releases, rather than rely on LLVM settings.
* Remove -mfpmath=sse flag in config/zen4/make_defs.mk as
  this is already specified in amd_config.mk (and should
  be the default setting anyway).
* Tidy files to simplify nested if structures and be more
  consistent with one another.

AMD-Internal: [CPUPL-3399]
Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac
2023-05-22 07:51:41 -04:00
mkadavil
3d74b62e60 Lpgemm threading and micro-kernel optimizations.
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.

AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
2023-03-16 11:44:51 +05:30
Harihara Sudhan S
299bed3fa8 Vectorized SCAL2V for double complex
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero
  or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.

AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
2023-02-28 06:59:23 -05:00
Harihara Sudhan S
535b49f150 Vectorized COPYV for double complex
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.

AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
2023-02-16 09:32:10 -05:00
Harihara Sudhan S
8d7593a940 AVX2 vectorized SCALV for double complex
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
  is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
  all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
  to the function in place of context. Expert interface of ZSETV will
  query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
  the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
  BLAS macro interface only for single complex type and single complex,
  float mixed type

AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
2023-02-05 09:36:12 -05:00
mkadavil
d2713d3dc0 GCC compiler optimization flag (kernel) update for zen3 and zen4 config.
-Inefficient assembly is generated for s16 gemm micro-kernel(intrinsics
code) when compiled using gcc. The presence of -fschedule-insns +
-fschedule-insns2 + -ftree-pre in O2 compiler optimization flags
results in the code being optimized to reduce data stalls, and results
in the usage of stack to store intermediate C register output. Disabling
-ftree-pre in gcc fixes the issue, even in the presence of the other
two flags.

AMD-Internal: [CPUPL-2971]
Change-Id: Ibf0dcde20b5a18708a05faad34e684eb0a9a5463
2023-02-02 23:58:14 -05:00
Edward Smyth
82c2eb4e8e Code cleanup and warnings fixes
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments

AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
2023-01-09 04:34:52 -05:00
Vignesh Balasubramanian
dbd0c069d4 Implemented 3x4 RD kernels for ZGEMM in SUP path
-Implemented (r)ow preferential (d)ot product milli-kernels
 (m and n variants) for dcomplex datatype along SUP path.

-These computational kernels extend the support for handling RRC and
 CRC storage schemes along the SUP path. In case of BLAS api call,
 it corresponds to the input cases with transa equal to T and
 transb equal to N.

-In case of the B matrix being packed(conditionally), the inputs are
 redirected to the existing (r)ow preferential (v)ector load optimized
 kernels due to better performance.

-Added macro for vhsubpd assembly instruction, to support the arithmetic
 for complex datatype in its interleaved storage.

AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
2022-12-22 18:31:46 +05:30
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
Dipal M Zambare
d3b503bbf2 Code cleanup and warnings fixes
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
  files

AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
2022-08-29 15:15:40 +05:30
Harsh Dave
f17d043e1c Implemented optimal dotxv kernel
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
  cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
  cases are handled serially for cdotxv kernel
- Added declaration in zen contexts

AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
2022-05-17 18:10:39 +05:30
Arnav Sharma
caa5b37005 Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context

AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
2022-05-17 18:10:39 +05:30
Arnav Sharma
393effbb0c Optimized ZAXPY2V using AVX2 Intrinsics
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts

AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
2022-05-17 18:08:57 +05:30
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
79c6aa5643 Implemented optimal S/DCOMPLEX dotxf kernel
- Optimized dotxf implementation for double
  and single precision complex datatype by
  handling dot product computation in tile 2x6
  and 4x6 handling 6 columns at a time, and rows
  in multiple of 2 and 4.

- Dot product computation is arranged such a way
  that multiple rho vector register will hold the
  temporary result till the end of loop and finally
  does horizontal addition to get final dot product
  result.

- Corner cases are handled serially.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
2022-01-06 02:22:52 -05:00
Harsh Dave
8b5b2707c1 Optimized daxpy2v implementation
- Optimized axpy2v implementation for double
  datatype by handling rows in mulitple of 4
  and store the final computed result at the
  end of computation, preventing unnecessary
  stores for improving the performance.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
2022-01-05 06:37:22 -05:00
Arnav Sharma
3190e547b0 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-01-05 04:19:11 -05:00
lcpu
30038af896 Reverted: To fix accuracy issues for complex datatypes
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
2021-11-12 08:58:57 +05:30
mkurumel
366ab66134 DNRM2 : Disable dnrm2 Fast math implementation.
Details :
          - Accuracy failures observed when  fast math and ILP64 are enabled.
          - Disabling the feature with macro BLIS_ENABLE_FAST_MATH .

        AMD-Internal: [CPUPL-1907]

Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
2021-11-12 08:58:56 +05:30
Nageshwar Singh
a263146a4c Optimized scalv for complex data-types c and z (cscalv and zscalv)
AMD-Internal: [CPUPL-1551]
Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9
2021-11-12 08:58:53 +05:30
mkurumel
595f7b7edf dnrm2 optimization with dot method
1.  Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
	norm with dot operation.
    2.  Added vectorization to compute square of 32 double element
	block size from vector X.
    3.  Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
	to compute nrm2 using new kernel.
    4.  Dot kernel definitions and implementation have a possibility for
	accuracy issues .we can switch to traditional implementation by
	disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
	for Vector X .

    AMD-Internal: [CPUPL-1757]

Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
2021-11-12 08:58:53 +05:30
Dipal M Zambare
41a86c2463 Enabled znver3 flag for gcc version above 11
-- Added -march=znver3 flag if the library is built for zen3
   configuration with gcc compiler version 11 or above.

-- Replaced hardcoded compiler names 'gcc' and 'clang' with
   variable $CC so that options are chosen as per the compiler
   specified at configure time (instead of compiler in path).

AMD-Internal: [CPUPL-1823]
Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4
2021-11-12 08:58:48 +05:30
Dipal M Zambare
bcd9591b3f Added support for amdepyc fat binary
-- Created new configuration amdepyc to include fat binary which
     includes zen, zen2, zen3 and generic architecture for fallback.

  -- Updated amdepyc family makefiles to include macros needed
     in amdepyc family binary. This file must include all macros,
     compiler options to be used for non architecture specific code.

  -- Added 'workaround' to exclude ZEN family specific code in some of
     the framework files. There are still lot of places were ZEN family
     specific code is added in framework files. They will be addressed
     with proper design later.
       - Moved definition of BLIS_CONFIG_EPYC from header files to
         makefile so that it is enabled only for framework and kernels

  -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
     wherever it was needed.

  -- Removed un-used, obsolete macros, some of them may be needed for
     debugging which can be added in the individual workspaces.
       - BLIS_DEFAULT_MR_THREAD_MAX
       - BLIS_DEFAULT_NR_THREAD_MAX
       - BLIS_ENABLE_ZEN_BLOCK_SIZES
       - BLIS_SMALL_MATRIX_THRES_TRSM
       - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
       - BLIS_ENABLE_SUP_MR_EXT
       - BLIS_ENABLE_SUP_NR_EXT

  -- Corrected implementation of exiting amd64_legacy configuration.

AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
2021-08-26 15:13:49 +05:30
Meghana Vankadari
6bad157754 Added a new field in cntx to store l3 threshold function pointers
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
  different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
  adding threshold functions under a macro enables these functions for
  all the configs under a family. This can be avoided by adding function
  pointers to cntx which can be queried from cntx during run-time
  based on the config chosen.

Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
2021-08-16 00:10:01 -04:00
Nallani Bhaskar
e328bdc549 Added prefetch in left cases of dtrsm small
Details:

1. Added prefetching next micro-panel of A and B in dgemm block,
   which are helping in reducing load latency and improved performance.

2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
   dgemm into macros and made it more modular

3. Packing and diagonal packing in main dgemm loops are modularized.
   Fringe cases are yet to modularize.

4. Updated dtrsm small thresholds for single and multi thread cases

5. Updated div/scale based on disable/enable of trsm pre-inversion

6. Code clean up

Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
2021-06-15 23:15:22 +05:30
Dipal M Zambare
21130ebece Added configure option for AOCL Dynamic feature.
- AOCL Dynamic feature is added in BLIS which determines optimal
    number of threads for the current problem size.
  - This feature can be enabled/disabled by modifying the source
    code
  - This change adds support to enable/disable this feature during
    configuration time by adding a new option in configure script

AOCL-Internal : [CPUPL-1565]

Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
2021-05-12 00:41:13 -04:00
Meghana Vankadari
eea347b02e Added dynamic threading support for GEMM SUP code path
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
  is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
  where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
  or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
  global_rntm data structure remains unchanged in order to keep track of
  original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
  number of threads recommended by the application.

AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
2021-05-07 09:52:51 +05:30
managalv
c420bd63e2 Enabled optimised packed routines on zen3
Change-Id: I5eb57f8ab2cccd20d0f778ada539fd1474cf6338
2021-05-06 01:25:08 -04:00
Madan mohan Manokar
c1fa9abe32 zgemm native path tuning
1. NC and MC values are tuned for both single-instance and multi-instance run.
2. zen2 and zen3 configs updated.
3. SUP path disabled for zgemm, since tuned native path performed better.
To be re-enabled after setting right threshold for SUP selection.

AMD-Internal: [CPUPL-1442]
Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239
2021-05-06 01:15:35 -04:00
Dipal M Zambare
38a8008cd8 Enabled znver3 flag for zen3 architecture
znver3 flag will be enabled if compiler is AOCC Clang version 3.0
and configuration is zen3

Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39
AMD-Internal: [CPUPL-1353]
2020-12-04 12:28:22 +05:30
Kumar, Phani
477fc41fff Cmake script changes and blis.h changes for amd-staging-milan-3.0
AMD Internal : [CPUPL-1083]

Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580
2020-11-24 06:12:25 -05:00
Dipal M Zambare
ce99b1ecef Added dynamic block size selection logic for DGEMM.
Block sizes (MC, KC, NC) for DGEMM are determined at runtime
based on following parameters

    - Single or multithreaded build
    - Processor Architecture (currently support only zen3)
    - Number of threads requested while running the library

Change-Id: Ia793484b77adb87486e630d0d3b4c7856ae52094
AMD-Internal: [CPUPL-660, CPUPL-661]
2020-11-12 22:40:38 +05:30
managalv
aae48c2221 Optimised AXPYF routine for complex float and complex double
Details:
    - Added SIMD code
    - Processing 5 rows at a time in SIMD loop to improve performance

AMD-Internal: [CPUPL-1054]

Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
2020-11-06 18:42:13 +05:30
Nageshwar Singh
dbd7b28373 Development of AVX2 axpyv kernels for c and z datatypes.
Details
    - Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv).
    - Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family.

AMD-Internal: [CPUPL-1231]

Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7
2020-10-23 09:33:35 +05:30
managalv
90f30e4c37 Optimised dotv kernel by SIMD approach and by removing framework overhead
Details:
    - Kernel is called directly from API call to avoid framework overhead in case of complex float and complex double precisions.
    - Added SIMD code for complex float and complex double and unrolled for loop 5 times to improve performance

AMD-Internal: [CPUPL-1057]

Change-Id: I3b9d202398cacc0168882c9d6da2b450c27466a0
2020-10-13 18:59:31 +05:30
Meghana Vankadari
47744663d9 Enabling framework optimizations for zen family architectures.
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
  framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
  configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
  accessible to all zen family architectures.

Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d
2020-10-07 13:10:50 +05:30
dzambare
95bbdc12ab Added -fomit-frame-pointer in kernel options
Some of the SUP kernels now use rbp register.
This register was also used by compiler to support
automatic function call tracing, which was creating
the conflict. Automatic call tracing feature is
removed for now. If needed it can be enabled for
non kernel code.

Change-Id: Ib7ad00875f501ee2ad552cbb2ecdc245002d63b7
AMD-Internal: [CPUPL-1135]
2020-08-31 09:53:16 +05:30
Dipal M Zambare
25d23cdda2 Zen3 support, disabled IR, JR loop parallelization
AMD-Internal: [CPUPL-1013]

Change-Id: I859152d63d1a56519c508dfa19587f25123e08b4
2020-07-24 20:55:47 +05:30
dzambare
9c7814da1c Added support for zen3 configuration
- User can now specify zen3 configuration,
      currently it reuses block sizes and kernels from zen2.
    - Auto configuration can detect and enable if zen3 config is needed
    - Added support for amd64 bundle which contains all zen platforms
    - Moved exiting amd bundle to amd64 legacy.

AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
2020-07-22 18:24:26 +05:30