Commit Graph

499 Commits

Author SHA1 Message Date
Dipal M Zambare
5d617429f4 Enabled znver4 support for GCC version >= 12
- Updated zen4 configuration to add -march=znver4 flag in the
   compiler options if the gcc version is above or equal to 12

AMD-Internal: [CPUPL-1937]
Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f
2022-07-22 12:46:15 +05:30
Dipal M Zambare
2ba2fb2b63 Add AVX2 path for TRSM+GEMM combination.
- Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called
  from TRSM context it will invoke AVX2 GEMM kernels instead
  of the default AVX-512 GEMM kernels.

- The default context has the block sizes for AVX512 GEMM
  kernels, however, TRSM uses AVX2 GEMM kernels and they
  need different block sizes.

- Added new API bli_zen4_override_trsm_blkszs(). It overrides
  default block sizes in context with block sizes needed for
  AVX2 GEMM kernels.

- Added new API bli_zen4_restore_default_blkszs(). It restores
  The block sizes to there default values (as needed by default
   AVX512 GEMM kernels).

- Updated bli_trsm_front() to override the block sizes in the
  context needed by TRSM + AVX2 GEMM kernels and restore them
  to the default values at the end of this function. It is done
  in bli_trsm_front() so that we override the context before
  creating different threads.

AMD-Internal: [CPUPL-2225]
Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55
2022-06-29 10:16:24 +00:00
Harish
2e4ed37e97 Added missing endif for the AOCC 4.0 verion check
Change-Id: I1c77ae795c398aec685152b491b838a75e7ce318
2022-06-28 13:01:53 +05:30
Harish
77e8492cbd Added znver4 flag for config builds with AOCC 4.0 compiler version
Change-Id: I45f1031ed4c5ea2e3f594713f3821d6bbbecd4df
2022-06-28 02:47:48 -04:00
Kiran Varaganti
47c344bc2e DGEMM Benchmark Optimizations
Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 dgemm kernel

Change-Id: I56b3df238b6d85a6f6861448c0c6f907c972146a
2022-06-27 09:32:39 -04:00
Dipal M Zambare
c87b9aab75 Added support for AVX512 for Windows and AMAVX
- Completed zen4 configuration support on windows
 - Enabled AVX512 kernels for AMAXV
 - Added zen4 configuration in amdzen for windows
 - Moved all zen4 kernels inside kernels/zen4 folder

AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
2022-06-08 11:09:48 +05:30
Dipal M Zambare
8cc15107ed Enabled AVX-512 kernels for Zen4 config
- Enabled AVX-512 skylake kernels in zen4 configuration.
    AVX-512 kernels are added for GEMM float and double types.

  - Enabled reference kernel for TRSM native path

AMD-Internal: [CPUPL-2108]
Change-Id: I66f3468346085c17183cbcbf4f2c8cfe07579b6f
2022-06-03 06:34:35 +00:00
Dipal M Zambare
6e2f536590 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 20:35:40 +05:30
Dipal M Zambare
9c6c76613c Added support for zen4 architecture
- Added configuration option for zen4 architecture
  - Added auto-detection of zen4 architecture
  - Added zen4 configuration for all checks related
    to AMD specific optimizations

AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
2022-05-17 18:12:49 +05:30
Harsh Dave
349dcc459a Fixed scalapack xcsep failer due to cdotxv kernel.
-Failure was observed in zen configuration as gcc
flag safe-math-optimization was being used for reference
kernel compilation.
- Optmized kernels were being compiled without this gcc flag
resulted in computation difference resulting in test case
failure.

AMD-Internal: [CPUPL-2121]
Change-Id: I5d86e589cdea633220aecadbcab84d9b88b31f57
2022-05-17 18:10:40 +05:30
Harsh Dave
f17d043e1c Implemented optimal dotxv kernel
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
  cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
  cases are handled serially for cdotxv kernel
- Added declaration in zen contexts

AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
2022-05-17 18:10:39 +05:30
Arnav Sharma
caa5b37005 Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context

AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
2022-05-17 18:10:39 +05:30
Arnav Sharma
393effbb0c Optimized ZAXPY2V using AVX2 Intrinsics
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts

AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
2022-05-17 18:08:57 +05:30
Dipal M. Zambare
b90420627a Revert "Enabled AVX-512 kernels for Zen4 config"
This reverts commit 62c96a4190.
Was committed without review.
2022-04-21 06:46:00 +00:00
Dipal M. Zambare
0adb525f5b Revert "Enabled AVX-512 kernels for Zen4 config"
This reverts commit f816cf059f.
Was committed without review.
2022-04-21 06:45:38 +00:00
Dipal M. Zambare
f816cf059f Enabled AVX-512 kernels for Zen4 config
Enabled AVX-512 skylake kernels in zen4 configuration.
  AVX-512 kernels are added for float and double types.

AMD-Internal: [CPUPL-2108]
Change-Id: Idfe3f64a037db019cbdf43318954db52ad241a51
2022-04-21 06:38:24 +00:00
Dipal M. Zambare
62c96a4190 Enabled AVX-512 kernels for Zen4 config
Enabled AVX-512 skylake kernels in zen4 configuration.
  AVX-512 kernels are added for float and double types.

AMD-Internal: [CPUPL-2108]
2022-04-21 06:28:29 +00:00
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Harsh Dave
79c6aa5643 Implemented optimal S/DCOMPLEX dotxf kernel
- Optimized dotxf implementation for double
  and single precision complex datatype by
  handling dot product computation in tile 2x6
  and 4x6 handling 6 columns at a time, and rows
  in multiple of 2 and 4.

- Dot product computation is arranged such a way
  that multiple rho vector register will hold the
  temporary result till the end of loop and finally
  does horizontal addition to get final dot product
  result.

- Corner cases are handled serially.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
2022-01-06 02:22:52 -05:00
Harsh Dave
8b5b2707c1 Optimized daxpy2v implementation
- Optimized axpy2v implementation for double
  datatype by handling rows in mulitple of 4
  and store the final computed result at the
  end of computation, preventing unnecessary
  stores for improving the performance.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
2022-01-05 06:37:22 -05:00
Arnav Sharma
3190e547b0 Optimized AXPBYV Kernel using AVX2 Intrinsics
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts

AMD-Internal: [CPUPL-1963]

Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
2022-01-05 04:19:11 -05:00
Dipal M Zambare
fd8a3aace9 Added support for zen4 architecture
- Added configuration option for zen4 architecture
  - Added auto-detection of zen4 architecture
  - Added zen4 configuration for all checks related
    to AMD specific optimizations

AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
2021-11-23 10:29:15 +05:30
lcpu
30038af896 Reverted: To fix accuracy issues for complex datatypes
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
2021-11-12 08:58:57 +05:30
nphaniku
4af525a313 AOCL Windows BLIS : Windows build for dynamic dispatch library
Change-Id: Ie05eafbeacbd5589b514d9353517330515104939
2021-11-12 08:58:57 +05:30
mkurumel
366ab66134 DNRM2 : Disable dnrm2 Fast math implementation.
Details :
          - Accuracy failures observed when  fast math and ILP64 are enabled.
          - Disabling the feature with macro BLIS_ENABLE_FAST_MATH .

        AMD-Internal: [CPUPL-1907]

Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
2021-11-12 08:58:56 +05:30
Dipal M Zambare
3364c0e4eb Binary and dynamic dispatch configuration name change
-- Reverted changes made to include lp/ilp info in binary name
     This reverts commit c5e6f885f0.

  -- Included BLAS int size in 'make showconfig'

  -- Renamed amdepyc configuration to amdzen

Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
2021-11-12 08:58:56 +05:30
Nageshwar Singh
a263146a4c Optimized scalv for complex data-types c and z (cscalv and zscalv)
AMD-Internal: [CPUPL-1551]
Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9
2021-11-12 08:58:53 +05:30
mkurumel
595f7b7edf dnrm2 optimization with dot method
1.  Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
	norm with dot operation.
    2.  Added vectorization to compute square of 32 double element
	block size from vector X.
    3.  Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
	to compute nrm2 using new kernel.
    4.  Dot kernel definitions and implementation have a possibility for
	accuracy issues .we can switch to traditional implementation by
	disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
	for Vector X .

    AMD-Internal: [CPUPL-1757]

Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
2021-11-12 08:58:53 +05:30
Dipal M Zambare
41a86c2463 Enabled znver3 flag for gcc version above 11
-- Added -march=znver3 flag if the library is built for zen3
   configuration with gcc compiler version 11 or above.

-- Replaced hardcoded compiler names 'gcc' and 'clang' with
   variable $CC so that options are chosen as per the compiler
   specified at configure time (instead of compiler in path).

AMD-Internal: [CPUPL-1823]
Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4
2021-11-12 08:58:48 +05:30
Abhiram S
7787bc79b1 Level1 samaxv: AVX512 implementation
Details:
 1. Unrolled by a factor 5. This gave around 1GFLOPS gain
 2. Changed CMP to subs and remove nan. CMP uses a lot of
    compare, which is higher in latency and more number of
    instructions. Replacing with subs and remove nan
    reduced it to 3 instructions and lighter ones.
 3. Added remove nan function.
 4. Added AVX512 definition in skx context.
 5. Disabled code in AMAXV kernel depending on AVX512 flag
    exists or not

Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35
2021-09-27 16:10:08 +05:30
Harihara Sudhan S
bdb5e32176 Level 1 Kernel: damaxv AVX512
Details:

- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
  to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
  implementation for better performance
- Unrolled the loop by order of 4.

Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
2021-09-02 09:47:08 +05:30
Dipal M Zambare
bcd9591b3f Added support for amdepyc fat binary
-- Created new configuration amdepyc to include fat binary which
     includes zen, zen2, zen3 and generic architecture for fallback.

  -- Updated amdepyc family makefiles to include macros needed
     in amdepyc family binary. This file must include all macros,
     compiler options to be used for non architecture specific code.

  -- Added 'workaround' to exclude ZEN family specific code in some of
     the framework files. There are still lot of places were ZEN family
     specific code is added in framework files. They will be addressed
     with proper design later.
       - Moved definition of BLIS_CONFIG_EPYC from header files to
         makefile so that it is enabled only for framework and kernels

  -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
     wherever it was needed.

  -- Removed un-used, obsolete macros, some of them may be needed for
     debugging which can be added in the individual workspaces.
       - BLIS_DEFAULT_MR_THREAD_MAX
       - BLIS_DEFAULT_NR_THREAD_MAX
       - BLIS_ENABLE_ZEN_BLOCK_SIZES
       - BLIS_SMALL_MATRIX_THRES_TRSM
       - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
       - BLIS_ENABLE_SUP_MR_EXT
       - BLIS_ENABLE_SUP_NR_EXT

  -- Corrected implementation of exiting amd64_legacy configuration.

AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
2021-08-26 15:13:49 +05:30
Meghana Vankadari
6bad157754 Added a new field in cntx to store l3 threshold function pointers
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
  different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
  adding threshold functions under a macro enables these functions for
  all the configs under a family. This can be avoided by adding function
  pointers to cntx which can be queried from cntx during run-time
  based on the config chosen.

Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
2021-08-16 00:10:01 -04:00
Nallani Bhaskar
e328bdc549 Added prefetch in left cases of dtrsm small
Details:

1. Added prefetching next micro-panel of A and B in dgemm block,
   which are helping in reducing load latency and improved performance.

2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
   dgemm into macros and made it more modular

3. Packing and diagonal packing in main dgemm loops are modularized.
   Fringe cases are yet to modularize.

4. Updated dtrsm small thresholds for single and multi thread cases

5. Updated div/scale based on disable/enable of trsm pre-inversion

6. Code clean up

Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
2021-06-15 23:15:22 +05:30
Dipal M Zambare
21130ebece Added configure option for AOCL Dynamic feature.
- AOCL Dynamic feature is added in BLIS which determines optimal
    number of threads for the current problem size.
  - This feature can be enabled/disabled by modifying the source
    code
  - This change adds support to enable/disable this feature during
    configuration time by adding a new option in configure script

AOCL-Internal : [CPUPL-1565]

Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
2021-05-12 00:41:13 -04:00
Meghana Vankadari
eea347b02e Added dynamic threading support for GEMM SUP code path
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
  is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
  where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
  or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
  global_rntm data structure remains unchanged in order to keep track of
  original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
  number of threads recommended by the application.

AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
2021-05-07 09:52:51 +05:30
managalv
c420bd63e2 Enabled optimised packed routines on zen3
Change-Id: I5eb57f8ab2cccd20d0f778ada539fd1474cf6338
2021-05-06 01:25:08 -04:00
Madan mohan Manokar
c1fa9abe32 zgemm native path tuning
1. NC and MC values are tuned for both single-instance and multi-instance run.
2. zen2 and zen3 configs updated.
3. SUP path disabled for zgemm, since tuned native path performed better.
To be re-enabled after setting right threshold for SUP selection.

AMD-Internal: [CPUPL-1442]
Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239
2021-05-06 01:15:35 -04:00
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
Field G. Van Zee
bf1b578ea3 Reduced KC on skx from 384 to 256.
Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
  from 384 to 256. The maximum (extended) KC was also reduced
  accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
  this change.
2021-03-19 13:03:17 -05:00
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Nicholai Tukanov
670bc7b60f Add low-precision POWER10 gemm kernels (#467)
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations 
  based on low-precision gemm kernels, and (2) extends the BLIS typed 
  API for those new implementations. Currently, these new kernels can 
  only be used for the POWER10 microarchitecture; however, they may 
  provide a template for developing similar kernels for other 
  microarchitectures (even those beyond POWER), as changes would likely 
  be limited to select places in the microkernel and possibly the 
  packing routines. The new low-precision operations that are now 
  supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more 
  information, refer to the POWER10.md document that is included in 
  'sandbox/power10'.
2021-03-05 13:53:43 -06:00
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00
Meghana Vankadari
943b1362c7 Enabled vectorized pack kernels for zen2 configuration.
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
  implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.

AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
2021-02-12 19:16:57 +05:30
Dipal M Zambare
38a8008cd8 Enabled znver3 flag for zen3 architecture
znver3 flag will be enabled if compiler is AOCC Clang version 3.0
and configuration is zen3

Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39
AMD-Internal: [CPUPL-1353]
2020-12-04 12:28:22 +05:30
Field G. Van Zee
11dfc176a3 Reorganized thread auto-factorization logic.
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
  guts were factored out into "fast" and "slow" variants. Then added
  logic to the "fast" variant that allows for more optimal thread
  factorizations in some situations where there is at least one factor
  of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
  added comments to that file describing BLIS_THREAD_RATIO_? and
  BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
  macros not used in vanilla BLIS and removed the unused macro
  BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
  and bli_trsm_front.c. (These branches of small matrix handling have
  not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
2020-12-01 19:51:27 +00:00
Dipal M Zambare
c2f63fcc54 Update amd64 bundle configuration
The configuration is updated to

   - Enable EPYC architecture optimizations
   - Macros to override block sizes.

AMD-Internal : [CPUPL-1350]

Change-Id: Id712f9abe6e81c9ece2baaab9d965b405e72977a
2020-12-01 14:37:13 +05:30
Kumar, Phani
477fc41fff Cmake script changes and blis.h changes for amd-staging-milan-3.0
AMD Internal : [CPUPL-1083]

Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580
2020-11-24 06:12:25 -05:00
Field G. Van Zee
88ad841434 Squash-merge 'pr' into 'squash'. (#457)
Merged contributions from AMD's AOCL BLIS (#448).
  
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
  only the lower or upper triangle of a square matrix C. For now, only
  the conventional/large code path will be supported (in vanilla BLIS).
  This was accomplished by leveraging the existing variant logic for
  herk. However, some of the infrastructure to support a gemmtsup is
  included in this commit, including
  - A bli_gemmtsup() front-end, similar to bli_gemmsup().
  - A bli_gemmtsup_ref() reference handler function.
  - A bli_gemmtsup_int() variant chooser function (with variant calls
    commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
  wrapper to a set of polymorphic CBLAS-like function wrappers defined
  in another header (cblas.hh). These two headers are installed if
  running the 'install' target with INSTALL_HH is set to 'yes'. (Also
  added a set of unit tests that exercise blis.hh, although they are
  disabled for now because they aren't compatible with out-of-tree
  builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
  within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
  various minor updates to dotv and scalv kernels. Also added various
  sup kernels contributed by AMD to kernels/zen/3. However, these
  kernels are (for now) not yet used, in part because they caused
  AppVeyor clang failures, and also because I have not found time to
  review and vet them.
- Output the python found during configure into the definition of PYTHON
  in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
  to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
  bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
  bug surfaced because the gemmt module verifies its computation using
  gemm with its beta parameter set to zero, which, on a cortexa15 system
  caused the gemm kernel code to unconditionally multiply the
  uninitialized C data by beta. The C matrix likely contained
  non-numeric values such as NaN, which then would have resulted in a
  false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
  in bli_l3_blocksize.c, was inadvertantly being defined in terms of
  helper functions meant for trmm. This bug was probably harmless since
  the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
  kernels/zen/3/bli_gemm_small.c since those macros are not used in
  vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
  accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
  Windows systems.
- Various whitespace changes.
2020-11-14 09:39:48 -06:00