Merge branch 'amd' into rt

Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
  Special thanks to AMD for their contributions to-date, especially with
  regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
  bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
  the extra cost of transposing the microtile in registers, this is
  much faster than using the general storage case when the underlying
  matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
  column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
  kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
  to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
  family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
  - only one vzeroupper instruction is called prior to returning
  - movapd/movupd are used in leiu of movaps/movups for double-real
    microkernels. (Note that single-real microkernels still use
    movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
  now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
  in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
  behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
  sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
  comments.
- Updated LICENSE file to explicitly mention that parts are copyright
  UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.

Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
  s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
  and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
  implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
  sum of the squares via dotv(). If there is a floating-point exception
  (FE_OVERFLOW), then the previous (numerically conservative) code is
  used; otherwise, the result of dotv() is square-rooted and stored as
  the result. This new implementation is only enabled when FE_OVERFLOW
  is #defined. If the macro is not #defined, then the previous
  implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.
This commit is contained in:
Field G. Van Zee
2018-02-21 17:43:32 -06:00
46 changed files with 17161 additions and 255 deletions

View File

@@ -214,9 +214,9 @@ CNTX_INIT_PROTS( generic )
// -- AMD64 architectures --
//#ifdef BLIS_KERNELS_ZEN
//#include "bli_kernels_zen.h"
//#endif
#ifdef BLIS_KERNELS_ZEN
#include "bli_kernels_zen.h"
#endif
//#ifdef BLIS_KERNELS_EXCAVATOR
//#include "bli_kernels_excavator.h"
//#endif

View File

@@ -5,6 +5,7 @@
libraries.
Copyright (C) 2014, The University of Texas at Austin
Copyright (C) 2017, Advanced Micro Devices, Inc.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
@@ -64,6 +65,49 @@
#endif
// -- BLIS Thread Local Storage Keyword --
// __thread for TLS is supported by GCC, CLANG, ICC, and IBMC.
// There is a small risk here as __GNUC__ can also be defined by some other
// compiler (other than ICC and CLANG which we know define it) that
// doesn't support __thread, as __GNUC__ is not quite unique to GCC.
// But the possibility of someone using such non-main-stream compiler
// for building BLIS is low.
#if defined(__GNUC__) || defined(__clang__) || defined(__ICC) || defined(__IBMC__)
#define BLIS_THREAD_LOCAL __thread
#else
#define BLIS_THREAD_LOCAL
#endif
// -- BLIS constructor/destructor function attribute --
// __attribute__((constructor/destructor)) is supported by GCC only.
// There is a small risk here as __GNUC__ can also be defined by some other
// compiler (other than ICC and CLANG which we know define it) that
// doesn't support this, as __GNUC__ is not quite unique to GCC.
// But the possibility of someone using such non-main-stream compiler
// for building BLIS is low.
#if defined(__ICC) || defined(__INTEL_COMPILER)
// ICC defines __GNUC__ but doesn't support this
#define BLIS_ATTRIB_CTOR
#define BLIS_ATTRIB_DTOR
#elif defined(__clang__)
// CLANG supports __attribute__, but its documentation doesn't
// mention support for constructor/destructor. Compiling with
// clang and testing shows that it does support.
#define BLIS_ATTRIB_CTOR __attribute__((constructor))
#define BLIS_ATTRIB_DTOR __attribute__((destructor))
#elif defined(__GNUC__)
#define BLIS_ATTRIB_CTOR __attribute__((constructor))
#define BLIS_ATTRIB_DTOR __attribute__((destructor))
#else
#define BLIS_ATTRIB_CTOR
#define BLIS_ATTRIB_DTOR
#endif
// -- Concatenation macros --
#define BLIS_FUNC_PREFIX_STR "bli"