Commit Graph

181 Commits

Author SHA1 Message Date
Field G. Van Zee
7a0ba4194f Added support for addons.
Details:
- Implemented a new feature called addons, which are similar to
  sandboxes except that there is no requirement to define gemm or any
  other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
  for requesting an addon be included within a BLIS build. configure now
  outputs the list of enabled addons into config.mk. It also outputs the
  corresponding #include directives for the addons' headers to a new
  companion to the bli_config.h header file named bli_addon.h. Because
  addons may wish to make use of existing BLIS types within their own
  definitions, the addons' headers must be included sometime after that
  of bli_config.h (which currently is #included before bli_type_defs.h).
  This is why the #include directives needed to go into a new top-level
  header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
  build with them, and what assumptions their authors should keep in
  mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
  as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
  functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.

Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
2022-03-31 12:03:27 +05:30
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Dipal M Zambare
5d287fdba0 Include LP64/ILP64 in BLIS binary name
Binary name will be chosen based on multi-threading and BLAS
  integer size configuration as given below.

  libblis-[mt]-lp64 - when configured to use 32 bit integers
  libblis-[mt]-ilp64 - when configured to use 64 bit integers

AMD-Internal: [CPUPL-1879]
Change-Id: I865023c63235a0a72bdfce7057b2cfb8158b1d87
2021-11-12 08:58:51 +05:30
Dipal M Zambare
d2313bb4e6 Update show config to include missing info.
-- Ignore aocl dynamic configuration if multithreading is disabled.
     AOCL Dynamic will also be disabled in this case.
  -- Added following configuration settings in showconfig output
     1. Complex return scheme
     2. TRSM preinversion status
     3. AOCL dynamic active status

AOCL-Internal: [CPUPL-1565]
Change-Id: Id5a31b233fc08dcd871de4a693aab0b2a5d9f1c4
2021-06-29 12:03:47 +05:30
Dipal M Zambare
fe3384b3c6 Enable AOCL Dynamic feature by default.
It can be disabled by configuration option --disable-aocl-dynamic.

AOCL-Internal: [CPUPL-1565]
Change-Id: I15ea5964dcd479f16dc9edc72957af3bcf4bc0e2
2021-06-22 14:17:52 +05:30
Dipal M Zambare
21130ebece Added configure option for AOCL Dynamic feature.
- AOCL Dynamic feature is added in BLIS which determines optimal
    number of threads for the current problem size.
  - This feature can be enabled/disabled by modifying the source
    code
  - This change adds support to enable/disable this feature during
    configuration time by adding a new option in configure script

AOCL-Internal : [CPUPL-1565]

Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
2021-05-12 00:41:13 -04:00
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
Field G. Van Zee
8f39aea11f Merge branch 'dev' 2021-01-30 17:59:56 -06:00
Devin Matthews
874c3f04ec Update configure
Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes #470.
2021-01-08 13:56:30 -06:00
dzambare
48f2366b6f Updated BLIS version string to "AOCL BLIS X.x" format
AMD-Internal : [CPUPL-1394]

Change-Id: Ifebcb14d9eb064d231b831f5a1e151853ad5a009
2021-01-07 12:38:32 +05:30
Field G. Van Zee
ed50c94738 Merge branch 'master' into dev 2021-01-04 14:31:44 -06:00
nprasadm
10ac4e2aba Blis: DOTC Additional argument for Complex types when using FLANG
Merged the changes done in UT Austin BLIS repo for DOTC Additional
argument.
Other modifications related to test application included.

Verifed the above code changes through scalapack test applications 'xztrd' , 'xctrd'

Change-Id: I7e16f3953db71890f9e8fbb0f7b363eaad899f62
Signed-off-by: Nagendra <Nagendra.PrasadM@amd.com>
AMD-Internal: [CPUPL-1323]
2020-12-16 14:03:10 +05:30
Isuru Fernando
21aa67e11c fix cc_vendor for crosstool-ng toolchains 2020-12-05 21:59:13 -06:00
Field G. Van Zee
7038bbaa05 Optionally disable trsm diagonal pre-inversion.
Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
  optionally disables the pre-inversion of diagonal elements of the
  triangular matrix in the trsm operation and instead uses division
  instructions within the gemmtrsm microkernels. Pre-inversion is
  enabled by default. When it is disabled, performance may suffer
  slightly, but numerical robustness should improve for certain
  pathological cases involving denormal (subnormal) numbers that would
  otherwise result in overflow in the pre-inverted value. Thanks to
  Bhaskar Nallani for reporting this issue via #461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
  gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
  to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
  instructions.
2020-12-04 16:08:15 -06:00
Field G. Van Zee
9bb23e6c2a Added support for systemless build (no pthreads).
Details:
- Added a configure option, --[enable|disable]-system, which determines
  whether the modest operating system dependencies in BLIS are included.
  The most notable example of this on Linux and BSD/OSX is the use of
  POSIX threads to ensure thread safety for when application-level
  threads call BLIS. When --disable-system is given, the bli_pthreads
  implementation is dummied out entirely, allowing the calling code
  within BLIS to remain unchanged. Why would anyone want to build BLIS
  like this? The motivating example was submitted via #454 in which a
  user wanted to build BLIS for a simulator such as gem5 where thread
  safety may not be a concern (and where the operating system is largely
  absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
  the implementation of bli_clock() unconditionally returns 0.0 instead
  of the time elapsed since some fixed point in the past. The reasoning
  for this is that if the operating system is truly minimal, the system
  function call upon which bli_clock() would normally be implemented
  (e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
  to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
  from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
  and BLISTypedAPI.md, with a note that both are non-functional when
  BLIS is configured with --disable-system.
2020-11-16 15:55:45 -06:00
Field G. Van Zee
88ad841434 Squash-merge 'pr' into 'squash'. (#457)
Merged contributions from AMD's AOCL BLIS (#448).
  
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
  only the lower or upper triangle of a square matrix C. For now, only
  the conventional/large code path will be supported (in vanilla BLIS).
  This was accomplished by leveraging the existing variant logic for
  herk. However, some of the infrastructure to support a gemmtsup is
  included in this commit, including
  - A bli_gemmtsup() front-end, similar to bli_gemmsup().
  - A bli_gemmtsup_ref() reference handler function.
  - A bli_gemmtsup_int() variant chooser function (with variant calls
    commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
  wrapper to a set of polymorphic CBLAS-like function wrappers defined
  in another header (cblas.hh). These two headers are installed if
  running the 'install' target with INSTALL_HH is set to 'yes'. (Also
  added a set of unit tests that exercise blis.hh, although they are
  disabled for now because they aren't compatible with out-of-tree
  builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
  within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
  various minor updates to dotv and scalv kernels. Also added various
  sup kernels contributed by AMD to kernels/zen/3. However, these
  kernels are (for now) not yet used, in part because they caused
  AppVeyor clang failures, and also because I have not found time to
  review and vet them.
- Output the python found during configure into the definition of PYTHON
  in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
  to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
  bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
  bug surfaced because the gemmt module verifies its computation using
  gemm with its beta parameter set to zero, which, on a cortexa15 system
  caused the gemm kernel code to unconditionally multiply the
  uninitialized C data by beta. The C matrix likely contained
  non-numeric values such as NaN, which then would have resulted in a
  false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
  in bli_l3_blocksize.c, was inadvertantly being defined in terms of
  helper functions meant for trmm. This bug was probably harmless since
  the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
  kernels/zen/3/bli_gemm_small.c since those macros are not used in
  vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
  accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
  Windows systems.
- Various whitespace changes.
2020-11-14 09:39:48 -06:00
Dipal M Zambare
4347d2d823 Re-enable support for Intel 19+ compiler.
Note that there is know issue with Intel 19+ as explained
in https://github.com/flame/blis/issues/371.

AMD version needs this support as some user applications
need ICC support.

AMD-Internal: [CPUPL-1223]

Change-Id: I86ddee068ae18bd940a5952d60960228d8100e97
2020-11-06 11:11:46 +05:30
Field G. Van Zee
2a0682f8e5 Implemented runtime subconfig selection (#451).
Details:
- Implemented support for the user manually overriding the automatic
  subconfiguration selection that happens at runtime. This override
  can be requested by setting the BLIS_ARCH_TYPE environment variable.
  The variable must be set to the arch_t id (as enumerated in
  bli_type_defs.h) corresponding to the desired subconfiguration. If a
  value outside this enumerated range is given, BLIS will abort with an
  error message. If the value is in the valid range but corresponds to a
  subconfiguration that was not activated at configure-time/compile-time,
  BLIS will abort with a (different) error message. Thanks to decandia50
  for suggesting this feature via issue #451.
- Defined a new function bli_gks_lookup_id to return the address of an
  internal data structure within the gks. If this address is NULL, then
  it indicates that the subconfig corresponding to the arch_t id passed
  into the function was not compiled into BLIS. This function is used
  in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
  is returned for the latter of the two abort scenarios mentioned above,
  along with a corresponding error message and a function to perform
  the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
  auto-detect.x executable during configure-time. This cpp branch is
  similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
  going forward. Also added a convenient debug switch that outputs the
  compilation command for the auto-detect.x executable and exits.
2020-10-18 18:04:03 -05:00
Field G. Van Zee
97e87f2c9f Whitespace/comment updates to #434 PR. 2020-09-07 15:56:42 -05:00
Devin Matthews
7fdc0fc893 Add an option to change the complex return type.
ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes #433.
2020-08-06 14:09:23 -05:00
dzambare
9c7814da1c Added support for zen3 configuration
- User can now specify zen3 configuration,
      currently it reuses block sizes and kernels from zen2.
    - Auto configuration can detect and enable if zen3 config is needed
    - Added support for amd64 bundle which contains all zen platforms
    - Moved exiting amd bundle to amd64 legacy.

AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
2020-07-22 18:24:26 +05:30
Field G. Van Zee
f973f00d94 Defined netlib equivalent of xerbla_array().
Details:
- Added a function definition for xerbla_array_(), which largely mirrors
  its netlib implementation. Thanks to Isuru Fernando for suggesting the
  addition of this function.

Change-Id: Ie9c619f5604e60a32edfda2db2b66f0c762581d3
2020-05-21 11:57:54 +05:30
Field G. Van Zee
b325f1ea62 Warn user when auto-detection returns 'generic'.
Details:
- Added logic to configure that causes the script to output a warning
  to the user if/when "./configure auto" is run and the underlying
  hardware feature detection code is unable to identify the hardware.
  In these cases, the auto-detect code will return 'generic', which
  is likely not what the user expected, and a flag will be set so that
  a message is printed at the end of the configure output. (Thankfully,
  we don't expect this scenario to play out very often.) Thanks to
  Devin Matthews for suggesting this fix #384.
2020-05-21 11:54:53 +05:30
Field G. Van Zee
99da76fd64 Fixed 'configure' breakage introduced in 6433831.
Details:
- Added a missing 'fi' (endif) keyword to a conditional block added in
  the configure script in commit 6433831.
2020-05-21 11:40:00 +05:30
Jeff Hammond
570d51483b blacklist ICC 18 for knl/skx due to test failures
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
2020-05-21 11:40:00 +05:30
Jeff Hammond
afc57adc1b blacklist Intel 19+
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
2020-05-21 11:40:00 +05:30
Field G. Van Zee
d51245e58b Add support for Intel oneAPI in configure.
Details:
- Properly select cc_vendor based on the output of invoking CC with the
  --version option, including cases where CC is the variant of clang
  that is included with Intel oneAPI. (However, we continue to treat
  the compiler as clang for other purposes, not icc.) Thanks to Ajay
  Panyala and Devin Matthews for reporting on this issue via #402.
2020-05-08 18:00:54 -05:00
dzambare
d40edf7dac Execution and Debug trace support.
Added support add debug logging, execution trace and decode.

Change-Id: I024bf6165daa9e23a62423f2401c0f1c5de459ba
AMD-Internal: [CPUPL-806]
2020-04-07 08:48:59 +05:30
Field G. Van Zee
c40a33190b Warn user when auto-detection returns 'generic'.
Details:
- Added logic to configure that causes the script to output a warning
  to the user if/when "./configure auto" is run and the underlying
  hardware feature detection code is unable to identify the hardware.
  In these cases, the auto-detect code will return 'generic', which
  is likely not what the user expected, and a flag will be set so that
  a message is printed at the end of the configure output. (Thankfully,
  we don't expect this scenario to play out very often.) Thanks to
  Devin Matthews for suggesting this fix #384.
2020-03-26 16:55:00 -05:00
Meghana Vankadari
cc98047fd6 Made framework changes to initialize specific cache block sizes for TRSM.
Details:
-This commit addresses the performance optimization(single-thread and
 multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
 what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
 on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
 store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
 block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
 This macro is automatically defined for zen family architectures.
 It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.

Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]
2020-03-09 10:33:42 +05:30
Field G. Van Zee
5ca1a3cfc1 Fixed 'configure' breakage introduced in 6433831.
Details:
- Added a missing 'fi' (endif) keyword to a conditional block added in
  the configure script in commit 6433831.
2020-01-06 12:29:12 -06:00
Jeff Hammond
6433831cc3 blacklist ICC 18 for knl/skx due to test failures
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
2020-01-03 17:51:05 -08:00
Jeff Hammond
af3589f1f9 blacklist Intel 19+
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
2020-01-03 17:51:05 -08:00
Devrajegowda, Kiran
c4047e491a Merge branch 'amd-blis-nov-mergetest' into amd-staging-rome2.1
Change-Id: I1e04592dd9494faa34555008dd1edbca8a092a44
2019-11-29 23:01:51 +05:30
Dipal M Zambare
37badee648 Updated build infra to use python detected by auto config.
Even though configure script check the availability of correct version
of python, this information is not passed to makefiles. This results
in python scripts getting involved without interpreter. This normally
works fine as the script used the path for shebang, however it doesn't
work if the command specified by shebang is alias.

This also causes confusion that even though configure has found the
python, we end up with python not found error during build.

This fix will pass the detected version of the python interpreter to
makefiles which solved both issues mentioned above.

Change-Id: Ic04da77601ff8ad2a461e9f2f936470109cda22c
2019-11-26 14:57:47 +05:30
Meghana Vankadari
764d6f4643 changed configure script to support AOCC
Change-Id: I86d2f36f42bc6cc7e6b950f4e85087753ce5bc40
2019-11-25 15:17:04 +05:30
Field G. Van Zee
c84391314d Reverted minor temp/wspace changes from b426f9e.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
  directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
2019-11-04 13:57:12 -06:00
Jeff Hammond
4870260f6b blacklist GCC 5 and older for POWER9 (#360) 2019-11-04 13:55:47 -06:00
Field G. Van Zee
58102aeaa2 Merge branch 'amd' 2019-10-28 17:58:31 -05:00
Field G. Van Zee
f0959a81db When manual config is blacklisted, output error.
Details:
- Fixed and adjusted the logic in configure so that a more informative
  error message is output when a user runs './configure ... <conf>' and
  <conf> is present in the configuration blacklist. Previously, this
  particular set of conditions would result in the message:

    'user-specified configuration '' is NOT registered!

  That is, the error message mis-identified the targeted configuration
  as the empty string, and (more importantly) mis-identifies the
  problem. Thanks to Tze Meng Low for reporting this issue.
- Fixed a nearby error messages somewhat unrelated to the issue above.
  Specifically, the wrong string was being printed when the error
  message was identifying an auto-detected configuration that did not
  appear to be registered.
2019-10-14 15:46:28 -05:00
Field G. Van Zee
29b0e1ef4e Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
  into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
  inadvertantly not incremented when the Zen2 subconfiguration was
  added.
- In bli_gemm_front(), added a missing conditional constraint around the
  call to bli_gemm_small() that ensures that the computation precision
  of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
  that existed around the call to bli_syrk_small() into bli_syrk_small()
  to minimize the calling code footprint and also to bring that code
  into stylistic harmony with similar code in bli_gemm_front() and
  bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
  proper accessor static functions (e.g. 'a->dim[0]' becomes
  'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
  bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
  strictly speaking unnecessary, but it serves as a useful visual cue to
  those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
  bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
  version check for availability of -march=znver2, and added appropriate
  support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
  config/zen/amd_config.mk, including: removal of -march=znver1 et al.
  from CKVECFLAGS (since the -march flag is added within make_defs.mk);
  setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
  added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
  set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
2019-10-11 10:24:24 -05:00
Field G. Van Zee
c90a1a8dca Inadvertantly hidden xerbla_() in blastest (#313).
Details:
- Attempted a fix to issue #313, which reports that when building only
  a shared library (ie: static library build is disabled), running the
  BLAS test drivers can fail because those drivers provide their own
  local version of xerbla_() as a clever (albeit still rather hackish)
  way of checking the error codes that result from the individual tests.
  This local xerbla_() function is never found at link-time because the
  BLAS test drivers' Makefile imports BLIS compilation flags via the
  get-user-cflags-for() function, which currently conveys the
  -fvisibility=hidden flag, which hides symbols unless they are
  explicitly annotated for export. The -fvisibility=hidden flag was
  only ever intended for use when building BLIS (not for applications),
  and so the attempted solution here is to omit the symbol export
  flag(s) from get-user-cflags-for() by storing the symbol export
  flag(s) to a new BULID_SYMFLAGS variable instead of appending it
  to the subconfigurations' CMISCFLAGS variable (which is returned by
  every get-*-cflags-for() function). Thanks to M. Zhou for reporting
  this issue and also to Isuru Fernando for suggesting the fix.
- Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly
  created BUILD_SYMFLAGS.
- Fixed typo in entry for --export-shared flag in 'configure --help'
  text.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
fb305d0837 Minor build system housekeeping.
Details:
- Commented out redundant setting of LIBBLIS_LINK within all driver-
  level Makefiles. This variable is already set within common.mk, and
  so the only time it should be overridden is if the user wants to link
  to a different copy of libblis.
- Very minor changes to build/gen-make-frags/gen-make-frag.sh.
- Whitespace and inconsequential quoting change to configure.
- Moved top-level 'windows' directory into a new 'attic' directory.
2019-08-23 14:18:08 +05:30
Jeff Hammond
cd8e74a69f add info about CXX in configure (#311) 2019-08-23 14:18:08 +05:30
Field G. Van Zee
4f08619855 Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
184ba1b3d5 GNU-like handling of installation prefix et al.
Details:
- Changed the default installation prefix from $HOME/lib to /usr/local.
- Modified the way configure internally handles the prefix, libdir,
  includedir, and sharedir (and also added an --exec-prefix option).
  The defaults to these variables are set as follows:
    prefix:      /usr/local
    exec_prefix: ${prefix}
    libdir:      ${exec_prefix}/lib
    includedir:  ${prefix}/include
    sharedir:    ${prefix}/share
  The key change, aside from the addition of exec_prefix and its use to
  define the default to libdir, is that the variables are substituted
  into config.mk with quoting that delays evaluation, meaning the
  substituted values may contain unevaluated references to other
  variables (namely, ${prefix} and ${exec_prefix}). This more closely
  follows GNU conventions, including those used by GNU autoconf, and
  also allows make to override any one of the variables *after*
  configure has already been run (e.g. during 'make install').
- Updates to build/config.mk.in pursuant to above changes.
- Updates to output of 'configure --help' pursuant to above changes.
- Updated docs/BuildSystem.md to reflect the new default installation
  prefix, as well as mention EXECPREFIX and SHAREDIR.
- Changed the definitions of the UNINSTALL_OLD_* variables in the
  top-level Makefile to use $(wildcard ...) instead of 'find'. This
  was motivated by the new way of handling prefix and friends, which
  leads to the 'find' command being run on /usr/local (by default),
  which can take a while almost never yielding any benefit (since the
  user will very rarely use the uninstall-old targets).
- Removed periods from the end of descriptive output statements (i.e.,
  non-verbose output) since those statements often end with file or
  directory paths, which get confusing to read when puctuated by a
  period.
- Trival change to 'make showconfig' output.
- Removed my name from 'configure --help'. (Many have contributed to it
  over the years.)
- In configure script, changed the default state of threading_model
  variable from 'no' to 'off' to match that of debug_type, where there
  are similarly more than two valid states. ('no' is still accepted
  if given via the --enable-debug= option, though it will be
  standardized to 'off' prior to config.mk being written out.)
- Minor variable name change in flatten-headers.py that was intended for
  32812ff.
- CREDITS file update.
2019-08-23 14:18:08 +05:30
Isuru Fernando
231a4b7c86 Use pthreads on MinGW and Cygwin (#307) 2019-08-23 14:18:08 +05:30
Isuru Fernando
686aa860f2 Fix clang version detection (#305)
clang -dumpversion gives 4.2.1 for all clang versions as clang was
originally compatible with gcc 4.2.1

Apple clang version and clang version are two different things
and the real clang version cannot be deduced from apple clang version
programatically. Rely on wikipedia to map apple clang to clang version

Also fixes assembly detection with clang

clang 3.8 can't build knl as it doesn't recognize zmm0
2019-08-23 14:18:08 +05:30
Field G. Van Zee
2d1cd32c0f Renamed --enable-export-all to --export-shared=[].
Details:
- Replaced the existing --enable-export-all / --disable-export-all
  configure option with --export-shared=[public|all], with the 'public'
  instance of the latter corresponding to --disable-export-all and the
  'all' instance corresponding to --enable-export-all. Nothing else
  semantically about the option, or its default, has changed.
2019-08-23 14:18:07 +05:30
Field G. Van Zee
d8f6d17bce Support shared lib export of only public symbols.
Details:
- Introduced a new configure option, --enable-export-all, which will
  cause all shared library symbols to be exported by default, or,
  alternatively, --disable-export-all, which will cause all symbols to
  be hidden by default, with only those symbols that are annotated for
  visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS
  symbols), to be exported. The default for this configure option is
  --disable-export-all. Thanks to Isuru Fernando for consulting on
  this commit.
- Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h,
  which was intended for 5a5f494.
- Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to
  frame/include/bli_config_macro_defs.h.
- Provided appropriate logic within common.mk to implement variable
  symbol visibility for gcc, clang, and icc (to the extend that each of
  these compilers allow).
- Relocated --help text associated with debug option (-d) to configure
  slightly further down in the list.
2019-08-23 14:18:07 +05:30