Commit Graph

1800 Commits

Author SHA1 Message Date
Field G. Van Zee
8f399c8940 Tweaked/added notes to docs/Multithreading.md.
Details:
- Added language to docs/Multithreading.md cautioning the reader about
  the nuances of setting multithreading parameters via the manual and
  automatic ways simultaneously, and also about how these parameters
  behave when multithreading is disabled at configure-time. These
  changes are an attempt to address the issues that arose in issue #362.
  Thanks to Jérémie du Boisberranger for his feedback on this topic.
- CREDITS file update.
2019-11-12 15:32:57 -06:00
Field G. Van Zee
bdc7ee3394 Various fixes to support packing duplication in B.
Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
  those operations to be cast so the structured matrix is on the left.
  symm and hemm already had such macros, but these too were renamed so
  that the macros were individual to the operation. We now have four
  such macros:
    #define BLIS_DISABLE_HEMM_RIGHT
    #define BLIS_DISABLE_SYMM_RIGHT
    #define BLIS_DISABLE_TRMM_RIGHT
    #define BLIS_DISABLE_TRMM3_RIGHT
  Also, updated the comments in the symm and hemm front-ends related to
  the first two macro guards, and added corresponding comments to the
  trmm and trmm3 front-ends for the latter two guards. (They all
  functionally do the same thing, just for their specific operations.)
  Thanks to Jeff Hammond for reporting the bugs that led me to this
  change (via #359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
  related to duplicating B during packing) to register: a packing
  kernel for single-precision real; gemmbb ukernels for s, c, and z;
  trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
  and z; and to use non-default cache and register blocksizes for s, c,
  and z datatypes. Also declared prototypes for all of the gemmbb,
  trsmbb, and gemmtrsmbb ukernel functions within the
  bli_cntx_init_haswellbb() function. This should, once applied to the
  power9 configuration, fix the remaining issues in #359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
  duplication factor of 4. This function is defined in the same file as
  bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).
2019-11-11 15:47:17 -06:00
Field G. Van Zee
0eb79ca850 Avoid unused variable warning in lread.c (#356).
Details:
- Replaced the line

    f = f;

  with

    ( void )f;

  for the unused variable 'f' in blastest/f2c/lread.c. (Hopefully)
  addresses issue #356, but since we don't use xlc who knows. Thanks
  to Jeff Hammond for reporting this.
2019-11-08 14:48:48 -06:00
Jérôme Duval
f377bb4485 Add Haiku to the known OS list (#361) 2019-11-07 16:39:29 -06:00
Field G. Van Zee
e29b1f9706 Fixed failing testsuite gemmtrsm_ukr for power9.
Details:
- Added code that fixes false failures in the gemmtrsm_ukr module of the
  testsuite. The tests were failing because the computation (bli_gemv())
  that performs the numerical check was not able to properly travserse
  the matrix operands bx1 and b11 that are views into the micropanel of
  B, which has duplicated/broadcast elements under the power9 subconfig.
  (For example, a micropanel of B with duplication factor of 2 needs to
  use a column stride of 2; previously, the column stride was being
  interpreted as 1.)
- Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
  static functions in bli_obj_macro_defs.h. (Previously, only the
  function bli_obj_set_strides() was defined. Amazing to think that we
  got this far without these former functions.)
- Updated/expounded upon comments.
2019-11-05 17:15:19 -06:00
Field G. Van Zee
49177a6b9a Fixed latent testsuite ukr module bugs for power9.
Details:
- Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and
  gemmtrsm) that only manifested once we began running with parameters
  that mimic those of power9. The problem was rooted in the way those
  modules were creating objects (and thus allocating memory) for the
  micropanel operands to the microkernel being tested. Since power9
  duplicates/broadcasts elements of B in memory, we needed an easy way
  of asking for more than one storage element per logical element in
  the matrix. I incorrectly expressed this as:

    bli_obj_create( datatype, k, n, ldbp, 1, &bp );

  The problem here is that bli_obj_create() is exceedingly efficient
  at calculating the size it passes to malloc() and doesn't allocate a
  full leading dimension's worth of elements for the last column (or
  row, in this example). This would normally not bother anyone since
  you're not supposed to access that memory anyway. But here, my
  attempted "hack" for getting extra elements was insufficient, and
  needed to be changed to:

    bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp );

  That is, the extra elements needed to be baked into the dimensions of
  the matrix object in order to have the intended effect on the number
  of elements actually allocated. Thanks to Jeff Hammond for reporting
  this bug.
- Fixed a typically harmless memory leak in the aforementioned test
  modules (the objects for the packed micropanels were not being freed).
- Updated/expanded a common comment across all three ukr test modules.
2019-11-04 18:09:37 -06:00
Field G. Van Zee
c84391314d Reverted minor temp/wspace changes from b426f9e.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
  directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
2019-11-04 13:57:12 -06:00
Jeff Hammond
4870260f6b blacklist GCC 5 and older for POWER9 (#360) 2019-11-04 13:55:47 -06:00
Nicholai Tukanov
b426f9e04e POWER9 DGEMM (#355)
Implemented and registered power9 dgemm ukernel.

Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel 
  assumes that elements of B have been duplicated/broadcast during the
  packing step. The microkernel uses a column orientation for its 
  microtile vector registers and thus implements column storage and 
  general stride IO cases. (A row storage IO case via in-register
  transposition may be added at a future date.) It should be noted that 
  we recommend using this microkernel with gcc and *not* xlc, as issues 
  with the latter cropped up during development, including but not 
  limited to slightly incompatible vector register mnemonics in the GNU 
  extended inline assembly clobber list.
2019-11-01 17:57:03 -05:00
Field G. Van Zee
58102aeaa2 Merge branch 'amd' 2019-10-28 17:58:31 -05:00
Field G. Van Zee
52059506b2 Added "How to Download BLIS" section to README.md.
Details:
- Added a new section to the README.md, just prior to the "Getting
  Started" section, titled "How to Download BLIS". This section details
  the user's options for obtaining BLIS and lays out four common ways
  of downloading the library. Thanks to Jeff Diamond for his feedback
  on this topic.
2019-10-23 15:26:42 -05:00
Field G. Van Zee
e6f0a96cc5 Updated README.md to ack Facebook as funder. 2019-10-14 17:05:39 -05:00
Field G. Van Zee
b9bc222bfc Call bli_syrk_small() before error checking.
Details:
- In bli_syrk_front(), moved the conditional call to bli_syrk_check()
  (if error checking is enabled) and the conditional scaling of C by
  beta (if alpha is zero) so that they occur after, instead of before,
  the call to bli_syrk_small(). This sequencing now matches that of
  bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in
  bli_trsm_front().
2019-10-14 16:38:15 -05:00
Field G. Van Zee
f0959a81db When manual config is blacklisted, output error.
Details:
- Fixed and adjusted the logic in configure so that a more informative
  error message is output when a user runs './configure ... <conf>' and
  <conf> is present in the configuration blacklist. Previously, this
  particular set of conditions would result in the message:

    'user-specified configuration '' is NOT registered!

  That is, the error message mis-identified the targeted configuration
  as the empty string, and (more importantly) mis-identifies the
  problem. Thanks to Tze Meng Low for reporting this issue.
- Fixed a nearby error messages somewhat unrelated to the issue above.
  Specifically, the wrong string was being printed when the error
  message was identifying an auto-detected configuration that did not
  appear to be registered.
2019-10-14 15:46:28 -05:00
Field G. Van Zee
6218ac95a5 Merge branch 'master' into amd 2019-10-11 11:53:51 -05:00
Field G. Van Zee
0016d541e6 Changed -march=znver2 to =znver1 for clang on zen2.
Details:
- In config/zen2/make_defs.mk, changed the -march= flag so that
  -march=znver1 is used instead of -march=znver2 when CC_VENDOR is
  clang. (The gcc branch attempts to differentiate between various
  versions, but the equivalent version cutoffs for clang are not
  yet known by us, so we have to use a single flag for all versions
  of clang. Hopefully -march=znver1 is new enough. If not, we'll
  fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.)
  This issue was discovered thanks to AppVeyor.
2019-10-11 11:09:44 -05:00
Field G. Van Zee
e94a0530e5 Corrected zen NC that was non-multiple of NR.
Details:
- Updated an incorrectly set cache blocksize NC for single real within
  config/zen/bli_cntx_init_zen.c that was non a multiple of the
  corresponding value of NR. This issue, which was caught by Travis CI,
  was introduced in 29b0e1e.
2019-10-11 10:48:27 -05:00
Field G. Van Zee
a2ffac7520 Merge branch 'amd-master' into amd 2019-10-11 10:31:18 -05:00
Field G. Van Zee
29b0e1ef4e Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
  into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
  inadvertantly not incremented when the Zen2 subconfiguration was
  added.
- In bli_gemm_front(), added a missing conditional constraint around the
  call to bli_gemm_small() that ensures that the computation precision
  of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
  that existed around the call to bli_syrk_small() into bli_syrk_small()
  to minimize the calling code footprint and also to bring that code
  into stylistic harmony with similar code in bli_gemm_front() and
  bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
  proper accessor static functions (e.g. 'a->dim[0]' becomes
  'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
  bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
  strictly speaking unnecessary, but it serves as a useful visual cue to
  those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
  bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
  version check for availability of -march=znver2, and added appropriate
  support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
  config/zen/amd_config.mk, including: removal of -march=znver1 et al.
  from CKVECFLAGS (since the -march flag is added within make_defs.mk);
  setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
  added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
  set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
2019-10-11 10:24:24 -05:00
Field G. Van Zee
a617301f93 Updates to docs/CodingConventions.md. 2019-10-08 17:14:05 -05:00
Field G. Van Zee
171f100691 Merge remote-tracking branch 'loveshack/emacs' 2019-10-04 11:18:23 -05:00
Field G. Van Zee
702486b125 Removed stray FAQ section introduced in 1907000. 2019-10-02 16:35:41 -05:00
Field G. Van Zee
1907000ad6 Updated to FAQ (AMD-related questions).
Details:
- Added a couple potential frequently-asked questions/answers releated
  to AMD's fork of BLIS.
- Updated existing answers to other questions.
2019-10-02 16:31:54 -05:00
Field G. Van Zee
834f30a0da Mention mixeddt paper in docs/MixedDatatypes.md. 2019-10-02 12:45:56 -05:00
Dave Love
05d58edfe0 Note .dir-locals.el in docs 2019-10-02 10:45:50 +01:00
Dave Love
531110c339 Modify Emacs config
Confine it to cc-mode and add comment-start/end.
2019-10-02 10:16:22 +01:00
Dave Love
4bab365cab Add .dir-locals.el for Emacs (#348)
A minimal version that could probably do with extending, but at least
gets the indentation roughly right.
2019-10-01 14:22:47 -05:00
Dave Love
4ec8dad66b Add .dir-locals.el for Emacs
A minimal version that could probably do with extending, but at least
gets the indentation roughly right.
2019-09-26 16:27:53 +01:00
Field G. Van Zee
bc16ec7d1e Set execute bits of shared library at install-time.
Details:
- Modified the 0644 octal code used during installation of shared
  libraries to 0755 (for Linux/OSX only). Thanks to Adam J. Stewart
  for reporting this issue via #343.
- CREDITS file update.
2019-09-23 15:37:33 -05:00
Field G. Van Zee
c60db26aee Fixed bad loop counter in bli_[cz]scal2bbs_mxn().
Details:
- Fixed a typo in the loop counter for the 'd' (duplication) dimension
  in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h.
  They shouldn't be used by anyone yet, but thankfully clang via
  AppVeyor spit out warnings that alerted me to the issue.
2019-09-17 18:04:17 -05:00
Field G. Van Zee
c766c81d62 Added missing schema arg to knl packm kernels.
Details:
- Added the pack_t schema argument to the knl packm kernel functions.
  This change was intended for inclusion in 31c8657. (Thank you SDE +
  Travis CI.)
2019-09-17 18:00:29 -05:00
Field G. Van Zee
31c8657f1d Added support for pre-broadcast when packing B.
Details:
- Added support for being able to duplicate (broadcast) elements in
  memory when packing matrix B (ie: the left-hand operand) in level-3
  operations. This turns out advantageous for some architectures that
  can afford the cost of the extra bandwidth and somehow benefit from
  the pre-broadcast elements (and thus being able to avoid using
  broadcast-style load instructions on micro-rows of B in the gemm
  microkernel).
- Support optionally disabling right-side hemm and symm. If this occurs,
  hemm_r is implemented in terms of hemm_l (and symm_r in terms of
  symm_l). This is needed when broadcasting during packing because the
  alternative--supporting the broadcast of B while also allowing matrix
  B to be Hermitian/symmetric--would be an absolute mess.
- Support alignment factors for packed blocks of A, B, and C separately
  (as well as for general-purpose buffers). In addition, we support
  byte offsets from those alignment values (which is different from
  aligning by align+offset bytes to begin with). The default alignment
  values are BLIS_PAGE_SIZE in all four cases, with the offset values
  defaulting to zero.
- Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
  into the packm kernel, where it will be needed by packm kernels that
  perform broadcasts of B, since the idea is that we *only* want to
  broadcast when packing micropanels of B and not A.
- Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
  used to set custom virtual level-3 microkernels in the cntx_t, which
  would typically be done in the bli_cntx_init_*() function defined in
  the subconfiguration of interest.
- Added a "broadcast B" kernel function for use with NP/NR = 12/6,
  defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
- Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
  defined in ref_kernels/3/bb. (These kernels have been tested with
  double real with NP/NR = 12/6.)
- Added #ifndef ... #endif guards around several macro constants defined
  in frame/include/bli_kernel_macro_defs.h.
- Defined a few "broadcast B" static functions in
  frame/include/level0/bb for use by "broadcast B"-style packm reference
  kernels. For now, only the real domain kernels are tested and fully
  defined.
- Output the alignment and offset values for packed blocks of A and B
  in the testsuite's "BLIS configuration info" section.
- Comment updates to various files.
- Bumped so_version to 3.0.0.
2019-09-17 17:42:10 -05:00
Field G. Van Zee
fd9bf497cd CREDITS file update. 2019-09-17 15:45:24 -05:00
ShmuelLevine
6c8f2d1486 Fix description for function bli_*pxby2v (#340)
Fix typo in BLISTypedAPI.md for bli_?axpy2v() description.
2019-09-17 15:43:46 -05:00
Field G. Van Zee
b5679c1520 Inserted Multithreading links into BuildSystem.md.
Details:
- Inserted brief disclaimers about default disabled multithreading
  and default single-threadedness to BuildSystem.md along with links to
  the Multithreading.md document. Thanks to Jeff Diamond for suggesting
  these additions.
- Trivial reword of sentence regarding automatically-detected
  architectures.
2019-09-17 14:00:37 -05:00
Isuru Fernando
f4f5170f84 Update README.md (#338) 2019-09-11 07:34:48 -05:00
Field G. Van Zee
1cfe8e2562 Reimplemented bli_cpuid_query() for ARM.
Details:
- Rewrote bli_cpuid_query() for ARM architectures to use stdio-based
  functions such as fopen() and fgets() instead of popen(). The new code
  does more or less the same thing as before--searches /proc/cpuinfo for
  various strings, which are then parsed in order to determine the
  model, part number, and features. Thanks to Dave Love for suggesting
  this change in issue #335.
2019-09-05 16:08:30 -05:00
Devin Matthews
7c78191457 Always use sqsumv to compute normfv. (#334)
* Always use sqsumv to compute normfv on MacOS.

* Unconditionally disable the "dot trick" in normfv.

* Added explanatory comment to normfv definition.

Details:
- Added a comment above the unconditional disabling of the dotv-based
  implementation to normfv. Thanks to Roman Yurchak, Devin Matthews,
  and Isuru Fernando in helping with this improvement.
- CREDITS file update.
2019-08-30 16:52:09 -05:00
Field G. Van Zee
80e6c10b72 Added reproduction section to Performance docs.
Details:
- Added section titled "Reproduction" to both Performance.md and
  PerformanceSmall.md that briefly nudges the motivated reader in the
  right direction if he/she wishes to run the same performance
  benchmarks used to produce the graphs shown in those documents.
  Thanks to Dave Love for making this suggestion.
2019-08-29 12:12:08 -05:00
Field G. Van Zee
14cb426414 Updated OpenBLAS, Eigen sup results.
Details:
- Updated the results shown in docs/PerformanceSmall.md for OpenBLAS and
  Eigen.
2019-08-28 17:04:33 -05:00
Field G. Van Zee
b02e0aae8c Updated test drivers to iterate backwards.
Details:
- Updated test driver source in test, test/3, test/1m4m, and
  test/mixeddt to iterate through the problem space backwards. This
  can help avoid certain situations where the CPU frequency does not
  immediately throttle up to its maximum. Thanks to Robert van de
  Geijn for recommending this fix (originally made to test/sup drivers
  in 57e422a).
- Applied off-by-one matlab output bugfix from b6017e5 to test drivers
  in test, test/3, test/1m4m, and test/mixeddt directories.
2019-08-27 14:37:46 -05:00
Field G. Van Zee
b6017e53f4 Bugfix of output text + tweaks to test/sup driver.
Details:
- Fixed an off-by-one bug in the output of matlab row indices in
  test/sup/test_gemm.c that only manifested when the problem size
  increment was equal to 1.
- Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage
  combinations for blissup drivers in test/sup. This helps make the
  building of drivers complete sooner.
- Trivial changes to test/sup/runme.sh.
2019-08-27 14:18:14 -05:00
Devin Matthews
138d403b6b Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (#331) 2019-08-26 18:11:27 -05:00
Field G. Van Zee
d5a05a15a7 Cropped whitespace from new sup graphs.
Details:
- Previously forgot crop whitespace from the new .png graphs
  added/updated in docs/graphs/sup.
2019-08-26 16:54:31 -05:00
Field G. Van Zee
a6c80171a3 Fixed contents links in docs/PerformanceSmall.md.
Details:
- Corrected links in contents section of docs/PerformanceSmall.md,
  which were erroneously directing readers to the corresponding
  sections of docs/Performance.md.
2019-08-26 16:51:31 -05:00
Field G. Van Zee
40781774df Updated sup performance graphs with libxsmm.
Details:
- Added libxsmm to column-stored sup graphs presented in
  docs/PerformanceSmall.md.
- Updated sup results for BLASFEO.
- Added sup results for Lonestar5 (Haswell).
- Addresses issue #326.
2019-08-26 16:47:37 -05:00
figual
bfddf67132 Fixed context registration for Cortex A53 (#329). 2019-08-26 12:01:33 +02:00
Field G. Van Zee
4a0a6e89c5 Changed test/sup alpha to 1; test libxsmm+netlib.
Details:
- Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is
  needed because libxsmm currently only optimizes gemm operations where
  alpha is unit (and beta is unit or zero).
- Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its
  fallback library. This is the library that will be called the
  problem dimensions are deemed too large, or any other criteria for
  optimization are not met. (This was done not because it is realistic,
  but rather so that it would be very clear when libxsmm ceased handling
  gemm calls internally when the data are graphed.)
2019-08-24 15:25:16 -05:00
Field G. Van Zee
7aa52b5783 Use libxsmm API in test/sup; add missing -ldl.
Details:
- Switch the driver source in test/sup so that libxsmm_?gemm() is called
  instead of ?gemm_() when compiling for / linking against libxsmm.
  libxsmm's documentation isn't clear on whether it is even *trying* to
  provide BLAS API compatibility, and I got tired of trying to figure it
  out.
- Added missing -ldl in LDFLAGS when linking against libxsmm.
2019-08-23 16:12:50 -05:00
Field G. Van Zee
57e422aa16 Added libxsmm support to test/sup drivers.
Details:
- Modified test/sup/Makefile to build drivers that test the performance
  of skinny/small problems via libxsmm.
- Modified test/sup/runme.sh to run aforementioned drivers.
- Modified test/sup/test_gemm.c so that problem sizes are tested in
  reverse order (from largest to smallest). This can help avoid certain
  situations where the CPU frequency does not immediately throttle up
  to its maximum. Thanks to Robert van de Geijn for recommending this
  fix.
2019-08-23 14:17:52 -05:00