Commit Graph

2013 Commits

Author SHA1 Message Date
Devin Matthews
551c6b4ee8 Merge pull request #519 from awild82/oot_build_bugfix
Fix installation from out-of-tree builds
2021-07-06 19:32:53 -05:00
Andrew Wildman
f648df4e55 Add symlink to blis.pc.in for out-of-tree builds 2021-07-06 16:35:12 -07:00
Devin Matthews
78eac6a0ab Revert "Always run make check."
This reverts commit a201a53440.
2021-07-06 11:05:43 -05:00
Devin Matthews
a201a53440 Always run make check.
I'm concerned that problems may lurk for `x86_64` builds on Windows which may be uncovered by a fuller `make check`.
2021-07-05 21:39:18 -05:00
Devin Matthews
5ef7f684dc Merge pull request #515 from chengguosun/bug-fix
Fixed configure script bug.
2021-07-05 21:35:07 -05:00
sunchengguo
ad6231cca3 Fixed configure script bug.
Details:
- Fixed kernel list string substitution error by adding function substitute_words in configure script.
  if the string contains zen and zen2, and zen need to be replaced with another string, then zen2
  also be incorrectly replaced.
2021-07-06 07:30:00 -04:00
Field G. Van Zee
aaa10c87e1 Skip clearing temp microtile in gemmlike sandbox.
Details:
- Removed code from gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. This code, introduced recently in 7f7d726, did
  not actually fix any bug (despite that commit's log entry). The
  microtile does not need to be initialized because it is completely
  overwritten by a "beta = 0" invocation of gemm prior to it being
  read. Any NaNs or Infs present at the outset would have no impact
  on the output matrix C. Thanks to Devin Matthews for reminding me
  of this.
2021-06-21 17:53:52 -05:00
Devin Matthews
bc10a3f2ff Merge pull request #492 from flame/thunderx2-clang
Allow clang for ThunderX2 config
2021-06-18 19:01:08 -05:00
Devin Matthews
bf72763663 Merge pull request #506 from xrq-phys/arm64-mac
BLIS on Darwin_Aarch64
2021-06-18 18:59:43 -05:00
Devin Matthews
e28f2a2dfc Merge pull request #513 from nicholaiTukanov/asm_warning_p9_fix
Fix assembler warning in POWER9 DGEMM
2021-06-15 19:35:07 -05:00
nicholai
56ffca6a9b Fix asm warning 2021-06-15 18:17:39 -05:00
Field G. Van Zee
d10e05bbd1 Sandbox header edits trigger full library rebuild.
Details:
- Adjusted the top-level Makefile so that any change to a sandbox header
  file will result in blis.h being regenerated along with a full
  recompilation of the library. Previously, sandbox files were omitted
  from the list of header files that, when touched, could trigger a full
  rebuild. Why was it like that previously? Because originally we only
  envisioned using sandboxes to *replace* gemm, not augment the library
  with new functionality. When replacing gemm, blis.h does not need to
  contain any local sandbox defintions in order for the user to be able
  to (indirectly) use that sandbox. But if you are adding functions to
  the library, those functions need to be prototyped so the compiler
  can perform type checking against the user's invocation of those new
  functions. Thanks to Jeff Diamond for helping us discover this
  deficiency in the build system.
2021-06-13 19:36:16 -05:00
Devin Matthews
7c3eb44efa Add vhsubpd/vhsubpd.
Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].
2021-06-02 11:28:22 -05:00
Field G. Van Zee
7f7d72610c Fixed bugs in cpackm kernels, gemmlike code.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
  bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
  kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
  of 4 bytes) from the real component. This was almost certainly a copy-
  paste bug carried over from the corresonding zpackm kernels. Thanks to
  Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. (This bug was never observed in output but rather
  noticed analytically. It probably would have also manifested as
  intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
  relating to debugging.
2021-05-31 16:50:18 -05:00
RuQing Xu
5fc93e2806 Armv8A Rename Regs for Safe Darwin Compile
Avoid x18 use in FP32 kernel:
- C address lines x[18-26] renamed to x[19-27] (reg index +1)
- Original role of x27 fulfilled by x5 which is free after k-loop pert.

FP64 does not require changing since x18 is not used there.
2021-05-29 18:44:47 +09:00
RuQing Xu
9f4a4a3cfb Armv8A Rename Regs for Clang Compile: FP32 Part
Roughly the same as 916e1fa , additionally with x15 clobbering removed.
- x15: Not used at all.

Compilation w/ Clang shows warning about x18 reservation, but
compilation itself is OK and all tests got passed.
2021-05-29 17:21:28 +09:00
RuQing Xu
916e1fa8be Armv8A Rename Regs for Clang Compile: FP64 Part
- x7, x8: Used to store address for Alpha and Beta.
  As Alpha & Beta was not used in k-loops, use x0, x1 to load
  Alpha & Beta's addresses after k-loops are completed, since A & B's
  addresses are no longer needed there.
  This "ldr [addr]; -> ldr val, [addr]" would not cause much performance
  drawback since it is done outside k-loops and there are plenty of
  instructions between Alpha & Beta's loading and usage.
- x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used
  any longer. Directly loading cs_c and into x10 and scale by 8 spares
  x9 straightforwardly.
- x11, x12: Not used at all. Simply remove from clobber list.
- x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is
  also used in a conditional branch so that "cmp x13, #1" needs to be
  modified into "cmp x14, #8" to completely free x13.
- x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load
  these addresses into x0 and x1 after Alpha & Beta are both loaded,
  since then neigher address of A/B nor address of Alpha/Beta is needed.
2021-05-29 16:46:52 +09:00
RuQing Xu
7fabd896af Asm Flag Mingling for Darwin_Aarch64
Apple+Arm64 requires additional "tagging" of local symbols.
2021-05-29 16:28:03 +09:00
Field G. Van Zee
213dce32d2 Added a new 'gemmlike' sandbox.
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
  multithreaded gemm in the style of gemmsup but also unconditionally
  employs packing. The purpose of this sandbox is to
  (1) avoid select abstractions, such as objects and control trees, in
      order to allow readers to better understand how a real-world
      implementation of high-performance gemm can be constructed;
  (2) provide a starting point for expert users who wish to build
      something that is gemm-like without "reinventing the wheel."
  Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
  Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
  instead of "bli_" in order to avoid any symbol collisions in the main
  library.
- The sandbox contains two variants, each of which implements gemm via a
  block-panel algorithm. The only difference between the two is that
  variant 1 calls the microkernel directly while variant 2 calls the
  microkernel indirectly, via a function wrapper, which allows the edge
  case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
  (not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
  framework.
2021-05-28 14:49:57 -05:00
Field G. Van Zee
82af05f54c Updated Fugaku (a64fx) performance results.
Details:
- Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx
  entry within Performance.md, and also updated the experiment details
  accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2
  experiments reflected in this commit.
- In Performance.md, added an English translation of the project name
  under which the Fugaku results were gathered, courtesy of RuQing Xu.
2021-05-25 15:25:08 -05:00
Devin Matthews
e5c85da376 Merge pull request #503 from flame/windows-compiler-check
Add explicit compiler check for Windows.
2021-05-24 16:56:22 -05:00
Devin Matthews
cbd8d39325 Merge pull request #500 from xrq-phys/armsve+travis
Upgrade Travis CI for Arm SVE
2021-05-24 16:32:42 -05:00
Devin Matthews
5feb04e233 Add explicit compiler check for Windows.
Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes #463.
2021-05-23 18:46:56 -05:00
Devin Matthews
6d4ab0223d Merge pull request #502 from flame/rm-rm-dupls
Remove `rm-dupls` function in common.mk.
2021-05-23 18:39:53 -05:00
Devin Matthews
859fb77a32 Remove rm-dupls function in common.mk.
AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.
2021-05-23 18:15:23 -05:00
RuQing Xu
932dfe6abb Travis CI Revert Unnecessary Extras from 91d3636
- Removed `V=1` in make line
- Removed `CFLAGS` in configure line
- Restored `pwd` surrounding OOT line
2021-05-20 02:07:31 +09:00
RuQing Xu
bd156a210d Adjust TravisCI
- ArmSVE don't test gemmt (seems Qemu-only problem);
- Clang use TravisCI-provided version instead of fixing to clang-8
  due to that clang-8 seems conflicting with TravisCI's clang-7.
2021-05-20 00:52:04 +09:00
RuQing Xu
91d3636031 Travis Support Arm SVE
- Updated distro to 20.04 focal aarch64-gcc-10.
  This is minimal version required by aarch64-gcc-10.
  SVE intrinsics would not compile without GCC >=10.
- x86 toolchains use official repo instead of ubuntu-toolchain-r/test.
  20.04 focal is not supported by that PPA at the moment.
- Add extra configuration-time options to .travis.yml.
- Add Arm SVE entry to .travis.yml.
2021-05-20 00:52:01 +09:00
RuQing Xu
61584deddf Added 512b SVE-based a64fx subconfig + SVE kernels.
Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically 
  tuned block size by Stepan Nassyr. This subconfig also sets the sector 
  cache size and enables memory-tagging code in SVE gemm kernels. This 
  subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
  blocksizes according to the analytical model. This part is ported from 
  Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE 
  at size (2*VL, 10). These kernels use unindexed FMLA instructions 
  because indexed FMLA takes 2 FMA units in many implementations.
  PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
  support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for 
  size (12, k). This dpackm kernel is not currently used by any 
  subconfiguration.
- Implemented several experimental dgemmsup kernels which would 
  improve performance in a few cases. However, those dgemmsup kernels 
  generally underperform hence they are not currently used in any 
  subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
  PR #424.
2021-05-19 09:52:29 -05:00
Devin Matthews
5d46dbee4a Replace bli_dlamch with something less archaic (#498)
Details:
- Added new implementations of bli_slamch() and bli_dlamch() that use
  constants from the standard C library in lieu of dynamically-computed
  values (via code inherited from netlib). The previous implementation
  is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is 
  defined by the subconfiguration at compile-time. Thanks to Devin
  Matthews for providing this patch, and to Stefano Zampini for
  reporting the issue (#497) that prompted Devin to propose the patch.
2021-05-12 18:42:09 -05:00
Field G. Van Zee
6a4aa986ff Fixed typo in Table of Contents. 2021-04-23 13:10:01 -05:00
Field G. Van Zee
f6424b5b82 Added dedicated Performance section to README.md.
Details:
- Spun off the Performance.md and PerformanceSmall.md links in the
  Documentation section into a new Performance section dedicated to
  those two links. (The previous entries remain redundantly listed
  within Documentation section.) Thanks to Robert van de Geijn for
  suggesting this change.
2021-04-23 13:08:06 -05:00
Devin Matthews
40ce5fd241 Merge pull request #493 from cassiersg/patch-1
Fix typo in FAQ.md
2021-04-21 09:54:25 -05:00
Gaëtan Cassiers
1f3461a5a5 Fix typo in FAQ.md 2021-04-21 16:49:05 +02:00
Devin Matthews
6548cebaf5 Allow clang for ThunderX2 config
Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.
2021-04-14 13:00:42 -05:00
Field G. Van Zee
6280757be3 Minor updates to a64fx section of Performance.md. 2021-04-07 13:03:56 -05:00
RuQing Xu
1e6ed823c6 Additional A64fx Comments (#490)
* Performance.md Update A64fx Comments

- Reason for ARMPL's missing data;
- Additional envs / flags for kernel selection;
- Update BLIS SRC commit.

* Include Another Fix in armsve-cfg-vendor

A prototype was forgotten, causing that void* pointer was not fully returned.
2021-04-07 12:59:26 -05:00
Field G. Van Zee
2688f21a5b Added Fujitsu A64fx (512-bit SVE) perf results.
Details:
- Added single-threaded and multithreaded performance results to
  docs/Performance.md. These results were gathered on the "Fugaku"
  Fujitsu A64fx supercomputer at the RIKEN Center for Computational
  Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan
  Nassyr for their work in developing and optimizing A64fx support in
  BLIS and RuQing for gathering the performance data that is reflected
  in these new graphs.
2021-04-06 19:02:37 -05:00
Field G. Van Zee
ba3ba8da83 Minor updates and fixes to test/3/octave scripts.
Details:
- Fixed an issue where the wrong string was being passed in for the
  vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.
2021-04-06 18:39:58 -05:00
Devin Matthews
90508192f2 Update do_sde.sh (#489)
Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.
2021-03-30 21:16:44 -05:00
Nicholai Tukanov
22c6b5dc4c Fixed bug in power10 microkernel I/O. (#488)
Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
  not store the microtile result correctly due to incorrect indices
  calculations. (The error was introduced when I reorganized the 
  'kernels/power10/3' directory.)
2021-03-30 19:07:42 -05:00
Field G. Van Zee
159ca6f01a Made test/3/octave scripts robust to missing data.
Details:
- Modified the octave scripts in test/3 so that the script does not
  choke when one or more of the expected OpenBLAS, Eigen, or vendor data
  files is missing. (The BLIS data set, however, must be complete.) When
  a file is missing, that data series is simply not included on that
  particular graph. Also factored out a lot of the redundant logic from
  plot_panel_4x5.m into a separate function in read_data.m.
2021-03-24 15:57:32 -05:00
Field G. Van Zee
545e6c2f6d CHANGELOG update (0.8.1) 2021-03-22 17:42:33 -05:00
Field G. Van Zee
8535b3e11d Version file update (0.8.1) 2021-03-22 17:42:33 -05:00
Field G. Van Zee
e56d9f2d94 ReleaseNotes.md update in advance of next version. 2021-03-22 17:40:50 -05:00
Field G. Van Zee
ca83f955d4 CREDITS file update. 2021-03-22 17:21:21 -05:00
Field G. Van Zee
57ef61f6cd Merge branch 'master' of github.com:flame/blis 2021-03-19 13:05:43 -05:00
Field G. Van Zee
bf1b578ea3 Reduced KC on skx from 384 to 256.
Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
  from 384 to 256. The maximum (extended) KC was also reduced
  accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
  this change.
2021-03-19 13:03:17 -05:00
Nicholai Tukanov
e7a4a8edc9 Fix calculation of new pb size (#487)
Details:
- Added missing parentheses to the i8 and i4 instantiations of the
  GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.
2021-03-17 19:43:31 -05:00
Field G. Van Zee
4493cf516e Redefined BLIS_NUM_ARCHS to update automatically.
Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
  value in the arch_t enum. This means that it no longer needs to get
  updated manually whenever new subconfigurations are added to BLIS.
  Also removed the explicit initial index assigment of 0 from the
  first enum value, which was unnecessary due to how the C language
  standard mandates indexing of enum values. Thanks to Devin Matthews
  for originally submitting this as a PR in #446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
  change.
2021-03-15 13:12:49 -05:00