Commit Graph

339 Commits

Author SHA1 Message Date
Dipal M Zambare
26e4b6b293 Added support for AMD's Zen3 microarchitecture.
Details:
- Added a new 'zen3' subconfiguration targeting support for the AMD Zen3
  microarchitecture (#561). Thanks to AMD for this contribution.
- Restructured clang and AOCC support for zen, zen2, and zen3
  make_defs.mk files. The clang and AOCC version detection now happens
  in configure, not in the subconfigurations' makefile fragments. That
  is, we've added logic to configure that detects the version of
  clang/AOCC, outputs an appropriate variable to config.mk
  (ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the
  makefile fragment (as is currently done for the GCC_OT_* variables).
- Added configure support for a GCC_OT_10_1_0 variable (and associated
  substitution anchor) to communicate whether the gcc version is older
  than 10.1.0, and use this variable to check for recent enough versions
  of gcc to use -march=znver3 in the zen3 subconfig.
- Inlined the contents of config/zen/amd_config.mk into the zen and zen2
  make_defs.mk so that the files are self-contained, harmonizing the
  format of all three Zen-based subconfigurations' make_defs.mk files.
- Added indenting (with spaces) of GNU make conditionals for easier
  reading in zen, zen2, and zen3 make_defs.mk files.
- Adjusted the range of models checked by bli_cpuid_is_zen() (which was
  previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is
  completely disjoint from the models checked by bli_cpuid_is_zen2()
  (0x30 ~ 0xff). This is normally necessary because Zen and Zen2
  microarchitectures share the same family (23, or 0x17), and so the
  model code is the only way to differentiate the two. But in our case,
  fixing the model range for zen *wasn't* actually necessary since we
  checked for zen2 first, and therefore the wide zen range acted like
  the 'else' of an 'if-else' statement. That said, the change helps
  improve clarity for the reader by encoding useful knowledge, which
  was obtained from https://en.wikichip.org/wiki/amd/cpuid .
- Added zen2.def and zen3.def files to the collection in travis/cpuid.
  Note that support for zen, zen2, and zen3 is now present, and while
  all the three microarchitectures have identical instruction sets from
  the perspective of BLIS microkernels, they each correspond to
  different subconfigurations and therefore merit separate testing.
  Thanks to Devin Matthews for his guidance in hacking these files as
  slight modifications of zen.def.
- Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh.
  Now, zen, zen2, and zen3 are tested through the SDE via Travis CI
  builds.
- Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils
  repository on GitHub rather than on Intel's website. This change was
  made in an attempt to circumvent recent troubles with Travis CI not
  being able to download the SDE directly from Intel's website via curl.
  Thanks to Devin Matthews for suggesting the idea.
- Updated travis/do_sde.sh to grab the latest version (8.69.1) of the
  Intel SDE from the flame/ci-utils repository.
- Updated .travis.yml to use gcc 9. The file was previously using gcc 8,
  which did not support -march=znver2.
- Created amd64_legacy umbrella family in config_registry for targeting
  older (bulldozer, piledriver, steamroller, and excavator)
  microarchitectures and moved those same subconfigs out of the amd64
  umbrella family. However, x86_64 retains amd64_legacy as a constituent
  member.
- Fixed a bug in configure related to the building of the so-called
  config list. When processing the contents of config_registry,
  configure creates a series of structures and lists that allow for
  various mappings related to configuration families, subconfigs, and
  kernel sets. Two of those lists are built via substitution of
  umbrella families with their subconfig members, and one of those
  lists was improperly performing the substitution in a way that would
  erroneously match on partial umbrella family names. That code was
  changed to match the code that was already doing the substitution
  properly, via substitute_words(). Also added comments noting the
  importance of using substitute_words() in both instances.
- Comment updates.
2021-11-17 13:02:00 -06:00
Devin Matthews
28b0982ea7 Refactored her[2]k/syr[2]k in terms of gemmt. (#531)
Details:
- Renamed herk macrokernels and supporting files and functions to gemmt, 
  which is possible since at the macrokernel level they are identical. 
  Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
  level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
  functions rather than cpp macros that instantiate multiple functions.
  Thanks to Devin Matthews for his efforts on this issue (#531).
- Check that the maximum stack buffer size is sufficiently large
  relative to the register blocksizes for each datatype, and do so when
  the context is initialized rather than when an operation is called.
  Note that with this change, users who pass in their own contexts into
  the expert interfaces currently will *not* have any checks performed.
  Thanks to Devin Matthews for suggesting this change.
2021-11-10 12:34:50 -06:00
Field G. Van Zee
e9da6425e2 Allow use of 1m with mixing of row/col-pref ukrs.
Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
  precision real and double-precision real ukernels had opposing I/O
  preferences (row-preferential sgemm ukernel + column-preferential
  dgemm ukernel, or vice versa). The fix involved adjusting the API
  to bli_cntx_set_ind_blkszs() so that the induced method context init
  function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
  function for only one datatype at a time. This allowed the blocksize
  scaling (which varies depending on whether we're doing 1m_r or 1m_c)
  to happen on a per-datatype basis. This fixes issue #557. Thanks to
  Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
  bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
  called from each level-3 _front() function. The pack_t schemas in the
  cntx_t were also removed entirely, along with the associated accessor
  functions. This in turn required updating the trsm1m-related virtual
  ukernels to read the pack schema for B from the auxinfo_t struct
  rather than the context. This also required slight tweaks to
  bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
  the microkernel IO preference. This mostly only affects gemm. Thanks
  to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
  querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
  in bli_gks.c. The context initialization functions for induced methods
  were previously passed a dt argument, but I can no longer figure out
  *why* they were passed this value. To reduce confusion, I've removed
  the dt argument (including also from the function defintion +
  prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
  breaks high-leve implementations of 3m and 4m, but this is okay since
  those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.
2021-10-13 14:15:38 -05:00
Devin Matthews
32a6d93ef6 Merge pull request #543 from xrq-phys/armsve-packm-fix
ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9
2021-10-09 15:53:54 -05:00
RuQing Xu
ccf16289d2 Arm SVE C/ZGEMM Fix FMOV 0 Mistake
FMOV [hsd]M, #imm does not allow zero immediate.
Use wzr, xzr instead.
2021-10-08 12:34:14 +09:00
RuQing Xu
82b61283b2 SH Kernel Unused Eigher 2021-10-08 12:17:29 +09:00
RuQing Xu
1749dfa493 Arm SVE C/ZGEMM Support *beta==0 2021-10-08 12:13:08 +09:00
RuQing Xu
66a018e6ad Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 2021-10-08 12:13:08 +09:00
RuQing Xu
9e1e781cb5 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 2021-10-08 12:13:08 +09:00
RuQing Xu
e4cabb977d Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg 2021-10-08 12:13:08 +09:00
RuQing Xu
b677e0d61b Arm SVE Add SGEMM 2Vx10 Unindexed 2021-10-08 12:13:07 +09:00
RuQing Xu
3f68e8309f Arm SVE ZGEMM Support Gather Load / Scatt. St. 2021-10-08 12:13:07 +09:00
RuQing Xu
c19db2ff82 Arm SVE Add ZGEMM 2Vx10 Unindexed 2021-10-08 12:13:07 +09:00
RuQing Xu
e13abde30b Arm SVE Add ZGEMM 2Vx7 Unindexed 2021-10-08 12:13:06 +09:00
RuQing Xu
49b9d7998e Arm SVE Add ZGEMM 2Vx8 Unindexed 2021-10-08 12:12:48 +09:00
RuQing Xu
f44149f787 Armv8 Trash New Bulk Kernels
- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
  Will break 1m.
2021-10-08 02:35:58 +09:00
RuQing Xu
2604f40713 Config ArmSVE Unregister 12xk. Move 12xk to Old 2021-10-07 02:39:00 +09:00
RuQing Xu
1e3200326b Revert __has_include(). Distinguish w/ BLIS_FAMILY_** 2021-10-07 02:37:14 +09:00
RuQing Xu
d7a3372247 Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo 2021-10-07 02:25:14 +09:00
RuQing Xu
2920dde5ac Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo 2021-10-07 02:01:45 +09:00
RuQing Xu
b9da6d55fe Armv8 GEMMSUP Edge Cases Require Signed Ints
Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c.
For safety upon similar strategies in the future,
 change all [mn]_[iter/left] into signed ints.
2021-10-06 12:25:54 +09:00
RuQing Xu
40baf83f0e Armv8 Handle *beta == 0 for GEMMSUP ??r Case. 2021-10-06 01:00:52 +09:00
Devin Matthews
079fbd42ce Merge branch 'master' into arm64-hi-bw 2021-10-04 17:21:48 -05:00
Devin Matthews
80c5366e4a Move unused ARM SVE kernels to "old" directory. 2021-10-04 15:40:28 -05:00
RuQing Xu
f5c03e9fe8 Armv8 Handle *beta == 0 for GEMMSUP ?rc Case. 2021-10-03 16:51:51 +09:00
RuQing Xu
abc648352c Armv8 Fix 6x8 Row-Maj Ukr
- Fixed for 6x8 only, 4x4 & 4x8 pending;
- Installed to config firestorm as benchmark seems to show better perf:
   Old:
blis_dgemm_ukr_c                     6     8   320    36.87   2.43e-17   PASS
blis_dgemm_ukr_c                     6     8   352    40.55   1.04e-17   PASS
blis_dgemm_ukr_c                     6     8   384    44.24   5.68e-17   PASS
blis_dgemm_ukr_c                     6     8   416    41.67   3.51e-17   PASS
blis_dgemm_ukr_c                     6     8   448    34.41   2.94e-17   PASS
blis_dgemm_ukr_c                     6     8   480    42.53   2.35e-17   PASS

   New:
blis_dgemm_ukr_r                     6     8   352    50.69   1.59e-17   PASS
blis_dgemm_ukr_r                     6     8   384    49.15   5.55e-17   PASS
blis_dgemm_ukr_r                     6     8   416    50.44   2.86e-17   PASS
blis_dgemm_ukr_r                     6     8   448    46.92   3.12e-17   PASS
blis_dgemm_ukr_r                     6     8   480    48.08   4.08e-17   PASS
2021-10-03 13:14:19 +09:00
Devin Matthews
13dbd5b5d3 Apply patch from @xrq-phys. 2021-10-02 16:08:05 -05:00
Devin Matthews
ae0eeeaf77 Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. 2021-09-29 16:43:38 -05:00
Devin Matthews
e3dc1954ff Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
The fix is to use the same (valid) source register twice in the horizontal addition.
2021-09-16 10:59:37 -05:00
Devin Matthews
5191c43fac Fix more copy-paste errors in the haswell gemmsup code.
Fixes #486.
2021-09-16 10:16:17 -05:00
RuQing Xu
30c29b256e Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Affected configs: a64fx.
2021-09-16 05:01:03 +09:00
RuQing Xu
bffa85be59 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
SVE-Intrinsic-based kernels ought not to use asm in their names.
2021-09-16 04:31:45 +09:00
RuQing Xu
820f11a469 Arm Whole GEMMSUP Call Route is Asm/Int Optimized
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
  it's not called by any upper routine.
2021-08-27 13:40:26 +09:00
RuQing Xu
7e2951e61f Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)
2021-08-23 17:06:44 +09:00
RuQing Xu
4fd82b0e93 Header Typo 2021-08-23 05:18:32 +09:00
RuQing Xu
35409ebe67 Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Plus some fix at edges.

TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.
2021-08-23 04:51:47 +09:00
RuQing Xu
a361492c24 Arm: DGEMMSUP ?rc(rd) Invoke Edge Size 2021-08-23 01:13:39 +09:00
RuQing Xu
e6799b26a6 Arm: Implement GEMMSUP Fallback Method
bli_dgemmsup_rv_armv8a_int_6x4mn
2021-08-21 02:39:38 +09:00
RuQing Xu
7d5903d8d7 Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
2021-08-21 01:55:50 +09:00
Devin Matthews
4f70eb7913 Clean up some warnings that show up on clang/OSX. 2021-08-13 11:12:43 -05:00
RuQing Xu
3df0e9b653 Arm64 8x4 Kernel Use Less Regs 2021-08-13 02:40:06 +09:00
RuQing Xu
4e7e225057 Armv8-A Supplimentary GEMMSUP Sizes for RD 2021-08-13 02:40:06 +09:00
RuQing Xu
c792d506ba Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Suffixed NEON opcode is not supported by GNU assembler
2021-08-13 02:40:06 +09:00
RuQing Xu
ce44735209 Armv8-A Adjust Types for PACKM Kernels
GCC does not have full NEON intrinsics support.
2021-08-13 02:40:06 +09:00
RuQing Xu
8a32d19af8 Armv8-A GEMMSUP-RD 6x8m
Armv8-A now has a complete set of GEMMSUP kernels..
2021-08-13 02:40:06 +09:00
RuQing Xu
afd0fa6ad1 Armv8-A GEMMSUP-RD 6x8n 2021-08-13 02:40:06 +09:00
RuQing Xu
3c5f740514 Armv8-A s/d Packing Kernels Fix Typo
For GCC.
2021-08-13 02:40:06 +09:00
RuQing Xu
49b05df792 Armv8-A Introduced s/d Packing Kernels
Sizes according to the 2014 kernels.
2021-08-13 02:40:06 +09:00
RuQing Xu
c3faf93168 Armv8-A DGEMMSUP 6x8m Kernel
Recommended kernels set:
  ...
  BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  ...
  bli_blksz_init     ( &blkszs[ BLIS_MR ],    -1,     6,    -1,    -1,
                                              -1,     8,    -1,    -1 );
  bli_blksz_init_easy( &blkszs[ BLIS_NR ],    -1,     8,    -1,    -1 );
  ...
2021-08-13 02:40:06 +09:00
RuQing Xu
3efe707b55 Armv8-A DGEMMSUP Adjustments 2021-08-13 02:40:06 +09:00