Commit Graph

22 Commits

Author SHA1 Message Date
Field G. Van Zee
e9da6425e2 Allow use of 1m with mixing of row/col-pref ukrs.
Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
  precision real and double-precision real ukernels had opposing I/O
  preferences (row-preferential sgemm ukernel + column-preferential
  dgemm ukernel, or vice versa). The fix involved adjusting the API
  to bli_cntx_set_ind_blkszs() so that the induced method context init
  function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
  function for only one datatype at a time. This allowed the blocksize
  scaling (which varies depending on whether we're doing 1m_r or 1m_c)
  to happen on a per-datatype basis. This fixes issue #557. Thanks to
  Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
  bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
  called from each level-3 _front() function. The pack_t schemas in the
  cntx_t were also removed entirely, along with the associated accessor
  functions. This in turn required updating the trsm1m-related virtual
  ukernels to read the pack schema for B from the auxinfo_t struct
  rather than the context. This also required slight tweaks to
  bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
  the microkernel IO preference. This mostly only affects gemm. Thanks
  to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
  querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
  in bli_gks.c. The context initialization functions for induced methods
  were previously passed a dt argument, but I can no longer figure out
  *why* they were passed this value. To reduce confusion, I've removed
  the dt argument (including also from the function defintion +
  prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
  breaks high-leve implementations of 3m and 4m, but this is okay since
  those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.
2021-10-13 14:15:38 -05:00
Devin Matthews
32a6d93ef6 Merge pull request #543 from xrq-phys/armsve-packm-fix
ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9
2021-10-09 15:53:54 -05:00
RuQing Xu
ccf16289d2 Arm SVE C/ZGEMM Fix FMOV 0 Mistake
FMOV [hsd]M, #imm does not allow zero immediate.
Use wzr, xzr instead.
2021-10-08 12:34:14 +09:00
RuQing Xu
82b61283b2 SH Kernel Unused Eigher 2021-10-08 12:17:29 +09:00
RuQing Xu
1749dfa493 Arm SVE C/ZGEMM Support *beta==0 2021-10-08 12:13:08 +09:00
RuQing Xu
66a018e6ad Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 2021-10-08 12:13:08 +09:00
RuQing Xu
9e1e781cb5 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 2021-10-08 12:13:08 +09:00
RuQing Xu
e4cabb977d Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg 2021-10-08 12:13:08 +09:00
RuQing Xu
b677e0d61b Arm SVE Add SGEMM 2Vx10 Unindexed 2021-10-08 12:13:07 +09:00
RuQing Xu
3f68e8309f Arm SVE ZGEMM Support Gather Load / Scatt. St. 2021-10-08 12:13:07 +09:00
RuQing Xu
c19db2ff82 Arm SVE Add ZGEMM 2Vx10 Unindexed 2021-10-08 12:13:07 +09:00
RuQing Xu
e13abde30b Arm SVE Add ZGEMM 2Vx7 Unindexed 2021-10-08 12:13:06 +09:00
RuQing Xu
49b9d7998e Arm SVE Add ZGEMM 2Vx8 Unindexed 2021-10-08 12:12:48 +09:00
RuQing Xu
2604f40713 Config ArmSVE Unregister 12xk. Move 12xk to Old 2021-10-07 02:39:00 +09:00
RuQing Xu
1e3200326b Revert __has_include(). Distinguish w/ BLIS_FAMILY_** 2021-10-07 02:37:14 +09:00
Devin Matthews
80c5366e4a Move unused ARM SVE kernels to "old" directory. 2021-10-04 15:40:28 -05:00
Devin Matthews
13dbd5b5d3 Apply patch from @xrq-phys. 2021-10-02 16:08:05 -05:00
Devin Matthews
ae0eeeaf77 Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. 2021-09-29 16:43:38 -05:00
RuQing Xu
30c29b256e Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Affected configs: a64fx.
2021-09-16 05:01:03 +09:00
RuQing Xu
bffa85be59 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
SVE-Intrinsic-based kernels ought not to use asm in their names.
2021-09-16 04:31:45 +09:00
RuQing Xu
61584deddf Added 512b SVE-based a64fx subconfig + SVE kernels.
Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically 
  tuned block size by Stepan Nassyr. This subconfig also sets the sector 
  cache size and enables memory-tagging code in SVE gemm kernels. This 
  subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
  blocksizes according to the analytical model. This part is ported from 
  Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE 
  at size (2*VL, 10). These kernels use unindexed FMLA instructions 
  because indexed FMLA takes 2 FMA units in many implementations.
  PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
  support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for 
  size (12, k). This dpackm kernel is not currently used by any 
  subconfiguration.
- Implemented several experimental dgemmsup kernels which would 
  improve performance in a few cases. However, those dgemmsup kernels 
  generally underperform hence they are not currently used in any 
  subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
  PR #424.
2021-05-19 09:52:29 -05:00
Guodong Xu
f032d5d4a6 New kernel set for Arm SVE using assembly (#396)
Here adds two kernels for Arm SVE vector extensions.
1. a gemm  kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.

To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.

"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code."  [1]

[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning

Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
2020-04-29 12:08:46 -05:00