Commit Graph

316 Commits

Author SHA1 Message Date
RuQing Xu
40baf83f0e Armv8 Handle *beta == 0 for GEMMSUP ??r Case. 2021-10-06 01:00:52 +09:00
Devin Matthews
079fbd42ce Merge branch 'master' into arm64-hi-bw 2021-10-04 17:21:48 -05:00
Devin Matthews
80c5366e4a Move unused ARM SVE kernels to "old" directory. 2021-10-04 15:40:28 -05:00
RuQing Xu
f5c03e9fe8 Armv8 Handle *beta == 0 for GEMMSUP ?rc Case. 2021-10-03 16:51:51 +09:00
RuQing Xu
abc648352c Armv8 Fix 6x8 Row-Maj Ukr
- Fixed for 6x8 only, 4x4 & 4x8 pending;
- Installed to config firestorm as benchmark seems to show better perf:
   Old:
blis_dgemm_ukr_c                     6     8   320    36.87   2.43e-17   PASS
blis_dgemm_ukr_c                     6     8   352    40.55   1.04e-17   PASS
blis_dgemm_ukr_c                     6     8   384    44.24   5.68e-17   PASS
blis_dgemm_ukr_c                     6     8   416    41.67   3.51e-17   PASS
blis_dgemm_ukr_c                     6     8   448    34.41   2.94e-17   PASS
blis_dgemm_ukr_c                     6     8   480    42.53   2.35e-17   PASS

   New:
blis_dgemm_ukr_r                     6     8   352    50.69   1.59e-17   PASS
blis_dgemm_ukr_r                     6     8   384    49.15   5.55e-17   PASS
blis_dgemm_ukr_r                     6     8   416    50.44   2.86e-17   PASS
blis_dgemm_ukr_r                     6     8   448    46.92   3.12e-17   PASS
blis_dgemm_ukr_r                     6     8   480    48.08   4.08e-17   PASS
2021-10-03 13:14:19 +09:00
Devin Matthews
13dbd5b5d3 Apply patch from @xrq-phys. 2021-10-02 16:08:05 -05:00
Devin Matthews
ae0eeeaf77 Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. 2021-09-29 16:43:38 -05:00
Devin Matthews
e3dc1954ff Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
The fix is to use the same (valid) source register twice in the horizontal addition.
2021-09-16 10:59:37 -05:00
Devin Matthews
5191c43fac Fix more copy-paste errors in the haswell gemmsup code.
Fixes #486.
2021-09-16 10:16:17 -05:00
RuQing Xu
820f11a469 Arm Whole GEMMSUP Call Route is Asm/Int Optimized
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
  it's not called by any upper routine.
2021-08-27 13:40:26 +09:00
RuQing Xu
7e2951e61f Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)
2021-08-23 17:06:44 +09:00
RuQing Xu
4fd82b0e93 Header Typo 2021-08-23 05:18:32 +09:00
RuQing Xu
35409ebe67 Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Plus some fix at edges.

TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.
2021-08-23 04:51:47 +09:00
RuQing Xu
a361492c24 Arm: DGEMMSUP ?rc(rd) Invoke Edge Size 2021-08-23 01:13:39 +09:00
RuQing Xu
e6799b26a6 Arm: Implement GEMMSUP Fallback Method
bli_dgemmsup_rv_armv8a_int_6x4mn
2021-08-21 02:39:38 +09:00
RuQing Xu
7d5903d8d7 Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
2021-08-21 01:55:50 +09:00
Devin Matthews
4f70eb7913 Clean up some warnings that show up on clang/OSX. 2021-08-13 11:12:43 -05:00
RuQing Xu
3df0e9b653 Arm64 8x4 Kernel Use Less Regs 2021-08-13 02:40:06 +09:00
RuQing Xu
4e7e225057 Armv8-A Supplimentary GEMMSUP Sizes for RD 2021-08-13 02:40:06 +09:00
RuQing Xu
c792d506ba Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Suffixed NEON opcode is not supported by GNU assembler
2021-08-13 02:40:06 +09:00
RuQing Xu
ce44735209 Armv8-A Adjust Types for PACKM Kernels
GCC does not have full NEON intrinsics support.
2021-08-13 02:40:06 +09:00
RuQing Xu
8a32d19af8 Armv8-A GEMMSUP-RD 6x8m
Armv8-A now has a complete set of GEMMSUP kernels..
2021-08-13 02:40:06 +09:00
RuQing Xu
afd0fa6ad1 Armv8-A GEMMSUP-RD 6x8n 2021-08-13 02:40:06 +09:00
RuQing Xu
3c5f740514 Armv8-A s/d Packing Kernels Fix Typo
For GCC.
2021-08-13 02:40:06 +09:00
RuQing Xu
49b05df792 Armv8-A Introduced s/d Packing Kernels
Sizes according to the 2014 kernels.
2021-08-13 02:40:06 +09:00
RuQing Xu
c3faf93168 Armv8-A DGEMMSUP 6x8m Kernel
Recommended kernels set:
  ...
  BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  ...
  bli_blksz_init     ( &blkszs[ BLIS_MR ],    -1,     6,    -1,    -1,
                                              -1,     8,    -1,    -1 );
  bli_blksz_init_easy( &blkszs[ BLIS_NR ],    -1,     8,    -1,    -1 );
  ...
2021-08-13 02:40:06 +09:00
RuQing Xu
3efe707b55 Armv8-A DGEMMSUP Adjustments 2021-08-13 02:40:06 +09:00
RuQing Xu
8ed8f5e625 Armv8-A Add More DGEMMSUP
- Add 6x8 GEMMSUP.
- Adjust prefetching.
- Workaround for Clang's disability to handle reg clobbering.
- Subproduct 6x8 row-major GEMM <- incomplete.
2021-08-13 02:40:06 +09:00
RuQing Xu
a9ba79ea14 Armv8-A Add GEMMSUP 4x8n Kernel
- Compile w/ both GCC & Clang.
- Edge cases use ref-kernels.
- Can give performance boost in some contexts.
2021-08-13 02:40:06 +09:00
RuQing Xu
df40efe8fb Armv8-A Add Part of GEMMSUP 8x4m Kernel
- Compile w/ both GCC & Clang
- Only block part is implement. Edge cases WIP
- Not Optimal kernel scheme. Should do 4x8 instead
2021-08-13 02:40:06 +09:00
RuQing Xu
6639999288 Armv8A DGEMM 4x4 Kernel WIP. Slow
Quite slow.
2021-08-13 02:40:06 +09:00
RuQing Xu
a29c16394c Armv8-A Add 8x4 Kernel WIP
Test result: a bit lower GFlOps than 6x8.
2021-08-13 02:40:04 +09:00
Field G. Van Zee
21911d6ed3 Merge branch 'dev' 2021-07-09 18:10:46 -05:00
Devin Matthews
17729cf449 Add vzeroupper to Haswell microkernels. (#524)
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' 
  microkernels so as to avoid a performance penalty when mixing AVX
  and SSE instructions. These vzeroupper instructions were once part 
  of the haswell kernels, but were inadvertently removed during a source 
  code shuffle some time ago when we were managing duplicate 'haswell' 
  and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down 
  and re-inserting the missing instructions.
2021-07-09 14:59:48 -05:00
Devin Matthews
bf72763663 Merge pull request #506 from xrq-phys/arm64-mac
BLIS on Darwin_Aarch64
2021-06-18 18:59:43 -05:00
nicholai
56ffca6a9b Fix asm warning 2021-06-15 18:17:39 -05:00
Field G. Van Zee
689fa0f403 Merge branch 'master' into dev 2021-06-13 19:44:14 -05:00
Field G. Van Zee
7f7d72610c Fixed bugs in cpackm kernels, gemmlike code.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
  bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
  kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
  of 4 bytes) from the real component. This was almost certainly a copy-
  paste bug carried over from the corresonding zpackm kernels. Thanks to
  Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. (This bug was never observed in output but rather
  noticed analytically. It probably would have also manifested as
  intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
  relating to debugging.
2021-05-31 16:50:18 -05:00
RuQing Xu
5fc93e2806 Armv8A Rename Regs for Safe Darwin Compile
Avoid x18 use in FP32 kernel:
- C address lines x[18-26] renamed to x[19-27] (reg index +1)
- Original role of x27 fulfilled by x5 which is free after k-loop pert.

FP64 does not require changing since x18 is not used there.
2021-05-29 18:44:47 +09:00
RuQing Xu
9f4a4a3cfb Armv8A Rename Regs for Clang Compile: FP32 Part
Roughly the same as 916e1fa , additionally with x15 clobbering removed.
- x15: Not used at all.

Compilation w/ Clang shows warning about x18 reservation, but
compilation itself is OK and all tests got passed.
2021-05-29 17:21:28 +09:00
RuQing Xu
916e1fa8be Armv8A Rename Regs for Clang Compile: FP64 Part
- x7, x8: Used to store address for Alpha and Beta.
  As Alpha & Beta was not used in k-loops, use x0, x1 to load
  Alpha & Beta's addresses after k-loops are completed, since A & B's
  addresses are no longer needed there.
  This "ldr [addr]; -> ldr val, [addr]" would not cause much performance
  drawback since it is done outside k-loops and there are plenty of
  instructions between Alpha & Beta's loading and usage.
- x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used
  any longer. Directly loading cs_c and into x10 and scale by 8 spares
  x9 straightforwardly.
- x11, x12: Not used at all. Simply remove from clobber list.
- x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is
  also used in a conditional branch so that "cmp x13, #1" needs to be
  modified into "cmp x14, #8" to completely free x13.
- x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load
  these addresses into x0 and x1 after Alpha & Beta are both loaded,
  since then neigher address of A/B nor address of Alpha/Beta is needed.
2021-05-29 16:46:52 +09:00
RuQing Xu
7fabd896af Asm Flag Mingling for Darwin_Aarch64
Apple+Arm64 requires additional "tagging" of local symbols.
2021-05-29 16:28:03 +09:00
RuQing Xu
61584deddf Added 512b SVE-based a64fx subconfig + SVE kernels.
Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically 
  tuned block size by Stepan Nassyr. This subconfig also sets the sector 
  cache size and enables memory-tagging code in SVE gemm kernels. This 
  subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
  blocksizes according to the analytical model. This part is ported from 
  Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE 
  at size (2*VL, 10). These kernels use unindexed FMLA instructions 
  because indexed FMLA takes 2 FMA units in many implementations.
  PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
  support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for 
  size (12, k). This dpackm kernel is not currently used by any 
  subconfiguration.
- Implemented several experimental dgemmsup kernels which would 
  improve performance in a few cases. However, those dgemmsup kernels 
  generally underperform hence they are not currently used in any 
  subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
  PR #424.
2021-05-19 09:52:29 -05:00
Field G. Van Zee
09bd4f4f12 Add err_t* "return" parameter to malloc functions.
Details:
- Added an err_t* parameter to memory allocation functions including
  bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(),
  bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions
  already use the return value to return the allocated memory address,
  they can't communicate errors to the caller through the return value.
  This commit does not employ any error checking within these functions
  or their callers, but this sets up BLIS for a more comprehensive
  commit that moves in that direction.
- Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to
  bli_type_defs.h. This was done so that what remains of bli_malloc.h
  can be included after the definition of the err_t enum. (This ordering
  was needed because bli_malloc.h now contains function prototypes that
  use err_t.)
- Defined bli_is_success() and bli_is_failure() static functions in
  bli_param_macro_defs.h. These functions provide easy checks for error
  codes and will be used more heavily in future commits.
- Unfortunately, the additional err_t* argument discussed above breaks
  the API for bli_malloc_user(), which is an exported symbol in the
  shared library. However, it's quite possible that the only application
  that calls bli_malloc_user()--indeed, the reason it is was marked for
  symbol exporting to begin with--is the BLIS testsuite. And if that's
  the case, this breakage won't affect anyone. Nonetheless, the "major"
  part of the so_version file has been updated accordingly to 4.0.0.
2021-03-31 17:09:36 -05:00
Field G. Van Zee
f9ad55ce7e Merge branch 'master' into dev 2021-03-31 14:20:19 -05:00
Nicholai Tukanov
22c6b5dc4c Fixed bug in power10 microkernel I/O. (#488)
Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
  not store the microtile result correctly due to incorrect indices
  calculations. (The error was introduced when I reorganized the 
  'kernels/power10/3' directory.)
2021-03-30 19:07:42 -05:00
Field G. Van Zee
3a6f41afb8 Renamed membrk files/vars/functions to pba.
Details:
- Renamed the files, variables, and functions relating to the packing
  block allocator from its legacy name (membrk) to its current name
  (pba). This more clearly contrasts the packing block allocator with
  the small block allocator (sba).
- Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that
  caused the function to erroneously change the value of the pack_a
  field of the global rntm_t instead of the pack_b field. (Apparently
  nobody has used this API yet.)
- Comment updates.
2021-03-27 17:22:14 -05:00
Nicholai Tukanov
670bc7b60f Add low-precision POWER10 gemm kernels (#467)
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations 
  based on low-precision gemm kernels, and (2) extends the BLIS typed 
  API for those new implementations. Currently, these new kernels can 
  only be used for the POWER10 microarchitecture; however, they may 
  provide a template for developing similar kernels for other 
  microarchitectures (even those beyond POWER), as changes would likely 
  be limited to select places in the microkernel and possibly the 
  packing routines. The new low-precision operations that are now 
  supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more 
  information, refer to the POWER10.md document that is included in 
  'sandbox/power10'.
2021-03-05 13:53:43 -06:00
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00