RuQing Xu
ccf16289d2
Arm SVE C/ZGEMM Fix FMOV 0 Mistake
...
FMOV [hsd]M, #imm does not allow zero immediate.
Use wzr, xzr instead.
2021-10-08 12:34:14 +09:00
RuQing Xu
82b61283b2
SH Kernel Unused Eigher
2021-10-08 12:17:29 +09:00
RuQing Xu
1749dfa493
Arm SVE C/ZGEMM Support *beta==0
2021-10-08 12:13:08 +09:00
RuQing Xu
66a018e6ad
Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
2021-10-08 12:13:08 +09:00
RuQing Xu
9e1e781cb5
Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
2021-10-08 12:13:08 +09:00
RuQing Xu
e4cabb977d
Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
2021-10-08 12:13:08 +09:00
RuQing Xu
b677e0d61b
Arm SVE Add SGEMM 2Vx10 Unindexed
2021-10-08 12:13:07 +09:00
RuQing Xu
3f68e8309f
Arm SVE ZGEMM Support Gather Load / Scatt. St.
2021-10-08 12:13:07 +09:00
RuQing Xu
c19db2ff82
Arm SVE Add ZGEMM 2Vx10 Unindexed
2021-10-08 12:13:07 +09:00
RuQing Xu
e13abde30b
Arm SVE Add ZGEMM 2Vx7 Unindexed
2021-10-08 12:13:06 +09:00
RuQing Xu
49b9d7998e
Arm SVE Add ZGEMM 2Vx8 Unindexed
2021-10-08 12:12:48 +09:00
RuQing Xu
f44149f787
Armv8 Trash New Bulk Kernels
...
- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
Will break 1m.
2021-10-08 02:35:58 +09:00
RuQing Xu
d7a3372247
Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
2021-10-07 02:25:14 +09:00
RuQing Xu
2920dde5ac
Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
2021-10-07 02:01:45 +09:00
RuQing Xu
b9da6d55fe
Armv8 GEMMSUP Edge Cases Require Signed Ints
...
Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c.
For safety upon similar strategies in the future,
change all [mn]_[iter/left] into signed ints.
2021-10-06 12:25:54 +09:00
RuQing Xu
40baf83f0e
Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
2021-10-06 01:00:52 +09:00
Devin Matthews
079fbd42ce
Merge branch 'master' into arm64-hi-bw
2021-10-04 17:21:48 -05:00
Devin Matthews
80c5366e4a
Move unused ARM SVE kernels to "old" directory.
2021-10-04 15:40:28 -05:00
RuQing Xu
f5c03e9fe8
Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
2021-10-03 16:51:51 +09:00
RuQing Xu
abc648352c
Armv8 Fix 6x8 Row-Maj Ukr
...
- Fixed for 6x8 only, 4x4 & 4x8 pending;
- Installed to config firestorm as benchmark seems to show better perf:
Old:
blis_dgemm_ukr_c 6 8 320 36.87 2.43e-17 PASS
blis_dgemm_ukr_c 6 8 352 40.55 1.04e-17 PASS
blis_dgemm_ukr_c 6 8 384 44.24 5.68e-17 PASS
blis_dgemm_ukr_c 6 8 416 41.67 3.51e-17 PASS
blis_dgemm_ukr_c 6 8 448 34.41 2.94e-17 PASS
blis_dgemm_ukr_c 6 8 480 42.53 2.35e-17 PASS
New:
blis_dgemm_ukr_r 6 8 352 50.69 1.59e-17 PASS
blis_dgemm_ukr_r 6 8 384 49.15 5.55e-17 PASS
blis_dgemm_ukr_r 6 8 416 50.44 2.86e-17 PASS
blis_dgemm_ukr_r 6 8 448 46.92 3.12e-17 PASS
blis_dgemm_ukr_r 6 8 480 48.08 4.08e-17 PASS
2021-10-03 13:14:19 +09:00
Devin Matthews
13dbd5b5d3
Apply patch from @xrq-phys.
2021-10-02 16:08:05 -05:00
Devin Matthews
ae0eeeaf77
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
2021-09-29 16:43:38 -05:00
Devin Matthews
e3dc1954ff
Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
...
The fix is to use the same (valid) source register twice in the horizontal addition.
2021-09-16 10:59:37 -05:00
Devin Matthews
5191c43fac
Fix more copy-paste errors in the haswell gemmsup code.
...
Fixes #486 .
2021-09-16 10:16:17 -05:00
RuQing Xu
820f11a469
Arm Whole GEMMSUP Call Route is Asm/Int Optimized
...
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
it's not called by any upper routine.
2021-08-27 13:40:26 +09:00
RuQing Xu
7e2951e61f
Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
...
Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)
2021-08-23 17:06:44 +09:00
RuQing Xu
4fd82b0e93
Header Typo
2021-08-23 05:18:32 +09:00
RuQing Xu
35409ebe67
Arm: DGEMMSUP ??r(rv) Invoke Edge Size
...
Plus some fix at edges.
TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.
2021-08-23 04:51:47 +09:00
RuQing Xu
a361492c24
Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
2021-08-23 01:13:39 +09:00
RuQing Xu
e6799b26a6
Arm: Implement GEMMSUP Fallback Method
...
bli_dgemmsup_rv_armv8a_int_6x4mn
2021-08-21 02:39:38 +09:00
RuQing Xu
7d5903d8d7
Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
...
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
2021-08-21 01:55:50 +09:00
Devin Matthews
4f70eb7913
Clean up some warnings that show up on clang/OSX.
2021-08-13 11:12:43 -05:00
RuQing Xu
3df0e9b653
Arm64 8x4 Kernel Use Less Regs
2021-08-13 02:40:06 +09:00
RuQing Xu
4e7e225057
Armv8-A Supplimentary GEMMSUP Sizes for RD
2021-08-13 02:40:06 +09:00
RuQing Xu
c792d506ba
Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
...
Suffixed NEON opcode is not supported by GNU assembler
2021-08-13 02:40:06 +09:00
RuQing Xu
ce44735209
Armv8-A Adjust Types for PACKM Kernels
...
GCC does not have full NEON intrinsics support.
2021-08-13 02:40:06 +09:00
RuQing Xu
8a32d19af8
Armv8-A GEMMSUP-RD 6x8m
...
Armv8-A now has a complete set of GEMMSUP kernels..
2021-08-13 02:40:06 +09:00
RuQing Xu
afd0fa6ad1
Armv8-A GEMMSUP-RD 6x8n
2021-08-13 02:40:06 +09:00
RuQing Xu
3c5f740514
Armv8-A s/d Packing Kernels Fix Typo
...
For GCC.
2021-08-13 02:40:06 +09:00
RuQing Xu
49b05df792
Armv8-A Introduced s/d Packing Kernels
...
Sizes according to the 2014 kernels.
2021-08-13 02:40:06 +09:00
RuQing Xu
c3faf93168
Armv8-A DGEMMSUP 6x8m Kernel
...
Recommended kernels set:
...
BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
...
bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1,
-1, 8, -1, -1 );
bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 );
...
2021-08-13 02:40:06 +09:00
RuQing Xu
3efe707b55
Armv8-A DGEMMSUP Adjustments
2021-08-13 02:40:06 +09:00
RuQing Xu
8ed8f5e625
Armv8-A Add More DGEMMSUP
...
- Add 6x8 GEMMSUP.
- Adjust prefetching.
- Workaround for Clang's disability to handle reg clobbering.
- Subproduct 6x8 row-major GEMM <- incomplete.
2021-08-13 02:40:06 +09:00
RuQing Xu
a9ba79ea14
Armv8-A Add GEMMSUP 4x8n Kernel
...
- Compile w/ both GCC & Clang.
- Edge cases use ref-kernels.
- Can give performance boost in some contexts.
2021-08-13 02:40:06 +09:00
RuQing Xu
df40efe8fb
Armv8-A Add Part of GEMMSUP 8x4m Kernel
...
- Compile w/ both GCC & Clang
- Only block part is implement. Edge cases WIP
- Not Optimal kernel scheme. Should do 4x8 instead
2021-08-13 02:40:06 +09:00
RuQing Xu
6639999288
Armv8A DGEMM 4x4 Kernel WIP. Slow
...
Quite slow.
2021-08-13 02:40:06 +09:00
RuQing Xu
a29c16394c
Armv8-A Add 8x4 Kernel WIP
...
Test result: a bit lower GFlOps than 6x8.
2021-08-13 02:40:04 +09:00
Field G. Van Zee
21911d6ed3
Merge branch 'dev'
2021-07-09 18:10:46 -05:00
Devin Matthews
17729cf449
Add vzeroupper to Haswell microkernels. ( #524 )
...
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
microkernels so as to avoid a performance penalty when mixing AVX
and SSE instructions. These vzeroupper instructions were once part
of the haswell kernels, but were inadvertently removed during a source
code shuffle some time ago when we were managing duplicate 'haswell'
and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
and re-inserting the missing instructions.
2021-07-09 14:59:48 -05:00
Devin Matthews
bf72763663
Merge pull request #506 from xrq-phys/arm64-mac
...
BLIS on Darwin_Aarch64
2021-06-18 18:59:43 -05:00