amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
RuQing Xu	c19db2ff82	Arm SVE Add ZGEMM 2Vx10 Unindexed	2021-10-08 12:13:07 +09:00
RuQing Xu	e13abde30b	Arm SVE Add ZGEMM 2Vx7 Unindexed	2021-10-08 12:13:06 +09:00
RuQing Xu	49b9d7998e	Arm SVE Add ZGEMM 2Vx8 Unindexed	2021-10-08 12:12:48 +09:00
RuQing Xu	f44149f787	Armv8 Trash New Bulk Kernels - They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m.	2021-10-08 02:35:58 +09:00
RuQing Xu	d7a3372247	Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo	2021-10-07 02:25:14 +09:00
RuQing Xu	2920dde5ac	Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo	2021-10-07 02:01:45 +09:00
RuQing Xu	b9da6d55fe	Armv8 GEMMSUP Edge Cases Require Signed Ints Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c. For safety upon similar strategies in the future, change all [mn]_[iter/left] into signed ints.	2021-10-06 12:25:54 +09:00
RuQing Xu	40baf83f0e	Armv8 Handle *beta == 0 for GEMMSUP ??r Case.	2021-10-06 01:00:52 +09:00
Devin Matthews	079fbd42ce	Merge branch 'master' into arm64-hi-bw	2021-10-04 17:21:48 -05:00
Devin Matthews	80c5366e4a	Move unused ARM SVE kernels to "old" directory.	2021-10-04 15:40:28 -05:00
RuQing Xu	f5c03e9fe8	Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.	2021-10-03 16:51:51 +09:00
RuQing Xu	abc648352c	Armv8 Fix 6x8 Row-Maj Ukr - Fixed for 6x8 only, 4x4 & 4x8 pending; - Installed to config firestorm as benchmark seems to show better perf: Old: blis_dgemm_ukr_c 6 8 320 36.87 2.43e-17 PASS blis_dgemm_ukr_c 6 8 352 40.55 1.04e-17 PASS blis_dgemm_ukr_c 6 8 384 44.24 5.68e-17 PASS blis_dgemm_ukr_c 6 8 416 41.67 3.51e-17 PASS blis_dgemm_ukr_c 6 8 448 34.41 2.94e-17 PASS blis_dgemm_ukr_c 6 8 480 42.53 2.35e-17 PASS New: blis_dgemm_ukr_r 6 8 352 50.69 1.59e-17 PASS blis_dgemm_ukr_r 6 8 384 49.15 5.55e-17 PASS blis_dgemm_ukr_r 6 8 416 50.44 2.86e-17 PASS blis_dgemm_ukr_r 6 8 448 46.92 3.12e-17 PASS blis_dgemm_ukr_r 6 8 480 48.08 4.08e-17 PASS	2021-10-03 13:14:19 +09:00
Devin Matthews	13dbd5b5d3	Apply patch from @xrq-phys.	2021-10-02 16:08:05 -05:00
Devin Matthews	ae0eeeaf77	Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.	2021-09-29 16:43:38 -05:00
Devin Matthews	e3dc1954ff	Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition.	2021-09-16 10:59:37 -05:00
Devin Matthews	5191c43fac	Fix more copy-paste errors in the haswell gemmsup code. Fixes #486.	2021-09-16 10:16:17 -05:00
RuQing Xu	820f11a469	Arm Whole GEMMSUP Call Route is Asm/Int Optimized - `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.	2021-08-27 13:40:26 +09:00
RuQing Xu	7e2951e61f	Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)	2021-08-23 17:06:44 +09:00
RuQing Xu	4fd82b0e93	Header Typo	2021-08-23 05:18:32 +09:00
RuQing Xu	35409ebe67	Arm: DGEMMSUP ??r(rv) Invoke Edge Size Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.	2021-08-23 04:51:47 +09:00
RuQing Xu	a361492c24	Arm: DGEMMSUP ?rc(rd) Invoke Edge Size	2021-08-23 01:13:39 +09:00
RuQing Xu	e6799b26a6	Arm: Implement GEMMSUP Fallback Method bli_dgemmsup_rv_armv8a_int_6x4mn	2021-08-21 02:39:38 +09:00
RuQing Xu	7d5903d8d7	Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.	2021-08-21 01:55:50 +09:00
Devin Matthews	4f70eb7913	Clean up some warnings that show up on clang/OSX.	2021-08-13 11:12:43 -05:00
RuQing Xu	3df0e9b653	Arm64 8x4 Kernel Use Less Regs	2021-08-13 02:40:06 +09:00
RuQing Xu	4e7e225057	Armv8-A Supplimentary GEMMSUP Sizes for RD	2021-08-13 02:40:06 +09:00
RuQing Xu	c792d506ba	Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Suffixed NEON opcode is not supported by GNU assembler	2021-08-13 02:40:06 +09:00
RuQing Xu	ce44735209	Armv8-A Adjust Types for PACKM Kernels GCC does not have full NEON intrinsics support.	2021-08-13 02:40:06 +09:00
RuQing Xu	8a32d19af8	Armv8-A GEMMSUP-RD 6x8m Armv8-A now has a complete set of GEMMSUP kernels..	2021-08-13 02:40:06 +09:00
RuQing Xu	afd0fa6ad1	Armv8-A GEMMSUP-RD 6x8n	2021-08-13 02:40:06 +09:00
RuQing Xu	3c5f740514	Armv8-A s/d Packing Kernels Fix Typo For GCC.	2021-08-13 02:40:06 +09:00
RuQing Xu	49b05df792	Armv8-A Introduced s/d Packing Kernels Sizes according to the 2014 kernels.	2021-08-13 02:40:06 +09:00
RuQing Xu	c3faf93168	Armv8-A DGEMMSUP 6x8m Kernel Recommended kernels set: ... BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, ... bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1, -1, 8, -1, -1 ); bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 ); ...	2021-08-13 02:40:06 +09:00
RuQing Xu	3efe707b55	Armv8-A DGEMMSUP Adjustments	2021-08-13 02:40:06 +09:00
RuQing Xu	8ed8f5e625	Armv8-A Add More DGEMMSUP - Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.	2021-08-13 02:40:06 +09:00
RuQing Xu	a9ba79ea14	Armv8-A Add GEMMSUP 4x8n Kernel - Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.	2021-08-13 02:40:06 +09:00
RuQing Xu	df40efe8fb	Armv8-A Add Part of GEMMSUP 8x4m Kernel - Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead	2021-08-13 02:40:06 +09:00
RuQing Xu	6639999288	Armv8A DGEMM 4x4 Kernel WIP. Slow Quite slow.	2021-08-13 02:40:06 +09:00
RuQing Xu	a29c16394c	Armv8-A Add 8x4 Kernel WIP Test result: a bit lower GFlOps than 6x8.	2021-08-13 02:40:04 +09:00
Field G. Van Zee	21911d6ed3	Merge branch 'dev'	2021-07-09 18:10:46 -05:00
Devin Matthews	17729cf449	Add vzeroupper to Haswell microkernels. (#524 ) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions.	2021-07-09 14:59:48 -05:00
Devin Matthews	bf72763663	Merge pull request #506 from xrq-phys/arm64-mac BLIS on Darwin_Aarch64	2021-06-18 18:59:43 -05:00
nicholai	56ffca6a9b	Fix asm warning	2021-06-15 18:17:39 -05:00
Field G. Van Zee	689fa0f403	Merge branch 'master' into dev	2021-06-13 19:44:14 -05:00
Field G. Van Zee	7f7d72610c	Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging.	2021-05-31 16:50:18 -05:00
RuQing Xu	5fc93e2806	Armv8A Rename Regs for Safe Darwin Compile Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.	2021-05-29 18:44:47 +09:00
RuQing Xu	9f4a4a3cfb	Armv8A Rename Regs for Clang Compile: FP32 Part Roughly the same as `916e1fa` , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.	2021-05-29 17:21:28 +09:00
RuQing Xu	916e1fa8be	Armv8A Rename Regs for Clang Compile: FP64 Part - x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, #8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.	2021-05-29 16:46:52 +09:00
RuQing Xu	7fabd896af	Asm Flag Mingling for Darwin_Aarch64 Apple+Arm64 requires additional "tagging" of local symbols.	2021-05-29 16:28:03 +09:00
RuQing Xu	61584deddf	Added 512b SVE-based a64fx subconfig + SVE kernels. Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424.	2021-05-19 09:52:29 -05:00

1 2 3 4 5 ...

323 Commits