amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
RuQing Xu	40baf83f0e	Armv8 Handle *beta == 0 for GEMMSUP ??r Case.	2021-10-06 01:00:52 +09:00
Devin Matthews	079fbd42ce	Merge branch 'master' into arm64-hi-bw	2021-10-04 17:21:48 -05:00
Devin Matthews	80c5366e4a	Move unused ARM SVE kernels to "old" directory.	2021-10-04 15:40:28 -05:00
RuQing Xu	f5c03e9fe8	Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.	2021-10-03 16:51:51 +09:00
RuQing Xu	abc648352c	Armv8 Fix 6x8 Row-Maj Ukr - Fixed for 6x8 only, 4x4 & 4x8 pending; - Installed to config firestorm as benchmark seems to show better perf: Old: blis_dgemm_ukr_c 6 8 320 36.87 2.43e-17 PASS blis_dgemm_ukr_c 6 8 352 40.55 1.04e-17 PASS blis_dgemm_ukr_c 6 8 384 44.24 5.68e-17 PASS blis_dgemm_ukr_c 6 8 416 41.67 3.51e-17 PASS blis_dgemm_ukr_c 6 8 448 34.41 2.94e-17 PASS blis_dgemm_ukr_c 6 8 480 42.53 2.35e-17 PASS New: blis_dgemm_ukr_r 6 8 352 50.69 1.59e-17 PASS blis_dgemm_ukr_r 6 8 384 49.15 5.55e-17 PASS blis_dgemm_ukr_r 6 8 416 50.44 2.86e-17 PASS blis_dgemm_ukr_r 6 8 448 46.92 3.12e-17 PASS blis_dgemm_ukr_r 6 8 480 48.08 4.08e-17 PASS	2021-10-03 13:14:19 +09:00
Devin Matthews	13dbd5b5d3	Apply patch from @xrq-phys.	2021-10-02 16:08:05 -05:00
Devin Matthews	ae0eeeaf77	Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.	2021-09-29 16:43:38 -05:00
Devin Matthews	e3dc1954ff	Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition.	2021-09-16 10:59:37 -05:00
Devin Matthews	5191c43fac	Fix more copy-paste errors in the haswell gemmsup code. Fixes #486.	2021-09-16 10:16:17 -05:00
RuQing Xu	820f11a469	Arm Whole GEMMSUP Call Route is Asm/Int Optimized - `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.	2021-08-27 13:40:26 +09:00
RuQing Xu	7e2951e61f	Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)	2021-08-23 17:06:44 +09:00
RuQing Xu	4fd82b0e93	Header Typo	2021-08-23 05:18:32 +09:00
RuQing Xu	35409ebe67	Arm: DGEMMSUP ??r(rv) Invoke Edge Size Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.	2021-08-23 04:51:47 +09:00
RuQing Xu	a361492c24	Arm: DGEMMSUP ?rc(rd) Invoke Edge Size	2021-08-23 01:13:39 +09:00
RuQing Xu	e6799b26a6	Arm: Implement GEMMSUP Fallback Method bli_dgemmsup_rv_armv8a_int_6x4mn	2021-08-21 02:39:38 +09:00
RuQing Xu	7d5903d8d7	Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.	2021-08-21 01:55:50 +09:00
Devin Matthews	4f70eb7913	Clean up some warnings that show up on clang/OSX.	2021-08-13 11:12:43 -05:00
RuQing Xu	3df0e9b653	Arm64 8x4 Kernel Use Less Regs	2021-08-13 02:40:06 +09:00
RuQing Xu	4e7e225057	Armv8-A Supplimentary GEMMSUP Sizes for RD	2021-08-13 02:40:06 +09:00
RuQing Xu	c792d506ba	Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Suffixed NEON opcode is not supported by GNU assembler	2021-08-13 02:40:06 +09:00
RuQing Xu	ce44735209	Armv8-A Adjust Types for PACKM Kernels GCC does not have full NEON intrinsics support.	2021-08-13 02:40:06 +09:00
RuQing Xu	8a32d19af8	Armv8-A GEMMSUP-RD 6x8m Armv8-A now has a complete set of GEMMSUP kernels..	2021-08-13 02:40:06 +09:00
RuQing Xu	afd0fa6ad1	Armv8-A GEMMSUP-RD 6x8n	2021-08-13 02:40:06 +09:00
RuQing Xu	3c5f740514	Armv8-A s/d Packing Kernels Fix Typo For GCC.	2021-08-13 02:40:06 +09:00
RuQing Xu	49b05df792	Armv8-A Introduced s/d Packing Kernels Sizes according to the 2014 kernels.	2021-08-13 02:40:06 +09:00
RuQing Xu	c3faf93168	Armv8-A DGEMMSUP 6x8m Kernel Recommended kernels set: ... BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, ... bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1, -1, 8, -1, -1 ); bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 ); ...	2021-08-13 02:40:06 +09:00
RuQing Xu	3efe707b55	Armv8-A DGEMMSUP Adjustments	2021-08-13 02:40:06 +09:00
RuQing Xu	8ed8f5e625	Armv8-A Add More DGEMMSUP - Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.	2021-08-13 02:40:06 +09:00
RuQing Xu	a9ba79ea14	Armv8-A Add GEMMSUP 4x8n Kernel - Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.	2021-08-13 02:40:06 +09:00
RuQing Xu	df40efe8fb	Armv8-A Add Part of GEMMSUP 8x4m Kernel - Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead	2021-08-13 02:40:06 +09:00
RuQing Xu	6639999288	Armv8A DGEMM 4x4 Kernel WIP. Slow Quite slow.	2021-08-13 02:40:06 +09:00
RuQing Xu	a29c16394c	Armv8-A Add 8x4 Kernel WIP Test result: a bit lower GFlOps than 6x8.	2021-08-13 02:40:04 +09:00
Field G. Van Zee	21911d6ed3	Merge branch 'dev'	2021-07-09 18:10:46 -05:00
Devin Matthews	17729cf449	Add vzeroupper to Haswell microkernels. (#524 ) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions.	2021-07-09 14:59:48 -05:00
Devin Matthews	bf72763663	Merge pull request #506 from xrq-phys/arm64-mac BLIS on Darwin_Aarch64	2021-06-18 18:59:43 -05:00
nicholai	56ffca6a9b	Fix asm warning	2021-06-15 18:17:39 -05:00
Field G. Van Zee	689fa0f403	Merge branch 'master' into dev	2021-06-13 19:44:14 -05:00
Field G. Van Zee	7f7d72610c	Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging.	2021-05-31 16:50:18 -05:00
RuQing Xu	5fc93e2806	Armv8A Rename Regs for Safe Darwin Compile Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.	2021-05-29 18:44:47 +09:00
RuQing Xu	9f4a4a3cfb	Armv8A Rename Regs for Clang Compile: FP32 Part Roughly the same as `916e1fa` , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.	2021-05-29 17:21:28 +09:00
RuQing Xu	916e1fa8be	Armv8A Rename Regs for Clang Compile: FP64 Part - x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, #8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.	2021-05-29 16:46:52 +09:00
RuQing Xu	7fabd896af	Asm Flag Mingling for Darwin_Aarch64 Apple+Arm64 requires additional "tagging" of local symbols.	2021-05-29 16:28:03 +09:00
RuQing Xu	61584deddf	Added 512b SVE-based a64fx subconfig + SVE kernels. Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424.	2021-05-19 09:52:29 -05:00
Field G. Van Zee	09bd4f4f12	Add err_t* "return" parameter to malloc functions. Details: - Added an err_t* parameter to memory allocation functions including bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(), bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions already use the return value to return the allocated memory address, they can't communicate errors to the caller through the return value. This commit does not employ any error checking within these functions or their callers, but this sets up BLIS for a more comprehensive commit that moves in that direction. - Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to bli_type_defs.h. This was done so that what remains of bli_malloc.h can be included after the definition of the err_t enum. (This ordering was needed because bli_malloc.h now contains function prototypes that use err_t.) - Defined bli_is_success() and bli_is_failure() static functions in bli_param_macro_defs.h. These functions provide easy checks for error codes and will be used more heavily in future commits. - Unfortunately, the additional err_t* argument discussed above breaks the API for bli_malloc_user(), which is an exported symbol in the shared library. However, it's quite possible that the only application that calls bli_malloc_user()--indeed, the reason it is was marked for symbol exporting to begin with--is the BLIS testsuite. And if that's the case, this breakage won't affect anyone. Nonetheless, the "major" part of the so_version file has been updated accordingly to 4.0.0.	2021-03-31 17:09:36 -05:00
Field G. Van Zee	f9ad55ce7e	Merge branch 'master' into dev	2021-03-31 14:20:19 -05:00
Nicholai Tukanov	22c6b5dc4c	Fixed bug in power10 microkernel I/O. (#488 ) Details: - Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did not store the microtile result correctly due to incorrect indices calculations. (The error was introduced when I reorganized the 'kernels/power10/3' directory.)	2021-03-30 19:07:42 -05:00
Field G. Van Zee	3a6f41afb8	Renamed membrk files/vars/functions to pba. Details: - Renamed the files, variables, and functions relating to the packing block allocator from its legacy name (membrk) to its current name (pba). This more clearly contrasts the packing block allocator with the small block allocator (sba). - Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that caused the function to erroneously change the value of the pack_a field of the global rntm_t instead of the pack_b field. (Apparently nobody has used this API yet.) - Comment updates.	2021-03-27 17:22:14 -05:00
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
Field G. Van Zee	f5871c7e06	Added complex asm packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision complex domain (c and z) and housed them in the 'haswell' kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Minor modifications to the corresponding s and d packm kernels that were introduced in `426ad67`. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), upon which these complex kernels are partially based.	2021-02-28 17:03:57 -06:00
Field G. Van Zee	426ad679f5	Added assembly packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision real domain (s and d) and housed them in the 'haswell' kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), which I have now tweaked and used to create comparable single-precision real kernels (s6xk and s16xk).	2021-02-27 18:39:56 -06:00

1 2 3 4 5 ...

316 Commits