amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 18:52:14 +00:00

Author	SHA1	Message	Date
RuQing Xu	df40efe8fb	Armv8-A Add Part of GEMMSUP 8x4m Kernel - Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead	2021-08-13 02:40:06 +09:00
RuQing Xu	6639999288	Armv8A DGEMM 4x4 Kernel WIP. Slow Quite slow.	2021-08-13 02:40:06 +09:00
RuQing Xu	a29c16394c	Armv8-A Add 8x4 Kernel WIP Test result: a bit lower GFlOps than 6x8.	2021-08-13 02:40:04 +09:00
Field G. Van Zee	a32257eeab	Fixed bli_init.c compile-time error on OSX clang. Details: - Fixed a compile-time error in bli_init.c when compiling with OSX's clang. This error was introduced in `868b901`, which introduced a post-declaration struct assignment where the RHS was a struct initialization expression (i.e. { ... }). This use of struct initializer expressions apparently works with gcc despite it not being strict C99. The fix included in this commit declares a temporary variable for the purposes of being initialized to the desired value, via the struct initializer, and then copies the temporary struct (via '=' struct assignment) to the persistent struct. Thanks to Devin Matthews for his help with this.	2021-08-05 16:23:02 -05:00
Field G. Van Zee	c8728cfbd1	Fixed configure breakage on OSX clang. Details: - Accept either 'clang' or 'LLVM' in vendor string when greping for the version number (after determining that we're working with clang). Thanks to Devin Matthews for this fix.	2021-08-05 15:17:09 -05:00
Field G. Van Zee	868b90138e	Fixed one-time use property of bli_init() (#525 ). Details: - Fixes a rather obvious bug that resulted in segmentation fault whenever the calling application tried to re-initialize BLIS after its first init/finalize cycle. The bug resulted from the fact that the bli_init.c APIs made no effort to allow bli_init() to be called subsequent times at all due to it, and bli_finalize(), being implemented in terms of pthread_once(). This has been fixed by resetting the pthread_once_t control variable for initialization at the end of bli_finalize_apis(), and by resetting the control variable for finalization at the end of bli_init_apis(). Thanks to @lschork2 for reporting this issue (#525), and to Minh Quan Ho and Devin Matthews for suggesting the chosen solution. - CREDITS file update.	2021-08-04 18:31:01 -05:00
Field G. Van Zee	8dba1e752c	CREDITS file update.	2021-07-27 12:38:24 -05:00
Field G. Van Zee	cc9206df66	Added Graviton2 Neoverse N1 performance results. Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on a Graviton2 Neoverse N1 server. Special thanks to Nicholai Tukanov for collecting these results via the Arm-HPC/AWS hackaton. - Corrected what was supposed to be a temporary tweak to the legend labels in test/3/octave/plot_l3_perf.m.	2021-07-16 15:48:37 -05:00
Devin Matthews	fab5c86d68	Merge pull request #516 from nicholaiTukanov/p10-sandbox-rework P10 sandbox rework	2021-07-13 16:46:21 -05:00
Devin Matthews	84f9dcd449	Remove unnecesary windows/zen2 directory.	2021-07-13 16:45:44 -05:00
Field G. Van Zee	21911d6ed3	Merge branch 'dev'	2021-07-09 18:10:46 -05:00
Devin Matthews	17729cf449	Add vzeroupper to Haswell microkernels. (#524 ) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions.	2021-07-09 14:59:48 -05:00
Devin Matthews	c9a7f59aa8	Merge pull request #522 from flame/windows-avx512 Fix Win64 AVX512 bug.	2021-07-08 14:00:38 -05:00
Devin Matthews	9a8e649c5a	Fix Win64 AVX512 bug. Use `-march=haswell` for kernels. Fixes #514.	2021-07-08 11:40:00 -05:00
Devin Matthews	75f03907c5	Add comment about make checkblas on Windows [ci skip]	2021-07-07 15:44:11 -05:00
Devin Matthews	4651583b12	Merge pull request #520 from flame/travis-ci-install Test installation in Travis CI	2021-07-07 01:11:20 -05:00
Field G. Van Zee	69205ac266	CREDITS file update. Details: - Thanks to Chengguo Sun for submitting #515 (`5ef7f68`). - Thanks to Andrew Wildman for submitting #519 (`551c6b4`). - Whitespace update to configure (spaces to tabs).	2021-07-06 20:39:22 -05:00
Devin Matthews	174f7fc9a1	Test installation in Travis CI	2021-07-06 19:35:55 -05:00
Devin Matthews	551c6b4ee8	Merge pull request #519 from awild82/oot_build_bugfix Fix installation from out-of-tree builds	2021-07-06 19:32:53 -05:00
Andrew Wildman	f648df4e55	Add symlink to blis.pc.in for out-of-tree builds	2021-07-06 16:35:12 -07:00
Devin Matthews	78eac6a0ab	Revert "Always run `make check`." This reverts commit `a201a53440`.	2021-07-06 11:05:43 -05:00
Devin Matthews	a201a53440	Always run `make check`. I'm concerned that problems may lurk for `x86_64` builds on Windows which may be uncovered by a fuller `make check`.	2021-07-05 21:39:18 -05:00
Devin Matthews	5ef7f684dc	Merge pull request #515 from chengguosun/bug-fix Fixed configure script bug.	2021-07-05 21:35:07 -05:00
sunchengguo	ad6231cca3	Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.	2021-07-06 07:30:00 -04:00
nicholaiTukanov	d073fc9aca	Update POWER10.md	2021-07-02 19:54:33 -05:00
nicholaiTukanov	907226c0af	Rework POWER10 sandbox - Add a testsuite for gathering performance (in GFLOPs) and measuring correctness for the POWER10 GEMM reduced precision/integer kernels. - Reworked GENERIC_GEMM template to hardcode the cache parameters. - Remove kernel wrapper that checked that only allowed matrices that weren't transposed or conjugated. However, the kernels still assume the matrices are not transposed. This wrapper was removed for performance reasons. - Renamed and restructured files and functions for clarity. - Editted the POWER10 document to reflect new changes.	2021-07-02 19:47:18 -05:00
Field G. Van Zee	aaa10c87e1	Skip clearing temp microtile in gemmlike sandbox. Details: - Removed code from gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. This code, introduced recently in `7f7d726`, did not actually fix any bug (despite that commit's log entry). The microtile does not need to be initialized because it is completely overwritten by a "beta = 0" invocation of gemm prior to it being read. Any NaNs or Infs present at the outset would have no impact on the output matrix C. Thanks to Devin Matthews for reminding me of this.	2021-06-21 17:53:52 -05:00
Devin Matthews	bc10a3f2ff	Merge pull request #492 from flame/thunderx2-clang Allow clang for ThunderX2 config	2021-06-18 19:01:08 -05:00
Devin Matthews	bf72763663	Merge pull request #506 from xrq-phys/arm64-mac BLIS on Darwin_Aarch64	2021-06-18 18:59:43 -05:00
Devin Matthews	e28f2a2dfc	Merge pull request #513 from nicholaiTukanov/asm_warning_p9_fix Fix assembler warning in POWER9 DGEMM	2021-06-15 19:35:07 -05:00
nicholai	56ffca6a9b	Fix asm warning	2021-06-15 18:17:39 -05:00
Field G. Van Zee	689fa0f403	Merge branch 'master' into dev	2021-06-13 19:44:14 -05:00
Field G. Van Zee	d10e05bbd1	Sandbox header edits trigger full library rebuild. Details: - Adjusted the top-level Makefile so that any change to a sandbox header file will result in blis.h being regenerated along with a full recompilation of the library. Previously, sandbox files were omitted from the list of header files that, when touched, could trigger a full rebuild. Why was it like that previously? Because originally we only envisioned using sandboxes to replace gemm, not augment the library with new functionality. When replacing gemm, blis.h does not need to contain any local sandbox defintions in order for the user to be able to (indirectly) use that sandbox. But if you are adding functions to the library, those functions need to be prototyped so the compiler can perform type checking against the user's invocation of those new functions. Thanks to Jeff Diamond for helping us discover this deficiency in the build system.	2021-06-13 19:36:16 -05:00
Devin Matthews	7c3eb44efa	Add vhsubpd/vhsubpd. Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].	2021-06-02 11:28:22 -05:00
Field G. Van Zee	7f7d72610c	Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging.	2021-05-31 16:50:18 -05:00
RuQing Xu	5fc93e2806	Armv8A Rename Regs for Safe Darwin Compile Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.	2021-05-29 18:44:47 +09:00
RuQing Xu	9f4a4a3cfb	Armv8A Rename Regs for Clang Compile: FP32 Part Roughly the same as `916e1fa` , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.	2021-05-29 17:21:28 +09:00
RuQing Xu	916e1fa8be	Armv8A Rename Regs for Clang Compile: FP64 Part - x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, #8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.	2021-05-29 16:46:52 +09:00
RuQing Xu	7fabd896af	Asm Flag Mingling for Darwin_Aarch64 Apple+Arm64 requires additional "tagging" of local symbols.	2021-05-29 16:28:03 +09:00
Field G. Van Zee	213dce32d2	Added a new 'gemmlike' sandbox. Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework.	2021-05-28 14:49:57 -05:00
Field G. Van Zee	82af05f54c	Updated Fugaku (a64fx) performance results. Details: - Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx entry within Performance.md, and also updated the experiment details accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2 experiments reflected in this commit. - In Performance.md, added an English translation of the project name under which the Fugaku results were gathered, courtesy of RuQing Xu.	2021-05-25 15:25:08 -05:00
Devin Matthews	e5c85da376	Merge pull request #503 from flame/windows-compiler-check Add explicit compiler check for Windows.	2021-05-24 16:56:22 -05:00
Devin Matthews	cbd8d39325	Merge pull request #500 from xrq-phys/armsve+travis Upgrade Travis CI for Arm SVE	2021-05-24 16:32:42 -05:00
Devin Matthews	5feb04e233	Add explicit compiler check for Windows. Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes #463.	2021-05-23 18:46:56 -05:00
Devin Matthews	6d4ab0223d	Merge pull request #502 from flame/rm-rm-dupls Remove `rm-dupls` function in common.mk.	2021-05-23 18:39:53 -05:00
Devin Matthews	859fb77a32	Remove `rm-dupls` function in common.mk. AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.	2021-05-23 18:15:23 -05:00
RuQing Xu	932dfe6abb	Travis CI Revert Unnecessary Extras from `91d3636` - Removed `V=1` in make line - Removed `CFLAGS` in configure line - Restored `pwd` surrounding OOT line	2021-05-20 02:07:31 +09:00
RuQing Xu	bd156a210d	Adjust TravisCI - ArmSVE don't test gemmt (seems Qemu-only problem); - Clang use TravisCI-provided version instead of fixing to clang-8 due to that clang-8 seems conflicting with TravisCI's clang-7.	2021-05-20 00:52:04 +09:00
RuQing Xu	91d3636031	Travis Support Arm SVE - Updated distro to 20.04 focal aarch64-gcc-10. This is minimal version required by aarch64-gcc-10. SVE intrinsics would not compile without GCC >=10. - x86 toolchains use official repo instead of ubuntu-toolchain-r/test. 20.04 focal is not supported by that PPA at the moment. - Add extra configuration-time options to .travis.yml. - Add Arm SVE entry to .travis.yml.	2021-05-20 00:52:01 +09:00
RuQing Xu	61584deddf	Added 512b SVE-based a64fx subconfig + SVE kernels. Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424.	2021-05-19 09:52:29 -05:00

1 2 3 4 5 ...

2045 Commits