Details:
- Fixed a compile-time error in bli_init.c when compiling with OSX's
clang. This error was introduced in 868b901, which introduced a
post-declaration struct assignment where the RHS was a struct
initialization expression (i.e. { ... }). This use of struct
initializer expressions apparently works with gcc despite it not
being strict C99. The fix included in this commit declares a temporary
variable for the purposes of being initialized to the desired value,
via the struct initializer, and then copies the temporary struct (via
'=' struct assignment) to the persistent struct. Thanks to Devin
Matthews for his help with this.
Details:
- Accept either 'clang' or 'LLVM' in vendor string when greping for
the version number (after determining that we're working with clang).
Thanks to Devin Matthews for this fix.
Details:
- Fixes a rather obvious bug that resulted in segmentation fault
whenever the calling application tried to re-initialize BLIS after
its first init/finalize cycle. The bug resulted from the fact that
the bli_init.c APIs made no effort to allow bli_init() to be called
subsequent times at all due to it, and bli_finalize(), being
implemented in terms of pthread_once(). This has been fixed by
resetting the pthread_once_t control variable for initialization
at the end of bli_finalize_apis(), and by resetting the control
variable for finalization at the end of bli_init_apis(). Thanks to
@lschork2 for reporting this issue (#525), and to Minh Quan Ho and
Devin Matthews for suggesting the chosen solution.
- CREDITS file update.
Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on a Graviton2
Neoverse N1 server. Special thanks to Nicholai Tukanov for
collecting these results via the Arm-HPC/AWS hackaton.
- Corrected what was supposed to be a temporary tweak to the legend
labels in test/3/octave/plot_l3_perf.m.
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
microkernels so as to avoid a performance penalty when mixing AVX
and SSE instructions. These vzeroupper instructions were once part
of the haswell kernels, but were inadvertently removed during a source
code shuffle some time ago when we were managing duplicate 'haswell'
and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
and re-inserting the missing instructions.
Details:
- Thanks to Chengguo Sun for submitting #515 (5ef7f68).
- Thanks to Andrew Wildman for submitting #519 (551c6b4).
- Whitespace update to configure (spaces to tabs).
Details:
- Fixed kernel list string substitution error by adding function substitute_words in configure script.
if the string contains zen and zen2, and zen need to be replaced with another string, then zen2
also be incorrectly replaced.
- Add a testsuite for gathering performance (in GFLOPs) and measuring correctness for the POWER10 GEMM reduced precision/integer kernels.
- Reworked GENERIC_GEMM template to hardcode the cache parameters.
- Remove kernel wrapper that checked that only allowed matrices that weren't transposed or conjugated. However, the kernels still assume the matrices are not transposed. This wrapper was removed for performance reasons.
- Renamed and restructured files and functions for clarity.
- Editted the POWER10 document to reflect new changes.
Details:
- Removed code from gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. This code, introduced recently in 7f7d726, did
not actually fix any bug (despite that commit's log entry). The
microtile does not need to be initialized because it is completely
overwritten by a "beta = 0" invocation of gemm prior to it being
read. Any NaNs or Infs present at the outset would have no impact
on the output matrix C. Thanks to Devin Matthews for reminding me
of this.
Details:
- Adjusted the top-level Makefile so that any change to a sandbox header
file will result in blis.h being regenerated along with a full
recompilation of the library. Previously, sandbox files were omitted
from the list of header files that, when touched, could trigger a full
rebuild. Why was it like that previously? Because originally we only
envisioned using sandboxes to *replace* gemm, not augment the library
with new functionality. When replacing gemm, blis.h does not need to
contain any local sandbox defintions in order for the user to be able
to (indirectly) use that sandbox. But if you are adding functions to
the library, those functions need to be prototyped so the compiler
can perform type checking against the user's invocation of those new
functions. Thanks to Jeff Diamond for helping us discover this
deficiency in the build system.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.
Avoid x18 use in FP32 kernel:
- C address lines x[18-26] renamed to x[19-27] (reg index +1)
- Original role of x27 fulfilled by x5 which is free after k-loop pert.
FP64 does not require changing since x18 is not used there.
Roughly the same as 916e1fa , additionally with x15 clobbering removed.
- x15: Not used at all.
Compilation w/ Clang shows warning about x18 reservation, but
compilation itself is OK and all tests got passed.
- x7, x8: Used to store address for Alpha and Beta.
As Alpha & Beta was not used in k-loops, use x0, x1 to load
Alpha & Beta's addresses after k-loops are completed, since A & B's
addresses are no longer needed there.
This "ldr [addr]; -> ldr val, [addr]" would not cause much performance
drawback since it is done outside k-loops and there are plenty of
instructions between Alpha & Beta's loading and usage.
- x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used
any longer. Directly loading cs_c and into x10 and scale by 8 spares
x9 straightforwardly.
- x11, x12: Not used at all. Simply remove from clobber list.
- x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is
also used in a conditional branch so that "cmp x13, #1" needs to be
modified into "cmp x14, #8" to completely free x13.
- x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load
these addresses into x0 and x1 after Alpha & Beta are both loaded,
since then neigher address of A/B nor address of Alpha/Beta is needed.
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
multithreaded gemm in the style of gemmsup but also unconditionally
employs packing. The purpose of this sandbox is to
(1) avoid select abstractions, such as objects and control trees, in
order to allow readers to better understand how a real-world
implementation of high-performance gemm can be constructed;
(2) provide a starting point for expert users who wish to build
something that is gemm-like without "reinventing the wheel."
Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
instead of "bli_" in order to avoid any symbol collisions in the main
library.
- The sandbox contains two variants, each of which implements gemm via a
block-panel algorithm. The only difference between the two is that
variant 1 calls the microkernel directly while variant 2 calls the
microkernel indirectly, via a function wrapper, which allows the edge
case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
(not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
framework.
Details:
- Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx
entry within Performance.md, and also updated the experiment details
accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2
experiments reflected in this commit.
- In Performance.md, added an English translation of the project name
under which the Fugaku results were gathered, courtesy of RuQing Xu.
AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.
- ArmSVE don't test gemmt (seems Qemu-only problem);
- Clang use TravisCI-provided version instead of fixing to clang-8
due to that clang-8 seems conflicting with TravisCI's clang-7.
- Updated distro to 20.04 focal aarch64-gcc-10.
This is minimal version required by aarch64-gcc-10.
SVE intrinsics would not compile without GCC >=10.
- x86 toolchains use official repo instead of ubuntu-toolchain-r/test.
20.04 focal is not supported by that PPA at the moment.
- Add extra configuration-time options to .travis.yml.
- Add Arm SVE entry to .travis.yml.
Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically
tuned block size by Stepan Nassyr. This subconfig also sets the sector
cache size and enables memory-tagging code in SVE gemm kernels. This
subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
blocksizes according to the analytical model. This part is ported from
Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE
at size (2*VL, 10). These kernels use unindexed FMLA instructions
because indexed FMLA takes 2 FMA units in many implementations.
PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for
size (12, k). This dpackm kernel is not currently used by any
subconfiguration.
- Implemented several experimental dgemmsup kernels which would
improve performance in a few cases. However, those dgemmsup kernels
generally underperform hence they are not currently used in any
subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
PR #424.