Details:
- Adjusted the top-level Makefile so that any change to a sandbox header
file will result in blis.h being regenerated along with a full
recompilation of the library. Previously, sandbox files were omitted
from the list of header files that, when touched, could trigger a full
rebuild. Why was it like that previously? Because originally we only
envisioned using sandboxes to *replace* gemm, not augment the library
with new functionality. When replacing gemm, blis.h does not need to
contain any local sandbox defintions in order for the user to be able
to (indirectly) use that sandbox. But if you are adding functions to
the library, those functions need to be prototyped so the compiler
can perform type checking against the user's invocation of those new
functions. Thanks to Jeff Diamond for helping us discover this
deficiency in the build system.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
multithreaded gemm in the style of gemmsup but also unconditionally
employs packing. The purpose of this sandbox is to
(1) avoid select abstractions, such as objects and control trees, in
order to allow readers to better understand how a real-world
implementation of high-performance gemm can be constructed;
(2) provide a starting point for expert users who wish to build
something that is gemm-like without "reinventing the wheel."
Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
instead of "bli_" in order to avoid any symbol collisions in the main
library.
- The sandbox contains two variants, each of which implements gemm via a
block-panel algorithm. The only difference between the two is that
variant 1 calls the microkernel directly while variant 2 calls the
microkernel indirectly, via a function wrapper, which allows the edge
case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
(not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
framework.
Details:
- Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx
entry within Performance.md, and also updated the experiment details
accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2
experiments reflected in this commit.
- In Performance.md, added an English translation of the project name
under which the Fugaku results were gathered, courtesy of RuQing Xu.
AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.
- ArmSVE don't test gemmt (seems Qemu-only problem);
- Clang use TravisCI-provided version instead of fixing to clang-8
due to that clang-8 seems conflicting with TravisCI's clang-7.
- Updated distro to 20.04 focal aarch64-gcc-10.
This is minimal version required by aarch64-gcc-10.
SVE intrinsics would not compile without GCC >=10.
- x86 toolchains use official repo instead of ubuntu-toolchain-r/test.
20.04 focal is not supported by that PPA at the moment.
- Add extra configuration-time options to .travis.yml.
- Add Arm SVE entry to .travis.yml.
Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically
tuned block size by Stepan Nassyr. This subconfig also sets the sector
cache size and enables memory-tagging code in SVE gemm kernels. This
subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
blocksizes according to the analytical model. This part is ported from
Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE
at size (2*VL, 10). These kernels use unindexed FMLA instructions
because indexed FMLA takes 2 FMA units in many implementations.
PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for
size (12, k). This dpackm kernel is not currently used by any
subconfiguration.
- Implemented several experimental dgemmsup kernels which would
improve performance in a few cases. However, those dgemmsup kernels
generally underperform hence they are not currently used in any
subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
PR #424.
Details:
- Added new implementations of bli_slamch() and bli_dlamch() that use
constants from the standard C library in lieu of dynamically-computed
values (via code inherited from netlib). The previous implementation
is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is
defined by the subconfiguration at compile-time. Thanks to Devin
Matthews for providing this patch, and to Stefano Zampini for
reporting the issue (#497) that prompted Devin to propose the patch.
Details:
- Spun off the Performance.md and PerformanceSmall.md links in the
Documentation section into a new Performance section dedicated to
those two links. (The previous entries remain redundantly listed
within Documentation section.) Thanks to Robert van de Geijn for
suggesting this change.
* Performance.md Update A64fx Comments
- Reason for ARMPL's missing data;
- Additional envs / flags for kernel selection;
- Update BLIS SRC commit.
* Include Another Fix in armsve-cfg-vendor
A prototype was forgotten, causing that void* pointer was not fully returned.
Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on the "Fugaku"
Fujitsu A64fx supercomputer at the RIKEN Center for Computational
Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan
Nassyr for their work in developing and optimizing A64fx support in
BLIS and RuQing for gathering the performance data that is reflected
in these new graphs.
Details:
- Fixed an issue where the wrong string was being passed in for the
vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.
Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
not store the microtile result correctly due to incorrect indices
calculations. (The error was introduced when I reorganized the
'kernels/power10/3' directory.)
Details:
- Modified the octave scripts in test/3 so that the script does not
choke when one or more of the expected OpenBLAS, Eigen, or vendor data
files is missing. (The BLIS data set, however, must be complete.) When
a file is missing, that data series is simply not included on that
particular graph. Also factored out a lot of the redundant logic from
plot_panel_4x5.m into a separate function in read_data.m.
Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
from 384 to 256. The maximum (extended) KC was also reduced
accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
this change.
Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
value in the arch_t enum. This means that it no longer needs to get
updated manually whenever new subconfigurations are added to BLIS.
Also removed the explicit initial index assigment of 0 from the
first enum value, which was unnecessary due to how the C language
standard mandates indexing of enum values. Thanks to Devin Matthews
for originally submitting this as a PR in #446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
change.
Details:
- Disabled the _self() and _equal() extensions to the bli_pthread API
introduced in d479654. These functions were disabled after I realized
that they aren't actually needed yet. Thanks to Devin Matthews for
helping me reason through the appropriate consumer code that will
appear in BLIS (eventually) in a future commit. (Also, I could never
get the Windows branch to link properly in clang builds in AppVeyor.
See the comment I left in the code, and #485, for more info.)
Details:
- Expanded the bli_pthread API to include equivalents to pthread_self()
and pthread_equal(). Implemented these two functions for all three cpp
branches present within bli_pthread.c: systemless, Windows, and
Linux/BSD.
Details:
- Reordered the definitions in the cpp branch in bli_pthreads.c that
defines the bli_pthreads API in terms of Windows API calls. Also added
missing comments that mark sections of the API, which brings the code
into harmony with other cpp branches (as well as bli_pthread.h).
Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
(in terms of strerror_s) from bli_thread.h to bli_env.c. It was
likely left behind in bli_thread.h in a previous commit, when code
that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
find any other instance of strerror_r being used in BLIS, so I moved
the #define directly to bli_env.c rather than place it in bli_env.h.)
The code that uses strerror_r is currently disabled, though, so this
commit should have no affect on BLIS.
Details:
- Moved the logic that checks for general stridedness in any of the
matrix operands in a gemmsup problem. The logic previously resided
near the top of bli_gemmsup_int(), which is the thread entry point
for the parallel region of the current gemmsup implementation. The
problem with this setup was that the code would attempt to reject
problems with any general-strided operands by returning BLIS_FAILURE,
and that return value was then being ignored by the l3_sup thread
decorator, which unconditionally returns BLIS_SUCCESS. To solve this
issue, rather than try to manage n return values, one from each of n
threads, I simply moved the logic into bli_gemmsup_ref(). I didn't
move it any higher (e.g. bli_gemmsup()) because I still want the
logic to be part of the current gemmsup handler implementation. That
is, perhaps someone else will create a different handler, and that
author wants to handle general stride differently. (We don't want to
force them into a particular way of handling general stride.)
- Removed the general stride handling from bli_gemmtsup_int(), even
though this function is inoperative for now.
- This commit addresses issue #484. Thanks to RuQing Xu for reporting
this issue.
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations
based on low-precision gemm kernels, and (2) extends the BLIS typed
API for those new implementations. Currently, these new kernels can
only be used for the POWER10 microarchitecture; however, they may
provide a template for developing similar kernels for other
microarchitectures (even those beyond POWER), as changes would likely
be limited to select places in the microkernel and possibly the
packing routines. The new low-precision operations that are now
supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more
information, refer to the POWER10.md document that is included in
'sandbox/power10'.
Details:
- Fixed incorrect definition and prototype of bli_?gemmt() in
frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
defined identically to gemm, which was wrong because it did not
take into account the uplo property of C.
- Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
Specifically, the document erroneously listed only a single transab
parameter instead of transa and transb.
Details:
- In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and
pointer 'a' was not being freed at all. This commit correctly frees
each pointer exactly once.
Details:
- Implemented assembly-based packm kernels for single- and double-
precision complex domain (c and z) and housed them in the 'haswell'
kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), upon which these complex kernels are
partially based.
Details:
- Implemented assembly-based packm kernels for single- and double-
precision real domain (s and d) and housed them in the 'haswell'
kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), which I have now tweaked and used to
create comparable single-precision real kernels (s6xk and s16xk).