Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding #include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is #included before bli_type_defs.h).
This is why the #include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
* Performance.md Update A64fx Comments
- Reason for ARMPL's missing data;
- Additional envs / flags for kernel selection;
- Update BLIS SRC commit.
* Include Another Fix in armsve-cfg-vendor
A prototype was forgotten, causing that void* pointer was not fully returned.
Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on the "Fugaku"
Fujitsu A64fx supercomputer at the RIKEN Center for Computational
Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan
Nassyr for their work in developing and optimizing A64fx support in
BLIS and RuQing for gathering the performance data that is reflected
in these new graphs.
Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
value in the arch_t enum. This means that it no longer needs to get
updated manually whenever new subconfigurations are added to BLIS.
Also removed the explicit initial index assigment of 0 from the
first enum value, which was unnecessary due to how the C language
standard mandates indexing of enum values. Thanks to Devin Matthews
for originally submitting this as a PR in #446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
change.
Details:
- Fixed incorrect definition and prototype of bli_?gemmt() in
frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
defined identically to gemm, which was wrong because it did not
take into account the uplo property of C.
- Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
Specifically, the document erroneously listed only a single transab
parameter instead of transa and transb.
Details:
- When requesting multithreaded parallelism by specifying the total
number of threads (whether it be via environment variable, globally at
runtime, or locally at runtime), reduce the number of threads actually
used by one if the original value (a) is prime and (b) exceeds a
minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
to 11 by default. If, when specifying the total number of threads (and
not the individual ways of parallelism for each loop), prime numbers
of threads are desired, this feature may be overridden by defining the
BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
corresponds to the configuration family targeted at configure-time.
(For now, there is no configure option(s) to control this feature.)
Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
bool that determines whether an integer is prime. This function is
implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
with unrelated minor edits.
Details:
- Added a configure option, --[enable|disable]-system, which determines
whether the modest operating system dependencies in BLIS are included.
The most notable example of this on Linux and BSD/OSX is the use of
POSIX threads to ensure thread safety for when application-level
threads call BLIS. When --disable-system is given, the bli_pthreads
implementation is dummied out entirely, allowing the calling code
within BLIS to remain unchanged. Why would anyone want to build BLIS
like this? The motivating example was submitted via #454 in which a
user wanted to build BLIS for a simulator such as gem5 where thread
safety may not be a concern (and where the operating system is largely
absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
the implementation of bli_clock() unconditionally returns 0.0 instead
of the time elapsed since some fixed point in the past. The reasoning
for this is that if the operating system is truly minimal, the system
function call upon which bli_clock() would normally be implemented
(e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
and BLISTypedAPI.md, with a note that both are non-functional when
BLIS is configured with --disable-system.
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.
Details:
- Moved the "Operation index" section of both the BLISObjectAPI.md and
BLISTypedAPI.md docs to appear immediately after the table of contents
of each document. This allows the reader to quickly jump to the
documentation for any operation without having to scroll through much
of the document (when rendered via a web browser).
- Fixed a mistake in the BLISObjectAPI.md for the setd operation, which
does *not* observe the diag property of its matrix argument. Thanks to
Jeff Diamond for reporting this.
Details:
- Added single-threaded and multithreaded sup performance results to
docs/PerformanceSmall.md for both sgemm and dgemm. These results were
gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
microarchitecture. Special thanks to Jeff Diamond for facilitating
access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
microarchitecture section.
Details:
- Added a frequently asked question to docs/FAQ.md regarding the
difference between upstream (vanilla) BLIS and AMD BLIS.
- Updated the name of ICES in the README.md to reflect the Oden
rebranding.
Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on an Epyc 7742
"Rome" server with AMD's Zen2 microarchitecture. Special thanks
to Jeff Diamond for facilitating access to the system via the
Oracle Cloud.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
Details:
- Steer the reader towards the example code section of each
documentation doc (object and typed).
- Trivial update to examples/oapi/README, examples/tapi/README.
Details:
- Added documentation for commonly-used object mutator functions in
BLISObjectAPI.md. Previously, only accessor functions were documented.
Thanks to Jeff Diamond for pointing out this omission.
- Explicitly set the 'diag' property of objects in oapi example modules
(08level2.c and 09level3.c).
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
publishes new performance graphs for Kaby Lake and Epyc showcasing
the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
document.
Details:
- Updated the level-3 performance graphs in docs/graphs with new Eigen
results, this time using a development version cloned from their git
mirror on March 27, 2019 (version 3.3.90). Performance is improved
over 3.3.7, though still noticeably short of BLIS/MKL in most cases.
- Very minor updates to docs/Performance.md and matlab scripts in
test/3/matlab.
Details:
- Updated the Haswell, SkylakeX, and Epyc performance graphs in
docs/graphs to report on Eigen implementations, where applicable.
Specifically, Eigen implements all level-3 operations sequentially,
however, of those operations it only provides multithreaded gemm.
Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are
omitted. Thanks to Sameer Agarwal for his help configuring and
using Eigen.
- Updated docs/Performance.md to note the new implementation tested.
- CREDITS file update.
Details:
- Added a new markdown document, docs/Performance.md, which reports
performance of a representative set of level-3 operations across a
variety of hardware architectures, comparing BLIS to OpenBLAS and a
vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs,
in pdf and png formats, reside in docs/graphs.
- Updated README.md to link to new Performance.md document.
- Minor updates to CREDITS, docs/Multithreading.md.
- Minor updates to matlab scripts in test/3/matlab.
Details:
- Made extra explicit the fact that: (a) multithreading in BLIS is
disabled by default; and (b) even with multithreading enabled, the
user must specify multithreading at runtime in order to observe
parallelism. Thanks to M. Zhou for suggesting these clarifications
in #292.
- Also made explicit that only the environment variable and global
runtime API methods are available when using the BLAS API. If the
user wishes to use the local runtime API (specify multithreading on
a per-call basis), one of the native BLIS APIs must be used.
Details:
- Added language to remind the reader to disable sup if the intended
behavior is for the sandbox implementation to handle all problem
sizes, even the smaller ones that would normally be handled by the
sup code path.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).
Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.
Details:
- Added language to remind the reader to disable sup if the intended
behavior is for the sandbox implementation to handle all problem
sizes, even the smaller ones that would normally be handled by the
sup code path.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).
Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.