Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
publishes new performance graphs for Kaby Lake and Epyc showcasing
the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
document.
Details:
- Fine-tuned the double-precision real MT threshold (which controls
whether the sup implementation kicks for smaller m dimension values)
from 180 to 201 for haswell and 180 to 256 for zen.
- Updated octave scripts in test/sup/octave to include a seventh column
to display performance for m = n = k.
Details:
- Attempted a fix to issue #313, which reports that when building only
a shared library (ie: static library build is disabled), running the
BLAS test drivers can fail because those drivers provide their own
local version of xerbla_() as a clever (albeit still rather hackish)
way of checking the error codes that result from the individual tests.
This local xerbla_() function is never found at link-time because the
BLAS test drivers' Makefile imports BLIS compilation flags via the
get-user-cflags-for() function, which currently conveys the
-fvisibility=hidden flag, which hides symbols unless they are
explicitly annotated for export. The -fvisibility=hidden flag was
only ever intended for use when building BLIS (not for applications),
and so the attempted solution here is to omit the symbol export
flag(s) from get-user-cflags-for() by storing the symbol export
flag(s) to a new BULID_SYMFLAGS variable instead of appending it
to the subconfigurations' CMISCFLAGS variable (which is returned by
every get-*-cflags-for() function). Thanks to M. Zhou for reporting
this issue and also to Isuru Fernando for suggesting the fix.
- Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly
created BUILD_SYMFLAGS.
- Fixed typo in entry for --export-shared flag in 'configure --help'
text.
Details:
- Documented the BLIS environment variables that were set
(e.g. BLIS_JC_NT, BLIS_IC_NT, BLIS_JR_NT) for each machine and
threading configuration in order to achieve the parallelism reported
on in docs/Performance.md.
Details:
- Increased the double-precision real MT threshold (which controls
whether the sup implementation kicks for smaller m dimension values)
from 80 to 180, and this change was made for both haswell and zen
subconfigurations. This is less about the m dimension in particular
and more about facilitating a smoother performance transition when
m = n = k.
Details:
- Commented out redundant setting of LIBBLIS_LINK within all driver-
level Makefiles. This variable is already set within common.mk, and
so the only time it should be overridden is if the user wants to link
to a different copy of libblis.
- Very minor changes to build/gen-make-frags/gen-make-frag.sh.
- Whitespace and inconsequential quoting change to configure.
- Moved top-level 'windows' directory into a new 'attic' directory.
Details:
- Added
#ifndef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L
#endif
to bli_system.h so that an application that uses BLIS (specifically,
an application that #includes blis.h) does not need to remember to
#define the macro itself (either on the command line or in the code
that includes blis.h) in order to activate things like the pthreads.
Thanks to Christos Psarras for reporting this issue and suggesting
this fix.
- Commented out #include <sys/time.h> in bli_system.h, since I don't
think this header is used/needed anymore.
- Comment update to function macro for bli_?normiv_unb_var1() in
frame/util/bli_util_unb_var1.c.
Details:
- Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel
that only affected the beta == 0, column-storage output case. Thanks
to the BLAS test drivers for catching this bug.
- Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if
k = 0, when the correct action would be to scale by beta (and then
return). Thanks to the BLAS test drivers to catching this bug.
- Changed the sup threshold behavior such that the sup implementation
only kicks in if a matrix dimension is strictly less than (rather than
less than or equal to) the threshold in question.
- Initialize all thresholds to zero (instead of 10) by default in
ref_kernels/bli_cntx_ref.c. This, combined with the above change to
threshold testing means that calls to BLIS or BLAS with one or more
matrix dimensions of zero will no longer trigger the sup
implementation.
- Added disabled debugging output to frame/3/bli_l3_sup.c (for future
use, perhaps).
Details:
- Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and
BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and
bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long
conditional that would determine whether to allow the last iteration
to be merged with the second-to-last iteration. Functionally, the
macros were not needed, and they ended up causing problems when
building configuration families such as intel64 and x86_64.
Details:
- Fixed an incorrectly-named macro guard that is intended to allow
disabling of the sup framework via the configure option
--disable-sup-handling. In this case, the preprocessor macro,
BLIS_DISABLE_SUP_HANDLING, was still named by its name from an older
uncommitted version of the code (BLIS_DISABLE_SM_HANDLING).
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
Details:
- Added preprocessor branches to test/3/test_gemm.c to explicitly
support row-stored matrices. Column-stored matrices are also still
supported (and is the default for now). (This is mainly residual work
leftover from initial integration of Eigen into the test drivers, so
if we ever want to test Eigen with row-stored matrices, the code will
be ready to use, even if it is not yet integrated into the Makefile
in test/3.)
Details:
- Somehow the variable name change (root_file_name -> root_inputname)
in flatten-headers.py mentioned in the commit log entry for 89a70cc
didn't make it into the actual commit. This commit applies that
change.
Details:
- Changed the default installation prefix from $HOME/lib to /usr/local.
- Modified the way configure internally handles the prefix, libdir,
includedir, and sharedir (and also added an --exec-prefix option).
The defaults to these variables are set as follows:
prefix: /usr/local
exec_prefix: ${prefix}
libdir: ${exec_prefix}/lib
includedir: ${prefix}/include
sharedir: ${prefix}/share
The key change, aside from the addition of exec_prefix and its use to
define the default to libdir, is that the variables are substituted
into config.mk with quoting that delays evaluation, meaning the
substituted values may contain unevaluated references to other
variables (namely, ${prefix} and ${exec_prefix}). This more closely
follows GNU conventions, including those used by GNU autoconf, and
also allows make to override any one of the variables *after*
configure has already been run (e.g. during 'make install').
- Updates to build/config.mk.in pursuant to above changes.
- Updates to output of 'configure --help' pursuant to above changes.
- Updated docs/BuildSystem.md to reflect the new default installation
prefix, as well as mention EXECPREFIX and SHAREDIR.
- Changed the definitions of the UNINSTALL_OLD_* variables in the
top-level Makefile to use $(wildcard ...) instead of 'find'. This
was motivated by the new way of handling prefix and friends, which
leads to the 'find' command being run on /usr/local (by default),
which can take a while almost never yielding any benefit (since the
user will very rarely use the uninstall-old targets).
- Removed periods from the end of descriptive output statements (i.e.,
non-verbose output) since those statements often end with file or
directory paths, which get confusing to read when puctuated by a
period.
- Trival change to 'make showconfig' output.
- Removed my name from 'configure --help'. (Many have contributed to it
over the years.)
- In configure script, changed the default state of threading_model
variable from 'no' to 'off' to match that of debug_type, where there
are similarly more than two valid states. ('no' is still accepted
if given via the --enable-debug= option, though it will be
standardized to 'off' prior to config.mk being written out.)
- Minor variable name change in flatten-headers.py that was intended for
32812ff.
- CREDITS file update.
Details:
- Fixed a minor bug in flatten-headers.py whereby the script, upon
encountering a #include directive for the root header file, would
erroneously recurse and inline the conents of that root header.
The script has been modified to avoid recursion into any headers
that share the same name as the root-level header that was passed
into the script. (Note: this bug didn't actually manifest in BLIS,
so it's merely a precaution for usage of flatten-headers.py in other
contexts.)
Change void*-typed function pointers to void_fp.
- Updated all instances of void* variables that store function pointers
to variables of a new type, void_fp. Originally, I wanted to define
the type of void_fp as "void (*void_fp)( void )"--that is, a pointer
to a function with no return value and no arguments. However, once
I did this, I realized that gcc complains with incompatible pointer
type (-Wincompatible-pointer-types) warnings every time any such a
pointer is being assigned to its final, type-accurate function
pointer type. That is, gcc will silently typecast a void* to
another defined function pointer type (e.g. dscalv_ker_ft) during
an assignment from the former to the latter, but the same statement
will trigger a warning when typecasting from a void_fp type. I suspect
an explicit typecast is needed in order to avoid the warning, which
I'm not willing to insert at this time.
- Added a typedef to bli_type_defs.h defining void_fp as void*, along
with a commented-out version of the aborted definition described
above. (Note that POSIX requires that void* and function pointers
be interchangeable; it is the C standard that does not provide this
guarantee.)
- Comment updates to various _oapi.c files.
Details:
- Renamed
kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c
to
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c.
This follows the naming convention used by other kernel sets, most
notably haswell.
Details:
- Updated the level-3 performance graphs in docs/graphs with new Eigen
results, this time using a development version cloned from their git
mirror on March 27, 2019 (version 3.3.90). Performance is improved
over 3.3.7, though still noticeably short of BLIS/MKL in most cases.
- Very minor updates to docs/Performance.md and matlab scripts in
test/3/matlab.
Details:
- Updated the Haswell, SkylakeX, and Epyc performance graphs in
docs/graphs to report on Eigen implementations, where applicable.
Specifically, Eigen implements all level-3 operations sequentially,
however, of those operations it only provides multithreaded gemm.
Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are
omitted. Thanks to Sameer Agarwal for his help configuring and
using Eigen.
- Updated docs/Performance.md to note the new implementation tested.
- CREDITS file update.
Details:
- Updated matlab scripts in test/3/matlab to optionally plot/display
Eigen performance curves. Whether Eigen is plotted is determined by
a new boolean function parameter, with_eigen.
- Updated runme.m scratchpad to reflect the latest invocations of the
plot_panel_4x5() function (with Eigen plotting enabled).
Details:
- Fixed the Makefile in test/3 so that it no longer incorrectly labels
the matlab output variables from Eigen-linked hemm, herk, trmm, and
trsm driver output as "vendor". (The gemm drivers were already
correctly outputing matlab variables containing the "eigen" label.)
Export macros can't support both shared and static at the same time.
When blis is built with both shared and static, headers assume that
shared is used at link time and dllimports the symbols with __imp_
prefix.
To use the headers with static libraries a user can give
-DBLIS_EXPORT= to import the symbol without the __imp_ prefix
Details:
- Adjusted test/3/Makefile so that the test drivers are linked against
Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do
this since Eigen's headers don't define implementations to the
standard BLAS APIs.
- Simplified #included headers in hemm, herk, trmm, and trsm source
driver files, since nothing specific to Eigen is needed at
compile-time for those operations.
Details:
- Use compile-time implementations of Eigen in test_gemm.c via new
EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS
library is not necessary.) However, as of Eigen 3.3.7, Eigen only
parallelizes the gemm operation and not hemm, herk, trmm, trsm, or
any other level-3 operation.
- Fixed a bug in trmm and trsm drivers whereby the wrong function
(bli_does_trans()) was being called to determine whether the object
for matrix A should be created for a left- or right-side case. This
was corrected by changing the function to bli_is_left(), as is done
in the hemm driver.
- Added support for running Eigen test drivers from runme.sh.
clang -dumpversion gives 4.2.1 for all clang versions as clang was
originally compatible with gcc 4.2.1
Apple clang version and clang version are two different things
and the real clang version cannot be deduced from apple clang version
programatically. Rely on wikipedia to map apple clang to clang version
Also fixes assembly detection with clang
clang 3.8 can't build knl as it doesn't recognize zmm0
Details:
- Modified bli_blas.h so that:
- By default, if the BLAS layer is enabled at configure-time, BLAS
prototypes are also enabled within blis.h;
- But if the user #defines BLIS_DISABLE_BLAS_DEFS prior to including
blis.h, BLAS prototypes are skipped over entirely so that, for
example, the application or some other header pulled in by the
application may prototype the BLAS functions without causing any
duplication.
- Updated docs/BuildSystem.md to document the feature above, and
related text.
Details:
- Added targets to test/3/Makefile that link against a BLAS library
build by Eigen. It appears, however, that Eigen's BLAS library does
not support multithreading. (It may be that multithreading is only
available when using the native C++ APIs.)
- Updated runme.sh with a few Eigen-related tweaks.
- Minor tweaks to docs/Performance.md.
Details:
- Added a new markdown document, docs/Performance.md, which reports
performance of a representative set of level-3 operations across a
variety of hardware architectures, comparing BLIS to OpenBLAS and a
vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs,
in pdf and png formats, reside in docs/graphs.
- Updated README.md to link to new Performance.md document.
- Minor updates to CREDITS, docs/Multithreading.md.
- Minor updates to matlab scripts in test/3/matlab.
Details:
- Adjusted the zen sub-configuration's cache blocksizes for float,
scomplex, and dcomplex based on the existing values for double.
(The previous values were taken directly from the haswell subconfig,
which targets Intel Haswell/Broadwell/Skylake systems.)
Details:
- Replaced the existing --enable-export-all / --disable-export-all
configure option with --export-shared=[public|all], with the 'public'
instance of the latter corresponding to --disable-export-all and the
'all' instance corresponding to --enable-export-all. Nothing else
semantically about the option, or its default, has changed.