Details:
- Updated the Haswell, SkylakeX, and Epyc performance graphs in
docs/graphs to report on Eigen implementations, where applicable.
Specifically, Eigen implements all level-3 operations sequentially,
however, of those operations it only provides multithreaded gemm.
Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are
omitted. Thanks to Sameer Agarwal for his help configuring and
using Eigen.
- Updated docs/Performance.md to note the new implementation tested.
- CREDITS file update.
Details:
- Updated matlab scripts in test/3/matlab to optionally plot/display
Eigen performance curves. Whether Eigen is plotted is determined by
a new boolean function parameter, with_eigen.
- Updated runme.m scratchpad to reflect the latest invocations of the
plot_panel_4x5() function (with Eigen plotting enabled).
Details:
- Fixed the Makefile in test/3 so that it no longer incorrectly labels
the matlab output variables from Eigen-linked hemm, herk, trmm, and
trsm driver output as "vendor". (The gemm drivers were already
correctly outputing matlab variables containing the "eigen" label.)
Export macros can't support both shared and static at the same time.
When blis is built with both shared and static, headers assume that
shared is used at link time and dllimports the symbols with __imp_
prefix.
To use the headers with static libraries a user can give
-DBLIS_EXPORT= to import the symbol without the __imp_ prefix
Details:
- Adjusted test/3/Makefile so that the test drivers are linked against
Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do
this since Eigen's headers don't define implementations to the
standard BLAS APIs.
- Simplified #included headers in hemm, herk, trmm, and trsm source
driver files, since nothing specific to Eigen is needed at
compile-time for those operations.
Details:
- Use compile-time implementations of Eigen in test_gemm.c via new
EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS
library is not necessary.) However, as of Eigen 3.3.7, Eigen only
parallelizes the gemm operation and not hemm, herk, trmm, trsm, or
any other level-3 operation.
- Fixed a bug in trmm and trsm drivers whereby the wrong function
(bli_does_trans()) was being called to determine whether the object
for matrix A should be created for a left- or right-side case. This
was corrected by changing the function to bli_is_left(), as is done
in the hemm driver.
- Added support for running Eigen test drivers from runme.sh.
clang -dumpversion gives 4.2.1 for all clang versions as clang was
originally compatible with gcc 4.2.1
Apple clang version and clang version are two different things
and the real clang version cannot be deduced from apple clang version
programatically. Rely on wikipedia to map apple clang to clang version
Also fixes assembly detection with clang
clang 3.8 can't build knl as it doesn't recognize zmm0
Details:
- Modified bli_blas.h so that:
- By default, if the BLAS layer is enabled at configure-time, BLAS
prototypes are also enabled within blis.h;
- But if the user #defines BLIS_DISABLE_BLAS_DEFS prior to including
blis.h, BLAS prototypes are skipped over entirely so that, for
example, the application or some other header pulled in by the
application may prototype the BLAS functions without causing any
duplication.
- Updated docs/BuildSystem.md to document the feature above, and
related text.
Details:
- Added targets to test/3/Makefile that link against a BLAS library
build by Eigen. It appears, however, that Eigen's BLAS library does
not support multithreading. (It may be that multithreading is only
available when using the native C++ APIs.)
- Updated runme.sh with a few Eigen-related tweaks.
- Minor tweaks to docs/Performance.md.
Details:
- Added a new markdown document, docs/Performance.md, which reports
performance of a representative set of level-3 operations across a
variety of hardware architectures, comparing BLIS to OpenBLAS and a
vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs,
in pdf and png formats, reside in docs/graphs.
- Updated README.md to link to new Performance.md document.
- Minor updates to CREDITS, docs/Multithreading.md.
- Minor updates to matlab scripts in test/3/matlab.
Details:
- Adjusted the zen sub-configuration's cache blocksizes for float,
scomplex, and dcomplex based on the existing values for double.
(The previous values were taken directly from the haswell subconfig,
which targets Intel Haswell/Broadwell/Skylake systems.)
Details:
- Replaced the existing --enable-export-all / --disable-export-all
configure option with --export-shared=[public|all], with the 'public'
instance of the latter corresponding to --disable-export-all and the
'all' instance corresponding to --enable-export-all. Nothing else
semantically about the option, or its default, has changed.
Details:
- Made extra explicit the fact that: (a) multithreading in BLIS is
disabled by default; and (b) even with multithreading enabled, the
user must specify multithreading at runtime in order to observe
parallelism. Thanks to M. Zhou for suggesting these clarifications
in #292.
- Also made explicit that only the environment variable and global
runtime API methods are available when using the BLAS API. If the
user wishes to use the local runtime API (specify multithreading on
a per-call basis), one of the native BLIS APIs must be used.
Details:
- Modified common.mk to use the -fvisibility=[hidden|default] option
when compiling with clang on non-Windows platforms (Linux, BSD, OS X,
etc.). Thanks to Isuru Fernando for pointing out this option works
with clang on these OSes.
Details:
- Added export annotations to additional function prototypes in order to
accommodate the testsuite.
- Disabled calling bli_amaxv_check() from within the testsuite's
test_amaxv.c.
Details:
- Introduced a new configure option, --enable-export-all, which will
cause all shared library symbols to be exported by default, or,
alternatively, --disable-export-all, which will cause all symbols to
be hidden by default, with only those symbols that are annotated for
visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS
symbols), to be exported. The default for this configure option is
--disable-export-all. Thanks to Isuru Fernando for consulting on
this commit.
- Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h,
which was intended for 5a5f494.
- Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to
frame/include/bli_config_macro_defs.h.
- Provided appropriate logic within common.mk to implement variable
symbol visibility for gcc, clang, and icc (to the extend that each of
these compilers allow).
- Relocated --help text associated with debug option (-d) to configure
slightly further down in the list.
Details:
- After merging PR #303, at Isuru's request, I removed the use of
BLIS_EXPORT_BLIS from all function prototypes *except* those that we
potentially wish to be exported in shared/dynamic libraries. In other
words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
functions that can be considered private or for internal use only.
This is likely the last big modification along the path towards
implementing the functionality spelled out in issue #248. Thanks
again to Isuru Fernando for his initial efforts of sprinkling the
export macros throughout BLIS, which made removing them where
necessary relatively painless. Also, I'd like to thank Tony Kelman,
Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
participating in the initial discussion in issue #37 that was later
summarized and restated in issue #248.
- CREDITS file update.
* Revert "restore bli_extern_defs exporting for now"
This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8.
* Remove symbols not intended to be public
* No need of def file anymore
* Fix whitespace
* No need of configure option
* Remove export macro from definitions
* Remove blas export macro from definitions
Details:
- Renamed '3m4m' directory to '3', which captures the directory nicely
since it builds test drivers to test level-3 operations.
- These test drivers ceased to be used to test the 3m and 4m (or even
1m) induced methods long ago, hence the name change.
Details:
- Further updates to matlab scripts, mostly for compatibility with
GNU Octave.
- More tweaks to runme.sh.
- Updates to runme.m that allow copy-paste into matlab interactive
session to generate graphs.
Details:
- Rewrote much of Makefile to generate executables for single- and dual-
socket multithreading as well as single-threaded. Each of the three
can also use a different problem size range/increment, as is often
appropriate when doubling/halving the number of threads.
- Rewrote runme.sh script to flexibly execute as many threading
parameter scenarios as is given in the input parameter string
(currently set within the script itself). The string also encodes
the maximum problem size for each threading scenario, which is used
to identify the executable to run. Also improved the "progress" output
of the script to reduce redundant info and improve readability in
terminals that are not especially wide.
- Minor updates to test_*.c source files.
- Updated matlab scripts according to changes made to the Makefile,
test drivers, and runme.sh script, and renamed 'plot_all.m' to
'runme.m'.
Details:
- Minor updates to matlab graph-generating scripts.
- Added a plot_all.m script that is more of a scratchpad for copying and
pasting function invocations into matlab to generate plots that are
presently of interest to us.
Details:
- Changed -funsafe-loop-optimizations (re-)introduced in 7690855 for
make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to
account for a miscommunication in issue #300). Thanks to Dave Love
for this suggestion and Jeff Hammond for his feedback on the topic.
Details:
- Restored use of -funsafe-loop-optimizations in the definitions of
CRVECFLAGS (when using gcc), but only for sub-configurations (and
not configuration families such as amd64, intel64, and x86_64).
This more or less reverts 5190d05 and 6cf1550.
Details:
- Added -mno-tbm -mno-xop -mno-lwp to CKVECFLAGS in bulldozer,
piledriver, steamroller, and excavator configurations to explicitly
disable AMD's bulldozer-era TBM, XOP, and LWP instruction sets in an
attempt to fix the invalid instruction error that has plagued Travis
CI builds since 6a014a3. Thanks to Devin Matthews for pointing out
that the offending instruction was part of TBM (issue #300).
- Restored -O3 to piledriver configuration's COPTFLAGS.
Details:
- Removed -funsafe-loop-optimizations from the configuration families
affected by 6a014a3, specifically: intel64, amd64, and x86_64.
This is part of an attempt to debug why the sde, as executed by
Travis CI, is crashing via the following error:
TID 0 SDE-ERROR: Executed instruction not valid for specified chip
(ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103
Details:
- Per Dave Love's recommendation in issue #300, this commit defines
COPTFLAGS := -03
and
CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations
in the make_defs.mk for all Intel- and AMD-based configurations.
Details:
- trsm parallelization was temporarily simplifed in 075143d to entirely
ignore any parallelism specified via the pc or ir loops. Now, any
parallelism specified to the pc loop will be redirected to the ic
loop, and any parallelism specified to the ir loop will be redirected
to the jr loop. (Note that because of inter-iteration dependencies,
trsm cannot parallelize the ir loop. Parallelism via the pc loop is
at least somewhat feasible in theory, but it would require tracking
dependencies between blocks--something for which BLIS currently lacks
the necessary supporting infrastructure.)
Details:
- Parallelism within the IC loop (3rd loop around the microkernel) is
now supported within the trsm operation. This is done via a new branch
on each of the control and thread trees, which guide execution of a
new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
subproblem corresponds to the macrokernel computation on only the
block of A that contains the diagonal (labeled as A11 in algorithms
with FLAME-like partitioning), and the corresponding row panel of C.
During the trsm subproblem, all threads within the JC communicator
participate and parallelize along the JR loop, including any
parallelism that was specified for the IC loop. (IR loop parallelism
is not supported for trsm due to inter-iteration dependencies.) After
this trsm subproblem is complete, a barrier synchronizes all
participating threads and then they proceed to apply the prescribed
BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
parallelism specified within) to the remaining gemm subproblem (the
rank-k update that is performed using the newly updated row-panel of
B). Thus, trsm now supports JC, IC, and JR loop parallelism.
- Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
now, since it is not currently used. (All trsm problems are cast in
terms of left-side trsm.)
- Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
subnode is only recursed upon if there existed a corresponding
thrinfo_t node, which may not always exist (for problems too small
to employ full parallelization due to the minimum granularity imposed
by micropanels).
- Updated other functions in frame/base/bli_cntl.c, such as
bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
if they exist.
- Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
when they exist, and added support for growing a prenode branch to
bli_thrinfo_grow() via a corresponding set of help functions named
with the _prenode() suffix.
- Added a bszid_t field thrinfo_t nodes. This field comes in handy when
debugging the allocation/release of thrinfo_t nodes, as it helps trace
the "identity" of each nodes as it is created/destroyed.
- Renamed
bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
and created a separate bli_l3_thrinfo_print_trsm_paths() function to
print out the newly reconfigured thrinfo_t trees for the trsm
operation.
- Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
regarding variable declarations.
- Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
(semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
represent the subpartition ahead of and behind, respectively,
BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
- Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
- Deprecated old functions in frame/3/bli_l3_thrinfo.c.
Formally registered power9 sub-configuration.
Details:
- Added and registered power9 sub-configuration into the build system.
Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
- Note: The sub-configuration does not yet have a corresponding
architecture-specific kernel set registered, and so for now the
sub-config is using the generic kernel set.
Details:
- Replaced direct usage of _Pragma( "omp simd" ) in reference kernels
with PRAGMA_SIMD, which is defined as a function of the compiler being
used in a new bli_pragma_macro_defs.h file. That definition is cleared
when BLIS detects that the -fopenmp-simd command line option is
unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions
that guided this commit.
- Updated configure and bli_config.h.in so that the appropriate anchor
is substituted in (when the corresponding pragma omp simd support is
present).