Details:
- Removed -funsafe-loop-optimizations from the configuration families
affected by 6a014a3, specifically: intel64, amd64, and x86_64.
This is part of an attempt to debug why the sde, as executed by
Travis CI, is crashing via the following error:
TID 0 SDE-ERROR: Executed instruction not valid for specified chip
(ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103
Details:
- Per Dave Love's recommendation in issue #300, this commit defines
COPTFLAGS := -03
and
CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations
in the make_defs.mk for all Intel- and AMD-based configurations.
Details:
- trsm parallelization was temporarily simplifed in 075143d to entirely
ignore any parallelism specified via the pc or ir loops. Now, any
parallelism specified to the pc loop will be redirected to the ic
loop, and any parallelism specified to the ir loop will be redirected
to the jr loop. (Note that because of inter-iteration dependencies,
trsm cannot parallelize the ir loop. Parallelism via the pc loop is
at least somewhat feasible in theory, but it would require tracking
dependencies between blocks--something for which BLIS currently lacks
the necessary supporting infrastructure.)
Details:
- Parallelism within the IC loop (3rd loop around the microkernel) is
now supported within the trsm operation. This is done via a new branch
on each of the control and thread trees, which guide execution of a
new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
subproblem corresponds to the macrokernel computation on only the
block of A that contains the diagonal (labeled as A11 in algorithms
with FLAME-like partitioning), and the corresponding row panel of C.
During the trsm subproblem, all threads within the JC communicator
participate and parallelize along the JR loop, including any
parallelism that was specified for the IC loop. (IR loop parallelism
is not supported for trsm due to inter-iteration dependencies.) After
this trsm subproblem is complete, a barrier synchronizes all
participating threads and then they proceed to apply the prescribed
BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
parallelism specified within) to the remaining gemm subproblem (the
rank-k update that is performed using the newly updated row-panel of
B). Thus, trsm now supports JC, IC, and JR loop parallelism.
- Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
now, since it is not currently used. (All trsm problems are cast in
terms of left-side trsm.)
- Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
subnode is only recursed upon if there existed a corresponding
thrinfo_t node, which may not always exist (for problems too small
to employ full parallelization due to the minimum granularity imposed
by micropanels).
- Updated other functions in frame/base/bli_cntl.c, such as
bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
if they exist.
- Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
when they exist, and added support for growing a prenode branch to
bli_thrinfo_grow() via a corresponding set of help functions named
with the _prenode() suffix.
- Added a bszid_t field thrinfo_t nodes. This field comes in handy when
debugging the allocation/release of thrinfo_t nodes, as it helps trace
the "identity" of each nodes as it is created/destroyed.
- Renamed
bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
and created a separate bli_l3_thrinfo_print_trsm_paths() function to
print out the newly reconfigured thrinfo_t trees for the trsm
operation.
- Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
regarding variable declarations.
- Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
(semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
represent the subpartition ahead of and behind, respectively,
BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
- Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
- Deprecated old functions in frame/3/bli_l3_thrinfo.c.
Formally registered power9 sub-configuration.
Details:
- Added and registered power9 sub-configuration into the build system.
Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
- Note: The sub-configuration does not yet have a corresponding
architecture-specific kernel set registered, and so for now the
sub-config is using the generic kernel set.
Details:
- Replaced direct usage of _Pragma( "omp simd" ) in reference kernels
with PRAGMA_SIMD, which is defined as a function of the compiler being
used in a new bli_pragma_macro_defs.h file. That definition is cleared
when BLIS detects that the -fopenmp-simd command line option is
unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions
that guided this commit.
- Updated configure and bli_config.h.in so that the appropriate anchor
is substituted in (when the corresponding pragma omp simd support is
present).
Details:
- Changed all occurrances of
micro-kernel -> microkernel
macro-kernel -> macrokernel
micro-panel -> micropanel
in all markdown documents in 'docs' directory. This change is being
made since we've reached the point in adoption and acceptance of
BLIS's insights where words such as "microkernel" are no longer new,
and therefore now merit being unhyphenated.
- Updated "Implementation Notes" sections of KernelsHowTo.md, which
still contained references to nonexistent cpp macros such as
BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?.
- Added 'run-fast' and 'check-fast' targets to testsuite/Makefile.
- Minor updates to Testsuite.md, including suggesting use of
'make check' and 'make check-fast' when running from the local
testsuite directory.
- Added a comment to top-level Makefile explaining the purpose behind
the TESTSUITE_WRAPPER variable, which at first glance appears to serve
no purpose.
Details:
- Fixed code in the skx subconfiguration that became a bug after
committing bdd46f9. Specifically, the bli_cntx_init_skx() function
was overwriting default blocksizes for the scomplex and dcomplex
microkernels despite the fact that only single and double real
microkernels were being registered. This was not a problem prior to
bdd46f9 since all microkernels used dynamically-queried (at runtime)
register blocksizes for loop bounds. However, post-bdd46f9, this
became a bug because the reference ukernels for scomplex and dcomplex
were written with their register blocksizes hard-coded as constant
loop bounds, which conflicted the the erroneous scomplex and dcomplex
values that bli_cntx_init_skx() was setting in the context. The
lesson here is that going forward, all subconfigurations must not set
any blocksizes for datatypes corresponding to default/reference
microkernels. (Note that a blocksize is left unchanged by the
bli_cntx_set_blkszs() function if it was set to -1.)
Details:
- Fixed a bug that mainfested anytime a configuration was used in which
optimized microkernels were registered and the trsm operation (or
kernel) was invoked. The bug resulted from the optimized microkernels'
register blocksizes conflicting with the hard-coded values--expressed
in the form of constant loop bounds--used in the new reference trsm
ukernels that were introduced in bdd46f9. The fix was easy: reverting
back to the implementation that uses variable-bound loops, which
amounted to changing an #if 0 to #if 1 (since I preserved the older
implementation in the file alongside the new code based on constant-
bound loops). It should be noted that this fix must be permanent,
since the trsm kernel code with constant-bound loops can never work
with gemm ukernels that use different register blocksizes.
Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
indexing annotated by the #pragma omp simd directive, which a compiler
can use to vectorize certain constant-bounded loops. (The new kernels
actually use _Pragma("omp simd") since the kernels are defined via
templatizing macros.) Modest speedup was observed in most cases using
gcc 5.4.0, which may improve with newer versions. Thanks to Devin
Matthews for suggesting this via issue #286 and #259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
respectively, with a default row preference for the gemm ukernel. Also
updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
which compiler is detected (via macros such as __GNUC__). These
prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
build/templates and removed AMD as a default copyright holder.
Details:
- Guard typedef of ftnlen in f2c_types.h with a #ifndef HAVE_BLIS_H
directive to prevent the redefinition of that type. Thanks to Jeff
Diamond for reporting this compiler warning (and apologies for the
delay in committing a fix).
Details:
- Add os_name to the list of variables into which the '/' character is
escaped. This is meant to address (or at least make progress toward
addressing) #293. Thanks to Isuru Fernando for spotting this as the
potential fix, and also thanks to M. Zhou for the original report.
Details:
- Added a variant set of matlab scripts geared to producing plots that
reflect performance data gathered with and without extra memory
optimizations enabled. These scripts reside (for now) in
test/mixeddt/matlab/wawoxmem.
Details:
- Removed malloc_ft and free_ft function pointer arguments from the
interface to bli_apool_init() after deciding that there is no need to
specify the malloc()/free() for blocks within the apool. (The apool
blocks are actually just array_t structs.) Instead, we simply call
bli_malloc_intl()/_free_intl() directly. This has the added benefit
of allowing additional output when memory tracing is enabled via
--enable-mem-tracing. Also made corresponding changes elsewhere in
the apool API.
- Changed the inner pools (elements of the array_t within the apool_t)
to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL
and BLIS_FREE_INTL.
- Disabled definitions of bli_malloc_pool() and bli_free_pool() since
there are no longer any consumers of these functions.
- Very minor comment / printf() updates.
* Initialize error messages at compile time
- Assigning strings directly to the bli_error_string array, instead of
snprintf() at execution-time.
* Retired bli_error_init(), _finalize().
Details:
- Removed functions obviated by changes in 80e8dc6: bli_error_init(),
bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
the former two in bli_init.c.
* Regenerated symbols in build/libblis-symbols.def.
Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto'.
Details:
- Updated/added comments about Fedora, OpenSUSE, and GNU Guix under the
newly-renamed "External GNU/Linux packages" section. Thanks to Dave
Love for providing these revisions.
Details:
- Relax version blacklisting of python3 to allow 3.4 or later instead
of 3.5 or later. Thanks to Dave Love for pointing out that 3.4 was
sufficient for the purpose of BLIS's build system. (It should be
noted that we're not sure which, if any, python3 versions prior to
3.4 are insufficient, and that the only thing stopping us from
determining this is the fact that these earlier versions of python3
are not readily available for us to test with.)
- Updated docs/BuildSystem.md to be explicit about current python2 vs
python3 version requirements.
Details:
- Added "What's New" and "What People Are Saying About BLIS" sections to
README.md.
- Added missing github handles to various individuals' entries in the
CREDITS file.
Details:
- Perform the same check for NULL return values and error message output
in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This
change was intended for f272c289.)
Details:
- Output an error message if and when the malloc()-equivalent called by
bli_fmalloc_align() ever returns NULL. Everything was already in place
for this to happen, including the error return code, the error string
sprintf(), the error checking function bli_check_valid_malloc_buf()
definition, and its prototype. Thanks to Minh Quan Ho for pointing out
the missing error message.
- Increased the default block_ptrs_len for each inner pool stored in the
small block allocator from 10 to 25. Under normal execution, each
thread uses only 21 blocks, so this change will prevent the sba from
needing to resize the block_ptrs array of any given inner pool as
threads initially populate the pool with small blocks upon first
execution of a level-3 operation.
- Nix stray newline echo in configure.
Details:
- Implemented a sophisticated data structure and set of APIs that track
the small blocks of memory (around 80-100 bytes each) used when
creating nodes for control and thread trees (cntl_t and thrinfo_t) as
well as thread communicators (thrcomm_t). The purpose of the small
block allocator, or sba, is to allow the library to transition into a
runtime state in which it does not perform any calls to malloc() or
free() during normal execution of level-3 operations, regardless of
the threading environment (potentially multiple application threads
as well as multiple BLIS threads). The functionality relies on a new
data structure, apool_t, which is (roughly speaking) a pool of
arrays, where each array element is a pool of small blocks. The outer
pool, which is protected by a mutex, provides separate arrays for each
application thread while the arrays each handle multiple BLIS threads
for any given application thread. The design minimizes the potential
for lock contention, as only concurrent application threads would
need to fight for the apool_t lock, and only if they happen to begin
their level-3 operations at precisely the same time. Thanks to Kiran
Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
by default; renamed the --[dis|en]able-packbuf-pools option to
--[dis|en]able-pba-pools; and rewrote the --help text associated with
this new option and consolidated it with the --help text for the
option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
used for small blocks with calls to bli_sba_acquire(), which takes a
rntm (in addition to the bytes requested), and bli_sba_release().
These latter two functions reduce to the former two when the sba pools
are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
in the block_size) from bli_membrk_acquire_m() to the implementation
of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
relative to that of gemmtrsm_ukr, in part to avoid the need to create
a packm control tree node, which now requires a rntm_t that has been
initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
run the testsuite with memory tracing enabled can check for memory
leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
the rntm to be copied locally when it is passed in via one of the
expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.
Details:
- In the case that the caller passes in a non-NULL rntm_t pointer into
one of the expert APIs for a level-3 operation (e.g. bli_gemm_ex()),
make a local copy of the rntm_t and use the address of that local copy
in all subsequent execution (which may change the contents of the
rntm_t). This prevents a potentially confusing situation whereby a
user-initialized rntm_t is used once (in, say, gemm), and then found
by the user to be in a different state before it is used a second
time.
Details:
- Updated External packages section in anticipation of introducing BLIS
into Debian package universe. Thanks to M. Zhou for sponsoring BLIS in
Debian.
Details:
- Fixed an issue with specifying threading globally at runtime via
bli_thread_set_num_threads() (the automatic way) or via
bli_thread_set_ways() (the manual way), with bli_thread_init_rntm()
also affected. These functions were not calling bli_init_once() prior
to acting, and therefore their effects on the global rntm_t structure
were being wiped out by the eventual call to bli_init_once(), by some
other BLIS function. Thanks to Ali Emre Gülcü for reporting the
behavior associated with this bug.
- Added additional content to docs/Multithreading.md covering topics of
choosing between OpenMP and pthreads, and specifying affinity via
OpenMP.
- CREDITS file update.
Details:
- Added malloc_ft and free_ft fields to pool_t, which are provided when
the pool is initialized, to allow bli_pool_alloc_block() and
bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align()
with arbitrary align_size values (according to how the pool_t was
initialized).
- Added a block_ptrs_len argument to bli_pool_init(), which allows the
caller to specify an initial length for the block_ptrs array, which
previously suffered the cost of being reallocated, copied, and freed
each time a new block was added to the pool.
- Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t
into a single "buf" field. Consolidated the bli_pblk API accordingly
and also updated the bli_mem API implementation. This was done
because I'd previously already implemented opaque alignment via
bli_malloc_align(), which allocates extra space and stores the
original pointer returned by malloc() one element before the element
whose address is aligned.
- Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call
bli_fmalloc_align() and bli_ffree_align(), which required adding an
align_size field to the membrk_t struct.
- Pass the pack schemas directly into bli_l3_cntl_create_if() rather
than transmit them via objects for A and B.
- Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free().
The function had not been conditionally freeing control trees for
quite some time. Also, removed obj_t* parameters since they aren't
needed anymore (or never were).
- Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a
separate function, bli_l3_thread_decorator_thread_check().
- Renamed:
bli_malloc_align() -> bli_fmalloc_align()
bli_free_align() -> bli_ffree_align()
bli_malloc_noalign() -> bli_fmalloc_noalign()
bli_free_noalign() -> bli_ffree_noalign()
The 'f' is for "function" since they each take a malloc_ft or free_ft
function pointer argument.
- Inserted various printf() calls for the purposes of tracing memory
allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which,
for now, is intended to be a "hidden" feature rather than one hooked
up to a configure-time option.
- Defined bli_rntm_equals(), which compares two rntm_t for equality.
(There are no use cases for this function yet, but there may be soon.)
- Whitespace changes to function parameter lists in bli_pool.c, .h.
Details:
- Updated the API and semantics of packm kernels such that they must now
handle edge cases, meaning that a c-by-k packm kernel must be able to
pack edge cases that are fewer than c rows/columns and be able to
zero-fill the remaining elements. They must also be able to zero-fill
the equivalent region when copying fewer than k columns/rows (which is
needed by trsm). The new packm kernel API is generally:
void packm_kernel
(
conj_t conja,
dim_t cdim,
dim_t n,
dim_t n_max,
ctype* restrict kappa,
ctype* restrict a, inc_t inca, inc_t lda,
ctype* restrict p, inc_t ldp,
cntx_t* restrict cntx
);
where cdim and n are the dimensions (short and long, respectively) of
the submatrix being copied from the source matrix A, and n_max is the
"full" long dimension (corresponding to the k dimension in gemm) of
the micropanel. The "full" short dimension (corresponding to the
register blocksize MR or NR) is not part of the API because it is
known intrinsically by the packm kernel implementation. Thanks to
Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
above changes, as well as all optimized packm kernels (which only
consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
I was considering leaving it unchanged, but I couldn't escape the
reality that the packm kernel API is much closer to an expert API
than it is some obscure helper function interface within the framework
that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
that would have been using those kernels is knc, which is likely no
longer being used by very many people (if any). (This also mostly
offset the larger object code footprint incurred by moving the edge-
case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
which those implementations were modifying the contexts stored in the
gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
method implementations of trsm from executing.
Details:
- Disabled (commented out) the arm32 and arm64 configuration families
in the config_registry file. Having a configuration family registered
only makes sense if BLIS is currently outfitted with runtime hardware
detection logic to choose the appropriate sub-configuration. That
logic is currently missing for ARM architectures, and thus having the
ARM configuration families in the configuration registry only serves
to confuse people. Thanks to Devangi Parikh for suggesting this
change.
Details:
- Parameterized, reorganized, and added comments to matlab scripts in
test/mixeddt/matlab.
- Reordered some lines of code and added comments to plot_l3_perf.m in
test/3m4m/matlab.