Details:
- Changed all occurrances of
micro-kernel -> microkernel
macro-kernel -> macrokernel
micro-panel -> micropanel
in all markdown documents in 'docs' directory. This change is being
made since we've reached the point in adoption and acceptance of
BLIS's insights where words such as "microkernel" are no longer new,
and therefore now merit being unhyphenated.
- Updated "Implementation Notes" sections of KernelsHowTo.md, which
still contained references to nonexistent cpp macros such as
BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?.
- Added 'run-fast' and 'check-fast' targets to testsuite/Makefile.
- Minor updates to Testsuite.md, including suggesting use of
'make check' and 'make check-fast' when running from the local
testsuite directory.
- Added a comment to top-level Makefile explaining the purpose behind
the TESTSUITE_WRAPPER variable, which at first glance appears to serve
no purpose.
Details:
- Fixed code in the skx subconfiguration that became a bug after
committing bdd46f9. Specifically, the bli_cntx_init_skx() function
was overwriting default blocksizes for the scomplex and dcomplex
microkernels despite the fact that only single and double real
microkernels were being registered. This was not a problem prior to
bdd46f9 since all microkernels used dynamically-queried (at runtime)
register blocksizes for loop bounds. However, post-bdd46f9, this
became a bug because the reference ukernels for scomplex and dcomplex
were written with their register blocksizes hard-coded as constant
loop bounds, which conflicted the the erroneous scomplex and dcomplex
values that bli_cntx_init_skx() was setting in the context. The
lesson here is that going forward, all subconfigurations must not set
any blocksizes for datatypes corresponding to default/reference
microkernels. (Note that a blocksize is left unchanged by the
bli_cntx_set_blkszs() function if it was set to -1.)
Details:
- Fixed a bug that mainfested anytime a configuration was used in which
optimized microkernels were registered and the trsm operation (or
kernel) was invoked. The bug resulted from the optimized microkernels'
register blocksizes conflicting with the hard-coded values--expressed
in the form of constant loop bounds--used in the new reference trsm
ukernels that were introduced in bdd46f9. The fix was easy: reverting
back to the implementation that uses variable-bound loops, which
amounted to changing an #if 0 to #if 1 (since I preserved the older
implementation in the file alongside the new code based on constant-
bound loops). It should be noted that this fix must be permanent,
since the trsm kernel code with constant-bound loops can never work
with gemm ukernels that use different register blocksizes.
Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
indexing annotated by the #pragma omp simd directive, which a compiler
can use to vectorize certain constant-bounded loops. (The new kernels
actually use _Pragma("omp simd") since the kernels are defined via
templatizing macros.) Modest speedup was observed in most cases using
gcc 5.4.0, which may improve with newer versions. Thanks to Devin
Matthews for suggesting this via issue #286 and #259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
respectively, with a default row preference for the gemm ukernel. Also
updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
which compiler is detected (via macros such as __GNUC__). These
prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
build/templates and removed AMD as a default copyright holder.
Details:
- Guard typedef of ftnlen in f2c_types.h with a #ifndef HAVE_BLIS_H
directive to prevent the redefinition of that type. Thanks to Jeff
Diamond for reporting this compiler warning (and apologies for the
delay in committing a fix).
Details:
- Add os_name to the list of variables into which the '/' character is
escaped. This is meant to address (or at least make progress toward
addressing) #293. Thanks to Isuru Fernando for spotting this as the
potential fix, and also thanks to M. Zhou for the original report.
Details:
- Added a variant set of matlab scripts geared to producing plots that
reflect performance data gathered with and without extra memory
optimizations enabled. These scripts reside (for now) in
test/mixeddt/matlab/wawoxmem.
Details:
- Removed malloc_ft and free_ft function pointer arguments from the
interface to bli_apool_init() after deciding that there is no need to
specify the malloc()/free() for blocks within the apool. (The apool
blocks are actually just array_t structs.) Instead, we simply call
bli_malloc_intl()/_free_intl() directly. This has the added benefit
of allowing additional output when memory tracing is enabled via
--enable-mem-tracing. Also made corresponding changes elsewhere in
the apool API.
- Changed the inner pools (elements of the array_t within the apool_t)
to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL
and BLIS_FREE_INTL.
- Disabled definitions of bli_malloc_pool() and bli_free_pool() since
there are no longer any consumers of these functions.
- Very minor comment / printf() updates.
* Initialize error messages at compile time
- Assigning strings directly to the bli_error_string array, instead of
snprintf() at execution-time.
* Retired bli_error_init(), _finalize().
Details:
- Removed functions obviated by changes in 80e8dc6: bli_error_init(),
bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
the former two in bli_init.c.
* Regenerated symbols in build/libblis-symbols.def.
Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto'.
Details:
- Updated/added comments about Fedora, OpenSUSE, and GNU Guix under the
newly-renamed "External GNU/Linux packages" section. Thanks to Dave
Love for providing these revisions.
Details:
- Relax version blacklisting of python3 to allow 3.4 or later instead
of 3.5 or later. Thanks to Dave Love for pointing out that 3.4 was
sufficient for the purpose of BLIS's build system. (It should be
noted that we're not sure which, if any, python3 versions prior to
3.4 are insufficient, and that the only thing stopping us from
determining this is the fact that these earlier versions of python3
are not readily available for us to test with.)
- Updated docs/BuildSystem.md to be explicit about current python2 vs
python3 version requirements.
Details:
- Added "What's New" and "What People Are Saying About BLIS" sections to
README.md.
- Added missing github handles to various individuals' entries in the
CREDITS file.
Details:
- Perform the same check for NULL return values and error message output
in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This
change was intended for f272c289.)
Details:
- Output an error message if and when the malloc()-equivalent called by
bli_fmalloc_align() ever returns NULL. Everything was already in place
for this to happen, including the error return code, the error string
sprintf(), the error checking function bli_check_valid_malloc_buf()
definition, and its prototype. Thanks to Minh Quan Ho for pointing out
the missing error message.
- Increased the default block_ptrs_len for each inner pool stored in the
small block allocator from 10 to 25. Under normal execution, each
thread uses only 21 blocks, so this change will prevent the sba from
needing to resize the block_ptrs array of any given inner pool as
threads initially populate the pool with small blocks upon first
execution of a level-3 operation.
- Nix stray newline echo in configure.
Details:
- Implemented a sophisticated data structure and set of APIs that track
the small blocks of memory (around 80-100 bytes each) used when
creating nodes for control and thread trees (cntl_t and thrinfo_t) as
well as thread communicators (thrcomm_t). The purpose of the small
block allocator, or sba, is to allow the library to transition into a
runtime state in which it does not perform any calls to malloc() or
free() during normal execution of level-3 operations, regardless of
the threading environment (potentially multiple application threads
as well as multiple BLIS threads). The functionality relies on a new
data structure, apool_t, which is (roughly speaking) a pool of
arrays, where each array element is a pool of small blocks. The outer
pool, which is protected by a mutex, provides separate arrays for each
application thread while the arrays each handle multiple BLIS threads
for any given application thread. The design minimizes the potential
for lock contention, as only concurrent application threads would
need to fight for the apool_t lock, and only if they happen to begin
their level-3 operations at precisely the same time. Thanks to Kiran
Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
by default; renamed the --[dis|en]able-packbuf-pools option to
--[dis|en]able-pba-pools; and rewrote the --help text associated with
this new option and consolidated it with the --help text for the
option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
used for small blocks with calls to bli_sba_acquire(), which takes a
rntm (in addition to the bytes requested), and bli_sba_release().
These latter two functions reduce to the former two when the sba pools
are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
in the block_size) from bli_membrk_acquire_m() to the implementation
of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
relative to that of gemmtrsm_ukr, in part to avoid the need to create
a packm control tree node, which now requires a rntm_t that has been
initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
run the testsuite with memory tracing enabled can check for memory
leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
the rntm to be copied locally when it is passed in via one of the
expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.
Details:
- In the case that the caller passes in a non-NULL rntm_t pointer into
one of the expert APIs for a level-3 operation (e.g. bli_gemm_ex()),
make a local copy of the rntm_t and use the address of that local copy
in all subsequent execution (which may change the contents of the
rntm_t). This prevents a potentially confusing situation whereby a
user-initialized rntm_t is used once (in, say, gemm), and then found
by the user to be in a different state before it is used a second
time.
Details:
- Updated External packages section in anticipation of introducing BLIS
into Debian package universe. Thanks to M. Zhou for sponsoring BLIS in
Debian.
Details:
- Fixed an issue with specifying threading globally at runtime via
bli_thread_set_num_threads() (the automatic way) or via
bli_thread_set_ways() (the manual way), with bli_thread_init_rntm()
also affected. These functions were not calling bli_init_once() prior
to acting, and therefore their effects on the global rntm_t structure
were being wiped out by the eventual call to bli_init_once(), by some
other BLIS function. Thanks to Ali Emre Gülcü for reporting the
behavior associated with this bug.
- Added additional content to docs/Multithreading.md covering topics of
choosing between OpenMP and pthreads, and specifying affinity via
OpenMP.
- CREDITS file update.
Details:
- Added malloc_ft and free_ft fields to pool_t, which are provided when
the pool is initialized, to allow bli_pool_alloc_block() and
bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align()
with arbitrary align_size values (according to how the pool_t was
initialized).
- Added a block_ptrs_len argument to bli_pool_init(), which allows the
caller to specify an initial length for the block_ptrs array, which
previously suffered the cost of being reallocated, copied, and freed
each time a new block was added to the pool.
- Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t
into a single "buf" field. Consolidated the bli_pblk API accordingly
and also updated the bli_mem API implementation. This was done
because I'd previously already implemented opaque alignment via
bli_malloc_align(), which allocates extra space and stores the
original pointer returned by malloc() one element before the element
whose address is aligned.
- Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call
bli_fmalloc_align() and bli_ffree_align(), which required adding an
align_size field to the membrk_t struct.
- Pass the pack schemas directly into bli_l3_cntl_create_if() rather
than transmit them via objects for A and B.
- Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free().
The function had not been conditionally freeing control trees for
quite some time. Also, removed obj_t* parameters since they aren't
needed anymore (or never were).
- Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a
separate function, bli_l3_thread_decorator_thread_check().
- Renamed:
bli_malloc_align() -> bli_fmalloc_align()
bli_free_align() -> bli_ffree_align()
bli_malloc_noalign() -> bli_fmalloc_noalign()
bli_free_noalign() -> bli_ffree_noalign()
The 'f' is for "function" since they each take a malloc_ft or free_ft
function pointer argument.
- Inserted various printf() calls for the purposes of tracing memory
allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which,
for now, is intended to be a "hidden" feature rather than one hooked
up to a configure-time option.
- Defined bli_rntm_equals(), which compares two rntm_t for equality.
(There are no use cases for this function yet, but there may be soon.)
- Whitespace changes to function parameter lists in bli_pool.c, .h.
Details:
- Updated the API and semantics of packm kernels such that they must now
handle edge cases, meaning that a c-by-k packm kernel must be able to
pack edge cases that are fewer than c rows/columns and be able to
zero-fill the remaining elements. They must also be able to zero-fill
the equivalent region when copying fewer than k columns/rows (which is
needed by trsm). The new packm kernel API is generally:
void packm_kernel
(
conj_t conja,
dim_t cdim,
dim_t n,
dim_t n_max,
ctype* restrict kappa,
ctype* restrict a, inc_t inca, inc_t lda,
ctype* restrict p, inc_t ldp,
cntx_t* restrict cntx
);
where cdim and n are the dimensions (short and long, respectively) of
the submatrix being copied from the source matrix A, and n_max is the
"full" long dimension (corresponding to the k dimension in gemm) of
the micropanel. The "full" short dimension (corresponding to the
register blocksize MR or NR) is not part of the API because it is
known intrinsically by the packm kernel implementation. Thanks to
Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
above changes, as well as all optimized packm kernels (which only
consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
I was considering leaving it unchanged, but I couldn't escape the
reality that the packm kernel API is much closer to an expert API
than it is some obscure helper function interface within the framework
that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
that would have been using those kernels is knc, which is likely no
longer being used by very many people (if any). (This also mostly
offset the larger object code footprint incurred by moving the edge-
case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
which those implementations were modifying the contexts stored in the
gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
method implementations of trsm from executing.
Details:
- Disabled (commented out) the arm32 and arm64 configuration families
in the config_registry file. Having a configuration family registered
only makes sense if BLIS is currently outfitted with runtime hardware
detection logic to choose the appropriate sub-configuration. That
logic is currently missing for ARM architectures, and thus having the
ARM configuration families in the configuration registry only serves
to confuse people. Thanks to Devangi Parikh for suggesting this
change.
Details:
- Parameterized, reorganized, and added comments to matlab scripts in
test/mixeddt/matlab.
- Reordered some lines of code and added comments to plot_l3_perf.m in
test/3m4m/matlab.
Details:
- Updated 3m4m and mixeddt Makefiles and runme.sh scripts, mostly to
port recent changes to the former to the latter.
- Disabled (for now) code in 3m4m/test_*.c files that disables all
induced methods except for the one that is requested from the
Makefile via the IND macro. This is done because usually, we want to
test whatever method is enabled automatically for complex datatypes.
(That is, when native complex microkernels are missing, we usually
want to test performance of 1m.)
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
Details:
- Lifted the constraint that 1m only be used when all operands' storage
datatypes (along with the computation datatype) are equal. Now, 1m may
be used as long as all operands are stored in the complex domain. This
change largely consisted of adding the ability to pack to 1e and 1r
formats from one precision to another. It also required adding logic
for handling complex values of alpha to bli_packm_blk_var1_md()
(similar to the logic in bli_packm_blk_var1()).
- Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
ukernel output preference field being read. Previously, the preference
for the native complex ukernel was being read instead of the pref for
the native real domain ukernel. This bug would not manifest if the
preference for the native complex ukernel happened to be equal to that
of the native real ukernel.
- Added support for testing mixed-precision 1m execution via the gemm
module of the testsuite.
- Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
schemas are always read from the context, rather than trying to
sometimes embed them directly to the A and B objects. (They are still
embedded, but now uniformly only after reading the schemas from the
context.)
- Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
consumer).
- Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
bli_gemm_ker_var2_md().
- Added explicit handling for beta == 1 and beta == 0 in the reference
gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
- Rewrote various level-0 macro defs, including axpyris, axpbyris,
scal2ris, and xpbyris (and their conjugating counterparts) to
explicitly support three operand types and updated invocations to
xpbyris in bli_gemmtrsm1m_ref.c.
- Query and use the storage datatype of the packed object instead of the
storage datatype of the source object in bli_packm_blk_var1().
- Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
frame/3/gemm/ind/bli_gemm_ind_opt.h.
- Various whitespace/comment updates.