Details:
- Rewrote code that selects the compiler for the purposes of compiling
the auto-detection executable. CC (if specified) is tried first. Then
gcc. Then clang. The absolute fallback is cc. The previous code was
sort of broken, and seemed to unintentionally always use gcc.
- Moved various configuration-agnostic flags from config/*/make_defs.mk
files to common.mk. The new mechanism appends the configuration-
agnostic flags to the various compiler flag variables initialized in
make_defs.mk. Flags specific to the sub-configuration are still set
in make_defs.mk.
- Added -Wno-tautological-compare to CMISCFLAGS when clang is in use.
Also added the flag to the compiler instantiation during configure-
time hardware detection (when clang is selected).
- Added some missing (but mostly-optional) quotes to configure script.
Details:
- Added "-std=c99" to compiler arguments when building auto-detection
driver in configure script.
- Added #include <stdint.h> to all three source files needed by auto-
detection program.
Details:
- Reimplemented the hardware detection functionality invoked when running
"./configure auto". Previously, a standalone script in build/auto-detect
that used CPUID was used. However, the script attempted to enumerate all
models for each microarchitecture supported. The new approach recycles
the same code used for runtime hardware detection introduced in 2c51356.
This has two immediate benefits. First, it reduces and consolidates the
code required to detect microarchitectures via the CPUID instruction.
Second, it provides an indirect way of testing at configure-time the
code that is used to detect hardware at runtime. This code is (a) only
activated when targeting a configuration family (such as intel64 or
amd64) at configure-time and (b) somewhat difficult to test in
practice, since it relies on having access to older microarchitectures.
- The above change required placing conditional cpp macro blocks in
bli_arch.c and bli_cpuid.c which either #include "blis.h" or #include
a bare-bones set of headers that does not rely on the presence of a
bli_config.h header. This is needed because bli_config.h has not been
created yet when configure-time auto-detection takes places.
- Defined a new function in bli_arch.c, bli_arch_string(), which takes
an arch_t id and returns a pointer to a string that contains the
lowercase name of the corresponding microarchitecture. This function
is used by the auto-detection script to printf() the name of the
sub-configuration corresponding to the detected hardware.
Details:
- Added a new configure option, --[en|dis]able-packbuf-pools, which will
enable or disable the use of internal memory pools for managing buffers
used for packing. When disabled, the function specified by the cpp
macro BLIS_MALLOC_POOL is called whenever a packing buffer is needed
(and BLIS_FREE_POOL is called when the buffer is ready to be released,
usually at the end of a loop). When enabled, which was the status quo
prior to this commit, a memory pool data structure is created and
managed to provide threads with packing buffers. The memory pool
minimizes calls to bli_malloc_pool() (i.e., the wrapper that calls
BLIS_MALLOC_POOL), but does so through a somewhat more complex
mechanism that may incur additional overhead in some (but not all)
situations. The new option defaults to --enable-packbuf-pools.
- Removed the reinitialization of the memory pools from the level-3
front-ends and replaced it with automatic reinitialization within the
pool API's implementation. This required an extra argument to
bli_pool_checkout_block() in the form of a requested size, but hides
the complexity entirely from BLIS. And since bli_pool_checkout_block()
is only ever called within a critical section, this change fixes a
potential race condition in which threads using contexts with different
cache blocksizes--most likely a heterogeneous environment--can check
out pool blocks that are too small for the submatrices it wishes to
pack. Thanks to Nisanth Padinharepatt for reporting this potential
issue.
- Removed several functions in light of the relocation of pool reinit,
including bli_membrk_reinit_pools(), bli_memsys_reinit(),
bli_pool_reinit_if(), and bli_check_requested_block_size_for_pool().
- Updated the testsuite to print whether the memory pools are enabled or
disabled.
Details:
- Modifed flatten-headers.py to work with python 3.x. This mostly
amounted to removing print statements (which I replaced with calls
to my_print(), a wrapper to sys.stdout.write()). Thanks to Stefan
Husmann for pointing out the script's incompatibility with python 3.
- Other minor changes/cleanups.
Details:
- Updated flatten-headers.py to pre-compile the main regular expression
used to isolate #include directives and the header filenames they
reference. The compiled regex object is then used over and over on
each header file in the tree of referenced headers. This appears to
have provided a 1.7-2x performance increase in the best case.
- Other minor tweaks, such as renaming the main recursive function from
replace_pass() to flatten_header().
Details:
- Added flatten-headers.py, a python implementation of the bash script
flatten-headers.sh. The new script appears to be 25-100x faster,
depending on the operating system, filesystem, etc. The python script
abides by the same command line interface as its predecessor and
targets python 2.7 or later. (Thanks to Devin Matthews for suggesting
that I look into a python replacement for higher performance.)
- Activated use of flatten-headers.py in common.mk via the FLATTEN_H
variable.
- Made minor tweaks to flatten-headers.sh such as spelling corrections
in comments.
Details:
- Added an allowance for OS X builds that run the testsuite to fail.
There seems to be an issue with 1m when running in Travis CI under
OS X and clang, but only in double-precision. Haven't been able to
reproduce the error on my own, and thus, I can't debug it. (Hopefully
it is simply a version-specific compiler bug.)
Details:
- Fixed a makefile error encountered when building the testsuite directly
in its directory (as opposed to indirectly via 'make test'). The fix
involves introducing a new variable, BUILD_PATH, alongside the existing
DIST_PATH variable. By default, BUILD_PATH is set to the current
directory, and is overridden by other Makefiles used by, for example,
the testsuite and standalone test drivers in testsuite or test,
respectively.
- Some files/directories in common.mk were redefined in terms of
BUILD_DIR, such as the locations of config.mk file and the intermediate
include directory.
Details:
- Found the likely cause of the Travis CI out-of-tree build failures:
config.mk was being read from DIST_PATH, rather than the current
directory.
Details:
- Defined the SHELL variable in common.mk as "/bin/bash" so that the
-n option can be used with echo in the Makefile rule for flattening
blis.h. Thanks to Devin Matthews for suggesting this fix.
Details:
- Fixed a mistake (hopefully) in d0c4dd0 that resulted in many more
osx/clang sub-tests than intended.
- Shortened the variable names in an effort to make them more readable
via the Travis CI web interface.
Details:
- Added 'pwd' commands to the script portion of the .travis.yml file in
an attempt to uncover the problem with the recent out-of-tree build
testing changes made in d0c4dd0.
Details:
- Fixed a race condition in self-initialization whereby the bli_is_init
static variable could be erroneously read as TRUE by thread 1 while
thread 0 is still executing bli_init_apis(), thus allowing thread 1 to
use the library before it is actually ready. Thanks to to Minh Quan Ho
and Devin Matthews for pointing out this issue.
- Part of the solution to the aforementioned race condition was involved
replacing the runtime initialization of the global scalar constants
(e.g., BLIS_ONE, BLIS_ZERO, etc.) in bli_const.c with a static
initialization of those same constants. This eliminates the need for
bli_const_init() altogether. (The static initialization is made concise
via preprocess macros.)
- Defined bli_gks_query_cntx_noinit(), which behaves just like
bli_gks_query_cntx(), except that it does not call bli_init_once(). This
function is called in lieu of bli_gks_query_cntx() in bli_ind_init() and
bli_memsys_init() so as to not result in any recursion into
bli_init_once().
- Removed BLIS_ONE_HALF, BLIS_MINUS_ONE_HALF global scalar constants.
They have no use in BLIS or its test products, and we have little reason
to believe they are used by others.
- Removed testsuite/out file, which was accidentally committed as part
of 70640a3.
Details:
- Added "temp_dir" argument to flatten-headers.sh so that the caller can
specify where intermediate files should be created as the script runs.
- Updated flatten-headers.sh to create intermediate files in temp_dir
instead of alongside the corresponding source files. This should now
(once again) allow out-of-tree builds where the BLIS distribution is
read-only, or where the out-of-tree build is running concurrently with
another out-of-tree build. (Thanks to Devin Matthews for pointing out
the possibility of simultaneous out-of-tree builds.)
Details:
- Modified .travis.yml file to include an out-of-tree build test (using
the "auto" configure target). Thanks to Devin Matthews for this
suggestion.
Details:
- Fix applied in 87978f6 was necessary but not sufficient to fix
out-of-tree builds. It turns out that using a source tree that had
already built the target erroneously gave the impression that
out-of-tree builds were working again, when in fact they were still
broken. The additional changes in this commit should complete the
fix that was started in the aforementioned commit. Thanks to Devin
Matthews and Shaden Smith for their help in isolating this issue.
Details:
- Defined two new functions in bli_init.c: bli_init_once() and
bli_finalize_once(). Each is implemented with pthread_once(), which
guarantees that, among the threads that pass in the same pthread_once_t
data structure, exactly one thread will execute a user-defined function.
(Thus, there is now a runtime dependency against libpthread even when
multithreading is not enabled at configure-time.)
- Added calls to bli_init_once() to top-level user APIs for all
computational operations as well as many other functions in BLIS to
all but guarantee that BLIS will self-initialize through the normal
use of its functions.
- Rewrote and simplified bli_init() and bli_finalize() and related
functions.
- Added -lpthread to LDFLAGS in common.mk.
- Modified the bli_init_auto()/_finalize_auto() functions used by the
BLAS compatibility layer to take and return no arguments. (The
previous API that tracked whether BLIS was initialized, and then
only finalized if it was initialized in the same function, was too
cute by half and borderline useless because by default BLIS stays
initialized when auto-initialized via the compatibility layer.)
- Removed static variables that track initialization of the sub-APIs in
bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and
bli_ind.c. We don't need to track initialization at the sub-API level,
especially now that BLIS can self-initialize.
- Added a critical section around the changing of the error checking
level in bli_error.c.
- Deprecated bli_ind_oper_has_avail() as well as all functions
bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation
name. These functions had no use cases within BLIS and likely none
outside of BLIS.
- Commented out calls to bli_init() and bli_finalize() in testsuite's
main() function, and likewise for standalone test drivers in 'test'
directory, so that self-initialization is exercised by default.
Details:
- In bli_cpuid.c, fixed an off-by-one indexing statement in vpu_count()
whereby a string-terminating NULL character, '\0', is written beyond
the bounds of the model_num string.
- Minor whitespace and formatting edits to bli_cpuid.c.
Details:
- Added missing $(DIST_PATH)/ prefix to relative path to flatten-headers.sh
script in common.mk so that the script could be found during out-of-tree
builds. Thanks to Devin Matthews for reporting this bug.
Details:
- Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c.
- Properly typecast integer arguments to match format specifier in various
calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and
bli_util_oapi.c.
- Fixed "unsigned less-than-comparison with zero" checks in bli_check.c,
bli_cntx.h.
- Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been
l1fkr_t or l1vkr_t).
- Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t
value BLIS_GEMM_UKR in bli_cntx_ref.c.
- NOTE: These issues were identified via compiler warnings when building
BLIS with clang on a rather old installation of OS X:
$ clang --version
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin15.2.0
Thread model: posix
Details:
- Removed the vast majority of directories named "old", which contained
deprecated code that I wasn't quite ready to jettison from the source
tree.
Details:
- Changed the interface of bli_getopt() to take a new argument, a getopt_t
struct, that stores the values of optarg, optind, opterr, and optopt,
and updated the implementation accordingly. (Previously, these
variables were assumed to be global.)
- Added a function for initializing a getopt_t struct.
- Changed test_libblis.c--currently the only consumer of bli_getopt()--to
utilize the new getopt_t state object.
Details:
- Reimplemented several sets of get/set-style preprocessor macros with
static functions, including those in the following frame/base headers:
auxinfo, cntl, mbool, mem, membrk, opid, and pool. A few headers in
frame/thread were touched as well: mutex_*, thrcomm, and thrinfo.
Details:
- Expanded checking of the arch_t id in bli_gks.c--either passed in from
the caller or as returned from bli_arch_query_id()--against the expected
range of id values. Thanks to Devangi Parikh for suggesting these
additional sanity checks.
Details:
- Defined a new 'uninstall-old-headers' target that allows users of BLIS to
uninstall no-longer-needed headers left over from previous installations.
- Fixed the 'uninstall-old' target so that it will install both .a and .so
libraries.
- Renamed 'uninstall-old' to 'uninstall-old-libs'.
- Added 'uninstall-old' target (different from previous 'uninstall-old'
target) that combines 'uninstall-old-libs' and 'uninstall-old-headers'.
Details:
- When CBLAS is enabled at configure-time, BLIS now creates a monolithic
cblas.h using the same flatten-header.sh script that was recently
introduced for creating monolithic blis.h header files. The top-level
Makefile will also install this cblas.h file into the install prefix
alongside blis.h when the 'install' target is invoked. The two header
files are compatible with one another. Regardless whether the user's
source #includes cblas.h, both blis.h and cblas.h, or just blis.h,
the user will get the CBLAS function prototypes and enums, as expected.
Details:
- Fixed a subtle bug in bli_cntx_get_[un]packm_ker_dt() in which the
function fails to return NULL when passed a kernel id argument that is
equal to or beyond BLIS_NUM_[UN]PACKM_KERS. Instead, the function was
attempting to index into the cntx_t's packm kernel array, which resulted
in undefined behvaior. Thanks to Devangi Parikh for finding this bug.
Details:
- Rewrote monolithify-header.sh (and renamed to flatten-header.sh) so that
headers are inserted recursively. This improves performance by a factor
of 3-4x.
- Modified configure to create an 'include/<configname>' directory in which
make can create a monolithic header.
- Modified the top-level Makefile so that a monolithic header is generated
unconditionally prior to compilation (stored in include/<configname>) and
so that the single header is installed instead of the 450 or so header
files that reside throughout the framework source tree.
- Added "include/*/*.h" to .gitignore file.
- Removed some pnacl/emscripten leftovers that I intended to include in
a1caeba (mostly in testsuite/Makefile).
- Trivial comment changes to frame/include/bli_f2c.h.
Details:
- Added support for the x86_64 configuration family to bli_arch.c and
bli_arch_config.h. Thanks to Johannes Dieterich for reporting this
issue.
- Bumped the default value for BLIS_SIMD_NUM_REGISTERS from 16 to 32 and
the default value for BLIS_SIMD_SIZE from 32 to 64. This will support
configuration families that include Skylake and newer processors without
any supported needed in the bli_family_*.h file. The semantics of these
values have always been "maximum" and not exact values; comments in
bli_kernel_macro_defs.h and the github wiki have been adjusted
accordingly.
Details:
- Erroneously placed the "don't overwrite existing blocksize" logic in
bli_blksz_init*() rather than in bli_cntx_set_blkszs(). It belongs in
the latter because that function copies blocksizes as-is from the
blksz_t function argument to the appropriate field in the cntx_t. If
the blksz_t was previously initialized selectively, based on the sign
of the blocksize value passed into bli_blksz_init*(), that just leaves
some fields possibly uninitialized (with garbage values), which
definitely will not work.
- The aforementioned logic has been moved to bli_cntx_set_blkszs() via
a new function bli_blksz_copy_if_pos(), which selectively copies only
the blocksizes that are greater than zero.
Details:
- Employ the new semantics of bli_blksz_init*() in e31f0b3 in various
sub-configurations' bli_cntx_init_*() functions by passing in 0 for
register and cache blocksizes that correpond to gemm microkernel
datatypes that were not registered, allowing the default values
set by the bli_cntx_init_*_ref() function call to remain.
Details:
- Updated the semantics of bli_blksz_init() and bli_blksz_init_ed() so
that non-positive blocksize values are ignored entirely. This provides
an easy way to indicate that certain existing values should not be
touched by the update. Thanks to Devangi Parikh for feedback that led
to these changes.