Details:
- Execute build/flatten-headers.py python script via $(PYTHON) in
common.mk. This allows distributions that define the current/preferred
python interpreter in the PYTHON environment variable to use that
interpreter when executing flatten-headers.py. Thanks to Isuru
Fernando for this suggestion, and for Dave Love for submitting the
initial issue/request.
Details:
- Applied the following C preprocessor macro renames
BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
in src/test_libblis.c. This is apparently the result of a failure by
git to properly merge the 'master' and 'amd' branches in the previous
commit. (The 'master' branch contained a commit, 53a9ab1, in which
these same cpp macros were renamed throughout the source distribution.
Details:
- Defined Fortran-77 compatible APIs for bli_thread_set_num_threads()
and bli_thread_set_ways(). These wrappers are defined in
frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for
suggesting these new interfaces.
- Added missing prototype for bli_thread_set_ways() in bli_thread.h and
removed prototypes for non-existent functions bli_thread_set_*_nt().
- CREDITS file update.
Details:
- Renamed the following C preprocessor macros whose fallback/default
values are specified within frame/include/bli_kernel_macro_defs.h:
BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
- Renamed the above cpp macro overrides within the knl, skx, and zen
sub-configurations, as well as invocations of those macros in
bli_rntm.c.
- Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer
used by any code within BLIS.
Details:
- Forgot to apply the column index range fix in 10f179f to situations
when "quiet" mode (-q) is requested. This commit applies the new
column index range modifications to the quiet case.
Details:
- Previously, trsm was consolidating all ways of parallelism into the jr
loop. This was unnecessary and to some degree detrimental on some
types of hardware. Now, any parallelism bound for the jc loop will be
applied to the jc loop, while all other loops' parallelism is funneled
to the jr loop. Thanks to Devangi Parikh for helping investigate this
issue and suggesting the fix.
- NOTE: This change affects only left-side trsm. However, currently
right-side trsm is currently implemented in terms of the left-side
case, and thus the change effectively applies to both left and right
cases.
Details:
- Updated testsuite to output various parameters related to parallelism
in BLIS. These parameters include:
- threading status: disabled, openmp, or pthreads;
- thread partitioning for jr/ir loops: slab or rr (round-robin);
- ways of parallelism from environment variables, and also actual
values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
square problems (assuming all dimensions are set to 1000);
- automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
libmemkind and the sandbox.
Details:
- Updated the irun.py script so that it updates the matlab column index
range (if found) to reflect the additional columns of data that are
substituted in. Thanks to Devangi Parikh for recognizing and reporting
this issue.
Details:
- Added inadvertantly-omitted mention of -r option-equivalent to
--thread-part-jrir to the output for 'configure --help'. Also made
minor edits to the same text.
Details:
- Updated existing macrokernel function names and definitions to
explicitly use slab assignment of micropanels to threads, then created
duplicate versions of macrokernels that explicitly use round-robin
assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
were not substantially updated in this commit because they are
currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
and packm function pointers depending on which method (slab or rr) was
enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
option (-m [slab|rr] for short), which allows the user to explicitly
request either slab or round-robin assignment (partitioning) of
micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
Details:
- Removed a guard from bli_clock_min_diff() that would return 0 if the
time delta was greater than 60 minutes. This was originally intended
to disregard extremely large values under the assumption that the
user probably didn't intend to run a test that long. However, since
it is in bli_clock_min_diff(), it doesn't actually help short-circuit
an implementation that is hanging or looping infinitely, since such
an implementation would first have to finish before the
bli_clock_min_diff() is called. Thanks to Kiran Varaganti for
reporting this issue.
Details:
- gcc 7 introduced new behavior to the -dumpversion option whereby only
the major version component is output. However, as part of this
change, gcc 7 also introduced a new option, -dumpfullversion, which is
guaranteed to always output the major, minor, and revision numbers. If
we are using gcc 7 or later, we re-query the version string with this
new option and then re-parse the result so as to avoid misleading
output from configure (e.g. using gcc 7.3.0 is reported as 7.7.7).
Details:
- Use the modulo operator to limit the size of an integer that is given
to sprintf(). This avoids a warning in some versions of gcc about the
integer potentially overflowing the available space in the string into
which the integer is being printed.
Details:
- Replaced much of "Getting Started" section with a shortened version of
the bullet list of documentation currently shown in the github wiki
page. Thanks to Devangi Parikh for her feedback in this change.
Details:
- Adjusted the method by which micropanels are assigned to threads in
the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
employ contiguous "slab" partitioning rather than interleaved (round
robin) partitioning. The new partitioning schemes and related details
for specific families of operations are listed below:
- gemm: slab partitioning.
- herk: slab partitioning for region corresponding to non-triangular
region of C; round robin partitioning for triangular region.
- trmm: slab partitioning for region corresponding to non-triangular
region of B; round robin partitioning for triangular region.
(NOTE: This affects both left- and right-side macrokernels:
trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
- trsm: slab partitioning.
(NOTE: This only affects only left-side macrokernels trsm_ll,
trsm_lu; right-side macrokernels were not touched.)
Also note that the previous macrokernels were preserved inside of
the 'other' directory of each operation family directory (e.g.
frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
and fixed a stale function pointer type in blx_gemm_int.c
(gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
Details:
- Updated the configure script to allow slashes in version string. This
is needed so that downstream maintainers (such as those for Debian)
can create local tags such as "upstream/0.4.1". Thanks to M. Zhou for
reporting this issue via PR #256 and providing me the information
needed to debug the problem.
Details:
- Removed the top-level 'debian' directory. This directory is apparently
no longer needed (issue #257). Thanks to M. Zhou and Nico Schlömer for
their contributions.
Details:
- Added irun.py script to 'build' directory. This irun.py script is a
python script for repeatedly invoking a test driver executable, such
as those found in test/3m4m, and replace the performance output column
with four columns that aggregate statistics. Specifically, the script
reports the minimum, average, maximum, and standard deviation for each
problem size. This script is useful especially (though not
exclusively) when trying to determine the impact of relatively minor
changes to the code, or other small optimizations that may be
difficult to distinguish from "noise." One way this "noise" manifests
is that a test executable may run slightly slower or faster for all
problem sizes (and all implementations) tested by the executable over
the life of a single execution. The cause of these minor
across-the-board pertubations in the overall performance signatures is
unknown, though we hypothesize that it may relate to any number of
issues such as operating system scheduling, where in memory the
program is loaded, or how the CPU clock frequency is throttled at the
time of execution. Regardless of the source of these subtle
performance anomalies, the statistical properties reported by the
irun.py script help the user to more precisely characterize the
underlying performance exhibited by any given test driver, which
allows him or her to make better judgments about the true difference
in performance between two implementations, or minor changes within a
single implementation.
Details:
- Corrected feedback echoed to user by configure when libmemkind is
found but not explicitly requested. In these cases, configure would
echo a message that it had received an explicit request to enable
libmemkind, which was not accurate, even if the end result was the
same--that libmemkind is enabled by default when it is found. Thanks
To Devangi Parikh for reporting this issue.
Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
unconditionally, but differently for Windows and non-Windows, but
then disabled the definition of bli_setenv() entirely since BLIS
no longer needs to set environment variables. Updated bli_winsys.h
accordingly, and call bli_sleep() from within testsuite instead of
sleep() directly.
- Use
#if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
instead of
#if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
when guarding against local definition of pthread barrier in
testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
should always be set to 200809L when barriers are supported, though I
won't be surprised if we encounter a case in the future where it is
set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
top-level Makefile. These are no longer needed because we no longer
output libraries with the version and configuration name as
substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.
Details:
- Added the new bli_dgemm_skx_asm_16x14.c microkernel from the skx-redux
branch, along with appropriate blocksizes in bli_cntx_init_skx.c and
a prototype in bli_kernels_skx.h. (Devin has not yet written the
sgemm analague, so for now we will continue using the older sgemm
ukernel.)
- Updated frame/include/bli_x86_asm_macros.h with a minor change that
was present within the skx-redux branch.
* Enable shared
* Enable rdp
* Add support for dll
* Use libblis-symbols.def
* Fix building dlls
* Fix libblis-symbols.def
* Fix soname
* Fix Makefile error
* Fix install target
* Fix missing symbols
* Add BLIS_MINUS_TWO
* Add path to dll
* Fix OSX soname
* Add declspec for dll
* Add -DBLIS_BUILD_DLL
* Replace @enable_shared@ in config
* switch to auto for now
* blis_ -> bli_
* Remove BLIS_BUILD_DLL in make check
* change auto->haswell
* enable_shared_01
* Add wno-macro-redefined
* print out.cblat3
* BLIS_BUILD_DLL -> BLIS_IS_BUILDING_LIBRARY
* Use V=1
* Remove fpic for windows
* Remember LIBPTHREAD
* Remove libm for windows
* Remember AR
* Fix remembering libpthread
* Add Wno-maybe-uninitialized in only gcc
* Don't do blastest for shared for now
* Fix install target
And remove unnecessary change
* test auto and x86_64
* Fix install target again
* Use IS_WIN variable
* Remove leading dot from LIBBLIS_SO_MAJ_EXT
* Make is_win yes/no
* Add comments for windows builds
* Change if else blocks location
Details:
- Use get-user-cflags-for() to generate cflags when compiling BLAS test
drivers and BLIS testsuite from top-level Makefile. Meant to include
these changes in previous commit (4b5437e). Thanks to Isuru Fernando
for pointing out this oversight.
Details:
- Tweaked the cflags functions in common.mk so that a new preprocessor
macro, BLIS_IS_BUILDING_LIBRARY, is defined, but only when BLIS
itself is being built. This macro will not be defined when, for
example, the testsuite or example code compiles code local to those
applications. This was done in part by defining a new cflags function
get-user-cflags-for(), which is now the designated function for
application Makefiles if they wish to inherit a basic set of CFLAGS
from BLIS. (The compiler flags returned are identical to that of
get-frame-cflags-for() except that -DBLIS_IS_BUILDING_LIBRARY is
omitted.)
- Updated all test driver-like makefiles to call get-user-cflags-for()
instead of get-frame-cflags-for().
Details:
- Added a new sub-configuration 'cortexa53', which is a mirror image
of cortexa57 except that it will use slightly different compiler
flags. Thanks to Mathieu Poumeyrol for making this suggestion after
discovering that the compiler flags being used by cortexa57 were
not working properly in certain OS X environments (the fix to which
is currently pending in pull request #245).