Details:
- Define a dummy bli_l3_thread_entry() function when multithreading is
disabled altogether, or enabled via OpenMP. This function was
originally necessary when multithreading is enabled via pthreads.
By defining the function no matter the threading options given, it is
less likely that an AppVeyor Windows build will complain due to a
missing symbol in the DLL. (To be clear: AppVeyor was working fine
before, but a problem may have arisen if it were switched to an
OpenMP build.)
- Removed the prototype for bli_l3_thread_entry() from
bli_thrcomm_pthreads.c and placed it in bli_thrcomm.h.
- Regenerated the symbols list file build/libblis-symbols.def.
Details:
- Modified .travis.yml to automatically test the mixed-datatype support
of the gemm operation, with supporting changes to common.mk, the
top-level Makefile, and travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
'.mixed' suffix (similar to those with the '.fast' suffix) for testing
mixed-datatype gemm.
- Updated docs/BuildSystem.md to document the new make targets
'testblis-md' and 'checkblis-md'.
Details:
- Fixed a bug in the code that prints out the communicator and work ids
from the various threads' thrinfo_t nodes. This bug manifested when
the dimension being parallelized was not large enough such that every
thread was assigned actual work (since the minimum amount of work is
determined by the register blocksize in the dimension being
parallelized). In those cases, the threads that receive no work in
that dimension do not finish building their thrinfo_t tree, leaving
lower-level nodes non-existent. (The bug itself was usally observed as
a segfault when the printing code attempted to dereference all the way
down the thrinfo_t tree.) The solution involves explicitly checking
each node as it is dereferenced, and if at any time NULL is found, all
subsequent communicator and work ids are set to -1.
Details:
- Added python version checking to configure script. (Recall that python
is needed to execute the flatten-headers.py script.) Minimum versions
of python needed are currently as follows:
python2: 2.7 or later
python3: 3.5 or later
The standard search order for python interpeters is:
python python3 python2
The PYTHON environment variable is also supported and will be checked
before the standard search order list.
- Updated BuildSystem.md to include: a minimum make version; mention
that the C compiler must actually be a C99 compiler; and the caveat
that Windows builds do not require pthreads since BLIS can provide
an implementation of pthreads internally.
Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
against a Windows DLL, will be able to access pthreads functionality
without those pthreads functions being explicitly exported by the DLL.
Instead, we export the bli_pthread_*() layer, which uses types and
functions that are identical to pthreads, but adds a 'bli_' prefix.
Only a few basic functions are present in the bli_pthreads_*() API
for now. Thanks to Devin Matthews and Isuru Fernando for their help
on a related PR (#261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.
Details:
- Defined a new level-1d operation called 'shiftd', including object and
typed APIs. This operation adds a scalar value to every element along
an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
terms of the addv kernel. (The scalar is passed in as the x vector
with an increment of zero.)
- Replaced ad-hoc usage of setd and addd (after creating a temporary
matrix object) with use of shiftd, which is much more concise, in
various test driver files in the testsuite. Similar changes were made
to the standalone test drivers and the example code.
- Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
for bli_shiftd() and bli_?shiftd(), respectively.
- Added observed object properties to level-1d documentation in
BLISObjectAPI.md.
Details:
- Moved windows/build/libblis-symbols.def to build/libblis-symbols.def.
Updated link commands in common.mk accordingly.
- Added a new script build/regen-symbols.sh that will regenerate the
libblis-symbols.def file in its new location after building a
haswell-targeted shared library. Thanks to Isuru Fernando for
providing the symbol generation command.
- Ran the new script to refresh the symbols file.
Details:
- Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and
then updated the file contents to use the 'haswell' infix.
- Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to
above function renames.
- Moved/updated the corresponding prototypes in bli_kernels_zen.h to
bli_kernels_haswell.h.
- Updated config_registry according to above changes.
- NOTE: This rename reflects the fact that haswell microkernels are
specifically written to overcome the floating-point latency for FMA
instructions on Intel Haswell-like architectures, which can issue two
FMA instructions per cycle. These ukernels happen to work fine on AMD
Zen-based architectures. However, Zen only issues one FMA per cycle,
which, while halving its floating-point throughput, gives it extra
flexibility in the design of its microkernels--namely, mr and nr can
be smaller and still overcome the floating-point latency for those
single-issue cores. A smaller value of mr and nr allows for a larger
value of kc, which may be useful in some situations. In the future,
we may write such Zen-specific microkernels to take advantage of this
additional flexibility.
Details:
- Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
file per sl/rr pair, with those files named as they were before
c92762e. The consolidation does not take away the *option* of using
slab or round-robin assignment of micropanels to threads; it merely
*hides* the choice within the definitions of functions such as
bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
rather than expose that choice explicitly in the code. The choice of
slab or rr is not always hidden, however; there are some cases
involving herk and trmm, for example, that require some part of the
computation to use rr unconditionally. (The --thread-part-jrir option
controls the partitioning in all other cases.)
- Note: Originally, the sl and rr macrokernels were separated out for
clarity. However, aside from the additional binary code bloat, I later
deemed that clarity not worth the price of maintaining the additional
(mostly similar) codes.
Details:
- Execute build/flatten-headers.py python script via $(PYTHON) in
common.mk. This allows distributions that define the current/preferred
python interpreter in the PYTHON environment variable to use that
interpreter when executing flatten-headers.py. Thanks to Isuru
Fernando for this suggestion, and for Dave Love for submitting the
initial issue/request.
Details:
- Applied the following C preprocessor macro renames
BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
in src/test_libblis.c. This is apparently the result of a failure by
git to properly merge the 'master' and 'amd' branches in the previous
commit. (The 'master' branch contained a commit, 53a9ab1, in which
these same cpp macros were renamed throughout the source distribution.
Details:
- Implemented support for gemm where A, B, and C may have different
storage datatypes, as well as a computational precision (and implied
computation domain) that may be different from the storage precision
of either A or B. This results in 128 different combinations, all
which are implemented within this commit. (For now, the mixed-datatype
functionality is only supported via the object API.) If desired, the
mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
that requires a single m-by-n matrix be allocated (temporarily) per
call to gemm. This optimization aims to avoid the overhead involved in
repeatedly updating C with general stride, or updating C after a
typecast from the computation precision. This memory optimization may
be disabled at configure-time (provided that the mixed-datatype
support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
The user may test gemm with mixed domains, precisions, both, or
neither.
- Added a standalone test driver directory for building and running
mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
except that imaginary values are not touched when casting a real
operand to a complex operand. (By contrast, in these situations castm
sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
accessor functions. Also commented out all usage of accessor
functions within macrokernels. (Typecasting in the microkernel is
still feasible, though probably unrealistic for now given the
additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
(courtsey of Devin Matthews).
Details:
- Defined Fortran-77 compatible APIs for bli_thread_set_num_threads()
and bli_thread_set_ways(). These wrappers are defined in
frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for
suggesting these new interfaces.
- Added missing prototype for bli_thread_set_ways() in bli_thread.h and
removed prototypes for non-existent functions bli_thread_set_*_nt().
- CREDITS file update.
Details:
- Renamed the following C preprocessor macros whose fallback/default
values are specified within frame/include/bli_kernel_macro_defs.h:
BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
- Renamed the above cpp macro overrides within the knl, skx, and zen
sub-configurations, as well as invocations of those macros in
bli_rntm.c.
- Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer
used by any code within BLIS.
Details:
- Forgot to apply the column index range fix in 10f179f to situations
when "quiet" mode (-q) is requested. This commit applies the new
column index range modifications to the quiet case.
Details:
- Previously, trsm was consolidating all ways of parallelism into the jr
loop. This was unnecessary and to some degree detrimental on some
types of hardware. Now, any parallelism bound for the jc loop will be
applied to the jc loop, while all other loops' parallelism is funneled
to the jr loop. Thanks to Devangi Parikh for helping investigate this
issue and suggesting the fix.
- NOTE: This change affects only left-side trsm. However, currently
right-side trsm is currently implemented in terms of the left-side
case, and thus the change effectively applies to both left and right
cases.
Details:
- Updated testsuite to output various parameters related to parallelism
in BLIS. These parameters include:
- threading status: disabled, openmp, or pthreads;
- thread partitioning for jr/ir loops: slab or rr (round-robin);
- ways of parallelism from environment variables, and also actual
values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
square problems (assuming all dimensions are set to 1000);
- automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
libmemkind and the sandbox.
Details:
- Updated the irun.py script so that it updates the matlab column index
range (if found) to reflect the additional columns of data that are
substituted in. Thanks to Devangi Parikh for recognizing and reporting
this issue.
Details:
- Added inadvertantly-omitted mention of -r option-equivalent to
--thread-part-jrir to the output for 'configure --help'. Also made
minor edits to the same text.
Details:
- Updated existing macrokernel function names and definitions to
explicitly use slab assignment of micropanels to threads, then created
duplicate versions of macrokernels that explicitly use round-robin
assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
were not substantially updated in this commit because they are
currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
and packm function pointers depending on which method (slab or rr) was
enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
option (-m [slab|rr] for short), which allows the user to explicitly
request either slab or round-robin assignment (partitioning) of
micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
Details:
- Removed a guard from bli_clock_min_diff() that would return 0 if the
time delta was greater than 60 minutes. This was originally intended
to disregard extremely large values under the assumption that the
user probably didn't intend to run a test that long. However, since
it is in bli_clock_min_diff(), it doesn't actually help short-circuit
an implementation that is hanging or looping infinitely, since such
an implementation would first have to finish before the
bli_clock_min_diff() is called. Thanks to Kiran Varaganti for
reporting this issue.
Details:
- gcc 7 introduced new behavior to the -dumpversion option whereby only
the major version component is output. However, as part of this
change, gcc 7 also introduced a new option, -dumpfullversion, which is
guaranteed to always output the major, minor, and revision numbers. If
we are using gcc 7 or later, we re-query the version string with this
new option and then re-parse the result so as to avoid misleading
output from configure (e.g. using gcc 7.3.0 is reported as 7.7.7).