Details:
- Adjusted the definition for libblis_test_get_string_for_result() in
testsuite/src/test_libblis.c so that the "FAIL" string is returned if
the computed residual contains either NaN or Inf. Previously, a
residual containing NaN would result in the selection of the "PASS"
string. Thanks to Devin Matthews for reporting this issue (#279).
- Expounded on comment for the macro definitions of bli_isnan() and
bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they
must remain macros.
Details:
- Expanded cpp guard in frame/include/bli_x86_asm_macros.h to also check
__MINGW32__ in addition to _WIN32, __clang__, and __MIC__. Thanks to
Isuru Fernando for suggesting this fix, and also to Costas Yamin for
originally reporting the issue (#277).
Details:
- Fixed a bug in frame/3/bli_l3_oapi.c in the conditional that divides
use of induced method (1m) execution from native execution. The former
was intended to only be used in cases where all storage datatypes are
complex and the datatype of C is equal to the computation datatype.
(If mixed datatypes are detected, native execution would be used.)
However, the code in bli_gemm() was erroneously checking the execution
datatype instead of the computation datatype, which at that point is
guaranteed to be equal to the storage datatype even if the computation
datatype contains a different value. Thanks to Devangi Parikh for
helping in isolating this bug.
Details:
- Added debug output to bli_malloc.c in order to debug certain kinds of
memory behavior in BLIS. The printf() statements are disabled and must
be enabled manually.
- Whitespace/comment updates in bli_membrk.c.
Details:
- Print an error message from configure if the user attempts to
explicitly configure BLIS for simultaneous use of 64-bit integers in
the BLAS API with 32-bit integers in the BLIS API.
- Added cpp macro conditional to bli_type_defs.h to mandate that BLIS
integers be 64 bits if the BLAS integers are 64 bits. This and the
above item take care of issue #274. Thanks to Devin Matthews and
Jeff Hammond for suggesting these safeguards.
- Slight reorganization and relabeling (for clarity) of BLAS/CBLAS
sections and BLIS integer size line of the testsuite configuration
output.
- Very minor edits to docs/MixedDatatypes.md.
Details:
- Added the missing bli_pthread_mutex_trylock() function and prototype
to the non-Windows sections of bli_pthread.c and .h. This function
isn't needed by BLIS, but I figured why not make the Windows and
non-Windows sections consistent with one another.
Details:
- Added function definitions for bli_pthread_cond_*() as well as related
types and constants to bli_pthread.c, and corresponding prototypes to
bli_pthread.h.
Details:
- Fully define bli_pthreads barrier-related types on OS X. Only typedef
those types in terms of pthreads types on non-Windows, non-Apple OSes
(i.e. Linux).
Details:
- Expanded the bli_pthread_*() -> pthread_*() wrappers in
frame/thread/bli_pthread.c to include cases for Windows taken from
frame/base/bli_pthread_wrap.c. Now, bli_thread_*() is always defined
and always used by BLIS and the BLIS testsuite (in lieu of calling
pthreads directly, as before). The implementation used in this new
API depends on whether we are building for Windows, and to a lesser
extent, whether we are building on OS X. For the core API, Windows
uses Windows threads, non-Windows (Linux, OS X) uses pthreads.
OS X and Windows get barriers implemented in terms of other
bli_pthread_*() functions, and Linux gets barriers implemented in
terms of pthread_barrier*(). This commit addresses issue #273.
- Fixed a bug in the Linux definition of bli_pthread_mutex_unlock(),
which was erroneously calling pthread_mutex_lock().
- Minor changes to configure so that the auto-detection executable
can be built given the above changes (most notably, turning on
POSIX extensions via -D_GNU_SOURCE).
- Removed temporary play-test code for shiftd that accidentally got
committed into test/3m4m/test_gemm.c.
Details:
- Define a dummy bli_l3_thread_entry() function when multithreading is
disabled altogether, or enabled via OpenMP. This function was
originally necessary when multithreading is enabled via pthreads.
By defining the function no matter the threading options given, it is
less likely that an AppVeyor Windows build will complain due to a
missing symbol in the DLL. (To be clear: AppVeyor was working fine
before, but a problem may have arisen if it were switched to an
OpenMP build.)
- Removed the prototype for bli_l3_thread_entry() from
bli_thrcomm_pthreads.c and placed it in bli_thrcomm.h.
- Regenerated the symbols list file build/libblis-symbols.def.
Details:
- Fixed a bug in the code that prints out the communicator and work ids
from the various threads' thrinfo_t nodes. This bug manifested when
the dimension being parallelized was not large enough such that every
thread was assigned actual work (since the minimum amount of work is
determined by the register blocksize in the dimension being
parallelized). In those cases, the threads that receive no work in
that dimension do not finish building their thrinfo_t tree, leaving
lower-level nodes non-existent. (The bug itself was usally observed as
a segfault when the printing code attempted to dereference all the way
down the thrinfo_t tree.) The solution involves explicitly checking
each node as it is dereferenced, and if at any time NULL is found, all
subsequent communicator and work ids are set to -1.
Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
against a Windows DLL, will be able to access pthreads functionality
without those pthreads functions being explicitly exported by the DLL.
Instead, we export the bli_pthread_*() layer, which uses types and
functions that are identical to pthreads, but adds a 'bli_' prefix.
Only a few basic functions are present in the bli_pthreads_*() API
for now. Thanks to Devin Matthews and Isuru Fernando for their help
on a related PR (#261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.
Details:
- Defined a new level-1d operation called 'shiftd', including object and
typed APIs. This operation adds a scalar value to every element along
an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
terms of the addv kernel. (The scalar is passed in as the x vector
with an increment of zero.)
- Replaced ad-hoc usage of setd and addd (after creating a temporary
matrix object) with use of shiftd, which is much more concise, in
various test driver files in the testsuite. Similar changes were made
to the standalone test drivers and the example code.
- Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
for bli_shiftd() and bli_?shiftd(), respectively.
- Added observed object properties to level-1d documentation in
BLISObjectAPI.md.
Details:
- Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
file per sl/rr pair, with those files named as they were before
c92762e. The consolidation does not take away the *option* of using
slab or round-robin assignment of micropanels to threads; it merely
*hides* the choice within the definitions of functions such as
bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
rather than expose that choice explicitly in the code. The choice of
slab or rr is not always hidden, however; there are some cases
involving herk and trmm, for example, that require some part of the
computation to use rr unconditionally. (The --thread-part-jrir option
controls the partitioning in all other cases.)
- Note: Originally, the sl and rr macrokernels were separated out for
clarity. However, aside from the additional binary code bloat, I later
deemed that clarity not worth the price of maintaining the additional
(mostly similar) codes.
Details:
- Implemented support for gemm where A, B, and C may have different
storage datatypes, as well as a computational precision (and implied
computation domain) that may be different from the storage precision
of either A or B. This results in 128 different combinations, all
which are implemented within this commit. (For now, the mixed-datatype
functionality is only supported via the object API.) If desired, the
mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
that requires a single m-by-n matrix be allocated (temporarily) per
call to gemm. This optimization aims to avoid the overhead involved in
repeatedly updating C with general stride, or updating C after a
typecast from the computation precision. This memory optimization may
be disabled at configure-time (provided that the mixed-datatype
support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
The user may test gemm with mixed domains, precisions, both, or
neither.
- Added a standalone test driver directory for building and running
mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
except that imaginary values are not touched when casting a real
operand to a complex operand. (By contrast, in these situations castm
sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
accessor functions. Also commented out all usage of accessor
functions within macrokernels. (Typecasting in the microkernel is
still feasible, though probably unrealistic for now given the
additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
(courtsey of Devin Matthews).
Details:
- Defined Fortran-77 compatible APIs for bli_thread_set_num_threads()
and bli_thread_set_ways(). These wrappers are defined in
frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for
suggesting these new interfaces.
- Added missing prototype for bli_thread_set_ways() in bli_thread.h and
removed prototypes for non-existent functions bli_thread_set_*_nt().
- CREDITS file update.
Details:
- Renamed the following C preprocessor macros whose fallback/default
values are specified within frame/include/bli_kernel_macro_defs.h:
BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
- Renamed the above cpp macro overrides within the knl, skx, and zen
sub-configurations, as well as invocations of those macros in
bli_rntm.c.
- Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer
used by any code within BLIS.
Details:
- Previously, trsm was consolidating all ways of parallelism into the jr
loop. This was unnecessary and to some degree detrimental on some
types of hardware. Now, any parallelism bound for the jc loop will be
applied to the jc loop, while all other loops' parallelism is funneled
to the jr loop. Thanks to Devangi Parikh for helping investigate this
issue and suggesting the fix.
- NOTE: This change affects only left-side trsm. However, currently
right-side trsm is currently implemented in terms of the left-side
case, and thus the change effectively applies to both left and right
cases.
Details:
- Updated testsuite to output various parameters related to parallelism
in BLIS. These parameters include:
- threading status: disabled, openmp, or pthreads;
- thread partitioning for jr/ir loops: slab or rr (round-robin);
- ways of parallelism from environment variables, and also actual
values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
square problems (assuming all dimensions are set to 1000);
- automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
libmemkind and the sandbox.
Details:
- Updated existing macrokernel function names and definitions to
explicitly use slab assignment of micropanels to threads, then created
duplicate versions of macrokernels that explicitly use round-robin
assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
were not substantially updated in this commit because they are
currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
and packm function pointers depending on which method (slab or rr) was
enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
option (-m [slab|rr] for short), which allows the user to explicitly
request either slab or round-robin assignment (partitioning) of
micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
Details:
- Removed a guard from bli_clock_min_diff() that would return 0 if the
time delta was greater than 60 minutes. This was originally intended
to disregard extremely large values under the assumption that the
user probably didn't intend to run a test that long. However, since
it is in bli_clock_min_diff(), it doesn't actually help short-circuit
an implementation that is hanging or looping infinitely, since such
an implementation would first have to finish before the
bli_clock_min_diff() is called. Thanks to Kiran Varaganti for
reporting this issue.
BUG No: CPUPL-197 fixed by Thangaraj Santanu
The bli_clock_min_diff() function in BLIS assumed that if the time taken is greater than 1 hour then the reading must be wrong. However this is not the case in general, while the other checks such as time taken closer to zero or nsec is ofcourse valid.
gerrit review: http://git.amd.com:8080/#/c/118694/1/frame/base/bli_clock.c
Change-Id: I9dc313d7c5fdc20684f67a516bf3237de3e0694a
Details:
- Adjusted the method by which micropanels are assigned to threads in
the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
employ contiguous "slab" partitioning rather than interleaved (round
robin) partitioning. The new partitioning schemes and related details
for specific families of operations are listed below:
- gemm: slab partitioning.
- herk: slab partitioning for region corresponding to non-triangular
region of C; round robin partitioning for triangular region.
- trmm: slab partitioning for region corresponding to non-triangular
region of B; round robin partitioning for triangular region.
(NOTE: This affects both left- and right-side macrokernels:
trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
- trsm: slab partitioning.
(NOTE: This only affects only left-side macrokernels trsm_ll,
trsm_lu; right-side macrokernels were not touched.)
Also note that the previous macrokernels were preserved inside of
the 'other' directory of each operation family directory (e.g.
frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
and fixed a stale function pointer type in blx_gemm_int.c
(gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
unconditionally, but differently for Windows and non-Windows, but
then disabled the definition of bli_setenv() entirely since BLIS
no longer needs to set environment variables. Updated bli_winsys.h
accordingly, and call bli_sleep() from within testsuite instead of
sleep() directly.
- Use
#if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
instead of
#if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
when guarding against local definition of pthread barrier in
testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
should always be set to 200809L when barriers are supported, though I
won't be surprised if we encounter a case in the future where it is
set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
top-level Makefile. These are no longer needed because we no longer
output libraries with the version and configuration name as
substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.
Details:
- Added the new bli_dgemm_skx_asm_16x14.c microkernel from the skx-redux
branch, along with appropriate blocksizes in bli_cntx_init_skx.c and
a prototype in bli_kernels_skx.h. (Devin has not yet written the
sgemm analague, so for now we will continue using the older sgemm
ukernel.)
- Updated frame/include/bli_x86_asm_macros.h with a minor change that
was present within the skx-redux branch.