Details:
- Previously, rs_ct and cs_ct, the strides of the temporary microtile used
primarily in the macrokernels' edge case handling, were unconditionally
set to 1 and MR, respectively. However, Devin Matthews noted that this
ought to be changed so that the strides of ct were in agreement with the
strides of C. (That is, if C was row-stored, then ct should be accessed
as by rows as well.) The implicit assumption is that the strides of C
have already been adjusted, via induced transposition, if the storage
preference of the microkernel is at odds with the storage of C. So, if
the microkernel prefers row storage, the macrokernel's interior cases
would present row-stored (ideal) microkernel subproblems to the
microkernel, but for edge cases, it would still see column-stored
subproblems (not ideal). This commit fixes this issue. Thanks to Devin
for his suggestion.
Details:
- Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING
from bli_thread.h to bli_config_macro_defs.h. Also moved the
sanity check that OpenMP and POSIX threads are not both enabled.
- Thanks to Krzysztof Drewniak for reporting this bug.
Details:
- Removed the header file, bli_malloc_prototypes.h, which automatically
generated prototypes for the functions specified by the following
cpp macros:
BLIS_MALLOC_INTL
BLIS_FREE_INTL
BLIS_MALLOC_POOL
BLIS_FREE_POOL
BLIS_MALLOC_USER
BLIS_FREE_USER
These prototypes were originally provided primarily as a convenience
to those developers who specified their own malloc()/free() substitutes
for one or more of the following. However, we generated these prototypes
regardless, even when the default values (malloc and free) of the
macros above were used. A problem arose under certain circumstances
(e.g., gcc in C++ mode on Linux with glibc) when including blis.h that
stemmed from the "throw" specification which was added to the glibc's
malloc() prototype, resulting in a prototype mismatch. Therefore, going
forward, developers who specify their own custom malloc()/free()
substitutes must also prototype those substitutes via bli_kernel.h.
Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews
for researching the nature and potential solutions.
Details:
- Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h.
- Moved #include of bli_malloc.h from blis.h to bli_type_defs.h.
- Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h.
- Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h.
- The redundant typedefs of membrk_t and mtx_t caused a warning on some C
compilers. Thanks to Tyler Smith for reporting this issue.
Details:
- Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious
developer to print the contents of the thrinfo_t structures of each
thread, for verification purposes or just to study the way thread
information and communicators are used in BLIS.
- Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing
an array of thrinfo_t* values that is used in the new, cpp-guarde code
mentioned above.
- Removed some old commented lines from bli_gemm_front.c.
Details:
- Forgot to update certain occurrences of "omp" in common.mk during
commit fd04869, which changed the preferred configure option string
for enabling OpenMP from "omp" to "openmp".
Details:
- Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h,
both of which were renamed/removed in 701b9aa. For some reason, these
files survived when the compose branch was merged back into master.
(Clearly, git's merging algorithm is not perfect.)
- Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed
memory allocator that I was keeping around for no particular reason).
Details:
- Fixed a bug that would manifest in the form of a segmentation fault
in bli_cntl_free() when calling any level-3 operation on an empty
output matrix (ie: m = n = 0). Specifically, the code previously
assumed that the entire control tree was built prior to it being
freed. However, if the level-3 operation performs an early exit, the
control tree will be incomplete, and this scenario is now handled.
Thanks to Elmar Peise for reporting this bug.
Details:
- Fixed a bug in bli_free_align() caused by failing to handle NULL pointers
up-front, which led to performing pointer arithmetic on NULL pointers in
order to free the address immediately before the pointer. Thanks to Devin
Matthews for reporting this bug.
Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
This includes the establishment of a new amaxv kernel to live beside all
of the other level-1v kernels.
- Added two new functions to bli_part.c:
bli_acquire_mij()
bli_acquire_vi()
The first acquires a scalar object for the (i,j) element of a matrix,
and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
module works by comparing the value identified by bli_amaxv() to the
the value found from a reference-like code local to the test module
source file. In other words, it (intentionally) does not guarantee the
same index is found; only the same value. This allows for different
implementations in the case where a vector contains two or more elements
containing exactly the same floating point value (or values, in the case
of the complex domain).
- Removed the directory frame/include/old/.
Details:
- Implemented Ricardo Magana's distributed thread info/communicator
management. Rather that fully construct the thrinfo_t structures, from
root to leaf, prior to spawning threads, the threads individually
construct their thrinfo_t trees (or, chains), and do so incrementally,
as needed, reusing the same structure nodes during subsequent blocked
variant iterations. This required moving the initial creation of the
thrinfo_t structure (now, the root nodes) from the _front() functions
to the bli_l3_thread_decorator(). The incremental "growing" of the tree
is performed in the internal back-end (ie: _int()) function, and so
mostly invisible. Also, the incremental growth of the thrinfo_t tree is
done as a function of the current and parent control tree nodes (as well
as the parent thrinfo_t node), further reinforcing the parallel
relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
as well as its id. Changed all APIs accordingly. Renamed
bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
in an array of thrinfo_t* structure pointers. (Used only as a
debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
bli_packm_thrinfo_create()
bli_l3_thrinfo_create()
because they are no longer used. bli_thrinfo_create() is now called
directly when creating thrinfo_t nodes.
Details:
- Changed the configure script so that the expected string argument to the
-t (or --enable-threading=) option that enables OpenMP multithreading is
'openmp'. The previous expected string, 'omp', is still supported but
should be considered deprecated.
Details:
- Added optional printf() statements to print out thread communicator
info as the thrinfo_t structure is built in bli_l3_thrinfo.c.
- Minor changes to frame/thread/bli_thrinfo.h.
Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
architectures. As with their real domain brethren, these kernels perfer
row storage, (though this doesn't affect most users due to high-level
optimizations in most level-3 operations that induce a transpose to
whatever storage preference the kernel may have).
Details:
- Removed thread barriers from the end of the loop bodies of
bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
and bli_trsm_blk_var2().
- Moved the thread barrier at the end of bli_packm_int() to the
end of bli_l3_packm(), and added missing barriers to that function.
- Removed the no longer necessary (and now incorrect) ochief guard
in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
- Thanks to Tyler Smith for help with these changes.
Details:
- Altered control tree node struct definitions so that all nodes have the
same struct definition, whose primary fields consist of a blocksize id,
a variant function pointer, a pointer to an optional parameter struct,
and a pointer to a (single) sub-node. This unified control tree type is
now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
they represent, such that, for example, packing operations are now
associated with nodes that are "inline" in the tree, rather than off-
shoot braches. The original tree for the classic Goto gemm algorithm was
expressed (roughly) as:
blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
| |
-> packb -> packa
and now, the same tree would look like:
blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2
Specifically, the packb and packa nodes perform their respective packing
operations and then recurse (without any loop) to a subproblem. This means
there are now two kinds of level-3 control tree nodes: partitioning and
non-partitioning. The blocked variants are members of the former, because
they iteratively partition off submatrices and perform suboperations on
those partitions, while the packing variants belong to the latter group.
(This change has the effect of allowing greatly simplified initialization
of the nodes, which previously involved setting many unused node fields to
NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
connective structure of control trees. That is, packm nodes are no longer
off-shoot branches of the main algorithmic nodes, but rather connected
"inline".
- Simplified control tree creation functions. Partitioning nodes are created
concisely with just a few fields needing initialization. By contrast, the
packing nodes require additional parameters, which are stored in a
packm-specific struct that is tracked via the optional parameters pointer
within the control tree struct. (This parameter struct must always begin
with a uint64_t that contains the byte size of the struct. This allows
us to use a generic function to recursively copy control trees.) gemm,
herk, and trmm control tree creation continues to be consolidated into
a single function, with the operation family being used to select
among the parameter-agnostic macro-kernel wrappers. A single routine,
bli_cntl_free(), is provided to free control trees recursively, whereby
the chief thread within a groups release the blocks associated with
mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
function pointer stored in the current control tree node (rather than
index into a local function pointer array). Before being invoked, these
function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
families) or trsm_voft (for trsm family) type, which is defined in
frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
from routines as a function of the variant's matrix operands. gemm and
herk always move forward, while trmm and trsm move in a direction that
is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
each of which takes additional arguments and hides complexity in managing
the difference between the way ranges are computed for the four families
of operations.
- Simplified level-3 blocked variants according to the above changes, so that
the only steps taken are:
1. Query partitioning direction (forwards or backwards).
2. Prune unreferenced regions, if they exist.
3. Determine the thread partitioning sub-ranges.
<begin loop>
4. Determine the partitioning blocksize (passing in the partitioning
direction)
5. Acquire the curren iteration's partitions for the matrices affected
by the current variants's partitioning dimension (m, k, n).
6. Call the subproblem.
<end loop>
- Instantiate control trees once per thread, per operation invocation.
(This is a change from the previous regime in which control trees were
treated as stateless objects, initialized with the library, and shared
as read-only objects between threads.) This once-per-thread allocation
is done primarily to allow threads to use the control tree as as place
to cache certain data for use in subsequent loop iterations. Presently,
the only application of this caching is a mem_t entry for the packing
blocks checked out from the memory broker (allocator). If a non-NULL
control tree is passed in by the (expert) user, then the tree is copied
by each thread. This is done in bli_l3_thread_decorator(), in
bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
of the operation being executed. For example, gemm, hemm, and symm are
all part of the gemm family, while herk, syrk, her2k, and syr2k are
all part of the herk family. Knowing the operation's family is necessary
when conditionally executing the internal (beta) scalar reset on on
C in blocked variant 3, which is needed for gemm and herk families,
but must not be performed for the trmm family (because beta has only
been applied to the current row-panel of C after the first rank-kc
iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
to comform with the new control tree design, and renamed the macro-
kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
optimization in the various level-3 front-ends was not being applied
properly when the context indicated that execution would be via an
induced method. (Before, we always checked the native micro-kernel
corresponding to the datatype being executed, whereas now we check
the native micro-kernel corresponding to the datatype's real projection,
since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
complex implementations. Previously, it was always tested, provided that
the c/z datatypes were enabled. However, some configurations use
reference micro-kernels for complex datatypes, and testing these
implementations can slow down the testsuite considerably.
Details:
- Updated the top-level Makefile, build/config.mk.in template, and
configure script so that object files corresponding to source files
belonging to the BLAS compatibility layer are not compiled (or archived)
when the compatibility layer is disabled. (Same for CBLAS.) Thanks
to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
of converting (overwriting) some, such as enable_blas2blis and
enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
now stored in new variables that live alongside the originals (with the
suffix "_01"). This is convenient since some values need to be
sed-substituted into the config.mk.in template, which requires "yes" or
"no", while some need to be written to the bli_config.h.in template,
which requires "0" or "1".
Details:
- Fixed a couple of bugs that affected OpenMP and POSIX threads
configurations that resulted in compiler errors and warnings due
to type mismatch, and in the case of pthreads, a missing function
argument. The bugs are fairly recent, introduced in a017062.
Details:
- Relaxed the base pointer and leading dimension alignment restrictions
in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
instead of vmovaps/vmovapd. These change mimic those made to the haswell
microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
directory to use DBL_MAX as the initial time candidate. Thanks to Devin
Matthews for suggesting this change.
- Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.