Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- Fixed various typecasts in
frame/base/bli_cntx.h
frame/base/bli_mbool.h
frame/base/bli_rntm.h
frame/include/bli_misc_macro_defs.h
frame/include/bli_obj_macro_defs.h
frame/include/bli_param_macro_defs.h
that were missing or being done improperly/incompletely. For example,
many return values were being typecast as
(bool_t)x && y
rather than
(bool_t)(x && y)
Thankfully, none of these deficiencies had manifested as actual bugs
at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
This reflects the fact that bli_env_get_var() needs to be able to
return a signed integer, and even though dim_t is currently defined
as a signed integer, it does not intuitively appear to necessarily be
signed by inspection (i.e., an integer named "dim_t" for matrix
"dimension"). Also, updated use of bli_env_get_var() within
bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
and added comments to the bli_thrcomm_*.h files that will explain a
planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
'bool' for 'bool_t', which will eliminate the namespace conflict with
arm_sve.h as reported in issue #420. This commit implements the first
phase of that transition. Thanks to RuQing Xu for reporting this
issue.
- CREDITS file update.
Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally #define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
inadvertantly not incremented when the Zen2 subconfiguration was
added.
- In bli_gemm_front(), added a missing conditional constraint around the
call to bli_gemm_small() that ensures that the computation precision
of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
that existed around the call to bli_syrk_small() into bli_syrk_small()
to minimize the calling code footprint and also to bring that code
into stylistic harmony with similar code in bli_gemm_front() and
bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
proper accessor static functions (e.g. 'a->dim[0]' becomes
'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
strictly speaking unnecessary, but it serves as a useful visual cue to
those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
version check for availability of -march=znver2, and added appropriate
support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
config/zen/amd_config.mk, including: removal of -march=znver1 et al.
from CKVECFLAGS (since the -march flag is added within make_defs.mk);
setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
Change void*-typed function pointers to void_fp.
- Updated all instances of void* variables that store function pointers
to variables of a new type, void_fp. Originally, I wanted to define
the type of void_fp as "void (*void_fp)( void )"--that is, a pointer
to a function with no return value and no arguments. However, once
I did this, I realized that gcc complains with incompatible pointer
type (-Wincompatible-pointer-types) warnings every time any such a
pointer is being assigned to its final, type-accurate function
pointer type. That is, gcc will silently typecast a void* to
another defined function pointer type (e.g. dscalv_ker_ft) during
an assignment from the former to the latter, but the same statement
will trigger a warning when typecasting from a void_fp type. I suspect
an explicit typecast is needed in order to avoid the warning, which
I'm not willing to insert at this time.
- Added a typedef to bli_type_defs.h defining void_fp as void*, along
with a commented-out version of the aborted definition described
above. (Note that POSIX requires that void* and function pointers
be interchangeable; it is the C standard that does not provide this
guarantee.)
- Comment updates to various _oapi.c files.
Details:
- Updated the BLAS compatibility layer for level-3 operations so that
the corresponding BLIS object API is called directly rather than first
calling the typed BLIS API. The previous code based on the typed BLIS
API calls is still available in a deactivated cpp macro branch, which
may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not
yet correspond to a configure option. If it seems like people might
want to toggle this behavior more regularly, a configure option can be
added in the future.)
- Updated the BLIS typed API to statically "pre-initialize" objects via
new initializor macros. Initialization is then finished via calls to
static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(),
which are similar to the previously-called functions,
bli_obj_create_1x1_with_attached_buffer() and
bli_obj_create_with_attached_buffer(), respectively. (The BLAS
compatibility layer updates mentioned above employ this new technique
as well.)
- Transformed certain routines in bli_param_map.c--specifically, the
ones that convert netlib-style parameters to BLIS equivalents--into
static functions, now in bli_param_map.h. (The remaining three classes
of conversation routines were left unchanged.)
- Added the aforementioned pre-initializor macros to bli_type_defs.h.
- Relocated bli_obj_init_const() and bli_obj_init_constdata() from
bli_obj_macro_defs.h to bli_type_defs.h.
- Added a few macros to bli_param_macro_defs.h for testing domains for
real/complexness and precisions for single/double-ness.
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
Details:
- Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
file per sl/rr pair, with those files named as they were before
c92762e. The consolidation does not take away the *option* of using
slab or round-robin assignment of micropanels to threads; it merely
*hides* the choice within the definitions of functions such as
bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
rather than expose that choice explicitly in the code. The choice of
slab or rr is not always hidden, however; there are some cases
involving herk and trmm, for example, that require some part of the
computation to use rr unconditionally. (The --thread-part-jrir option
controls the partitioning in all other cases.)
- Note: Originally, the sl and rr macrokernels were separated out for
clarity. However, aside from the additional binary code bloat, I later
deemed that clarity not worth the price of maintaining the additional
(mostly similar) codes.
Details:
- Updated existing macrokernel function names and definitions to
explicitly use slab assignment of micropanels to threads, then created
duplicate versions of macrokernels that explicitly use round-robin
assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
were not substantially updated in this commit because they are
currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
and packm function pointers depending on which method (slab or rr) was
enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
option (-m [slab|rr] for short), which allows the user to explicitly
request either slab or round-robin assignment (partitioning) of
micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
Details:
- Adjusted the method by which micropanels are assigned to threads in
the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
employ contiguous "slab" partitioning rather than interleaved (round
robin) partitioning. The new partitioning schemes and related details
for specific families of operations are listed below:
- gemm: slab partitioning.
- herk: slab partitioning for region corresponding to non-triangular
region of C; round robin partitioning for triangular region.
- trmm: slab partitioning for region corresponding to non-triangular
region of B; round robin partitioning for triangular region.
(NOTE: This affects both left- and right-side macrokernels:
trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
- trsm: slab partitioning.
(NOTE: This only affects only left-side macrokernels trsm_ll,
trsm_lu; right-side macrokernels were not touched.)
Also note that the previous macrokernels were preserved inside of
the 'other' directory of each operation family directory (e.g.
frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
and fixed a stale function pointer type in blx_gemm_int.c
(gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
Details:
- Added explicit typecasting to various functions (mostly static
functions), primarily those in bli_param_macro_defs.h,
bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header
files.
- This change was prompted by feedback from Jacob Gorm Hansen, who
reported that #including "blis.h" from his application caused a
gcc to output error messages (relating to types being returned
mismatching the declared return types) when used via the C++ compiler
front-end. This is the first pass of fixes, and we may need to
iterate with additional follow-up commits (#233).
Details:
- Fixed incorrect bit shifts in the following static functions:
bli_obj_set_target_domain()
bli_obj_set_target_prec()
bli_obj_set_exec_domain()
bli_obj_set_exec_prec()
- Fixed incorrect bitmask in bli_dt_proj_to_single_prec().
- Updated bli_obj_real_part() and bli_obj_imag_part() so that it updates
the target and exec datatypes (in addition to the storage datatypes).
Details:
- Added definitions of static functions bli_dt_domain()/bli_dt_prec(),
which extract a dom_t domain or prec_t precision value, respectively,
from a num_t datatype.
- Changed the return types of bli_obj_domain() and bli_obj_prec() from
objbits_t to dom_t and prec_t. (Not sure why they were ever set to
return objbits_t.)
Details:
- Added static functions for projecting a datatype to single precision
or double precision, both for obj_t's storage datatypes and standalone
datatypes.
Details:
- Implemented castm and castv operations, which behave like copym and
copyv except where the obj_t operands can be of different datatypes.
These new operations, however, unlike copym/copyv, do not build upon
existing level-1v kernels.
- Reorganized projm, projv into a 'proj' subdirectory of frame/base (to
match the newly added frame/base/cast directory).
- Added new macros to bli_gentfunc_macro_defs.h, _gentprot_macro_defs.h
that insert GENTFUNC2/GENTPROT2 macros for all non-homogeneous datatype
combinations. Previously, one had to invoke two additional macros--one
which mixed domains only and another that included all remaining
cases--in order to get full type combination coverage.
- Defined a new static function, bli_set_dims_incs_2m(), to aid in the
setting of various variables in the implementations of bli_??castm().
This static function joins others like it in bli_param_macro_defs.h.
- Comment update to bli_copysc.h.
Details:
- Defined a new operation in frame/base/bli_proj.c, bli_projm(), which
behaves like bli_copym(), except that operands a and b are allowed to
contain data of differing domains (e.g. a is real while b is complex,
or vice versa). The file is named bli_proj.c, rather than bli_projm.c,
with the intention that a 'v' vector version of the function may be
added to the same file (at some point in the future).
- Added supporting bli_check_*() functions in bli_check.c to confirm
consistent precisions between to datatypes/objects, as well as the
appropriate error message in bli_error.c and a new error code in
bli_type_defs.h.
- Wrote a bli_projm_check() function to go along with bli_projm().
- Defined static function bli_obj_real_part() in bli_obj_macro_defs.h,
which will initialize an obj_t alias to the real part of the source
object.
- Fixed a bug in the static function bli_dt_proj_to_complex(), found
in bli_param_macro_defs.h. Thankfully, there were no calls to the
function to produce buggy behavior.
Details:
- Changed the void* arguments of the following static functions:
bli_is_aligned_to()
bli_is_unaligned_to()
bli_offset_past_alignment()
to siz_t, and the return type of bli_offset_past_alignment() from
guint_t to siz_t. This allows for more versatile usage of these
functions (e.g. when aligning both pointers and leading dimension).
- Updated all invocations of these functions, mostly in kernels/penryn
but also in kernels/bgq, to include explicit typecasts to siz_t when
pointer arguments are passed in.
- Thanks to Devin Matthews for pointing out this potential bug (via issue
#211).
- Deleted a few trailing spaces in various penryn kernels.
- Removed duplicate instances of the words "derived" and "THEORY" from
various kernel license headers, likely from a malformed recursive sed
performed long ago.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
Details:
- Implemented the 1m method for inducing complex domain matrix
multiplication. 1m support has been added to all level-3 operations,
including trsm, and is now the default induced method when native
complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
needed for the corresponding function for 1m (because 1m requires us
to choose between column-oriented or row-oriented execution, which
requires us to query the context for the storage preference of the
gemm microkernel, which requires knowing the datatype) but I decided
that it made sense for consistency to add the parameter to all other
cntx initialization functions as well, even though those functions
don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
a second scalar for each blocksize entry. The semantic meaning of the
two scalars now is that the first will scale the default blocksize
while the second will scale the maximum blocksize. This allows scaling
the two independently, and was needed to support 1m, which requires
scaling for a register blocksize but not the register storage
blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
default and maximum blocksizes to some desired blocksize multiple.
These functions are needed in the updated definitions of
bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
certain circumstances (specifically, real domain beta and row- or
column-stored matrix C), the real domain macrokernel and microkernel
to be called directly, rather than using the virtual microkernel
via the complex domain macrokernel, which carries a slight additional
amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
some code in test_gemm.c driver.
Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
This includes the establishment of a new amaxv kernel to live beside all
of the other level-1v kernels.
- Added two new functions to bli_part.c:
bli_acquire_mij()
bli_acquire_vi()
The first acquires a scalar object for the (i,j) element of a matrix,
and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
module works by comparing the value identified by bli_amaxv() to the
the value found from a reference-like code local to the test module
source file. In other words, it (intentionally) does not guarantee the
same index is found; only the same value. This allows for different
implementations in the case where a vector contains two or more elements
containing exactly the same floating point value (or values, in the case
of the complex domain).
- Removed the directory frame/include/old/.
Details:
- Replaced all instances of bli_malloc() and bli_free() with one of:
- bli_malloc_pool()/bli_free_pool()
- bli_malloc_user()/bli_free_user()
- bli_malloc_intl()/bli_free_intl()
each of which can be configured to call malloc()/free() substitutes,
so long as the substitute functions have the same function type
signatures as malloc() and free() defined by C's stdlib.h. The _pool()
function is called when allocating blocks for the memory pools (used
for packing buffers, primarily), the _user() function is called when
obj_t's are created (via bli_obj_create() and friends), and the _intl()
function is called for internal use by BLIS, such as when creating
control tree nodes or temporary buffers for manipulating internal data
structures. Substitutes for any of the three types of bli_malloc() may
be specified by #defining the following pairs of cpp macros in
bli_kernel.h:
- BLIS_MALLOC_POOL/BLIS_FREE_POOL
- BLIS_MALLOC_USER/BLIS_FREE_USER
- BLIS_MALLOC_INTL/BLIS_FREE_INTL
to be the name of the substitute functions. (Obviously, the object
code that contains these functions must be provided at link-time.)
These macros default to malloc() and free(). Subsitute functions are
also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
- Removed definitions for bli_malloc() and bli_free().
- Note that bli_malloc_pool() and bli_malloc_user() are now defined in
terms of a new function, bli_malloc_align(), which aligns memory to an
arbitrary (power of two) alignment boundary, but does so manually,
whereas before alignment was performed behind the scenes by
posix_memalign(). Currently, bli_malloc_intl() is defined in terms
of bli_malloc_noalign(), which serves as a simple wrapper to the
designated function that is passed in (e.g. BLIS_MALLOC_INTL).
Similarly, there are bli_free_align() and bli_free_noalign(), which
are used in concert with their bli_malloc_*() counterparts.
Details:
- Retrofitted a new data structure, known as a context, into virtually
all internal APIs for computational operations in BLIS. The structure
is now present within the type-aware APIs, as well as many supporting
utility functions that require information stored in the context. User-
level object APIs were unaffected and continue to be "context-free,"
however, these APIs were duplicated/mirrored so that "context-aware"
APIs now also exist, differentiated with an "_ex" suffix (for "expert").
These new context-aware object APIs (along with the lower-level, type-
aware, BLAS-like APIs) contain the the address of a context as a last
parameter, after all other operands. Contexts, or specifically, cntx_t
object pointers, are passed all the way down the function stack into
the kernels and allow the code at any level to query information about
the runtime, such as kernel addresses and blocksizes, in a thread-
friendly manner--that is, one that allows thread-safety, even if the
original source of the information stored in the context changes at
run-time; see next bullet for more on this "original source" of info).
(Special thanks go to Lee Killough for suggesting the use of this kind
of data structure in discussions that transpired during the early
planning stages of BLIS, and also for suggesting such a perfectly
appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
structure" (gks). This data structure and API will allow the caller to
initialize a context with the kernel addresses, blocksizes, and other
information associated with the currently active kernel configuration.
The currently active kernel configuration within the gks cannot be
changed (for now), and is initialized with the traditional cpp macros
that define kernel function names, blocksizes, and the like. However,
in the future, the gks API will be expanded to allow runtime management
of kernels and runtime parameters. The most obvious application of this
new infrastructure is the runtime detection of hardware (and the
implied selection of appropriate kernels). With contexts in place,
kernels may even be "hot swapped" at runtime within the gks. Once
execution enters a level-3 _front() function, the memory allocator will
be reinitialized on-the-fly, if necessary, to accommodate the new
kernels' blocksizes. If another application thread is executing with
another (previously loaded) kernel, it will finish in a deterministic
fashion because its kernel information was loaded into its context
before computation began, and also because the blocks it checked out
from the internal memory pools will be unaffected by the newer threads'
reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
the code enabling use of induced methods for complex domain matrix
multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
those APIs' functionality is now mostly subsumed within the global
kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
that will reinitialize a memory pool if the necessary pool block size
has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
usage of contexts where appropriate to communicate cache and register
blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
the context and/or the global kernel structure:
- Removed blocksize object pointers (blksz_t*) fields from all control
tree node definitions and replaced them with blocksize id (bszid_t)
values instead, which may be passed into a context query routine in
order to extract the corresponding blocksize from the given context.
- Removed micro-kernel function pointers (func_t*) fields from all
control tree node definitions. Now, any code that needs these function
pointers can query them from the local context, as identified by a
level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
level-1v kernel id (l1vkr_t).
- Removed blksz_t object creation and initialization, as well as kernel
function object creation and initialization, from all operation-
specific control tree initialization files (bli_*_cntl.c), since this
information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
blocksize multiples for each blocksize id (bszid_t) in the context
object.
- Removed the bool_t's that were required when a func_t was initialized.
These bools are meant to allow one to track the micro-kernel's storage
preferences (by rows or columns). This preference is now tracked
separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
util directories, but has the most obvious effect of allowing BLIS
to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
in an attempt to reduce overhead for memory-bound operations. This
includes removal of default use of object-based variants for level-2
operations. Now, by default, level-2 operations will directly call a
low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
heterogeneous bool_t's (one for each floating-point datatype), in the
same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
and BLIS_SIMD_SIZE. These values are needed in order to compute a third
new parameter, which may be set indirectly via the aforementioned
macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
statically allocate memory in macro-kernels and the induced methods'
virtual kernels to be used as temporary space to hold a single
micro-tile. These values are now output by the testsuite. The default
value of BLIS_STACK_BUF_MAX_SIZE is computed as
"2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
and "haswell," respectively, and gave more consistent and meaningful
names to many kernel files (as well as updating their interfaces to
conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
context for test modules that need those values: axpyf, dotxf,
dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
for level-1m-like operations on small matrices) in frame/include/level0
to use more obscure local variable names in an effort to avoid variable
shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
of scalm. The semantic meaning of the conj argument is to optionally
allow implicit conjugation of the scalar prior to being populated into
the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
that this does not preclude supporting mixed types via the object APIs,
where it produces absolutely zero API code bloat.
Details:
- Consolidated the two blocked variants for packm into a single
implementation (packm_blk_var1) and removed the other variant.
- Updated all induced method _cntl_init() functions in frame/cntl/ind/
to use the new blocked variant 1.
- Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
to detect pack_t schemas for induced methods and native execution,
respectively.
Details:
- Applied a patch submitted by Devin Matthews that:
- implements subtle changes to handling of somewhat unusual cases of
row and column strides to accommodate certail tensor cases, which
includes adding dimension parameters to _is_col_tilted() and
_is_row_tilted() macros,
- simplifies how buffers are sized when requested BLIS-allocated
objects,
- re-consolidates bli_adjust_strides_*() into one function, and
- defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
environments.
Details:
- Fixed a family of bugs in the triangular level-3 operations for
certain complex implementations (3m1 and 4m1a) that only manifest if
one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
- Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
for the triangular case.
- Fixed the incorrect computation of imaginary stride, as stored in
the auxinfo_t struct in trmm and trsm macro-kernels.
- Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
cases where the the register blocksize for the triangular matrix is
odd. Introduced a new byte-granular pointer arithmetic macro,
bli_ptr_add(), that computes the correct value.
- Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
terms of __typeof__, which is used by bli_ptr_add() macro.
- Disabled the row- vs. column-storage optimization in bli_trmm_front()
for singleton problems because the inherent ambiguity of whether a
scalar is row-stored or column-stored causes the wrong parameter
combination code to be executed (by dumb luck of our checking for
row storage first).
- Added commented-out debugging lines to 3m1/4m1a and reference
micro-kernels, and trsm_ll macro-kernel.
Details:
- Expanded/updated interface for bli_get_range_weighted() and
bli_get_range() so that the direction of movement is specified in the
function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
and also so that the object being partitioned is passed instead of an
uplo parameter. Updated invocations in level-3 blocked variants, as
appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
carefully take into account the location of the diagonal when computing
ranges so that the area of each subpartition (which, in all present
level-3 operations, is proportional to the amount of computation
engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
variants:
bli_<oper>_prune_unref_mparts_[mnk]()
where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
dimension is being partitioned. These routines call a more basic
routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
regions from matrices and simultaneously adjust other matrices which
share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
bli_get_range_*() and bli_get_range_weighted_*() under a range of
conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
array field so that dimensions could be queried via index constant
(e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
and bli_round_to_mult(), which rounds a value to the nearest multiple
of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
bli_obj_row_off(), bli_obj_col_off().
Details:
- Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
pointer arithmetic was performed on a void* as if it were a byte
pointer (such as char*). Some compilers may have already been
interpreting this situation as intended, despite the sloppiness.
Thanks to Aleksei Rechinskii for reporting this issue.
- Redefined pointer alignment macros to typecast to uintptr_t instead of
siz_t.
Details:
- Defined a new "3ms" (separated 3m) pack schema and added appropriate
support in packm_init(), packm_blk_var2().
- Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
as an argument instead of computing it locally. Exception: for trmm,
is_p must be computed locally, since it changes for triangular
packed matrices. Also exposed is_p in interface to dt-specific
packm_blk_var2 (and _var1, even though it does not use imaginary
stride).
- Renamed many functions/variables from _3mi to _3mis to indicate that
they work for either interleaved or separated 3m pack schemas.
- Generalized gemm and herk macro-kernels to pass in imaginary stride
rather than compute them locally.
- Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
and 3m3-specific virtual micro-kernels.
- Added special gemm macro-kernels to support 3m2 and 3m3.
- Added support for 3m2 and 3m3 to testsuite.
- Corrected the type of the panel dimension (pd_) in various macro-
kernels from inc_t to dim_t.
- Renamed many functions defined in bli_blocksize.c.
- Moved most induced-related macro defs from frame/include to
frame/ind/include.
- Updated the _ukernel.c files so that the micro-kernel function pointers
are obtained from the func_t objects rather than the cpp macros that
define the function names.
- Updated test/3m4m driver, Makefile, and run script.
Details:
- Renamed several variables and macros from 3m/4m to 3mi/4mi. This is
because those packing schemas were always implicitly "interleaved".
This new naming scheme will make way for new schemas that separate
instead of interleve the real and imaginary (and summed) parts.
- Expanded the pack format sub-field of the pack schema field of the
info_t to 4 bits (from 3). This will allow for more schema types
going forward.
- Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.
Details:
- Redefined bli_is_last_iter() to take thread_id and num_thread
arguments, which allows the macro to correctly compute whether a
given iteration is the last that the thread will compute in that
particular loop. The new definition, however, remains disabled
(commented out) until someone can look at this more closely, as
the new definition seems to actually hurt performance slightly.
- Whitespace and related updates to level-3 macro-kernels.
- Updated test suite so that performance results in the hundreds of
gigaflops does not disrupt the column alignment of the output.
Details:
- Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
high levels, respectively. APIs for trmm and trsm were NOT added due
to the fact that these approaches are inherently incompatible with
implementing 4m or 3m at high levels (because the input right-hand
side matrix is overwritten).
- Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
3m so that all are stylistically consistent.
- Added new "rih" packing kernels (both low-level and structure-aware)
to support both 4mh and 3mh.
- Defined new pack_t schemas to support real-only, imaginary-only, and
real+imaginary packing formats.
- Added various level0 scalar macros to support the rih packm kernels.
- Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
- Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
that order) and execute the first one that is enabled, or the native
implementation if none are enabled.
- Added implementation query functions for each level-3 operation so
that the user can query a string that describes the implementation
that is currently enabled.
- Updated test suite to output implementation types for reach level-3
operation, as well as micro-kernel types for each of the five micro-
kernels.
- Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
- Fixed an obscure bug when packing Hermitian matrices (regular packing
type) whereby the diagonal elements of the packed micro-panels could
get tainted if the source matrix's imaginary diagonal part contained
garbage.
Details:
- Combined the 4m/3m bits into an expanded bitfield, which will encode
the packing "format" of the micro-panels. This will allow for more
easily and compactly encoding additional formats.
- Other minor comment/whitespace updates to bli_type_defs.h.
- Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new
format bitfield.
- Comment update to bli_kernel_post_macro_defs.h.
- Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h.
Details:
- Instead of inferring the storage format of the micro-panels from within
the packm variants, we now pass in a bool_t value that denotes whether
the packed matrix contains row-stored column panels or column-stored
row panels. This value can then be tested more easily inside the main
packm variant loop.
- Renumbered pack_t schema values in bli_type_defs.h so that there are
now five bits, each with different meaning:
- 4: packed or not packed?
- 3: packed for 3m?
- 2: packed for 4m?
- 1: packed to panels?
- 0: stored by rows or columns?
- Added new macros that test for status of above bits in schema bit
subfield, and renamed some existing macros related to 4m/3m.
Details:
- Fixed a breakdown in BLIS's ability to differentiate between row-stored
and column-stored micro-panels when MR or NR is unit. When either
register blocksize (or both) is equal to one, inspecting the strides of
the affected packed micro-panel is no longer sufficient to determine
whether the micro-panel is a row-stored column panel or a column-stored
row panel (because both strides are unit). At that point, dimension
information is necessary when invoking the bli_is_row_stored_f() and
bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to
Ilya Polkovnichenko for reporting this bug.
- Added panel dimensions (m and n) to obj_t, which are set in
packm_init() and then passed into the blocked variants to support the
aforementioned update.
Details:
- Updated copyright headers to include "at Austin" in the name of the
University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
2012).
Details:
- Added initialization statements to various macros used in level 1m and
1m-like operations. I wasn't able to reproduce the reported behavior,
so hopefully this takes care of it. Thanks to Jeff Hammond for the
report.
Details:
- Completely reoganized norm operations:
- Renames:
- fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
- absumv -> norm1v (vector 1-norm)
- New operations:
- norm1m (matrix 1-norm)
- normiv, normim (infinity-norm)
- amaxv (BLAS-like absolute maximum value index)
- asumv (BLAS-like absolute sum)
- Deprecated absumm, as it did not correspond to any actual norm.
(However, an inlined version now exists in the testsuite module for
randm.)
Details:
- Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
elsewhere in the framework that were not yet set up to work properly
when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
- Extensive changes to f2c-derived files in frame/compat/f2c to allow
C99 complex storage. Most of these changes center around accessing
real and imaginary components via bli_?real()/bli_?imag() accessor
macros, and setting of values via bli_?sets() assignment macros.
(Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
was broken.)
Details:
- Consolidated the functionality previously supported by packm_blk_var2()
and packm_blk_var3() into a new variant, packm_blk_var1().
- Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
to accommodate above changes.
- Removed packm_blk_var3() and retired packm_blk_var2() to
frame/1m/packm/old.
- Updated all level-3 _cntl_init() functions so that the new, more
versatile packm_blk_var1 is used for all level-3 matrix packing.
Details:
- Replaced conditional expressions in macro-kernels related to computing
the addresses a2 and b2 (a_next and b_next) with a preprocessor macro
invocation, bli_is_last_iter(), that tests the same condition.
- Updated gemm_ukr module to use auxinfo_t argument.
- Whitespace changes in test suite ukr modules.