Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
macro constants. Examples:
BLIS_SAXPYV_KERNEL_REF
BLIS_DDOTXF_KERNEL_REF
BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
with a common base name; [sdcz] datatype flavors of each kernel or
micro-kernel (level-1v, -1f, or 3) may now be named independently.
This means you can now, if you wish, encode the datatype-specific
register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
undefined in bli_kernel.h will default to the corresponding reference
implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
datatype chars to match the number of types the kernel WOULD take in
a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
your level-1v/-1f kernels. The framework now prvides a _kernel()
function which serves as the obj_t wrapper for whatever kernels are
specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
longer need to include any prototyping headers from within
bli_kernel.h. The framework now generates kernel prototypes, with the
proper type signature, based on the kernel names defined (or defaulted
to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
kernel is left undefined by bli_kernel.h, but its same-precision real
domain equivalent IS defined, BLIS will use a 4m-based implementation
for the datatype x implementations of all level-3 operations, using
only the real gemm micro-kernel.
Details:
- Modified all control tree node definitions to include a new field of
type func_t*, which is similar to a blksz_t except that it contains
one function pointer (each typed simply as void*) for each datatype.
We use the func_t* to embed pointers to the micro-kernels to use for
the leaf-level nodes of each control tree. This change is a natural
extension of control trees and will allow more flexibility in the
future.
- Modified all macro-kernel wrappers to obtain the micro-kernel pointers
from the incomming (previously ignored) control tree node and then pass
the queried pointer into the datatype-specific macro-kernel code, which
then casts the pointer to the appropriate type (new typedefs residing
in bli_kernel_type_defs.h) and then uses the pointer to call the micro-
kernel. Thus, the micro-kernel function is no longer "hard-coded" (that
is, determined when the datatype-specific macro-kernel functions are
instantiated by the C preprocessor).
- Added macros to bli_kernel_macro_defs.h that build datatype-specific
base names if they do not exist already, and then uses those to build
datatype-specific micro-kernel function names. This will allow
developers extra flexibility if they wanted to, for example, name each
of their datatype-specific micro-kernels differently (e.g. double
real might be named bli_dgemm_opt_4x4() while double complex might be
named bli_zgemm_opt_2x2()).
- Inserted appropriate code into _cntl_init() functions that allocates
and initializes a func_t object for the corresponding micro-kernels.
The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(),
and then reused via extern wherever possible.
Details:
- Removed support for duplication from the gemmtrsm/trsm micro-kernels
and all framework code.
- Updated test suite modules according to above changes.
Details:
- Added test modules in test suite for level-1f kernels and level-3
micro-kernels. (Duplication in the micro-kernels, for now, is NOT
supported by these test modules.)
- Added section override switches to test suite's input.operations file.
- Added obj_t APIs for level-1f front-ends and their unblocked variants to
facilitate the level-1f test modules. Also added front-end for dupl
operation.
- Added obj_t-based check routines for level-1f operations, which are
called from the new front-ends mentioned above.
- Added query routines for axpyf, dotxf, and dotxaxpyf that return fusing
factors as a function of datatype, which is needed by their respective
test modules.
- Whitespace changes to bli_kernel.h of all existing configurations.
Details:
- Added a 'template' configuration, which contains stub implementations of the
level 1, 1f, and 3 kernels with one datatype implemented in C for each, with
lots of in-file comments and documentation.
- Modified some variable/parameter names for some 1/1f operations. (e.g.
renaming vector length parameter from m to n.)
- Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files
to bli_kernel.h.
- Modifed test suite to print out fusing factors for axpyf, dotxf, and
dotxaxpyf, as well as the default fusing factor (which are all equal
in the reference and template implementations).
- Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these
reference variants were implemented in terms of front-end routines rather
that directly in terms of the kernels. (For example, axpy2v was implemented
as two calls to axpyv rather than two calls to AXPYV_KERNEL.)
- Changed the interface to dotxf so that it matches that of axpyf, in that
A is assumed to be m x b_n in both cases, and for dotxf A is actually used
as A^T.
- Minor variable naming and comment changes to reference micro-kernels in
frame/3/gemm/ukernels and frame/3/trsm/ukernels.
Details:
- Expanded on cpp macro definitions from bli_mem.c and relocated them to
a new header file, frame/include/bli_mem_pool_macro_defs.h. The expanded
functionality includes computing the pool size for each datatype (using
that datatype's cache blocksizes) and using the maximum to size the
actual pool array. This addresses the somewhat common pitfall whereby a
developer updates cache blocksizes in bli_kernel.h for only one datatype
(say, single-precision real), while the memory pools are sized using the
double-precision real values. Then, when the developer attempts to link
to and run a level-3 BLIS routine (e.g. dgemm), the library aborts with
a message saying the static memory pool was exhausted. Clearly, this
message is misleading when the pool was not sized properly to begin with.
- Removed previously disabled code in bli_kernel_macro_defs.h that was
meant to check for size consistency among the various cache blocksizes.
(Obviously the memory pool size-based solution mentioned above is better.)
- Added BLIS_SIZEOF_? cpp macros to bli_type_defs.h. This seemed like a
reasonable place to put these constants, rather than further crowd up
bli_config.h.
- Updated testsuite driver to output memory pool sizes for A, B, and C.
- Minor comment updates to bli_config.h.
- Removed 'flame' configuration. It was beginning to get out-of-date, and
I hadn't used it in months. We can always re-create it later.
Details:
- Added missing C preprocessor guards in bli_kernel_macro_defs.h that enforce
constraints on the register blocksizes relative to the cache blocksizes.
Thanks to Tyler for helping me stumble across this issue.
Details:
- Added support for an experimental gemm macro-kernel incrementally
packs one micro-panel of B at a time. This is useful for certain
special cases of gemm where m is small.
- Minor changes to default values of clarksville configuration.
- Defined BLIS_PACKED_BLOCKS as part of pack_t type, even though we
do not yet have any use (or implementation support) for block storage.
- Comment update to bli_packm_init.c.
Details:
- Temporarily disabled checks that ensure that enough memory will be allocated
by the contiguous memory allocator for all types, given that the values for
double precision real are the ones used to allocate the space. These checks
can easily go awry in certain situations, especially if you are developing for
only one datatype. So for now, they are probably more trouble than they are
worth.
Details:
- Made substantial changes throughout the framework to decouple the leading
dimension (row or column stride) used within each packed micro-panel from
the corresponding register blocksize. It appears advantageous on some
systems to use, for example, packed micro-panels of A where the column
stride is greater than MR (whereas previously it was always equal to MR).
- Changes include:
- Added BLIS_EXTEND_[MNK]R_? macros, which specify how much extra padding
to use when packing micro-panels of A and B.
- Adjusted all packing routines and macro-kernels to use PACKMR and PACKNR
where appropriate, instead of MR and NR.
- Added pd field (panel dimension) to obj_t.
- New interface to bli_packm_cntl_obj_create().
- Renamed bli_obj_packed_length()/_width() macros to
bli_obj_padded_length()/_width().
- Removed local #defines for cache/register blocksizes in level-3 *_cntl.c.
- Print out new cache and register blocksize extensions in test suite.
- Also added new BLIS_EXTEND_[MNK]C_? macros for future use in using a larger
blocksize for edge cases, which can improve performance at the margins.
Details:
- Added new macros that alias level-3 cache and register blocksize macros
to names that can be constructed via the PASTEMAC macro. These aliased
macro definitions live inside bli_kernel_macro_defs.h, which is now
#included after bli_kernel.h.
- Modified macro-kernels to use new aliased blocksize macros instead of
operation-specific ones.
- Removed local, operation-specific kernel blocksize macro definitions
(found in macro-kernel header files).