Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
those operations to be cast so the structured matrix is on the left.
symm and hemm already had such macros, but these too were renamed so
that the macros were individual to the operation. We now have four
such macros:
#define BLIS_DISABLE_HEMM_RIGHT
#define BLIS_DISABLE_SYMM_RIGHT
#define BLIS_DISABLE_TRMM_RIGHT
#define BLIS_DISABLE_TRMM3_RIGHT
Also, updated the comments in the symm and hemm front-ends related to
the first two macro guards, and added corresponding comments to the
trmm and trmm3 front-ends for the latter two guards. (They all
functionally do the same thing, just for their specific operations.)
Thanks to Jeff Hammond for reporting the bugs that led me to this
change (via #359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
related to duplicating B during packing) to register: a packing
kernel for single-precision real; gemmbb ukernels for s, c, and z;
trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
and z; and to use non-default cache and register blocksizes for s, c,
and z datatypes. Also declared prototypes for all of the gemmbb,
trsmbb, and gemmtrsmbb ukernel functions within the
bli_cntx_init_haswellbb() function. This should, once applied to the
power9 configuration, fix the remaining issues in #359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
duplication factor of 4. This function is defined in the same file as
bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).
Details:
- Added support for being able to duplicate (broadcast) elements in
memory when packing matrix B (ie: the left-hand operand) in level-3
operations. This turns out advantageous for some architectures that
can afford the cost of the extra bandwidth and somehow benefit from
the pre-broadcast elements (and thus being able to avoid using
broadcast-style load instructions on micro-rows of B in the gemm
microkernel).
- Support optionally disabling right-side hemm and symm. If this occurs,
hemm_r is implemented in terms of hemm_l (and symm_r in terms of
symm_l). This is needed when broadcasting during packing because the
alternative--supporting the broadcast of B while also allowing matrix
B to be Hermitian/symmetric--would be an absolute mess.
- Support alignment factors for packed blocks of A, B, and C separately
(as well as for general-purpose buffers). In addition, we support
byte offsets from those alignment values (which is different from
aligning by align+offset bytes to begin with). The default alignment
values are BLIS_PAGE_SIZE in all four cases, with the offset values
defaulting to zero.
- Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
into the packm kernel, where it will be needed by packm kernels that
perform broadcasts of B, since the idea is that we *only* want to
broadcast when packing micropanels of B and not A.
- Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
used to set custom virtual level-3 microkernels in the cntx_t, which
would typically be done in the bli_cntx_init_*() function defined in
the subconfiguration of interest.
- Added a "broadcast B" kernel function for use with NP/NR = 12/6,
defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
- Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
defined in ref_kernels/3/bb. (These kernels have been tested with
double real with NP/NR = 12/6.)
- Added #ifndef ... #endif guards around several macro constants defined
in frame/include/bli_kernel_macro_defs.h.
- Defined a few "broadcast B" static functions in
frame/include/level0/bb for use by "broadcast B"-style packm reference
kernels. For now, only the real domain kernels are tested and fully
defined.
- Output the alignment and offset values for packed blocks of A and B
in the testsuite's "BLIS configuration info" section.
- Comment updates to various files.
- Bumped so_version to 3.0.0.
Details:
- Updated the API and semantics of packm kernels such that they must now
handle edge cases, meaning that a c-by-k packm kernel must be able to
pack edge cases that are fewer than c rows/columns and be able to
zero-fill the remaining elements. They must also be able to zero-fill
the equivalent region when copying fewer than k columns/rows (which is
needed by trsm). The new packm kernel API is generally:
void packm_kernel
(
conj_t conja,
dim_t cdim,
dim_t n,
dim_t n_max,
ctype* restrict kappa,
ctype* restrict a, inc_t inca, inc_t lda,
ctype* restrict p, inc_t ldp,
cntx_t* restrict cntx
);
where cdim and n are the dimensions (short and long, respectively) of
the submatrix being copied from the source matrix A, and n_max is the
"full" long dimension (corresponding to the k dimension in gemm) of
the micropanel. The "full" short dimension (corresponding to the
register blocksize MR or NR) is not part of the API because it is
known intrinsically by the packm kernel implementation. Thanks to
Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
above changes, as well as all optimized packm kernels (which only
consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
I was considering leaving it unchanged, but I couldn't escape the
reality that the packm kernel API is much closer to an expert API
than it is some obscure helper function interface within the framework
that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
that would have been using those kernels is knc, which is likely no
longer being used by very many people (if any). (This also mostly
offset the larger object code footprint incurred by moving the edge-
case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
which those implementations were modifying the contexts stored in the
gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
method implementations of trsm from executing.
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
Details:
- Renamed the reference packm kernels used by 1m. Previously, they used
a _1e suffix, which was confusing since they packed to both 1e and 1r
schemas. This was likely an artifact of the time when there were
separate kernels for each schema before I decided to combine them into
a single function (per datatype and panel dimension), and the 1e
functions were the ones to inherit the 1r functionality. The kernels
have now been renamed to use a _1er suffix.
Details:
- Removed the vast majority of directories named "old", which contained
deprecated code that I wasn't quite ready to jettison from the source
tree.
Details:
- Reworked the build system around a configuration registry file, named
config_registry', that identifies valid configuration targets, their
constituent sub-configurations, and the kernel sets that are needed by
those sub-configurations. The build system now facilitates the building
of a single library that can contains kernels and cache/register
blocksizes for multiple configurations (microarchitectures). Reference
kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
in determining which sub-configurations (CONFIG_LIST) and kernel sets
(KERNEL_LIST) are included in the library, and which make_defs.mk files'
CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
functions into a standard format that includes the kernel set name
(e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
kernels sub-directory. These files exist to provide prototypes for the
kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
This directory includes a new source file, bli_cntx_ref.c (compiled on
a per-configuration basis), that defines the code needed to initialize
a reference context and a context for induced methods for the
microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
#defines for the config (family) name, the sub-configurations that are
associated with the family, and the kernel sets needed by those
sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
what remains to new header files named "bli_arch_<configname>.h", which
are conditionally #included from a new header bli_arch.h. These files
are still needed to set library-wide parameters such as custom
malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
The files contain a function, named the same as the file, that initializes
a "native" context for a particular configuration (microarchitecture). The
idea is that optimized kernels, if available, will be initialized into
these contexts. Other fields will retain pointers to reference functions,
which will be compiled on a per-configuration basis. These bli_cntx_init_*()
functions will be called during the initialization of the global kernel
structure. They are thought of as initializing for "native" execution, but
they also form the basis for contexts that use induced methods. These
functions are prototyped, along with their _ref() and _ind() brethren, by
prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
structures (pointers to cntx_t, actually). The first dimension is indexed
over arch_t and the inner dimension is the ind_t (induced method) for
each microarchitecture. When a microarchitecture (configuration) is
"registered" at init-time, the inner array for that configuration in the
2D array is initialized (and allocated, if it hasn't been already). The
cntx_t slot for BLIS_NAT is initialized immediately and those for other
induced method types are initialized and cached on-demand, as needed. At
cntx_t registration, we also store function pointers to cntx_init functions
that will initialize (a) "reference" contexts and (b) contexts for use with
induced methods. We don't cache the full contexts for reference contexts
since they are rarely needed. The functions that initialize these two kinds
of contexts are generated automatically for each targeted sub-configuration
from cpp-templatized code at compile-time. Induced method contexts that
need "stage" adjustments can still obtain them via functions in
bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
the level-1f, level-1v, and packm kernels, and for converting a native
context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
macros in bli_kernel_macro_defs.h to being runtime checks defined in
bli_check.c and called from bli_gks_register_cntx() at the time that the
global kernel structure's internal context is initialized for a given
microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
the previous operation-level cntx_t_init()/_finalize() invocations.
Instead, we now query the gks for a suitable context, usually via
bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
into a single kernel that will branch on the schema and support packing
to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
bli_malloc_intl(). Useful when we wish to allocate and initialize to
zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
bli_cntx.h into static functions.