Details:
- Added initialization statements to various macros used in level 1m and
1m-like operations. I wasn't able to reproduce the reported behavior,
so hopefully this takes care of it. Thanks to Jeff Hammond for the
report.
Details:
- Defined new INSERT_GENTFUNC macros so that the macro always takes
exactly the number of arguments needed for the particular operation or
variant being defined. Many operations were using INSERT_GENTFUNC
macros that expected one auxiliary argument even though none were
needed. Those instances have now been updated. Most of these instances
were in the level-0 and -1v operations, as well as some operations
defined in frame/util.
Details:
- Simplified some code that would have allowed the diagonal of a trmm
or trsm triangular matrix to intersect the short end of a micro-panel.
This is disallowed via higher-level constraints on cache blocksizes, so
this code was never needed and only served to obfuscate.
- Updated some comments in trmm, trsm macro-kernels.
Details:
- Completely reoganized norm operations:
- Renames:
- fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
- absumv -> norm1v (vector 1-norm)
- New operations:
- norm1m (matrix 1-norm)
- normiv, normim (infinity-norm)
- amaxv (BLAS-like absolute maximum value index)
- asumv (BLAS-like absolute sum)
- Deprecated absumm, as it did not correspond to any actual norm.
(However, an inlined version now exists in the testsuite module for
randm.)
Details:
- Minor update to bli_sumsqv_unb_var1() to bring it up-to-date with
LAPACK 3.5.0's zlassq.f, which, starting with 3.4.2, returns NaN when
the vector (or matrix) contains a NaN.
- Minor change to bli_abmaxv_unb_var1() to more closely mimic the
behavior of netlib BLAS's izamax(). There, a "less than or equal to"
operator is used in the search instead of "less than", which would
change the element index returned if there were multiple maximum values.
- Added macro function definitions for bli_isinf() and bli_isnan(), which
are currently implemented in terms of isinf() and isnan() from math.h.
Details:
- Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
elsewhere in the framework that were not yet set up to work properly
when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
- Extensive changes to f2c-derived files in frame/compat/f2c to allow
C99 complex storage. Most of these changes center around accessing
real and imaginary components via bli_?real()/bli_?imag() accessor
macros, and setting of values via bli_?sets() assignment macros.
(Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
was broken.)
Details:
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE from bli_config.h of all
configurations. It was already going unused in packm_init() since the
recent 4m/3m commit. This setting was rarely, if ever, useful, and its
existence only posed a potential risk for 4m/3m-based implementations.
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE usage from mem_pool_macro_defs.h.
- Updated comments regarding CONTIG_STRIDE_ALIGN_SIZE in template
micro-kernels.
Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
macro constants. Examples:
BLIS_SAXPYV_KERNEL_REF
BLIS_DDOTXF_KERNEL_REF
BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
with a common base name; [sdcz] datatype flavors of each kernel or
micro-kernel (level-1v, -1f, or 3) may now be named independently.
This means you can now, if you wish, encode the datatype-specific
register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
undefined in bli_kernel.h will default to the corresponding reference
implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
datatype chars to match the number of types the kernel WOULD take in
a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
your level-1v/-1f kernels. The framework now prvides a _kernel()
function which serves as the obj_t wrapper for whatever kernels are
specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
longer need to include any prototyping headers from within
bli_kernel.h. The framework now generates kernel prototypes, with the
proper type signature, based on the kernel names defined (or defaulted
to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
kernel is left undefined by bli_kernel.h, but its same-precision real
domain equivalent IS defined, BLIS will use a 4m-based implementation
for the datatype x implementations of all level-3 operations, using
only the real gemm micro-kernel.
Details:
- Added the ability to induce complex domain level-3 operations via new
virtual complex micro-kernels which are implemented via only real
domain micro-kernels. Two new implementations are provided: 4m and 3m.
4m implements complex matrix multiplication in terms of four real
matrix multiplications, where as 3m uses only three and thus is
capable of even higher (than peak) performance. However, the 3m method
has somewhat weaker numerical properties, making it less desirable
in general.
- Further refined packing routines, which were recently revamped, and
added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.
Details:
- Removed use of bli_min() and bli_max() that were only being used to
try to support situations where the diagonal would intersect the
short end of some micro-panels, which is situation that is disallowed
at a higher level by various constraints on the register and cache
blocksize. This only affected trsm_ll and trsm_lu.
- Use panel stride as passed into the macro-kernel rather than compute
it via k and PACKMR/PACKNR. This affects all macro-kernels of trsm.
Details:
- Fixed an obscure bug in left-hand trmm that would only manifest when
non-zero register blocksize extensions (PACKMR > MR or PACKNR > NR)
are used.
- Removed use of bli_min() and bli_max() that were only being used to
try to support situations where the diagonal would intersect the
short end of some micro-panels, which is situation that is disallowed
at a higher level by various constraints on the register and cache
blocksize. This only affected trmm_ll and trmm_lu.
- Use panel stride as passed into the macro-kernel rather than compute
it via k and PACKMR/PACKNR. This affects all macro-kernels of trmm.
Details:
- Fixed a bug in packm_cxk() whereby the packm ukernel was being chosen
from ldp, which is always equal to PACKMR or PACKNR. The problem with
this is that the pack ukernels were implicitly assuming that the
panel dimension of the panel being packed was equal to ldp, which
is not the case when the register blocksizes extensions are non-zero
(ie: when PACKMR > MR or PACKNR > NR, whichever is applicable). This
problem has been fixed by passing ldp into the pack ukernels, which
now walk through the packed micro-panel region by incrementing by this
value, rather than incrementing by the inherent panel dimension value
assumed by each packm ukernel (e.g. 4 in the case of packm_ref_4xk).
- Also fixed a very minor edge case inefficiency whereby pack ukernels
smaller than the default were not being used in edge cases, and instead
those situations were being handled by scal2m. This is related to the
issue above, because the pack ukernel itself was being chosen based on
ldp instead of the panel dimension.
Details:
- Consolidated the functionality previously supported by packm_blk_var2()
and packm_blk_var3() into a new variant, packm_blk_var1().
- Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
to accommodate above changes.
- Removed packm_blk_var3() and retired packm_blk_var2() to
frame/1m/packm/old.
- Updated all level-3 _cntl_init() functions so that the new, more
versatile packm_blk_var1 is used for all level-3 matrix packing.
Details:
- Revised packm_blk_var2() and _var3() by encapsulating the general,
hermitian/symmetric, and triangular panel-packing subproblems into
separate functions: packm_gen_cxk(), packm_herm_cxk(), and
packm_tri_cxk(), respectively. Also, homogenized the packm code as
well as the new specialized packm_*_cxk() code to further improve
readability.
Details:
- Defined a new macro, INSERT_GENTFUNC_BASIC0, which takes only one
argument--the base name of the function--and employed this macro
in the reference micro-kernel files instead of the _BASIC macro,
which takes one auxiliary argument. That argument was not being
used and probably just acted to unnecessarily obfuscate.
Details:
- Fixed some technical incorrectness with some usage of the 'restrict'
keyword in the reference trsm micro-kernels.
- Tweak to testsuite/Makefile that causes rebuild if libblis was
touched.
Details:
- Updated (and fixed some errors in) the "Assumptions/assertions" comment
section of macro-kernels.
- Changed register blocksizes of reference configuration to MR = 8 and
NR = 4. It's always good for MR != NR in the reference configuration
since it may help uncover bugs related to non-square micro-kernels.
Details:
- Modified the interfaces to the datatype-specific macro-kernels so that:
- pd_a and pd_b are passed in (which contain the panel dimensions of
packed panels of a and b).
- rs_a and cs_b are no longer passed in (they were guaranteed to be 1).
- Modified implementations of datatype-specific macro-kernels so pd_a,
pd_b, cs_a, and rs_b are used instead of cpp macros for MR, NR, PACKMR,
and PACKNR, respectively.
- Declare temporary c matrices (ct) as being maxmr-by-maxnr, which for now
is equivalent to being mr-by-nr. maxmr and maxnr are declared in a new
header file bli_kernel_post_macro_defs.h.
Details:
- Removed macro constant definitions related to incremental blocksizes
from all configurations' bli_kernel.h files. This change is minor and
is mostly a cleanup related to a previous commit.
Details:
- Added some logic to packm_init(), pack_int() and gemm_int() so that
(a) objects marked as BLIS_ZEROS are not packed, and (b) those
objects are not computed with. This functionality is not currently
needed by any existing implementations, but may be used in the
future.
Details:
- Added 'f' to some block variant files/functions to be consistent with
other file/functions' naming convention. Here, the f indicates
partitioning in the "forward" direction.
Details:
- Removed code that creates duplicate blksz_t objects for herk, trmm,
and trsm. Instead, the gemm blksz_t objects are accessed via extern
and used directly. This reduces the amount of code associated with
each of the three _cntl_init() and _cntl_finalize() function.
Details:
- Added new _front() functions for each level-3 operation. This is done
so that the choosing of the control tree (and *only* the choosing of
the control tree) happens in what was previously the "front end"
(e.g. bli_gemm()). That control tree is then passed into the _front()
function, which then performs up-front tasks such as parameter
checking.
Details:
- Removed code that generated a control tree specifically for hemm and
symm. Instead, the gemm control tree is now configured so that it
works for gemm, hemm, or symm.
- Retired most her2k code, as it was not being used. (Currently, her2k is
implemented as two invocations of herk.) I couldn't think of many
situations where her2k variants were needed.
- Removed some older her2k code.
Details:
- Modified all control tree node definitions to include a new field of
type func_t*, which is similar to a blksz_t except that it contains
one function pointer (each typed simply as void*) for each datatype.
We use the func_t* to embed pointers to the micro-kernels to use for
the leaf-level nodes of each control tree. This change is a natural
extension of control trees and will allow more flexibility in the
future.
- Modified all macro-kernel wrappers to obtain the micro-kernel pointers
from the incomming (previously ignored) control tree node and then pass
the queried pointer into the datatype-specific macro-kernel code, which
then casts the pointer to the appropriate type (new typedefs residing
in bli_kernel_type_defs.h) and then uses the pointer to call the micro-
kernel. Thus, the micro-kernel function is no longer "hard-coded" (that
is, determined when the datatype-specific macro-kernel functions are
instantiated by the C preprocessor).
- Added macros to bli_kernel_macro_defs.h that build datatype-specific
base names if they do not exist already, and then uses those to build
datatype-specific micro-kernel function names. This will allow
developers extra flexibility if they wanted to, for example, name each
of their datatype-specific micro-kernels differently (e.g. double
real might be named bli_dgemm_opt_4x4() while double complex might be
named bli_zgemm_opt_2x2()).
- Inserted appropriate code into _cntl_init() functions that allocates
and initializes a func_t object for the corresponding micro-kernels.
The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(),
and then reused via extern wherever possible.
Details:
- Removed commented-out code from macro-kernels that was supposed to
facilitate implementing mixed domain (complex times real) matrix
multiplication. This functionality is still (probably possible),
but I'm getting tired of looking at the code every time I edit
a macro-kernel. Plus, there are probably ways of doing it at a
higher level, via control trees.
Details:
- Removed b_aux field from all control tree node definitions. This field
was being used in certain optimizations (incremental blocking) that were
not actually being employed within BLIS, and are probably not employed
by others.
- Updated all _cntl_obj_create() function definitions and invocations
according to above change.
- Retired bli_gemm_blk_var4.c, which was one such function that employed
incremental blocking, but which was never called by BLIS itself.
Details:
- Changed the pack_t enumerations so that BLIS_PACKED_VECTOR no longer has
its own value, and instead simply aliases to BLIS_PACKED_UNSPEC. This
makes room in the three pack_t bits of the info field of obj_t so that
two values are now unused, and may be used for other future purposes.
- Updated sloppy terminology usage in comments in level-2 front-ends.
(Replaced "is contiguous" with more accurate "has unit stride".)
Details:
- Redirect errors to /dev/null when using 'find' to locate libraries that
would be uninstalled upon executing "make uninstall-old". Before, if the
Makefile was read before $(INSTALL_PREFIX)/lib existed, a "No such file
or directory" message was emitted. This message was harmless, but is now
suppressed in this situation.
Details:
- In the test suite driver, inserted an explicit typecast of the return
value of bli_getopt() prior parsing. The lack of typecast caused a
problem on at least one system whereby a return value of -1 was
interpreted as garbage character. Thanks to Francisco Igual for finding
and submitting this fix.
Details:
- Modified build system (mostly configure and top-level Makefile) so that
a user can build a BLIS library outside of the top-level directory of
the source distribution.
- Added "test" target to Makefile so that the user can run "make test",
which will compile, link, and run the testsuite binary. This works even
if the build directory is externally located, thanks to the test suite
binary's new -g and -o command-line options. Also, when creating the
test suite via the top-level Makefile, the linking is against the
local archive, in lib/<configname>, rather than at <install_prefix>/lib.
- Modified testsuite/Makefile so that it links against the library built
locally, in ../lib/<configname>.
- Added "-lm" to LDFLAGS of most configurations' make_defs.mk.
- Various other cleanups to build system.
Details:
- Added bli_getopt.c and .h files to frame/base. These files implement
a custom version of getopt(), which may be used to parse command line
options passed into a program via argc/argv. I am implementing this
function myself, as opposed to using the version available via unistd.h,
for portability reasons, as the only requirements are string.h (which
is available via the standard C library).
- Modified test suite to allow the user to specify the file name (and/or
path) to the parameters and operations input files: -g may be used to
specify the general input file and -o to specify the operations input
file). If -g or -o or both are not given, default filenames are assumed
(as well as their existence in the current directory).
Details:
- Updated template micro-kernel implementations (located in
config/template/kernels), to adhere to the new auxinfo_t interface.
Meant to include this change in a0331fb1.
- Changed template configuration to use 64-bit integers (for both BLIS
and the BLAS compatibility layer).