Details:
- Updated bli_kernel_*_macro_defs.h headers to include default
definitions for 30xk packm kernels.
- Extended function pointer arrays in bli_packm_cxk_*() out to 31 and
included 30xk kernels.
- Addex 30xk kernels to frame/1m/packm/ukernels/bli_packm_ref_cxk_*.c.
Details:
- Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
high levels, respectively. APIs for trmm and trsm were NOT added due
to the fact that these approaches are inherently incompatible with
implementing 4m or 3m at high levels (because the input right-hand
side matrix is overwritten).
- Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
3m so that all are stylistically consistent.
- Added new "rih" packing kernels (both low-level and structure-aware)
to support both 4mh and 3mh.
- Defined new pack_t schemas to support real-only, imaginary-only, and
real+imaginary packing formats.
- Added various level0 scalar macros to support the rih packm kernels.
- Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
- Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
that order) and execute the first one that is enabled, or the native
implementation if none are enabled.
- Added implementation query functions for each level-3 operation so
that the user can query a string that describes the implementation
that is currently enabled.
- Updated test suite to output implementation types for reach level-3
operation, as well as micro-kernel types for each of the five micro-
kernels.
- Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
- Fixed an obscure bug when packing Hermitian matrices (regular packing
type) whereby the diagonal elements of the packed micro-panels could
get tainted if the source matrix's imaginary diagonal part contained
garbage.
Details:
- Modified macro-kernels to pass the pack_t schema values for matrices
A and B into the datatype-specific functions, where they are now
inserted into a newly-expanded auxinfo_t struct. This gives gives the
micro-kernels access to the pack_t schema values embedded in the
control trees, which determine the precise format into which the
matrix elements are packed.
- Updated a call to bli_packm_init_pack() in src/test_libblis.c to
remove densify argument. Meant to include this in commit c472993b.
Details:
- Removed sections of bli_kernel_[4m|3m]_macro_defs.h that defined
4m/3m-specific blocksizes after realizing that this can be done in
bli_gemm[4m|3m]_cntl.c, since that is (mostly) the only place they
are used.
- The maximum cache values for 4m/3m are stll needed when computing mem
pool dimensions in bli_mem_pool_macro_defs.h. As a workaround, "local"
definitions in terms of the regular cache blocksizes are now in place.
- Similarly, the register blocksizes for 4m/3m are still needed in
bli_kernel_post_macro_defs.h. As a workaround, "local" definitions in
terms of the regular register blocksizes are now in place.
Details:
- Changed semantics of cache and register blocksize extensions so that
the extended values are tracked, rather than just the marginal
extensions.
- BLIS_EXTEND_[MKN]C_? has been renamed BLIS_MAXIMUM_[MKN]C_?.
- BLIS_EXTEND_[MKN]R_? has been renamed BLIS_PACKDIM_[MKN]R_?.
- bli_blksz_ext_*() APIs have been renamed to bli_blksz_max_*(). Note
that these "max" query routines grab the maximum value for cache
blocksizes and the packdim value for register blocksizes.
- bli_info_*() API has been updated accordingly.
- All configurations have been updated accordingly.
Details:
- Changed the interface to the packm_struc_cxk*() kernels to include
the pack_t schema. This allows the implementation to more easily
determine how the micro-panel is stored (row-stored column panel
or column-stored row panel).
- Updated packm blocked variants to pass in the schema.
- Updated packm_ker_t function pointer definition accordingly.
Details:
- Reorganized packm variants and structure-aware kernels so that all
routines for a given pack format (4m, 3m, regular) reside in a single
file.
- Renamed _blk_var4 to _blk_var2 and generalized so that it will work
for
both 4m and 3m, and adjusted 4m/3m _cntl_init() functions accordingly.
- Added a new packm_ker_t function pointer type to
bli_kernel_type_defs.h
to facilitate function pointer typecasting in the datatype-specific
packm_blk_var2() functions.
- Deprecated _blk_var3.
- Fixed a bug in the triangular micro-panel packing facility that
affected trmm and trmm3 with unit diagonals.
Details:
- Reordered the #include statements in bli_scalar_macro_defs.h so that
conventional, ri-, and ri3-based macros are grouped together.
- Renamed bli_eqri.h (and macros within) to end with 'ris' suffix.
Details:
- Combined the 4m/3m bits into an expanded bitfield, which will encode
the packing "format" of the micro-panels. This will allow for more
easily and compactly encoding additional formats.
- Other minor comment/whitespace updates to bli_type_defs.h.
- Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new
format bitfield.
- Comment update to bli_kernel_post_macro_defs.h.
- Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h.
Details:
- Renamed packm ukernels, _cxk dispatcher, and structure-aware _cxk
helper functions to use _4m and _3m instead of _ri and _ri3 suffixes.
- Updated names of cpp macros that correspond to packm ukernels.
Details:
- Added bli_4m.c (and header), which defines a simple API that can be
used to query, enable, and disable 4m-based complex support in BLIS.
The macros BLIS_ENABLE_?COMPLEX_VIA_4M are now used to initialize
the variable that determines the state (enabled or disabled).
- Changed bli_info*() API so that all cache and register blocksize-
related query routines return the blksz_t objects' values as they
exist at runtime, rather than return the values as determined by the
configuration system (e.g. bli_kernel.h, or defaults for those values
not specified). This sets the foundation for being able to change
those blocksizes at runtime.
Details:
- Added the ability for the kernel developer to indicate the gemm micro-
kernel as having a preference for accessing the micro-tile of C via
contiguous rows (as opposed to contiguous columns). This property may
be encoded in bli_kernel.h as BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS,
which may be defined or left undefined. Leaving it undefined leads to
the default assumption of column preference.
- Changed conditionals in frame/3/*/*_front.c that induce transposition
of the operation so that the transposition is induced only if there
is disagreement between the storage of C and the preference of the
micro-kernel. Previously, the only conditional that needed to be met
was that C was row-stored, which is to say that we assumed the micro-
kernel preferred column-contiguous access on C.
- Added a "prefers_contig_rows" property to func_t objects, and updated
calls to bli_func_obj_create() in _cntl.c files in order to support
the above changes.
- Removed the row-storage optimization from bli_trsm_front.c because
it is actually ineffective. This is because the right-side case of
trsm flips the A and B micro-panel operands (since BLIS only requires
left-side gemmtrsm/trsm kernels), meaning any transposition done
at the high level is then undone at the low level.
- Tweaked trmm, trmm3 _front.c files to eliminate a possible redundant
invocation of the bli_obj_swap() macro.
Details:
- Removed unused and unneeded s- and d-flavored macro definitions for
packm ukernels related to the complex 4m and 3m methods, as
implemented in BLIS.
Details:
- Rolled back recent changes to bli_obj_is_row_stored() and
bli_obj_is_col_stored() so that those macros now only inspect the
strides (row or column). It turns out that the more sophisticated
definitions introduced in a51e32e are not necessary, because these
"obj" macros are virtually never used on packed matrices, and when
they are, they can use bli_obj_is_[row|col}_packed() macros, which
inspect the info bitfield.
Details:
- Added a new section in bli_config.h files of all configurations for
enabling CBLAS support. (Currently, the default is for the CBLAS layer
to be disabled.)
- Added a directory, frame/compat/cblas, to house CBLAS source code. A
subdirectory 'f77_sub' holds subroutine wrappers corresponding to
subroutines found in CBLAS that allow calling some BLAS routines with
the return value passed as the last argument rather than as an actual
(function) return value. This was probably intended to allow CBLAS to
avoid the whole f2c debacle altogether. However, since BLIS does not
assume the presence of a Fortran compiler, we had to provide similar
routines in C.
- A script, integrate-cblas-tarball.sh, is included to streamline the
integration of future revisions of the CBLAS source code.
- The current tarball, cblas.tgz, that was used with the above script to
generate the present set of CBLAS source code is also included.
- Updated blis.h to include necessary CBLAS-related headers.
Details:
- Redefined many of the macros that define bit fields and bit values in
the obj_t info field using the bitshift operator (<<). This makes it
easier to reorder bit fields, or expand existing bit fields, or add
new fields. The bitshifting should be evaluated by the compiler at
compile-time.
Details:
- Instead of inferring the storage format of the micro-panels from within
the packm variants, we now pass in a bool_t value that denotes whether
the packed matrix contains row-stored column panels or column-stored
row panels. This value can then be tested more easily inside the main
packm variant loop.
- Renumbered pack_t schema values in bli_type_defs.h so that there are
now five bits, each with different meaning:
- 4: packed or not packed?
- 3: packed for 3m?
- 2: packed for 4m?
- 1: packed to panels?
- 0: stored by rows or columns?
- Added new macros that test for status of above bits in schema bit
subfield, and renamed some existing macros related to 4m/3m.
Details:
- Fixed a breakdown in BLIS's ability to differentiate between row-stored
and column-stored micro-panels when MR or NR is unit. When either
register blocksize (or both) is equal to one, inspecting the strides of
the affected packed micro-panel is no longer sufficient to determine
whether the micro-panel is a row-stored column panel or a column-stored
row panel (because both strides are unit). At that point, dimension
information is necessary when invoking the bli_is_row_stored_f() and
bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to
Ilya Polkovnichenko for reporting this bug.
- Added panel dimensions (m and n) to obj_t, which are set in
packm_init() and then passed into the blocked variants to support the
aforementioned update.
Details:
- Defined a new level-1d operation, setid, which sets the imaginary
elements of an object's diagonal to a single scalar. This can be
useful, for example, when trying to make the diagonal of a Hermitian
matrix real-valued.
Details:
- Updated copyright headers to include "at Austin" in the name of the
University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
2012).
Details:
- Added a new API family, bli_info_*(), which can be used to query
information about how BLIS was configured. Most of these values are
returned as gint_t, with the exception of the version string which
is char*.
- Changed how the testsuite driver queries information about how BLIS
was configured (from using macro constants directly to using the
new bli_info API).
- Removed bli_version.c and its header file.
- Added STRINGIFY_INT() macro to bli_macro_defs.h
- Renamed info_t type in bli_type_defs.h to objbits_t (not because of
an actual naming conflict, but because the name 'info_t' would now be
somewhat misleading in the presence of the new bli_info API, as the
two are unrelated).
Details:
- Redefined xpbys_mxn and xpbys_mxn_u/_l macros to employ a copy
(instead of scaling by beta) when beta is zero. This will stamp out
any possible infs or NaNs in the output matrix, if it happens to be
uninitialized. Thanks to Tony Kelman for isolating this bug.
Details:
- Relaxed the constraint in bli_obj_attach_buffer_check(), which required
the buffer address being attached to be non-NULL. This is acceptable
because the user was already able to create and use objects with NULL
buffers (via bli_obj_create_without_buffer(), which initializes the
buffer to NULL).
- Inserted calls to newly defined function, bli_check_object_buffer(),
into nearly all operations' _check() or _int_check() functions. This
allows BLIS to abort peacefully if a computational routine is called
with an object containing a NULL buffer. By contrast, under such
conditions, BLAS would typically fail with a segmentation fault.
- Within operation front-ends, moved the calls to _check()/_int_check()
so that zero dimensions are checked first (and if found, execution
returns with trivial or no computation). This resolves issue #7. Thanks
to Jack Poulson for reporting this bug.
Details:
- Added a new field to blksz_t objects that allows one to attach a
sub-object. Doing this allows us to associate a register blocksize with
any given cache blocksize. That way, the register blocksize can be
queried wherever the cache blocksize would normally be accessible
(e.g. a blocked algorithm).
- Modified bli_gemm_cntl.c (and 4m/3m variants) so that the register
blocksizes are attached to the cache blocksizes after they are created.
Details:
- Added initialization statements to various macros used in level 1m and
1m-like operations. I wasn't able to reproduce the reported behavior,
so hopefully this takes care of it. Thanks to Jeff Hammond for the
report.
Details:
- Defined new INSERT_GENTFUNC macros so that the macro always takes
exactly the number of arguments needed for the particular operation or
variant being defined. Many operations were using INSERT_GENTFUNC
macros that expected one auxiliary argument even though none were
needed. Those instances have now been updated. Most of these instances
were in the level-0 and -1v operations, as well as some operations
defined in frame/util.
Details:
- Completely reoganized norm operations:
- Renames:
- fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
- absumv -> norm1v (vector 1-norm)
- New operations:
- norm1m (matrix 1-norm)
- normiv, normim (infinity-norm)
- amaxv (BLAS-like absolute maximum value index)
- asumv (BLAS-like absolute sum)
- Deprecated absumm, as it did not correspond to any actual norm.
(However, an inlined version now exists in the testsuite module for
randm.)
This change makes each operation have its own thread info type,
allowing more fine control of threading in operations that have different types of suboperations
Details:
- Minor update to bli_sumsqv_unb_var1() to bring it up-to-date with
LAPACK 3.5.0's zlassq.f, which, starting with 3.4.2, returns NaN when
the vector (or matrix) contains a NaN.
- Minor change to bli_abmaxv_unb_var1() to more closely mimic the
behavior of netlib BLAS's izamax(). There, a "less than or equal to"
operator is used in the search instead of "less than", which would
change the element index returned if there were multiple maximum values.
- Added macro function definitions for bli_isinf() and bli_isnan(), which
are currently implemented in terms of isinf() and isnan() from math.h.
Details:
- Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
elsewhere in the framework that were not yet set up to work properly
when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
- Extensive changes to f2c-derived files in frame/compat/f2c to allow
C99 complex storage. Most of these changes center around accessing
real and imaginary components via bli_?real()/bli_?imag() accessor
macros, and setting of values via bli_?sets() assignment macros.
(Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
was broken.)
Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.
Details:
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE from bli_config.h of all
configurations. It was already going unused in packm_init() since the
recent 4m/3m commit. This setting was rarely, if ever, useful, and its
existence only posed a potential risk for 4m/3m-based implementations.
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE usage from mem_pool_macro_defs.h.
- Updated comments regarding CONTIG_STRIDE_ALIGN_SIZE in template
micro-kernels.
Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
macro constants. Examples:
BLIS_SAXPYV_KERNEL_REF
BLIS_DDOTXF_KERNEL_REF
BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
with a common base name; [sdcz] datatype flavors of each kernel or
micro-kernel (level-1v, -1f, or 3) may now be named independently.
This means you can now, if you wish, encode the datatype-specific
register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
undefined in bli_kernel.h will default to the corresponding reference
implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
datatype chars to match the number of types the kernel WOULD take in
a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
your level-1v/-1f kernels. The framework now prvides a _kernel()
function which serves as the obj_t wrapper for whatever kernels are
specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
longer need to include any prototyping headers from within
bli_kernel.h. The framework now generates kernel prototypes, with the
proper type signature, based on the kernel names defined (or defaulted
to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
kernel is left undefined by bli_kernel.h, but its same-precision real
domain equivalent IS defined, BLIS will use a 4m-based implementation
for the datatype x implementations of all level-3 operations, using
only the real gemm micro-kernel.
Details:
- Added the ability to induce complex domain level-3 operations via new
virtual complex micro-kernels which are implemented via only real
domain micro-kernels. Two new implementations are provided: 4m and 3m.
4m implements complex matrix multiplication in terms of four real
matrix multiplications, where as 3m uses only three and thus is
capable of even higher (than peak) performance. However, the 3m method
has somewhat weaker numerical properties, making it less desirable
in general.
- Further refined packing routines, which were recently revamped, and
added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.
Details:
- Consolidated the functionality previously supported by packm_blk_var2()
and packm_blk_var3() into a new variant, packm_blk_var1().
- Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
to accommodate above changes.
- Removed packm_blk_var3() and retired packm_blk_var2() to
frame/1m/packm/old.
- Updated all level-3 _cntl_init() functions so that the new, more
versatile packm_blk_var1 is used for all level-3 matrix packing.
Details:
- Defined a new macro, INSERT_GENTFUNC_BASIC0, which takes only one
argument--the base name of the function--and employed this macro
in the reference micro-kernel files instead of the _BASIC macro,
which takes one auxiliary argument. That argument was not being
used and probably just acted to unnecessarily obfuscate.