Details:
- Fixed an obscure bug in the front-ends for herk, her2k, syrk, and syr2k,
that manifested as the incorrect triangle being updated. It occurred when
the user would pass in a matrix object that was correctly marked as
symmetric/Hermitian and lower-stored, but whose root object was never marked
as lower (or upper). We now alias and re-assign root status for matrix C
within the front-ends. Note that trmm and trsm were already doing this,
albeit for a slightly different reason (to allow the internal back-end to
choose which algorithm to run--lower or upper--based on the uplo of the root
object for both left and right side cases). Thanks to Bryan Marker for
leading me to this bug.
Details:
- Relaxed type checking in getsc so that the input object could be a constant
and not just a proper floating-point type. (If it is a constant, default to
extracting the dcomplex values.) Thanks to Bryan Marker for reporting this
bug.
- Added definition for bli_is_constant() in bli_param_macro_defs.h
- Comment updates to various level-0 scalar routines.
Details:
- This macro is used to determine whether the partitioning routines should
call a corresponding packm_part routine instead. However, it was
unintentionally catching matrices that were marked as "packed" by virtue
of them simply being marked as BLIS_PACKED_UNSPEC in, say, bli_gemv().
The macro has now been renamed to bli_obj_is_panel_packed(), and now only
checks for row or column panel packing. (Note that I first attempted to
fix this bug in a571af816d72.) Thanks to Bryan Marker for reporting the
erroneous behavior that led me to this bug.
Details:
- Removed locally defined gemm microkernel blocksize macros from _mxn
reference microkernel definition and header. Meant to include this in
a recent/previous commit (0020ef7c82).
Details:
- Added missing statement to set structure field of local objects in
top-level BLIS (BLAS-like) API wrappers. Thanks to Bryan Marker for
reporting this bug.
Details:
- Fixed a bug in reference implementation of dupl (bli_dupl_unb_var1.c),
which resulted in incorrect duplication.
- Updated old test drivers according to recently updated packm control tree
creation interface.
- Added 'restrict' to x86 gemm microkernel interface.
Details:
- Delayed #include of bli_kernel.h in blis.h to prevent a situation where
_kernel.h includes an optimized microkernel header, which uses BLIS types
such as dim_t and inc_t, which would precede the definition of those types
in bli_type_defs.h.
- Moved the #include of bli_kernel_macro_defs.h in bli_macro_defs.h to blis.h
(immediately after that of bli_kernel.h).
Details:
- Modified gemmtrsm micro-kernel wrappers to use new aliased blocksize macros
instead of operation-specific ones.
- Removed local, gemmtrsm-specific blocksize macro definitions found in
micro-kernel header files.
(Meant to include above changes in 31b100e7bf4a.)
- Added comments to reference gemmtrsm micro-kernel wrapper implementation.
Details:
- Added new memory alignment constants:
BLIS_HEAP_STRIDE_ALIGN_SIZE (previously assumed to be same as SYSTEM_MEM)
BLIS_CONTIG_ADDR_ALIGN_SIZE (previously assumed to be same as PAGE_SIZE)
BLIS_STACK_BUF_ALIGN_SIZE (previously not enforced)
and renamed existing ones
BLIS_SYSTEM_MEM_ALIGN_SIZE -> BLIS_HEAP_ADDR_ALIGN_SIZE
BLIS_CONTIG_MEM_ALIGN_SIZE -> BLIS_CONTIG_STRIDE_ALIGN_SIZE
to better convey what the alignment factor is used for (and what it is
not used for).
- Removed BLIS_ENABLE_SYSTEM_MEM_ALIGN. Dynamic memory alignment is now
disabled by setting BLIS_HEAP_STRIDE_ALIGN_SIZE to 1.
- Inserted instances of __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE)))
into macro-kernels to specify stack alignment of temporary buffers.
- Modified test suite driver to output new constants.
- Removed bli_align_dim_to_sys() and bli_align_dim_to_cmem(). Instead, we now
use bli_align_dim_to_size(), which takes a third argument (the desired
alignment).
Details:
- Fixed bug whereby axpyv and axpym were incorrectly simplifying to a copy,
rather than an add, when alpha = 1. Thanks to Bryan Marker for identifying
this bug.
Details:
- Renamed abs, min, max, dmin, and dmax macros in bli_f2c.h so that they
would not conflict with anything defined by the user (or the language).
Thanks to Devin Matthews for suggesting this fix.
- Updated all instances of the above macros accordingly.
Details:
- Added new macros that alias level-3 cache and register blocksize macros
to names that can be constructed via the PASTEMAC macro. These aliased
macro definitions live inside bli_kernel_macro_defs.h, which is now
#included after bli_kernel.h.
- Modified macro-kernels to use new aliased blocksize macros instead of
operation-specific ones.
- Removed local, operation-specific kernel blocksize macro definitions
(found in macro-kernel header files).
Details:
- Replaced scalar constant macro definitions in bli_const_defs.h with a single,
simplier macro in bli_obj_macro_defs.h.
- Updated invocations of old macros accordingly.
- Removed bli_const_defs.h.
Details:
- Fixed a bug in the reference implementations of the gemmtrsm wrappers
(bli_gemmtrsm_l_ref_mxn.c and bli_gemmtrsm_u_ref_mxn.c) whereby the
reference gemm microkernel was hard-coded, and thus always called, even
when GEMM_UKERNEL was defined to point to an optimzied microkernel. This
manifested as artificially low trsm performance for all problem sizes, but
especially for small problem sizes as it only affected blocks of A that
intersected the diagonal. Thanks to Mike Kistler of IBM for helping me
find this bug.
Details:
- Changed the definition of bli_is_packed_object() so that it keys off of the
value of the pack schema bits in the info field of obj_t, rather than
comparing the obj_t buffer with that of the mem_t entry. This was the cause
of a very low probability bug whereby uninitialized memory caused the macro
to evaluate to TRUE even though the object in question was not packed.
Thanks to Vernon Austel of IBM for helping discover this bug.
- Changed an abort() in bli_packm_part() to a not-yet-implemented.
Details:
- Fixed a bug present in bli_herk_l_ker_var2() and bli_herk_u_ker_var2() that
only manifests when BLIS is configured such that MR != NR. The bug involves
incorrectly detecting edge cases, which resulted in some parts of matrix C
potentially being skipped and not updated, depending on the problem size.
- Updated the default values of MR and NR in config/reference/bli_kernel.h to
8 and 4, respectively, so that I can better stress the framework on a
day-to-day basis. (The fact that they were both equal to 4 for so long is
why I did not stumble upon this bug much sooner.)
Details:
- Added a new set of reference gemm, gemmtrsm, and trsm micro-kernels that
contain explicit loops over MR and NR, thus allowing them to be used
unmodified by developers who want to build a reference library with
custom register blocksizes.
- Changed config/reference/bli_kernel.h to use above ukernels by default.
- Changed interfaces of new and existing gemm, gemmtrsm, and trsm micro-kernels
to use 'restrict' keyword.
- Added -funroll-loops option to config/reference/make_defs.mk.
- Updated comments in bli_kernel.h describing constraints on register and
cache blocksizes.
- Updated _adds_mxn.h, _copys_mxn.h, and _xpbys_mxn.h macros files so that
single-char macros are also defined.
Details:
- Implemented amax operation in BLIS.
- Activated BLAS2BLIS routine mapping for new amax BLIS implementation.
- Added integer support to [f]printv, [f]printm.
- Added integer support to level-0 copys macros.
- Updated printing of configuration information in test suite driver.
- Comment changes to _config.h files.
- Added comments to bla_dot.c to reminder reader what sdsdot()/dsdot() are
used for.
Details:
- Renamed BLIS_MEMORY_ALIGNMENT_SIZE to BLIS_CONTIG_MEM_ALIGN_SIZE.
- Renamed BLIS_ENABLE_MEMORY_ALIGNMENT to BLIS_ENABLE_SYSTEM_MEM_ALIGN.
- Added BLIS_SYSTEM_MEM_ALIGN_SIZE, which controls only the alignment
passed into posix_memalign() or equivalent.
- Defined new function, bli_align_dim_to_cmem(), which applies the
contiguous memory alignment (rather than the system/malloc alignment).
Details:
- Pass panel strides through bli_align_dim_to_sys() to ensure that each
subsequent packed panel of A and B begins at an aligned address. (The
first panel is presumably aligned to system alignment because it is
aligned to a page boundary, which is typically much larger.)
- Rearranged code in packm_init_pack() to prevent additional conditional
blocks as a result of the aforementioned change.
- Adjusted contiguous memory allocator so that the system memory alignment
is used to allocate enough space for each block no matter what kind of
register blocking is used (even if register blocksize is unit and every
row/column needs maximal padding).
- Adjusted default blocksizes in reference configuration so that MC*KC
and KC*NC result in identical footprints for all datatypes.
Details:
- Changed all filename and function prefixes from 'bl2' to 'bli'.
- Changed the "blis2.h" header filename to "blis.h" and changed all
corresponding #include statements accordingly.
- Fixed incorrect association for Fran in CREDITS file.
Details:
- Removed #include of "blis2.h" from various lower-level, operation-specific
header files throughout the framework. Given that these low-level headers
are included within #blis2.h in a very specific order, #include'ing blis2.h
within them directly is unnecessary.
Details:
- Added cpp guards around the definitions of dim_t, scomplex, and dcomplex.
This is a temporary hack to allow interoperability with libflame. (Similarly
temporary changes are being made to libflame's type definitions file.)
Details:
- Renamed BLIS_MAX_PREFETCH_BYTE_OFFSET to
BLIS_MAX_PRELOAD_BYTE_OFFSET since "prefetch" is kind of a loaded word
(e.g. "prefetch" instructions, which are different than the particular
kind of prefetching/preloading referred to by this constant).
Details:
- Removed the m and n (and elem_size) fields from the mem_t object, and added
m_packed and n_packed fields to obj_t. These new fields track the same as
the old ones. From an abstraction standpoint, it seemed awkward to store
those dimensions inside the mem_t.
- Updated interfaces to bl2_mem_acquire_*() so that only a byte size argument
is passed in, instead of m, n, and elem_size.
- Updated bl2_packm_init_pack() and bl2_packv_init_pack() to inline the
functionality of bl2_mem_alloc_update_m() and bl2_mem_alloc_update_v(),
respectively.
- Updated packm variants to access the packed length and width fields from
their new locations.
Details:
- Completely re-wrote the contiguous memory allocator (bl2_mem.c). The new
allocator instantiates and initializes three separate memory pool objects,
each one associated with a separate array of contiguous memory blocks, each
block of fixed and uniform size. (The three pools are for allocating mc-by-kc
blocks of A, kc-by-nc panels of B, and mc-by-nc panels of C.) The pool
objects use a stack structure internally to track which blocks in the region
have been "checked out" to a thread and which are still available. Critical
regions are now clearly marked and adaptable to parallel environments (e.g.
OpenMP). Memory pools are set up when bl2_init() is called.
- Added a new field to the packm control tree node, which indicates what kind
of packed buffer is being allocated. The enumerated type for this argument
is defined as packbuf_t in bl2_type_defs.h.
- Updated level-3 _cntl.c files to pass in the appropriate value for a new
packbuf_t argument to bl2_packm_cntl_obj_create().
- Moved some macros called by packm_init_pack() from bl2_obj_macro_defs.h to
bl2_mem_macro_defs.h.
- Added BLIS_MAX_NUM_THREADS to bl2_config.h, which we use as the default
number of blocks of A reserved for the memory allocator.
- Deprecated bl2_align_dim(). Replaced usage with that of
bl2_align_dim_to_mult(). Turns out that typically we don't need to align
a dimension to the system alignment, since that value has to do with
starting addresses, whereas the values we are dealing with are unitless
dimensions.
Details:
- Changed variant 1 of her2k so that the two rank-k products are computed
and accumulated in sequence rather than fused into one loop. This is
necessary if BLIS is to be configured to provide only enough contiguous
memory for one panel of B.
Details:
- Added new fields to mem_t struct definition to track the allocated (as
opposed to the currently used) dimensions of the memory region. This
allows packm_init() to be more robust in situations where memory is
already allocated but is more than needed for the current packing job.
- Updated logic in bl2_obj_set_buffer_with_cached_packm_mem() macro, used
in packm_init(), to update the "currently used" dimensions of the mem_t
object if the requested dimensions are smaller than the allocated
dimensions.
Details:
- Changed bl2_obj_induce_trans() so that the transposition bit is no longer
updated as part of the macro. All current uses of the macro have been
coupled with instances of bl2_obj_set_trans() to clear the bit.
- Added Jed to CREDITS file.
Details:
- Retired the blas2blis wrappers that simply called abort with a "not yet
implemented" message. This includes all of the level-2 banded and packed
routines.
- Replaced the aforementioned with the corresponding netlib implementations
having been run through f2c (with some customization).
- Added directories named 'attic' to build/gen-make-frags/ignore_list.
Details:
- Moved the Fortran-77 name-mangling macros from bl2_blas_macro_defs.h to the
configuration directory (bl2_config.h, specifically) given that it can be
expected to be tweaked by some developers.
Details:
- Added the blas2blis compatibility layer, located in frame/compat. This
includes virtually all of the BLAS, including banded and packed level-2
operations.
- Defined bl2_init_safe(), bl2_finalize_safe(). The former allows a conditional
initialization, which stores the "exit status" in an err_t, which is then
read by the latter function to determine whether finalization should actually
take place.
- Added calls to bl2_init_safe(), bl2_finalize_safe() to all level-2 and
level-3 BLAS-like wrappers.
- Added configuration option to instruct BLIS to remain initialized whenever
it automatically initializes itself (via bl2_init_safe()), until/unless the
application code explicitly calls bl2_finalize().
- Added INSERT_GENTFUNC* and INSERT_GENTPROT* macros to facilitate type
templatization of blas2blis wrappers.
- Defined level-0 scalar macro bl2_??swaps().
- Defined level-1v operation bl2_swapv().
- Defined some "Fortran" types to bl2_type_defs.h for use with BLAS
wrappers.
Details:
- Some of the scalars of Hermitian operations, such as alpha in her,
alpha and beta in herk, and beta in her2k, need to be real. These
arguments were typed incorrectly as the complex types. This has been
fixed. Note the issue was only present in the BLAS-like APIs for
these operations (not the native object-based interfaces).
Details:
- Changed the interface of packm_init_pack() so that mult_m and mult_n
are passed in as type blksz_t* instead of dim_t.
- Make similar change for packv_init_pack().
Details:
- Removed diagx parameter from lower-level interfaces of scalm.
- Modified scalm_basic_check() to expect an object with a nonunit diagonal.
- Changed setm_unb_var1() so that having an implicit unit diagonal results
in only the strictly lower or upper triangle of the matrix being modified.
Details:
- Changed name of set0 and set0_mxn macros to set0s and set0s_mxn,
respectively.
- Added code to the following operations that sets the output operand to
zero if the corresponding scalar is zero (rather than performing the
floating-point multiply, or in the case of setv, copying the value).
This will prevent nan's and inf's from creeping into results from
uninitialized memory.
- axpy
- dotxv
- scalv
- scal2v
- setv
- gemv
- ger
- hemv
- her
- her2
- gemm reference ukernels
Details:
- Added bl2_packm_unb_var1() back into the mix once I realized that level-2
operations still need this routine for packing matrices. Now, whether
level-2 operations should be packing matrices to begin with is another
matter. But this fixes the segmentation fault one would have gotten when
running bl2_gemv() on a general stride matrix.
Details:
- Added new fields to obj_t info field:
- invert_diag
- pack_order_if_upper
- pack_order_if_lower
These fields allow packm_init() to embed information that begins
in the control tree into the object so that the packm implementation
does not need to use control trees at all. This is being done to aid
Bryan's DxT code generation.
- Added macros that operate on above fields.
- Changed packm_init(), packm_blk_var2(), and packm_blk_var3() according
to above changes.
- Made similar (but much simpler) changes to packv.
- Deprecated packm_blk_var1(), packm_unb_var1(), and packm_densify().
These were part of prototype implementations and are no longer needed.