Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
Details:
- Replaced critical sections that were conditional upon multithreading
being enabled (via pthreads or OpenMP) with unconditional use of
pthreads mutexes. (Why pthreads? Because BLIS already requires it
for its initialization mechanism: pthread_once().) This was done in
bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
mtx_t object and bli_mutex_*() API with pthread mutexes in
bli_thread.c. The previous status quo could result in a race condition
if the application called BLIS from more than one thread. The new
pthread-based code should be completely agnostic to the application's
threading configuration. Thanks to AMD for bringing to our attention
the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
multithreading. Specifically, each thread maintains a counter that is
incremented after each experiment. The thread only executes the
experiment if: counter % n_threads == thread_id. In other words, the
threads simply take turns executing each problem experiment. Also,
POSIX guarantees that fprintf() will not intermingle output, so
output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
a race condition; specifically, two threads calling the function with
the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
to (and updating existing lines of) source files.
Details:
- Removed the line in each operation entry in input.operations titled
"test sequential front-end" and the corresponding support for the lines
in the testsuite input parsing code. This line was included in the some
of the earliest versions of the testsuite, back when I intended to
eventually have separate multithreaded APIs. Specifically, I envisioned
that multithreaded and sequential testing could be enabled or disabled
on an operation level. However, BLIS evolved in a different direction
and still does not have multithreaded-specific APIs (even if it will
eventually someday). But even if it did have such APIs, I doubt I would
allow the user to enable/disable them on an operation level. Thus, this
was a zombie future parameter that was never used and never made sense
to begin with. The one instance of the front_seq variable, used in the
various libblis_test_<operation>() functions to guard the call to the
operation test driver, that remains was commented out instead of
deleted so that someday it could be easily changed via sed, if desired.
- Various minor cleanups to the testsuite code, including consolidating
use of DISABLE and DISABLE_ALL and reexpressing certain conditional
expressions in the libblis_test_<operation>() functions in terms of
boolean functions.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
Details:
- Updated the testsuite driver so that setting one or more individual
operation test switches to "2" in input.operations will enable ONLY
those operations and disable all others, regardless of the values of
the section overrides and other operation switches. This makes it
every easy to quickly test only one or two operations, and equally
easy to revert back to the previous combination of operation tests.
- Added more comments to input.operations describing the use of
individual "enable only" overrides.
Details:
- Relaxed the base pointer and leading dimension alignment restrictions
in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
instead of vmovaps/vmovapd. These change mimic those made to the haswell
microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
directory to use DBL_MAX as the initial time candidate. Thanks to Devin
Matthews for suggesting this change.
- Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.
Details:
- Defined a new randomization operation, randn, on vectors and matrices.
The randnv and randnm operations randomize each element of the target
object with values from a narrow range of values. Presently, those
values are all integer powers of two, but they do not need to be powers
of two in order to achieve the primary goal, which is to initialize
objects that can be operated on with plenty of precision "slack"
available to allow computations that avoid roundoff. Using this method
of randomization makes it much more likely that testsuite residuals of
properly-functioning operations are close to zero, if not exactly zero.
- Updated existing randomization operations randv and randm to skip
special diagonal handling and normalization for matrices with structure.
This is now handled by the testsuite modules by explicitly calling a
testsuite function that loads the diagonal (and scales off-diagonal
elements).
- Added support for randnv and randnm in the testsuite with a new switch
in input.general that universally toggles between use of the classic
randv/randm, which use real values on the interval [-1,1], and
randnv/randnm, which use only values from a narrow range. Currently,
the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as
well as 0.0.
- Updated testsuite modules so that a testsutie wrapper function is called
instead of directly calling the randomization operations (such as
bli_randv() and bli_randm()). This wrapper also takes a bool_t that
indicates whether the object's elements should be normalized. (NOTE: As
alluded to above, in the test modules of triangular solve operations such
as trsv and trsm, we perform the extra step of loading the diagonal.)
- Defined a new level-0 operation, invertsc, which inverts a scalar.
- Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely
but possible divide-by-zero.
- Updated function signature and prototype formatting in testsuite.
Details:
- Updated copyright headers to include "at Austin" in the name of the
University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
2012).
Details:
- Completely reoganized norm operations:
- Renames:
- fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
- absumv -> norm1v (vector 1-norm)
- New operations:
- norm1m (matrix 1-norm)
- normiv, normim (infinity-norm)
- amaxv (BLAS-like absolute maximum value index)
- asumv (BLAS-like absolute sum)
- Deprecated absumm, as it did not correspond to any actual norm.
(However, an inlined version now exists in the testsuite module for
randm.)
Details:
- Added the ability to induce complex domain level-3 operations via new
virtual complex micro-kernels which are implemented via only real
domain micro-kernels. Two new implementations are provided: 4m and 3m.
4m implements complex matrix multiplication in terms of four real
matrix multiplications, where as 3m uses only three and thus is
capable of even higher (than peak) performance. However, the 3m method
has somewhat weaker numerical properties, making it less desirable
in general.
- Further refined packing routines, which were recently revamped, and
added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.
Details:
- Added infrastructure to support a new scalar representation, whereby
every object contains an internal scalar that defaults to 1.0. This
facilitates passing scalars around without having to house them in
separate objects. These "attached" scalars are stored in the internal
atom_t field of the obj_t struct, and are always stored to be the same
datatype as the object to which they are attached. Level-3 variants no
longer take scalar arguments, however, level-3 internal back-ends stll
do; this is so that the calling function can perform subproblems such
as C := C - alpha * A * B on-the-fly without needing to change either
of the scalars attached to A or B.
- Removed scalar argument from packm_int().
- Observe and apply attached scalars in scalm_int(), and removed scalar
from interface of scalm_unb_var1().
- Renamed the following functions (and corresponding invocations):
bli_obj_init_scalar_copy_of()
-> bli_obj_scalar_init_detached_copy_of()
bli_obj_init_scalar() -> bli_obj_scalar_init_detached()
bli_obj_create_scalar_with_attached_buffer()
-> bli_obj_create_1x1_with_attached_buffer()
bli_obj_scalar_equals() -> bli_obj_equals()
- Defined new functions:
bli_obj_scalar_detach()
bli_obj_scalar_attach()
bli_obj_scalar_apply_scalar()
bli_obj_scalar_reset()
bli_obj_scalar_has_nonzero_imag()
bli_obj_scalar_equals()
- Placed all bli_obj_scalar_* functions in a new file, bli_obj_scalar.c.
- Renamed the following macros:
bli_obj_scalar_buffer() -> bli_obj_buffer_for_1x1()
bli_obj_is_scalar() -> bli_obj_is_1x1()
- Defined new macros to set and copy internal scalars between objects:
bli_obj_set_internal_scalar()
bli_obj_copy_internal_scalar()
- In level-3 internal back-ends, added conditional blocks where alpha and
beta are checked for non-unit-ness. Those values for alpha and beta are
applied to the scalars attached to aliases of A/B/C, as appropriate,
before being passed into the variant specified by the control tree.
- In level-3 blocked variants, pass BLIS_ONE into subproblems instead of
alpha and/or beta.
- In level-3 macro-kernels, changed how scalars are obtained. Now, scalars
attached to A and B are multiplied together to obtain alpha, while beta
is obtained directly from C.
- In level-3 front-ends, removed old function calls meant to provide
future support for mixed domain/precision. These can be added back later
once that functionality is given proper treatment. Also, removed the
creating of copy-casts of alpha and beta since typecasting of scalars
is now implicitly handled in the internal back-ends when alpha and
beta are applied to the attached scalars.
Details:
- Added new lines of input to the test suite's input.operations file, which
allows the user to disable entire sections (levels) of tests. Before this
change, the user had to manually disable each operation tests's "master
switch". (This is why input.operations.0 existed: to allow a more
convenient starting point for someone who only wanted to test one or a
few operations.)
Details:
- Added a new cpp macro in bli_config.h, BLIS_INT_TYPE_SIZE, which can be
assigned values of 32, 64, or some other value. The former two result in
defining gint_t/guint_t in terms of 32- or 64-bit integers, while the latter
causes integers to be defined in terms of a default type (e.g. long int).
- Updated bli_config.h in reference and clarksville configurations according
to above changes.
- Updated test drivers in test and testsuite to avoid type warnings associated
with format specifiers not matching the types of their arguments to printf()
and scanf().
- Inserted missing #include "bli_system.h" into blis.h (which was slated for
inclusion in d141f9eeb6).
- Added explicit typecasting of dim_t and inc_t to macros in
bli_blas_macro_defs.h (which are used in BLAS compatibility layer).
- Slight changes to CREDITS and INSTALL files.
- Slight tweaks to Windows build system, mostly in the form of switching to
Windows-style CRLF newlines for certain files.
Details:
- Fixed bugs in trmv_l and trsv_u due to backwards iteration resulting in
unaligned subpartitions. We were already going out of our way a bit to
handle edge cases in the first iteration for blocked variants, and this
was simply the unblocked-fused extension of that idea.
- Fixed control tree handling in her/her2/syr/syr2 that was not taking
into account how the choice of variant needed to be altered for
upper-stored matrices (given that only lower-stored algorithms are
explicitly implemented).
- Added bli_determine_blocksize_dim_f(), bli_determine_blocksize_dim_b()
macros to provide inlined versions of bli_determine_blocksize_[fb]() for
use by unblocked-fused variants.
- Integrated new blocksize_dim macros into gemv/hemv unf variants for
consistency with that of the bugfix for trmv/trsv (both of which now
use the same macros).
- Modified bli_obj_vector_inc() so that 1 is returned if the object is a
vector of length 1 (ie: 1 x 1). This fixes a bug whereby under certain
conditions (e.g. dotv_opt_var1), an invalid increment was returned, which
was invalid only because the code was expecting 1 (for purposes of
performing contiguous vector loads) but got a value greater than 1 because
the column stride of the object (e.g. rho) was inflated for alignment
purposes (albeit unnecessarily since there is only one element in the
object).
- Replaced some old invocations of set0 with set0s.
- Added alpha parameter to gemmtrsm ukernels for x86_64 and use accordingly.
- Fixed increment bug in cleanup loop of gemm ukernel for x86_64.
- Added safeguard to test modules so that testing a problem with a zero
dimension does not result in a failure.
- Tweaked handling of zero dimensions in level-2 and level-3 operations'
internal back-ends to correctly handle cases where output operand still
needs to be scaled (e.g. by beta, in the case of gemm with k = 0).
Details:
- Changed all filename and function prefixes from 'bl2' to 'bli'.
- Changed the "blis2.h" header filename to "blis.h" and changed all
corresponding #include statements accordingly.
- Fixed incorrect association for Fran in CREDITS file.
Details:
- Removed calls to setv() in test modules whose sole purpose was to
initialize vectors to zero to ensure that nan's and inf's would not
taint the computation. Now that beta == zero semantics have been
updated to clear the output operand (when beta is zero), rather than
multiply against it, these setv() calls are no longer needed.
Details:
- Added a highly configurable, unified test suite.
- Removed DUPB configuration constant from bl2_kernel.h and macro-kernel
header files. Now, instead, DUPB is computed as (NDUP != 1) within each
macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into
incorrectly when DUPB was set to FALSE but the NDUP was still non-unit.
By encoding both pieces of information into one constant in _kernel.h,
it seems somewhat less likely others will encounter this bug in the
future.
- Added level-2 cache blocksizes to _kernel.h for reference configuration,
and defined blocksizes in _cntl.c files to these default values.
- Changed semantics of her2k and syr2k such that these operations no longer
expect the B matrix to already be conjugate-transposed (or just transposed
for syr2k). However, these semantics are preserved for the internal
mechanics of the implementations, including the internal back-end and all
blocked variants.
- Inserted checks for real-valued alpha and beta for herk/her2k and herk,
respectively.
- Relaxed general object structure constraints in _basic_check() for gemv, ger.
- Changed her front-end to NOT copy-cast to real projection; instead, this is
replaced by selecting either the real part or both parts within the unblocked
algorithm implementation, depending on the value of conjh.
- Added conjh to all _check routines for her so that the code knows when to
verify that alpha has an imaginary component equal to zero (for her, but
not syr).
- Changed control tree for her to forgo packing.
- Added unit diagonal support to fnormm.
- Redefined real versions of abval2s macros in terms of fabs(), fabsf().
- Redefined complex versions of sqrt2s macros using the actual "complex square
root" formula.
- Created new level-0 object-based routines, suffixed with "sc" (for "scalar").
- Defined new level-1v, -1d, and -1m versions of add and sub operations
(two-operand add and subtract).
- Added new scalar macros:
- getris: acquire real and imaginary components.
- setris: set real and imaginary components.
- addjs: addition with conjugated x.
- subjs: subtraction with conjugated x.
- Defined new utility operations:
- absumv: element-wise sum of absolute values for vector elements.
- absumm: element-wise sum of absolute values for matrix elements.
- mkherm: convert existing matrix to Hermitian.
- mksymm: convert existing matrix to symmetric.
- mktrim: convert existing matrix to triangular.
- Added various error checking routines.
- Added bl2_clock_min_diff(), which is used to more cleanly measure the
wall clock time of a code block.
- Added general stride support to bl2_obj_alloc_buffer().
- Added bl2_obj_init_scalar().
- Updated parameter mapping in bl2_param_map.c.
- Added support for queriable version string.
- Fixed a bug in the her2k macro-kernels (which currently are simply
implemented in terms of two invocations of herk) whereby beta was being
applied to both the first and second rank-k updates, rather than only
the first.
- Fixed a bug in trmm/trsm whereby transpose and right side cases were not
properly implemented due to erroneous assumptions regarding aliasing and
root objects.
- Fixed a bug in the upper triangular trsm macro-kernel in which the wrong
MR x NR block of B was being updated.
- Fixed a bug in the inverts macro in the double real case whereby the
value was typecast to float before inversion. This affected non-unit cases
of dtrsm.
- Fixed a bug in the reference kernels for gemmtrsm whereby the minus one
constant was being applied incorrectly.
- Fixed a bug in the overall treatment of non-unit alpha for trsm. The code
now mimics the rank-k strategy of gemm, whereby alpah is applied during
the first iteration of variant 3, with BLIS_ONE passed in instead for
subsequent iterations. This also required passing alpha into the macro-
kernels as well as the fused gemmtrsm micro-kernels.
- Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being
called for blocks strictly above the diagonal. While this sounds good in
theory, this cannot be done because gemm_ker_var2 expects row panels of
A to be packed from top to bottom, while for trsm_u, A is actually packed
from bottom to top due to the reverse (BR->TL) nature of the algorithm.
- Fixed a bug in packm_cxk() whereby panel packings with unit panel
dimensions were mishandled due to incorrect arguments to the copyv kernel.
Also changed the copyv kernel invocation to scal2v so that these edge
cases are properly handled when scaling is requested.
- Fixed a bug in packv_int() whereby an uninitialized object is passed in
instead of the source object.
- Fixed a bug whereby level-2 code could allocate memory dynamically via
bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed
a potential future bug whereby a mem_t object that is actually no longer
"allocated" from the static pool is mistaken for being allocated due to
failure to NULLify the buffer when the block was most recently released.
- Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly
toggled when the requested subpartition needed to be "reflected" due to it
residing in an unstored region.