Details:
- Replaced critical sections that were conditional upon multithreading
being enabled (via pthreads or OpenMP) with unconditional use of
pthreads mutexes. (Why pthreads? Because BLIS already requires it
for its initialization mechanism: pthread_once().) This was done in
bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
mtx_t object and bli_mutex_*() API with pthread mutexes in
bli_thread.c. The previous status quo could result in a race condition
if the application called BLIS from more than one thread. The new
pthread-based code should be completely agnostic to the application's
threading configuration. Thanks to AMD for bringing to our attention
the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
multithreading. Specifically, each thread maintains a counter that is
incremented after each experiment. The thread only executes the
experiment if: counter % n_threads == thread_id. In other words, the
threads simply take turns executing each problem experiment. Also,
POSIX guarantees that fprintf() will not intermingle output, so
output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
a race condition; specifically, two threads calling the function with
the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
to (and updating existing lines of) source files.
Details:
- Created a new test suite that exercises only the BLAS compatibility
found in BLIS. The test suite is a straightforward port of code
obtained from netlib LAPACK, run through f2c and linked to a stripped-
down version of libf2c that is compiled along with the test drivers
(to prevent any obvious ABI issues). The new BLAS test suite can be
run from within its new local directory, 'blastest' (through its local
'make ; make run' targets) or from the top-level Makefile (via the
'make testblas' target). Output files are created in whatever directory
the test drivers are run, whether it be the 'blastest' directory, the
top-level source distribution directory, or the out-of-tree directory
in which 'configure' was run. Also, the results of the BLAS test suite
can be checked via 'make checkblas', which summarizes the presence or
absence of test failures in a single line printed to stdout.
- Updated the 'test' target to run both 'testblis' and 'testblas'.
- Added a new 'testblis-fast' target that runs the BLIS testsuite with
smaller problem sizes, allowing it to finish more quickly.
- Added a 'make check' target, which runs 'checkblis-fast' and
'checkblas'.
- Changed .travis.yml so that Travis CI runs 'testblis-fast' instead of
'testblis' before (calling the check-blistest.sh script to check the
result manually).
- Renamed some targets in the top-level Makefile to be consistent between
BLAS and BLIS.
Details:
- Change "vector storage schemes to test" parameter in testsuite's
input.general file to "cj". This means that both unit stride column
vectors and non-unit stride column vectors will be tested in
operations with vector operands (e.g. level-1v, level-1f, level-2).
- Very minor comment (typo) changes to input.operations.
Details:
- Reworked the build system around a configuration registry file, named
config_registry', that identifies valid configuration targets, their
constituent sub-configurations, and the kernel sets that are needed by
those sub-configurations. The build system now facilitates the building
of a single library that can contains kernels and cache/register
blocksizes for multiple configurations (microarchitectures). Reference
kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
in determining which sub-configurations (CONFIG_LIST) and kernel sets
(KERNEL_LIST) are included in the library, and which make_defs.mk files'
CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
functions into a standard format that includes the kernel set name
(e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
kernels sub-directory. These files exist to provide prototypes for the
kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
This directory includes a new source file, bli_cntx_ref.c (compiled on
a per-configuration basis), that defines the code needed to initialize
a reference context and a context for induced methods for the
microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
#defines for the config (family) name, the sub-configurations that are
associated with the family, and the kernel sets needed by those
sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
what remains to new header files named "bli_arch_<configname>.h", which
are conditionally #included from a new header bli_arch.h. These files
are still needed to set library-wide parameters such as custom
malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
The files contain a function, named the same as the file, that initializes
a "native" context for a particular configuration (microarchitecture). The
idea is that optimized kernels, if available, will be initialized into
these contexts. Other fields will retain pointers to reference functions,
which will be compiled on a per-configuration basis. These bli_cntx_init_*()
functions will be called during the initialization of the global kernel
structure. They are thought of as initializing for "native" execution, but
they also form the basis for contexts that use induced methods. These
functions are prototyped, along with their _ref() and _ind() brethren, by
prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
structures (pointers to cntx_t, actually). The first dimension is indexed
over arch_t and the inner dimension is the ind_t (induced method) for
each microarchitecture. When a microarchitecture (configuration) is
"registered" at init-time, the inner array for that configuration in the
2D array is initialized (and allocated, if it hasn't been already). The
cntx_t slot for BLIS_NAT is initialized immediately and those for other
induced method types are initialized and cached on-demand, as needed. At
cntx_t registration, we also store function pointers to cntx_init functions
that will initialize (a) "reference" contexts and (b) contexts for use with
induced methods. We don't cache the full contexts for reference contexts
since they are rarely needed. The functions that initialize these two kinds
of contexts are generated automatically for each targeted sub-configuration
from cpp-templatized code at compile-time. Induced method contexts that
need "stage" adjustments can still obtain them via functions in
bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
the level-1f, level-1v, and packm kernels, and for converting a native
context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
macros in bli_kernel_macro_defs.h to being runtime checks defined in
bli_check.c and called from bli_gks_register_cntx() at the time that the
global kernel structure's internal context is initialized for a given
microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
the previous operation-level cntx_t_init()/_finalize() invocations.
Instead, we now query the gks for a suitable context, usually via
bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
into a single kernel that will branch on the schema and support packing
to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
bli_malloc_intl(). Useful when we wish to allocate and initialize to
zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
bli_cntx.h into static functions.
Details:
- Disabled testsuite tests of all level-3 implementations based on 3m
and 4m. This will improve testing runtime on Travis CI as well as for
anyone manually running the testsuite using default test parameters.
Thanks to Devin Matthews for suggesting this change.
Details:
- Implemented the 1m method for inducing complex domain matrix
multiplication. 1m support has been added to all level-3 operations,
including trsm, and is now the default induced method when native
complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
needed for the corresponding function for 1m (because 1m requires us
to choose between column-oriented or row-oriented execution, which
requires us to query the context for the storage preference of the
gemm microkernel, which requires knowing the datatype) but I decided
that it made sense for consistency to add the parameter to all other
cntx initialization functions as well, even though those functions
don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
a second scalar for each blocksize entry. The semantic meaning of the
two scalars now is that the first will scale the default blocksize
while the second will scale the maximum blocksize. This allows scaling
the two independently, and was needed to support 1m, which requires
scaling for a register blocksize but not the register storage
blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
default and maximum blocksizes to some desired blocksize multiple.
These functions are needed in the updated definitions of
bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
certain circumstances (specifically, real domain beta and row- or
column-stored matrix C), the real domain macrokernel and microkernel
to be called directly, rather than using the virtual microkernel
via the complex domain macrokernel, which carries a slight additional
amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
some code in test_gemm.c driver.
Details:
- Altered control tree node struct definitions so that all nodes have the
same struct definition, whose primary fields consist of a blocksize id,
a variant function pointer, a pointer to an optional parameter struct,
and a pointer to a (single) sub-node. This unified control tree type is
now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
they represent, such that, for example, packing operations are now
associated with nodes that are "inline" in the tree, rather than off-
shoot braches. The original tree for the classic Goto gemm algorithm was
expressed (roughly) as:
blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
| |
-> packb -> packa
and now, the same tree would look like:
blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2
Specifically, the packb and packa nodes perform their respective packing
operations and then recurse (without any loop) to a subproblem. This means
there are now two kinds of level-3 control tree nodes: partitioning and
non-partitioning. The blocked variants are members of the former, because
they iteratively partition off submatrices and perform suboperations on
those partitions, while the packing variants belong to the latter group.
(This change has the effect of allowing greatly simplified initialization
of the nodes, which previously involved setting many unused node fields to
NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
connective structure of control trees. That is, packm nodes are no longer
off-shoot branches of the main algorithmic nodes, but rather connected
"inline".
- Simplified control tree creation functions. Partitioning nodes are created
concisely with just a few fields needing initialization. By contrast, the
packing nodes require additional parameters, which are stored in a
packm-specific struct that is tracked via the optional parameters pointer
within the control tree struct. (This parameter struct must always begin
with a uint64_t that contains the byte size of the struct. This allows
us to use a generic function to recursively copy control trees.) gemm,
herk, and trmm control tree creation continues to be consolidated into
a single function, with the operation family being used to select
among the parameter-agnostic macro-kernel wrappers. A single routine,
bli_cntl_free(), is provided to free control trees recursively, whereby
the chief thread within a groups release the blocks associated with
mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
function pointer stored in the current control tree node (rather than
index into a local function pointer array). Before being invoked, these
function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
families) or trsm_voft (for trsm family) type, which is defined in
frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
from routines as a function of the variant's matrix operands. gemm and
herk always move forward, while trmm and trsm move in a direction that
is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
each of which takes additional arguments and hides complexity in managing
the difference between the way ranges are computed for the four families
of operations.
- Simplified level-3 blocked variants according to the above changes, so that
the only steps taken are:
1. Query partitioning direction (forwards or backwards).
2. Prune unreferenced regions, if they exist.
3. Determine the thread partitioning sub-ranges.
<begin loop>
4. Determine the partitioning blocksize (passing in the partitioning
direction)
5. Acquire the curren iteration's partitions for the matrices affected
by the current variants's partitioning dimension (m, k, n).
6. Call the subproblem.
<end loop>
- Instantiate control trees once per thread, per operation invocation.
(This is a change from the previous regime in which control trees were
treated as stateless objects, initialized with the library, and shared
as read-only objects between threads.) This once-per-thread allocation
is done primarily to allow threads to use the control tree as as place
to cache certain data for use in subsequent loop iterations. Presently,
the only application of this caching is a mem_t entry for the packing
blocks checked out from the memory broker (allocator). If a non-NULL
control tree is passed in by the (expert) user, then the tree is copied
by each thread. This is done in bli_l3_thread_decorator(), in
bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
of the operation being executed. For example, gemm, hemm, and symm are
all part of the gemm family, while herk, syrk, her2k, and syr2k are
all part of the herk family. Knowing the operation's family is necessary
when conditionally executing the internal (beta) scalar reset on on
C in blocked variant 3, which is needed for gemm and herk families,
but must not be performed for the trmm family (because beta has only
been applied to the current row-panel of C after the first rank-kc
iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
to comform with the new control tree design, and renamed the macro-
kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
optimization in the various level-3 front-ends was not being applied
properly when the context indicated that execution would be via an
induced method. (Before, we always checked the native micro-kernel
corresponding to the datatype being executed, whereas now we check
the native micro-kernel corresponding to the datatype's real projection,
since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
complex implementations. Previously, it was always tested, provided that
the c/z datatypes were enabled. However, some configurations use
reference micro-kernels for complex datatypes, and testing these
implementations can slow down the testsuite considerably.
Details:
- Defined a new randomization operation, randn, on vectors and matrices.
The randnv and randnm operations randomize each element of the target
object with values from a narrow range of values. Presently, those
values are all integer powers of two, but they do not need to be powers
of two in order to achieve the primary goal, which is to initialize
objects that can be operated on with plenty of precision "slack"
available to allow computations that avoid roundoff. Using this method
of randomization makes it much more likely that testsuite residuals of
properly-functioning operations are close to zero, if not exactly zero.
- Updated existing randomization operations randv and randm to skip
special diagonal handling and normalization for matrices with structure.
This is now handled by the testsuite modules by explicitly calling a
testsuite function that loads the diagonal (and scales off-diagonal
elements).
- Added support for randnv and randnm in the testsuite with a new switch
in input.general that universally toggles between use of the classic
randv/randm, which use real values on the interval [-1,1], and
randnv/randnm, which use only values from a narrow range. Currently,
the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as
well as 0.0.
- Updated testsuite modules so that a testsutie wrapper function is called
instead of directly calling the randomization operations (such as
bli_randv() and bli_randm()). This wrapper also takes a bool_t that
indicates whether the object's elements should be normalized. (NOTE: As
alluded to above, in the test modules of triangular solve operations such
as trsv and trsm, we perform the extra step of loading the diagonal.)
- Defined a new level-0 operation, invertsc, which inverts a scalar.
- Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely
but possible divide-by-zero.
- Updated function signature and prototype formatting in testsuite.
Details:
- Added a new input parameter to input.general that globally toggles
whether testsuite tests are performed on objects whose buffers and
leading dimensions have been aligned, and changed the implementation
of libblis_test_mobj_create() to employ alignment (or not) regardless
of whether row, column, or general storage is being tested.
- Updated configure script's "--help" text to indicate default behavior
for internal integer type size and BLAS/CBLAS integer type size
options.
Details:
- Defined a new "3ms" (separated 3m) pack schema and added appropriate
support in packm_init(), packm_blk_var2().
- Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
as an argument instead of computing it locally. Exception: for trmm,
is_p must be computed locally, since it changes for triangular
packed matrices. Also exposed is_p in interface to dt-specific
packm_blk_var2 (and _var1, even though it does not use imaginary
stride).
- Renamed many functions/variables from _3mi to _3mis to indicate that
they work for either interleaved or separated 3m pack schemas.
- Generalized gemm and herk macro-kernels to pass in imaginary stride
rather than compute them locally.
- Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
and 3m3-specific virtual micro-kernels.
- Added special gemm macro-kernels to support 3m2 and 3m3.
- Added support for 3m2 and 3m3 to testsuite.
- Corrected the type of the panel dimension (pd_) in various macro-
kernels from inc_t to dim_t.
- Renamed many functions defined in bli_blocksize.c.
- Moved most induced-related macro defs from frame/include to
frame/ind/include.
- Updated the _ukernel.c files so that the micro-kernel function pointers
are obtained from the func_t objects rather than the cpp macros that
define the function names.
- Updated test/3m4m driver, Makefile, and run script.
Details:
- Consolidated most of the code relating to induced complex methods
(e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
are now enabled on a per-operation basis. The current "available"
(enabled and implemented) implementation can then be queried on
an operation basis. Micro-kernel func_t objects as well as blksz_t
objects can also be queried in a similar maner.
- Redefined several micro-kernel and operation-related functions in
bli_info_*() API, in accordance with above changes.
- Added mr and nr fields to blksz_t object, which point to the mr
and nr blksz_t objects for each cache blocksize (and are NULL for
register blocksizes). Renamed the sub-blocksize field "sub" to
"mult" since it is really expressing a blocksize multiple.
- Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
trsm to correctly query mr and nr (for purposes of nudging kc).
- Introduced an enumerated opid_t in bli_type_defs.h that uniquely
identifies an operation. For now, only level-3 id values are defined,
along with a generic, catch-all BLIS_NOID value.
- Reworked testsuite so that all induced methods that are enabled
are tested (one at a time) rather than only testing the first
available method.
- Reformated summary at the beginning of testsuite output so that
blocksize and micro-kernel info is shown for each induced method
that was requested (as well as native execution).
- Reduced the number of columns needed to display non-matlab
testsuite output (from approx. 90 to 80).
Details:
- Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi
('i' for "interleaved"). Similar changes to 3M/4M macros.
- Renamed all 3m/4m files and functions to 3m1/4m1.
- Whitespace changes.
Details:
- Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
high levels, respectively. APIs for trmm and trsm were NOT added due
to the fact that these approaches are inherently incompatible with
implementing 4m or 3m at high levels (because the input right-hand
side matrix is overwritten).
- Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
3m so that all are stylistically consistent.
- Added new "rih" packing kernels (both low-level and structure-aware)
to support both 4mh and 3mh.
- Defined new pack_t schemas to support real-only, imaginary-only, and
real+imaginary packing formats.
- Added various level0 scalar macros to support the rih packm kernels.
- Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
- Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
that order) and execute the first one that is enabled, or the native
implementation if none are enabled.
- Added implementation query functions for each level-3 operation so
that the user can query a string that describes the implementation
that is currently enabled.
- Updated test suite to output implementation types for reach level-3
operation, as well as micro-kernel types for each of the five micro-
kernels.
- Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
- Fixed an obscure bug when packing Hermitian matrices (regular packing
type) whereby the diagonal elements of the packed micro-panels could
get tainted if the source matrix's imaginary diagonal part contained
garbage.
Details:
- Reverted some changes that were unintentionally included in the
previous commit (9526ce98). Thanks to Tony Kelman for pointing
this out. (Note: a few select changes were not reverted.)
Details:
- Removed support for duplication from the gemmtrsm/trsm micro-kernels
and all framework code.
- Updated test suite modules according to above changes.
Details:
- Added extensive comments to the top of testsuite/input.operations,
which describe how to edit the file.
- Removed input.operations.0 and input.operations.1.
- Changed input.general to test all datatypes ("sdcz") by default.
Details:
- Applied a patch from Tyler that fixes minor staleness in the piledriver
configuration and gemm micro-kernel.
- Very minor changes to test suite input files.
Details:
- Added test modules in test suite for level-1f kernels and level-3
micro-kernels. (Duplication in the micro-kernels, for now, is NOT
supported by these test modules.)
- Added section override switches to test suite's input.operations file.
- Added obj_t APIs for level-1f front-ends and their unblocked variants to
facilitate the level-1f test modules. Also added front-end for dupl
operation.
- Added obj_t-based check routines for level-1f operations, which are
called from the new front-ends mentioned above.
- Added query routines for axpyf, dotxf, and dotxaxpyf that return fusing
factors as a function of datatype, which is needed by their respective
test modules.
- Whitespace changes to bli_kernel.h of all existing configurations.
Details:
- Added a 'template' configuration, which contains stub implementations of the
level 1, 1f, and 3 kernels with one datatype implemented in C for each, with
lots of in-file comments and documentation.
- Modified some variable/parameter names for some 1/1f operations. (e.g.
renaming vector length parameter from m to n.)
- Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files
to bli_kernel.h.
- Modifed test suite to print out fusing factors for axpyf, dotxf, and
dotxaxpyf, as well as the default fusing factor (which are all equal
in the reference and template implementations).
- Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these
reference variants were implemented in terms of front-end routines rather
that directly in terms of the kernels. (For example, axpy2v was implemented
as two calls to axpyv rather than two calls to AXPYV_KERNEL.)
- Changed the interface to dotxf so that it matches that of axpyf, in that
A is assumed to be m x b_n in both cases, and for dotxf A is actually used
as A^T.
- Minor variable naming and comment changes to reference micro-kernels in
frame/3/gemm/ukernels and frame/3/trsm/ukernels.
Details:
- Redefined gint_t and guint_t in terms of the standard C types long int
and unsigned long int, respectively.
- Changed testsuite default max problem size to 500.
- Changed testsuite input.operations to use square problems for level-3
operation tests.
Details:
- Added a new option in input.general that allows outputting in
matlab/octave format so that one can output in matlab format
independently from outputting to files.
- Adjusted input.operations according to above.
- Added input.operations.0 and input.operations.1 with all options
disabled and enabled, respectively.
Details:
- Updated level-1/-1f kernels so that non-unit and un-aligned cases are
handled by reference implementation (rather than aborted).
- Added -fomit-frame-pointer to default make_defs.mk for clarksville
configuration.
- Defined bli_offset_from_alignment() macro.
- Minor edits to old test drivers.
Details:
- Added a new line to input.general that allows one to specify the error-
checking level to use for each BLIS experiment. The only two levels
supported for now are "no error checking" and "full error checking".
Details:
- Added a highly configurable, unified test suite.
- Removed DUPB configuration constant from bl2_kernel.h and macro-kernel
header files. Now, instead, DUPB is computed as (NDUP != 1) within each
macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into
incorrectly when DUPB was set to FALSE but the NDUP was still non-unit.
By encoding both pieces of information into one constant in _kernel.h,
it seems somewhat less likely others will encounter this bug in the
future.
- Added level-2 cache blocksizes to _kernel.h for reference configuration,
and defined blocksizes in _cntl.c files to these default values.
- Changed semantics of her2k and syr2k such that these operations no longer
expect the B matrix to already be conjugate-transposed (or just transposed
for syr2k). However, these semantics are preserved for the internal
mechanics of the implementations, including the internal back-end and all
blocked variants.
- Inserted checks for real-valued alpha and beta for herk/her2k and herk,
respectively.
- Relaxed general object structure constraints in _basic_check() for gemv, ger.
- Changed her front-end to NOT copy-cast to real projection; instead, this is
replaced by selecting either the real part or both parts within the unblocked
algorithm implementation, depending on the value of conjh.
- Added conjh to all _check routines for her so that the code knows when to
verify that alpha has an imaginary component equal to zero (for her, but
not syr).
- Changed control tree for her to forgo packing.
- Added unit diagonal support to fnormm.
- Redefined real versions of abval2s macros in terms of fabs(), fabsf().
- Redefined complex versions of sqrt2s macros using the actual "complex square
root" formula.
- Created new level-0 object-based routines, suffixed with "sc" (for "scalar").
- Defined new level-1v, -1d, and -1m versions of add and sub operations
(two-operand add and subtract).
- Added new scalar macros:
- getris: acquire real and imaginary components.
- setris: set real and imaginary components.
- addjs: addition with conjugated x.
- subjs: subtraction with conjugated x.
- Defined new utility operations:
- absumv: element-wise sum of absolute values for vector elements.
- absumm: element-wise sum of absolute values for matrix elements.
- mkherm: convert existing matrix to Hermitian.
- mksymm: convert existing matrix to symmetric.
- mktrim: convert existing matrix to triangular.
- Added various error checking routines.
- Added bl2_clock_min_diff(), which is used to more cleanly measure the
wall clock time of a code block.
- Added general stride support to bl2_obj_alloc_buffer().
- Added bl2_obj_init_scalar().
- Updated parameter mapping in bl2_param_map.c.
- Added support for queriable version string.
- Fixed a bug in the her2k macro-kernels (which currently are simply
implemented in terms of two invocations of herk) whereby beta was being
applied to both the first and second rank-k updates, rather than only
the first.
- Fixed a bug in trmm/trsm whereby transpose and right side cases were not
properly implemented due to erroneous assumptions regarding aliasing and
root objects.
- Fixed a bug in the upper triangular trsm macro-kernel in which the wrong
MR x NR block of B was being updated.
- Fixed a bug in the inverts macro in the double real case whereby the
value was typecast to float before inversion. This affected non-unit cases
of dtrsm.
- Fixed a bug in the reference kernels for gemmtrsm whereby the minus one
constant was being applied incorrectly.
- Fixed a bug in the overall treatment of non-unit alpha for trsm. The code
now mimics the rank-k strategy of gemm, whereby alpah is applied during
the first iteration of variant 3, with BLIS_ONE passed in instead for
subsequent iterations. This also required passing alpha into the macro-
kernels as well as the fused gemmtrsm micro-kernels.
- Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being
called for blocks strictly above the diagonal. While this sounds good in
theory, this cannot be done because gemm_ker_var2 expects row panels of
A to be packed from top to bottom, while for trsm_u, A is actually packed
from bottom to top due to the reverse (BR->TL) nature of the algorithm.
- Fixed a bug in packm_cxk() whereby panel packings with unit panel
dimensions were mishandled due to incorrect arguments to the copyv kernel.
Also changed the copyv kernel invocation to scal2v so that these edge
cases are properly handled when scaling is requested.
- Fixed a bug in packv_int() whereby an uninitialized object is passed in
instead of the source object.
- Fixed a bug whereby level-2 code could allocate memory dynamically via
bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed
a potential future bug whereby a mem_t object that is actually no longer
"allocated" from the static pool is mistaken for being allocated due to
failure to NULLify the buffer when the block was most recently released.
- Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly
toggled when the requested subpartition needed to be "reflected" due to it
residing in an unstored region.