Details:
- Dropped 'u' from the list of modifiers passed into the library archiver
ar. Previously, "cru" was used, while now we employ only "cr". This
change was prompted by a warning observed on Ubuntu 16.04:
ar: `u' modifier ignored since `D' is the default (see `U')
This caused me to realize that the default mode causes timestamps to be
zero, and thus the 'u' option, which causes only changed object files to
be inserted, is not applicable.
Details:
- Added an option to configure that allows the user to force an arbitrary
version string at configure-time. The help text also now describes the
usage information.
- Changed the way the version string is communicated to the Makefile.
Previously, it was read into the VERSION variable from the 'version' file
via $(shell cat ...). Now, the VERSION variable is instead set in
config.mk (via a configure-substituted anchor from config.mk.in).
Details:
- Updated the non-tree openmp and pthreads barriers defined in
bli_thrcomm_openmp.c and bli_thrcomm_pthreads.c to instead call a common
implementation in bli_thrcomm.c, bli_thrcomm_barrier_atomic(). This new
implementation goes through the same motions as the previous codes, but
protects its loads and increments with GNU atomic built-ins. These atomic
statements take memory ordering parameters that allow us to specify just
enough constraints for the barrier to work as intended on weakly-ordered
hardware. The prior implementation was only guaranteed to work on systems
with strongly- ordered memory. (Thanks to Devin Matthews for suggesting
this change and his crash-course in atomics and memory ordering.)
- Removed 'volatile' from structs' barrier field declarations in
bli_thrcomm_*.h.
- Updated bli_thrcomm_pthread.? files to use renamed struct barrier fields
consistent with that of the _openmp.? files.
- Updated other bli_thrcomm_* files to rename "communicator" variables to
simply "comm".
Details:
- Renamed bli_env_get_nway() -> bli_thread_get_env().
- Added bli_thread_set_env() to allow setting environment variables
pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
- Added the following convenience wrapper routines:
bli_thread_get_jc_nt()
bli_thread_get_ic_nt()
bli_thread_get_jr_nt()
bli_thread_get_ir_nt()
bli_thread_get_num_threads()
bli_thread_set_jc_nt()
bli_thread_set_ic_nt()
bli_thread_set_jr_nt()
bli_thread_set_ir_nt()
bli_thread_set_num_threads()
- Added #include "errno.h" to bli_system.h.
- This commit addresses issue #140.
- Thanks to Chris Goodyer for inspiring these updates.
Details:
- Fixed a bug that manifested as improperly-computed 1-norm for vectors
and matrices. This is one of the few operations in BLIS that does not
have its own test module within the testsuite, hence why it went
undetected for so long. The bad 1-norms were being used to normalize
matrices in the testsuite after initialization, which led to some
matrices containing a combination of "large" and "small" values. This
tended to push the residuals computed after each test away from zero.
In some cases, they were off *just* enough to the testsuite to label
it a "failure". Many thanks to Jeff Hammond for reporting this bug.
(Wonky details: the bug was due to improperly-defined level-0 scalar
macros for abval2, an operation that computes the absolute square,
or complex magnitude/modulus. Certain complex domain instances of
abval2 were being incorrectly defined in terms of real-only solutions,
leading to bad results. This level-0 operation forms the basis of
norm1v/norm1m. absq2 was also affected, but almost nothing uses
this operation.)
Details:
- Disabled testsuite tests of all level-3 implementations based on 3m
and 4m. This will improve testing runtime on Travis CI as well as for
anyone manually running the testsuite using default test parameters.
Thanks to Devin Matthews for suggesting this change.
we want to be able to run BLIS KNL binaries on non-KNL machines via SDE.
although it is possible to install hbwmalloc implementation on such
systems, it is easier not to, since obviously the performance of SDE
execution is not representative so there is no reason to emulate HBW
allocation.
Details:
- Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result
was nondeterministic behavior (usually segmentation faults) for certain
problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The
cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c
which explicitly directed the virtual gemm micro-kernel to use temporary
space if the storage preference of the [real domain] gemm ukernel did
not match the storage of the output matrix C. In the context of gemm,
this handling is not needed because agreement between the storage pref
and the matrix is guaranteed by a high-level optimization in BLIS.
However, this optimization is not applied to trsm because the storage
of C is not necessarily the same as the storage of the micro-panels of
B--both of which are updated by the micro-kernel during a trsm
operation. Thus, the guarantee of storage/preference agreement is not
in place for trsm, which means we must handle that case within the
virtual gemm micro-kernel.
- Comment updates and a minor macro change to bli_trsm*_cntx_init() for
3m1, 4m1a, and 1m.
Details:
- Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was
specifically inserted to facilitate the benchmarking of 1m block-panel
and panel-block algorithms.
- Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to
reflect changes used/needed during benchmarking.
Details:
- Fixed issue #115 by adding implementations for scabs1_() and dcabs1_()
to the BLAS compatibility layer. Thanks to heroxbd for pointing out
their absence.
Details:
- Fixed a bug in the configure script whereby a non-preferred value for
--enable-threading would cause problems in common.mk vis-a-vis detecting
which threading model was chosen. Thanks to heroxbd for reporting this
issue.
Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
bli_cntl_free() can check if the thread parameter is NULL, and if so,
call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
terms of bli_gemm1mxx_cntx_init(), which behaves the same as
bli_gemm1m_cntx_init() did before, except that an extra bool parameter
(is_pb) is used to support both bp and pb algorithms (including to
support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
when true, will toggle the boolean return value of routines such as
bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
causing BLIS to transpose the operation to achieve disagreement (rather
than agreement) between the storage of C and the micro-kernel output
preference. This disagreement is needed for panel-block implementations,
since they induce a transposition of the suboperation immediately before
the macro-kernel is called, which changes the apparent storage of C. For
now, anti-preference is used only with the pb algorithm for 1m (and not
with any other non-1m implementation).
- Defined new functions,
bli_cntx_l3_ukr_eff_prefers_storage_of()
bli_cntx_l3_ukr_eff_dislikes_storage_of()
bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
which are identical to their non-"eff" (effectively) counterparts except
that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
in terms of the existing block-panel macro-kernel _ker_var2(). This
technique requires inducing transposes on all operands and swapping
the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
and updated all instantiations. Also updated the field names in the
cntx_t struct.
- Comment updates.
Details:
- Implemented the 1m method for inducing complex domain matrix
multiplication. 1m support has been added to all level-3 operations,
including trsm, and is now the default induced method when native
complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
needed for the corresponding function for 1m (because 1m requires us
to choose between column-oriented or row-oriented execution, which
requires us to query the context for the storage preference of the
gemm microkernel, which requires knowing the datatype) but I decided
that it made sense for consistency to add the parameter to all other
cntx initialization functions as well, even though those functions
don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
a second scalar for each blocksize entry. The semantic meaning of the
two scalars now is that the first will scale the default blocksize
while the second will scale the maximum blocksize. This allows scaling
the two independently, and was needed to support 1m, which requires
scaling for a register blocksize but not the register storage
blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
default and maximum blocksizes to some desired blocksize multiple.
These functions are needed in the updated definitions of
bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
certain circumstances (specifically, real domain beta and row- or
column-stored matrix C), the real domain macrokernel and microkernel
to be called directly, rather than using the virtual microkernel
via the complex domain macrokernel, which carries a slight additional
amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
some code in test_gemm.c driver.