Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- Changed the typedef that defines bool_t from:
typedef gint_t bool_t;
where gint_t is a signed integer that forms the basis of most other
integers in BLIS, to:
typedef bool bool_t;
- Changed BLIS's TRUE and FALSE macro definitions from being in terms of
integer literals:
#define TRUE 1
#define FALSE 0
to being in terms of C99 boolean constants:
#define TRUE true
#define FALSE false
which are provided by stdbool.h.
- This commit constitutes the second phase of a transition toward using
C99's bool instead of bool_t, which will address issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7.
Details:
- Fixed various typecasts in
frame/base/bli_cntx.h
frame/base/bli_mbool.h
frame/base/bli_rntm.h
frame/include/bli_misc_macro_defs.h
frame/include/bli_obj_macro_defs.h
frame/include/bli_param_macro_defs.h
that were missing or being done improperly/incompletely. For example,
many return values were being typecast as
(bool_t)x && y
rather than
(bool_t)(x && y)
Thankfully, none of these deficiencies had manifested as actual bugs
at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
This reflects the fact that bli_env_get_var() needs to be able to
return a signed integer, and even though dim_t is currently defined
as a signed integer, it does not intuitively appear to necessarily be
signed by inspection (i.e., an integer named "dim_t" for matrix
"dimension"). Also, updated use of bli_env_get_var() within
bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
and added comments to the bli_thrcomm_*.h files that will explain a
planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
'bool' for 'bool_t', which will eliminate the namespace conflict with
arm_sve.h as reported in issue #420. This commit implements the first
phase of that transition. Thanks to RuQing Xu for reporting this
issue.
- CREDITS file update.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).
Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally #define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.
Details:
- Fixed an inadvertently disabled edge case optimization in the two
gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case
optimizations allow the last millikernel operation in the jr loop to
be executed with inflated an register blocksize if it is the last
(or only) iteration. For example, if mr=6 and nr=8 and the gemmsup
problem is m=8, n=100, k=100. (In this case, the panel-block variant
(var1n) is executed, which places the jr loop in the m dimension.)
In principle, this problem could be executed as two millikernels: one
with dimensions 6x100x100, and one as 2x100x100. However, with the
support for inflated blocksizes in the kernel, the entire 8x100x100
problem can be passed to the millikernel function, which will then
execute it more favorably as two 4x100x100 millikernel sub-calls.
Now, this optimization is disabled under certain circumstances, such
as when multithreading. Previously, the is_mt predicate was being set
incorrectly such that it was non-zero even when running
single-threaded.
- Upon fixing the is_mt issue above, another bit of code needed to be
moved so that the result of the optimization could have an impact on
the assignment of loop bounds ranges to threads.
Details:
- Fixed an innocuous bug that manifested when running the testsuite on
extremely small matrices with randomization via the "powers of 2 in
narrow precision range" option enabled. When the randomization
function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
then compute 0.0/0.0 during the normalization process, which leads to
NaN residuals. The solution entails smarter implementaions of randv,
randnv, randm, and randnm, each of which will compute the 1-norm of
the vector or matrix in question. If the object has a 1-norm of 0.0,
the object is re-randomized until the 1-norm is not 0.0. Thanks to
Kiran Varaganti for reporting this issue (#413).
- Updated the implementation of randm_unb_var1() so that it loops over
a call to the randv_unb_var1() implementation directly rather than
calling it indirectly via randv(). This was done to avoid the overhead
of multiple calls to norm1v() when randomizing the rows/columns of a
matrix.
- Updated comments.
Details:
- Added a function definition for xerbla_array_(), which largely mirrors
its netlib implementation. Thanks to Isuru Fernando for suggesting the
addition of this function.
Here adds two kernels for Arm SVE vector extensions.
1. a gemm kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.
To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.
"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code." [1]
[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
Details:
- Relocated the #include "cpuid.h" directive from bli_cpuid.h to
bli_cpuid.c. This was done because cpuid.h (which is pulled into
the post-build blis.h developer header) doesn't protect its
definitions with a preprocessor guard of the form:
#ifndef FOOBAR_H
#define FOOBAR_H
// header contents.
#endif
and as a result, applications (previously) could not #include both
blis.h and cpuid.h (since the former was already including the
latter). Thanks to Bhaskar Nallani for raising this issue via #393
and to Devin Matthews for suggesting this fix.
- CREDITS file update.
Details:
- Changed the behavior of bli_rntm_init() as well as the static
initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t
objects by default specify the disabling of packing for A and B.
Packing of A/B was already disabled by default when calling non-expert
APIs (and enabled only when the user set environment variables
BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of
using user-initialized rntm_t objects with expert APIs comes into line
with the default behavior of non-expert APIs--that is, they now both
lead to the avoidance of packing in the sup code path. (Note: The
conventional code path is unaffected by the environment variables
BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t
object when calling an expert API.) This addresses issue #392. Thanks
to Kiran Varaganti for bringing this inconsistency to our attention.
- The above change was accomplished by changing the the definitions of
static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b()
in bli_rntm.h, which are both for internal use only.
Details:
- Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
that were made in the supmt-specific code commited to the 'amd'
branch, which has now been merged with 'master'. Prior to the merge,
'master' received commit c01d249, which applied these renamings to
the existing, non-sup codebase.
Details:
- Added a missing return statement to the body of an early case handling
branch in bli_thread_partition_2x2(). This bug only affected cases
where n_threads < 4, and even then, the code meant to handle cases
where n_threads >= 4 executes and does the right thing, albeit using
more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).
Details:
- Renamed two bli_thread_*() APIs:
bli_thread_obarrier() -> bli_thread_barrier()
bli_thread_obroadcast() -> bli_thread_broadcast()
The 'o' was a leftover from when thrcomm_t objects tracked both
"inner" and "outer" communicators. They have long since been
simplified to only support the latter, and thus the 'o' is
superfluous.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
address is equal to either &BLIS_GEMM_SINGLE_THREADED or
&BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
bli_l3_sup_decor_single.c that (by default) disables code that
creates and frees the thrinfo_t tree and instead passes
&BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
sup implementation.
- The net effect of the above changes is that a small amount of
thrinfo_t overhead is avoided when running small/skinny dgemm
problems when BLIS is compiled with multithreading disabled.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
changing function interface for the thread entry point function
(of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
a memory leak in the sba at bli_finalize() time. It turns out that,
due to the new multithreading-capable variant code useing thrinfo_t
objects--specifically, their calling of bli_thrinfo_grow()--we
have to pass in a real thrinfo_t object rather than the global
objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
Thus, I inserted the appropriate logic from the OpenMP and pthreads
versions so that single-threaded execution would work as intended
with the newly upgraded variants.
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
Details:
- Fixed an error that manifests only when using C++ (specifically,
modern versions of g++) to compile drivers in 'test' (and likely most
other application code that #includes blis.h. Thanks to Ajay Panyala
for reporting this issue (#374).
* Fix parsing in vpu_count on workstation SKX
* Document Skylake-X as Haswell for single FMA
* Update vpu_count for Skylake and Cascade Lake models
* Support printing the configuration selected, controlled by the environment
Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.
* Move bli_log outside the cpp condition, and use it where intended
* Add Fixme comment (Skylake D)
* Mostly superficial edits to commits towards #351.
Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
relates to single-VPU Skylake-Xs.
* Fix comment typos
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
Details:
- Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar,
named 'sb'. This value was already being added by the underlying
sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub().
Thanks to Simon Lukas Märtens for reporting this bug via #367.
- Fixed a second bug in order of typecasting intermediate products in
sdsdot_(). Previously, the "alpha" scalar was being added after the
"outer" typecast to float. However, the operation is supposed to first
add the dot product to the (promoted) scalar and THEN downcast the sum
to float. Thanks to Devin Matthews for catching this bug.
Details:
- Added BLIS_EXPORT_BLIS annotation to function prototypes for
bli_thrcomm_bcast()
bli_thrcomm_barrier()
bli_thread_range_sub()
so that these functions are exported to shared libraries by default.
This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for
reporting this bug.
- CREDITS file update.
Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
and pthreads so that those builds don't fail when performing shared
library linking (especially for Windows DLLs via AppVeyor). For now,
these dummy implementations of bli_l3_sup_thread_decorator() are
merely carbon-copies of the implementation provided for single-
threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
Thus, an OpenMP or pthreads build will be able to use the gemmsup
code (including the new selective packing functionality), as it did
before 39fa7136, even though it will not actually employ any
multithreaded parallelism.
Details:
- Implemented optional packing for A or B (or both) within the sup
framework (which currently only supports gemm). The request for
packing either matrix A or matrix B can be made via setting
environment variables BLIS_PACK_A or BLIS_PACK_B (to any
non-zero value; if set, zero means "disable packing"). It can also
be made globally at runtime via bli_pack_set_pack_a() and
bli_pack_set_pack_b() or with individual rntm_t objects via
bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
interface of either the BLIS typed or object APIs. (If using the
BLAS API, environment variables are the only way to communicate the
packing request.)
- One caveat (for now) with the current implementation of selective
packing is that any blocksize extension registered in the _cntx_init
function (such as is currently used by haswell and zen subconfigs)
will be ignored if the affected matrix is packed. The reason is
simply that I didn't get around to implementing the necessary logic
to pack a larger edge-case micropanel, though this is entirely
possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
with corresponding headers, in which higher-level packm-related
functions are defined for use within the sup framework. The actual
packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
always NULL), and pointer to a thrinfo_t* (which for nowis the address
of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
the millikernel can query the panel stride of the packed matrix and
step through it accordingly. If the matrix isn't packed, the panel
stride of interest for the given millikernel will be set to the
appropriate value so that the mkernel may step through the unpacked
matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
panel strides (ps_a and ps_b, respectively) instead of computing them
on the fly.
- Spun off the environment variable getting and setting functions into
a new file, bli_env.c (with a corresponding prototype header). These
functions are now used by the threading infrastructure (e.g.
BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
This means that the function bli_thread_init_rntm() was renamed to
bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
functions that manage the pack_a and pack_b fields of the global
rntm_t, including from environment variables, just as we have
functions to manage the threading fields of the global rntm_t in
bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
spinning off the bli_l3_thread_decorator() functions into their own
files. This change makes more sense when considering the further
addition of bli_l3_sup_thread_decorator() functions (for now limited
only to the single-threaded form found in the _single.c file).
- Explicitly initialize the reference sup handlers in both
bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
Details:
- Fixed a subtle and complicated bug that only manifested via the BLAS
test drivers in the generic subconfiguration, and possibly any other
subconfiguration that did not register complex-domain gemm ukernels,
or registered ONLY real-domain ukernels as row-preferential. This is
a long story, but it boils down to an exception to the "transpose the
operation to bring storage of C into agreement with ukernel pref"
optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
proper functioning of the 1m method, but only when the imaginary
component of beta is zero. See the comments in issue #342 for more
details. Thanks to Dave Love for identifying the commit in which this
bug was introduced, and other feedback related to this bug.
Details:
- Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(),
and bli_acquire_mpart_mndim() that allowed the use of a blocksize b
that is too large given the current row/column index (i.e., the i/j
argument) and the size of the dimension being partitioned (i.e., the
m/n argument). This bug only affected backwards partitioning/motion
through the dimension and was the result of a misplaced conditional
check-and-redirect to the backwards code path. It should be noted
that this bug was discovered not because it manifested the way it
could (thanks to the callers in BLIS making sure to always pass in
the "correct" blocksize b), but could have manifested if the
functions were used by 3rd party callers. Thanks to Minh Quan Ho for
reporting the bug via issue #363.
Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
those operations to be cast so the structured matrix is on the left.
symm and hemm already had such macros, but these too were renamed so
that the macros were individual to the operation. We now have four
such macros:
#define BLIS_DISABLE_HEMM_RIGHT
#define BLIS_DISABLE_SYMM_RIGHT
#define BLIS_DISABLE_TRMM_RIGHT
#define BLIS_DISABLE_TRMM3_RIGHT
Also, updated the comments in the symm and hemm front-ends related to
the first two macro guards, and added corresponding comments to the
trmm and trmm3 front-ends for the latter two guards. (They all
functionally do the same thing, just for their specific operations.)
Thanks to Jeff Hammond for reporting the bugs that led me to this
change (via #359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
related to duplicating B during packing) to register: a packing
kernel for single-precision real; gemmbb ukernels for s, c, and z;
trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
and z; and to use non-default cache and register blocksizes for s, c,
and z datatypes. Also declared prototypes for all of the gemmbb,
trsmbb, and gemmtrsmbb ukernel functions within the
bli_cntx_init_haswellbb() function. This should, once applied to the
power9 configuration, fix the remaining issues in #359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
duplication factor of 4. This function is defined in the same file as
bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).
Details:
- Added code that fixes false failures in the gemmtrsm_ukr module of the
testsuite. The tests were failing because the computation (bli_gemv())
that performs the numerical check was not able to properly travserse
the matrix operands bx1 and b11 that are views into the micropanel of
B, which has duplicated/broadcast elements under the power9 subconfig.
(For example, a micropanel of B with duplication factor of 2 needs to
use a column stride of 2; previously, the column stride was being
interpreted as 1.)
- Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
static functions in bli_obj_macro_defs.h. (Previously, only the
function bli_obj_set_strides() was defined. Amazing to think that we
got this far without these former functions.)
- Updated/expounded upon comments.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
Implemented and registered power9 dgemm ukernel.
Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel
assumes that elements of B have been duplicated/broadcast during the
packing step. The microkernel uses a column orientation for its
microtile vector registers and thus implements column storage and
general stride IO cases. (A row storage IO case via in-register
transposition may be added at a future date.) It should be noted that
we recommend using this microkernel with gcc and *not* xlc, as issues
with the latter cropped up during development, including but not
limited to slightly incompatible vector register mnemonics in the GNU
extended inline assembly clobber list.
Details:
- In bli_syrk_front(), moved the conditional call to bli_syrk_check()
(if error checking is enabled) and the conditional scaling of C by
beta (if alpha is zero) so that they occur after, instead of before,
the call to bli_syrk_small(). This sequencing now matches that of
bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in
bli_trsm_front().
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
inadvertantly not incremented when the Zen2 subconfiguration was
added.
- In bli_gemm_front(), added a missing conditional constraint around the
call to bli_gemm_small() that ensures that the computation precision
of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
that existed around the call to bli_syrk_small() into bli_syrk_small()
to minimize the calling code footprint and also to bring that code
into stylistic harmony with similar code in bli_gemm_front() and
bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
proper accessor static functions (e.g. 'a->dim[0]' becomes
'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
strictly speaking unnecessary, but it serves as a useful visual cue to
those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
version check for availability of -march=znver2, and added appropriate
support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
config/zen/amd_config.mk, including: removal of -march=znver1 et al.
from CKVECFLAGS (since the -march flag is added within make_defs.mk);
setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
Details:
- Fixed a typo in the loop counter for the 'd' (duplication) dimension
in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h.
They shouldn't be used by anyone yet, but thankfully clang via
AppVeyor spit out warnings that alerted me to the issue.
Details:
- Added support for being able to duplicate (broadcast) elements in
memory when packing matrix B (ie: the left-hand operand) in level-3
operations. This turns out advantageous for some architectures that
can afford the cost of the extra bandwidth and somehow benefit from
the pre-broadcast elements (and thus being able to avoid using
broadcast-style load instructions on micro-rows of B in the gemm
microkernel).
- Support optionally disabling right-side hemm and symm. If this occurs,
hemm_r is implemented in terms of hemm_l (and symm_r in terms of
symm_l). This is needed when broadcasting during packing because the
alternative--supporting the broadcast of B while also allowing matrix
B to be Hermitian/symmetric--would be an absolute mess.
- Support alignment factors for packed blocks of A, B, and C separately
(as well as for general-purpose buffers). In addition, we support
byte offsets from those alignment values (which is different from
aligning by align+offset bytes to begin with). The default alignment
values are BLIS_PAGE_SIZE in all four cases, with the offset values
defaulting to zero.
- Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
into the packm kernel, where it will be needed by packm kernels that
perform broadcasts of B, since the idea is that we *only* want to
broadcast when packing micropanels of B and not A.
- Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
used to set custom virtual level-3 microkernels in the cntx_t, which
would typically be done in the bli_cntx_init_*() function defined in
the subconfiguration of interest.
- Added a "broadcast B" kernel function for use with NP/NR = 12/6,
defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
- Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
defined in ref_kernels/3/bb. (These kernels have been tested with
double real with NP/NR = 12/6.)
- Added #ifndef ... #endif guards around several macro constants defined
in frame/include/bli_kernel_macro_defs.h.
- Defined a few "broadcast B" static functions in
frame/include/level0/bb for use by "broadcast B"-style packm reference
kernels. For now, only the real domain kernels are tested and fully
defined.
- Output the alignment and offset values for packed blocks of A and B
in the testsuite's "BLIS configuration info" section.
- Comment updates to various files.
- Bumped so_version to 3.0.0.
Details:
- Rewrote bli_cpuid_query() for ARM architectures to use stdio-based
functions such as fopen() and fgets() instead of popen(). The new code
does more or less the same thing as before--searches /proc/cpuinfo for
various strings, which are then parsed in order to determine the
model, part number, and features. Thanks to Dave Love for suggesting
this change in issue #335.
* Always use sqsumv to compute normfv on MacOS.
* Unconditionally disable the "dot trick" in normfv.
* Added explanatory comment to normfv definition.
Details:
- Added a comment above the unconditional disabling of the dotv-based
implementation to normfv. Thanks to Roman Yurchak, Devin Matthews,
and Isuru Fernando in helping with this improvement.
- CREDITS file update.
Details:
- Defined two new static functions in bli_cntx.h:
bli_cntx_set_blksz_def_dt()
bli_cntx_set_blksz_max_dt()
which developers may find convenient when experimenting with different
values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.
Details:
- Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
whereby subnodes of the thrinfo_t tree are "dereferenced" near the
beginning of the functions, which may lead to segfaults in certain
situations where the thread tree was not fully formed because the
matrix problem was too small for the level of parallelism specified.
(That is, too small because some problems were assigned no work due
to the smallest units in the m and n dimensions being defined by the
register blocksizes mr and nr.) The fix requires several nested levels
of if statements, and this is one of those few instances where use of
goto statements results in (mostly) prettier code, especially in the
case of _gemm_paths(). And while it wasn't necessary, I ported this
goto usage to the loop body that prints the thrinfo_t work_id and
comm_id values for each thread. Thanks to Nicholai Tukanov for helping
to find this bug.
Details:
- Added a cpp-conditional directive block to bli_arch_config.h that
#includes "bli_family_thunderx2.h". The code has been missing since
adf5c17f. However, this never manifested as an error because the file
is virtually empty and not needed for thunderx2 (or most subconfigs).
Thanks to Jeff Diamond for helping to spot this.