Details:
- Fixed various typecasts in
frame/base/bli_cntx.h
frame/base/bli_mbool.h
frame/base/bli_rntm.h
frame/include/bli_misc_macro_defs.h
frame/include/bli_obj_macro_defs.h
frame/include/bli_param_macro_defs.h
that were missing or being done improperly/incompletely. For example,
many return values were being typecast as
(bool_t)x && y
rather than
(bool_t)(x && y)
Thankfully, none of these deficiencies had manifested as actual bugs
at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
This reflects the fact that bli_env_get_var() needs to be able to
return a signed integer, and even though dim_t is currently defined
as a signed integer, it does not intuitively appear to necessarily be
signed by inspection (i.e., an integer named "dim_t" for matrix
"dimension"). Also, updated use of bli_env_get_var() within
bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
and added comments to the bli_thrcomm_*.h files that will explain a
planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
'bool' for 'bool_t', which will eliminate the namespace conflict with
arm_sve.h as reported in issue #420. This commit implements the first
phase of that transition. Thanks to RuQing Xu for reporting this
issue.
- CREDITS file update.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).
Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
various n = 6 edge cases with a single sup kernel call. Previously,
only n = {4,2,1} were handled explicitly as single kernel calls;
that is, cases where n = 6 were previously being executed via two
kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.
Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally #define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.
Details:
- Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS
variable in the following make_defs.mk files:
config/haswell/make_defs.mk
config/skx/make_defs.mk
as well as comments that mention why the compiler option is needed.
This option is needed to prevent the compiler from using the rbp
frame register (in the very early portion of kernel code, typically
where k_iter and k_left are defined and computed), which, as of
1c719c9, is used explicitly by the gemmsup millikernels. Thanks to
Devin Matthews for identifying this missing option and to Jeff
Diamond for reporting the original bug in #417.
- The file
config/zen/amd_config.mk
which feeds into the make_defs.mk for both zen and zen2 subconfigs,
was also touched, but only to add a commented-out compiler option
(and the aforementioned explanatory comment) since that file already
uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of
CKOPTFLAGS.
Details:
- Fixed an inadvertently disabled edge case optimization in the two
gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case
optimizations allow the last millikernel operation in the jr loop to
be executed with inflated an register blocksize if it is the last
(or only) iteration. For example, if mr=6 and nr=8 and the gemmsup
problem is m=8, n=100, k=100. (In this case, the panel-block variant
(var1n) is executed, which places the jr loop in the m dimension.)
In principle, this problem could be executed as two millikernels: one
with dimensions 6x100x100, and one as 2x100x100. However, with the
support for inflated blocksizes in the kernel, the entire 8x100x100
problem can be passed to the millikernel function, which will then
execute it more favorably as two 4x100x100 millikernel sub-calls.
Now, this optimization is disabled under certain circumstances, such
as when multithreading. Previously, the is_mt predicate was being set
incorrectly such that it was non-zero even when running
single-threaded.
- Upon fixing the is_mt issue above, another bit of code needed to be
moved so that the result of the optimization could have an impact on
the assignment of loop bounds ranges to threads.
Details:
- Updated the test/sup source file (test_gemm.c) and Makefile to support
building matrices with small or large leading dimensions, and updated
runme.sh to support executing both kinds of test drivers.
- Updated runme.sh to allow for executing sup drivers with unpacked (the
default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B
environment variables), and for capturing output to files that encode
both the leading dimension (small or large) and packing status into
the filenames.
- Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt
into test/sup/octave and updated the octave code in that consolidated
directory to read the new output filename format (encoding ldim and
packing). Also added comments and streamlined code, particularly in
plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0.
- Moved old octave_st, octave_mt directories to test/sup/old.
Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.
Details:
- Fixed an innocuous bug that manifested when running the testsuite on
extremely small matrices with randomization via the "powers of 2 in
narrow precision range" option enabled. When the randomization
function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
then compute 0.0/0.0 during the normalization process, which leads to
NaN residuals. The solution entails smarter implementaions of randv,
randnv, randm, and randnm, each of which will compute the 1-norm of
the vector or matrix in question. If the object has a 1-norm of 0.0,
the object is re-randomized until the 1-norm is not 0.0. Thanks to
Kiran Varaganti for reporting this issue (#413).
- Updated the implementation of randm_unb_var1() so that it loops over
a call to the randv_unb_var1() implementation directly rather than
calling it indirectly via randv(). This was done to avoid the overhead
of multiple calls to norm1v() when randomizing the rows/columns of a
matrix.
- Updated comments.
Details:
- Fixed a few not-really-bugs:
- Previously, the d6x8m kernels were still prefetching the next upanel
of A using MR*rs_a instead of ps_a (same for prefetching of next
upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
that the upanels might be packed, using ps_a or ps_b is the correct
way to compute the prefetch address.
- Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
executed as intended even though it was based on a faulty pointer
management. Basically, in the rd_d6x8m kernel, the pointer for B
(stored in rdx) was loaded only once, outside of the jj loop, and in
the second iteration its new position was calculated by incrementing
rdx by the *absolute* offset (four columns), which happened to be the
same as the relative offset (also four columns) that was needed. It
worked only because that loop only executed twice. A similar issue
was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
- Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
that it is loaded only once outside of the loops rather than
multiple times inside the loops.
- Changed outer loop in rd kernels so that the jump/comparison and
loop bounds more closely mimic what you'd see in higher-level source
code. That is, something like:
for( i = 0; i < 6; i+=3 )
rather than something like:
for( i = 0; i <= 3; i+=3 )
- Switched row-based IO to use byte offsets instead of byte column
strides (e.g. via rsi register), which were known to be 8 anyway
since otherwise that conditional branch wouldn't have executed.
- Cleaned up and homogenized prefetching a bit.
- Updated the comments that show the before and after of the
in-register transpositions.
- Added comments to column-based IO cases to indicate which columns
are being accessed/updated.
- Added rbp register to clobber lists.
- Removed some dead (commented out) code.
- Fixed some copy-paste typos in comments in the rv_6x8n kernels.
- Cleaned up whitespace (including leading ws -> tabs).
- Moved edge case (non-milli) kernels to their own directory, d6x8,
and split them into separate files based on the "NR" value of the
kernels (Mx8, Mx4, Mx2, etc.).
- Moved config-specific reference Mx1 kernels into their own file
(e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
- Added rd_dMx1 assembly kernels, which seems marginally faster than
the corresponding reference kernels.
- Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
the row-oriented reference kernels for all storage combos.
This bug happens at a corner case, when k_iter == 0 and we jump to
CONSIDERKLEFT.
In current design, first row/col. of a and b are loaded twice.
The fix is to rearrange a and b (first row/col.) loading instructions.
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
Details:
- Properly select cc_vendor based on the output of invoking CC with the
--version option, including cases where CC is the variant of clang
that is included with Intel oneAPI. (However, we continue to treat
the compiler as clang for other purposes, not icc.) Thanks to Ajay
Panyala and Devin Matthews for reporting on this issue via #402.
Details:
- Added a function definition for xerbla_array_(), which largely mirrors
its netlib implementation. Thanks to Isuru Fernando for suggesting the
addition of this function.
Details:
- Added Perl to list of prerequisites for building BLIS. This is in part
(and perhaps completely?) due to some substitution commands used at
the end of configure that include '\n' characters that are not
properly interpreted by the version of sed included on some versions
of OS X. This new documentation addresses issue #398.
Here adds two kernels for Arm SVE vector extensions.
1. a gemm kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.
To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.
"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code." [1]
[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
Details:
- Relocated the #include "cpuid.h" directive from bli_cpuid.h to
bli_cpuid.c. This was done because cpuid.h (which is pulled into
the post-build blis.h developer header) doesn't protect its
definitions with a preprocessor guard of the form:
#ifndef FOOBAR_H
#define FOOBAR_H
// header contents.
#endif
and as a result, applications (previously) could not #include both
blis.h and cpuid.h (since the former was already including the
latter). Thanks to Bhaskar Nallani for raising this issue via #393
and to Devin Matthews for suggesting this fix.
- CREDITS file update.
Details:
- Fixed a missing argument (conjy) in the function signatures of
bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert
van de Geijn for reporting this omission.
Details:
- Changed the behavior of bli_rntm_init() as well as the static
initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t
objects by default specify the disabling of packing for A and B.
Packing of A/B was already disabled by default when calling non-expert
APIs (and enabled only when the user set environment variables
BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of
using user-initialized rntm_t objects with expert APIs comes into line
with the default behavior of non-expert APIs--that is, they now both
lead to the avoidance of packing in the sup code path. (Note: The
conventional code path is unaffected by the environment variables
BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t
object when calling an expert API.) This addresses issue #392. Thanks
to Kiran Varaganti for bringing this inconsistency to our attention.
- The above change was accomplished by changing the the definitions of
static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b()
in bli_rntm.h, which are both for internal use only.
Details:
- Updated the sup entry in the "What's New" section of the README.md
file to promote the multithreaded dgemm sup feature introduced in
c0558fd.
Details:
- Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
that were made in the supmt-specific code commited to the 'amd'
branch, which has now been merged with 'master'. Prior to the merge,
'master' received commit c01d249, which applied these renamings to
the existing, non-sup codebase.
* OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime
Before this change:
Appication gives runtime error [when linked with blis]
dyld: Library not loaded: libblis.3.dylib
balay@kpro lib % otool -L libblis.dylib
libblis.dylib:
libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)
After this change:
balay@kpro lib % otool -L libblis.dylib
libblis.dylib:
/Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)
* INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR
Co-Authored-By: Jed Brown <jed@jedbrown.org>
* CREDITS file update.
Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
Details:
- Made several updates to test/1m4m/runme.sh, including:
- Added missing handling for 1m and 4m1a implementations when setting
the BLIS_??_NT environment variables.
- Added support for using numactl to run the test executables.
- Several other cleanups.
Details:
- Added logic to configure that causes the script to output a warning
to the user if/when "./configure auto" is run and the underlying
hardware feature detection code is unable to identify the hardware.
In these cases, the auto-detect code will return 'generic', which
is likely not what the user expected, and a flag will be set so that
a message is printed at the end of the configure output. (Thankfully,
we don't expect this scenario to play out very often.) Thanks to
Devin Matthews for suggesting this fix#384.
* Fix vectorized version of bli_amaxv
To match Netlib, i?amax should return:
- the lowest index among equal values
- the first NaN if one is encountered
* Fix typos.
* And another one...
* Update ref. amaxv kernel too.
* Re-enabled optimized amaxv kernels.
Details:
- Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen'
kernel set for use in haswell, zen, zen2, knl, and skx subconfigs.
These two kernels (for s and d datatypes) were temporarily disabled in
e186d71 as part of issue #380. However, the key missing semantic
properties that prompted the disabling of these kernels--returning the
index of the *first* rather than of the last element with largest
absolute value, and returning the index of the first NaN if one is
encountered--were added as part of #382 thanks to Devin Matthews.
Thus, now that the kernels are working as expected once more, this
commit causes these kernels to once again be registered for the
affected subconfigs, which effectively reverts all code changes
included in e186d71.
- Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c.
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
Details:
- Disabled use of optimized amaxv kernels, which use vector intrinsics
for both 's' and 'd' datatypes. We disable these kernels because the
current implementations fail to observe a semantic property of the
BLAS i?amax_() subroutine, which is to return the index of the
*first* element containing the maximum absolute value (that is, the
first element if there exist two or more elements that contain the
same value). With the optimized kernels disabled, the affected
subconfigurations (haswell, zen, zen2, knl, and skx) will use the
default reference implementations. Thanks to Mat Cross for reporting
this issue via #380.
- CREDITS file update.
Details:
- Added a missing return statement to the body of an early case handling
branch in bli_thread_partition_2x2(). This bug only affected cases
where n_threads < 4, and even then, the code meant to handle cases
where n_threads >= 4 executes and does the right thing, albeit using
more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.
Details:
- Reran all existing single-threaded performance experiments comparing
BLIS sup to other implementations (including the conventional code
path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
(Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
Details:
- Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
different problem size ranges for the drivers where one, two, or three
matrix dimensions is large. This will facilitate the generation of
more meaningful graphs, particularly when two dimensions are tiny.
Details:
- Optimized scripts in test/sup/octave and test/supmt/octave for use
with octave 5.2.0 on Ubuntu 18.04.
- Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m,
which were not only unnecessary but also causing issues with versions
5.x.
Details:
- Renamed two bli_thread_*() APIs:
bli_thread_obarrier() -> bli_thread_barrier()
bli_thread_obroadcast() -> bli_thread_broadcast()
The 'o' was a leftover from when thrcomm_t objects tracked both
"inner" and "outer" communicators. They have long since been
simplified to only support the latter, and thus the 'o' is
superfluous.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
address is equal to either &BLIS_GEMM_SINGLE_THREADED or
&BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
bli_l3_sup_decor_single.c that (by default) disables code that
creates and frees the thrinfo_t tree and instead passes
&BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
sup implementation.
- The net effect of the above changes is that a small amount of
thrinfo_t overhead is avoided when running small/skinny dgemm
problems when BLIS is compiled with multithreading disabled.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
changing function interface for the thread entry point function
(of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
a memory leak in the sba at bli_finalize() time. It turns out that,
due to the new multithreading-capable variant code useing thrinfo_t
objects--specifically, their calling of bli_thrinfo_grow()--we
have to pass in a real thrinfo_t object rather than the global
objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
Thus, I inserted the appropriate logic from the OpenMP and pthreads
versions so that single-threaded execution would work as intended
with the newly upgraded variants.
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.