Details:
- Modified .travis.yml so that only commits to 'master', 'dev', and
'amd' branches get built by Travis CI. Thanks to Devin Matthews for
helping to track down the syntax for this change.
Details:
- Re-enabled the changes made in fb93d24.
- Defined BLIS_ENABLE_SYSTEM in bli_arch.c, bli_cpuid.c, and bli_env.c,
all of which needed the definition (in addition to config_detect.c) in
order for the configure-time hardware detection binary to be compiled
properly. Thanks to Minh Quan Ho for helping identify these additional
files as needing to be updated.
- Added additional comments to all four source files, most notably to
prompt the reader to remember to update all of the files when updating
any of the files. Also made the cpp code in each of the files as
consistent/similar as possible.
- Refer to issues #532 and PR #546 for more history.
Details:
- Re-enable the changes originally made in 8e0c425 but quickly reverted
in 2be78fc.
- Moved the #include of bli_config.h so that it occurs before the
#include of bli_system.h. This allows the #define BLIS_ENABLE_SYSTEM
or #define BLIS_DISABLE_SYSTEM in bli_config.h to be processed by the
time it is needed in bli_system.h. This change should have been
in the original 8e0c425, but was accidentally omitted. Thanks to Minh
Quan Ho for catching this.
- Add #define BLIS_ENABLE_SYSTEM to config_detect.c so that the proper
cpp conditional branch executes in bli_system.h when compiling the
hardware detection binary. The changes made in 8e0c425 were an attempt
to support the definition of BLIS_OS_NONE when configuring with
--disable-system (in issue #532). That commit failed because, aside
from the required but omitted header reordering (second bullet above),
AppVeyor was unable to compile the hardware detection binary as a
result of missing Windows headers. This commit, which builds on PR
#546, should help fix that issue. Thanks to Minh Quan Ho for his
assistance and patience on this matter.
Details:
- Removed the commented-out #define BLIS_NUM_ARCHS in bli_type_defs.h
and its associated (now outdated) comments. BLIS_NUM_ARCHS has been
part of the arch_t enum for some time now, and so this change is
mostly about removing any opportunity for confusion for people who
may be reading the code. Thanks to Minh Quan Ho for leading me to
cleanup.
Details:
- Defined a new packm variant for the 'gemmlike' sandbox. This new
variant (bls_l3_packm_var3.c) parallelizes the packing operation over
the k dimension rather than the m or n dimensions. Note that the
gemmlike implementation still uses var1 by default, and use of the new
code would require changing bls_l3_packm_a.c and/or bls_l3_packm_b.c
so that var3 is called instead. Thanks to Jeff Diamond for proposing
this (perhaps NUMA-friendly) solution.
Details:
- Modified bli_system.h so that the cpp macro BLIS_OS_NONE is defined
when BLIS_DISABLE_SYSTEM is defined. Otherwise, the previous OS-
detecting macro conditionals are considered. This change is to
accommodate a solution to a cross-compilation issue described in
#532.
Details:
- Updated two out-of-date calls to bli_malloc_intl() within the gemmlike
sandbox. These calls to malloc_intl(), which resided in
bls_l3_decor_pthreads.c, were missing the err_t argument that the
function uses to report errors. Thanks to Jeff Diamond for helping
isolate this issue.
Details:
- Moved miscellaneous language-related definitions, including defs
related to the handling of the 'restrict' keyword, from the top half
of bli_macro_defs.h into a new file, bli_lang_defs.h, which is now
#included immediately after "bli_system.h" in blis.h. This change is
an attempt to fix a report of recent breakage of C++ compilers due
to the recent introduction of 'restrict' in bli_type_defs.h (which
previously was being included *before* bli_macro_defs.h and its
restrict handling therein. Thanks to Ivan Korostelev for reporting
this issue in #527.
- CREDITS file update.
Details:
- In the gemmlike sandbox, changed the loop index variable of inner
loop of packm_cxk() from 'd' to 'i' (and likewise for the
corresponding inlined code within packm_var2()).
- Pack matrices A and B using packm_var1() instead of packm_var2().
Details:
- Added code to the gemmlike sandbox that handles parameter checking.
Previously, the gemmlike implementation called bli_gemm_check(), which
resides within the BLIS framework proper. Certain modifications that a
user may wish to perform on the sandbox, such as adding a new matrix
or vector operand, would have required additional checks, and so these
changes make it easier for such a person to implement those checks for
their custom gemm-like operation.
Details:
- Changed the implementation in the 'gemmlike' sandbox to more easily
allow others to provide custom implementations of packm. These changes
include:
- Calling a local version of packm_cxk() that can be modified. This
version of packm_cxk() uses inlined loops in packm_cxk() rather
than querying the context for packm kernels (or even using scal2m).
- Providing two variants of packm, one of which calls the
aforementioned packm_cxk(), the other of which inlines the contents
of packm_cxk() into the variant itself, making it self-contained.
To switch from one to the other, simply change which function gets
called within bls_packm_a() and bls_packm_b().
- Simplified and cleaned up some variant names in both variants of
packm, relative to their parent code.
This fixes a bug where "make -j<N> check" may fail after a change to one or more header files, or where testsuite code doesn't get properly recompiled after internal changes.
Details:
- Disabled a sanity check in bli_pool_finalize() that was meant to alert
the user if a pool_t was being finalized while some blocks were still
checked out. However, this is exactly the situation that might happen
when a pool_t is re-initialized for a larger blocksize, and currently
bli_pool_reinit() is implemeneted as _finalize() followed by _init().
So, this sanity check is not universally appropriate. Thanks to
AMD-India for reporting this issue.
Details:
- Updated stale calls to the bli_membrk API within the 'gemmlike'
sandbox. This API is now called bli_pba (packed block allocator).
Ideally, this forgotten update would have been included as part of
21911d6, which is when the branch where the membrk->pba changes was
introduced was merged into 'master'.
- Comment updates.
The added fields:
1. `pack_t schema`: storing the pack schema on the object allows the macrokernel to act accordingly without side-channel information from the rntm_t and cntx_t. The pack schema and "pack_[ab]" fields could be removed from those structs.
2. `void* user_data`: this field can be used to store any sort of additional information provided by the user. The pointer is propagated to submatrix objects and copies, but is otherwise ignored by the framework and the default implementations of the following three fields. User-specified pack, kernel, or ukr functions can do whatever they want with the data, and the user is 100% responsible for allocating, assigning, and freeing this buffer.
3. `obj_pack_fn_t pack`: the function called when a matrix is packed. This functions receives the expected arguments, as well as a mdim_t and mem_t* as memory must be allocated inside this function, and behavior may differ based on which matrix is being backed (i.e. transposition for B). This could also be achieved by passing a desired pack schema, but this would require additional information to travel down the control tree.
4. `obj_ker_fn_t ker`: the function called when we get to the "second loop", or the macro-kernel. Behavior may depend on the pack schemas of the input matrices. The default implementation would perform the inner two loops around the ukr, and then call either the default ukr or a user-supplied one (next field).
5. `obj_ukr_fn_t ukr`: the function called by the default macrokernel. This would replace the various current "virtual" microkernels, and could also be used to supply user-defined behavior. Users could supply both a custom kernel (above) and microkernel, although the user-specified kernel does **not** necessarily have to call the ukr function specified on the obj_t.
Note that no macros or functions for accessing these new fields have been defined yet. That is next once these are finalized. Addresses https://github.com/flame/blis/projects/1#card-62357687.
Details:
- Fixed a compile-time error in bli_init.c when compiling with OSX's
clang. This error was introduced in 868b901, which introduced a
post-declaration struct assignment where the RHS was a struct
initialization expression (i.e. { ... }). This use of struct
initializer expressions apparently works with gcc despite it not
being strict C99. The fix included in this commit declares a temporary
variable for the purposes of being initialized to the desired value,
via the struct initializer, and then copies the temporary struct (via
'=' struct assignment) to the persistent struct. Thanks to Devin
Matthews for his help with this.
Details:
- Accept either 'clang' or 'LLVM' in vendor string when greping for
the version number (after determining that we're working with clang).
Thanks to Devin Matthews for this fix.
Details:
- Fixes a rather obvious bug that resulted in segmentation fault
whenever the calling application tried to re-initialize BLIS after
its first init/finalize cycle. The bug resulted from the fact that
the bli_init.c APIs made no effort to allow bli_init() to be called
subsequent times at all due to it, and bli_finalize(), being
implemented in terms of pthread_once(). This has been fixed by
resetting the pthread_once_t control variable for initialization
at the end of bli_finalize_apis(), and by resetting the control
variable for finalization at the end of bli_init_apis(). Thanks to
@lschork2 for reporting this issue (#525), and to Minh Quan Ho and
Devin Matthews for suggesting the chosen solution.
- CREDITS file update.
Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on a Graviton2
Neoverse N1 server. Special thanks to Nicholai Tukanov for
collecting these results via the Arm-HPC/AWS hackaton.
- Corrected what was supposed to be a temporary tweak to the legend
labels in test/3/octave/plot_l3_perf.m.
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
microkernels so as to avoid a performance penalty when mixing AVX
and SSE instructions. These vzeroupper instructions were once part
of the haswell kernels, but were inadvertently removed during a source
code shuffle some time ago when we were managing duplicate 'haswell'
and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
and re-inserting the missing instructions.