Details:
- Add a blurb about the new addons feature to the "Documentation for
BLIS developers" section of the README.md, which also links to the
Addons.md document.
Details:
- Replaced the hard-coded calls to double-precision real syr, syr2,
syrk, and syrk in the corresponding standalone test drivers in the
'test' directory with conditional branches that will call the
appropriate BLAS interface depending on which datatype is enabled.
Thanks to Madan mohan Manokar for this improvement.
- CREDITS file update.
Details:
- Added a new 'zen3' subconfiguration targeting support for the AMD Zen3
microarchitecture (#561). Thanks to AMD for this contribution.
- Restructured clang and AOCC support for zen, zen2, and zen3
make_defs.mk files. The clang and AOCC version detection now happens
in configure, not in the subconfigurations' makefile fragments. That
is, we've added logic to configure that detects the version of
clang/AOCC, outputs an appropriate variable to config.mk
(ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the
makefile fragment (as is currently done for the GCC_OT_* variables).
- Added configure support for a GCC_OT_10_1_0 variable (and associated
substitution anchor) to communicate whether the gcc version is older
than 10.1.0, and use this variable to check for recent enough versions
of gcc to use -march=znver3 in the zen3 subconfig.
- Inlined the contents of config/zen/amd_config.mk into the zen and zen2
make_defs.mk so that the files are self-contained, harmonizing the
format of all three Zen-based subconfigurations' make_defs.mk files.
- Added indenting (with spaces) of GNU make conditionals for easier
reading in zen, zen2, and zen3 make_defs.mk files.
- Adjusted the range of models checked by bli_cpuid_is_zen() (which was
previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is
completely disjoint from the models checked by bli_cpuid_is_zen2()
(0x30 ~ 0xff). This is normally necessary because Zen and Zen2
microarchitectures share the same family (23, or 0x17), and so the
model code is the only way to differentiate the two. But in our case,
fixing the model range for zen *wasn't* actually necessary since we
checked for zen2 first, and therefore the wide zen range acted like
the 'else' of an 'if-else' statement. That said, the change helps
improve clarity for the reader by encoding useful knowledge, which
was obtained from https://en.wikichip.org/wiki/amd/cpuid .
- Added zen2.def and zen3.def files to the collection in travis/cpuid.
Note that support for zen, zen2, and zen3 is now present, and while
all the three microarchitectures have identical instruction sets from
the perspective of BLIS microkernels, they each correspond to
different subconfigurations and therefore merit separate testing.
Thanks to Devin Matthews for his guidance in hacking these files as
slight modifications of zen.def.
- Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh.
Now, zen, zen2, and zen3 are tested through the SDE via Travis CI
builds.
- Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils
repository on GitHub rather than on Intel's website. This change was
made in an attempt to circumvent recent troubles with Travis CI not
being able to download the SDE directly from Intel's website via curl.
Thanks to Devin Matthews for suggesting the idea.
- Updated travis/do_sde.sh to grab the latest version (8.69.1) of the
Intel SDE from the flame/ci-utils repository.
- Updated .travis.yml to use gcc 9. The file was previously using gcc 8,
which did not support -march=znver2.
- Created amd64_legacy umbrella family in config_registry for targeting
older (bulldozer, piledriver, steamroller, and excavator)
microarchitectures and moved those same subconfigs out of the amd64
umbrella family. However, x86_64 retains amd64_legacy as a constituent
member.
- Fixed a bug in configure related to the building of the so-called
config list. When processing the contents of config_registry,
configure creates a series of structures and lists that allow for
various mappings related to configuration families, subconfigs, and
kernel sets. Two of those lists are built via substitution of
umbrella families with their subconfig members, and one of those
lists was improperly performing the substitution in a way that would
erroneously match on partial umbrella family names. That code was
changed to match the code that was already doing the substitution
properly, via substitute_words(). Also added comments noting the
importance of using substitute_words() in both instances.
- Comment updates.
Details:
- Reverted the annotation of some markdown code blocks with 'bash'
after realizing that the in-browser syntax highlighting was not
worthwhile.
Details:
- Inserted a new 'Example Code' section into the README.md immediately
after the 'Getting Started' section. Thanks to Devin Matthews for
recommending this addition.
- Moved the 'Performance' section of the README down slightly so that it
appears after the 'Documentation' section.
Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding #include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is #included before bli_type_defs.h).
This is why the #include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
Details:
- Expanded the BLAS compatibility layer to include support for
?axpby_() and ?gemm_batch_(). The former is a straightforward
BLAS-like interface into the axpbyv operation while the latter
implements a batched gemm via loops over bli_?gemm(). Also
expanded the CBLAS compatibility layer to include support for
cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to
the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari
for submitting these new APIs via #566.
- Fixed a long-standing bug in common.mk that for some reason never
manifested until now. Previously, CBLAS source files were compiled
*without* the location of cblas.h being specified via a -I flag.
I'm not sure why this worked, but it may be due to the fact that
the cblas.h file resided in the same directory as all of the CBLAS
source, and perhaps compilers implicitly add a -I flag for the
directory that corresponds to the location of the source file being
compiled. This bug only showed up because some CBLAS-like source code
was moved into an 'extra' subdirectory of that frame/compat/cblas/src
directory. After moving the code, compilation for those files failed
(because the cblas.h header file, presumably, could not be found in
the same location). This bug was fixed within common.mk by explicitly
adding the cblas.h directory to the list of -I flags passed to the
compiler.
- Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory,
and updated test/Makefile to build those drivers.
- Fixed typo in error message string in cblas_sgemm.c.
Details:
- Renamed herk macrokernels and supporting files and functions to gemmt,
which is possible since at the macrokernel level they are identical.
Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
functions rather than cpp macros that instantiate multiple functions.
Thanks to Devin Matthews for his efforts on this issue (#531).
- Check that the maximum stack buffer size is sufficiently large
relative to the register blocksizes for each datatype, and do so when
the context is initialized rather than when an operation is called.
Note that with this change, users who pass in their own contexts into
the expert interfaces currently will *not* have any checks performed.
Thanks to Devin Matthews for suggesting this change.
Details:
- Fixed a bug that broke certain mixed-datatype gemm behavior. This
bug was introduced recently in e9da642 when the code that performs
the operation transposition (for microkernel IO preference purposes)
was moved up so that it occurred sooner. However, when I moved that
code, I failed to notice that there was a cpp-protected "if"
conditional that applied to the entire code block that was moved. Once
the code block was relocated, the orphaned if-statement was now
(erroneously) glomming on to the next thing that happened to be in the
function, which happened to be the call to bli_rntm_set_ways_for_op(),
causing a rather odd memory exhaustion error in the sba due to the
num_threads field of the rntm_t still being -1 (because the rntm_t
field were never processed as they should have been). Thanks to
@ArcadioN09 (Snehith) for reporting this error and helpfully including
relevant memory trace output.
Details:
- Removed support for all induced methods except for 1m. This included
removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any
code that existed only to support those implementations. These
implementations were rarely used and posed code maintenance challenges
for BLIS's maintainers going forward.
- Removed reference kernels for packm that pack 3m and 4m micropanels,
and removed 3m/4m-related code from bli_cntx_ref.c.
- Removed support for 3m/4m from the code in frame/ind, then reorganized
and streamlined the remaining code in that directory. The *ind(),
*nat(), and *1m() APIs were all removed. (These additional API layers
no longer made as much sense with only one induced method (1m) being
supported.) The bli_ind.c file (and header) were moved to frame/base
and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to
frame/3.
- Removed 3m/4m support from the code in frame/1m/packm.
- Removed 3m/4m support from trmm/trsm macrokernels and simplified some
pointer arithmetic that was previously expressed in terms of the
bli_ptr_inc_by_frac() static inline function (whose definition was
also removed).
- Removed the following subdirectories of level-0 macro headers from
frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros
defined in these directories were used exclusively for 3m and 4m
method codes.
- Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in
light of 1m being the only induced method left within BLIS.
- Removed dt_on_output field within auxinfo_t and its associated
accessor functions.
- Re-indexed the 1e/1r pack schemas after removing those associated with
variants of the 3m and 4m methods. This leaves two bits unused within
the pack format portion of the schema bitfield. (See bli_type_defs.h
for more info.)
- Spun off the basic and expert interfaces to the object and typed APIs
into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c
and bli_l3_tapi_ex.c.
- Moved the level-3 operation-specific _check function calls from the
operations' _front() functions to the corresponding _ex() function of
the object API. (This change roughly maintains where the _check()
functions are called in the call stack but lays the groundwork for
future changes that may come to the level-3 object APIs.) Minor
modifications to bli_l3_check.c to allow the check() functions to be
called from the expert interface APIs.
- Removed support within the testsuite for testing the aforementioned
induced methods, and updated the standalone test drivers in the 'test'
directory so reflect the retirement of those induced methods.
- Modified the sandbox contract so that the user is obliged to define
bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light
of the *nat() functions no longer existing.) Also updated the existing
'power10' and 'gemmlike' sandboxes to come into compliance with the
new sandbox rules.
- Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation
to reflect the retirement of 3m/4m, and also modified Sandboxes.md to
bring the document into alignment with new conventions.
- Updated various comments; removed segments of commented-out code.
Details:
- Updated travis/do_sde.sh so that the script downloads the SDE tarball
from a new ci-utils repository on GitHub rather than from Intel's
website. This change is being made in an attempt to circumvent Travis
CI's recent troubles with downloading the SDE from Intel's website via
curl. Thanks to Devin Matthews for suggesting the idea.
Details:
- Fixed a bug in configure related to the building of the so-called
config list. When processing the contents of config_registry,
configure creates a series of structures and list that allow for
various mappings related to configuration families, subconfigs,
and kernel sets. Two of those lists are built via subsitituion
of umbrella families with their subconfig members, and one of
those lists was improperly performing the subtitution in a way
that would erroneously match on partial umbrella family names.
That code was changed to match the code that was already doing
the subtitution properly, via substitute_words().
- Added comments noting the importance of using substitute_words()
in both instances.
Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
precision real and double-precision real ukernels had opposing I/O
preferences (row-preferential sgemm ukernel + column-preferential
dgemm ukernel, or vice versa). The fix involved adjusting the API
to bli_cntx_set_ind_blkszs() so that the induced method context init
function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
function for only one datatype at a time. This allowed the blocksize
scaling (which varies depending on whether we're doing 1m_r or 1m_c)
to happen on a per-datatype basis. This fixes issue #557. Thanks to
Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
called from each level-3 _front() function. The pack_t schemas in the
cntx_t were also removed entirely, along with the associated accessor
functions. This in turn required updating the trsm1m-related virtual
ukernels to read the pack schema for B from the auxinfo_t struct
rather than the context. This also required slight tweaks to
bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
the microkernel IO preference. This mostly only affects gemm. Thanks
to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
in bli_gks.c. The context initialization functions for induced methods
were previously passed a dt argument, but I can no longer figure out
*why* they were passed this value. To reduce confusion, I've removed
the dt argument (including also from the function defintion +
prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
breaks high-leve implementations of 3m and 4m, but this is okay since
those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.
Details:
- Previously, the block_ptrs field of the pool_t was allowed to be
initialized as any unsigned integer, including 0. However, a length of
0 could be problematic given that malloc(0) is undefined and therefore
variable across implementations. As a safety measure, we check for
block_ptrs array lengths of 0 and, in that case, increase them to 1.
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
Details:
- The current mechanism for growing a pool_t doubles the length of the
block_ptrs array every time the array length needs to be increased
due to new blocks being added. However, that logic did not take in
account the new total number of blocks, and the fact that the caller
may be requesting more blocks that would fit even after doubling the
current length of block_ptrs. The code comments now contain two
illustrating examples that show why, even after doubling, we must
always have at least enough room to fit all of the old blocks plus
the newly requested blocks.
- This commit also happens to fix a memory corruption issue that stems
from growing any pool_t that is initialized with a block_ptrs length
of 0. (Previously, the memory pool for packed buffers of C was
initialized with a block_ptrs length of 0, but because it is unused
this bug did not manifest by default.)
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.