Details:
- Fixed a harmless bug that would have allowed C++ headers into the list
of header suffices specifically reserved for C99 headers. In practice,
this would have had no substantive effect on anything since the core
BLIS framework does not use C++ headers.
Details:
- Removed #defines for BLIS_BBN_s and BLIS_BBN_d from
bli_kernel_defs_power10.h. These were inadvertently set in ae10d949
because the power10 subconfig was registering bb packm ukernels, but
only for 6xk (power10 uses s8x16 and d8x8 ukernels) and only because
the original author (probably) copy-pasted from power9 when getting
started. That 6xk packm registration was effectively "dead code"
prior to ae10d949, but was then mistaken as not-dead code during the
ae10d949 refactor. These improper bb factors may have been causing
bugs in power10 builds. Thanks to Nicholai Tukanov for helping remind
me what the power10 subconfig was supposed to look like.
- Removed extraneous microkernel preference registrations from power10
subconfig. Preferences for single and double complex gemm were being
registered despite there being no complex gemm ukernels registered to
go with them. Similarly, there were trsm preferences registered
without any trsm ukernels registered (and BLIS doesn't actually use a
preference for the trsm ukernel anyway). These extraneous
registrations were almost surely not hurting anything, even if they
were quite misleading.
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
assembly kernels. This bug is similar to the one fixed in 17b0caa
and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.
Details:
- Modified flatten-headers.py so that #line directives are inserted into
the flattened blis.h file. This facilitates easier debugging when
something is amiss in the flattened blis.h because the compiler will
be able to refer to the line number within the original constituent
header file (which is where the fix would go) rather than the line
number within the flattened header (which is not as helpful).
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
single-precision elements of C, via instructions such as:
vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:
vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)
Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
Thanks to Daniël de Kok for reporting these bugs in #635, and to
Bhaskar Nallani for proposing the fix).
- CREDITS file update.
Details:
- Tweaked test/3/runme.sh so that the test driver binaries for single-
threaded (st), single-socket (1s), and dual-socket (2s) execution can
be built using identical problem size ranges. Previously, this was not
possible because runme.sh used the maximum problem size, which was
embedded into the binary filename, to tell the three classes of
binaries apart from one another. Now, runme.sh uses the binary suffix
("st", "1s", or "2s") to tell them apart. This required only a few
changes to the logic, but it also required a change in format to the
threading config strings themselves (replacing the max problem size
with "st", "1s", or "2s"). Thanks to Jeff Diamond for inspiring this
improvement.
- Comment updates.
Details:
- Fixed a crash that occurs when either cblat1 or zblat1 are linked
with a build of BLIS that was compiled with '--complex-return=intel'.
This fix involved inserting preprocessor macro guards based on
BLIS_ENABLE_COMPLEX_RETURN_INTEL into blastest/src/cblat1.c and
blastest/src/zblat1.c to correctly handle situations where BLIS is
compiled with Intel/f2c-style calling conventions for complex numbers.
- Updated blastest/src/fortran/run-f2c.sh so that future executions
will insert the aforementioned cpp macro conditional where
appropriate.
Details:
- When checking the version string of the Fortran compiler for the
purposes of determining a default return convention for complex
domain values, grep for "IFORT" instead of "ifort" since that string
is common to both the 'ifx' and 'ifort' binaries provided by Intel:
$ ifx --version
ifx (IFORT) 2022.1.0 20220316
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
$ ifort --version
ifort (IFORT) 2021.6.0 20220226
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
Details:
- Macs create .DS_Store files in every directory visited. Updated
.gitignore file so these files won't be reported as untracked by
'git status'.
- Added Oracle Corporation to the LICENSE file.
- Updated UT copyright on behalf of SHPC.
Details:
- Removed a redundant registration of 'a64fx' subconfig in
bli_gks_init().
- Reordered registration of 'armsve', 'a64fx', and 'firestorm'
subconfigs. Thanks to Jeff Diamond for his input on this reordering.
- Comment updates to bli_gks.c and arch_t enum in bli_type_defs.h.
Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
num_t value read from va_args() down to a bool before being stored
within the cntx_t. This bug was introduced on April 6th 2022, in
ae10d94. This caused the ukernel preferences for double real and
double complex to go unchanged while the preferences for single real
and single complex were corrupted by the former datatypes'
preference values. The bug manifested as degraded performance for
subconfigurations that registered column-preferential ukernels. The
reason is that the erroneous preferences trigger unnecessary
transpositions in the operation, which forces the gemm ukernel to
compute on matrices that are not stored according to its preference.
Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
Details:
- Removed the num_t datatype argument from bli_gks_query_ind_cntx().
This argument stopped being needed by the function in commit e9da642.
Its only use in bli_gks_query_ind_cntx() was to be passed through to
the context initialization function for the chosen induced method,
but even then, commit log notes from e9da642 indicate that I could not
recall why the datatype argument was ever needed by the context init
function to begin with.
- Updated all invocations of bli_gks_query_ind_cntx() to omit the dt
argument. Most of these invocations resided in various standalone test
drivers (and the testsuite).
Details:
- Added a citation to SMU and the Matthews Research Group to the general
attribution of maintainership and development in the Introduction of
the README.md file. Thanks to Robert van de Geijn and Devin Matthews
for suggesting this change.
Details:
- Defined and implemented a new pthread-like abstract datatype and API
in bli_pthread.c. The new type, bli_pthread_switch_t, is similar to
bli_pthread_once_t in some respects. The idea is that like a switch in
your home that controls a light or ceiling fan, it can either be on or
off. The switch starts in the off state. Moving from one state to the
other (on to off; off to on) causes some action (i.e., a startup or
shutdown function) to be executed. Trying to move from one state to
the same state (on to on; off to off) is safe in that it results in
no action. Unlike bli_pthread_once(), the API for bli_pthread_switch_t
contains both _on() and _off() interfaces. Also, unlike the _once()
function, the _on() and _off() functions return error codes so that
the 'int' error code returned from the startup or shutdown functions
may be passed back to the caller. Thanks to Devin Matthews for his
input and feedback on this feature.
- Replaced the previous implementation of bli_init_once() and
bli_finalize_once() -- both of which used bli_pthread_once() -- with
ones that rely upon bli_pthread_switch_on() and _switch_off(),
respectively. This also required updating the return types of
_init_apis() and _finalize_apis() to match the function pointer type
required by bli_pthread_switch_on()/_switch_off().
- Comment updates.
Details:
- Fixed a functionally harmless typo in bli_gemm_ker_var2.c where a few
instances of the substring "xpbys" were misspelled as "xbpys". The
misspellings were harmless because they were consistent, and because
they referenced only local symbols.
Details:
- Added missing 'const' qualifiers to signatures of functions defined in
kernels/zen/3/bli_gemm_small.c. This fixes compile-time errors when
targeting 'zen3' subconfig (which apparently is enabling AMD's
gemm_small code path by default). Thanks to Devin Matthews for
reporting this error.
Details:
- Added 'const' qualifier to applicable function arguments wherever the
the pointed-to object is not internally modified. This change affects
all interfaces that reside above the level of the (micro)kernels.
- Typecast certain function return values to discard 'const' qualifier.
- Removed 'restrict' from various arguments, including cntx_t*,
auxinfo_t*, rntm_t*, thrinfo_t*, mem_t*, and others
- Removed parts of some APIs, such as bli_cntx_*(), due to limited use.
- Merged some variable declarations with their corresponding
initialization statements.
- Whitespace changes.
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
that rather than having a function pointer for every supported size of
unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
kernels per datatype: one to pack MRxk micropanels and one to pack
NRxk micropanels.
- NOTE: The "bb" (broadcast B) reference kernels have been merged into
the "standard" kernels (packm [including 1er and unpackm], gemm,
trsm, gemmtrsm). This replication factor is controlled by
BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
replication factor of 1 has been tested. armsve also needs testing
since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
and friends) in bli_kernel_macro_defs.h, but only when the macro
BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
- Still need to update documentation, including:
- docs/ConfigurationHowTo.md
- docs/KernelsHowTo.md
to reflect changes made in this commit.
Details:
- Fixed an unresolved symbol issue leftover from #590 whereby ?gemm3m_()
as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which
does not exist. It should have simply called the _check() function for
gemm.
Details:
- Allow building BLIS with certain framework files (each with the '_amd'
suffix) that have been customized by AMD for Zen-based hardware. These
customized files were derived from portable versions of the same files
(i.e., those without the '_amd' suffix). Whether the portable or AMD-
specific files are compiled is now controlled by a new configure
option, --[en|dis]able-amd-frame-tweaks. This option is disabled by
default in vanilla BLIS, though AMD may choose to enable it by default
in their fork. For now, the added AMD-specific files are:
- bli_gemv_unf_var2_amd.c
- bla_copy_amd.c
- bla_gemv_amd.c
These files reside in 'amd' subdirectories found within the directory
housing their generic counterparts.
- Register optimized real-domain copyv, setv, and swapv kernels in
bli_cntx_init_zen.c.
- Various minor updates to level-1v kernels in 'zen' kernel set.
- Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to
the 'zen' kernel set
- If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim,
call gemv instead and return early.
- Combined variable declarations with their initialization in various
level-2 and level-3 BLAS compatibility files, and also inserted
'const' qualifer in those same declaration statements.
- Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ .
- Added copyv and swapv test drivers to 'test' directory.
- Whitespace, comment changes.
Details:
- Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply
invoke the 1m implementation unconditionally. (Note that these APIs
bypass sup handling.)
- Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h.
- Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h.
- Relocated:
frame/compat/cblas/src/cblas_?gemmt.c
files into
frame/compat/cblas/src/extra/
- Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ .
- Minor reorganization of prototypes and cpp macro directives in
bli_blas.h, cblas.h, and cblas_f77.h.
- Trival whitespace change to cblas_zgemm.c.
Details:
- Implemented a multithreaded optimization for the special (and common)
case of employing the gemmsup code path when the user requests
(implicitly or explicitly) that neither A nor B be packed during
computation. This optimization takes the form of a greatly reduced
code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a
broadcast and two barriers, and results in higher performance when
obtaining two-way or higher parallelism within BLIS. Thanks to
Bhaskar Nallani of AMD for proposing this change via issue #605.
- Added an early return branch to bli_thrinfo_create_for_cntl() that
detects and quickly handles cases where no parallelism is being
obtained within BLIS (i.e., single-threaded execution). Note that
this special case handling was/is already present in
bli_thrinfo_sup_create_for_cntl().
- CREDITS file update.
Details:
- Fixed a performance regression affecting nearly all level-3 operations
that use the 'haswell' sgemm and dgemm microkernels. This regression
was introduced in 54fa28b, caused by an ill-formed conditional
expression in the assembly code that controls whether cache lines of C
should be prefetched as rows or as columns. Essentially, the two
branches were reversed, causing incomplete prefetching to occur for
both row- and column-stored instances of matrix C. Thanks to Devin
Matthews for his help finding and fixing this bug.
Details:
- Consolidate handling of tools that are specifiable via CC, CXX, FC,
PYTHON, AR, and RANLIB into one bash function, select_tool_w_env().
- If the user specifies a tool via an environment variable (e.g.
CC=gcc) and that tool does not seem valid, print an error message
and abort configure, unless the tool is optional (e.g. CXX or FC),
in which case a warning message is printed instead.
- The definition of "seems valid" above amounts to:
- responding to at least one of a basic set of command line options
(e.g. --version, -V, -h) if the os_name is Linux (since GNU tools
tend to respond to flags such as --version) or if the tool in
question is CC, CXX, FC, or PYTHON (which tend to respond to the
expected flags regardless of OS)
- the binary merely existing for AR and RANLIB on Darwin/OSX/BSD.
(These OSes tend to have non-GNU versions of ar and ranlib, which
typically do not respond to --version and friends.)
- This PR addresses #584. Thanks to Devin Matthews for suggesting some
of the changes in this commit.
Fixes#613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.
Details:
- Renamed the following macros defined in bli_kernel_macro_defs.h:
BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS
BLIS_SIMD_SIZE -> BLIS_SIMD_MAX_SIZE
Also updated all instances of these macros elsewhere, including
subconfigurations, source code, and documentation. Thanks to Devin
Matthews for suggesting this change.
Details:
- Moved edge-case handling into the gemmtrsm microkernel. This required
changing the microkernel API to take m and n dimension parameters as
well as updating all existing gemmtrsm microkernel function pointer
types, function signatures, and related definitions to take m and n
dimensions. Also updated all existing gemmtrsm kernels in the
'kernels' directory (which for now is limited to haswell and penryn
kernel sets, plus native and 1m-based reference kernels in
'ref_kernels') to take m and n dimensions, and implemented edge-case
handling within those microkernels via a collection of new C
preprocessor macros defined within bli_edge_case_macro_defs.h. Note
that the edge-case handling for gemm-like operations had already
been relocated into the gemm microkernel in 54fa28b.
- Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in
bli_edge_case_macro_defs.h to allow for easier reading.
- Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up
the bullet under "Implementation Notes for gemm" that covers alignment
issues. (Thanks to Ivan Korostelev for pointing out the confusing and
outdated language in issue #591.)
- Other minor tweaks to KernelsHowTo.md.