Details:
- Reduced a code segment that appears in all of the bli_*_front()
functions except for bli_gemm_front(). Previously, the code looked
like this (taken from bli_herk_front()):
if ( bli_cntx_method( cntx ) == BLIS_NAT )
{
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local );
}
else // if ( bli_cntx_method( cntx ) != BLIS_NAT )
{
pack_t schema_a = bli_cntx_schema_a_block( cntx );
pack_t schema_b = bli_cntx_schema_b_panel( cntx );
bli_obj_set_pack_schema( schema_a, &a_local );
bli_obj_set_pack_schema( schema_b, &ah_local );
}
This code segment is part of a sort-of-hack that allows us to
communicate the pack schemas into the level-3 thread decorator, which
needs them so that they can be passed into bli_l3_cntl_create_if(),
where the control tree is created. However, the first conditional case
above is unnecessary because the second case is fully generalized.
That is, even in the native case, the context contains correct,
queryable schemas. Thus, these code segments were reduced to something
like:
pack_t schema_a = bli_cntx_schema_a_block( cntx );
pack_t schema_b = bli_cntx_schema_b_panel( cntx );
bli_obj_set_pack_schema( schema_a, &a_local );
bli_obj_set_pack_schema( schema_b, &ah_local );
There's always a small chance that the seemingly unnecessary code
in the first branch case has some special use that is not apparent to
me, but the testsuite's default input parameters seem to think this
commit will be fine.
Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
optionally disables the pre-inversion of diagonal elements of the
triangular matrix in the trsm operation and instead uses division
instructions within the gemmtrsm microkernels. Pre-inversion is
enabled by default. When it is disabled, performance may suffer
slightly, but numerical robustness should improve for certain
pathological cases involving denormal (subnormal) numbers that would
otherwise result in overflow in the pre-inverted value. Thanks to
Bhaskar Nallani for reporting this issue via #461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
instructions.
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
guts were factored out into "fast" and "slow" variants. Then added
logic to the "fast" variant that allows for more optimal thread
factorizations in some situations where there is at least one factor
of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
added comments to that file describing BLIS_THREAD_RATIO_? and
BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
macros not used in vanilla BLIS and removed the unused macro
BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
and bli_trsm_front.c. (These branches of small matrix handling have
not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
Details:
- When requesting multithreaded parallelism by specifying the total
number of threads (whether it be via environment variable, globally at
runtime, or locally at runtime), reduce the number of threads actually
used by one if the original value (a) is prime and (b) exceeds a
minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
to 11 by default. If, when specifying the total number of threads (and
not the individual ways of parallelism for each loop), prime numbers
of threads are desired, this feature may be overridden by defining the
BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
corresponds to the configuration family targeted at configure-time.
(For now, there is no configure option(s) to control this feature.)
Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
bool that determines whether an integer is prime. This function is
implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
with unrelated minor edits.
Details:
- Added a configure option, --[enable|disable]-system, which determines
whether the modest operating system dependencies in BLIS are included.
The most notable example of this on Linux and BSD/OSX is the use of
POSIX threads to ensure thread safety for when application-level
threads call BLIS. When --disable-system is given, the bli_pthreads
implementation is dummied out entirely, allowing the calling code
within BLIS to remain unchanged. Why would anyone want to build BLIS
like this? The motivating example was submitted via #454 in which a
user wanted to build BLIS for a simulator such as gem5 where thread
safety may not be a concern (and where the operating system is largely
absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
the implementation of bli_clock() unconditionally returns 0.0 instead
of the time elapsed since some fixed point in the past. The reasoning
for this is that if the operating system is truly minimal, the system
function call upon which bli_clock() would normally be implemented
(e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
and BLISTypedAPI.md, with a note that both are non-functional when
BLIS is configured with --disable-system.
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.
Details:
- Increased the test thresholds used by the dotxaxpyf testsuite module
by a factor of five in order to avoid residuals that unnecessarily
fall in the MARGINAL range. This commit should fix#455. Thanks to
@nagsingh for reporting this issue.
Details:
- Moved the "Operation index" section of both the BLISObjectAPI.md and
BLISTypedAPI.md docs to appear immediately after the table of contents
of each document. This allows the reader to quickly jump to the
documentation for any operation without having to scroll through much
of the document (when rendered via a web browser).
- Fixed a mistake in the BLISObjectAPI.md for the setd operation, which
does *not* observe the diag property of its matrix argument. Thanks to
Jeff Diamond for reporting this.
Details:
- Implemented support for the user manually overriding the automatic
subconfiguration selection that happens at runtime. This override
can be requested by setting the BLIS_ARCH_TYPE environment variable.
The variable must be set to the arch_t id (as enumerated in
bli_type_defs.h) corresponding to the desired subconfiguration. If a
value outside this enumerated range is given, BLIS will abort with an
error message. If the value is in the valid range but corresponds to a
subconfiguration that was not activated at configure-time/compile-time,
BLIS will abort with a (different) error message. Thanks to decandia50
for suggesting this feature via issue #451.
- Defined a new function bli_gks_lookup_id to return the address of an
internal data structure within the gks. If this address is NULL, then
it indicates that the subconfig corresponding to the arch_t id passed
into the function was not compiled into BLIS. This function is used
in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
is returned for the latter of the two abort scenarios mentioned above,
along with a corresponding error message and a function to perform
the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
auto-detect.x executable during configure-time. This cpp branch is
similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
going forward. Also added a convenient debug switch that outputs the
compilation command for the auto-detect.x executable and exits.
Details:
- Added single-threaded and multithreaded sup performance results to
docs/PerformanceSmall.md for both sgemm and dgemm. These results were
gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
microarchitecture. Special thanks to Jeff Diamond for facilitating
access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
microarchitecture section.
Details:
- Registered full suite of sgemm and dgemm sup millikernels, blocksizes,
and crossover thresholds in bli_cntx_init_zen2.c.
- Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742
system.
Details:
- Added a frequently asked question to docs/FAQ.md regarding the
difference between upstream (vanilla) BLIS and AMD BLIS.
- Updated the name of ICES in the README.md to reflect the Oden
rebranding.
Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on an Epyc 7742
"Rome" server with AMD's Zen2 microarchitecture. Special thanks
to Jeff Diamond for facilitating access to the system via the
Oracle Cloud.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
Details:
- Renamed test/3/matlab to test/3/octave.
- Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m
files for use with octave (which is free and doesn't crash on me
mid-way through my use of subplot).
- Updated runthese.m scratchpad for zen2 invocations.
- Added Nikolay S.'s subplot_tight() function, along with its license.
Details:
- Created a set of single-precision real millikernels and microkernels
comparable to the dgemmsup kernels that already exist within BLIS.
- Added prototypes for all kernels within bli_kernels_haswell.h.
- Registered entry-point millikernels in bli_cntx_init_haswell.c and
bli_cntx_init_zen.c.
- Added sgemmsup support to the Makefile, runme.sh script, and source
file in test/sup. This included edits that allow for separate "small"
dimensions for single- and double-precision as well as for single-
vs. multithreaded execution.
Details:
- Changed double* pointers in sgemm function signature to float*. At
this point I've lost track of whether this was my fault or another
dormant bug like the one described in ece9f6a, but at this point I
no longer care. It's one of those days (aka I didn't ask for this).
Details:
- Fixed dormant type mismatches in the use of the prototype-generating
macros in bli_kernels_knl.h. Specifically, some float prototypes
were incorrectly using double as their ctype. This didn't actually
matter until the type changes in 645d771, as previously those types
were not used since packm was prototyped with void* pointers.
Details:
- In trying to clean up kappa_cast variables in the reference packm
kernels, which I initally believed to be redundant given the other
void* -> ctype* changes in 645d771, I accidentally ended up violating
restrict semantics for 1e/1r packing and possibly other packm kernels.
(Normally, my pre-commit testsuite run would have caught this, but I
was unknowingly using an edited input.operations file in which I'd
disabled most tests as part of unrelated work.) This commit reverts
the kappa_cast changes in 645d771.
Details:
- Changed all void* function arguments in reference packm kernels to
those of the native type (ctype*). These pointers no longer need to
be void* and are better represented by their native types anyway.
(See below for details.) Updated knl packm kernels accordingly.
- In the definition of the PACKM_KER_PROT prototype macro template in
frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a,
and p from void* to ctype*. They were originally void* because these
function signatures had to share the same type so they could all be
stored in a single array of that shared type, from which they were
queried and called by packm_cxk(). This is no longer how the function
pointers are stored, and so it no longer makes sense to force the
caller of packm kernels to use void*, only so that the implementor
of the packm kernels can typecast back to the native datatype within
the kernel definition. This change has no effect internally within
BLIS because currently all packm kernels are called after querying
the function addresses from the context and then typecasting to the
appropriate function pointer type, which is based upon type-specific
function pointers like float* and double*.
- Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and
misleading due to changes to the handling of packm kernels since
moving them into the context.
It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes#341 and fixes#342.
Details:
- Steer the reader towards the example code section of each
documentation doc (object and typed).
- Trivial update to examples/oapi/README, examples/tapi/README.
Details:
- Added documentation for commonly-used object mutator functions in
BLISObjectAPI.md. Previously, only accessor functions were documented.
Thanks to Jeff Diamond for pointing out this omission.
- Explicitly set the 'diag' property of objects in oapi example modules
(08level2.c and 09level3.c).
ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes#433.
Details:
- Added language to remind the reader to disable sup if the intended
behavior is for the sandbox implementation to handle all problem
sizes, even the smaller ones that would normally be handled by the
sup code path.