Details:
- Updated the '-march=corei7-avx' flag in the sandybridge subconfig
to '-march=sandybridge' and the '-march=core-avx2' flag in the
haswell subconfig to '-march=haswell'. The older flags were used
by older versions of gcc and should have been updated to the newer
forms a long time ago. (The older flags were clearly working, even
though they are no longer documented in the gcc man page.)
Details:
- Updated most comments in testsuite modules that describe how the
correctness test is performed so that it is clear whether the vector
(normfv) or matrix (normfm) form of Frobenius norm is used.
Details:
- Defined two new static functions in bli_cntx.h:
bli_cntx_set_blksz_def_dt()
bli_cntx_set_blksz_max_dt()
which developers may find convenient when experimenting with different
values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.
Details:
- Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
whereby subnodes of the thrinfo_t tree are "dereferenced" near the
beginning of the functions, which may lead to segfaults in certain
situations where the thread tree was not fully formed because the
matrix problem was too small for the level of parallelism specified.
(That is, too small because some problems were assigned no work due
to the smallest units in the m and n dimensions being defined by the
register blocksizes mr and nr.) The fix requires several nested levels
of if statements, and this is one of those few instances where use of
goto statements results in (mostly) prettier code, especially in the
case of _gemm_paths(). And while it wasn't necessary, I ported this
goto usage to the loop body that prints the thrinfo_t work_id and
comm_id values for each thread. Thanks to Nicholai Tukanov for helping
to find this bug.
Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
containing the mpnpkp results do not need to be preprocessed in order
to plot half the problem size range (ie: up to 400 instead of the
800 range of the other shape cases).
- Trivial updates to runme.m.
Details:
- Use -qopenmp-simd instead of -fopenmp-simd when compiling with Intel
icc. Recall that this option is used for SIMD auto-vectorization in
reference kernels only. Support for the -f option has been completely
deprecated and removed in newer versions of icc in favor of -q. Thanks
to Victor Eijkhout for reporting this issue and suggesting the fix.
Details:
- Added a cpp-conditional directive block to bli_arch_config.h that
#includes "bli_family_thunderx2.h". The code has been missing since
adf5c17f. However, this never manifested as an error because the file
is virtually empty and not needed for thunderx2 (or most subconfigs).
Thanks to Jeff Diamond for helping to spot this.
Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
docs/Performance.md.
Details:
- Changed starting problem and increment from 16 to 4.
- Added 'lll' (square problems) to list of problem size shapes to
compile and run with.
- Define BLASFEO location and added BLASFEO-related definitions.
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
publishes new performance graphs for Kaby Lake and Epyc showcasing
the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
document.
Details:
- Fine-tuned the double-precision real MT threshold (which controls
whether the sup implementation kicks for smaller m dimension values)
from 180 to 201 for haswell and 180 to 256 for zen.
- Updated octave scripts in test/sup/octave to include a seventh column
to display performance for m = n = k.
Details:
- Attempted a fix to issue #313, which reports that when building only
a shared library (ie: static library build is disabled), running the
BLAS test drivers can fail because those drivers provide their own
local version of xerbla_() as a clever (albeit still rather hackish)
way of checking the error codes that result from the individual tests.
This local xerbla_() function is never found at link-time because the
BLAS test drivers' Makefile imports BLIS compilation flags via the
get-user-cflags-for() function, which currently conveys the
-fvisibility=hidden flag, which hides symbols unless they are
explicitly annotated for export. The -fvisibility=hidden flag was
only ever intended for use when building BLIS (not for applications),
and so the attempted solution here is to omit the symbol export
flag(s) from get-user-cflags-for() by storing the symbol export
flag(s) to a new BULID_SYMFLAGS variable instead of appending it
to the subconfigurations' CMISCFLAGS variable (which is returned by
every get-*-cflags-for() function). Thanks to M. Zhou for reporting
this issue and also to Isuru Fernando for suggesting the fix.
- Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly
created BUILD_SYMFLAGS.
- Fixed typo in entry for --export-shared flag in 'configure --help'
text.
Updated copyright information for kernels/zen/bli_trsm_small.c file
Removed separate kernels for zen2 architecture
Instead added threshold conditions in zen kernels both for ROME and NAPLES
Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5
Details:
- Documented the BLIS environment variables that were set
(e.g. BLIS_JC_NT, BLIS_IC_NT, BLIS_JR_NT) for each machine and
threading configuration in order to achieve the parallelism reported
on in docs/Performance.md.
Details:
- Increased the double-precision real MT threshold (which controls
whether the sup implementation kicks for smaller m dimension values)
from 80 to 180, and this change was made for both haswell and zen
subconfigurations. This is less about the m dimension in particular
and more about facilitating a smoother performance transition when
m = n = k.
Details:
- Commented out redundant setting of LIBBLIS_LINK within all driver-
level Makefiles. This variable is already set within common.mk, and
so the only time it should be overridden is if the user wants to link
to a different copy of libblis.
- Very minor changes to build/gen-make-frags/gen-make-frag.sh.
- Whitespace and inconsequential quoting change to configure.
- Moved top-level 'windows' directory into a new 'attic' directory.
config/zen/bli_family_zen.h: deleted macro BLIS_ENBLE_ZEN_BLOCK_SIZES
config/zen/make_defs.mk: removed compiler flag -mno-avx256-split-unaligned-store
frame/base/bli_cpuid.c: ROME family is 17H but model # is from 0x30H.
test/test_gemm.c - commented out #define FILE_IN_OUT (some compilation error when BLIS is configured as amd64)
Now we can use single configuration has ./configure amd64 - this will work both for ROME & Naples
Change-Id: I91b4fc35380f8a35b4f4c345da040c6b5910b4a2
Details:
- Added
#ifndef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L
#endif
to bli_system.h so that an application that uses BLIS (specifically,
an application that #includes blis.h) does not need to remember to
#define the macro itself (either on the command line or in the code
that includes blis.h) in order to activate things like the pthreads.
Thanks to Christos Psarras for reporting this issue and suggesting
this fix.
- Commented out #include <sys/time.h> in bli_system.h, since I don't
think this header is used/needed anymore.
- Comment update to function macro for bli_?normiv_unb_var1() in
frame/util/bli_util_unb_var1.c.