Updated copyright information for kernels/zen/bli_trsm_small.c file
Removed separate kernels for zen2 architecture
Instead added threshold conditions in zen kernels both for ROME and NAPLES
Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5
config/zen/bli_family_zen.h: deleted macro BLIS_ENBLE_ZEN_BLOCK_SIZES
config/zen/make_defs.mk: removed compiler flag -mno-avx256-split-unaligned-store
frame/base/bli_cpuid.c: ROME family is 17H but model # is from 0x30H.
test/test_gemm.c - commented out #define FILE_IN_OUT (some compilation error when BLIS is configured as amd64)
Now we can use single configuration has ./configure amd64 - this will work both for ROME & Naples
Change-Id: I91b4fc35380f8a35b4f4c345da040c6b5910b4a2
Details:
- SYRK for small matrix was implemented by reusing small GEMM routine. This was
resulting in output written to the full C matrix, and C being symmetric the
lower and upper triangles of C matrix contained same results. BLAS SYRK API
spec demands either lower or upper triangle of C matrix to be written with
results. So, this was resulting in BLAS test failures, even though testsuite
of BLIS was passing small SYRK operation.
- To fix BLAS test failures of small matrix SYRK, separate kernel routines are
implemented for small SYRK for both single and double precision. The newly
added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c.
Now the intermediate results of matrix C are written to a scratch buffer.
Final results are written from scratch buffer to matrix C using SIMD
copy to either lower or upper traingle part of matrix C.
- Source and header files frame/3/syrk/bli_syrk_front.c and
frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines.
Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb
Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically.
Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores.
Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe
BUG No: CPUPL-197 fixed by Thangaraj Santanu
The bli_clock_min_diff() function in BLIS assumed that if the time taken is greater than 1 hour then the reading must be wrong. However this is not the case in general, while the other checks such as time taken closer to zero or nsec is ofcourse valid.
gerrit review: http://git.amd.com:8080/#/c/118694/1/frame/base/bli_clock.c
Change-Id: I9dc313d7c5fdc20684f67a516bf3237de3e0694a
Details:
- Added a short section after the Intro of the README.md file titled
"Education and Learning" that directs interested readers to the
"LAFF-On Programming for High-Performance" massive open online course
(MOOC) hosted via edX.
Details:
- Added a new standalone test driver directory named '1m4m' that can
build and run performance experiments for BLIS 1m, 4m1a, assembly,
OpenBLAS, and the vendor library (MKL). This new driver directory
was used to regenerate performance results for the 1m paper.
- Added alternate (commented-out) cache blocksizes to
config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to
work well on an a 12-core Intel Xeon E5-2650 v3.
Details:
- Updated the default MC cache blocksizes used by the haswell subconfig
for both row-preferential (the default) and column-preferential
microkernels.
Details:
- Updated the '-march=corei7-avx' flag in the sandybridge subconfig
to '-march=sandybridge' and the '-march=core-avx2' flag in the
haswell subconfig to '-march=haswell'. The older flags were used
by older versions of gcc and should have been updated to the newer
forms a long time ago. (The older flags were clearly working, even
though they are no longer documented in the gcc man page.)
Details:
- Updated most comments in testsuite modules that describe how the
correctness test is performed so that it is clear whether the vector
(normfv) or matrix (normfm) form of Frobenius norm is used.
Details:
- Defined two new static functions in bli_cntx.h:
bli_cntx_set_blksz_def_dt()
bli_cntx_set_blksz_max_dt()
which developers may find convenient when experimenting with different
values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.
Details:
- Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
whereby subnodes of the thrinfo_t tree are "dereferenced" near the
beginning of the functions, which may lead to segfaults in certain
situations where the thread tree was not fully formed because the
matrix problem was too small for the level of parallelism specified.
(That is, too small because some problems were assigned no work due
to the smallest units in the m and n dimensions being defined by the
register blocksizes mr and nr.) The fix requires several nested levels
of if statements, and this is one of those few instances where use of
goto statements results in (mostly) prettier code, especially in the
case of _gemm_paths(). And while it wasn't necessary, I ported this
goto usage to the loop body that prints the thrinfo_t work_id and
comm_id values for each thread. Thanks to Nicholai Tukanov for helping
to find this bug.
Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
containing the mpnpkp results do not need to be preprocessed in order
to plot half the problem size range (ie: up to 400 instead of the
800 range of the other shape cases).
- Trivial updates to runme.m.