Details:
- Added single-threaded and multithreaded sup performance results to
docs/PerformanceSmall.md for both sgemm and dgemm. These results were
gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
microarchitecture. Special thanks to Jeff Diamond for facilitating
access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
microarchitecture section.
Details:
- Added a frequently asked question to docs/FAQ.md regarding the
difference between upstream (vanilla) BLIS and AMD BLIS.
- Updated the name of ICES in the README.md to reflect the Oden
rebranding.
Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on an Epyc 7742
"Rome" server with AMD's Zen2 microarchitecture. Special thanks
to Jeff Diamond for facilitating access to the system via the
Oracle Cloud.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
Details:
- Steer the reader towards the example code section of each
documentation doc (object and typed).
- Trivial update to examples/oapi/README, examples/tapi/README.
Details:
- Added documentation for commonly-used object mutator functions in
BLISObjectAPI.md. Previously, only accessor functions were documented.
Thanks to Jeff Diamond for pointing out this omission.
- Explicitly set the 'diag' property of objects in oapi example modules
(08level2.c and 09level3.c).
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
publishes new performance graphs for Kaby Lake and Epyc showcasing
the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
document.
Details:
- Updated the level-3 performance graphs in docs/graphs with new Eigen
results, this time using a development version cloned from their git
mirror on March 27, 2019 (version 3.3.90). Performance is improved
over 3.3.7, though still noticeably short of BLIS/MKL in most cases.
- Very minor updates to docs/Performance.md and matlab scripts in
test/3/matlab.
Details:
- Updated the Haswell, SkylakeX, and Epyc performance graphs in
docs/graphs to report on Eigen implementations, where applicable.
Specifically, Eigen implements all level-3 operations sequentially,
however, of those operations it only provides multithreaded gemm.
Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are
omitted. Thanks to Sameer Agarwal for his help configuring and
using Eigen.
- Updated docs/Performance.md to note the new implementation tested.
- CREDITS file update.
Details:
- Added a new markdown document, docs/Performance.md, which reports
performance of a representative set of level-3 operations across a
variety of hardware architectures, comparing BLIS to OpenBLAS and a
vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs,
in pdf and png formats, reside in docs/graphs.
- Updated README.md to link to new Performance.md document.
- Minor updates to CREDITS, docs/Multithreading.md.
- Minor updates to matlab scripts in test/3/matlab.
Details:
- Made extra explicit the fact that: (a) multithreading in BLIS is
disabled by default; and (b) even with multithreading enabled, the
user must specify multithreading at runtime in order to observe
parallelism. Thanks to M. Zhou for suggesting these clarifications
in #292.
- Also made explicit that only the environment variable and global
runtime API methods are available when using the BLAS API. If the
user wishes to use the local runtime API (specify multithreading on
a per-call basis), one of the native BLIS APIs must be used.
Details:
- Added language to remind the reader to disable sup if the intended
behavior is for the sandbox implementation to handle all problem
sizes, even the smaller ones that would normally be handled by the
sup code path.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).
Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.
Details:
- Added language to remind the reader to disable sup if the intended
behavior is for the sandbox implementation to handle all problem
sizes, even the smaller ones that would normally be handled by the
sup code path.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).
Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.
Details:
- Added Perl to list of prerequisites for building BLIS. This is in part
(and perhaps completely?) due to some substitution commands used at
the end of configure that include '\n' characters that are not
properly interpreted by the version of sed included on some versions
of OS X. This new documentation addresses issue #398.
Details:
- Fixed a missing argument (conjy) in the function signatures of
bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert
van de Geijn for reporting this omission.
Change-Id: Ifd1e01d5d7f943db4b1d67b467eb57e4a5c44165
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.
Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed
* Fix parsing in vpu_count on workstation SKX
* Document Skylake-X as Haswell for single FMA
* Update vpu_count for Skylake and Cascade Lake models
* Support printing the configuration selected, controlled by the environment
Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.
* Move bli_log outside the cpp condition, and use it where intended
* Add Fixme comment (Skylake D)
* Mostly superficial edits to commits towards #351.
Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
relates to single-VPU Skylake-Xs.
* Fix comment typos
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
Details:
- Added Perl to list of prerequisites for building BLIS. This is in part
(and perhaps completely?) due to some substitution commands used at
the end of configure that include '\n' characters that are not
properly interpreted by the version of sed included on some versions
of OS X. This new documentation addresses issue #398.
Details:
- Fixed a missing argument (conjy) in the function signatures of
bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert
van de Geijn for reporting this omission.
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]
Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
Details:
- Reran all existing single-threaded performance experiments comparing
BLIS sup to other implementations (including the conventional code
path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
(Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.