Merge branch 'dev'

2026-04-20 15:48:50 +00:00 · 2020-11-07 13:02:50 -06:00
parent 0cfe1aac22 eccdd75a2d
commit e14424f55b
53 changed files with 839 additions and 278 deletions
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -8,6 +8,7 @@ project, as well as those we think a new user or developer might ask. If you do
  * [Why did you create BLIS?](FAQ.md#why-did-you-create-blis)
  * [Why should I use BLIS instead of GotoBLAS / OpenBLAS / ATLAS / MKL / ESSL / ACML / Accelerate?](FAQ.md#why-should-i-use-blis-instead-of-gotoblas--openblas--atlas--mkl--essl--acml--accelerate)
  * [How is BLIS related to FLAME / libflame?](FAQ.md#how-is-blis-related-to-flame--libflame)
+  * [What is the difference between BLIS and the AMD fork of BLIS found in AOCL?](FAQ.md#what-is-the-difference-between-blis-and-the-amd-fork-of-blis-found-in-aocl)
  * [Does BLIS automatically detect my hardware?](FAQ.md#does-blis-automatically-detect-my-hardware)
  * [I understand that BLIS is mostly a tool for developers?](FAQ.md#i-understand-that-blis-is-mostly-a-tool-for-developers)
  * [How do I link against BLIS?](FAQ.md#how-do-i-link-against-blis)
@@ -60,6 +61,12 @@ homepage](https://github.com/flame/blis#key-features). But here are a few reason

 As explained [above](FAQ.md#why-did-you-create-blis?), BLIS was initially a layer within `libflame` that allowed more convenient interfacing to the BLAS. So in some ways, BLIS is a spin-off project. Prior to developing BLIS, [its author](http://www.cs.utexas.edu/users/field/) worked as the primary maintainer of `libflame`. If you look closely, you can also see that the design of BLIS was influenced by some of the more useful and innovative aspects of `libflame`, such as internal object abstractions and control trees. Also, various members of the [SHPC research group](http://shpc.ices.utexas.edu/people.html) and its [collaborators](http://shpc.ices.utexas.edu/collaborators.html) routinely provide insight, feedback, and also contribute code (especially kernels) to the BLIS project.

+### What is the difference between BLIS and the AMD fork of BLIS found in AOCL?
+
+BLIS, also known as "vanilla BLIS" or "upstream BLIS," is maintained by its [original developer](https://github.com/fgvanzee) (with the [support of others](http://shpc.ices.utexas.edu/collaborators.html)) in the [Science of High-Performance Computing](http://shpc.ices.utexas.edu/) (SHPC) group within the [The Oden Institute for Computational Engineering and Sciences](http://www.oden.utexas.edu/) at [The University of Texas at Austin](http://www.utexas.edu/). In 2015, [AMD](https://www.amd.com/) reorganized many of their software library efforts around existing open source projects. BLIS was chosen as the basis for their [CPU BLAS library](https://developer.amd.com/amd-aocl/blas-library/), and an AMD-maintained [fork of BLIS](https://github.com/amd/blis) was established.
+
+AMD BLIS sometimes contains certain optimizations specific to AMD hardware. Many of these optimizations are (eventually) merged back into upstream BLIS. However, for various reasons, some changes may remain unique to AMD BLIS for quite some time. Thus, if you want the latest optimizations for AMD hardware, feel free to try AMD BLIS. However, please note that neither The University of Texas at Austin nor BLIS's developers can endorse or offer direct support for any outside fork of BLIS, including AMD BLIS.
+
 ### Does BLIS automatically detect my hardware?

 On certain architectures (most notably x86_64), yes. In order to use auto-detection, you must specify `auto` as your configuration when running `configure` (Please see the BLIS [Build System](BuildSystem.md) guide for more info.) A runtime detection option is also available. (Please see the [Configuration Guide](ConfigurationHowTo.md) for a comprehensive walkthrough.)
--- a/docs/Performance.md
+++ b/docs/Performance.md
@@ -15,9 +15,12 @@
  * **[Haswell](Performance.md#haswell)**
    * **[Experiment details](Performance.md#haswell-experiment-details)**
    * **[Results](Performance.md#haswell-results)**
-  * **[Epyc](Performance.md#epyc)**
-    * **[Experiment details](Performance.md#epyc-experiment-details)**
-    * **[Results](Performance.md#epyc-results)**
+  * **[Zen](Performance.md#zen)**
+    * **[Experiment details](Performance.md#zen-experiment-details)**
+    * **[Results](Performance.md#zen-results)**
+  * **[Zen2](Performance.md#zen2)**
+    * **[Experiment details](Performance.md#zen2-experiment-details)**
+    * **[Results](Performance.md#zen2-results)**
 * **[Feedback](Performance.md#feedback)**

 # Introduction
@@ -240,6 +243,7 @@ The `runthese.m` file will contain example invocations of the function.
         endif()
         ```
    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
    * Multithreaded (26 core) execution requested via `export OMP_NUM_THREADS=26`
@@ -320,6 +324,7 @@ The `runthese.m` file will contain example invocations of the function.
         endif()
         ```
    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
    * Multithreaded (12 core) execution requested via `export OMP_NUM_THREADS=12`
@@ -355,12 +360,12 @@ The `runthese.m` file will contain example invocations of the function.

 ---

-## Epyc
+## Zen

-### Epyc experiment details
+### Zen experiment details

 * Location: Oracle cloud
-* Processor model: AMD Epyc 7551 (Zen1)
+* Processor model: AMD Epyc 7551 (Zen1 "Naples")
 * Core topology: two sockets, 4 dies per socket, 2 core complexes (CCX) per die, 4 cores per CCX, 64 cores total
 * SMT status: enabled, but not utilized
 * Max clock rate: 3.0GHz (single-core), 2.55GHz (multicore)
@@ -398,6 +403,7 @@ The `runthese.m` file will contain example invocations of the function.
         endif()
         ```
    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
    * Multithreaded (32 core) execution requested via `export OMP_NUM_THREADS=32`
@@ -417,22 +423,106 @@ The `runthese.m` file will contain example invocations of the function.
 * Comments:
  * MKL performance is dismal, despite being linked in the same manner as on the Xeon Platinum. It's not clear what is causing the slowdown. It could be that MKL's runtime kernel/blocksize selection logic is falling back to some older, more basic implementation because CPUID is not returning Intel as the hardware vendor. Alternatively, it's possible that MKL is trying to use kernels for the closest Intel architectures--say, Haswell/Broadwell--but its implementations use Haswell-specific optimizations that, due to microarchitectural differences, degrade performance on Zen.

-### Epyc results
+### Zen results

 #### pdf

-* [Epyc single-threaded](graphs/large/l3_perf_epyc_nt1.pdf)
-* [Epyc multithreaded (32 cores)](graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf)
-* [Epyc multithreaded (64 cores)](graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf)
+* [Zen single-threaded](graphs/large/l3_perf_zen_nt1.pdf)
+* [Zen multithreaded (32 cores)](graphs/large/l3_perf_zen_jc1ic8jr4_nt32.pdf)
+* [Zen multithreaded (64 cores)](graphs/large/l3_perf_zen_jc2ic8jr4_nt64.pdf)

 #### png (inline)

-* **Epyc single-threaded**
-![single-threaded](graphs/large/l3_perf_epyc_nt1.png)
-* **Epyc multithreaded (32 cores)**
-![multithreaded (32 cores)](graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png)
-* **Epyc multithreaded (64 cores)**
-![multithreaded (64 cores)](graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png)
+* **Zen single-threaded**
+![single-threaded](graphs/large/l3_perf_zen_nt1.png)
+* **Zen multithreaded (32 cores)**
+![multithreaded (32 cores)](graphs/large/l3_perf_zen_jc1ic8jr4_nt32.png)
+* **Zen multithreaded (64 cores)**
+![multithreaded (64 cores)](graphs/large/l3_perf_zen_jc2ic8jr4_nt64.png)
+
+---
+
+## Zen2
+
+### Zen2 experiment details
+
+* Location: Oracle cloud
+* Processor model: AMD Epyc 7742 (Zen2 "Rome")
+* Core topology: two sockets, 8 Core Complex Dies (CCDs) per socket, 2 Core Complexes (CCX) per CCD, 4 cores per CCX, 128 cores total
+* SMT status: enabled, but not utilized
+* Max clock rate: 2.25GHz (base, documented); 3.4GHz boost (single-core, documented); 2.6GHz boost (multicore, estimated)
+* Max vector register length: 256 bits (AVX2)
+* Max FMA vector IPC: 2
+  * Alternatively, FMA vector IPC is 4 when vectors are limited to 128 bits each.
+* Peak performance:
+  * single-core: 54.4 GFLOPS (double-precision), 108.8 GFLOPS (single-precision)
+  * multicore (estimated): 41.6 GFLOPS/core (double-precision), 83.2 GFLOPS/core (single-precision)
+* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
+* Page size: 4096 bytes
+* Compiler: gcc 9.3.0
+* Results gathered: 24 September 2020, 29 September 2020
+* Implementations tested:
+  * BLIS 4fd8d9f (0.7.0-55)
+    * configured with `./configure -t openmp auto` (single- and multithreaded)
+    * sub-configuration exercised: `zen2`
+    * Single-threaded (1 core) execution requested via no change in environment variables
+    * Multithreaded (64 core) execution requested via `export BLIS_JC_NT=4 BLIS_IC_NT=4 BLIS_JR_NT=4`
+    * Multithreaded (128 core) execution requested via `export BLIS_JC_NT=8 BLIS_IC_NT=4 BLIS_JR_NT=4`
+  * OpenBLAS 0.3.10
+    * configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
+    * configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=64` (multithreaded, 64 cores)
+    * configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=128` (multithreaded, 128 cores)
+    * Single-threaded (1 core) execution requested via `export OPENBLAS_NUM_THREADS=1`
+    * Multithreaded (64 core) execution requested via `export OPENBLAS_NUM_THREADS=64`
+    * Multithreaded (128 core) execution requested via `export OPENBLAS_NUM_THREADS=128`
+  * Eigen 3.3.90
+    * Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen) (24 September 2020)
+    * Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
+         ```
+         # These lines added after line 60.
+         check_cxx_compiler_flag("-march=native" COMPILER_SUPPORTS_MARCH_NATIVE)
+         if(COMPILER_SUPPORTS_MARCH_NATIVE)
+           set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
+         endif()
+         ```
+    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
+    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
+    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
+    * Multithreaded (64 core) execution requested via `export OMP_NUM_THREADS=64`
+    * Multithreaded (128 core) execution requested via `export OMP_NUM_THREADS=128`
+    * **NOTE**: This version of Eigen does not provide multithreaded implementations of `symm`/`hemm`, `syrk`/`herk`, `trmm`, or `trsm`, and therefore those curves are omitted from the multithreaded graphs.
+  * MKL 2020 update 3
+    * Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
+    * Multithreaded (64 core) execution requested via `export MKL_NUM_THREADS=64`
+    * Multithreaded (128 core) execution requested via `export MKL_NUM_THREADS=128`
+* Affinity:
+  * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-127"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset. 
+  * All executables were run through `numactl --interleave=all`.
+* Frequency throttling (via `cpupower`):
+  * Driver: acpi-cpufreq
+  * Governor: performance
+  * Hardware limits (steps): 1.5GHz, 2.0GHz, 2.25GHz
+  * Adjusted minimum: 2.25GHz
+* Comments:
+  * MKL performance is once again underwhelming. This is likely because Intel has decided that it does not want to give users of MKL a reason to purchase AMD hardware.
+
+### Zen2 results
+
+#### pdf
+
+* [Zen2 single-threaded](graphs/large/l3_perf_zen2_nt1.pdf)
+* [Zen2 multithreaded (64 cores)](graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf)
+* [Zen2 multithreaded (128 cores)](graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf)
+
+#### png (inline)
+
+* **Zen2 single-threaded**
+![single-threaded](graphs/large/l3_perf_zen2_nt1.png)
+* **Zen2 multithreaded (64 cores)**
+![multithreaded (64 cores)](graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.png)
+* **Zen2 multithreaded (128 cores)**
+![multithreaded (128 cores)](graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.png)

 ---

--- a/docs/PerformanceSmall.md
+++ b/docs/PerformanceSmall.md
@@ -12,9 +12,12 @@
  * **[Haswell](PerformanceSmall.md#haswell)**
    * **[Experiment details](PerformanceSmall.md#haswell-experiment-details)**
    * **[Results](PerformanceSmall.md#haswell-results)**
-  * **[Epyc](PerformanceSmall.md#epyc)**
-    * **[Experiment details](PerformanceSmall.md#epyc-experiment-details)**
-    * **[Results](PerformanceSmall.md#epyc-results)**
+  * **[Zen](PerformanceSmall.md#zen)**
+    * **[Experiment details](PerformanceSmall.md#zen-experiment-details)**
+    * **[Results](PerformanceSmall.md#zen-results)**
+  * **[Zen2](PerformanceSmall.md#zen2)**
+    * **[Experiment details](PerformanceSmall.md#zen2-experiment-details)**
+    * **[Results](PerformanceSmall.md#zen2-results)**
 * **[Feedback](PerformanceSmall.md#feedback)**

 # Introduction
@@ -295,9 +298,9 @@ The `runthese.m` file will contain example invocations of the function.

 ---

-## Epyc
+## Zen

-### Epyc experiment details
+### Zen experiment details

 * Location: Oracle cloud
 * Processor model: AMD Epyc 7551 (Zen1)
@@ -318,7 +321,7 @@ The `runthese.m` file will contain example invocations of the function.
  * BLIS 90db88e (0.6.1-8)
    * configured with `./configure --enable-cblas auto` (single-threaded)
    * configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
-    * sub-configuration exercised: `haswell`
+    * sub-configuration exercised: `zen`
    * Multithreaded (32 cores) execution requested via `export BLIS_NUM_THREADS=32`
  * OpenBLAS 0.3.8
    * configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
@@ -357,25 +360,122 @@ The `runthese.m` file will contain example invocations of the function.
 * Comments:
  * libxsmm is highly competitive for very small problems, but quickly gives up once the "large" dimension exceeds about 180-240 (or 64 in the case where all operands are square). Also, libxsmm's `gemm` cannot handle a transposition on matrix A and similarly dispatches the fallback implementation for those cases. libxsmm also does not export CBLAS interfaces, and therefore only appears on the graphs for column-stored matrices.

-### Epyc results
+### Zen results

 #### pdf

-* [Epyc single-threaded row-stored](graphs/sup/dgemm_rrr_epyc_nt1.pdf)
-* [Epyc single-threaded column-stored](graphs/sup/dgemm_ccc_epyc_nt1.pdf)
-* [Epyc multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_epyc_nt32.pdf)
-* [Epyc multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_epyc_nt32.pdf)
+* [Zen single-threaded row-stored](graphs/sup/dgemm_rrr_zen_nt1.pdf)
+* [Zen single-threaded column-stored](graphs/sup/dgemm_ccc_zen_nt1.pdf)
+* [Zen multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_zen_nt32.pdf)
+* [Zen multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_zen_nt32.pdf)

 #### png (inline)

-* **Epyc single-threaded row-stored**
-![single-threaded row-stored](graphs/sup/dgemm_rrr_epyc_nt1.png)
-* **Epyc single-threaded column-stored**
-![single-threaded column-stored](graphs/sup/dgemm_ccc_epyc_nt1.png)
-* **Epyc multithreaded (32 cores) row-stored**
-![multithreaded row-stored](graphs/sup/dgemm_rrr_epyc_nt32.png)
-* **Epyc multithreaded (32 cores) column-stored**
-![multithreaded column-stored](graphs/sup/dgemm_ccc_epyc_nt32.png)
+* **Zen single-threaded row-stored**
+![single-threaded row-stored](graphs/sup/dgemm_rrr_zen_nt1.png)
+* **Zen single-threaded column-stored**
+![single-threaded column-stored](graphs/sup/dgemm_ccc_zen_nt1.png)
+* **Zen multithreaded (32 cores) row-stored**
+![multithreaded row-stored](graphs/sup/dgemm_rrr_zen_nt32.png)
+* **Zen multithreaded (32 cores) column-stored**
+![multithreaded column-stored](graphs/sup/dgemm_ccc_zen_nt32.png)
+
+---
+
+## Zen2
+
+### Zen2 experiment details
+
+* Location: Oracle cloud
+* Processor model: AMD Epyc 7742 (Zen2 "Rome")
+* Core topology: two sockets, 8 Core Complex Dies (CCDs) per socket, 2 Core Complexes (CCX) per CCD, 4 cores per CCX, 128 cores total
+* SMT status: enabled, but not utilized
+* Max clock rate: 2.25GHz (base, documented); 3.4GHz boost (single-core, documented); 2.6GHz boost (multicore, estimated)
+* Max vector register length: 256 bits (AVX2)
+* Max FMA vector IPC: 2
+  * Alternatively, FMA vector IPC is 4 when vectors are limited to 128 bits each.
+* Peak performance:
+  * single-core: 54.4 GFLOPS (double-precision), 108.8 GFLOPS (single-precision)
+  * multicore (estimated): 41.6 GFLOPS/core (double-precision), 83.2 GFLOPS/core (single-precision)
+* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
+* Page size: 4096 bytes
+* Compiler: gcc 9.3.0
+* Results gathered: 8 October 2020
+* Implementations tested:
+  * BLIS a0849d3 (0.7.0-67)
+    * configured with `./configure --enable-cblas auto` (single-threaded)
+    * configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
+    * sub-configuration exercised: `zen2`
+    * Multithreaded (32 cores) execution requested via `export BLIS_NUM_THREADS=32`
+  * OpenBLAS 0.3.10
+    * configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
+    * configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=32` (multithreaded)
+    * Multithreaded (32 cores) execution requested via `export OPENBLAS_NUM_THREADS=32`
+  * BLASFEO 5b26d40
+    * configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
+    * built BLAS library via `make CC=gcc`
+  * Eigen 3.3.90
+    * Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen) (24 September 2020)
+    * Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
+         ```
+         # These lines added after line 60.
+         check_cxx_compiler_flag("-march=native" COMPILER_SUPPORTS_MARCH_NATIVE)
+         if(COMPILER_SUPPORTS_MARCH_NATIVE)
+           set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
+         endif()
+         ```
+    * configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
+    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
+    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
+    * Multithreaded (32 cores) execution requested via `export OMP_NUM_THREADS=32`
+  * MKL 2020 update 3
+    * Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
+    * Multithreaded (32 cores) execution requested via `export MKL_NUM_THREADS=32`
+  * libxsmm f0ab9cb (post-1.16.1)
+    * compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
+* Affinity:
+  * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-31"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
+  * All executables were run through `numactl --interleave=all`.
+* Frequency throttling (via `cpupower`):
+  * Driver: acpi-cpufreq
+  * Governor: performance
+  * Hardware limits (steps): 1.5GHz, 2.0GHz, 2.25GHz
+  * Adjusted minimum: 2.25GHz
+* Comments:
+  * None.
+
+### Zen2 results
+
+#### pdf
+
+* [Zen2 sgemm single-threaded row-stored](graphs/sup/sgemm_rrr_zen2_nt1.pdf)
+* [Zen2 sgemm single-threaded column-stored](graphs/sup/sgemm_ccc_zen2_nt1.pdf)
+* [Zen2 dgemm single-threaded row-stored](graphs/sup/dgemm_rrr_zen2_nt1.pdf)
+* [Zen2 dgemm single-threaded column-stored](graphs/sup/dgemm_ccc_zen2_nt1.pdf)
+* [Zen2 sgemm multithreaded (32 cores) row-stored](graphs/sup/sgemm_rrr_zen2_nt32.pdf)
+* [Zen2 sgemm multithreaded (32 cores) column-stored](graphs/sup/sgemm_ccc_zen2_nt32.pdf)
+* [Zen2 dgemm multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_zen2_nt32.pdf)
+* [Zen2 dgemm multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_zen2_nt32.pdf)
+
+#### png (inline)
+
+* **Zen2 sgemm single-threaded row-stored**
+![sgemm single-threaded row-stored](graphs/sup/sgemm_rrr_zen2_nt1.png)
+* **Zen2 sgemm single-threaded column-stored**
+![sgemm single-threaded column-stored](graphs/sup/sgemm_ccc_zen2_nt1.png)
+* **Zen2 dgemm single-threaded row-stored**
+![dgemm single-threaded row-stored](graphs/sup/dgemm_rrr_zen2_nt1.png)
+* **Zen2 dgemm single-threaded column-stored**
+![dgemm single-threaded column-stored](graphs/sup/dgemm_ccc_zen2_nt1.png)
+* **Zen2 sgemm multithreaded (32 cores) row-stored**
+![sgemm multithreaded row-stored](graphs/sup/sgemm_rrr_zen2_nt32.png)
+* **Zen2 sgemm multithreaded (32 cores) column-stored**
+![sgemm multithreaded column-stored](graphs/sup/sgemm_ccc_zen2_nt32.png)
+* **Zen2 dgemm multithreaded (32 cores) row-stored**
+![dgemm multithreaded row-stored](graphs/sup/dgemm_rrr_zen2_nt32.png)
+* **Zen2 dgemm multithreaded (32 cores) column-stored**
+![dgemm multithreaded column-stored](graphs/sup/dgemm_ccc_zen2_nt32.png)

 ---

--- a/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf
+++ b/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf
--- a/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.png
+++ b/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.png
--- a/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf
+++ b/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf
--- a/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.png
+++ b/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.png
--- a/docs/graphs/large/l3_perf_zen2_nt1.pdf
+++ b/docs/graphs/large/l3_perf_zen2_nt1.pdf
--- a/docs/graphs/large/l3_perf_zen2_nt1.png
+++ b/docs/graphs/large/l3_perf_zen2_nt1.png
--- a/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf
+++ b/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf
--- a/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png
+++ b/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png
--- a/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf
+++ b/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf
--- a/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png
+++ b/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png
--- a/docs/graphs/large/l3_perf_epyc_nt1.pdf
+++ b/docs/graphs/large/l3_perf_epyc_nt1.pdf
--- a/docs/graphs/large/l3_perf_epyc_nt1.png
+++ b/docs/graphs/large/l3_perf_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt1.pdf
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt1.pdf
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt1.png
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt32.pdf
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt32.pdf
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt32.png
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt32.png
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt1.pdf
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt1.pdf
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt1.png
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt32.pdf
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt32.pdf
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt32.png
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt32.png
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt1.pdf
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt1.pdf
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt1.png
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt1.png
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt32.pdf
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt32.pdf
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt32.png
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt32.png
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt1.pdf
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt1.pdf
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt1.png
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt32.pdf
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt32.pdf
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt32.png
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt32.png
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt1.pdf
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt1.pdf
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt1.png
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt1.png
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt32.pdf
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt32.pdf
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt32.png
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt32.png
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt1.pdf
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt1.pdf
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt1.png
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt1.png
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt32.pdf
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt32.pdf
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt32.png
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt32.png