Added sup performance graphs/document to 'docs'.

Details: - Added a new markdown document, docs/PerformanceSmall.md, which publishes new performance graphs for Kaby Lake and Epyc showcasing the new BLIS sup (small/skinny/unpacked) framework logic and kernels. For now, only single-threaded dgemm performance is shown. - Reorganized graphs in docs/graphs into docs/graphs/large, with new graphs being placed in docs/graphs/sup. - Updates to scripts in test/sup/octave, mostly to allow decent output in both GNU octave and Matlab. - Updated README.md to mention and refer to the new PerformanceSmall.md document.
2026-04-19 23:28:52 +00:00 · 2019-06-03 16:53:19 -05:00
parent 5e03ca6fc7
commit 55e7b045c3
38 changed files with 292 additions and 43 deletions
--- a/README.md
+++ b/README.md
@@ -79,6 +79,17 @@ and [other educational projects](http://www.ulaff.net/) (such as MOOCs).
 What's New
 ----------

+ * **Small/skinny matrix support for dgemm now available!** Thanks to funding
+from AMD, we have dramatically accelerated `gemm` for double-precision real
+matrix problems where one or two dimensions is exceedingly small. A natural
+byproduct of this optimization is that the traditional case of small _m = n = k_
+(i.e. square matrices) is also accelerated, even though it was not targeted
+specifically. And though only `dgemm` was optimized for now, support for other
+datatypes, other operations, and/or multithreading may be implemented in the
+future. We've also added a new [PerformanceSmall](docs/PerformanceSmall.md)
+document to showcase the improvement in performance when some matrix dimensions
+are small.
+
 * **Performance comparisons now available!** We recently measured the
 performance of various level-3 operations on a variety of hardware architectures,
 as implemented within BLIS and other BLAS libraries for all four of the standard
@@ -329,6 +340,11 @@ measured performance of a representative set of level-3 operations on a variety
 of hardware architectures, as implemented within BLIS and other BLAS libraries
 for all four of the standard floating-point datatypes.

+ * **[PerformanceSmall](docs/PerformanceSmall.md).** This document reports
+empirically measured performance of `gemm` on select hardware architectures
+within BLIS and other BLAS libraries when performing matrix problems where one
+or two dimensions is exceedingly small.
+
 * **[Release Notes](docs/ReleaseNotes.md).** This document tracks a summary of
 changes included with each new version of BLIS, along with contributor credits
 for key features.
--- a/docs/Performance.md
+++ b/docs/Performance.md
@@ -21,11 +21,13 @@
 # Introduction

 This document showcases performance results for a representative sample of
-level-3 operations with BLIS and BLAS for several hardware architectures.
+level-3 operations on large matrices with BLIS and BLAS for several hardware
+architectures.

 # General information

-Generally speaking, we publish three "panels" for each type of hardware,
+Generally speaking, for level-3 operations on large matrices, we publish three
+"panels" for each type of hardware,
 each of which reports one of: single-threaded performance, multithreaded
 performance on a single socket, or multithreaded performance on two sockets.
 Each panel will consist of a 4x5 grid of graphs, with each row representing
@@ -155,18 +157,18 @@ size of interest so that we can better assist you.

 #### pdf

-* [ThunderX2 single-threaded](graphs/l3_perf_tx2_nt1.pdf)
-* [ThunderX2 multithreaded (28 cores)](graphs/l3_perf_tx2_jc4ic7_nt28.pdf)
-* [ThunderX2 multithreaded (56 cores)](graphs/l3_perf_tx2_jc8ic7_nt56.pdf)
+* [ThunderX2 single-threaded](graphs/large/l3_perf_tx2_nt1.pdf)
+* [ThunderX2 multithreaded (28 cores)](graphs/large/l3_perf_tx2_jc4ic7_nt28.pdf)
+* [ThunderX2 multithreaded (56 cores)](graphs/large/l3_perf_tx2_jc8ic7_nt56.pdf)

 #### png (inline)

 * **ThunderX2 single-threaded**
-![single-threaded](graphs/l3_perf_tx2_nt1.png)
+![single-threaded](graphs/large/l3_perf_tx2_nt1.png)
 * **ThunderX2 multithreaded (28 cores)**
-![multithreaded (28 cores)](graphs/l3_perf_tx2_jc4ic7_nt28.png)
+![multithreaded (28 cores)](graphs/large/l3_perf_tx2_jc4ic7_nt28.png)
 * **ThunderX2 multithreaded (56 cores)**
-![multithreaded (56 cores)](graphs/l3_perf_tx2_jc8ic7_nt56.png)
+![multithreaded (56 cores)](graphs/large/l3_perf_tx2_jc8ic7_nt56.png)

 ---

@@ -227,18 +229,18 @@ size of interest so that we can better assist you.

 #### pdf

-* [SkylakeX single-threaded](graphs/l3_perf_skx_nt1.pdf)
-* [SkylakeX multithreaded (26 cores)](graphs/l3_perf_skx_jc2ic13_nt26.pdf)
-* [SkylakeX multithreaded (52 cores)](graphs/l3_perf_skx_jc4ic13_nt52.pdf)
+* [SkylakeX single-threaded](graphs/large/l3_perf_skx_nt1.pdf)
+* [SkylakeX multithreaded (26 cores)](graphs/large/l3_perf_skx_jc2ic13_nt26.pdf)
+* [SkylakeX multithreaded (52 cores)](graphs/large/l3_perf_skx_jc4ic13_nt52.pdf)

 #### png (inline)

 * **SkylakeX single-threaded**
-![single-threaded](graphs/l3_perf_skx_nt1.png)
+![single-threaded](graphs/large/l3_perf_skx_nt1.png)
 * **SkylakeX multithreaded (26 cores)**
-![multithreaded (26 cores)](graphs/l3_perf_skx_jc2ic13_nt26.png)
+![multithreaded (26 cores)](graphs/large/l3_perf_skx_jc2ic13_nt26.png)
 * **SkylakeX multithreaded (52 cores)**
-![multithreaded (52 cores)](graphs/l3_perf_skx_jc4ic13_nt52.png)
+![multithreaded (52 cores)](graphs/large/l3_perf_skx_jc4ic13_nt52.png)

 ---

@@ -296,18 +298,18 @@ size of interest so that we can better assist you.

 #### pdf

-* [Haswell single-threaded](graphs/l3_perf_has_nt1.pdf)
-* [Haswell multithreaded (12 cores)](graphs/l3_perf_has_jc2ic3jr2_nt12.pdf)
-* [Haswell multithreaded (24 cores)](graphs/l3_perf_has_jc4ic3jr2_nt24.pdf)
+* [Haswell single-threaded](graphs/large/l3_perf_has_nt1.pdf)
+* [Haswell multithreaded (12 cores)](graphs/large/l3_perf_has_jc2ic3jr2_nt12.pdf)
+* [Haswell multithreaded (24 cores)](graphs/large/l3_perf_has_jc4ic3jr2_nt24.pdf)

 #### png (inline)

 * **Haswell single-threaded**
-![single-threaded](graphs/l3_perf_has_nt1.png)
+![single-threaded](graphs/large/l3_perf_has_nt1.png)
 * **Haswell multithreaded (12 cores)**
-![multithreaded (12 cores)](graphs/l3_perf_has_jc2ic3jr2_nt12.png)
+![multithreaded (12 cores)](graphs/large/l3_perf_has_jc2ic3jr2_nt12.png)
 * **Haswell multithreaded (24 cores)**
-![multithreaded (24 cores)](graphs/l3_perf_has_jc4ic3jr2_nt24.png)
+![multithreaded (24 cores)](graphs/large/l3_perf_has_jc4ic3jr2_nt24.png)

 ---

@@ -369,18 +371,18 @@ size of interest so that we can better assist you.

 #### pdf

-* [Epyc single-threaded](graphs/l3_perf_epyc_nt1.pdf)
-* [Epyc multithreaded (32 cores)](graphs/l3_perf_epyc_jc1ic8jr4_nt32.pdf)
-* [Epyc multithreaded (64 cores)](graphs/l3_perf_epyc_jc2ic8jr4_nt64.pdf)
+* [Epyc single-threaded](graphs/large/l3_perf_epyc_nt1.pdf)
+* [Epyc multithreaded (32 cores)](graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf)
+* [Epyc multithreaded (64 cores)](graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf)

 #### png (inline)

 * **Epyc single-threaded**
-![single-threaded](graphs/l3_perf_epyc_nt1.png)
+![single-threaded](graphs/large/l3_perf_epyc_nt1.png)
 * **Epyc multithreaded (32 cores)**
-![multithreaded (32 cores)](graphs/l3_perf_epyc_jc1ic8jr4_nt32.png)
+![multithreaded (32 cores)](graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png)
 * **Epyc multithreaded (64 cores)**
-![multithreaded (64 cores)](graphs/l3_perf_epyc_jc2ic8jr4_nt64.png)
+![multithreaded (64 cores)](graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png)

 ---

--- a/docs/PerformanceSmall.md
+++ b/docs/PerformanceSmall.md
@@ -0,0 +1,219 @@
+# Contents
+
+* **[Contents](Performance.md#contents)**
+* **[Introduction](Performance.md#introduction)**
+* **[General information](Performance.md#general-information)**
+* **[Level-3 performance](Performance.md#level-3-performance)**
+  * **[Kaby Lake](Performance.md#kaby-lake)**
+    * **[Experiment details](Performance.md#kaby-lake-experiment-details)**
+    * **[Results](Performance.md#kaby-lake-results)**
+  * **[Epyc](Performance.md#epyc)**
+    * **[Experiment details](Performance.md#epyc-experiment-details)**
+    * **[Results](Performance.md#epyc-results)**
+* **[Feedback](Performance.md#feedback)**
+
+# Introduction
+
+This document showcases performance results for the level-3 `gemm` operation
+on small matrices with BLIS and BLAS for select hardware architectures.
+
+# General information
+
+Generally speaking, for level-3 operations on small matrices, we publish 
+two "panels" for each type of hardware, one that reflects performance on
+row-stored matrices and another for column-stored matrices.
+Each panel will consist of a 4x7 grid of graphs, with each row representing
+a different transposition case (`nn`, `nt`, `tn`, `tt`)
+complex) and each column representing a different shape scenario, usually
+with one or two matrix dimensions bound to a fixed size for all problem
+sizes tested.
+Each of the 28 graphs within a panel will contain an x-axis that reports
+problem size, with one, two, or all three matrix dimensions equal to the
+problem size (e.g. _m_ = 6; _n_ = _k_, also encoded as `m6npkp`).
+The y-axis will report in units GFLOPS (billions of floating-point operations
+per second) on a single core.
+
+It's also worth pointing out that the top of each graph (e.g. the maximum
+y-axis value depicted) _always_ corresponds to the theoretical peak performance
+under the conditions associated with that graph.
+Theoretical peak performance, in units of GFLOPS, is calculated as the
+product of:
+1. the maximum sustainable clock rate in GHz; and
+2. the maximum number of floating-point operations (flops) that can be
+executed per cycle.
+
+Note that the maximum sustainable clock rate may change depending on the
+conditions.
+For example, on some systems the maximum clock rate is higher when only one
+core is active (e.g. single-threaded performance) versus when all cores are
+active (e.g. multithreaded performance).
+The maximum number of flops executable per cycle (per core) is generally
+computed as the product of:
+1. the maximum number of fused multiply-add (FMA) vector instructions that
+can be issued per cycle (per core);
+2. the maximum number of elements that can be stored within a single vector
+register (for the datatype in question); and
+3. 2.0, since an FMA instruction fuses two operations (a multiply and an add).
+
+The problem size range, represented on the x-axis, is usually sampled in
+increments of 4 up to 800 for the cases where one or two dimensions is small
+(and constant)
+and up to 400 in the case where all dimensions (m, n, and k) are bound to the
+problem size (i.e., square matrices).
+
+Note that the constant small matrix dimensions were chosen to be _very_
+small--in the neighborhood of 8--intentionally to showcase what happens when
+at least one of the matrices is abnormally "skinny." Typically, organizations
+and individuals only publish performance with square matrices, which can miss
+the problem sizes of interest to many applications. Here, in addition to square
+matrices (shown in the seventh column), we also show six other scenarios where
+one or two `gemm` dimensions (of m, n, and k) is small.
+
+The legend in each graph contains two entries for BLIS, corresponding to the
+two black lines, one solid and one dotted. The dotted line ("BLIS conv")
+represents the conventional implementation that targets large matrices. This
+was the only implementation available in BLIS prior to the addition to the
+small/skinny matrix support. The solid line ("BLIS sup") makes use of the
+new small/skinny matrix implementation for certain small problems. Whenever
+these results differ by any significant amount (beyond noise), it denotes a
+problem size for which BLIS employed the new small/skinny implementation.
+Put another way, the delta between these two lines represents the performance
+improvement between BLIS's previous status quo and the new regime.
+
+Finally, each point along each curve represents the best of three trials.
+
+# Interpretation
+
+In general, the the curves associated with higher-performing implementations
+will appear higher in the graphs than lower-performing implementations.
+Ideally, an implementation will climb in performance (as a function of problem
+size) as quickly as possible and asymptotically approach some high fraction of
+peak performance.
+
+When corresponding with us, via email or when opening an
+[issue](https://github.com/flame/blis/issues) on github, we kindly ask that
+you specify as closely as possible (though a range is fine) your problem
+size of interest so that we can better assist you.
+
+# Level-3 performance
+
+## Kaby Lake
+
+### Kaby Lake experiment details
+
+* Location: undisclosed
+* Processor model: Intel Core i5-7500 (Kaby Lake)
+* Core topology: one socket, 4 cores total
+* SMT status: unavailable
+* Max clock rate: 3.8GHz (single-core)
+* Max vector register length: 256 bits (AVX2)
+* Max FMA vector IPC: 2
+* Peak performance:
+  * single-core: 57.6 GFLOPS (double-precision), 115.2 GFLOPS (single-precision)
+* Operating system: Gentoo Linux (Linux kernel 5.0.7)
+* Compiler: gcc 7.3.0
+* Results gathered: 31 May 2019
+* Implementations tested:
+  * BLIS 6bf449c (0.5.2-42)
+    * configured with `./configure --enable-cblas auto`
+    * sub-configuration exercised: `haswell`
+  * OpenBLAS 0.3.6
+    * configured with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
+  * Eigen 3.3.90
+    * Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (30 May 2019)
+    * Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal).
+    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
+    * Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
+  * MKL 2018 update 4
+    * Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
+* Affinity:
+  * N/A.
+* Frequency throttling (via `cpupower`):
+  * Driver: intel_pstate
+  * Governor: performance
+  * Hardware limits: 800MHz - 3.8GHz
+  * Adjusted minimum: 3.7GHz
+* Comments:
+  * 
+
+### Kaby Lake results
+
+#### pdf
+
+* [Kaby Lake row-stored](graphs/sup/dgemm_rrr_kbl_nt1.pdf)
+* [Kaby Lake column-stored](graphs/sup/dgemm_ccc_kbl_nt1.pdf)
+
+#### png (inline)
+
+* **Kaby Lake row-stored**
+![row-stored](graphs/sup/dgemm_rrr_kbl_nt1.png)
+* **Kaby Lake column-stored**
+![column-stored](graphs/sup/dgemm_ccc_kbl_nt1.png)
+
+---
+
+## Epyc
+
+### Epyc experiment details
+
+* Location: Oracle cloud
+* Processor model: AMD Epyc 7551 (Zen1)
+* Core topology: two sockets, 4 dies per socket, 2 core complexes (CCX) per die, 4 cores per CCX, 64 cores total
+* SMT status: enabled, but not utilized
+* Max clock rate: 3.0GHz (single-core), 2.55GHz (multicore)
+* Max vector register length: 256 bits (AVX2)
+* Max FMA vector IPC: 1
+  * Alternatively, FMA vector IPC is 2 when vectors are limited to 128 bits each.
+* Peak performance:
+  * single-core: 24 GFLOPS (double-precision), 48 GFLOPS (single-precision)
+* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
+* Compiler: gcc 7.3.0
+* Results gathered: 31 May 2019
+* Implementations tested:
+  * BLIS 6bf449c (0.5.2-42)
+    * configured with `./configure --enable-cblas auto`
+    * sub-configuration exercised: `zen`
+  * OpenBLAS 0.3.6
+    * configured with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
+  * Eigen 3.3.90
+    * Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (30 May 2019)
+    * Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal).
+    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
+    * Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
+  * MKL 2019 update 4
+    * Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
+* Affinity:
+  * None.
+d affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 63"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
+* Frequency throttling (via `cpupower`):
+  * Driver: acpi-cpufreq
+  * Governor: performance
+  * Hardware limits: 1.2GHz - 2.0GHz
+  * Adjusted minimum: 2.0GHz
+* Comments:
+  * 
+
+### Epyc results
+
+#### pdf
+
+* [Epyc row-stored](graphs/sup/dgemm_rrr_epyc_nt1.pdf)
+* [Epyc column-stored](graphs/sup/dgemm_ccc_epyc_nt1.pdf)
+
+#### png (inline)
+
+* **Epyc row-stored**
+![row-stored](graphs/sup/dgemm_rrr_epyc_nt1.png)
+* **Epyc column-stored**
+![column-stored](graphs/sup/dgemm_ccc_epyc_nt1.png)
+
+---
+
+# Feedback
+
+Please let us know what you think of these performance results! Similarly, if you have any questions or concerns, or are interested in reproducing these performance experiments on your own hardware, we invite you to [open an issue](https://github.com/flame/blis/issues) and start a conversation with BLIS developers.
+
+Thanks for your interest in BLIS!
+
--- a/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf
+++ b/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf
--- a/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png
+++ b/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png
--- a/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf
+++ b/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf
--- a/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png
+++ b/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png
--- a/docs/graphs/large/l3_perf_epyc_nt1.pdf
+++ b/docs/graphs/large/l3_perf_epyc_nt1.pdf
--- a/docs/graphs/large/l3_perf_epyc_nt1.png
+++ b/docs/graphs/large/l3_perf_epyc_nt1.png
--- a/docs/graphs/large/l3_perf_has_jc2ic3jr2_nt12.pdf
+++ b/docs/graphs/large/l3_perf_has_jc2ic3jr2_nt12.pdf
--- a/docs/graphs/large/l3_perf_has_jc2ic3jr2_nt12.png
+++ b/docs/graphs/large/l3_perf_has_jc2ic3jr2_nt12.png
--- a/docs/graphs/large/l3_perf_has_jc4ic3jr2_nt24.pdf
+++ b/docs/graphs/large/l3_perf_has_jc4ic3jr2_nt24.pdf
--- a/docs/graphs/large/l3_perf_has_jc4ic3jr2_nt24.png
+++ b/docs/graphs/large/l3_perf_has_jc4ic3jr2_nt24.png
--- a/docs/graphs/large/l3_perf_has_nt1.pdf
+++ b/docs/graphs/large/l3_perf_has_nt1.pdf
--- a/docs/graphs/large/l3_perf_has_nt1.png
+++ b/docs/graphs/large/l3_perf_has_nt1.png
--- a/docs/graphs/large/l3_perf_skx_jc2ic13_nt26.pdf
+++ b/docs/graphs/large/l3_perf_skx_jc2ic13_nt26.pdf
--- a/docs/graphs/large/l3_perf_skx_jc2ic13_nt26.png
+++ b/docs/graphs/large/l3_perf_skx_jc2ic13_nt26.png
--- a/docs/graphs/large/l3_perf_skx_jc4ic13_nt52.pdf
+++ b/docs/graphs/large/l3_perf_skx_jc4ic13_nt52.pdf
--- a/docs/graphs/large/l3_perf_skx_jc4ic13_nt52.png
+++ b/docs/graphs/large/l3_perf_skx_jc4ic13_nt52.png
--- a/docs/graphs/large/l3_perf_skx_nt1.pdf
+++ b/docs/graphs/large/l3_perf_skx_nt1.pdf
--- a/docs/graphs/large/l3_perf_skx_nt1.png
+++ b/docs/graphs/large/l3_perf_skx_nt1.png
--- a/docs/graphs/large/l3_perf_tx2_jc4ic7_nt28.pdf
+++ b/docs/graphs/large/l3_perf_tx2_jc4ic7_nt28.pdf
--- a/docs/graphs/large/l3_perf_tx2_jc4ic7_nt28.png
+++ b/docs/graphs/large/l3_perf_tx2_jc4ic7_nt28.png
--- a/docs/graphs/large/l3_perf_tx2_jc8ic7_nt56.pdf
+++ b/docs/graphs/large/l3_perf_tx2_jc8ic7_nt56.pdf
--- a/docs/graphs/large/l3_perf_tx2_jc8ic7_nt56.png
+++ b/docs/graphs/large/l3_perf_tx2_jc8ic7_nt56.png
--- a/docs/graphs/large/l3_perf_tx2_nt1.pdf
+++ b/docs/graphs/large/l3_perf_tx2_nt1.pdf
--- a/docs/graphs/large/l3_perf_tx2_nt1.png
+++ b/docs/graphs/large/l3_perf_tx2_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt1.pdf
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt1.pdf
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt1.png
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_kbl_nt1.pdf
+++ b/docs/graphs/sup/dgemm_ccc_kbl_nt1.pdf
--- a/docs/graphs/sup/dgemm_ccc_kbl_nt1.png
+++ b/docs/graphs/sup/dgemm_ccc_kbl_nt1.png
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt1.pdf
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt1.pdf
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt1.png
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_rrr_kbl_nt1.pdf
+++ b/docs/graphs/sup/dgemm_rrr_kbl_nt1.pdf
--- a/docs/graphs/sup/dgemm_rrr_kbl_nt1.png
+++ b/docs/graphs/sup/dgemm_rrr_kbl_nt1.png
--- a/test/sup/octave/plot_l3sup_perf.m
+++ b/test/sup/octave/plot_l3sup_perf.m
@@ -8,7 +8,7 @@ function r_val = plot_l3sup_perf( opname, ...
                                  rows, cols, ...
                                  cfreq, ...
                                  dfps, ...
-                                  theid )
+                                  theid, impl )
 if ... %mod(theid-1,cols) == 2 || ...
   ... %mod(theid-1,cols) == 3 || ...
   ... %mod(theid-1,cols) == 4 || ...
@@ -178,11 +178,16 @@ if rows == 4 && cols == 7
 		'Location', legend_loc );
 		set( leg,'Box','off' );
 		set( leg,'Color','none' );
-		set( leg,'FontSize',fontsize ); %-3 );
 		set( leg,'Units','inches' );
 		%                    xpos ypos
 		%set( leg,'Position',[11.32 6.36 1.15 0.7 ] ); % (1,4tl)
+		if impl == 'octave'
+		set( leg,'FontSize',fontsize );
 		set( leg,'Position',[11.92 6.54 1.15 0.7 ] ); % (1,4tl)
+		else
+		set( leg,'FontSize',fontsize );
+		set( leg,'Position',[18.34 10.22 1.15 0.7 ] ); % (1,4tl)
+		end
 	elseif nth > 1 && theid == legend_plot_id
 	end
 end
@@ -195,7 +200,7 @@ box( ax1, 'on' );
 titl = title( titlename );
 set( titl, 'FontWeight', 'normal' ); % default font style is now 'bold'.

-if 1 == 1
+if impl == 'octave'
 tpos = get( titl, 'Position' ); % default is to align across whole figure, not box.
 tpos(1) = tpos(1) + -40;
 set( titl, 'Position', tpos ); % here we nudge it back to centered with box.
--- a/test/sup/octave/plot_panel_trxsh.m
+++ b/test/sup/octave/plot_panel_trxsh.m
@@ -1,5 +1,5 @@
 function r_val = plot_panel_trxsh ...
-     (
+     ( ...
       cfreq, ...
       dflopspercycle, ...
       nth, ...
@@ -9,7 +9,8 @@ function r_val = plot_panel_trxsh ...
       smalldims, ...
       dirpath, ...
       arch_str, ...
-       vend_str
+       vend_str, ...
+       impl ...
     )

 %cfreq = 1.8;
@@ -45,14 +46,14 @@ n_opsupnames = size( opsupnames, 1 );

 if 1 == 1
 	%fig = figure('Position', [100, 100, 2400, 1500]);
-	fig = figure('Position', [100, 100, 1860, 1000]);
+	fig = figure('Position', [100, 100, 2800, 1500]);
 	orient( fig, 'portrait' );
 	set(gcf,'PaperUnits', 'inches');
-	if 0 == 1 % matlab
-		set(gcf,'PaperSize', [11 17.5]);
-		set(gcf,'PaperPosition', [0 0 11 17.5]);
+	if impl == 'matlab'
+		set(gcf,'PaperSize', [11.5 20.4]);
+		set(gcf,'PaperPosition', [0 0 11.5 20.4]);
 		set(gcf,'PaperPositionMode','manual');
-	else % octave 4.x
+	else % impl == 'octave' % octave 4.x
 	   set(gcf,'PaperSize', [10 17.5]);
 	   set(gcf,'PaperPositionMode','auto');
 	end
@@ -123,9 +124,9 @@ for opi = 1:n_opsupnames
 	                 4, 7, ...
 	                 cfreq, ...
 	                 dflopspercycle, ...
-	                 opi );
+	                 opi, impl );

-	clear data_st_?gemm_*
+	clear data_st_*gemm_*;
 	clear data_blissup;
 	clear data_blislpab;
 	clear data_eigen;
@@ -137,12 +138,14 @@ for opi = 1:n_opsupnames
 end

 % Construct the name of the file to which we will output the graph.
-outfile = sprintf( 'fig_%s_%s_%s_nt%d.pdf', oproot, stor_str, arch_str, nth );
+outfile = sprintf( 'l3sup_%s_%s_%s_nt%d.pdf', oproot, stor_str, arch_str, nth );

 % Output the graph to pdf format.
 %print(gcf, 'gemm_md','-fillpage','-dpdf');
 %print(gcf, outfile,'-bestfit','-dpdf');
-if 1 == 1
+if impl == 'octave'
 print(gcf, outfile);
+else % if impl == 'matlab'
+print(gcf, outfile,'-bestfit','-dpdf');
 end

--- a/test/sup/octave/runme.m
+++ b/test/sup/octave/runme.m
@@ -1,4 +1,8 @@

-% has
-plot_panel_trxsh(3.5,16,1,'st','d','ccc',[ 6 8 4 ],'../results','has','MKL'); close; clear all;
-plot_panel_trxsh(3.5,16,1,'st','d','rrr',[ 6 8 4 ],'../results7','has','MKL'); close; clear all;
+% kabylake
+plot_panel_trxsh(3.6,16,1,'st','d','rrr',[ 6 8 4 ],'../results/kabylake/20190531/4_800_4_mt201_last400','kbl','MKL','matlab'); close; clear all;
+plot_panel_trxsh(3.6,16,1,'st','d','ccc',[ 6 8 4 ],'../results/kabylake/20190531/4_800_4_mt201_last400','kbl','MKL','matlab'); close; clear all;
+
+% epyc
+plot_panel_trxsh(3.0,8,1,'st','d','rrr',[ 6 8 4 ],'../results/epyc/20190531/4_800_4_mt256_last400','epyc','MKL','matlab'); close; clear all;
+plot_panel_trxsh(3.0,8,1,'st','d','ccc',[ 6 8 4 ],'../results/epyc/20190531/4_800_4_mt256_last400','epyc','MKL','matlab'); close; clear all;