Updated sup performance graphs; added mt results.

Details:
- Reran all existing single-threaded performance experiments comparing
  BLIS sup to other implementations (including the conventional code
  path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
  showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
  (Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
This commit is contained in:
Field G. Van Zee
2020-03-05 17:03:21 -06:00
parent 90db88e572
commit 0f9e0399e1
29 changed files with 210 additions and 114 deletions

View File

@@ -35,14 +35,18 @@ sizes tested.
Each of the 28 graphs within a panel will contain an x-axis that reports
problem size, with one, two, or all three matrix dimensions equal to the
problem size (e.g. _m_ = 6; _n_ = _k_, also encoded as `m6npkp`).
The y-axis will report in units GFLOPS (billions of floating-point operations
per second) on a single core.
The y-axis will report in units GFLOPS (or billions of floating-point operations
per second) per core.
It's also worth pointing out that the top of each graph (e.g. the maximum
y-axis value depicted) _always_ corresponds to the theoretical peak performance
under the conditions associated with that graph.
Theoretical peak performance, in units of GFLOPS, is calculated as the
product of:
It's also worth pointing out that the top of some graphs (e.g. the maximum
y-axis value depicted) correspond to the theoretical peak performance
under the conditions associated with that graph, while in other graphs the
y-axis has been adjusted to better show the difference between the various
curves. (We *strongly* prefer to always use peak performance as the top of
the graph; however, this is one of the few exceptions where we feel some
scaling is warranted.)
Theoretical peak performance on a single core, in units of GFLOPS, is
calculated as the product of:
1. the maximum sustainable clock rate in GHz; and
2. the maximum number of floating-point operations (flops) that can be
executed per cycle.
@@ -60,30 +64,32 @@ can be issued per cycle (per core);
register (for the datatype in question); and
3. 2.0, since an FMA instruction fuses two operations (a multiply and an add).
The problem size range, represented on the x-axis, is sampled in
increments of 4 up to 800 for the cases where one or two dimensions is small
(and constant)
and up to 400 in the case where all dimensions (e.g. _m_, _n_, and _k_) are
bound to the problem size (i.e., square matrices).
Typically, organizations and individuals publish performance with square
matrices, which can miss the problem sizes of interest to many applications.
Here, in addition to square matrices (shown in the seventh column), we also
show six other scenarios where one or two `gemm` dimensions (of _m,_ _n_, and
_k_) is small. In these six columns, the constant small matrix dimensions were
chosen to be _very_ small--in the neighborhood of 8--intentionally to showcase
what happens when at least one of the matrices is abnormally "skinny."
Note that the constant small matrix dimensions were chosen to be _very_
small--in the neighborhood of 8--intentionally to showcase what happens when
at least one of the matrices is abnormally "skinny." Typically, organizations
and individuals only publish performance with square matrices, which can miss
the problem sizes of interest to many applications. Here, in addition to square
matrices (shown in the seventh column), we also show six other scenarios where
one or two `gemm` dimensions (of _m,_ _n_, and _k_) is small.
The problem size range, represented on the x-axis, is sampled in
increments that vary. These increments (and the overall range) are generally
large for the cases where two dimensions are small (and constant), medium for
cases where one dimension is small (and constant), and small for cases where
all dimensions (e.g. _m_, _n_, and _k_) are variable and bound to the problem
size (i.e., square matrices).
The legend in each graph contains two entries for BLIS, corresponding to the
two black lines, one solid and one dotted. The dotted line, **"BLIS conv"**,
represents the conventional implementation that targets large matrices. This
was the only implementation available in BLIS prior to the addition to the
small/skinny matrix support. The solid line, **"BLIS sup"**, makes use of the
new small/skinny matrix implementation for certain small problems. Whenever
these results differ by any significant amount (beyond noise), it denotes a
problem size for which BLIS employed the new small/skinny implementation.
Put another way, **the delta between these two lines represents the performance
improvement between BLIS's previous status quo and the new regime.**
new small/skinny matrix implementation. Sometimes, the performance of
**"BLIS sup"** drops below that of **"BLIS conv"** for somewhat larger problems.
However, in practice, we use a threshold to determine when to switch from the
former to the latter, and therefore the goal is for the performance of
**"BLIS conv"** to serve as an approximate floor below which BLIS performance
never drops.
Finally, each point along each curve represents the best of three trials.
@@ -119,7 +125,8 @@ and/or install some (or all) of the implementations shown (e.g.
[BLASFEO](https://github.com/giaf/blasfeo), and
[libxsmm](https://github.com/hfp/libxsmm)), including BLIS. Be sure to consult
the detailed notes provided below; they should be *very* helpful in successfully
building the libraries. The `runme.sh` script in `test/sup` will help you run
building the libraries. The `runme.sh` script in `test/sup` (or `test/supmt`)
will help you run
some (or all) of the test drivers produced by the `Makefile`, and the
Matlab/Octave function `plot_panel_trxsh()` defined in the `octave` directory
will help you turn the output of those test drivers into a PDF file of graphs.
@@ -140,20 +147,25 @@ The `runthese.m` file will contain example invocations of the function.
* Max FMA vector IPC: 2
* Peak performance:
* single-core: 57.6 GFLOPS (double-precision), 115.2 GFLOPS (single-precision)
* multicore: 57.6 GFLOPS/core (double-precision), 115.2 GFLOPS/core (single-precision)
* Operating system: Gentoo Linux (Linux kernel 5.2.4)
* Page size: 4096 bytes
* Compiler: gcc 8.3.0
* Results gathered: 23-28 August 2019
* Results gathered: 3 March 2020
* Implementations tested:
* BLIS 4a0a6e8 (0.6.0-28)
* configured with `./configure --enable-cblas auto`
* BLIS 90db88e (0.6.1-8)
* configured with `./configure --enable-cblas auto` (single-threaded)
* configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
* sub-configuration exercised: `haswell`
* OpenBLAS 0.3.7
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* BLASFEO 01f6b7f
* Multithreaded (4 cores) execution requested via `export BLIS_NUM_THREADS=4`
* OpenBLAS 0.3.8
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=4` (multithreaded)
* Multithreaded (4 cores) execution requested via `export OPENBLAS_NUM_THREADS=4`
* BLASFEO f9b78c6
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
* Eigen 3.3.90
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (28 August 2019)
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (36b9596)
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
```
# These lines added after line 67.
@@ -165,18 +177,20 @@ The `runthese.m` file will contain example invocations of the function.
* configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
* installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
* MKL 2019 update 4
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* libxsmm 77a295c (1.6.5-6679)
* Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
* Multithreaded (4 cores) execution requested via `export OMP_NUM_THREADS=4`
* MKL 2020 initial release
* Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
* Multithreaded (4 cores) execution requested via `export MKL_NUM_THREADS=4`
* libxsmm a40a833 (post-1.14)
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
* Affinity:
* N/A.
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-3"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
* Frequency throttling (via `cpupower`):
* Driver: intel_pstate
* Governor: performance
* Hardware limits: 800MHz - 3.8GHz
* Adjusted minimum: 3.7GHz
* Adjusted minimum: 3.8GHz
* Comments:
* libxsmm is highly competitive for very small problems, but quickly gives up once the "large" dimension exceeds about 180-240 (or 64 in the case where all operands are square). Also, libxsmm's `gemm` cannot handle a transposition on matrix A and similarly dispatches the fallback implementation for those cases. libxsmm also does not export CBLAS interfaces, and therefore only appears on the graphs for column-stored matrices.
@@ -184,15 +198,21 @@ The `runthese.m` file will contain example invocations of the function.
#### pdf
* [Kaby Lake row-stored](graphs/sup/dgemm_rrr_kbl_nt1.pdf)
* [Kaby Lake column-stored](graphs/sup/dgemm_ccc_kbl_nt1.pdf)
* [Kaby Lake single-threaded row-stored](graphs/sup/dgemm_rrr_kbl_nt1.pdf)
* [Kaby Lake single-threaded column-stored](graphs/sup/dgemm_ccc_kbl_nt1.pdf)
* [Kaby Lake multithreaded (4 cores) row-stored](graphs/sup/dgemm_rrr_kbl_nt4.pdf)
* [Kaby Lake multithreaded (4 cores) column-stored](graphs/sup/dgemm_ccc_kbl_nt4.pdf)
#### png (inline)
* **Kaby Lake row-stored**
![row-stored](graphs/sup/dgemm_rrr_kbl_nt1.png)
* **Kaby Lake column-stored**
![column-stored](graphs/sup/dgemm_ccc_kbl_nt1.png)
* **Kaby Lake single-threaded row-stored**
![single-threaded row-stored](graphs/sup/dgemm_rrr_kbl_nt1.png)
* **Kaby Lake single-threaded column-stored**
![single-threaded column-stored](graphs/sup/dgemm_ccc_kbl_nt1.png)
* **Kaby Lake multithreaded (4 cores) row-stored**
![multithreaded row-stored](graphs/sup/dgemm_rrr_kbl_nt4.png)
* **Kaby Lake multithreaded (4 cores) column-stored**
![multithreaded column-stored](graphs/sup/dgemm_ccc_kbl_nt4.png)
---
@@ -209,20 +229,25 @@ The `runthese.m` file will contain example invocations of the function.
* Max FMA vector IPC: 2
* Peak performance:
* single-core: 56 GFLOPS (double-precision), 112 GFLOPS (single-precision)
* multicore: 49.6 GFLOPS/core (double-precision), 99.2 GFLOPS/core (single-precision)
* Operating system: Cray Linux Environment 6 (Linux kernel 4.4.103)
* Page size: 4096 bytes
* Compiler: gcc 7.3.0
* Results gathered: 23-28 August 2019
* Results gathered: 3 March 2020
* Implementations tested:
* BLIS 4a0a6e8 (0.6.0-28)
* configured with `./configure --enable-cblas auto`
* BLIS 90db88e (0.6.1-8)
* configured with `./configure --enable-cblas auto` (single-threaded)
* configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
* sub-configuration exercised: `haswell`
* OpenBLAS 0.3.7
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* BLASFEO 01f6b7f
* Multithreaded (12 cores) execution requested via `export BLIS_NUM_THREADS=12`
* OpenBLAS 0.3.8
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=12` (multithreaded)
* Multithreaded (12 cores) execution requested via `export OPENBLAS_NUM_THREADS=12`
* BLASFEO f9b78c6
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
* Eigen 3.3.90
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (28 August 2019)
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (36b9596)
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
```
# These lines added after line 67.
@@ -234,13 +259,15 @@ The `runthese.m` file will contain example invocations of the function.
* configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
* installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
* MKL 2019 update 4
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* libxsmm 77a295c (1.6.5-6679)
* Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
* Multithreaded (12 cores) execution requested via `export OMP_NUM_THREADS=12`
* MKL 2020 initial release
* Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
* Multithreaded (12 cores) execution requested via `export MKL_NUM_THREADS=12`
* libxsmm a40a833 (post-1.14)
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
* Affinity:
* N/A.
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-11"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
* Frequency throttling (via `cpupower`):
* No changes made.
* Comments:
@@ -250,15 +277,21 @@ The `runthese.m` file will contain example invocations of the function.
#### pdf
* [Haswell row-stored](graphs/sup/dgemm_rrr_has_nt1.pdf)
* [Haswell column-stored](graphs/sup/dgemm_ccc_has_nt1.pdf)
* [Haswell single-threaded row-stored](graphs/sup/dgemm_rrr_has_nt1.pdf)
* [Haswell single-threaded column-stored](graphs/sup/dgemm_ccc_has_nt1.pdf)
* [Haswell multithreaded (12 cores) row-stored](graphs/sup/dgemm_rrr_has_nt12.pdf)
* [Haswell multithreaded (12 cores) column-stored](graphs/sup/dgemm_ccc_has_nt12.pdf)
#### png (inline)
* **Haswell row-stored**
![row-stored](graphs/sup/dgemm_rrr_has_nt1.png)
* **Haswell column-stored**
![column-stored](graphs/sup/dgemm_ccc_has_nt1.png)
* **Haswell single-threaded row-stored**
![single-threaded row-stored](graphs/sup/dgemm_rrr_has_nt1.png)
* **Haswell single-threaded column-stored**
![single-threaded column-stored](graphs/sup/dgemm_ccc_has_nt1.png)
* **Haswell multithreaded (12 cores) row-stored**
![multithreaded row-stored](graphs/sup/dgemm_rrr_has_nt12.png)
* **Haswell multithreaded (12 cores) column-stored**
![multithreaded column-stored](graphs/sup/dgemm_ccc_has_nt12.png)
---
@@ -276,20 +309,26 @@ The `runthese.m` file will contain example invocations of the function.
* Alternatively, FMA vector IPC is 2 when vectors are limited to 128 bits each.
* Peak performance:
* single-core: 24 GFLOPS (double-precision), 48 GFLOPS (single-precision)
* multicore: 20.4 GFLOPS/core (double-precision), 40.8 GFLOPS/core (single-precision)
* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
* Page size: 4096 bytes
* Compiler: gcc 7.4.0
* Results gathered: 23-28 August 2019
* Results gathered: 3 March 2020
* Implementations tested:
* BLIS 4a0a6e8 (0.6.0-28)
* configured with `./configure --enable-cblas auto`
* sub-configuration exercised: `zen`
* OpenBLAS 0.3.7
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* BLASFEO 01f6b7f
* BLIS 90db88e (0.6.1-8)
* configured with `./configure --enable-cblas auto` (single-threaded)
* configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
* sub-configuration exercised: `haswell`
* Multithreaded (32 cores) execution requested via `export BLIS_NUM_THREADS=32`
* OpenBLAS 0.3.8
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=32` (multithreaded)
* Multithreaded (32 cores) execution requested via `export OPENBLAS_NUM_THREADS=32`
* BLASFEO f9b78c6
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
* built BLAS library via `make CC=gcc`
* Eigen 3.3.90
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (28 August 2019)
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (36b9596)
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
```
# These lines added after line 67.
@@ -301,13 +340,15 @@ The `runthese.m` file will contain example invocations of the function.
* configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
* installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
* MKL 2019 update 4
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* libxsmm 77a295c (1.6.5-6679)
* Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
* Multithreaded (32 cores) execution requested via `export OMP_NUM_THREADS=32`
* MKL 2020 initial release
* Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
* Multithreaded (32 cores) execution requested via `export MKL_NUM_THREADS=32`
* libxsmm a40a833 (post-1.14)
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
* Affinity:
* N/A.
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-31"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
* Frequency throttling (via `cpupower`):
* Driver: acpi-cpufreq
* Governor: performance
@@ -320,15 +361,21 @@ The `runthese.m` file will contain example invocations of the function.
#### pdf
* [Epyc row-stored](graphs/sup/dgemm_rrr_epyc_nt1.pdf)
* [Epyc column-stored](graphs/sup/dgemm_ccc_epyc_nt1.pdf)
* [Epyc single-threaded row-stored](graphs/sup/dgemm_rrr_epyc_nt1.pdf)
* [Epyc single-threaded column-stored](graphs/sup/dgemm_ccc_epyc_nt1.pdf)
* [Epyc multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_epyc_nt32.pdf)
* [Epyc multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_epyc_nt32.pdf)
#### png (inline)
* **Epyc row-stored**
![row-stored](graphs/sup/dgemm_rrr_epyc_nt1.png)
* **Epyc column-stored**
![column-stored](graphs/sup/dgemm_ccc_epyc_nt1.png)
* **Epyc single-threaded row-stored**
![single-threaded row-stored](graphs/sup/dgemm_rrr_epyc_nt1.png)
* **Epyc single-threaded column-stored**
![single-threaded column-stored](graphs/sup/dgemm_ccc_epyc_nt1.png)
* **Epyc multithreaded (32 cores) row-stored**
![multithreaded row-stored](graphs/sup/dgemm_rrr_epyc_nt32.png)
* **Epyc multithreaded (32 cores) column-stored**
![multithreaded column-stored](graphs/sup/dgemm_ccc_epyc_nt32.png)
---

Binary file not shown.

Before

Width:  |  Height:  |  Size: 187 KiB

After

Width:  |  Height:  |  Size: 236 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 180 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 183 KiB

After

Width:  |  Height:  |  Size: 248 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 194 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 216 KiB

After

Width:  |  Height:  |  Size: 282 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 215 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 177 KiB

After

Width:  |  Height:  |  Size: 222 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 179 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 176 KiB

After

Width:  |  Height:  |  Size: 242 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 197 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 208 KiB

After

Width:  |  Height:  |  Size: 274 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 221 KiB

View File

@@ -102,13 +102,14 @@ for psize_col = 1:3
end
x_axis( :, 1 ) = data_blissup( :, psize_col );
% Compute the number of data points we have in the x-axis. Note that
% we only use half the data points for the m = n = k column of graphs.
if mod(theid-1,cols) == 6
np = size( data_blissup, 1 ) / 2;
else
np = size( data_blissup, 1 );
end
% Compute the number of data points we have in the x-axis. Note that we
% only use half the data points for the m = n = k column of graphs.
%if mod(theid-1,cols) == 6
% np = size( data_blissup, 1 ) / 2;
%else
% np = size( data_blissup, 1 );
%end
np = size( data_blissup, 1 );
has_xsmm = 1;
if data_xsmm( 1, flopscol ) == 0.0
@@ -177,10 +178,16 @@ end
xlim( ax1, [x_begin x_end] );
ylim( ax1, [y_begin y_end] );
if 6000 <= x_end && x_end < 10000
if 10000 <= x_end && x_end < 15000
x_tick2 = x_end - 2000;
x_tick1 = x_tick2/2;
xticks( ax1, [ x_tick1 x_tick2 ] );
%xticks( ax1, [ x_tick1 x_tick2 ] );
xticks( ax1, [ 3000 6000 9000 12000 ] );
elseif 6000 <= x_end && x_end < 10000
x_tick2 = x_end - 2000;
x_tick1 = x_tick2/2;
%xticks( ax1, [ x_tick1 x_tick2 ] );
xticks( ax1, [ 2000 4000 6000 8000 ] );
elseif 4000 <= x_end && x_end < 6000
x_tick2 = x_end - 1000;
x_tick1 = x_tick2/2;

View File

@@ -1,12 +1,25 @@
% haswell
plot_panel_trxsh(3.25,16,1,'st','d','ccc',[ 6 8 4 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.25,16,1,'st','d','rrr',[ 6 8 4 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
% kabylake
plot_panel_trxsh(3.80,16,1,'st','d','rrr',[ 6 8 4 ],'../results/kabylake/20190823/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.80,16,1,'st','d','ccc',[ 6 8 4 ],'../results/kabylake/20190823/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.80,16,1,'st','d','rrr',[ 6 8 4 ],'../results/kabylake/20200302/mnkt100000_st','kbl','MKL','octave'); close; clear all;
% haswell
plot_panel_trxsh(3.5,16,1,'st','d','rrr',[ 6 8 4 ],'../results/haswell/20200302/mnkt100000_st','has','MKL','octave'); close; clear all;
% epyc
plot_panel_trxsh(3.00, 8,1,'st','d','rrr',[ 6 8 4 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.00, 8,1,'st','d','ccc',[ 6 8 4 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.00, 8,1,'st','d','rrr',[ 6 8 4 ],'../results/epyc/20200302/mnkt100000_st','epyc','MKL','octave'); close; clear all;

View File

@@ -20,7 +20,7 @@ function r_val = plot_l3sup_perf( opname, ...
%end
%legend_plot_id = 11;
legend_plot_id = 0*cols + 1*4;
legend_plot_id = 0*cols + 1*6;
if 1
ax1 = subplot( rows, cols, theid );
@@ -96,13 +96,14 @@ for psize_col = 1:3
end
x_axis( :, 1 ) = data_blissup( :, psize_col );
% Compute the number of data points we have in the x-axis. Note that
% we only use quarter the data points for the m = n = k column of graphs.
if mod(theid-1,cols) == 6
np = size( data_blissup, 1 ) / 4;
else
np = size( data_blissup, 1 );
end
% Compute the number of data points we have in the x-axis. Note that we
% only use half the data points for the m = n = k column of graphs.
%if mod(theid-1,cols) == 6
% np = size( data_blissup, 1 ) / 2;
%else
% np = size( data_blissup, 1 );
%end
np = size( data_blissup, 1 );
% Grab the last x-axis value.
x_end = data_blissup( np, psize_col );
@@ -148,9 +149,23 @@ end
xlim( ax1, [x_begin x_end] );
ylim( ax1, [y_begin y_end] );
if 6000 <= x_end && x_end < 10000
if mod(theid-1,cols) == 3 || mod(theid-1,cols) == 4 || mod(theid-1,cols) == 5
if nth == 12
ylim( ax1, [y_begin y_end/2] );
elseif nth > 12
ylim( ax1, [y_begin y_end/6] );
end
end
if 10000 <= x_end && x_end < 15000
x_tick2 = x_end - 2000;
x_tick1 = x_tick2/2;
%xticks( ax1, [ x_tick1 x_tick2 ] );
xticks( ax1, [ 4000 8000 12000 ] );
elseif 6000 <= x_end && x_end < 10000
x_tick2 = x_end - 2000;
x_tick1 = x_tick2/2;
%xticks( ax1, [ x_tick1 x_tick2 ] );
xticks( ax1, [ x_tick1 x_tick2 ] );
elseif 4000 <= x_end && x_end < 6000
x_tick2 = x_end - 1000;
@@ -188,7 +203,8 @@ if show_plot == 1 || theid == legend_plot_id
set( leg,'Units','inches' );
if impl == 'octave'
set( leg,'FontSize',fontsize );
set( leg,'Position',[12.40 10.60 1.9 0.95 ] ); % (1,4tl)
%set( leg,'Position',[12.40 10.60 1.9 0.95 ] ); % (1,4tl)
set( leg,'Position',[18.80 10.60 1.9 0.95 ] ); % (1,4tl)
else
set( leg,'FontSize',fontsize-1 );
set( leg,'Position',[18.24 10.15 1.15 0.7 ] ); % (1,4tl)

View File

@@ -1,12 +1,25 @@
% haswell
plot_panel_trxsh(3.25,16,1,'mt','d','ccc',[ 6 8 10 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.25,16,1,'mt','d','rrr',[ 6 8 10 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
% kabylake
plot_panel_trxsh(3.80,16,1,'mt','d','rrr',[ 6 8 10 ],'..','kbl','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.80,16,1,'mt','d','ccc',[ 6 8 10 ],'..','kbl','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.80,16,4,'mt','d','rrr',[ 6 8 10 ],'../../sup/results/kabylake/20200302/mnkt100000_mt4','kbl','MKL','octave'); close; clear all;
% haswell
plot_panel_trxsh(3.1,16,12,'mt','d','rrr',[ 6 8 10 ],'../../sup/results/haswell/20200302/mnkt100000_mt12','has','MKL','octave'); close; clear all;
% epyc
plot_panel_trxsh(3.00, 8,1,'mt','d','rrr',[ 6 8 10 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.00, 8,1,'mt','d','ccc',[ 6 8 10 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
plot_panel_trxsh(2.55,8,32,'mt','d','rrr',[ 6 8 10 ],'../../sup/results/epyc/20200302/mnkt100000_mt32','epyc','MKL','octave'); close; clear all;