Updated sup performance graphs; added mt results.
Details: - Reran all existing single-threaded performance experiments comparing BLIS sup to other implementations (including the conventional code path within BLIS), using the latest versions (where appropriate). - Added multithreaded results for the three existing hardware types showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc (Zen1). - Various minor updates to the text in docs/PerformanceSmall.md. - Updates to the octave scripts in test/sup/octave, test/supmt/octave.
@@ -35,14 +35,18 @@ sizes tested.
|
||||
Each of the 28 graphs within a panel will contain an x-axis that reports
|
||||
problem size, with one, two, or all three matrix dimensions equal to the
|
||||
problem size (e.g. _m_ = 6; _n_ = _k_, also encoded as `m6npkp`).
|
||||
The y-axis will report in units GFLOPS (billions of floating-point operations
|
||||
per second) on a single core.
|
||||
The y-axis will report in units GFLOPS (or billions of floating-point operations
|
||||
per second) per core.
|
||||
|
||||
It's also worth pointing out that the top of each graph (e.g. the maximum
|
||||
y-axis value depicted) _always_ corresponds to the theoretical peak performance
|
||||
under the conditions associated with that graph.
|
||||
Theoretical peak performance, in units of GFLOPS, is calculated as the
|
||||
product of:
|
||||
It's also worth pointing out that the top of some graphs (e.g. the maximum
|
||||
y-axis value depicted) correspond to the theoretical peak performance
|
||||
under the conditions associated with that graph, while in other graphs the
|
||||
y-axis has been adjusted to better show the difference between the various
|
||||
curves. (We *strongly* prefer to always use peak performance as the top of
|
||||
the graph; however, this is one of the few exceptions where we feel some
|
||||
scaling is warranted.)
|
||||
Theoretical peak performance on a single core, in units of GFLOPS, is
|
||||
calculated as the product of:
|
||||
1. the maximum sustainable clock rate in GHz; and
|
||||
2. the maximum number of floating-point operations (flops) that can be
|
||||
executed per cycle.
|
||||
@@ -60,30 +64,32 @@ can be issued per cycle (per core);
|
||||
register (for the datatype in question); and
|
||||
3. 2.0, since an FMA instruction fuses two operations (a multiply and an add).
|
||||
|
||||
The problem size range, represented on the x-axis, is sampled in
|
||||
increments of 4 up to 800 for the cases where one or two dimensions is small
|
||||
(and constant)
|
||||
and up to 400 in the case where all dimensions (e.g. _m_, _n_, and _k_) are
|
||||
bound to the problem size (i.e., square matrices).
|
||||
Typically, organizations and individuals publish performance with square
|
||||
matrices, which can miss the problem sizes of interest to many applications.
|
||||
Here, in addition to square matrices (shown in the seventh column), we also
|
||||
show six other scenarios where one or two `gemm` dimensions (of _m,_ _n_, and
|
||||
_k_) is small. In these six columns, the constant small matrix dimensions were
|
||||
chosen to be _very_ small--in the neighborhood of 8--intentionally to showcase
|
||||
what happens when at least one of the matrices is abnormally "skinny."
|
||||
|
||||
Note that the constant small matrix dimensions were chosen to be _very_
|
||||
small--in the neighborhood of 8--intentionally to showcase what happens when
|
||||
at least one of the matrices is abnormally "skinny." Typically, organizations
|
||||
and individuals only publish performance with square matrices, which can miss
|
||||
the problem sizes of interest to many applications. Here, in addition to square
|
||||
matrices (shown in the seventh column), we also show six other scenarios where
|
||||
one or two `gemm` dimensions (of _m,_ _n_, and _k_) is small.
|
||||
The problem size range, represented on the x-axis, is sampled in
|
||||
increments that vary. These increments (and the overall range) are generally
|
||||
large for the cases where two dimensions are small (and constant), medium for
|
||||
cases where one dimension is small (and constant), and small for cases where
|
||||
all dimensions (e.g. _m_, _n_, and _k_) are variable and bound to the problem
|
||||
size (i.e., square matrices).
|
||||
|
||||
The legend in each graph contains two entries for BLIS, corresponding to the
|
||||
two black lines, one solid and one dotted. The dotted line, **"BLIS conv"**,
|
||||
represents the conventional implementation that targets large matrices. This
|
||||
was the only implementation available in BLIS prior to the addition to the
|
||||
small/skinny matrix support. The solid line, **"BLIS sup"**, makes use of the
|
||||
new small/skinny matrix implementation for certain small problems. Whenever
|
||||
these results differ by any significant amount (beyond noise), it denotes a
|
||||
problem size for which BLIS employed the new small/skinny implementation.
|
||||
Put another way, **the delta between these two lines represents the performance
|
||||
improvement between BLIS's previous status quo and the new regime.**
|
||||
new small/skinny matrix implementation. Sometimes, the performance of
|
||||
**"BLIS sup"** drops below that of **"BLIS conv"** for somewhat larger problems.
|
||||
However, in practice, we use a threshold to determine when to switch from the
|
||||
former to the latter, and therefore the goal is for the performance of
|
||||
**"BLIS conv"** to serve as an approximate floor below which BLIS performance
|
||||
never drops.
|
||||
|
||||
Finally, each point along each curve represents the best of three trials.
|
||||
|
||||
@@ -119,7 +125,8 @@ and/or install some (or all) of the implementations shown (e.g.
|
||||
[BLASFEO](https://github.com/giaf/blasfeo), and
|
||||
[libxsmm](https://github.com/hfp/libxsmm)), including BLIS. Be sure to consult
|
||||
the detailed notes provided below; they should be *very* helpful in successfully
|
||||
building the libraries. The `runme.sh` script in `test/sup` will help you run
|
||||
building the libraries. The `runme.sh` script in `test/sup` (or `test/supmt`)
|
||||
will help you run
|
||||
some (or all) of the test drivers produced by the `Makefile`, and the
|
||||
Matlab/Octave function `plot_panel_trxsh()` defined in the `octave` directory
|
||||
will help you turn the output of those test drivers into a PDF file of graphs.
|
||||
@@ -140,20 +147,25 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* Max FMA vector IPC: 2
|
||||
* Peak performance:
|
||||
* single-core: 57.6 GFLOPS (double-precision), 115.2 GFLOPS (single-precision)
|
||||
* multicore: 57.6 GFLOPS/core (double-precision), 115.2 GFLOPS/core (single-precision)
|
||||
* Operating system: Gentoo Linux (Linux kernel 5.2.4)
|
||||
* Page size: 4096 bytes
|
||||
* Compiler: gcc 8.3.0
|
||||
* Results gathered: 23-28 August 2019
|
||||
* Results gathered: 3 March 2020
|
||||
* Implementations tested:
|
||||
* BLIS 4a0a6e8 (0.6.0-28)
|
||||
* configured with `./configure --enable-cblas auto`
|
||||
* BLIS 90db88e (0.6.1-8)
|
||||
* configured with `./configure --enable-cblas auto` (single-threaded)
|
||||
* configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
|
||||
* sub-configuration exercised: `haswell`
|
||||
* OpenBLAS 0.3.7
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
|
||||
* BLASFEO 01f6b7f
|
||||
* Multithreaded (4 cores) execution requested via `export BLIS_NUM_THREADS=4`
|
||||
* OpenBLAS 0.3.8
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=4` (multithreaded)
|
||||
* Multithreaded (4 cores) execution requested via `export OPENBLAS_NUM_THREADS=4`
|
||||
* BLASFEO f9b78c6
|
||||
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
|
||||
* Eigen 3.3.90
|
||||
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (28 August 2019)
|
||||
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (36b9596)
|
||||
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
|
||||
```
|
||||
# These lines added after line 67.
|
||||
@@ -165,18 +177,20 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
|
||||
* installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
|
||||
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
|
||||
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
|
||||
* MKL 2019 update 4
|
||||
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
|
||||
* libxsmm 77a295c (1.6.5-6679)
|
||||
* Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
|
||||
* Multithreaded (4 cores) execution requested via `export OMP_NUM_THREADS=4`
|
||||
* MKL 2020 initial release
|
||||
* Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
|
||||
* Multithreaded (4 cores) execution requested via `export MKL_NUM_THREADS=4`
|
||||
* libxsmm a40a833 (post-1.14)
|
||||
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
|
||||
* Affinity:
|
||||
* N/A.
|
||||
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-3"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
|
||||
* Frequency throttling (via `cpupower`):
|
||||
* Driver: intel_pstate
|
||||
* Governor: performance
|
||||
* Hardware limits: 800MHz - 3.8GHz
|
||||
* Adjusted minimum: 3.7GHz
|
||||
* Adjusted minimum: 3.8GHz
|
||||
* Comments:
|
||||
* libxsmm is highly competitive for very small problems, but quickly gives up once the "large" dimension exceeds about 180-240 (or 64 in the case where all operands are square). Also, libxsmm's `gemm` cannot handle a transposition on matrix A and similarly dispatches the fallback implementation for those cases. libxsmm also does not export CBLAS interfaces, and therefore only appears on the graphs for column-stored matrices.
|
||||
|
||||
@@ -184,15 +198,21 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
|
||||
#### pdf
|
||||
|
||||
* [Kaby Lake row-stored](graphs/sup/dgemm_rrr_kbl_nt1.pdf)
|
||||
* [Kaby Lake column-stored](graphs/sup/dgemm_ccc_kbl_nt1.pdf)
|
||||
* [Kaby Lake single-threaded row-stored](graphs/sup/dgemm_rrr_kbl_nt1.pdf)
|
||||
* [Kaby Lake single-threaded column-stored](graphs/sup/dgemm_ccc_kbl_nt1.pdf)
|
||||
* [Kaby Lake multithreaded (4 cores) row-stored](graphs/sup/dgemm_rrr_kbl_nt4.pdf)
|
||||
* [Kaby Lake multithreaded (4 cores) column-stored](graphs/sup/dgemm_ccc_kbl_nt4.pdf)
|
||||
|
||||
#### png (inline)
|
||||
|
||||
* **Kaby Lake row-stored**
|
||||

|
||||
* **Kaby Lake column-stored**
|
||||

|
||||
* **Kaby Lake single-threaded row-stored**
|
||||

|
||||
* **Kaby Lake single-threaded column-stored**
|
||||

|
||||
* **Kaby Lake multithreaded (4 cores) row-stored**
|
||||

|
||||
* **Kaby Lake multithreaded (4 cores) column-stored**
|
||||

|
||||
|
||||
---
|
||||
|
||||
@@ -209,20 +229,25 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* Max FMA vector IPC: 2
|
||||
* Peak performance:
|
||||
* single-core: 56 GFLOPS (double-precision), 112 GFLOPS (single-precision)
|
||||
* multicore: 49.6 GFLOPS/core (double-precision), 99.2 GFLOPS/core (single-precision)
|
||||
* Operating system: Cray Linux Environment 6 (Linux kernel 4.4.103)
|
||||
* Page size: 4096 bytes
|
||||
* Compiler: gcc 7.3.0
|
||||
* Results gathered: 23-28 August 2019
|
||||
* Results gathered: 3 March 2020
|
||||
* Implementations tested:
|
||||
* BLIS 4a0a6e8 (0.6.0-28)
|
||||
* configured with `./configure --enable-cblas auto`
|
||||
* BLIS 90db88e (0.6.1-8)
|
||||
* configured with `./configure --enable-cblas auto` (single-threaded)
|
||||
* configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
|
||||
* sub-configuration exercised: `haswell`
|
||||
* OpenBLAS 0.3.7
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
|
||||
* BLASFEO 01f6b7f
|
||||
* Multithreaded (12 cores) execution requested via `export BLIS_NUM_THREADS=12`
|
||||
* OpenBLAS 0.3.8
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=12` (multithreaded)
|
||||
* Multithreaded (12 cores) execution requested via `export OPENBLAS_NUM_THREADS=12`
|
||||
* BLASFEO f9b78c6
|
||||
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
|
||||
* Eigen 3.3.90
|
||||
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (28 August 2019)
|
||||
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (36b9596)
|
||||
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
|
||||
```
|
||||
# These lines added after line 67.
|
||||
@@ -234,13 +259,15 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
|
||||
* installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
|
||||
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
|
||||
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
|
||||
* MKL 2019 update 4
|
||||
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
|
||||
* libxsmm 77a295c (1.6.5-6679)
|
||||
* Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
|
||||
* Multithreaded (12 cores) execution requested via `export OMP_NUM_THREADS=12`
|
||||
* MKL 2020 initial release
|
||||
* Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
|
||||
* Multithreaded (12 cores) execution requested via `export MKL_NUM_THREADS=12`
|
||||
* libxsmm a40a833 (post-1.14)
|
||||
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
|
||||
* Affinity:
|
||||
* N/A.
|
||||
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-11"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
|
||||
* Frequency throttling (via `cpupower`):
|
||||
* No changes made.
|
||||
* Comments:
|
||||
@@ -250,15 +277,21 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
|
||||
#### pdf
|
||||
|
||||
* [Haswell row-stored](graphs/sup/dgemm_rrr_has_nt1.pdf)
|
||||
* [Haswell column-stored](graphs/sup/dgemm_ccc_has_nt1.pdf)
|
||||
* [Haswell single-threaded row-stored](graphs/sup/dgemm_rrr_has_nt1.pdf)
|
||||
* [Haswell single-threaded column-stored](graphs/sup/dgemm_ccc_has_nt1.pdf)
|
||||
* [Haswell multithreaded (12 cores) row-stored](graphs/sup/dgemm_rrr_has_nt12.pdf)
|
||||
* [Haswell multithreaded (12 cores) column-stored](graphs/sup/dgemm_ccc_has_nt12.pdf)
|
||||
|
||||
#### png (inline)
|
||||
|
||||
* **Haswell row-stored**
|
||||

|
||||
* **Haswell column-stored**
|
||||

|
||||
* **Haswell single-threaded row-stored**
|
||||

|
||||
* **Haswell single-threaded column-stored**
|
||||

|
||||
* **Haswell multithreaded (12 cores) row-stored**
|
||||

|
||||
* **Haswell multithreaded (12 cores) column-stored**
|
||||

|
||||
|
||||
---
|
||||
|
||||
@@ -276,20 +309,26 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* Alternatively, FMA vector IPC is 2 when vectors are limited to 128 bits each.
|
||||
* Peak performance:
|
||||
* single-core: 24 GFLOPS (double-precision), 48 GFLOPS (single-precision)
|
||||
* multicore: 20.4 GFLOPS/core (double-precision), 40.8 GFLOPS/core (single-precision)
|
||||
* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
|
||||
* Page size: 4096 bytes
|
||||
* Compiler: gcc 7.4.0
|
||||
* Results gathered: 23-28 August 2019
|
||||
* Results gathered: 3 March 2020
|
||||
* Implementations tested:
|
||||
* BLIS 4a0a6e8 (0.6.0-28)
|
||||
* configured with `./configure --enable-cblas auto`
|
||||
* sub-configuration exercised: `zen`
|
||||
* OpenBLAS 0.3.7
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
|
||||
* BLASFEO 01f6b7f
|
||||
* BLIS 90db88e (0.6.1-8)
|
||||
* configured with `./configure --enable-cblas auto` (single-threaded)
|
||||
* configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
|
||||
* sub-configuration exercised: `haswell`
|
||||
* Multithreaded (32 cores) execution requested via `export BLIS_NUM_THREADS=32`
|
||||
* OpenBLAS 0.3.8
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=32` (multithreaded)
|
||||
* Multithreaded (32 cores) execution requested via `export OPENBLAS_NUM_THREADS=32`
|
||||
* BLASFEO f9b78c6
|
||||
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
|
||||
* built BLAS library via `make CC=gcc`
|
||||
* Eigen 3.3.90
|
||||
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (28 August 2019)
|
||||
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (36b9596)
|
||||
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
|
||||
```
|
||||
# These lines added after line 67.
|
||||
@@ -301,13 +340,15 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
|
||||
* installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
|
||||
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
|
||||
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
|
||||
* MKL 2019 update 4
|
||||
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
|
||||
* libxsmm 77a295c (1.6.5-6679)
|
||||
* Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
|
||||
* Multithreaded (32 cores) execution requested via `export OMP_NUM_THREADS=32`
|
||||
* MKL 2020 initial release
|
||||
* Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
|
||||
* Multithreaded (32 cores) execution requested via `export MKL_NUM_THREADS=32`
|
||||
* libxsmm a40a833 (post-1.14)
|
||||
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
|
||||
* Affinity:
|
||||
* N/A.
|
||||
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-31"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
|
||||
* Frequency throttling (via `cpupower`):
|
||||
* Driver: acpi-cpufreq
|
||||
* Governor: performance
|
||||
@@ -320,15 +361,21 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
|
||||
#### pdf
|
||||
|
||||
* [Epyc row-stored](graphs/sup/dgemm_rrr_epyc_nt1.pdf)
|
||||
* [Epyc column-stored](graphs/sup/dgemm_ccc_epyc_nt1.pdf)
|
||||
* [Epyc single-threaded row-stored](graphs/sup/dgemm_rrr_epyc_nt1.pdf)
|
||||
* [Epyc single-threaded column-stored](graphs/sup/dgemm_ccc_epyc_nt1.pdf)
|
||||
* [Epyc multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_epyc_nt32.pdf)
|
||||
* [Epyc multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_epyc_nt32.pdf)
|
||||
|
||||
#### png (inline)
|
||||
|
||||
* **Epyc row-stored**
|
||||

|
||||
* **Epyc column-stored**
|
||||

|
||||
* **Epyc single-threaded row-stored**
|
||||

|
||||
* **Epyc single-threaded column-stored**
|
||||

|
||||
* **Epyc multithreaded (32 cores) row-stored**
|
||||

|
||||
* **Epyc multithreaded (32 cores) column-stored**
|
||||

|
||||
|
||||
---
|
||||
|
||||
|
||||
|
Before Width: | Height: | Size: 187 KiB After Width: | Height: | Size: 236 KiB |
BIN
docs/graphs/sup/dgemm_ccc_epyc_nt32.pdf
Normal file
BIN
docs/graphs/sup/dgemm_ccc_epyc_nt32.png
Normal file
|
After Width: | Height: | Size: 180 KiB |
|
Before Width: | Height: | Size: 183 KiB After Width: | Height: | Size: 248 KiB |
BIN
docs/graphs/sup/dgemm_ccc_has_nt12.pdf
Normal file
BIN
docs/graphs/sup/dgemm_ccc_has_nt12.png
Normal file
|
After Width: | Height: | Size: 194 KiB |
|
Before Width: | Height: | Size: 216 KiB After Width: | Height: | Size: 282 KiB |
BIN
docs/graphs/sup/dgemm_ccc_kbl_nt4.pdf
Normal file
BIN
docs/graphs/sup/dgemm_ccc_kbl_nt4.png
Normal file
|
After Width: | Height: | Size: 215 KiB |
|
Before Width: | Height: | Size: 177 KiB After Width: | Height: | Size: 222 KiB |
BIN
docs/graphs/sup/dgemm_rrr_epyc_nt32.pdf
Normal file
BIN
docs/graphs/sup/dgemm_rrr_epyc_nt32.png
Normal file
|
After Width: | Height: | Size: 179 KiB |
|
Before Width: | Height: | Size: 176 KiB After Width: | Height: | Size: 242 KiB |
BIN
docs/graphs/sup/dgemm_rrr_has_nt12.pdf
Normal file
BIN
docs/graphs/sup/dgemm_rrr_has_nt12.png
Normal file
|
After Width: | Height: | Size: 197 KiB |
|
Before Width: | Height: | Size: 208 KiB After Width: | Height: | Size: 274 KiB |
BIN
docs/graphs/sup/dgemm_rrr_kbl_nt4.pdf
Normal file
BIN
docs/graphs/sup/dgemm_rrr_kbl_nt4.png
Normal file
|
After Width: | Height: | Size: 221 KiB |
@@ -102,13 +102,14 @@ for psize_col = 1:3
|
||||
end
|
||||
x_axis( :, 1 ) = data_blissup( :, psize_col );
|
||||
|
||||
% Compute the number of data points we have in the x-axis. Note that
|
||||
% we only use half the data points for the m = n = k column of graphs.
|
||||
if mod(theid-1,cols) == 6
|
||||
np = size( data_blissup, 1 ) / 2;
|
||||
else
|
||||
np = size( data_blissup, 1 );
|
||||
end
|
||||
% Compute the number of data points we have in the x-axis. Note that we
|
||||
% only use half the data points for the m = n = k column of graphs.
|
||||
%if mod(theid-1,cols) == 6
|
||||
% np = size( data_blissup, 1 ) / 2;
|
||||
%else
|
||||
% np = size( data_blissup, 1 );
|
||||
%end
|
||||
np = size( data_blissup, 1 );
|
||||
|
||||
has_xsmm = 1;
|
||||
if data_xsmm( 1, flopscol ) == 0.0
|
||||
@@ -177,10 +178,16 @@ end
|
||||
xlim( ax1, [x_begin x_end] );
|
||||
ylim( ax1, [y_begin y_end] );
|
||||
|
||||
if 6000 <= x_end && x_end < 10000
|
||||
if 10000 <= x_end && x_end < 15000
|
||||
x_tick2 = x_end - 2000;
|
||||
x_tick1 = x_tick2/2;
|
||||
xticks( ax1, [ x_tick1 x_tick2 ] );
|
||||
%xticks( ax1, [ x_tick1 x_tick2 ] );
|
||||
xticks( ax1, [ 3000 6000 9000 12000 ] );
|
||||
elseif 6000 <= x_end && x_end < 10000
|
||||
x_tick2 = x_end - 2000;
|
||||
x_tick1 = x_tick2/2;
|
||||
%xticks( ax1, [ x_tick1 x_tick2 ] );
|
||||
xticks( ax1, [ 2000 4000 6000 8000 ] );
|
||||
elseif 4000 <= x_end && x_end < 6000
|
||||
x_tick2 = x_end - 1000;
|
||||
x_tick1 = x_tick2/2;
|
||||
|
||||
@@ -1,12 +1,25 @@
|
||||
|
||||
% haswell
|
||||
plot_panel_trxsh(3.25,16,1,'st','d','ccc',[ 6 8 4 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.25,16,1,'st','d','rrr',[ 6 8 4 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
% kabylake
|
||||
plot_panel_trxsh(3.80,16,1,'st','d','rrr',[ 6 8 4 ],'../results/kabylake/20190823/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.80,16,1,'st','d','ccc',[ 6 8 4 ],'../results/kabylake/20190823/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.80,16,1,'st','d','rrr',[ 6 8 4 ],'../results/kabylake/20200302/mnkt100000_st','kbl','MKL','octave'); close; clear all;
|
||||
|
||||
% haswell
|
||||
plot_panel_trxsh(3.5,16,1,'st','d','rrr',[ 6 8 4 ],'../results/haswell/20200302/mnkt100000_st','has','MKL','octave'); close; clear all;
|
||||
|
||||
% epyc
|
||||
plot_panel_trxsh(3.00, 8,1,'st','d','rrr',[ 6 8 4 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.00, 8,1,'st','d','ccc',[ 6 8 4 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.00, 8,1,'st','d','rrr',[ 6 8 4 ],'../results/epyc/20200302/mnkt100000_st','epyc','MKL','octave'); close; clear all;
|
||||
|
||||
@@ -20,7 +20,7 @@ function r_val = plot_l3sup_perf( opname, ...
|
||||
%end
|
||||
|
||||
%legend_plot_id = 11;
|
||||
legend_plot_id = 0*cols + 1*4;
|
||||
legend_plot_id = 0*cols + 1*6;
|
||||
|
||||
if 1
|
||||
ax1 = subplot( rows, cols, theid );
|
||||
@@ -96,13 +96,14 @@ for psize_col = 1:3
|
||||
end
|
||||
x_axis( :, 1 ) = data_blissup( :, psize_col );
|
||||
|
||||
% Compute the number of data points we have in the x-axis. Note that
|
||||
% we only use quarter the data points for the m = n = k column of graphs.
|
||||
if mod(theid-1,cols) == 6
|
||||
np = size( data_blissup, 1 ) / 4;
|
||||
else
|
||||
np = size( data_blissup, 1 );
|
||||
end
|
||||
% Compute the number of data points we have in the x-axis. Note that we
|
||||
% only use half the data points for the m = n = k column of graphs.
|
||||
%if mod(theid-1,cols) == 6
|
||||
% np = size( data_blissup, 1 ) / 2;
|
||||
%else
|
||||
% np = size( data_blissup, 1 );
|
||||
%end
|
||||
np = size( data_blissup, 1 );
|
||||
|
||||
% Grab the last x-axis value.
|
||||
x_end = data_blissup( np, psize_col );
|
||||
@@ -148,9 +149,23 @@ end
|
||||
xlim( ax1, [x_begin x_end] );
|
||||
ylim( ax1, [y_begin y_end] );
|
||||
|
||||
if 6000 <= x_end && x_end < 10000
|
||||
if mod(theid-1,cols) == 3 || mod(theid-1,cols) == 4 || mod(theid-1,cols) == 5
|
||||
if nth == 12
|
||||
ylim( ax1, [y_begin y_end/2] );
|
||||
elseif nth > 12
|
||||
ylim( ax1, [y_begin y_end/6] );
|
||||
end
|
||||
end
|
||||
|
||||
if 10000 <= x_end && x_end < 15000
|
||||
x_tick2 = x_end - 2000;
|
||||
x_tick1 = x_tick2/2;
|
||||
%xticks( ax1, [ x_tick1 x_tick2 ] );
|
||||
xticks( ax1, [ 4000 8000 12000 ] );
|
||||
elseif 6000 <= x_end && x_end < 10000
|
||||
x_tick2 = x_end - 2000;
|
||||
x_tick1 = x_tick2/2;
|
||||
%xticks( ax1, [ x_tick1 x_tick2 ] );
|
||||
xticks( ax1, [ x_tick1 x_tick2 ] );
|
||||
elseif 4000 <= x_end && x_end < 6000
|
||||
x_tick2 = x_end - 1000;
|
||||
@@ -188,7 +203,8 @@ if show_plot == 1 || theid == legend_plot_id
|
||||
set( leg,'Units','inches' );
|
||||
if impl == 'octave'
|
||||
set( leg,'FontSize',fontsize );
|
||||
set( leg,'Position',[12.40 10.60 1.9 0.95 ] ); % (1,4tl)
|
||||
%set( leg,'Position',[12.40 10.60 1.9 0.95 ] ); % (1,4tl)
|
||||
set( leg,'Position',[18.80 10.60 1.9 0.95 ] ); % (1,4tl)
|
||||
else
|
||||
set( leg,'FontSize',fontsize-1 );
|
||||
set( leg,'Position',[18.24 10.15 1.15 0.7 ] ); % (1,4tl)
|
||||
|
||||
@@ -1,12 +1,25 @@
|
||||
|
||||
% haswell
|
||||
plot_panel_trxsh(3.25,16,1,'mt','d','ccc',[ 6 8 10 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.25,16,1,'mt','d','rrr',[ 6 8 10 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
% kabylake
|
||||
plot_panel_trxsh(3.80,16,1,'mt','d','rrr',[ 6 8 10 ],'..','kbl','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.80,16,1,'mt','d','ccc',[ 6 8 10 ],'..','kbl','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.80,16,4,'mt','d','rrr',[ 6 8 10 ],'../../sup/results/kabylake/20200302/mnkt100000_mt4','kbl','MKL','octave'); close; clear all;
|
||||
|
||||
% haswell
|
||||
plot_panel_trxsh(3.1,16,12,'mt','d','rrr',[ 6 8 10 ],'../../sup/results/haswell/20200302/mnkt100000_mt12','has','MKL','octave'); close; clear all;
|
||||
|
||||
% epyc
|
||||
plot_panel_trxsh(3.00, 8,1,'mt','d','rrr',[ 6 8 10 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(3.00, 8,1,'mt','d','ccc',[ 6 8 10 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
|
||||
plot_panel_trxsh(2.55,8,32,'mt','d','rrr',[ 6 8 10 ],'../../sup/results/epyc/20200302/mnkt100000_mt32','epyc','MKL','octave'); close; clear all;
|
||||
|
||||