Updated sup performance graphs with libxsmm.

Details:
- Added libxsmm to column-stored sup graphs presented in
  docs/PerformanceSmall.md.
- Updated sup results for BLASFEO.
- Added sup results for Lonestar5 (Haswell).
- Addresses issue #326.
This commit is contained in:
Field G. Van Zee
2019-08-26 16:47:37 -05:00
parent bfddf67132
commit 40781774df
16 changed files with 185 additions and 46 deletions

View File

@@ -7,6 +7,9 @@
* **[Kaby Lake](Performance.md#kaby-lake)**
* **[Experiment details](Performance.md#kaby-lake-experiment-details)**
* **[Results](Performance.md#kaby-lake-results)**
* **[Haswell](Performance.md#haswell)**
* **[Experiment details](Performance.md#haswell-experiment-details)**
* **[Results](Performance.md#haswell-results)**
* **[Epyc](Performance.md#epyc)**
* **[Experiment details](Performance.md#epyc-experiment-details)**
* **[Results](Performance.md#epyc-results)**
@@ -112,16 +115,16 @@ size of interest so that we can better assist you.
* single-core: 57.6 GFLOPS (double-precision), 115.2 GFLOPS (single-precision)
* Operating system: Gentoo Linux (Linux kernel 5.0.7)
* Page size: 4096 bytes
* Compiler: gcc 7.3.0
* Compiler: gcc 8.3.0
* Driver source code directory: `test/sup`
* Results gathered: 31 May 2019, 3 June 2019, 19 June 2019
* Results gathered: 23 August 2019, 26 August 2019
* Implementations tested:
* BLIS 6bf449c (0.5.2-42)
* BLIS 4a0a6e8 (0.6.0-28)
* configured with `./configure --enable-cblas auto`
* sub-configuration exercised: `haswell`
* OpenBLAS 0.3.6
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* BLASFEO 2c9f312
* BLASFEO 01f6b7f
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
* Eigen 3.3.90
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (30 May 2019)
@@ -136,8 +139,10 @@ size of interest so that we can better assist you.
* configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
* MKL 2018 update 4
* MKL 2019 update 4
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* libxsmm 77a295c (1.6.5-6679)
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
* Affinity:
* N/A.
* Frequency throttling (via `cpupower`):
@@ -146,8 +151,7 @@ size of interest so that we can better assist you.
* Hardware limits: 800MHz - 3.8GHz
* Adjusted minimum: 3.7GHz
* Comments:
* For both row- and column-stored matrices, BLIS's new small/skinny matrix implementation is competitive with (or exceeds the performance of) the next highest-performing solution (typically MKL), except for a few cases of where the _k_ dimension is very small. It is likely the case that this shape scenario begs a different kernel approach, since the BLIS microkernel is inherently designed to iterate over many _k_ dimension iterations (which leads them to incur considerable overhead for small values of _k_).
* For the classic case of `dgemm_nn` on square matrices, BLIS is the fastest implementation for the problem size range of approximately 80 to 180. BLIS is also competitive in this general range for other transpose parameter combinations (`nt`, `tn`, and `tt`).
* libxsmm is highly competitive for very small problems, but quickly gives up once the "large" dimension exceeds about 180-240 (or 64 in the case where all operands are square). Also, libxsmm's `gemm` cannot handle a transposition on matrix A and similarly dispatches the fallback implementation for those cases. libxsmm also does not export CBLAS interfaces, and therefore only appears on the graphs for column-stored matrices.
### Kaby Lake results
@@ -165,6 +169,72 @@ size of interest so that we can better assist you.
---
## Haswell
### Haswell experiment details
* Location: TACC (Lonestar5)
* Processor model: Intel Xeon E5-2690 v3 (Haswell)
* Core topology: two sockets, 12 cores per socket, 24 cores total
* SMT status: enabled, but not utilized
* Max clock rate: 3.5GHz (single-core), 3.1GHz (multicore)
* Max vector register length: 256 bits (AVX2)
* Max FMA vector IPC: 2
* Peak performance:
* single-core: 56 GFLOPS (double-precision), 112 GFLOPS (single-precision)
* Operating system: Cray Linux Environment 6 (Linux kernel 4.4.103)
* Page size: 4096 bytes
* Compiler: gcc 7.3.0
* Driver source code directory: `test/3`
* Results gathered: 23 August 2019, 26 August 2019
* Implementations tested:
* BLIS 4a0a6e8 (0.6.0-28)
* configured with `./configure --enable-cblas auto`
* sub-configuration exercised: `haswell`
* OpenBLAS 0.3.6
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* BLASFEO 01f6b7f
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
* Eigen 3.3.90
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (March 27, 2019)
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
```
# These lines added after line 67.
check_cxx_compiler_flag("-march=native" COMPILER_SUPPORTS_MARCH_NATIVE)
if(COMPILER_SUPPORTS_MARCH_NATIVE)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
endif()
```
* configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
* MKL 2019 update 4
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* libxsmm 77a295c (1.6.5-6679)
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
* Affinity:
* N/A.
* Frequency throttling (via `cpupower`):
* No changes made.
* Comments:
* libxsmm is highly competitive for very small problems, but quickly gives up once the "large" dimension exceeds about 180-240 (or 64 in the case where all operands are square). Also, libxsmm's `gemm` cannot handle a transposition on matrix A and similarly dispatches the fallback implementation for those cases. libxsmm also does not export CBLAS interfaces, and therefore only appears on the graphs for column-stored matrices.
### Haswell results
#### pdf
* [Haswell row-stored](graphs/sup/dgemm_rrr_has_nt1.pdf)
* [Haswell column-stored](graphs/sup/dgemm_ccc_has_nt1.pdf)
#### png (inline)
* **Haswell row-stored**
![row-stored](graphs/sup/dgemm_rrr_has_nt1.png)
* **Haswell column-stored**
![column-stored](graphs/sup/dgemm_ccc_has_nt1.png)
---
## Epyc
### Epyc experiment details
@@ -183,14 +253,14 @@ size of interest so that we can better assist you.
* Page size: 4096 bytes
* Compiler: gcc 7.3.0
* Driver source code directory: `test/sup`
* Results gathered: 31 May 2019, 3 June 2019, 19 June 2019
* Results gathered: 23 August 2019, 26 August 2019
* Implementations tested:
* BLIS 6bf449c (0.5.2-42)
* BLIS 4a0a6e8 (0.6.0-28)
* configured with `./configure --enable-cblas auto`
* sub-configuration exercised: `zen`
* OpenBLAS 0.3.6
* configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* BLASFEO 2c9f312
* BLASFEO 01f6b7f
* configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
* Eigen 3.3.90
* Obtained via the [Eigen git mirror](https://github.com/eigenteam/eigen-git-mirror) (30 May 2019)
@@ -207,6 +277,8 @@ size of interest so that we can better assist you.
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
* MKL 2019 update 4
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* libxsmm 77a295c (1.6.5-6679)
* compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
* Affinity:
* N/A.
* Frequency throttling (via `cpupower`):
@@ -215,8 +287,7 @@ size of interest so that we can better assist you.
* Hardware limits: 1.2GHz - 2.0GHz
* Adjusted minimum: 2.0GHz
* Comments:
* As with Kaby Lake, BLIS's new small/skinny matrix implementation is competitive with (or exceeds the performance of) the next highest-performing solution, except for a few cases of where the _k_ dimension is very small.
* For the classic case of `dgemm_nn` on square matrices, BLIS is the fastest implementation for the problem size range of approximately 12 to 256. BLIS is also competitive in this general range for other transpose parameter combinations (`nt`, `tn`, and `tt`).
* libxsmm is highly competitive for very small problems, but quickly gives up once the "large" dimension exceeds about 180-240 (or 64 in the case where all operands are square). Also, libxsmm's `gemm` cannot handle a transposition on matrix A and similarly dispatches the fallback implementation for those cases. libxsmm also does not export CBLAS interfaces, and therefore only appears on the graphs for column-stored matrices.
### Epyc results

Binary file not shown.

Before

Width:  |  Height:  |  Size: 169 KiB

After

Width:  |  Height:  |  Size: 178 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 185 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 195 KiB

After

Width:  |  Height:  |  Size: 208 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 171 KiB

After

Width:  |  Height:  |  Size: 167 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 179 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 203 KiB

After

Width:  |  Height:  |  Size: 200 KiB

View File

@@ -4,6 +4,7 @@ function r_val = plot_l3sup_perf( opname, ...
data_eigen, ...
data_open, ...
data_bfeo, ...
data_xsmm, ...
data_vend, vend_str, ...
nth, ...
rows, cols, ...
@@ -33,6 +34,7 @@ color_blislpab = 'k'; lines_blislpab = ':'; markr_blislpab = '';
color_eigen = 'm'; lines_eigen = '-.'; markr_eigen = 'o';
color_open = 'r'; lines_open = '--'; markr_open = 'o';
color_bfeo = 'c'; lines_bfeo = '-'; markr_bfeo = 'o';
color_xsmm = 'g'; lines_xsmm = '-'; markr_xsmm = 'o';
color_vend = 'b'; lines_vend = '-.'; markr_vend = '.';
% Compute the peak performance in terms of the number of double flops
@@ -57,6 +59,7 @@ blislpab_legend = sprintf( 'BLIS conv' );
eigen_legend = sprintf( 'Eigen' );
open_legend = sprintf( 'OpenBLAS' );
bfeo_legend = sprintf( 'BLASFEO' );
xsmm_legend = sprintf( 'libxsmm' );
%vend_legend = sprintf( 'MKL' );
%vend_legend = sprintf( 'ARMPL' );
vend_legend = vend_str;
@@ -106,6 +109,11 @@ else
np = size( data_blissup, 1 );
end
has_xsmm = 1;
if data_xsmm( 1, flopscol ) == 0.0
has_xsmm = 0;
end
% Grab the last x-axis value.
x_end = data_blissup( np, psize_col );
@@ -128,6 +136,15 @@ open_ln = line( x_axis( 1:np, 1 ), data_open( 1:np, flopscol ) / nth, ...
bfeo_ln = line( x_axis( 1:np, 1 ), data_bfeo( 1:np, flopscol ) / nth, ...
'Color',color_bfeo, 'LineStyle',lines_bfeo, ...
'LineWidth',linesize );
if has_xsmm == 1
xsmm_ln = line( x_axis( 1:np, 1 ), data_xsmm( 1:np, flopscol ) / nth, ...
'Color',color_xsmm, 'LineStyle',lines_xsmm, ...
'LineWidth',linesize );
else
xsmm_ln = line( nan, nan, ...
'Color',color_xsmm, 'LineStyle',lines_xsmm, ...
'LineWidth',linesize );
end
vend_ln = line( x_axis( 1:np, 1 ), data_vend( 1:np, flopscol ) / nth, ...
'Color',color_vend, 'LineStyle',lines_vend, ...
'LineWidth',linesize );
@@ -148,6 +165,9 @@ open_ln = line( nan, nan, ...
bfeo_ln = line( nan, nan, ...
'Color',color_bfeo, 'LineStyle',lines_bfeo, ...
'LineWidth',linesize );
xsmm_ln = line( nan, nan, ...
'Color',color_xsmm, 'LineStyle',lines_xsmm, ...
'LineWidth',linesize );
vend_ln = line( nan, nan, ...
'Color',color_vend, 'LineStyle',lines_vend, ...
'LineWidth',linesize );
@@ -178,40 +198,72 @@ elseif 500 <= x_end && x_end < 1000
end
if show_plot == 1 || theid == legend_plot_id
if rows == 4 && cols == 7
if nth == 1 && theid == legend_plot_id
leg = legend( ...
[ ...
blissup_ln ...
blislpab_ln ...
eigen_ln ...
open_ln ...
bfeo_ln ...
vend_ln ...
], ...
blissup_legend, ...
blislpab_legend, ...
eigen_legend, ...
open_legend, ...
bfeo_legend, ...
vend_legend, ...
'Location', legend_loc );
if has_xsmm == 1
leg = legend( ...
[ ...
blissup_ln ...
blislpab_ln ...
eigen_ln ...
open_ln ...
bfeo_ln ...
xsmm_ln ...
vend_ln ...
], ...
blissup_legend, ...
blislpab_legend, ...
eigen_legend, ...
open_legend, ...
bfeo_legend, ...
xsmm_legend, ...
vend_legend, ...
'Location', legend_loc );
set( leg,'Box','off' );
set( leg,'Color','none' );
set( leg,'Units','inches' );
if impl == 'octave'
set( leg,'FontSize',fontsize );
set( leg,'Position',[11.92 6.54 1.15 0.7 ] ); % (1,4tl)
else
set( leg,'FontSize',fontsize-3 );
set( leg,'Position',[18.20 10.20 1.15 0.7 ] ); % (1,4tl)
end
else
leg = legend( ...
[ ...
blissup_ln ...
blislpab_ln ...
eigen_ln ...
open_ln ...
bfeo_ln ...
vend_ln ...
], ...
blissup_legend, ...
blislpab_legend, ...
eigen_legend, ...
open_legend, ...
bfeo_legend, ...
vend_legend, ...
'Location', legend_loc );
set( leg,'Box','off' );
set( leg,'Color','none' );
set( leg,'Units','inches' );
if impl == 'octave'
set( leg,'FontSize',fontsize );
set( leg,'Position',[11.92 6.54 1.15 0.7 ] ); % (1,4tl)
else
set( leg,'FontSize',fontsize-1 );
set( leg,'Position',[18.24 10.15 1.15 0.7 ] ); % (1,4tl)
end
end
set( leg,'Box','off' );
set( leg,'Color','none' );
set( leg,'Units','inches' );
% xpos ypos
%set( leg,'Position',[11.32 6.36 1.15 0.7 ] ); % (1,4tl)
if impl == 'octave'
set( leg,'FontSize',fontsize );
set( leg,'Position',[11.92 6.54 1.15 0.7 ] ); % (1,4tl)
else
set( leg,'FontSize',fontsize-1 );
set( leg,'Position',[18.24 10.15 1.15 0.7 ] ); % (1,4tl)
end
elseif nth > 1 && theid == legend_plot_id
end
end
end
set( ax1,'FontSize',fontsize );
set( ax1,'TitleFontSizeMultiplier',1.0 ); % default is 1.1.

View File

@@ -23,6 +23,7 @@ filetemp_blislpab = '%s/output_%s_%s_blislpab.m';
filetemp_eigen = '%s/output_%s_%s_eigen.m';
filetemp_open = '%s/output_%s_%s_openblas.m';
filetemp_bfeo = '%s/output_%s_%s_blasfeo.m';
filetemp_xsmm = '%s/output_%s_%s_libxsmm.m';
filetemp_vend = '%s/output_%s_%s_vendor.m';
% Create a variable name "template" for the variables contained in the
@@ -83,15 +84,10 @@ for opi = 1:n_opsupnames
% Load the data files.
%str = sprintf( ' Loading %s', file_blissup ); disp(str);
run( file_blissup )
%str = sprintf( ' Loading %s', file_blislpab ); disp(str);
run( file_blislpab )
%str = sprintf( ' Loading %s', file_eigen ); disp(str);
run( file_eigen )
%str = sprintf( ' Loading %s', file_open ); disp(str);
run( file_open )
%str = sprintf( ' Loading %s', file_open ); disp(str);
run( file_bfeo )
%str = sprintf( ' Loading %s', file_vend ); disp(str);
run( file_vend )
% Construct variable names for the variables in the data files.
@@ -111,11 +107,25 @@ for opi = 1:n_opsupnames
data_bfeo = eval( var_bfeo ); % e.g. data_st_dgemm_blasfeo( :, : );
data_vend = eval( var_vend ); % e.g. data_st_dgemm_vendor( :, : );
if stor_str == 'ccc'
% Only read xsmm data for the column storage case, since that's the
% only format that libxsmm supports.
file_xsmm = sprintf( filetemp_xsmm, dirpath, thr_str, opsupname );
run( file_xsmm )
var_xsmm = sprintf( vartemp, thr_str, opname, 'libxsmm' );
data_xsmm = eval( var_xsmm ); % e.g. data_st_dgemm_libxsmm( :, : );
else
% Set the data variable to zeros using the same dimensions as the other
% variables.
data_xsmm = zeros( size( data_blissup, 1 ), ...
size( data_blissup, 2 ) );
end
%str = sprintf( ' Reading %s', var_blissup ); disp(str);
%str = sprintf( ' Reading %s', var_blislpab ); disp(str);
%str = sprintf( ' Reading %s', var_eigen ); disp(str);
%str = sprintf( ' Reading %s', var_open ); disp(str);
%str = sprintf( ' Reading %s', var_bfeo ); disp(str);
%str = sprintf( ' Reading %s', var_xsmm ); disp(str);
%str = sprintf( ' Reading %s', var_vend ); disp(str);
% Plot one result in an m x n grid of plots, via the subplot()
@@ -127,6 +137,7 @@ for opi = 1:n_opsupnames
data_eigen, ...
data_open, ...
data_bfeo, ...
data_xsmm, ...
data_vend, vend_str, ...
nth, ...
4, 7, ...
@@ -140,6 +151,7 @@ for opi = 1:n_opsupnames
clear data_eigen;
clear data_open;
clear data_bfeo;
clear data_xsmm;
clear data_vend;
end

View File

@@ -1,8 +1,12 @@
% haswell
plot_panel_trxsh(3.25,16,1,'st','d','ccc',[ 6 8 4 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.25,16,1,'st','d','rrr',[ 6 8 4 ],'../results/haswell/20190823/4_800_4_mt201','has','MKL','matlab'); close; clear all;
% kabylake
plot_panel_trxsh(3.8,16,1,'st','d','rrr',[ 6 8 4 ],'../results/kabylake/20190619/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.8,16,1,'st','d','ccc',[ 6 8 4 ],'../results/kabylake/20190619/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.80,16,1,'st','d','rrr',[ 6 8 4 ],'../results/kabylake/20190823/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.80,16,1,'st','d','ccc',[ 6 8 4 ],'../results/kabylake/20190823/4_800_4_mt201','kbl','MKL','matlab'); close; clear all;
% epyc
plot_panel_trxsh(3.0,8,1,'st','d','rrr',[ 6 8 4 ],'../results/epyc/20190619/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.0,8,1,'st','d','ccc',[ 6 8 4 ],'../results/epyc/20190619/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.00, 8,1,'st','d','rrr',[ 6 8 4 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;
plot_panel_trxsh(3.00, 8,1,'st','d','ccc',[ 6 8 4 ],'../results/epyc/20190826/4_800_4_mt256','epyc','MKL','matlab'); close; clear all;