Updated Fugaku (a64fx) performance results.

Details:
- Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx
  entry within Performance.md, and also updated the experiment details
  accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2
  experiments reflected in this commit.
- In Performance.md, added an English translation of the project name
  under which the Fugaku results were gathered, courtesy of RuQing Xu.
This commit is contained in:
Field G. Van Zee
2021-05-25 15:25:08 -05:00
parent e5c85da376
commit 82af05f54c
10 changed files with 16 additions and 22 deletions

View File

@@ -534,7 +534,7 @@ The `runthese.m` file will contain example invocations of the function.
### A64fx experiment details
* Location: RIKEN Center of Computational Science in Kobe, Japan
* These test results were gathered on the Fugaku supercomputer under project "量子物質の創発と機能のための基礎科学 ―「富岳」と最先端実験の密連携による革新的強相関電子科学" (hp200132)
* These test results were gathered on the Fugaku supercomputer under project "量子物質の創発と機能のための基礎科学 ―「富岳」と最先端実験の密連携による革新的強相関電子科学" (hp200132) (Basic Science for Emergence and Functionality in Quantum Matter: Innovative Strongly-Correlated Electron Science by Integration of "Fugaku" and Frontier Experiments)
* Processor model: Fujitsu A64fx
* Core topology: one socket, 4 NUMA groups per socket, 13 cores per group (one reserved for the OS), 48 cores total
* SMT status: Unknown
@@ -546,23 +546,17 @@ The `runthese.m` file will contain example invocations of the function.
* multicore: 70.4 GFLOPS/core (double-precision), 140.8 GFLOPS/core (single-precision)
* Operating system: RHEL 8.3
* Page size: 256 bytes
* Compiler: gcc 9.3.0
* Results gathered: 2 April 2021
* Compiler: gcc 10.1.0
* Results gathered: 2 April 2021; BLIS and SSL2 updated on 20 May 2021
* Implementations tested:
* BLIS 757cb1c (post-0.8.1)
* configured with `./configure -t openmp --sve-vector-size=vla CFLAGS="-D_A64FX -DPREFETCH256 -DSVE_NO_NAT_COMPLEX_KERNELS" arm64_sve` (single- and multithreaded)
* sub-configuration exercised: `arm64_sve`
* Single-threaded (1 core) execution requested via:
* `export BLIS_SVE_KC_D=2048 BLIS_SVE_MC_D=128 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
* `export BLIS_SVE_KC_S=2048 BLIS_SVE_MC_S=256 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
* Multithreaded (12 core) execution requested via:
* `export BLIS_JC_NT=1 BLIS_IC_NT=2 BLIS_JR_NT=6`
* `export BLIS_SVE_KC_D=2400 BLIS_SVE_MC_D=64 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
* `export BLIS_SVE_KC_S=2400 BLIS_SVE_MC_S=128 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
* Multithreaded (48 core) execution requested via:
* `export BLIS_JC_NT=1 BLIS_IC_NT=4 BLIS_JR_NT=12`
* `export BLIS_SVE_KC_D=2048 BLIS_SVE_MC_D=128 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
* `export BLIS_SVE_KC_S=2048 BLIS_SVE_MC_S=256 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
* BLIS 61584de (post-0.8.1)
* configured with:
* `../configure -t none CFLAGS="-DCACHE_SECTOR_SIZE_READONLY" a64fx` (single-threaded)
* `../configure -t openmp CFLAGS="-DCACHE_SECTOR_SIZE_READONLY" a64fx` (multithreaded)
* sub-configuration exercised: `a64fx`
* Single-threaded (1 core) execution requested via no change in environment variables
* Multithreaded (12 core) execution requested via `export BLIS_JC_NT=1 BLIS_IC_NT=1 BLIS_JR_NT=12`
* Multithreaded (48 core) execution requested via `export BLIS_JC_NT=1 BLIS_IC_NT=4 BLIS_JR_NT=12`
* Eigen 3.3.9
* Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen)
* configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
@@ -593,7 +587,7 @@ The `runthese.m` file will contain example invocations of the function.
#### pdf
* [A64fx single-threaded](graphs/large/l3_perf_a64fx_nt1.pdf)
* [A64fx multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.pdf)
* [A64fx multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic1jr12_nt12.pdf)
* [A64fx multithreaded (48 cores)](graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.pdf)
#### png (inline)
@@ -601,7 +595,7 @@ The `runthese.m` file will contain example invocations of the function.
* **A64fx single-threaded**
![single-threaded](graphs/large/l3_perf_a64fx_nt1.png)
* **A64fx multithreaded (12 cores)**
![multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.png)
![multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic1jr12_nt12.png)
* **A64fx multithreaded (48 cores)**
![multithreaded (48 cores)](graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.png)

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 248 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 258 KiB

After

Width:  |  Height:  |  Size: 260 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 249 KiB

After

Width:  |  Height:  |  Size: 250 KiB

View File

@@ -24,6 +24,6 @@ plot_panel_4x5(2.60,16,64, '1s','../results/zen2/20200929/jc4ic4jr4','zen2','MKL
plot_panel_4x5(2.60,16,128,'2s','../results/zen2/20200929/jc8ic4jr4','zen2','MKL'); close all; clear all;
% a64fx
plot_panel_4x5(2.20,32,1, 'st','../results/a64fx/20210405/st', 'a64fx','Fujitsu SSL2'); close all; clear all;
plot_panel_4x5(2.20,32,12,'1s','../results/a64fx/20210405/jc1ic4jr3', 'a64fx','Fujitsu SSL2'); close all; clear all;
plot_panel_4x5(2.20,32,48,'2s','../results/a64fx/20210405/jc1ic4jr12','a64fx','Fujitsu SSL2'); close all; clear all;
plot_panel_4x5(2.20,32,1, 'st','../results/a64fx/20210520/st', 'a64fx','Fujitsu SSL2'); close all; clear all;
plot_panel_4x5(2.20,32,12,'1s','../results/a64fx/20210520/jc1ic1jr12','a64fx','Fujitsu SSL2'); close all; clear all;
plot_panel_4x5(2.20,32,48,'2s','../results/a64fx/20210520/jc1ic4jr12','a64fx','Fujitsu SSL2'); close all; clear all;