Added Epyc 7742 Zen2 ("Rome") performance results.
Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on an Epyc 7742 "Rome" server with AMD's Zen2 microarchitecture. Special thanks to Jeff Diamond for facilitating access to the system via the Oracle Cloud. - Renamed files containing the previous Zen performance results for consistency with the new results.
@@ -15,9 +15,9 @@
|
||||
* **[Haswell](Performance.md#haswell)**
|
||||
* **[Experiment details](Performance.md#haswell-experiment-details)**
|
||||
* **[Results](Performance.md#haswell-results)**
|
||||
* **[Epyc](Performance.md#epyc)**
|
||||
* **[Experiment details](Performance.md#epyc-experiment-details)**
|
||||
* **[Results](Performance.md#epyc-results)**
|
||||
* **[Zen](Performance.md#zen)**
|
||||
* **[Experiment details](Performance.md#zen-experiment-details)**
|
||||
* **[Results](Performance.md#zen-results)**
|
||||
* **[Feedback](Performance.md#feedback)**
|
||||
|
||||
# Introduction
|
||||
@@ -355,12 +355,12 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
|
||||
---
|
||||
|
||||
## Epyc
|
||||
## Zen
|
||||
|
||||
### Epyc experiment details
|
||||
### Zen experiment details
|
||||
|
||||
* Location: Oracle cloud
|
||||
* Processor model: AMD Epyc 7551 (Zen1)
|
||||
* Processor model: AMD Epyc 7551 (Zen1 "Naples")
|
||||
* Core topology: two sockets, 4 dies per socket, 2 core complexes (CCX) per die, 4 cores per CCX, 64 cores total
|
||||
* SMT status: enabled, but not utilized
|
||||
* Max clock rate: 3.0GHz (single-core), 2.55GHz (multicore)
|
||||
@@ -417,22 +417,105 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* Comments:
|
||||
* MKL performance is dismal, despite being linked in the same manner as on the Xeon Platinum. It's not clear what is causing the slowdown. It could be that MKL's runtime kernel/blocksize selection logic is falling back to some older, more basic implementation because CPUID is not returning Intel as the hardware vendor. Alternatively, it's possible that MKL is trying to use kernels for the closest Intel architectures--say, Haswell/Broadwell--but its implementations use Haswell-specific optimizations that, due to microarchitectural differences, degrade performance on Zen.
|
||||
|
||||
### Epyc results
|
||||
### Zen results
|
||||
|
||||
#### pdf
|
||||
|
||||
* [Epyc single-threaded](graphs/large/l3_perf_epyc_nt1.pdf)
|
||||
* [Epyc multithreaded (32 cores)](graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf)
|
||||
* [Epyc multithreaded (64 cores)](graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf)
|
||||
* [Zen single-threaded](graphs/large/l3_perf_zen_nt1.pdf)
|
||||
* [Zen multithreaded (32 cores)](graphs/large/l3_perf_zen_jc1ic8jr4_nt32.pdf)
|
||||
* [Zen multithreaded (64 cores)](graphs/large/l3_perf_zen_jc2ic8jr4_nt64.pdf)
|
||||
|
||||
#### png (inline)
|
||||
|
||||
* **Epyc single-threaded**
|
||||

|
||||
* **Epyc multithreaded (32 cores)**
|
||||

|
||||
* **Epyc multithreaded (64 cores)**
|
||||

|
||||
* **Zen single-threaded**
|
||||

|
||||
* **Zen multithreaded (32 cores)**
|
||||

|
||||
* **Zen multithreaded (64 cores)**
|
||||

|
||||
|
||||
---
|
||||
|
||||
## Zen2
|
||||
|
||||
### Zen2 experiment details
|
||||
|
||||
* Location: Oracle cloud
|
||||
* Processor model: AMD Epyc 7742 (Zen2 "Rome")
|
||||
* Core topology: two sockets, 8 Core Complex Dies (CCDs) per socket, 2 Core Complexes (CCX) per CCD, 4 cores per CCX, 128 cores total
|
||||
* SMT status: enabled, but not utilized
|
||||
* Max clock rate: 2.25GHz (base, documented); 3.4GHz boost (single-core, documented); 2.6GHz boost (multicore, estimated)
|
||||
* Max vector register length: 256 bits (AVX2)
|
||||
* Max FMA vector IPC: 2
|
||||
* Alternatively, FMA vector IPC is 4 when vectors are limited to 128 bits each.
|
||||
* Peak performance:
|
||||
* single-core: 54.4 GFLOPS (double-precision), 108.8 GFLOPS (single-precision)
|
||||
* multicore (estimated): 41.6 GFLOPS/core (double-precision), 83.2 GFLOPS/core (single-precision)
|
||||
* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
|
||||
* Page size: 4096 bytes
|
||||
* Compiler: gcc 9.3.0
|
||||
* Results gathered: 24 September 2020, 29 September 2020
|
||||
* Implementations tested:
|
||||
* BLIS 4fd8d9f (0.7.0-55)
|
||||
* configured with `./configure -t openmp auto` (single- and multithreaded)
|
||||
* sub-configuration exercised: `zen2`
|
||||
* Single-threaded (1 core) execution requested via no change in environment variables
|
||||
* Multithreaded (64 core) execution requested via `export BLIS_JC_NT=4 BLIS_IC_NT=4 BLIS_JR_NT=4`
|
||||
* Multithreaded (128 core) execution requested via `export BLIS_JC_NT=8 BLIS_IC_NT=4 BLIS_JR_NT=4`
|
||||
* OpenBLAS 0.3.10
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=64` (multithreaded, 64 cores)
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=128` (multithreaded, 128 cores)
|
||||
* Single-threaded (1 core) execution requested via `export OPENBLAS_NUM_THREADS=1`
|
||||
* Multithreaded (64 core) execution requested via `export OPENBLAS_NUM_THREADS=64`
|
||||
* Multithreaded (128 core) execution requested via `export OPENBLAS_NUM_THREADS=128`
|
||||
* Eigen 3.3.90
|
||||
* Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen) (24 September 2020)
|
||||
* Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
|
||||
```
|
||||
# These lines added after line 60.
|
||||
check_cxx_compiler_flag("-march=native" COMPILER_SUPPORTS_MARCH_NATIVE)
|
||||
if(COMPILER_SUPPORTS_MARCH_NATIVE)
|
||||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
|
||||
endif()
|
||||
```
|
||||
* configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
|
||||
* The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
|
||||
* Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
|
||||
* Multithreaded (64 core) execution requested via `export OMP_NUM_THREADS=64`
|
||||
* Multithreaded (128 core) execution requested via `export OMP_NUM_THREADS=128`
|
||||
* **NOTE**: This version of Eigen does not provide multithreaded implementations of `symm`/`hemm`, `syrk`/`herk`, `trmm`, or `trsm`, and therefore those curves are omitted from the multithreaded graphs.
|
||||
* MKL 2020 update 3
|
||||
* Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
|
||||
* Multithreaded (64 core) execution requested via `export MKL_NUM_THREADS=64`
|
||||
* Multithreaded (128 core) execution requested via `export MKL_NUM_THREADS=128`
|
||||
* Affinity:
|
||||
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-127"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
|
||||
* All executables were run through `numactl --interleave=all`.
|
||||
* Frequency throttling (via `cpupower`):
|
||||
* Driver: acpi-cpufreq
|
||||
* Governor: performance
|
||||
* Hardware limits (steps): 1.5GHz, 2.0GHz, 2.25GHz
|
||||
* Adjusted minimum: 2.25GHz
|
||||
* Comments:
|
||||
* MKL performance is once again underwhelming. This is likely because Intel has decided that it does not want to give users of MKL a reason to purchase AMD hardware.
|
||||
|
||||
### Zen2 results
|
||||
|
||||
#### pdf
|
||||
|
||||
* [Zen2 single-threaded](graphs/large/l3_perf_zen2_nt1.pdf)
|
||||
* [Zen2 multithreaded (64 cores)](graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf)
|
||||
* [Zen2 multithreaded (128 cores)](graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf)
|
||||
|
||||
#### png (inline)
|
||||
|
||||
* **Zen2 single-threaded**
|
||||

|
||||
* **Zen2 multithreaded (64 cores)**
|
||||

|
||||
* **Zen2 multithreaded (128 cores)**
|
||||

|
||||
|
||||
---
|
||||
|
||||
|
||||
BIN
docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf
Normal file
BIN
docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.png
Normal file
|
After Width: | Height: | Size: 235 KiB |
BIN
docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf
Normal file
BIN
docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.png
Normal file
|
After Width: | Height: | Size: 233 KiB |
BIN
docs/graphs/large/l3_perf_zen2_nt1.pdf
Normal file
BIN
docs/graphs/large/l3_perf_zen2_nt1.png
Normal file
|
After Width: | Height: | Size: 221 KiB |
|
Before Width: | Height: | Size: 108 KiB After Width: | Height: | Size: 108 KiB |
|
Before Width: | Height: | Size: 115 KiB After Width: | Height: | Size: 115 KiB |
|
Before Width: | Height: | Size: 78 KiB After Width: | Height: | Size: 78 KiB |