Merge commit 'e366665c' into amd-main
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
@@ -24,6 +24,9 @@
|
||||
* **[A64fx](Performance.md#a64fx)**
|
||||
* **[Experiment details](Performance.md#a64fx-experiment-details)**
|
||||
* **[Results](Performance.md#a64fx-results)**
|
||||
* **[Neoverse N1](Performance.md#neoverse-n1)**
|
||||
* **[Experiment details](Performance.md#neoverse-n1-experiment-details)**
|
||||
* **[Results](Performance.md#neoverse-n1-results)**
|
||||
* **[Feedback](Performance.md#feedback)**
|
||||
|
||||
# Introduction
|
||||
@@ -534,7 +537,7 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
### A64fx experiment details
|
||||
|
||||
* Location: RIKEN Center of Computational Science in Kobe, Japan
|
||||
* These test results were gathered on the Fugaku supercomputer under project "量子物質の創発と機能のための基礎科学 ―「富岳」と最先端実験の密連携による革新的強相関電子科学" (hp200132)
|
||||
* These test results were gathered on the Fugaku supercomputer under project "量子物質の創発と機能のための基礎科学 ―「富岳」と最先端実験の密連携による革新的強相関電子科学" (hp200132) (Basic Science for Emergence and Functionality in Quantum Matter: Innovative Strongly-Correlated Electron Science by Integration of "Fugaku" and Frontier Experiments)
|
||||
* Processor model: Fujitsu A64fx
|
||||
* Core topology: one socket, 4 NUMA groups per socket, 13 cores per group (one reserved for the OS), 48 cores total
|
||||
* SMT status: Unknown
|
||||
@@ -546,23 +549,17 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* multicore: 70.4 GFLOPS/core (double-precision), 140.8 GFLOPS/core (single-precision)
|
||||
* Operating system: RHEL 8.3
|
||||
* Page size: 256 bytes
|
||||
* Compiler: gcc 9.3.0
|
||||
* Results gathered: 2 April 2021
|
||||
* Compiler: gcc 10.1.0
|
||||
* Results gathered: 2 April 2021; BLIS and SSL2 updated on 20 May 2021
|
||||
* Implementations tested:
|
||||
* BLIS 757cb1c (post-0.8.1)
|
||||
* configured with `./configure -t openmp --sve-vector-size=vla CFLAGS="-D_A64FX -DPREFETCH256 -DSVE_NO_NAT_COMPLEX_KERNELS" arm64_sve` (single- and multithreaded)
|
||||
* sub-configuration exercised: `arm64_sve`
|
||||
* Single-threaded (1 core) execution requested via:
|
||||
* `export BLIS_SVE_KC_D=2048 BLIS_SVE_MC_D=128 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
|
||||
* `export BLIS_SVE_KC_S=2048 BLIS_SVE_MC_S=256 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
|
||||
* Multithreaded (12 core) execution requested via:
|
||||
* `export BLIS_JC_NT=1 BLIS_IC_NT=2 BLIS_JR_NT=6`
|
||||
* `export BLIS_SVE_KC_D=2400 BLIS_SVE_MC_D=64 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
|
||||
* `export BLIS_SVE_KC_S=2400 BLIS_SVE_MC_S=128 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
|
||||
* Multithreaded (48 core) execution requested via:
|
||||
* `export BLIS_JC_NT=1 BLIS_IC_NT=4 BLIS_JR_NT=12`
|
||||
* `export BLIS_SVE_KC_D=2048 BLIS_SVE_MC_D=128 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
|
||||
* `export BLIS_SVE_KC_S=2048 BLIS_SVE_MC_S=256 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
|
||||
* BLIS 61584de (post-0.8.1)
|
||||
* configured with:
|
||||
* `../configure -t none CFLAGS="-DCACHE_SECTOR_SIZE_READONLY" a64fx` (single-threaded)
|
||||
* `../configure -t openmp CFLAGS="-DCACHE_SECTOR_SIZE_READONLY" a64fx` (multithreaded)
|
||||
* sub-configuration exercised: `a64fx`
|
||||
* Single-threaded (1 core) execution requested via no change in environment variables
|
||||
* Multithreaded (12 core) execution requested via `export BLIS_JC_NT=1 BLIS_IC_NT=1 BLIS_JR_NT=12`
|
||||
* Multithreaded (48 core) execution requested via `export BLIS_JC_NT=1 BLIS_IC_NT=4 BLIS_JR_NT=12`
|
||||
* Eigen 3.3.9
|
||||
* Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen)
|
||||
* configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
|
||||
@@ -593,7 +590,7 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
#### pdf
|
||||
|
||||
* [A64fx single-threaded](graphs/large/l3_perf_a64fx_nt1.pdf)
|
||||
* [A64fx multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.pdf)
|
||||
* [A64fx multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic1jr12_nt12.pdf)
|
||||
* [A64fx multithreaded (48 cores)](graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.pdf)
|
||||
|
||||
#### png (inline)
|
||||
@@ -601,12 +598,64 @@ The `runthese.m` file will contain example invocations of the function.
|
||||
* **A64fx single-threaded**
|
||||

|
||||
* **A64fx multithreaded (12 cores)**
|
||||

|
||||

|
||||
* **A64fx multithreaded (48 cores)**
|
||||

|
||||
|
||||
---
|
||||
|
||||
## Neoverse N1
|
||||
|
||||
### Neoverse N1 experiment details
|
||||
|
||||
* Location: AWS cloud
|
||||
* Processor model: Graviton2 Neoverse N1
|
||||
* Core topology: one socket, 64 cores per socket, 64 cores total
|
||||
* SMT status: none
|
||||
* Max clock rate: 2.5GHz (single-core and multicore)
|
||||
* Max vector register length: 128 bits (NEON)
|
||||
* Max FMA vector IPC: 2
|
||||
* Peak performance:
|
||||
* single-core: 20.0 GFLOPS (double-precision), 40.0 GFLOPS (single-precision)
|
||||
* multicore: 20.0 GFLOPS/core (double-precision), 40.0 GFLOPS/core (single-precision)
|
||||
* Operating system: unknown
|
||||
* Page size: unknown
|
||||
* Compiler: gcc 10.3.0
|
||||
* Results gathered: 15 July 2021
|
||||
* Implementations tested:
|
||||
* BLIS fab5c86d (0.8.1-67)
|
||||
* configured with `./configure -t openmp thunderx2` (single- and multithreaded)
|
||||
* sub-configuration exercised: `thunderx2`
|
||||
* Single-threaded (1 core) execution requested via no change in environment variables
|
||||
* Multithreaded (64 core) execution requested via `export BLIS_NUM_THREADS=64`
|
||||
* OpenBLAS 0.3.17
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
|
||||
* configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=64` (multithreaded, 64 cores)
|
||||
* Single-threaded (1 core) execution requested via `export OPENBLAS_NUM_THREADS=1`
|
||||
* Multithreaded (64 core) execution requested via `export OPENBLAS_NUM_THREADS=64`
|
||||
* Affinity:
|
||||
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-63"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
|
||||
* Frequency throttling (via `cpupower`):
|
||||
* No changes made.
|
||||
* Comments:
|
||||
* N/A
|
||||
|
||||
### Neoverse N1 results
|
||||
|
||||
#### pdf
|
||||
|
||||
* [Neoverse N1 single-threaded](graphs/large/l3_perf_nn1_nt1.pdf)
|
||||
* [Neoverse N1 multithreaded (64 cores)](graphs/large/l3_perf_nn1_jc2ic8jr4_nt64.pdf)
|
||||
|
||||
#### png (inline)
|
||||
|
||||
* **Neoverse N1 single-threaded**
|
||||

|
||||
* **Neoverse N1 multithreaded (64 cores)**
|
||||

|
||||
|
||||
---
|
||||
|
||||
# Feedback
|
||||
|
||||
Please let us know what you think of these performance results! Similarly, if you have any questions or concerns, or are interested in reproducing these performance experiments on your own hardware, we invite you to [open an issue](https://github.com/flame/blis/issues) and start a conversation with BLIS developers.
|
||||
|
||||
BIN
docs/graphs/large/l3_perf_a64fx_jc1ic1jr12_nt12.pdf
Normal file
BIN
docs/graphs/large/l3_perf_a64fx_jc1ic1jr12_nt12.png
Normal file
|
After Width: | Height: | Size: 250 KiB |
|
Before Width: | Height: | Size: 248 KiB |
|
Before Width: | Height: | Size: 258 KiB After Width: | Height: | Size: 260 KiB |
|
Before Width: | Height: | Size: 249 KiB After Width: | Height: | Size: 250 KiB |
BIN
docs/graphs/large/l3_perf_nn1_jc2ic8jr4_nt64.pdf
Normal file
BIN
docs/graphs/large/l3_perf_nn1_jc2ic8jr4_nt64.png
Normal file
|
After Width: | Height: | Size: 215 KiB |
BIN
docs/graphs/large/l3_perf_nn1_nt1.pdf
Normal file
BIN
docs/graphs/large/l3_perf_nn1_nt1.png
Normal file
|
After Width: | Height: | Size: 150 KiB |