Added Eigen support to test/3 Makefile, runme.sh.

Details:
- Added targets to test/3/Makefile that link against a BLAS library
  build by Eigen. It appears, however, that Eigen's BLAS library does
  not support multithreading. (It may be that multithreading is only
  available when using the native C++ APIs.)
- Updated runme.sh with a few Eigen-related tweaks.
- Minor tweaks to docs/Performance.md.
This commit is contained in:
Field G. Van Zee
2019-03-20 17:52:23 -05:00
parent 05c4e42642
commit 288843b06d
3 changed files with 117 additions and 48 deletions

View File

@@ -49,17 +49,17 @@ Theoretical peak performance, in units of GFLOPS/core, is calculated as the
product of:
1. the maximum sustainable clock rate in GHz; and
2. the maximum number of floating-point operations (flops) that can be
executed per cycle.
executed per cycle (per core).
Note that the maximum sustainable clock rate may change depending on the
conditions.
For example, on some systems the maximum clock rate is higher when only one
core is active (e.g. single-threaded performance) versus when all cores are
active (e.g. multithreaded performance).
The maximum number of flops executable per cycle is generally computed as the
product of:
The maximum number of flops executable per cycle (per core) is generally
computed as the product of:
1. the maximum number of fused multiply-add (FMA) vector instructions that
can be issued per cycle;
can be issued per cycle (per core);
2. the maximum number of elements that can be stored within a single vector
register (for the datatype in question); and
3. 2.0, since an FMA instruction fuses two operations (a multiply and an add).
@@ -90,7 +90,7 @@ allow it to finish.
Where along the x-axis you focus your attention will depend on the segment of
the problem size range that you care about most. Some people's applications
depend heavily on smaller problems, where "small" can mean anything from 200
depend heavily on smaller problems, where "small" can mean anything from 10
to 1000 or even higher. Some people consider 1000 to be quite large, while
others insist that 5000 is merely "medium." What each of us considers to be
small, medium, or large (naturally) depends heavily on the kinds of dense
@@ -132,10 +132,16 @@ size of interest so that we can better assist you.
* sub-configuration exercised: `thunderx2`
* OpenBLAS 52d3f7a
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=56` (multithreaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=56` (multithreaded, 56 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=1` (single-threaded)
* Requested threading via `export OPENBLAS_NUM_THREADS=28` (multithreaded, 28 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=56` (multithreaded, 56 cores)
* ARMPL 18.4
* Requested threading via `export OMP_NUM_THREADS=1` (single-threaded)
* Requested threading via `export OMP_NUM_THREADS=28` (multithreaded, 28 cores)
* Requested threading via `export OMP_NUM_THREADS=56` (multithreaded, 56 cores)
* Affinity:
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 55"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset.
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 55"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
* Frequency throttling (via `cpupower`):
* No changes made.
* Comments:
@@ -183,10 +189,16 @@ size of interest so that we can better assist you.
* sub-configuration exercised: `skx`
* OpenBLAS 0.3.5
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=52` (multithreaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=52` (multithreaded, 52 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=1` (single-threaded)
* Requested threading via `export OPENBLAS_NUM_THREADS=26` (multithreaded, 26 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=52` (multithreaded, 52 cores)
* MKL 2019 update 1
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* Requested threading via `export MKL_NUM_THREADS=26` (multithreaded, 26 cores)
* Requested threading via `export MKL_NUM_THREADS=52` (multithreaded, 52 cores)
* Affinity:
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 51"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset.
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 51"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
* Frequency throttling (via `cpupower`):
* No changes made.
* Comments:
@@ -234,10 +246,16 @@ size of interest so that we can better assist you.
* sub-configuration exercised: `haswell`
* OpenBLAS 0.3.5
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=24` (multithreaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=24` (multithreaded, 24 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=1` (single-threaded)
* Requested threading via `export OPENBLAS_NUM_THREADS=12` (multithreaded, 12 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=24` (multithreaded, 24 cores)
* MKL 2018 update 2
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* Requested threading via `export MKL_NUM_THREADS=12` (multithreaded, 12 cores)
* Requested threading via `export MKL_NUM_THREADS=24` (multithreaded, 24 cores)
* Affinity:
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 23"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset.
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 23"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
* Frequency throttling (via `cpupower`):
* No changes made.
* Comments:
@@ -286,10 +304,16 @@ size of interest so that we can better assist you.
* sub-configuration exercised: `zen`
* OpenBLAS 0.3.5
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=64` (multithreaded)
* configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=64` (multithreaded, 64 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=1` (single-threaded)
* Requested threading via `export OPENBLAS_NUM_THREADS=32` (multithreaded, 32 cores)
* Requested threading via `export OPENBLAS_NUM_THREADS=64` (multithreaded, 64 cores)
* MKL 2019 update 1
* Requested threading via `export MKL_NUM_THREADS=1` (single-threaded)
* Requested threading via `export MKL_NUM_THREADS=32` (multithreaded, 32 cores)
* Requested threading via `export MKL_NUM_THREADS=64` (multithreaded, 64 cores)
* Affinity:
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 63"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset.
* Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 63"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
* Frequency throttling (via `cpupower`):
* Driver: acpi-cpufreq
* Governor: performance

View File

@@ -106,6 +106,10 @@ OPENBLASP_LIB := $(HOME_LIB_PATH)/libopenblasp.a
#ATLAS_LIB := $(HOME_LIB_PATH)/libf77blas.a \
# $(HOME_LIB_PATH)/libatlas.a
# Eigen
EIGEN_LIB := $(HOME_LIB_PATH)/libeigen.a
EIGENP_LIB := $(HOME_LIB_PATH)/libeigen.a
# MKL
MKL_LIB := -L$(MKL_LIB_PATH) \
-lmkl_intel_lp64 \
@@ -199,6 +203,7 @@ DNAT := -DIND=BLIS_NAT
#STR_1M := -DSTR=\"1m\"
STR_NAT := -DSTR=\"asm_blis\"
STR_OBL := -DSTR=\"openblas\"
STR_EIG := -DSTR=\"eigen\"
STR_VEN := -DSTR=\"vendor\"
# Single or multithreaded string
@@ -220,6 +225,7 @@ PDEF_2S := -DP_BEGIN=$(P2_BEGIN) -DP_INC=$(P2_INC) -DP_MAX=$(P2_MAX)
all: all-st all-1s all-2s
blis: blis-st blis-1s blis-2s
openblas: openblas-st openblas-1s openblas-2s
eigen: eigen-st eigen-1s eigen-2s
vendor: vendor-st vendor-1s vendor-2s
mkl: vendor
armpl: vendor
@@ -238,7 +244,7 @@ blis-nat: blis-nat-st blis-nat-1s blis-nat-2s
# Define the datatypes, operations, and implementations.
DTS := s d c z
OPS := gemm hemm herk trmm trsm
IMPLS := asm_blis openblas vendor
IMPLS := asm_blis openblas eigen vendor
# Define functions to construct object filenames from the datatypes and
# operations given an implementation. We define one function for single-
@@ -263,6 +269,13 @@ OPENBLAS_1S_BINS := $(patsubst %.o,%.x,$(OPENBLAS_1S_OBJS))
OPENBLAS_2S_OBJS := $(call get-2s-objs,openblas)
OPENBLAS_2S_BINS := $(patsubst %.o,%.x,$(OPENBLAS_2S_OBJS))
EIGEN_ST_OBJS := $(call get-st-objs,eigen)
EIGEN_ST_BINS := $(patsubst %.o,%.x,$(EIGEN_ST_OBJS))
EIGEN_1S_OBJS := $(call get-1s-objs,eigen)
EIGEN_1S_BINS := $(patsubst %.o,%.x,$(EIGEN_1S_OBJS))
EIGEN_2S_OBJS := $(call get-2s-objs,eigen)
EIGEN_2S_BINS := $(patsubst %.o,%.x,$(EIGEN_2S_OBJS))
VENDOR_ST_OBJS := $(call get-st-objs,vendor)
VENDOR_ST_BINS := $(patsubst %.o,%.x,$(VENDOR_ST_OBJS))
VENDOR_1S_OBJS := $(call get-1s-objs,vendor)
@@ -279,6 +292,10 @@ openblas-st: $(OPENBLAS_ST_BINS)
openblas-1s: $(OPENBLAS_1S_BINS)
openblas-2s: $(OPENBLAS_2S_BINS)
eigen-st: $(EIGEN_ST_BINS)
eigen-1s: $(EIGEN_1S_BINS)
eigen-2s: $(EIGEN_2S_BINS)
vendor-st: $(VENDOR_ST_BINS)
vendor-1s: $(VENDOR_1S_BINS)
vendor-2s: $(VENDOR_2S_BINS)
@@ -293,9 +310,10 @@ armpl-2s: vendor-2s
# Mark the object files as intermediate so that make will remove them
# automatically after building the binaries on which they depend.
.INTERMEDIATE: $(BLIS_NAT_ST_OBJS) $(OPENBLAS_ST_OBJS) $(VENDOR_ST_OBJS)
.INTERMEDIATE: $(BLIS_NAT_1S_OBJS) $(OPENBLAS_1S_OBJS) $(VENDOR_1S_OBJS)
.INTERMEDIATE: $(BLIS_NAT_2S_OBJS) $(OPENBLAS_2S_OBJS) $(VENDOR_2S_OBJS)
.INTERMEDIATE: $(BLIS_NAT_ST_OBJS) $(BLIS_NAT_1S_OBJS) $(BLIS_NAT_2S_OBJS)
.INTERMEDIATE: $(OPENBLAS_ST_OBJS) $(OPENBLAS_1S_OBJS) $(OPENBLAS_2S_OBJS)
.INTERMEDIATE: $(EIGEN_ST_OBJS) $(EIGEN_1S_OBJS) $(EIGEN_2S_OBJS)
.INTERMEDIATE: $(VENDOR_ST_OBJS) $(VENDOR_1S_OBJS) $(VENDOR_2S_OBJS)
# --Object file rules --
@@ -312,7 +330,8 @@ get-dt-cpp = -DDT=bli_$(1)type
get-bl-cpp = $(strip \
$(if $(findstring blis,$(1)),$(STR_NAT) $(BLI_DEF),\
$(if $(findstring openblas,$(1)),$(STR_OBL) $(BLA_DEF),\
$(STR_VEN) $(BLA_DEF))))
$(if $(findstring eigen,$(1)),$(STR_EIG) $(BLA_DEF),\
$(STR_VEN) $(BLA_DEF)))))
define make-st-rule
test_$(1)$(2)_$(PS_MAX)_$(3)_st.o: test_$(op).c Makefile
@@ -349,34 +368,44 @@ $(foreach im,$(IMPLS),$(eval $(call make-2s-rule,$(dt),$(op),$(im))))))
# compatibility layer. This prevents BLIS from inadvertently getting called
# for the BLAS routines we are trying to test with.
test_%_$(PS_MAX)_openblas_st.x: test_%_$(PS_MAX)_openblas_st.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(OPENBLAS_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P1_MAX)_openblas_1s.x: test_%_$(P1_MAX)_openblas_1s.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(OPENBLASP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P2_MAX)_openblas_2s.x: test_%_$(P2_MAX)_openblas_2s.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(OPENBLASP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(PS_MAX)_vendor_st.x: test_%_$(PS_MAX)_vendor_st.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(VENDOR_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P1_MAX)_vendor_1s.x: test_%_$(P1_MAX)_vendor_1s.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(VENDORP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P2_MAX)_vendor_2s.x: test_%_$(P2_MAX)_vendor_2s.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(VENDORP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(PS_MAX)_asm_blis_st.x: test_%_$(PS_MAX)_asm_blis_st.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
$(CC) $(strip $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P1_MAX)_asm_blis_1s.x: test_%_$(P1_MAX)_asm_blis_1s.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
$(CC) $(strip $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P2_MAX)_asm_blis_2s.x: test_%_$(P2_MAX)_asm_blis_2s.o $(LIBBLIS_LINK)
$(LINKER) $(strip $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
$(CC) $(strip $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(PS_MAX)_openblas_st.x: test_%_$(PS_MAX)_openblas_st.o $(LIBBLIS_LINK)
$(CC) $(strip $< $(OPENBLAS_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P1_MAX)_openblas_1s.x: test_%_$(P1_MAX)_openblas_1s.o $(LIBBLIS_LINK)
$(CC) $(strip $< $(OPENBLASP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P2_MAX)_openblas_2s.x: test_%_$(P2_MAX)_openblas_2s.o $(LIBBLIS_LINK)
$(CC) $(strip $< $(OPENBLASP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(PS_MAX)_eigen_st.x: test_%_$(PS_MAX)_eigen_st.o $(LIBBLIS_LINK)
$(CXX) $(strip $< $(EIGEN_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P1_MAX)_eigen_1s.x: test_%_$(P1_MAX)_eigen_1s.o $(LIBBLIS_LINK)
$(CXX) $(strip $< $(EIGENP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P2_MAX)_eigen_2s.x: test_%_$(P2_MAX)_eigen_2s.o $(LIBBLIS_LINK)
$(CXX) $(strip $< $(EIGENP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(PS_MAX)_vendor_st.x: test_%_$(PS_MAX)_vendor_st.o $(LIBBLIS_LINK)
$(CC) $(strip $< $(VENDOR_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P1_MAX)_vendor_1s.x: test_%_$(P1_MAX)_vendor_1s.o $(LIBBLIS_LINK)
$(CC) $(strip $< $(VENDORP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
test_%_$(P2_MAX)_vendor_2s.x: test_%_$(P2_MAX)_vendor_2s.o $(LIBBLIS_LINK)
$(CC) $(strip $< $(VENDORP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@)
# -- Clean rules --

View File

@@ -78,6 +78,10 @@ if [ "${impls}" = "blis" ]; then
test_impls="asm_blis"
elif [ "${impls}" = "eigen" ]; then
test_impls="eigen"
elif [ "${impls}" = "other" ]; then
test_impls="openblas vendor"
@@ -148,13 +152,24 @@ for th in ${threads}; do
# Set the number of threads according to th.
if [ "${suf}" = "1s" ] || [ "${suf}" = "2s" ]; then
export BLIS_JC_NT=${jc_nt}
export BLIS_PC_NT=${pc_nt}
export BLIS_IC_NT=${ic_nt}
export BLIS_JR_NT=${jr_nt}
export BLIS_IR_NT=${ir_nt}
export OPENBLAS_NUM_THREADS=${nt}
export MKL_NUM_THREADS=${nt}
# Set the threading parameters based on the implementation
# that we are preparing to run.
if [ "${im}" = "asm_blis" ]
unset OMP_NUM_THREADS
export BLIS_JC_NT=${jc_nt}
export BLIS_PC_NT=${pc_nt}
export BLIS_IC_NT=${ic_nt}
export BLIS_JR_NT=${jr_nt}
export BLIS_IR_NT=${ir_nt}
elif [ "${im}" = "openblas" ]
unset OMP_NUM_THREADS
export OPENBLAS_NUM_THREADS=${nt}
elif [ "${im}" = "eigen" ]
export OMP_NUM_THREADS=${nt}
elif [ "${im}" = "vendor" ]
unset OMP_NUM_THREADS
export MKL_NUM_THREADS=${nt}
fi
export nt_use=${nt}
# Multithreaded OpenBLAS seems to have a problem running
@@ -173,6 +188,7 @@ for th in ${threads}; do
export BLIS_IC_NT=1
export BLIS_JR_NT=1
export BLIS_IR_NT=1
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export nt_use=1