diff --git a/CREDITS b/CREDITS index 52efd782f..73a90cc1b 100644 --- a/CREDITS +++ b/CREDITS @@ -71,6 +71,7 @@ but many others have contributed code and feedback, including Paul Springer @springer13 (RWTH-Aachen) Vladimir Sukarev Santanu Thangaraj (AMD) + Nicholai Tukanov @nicholaiTukanov (The University of Texas at Austin) Rhys Ulerich @RhysU (The University of Texas at Austin) Robert van de Geijn @rvdg (The University of Texas at Austin) Kiran Varaganti @kvaragan (AMD) diff --git a/README.md b/README.md index 13acd96ec..7653f3d24 100644 --- a/README.md +++ b/README.md @@ -79,6 +79,13 @@ and [other educational projects](http://www.ulaff.net/) (such as MOOCs). What's New ---------- + * **Performance comparisons now available!** We recently measured the +performance of various level-3 operations on a variety of hardware architectures, +as implemented within BLIS and other BLAS libraries for all four of the standard +floating-point datatypes. The results speak for themselves! Check out our +extensive performance graphs and background info in our new +[Performance](docs/Performance.md) document. + * **BLIS is now in Debian Unstable!** Thanks to Debian developer-maintainers [M. Zhou](https://github.com/cdluminate) and [Nico Schlömer](https://github.com/nschloe) for sponsoring our package in Debian. @@ -87,7 +94,7 @@ the second-most popular Linux distribution (behind Ubuntu, which Debian packages feed into). The Debian tracker page may be found [here](https://tracker.debian.org/pkg/blis). - * **BLIS now supports mixed-datatype gemm.** The `gemm` operation may now be + * **BLIS now supports mixed-datatype gemm!** The `gemm` operation may now be executed on operands of mixed domains and/or mixed precisions. Any combination of storage datatype for A, B, and C is now supported, along with a separate computation precision that can differ from the storage precision of A and B. @@ -317,6 +324,11 @@ use the multithreading features of BLIS. overview of BLIS's mixed-datatype functionality and provides a brief example of how to take advantage of this new code. + * **[Performance](docs/Performance.md).** This document reports empirically +measured performance of a representative set of level-3 operations on a variety +of hardware architectures, as implemented within BLIS and other BLAS libraries +for all four of the standard floating-point datatypes. + * **[Release Notes](docs/ReleaseNotes.md).** This document tracks a summary of changes included with each new version of BLIS, along with contributor credits for key features. diff --git a/docs/Multithreading.md b/docs/Multithreading.md index 61a337bc8..a2630b18d 100644 --- a/docs/Multithreading.md +++ b/docs/Multithreading.md @@ -23,7 +23,7 @@ # Introduction -Our paper [Anatomy of High-Performance Many-Threaded Matrix Multiplication](https://github.com/flame/blis#citations), presented at IPDPS'14, identified 5 loops around the microkernel as opportunities for parallelization within level-3 operations such as `gemm`. Within BLIS, we have enabled parallelism for 4 of those loops and have extended it to the rest of the level-3 operations except for `trsm`. +Our paper [Anatomy of High-Performance Many-Threaded Matrix Multiplication](https://github.com/flame/blis#citations), presented at IPDPS'14, identified five loops around the microkernel as opportunities for parallelization within level-3 operations such as `gemm`. Within BLIS, we have enabled parallelism for four of those loops, with the fifth planned for future work. This software architecture extends naturally to all level-3 operations except for `trsm`, where its application is necessarily limited to three of the five loops due to inter-iteration dependencies. **IMPORTANT**: Multithreading in BLIS is disabled by default. Furthermore, even when multithreading is enabled, BLIS will default to single-threaded execution at runtime. In order to both *allow* and *invoke* parallelism from within BLIS operations, you must both *enable* multithreading at configure-time and *specify* multithreading at runtime. diff --git a/docs/Performance.md b/docs/Performance.md new file mode 100644 index 000000000..d23ed99ea --- /dev/null +++ b/docs/Performance.md @@ -0,0 +1,325 @@ +# Contents + +* **[Contents](Performance.md#contents)** +* **[Introduction](Performance.md#introduction)** +* **[General information](Performance.md#general-information)** +* **[Level-3 performance](Performance.md#level-3-performance)** + * **[ThunderX2](Performance.md#thunderx2)** + * **[Experiment Details](Performance.md#thunderx2-details)** + * **[Results](Performance.md#thunderx2-results)** + * **[SkylakeX](Performance.md#skylakex)** + * **[Experiment Details](Performance.md#skylakex-details)** + * **[Results](Performance.md#skylakex-results)** + * **[Haswell](Performance.md#haswell)** + * **[Experiment Details](Performance.md#haswell-details)** + * **[Desults](Performance.md#haswell-results)** + * **[Epyc](Performance.md#epyc)** + * **[Experiment Details](Performance.md#epyc-details)** + * **[Results](Performance.md#epyc-results)** +* **[Feedback](Performance.md#feedback)** + +# Introduction + +This document showcases performance results for a representative sample of +level-3 operations with BLIS and BLAS for several hardware architectures. + +# General information + +Generally speaking, we publish three "panels" for each type of hardware, +each of which reports one of: single-threaded performance, multithreaded +performance on a single socket, or multithreaded performance on two sockets. +Each panel will consist of a 4x5 grid of graphs, with each row representing +a different datatype (single real, double real, single complex, and double +complex) and each column representing a different operation (`gemm`, +`hemm`/`symm`, `herk`/`syrk`, `trmm`, and `trsm`). +Each of the 20 graphs within a panel will contain an x-axis that reports +problem size, with all matrix dimensions equal to the problem size (e.g. +_m_ = _n_ = _k_), resulting in square matrices. +The y-axis will report GFLOPS (in the case of single-threaded performance) +or GFLOPS/core (in the case of single- or dual-socket multithreaded +performance), which is simply the total GFLOPS divided by the number of +threads utilized. +This normalization is done intentionally in order to facilitate visual +comparison of multithreaded graphs and single-threaded graphs. + +It's also worth pointing out that the top of each graph (e.g. the maximum +y-axis value depicted) _always_ corresponds to the theoretical peak performance +under the conditions associated with that graph. +Theoretical peak performance, in units of GFLOPS/core, is calculated as the +product of: +1. the maximum sustainable clock rate in GHz; and +2. the maximum number of floating-point operations (flops) that can be +executed per cycle. + +Note that the maximum sustainable clock rate may change depending on the +conditions. +For example, on some systems the maximum clock rate is higher when only one +core is active (e.g. single-threaded performance) versus when all cores are +active (e.g. multithreaded performance). +The maximum number of flops executable per cycle is generally computed as the +product of: +1. the maximum number of fused multiply-add (FMA) vector instructions that +can be issued per cycle; +2. the maximum number of elements that can be stored within a single vector +register (for the datatype in question); and +3. 2.0, since an FMA instruction fuses two operations (a multiply and an add). + +The problem size range, represented on the x-axis, is usually sampled with 50 +equally-spaced problem size. +For example, for single-threaded execution, we might choose to execute with +problem sizes of 48 to 2400 in increments of 48, or 56 to 2800 in increments +of 56. +These values are almost never chosen for any particular (read: sneaky) reason; +rather, we start with a "good" maximum problem size, such as 2400 or 2800, and +then divide it by 50 to obtain the appropriate starting point and increment. + +Finally, each point along each curve represents the best of three trials. + +# Interpretation + +In general, the the curves associated with higher-performing implementations +will appear higher in the graphs than lower-performing implementations. +Ideally, an implementation will climb in performance (as a function of problem +size) as quickly as possible and asymptotically approach some high fraction of +peak performance. + +Occasionally, we may publish graphs with incomplete curves--for example, +only the first 25 data points in a typical 50-point series--usually because +the implementation being tested was slow enough that it was not practical to +allow it to finish. + +Where along the x-axis you focus your attention will depend on the segment of +the problem size range that you care about most. Some people's applications +depend heavily on smaller problems, where "small" can mean anything from 200 +to 1000 or even higher. Some people consider 1000 to be quite large, while +others insist that 5000 is merely "medium." What each of us considers to be +small, medium, or large (naturally) depends heavily on the kinds of dense +linear algebra problems we tend to encounter. No one is "right" or "wrong" +about their characterization of matrix smallness or bigness since each person's +relative frame of reference can vary greatly. That said, the +[Science of High-Performance Computing](http://shpc.ices.utexas.edu/) group at +[The University of Texas at Austin](https://www.utexas.edu/) tends to target +matrices that it classifies as "medium-to-large", and so most of the graphs +presented in this document will reflect that targeting in their x-axis range. + +When corresponding with us, via email or when opening an +[issue](https://github.com/flame/blis/issues) on github, we kindly ask that +you specify as closely as possible (though a range is fine) your problem +size of interest so that we can better assist you. + +# Level-3 performance results + +## ThunderX2 + +### ThunderX2 experiment details + +* Location: Unknown +* Processor model: Marvell ThunderX2 CN9975 +* Core topology: two sockets, 28 cores per socket, 56 cores total +* SMT status: disabled at boot-time +* Max clock rate: 2.2GHz (single-core and multicore) +* Max vector register length: 128 bits (NEON) +* Max FMA vector IPC: 2 +* Peak performance: + * single-core: 17.6 GFLOPS (double-precision), 35.2 GFLOPS (single-precision) + * multicore: 17.6 GFLOPS/core (double-precision), 35.2 GFLOPS/core (single-precision) +* Operating system: Ubuntu 16.04 (Linux kernel 4.15.0) +* Compiler: gcc 7.3.0 +* Results gathered: 14 February 2019 +* Implementations tested: + * BLIS 075143df (0.5.1-39) + * configured with `./configure -t openmp thunderx2` (single- and multithreaded) + * sub-configuration exercised: `thunderx2` + * OpenBLAS 52d3f7a + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded) + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=56` (multithreaded) + * ARMPL 18.4 +* Affinity: + * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 55"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset. +* Frequency throttling (via `cpupower`): + * No changes made. +* Comments: + * ARMPL performance is remarkably uneven across datatypes and operations, though it would appear their "base" consists of OpenBLAS, which they then optimize for select, targeted routines. Unfortunately, we were unable to test the absolute latest versions of OpenBLAS and ARMPL on this hardware before we lost access. We will rerun these experiments once we gain access to a similar system. + +### ThunderX2 results + +#### pdf + +* [ThunderX2 single-threaded](graphs/l3_perf_tx2_nt1.pdf) +* [ThunderX2 multithreaded (32 cores)](graphs/l3_perf_tx2_jc4ic7_nt28.pdf) +* [ThunderX2 multithreaded (64 cores)](graphs/l3_perf_tx2_jc8ic7_nt56.pdf) + +#### png (inline) + +* **ThunderX2 single-threaded** +![single-threaded](graphs/l3_perf_tx2_nt1.png) +* **ThunderX2 multithreaded (32 cores)** +![multithreaded (32 cores)](graphs/l3_perf_tx2_jc4ic7_nt28.png) +* **ThunderX2 multithreaded (64 cores)** +![multithreaded (64 cores)](graphs/l3_perf_tx2_jc8ic7_nt56.png) + +--- + +## SkylakeX + +### SkylakeX experiment details + +* Location: Oracle cloud +* Processor model: Intel Xeon Platinum 8167M (SkylakeX/AVX-512) +* Core topology: two sockets, 26 cores per socket, 52 cores total +* SMT status: enabled, but not utilized +* Max clock rate: 2.0GHz (single-core and multicore) +* Max vector register length: 512 bits (AVX-512) +* Max FMA vector IPC: 2 +* Peak performance: + * single-core: 64 GFLOPS (double-precision), 128 GFLOPS (single-precision) + * multicore: 64 GFLOPS/core (double-precision), 128 GFLOPS/core (single-precision) +* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0) +* Compiler: gcc 7.3.0 +* Results gathered: 6 March 2019 +* Implementations tested: + * BLIS 9f1dbe5 (0.5.1-54) + * configured with `./configure -t openmp auto` (single- and multithreaded) + * sub-configuration exercised: `skx` + * OpenBLAS 0.3.5 + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded) + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=52` (multithreaded) + * MKL 2019 update 1 +* Affinity: + * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 51"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset. +* Frequency throttling (via `cpupower`): + * No changes made. +* Comments: + * MKL yields superb performance for most operations, though BLIS is not far behind except for trsm. (We understand the trsm underperformance and hope to address it in the future.) OpenBLAS lags far behind MKL and BLIS due to lack of full support for AVX-512, and possibly other reasons related to software architecture and register/cache blocksizes. + +### SkylakeX results + +#### pdf + +* [SkylakeX single-threaded](graphs/l3_perf_skx_nt1.pdf) +* [SkylakeX multithreaded (32 cores)](graphs/l3_perf_skx_jc2ic13_nt26.pdf) +* [SkylakeX multithreaded (64 cores)](graphs/l3_perf_skx_jc4ic13_nt52.pdf) + +#### png (inline) + +* **SkylakeX single-threaded** +![single-threaded](graphs/l3_perf_skx_nt1.png) +* **SkylakeX multithreaded (26 cores)** +![multithreaded (26 cores)](graphs/l3_perf_skx_jc2ic13_nt26.png) +* **SkylakeX multithreaded (52 cores)** +![multithreaded (52 cores)](graphs/l3_perf_skx_jc4ic13_nt52.png) + +--- + +## Haswell + +### Haswell experiment details + +* Location: TACC (Lonestar5) +* Processor model: Intel Xeon E5-2690 v3 (Haswell) +* Core topology: two sockets, 12 cores per socket, 24 cores total +* SMT status: enabled, but not utilized +* Max clock rate: 3.5GHz (single-core), 3.1GHz (multicore) +* Max vector register length: 256 bits (AVX2) +* Max FMA vector IPC: 2 +* Peak performance: + * single-core: 56 GFLOPS (double-precision), 112 GFLOPS (single-precision) + * multicore: 49.6 GFLOPS/core (double-precision), 99.2 GFLOPS/core (single-precision) +* Operating system: Cray Linux Environment 6 (Linux kernel 4.4.103) +* Compiler: gcc 6.3.0 +* Results gathered: 25-26 February 2019 +* Implementations tested: + * BLIS 075143df (0.5.1-39) + * configured with `./configure -t openmp auto` (single- and multithreaded) + * sub-configuration exercised: `haswell` + * OpenBLAS 0.3.5 + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded) + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=24` (multithreaded) + * MKL 2018 update 2 +* Affinity: + * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 23"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset. +* Frequency throttling (via `cpupower`): + * No changes made. +* Comments: + * We were pleasantly surprised by how competitive BLIS performs relative to MKL on this multicore Haswell system, which is a _very_ common microarchitecture, and _very_ similar to the more recent Broadwells, Skylakes (desktop), Kaby Lakes, and Coffee Lakes that succeeded it. + +### Haswell results + +#### pdf + +* [Haswell single-threaded](graphs/l3_perf_has_nt1.pdf) +* [Haswell multithreaded (32 cores)](graphs/l3_perf_has_jc2ic3jr2_nt12.pdf) +* [Haswell multithreaded (64 cores)](graphs/l3_perf_has_jc4ic3jr2_nt24.pdf) + +#### png (inline) + +* **Haswell single-threaded** +![single-threaded](graphs/l3_perf_has_nt1.png) +* **Haswell multithreaded (12 cores)** +![multithreaded (12 cores)](graphs/l3_perf_has_jc2ic3jr2_nt12.png) +* **Haswell multithreaded (24 cores)** +![multithreaded (24 cores)](graphs/l3_perf_has_jc4ic3jr2_nt24.png) + +--- + +## Epyc + +### Epyc experiment details + +* Location: Oracle cloud +* Processor model: AMD Epyc 7551 (Zen1) +* Core topology: two sockets, 4 dies per socket, 2 core complexes (CCX) per die, 4 cores per CCX, 64 cores total +* SMT status: enabled, but not utilized +* Max clock rate: 3.0GHz (single-core), 2.55GHz (multicore) +* Max vector register length: 256 bits (AVX2) +* Max FMA vector IPC: 1 + * Alternatively, FMA vector IPC is 2 when vectors are limited to 128 bits each. +* Peak performance: + * single-core: 24 GFLOPS (double-precision), 48 GFLOPS (single-precision) + * multicore: 20.4 GFLOPS/core (double-precision), 40.8 GFLOPS/core (single-precision) +* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0) +* Compiler: gcc 7.3.0 +* Results gathered: 6 March 2019, 19 March 2019 +* Implementations tested: + * BLIS 9f1dbe5 (0.5.1-54) + * configured with `./configure -t openmp auto` (single- and multithreaded) + * sub-configuration exercised: `zen` + * OpenBLAS 0.3.5 + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded) + * configured with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=64` (multithreaded) + * MKL 2019 update 1 +* Affinity: + * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0 1 2 3 ... 63"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` was unset. +* Frequency throttling (via `cpupower`): + * Driver: acpi-cpufreq + * Governor: performance + * Hardware limits: 1.2GHz - 2.0GHz + * Adjusted minimum: 2.0GHz +* Comments: + * MKL performance is dismal, despite being linked in the same manner as on the Xeon Platinum. It's not clear what is causing the slowdown. It could be that MKL's runtime kernel/blocksize selection logic is falling back to some older, more basic implementation because CPUID is not returning Intel as the hardware vendor. Alternatively, it's possible that MKL is trying to use kernels for the closest Intel architectures--say, Haswell/Broadwell--but its implementations use Haswell-specific optimizations that, due to microarchitectural differences, degrade performance on Zen. + +### Epyc results + +#### pdf + +* [Epyc single-threaded](graphs/l3_perf_epyc_nt1.pdf) +* [Epyc multithreaded (32 cores)](graphs/l3_perf_epyc_jc1ic8jr4_nt32.pdf) +* [Epyc multithreaded (64 cores)](graphs/l3_perf_epyc_jc2ic8jr4_nt64.pdf) + +#### png (inline) + +* **Epyc single-threaded** +![single-threaded](graphs/l3_perf_epyc_nt1.png) +* **Epyc multithreaded (32 cores)** +![multithreaded (32 cores)](graphs/l3_perf_epyc_jc1ic8jr4_nt32.png) +* **Epyc multithreaded (64 cores)** +![multithreaded (64 cores)](graphs/l3_perf_epyc_jc2ic8jr4_nt64.png) + +--- + +# Feedback + +Please let us know what you think of these performance results! Similarly, if you have any questions or concerns, or are interested in reproducing these performance experiments on your own hardware, we invite you to [open an issue](https://github.com/flame/blis/issues) and start a conversation with BLIS developers. + +Thanks for your interest in BLIS! + diff --git a/docs/graphs/l3_perf_epyc_jc1ic8jr4_nt32.pdf b/docs/graphs/l3_perf_epyc_jc1ic8jr4_nt32.pdf new file mode 100644 index 000000000..d60329805 Binary files /dev/null and b/docs/graphs/l3_perf_epyc_jc1ic8jr4_nt32.pdf differ diff --git a/docs/graphs/l3_perf_epyc_jc1ic8jr4_nt32.png b/docs/graphs/l3_perf_epyc_jc1ic8jr4_nt32.png new file mode 100644 index 000000000..fa7959762 Binary files /dev/null and b/docs/graphs/l3_perf_epyc_jc1ic8jr4_nt32.png differ diff --git a/docs/graphs/l3_perf_epyc_jc2ic8jr4_nt64.pdf b/docs/graphs/l3_perf_epyc_jc2ic8jr4_nt64.pdf new file mode 100644 index 000000000..2c09c2ae3 Binary files /dev/null and b/docs/graphs/l3_perf_epyc_jc2ic8jr4_nt64.pdf differ diff --git a/docs/graphs/l3_perf_epyc_jc2ic8jr4_nt64.png b/docs/graphs/l3_perf_epyc_jc2ic8jr4_nt64.png new file mode 100644 index 000000000..83eb4e92c Binary files /dev/null and b/docs/graphs/l3_perf_epyc_jc2ic8jr4_nt64.png differ diff --git a/docs/graphs/l3_perf_epyc_nt1.pdf b/docs/graphs/l3_perf_epyc_nt1.pdf new file mode 100644 index 000000000..12c2742da Binary files /dev/null and b/docs/graphs/l3_perf_epyc_nt1.pdf differ diff --git a/docs/graphs/l3_perf_epyc_nt1.png b/docs/graphs/l3_perf_epyc_nt1.png new file mode 100644 index 000000000..2ac4b238c Binary files /dev/null and b/docs/graphs/l3_perf_epyc_nt1.png differ diff --git a/docs/graphs/l3_perf_has_jc2ic3jr2_nt12.pdf b/docs/graphs/l3_perf_has_jc2ic3jr2_nt12.pdf new file mode 100644 index 000000000..a6978f345 Binary files /dev/null and b/docs/graphs/l3_perf_has_jc2ic3jr2_nt12.pdf differ diff --git a/docs/graphs/l3_perf_has_jc2ic3jr2_nt12.png b/docs/graphs/l3_perf_has_jc2ic3jr2_nt12.png new file mode 100644 index 000000000..7a4e252d1 Binary files /dev/null and b/docs/graphs/l3_perf_has_jc2ic3jr2_nt12.png differ diff --git a/docs/graphs/l3_perf_has_jc4ic3jr2_nt24.pdf b/docs/graphs/l3_perf_has_jc4ic3jr2_nt24.pdf new file mode 100644 index 000000000..0717574cb Binary files /dev/null and b/docs/graphs/l3_perf_has_jc4ic3jr2_nt24.pdf differ diff --git a/docs/graphs/l3_perf_has_jc4ic3jr2_nt24.png b/docs/graphs/l3_perf_has_jc4ic3jr2_nt24.png new file mode 100644 index 000000000..0804e11c2 Binary files /dev/null and b/docs/graphs/l3_perf_has_jc4ic3jr2_nt24.png differ diff --git a/docs/graphs/l3_perf_has_nt1.pdf b/docs/graphs/l3_perf_has_nt1.pdf new file mode 100644 index 000000000..f2b95fbdc Binary files /dev/null and b/docs/graphs/l3_perf_has_nt1.pdf differ diff --git a/docs/graphs/l3_perf_has_nt1.png b/docs/graphs/l3_perf_has_nt1.png new file mode 100644 index 000000000..66bb33207 Binary files /dev/null and b/docs/graphs/l3_perf_has_nt1.png differ diff --git a/docs/graphs/l3_perf_skx_jc2ic13_nt26.pdf b/docs/graphs/l3_perf_skx_jc2ic13_nt26.pdf new file mode 100644 index 000000000..5d2f075f7 Binary files /dev/null and b/docs/graphs/l3_perf_skx_jc2ic13_nt26.pdf differ diff --git a/docs/graphs/l3_perf_skx_jc2ic13_nt26.png b/docs/graphs/l3_perf_skx_jc2ic13_nt26.png new file mode 100644 index 000000000..166764fc7 Binary files /dev/null and b/docs/graphs/l3_perf_skx_jc2ic13_nt26.png differ diff --git a/docs/graphs/l3_perf_skx_jc4ic13_nt52.pdf b/docs/graphs/l3_perf_skx_jc4ic13_nt52.pdf new file mode 100644 index 000000000..8a8edbce4 Binary files /dev/null and b/docs/graphs/l3_perf_skx_jc4ic13_nt52.pdf differ diff --git a/docs/graphs/l3_perf_skx_jc4ic13_nt52.png b/docs/graphs/l3_perf_skx_jc4ic13_nt52.png new file mode 100644 index 000000000..fed5b22e2 Binary files /dev/null and b/docs/graphs/l3_perf_skx_jc4ic13_nt52.png differ diff --git a/docs/graphs/l3_perf_skx_nt1.pdf b/docs/graphs/l3_perf_skx_nt1.pdf new file mode 100644 index 000000000..ad8af5327 Binary files /dev/null and b/docs/graphs/l3_perf_skx_nt1.pdf differ diff --git a/docs/graphs/l3_perf_skx_nt1.png b/docs/graphs/l3_perf_skx_nt1.png new file mode 100644 index 000000000..772ca0180 Binary files /dev/null and b/docs/graphs/l3_perf_skx_nt1.png differ diff --git a/docs/graphs/l3_perf_tx2_jc4ic7_nt28.pdf b/docs/graphs/l3_perf_tx2_jc4ic7_nt28.pdf new file mode 100644 index 000000000..352d0556c Binary files /dev/null and b/docs/graphs/l3_perf_tx2_jc4ic7_nt28.pdf differ diff --git a/docs/graphs/l3_perf_tx2_jc4ic7_nt28.png b/docs/graphs/l3_perf_tx2_jc4ic7_nt28.png new file mode 100644 index 000000000..ec691cd3e Binary files /dev/null and b/docs/graphs/l3_perf_tx2_jc4ic7_nt28.png differ diff --git a/docs/graphs/l3_perf_tx2_jc8ic7_nt56.pdf b/docs/graphs/l3_perf_tx2_jc8ic7_nt56.pdf new file mode 100644 index 000000000..c25ea9eee Binary files /dev/null and b/docs/graphs/l3_perf_tx2_jc8ic7_nt56.pdf differ diff --git a/docs/graphs/l3_perf_tx2_jc8ic7_nt56.png b/docs/graphs/l3_perf_tx2_jc8ic7_nt56.png new file mode 100644 index 000000000..861d80782 Binary files /dev/null and b/docs/graphs/l3_perf_tx2_jc8ic7_nt56.png differ diff --git a/docs/graphs/l3_perf_tx2_nt1.pdf b/docs/graphs/l3_perf_tx2_nt1.pdf new file mode 100644 index 000000000..66c808c9c Binary files /dev/null and b/docs/graphs/l3_perf_tx2_nt1.pdf differ diff --git a/docs/graphs/l3_perf_tx2_nt1.png b/docs/graphs/l3_perf_tx2_nt1.png new file mode 100644 index 000000000..a6e5dc413 Binary files /dev/null and b/docs/graphs/l3_perf_tx2_nt1.png differ diff --git a/test/3/matlab/plot_l3_perf.m b/test/3/matlab/plot_l3_perf.m index 8717fb5eb..3942d2e59 100644 --- a/test/3/matlab/plot_l3_perf.m +++ b/test/3/matlab/plot_l3_perf.m @@ -152,8 +152,8 @@ if rows == 4 && cols == 5 set( leg,'FontSize',fontsize-3 ); set( leg,'Units','inches' ); %set( leg,'Position',[7.70 12.75 0.7 0.3 ] ); % (0,1br) - %set( leg,'Position',[10.47 14.28 0.7 0.3 ] ); % (0,2tl) - set( leg,'Position',[11.20 12.75 0.7 0.3 ] ); % (0,2br) + set( leg,'Position',[10.47 14.28 0.7 0.3 ] ); % (0,2tl) + %set( leg,'Position',[11.20 12.75 0.7 0.3 ] ); % (0,2br) %set( leg,'Position',[13.95 14.28 0.7 0.3 ] ); % (0,3tl) %set( leg,'Position',[14.70 12.75 0.7 0.3 ] ); % (0,3br) %set( leg,'Position',[17.45 14.28 0.7 0.3 ] ); % (0,4tl) diff --git a/test/3/matlab/plot_panel_4x5.m b/test/3/matlab/plot_panel_4x5.m index 40e212a68..c7e97434c 100644 --- a/test/3/matlab/plot_panel_4x5.m +++ b/test/3/matlab/plot_panel_4x5.m @@ -100,3 +100,4 @@ outfile = sprintf( 'l3_perf_%s_nt%d.pdf', arch_str, nth ); %print(gcf, 'gemm_md','-fillpage','-dpdf'); print(gcf, outfile,'-bestfit','-dpdf'); +end diff --git a/test/3/matlab/runme.m b/test/3/matlab/runme.m index 2da7d7442..747989b76 100644 --- a/test/3/matlab/runme.m +++ b/test/3/matlab/runme.m @@ -14,6 +14,6 @@ plot_panel_4x5(3.00,16,12,'1s','../results/has/20190206/jc2ic3jr2','has_jc2ic3jr plot_panel_4x5(3.00,16,24,'2s','../results/has/20190206/jc4ic3jr2','has_jc4ic3jr2','MKL'); close; clear all; % epyc -plot_panel_4x5(3.00,8,1, 'st','../results/epyc/20190306/st', 'epyc', 'MKL'); close; clear all; -plot_panel_4x5(2.55,8,32,'1s','../results/epyc/20190306/jc1ic8jr4','epyc_jc1ic8jr4','MKL'); close; clear all; -plot_panel_4x5(2.55,8,64,'2s','../results/epyc/20190306/jc2ic8jr4','epyc_jc2ic8jr4','MKL'); close; clear all; +plot_panel_4x5(3.00,8,1, 'st','../results/epyc/merged201903_0619/st','epyc', 'MKL'); close; clear all; +plot_panel_4x5(2.55,8,32,'1s','../results/epyc/merged201903_0619/jc1ic8jr4','epyc_jc1ic8jr4','MKL'); close; clear all; +plot_panel_4x5(2.55,8,64,'2s','../results/epyc/merged201903_0619/jc2ic8jr4','epyc_jc2ic8jr4','MKL'); close; clear all;