mirror of
https://github.com/amd/blis.git
synced 2026-04-20 15:48:50 +00:00
Updates to docs/Multithreading.md.
Details: - Made extra explicit the fact that: (a) multithreading in BLIS is disabled by default; and (b) even with multithreading enabled, the user must specify multithreading at runtime in order to observe parallelism. Thanks to M. Zhou for suggesting these clarifications in #292. - Also made explicit that only the environment variable and global runtime API methods are available when using the BLAS API. If the user wishes to use the local runtime API (specify multithreading on a per-call basis), one of the native BLIS APIs must be used.
This commit is contained in:
@@ -25,9 +25,15 @@
|
||||
|
||||
Our paper [Anatomy of High-Performance Many-Threaded Matrix Multiplication](https://github.com/flame/blis#citations), presented at IPDPS'14, identified 5 loops around the microkernel as opportunities for parallelization within level-3 operations such as `gemm`. Within BLIS, we have enabled parallelism for 4 of those loops and have extended it to the rest of the level-3 operations except for `trsm`.
|
||||
|
||||
**IMPORTANT**: Multithreading in BLIS is disabled by default. Furthermore, even when multithreading is enabled, BLIS will default to single-threaded execution at runtime. In order to both *allow* and *invoke* parallelism from within BLIS operations, you must both *enable* multithreading at configure-time and *specify* multithreading at runtime.
|
||||
|
||||
To summarize: In order to observe multithreaded parallelism within a BLIS operation, you must do *both* of the following:
|
||||
1. Enable multithreading at configure-time. This is discussed in the [next section](docs/Multithreading.md#enabling-multithreading).
|
||||
2. Specify multithreading at runtime. This is also dicussed [later on](docs/Multithreading.md#specifying-multithreading).
|
||||
|
||||
# Enabling multithreading
|
||||
|
||||
Note that BLIS disables multithreading by default. In order to extract multithreaded parallelism from BLIS, you must first enable multithreading explicitly at configure-time.
|
||||
BLIS disables multithreading by default. In order to allow multithreaded parallelism from BLIS, you must first enable multithreading explicitly at configure-time.
|
||||
|
||||
As of this writing, BLIS optionally supports multithreading via either OpenMP or POSIX threads.
|
||||
|
||||
@@ -101,7 +107,7 @@ This pattern--automatic or manual--holds regardless of which of the three method
|
||||
|
||||
Regardless of which method is employed, and which specific way within each method, after setting the number of threads, the application may call the desired level-3 operation (via either the [typed API](docs/BLISTypedAPI.md) or the [object API](docs/BLISObjectAPI.md)) and the operation will execute in a multithreaded manner. (When calling BLIS via the BLAS API, only the first two (global) methods are available.)
|
||||
|
||||
NOTE: Please be aware of what happens if you try to specify both the automatic and manual ways, as it could otherwise confuse new users. Regardless of which broad method is used, **if multithreading is specified via both the automatic and manual ways, the manual way will always take precedence.** Also, specifying parallelism for even *one* loop counts as specifying the manual way (in which case the ways of parallelism for the remaining loops will be assumed to be 1).
|
||||
**Note**: Please be aware of what happens if you try to specify both the automatic and manual ways, as it could otherwise confuse new users. Regardless of which broad method is used, **if multithreading is specified via both the automatic and manual ways, the manual way will always take precedence.** Also, specifying parallelism for even *one* loop counts as specifying the manual way (in which case the ways of parallelism for the remaining loops will be assumed to be 1).
|
||||
|
||||
## Globally via environment variables
|
||||
|
||||
@@ -109,6 +115,8 @@ The most common method of specifying multithreading in BLIS is globally via envi
|
||||
|
||||
Regardless of whether you end up using the automatic or manual way of expressing a request for multithreading, note that the environment variables are read (via `getenv()`) by BLIS **only once**, when the library is initialized. Subsequent to library initialization, the global settings for parallelization may only be changed via the [global runtime API](Multithreading.md#globally-at-runtime). If this constraint is not a problem, then environment variables may work fine for you. Otherwise, please consider [local settings](Multithreading.md#locally-at-runtime). (Local settings may used at any time, regardless of whether global settings were explicitly specified, and local settings always override global settings.)
|
||||
|
||||
**Note**: Regardless of which way ([automatic](Multithreading.md#environment-variables-the-automatic-way) or [manual](Multithreading.md#environment-variables-the-manual-way)) environment variables are used to specify multithreading, that specification will affect operation of BLIS through **both** the BLAS compatibility layer as well as the native [typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md) APIs that are unique to BLIS.
|
||||
|
||||
### Environment variables: the automatic way
|
||||
|
||||
The automatic way of specifying parallelism entails simply setting the total number of threads you wish BLIS to employ in its parallelization. This total number of threads is captured by the `BLIS_NUM_THREADS` environment variable. You can set this variable prior to executing your BLIS-linked executable:
|
||||
@@ -119,7 +127,7 @@ $ ./my_blis_program
|
||||
```
|
||||
This causes BLIS to automatically determine a reasonable threading strategy based on what is known about the operation and problem size. If `BLIS_NUM_THREADS` is not set, BLIS will attempt to query the value of `OMP_NUM_THREADS`. If neither variable is set, the default number of threads is 1.
|
||||
|
||||
**Note:** We *highly* discourage use of the `OMP_NUM_THREADS` environment variable and may remove support for it in the future. If you wish to set parallelism globally via environment variables, please use `BLIS_NUM_THREADS`.
|
||||
**Note**: We *highly* discourage use of the `OMP_NUM_THREADS` environment variable and may remove support for it in the future. If you wish to set parallelism globally via environment variables, please use `BLIS_NUM_THREADS`.
|
||||
|
||||
### Environment variables: the manual way
|
||||
|
||||
@@ -127,7 +135,7 @@ The manual way of specifying parallelism involves communicating which loops with
|
||||
|
||||
The below chart describes the five loops used in BLIS's matrix multiplication operations.
|
||||
|
||||
| Loop around microkernel | Environment variable | Direction | Notes |
|
||||
| Loop around microkernel | Environment variable | Direction | Notes |
|
||||
|:-------------------------|:---------------------|:----------|:------------|
|
||||
| 5th loop | `BLIS_JC_NT` | `n` | |
|
||||
| 4th loop | _N/A_ | `k` | Not enabled |
|
||||
@@ -154,6 +162,8 @@ Next, which combinations of loops to parallelize depends on which caches are sha
|
||||
|
||||
If you still wish to set the parallelization scheme globally, but you want to do so at runtime, BLIS provides a thread-safe API for specifying multithreading. Think of these functions as a way to modify the same internal data structure into which the environment variables are read. (Recall that the environment variables are only read once, when BLIS is initialized).
|
||||
|
||||
**Note**: Regardless of which way ([automatic](Multithreading.md#globally-at-runtime-the-automatic-way) or [manual](Multithreading.md#globally-at-runtime-the-manual-way)) the global runtime API is used to specify multithreading, that specification will affect operation of BLIS through **both** the BLAS compatibility layer as well as the native [typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md) APIs that are unique to BLIS.
|
||||
|
||||
### Globally at runtime: the automatic way
|
||||
|
||||
If you simply want to specify an overall number of threads and let BLIS choose a thread factorization automatically, use the following function:
|
||||
@@ -193,6 +203,8 @@ In addition to the global methods based on environment variables and runtime fun
|
||||
|
||||
As with environment variables and the global runtime API, there are two ways to specify parallelism: the automatic way and the manual way. Both ways involve allocating a BLIS-specific object, initializing the object and encoding the desired parallelization, and then passing a pointer to the object into one of the expert interfaces of either the [typed](docs/BLISTypedAPI.md) or [object](docs/BLISObjectAPI) APIs. We provide examples of utilizing this threading object below.
|
||||
|
||||
**Note**: Neither way ([automatic](Multithreading.md#locally-at-runtime-the-automatic-way) nor [manual](Multithreading.md#locally-at-runtime-the-manual-way)) of specifying multithreading via the local runtime API can be used via the BLAS interfaces. The local runtime API may *only* be used via the native [typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md) APIs, which are unique to BLIS. (Furthermore, the expert interfaces of each API must be used. This is demonstrated later on in this section.)
|
||||
|
||||
### Initializing a rntm_t
|
||||
|
||||
Before specifying the parallelism (automatically or manually), you must first allocate a special BLIS object called a `rntm_t` (runtime). The object is quite small (about 64 bytes), and so we recommend allocating it statically on the function stack:
|
||||
|
||||
Reference in New Issue
Block a user