mirror of
https://github.com/amd/blis.git
synced 2026-04-19 23:28:52 +00:00
Defined rntm_t to relocate cntx_t.thrloop (#235).
Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field *after* it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used *different* parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
This commit is contained in:
@@ -1,64 +1,83 @@
|
||||
## Contents
|
||||
# Contents
|
||||
|
||||
* **[Contents](Multithreading.md#contents)**
|
||||
* **[Introduction](Multithreading.md#introduction)**
|
||||
* **[Enabling multithreading](Multithreading.md#enabling-multithreading)**
|
||||
* **[Specifying multithreading](Multithreading.md#specifying-multithreading)**
|
||||
* [The automatic way](Multithreading.md#the-automatic-way)
|
||||
* [The manual way](Multithreading.md#the-manual-way)
|
||||
* [Globally via environment variables](Multithreading.md#globally-via-environment-variables)
|
||||
* [The automatic way](Multithreading.md#environment-variables-the-automatic-way)
|
||||
* [The manual way](Multithreading.md#environment-variables-the-manual-way)
|
||||
* [Globally at runtime](Multithreading.md#globally-at-runtime)
|
||||
* [The automatic way](Multithreading.md#globally-at-runtime-the-automatic-way)
|
||||
* [The manual way](Multithreading.md#globally-at-runtime-the-manual-way)
|
||||
* [Locally at runtime](Multithreading.md#locally-at-runtime)
|
||||
* [Initializing a rntm_t](Multithreading.md#initializing-a-rntm-t)
|
||||
* [The automatic way](Multithreading.md#locally-at-runtime-the-automatic-way)
|
||||
* [The manual way](Multithreading.md#locally-at-runtime-the-manual-way)
|
||||
* [Using the expert interface](Multithreading.md#locally-at-runtime-using-the-expert-interface)
|
||||
|
||||
## Introduction
|
||||
|
||||
Our paper [Anatomy of High-Performance Many-Threaded Matrix Multiplication](https://github.com/flame/blis#citations), presented at IPDPS'14, identified 5 loops around the micro-kernel as opportunities for parallelization. Within BLIS, we have enabled parallelism for 4 of those loops and have extended it to the rest of the level-3 operations except for `trsm`.
|
||||
# Introduction
|
||||
|
||||
## Enabling multithreading
|
||||
Our paper [Anatomy of High-Performance Many-Threaded Matrix Multiplication](https://github.com/flame/blis#citations), presented at IPDPS'14, identified 5 loops around the micro-kernel as opportunities for parallelization within level-3 operations such as `gemm`. Within BLIS, we have enabled parallelism for 4 of those loops and have extended it to the rest of the level-3 operations except for `trsm`.
|
||||
|
||||
Note that BLIS disables multithreading by default.
|
||||
# Enabling multithreading
|
||||
|
||||
Note that BLIS disables multithreading by default. In order to extract multithreaded parallelism from BLIS, you must first enable multithreading explicitly at configure-time.
|
||||
|
||||
As of this writing, BLIS optionally supports multithreading via either OpenMP or POSIX threads.
|
||||
|
||||
To enable multithreading via OpenMP, you must provide the `--enable-threading` option to the `configure` script:
|
||||
```
|
||||
$ ./configure --enable-threading=openmp haswell
|
||||
$ ./configure --enable-threading=openmp auto
|
||||
```
|
||||
In this example, we configure for the `haswell` configuration. Similarly, to enable multithreading via POSIX threads (pthreads), specify the threading model as `pthreads` instead of `openmp`:
|
||||
```
|
||||
$ ./configure --enable-threading=pthreads haswell
|
||||
$ ./configure --enable-threading=pthreads auto
|
||||
```
|
||||
You can also use the shorthand option for `--enable-threading`, which is `-t`:
|
||||
```
|
||||
$ ./configure -t pthreads
|
||||
```
|
||||
For more complete and up-to-date information on the `--enable-threading` option, simply run `configure` with the `--help` (or `-h`) option:
|
||||
```
|
||||
$ ./configure --help
|
||||
$ ./configure --help
|
||||
```
|
||||
|
||||
|
||||
## Specifying multithreading
|
||||
# Specifying multithreading
|
||||
|
||||
There are two broad ways to specify multithreading in BLIS: the "automatic way" or the "manual way".
|
||||
There are three broad methods of specifying multithreading in BLIS:
|
||||
* [Globally via environment variables](Multithreading.md#globally-via-environment-variables)
|
||||
* [Globally at runtime](Multithreading.md#globally-at-runtime)
|
||||
* [Locally at runtime](Multithreading.md#locally-at-runtime) (that is, on a per-call, thread-safe basis)
|
||||
|
||||
### The automatic way
|
||||
Within these three broad methods there are two specific ways of expressing a request for parallelism. First, the user may express a single number--the total number of threads, or ways of parallelism, to use within a single operation such as `gemm`. We call this the "automatic" way. Alternatively, the user may express the number of ways of parallelism to obtain within *each loop* of the level-3 operation. We call this the "manual" way. The latter way is actually what BLIS eventually needs before it can perform its multithreading; the former is viable only because we have a heuristic of determing a reasonable instance of the latter when given the former.
|
||||
This pattern--automatic or manual--holds regardless of which of the three methods is used.
|
||||
|
||||
The simplest way to enable multithreading in BLIS is to simply set the total number of threads you wish BLIS to employ in its parallelization. This total number of threads is captured by the `BLIS_NUM_THREADS` environment variable. You can set this variable prior to executing your BLIS-linked executable:
|
||||
Regardless of which method is employed, and which specific way within each method, after setting the number of threads, the application may simply call the desired level-3 operation via either the BLAS, the [typed API](docs/BLISTypedAPI.md), or the [object API](docs/BLISObjectAPI.md), and the operation will execute in a multithreaded manner.
|
||||
|
||||
## Globally via environment variables
|
||||
|
||||
The most common method of specifying multithreading in BLIS is globally via environment variables. With this method, the user sets one or more environment variables in the shell before launching the BLIS-linked executable.
|
||||
|
||||
Regardless of whether you end up using the automatic or manual way of expressing a request for multithreading, note that the environment variables are read (via `getenv()`) by BLIS **only once**, when the library is initialized. Subsequent to library initialization, the global settings for parallelization may only be changed via the [global runtime API](Multithreading.md#globally-at-runtime). If this constraint is not a problem, then environment variables may work fine for you.
|
||||
|
||||
### Environment variables: the automatic way
|
||||
|
||||
The automatic way of specifying parallelism entails simply setting the total number of threads you wish BLIS to employ in its parallelization. This total number of threads is captured by the `BLIS_NUM_THREADS` environment variable. You can set this variable prior to executing your BLIS-linked executable:
|
||||
```
|
||||
$ export BLIS_NUM_THREADS=16
|
||||
$ ./my_blis_program
|
||||
```
|
||||
This causes BLIS to automatically determine a reasonable threading strategy based on what is known about your architecture. If `BLIS_NUM_THREADS` is not set, then BLIS also looks at the value of `OMP_NUM_THREADS`, if set. If neither variable is set, the default number of threads is 1.
|
||||
|
||||
Alternatively, any time after calling `bli_init()` but before `bli_finalize()`, you can also set (or change) the value of `BLIS_NUM_THREADS` at run-time:
|
||||
```
|
||||
bli_thread_set_num_threads( 8 );
|
||||
```
|
||||
Similarly, the current value of `BLIS_NUM_THREADS` can always be queried as follows:
|
||||
```
|
||||
dim_t num_threads = bli_thread_get_num_threads();
|
||||
$ export GOMP_CPU_AFFINITY="..." # optional step when using GNU libgomp.
|
||||
$ export BLIS_NUM_THREADS=16
|
||||
$ ./my_blis_program
|
||||
```
|
||||
This causes BLIS to automatically determine a reasonable threading strategy based on what is known about the operation and problem size. If `BLIS_NUM_THREADS` is not set, then BLIS also looks at the value of `OMP_NUM_THREADS`, if set. If neither variable is set, the default number of threads is 1.
|
||||
|
||||
### The manual way
|
||||
### Environment variables: the manual way
|
||||
|
||||
The "manual way" of specifying parallelism in BLIS involves specifying which loops within the matrix multiplication algorithm to parallelize, and the degree of parallelism to be obtained from those loops.
|
||||
The manual way of specifying parallelism involves communicating which loops within the matrix multiplication algorithm to parallelize and the degree of parallelism to be obtained from each of those loops.
|
||||
|
||||
The below chart describes the five loops used in BLIS's matrix multiplication operations.
|
||||
The below chart describes the five loops used in BLIS's matrix multiplication operations.
|
||||
|
||||
| Loop around micro-kernel | Environment variable | Direction | Notes |
|
||||
|:-------------------------|:---------------------|:----------|:------------|
|
||||
@@ -68,9 +87,11 @@ The below chart describes the five loops used in BLIS's matrix multiplication op
|
||||
| 2nd loop | `BLIS_JR_NT` | `n` | |
|
||||
| 1st loop | `BLIS_IR_NT` | `m` | |
|
||||
|
||||
Note: Parallelization of the 4th loop is not currently enabled because each iteration of the loop updates the same part of the matrix C. Thus to parallelize it requires either a reduction or mutex locks when updating C.
|
||||
**Note**: Parallelization of the 4th loop is not currently enabled because each iteration of the loop updates the same part of the output matrix C. Thus, to safely parallelize it requires either a reduction or mutex locks when updating C.
|
||||
|
||||
Parallelization in BLIS is hierarchical. So if we parallelize multiple loops, the total number of threads will be the product of the amount of parallelism for each loop. Thus the total number of threads used is `BLIS_IR_NT * BLIS_JR_NT * BLIS_IC_NT * BLIS_JC_NT`.
|
||||
Parallelization in BLIS is hierarchical. So if we parallelize multiple loops, the total number of threads will be the product of the amount of parallelism for each loop. Thus the total number of threads used is the product of all the values:
|
||||
`BLIS_JC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT`.
|
||||
Note that if you set at least one of these loop-specific variables, any others that are unset will default to 1.
|
||||
|
||||
In general, the way to choose how to set these environment variables is as follows: The amount of parallelism from the M and N dimensions should be roughly the same. Thus `BLIS_IR_NT * BLIS_IC_NT` should be roughly equal to `BLIS_JR_NT * BLIS_JC_NT`.
|
||||
|
||||
@@ -81,18 +102,123 @@ Next, which combinations of loops to parallelize depends on which caches are sha
|
||||
|
||||

|
||||
|
||||
As with specifying parallelism via `BLIS_NUM_THREADS`, you can set the `BLIS_xx_NT` environment variables in the shell, prior to launching your BLIS-linked executable, or you can set (or update) the environment variables at run-time. Here are some examples of using the run-time API:
|
||||
## Globally at runtime
|
||||
|
||||
If you still wish to set the parallelization scheme globally, but you want to do so at runtime, BLIS provides a thread-safe API for specifying multithreading. Think of these functions as a way to modify the same internal data structure into which the environment variables are read. (Recall that the environment variables are only read once, when BLIS is initialized).
|
||||
|
||||
### Globally at runtime: the automatic way
|
||||
|
||||
If you simply want to specify an overall number of threads and let BLIS choose a thread factorization automatically, use the following function:
|
||||
```c
|
||||
bli_thread_set_jc_nt( 2 ); // Set BLIS_JC_NT to 2.
|
||||
bli_thread_set_jc_nt( 4 ); // Set BLIS_IC_NT to 4.
|
||||
bli_thread_set_jr_nt( 3 ); // Set BLIS_JR_NT to 3.
|
||||
bli_thread_set_ir_nt( 1 ); // Set BLIS_IR_NT to 1.
|
||||
void bli_thread_set_num_threads( dim_t n_threads );
|
||||
```
|
||||
There are also equivalent "get" functions that allow you to query the current values for the `BLIS_xx_NT` variables:
|
||||
This function takes one integer--the total number of threads for BLIS to utilize in any one operation. So, for example, if we call
|
||||
```c
|
||||
dim_t jc_nt = bli_thread_get_jc_nt();
|
||||
dim_t ic_nt = bli_thread_get_ic_nt();
|
||||
dim_t jr_nt = bli_thread_get_jr_nt();
|
||||
dim_t ir_nt = bli_thread_get_ir_nt();
|
||||
bli_thread_set_num_threads( 4 );
|
||||
```
|
||||
we are requesting that the global number of threads be set to 4. You may also query the global number of threads at any time via
|
||||
```c
|
||||
dim_t bli_thread_get_num_threads( void );
|
||||
```
|
||||
Which may be called in the usual way:
|
||||
```c
|
||||
dim_t nt = bli_thread_get_num_threads();
|
||||
```
|
||||
|
||||
### Globally at runtime: the manual way
|
||||
|
||||
If you want to specify the number of ways of parallelism to obtain for each loop, use the following function:
|
||||
```c
|
||||
void bli_thread_set_ways( dim_t jc, dim_t pc, dim_t ic, dim_t jr, dim_t ir );
|
||||
```
|
||||
This function takes one integer for each loop in the level-3 operations. (**Note**: even though the function takes a `pc` argument, it will be ignored until parallelism is supported in the KC loop.)
|
||||
So, for example, if we call
|
||||
```c
|
||||
bli_thread_set_ways( 2, 1, 4, 1, 1 );
|
||||
```
|
||||
we are requesting two ways of parallelism in the `JC` loop and 4 ways of parallelism in the `IC` loop.
|
||||
Unlike environment variables, which only allow the user to set the parallelization strategy prior to running the executable, `bli_thread_set_ways()` may be called any time during the normal course of the BLIS-linked application's execution.
|
||||
|
||||
## Locally at runtime
|
||||
|
||||
In addition to the global methods based on environment variables and runtime function calls, BLIS also a local, *per-call* method of requesting parallelism at runtime. This method has the benefit of being thread-safe and flexible; your application can spawn two threads, with each thread requesting different degrees of parallelism from their respective calls to level-3 operations.
|
||||
|
||||
As with environment variables and the global runtime API, there are two ways to specify parallelism: the automatic way and the manual way. Both ways involve allocating a BLIS-specific object, initializing the object and encoding the desired parallelization, and then passing a pointer to the object into one of the expert interfaces of either the [typed](docs/BLISTypedAPI.md) or [object](docs/BLISObjectAPI) APIs. We provide examples of utilizing this threading object below.
|
||||
|
||||
### Initializing a rntm_t
|
||||
|
||||
Before specifying the parallelism (automatically or manually), you must first allocate a special BLIS object called a `rntm_t` (runtime). The object is quite small (about 64 bytes), and so we recommend allocating it statically on the function stack:
|
||||
```c
|
||||
rntm_t rntm;
|
||||
```
|
||||
We **strongly recommend** initializing the `rntm_t`. This can be done in either of two ways.
|
||||
If you want to also initialize it as part of the declaration, you may do so via the default `BLIS_RNTM_INITIALIZER` macro:
|
||||
```c
|
||||
rntm_t rntm = BLIS_RNTM_INITIALIZER;
|
||||
```
|
||||
Alternatively, you can perform the same initialization by passing the address of the `rntm_t` to an initialization function:
|
||||
```c
|
||||
bli_rntm_init( &rntm );
|
||||
```
|
||||
As of this writing, BLIS treats a default-initialized `rntm_t` as a request for single-threaded execution.
|
||||
|
||||
**Note**: If you choose to **not** initialize the `rntm_t` object, you **must** set its parallelism via either the automatic way or the manual way, described below. Passing a completely uninitialized `rntm_t` to a level-3 operation **will almost surely result in undefined behvaior!**
|
||||
|
||||
### Locally at runtime: the automatic way
|
||||
|
||||
Once your `rntm_t` is initialized, you may request automatic parallelization by encoding only the total number of threads into the `rntm_t` via the following function:
|
||||
```c
|
||||
void bli_rntm_set_num_threads( dim_t n_threads, rntm_t* rntm );
|
||||
```
|
||||
As with `bli_thread_set_num_threads()` [discussed previously](Multithreading.md#globally-at-runtime-the-automatic-way), this function takes a single integer. It also takes the address of the `rntm_t` to modify. So, for example, if (after declaring and initializing a `rntm_t` as discussed above) we call
|
||||
```c
|
||||
bli_rntm_set_num_threads( 6, &rntm );
|
||||
```
|
||||
the `rntm_t` object will be encoded to use a total of 6 threads.
|
||||
|
||||
### Locally at runtime: the manual way
|
||||
|
||||
Once your `rntm_t` is initialized, you may manually encode the ways of parallelism for each loop into the `rntm_t` by using the following function:
|
||||
```c
|
||||
void bli_rntm_set_ways( dim_t jc, dim_t pc, dim_t ic, dim_t jr, dim_t ir, rntm_t* rntm );
|
||||
```
|
||||
As with `bli_thread_set_ways()` [discussed previously](Multithreading.md#globally-at-runtime-the-manual-way), this function takes one integer for each loop in the level-3 operations. It also takes the address of the `rntm_t` to modify.
|
||||
(**Note**: even though the function takes a `pc` argument, it will be ignored until parallelism is supported in the `KC` loop.)
|
||||
So, for example, if we call
|
||||
```c
|
||||
bli_rntm_set_ways( 1, 1, 2, 3, 1, &rntm );
|
||||
```
|
||||
we are requesting two ways of parallelism in the `IC` loop and three ways of parallelism in the `JR` loop.
|
||||
|
||||
### Locally at runtime: using the expert interfaces
|
||||
|
||||
Regardless of whether you specified parallelism into your `rntm_t` object via the automatic or manual method, eventually you must use the data structure when calling a BLIS operation.
|
||||
|
||||
Let's assume you wish to call `gemm`. To so do, simply use the expert interface, which takes two additional arguments: a `cntx_t` (context) and a `rntm_t`. For the context, you may simply pass in `NULL` and BLIS will select a default context (which is exactly what happens when you call the basic/non-expert interfaces). Here is an example of such a call:
|
||||
```c
|
||||
bli_gemm_ex( &alpha, &a, &b, &beta, &c, NULL, &rntm );
|
||||
```
|
||||
This will cause `gemm` to execute and be parallelized in the manner encoded by `rntm`.
|
||||
|
||||
To summarize, using a `rntm_t` involves three steps:
|
||||
```c
|
||||
// Declare and initialize a rntm_t object.
|
||||
rntm_t rntm = BLIS_RNTM_INITIALIZER;
|
||||
|
||||
// Call ONE (not both) of the following to encode your parallelization into
|
||||
// the rntm_t. (These are examples only--use numbers that make sense for your
|
||||
// application!)
|
||||
bli_rntm_set_num_threads( 6, &rntm );
|
||||
bli_rntm_set_ways( 1, 1, 2, 3, 1, &rntm );
|
||||
|
||||
// Finally, call BLIS via an expert interface and pass in your rntm_t.
|
||||
bli_gemm_ex( &alpha, &a, &b, &beta, &c, NULL, &rntm );
|
||||
```
|
||||
Note that `rntm_t` objects may be reused over and over again once they are initialized; there is no need to reinitialize them and re-encode their threading values!
|
||||
|
||||
# Conclusion
|
||||
|
||||
Please send us feedback if you have any concerns or questions, or [open an issue](http://github.com/flame/blis/issues) if you observe any reproducible behavior that you think is erroneous. (You are welcome to use the issue feature to start any non-trivial dialogue; we don't restrict them only to bug reports!
|
||||
|
||||
Thanks for your interest in BLIS.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user