Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field *after* it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used *different* parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
Introduction
This file briefly describes the requirements for building a custom BLIS sandbox.
Simply put, a sandbox in BLIS provides an alternative implementation to the
function bli_gemmnat(), which is the object-based API call for computing
the gemm operation via native execution. (Native execution simply means that
an induced method will not be used. It's what you probably already think of
when you think of implementing the gemm operation: a series of loops around
an optimized (usually assembly-based) microkernel with some packing functions
thrown in at various levels.)
Why sandboxes? Sometimes you want to experiment with tweaks or changes to the gemm operation, but you want to do so in a simple environment rather than the highly macroized and refactored (and somewhat obfuscated) code of the core framework (which, I will remind everyone, is highly macroized and refactored mostly so that all floating-point datatypes and all level-3 operations are supported with minimal source code). By building a BLIS sandbox, you can experiment (within limits) and still benefit from BLIS's existing build system, testsuite, and toolbox of utility functions.
Enabling a sandbox
To enable a sandbox at configure-time, you simply specify it as an option to
configure. Either of the following usages are accepted:
$ ./configure --enable-sandbox=ref99 auto
$ ./configure -s ref99 auto
Here, we tell configure that we want to use the ref99 sandbox, which
corresponds to a sub-directory of sandbox named ref99. (Reminder: the
auto argument is simply the configuration target and thus unrelated to
sandboxes.) As configure runs, you should get output that includes lines
similar to:
configure: configuring for alternate gemm implementation:
configure: sandbox/ref99
And when you build BLIS, the last files to be compiled will be the source code in the specified sandbox:
Compiling obj/haswell/sandbox/ref99/blx_gemm_front.o ('haswell' CFLAGS for sandboxes)
Compiling obj/haswell/sandbox/ref99/blx_gemm_int.o ('haswell' CFLAGS for sandboxes)
Compiling obj/haswell/sandbox/ref99/base/blx_blksz.o ('haswell' CFLAGS for sandboxes)
Compiling obj/haswell/sandbox/ref99/cntl/blx_gemm_cntl.o ('haswell' CFLAGS for sandboxes)
...
That's it! After the BLIS library is built, it will contain your chosen
sandbox's implementation of bli_gemmnat() instead of the default
implementation.
Sandbox rules
Like any decent sandbox, there are rules for playing here. Please follow these guidelines for the best sandbox developer experience.
-
Don't bother worrying about makefiles. We've already taken care of the boring/annoying/headache-inducing build system stuff for you. :) By configuring BLIS with a sandbox enabled,
makewill scan your sandbox directory and compile all of its source code using similar compilation rules as were used for the rest of the framework. In addition, the compilation command line will automatically contain one-I<includepath>option for every subdirectory in your sandbox, so it doesn't matter where in your sandbox you place your header files. They will be found! -
Your sandbox must be written in C99 or C++11. If you write your sandbox in C++11, you must use one of the BLIS-approved file extensions for your source files (
.cc,.cpp,.cxx) and your header files (.hh,.hpp,.hxx). Note thatblis.halready contains all of its definitions inside of anextern "C"block, so you should be able to#include "blis.h"from your C++11 source code without any issues. -
All of your code to replace BLIS's default implementation of
bli_gemmnat()should reside in the named sandbox directory, or some directory therein. (Obviously.) For example, thisREADME.mdfile is located in theref99sandbox, located insandbox/ref99. All of the code associated with this sandbox will be contained withinsandbox/ref99. -
The only header file that is required of your sandbox is
bli_sandbox.h. It must be namedbli_sandbox.hbecauseblis.hwill#includethis file when the sandbox is enabled at configure-time. That said, you will probably want to keep the file empty. Why require a file that is supposed to be empty? Well, it doesn't have to be empty. Anything placed in this file will be folded into the flattened (monolithic)blis.hat compile-time. Therefore, you should only place things (e.g. prototypes or type definitions) inbli_sandbox.hif those things would be needed at compile-time by: (a) the BLIS framework itself, or (b) an application that calls your sandbox-enabled BLIS library. Usually, neither of these situations will require any of your local definitions since those definitions are only needed to define your sandbox implementation ofbli_gemmnat(), and this function is already prototyped by BLIS. -
Your definition of
bli_gemmnat()should be the only function you define in your sandbox that begins withbli_. If you define other functions that begin withbli_, you risk a namespace collision with existing framework functions. To guarantee safety, please prefix your locally-defined sandbox functions with another prefix. Here, in theref99sandbox, we use the prefixblx_. (Thexis for sandbox. Or experimental. Whatever, it doesn't matter.) Also, please avoid the prefixbla_since that prefix is also used in BLIS for BLAS compatibility functions.
If you follow these rules, you will be much more likely to have a pleasant experience integrating your BLIS sandbox into the larger framework.
Caveats
Notice that the BLIS sandbox is not all-powerful. You are more-or-less stuck working with the existing BLIS infrastructure.
For example, with a BLIS sandbox you can do the following kinds of things:
- use a different gemm algorithmic partitioning path than the default Goto-like algorithm;
- experiment with different implementations of
packm(not justpackmkernels, which can already be customized within each sub-configuration); - try inlining your functions manually;
- pivot away from using
obj_tobjects at higher algorithmic level (such as immediately after callingbli_gemmnat()) to try to avoid some overhead; - create experimental implementations of new BLAS-like operations (provided
that you also provide an implementation of
blis_gemmnat()).
You cannot, however, use a sandbox to do the following kinds of things:
- define new datatypes (half-precision, quad-precision, short integer, etc.) and expect the rest of BLIS to "know" how to handle them;
- use a sandbox to replace the default implementation of a different level-3 operation, such as Hermitian rank-k update;
- change the existing BLIS APIs;
- remove support for one or more BLIS datatypes (to cut down on library size, for example).
Another important limitation is the fact that the build system currently uses
"framework CFLAGS" when compiling the sandbox source files. These are the same
CFLAGS used when compiling general framework source code,
# Example framework CFLAGS used by 'haswell' sub-configuration
-O3 -Wall -Wno-unused-function -Wfatal-errors -fPIC -std=c99
-D_POSIX_C_SOURCE=200112L -I./include/haswell -I./frame/3/
-I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/
-I./frame/include -DBLIS_VERSION_STRING=\"0.3.2-51\"
which are likely more general-purpose than the CFLAGS used for, say,
optimized kernels or even reference kernels.
# Example optimized kernel CFLAGS used by 'haswell' sub-configuration
-O3 -mavx2 -mfma -mfpmath=sse -march=core-avx2 -Wall -Wno-unused-function
-Wfatal-errors -fPIC -std=c99 -D_POSIX_C_SOURCE=200112L -I./include/haswell
-I./frame/3/ -I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/
-I./frame/include -DBLIS_VERSION_STRING=\"0.3.2-51\"
(To see precisely which flags are being employed for any given file, enable
verbosity at compile-time via make V=1.) Compiling sandboxes with these more
versatile CFLAGS compiler options means that we only need to compile one
instance of each sandbox source file, even when targeting multiple
configurations (for example, via ./configure x86_64). However, it also means
that sandboxes are not ideal for microkernels, as they sometimes need additional
compiler flags not included in the set used for framework CFLAGS in order to
yield the highest performance. If you have a new microkernel you would like to
use within a sandbox, you can always prototype it within a sandbox. However,
once it is stable and ready for use by others, it's best to formally register
the kernel(s) along with a new configuration, which will allow you to specify
kernel-specific compiler flags to be used when compiling your microkernel.
Please see the
Configuration wiki
for more details, and when in doubt, please don't be shy about seeking
guidance from BLIS developers by opening a
new issue or sending a message to the
blis-devel mailing list.
Notwithstanding these limitations, hopefully you still find BLIS sandboxes useful!
Questions? Concerns? Feedback?
If you encounter any problems, please open a new issue on GitHub.
If you are unsure about how something works, you can still open an issue. Or, you can send a message to blis-devel mailing list.
Happy sandboxing!
Field