mirror of
https://github.com/amd/blis.git
synced 2026-05-05 06:51:11 +00:00
8f399c89403d5824ba767df1426706cf2d19d0a7
35 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6218ac95a5 | Merge branch 'master' into amd | ||
|
|
29b0e1ef4e |
Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes. |
||
|
|
31c8657f1d |
Added support for pre-broadcast when packing B.
Details: - Added support for being able to duplicate (broadcast) elements in memory when packing matrix B (ie: the left-hand operand) in level-3 operations. This turns out advantageous for some architectures that can afford the cost of the extra bandwidth and somehow benefit from the pre-broadcast elements (and thus being able to avoid using broadcast-style load instructions on micro-rows of B in the gemm microkernel). - Support optionally disabling right-side hemm and symm. If this occurs, hemm_r is implemented in terms of hemm_l (and symm_r in terms of symm_l). This is needed when broadcasting during packing because the alternative--supporting the broadcast of B while also allowing matrix B to be Hermitian/symmetric--would be an absolute mess. - Support alignment factors for packed blocks of A, B, and C separately (as well as for general-purpose buffers). In addition, we support byte offsets from those alignment values (which is different from aligning by align+offset bytes to begin with). The default alignment values are BLIS_PAGE_SIZE in all four cases, with the offset values defaulting to zero. - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed into the packm kernel, where it will be needed by packm kernels that perform broadcasts of B, since the idea is that we *only* want to broadcast when packing micropanels of B and not A. - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be used to set custom virtual level-3 microkernels in the cntx_t, which would typically be done in the bli_cntx_init_*() function defined in the subconfiguration of interest. - Added a "broadcast B" kernel function for use with NP/NR = 12/6, defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c. - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels defined in ref_kernels/3/bb. (These kernels have been tested with double real with NP/NR = 12/6.) - Added #ifndef ... #endif guards around several macro constants defined in frame/include/bli_kernel_macro_defs.h. - Defined a few "broadcast B" static functions in frame/include/level0/bb for use by "broadcast B"-style packm reference kernels. For now, only the real domain kernels are tested and fully defined. - Output the alignment and offset values for packed blocks of A and B in the testsuite's "BLIS configuration info" section. - Comment updates to various files. - Bumped so_version to 3.0.0. |
||
|
|
13806ba3b0 |
This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041 |
||
|
|
b9c9f03502 |
Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
|
||
|
|
89cd650e7b |
Use void_fp for function pointers instead of void*.
Change void*-typed function pointers to void_fp. - Updated all instances of void* variables that store function pointers to variables of a new type, void_fp. Originally, I wanted to define the type of void_fp as "void (*void_fp)( void )"--that is, a pointer to a function with no return value and no arguments. However, once I did this, I realized that gcc complains with incompatible pointer type (-Wincompatible-pointer-types) warnings every time any such a pointer is being assigned to its final, type-accurate function pointer type. That is, gcc will silently typecast a void* to another defined function pointer type (e.g. dscalv_ker_ft) during an assignment from the former to the latter, but the same statement will trigger a warning when typecasting from a void_fp type. I suspect an explicit typecast is needed in order to avoid the warning, which I'm not willing to insert at this time. - Added a typedef to bli_type_defs.h defining void_fp as void*, along with a commented-out version of the aborted definition described above. (Note that POSIX requires that void* and function pointers be interchangeable; it is the C standard that does not provide this guarantee.) - Comment updates to various _oapi.c files. |
||
|
|
eb97f778a1 |
Added missing AMD copyrights to previous commit.
Details:
- Forgot to add AMD copyrights to several touched files that did not
already have them in
|
||
|
|
2f3174330f |
Implemented a pool-based small block allocator.
Details:
- Implemented a sophisticated data structure and set of APIs that track
the small blocks of memory (around 80-100 bytes each) used when
creating nodes for control and thread trees (cntl_t and thrinfo_t) as
well as thread communicators (thrcomm_t). The purpose of the small
block allocator, or sba, is to allow the library to transition into a
runtime state in which it does not perform any calls to malloc() or
free() during normal execution of level-3 operations, regardless of
the threading environment (potentially multiple application threads
as well as multiple BLIS threads). The functionality relies on a new
data structure, apool_t, which is (roughly speaking) a pool of
arrays, where each array element is a pool of small blocks. The outer
pool, which is protected by a mutex, provides separate arrays for each
application thread while the arrays each handle multiple BLIS threads
for any given application thread. The design minimizes the potential
for lock contention, as only concurrent application threads would
need to fight for the apool_t lock, and only if they happen to begin
their level-3 operations at precisely the same time. Thanks to Kiran
Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
by default; renamed the --[dis|en]able-packbuf-pools option to
--[dis|en]able-pba-pools; and rewrote the --help text associated with
this new option and consolidated it with the --help text for the
option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
used for small blocks with calls to bli_sba_acquire(), which takes a
rntm (in addition to the bytes requested), and bli_sba_release().
These latter two functions reduce to the former two when the sba pools
are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
in the block_size) from bli_membrk_acquire_m() to the implementation
of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
relative to that of gemmtrsm_ukr, in part to avoid the need to create
a packm control tree node, which now requires a rntm_t that has been
initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
run the testsuite with memory tracing enabled can check for memory
leaks.
- Manually imported the compact/minor changes from
|
||
|
|
0645f239fb |
Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
|
||
|
|
4fa4cb0734 |
Trivial comment header updates.
Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers. |
||
|
|
658f0a129b |
Fixed obscure integer size bug in va_arg() usage.
Details: - Fixed a bug in the way that the variadic bli_cntx_set_l3_nat_ukrs() function was defined. This function is meant to take a microkernel id, microkernel datatype, microkernel address, and microkernel preference as arguments, and is typically called within the bli_cntx_init_*() function defined within a sub-configuration for initializing an appropriate context. The problem is with the final argument: the microkernel preference. These preferences are actually boolean values, 0 or 1 (encoded as FALSE or TRUE). Since the variadic function does not give the compiler any type information for any variadic arguments, they are "promoted" in the course of internal (macroized) processing according to default argument promotion rules. Thus, integer literals such as 0 and 1 become int and floating-point literals (such as 0.0 or 1.0) become double. Previous to this commit, we indicated to va_arg() that the ukernel preference was a 'bool_t', which is a typedef of int64_t on 64-bit systems. On systems where int is defined as 64 bits, no problems manifest since int is the same size as the type we passed in to va_arg(), but on systems where int is 32 bits, the ukernel preference could be misinterpreted as a garbage value. (This was observed on a modern armv8 system.) The fix was to interpret the bool_t value as int and then immediately typecast it to and store it as a bool_t. Special thanks to Devangi Parikh for helping track down this issue, including deciphering the use of va_arg() and its byzantine treatment of types. - Added explicit typecasts for all invocations of va_arg() in bli_cntx.c. |
||
|
|
ecbebe7c2e |
Defined rntm_t to relocate cntx_t.thrloop (#235).
Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field *after* it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used *different* parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md. |
||
|
|
87db5c048e |
Changed usage of virtual microkernel slots in cntx.
Details: - Changed the way virtual microkernels are handled in the context. Previously, there were query routines such as bli_cntx_get_l3_ukr_dt() which returned the native ukernel for a datatype if the method was equal to BLIS_NAT, or the virtual ukernel for that datatype if the method was some other value. Going forward, the context native and virtual ukernel slots will both be initialized to native ukernel function pointers for native execution, and for non-native execution the virtual ukernel pointer will be something else. This allows us to always query the virtual ukernel slot (from within, say, the macrokernel) without needing any logic in the query routine to decide which function pointer (native or virtual) to return. (Essentially, the logic has been shifted to init-time instead of compute-time.) This scheme will also allow generalized virtual ukernels as a way to insert extra logic in between the macrokernel and the native microkernel. - Initialize native contexts (in bli_cntx_ref.c) with native ukernel function addresses stored to the virtual ukernel slots pursuant to the above policy change. - Renamed all static functions that were native/virtual-ambiguous, such as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt() pursuant to the above polilcy change. Those routines now use the substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All of these functions were static functions defined in bli_cntx.h, and most uses were in level-3 front-ends and macrokernels. - Deprecated anti_pref bool_t in context, along with related functions such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's panel-block execution is disabled. |
||
|
|
ed7dedfd4a | Merge branch 'master' into dev | ||
|
|
f97a86f322 |
Updated setting/querying pack schema (cntx->cntl).
- Query pack schemas in level-3 bli_*_front() functions and store those values in the schema bitfields of the correponding obj_t's when the cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default native schemas are stored to the obj_t's.) - In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in bli_*_front(), clear the schema bitfields, and pass the queried values into bli_gemm_cntl_create() and bli_trsm_cntl_create(). - Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to take schemas for A and B, and use these values to initialize the appropriate control tree nodes. (Also cpp-disabled the panel-block cntl tree creation variant, bli_gemmpb_cntl_create(), as it has not been employed by BLIS in quite some time.) - Simplified querying of schema in bli_packm_init() thanks to above changes. - Updated openmp and pthreads definitions of bli_l3_thread_decorator() so that thread-local aliases of matrix operands are guaranteed, even if aliasing is disabled within the internal back-end functions (e.g. bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c explaining why the extra aliasing is not needed there. - Change bli_gemm() and level-3 friends so that the operation's ind() function is called only if all matrix operands have the same datatype, and only if that datatype is complex. The former condition is needed in preparation for work related to mixed domain operands, while the latter helps with readability, especially for those who don't want to venture into frame/ind. - Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be consistent with BLIS calling conventions (modified argument(s) are last), and updated all invocations in the level-3 _front() functions. - Comment updates to bli_cntx_set_thrloop_from_env(). |
||
|
|
965db85d29 |
Updated macro invocations in bli_gemm_ker_var2.c.
Details:
- Updated "get next a/b micropanel" macro invocations in
bli_gemm_ker_var2.c according to changes in
|
||
|
|
469727d4f8 | Very minor comment updates. | ||
|
|
70640a3710 |
Implemented library self-initialization.
Details: - Defined two new functions in bli_init.c: bli_init_once() and bli_finalize_once(). Each is implemented with pthread_once(), which guarantees that, among the threads that pass in the same pthread_once_t data structure, exactly one thread will execute a user-defined function. (Thus, there is now a runtime dependency against libpthread even when multithreading is not enabled at configure-time.) - Added calls to bli_init_once() to top-level user APIs for all computational operations as well as many other functions in BLIS to all but guarantee that BLIS will self-initialize through the normal use of its functions. - Rewrote and simplified bli_init() and bli_finalize() and related functions. - Added -lpthread to LDFLAGS in common.mk. - Modified the bli_init_auto()/_finalize_auto() functions used by the BLAS compatibility layer to take and return no arguments. (The previous API that tracked whether BLIS was initialized, and then only finalized if it was initialized in the same function, was too cute by half and borderline useless because by default BLIS stays initialized when auto-initialized via the compatibility layer.) - Removed static variables that track initialization of the sub-APIs in bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and bli_ind.c. We don't need to track initialization at the sub-API level, especially now that BLIS can self-initialize. - Added a critical section around the changing of the error checking level in bli_error.c. - Deprecated bli_ind_oper_has_avail() as well as all functions bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation name. These functions had no use cases within BLIS and likely none outside of BLIS. - Commented out calls to bli_init() and bli_finalize() in testsuite's main() function, and likewise for standalone test drivers in 'test' directory, so that self-initialization is exercised by default. |
||
|
|
513ef4d040 |
Various typecasting fixes, mis-typed enums, etc.
Details:
- Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c.
- Properly typecast integer arguments to match format specifier in various
calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and
bli_util_oapi.c.
- Fixed "unsigned less-than-comparison with zero" checks in bli_check.c,
bli_cntx.h.
- Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been
l1fkr_t or l1vkr_t).
- Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t
value BLIS_GEMM_UKR in bli_cntx_ref.c.
- NOTE: These issues were identified via compiler warnings when building
BLIS with clang on a rather old installation of OS X:
$ clang --version
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin15.2.0
Thread model: posix
|
||
|
|
9f39806c4e |
Fixed a bug in e31f0b3/b131b9a.
Details: - Erroneously placed the "don't overwrite existing blocksize" logic in bli_blksz_init*() rather than in bli_cntx_set_blkszs(). It belongs in the latter because that function copies blocksizes as-is from the blksz_t function argument to the appropriate field in the cntx_t. If the blksz_t was previously initialized selectively, based on the sign of the blocksize value passed into bli_blksz_init*(), that just leaves some fields possibly uninitialized (with garbage values), which definitely will not work. - The aforementioned logic has been moved to bli_cntx_set_blkszs() via a new function bli_blksz_copy_if_pos(), which selectively copies only the blocksizes that are greater than zero. |
||
|
|
453deb2906 |
Implemented runtime kernel management.
Details: - Reworked the build system around a configuration registry file, named config_registry', that identifies valid configuration targets, their constituent sub-configurations, and the kernel sets that are needed by those sub-configurations. The build system now facilitates the building of a single library that can contains kernels and cache/register blocksizes for multiple configurations (microarchitectures). Reference kernels are also built on a per-configuration basis. - Updated the Makefile to use new variables set by configure via the config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP, in determining which sub-configurations (CONFIG_LIST) and kernel sets (KERNEL_LIST) are included in the library, and which make_defs.mk files' CFLAGS (KCONFIG_MAP) are used when compiling kernels. - Reorganized 'kernels' directory into a "flat" structure. Renamed kernel functions into a standard format that includes the kernel set name (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each kernels sub-directory. These files exist to provide prototypes for the kernels present in those directories. - Reorganized reference kernels into a top-level 'ref_kernels' directory. This directory includes a new source file, bli_cntx_ref.c (compiled on a per-configuration basis), that defines the code needed to initialize a reference context and a context for induced methods for the microarchitecture in question. - Rewrote make_defs.mk files in each configuration so that the compiler variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration basis. - Modified bli_config.h.in template so that bli_config.h is generated with #defines for the config (family) name, the sub-configurations that are associated with the family, and the kernel sets needed by those sub-configurations. - Deprecated all kernel-related information in bli_kernel.h and transferred what remains to new header files named "bli_arch_<configname>.h", which are conditionally #included from a new header bli_arch.h. These files are still needed to set library-wide parameters such as custom malloc()/free() functions or SIMD alignment values. - Added bli_cntx_init_<configname>.c files to each configuration directory. The files contain a function, named the same as the file, that initializes a "native" context for a particular configuration (microarchitecture). The idea is that optimized kernels, if available, will be initialized into these contexts. Other fields will retain pointers to reference functions, which will be compiled on a per-configuration basis. These bli_cntx_init_*() functions will be called during the initialization of the global kernel structure. They are thought of as initializing for "native" execution, but they also form the basis for contexts that use induced methods. These functions are prototyped, along with their _ref() and _ind() brethren, by prototype-generating macros in bli_arch.h. - Added a new typedef enum in bli_type_defs.h to define an arch_t, which identifies the various sub-configurations. - Redesigned the global kernel structure (gks) around a 2D array of cntx_t structures (pointers to cntx_t, actually). The first dimension is indexed over arch_t and the inner dimension is the ind_t (induced method) for each microarchitecture. When a microarchitecture (configuration) is "registered" at init-time, the inner array for that configuration in the 2D array is initialized (and allocated, if it hasn't been already). The cntx_t slot for BLIS_NAT is initialized immediately and those for other induced method types are initialized and cached on-demand, as needed. At cntx_t registration, we also store function pointers to cntx_init functions that will initialize (a) "reference" contexts and (b) contexts for use with induced methods. We don't cache the full contexts for reference contexts since they are rarely needed. The functions that initialize these two kinds of contexts are generated automatically for each targeted sub-configuration from cpp-templatized code at compile-time. Induced method contexts that need "stage" adjustments can still obtain them via functions in bli_cntx_ind_stage.c. - Added new functions and functionality to bli_cntx.c, such as for setting the level-1f, level-1v, and packm kernels, and for converting a native context into one for executing an induced method. - Moved the checking of register/cache blocksize consistency from being cpp macros in bli_kernel_macro_defs.h to being runtime checks defined in bli_check.c and called from bli_gks_register_cntx() at the time that the global kernel structure's internal context is initialized for a given microarchitecture/configuration. - Deprecated all of the old per-operation bli_*_cntx.c files and removed the previous operation-level cntx_t_init()/_finalize() invocations. Instead, we now query the gks for a suitable context, usually via bli_gks_query_cntx(). - Deprecated support for the 3m2 and 3m3 induced methods. (They required hackery that I was no longer willing to support.) - Consolidated the 1e and 1r packm kernels for any given register blocksize into a single kernel that will branch on the schema and support packing to both formats. - Added the cntx_t* argument to all packm kernel signatures. - Deprecated the local function pointer array in all bli_packm_cxk*.c files and instead obtain the packm kernel from the cntx_t. - Added bli_calloc_intl(), which serves as the calloc-equivalent to to bli_malloc_intl(). Useful when we wish to allocate and initialize to zero/NULL. - Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h, bli_cntx.h into static functions. |
||
|
|
c63980f4ca |
Moved 'family' field from cntx_t to cntl_t.
Details: - Removed the family field inside the cntx_t struct and re-added it to the cntl_t struct. Updated all accessor functions/macros accordingly, as well as all consumers and intermediaries of the family parameter (such as bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This change was motivated by the desire to keep the context limited, as much as possible, to information about the computing environment. (The family field, by contrast, is a descriptor about the operation being executed.) - Added additional functions to bli_blksz_*() API. - Added additional functions to bli_cntx_*() API. - Minor updates to bli_func.c, bli_mbool.c. - Removed 'obj' from bli_blksz_*() API names. - Removed 'obj' from bli_cntx_*() API names. - Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines that operate only on a single struct to contain the "_node" suffix to differentiate with those routines that operate on the entire tree. - Added enums for packm and unpackm kernels to bli_type_defs.h. - Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h. They weren't being used and probably never will be. |
||
|
|
0e58ba1b3a |
Added API to set mt environment variables.
Details:
- Renamed bli_env_get_nway() -> bli_thread_get_env().
- Added bli_thread_set_env() to allow setting environment variables
pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
- Added the following convenience wrapper routines:
bli_thread_get_jc_nt()
bli_thread_get_ic_nt()
bli_thread_get_jr_nt()
bli_thread_get_ir_nt()
bli_thread_get_num_threads()
bli_thread_set_jc_nt()
bli_thread_set_ic_nt()
bli_thread_set_jr_nt()
bli_thread_set_ir_nt()
bli_thread_set_num_threads()
- Added #include "errno.h" to bli_system.h.
- This commit addresses issue #140.
- Thanks to Chris Goodyer for inspiring these updates.
|
||
|
|
fdc66f12d4 | Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123. | ||
|
|
1c732d3ddc |
Added 1m-specific APIs for bp, pb gemm algorithms.
Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
bli_cntl_free() can check if the thread parameter is NULL, and if so,
call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
terms of bli_gemm1mxx_cntx_init(), which behaves the same as
bli_gemm1m_cntx_init() did before, except that an extra bool parameter
(is_pb) is used to support both bp and pb algorithms (including to
support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
when true, will toggle the boolean return value of routines such as
bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
causing BLIS to transpose the operation to achieve disagreement (rather
than agreement) between the storage of C and the micro-kernel output
preference. This disagreement is needed for panel-block implementations,
since they induce a transposition of the suboperation immediately before
the macro-kernel is called, which changes the apparent storage of C. For
now, anti-preference is used only with the pb algorithm for 1m (and not
with any other non-1m implementation).
- Defined new functions,
bli_cntx_l3_ukr_eff_prefers_storage_of()
bli_cntx_l3_ukr_eff_dislikes_storage_of()
bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
which are identical to their non-"eff" (effectively) counterparts except
that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
in terms of the existing block-panel macro-kernel _ker_var2(). This
technique requires inducing transposes on all operands and swapping
the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
and updated all instantiations. Also updated the field names in the
cntx_t struct.
- Comment updates.
|
||
|
|
126482a3b6 |
Implemented the 1m method.
Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver. |
||
|
|
bdc0a264d2 |
Adjusted stride selection of ct in macrokernels.
Details:
- Updated the changes introduced in
|
||
|
|
a8220e3a86 |
- Fix typo in bli_cntx.c
- Bump BLIS_DEFAULT_NR_THREAD_MAX to 4 |
||
|
|
c05b3862f6 |
Add automatic loop thread assignment.
- Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before. - Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h. - All level-3 BLAS covered. |
||
|
|
8d55033c96 |
Implemented distributed thrinfo_t management.
Details:
- Implemented Ricardo Magana's distributed thread info/communicator
management. Rather that fully construct the thrinfo_t structures, from
root to leaf, prior to spawning threads, the threads individually
construct their thrinfo_t trees (or, chains), and do so incrementally,
as needed, reusing the same structure nodes during subsequent blocked
variant iterations. This required moving the initial creation of the
thrinfo_t structure (now, the root nodes) from the _front() functions
to the bli_l3_thread_decorator(). The incremental "growing" of the tree
is performed in the internal back-end (ie: _int()) function, and so
mostly invisible. Also, the incremental growth of the thrinfo_t tree is
done as a function of the current and parent control tree nodes (as well
as the parent thrinfo_t node), further reinforcing the parallel
relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
as well as its id. Changed all APIs accordingly. Renamed
bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
in an array of thrinfo_t* structure pointers. (Used only as a
debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
bli_packm_thrinfo_create()
bli_l3_thrinfo_create()
because they are no longer used. bli_thrinfo_create() is now called
directly when creating thrinfo_t nodes.
|
||
|
|
701b9aa3ff |
Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
same struct definition, whose primary fields consist of a blocksize id,
a variant function pointer, a pointer to an optional parameter struct,
and a pointer to a (single) sub-node. This unified control tree type is
now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
they represent, such that, for example, packing operations are now
associated with nodes that are "inline" in the tree, rather than off-
shoot braches. The original tree for the classic Goto gemm algorithm was
expressed (roughly) as:
blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
| |
-> packb -> packa
and now, the same tree would look like:
blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2
Specifically, the packb and packa nodes perform their respective packing
operations and then recurse (without any loop) to a subproblem. This means
there are now two kinds of level-3 control tree nodes: partitioning and
non-partitioning. The blocked variants are members of the former, because
they iteratively partition off submatrices and perform suboperations on
those partitions, while the packing variants belong to the latter group.
(This change has the effect of allowing greatly simplified initialization
of the nodes, which previously involved setting many unused node fields to
NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
connective structure of control trees. That is, packm nodes are no longer
off-shoot branches of the main algorithmic nodes, but rather connected
"inline".
- Simplified control tree creation functions. Partitioning nodes are created
concisely with just a few fields needing initialization. By contrast, the
packing nodes require additional parameters, which are stored in a
packm-specific struct that is tracked via the optional parameters pointer
within the control tree struct. (This parameter struct must always begin
with a uint64_t that contains the byte size of the struct. This allows
us to use a generic function to recursively copy control trees.) gemm,
herk, and trmm control tree creation continues to be consolidated into
a single function, with the operation family being used to select
among the parameter-agnostic macro-kernel wrappers. A single routine,
bli_cntl_free(), is provided to free control trees recursively, whereby
the chief thread within a groups release the blocks associated with
mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
function pointer stored in the current control tree node (rather than
index into a local function pointer array). Before being invoked, these
function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
families) or trsm_voft (for trsm family) type, which is defined in
frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
from routines as a function of the variant's matrix operands. gemm and
herk always move forward, while trmm and trsm move in a direction that
is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
each of which takes additional arguments and hides complexity in managing
the difference between the way ranges are computed for the four families
of operations.
- Simplified level-3 blocked variants according to the above changes, so that
the only steps taken are:
1. Query partitioning direction (forwards or backwards).
2. Prune unreferenced regions, if they exist.
3. Determine the thread partitioning sub-ranges.
<begin loop>
4. Determine the partitioning blocksize (passing in the partitioning
direction)
5. Acquire the curren iteration's partitions for the matrices affected
by the current variants's partitioning dimension (m, k, n).
6. Call the subproblem.
<end loop>
- Instantiate control trees once per thread, per operation invocation.
(This is a change from the previous regime in which control trees were
treated as stateless objects, initialized with the library, and shared
as read-only objects between threads.) This once-per-thread allocation
is done primarily to allow threads to use the control tree as as place
to cache certain data for use in subsequent loop iterations. Presently,
the only application of this caching is a mem_t entry for the packing
blocks checked out from the memory broker (allocator). If a non-NULL
control tree is passed in by the (expert) user, then the tree is copied
by each thread. This is done in bli_l3_thread_decorator(), in
bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
of the operation being executed. For example, gemm, hemm, and symm are
all part of the gemm family, while herk, syrk, her2k, and syr2k are
all part of the herk family. Knowing the operation's family is necessary
when conditionally executing the internal (beta) scalar reset on on
C in blocked variant 3, which is needed for gemm and herk families,
but must not be performed for the trmm family (because beta has only
been applied to the current row-panel of C after the first rank-kc
iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
to comform with the new control tree design, and renamed the macro-
kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
optimization in the various level-3 front-ends was not being applied
properly when the context indicated that execution would be via an
induced method. (Before, we always checked the native micro-kernel
corresponding to the datatype being executed, whereas now we check
the native micro-kernel corresponding to the datatype's real projection,
since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
complex implementations. Previously, it was always tested, provided that
the c/z datatypes were enabled. However, some configurations use
reference micro-kernels for complex datatypes, and testing these
implementations can slow down the testsuite considerably.
|
||
|
|
9dcd6f05c4 |
Implemented developer-configurable malloc()/free().
Details: - Replaced all instances of bli_malloc() and bli_free() with one of: - bli_malloc_pool()/bli_free_pool() - bli_malloc_user()/bli_free_user() - bli_malloc_intl()/bli_free_intl() each of which can be configured to call malloc()/free() substitutes, so long as the substitute functions have the same function type signatures as malloc() and free() defined by C's stdlib.h. The _pool() function is called when allocating blocks for the memory pools (used for packing buffers, primarily), the _user() function is called when obj_t's are created (via bli_obj_create() and friends), and the _intl() function is called for internal use by BLIS, such as when creating control tree nodes or temporary buffers for manipulating internal data structures. Substitutes for any of the three types of bli_malloc() may be specified by #defining the following pairs of cpp macros in bli_kernel.h: - BLIS_MALLOC_POOL/BLIS_FREE_POOL - BLIS_MALLOC_USER/BLIS_FREE_USER - BLIS_MALLOC_INTL/BLIS_FREE_INTL to be the name of the substitute functions. (Obviously, the object code that contains these functions must be provided at link-time.) These macros default to malloc() and free(). Subsitute functions are also automatically prototyped by BLIS (in bli_malloc_prototypes.h). - Removed definitions for bli_malloc() and bli_free(). - Note that bli_malloc_pool() and bli_malloc_user() are now defined in terms of a new function, bli_malloc_align(), which aligns memory to an arbitrary (power of two) alignment boundary, but does so manually, whereas before alignment was performed behind the scenes by posix_memalign(). Currently, bli_malloc_intl() is defined in terms of bli_malloc_noalign(), which serves as a simple wrapper to the designated function that is passed in (e.g. BLIS_MALLOC_INTL). Similarly, there are bli_free_align() and bli_free_noalign(), which are used in concert with their bli_malloc_*() counterparts. |
||
|
|
4136553f0d |
Clear level-3 cntx_t's via memset() before use.
Details: - In all level-3 operations' _cntx_init() functions, replaced calls to bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all level-3 operations' _cntx_finalize() functions, removed calls to bli_cntx_obj_finalize(), leaving those function definitions empty. - Changed the definition of bli_cntx_obj_clear() so that the clearing occurs via a single call to memset(). |
||
|
|
dd0ab1d93f |
Converted some bli_cntx query functions to macros.
Details: - Commented out several datatype-aware query functions (those ending in _dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and added equivalent cpp query macros to bli_cntx.h. - Added 'bli_config.h' to .gitignore. |
||
|
|
537a1f4f85 |
Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
all internal APIs for computational operations in BLIS. The structure
is now present within the type-aware APIs, as well as many supporting
utility functions that require information stored in the context. User-
level object APIs were unaffected and continue to be "context-free,"
however, these APIs were duplicated/mirrored so that "context-aware"
APIs now also exist, differentiated with an "_ex" suffix (for "expert").
These new context-aware object APIs (along with the lower-level, type-
aware, BLAS-like APIs) contain the the address of a context as a last
parameter, after all other operands. Contexts, or specifically, cntx_t
object pointers, are passed all the way down the function stack into
the kernels and allow the code at any level to query information about
the runtime, such as kernel addresses and blocksizes, in a thread-
friendly manner--that is, one that allows thread-safety, even if the
original source of the information stored in the context changes at
run-time; see next bullet for more on this "original source" of info).
(Special thanks go to Lee Killough for suggesting the use of this kind
of data structure in discussions that transpired during the early
planning stages of BLIS, and also for suggesting such a perfectly
appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
structure" (gks). This data structure and API will allow the caller to
initialize a context with the kernel addresses, blocksizes, and other
information associated with the currently active kernel configuration.
The currently active kernel configuration within the gks cannot be
changed (for now), and is initialized with the traditional cpp macros
that define kernel function names, blocksizes, and the like. However,
in the future, the gks API will be expanded to allow runtime management
of kernels and runtime parameters. The most obvious application of this
new infrastructure is the runtime detection of hardware (and the
implied selection of appropriate kernels). With contexts in place,
kernels may even be "hot swapped" at runtime within the gks. Once
execution enters a level-3 _front() function, the memory allocator will
be reinitialized on-the-fly, if necessary, to accommodate the new
kernels' blocksizes. If another application thread is executing with
another (previously loaded) kernel, it will finish in a deterministic
fashion because its kernel information was loaded into its context
before computation began, and also because the blocks it checked out
from the internal memory pools will be unaffected by the newer threads'
reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
the code enabling use of induced methods for complex domain matrix
multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
those APIs' functionality is now mostly subsumed within the global
kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
that will reinitialize a memory pool if the necessary pool block size
has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
usage of contexts where appropriate to communicate cache and register
blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
the context and/or the global kernel structure:
- Removed blocksize object pointers (blksz_t*) fields from all control
tree node definitions and replaced them with blocksize id (bszid_t)
values instead, which may be passed into a context query routine in
order to extract the corresponding blocksize from the given context.
- Removed micro-kernel function pointers (func_t*) fields from all
control tree node definitions. Now, any code that needs these function
pointers can query them from the local context, as identified by a
level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
level-1v kernel id (l1vkr_t).
- Removed blksz_t object creation and initialization, as well as kernel
function object creation and initialization, from all operation-
specific control tree initialization files (bli_*_cntl.c), since this
information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
blocksize multiples for each blocksize id (bszid_t) in the context
object.
- Removed the bool_t's that were required when a func_t was initialized.
These bools are meant to allow one to track the micro-kernel's storage
preferences (by rows or columns). This preference is now tracked
separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
util directories, but has the most obvious effect of allowing BLIS
to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
in an attempt to reduce overhead for memory-bound operations. This
includes removal of default use of object-based variants for level-2
operations. Now, by default, level-2 operations will directly call a
low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
heterogeneous bool_t's (one for each floating-point datatype), in the
same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
and BLIS_SIMD_SIZE. These values are needed in order to compute a third
new parameter, which may be set indirectly via the aforementioned
macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
statically allocate memory in macro-kernels and the induced methods'
virtual kernels to be used as temporary space to hold a single
micro-tile. These values are now output by the testsuite. The default
value of BLIS_STACK_BUF_MAX_SIZE is computed as
"2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
and "haswell," respectively, and gave more consistent and meaningful
names to many kernel files (as well as updating their interfaces to
conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
context for test modules that need those values: axpyf, dotxf,
dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
for level-1m-like operations on small matrices) in frame/include/level0
to use more obscure local variable names in an effort to avoid variable
shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
of scalm. The semantic meaning of the conj argument is to optionally
allow implicit conjugation of the scalar prior to being populated into
the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
that this does not preclude supporting mixed types via the object APIs,
where it produces absolutely zero API code bloat.
|