mirror of
https://github.com/amd/blis.git
synced 2026-05-05 06:51:11 +00:00
8f399c89403d5824ba767df1426706cf2d19d0a7
34 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c84391314d |
Reverted minor temp/wspace changes from b426f9e.
Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files. |
||
|
|
b426f9e04e |
POWER9 DGEMM (#355)
Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and *not* xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list. |
||
|
|
31c8657f1d |
Added support for pre-broadcast when packing B.
Details: - Added support for being able to duplicate (broadcast) elements in memory when packing matrix B (ie: the left-hand operand) in level-3 operations. This turns out advantageous for some architectures that can afford the cost of the extra bandwidth and somehow benefit from the pre-broadcast elements (and thus being able to avoid using broadcast-style load instructions on micro-rows of B in the gemm microkernel). - Support optionally disabling right-side hemm and symm. If this occurs, hemm_r is implemented in terms of hemm_l (and symm_r in terms of symm_l). This is needed when broadcasting during packing because the alternative--supporting the broadcast of B while also allowing matrix B to be Hermitian/symmetric--would be an absolute mess. - Support alignment factors for packed blocks of A, B, and C separately (as well as for general-purpose buffers). In addition, we support byte offsets from those alignment values (which is different from aligning by align+offset bytes to begin with). The default alignment values are BLIS_PAGE_SIZE in all four cases, with the offset values defaulting to zero. - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed into the packm kernel, where it will be needed by packm kernels that perform broadcasts of B, since the idea is that we *only* want to broadcast when packing micropanels of B and not A. - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be used to set custom virtual level-3 microkernels in the cntx_t, which would typically be done in the bli_cntx_init_*() function defined in the subconfiguration of interest. - Added a "broadcast B" kernel function for use with NP/NR = 12/6, defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c. - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels defined in ref_kernels/3/bb. (These kernels have been tested with double real with NP/NR = 12/6.) - Added #ifndef ... #endif guards around several macro constants defined in frame/include/bli_kernel_macro_defs.h. - Defined a few "broadcast B" static functions in frame/include/level0/bb for use by "broadcast B"-style packm reference kernels. For now, only the real domain kernels are tested and fully defined. - Output the alignment and offset values for packed blocks of A and B in the testsuite's "BLIS configuration info" section. - Comment updates to various files. - Bumped so_version to 3.0.0. |
||
|
|
c4cc6fa702 |
New cntx_t blksz "set" functions + misc tweaks.
Details:
- Defined two new static functions in bli_cntx.h:
bli_cntx_set_blksz_def_dt()
bli_cntx_set_blksz_max_dt()
which developers may find convenient when experimenting with different
values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.
|
||
|
|
3df84f1b5d |
Minor bugfixes in sup dgemm implementation.
Details: - Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel that only affected the beta == 0, column-storage output case. Thanks to the BLAS test drivers for catching this bug. - Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if k = 0, when the correct action would be to scale by beta (and then return). Thanks to the BLAS test drivers to catching this bug. - Changed the sup threshold behavior such that the sup implementation only kicks in if a matrix dimension is strictly less than (rather than less than or equal to) the threshold in question. - Initialize all thresholds to zero (instead of 10) by default in ref_kernels/bli_cntx_ref.c. This, combined with the above change to threshold testing means that calls to BLIS or BLAS with one or more matrix dimensions of zero will no longer trigger the sup implementation. - Added disabled debugging output to frame/3/bli_l3_sup.c (for future use, perhaps). |
||
|
|
b9c9f03502 |
Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
|
||
|
|
89cd650e7b |
Use void_fp for function pointers instead of void*.
Change void*-typed function pointers to void_fp. - Updated all instances of void* variables that store function pointers to variables of a new type, void_fp. Originally, I wanted to define the type of void_fp as "void (*void_fp)( void )"--that is, a pointer to a function with no return value and no arguments. However, once I did this, I realized that gcc complains with incompatible pointer type (-Wincompatible-pointer-types) warnings every time any such a pointer is being assigned to its final, type-accurate function pointer type. That is, gcc will silently typecast a void* to another defined function pointer type (e.g. dscalv_ker_ft) during an assignment from the former to the latter, but the same statement will trigger a warning when typecasting from a void_fp type. I suspect an explicit typecast is needed in order to avoid the warning, which I'm not willing to insert at this time. - Added a typedef to bli_type_defs.h defining void_fp as void*, along with a commented-out version of the aborted definition described above. (Note that POSIX requires that void* and function pointers be interchangeable; it is the C standard that does not provide this guarantee.) - Comment updates to various _oapi.c files. |
||
|
|
f0dcc8944f |
Add symbol export macro for all functions (#302)
* initial export of blis functions * Regenerate def file for master * restore bli_extern_defs exporting for now |
||
|
|
2f3174330f |
Implemented a pool-based small block allocator.
Details:
- Implemented a sophisticated data structure and set of APIs that track
the small blocks of memory (around 80-100 bytes each) used when
creating nodes for control and thread trees (cntl_t and thrinfo_t) as
well as thread communicators (thrcomm_t). The purpose of the small
block allocator, or sba, is to allow the library to transition into a
runtime state in which it does not perform any calls to malloc() or
free() during normal execution of level-3 operations, regardless of
the threading environment (potentially multiple application threads
as well as multiple BLIS threads). The functionality relies on a new
data structure, apool_t, which is (roughly speaking) a pool of
arrays, where each array element is a pool of small blocks. The outer
pool, which is protected by a mutex, provides separate arrays for each
application thread while the arrays each handle multiple BLIS threads
for any given application thread. The design minimizes the potential
for lock contention, as only concurrent application threads would
need to fight for the apool_t lock, and only if they happen to begin
their level-3 operations at precisely the same time. Thanks to Kiran
Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
by default; renamed the --[dis|en]able-packbuf-pools option to
--[dis|en]able-pba-pools; and rewrote the --help text associated with
this new option and consolidated it with the --help text for the
option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
used for small blocks with calls to bli_sba_acquire(), which takes a
rntm (in addition to the bytes requested), and bli_sba_release().
These latter two functions reduce to the former two when the sba pools
are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
in the block_size) from bli_membrk_acquire_m() to the implementation
of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
relative to that of gemmtrsm_ukr, in part to avoid the need to create
a packm control tree node, which now requires a rntm_t that has been
initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
run the testsuite with memory tracing enabled can check for memory
leaks.
- Manually imported the compact/minor changes from
|
||
|
|
0645f239fb |
Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
|
||
|
|
5fec95b99f |
Implemented mixed-datatype support for gemm.
Details: - Implemented support for gemm where A, B, and C may have different storage datatypes, as well as a computational precision (and implied computation domain) that may be different from the storage precision of either A or B. This results in 128 different combinations, all which are implemented within this commit. (For now, the mixed-datatype functionality is only supported via the object API.) If desired, the mixed-datatype support may be disabled at configure-time. - Added a memory-intensive optimization to certain mixed-datatype cases that requires a single m-by-n matrix be allocated (temporarily) per call to gemm. This optimization aims to avoid the overhead involved in repeatedly updating C with general stride, or updating C after a typecast from the computation precision. This memory optimization may be disabled at configure-time (provided that the mixed-datatype support is enabled in the first place). - Added support for testing mixed-datatype combinations to testsuite. The user may test gemm with mixed domains, precisions, both, or neither. - Added a standalone test driver directory for building and running mixed-datatype performance experiments. - Defined a new variation of castm, castnzm, which operates like castm except that imaginary values are not touched when casting a real operand to a complex operand. (By contrast, in these situations castm sets the imaginary components of the destination matrix to zero.) - Defined bli_obj_imag_is_zero() and substituted calls in lieu of all usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and also simplified the implementation of bli_obj_imag_equals(). - Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex() when given BLIS_CONSTANT objects. - Disabled dt_on_output field in auxinfo_t structure as well as all accessor functions. Also commented out all usage of accessor functions within macrokernels. (Typecasting in the microkernel is still feasible, though probably unrealistic for now given the additional complexity required.) - Use void function pointer type (instead of void*) for storing function pointers in bli_l0_fpa.c. - Added documentation for using gemm with mixed datatypes in docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c. - Defined level-1d operation xpbyd and level-1m operation xpbym. - Added xpbym test module to testsuite. - Updated frame/include/bli_x86_asm_macros.h with additional macros (courtsey of Devin Matthews). |
||
|
|
4fa4cb0734 |
Trivial comment header updates.
Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers. |
||
|
|
6074082cd3 |
Fixed bug in bli_cntx_set_packm_ker_dt() implementation.
Details: - Fixed bug in static function bli_cntx_set_[packm/unpackm]_ker_dt(), which were incorrectly calling bli_cntx_get_[packm/unpackm]_ker_dt to get the corresponding func_t. |
||
|
|
b7db293323 |
Explicitly typecast return vals in static funcs.
Details: - Added explicit typecasting to various functions (mostly static functions), primarily those in bli_param_macro_defs.h, bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header files. - This change was prompted by feedback from Jacob Gorm Hansen, who reported that #including "blis.h" from his application caused a gcc to output error messages (relating to types being returned mismatching the declared return types) when used via the C++ compiler front-end. This is the first pass of fixes, and we may need to iterate with additional follow-up commits (#233). |
||
|
|
ecbebe7c2e |
Defined rntm_t to relocate cntx_t.thrloop (#235).
Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field *after* it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used *different* parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md. |
||
|
|
87db5c048e |
Changed usage of virtual microkernel slots in cntx.
Details: - Changed the way virtual microkernels are handled in the context. Previously, there were query routines such as bli_cntx_get_l3_ukr_dt() which returned the native ukernel for a datatype if the method was equal to BLIS_NAT, or the virtual ukernel for that datatype if the method was some other value. Going forward, the context native and virtual ukernel slots will both be initialized to native ukernel function pointers for native execution, and for non-native execution the virtual ukernel pointer will be something else. This allows us to always query the virtual ukernel slot (from within, say, the macrokernel) without needing any logic in the query routine to decide which function pointer (native or virtual) to return. (Essentially, the logic has been shifted to init-time instead of compute-time.) This scheme will also allow generalized virtual ukernels as a way to insert extra logic in between the macrokernel and the native microkernel. - Initialize native contexts (in bli_cntx_ref.c) with native ukernel function addresses stored to the virtual ukernel slots pursuant to the above policy change. - Renamed all static functions that were native/virtual-ambiguous, such as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt() pursuant to the above polilcy change. Those routines now use the substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All of these functions were static functions defined in bli_cntx.h, and most uses were in level-3 front-ends and macrokernels. - Deprecated anti_pref bool_t in context, along with related functions such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's panel-block execution is disabled. |
||
|
|
f97a86f322 |
Updated setting/querying pack schema (cntx->cntl).
- Query pack schemas in level-3 bli_*_front() functions and store those values in the schema bitfields of the correponding obj_t's when the cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default native schemas are stored to the obj_t's.) - In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in bli_*_front(), clear the schema bitfields, and pass the queried values into bli_gemm_cntl_create() and bli_trsm_cntl_create(). - Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to take schemas for A and B, and use these values to initialize the appropriate control tree nodes. (Also cpp-disabled the panel-block cntl tree creation variant, bli_gemmpb_cntl_create(), as it has not been employed by BLIS in quite some time.) - Simplified querying of schema in bli_packm_init() thanks to above changes. - Updated openmp and pthreads definitions of bli_l3_thread_decorator() so that thread-local aliases of matrix operands are guaranteed, even if aliasing is disabled within the internal back-end functions (e.g. bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c explaining why the extra aliasing is not needed there. - Change bli_gemm() and level-3 friends so that the operation's ind() function is called only if all matrix operands have the same datatype, and only if that datatype is complex. The former condition is needed in preparation for work related to mixed domain operands, while the latter helps with readability, especially for those who don't want to venture into frame/ind. - Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be consistent with BLIS calling conventions (modified argument(s) are last), and updated all invocations in the level-3 _front() functions. - Comment updates to bli_cntx_set_thrloop_from_env(). |
||
|
|
962a706a6f |
Updated LICENSE file to mention HP Enterprise.
Details: - Added HP Enterprise to the LICENSE file. Previously, only the source files touched by HPE contained the corresponding copyright notices. (This oversight was unintentional.) - Updated file-level copyright notices to include a comma, to match the formatting used for UT and AMD copyrights. |
||
|
|
4b36e85be9 |
Converted function-like macros to static functions.
Details: - Converted most C preprocessor macros in bli_param_macro_defs.h and bli_obj_macro_defs.h to static functions. - Reshuffled some functions/macros to bli_misc_macro_defs.h and also between bli_param_macro_defs.h and bli_obj_macro_defs.h. - Changed obj_t-initializing macros in bli_type_defs.h to static functions. - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from bli_constants.h. - Whitespace changes in select files (four spaces to single tab). |
||
|
|
75d0d1057d |
Renamed various datatype-related macros/functions.
Details: - Renamed the following macros in bli_obj_macro_defs.h and bli_param_macro_defs.h: - bli_obj_datatype() -> bli_obj_dt() - bli_obj_target_datatype() -> bli_obj_target_dt() - bli_obj_execution_datatype() -> bli_obj_exec_dt() - bli_obj_set_datatype() -> bli_obj_set_dt() - bli_obj_set_target_datatype() -> bli_obj_set_target_dt() - bli_obj_set_execution_datatype() -> bli_obj_set_exec_dt() - bli_obj_datatype_proj_to_real() -> bli_obj_dt_proj_to_real() - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex() - bli_datatype_proj_to_real() -> bli_dt_proj_to_real() - bli_datatype_proj_to_complex() -> bli_dt_proj_to_complex() - Renamed the following functions in bli_obj.c: - bli_datatype_size() -> bli_dt_size() - bli_datatype_string() -> bli_dt_string() - bli_datatype_union() -> bli_dt_union() - Removed a pair of old level-1f penryn intrinsics kernels that were no longer in use. |
||
|
|
513ef4d040 |
Various typecasting fixes, mis-typed enums, etc.
Details:
- Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c.
- Properly typecast integer arguments to match format specifier in various
calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and
bli_util_oapi.c.
- Fixed "unsigned less-than-comparison with zero" checks in bli_check.c,
bli_cntx.h.
- Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been
l1fkr_t or l1vkr_t).
- Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t
value BLIS_GEMM_UKR in bli_cntx_ref.c.
- NOTE: These issues were identified via compiler warnings when building
BLIS with clang on a rather old installation of OS X:
$ clang --version
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin15.2.0
Thread model: posix
|
||
|
|
21360dd8e2 |
Fixed cntx_t packm query when ker_id > _NUM_PACKM_KERS.
Details: - Fixed a subtle bug in bli_cntx_get_[un]packm_ker_dt() in which the function fails to return NULL when passed a kernel id argument that is equal to or beyond BLIS_NUM_[UN]PACKM_KERS. Instead, the function was attempting to index into the cntx_t's packm kernel array, which resulted in undefined behvaior. Thanks to Devangi Parikh for finding this bug. |
||
|
|
3e4f42a4d2 |
Typecast l1mkr_t enum value prior to comparison.
Details:
- Typecast l1mkr_t enum value in bli_cntx.h to guint_t before testing for
out-of-range value. This is an attempt to pacify a strange warning from
clang on OS X that is seemingly the result of the following compiler
warning flag:
-Wtautological-constant-out-of-range-compare
|
||
|
|
453deb2906 |
Implemented runtime kernel management.
Details: - Reworked the build system around a configuration registry file, named config_registry', that identifies valid configuration targets, their constituent sub-configurations, and the kernel sets that are needed by those sub-configurations. The build system now facilitates the building of a single library that can contains kernels and cache/register blocksizes for multiple configurations (microarchitectures). Reference kernels are also built on a per-configuration basis. - Updated the Makefile to use new variables set by configure via the config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP, in determining which sub-configurations (CONFIG_LIST) and kernel sets (KERNEL_LIST) are included in the library, and which make_defs.mk files' CFLAGS (KCONFIG_MAP) are used when compiling kernels. - Reorganized 'kernels' directory into a "flat" structure. Renamed kernel functions into a standard format that includes the kernel set name (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each kernels sub-directory. These files exist to provide prototypes for the kernels present in those directories. - Reorganized reference kernels into a top-level 'ref_kernels' directory. This directory includes a new source file, bli_cntx_ref.c (compiled on a per-configuration basis), that defines the code needed to initialize a reference context and a context for induced methods for the microarchitecture in question. - Rewrote make_defs.mk files in each configuration so that the compiler variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration basis. - Modified bli_config.h.in template so that bli_config.h is generated with #defines for the config (family) name, the sub-configurations that are associated with the family, and the kernel sets needed by those sub-configurations. - Deprecated all kernel-related information in bli_kernel.h and transferred what remains to new header files named "bli_arch_<configname>.h", which are conditionally #included from a new header bli_arch.h. These files are still needed to set library-wide parameters such as custom malloc()/free() functions or SIMD alignment values. - Added bli_cntx_init_<configname>.c files to each configuration directory. The files contain a function, named the same as the file, that initializes a "native" context for a particular configuration (microarchitecture). The idea is that optimized kernels, if available, will be initialized into these contexts. Other fields will retain pointers to reference functions, which will be compiled on a per-configuration basis. These bli_cntx_init_*() functions will be called during the initialization of the global kernel structure. They are thought of as initializing for "native" execution, but they also form the basis for contexts that use induced methods. These functions are prototyped, along with their _ref() and _ind() brethren, by prototype-generating macros in bli_arch.h. - Added a new typedef enum in bli_type_defs.h to define an arch_t, which identifies the various sub-configurations. - Redesigned the global kernel structure (gks) around a 2D array of cntx_t structures (pointers to cntx_t, actually). The first dimension is indexed over arch_t and the inner dimension is the ind_t (induced method) for each microarchitecture. When a microarchitecture (configuration) is "registered" at init-time, the inner array for that configuration in the 2D array is initialized (and allocated, if it hasn't been already). The cntx_t slot for BLIS_NAT is initialized immediately and those for other induced method types are initialized and cached on-demand, as needed. At cntx_t registration, we also store function pointers to cntx_init functions that will initialize (a) "reference" contexts and (b) contexts for use with induced methods. We don't cache the full contexts for reference contexts since they are rarely needed. The functions that initialize these two kinds of contexts are generated automatically for each targeted sub-configuration from cpp-templatized code at compile-time. Induced method contexts that need "stage" adjustments can still obtain them via functions in bli_cntx_ind_stage.c. - Added new functions and functionality to bli_cntx.c, such as for setting the level-1f, level-1v, and packm kernels, and for converting a native context into one for executing an induced method. - Moved the checking of register/cache blocksize consistency from being cpp macros in bli_kernel_macro_defs.h to being runtime checks defined in bli_check.c and called from bli_gks_register_cntx() at the time that the global kernel structure's internal context is initialized for a given microarchitecture/configuration. - Deprecated all of the old per-operation bli_*_cntx.c files and removed the previous operation-level cntx_t_init()/_finalize() invocations. Instead, we now query the gks for a suitable context, usually via bli_gks_query_cntx(). - Deprecated support for the 3m2 and 3m3 induced methods. (They required hackery that I was no longer willing to support.) - Consolidated the 1e and 1r packm kernels for any given register blocksize into a single kernel that will branch on the schema and support packing to both formats. - Added the cntx_t* argument to all packm kernel signatures. - Deprecated the local function pointer array in all bli_packm_cxk*.c files and instead obtain the packm kernel from the cntx_t. - Added bli_calloc_intl(), which serves as the calloc-equivalent to to bli_malloc_intl(). Useful when we wish to allocate and initialize to zero/NULL. - Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h, bli_cntx.h into static functions. |
||
|
|
c63980f4ca |
Moved 'family' field from cntx_t to cntl_t.
Details: - Removed the family field inside the cntx_t struct and re-added it to the cntl_t struct. Updated all accessor functions/macros accordingly, as well as all consumers and intermediaries of the family parameter (such as bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This change was motivated by the desire to keep the context limited, as much as possible, to information about the computing environment. (The family field, by contrast, is a descriptor about the operation being executed.) - Added additional functions to bli_blksz_*() API. - Added additional functions to bli_cntx_*() API. - Minor updates to bli_func.c, bli_mbool.c. - Removed 'obj' from bli_blksz_*() API names. - Removed 'obj' from bli_cntx_*() API names. - Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines that operate only on a single struct to contain the "_node" suffix to differentiate with those routines that operate on the entire tree. - Added enums for packm and unpackm kernels to bli_type_defs.h. - Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h. They weren't being used and probably never will be. |
||
|
|
1c732d3ddc |
Added 1m-specific APIs for bp, pb gemm algorithms.
Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
bli_cntl_free() can check if the thread parameter is NULL, and if so,
call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
terms of bli_gemm1mxx_cntx_init(), which behaves the same as
bli_gemm1m_cntx_init() did before, except that an extra bool parameter
(is_pb) is used to support both bp and pb algorithms (including to
support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
when true, will toggle the boolean return value of routines such as
bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
causing BLIS to transpose the operation to achieve disagreement (rather
than agreement) between the storage of C and the micro-kernel output
preference. This disagreement is needed for panel-block implementations,
since they induce a transposition of the suboperation immediately before
the macro-kernel is called, which changes the apparent storage of C. For
now, anti-preference is used only with the pb algorithm for 1m (and not
with any other non-1m implementation).
- Defined new functions,
bli_cntx_l3_ukr_eff_prefers_storage_of()
bli_cntx_l3_ukr_eff_dislikes_storage_of()
bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
which are identical to their non-"eff" (effectively) counterparts except
that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
in terms of the existing block-panel macro-kernel _ker_var2(). This
technique requires inducing transposes on all operands and swapping
the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
and updated all instantiations. Also updated the field names in the
cntx_t struct.
- Comment updates.
|
||
|
|
126482a3b6 |
Implemented the 1m method.
Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver. |
||
|
|
bdc0a264d2 |
Adjusted stride selection of ct in macrokernels.
Details:
- Updated the changes introduced in
|
||
|
|
c05b3862f6 |
Add automatic loop thread assignment.
- Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before. - Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h. - All level-3 BLAS covered. |
||
|
|
8d55033c96 |
Implemented distributed thrinfo_t management.
Details:
- Implemented Ricardo Magana's distributed thread info/communicator
management. Rather that fully construct the thrinfo_t structures, from
root to leaf, prior to spawning threads, the threads individually
construct their thrinfo_t trees (or, chains), and do so incrementally,
as needed, reusing the same structure nodes during subsequent blocked
variant iterations. This required moving the initial creation of the
thrinfo_t structure (now, the root nodes) from the _front() functions
to the bli_l3_thread_decorator(). The incremental "growing" of the tree
is performed in the internal back-end (ie: _int()) function, and so
mostly invisible. Also, the incremental growth of the thrinfo_t tree is
done as a function of the current and parent control tree nodes (as well
as the parent thrinfo_t node), further reinforcing the parallel
relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
as well as its id. Changed all APIs accordingly. Renamed
bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
in an array of thrinfo_t* structure pointers. (Used only as a
debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
bli_packm_thrinfo_create()
bli_l3_thrinfo_create()
because they are no longer used. bli_thrinfo_create() is now called
directly when creating thrinfo_t nodes.
|
||
|
|
701b9aa3ff |
Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
same struct definition, whose primary fields consist of a blocksize id,
a variant function pointer, a pointer to an optional parameter struct,
and a pointer to a (single) sub-node. This unified control tree type is
now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
they represent, such that, for example, packing operations are now
associated with nodes that are "inline" in the tree, rather than off-
shoot braches. The original tree for the classic Goto gemm algorithm was
expressed (roughly) as:
blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
| |
-> packb -> packa
and now, the same tree would look like:
blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2
Specifically, the packb and packa nodes perform their respective packing
operations and then recurse (without any loop) to a subproblem. This means
there are now two kinds of level-3 control tree nodes: partitioning and
non-partitioning. The blocked variants are members of the former, because
they iteratively partition off submatrices and perform suboperations on
those partitions, while the packing variants belong to the latter group.
(This change has the effect of allowing greatly simplified initialization
of the nodes, which previously involved setting many unused node fields to
NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
connective structure of control trees. That is, packm nodes are no longer
off-shoot branches of the main algorithmic nodes, but rather connected
"inline".
- Simplified control tree creation functions. Partitioning nodes are created
concisely with just a few fields needing initialization. By contrast, the
packing nodes require additional parameters, which are stored in a
packm-specific struct that is tracked via the optional parameters pointer
within the control tree struct. (This parameter struct must always begin
with a uint64_t that contains the byte size of the struct. This allows
us to use a generic function to recursively copy control trees.) gemm,
herk, and trmm control tree creation continues to be consolidated into
a single function, with the operation family being used to select
among the parameter-agnostic macro-kernel wrappers. A single routine,
bli_cntl_free(), is provided to free control trees recursively, whereby
the chief thread within a groups release the blocks associated with
mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
function pointer stored in the current control tree node (rather than
index into a local function pointer array). Before being invoked, these
function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
families) or trsm_voft (for trsm family) type, which is defined in
frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
from routines as a function of the variant's matrix operands. gemm and
herk always move forward, while trmm and trsm move in a direction that
is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
each of which takes additional arguments and hides complexity in managing
the difference between the way ranges are computed for the four families
of operations.
- Simplified level-3 blocked variants according to the above changes, so that
the only steps taken are:
1. Query partitioning direction (forwards or backwards).
2. Prune unreferenced regions, if they exist.
3. Determine the thread partitioning sub-ranges.
<begin loop>
4. Determine the partitioning blocksize (passing in the partitioning
direction)
5. Acquire the curren iteration's partitions for the matrices affected
by the current variants's partitioning dimension (m, k, n).
6. Call the subproblem.
<end loop>
- Instantiate control trees once per thread, per operation invocation.
(This is a change from the previous regime in which control trees were
treated as stateless objects, initialized with the library, and shared
as read-only objects between threads.) This once-per-thread allocation
is done primarily to allow threads to use the control tree as as place
to cache certain data for use in subsequent loop iterations. Presently,
the only application of this caching is a mem_t entry for the packing
blocks checked out from the memory broker (allocator). If a non-NULL
control tree is passed in by the (expert) user, then the tree is copied
by each thread. This is done in bli_l3_thread_decorator(), in
bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
of the operation being executed. For example, gemm, hemm, and symm are
all part of the gemm family, while herk, syrk, her2k, and syr2k are
all part of the herk family. Knowing the operation's family is necessary
when conditionally executing the internal (beta) scalar reset on on
C in blocked variant 3, which is needed for gemm and herk families,
but must not be performed for the trmm family (because beta has only
been applied to the current row-panel of C after the first rank-kc
iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
to comform with the new control tree design, and renamed the macro-
kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
optimization in the various level-3 front-ends was not being applied
properly when the context indicated that execution would be via an
induced method. (Before, we always checked the native micro-kernel
corresponding to the datatype being executed, whereas now we check
the native micro-kernel corresponding to the datatype's real projection,
since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
complex implementations. Previously, it was always tested, provided that
the c/z datatypes were enabled. However, some configurations use
reference micro-kernels for complex datatypes, and testing these
implementations can slow down the testsuite considerably.
|
||
|
|
a017062fdf |
Integrated "memory broker" (membrk_t) abstraction.
Details:
- Integrated a patch originally authored and submitted by Ricardo Magana
of HP Enterprise. The changeset inserts use of a new object type, membrk_t,
(memory broker) that allows multiple sets of memory pools on, for example,
separate NUMA nodes, each of which has a separate memory space.
- Added membrk field to cntx_t and defined corresponding accessor macros.
- Added membrk field to mem_t object and defined corresponding accessor macros.
- Created new bli_membrk.c file, which contains the new memory broker API,
including:
bli_membrk_init(), bli_membrk_finalize()
bli_membrk_acquire_[mv](), bli_membrk_release(),
bli_membrk_init_pools(), bli_membrk_reinit_pools(),
bli_membrk_finalize_pools(),
bli_membrk_pool_size()
- In bli_mem.c, changed function calls to
bli_mem_init_pools() -> bli_membrk_init()
bli_mem_reinit_pools() -> bli_membrk_reinit()
bli_mem_finalize_pools() -> bli_membrk_finalize()
- In bli_packv_init.c, bli_packm_init.c, changed function calls to:
bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]()
bli_mem_release() -> bli_membrk_release()
- Added bli_mutex.c and related files to frame/thread. These files define
abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or
single-threaded execution. This new API is employed within functions
such as bli_membrk_acquire_[mv]() and bli_membrk_release().
|
||
|
|
dd0ab1d93f |
Converted some bli_cntx query functions to macros.
Details: - Commented out several datatype-aware query functions (those ending in _dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and added equivalent cpp query macros to bli_cntx.h. - Added 'bli_config.h' to .gitignore. |
||
|
|
537a1f4f85 |
Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
all internal APIs for computational operations in BLIS. The structure
is now present within the type-aware APIs, as well as many supporting
utility functions that require information stored in the context. User-
level object APIs were unaffected and continue to be "context-free,"
however, these APIs were duplicated/mirrored so that "context-aware"
APIs now also exist, differentiated with an "_ex" suffix (for "expert").
These new context-aware object APIs (along with the lower-level, type-
aware, BLAS-like APIs) contain the the address of a context as a last
parameter, after all other operands. Contexts, or specifically, cntx_t
object pointers, are passed all the way down the function stack into
the kernels and allow the code at any level to query information about
the runtime, such as kernel addresses and blocksizes, in a thread-
friendly manner--that is, one that allows thread-safety, even if the
original source of the information stored in the context changes at
run-time; see next bullet for more on this "original source" of info).
(Special thanks go to Lee Killough for suggesting the use of this kind
of data structure in discussions that transpired during the early
planning stages of BLIS, and also for suggesting such a perfectly
appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
structure" (gks). This data structure and API will allow the caller to
initialize a context with the kernel addresses, blocksizes, and other
information associated with the currently active kernel configuration.
The currently active kernel configuration within the gks cannot be
changed (for now), and is initialized with the traditional cpp macros
that define kernel function names, blocksizes, and the like. However,
in the future, the gks API will be expanded to allow runtime management
of kernels and runtime parameters. The most obvious application of this
new infrastructure is the runtime detection of hardware (and the
implied selection of appropriate kernels). With contexts in place,
kernels may even be "hot swapped" at runtime within the gks. Once
execution enters a level-3 _front() function, the memory allocator will
be reinitialized on-the-fly, if necessary, to accommodate the new
kernels' blocksizes. If another application thread is executing with
another (previously loaded) kernel, it will finish in a deterministic
fashion because its kernel information was loaded into its context
before computation began, and also because the blocks it checked out
from the internal memory pools will be unaffected by the newer threads'
reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
the code enabling use of induced methods for complex domain matrix
multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
those APIs' functionality is now mostly subsumed within the global
kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
that will reinitialize a memory pool if the necessary pool block size
has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
usage of contexts where appropriate to communicate cache and register
blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
the context and/or the global kernel structure:
- Removed blocksize object pointers (blksz_t*) fields from all control
tree node definitions and replaced them with blocksize id (bszid_t)
values instead, which may be passed into a context query routine in
order to extract the corresponding blocksize from the given context.
- Removed micro-kernel function pointers (func_t*) fields from all
control tree node definitions. Now, any code that needs these function
pointers can query them from the local context, as identified by a
level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
level-1v kernel id (l1vkr_t).
- Removed blksz_t object creation and initialization, as well as kernel
function object creation and initialization, from all operation-
specific control tree initialization files (bli_*_cntl.c), since this
information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
blocksize multiples for each blocksize id (bszid_t) in the context
object.
- Removed the bool_t's that were required when a func_t was initialized.
These bools are meant to allow one to track the micro-kernel's storage
preferences (by rows or columns). This preference is now tracked
separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
util directories, but has the most obvious effect of allowing BLIS
to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
in an attempt to reduce overhead for memory-bound operations. This
includes removal of default use of object-based variants for level-2
operations. Now, by default, level-2 operations will directly call a
low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
heterogeneous bool_t's (one for each floating-point datatype), in the
same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
and BLIS_SIMD_SIZE. These values are needed in order to compute a third
new parameter, which may be set indirectly via the aforementioned
macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
statically allocate memory in macro-kernels and the induced methods'
virtual kernels to be used as temporary space to hold a single
micro-tile. These values are now output by the testsuite. The default
value of BLIS_STACK_BUF_MAX_SIZE is computed as
"2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
and "haswell," respectively, and gave more consistent and meaningful
names to many kernel files (as well as updating their interfaces to
conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
context for test modules that need those values: axpyf, dotxf,
dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
for level-1m-like operations on small matrices) in frame/include/level0
to use more obscure local variable names in an effort to avoid variable
shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
of scalm. The semantic meaning of the conj argument is to optionally
allow implicit conjugation of the scalar prior to being populated into
the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
that this does not preclude supporting mixed types via the object APIs,
where it produces absolutely zero API code bloat.
|