mirror of
https://github.com/amd/blis.git
synced 2026-05-19 07:58:58 +00:00
810e90ee806510c57504f0cf8eeaf608d38bd9dd
24 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
00e14cb6d8 |
Replaced use of bool_t type with C99 bool.
Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit |
||
|
|
72f6ed0637 |
Declare/define static functions via BLIS_INLINE.
Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally #define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.
|
||
|
|
13806ba3b0 |
This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041 |
||
|
|
0645f239fb |
Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
|
||
|
|
375eb30b0a |
Added mixed-precision support to 1m method.
Details: - Lifted the constraint that 1m only be used when all operands' storage datatypes (along with the computation datatype) are equal. Now, 1m may be used as long as all operands are stored in the complex domain. This change largely consisted of adding the ability to pack to 1e and 1r formats from one precision to another. It also required adding logic for handling complex values of alpha to bli_packm_blk_var1_md() (similar to the logic in bli_packm_blk_var1()). - Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c, bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong ukernel output preference field being read. Previously, the preference for the native complex ukernel was being read instead of the pref for the native real domain ukernel. This bug would not manifest if the preference for the native complex ukernel happened to be equal to that of the native real ukernel. - Added support for testing mixed-precision 1m execution via the gemm module of the testsuite. - Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack schemas are always read from the context, rather than trying to sometimes embed them directly to the A and B objects. (They are still embedded, but now uniformly only after reading the schemas from the context.) - Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only consumer). - Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to bli_gemm_ker_var2_md(). - Added explicit handling for beta == 1 and beta == 0 in the reference gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c. - Rewrote various level-0 macro defs, including axpyris, axpbyris, scal2ris, and xpbyris (and their conjugating counterparts) to explicitly support three operand types and updated invocations to xpbyris in bli_gemmtrsm1m_ref.c. - Query and use the storage datatype of the packed object instead of the storage datatype of the source object in bli_packm_blk_var1(). - Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to frame/3/gemm/ind/bli_gemm_ind_opt.h. - Various whitespace/comment updates. |
||
|
|
c92762ecdc |
Added option of slab or rr partitioning in jr/ir.
Details:
- Updated existing macrokernel function names and definitions to
explicitly use slab assignment of micropanels to threads, then created
duplicate versions of macrokernels that explicitly use round-robin
assignment instead of slab. NOTE: As in
|
||
|
|
ac18949a4b |
Multithreading optimizations for l3 macrokernels.
Details:
- Adjusted the method by which micropanels are assigned to threads in
the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
employ contiguous "slab" partitioning rather than interleaved (round
robin) partitioning. The new partitioning schemes and related details
for specific families of operations are listed below:
- gemm: slab partitioning.
- herk: slab partitioning for region corresponding to non-triangular
region of C; round robin partitioning for triangular region.
- trmm: slab partitioning for region corresponding to non-triangular
region of B; round robin partitioning for triangular region.
(NOTE: This affects both left- and right-side macrokernels:
trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
- trsm: slab partitioning.
(NOTE: This only affects only left-side macrokernels trsm_ll,
trsm_lu; right-side macrokernels were not touched.)
Also note that the previous macrokernels were preserved inside of
the 'other' directory of each operation family directory (e.g.
frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
and fixed a stale function pointer type in blx_gemm_int.c
(gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
|
||
|
|
4fa4cb0734 |
Trivial comment header updates.
Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers. |
||
|
|
ecbebe7c2e |
Defined rntm_t to relocate cntx_t.thrloop (#235).
Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field *after* it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used *different* parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md. |
||
|
|
87db5c048e |
Changed usage of virtual microkernel slots in cntx.
Details: - Changed the way virtual microkernels are handled in the context. Previously, there were query routines such as bli_cntx_get_l3_ukr_dt() which returned the native ukernel for a datatype if the method was equal to BLIS_NAT, or the virtual ukernel for that datatype if the method was some other value. Going forward, the context native and virtual ukernel slots will both be initialized to native ukernel function pointers for native execution, and for non-native execution the virtual ukernel pointer will be something else. This allows us to always query the virtual ukernel slot (from within, say, the macrokernel) without needing any logic in the query routine to decide which function pointer (native or virtual) to return. (Essentially, the logic has been shifted to init-time instead of compute-time.) This scheme will also allow generalized virtual ukernels as a way to insert extra logic in between the macrokernel and the native microkernel. - Initialize native contexts (in bli_cntx_ref.c) with native ukernel function addresses stored to the virtual ukernel slots pursuant to the above policy change. - Renamed all static functions that were native/virtual-ambiguous, such as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt() pursuant to the above polilcy change. Those routines now use the substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All of these functions were static functions defined in bli_cntx.h, and most uses were in level-3 front-ends and macrokernels. - Deprecated anti_pref bool_t in context, along with related functions such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's panel-block execution is disabled. |
||
|
|
9588625c43 |
Renamed "next micropanel" macros in _l3_thrinfo.h.
Details: - Renamed several macros defined in bli_l3_thrinfo.h designed to compute the values of a_next and b_next to insert into an auxinfo_t struct in level-3 macrokernels. (Previously, the macros did not use a bli_ prefix.) - Updated instances of above macro usage within various macrokernels. |
||
|
|
4b36e85be9 |
Converted function-like macros to static functions.
Details: - Converted most C preprocessor macros in bli_param_macro_defs.h and bli_obj_macro_defs.h to static functions. - Reshuffled some functions/macros to bli_misc_macro_defs.h and also between bli_param_macro_defs.h and bli_obj_macro_defs.h. - Changed obj_t-initializing macros in bli_type_defs.h to static functions. - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from bli_constants.h. - Whitespace changes in select files (four spaces to single tab). |
||
|
|
75d0d1057d |
Renamed various datatype-related macros/functions.
Details: - Renamed the following macros in bli_obj_macro_defs.h and bli_param_macro_defs.h: - bli_obj_datatype() -> bli_obj_dt() - bli_obj_target_datatype() -> bli_obj_target_dt() - bli_obj_execution_datatype() -> bli_obj_exec_dt() - bli_obj_set_datatype() -> bli_obj_set_dt() - bli_obj_set_target_datatype() -> bli_obj_set_target_dt() - bli_obj_set_execution_datatype() -> bli_obj_set_exec_dt() - bli_obj_datatype_proj_to_real() -> bli_obj_dt_proj_to_real() - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex() - bli_datatype_proj_to_real() -> bli_dt_proj_to_real() - bli_datatype_proj_to_complex() -> bli_dt_proj_to_complex() - Renamed the following functions in bli_obj.c: - bli_datatype_size() -> bli_dt_size() - bli_datatype_string() -> bli_dt_string() - bli_datatype_union() -> bli_dt_union() - Removed a pair of old level-1f penryn intrinsics kernels that were no longer in use. |
||
|
|
39be59f2a8 |
Replaced several macros with static function APIs.
Details: - Reimplemented several sets of get/set-style preprocessor macros with static functions, including those in the following frame/base headers: auxinfo, cntl, mbool, mem, membrk, opid, and pool. A few headers in frame/thread were touched as well: mutex_*, thrcomm, and thrinfo. |
||
|
|
453deb2906 |
Implemented runtime kernel management.
Details: - Reworked the build system around a configuration registry file, named config_registry', that identifies valid configuration targets, their constituent sub-configurations, and the kernel sets that are needed by those sub-configurations. The build system now facilitates the building of a single library that can contains kernels and cache/register blocksizes for multiple configurations (microarchitectures). Reference kernels are also built on a per-configuration basis. - Updated the Makefile to use new variables set by configure via the config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP, in determining which sub-configurations (CONFIG_LIST) and kernel sets (KERNEL_LIST) are included in the library, and which make_defs.mk files' CFLAGS (KCONFIG_MAP) are used when compiling kernels. - Reorganized 'kernels' directory into a "flat" structure. Renamed kernel functions into a standard format that includes the kernel set name (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each kernels sub-directory. These files exist to provide prototypes for the kernels present in those directories. - Reorganized reference kernels into a top-level 'ref_kernels' directory. This directory includes a new source file, bli_cntx_ref.c (compiled on a per-configuration basis), that defines the code needed to initialize a reference context and a context for induced methods for the microarchitecture in question. - Rewrote make_defs.mk files in each configuration so that the compiler variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration basis. - Modified bli_config.h.in template so that bli_config.h is generated with #defines for the config (family) name, the sub-configurations that are associated with the family, and the kernel sets needed by those sub-configurations. - Deprecated all kernel-related information in bli_kernel.h and transferred what remains to new header files named "bli_arch_<configname>.h", which are conditionally #included from a new header bli_arch.h. These files are still needed to set library-wide parameters such as custom malloc()/free() functions or SIMD alignment values. - Added bli_cntx_init_<configname>.c files to each configuration directory. The files contain a function, named the same as the file, that initializes a "native" context for a particular configuration (microarchitecture). The idea is that optimized kernels, if available, will be initialized into these contexts. Other fields will retain pointers to reference functions, which will be compiled on a per-configuration basis. These bli_cntx_init_*() functions will be called during the initialization of the global kernel structure. They are thought of as initializing for "native" execution, but they also form the basis for contexts that use induced methods. These functions are prototyped, along with their _ref() and _ind() brethren, by prototype-generating macros in bli_arch.h. - Added a new typedef enum in bli_type_defs.h to define an arch_t, which identifies the various sub-configurations. - Redesigned the global kernel structure (gks) around a 2D array of cntx_t structures (pointers to cntx_t, actually). The first dimension is indexed over arch_t and the inner dimension is the ind_t (induced method) for each microarchitecture. When a microarchitecture (configuration) is "registered" at init-time, the inner array for that configuration in the 2D array is initialized (and allocated, if it hasn't been already). The cntx_t slot for BLIS_NAT is initialized immediately and those for other induced method types are initialized and cached on-demand, as needed. At cntx_t registration, we also store function pointers to cntx_init functions that will initialize (a) "reference" contexts and (b) contexts for use with induced methods. We don't cache the full contexts for reference contexts since they are rarely needed. The functions that initialize these two kinds of contexts are generated automatically for each targeted sub-configuration from cpp-templatized code at compile-time. Induced method contexts that need "stage" adjustments can still obtain them via functions in bli_cntx_ind_stage.c. - Added new functions and functionality to bli_cntx.c, such as for setting the level-1f, level-1v, and packm kernels, and for converting a native context into one for executing an induced method. - Moved the checking of register/cache blocksize consistency from being cpp macros in bli_kernel_macro_defs.h to being runtime checks defined in bli_check.c and called from bli_gks_register_cntx() at the time that the global kernel structure's internal context is initialized for a given microarchitecture/configuration. - Deprecated all of the old per-operation bli_*_cntx.c files and removed the previous operation-level cntx_t_init()/_finalize() invocations. Instead, we now query the gks for a suitable context, usually via bli_gks_query_cntx(). - Deprecated support for the 3m2 and 3m3 induced methods. (They required hackery that I was no longer willing to support.) - Consolidated the 1e and 1r packm kernels for any given register blocksize into a single kernel that will branch on the schema and support packing to both formats. - Added the cntx_t* argument to all packm kernel signatures. - Deprecated the local function pointer array in all bli_packm_cxk*.c files and instead obtain the packm kernel from the cntx_t. - Added bli_calloc_intl(), which serves as the calloc-equivalent to to bli_malloc_intl(). Useful when we wish to allocate and initialize to zero/NULL. - Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h, bli_cntx.h into static functions. |
||
|
|
bdc0a264d2 |
Adjusted stride selection of ct in macrokernels.
Details:
- Updated the changes introduced in
|
||
|
|
35509818cb |
Added, moved some thread barriers.
Details: - Removed thread barriers from the end of the loop bodies of bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(), and bli_trsm_blk_var2(). - Moved the thread barrier at the end of bli_packm_int() to the end of bli_l3_packm(), and added missing barriers to that function. - Removed the no longer necessary (and now incorrect) ochief guard in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C. - Thanks to Tyler Smith for help with these changes. |
||
|
|
701b9aa3ff |
Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
same struct definition, whose primary fields consist of a blocksize id,
a variant function pointer, a pointer to an optional parameter struct,
and a pointer to a (single) sub-node. This unified control tree type is
now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
they represent, such that, for example, packing operations are now
associated with nodes that are "inline" in the tree, rather than off-
shoot braches. The original tree for the classic Goto gemm algorithm was
expressed (roughly) as:
blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
| |
-> packb -> packa
and now, the same tree would look like:
blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2
Specifically, the packb and packa nodes perform their respective packing
operations and then recurse (without any loop) to a subproblem. This means
there are now two kinds of level-3 control tree nodes: partitioning and
non-partitioning. The blocked variants are members of the former, because
they iteratively partition off submatrices and perform suboperations on
those partitions, while the packing variants belong to the latter group.
(This change has the effect of allowing greatly simplified initialization
of the nodes, which previously involved setting many unused node fields to
NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
connective structure of control trees. That is, packm nodes are no longer
off-shoot branches of the main algorithmic nodes, but rather connected
"inline".
- Simplified control tree creation functions. Partitioning nodes are created
concisely with just a few fields needing initialization. By contrast, the
packing nodes require additional parameters, which are stored in a
packm-specific struct that is tracked via the optional parameters pointer
within the control tree struct. (This parameter struct must always begin
with a uint64_t that contains the byte size of the struct. This allows
us to use a generic function to recursively copy control trees.) gemm,
herk, and trmm control tree creation continues to be consolidated into
a single function, with the operation family being used to select
among the parameter-agnostic macro-kernel wrappers. A single routine,
bli_cntl_free(), is provided to free control trees recursively, whereby
the chief thread within a groups release the blocks associated with
mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
function pointer stored in the current control tree node (rather than
index into a local function pointer array). Before being invoked, these
function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
families) or trsm_voft (for trsm family) type, which is defined in
frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
from routines as a function of the variant's matrix operands. gemm and
herk always move forward, while trmm and trsm move in a direction that
is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
each of which takes additional arguments and hides complexity in managing
the difference between the way ranges are computed for the four families
of operations.
- Simplified level-3 blocked variants according to the above changes, so that
the only steps taken are:
1. Query partitioning direction (forwards or backwards).
2. Prune unreferenced regions, if they exist.
3. Determine the thread partitioning sub-ranges.
<begin loop>
4. Determine the partitioning blocksize (passing in the partitioning
direction)
5. Acquire the curren iteration's partitions for the matrices affected
by the current variants's partitioning dimension (m, k, n).
6. Call the subproblem.
<end loop>
- Instantiate control trees once per thread, per operation invocation.
(This is a change from the previous regime in which control trees were
treated as stateless objects, initialized with the library, and shared
as read-only objects between threads.) This once-per-thread allocation
is done primarily to allow threads to use the control tree as as place
to cache certain data for use in subsequent loop iterations. Presently,
the only application of this caching is a mem_t entry for the packing
blocks checked out from the memory broker (allocator). If a non-NULL
control tree is passed in by the (expert) user, then the tree is copied
by each thread. This is done in bli_l3_thread_decorator(), in
bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
of the operation being executed. For example, gemm, hemm, and symm are
all part of the gemm family, while herk, syrk, her2k, and syr2k are
all part of the herk family. Knowing the operation's family is necessary
when conditionally executing the internal (beta) scalar reset on on
C in blocked variant 3, which is needed for gemm and herk families,
but must not be performed for the trmm family (because beta has only
been applied to the current row-panel of C after the first rank-kc
iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
to comform with the new control tree design, and renamed the macro-
kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
optimization in the various level-3 front-ends was not being applied
properly when the context indicated that execution would be via an
induced method. (Before, we always checked the native micro-kernel
corresponding to the datatype being executed, whereas now we check
the native micro-kernel corresponding to the datatype's real projection,
since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
complex implementations. Previously, it was always tested, provided that
the c/z datatypes were enabled. However, some configurations use
reference micro-kernels for complex datatypes, and testing these
implementations can slow down the testsuite considerably.
|
||
|
|
31def12e26 |
First phase of control tree redesign.
Details: - These changes constitute the first set of changes in preparation to revamping the structure and use of control trees in BLIS. Modifications in this commit don't affect the control tree code yet, but rather lay the groundwork. - Defined wrappers for the following functions, where the the wrappers each take a direction parameter of a new enumerated type (BLIS_BWD or BLIS_FWD), dir_t, and executes the correct underlying function. - bli_acquire_mpart_*() and _vpart_*() - bli_*_determine_kc_[fb]() - bli_thread_get_range_*() and bli_thread_get_range_weighted_*() - Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving) blocked variants for trmm and trsm, and renamed gemm and herk variants accordingly. The direction is now queried via routines such as bli_trmm_direct(), which deterines the direction from the implied side and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD. - Defined wrappers to parameter-specific macrokernels for herk, trmm, and trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying macrokernel based on the implied parameters. The same logic used to choose the dir_t in _direct() functions is used here. - Simplified the function pointer arrays in _int() functions given the consolidation and dir_t querying mentioned above. - Function signature (whitespace) reformatting for various functions. - Removed old code in various 'old' directories. |
||
|
|
096895c5d5 |
Reorganized code, APIs related to multithreading.
Details:
- Reorganized code and renamed files defining APIs related to multithreading.
All code that is not specific to a particular operation is now located in a
new directory: frame/thread. Code is now organized, roughly, by the
namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
We now have the following API namespaces:
- bli_thrinfo_*(): functions related to thrinfo_t objects
- bli_thrcomm_*(): functions related to thrcomm_t objects.
- bli_thread_*(): general-purpose functions, such as initialization,
finalization, and computing ranges. (For now, some macros, such as
bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
consistent with the thread-related macros that were also renamed).
- Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
#undef was a temporary fix to some macro defaults which were being applied
in the wrong order, which was recently fixed.
|
||
|
|
537a1f4f85 |
Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
all internal APIs for computational operations in BLIS. The structure
is now present within the type-aware APIs, as well as many supporting
utility functions that require information stored in the context. User-
level object APIs were unaffected and continue to be "context-free,"
however, these APIs were duplicated/mirrored so that "context-aware"
APIs now also exist, differentiated with an "_ex" suffix (for "expert").
These new context-aware object APIs (along with the lower-level, type-
aware, BLAS-like APIs) contain the the address of a context as a last
parameter, after all other operands. Contexts, or specifically, cntx_t
object pointers, are passed all the way down the function stack into
the kernels and allow the code at any level to query information about
the runtime, such as kernel addresses and blocksizes, in a thread-
friendly manner--that is, one that allows thread-safety, even if the
original source of the information stored in the context changes at
run-time; see next bullet for more on this "original source" of info).
(Special thanks go to Lee Killough for suggesting the use of this kind
of data structure in discussions that transpired during the early
planning stages of BLIS, and also for suggesting such a perfectly
appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
structure" (gks). This data structure and API will allow the caller to
initialize a context with the kernel addresses, blocksizes, and other
information associated with the currently active kernel configuration.
The currently active kernel configuration within the gks cannot be
changed (for now), and is initialized with the traditional cpp macros
that define kernel function names, blocksizes, and the like. However,
in the future, the gks API will be expanded to allow runtime management
of kernels and runtime parameters. The most obvious application of this
new infrastructure is the runtime detection of hardware (and the
implied selection of appropriate kernels). With contexts in place,
kernels may even be "hot swapped" at runtime within the gks. Once
execution enters a level-3 _front() function, the memory allocator will
be reinitialized on-the-fly, if necessary, to accommodate the new
kernels' blocksizes. If another application thread is executing with
another (previously loaded) kernel, it will finish in a deterministic
fashion because its kernel information was loaded into its context
before computation began, and also because the blocks it checked out
from the internal memory pools will be unaffected by the newer threads'
reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
the code enabling use of induced methods for complex domain matrix
multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
those APIs' functionality is now mostly subsumed within the global
kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
that will reinitialize a memory pool if the necessary pool block size
has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
usage of contexts where appropriate to communicate cache and register
blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
the context and/or the global kernel structure:
- Removed blocksize object pointers (blksz_t*) fields from all control
tree node definitions and replaced them with blocksize id (bszid_t)
values instead, which may be passed into a context query routine in
order to extract the corresponding blocksize from the given context.
- Removed micro-kernel function pointers (func_t*) fields from all
control tree node definitions. Now, any code that needs these function
pointers can query them from the local context, as identified by a
level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
level-1v kernel id (l1vkr_t).
- Removed blksz_t object creation and initialization, as well as kernel
function object creation and initialization, from all operation-
specific control tree initialization files (bli_*_cntl.c), since this
information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
blocksize multiples for each blocksize id (bszid_t) in the context
object.
- Removed the bool_t's that were required when a func_t was initialized.
These bools are meant to allow one to track the micro-kernel's storage
preferences (by rows or columns). This preference is now tracked
separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
util directories, but has the most obvious effect of allowing BLIS
to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
in an attempt to reduce overhead for memory-bound operations. This
includes removal of default use of object-based variants for level-2
operations. Now, by default, level-2 operations will directly call a
low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
heterogeneous bool_t's (one for each floating-point datatype), in the
same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
and BLIS_SIMD_SIZE. These values are needed in order to compute a third
new parameter, which may be set indirectly via the aforementioned
macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
statically allocate memory in macro-kernels and the induced methods'
virtual kernels to be used as temporary space to hold a single
micro-tile. These values are now output by the testsuite. The default
value of BLIS_STACK_BUF_MAX_SIZE is computed as
"2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
and "haswell," respectively, and gave more consistent and meaningful
names to many kernel files (as well as updating their interfaces to
conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
context for test modules that need those values: axpyf, dotxf,
dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
for level-1m-like operations on small matrices) in frame/include/level0
to use more obscure local variable names in an effort to avoid variable
shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
of scalm. The semantic meaning of the conj argument is to optionally
allow implicit conjugation of the scalar prior to being populated into
the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
that this does not preclude supporting mixed types via the object APIs,
where it produces absolutely zero API code bloat.
|
||
|
|
e2e9d64a63 |
Load balance thread ranges for arbitrary diagonals.
Details:
- Expanded/updated interface for bli_get_range_weighted() and
bli_get_range() so that the direction of movement is specified in the
function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
and also so that the object being partitioned is passed instead of an
uplo parameter. Updated invocations in level-3 blocked variants, as
appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
carefully take into account the location of the diagonal when computing
ranges so that the area of each subpartition (which, in all present
level-3 operations, is proportional to the amount of computation
engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
variants:
bli_<oper>_prune_unref_mparts_[mnk]()
where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
dimension is being partitioned. These routines call a more basic
routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
regions from matrices and simultaneously adjust other matrices which
share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
bli_get_range_*() and bli_get_range_weighted_*() under a range of
conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
array field so that dimensions could be queried via index constant
(e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
and bli_round_to_mult(), which rounds a value to the nearest multiple
of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
bli_obj_row_off(), bli_obj_col_off().
|
||
|
|
ee129c6b02 |
Fixed bugs in _get_range(), _get_range_weighted().
Details: - Fixed some bugs that only manifested in multithreaded instances of some (non-gemm) level-3 operations. The bugs were related to invalid allocation of "edge" cases to thread subpartitions. (Here, we define an "edge" case to be one where the dimension being partitioned for parallelism is not a whole multiple of whatever register blocksize is needed in that dimension.) In BLIS, we always require edge cases to be part of the bottom, right, or bottom-right subpartitions. (This is so that zero-padding only has to happen at the bottom, right, or bottom-right edges of micro-panels.) The previous implementations of bli_get_range() and _get_range_weighted() did not adhere to this implicit policy and thus produced bad ranges for some combinations of operation, parameter cases, problem sizes, and n-way parallelism. - As part of the above fix, the functions bli_get_range() and _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b, and _b2t suffixes, similar to the partitioning functions. This is an easy way to make sure that the variants are calling the right version of each function. The function signatures have also been changed slightly. - Comment/whitespace updates. - Removed unnecessary '/' from macros in bli_obj_macro_defs.h. |
||
|
|
26a4b8f6f9 |
Implemented 3m2, 3m3 induced algorithms (gemm only).
Details: - Defined a new "3ms" (separated 3m) pack schema and added appropriate support in packm_init(), packm_blk_var2(). - Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p) as an argument instead of computing it locally. Exception: for trmm, is_p must be computed locally, since it changes for triangular packed matrices. Also exposed is_p in interface to dt-specific packm_blk_var2 (and _var1, even though it does not use imaginary stride). - Renamed many functions/variables from _3mi to _3mis to indicate that they work for either interleaved or separated 3m pack schemas. - Generalized gemm and herk macro-kernels to pass in imaginary stride rather than compute them locally. - Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2- and 3m3-specific virtual micro-kernels. - Added special gemm macro-kernels to support 3m2 and 3m3. - Added support for 3m2 and 3m3 to testsuite. - Corrected the type of the panel dimension (pd_) in various macro- kernels from inc_t to dim_t. - Renamed many functions defined in bli_blocksize.c. - Moved most induced-related macro defs from frame/include to frame/ind/include. - Updated the _ukernel.c files so that the micro-kernel function pointers are obtained from the func_t objects rather than the cpp macros that define the function names. - Updated test/3m4m driver, Makefile, and run script. |