amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	00e14cb6d8	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-07-29 14:24:34 -05:00
Field G. Van Zee	a69a4d7e2f	Cleaned up bool_t usage and various typecasts. Details: - Fixed various typecasts in frame/base/bli_cntx.h frame/base/bli_mbool.h frame/base/bli_rntm.h frame/include/bli_misc_macro_defs.h frame/include/bli_obj_macro_defs.h frame/include/bli_param_macro_defs.h that were missing or being done improperly/incompletely. For example, many return values were being typecast as (bool_t)x && y rather than (bool_t)(x && y) Thankfully, none of these deficiencies had manifested as actual bugs at the time of this commit. - Changed the return type of bli_env_get_var() from dim_t to gint_t. This reflects the fact that bli_env_get_var() needs to be able to return a signed integer, and even though dim_t is currently defined as a signed integer, it does not intuitively appear to necessarily be signed by inspection (i.e., an integer named "dim_t" for matrix "dimension"). Also, updated use of bli_env_get_var() within bli_pack.c to reflect the changed return type. - Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t and added comments to the bli_thrcomm_*.h files that will explain a planned replacement of bool_t with C99's bool type. - Note: These changes are being made to facilitate the substitution of 'bool' for 'bool_t', which will eliminate the namespace conflict with arm_sve.h as reported in issue #420. This commit implements the first phase of that transition. Thanks to RuQing Xu for reporting this issue. - CREDITS file update.	2020-07-22 16:13:09 -05:00
Field G. Van Zee	72f6ed0637	Declare/define static functions via BLIS_INLINE. Details: - Updated all static function definitions to use the cpp macro BLIS_INLINE instead of the static keyword. This allows blis.h to use a different keyword (inline) to define these functions when compiling with C++, which might otherwise trigger "defined but not used" warning messages. Thanks to Giorgos Margaritis for reporting this issue and Devin Matthews for suggesting the fix. - Updated the following files, which are used by configure's hardware auto-detection facility, to unconditionally #define BLIS_INLINE to the static keyword (since we know BLIS will be compiled with C, not C++): build/detect/config/config_detect.c frame/base/bli_arch.c frame/base/bli_cpuid.c - CREDITS file update.	2020-07-03 17:55:54 -05:00
Field G. Van Zee	e29b1f9706	Fixed failing testsuite gemmtrsm_ukr for power9. Details: - Added code that fixes false failures in the gemmtrsm_ukr module of the testsuite. The tests were failing because the computation (bli_gemv()) that performs the numerical check was not able to properly travserse the matrix operands bx1 and b11 that are views into the micropanel of B, which has duplicated/broadcast elements under the power9 subconfig. (For example, a micropanel of B with duplication factor of 2 needs to use a column stride of 2; previously, the column stride was being interpreted as 1.) - Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride() static functions in bli_obj_macro_defs.h. (Previously, only the function bli_obj_set_strides() was defined. Amazing to think that we got this far without these former functions.) - Updated/expounded upon comments.	2019-11-05 17:15:19 -06:00
Field G. Van Zee	b9c9f03502	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2019-04-27 18:44:50 -05:00
Field G. Van Zee	540ec1b479	Updated level-3 BLAS to call object API directly. Details: - Updated the BLAS compatibility layer for level-3 operations so that the corresponding BLIS object API is called directly rather than first calling the typed BLIS API. The previous code based on the typed BLIS API calls is still available in a deactivated cpp macro branch, which may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not yet correspond to a configure option. If it seems like people might want to toggle this behavior more regularly, a configure option can be added in the future.) - Updated the BLIS typed API to statically "pre-initialize" objects via new initializor macros. Initialization is then finished via calls to static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(), which are similar to the previously-called functions, bli_obj_create_1x1_with_attached_buffer() and bli_obj_create_with_attached_buffer(), respectively. (The BLAS compatibility layer updates mentioned above employ this new technique as well.) - Transformed certain routines in bli_param_map.c--specifically, the ones that convert netlib-style parameters to BLIS equivalents--into static functions, now in bli_param_map.h. (The remaining three classes of conversation routines were left unchanged.) - Added the aforementioned pre-initializor macros to bli_type_defs.h. - Relocated bli_obj_init_const() and bli_obj_init_constdata() from bli_obj_macro_defs.h to bli_type_defs.h. - Added a few macros to bli_param_macro_defs.h for testing domains for real/complexness and precisions for single/double-ness.	2019-02-24 19:09:10 -06:00
Field G. Van Zee	0645f239fb	Remove UT-Austin from copyright headers' clause 3. Details: - Removed explicit reference to The University of Texas at Austin in the third clause of the license comment blocks of all relevant files and replaced it with a more all-encompassing "copyright holder(s)". - Removed duplicate words ("derived") from a few kernels' license comment blocks. - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c with format of all other comment blocks.	2018-12-04 14:31:06 -06:00
Field G. Van Zee	1d8aae220b	Track internal scalar datatypes. Details: - Added a num_t datatype bitfield to the obj_t in the form of a new info2 field in the obj_t. This change was made primarily so that in the case of mixed-datatype gemm, the alpha scalar would not need to be cast to the storage datatype of B (or A) before then being cast to the computation datatype just before the macrokernel is called. This double-casting regime could result in loss of precision if the storage datatype of B (or A) is less than the computation precision. In practice, it was likely not going to be a big deal since most usage of alpha is for -1.0, 0.0, and 1.0 (or integer multiples thereof), which can all be represented exactly in single or double precision. - The type of objbits_t was changed to uint32_t, so the new format potentially takes up the same space as the previous obj_t definition, assuming no padding inserted by the compiler. Shrinking info to 32 bits and spilling over into a second field was chosen over using the high 32 bits of a single 64-bit objbits_t info field because many of the bitwise operations are performed with enums such as num_t, dom_t, and prec_t, which may take on the type of 32-bit ints. It's easier to just keep all of those bitwise operations in 32 bits than perform a million typecasts throughout bli_type_defs.h and bli_obj_macro_defs.h to ensure that the integers are treated as 64-bit for the purposes of the ANDs, ORs, and bitshifts. - Many comment updates. - Thanks to Devin Matthews and Devangi Parikh for their feedback and involvement during this commit cycle.	2018-11-20 18:42:07 -06:00
Field G. Van Zee	5fec95b99f	Implemented mixed-datatype support for gemm. Details: - Implemented support for gemm where A, B, and C may have different storage datatypes, as well as a computational precision (and implied computation domain) that may be different from the storage precision of either A or B. This results in 128 different combinations, all which are implemented within this commit. (For now, the mixed-datatype functionality is only supported via the object API.) If desired, the mixed-datatype support may be disabled at configure-time. - Added a memory-intensive optimization to certain mixed-datatype cases that requires a single m-by-n matrix be allocated (temporarily) per call to gemm. This optimization aims to avoid the overhead involved in repeatedly updating C with general stride, or updating C after a typecast from the computation precision. This memory optimization may be disabled at configure-time (provided that the mixed-datatype support is enabled in the first place). - Added support for testing mixed-datatype combinations to testsuite. The user may test gemm with mixed domains, precisions, both, or neither. - Added a standalone test driver directory for building and running mixed-datatype performance experiments. - Defined a new variation of castm, castnzm, which operates like castm except that imaginary values are not touched when casting a real operand to a complex operand. (By contrast, in these situations castm sets the imaginary components of the destination matrix to zero.) - Defined bli_obj_imag_is_zero() and substituted calls in lieu of all usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and also simplified the implementation of bli_obj_imag_equals(). - Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex() when given BLIS_CONSTANT objects. - Disabled dt_on_output field in auxinfo_t structure as well as all accessor functions. Also commented out all usage of accessor functions within macrokernels. (Typecasting in the microkernel is still feasible, though probably unrealistic for now given the additional complexity required.) - Use void function pointer type (instead of void*) for storing function pointers in bli_l0_fpa.c. - Added documentation for using gemm with mixed datatypes in docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c. - Defined level-1d operation xpbyd and level-1m operation xpbym. - Added xpbym test module to testsuite. - Updated frame/include/bli_x86_asm_macros.h with additional macros (courtsey of Devin Matthews).	2018-10-15 16:37:39 -05:00
Field G. Van Zee	4fa4cb0734	Trivial comment header updates. Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers.	2018-08-29 18:06:41 -05:00
Field G. Van Zee	b7db293323	Explicitly typecast return vals in static funcs. Details: - Added explicit typecasting to various functions (mostly static functions), primarily those in bli_param_macro_defs.h, bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header files. - This change was prompted by feedback from Jacob Gorm Hansen, who reported that #including "blis.h" from his application caused a gcc to output error messages (relating to types being returned mismatching the declared return types) when used via the C++ compiler front-end. This is the first pass of fixes, and we may need to iterate with additional follow-up commits (#233).	2018-07-19 11:14:30 -05:00
Field G. Van Zee	52d80b5f09	Fixed static funcs related to target and exec dts. Details: - Fixed incorrect bit shifts in the following static functions: bli_obj_set_target_domain() bli_obj_set_target_prec() bli_obj_set_exec_domain() bli_obj_set_exec_prec() - Fixed incorrect bitmask in bli_dt_proj_to_single_prec(). - Updated bli_obj_real_part() and bli_obj_imag_part() so that it updates the target and exec datatypes (in addition to the storage datatypes).	2018-06-29 12:30:44 -05:00
Field G. Van Zee	17928b1c99	Added static funcs bli_dt_domain(), bli_dt_prec(). Details: - Added definitions of static functions bli_dt_domain()/bli_dt_prec(), which extract a dom_t domain or prec_t precision value, respectively, from a num_t datatype. - Changed the return types of bli_obj_domain() and bli_obj_prec() from objbits_t to dom_t and prec_t. (Not sure why they were ever set to return objbits_t.)	2018-06-19 17:59:03 -05:00
Field G. Van Zee	5f7fbb7115	Static funcs for projecting dt to single/double. Details: - Added static functions for projecting a datatype to single precision or double precision, both for obj_t's storage datatypes and standalone datatypes.	2018-06-19 15:38:55 -05:00
Field G. Van Zee	f317c2e31b	Added get/set static funcs for exec dt/dom/prec. Details: - Added functions to bli_obj_macro_defs.h to get and set the target domain and target precision bits in the obj_t, and also added the appropriate support in bli_type_defs.h.	2018-06-19 12:21:23 -05:00
Field G. Van Zee	ed20392c50	Added get/set static funcs for exec dt/dom/prec. Details: - Added functions to bli_obj_macro_defs.h to get and set the execution domain and execution precision bits in the obj_t. - Added/rearranged a few functions in bli_obj_macro_defs.h. - Renamed some macros in bli_type_defs.h: EXECUTION -> EXEC.	2018-06-15 16:31:22 -05:00
Field G. Van Zee	22aa44ebec	Merge branch 'dev' of github.com:flame/blis into dev	2018-06-07 17:42:59 -05:00
Field G. Van Zee	65fae95074	Implemented bli_setrm, _setim, _setrv, _setiv. Details: - Defined new wrappers to setm/setv operations in frame/base/bli_setri.c that will target only the real or only the imaginary parts of a matrix/vector object. - Updated bli_obj_real_part() so that the complex-specific portions of the function are not executed if the object is real. - Defined bli_obj_imag_part(). - Caveat: If bli_obj_imag_part() is called on a real object, it does nothing, leaving the destination object untouched. The caller must take care to only call the function on complex objects. - Reordered some of the static functions in bli_obj_macro_defs.h related to aliasing.	2018-06-07 17:41:09 -05:00
Field G. Van Zee	b65d0b841b	Fixed bug in bli_dt_proj_to_complex(). Details: - Fixed a bug identical to the one fixed in `0a4a27e`, except this time in the bli_obj_param_defs.h header file. It looks like the only consumers of this static function were in bli_l0_oapi.c, and so this may not have been manifesting (yet).	2018-06-07 14:38:41 -05:00
Field G. Van Zee	0a4a27e1a4	Defined/implemented bli_projm(). Details: - Defined a new operation in frame/base/bli_proj.c, bli_projm(), which behaves like bli_copym(), except that operands a and b are allowed to contain data of differing domains (e.g. a is real while b is complex, or vice versa). The file is named bli_proj.c, rather than bli_projm.c, with the intention that a 'v' vector version of the function may be added to the same file (at some point in the future). - Added supporting bli_check_*() functions in bli_check.c to confirm consistent precisions between to datatypes/objects, as well as the appropriate error message in bli_error.c and a new error code in bli_type_defs.h. - Wrote a bli_projm_check() function to go along with bli_projm(). - Defined static function bli_obj_real_part() in bli_obj_macro_defs.h, which will initialize an obj_t alias to the real part of the source object. - Fixed a bug in the static function bli_dt_proj_to_complex(), found in bli_param_macro_defs.h. Thankfully, there were no calls to the function to produce buggy behavior.	2018-06-06 19:02:29 -05:00
Field G. Van Zee	962a706a6f	Updated LICENSE file to mention HP Enterprise. Details: - Added HP Enterprise to the LICENSE file. Previously, only the source files touched by HPE contained the corresponding copyright notices. (This oversight was unintentional.) - Updated file-level copyright notices to include a comma, to match the formatting used for UT and AMD copyrights.	2018-05-18 18:19:40 -05:00
Field G. Van Zee	4b36e85be9	Converted function-like macros to static functions. Details: - Converted most C preprocessor macros in bli_param_macro_defs.h and bli_obj_macro_defs.h to static functions. - Reshuffled some functions/macros to bli_misc_macro_defs.h and also between bli_param_macro_defs.h and bli_obj_macro_defs.h. - Changed obj_t-initializing macros in bli_type_defs.h to static functions. - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from bli_constants.h. - Whitespace changes in select files (four spaces to single tab).	2018-05-08 14:26:30 -05:00
Field G. Van Zee	75d0d1057d	Renamed various datatype-related macros/functions. Details: - Renamed the following macros in bli_obj_macro_defs.h and bli_param_macro_defs.h: - bli_obj_datatype() -> bli_obj_dt() - bli_obj_target_datatype() -> bli_obj_target_dt() - bli_obj_execution_datatype() -> bli_obj_exec_dt() - bli_obj_set_datatype() -> bli_obj_set_dt() - bli_obj_set_target_datatype() -> bli_obj_set_target_dt() - bli_obj_set_execution_datatype() -> bli_obj_set_exec_dt() - bli_obj_datatype_proj_to_real() -> bli_obj_dt_proj_to_real() - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex() - bli_datatype_proj_to_real() -> bli_dt_proj_to_real() - bli_datatype_proj_to_complex() -> bli_dt_proj_to_complex() - Renamed the following functions in bli_obj.c: - bli_datatype_size() -> bli_dt_size() - bli_datatype_string() -> bli_dt_string() - bli_datatype_union() -> bli_dt_union() - Removed a pair of old level-1f penryn intrinsics kernels that were no longer in use.	2018-04-30 14:57:33 -05:00
Field G. Van Zee	83316485ce	Simplified/fixed self-initialization. Details: - Fixed a race condition in self-initialization whereby the bli_is_init static variable could be erroneously read as TRUE by thread 1 while thread 0 is still executing bli_init_apis(), thus allowing thread 1 to use the library before it is actually ready. Thanks to to Minh Quan Ho and Devin Matthews for pointing out this issue. - Part of the solution to the aforementioned race condition was involved replacing the runtime initialization of the global scalar constants (e.g., BLIS_ONE, BLIS_ZERO, etc.) in bli_const.c with a static initialization of those same constants. This eliminates the need for bli_const_init() altogether. (The static initialization is made concise via preprocess macros.) - Defined bli_gks_query_cntx_noinit(), which behaves just like bli_gks_query_cntx(), except that it does not call bli_init_once(). This function is called in lieu of bli_gks_query_cntx() in bli_ind_init() and bli_memsys_init() so as to not result in any recursion into bli_init_once(). - Removed BLIS_ONE_HALF, BLIS_MINUS_ONE_HALF global scalar constants. They have no use in BLIS or its test products, and we have little reason to believe they are used by others. - Removed testsuite/out file, which was accidentally committed as part of `70640a3`.	2017-12-13 14:14:50 -06:00
Field G. Van Zee	1c732d3ddc	Added 1m-specific APIs for bp, pb gemm algorithms. Details: - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the body of bli_gemm_cntl_create() replaced with a call to the former. - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now, bli_cntl_free() can check if the thread parameter is NULL, and if so, call the latter, and otherwise call the former. - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in terms of bli_gemm1mxx_cntx_init(), which behaves the same as bli_gemm1m_cntx_init() did before, except that an extra bool parameter (is_pb) is used to support both bp and pb algorithms (including to support the anti-preference field described below). - Added support for "anti-preference" in context. The anti_pref field, when true, will toggle the boolean return value of routines such as bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of causing BLIS to transpose the operation to achieve disagreement (rather than agreement) between the storage of C and the micro-kernel output preference. This disagreement is needed for panel-block implementations, since they induce a transposition of the suboperation immediately before the macro-kernel is called, which changes the apparent storage of C. For now, anti-preference is used only with the pb algorithm for 1m (and not with any other non-1m implementation). - Defined new functions, bli_cntx_l3_ukr_eff_prefers_storage_of() bli_cntx_l3_ukr_eff_dislikes_storage_of() bli_cntx_l3_nat_ukr_eff_prefers_storage_of() bli_cntx_l3_nat_ukr_eff_dislikes_storage_of() which are identical to their non-"eff" (effectively) counterparts except that they take the anti-preference field of the context into account. - Explicitly initialize the anti-pref field to FALSE in bli_gks_cntx_set_l3_nat_ukr_prefs(). - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel in terms of the existing block-panel macro-kernel _ker_var2(). This technique requires inducing transposes on all operands and swapping the A and B. - Changed bli_obj_induce_trans() macro so that pack-related fields are also changed to reflect the induced transposition. - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily specify the 1m algorithm (block-panel or panel-block). - Renamed the following cntx_t-related macros: bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block() bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel() bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel() and updated all instantiations. Also updated the field names in the cntx_t struct. - Comment updates.	2017-01-25 16:25:46 -06:00
Field G. Van Zee	701b9aa3ff	Redesigned control tree infrastructure. Details: - Altered control tree node struct definitions so that all nodes have the same struct definition, whose primary fields consist of a blocksize id, a variant function pointer, a pointer to an optional parameter struct, and a pointer to a (single) sub-node. This unified control tree type is now named cntl_t. - Changed the way control tree nodes are connected, and what computation they represent, such that, for example, packing operations are now associated with nodes that are "inline" in the tree, rather than off- shoot braches. The original tree for the classic Goto gemm algorithm was expressed (roughly) as: blk_var2 -> blk_var3 -> blk_var1 -> ker_var2 \| \| -> packb -> packa and now, the same tree would look like: blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2 Specifically, the packb and packa nodes perform their respective packing operations and then recurse (without any loop) to a subproblem. This means there are now two kinds of level-3 control tree nodes: partitioning and non-partitioning. The blocked variants are members of the former, because they iteratively partition off submatrices and perform suboperations on those partitions, while the packing variants belong to the latter group. (This change has the effect of allowing greatly simplified initialization of the nodes, which previously involved setting many unused node fields to NULL.) - Changed the way thrinfo_t tree nodes are arranged to mirror the new connective structure of control trees. That is, packm nodes are no longer off-shoot branches of the main algorithmic nodes, but rather connected "inline". - Simplified control tree creation functions. Partitioning nodes are created concisely with just a few fields needing initialization. By contrast, the packing nodes require additional parameters, which are stored in a packm-specific struct that is tracked via the optional parameters pointer within the control tree struct. (This parameter struct must always begin with a uint64_t that contains the byte size of the struct. This allows us to use a generic function to recursively copy control trees.) gemm, herk, and trmm control tree creation continues to be consolidated into a single function, with the operation family being used to select among the parameter-agnostic macro-kernel wrappers. A single routine, bli_cntl_free(), is provided to free control trees recursively, whereby the chief thread within a groups release the blocks associated with mem_t entries back to the memory broker from which they were acquired. - Updated internal back-ends, e.g. bli_gemm_int(), to query and call the function pointer stored in the current control tree node (rather than index into a local function pointer array). Before being invoked, these function pointers are first cast to a gemm_voft (for gemm, herk, or trmm families) or trsm_voft (for trsm family) type, which is defined in frame/3/bli_l3_var_oft.h. - Retired herk and trmm internal back-ends, since all execution now flows through gemm or trsm blocked variants. - Merged forwards- and backwards-moving variants by querying the direction from routines as a function of the variant's matrix operands. gemm and herk always move forward, while trmm and trsm move in a direction that is dependent on which operand (a or b) is triangular. - Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(), each of which takes additional arguments and hides complexity in managing the difference between the way ranges are computed for the four families of operations. - Simplified level-3 blocked variants according to the above changes, so that the only steps taken are: 1. Query partitioning direction (forwards or backwards). 2. Prune unreferenced regions, if they exist. 3. Determine the thread partitioning sub-ranges. <begin loop> 4. Determine the partitioning blocksize (passing in the partitioning direction) 5. Acquire the curren iteration's partitions for the matrices affected by the current variants's partitioning dimension (m, k, n). 6. Call the subproblem. <end loop> - Instantiate control trees once per thread, per operation invocation. (This is a change from the previous regime in which control trees were treated as stateless objects, initialized with the library, and shared as read-only objects between threads.) This once-per-thread allocation is done primarily to allow threads to use the control tree as as place to cache certain data for use in subsequent loop iterations. Presently, the only application of this caching is a mem_t entry for the packing blocks checked out from the memory broker (allocator). If a non-NULL control tree is passed in by the (expert) user, then the tree is copied by each thread. This is done in bli_l3_thread_decorator(), in bli_thrcomm_*.c. - Added a new field to the context, and opid_t which tracks the "family" of the operation being executed. For example, gemm, hemm, and symm are all part of the gemm family, while herk, syrk, her2k, and syr2k are all part of the herk family. Knowing the operation's family is necessary when conditionally executing the internal (beta) scalar reset on on C in blocked variant 3, which is needed for gemm and herk families, but must not be performed for the trmm family (because beta has only been applied to the current row-panel of C after the first rank-kc iteration). - Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind to comform with the new control tree design, and renamed the macro- kernel codes corresponding to 3m2 and 4m1b. - Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h. - Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to frame/base/bli_auxinfo.h. - Fixed a minor bug whereby the storage-to-ukr-preference matching optimization in the various level-3 front-ends was not being applied properly when the context indicated that execution would be via an induced method. (Before, we always checked the native micro-kernel corresponding to the datatype being executed, whereas now we check the native micro-kernel corresponding to the datatype's real projection, since that is the micro-kernel that is actually used by induced methods. - Added an option to the testsuite to skip the testing of native level-3 complex implementations. Previously, it was always tested, provided that the c/z datatypes were enabled. However, some configurations use reference micro-kernels for complex datatypes, and testing these implementations can slow down the testsuite considerably.	2016-08-26 19:04:45 -05:00
Field G. Van Zee	a017062fdf	Integrated "memory broker" (membrk_t) abstraction. Details: - Integrated a patch originally authored and submitted by Ricardo Magana of HP Enterprise. The changeset inserts use of a new object type, membrk_t, (memory broker) that allows multiple sets of memory pools on, for example, separate NUMA nodes, each of which has a separate memory space. - Added membrk field to cntx_t and defined corresponding accessor macros. - Added membrk field to mem_t object and defined corresponding accessor macros. - Created new bli_membrk.c file, which contains the new memory broker API, including: bli_membrk_init(), bli_membrk_finalize() bli_membrk_acquire_[mv](), bli_membrk_release(), bli_membrk_init_pools(), bli_membrk_reinit_pools(), bli_membrk_finalize_pools(), bli_membrk_pool_size() - In bli_mem.c, changed function calls to bli_mem_init_pools() -> bli_membrk_init() bli_mem_reinit_pools() -> bli_membrk_reinit() bli_mem_finalize_pools() -> bli_membrk_finalize() - In bli_packv_init.c, bli_packm_init.c, changed function calls to: bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]() bli_mem_release() -> bli_membrk_release() - Added bli_mutex.c and related files to frame/thread. These files define abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or single-threaded execution. This new API is employed within functions such as bli_membrk_acquire_[mv]() and bli_membrk_release().	2016-07-22 17:02:59 -05:00
Field G. Van Zee	537a1f4f85	Implemented runtime contexts and reorganized code. Details: - Retrofitted a new data structure, known as a context, into virtually all internal APIs for computational operations in BLIS. The structure is now present within the type-aware APIs, as well as many supporting utility functions that require information stored in the context. User- level object APIs were unaffected and continue to be "context-free," however, these APIs were duplicated/mirrored so that "context-aware" APIs now also exist, differentiated with an "_ex" suffix (for "expert"). These new context-aware object APIs (along with the lower-level, type- aware, BLAS-like APIs) contain the the address of a context as a last parameter, after all other operands. Contexts, or specifically, cntx_t object pointers, are passed all the way down the function stack into the kernels and allow the code at any level to query information about the runtime, such as kernel addresses and blocksizes, in a thread- friendly manner--that is, one that allows thread-safety, even if the original source of the information stored in the context changes at run-time; see next bullet for more on this "original source" of info). (Special thanks go to Lee Killough for suggesting the use of this kind of data structure in discussions that transpired during the early planning stages of BLIS, and also for suggesting such a perfectly appropriate name.) - Added a new API, in frame/base/bli_gks.c, to define a "global kernel structure" (gks). This data structure and API will allow the caller to initialize a context with the kernel addresses, blocksizes, and other information associated with the currently active kernel configuration. The currently active kernel configuration within the gks cannot be changed (for now), and is initialized with the traditional cpp macros that define kernel function names, blocksizes, and the like. However, in the future, the gks API will be expanded to allow runtime management of kernels and runtime parameters. The most obvious application of this new infrastructure is the runtime detection of hardware (and the implied selection of appropriate kernels). With contexts in place, kernels may even be "hot swapped" at runtime within the gks. Once execution enters a level-3 _front() function, the memory allocator will be reinitialized on-the-fly, if necessary, to accommodate the new kernels' blocksizes. If another application thread is executing with another (previously loaded) kernel, it will finish in a deterministic fashion because its kernel information was loaded into its context before computation began, and also because the blocks it checked out from the internal memory pools will be unaffected by the newer threads' reinitialization of the allocator. - Reorganized and streamlined the 'ind' directory, which contains much of the code enabling use of induced methods for complex domain matrix multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as those APIs' functionality is now mostly subsumed within the global kernel structure. - Updated bli_pool.c to define a new function, bli_pool_reinit_if(), that will reinitialize a memory pool if the necessary pool block size has increased. - Updated bli_mem.c to use bli_pool_reinit_if() instead of bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed usage of contexts where appropriate to communicate cache and register blocksizes to bli_mem_compute_pool_block_sizes(). - Simplified control trees now that much of the information resides in the context and/or the global kernel structure: - Removed blocksize object pointers (blksz_t) fields from all control tree node definitions and replaced them with blocksize id (bszid_t) values instead, which may be passed into a context query routine in order to extract the corresponding blocksize from the given context. - Removed micro-kernel function pointers (func_t) fields from all control tree node definitions. Now, any code that needs these function pointers can query them from the local context, as identified by a level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or level-1v kernel id (l1vkr_t). - Removed blksz_t object creation and initialization, as well as kernel function object creation and initialization, from all operation- specific control tree initialization files (bli__cntl.c), since this information will now live in the gks and, secondarily, in the context. - Removed blocksize multiples from blksz_t objects. Now, we track blocksize multiples for each blocksize id (bszid_t) in the context object. - Removed the bool_t's that were required when a func_t was initialized. These bools are meant to allow one to track the micro-kernel's storage preferences (by rows or columns). This preference is now tracked separately within the gks and contexts. - Merged and reorganized many separate-but-related functions into single files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and util directories, but has the most obvious effect of allowing BLIS to compile noticeably faster. - Reorganized execution paths for level-1v, -1d, -1m, and -2 operations in an attempt to reduce overhead for memory-bound operations. This includes removal of default use of object-based variants for level-2 operations. Now, by default, level-2 operations will directly call a low-level (non-object based) loop over a level-1v or -1f kernel. - Converted many common query functions in blk_blksz.c (renamed from bli_blocksize.c) and bli_func.c into cpp macros, now defined in their respective header files. - Defined bli_mbool.c API to create and query "multi-bools", or heterogeneous bool_t's (one for each floating-point datatype), in the same spirit as blksz_t and func_t. - Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS and BLIS_SIMD_SIZE. These values are needed in order to compute a third new parameter, which may be set indirectly via the aforementioned macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to statically allocate memory in macro-kernels and the induced methods' virtual kernels to be used as temporary space to hold a single micro-tile. These values are now output by the testsuite. The default value of BLIS_STACK_BUF_MAX_SIZE is computed as "2 BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE". - Cleaned up top-level 'kernels' directory (for example, renaming the embarrassingly misleading "avx" and "avx2" directories to "sandybridge" and "haswell," respectively, and gave more consistent and meaningful names to many kernel files (as well as updating their interfaces to conform to the new context-aware kernel APIs). - Updated the testsuite to query blocksizes from a locally-initialized context for test modules that need those values: axpyf, dotxf, dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr. - Reformatted many function signatures into a standard format that will more easily facilitate future API-wide changes. - Updated many "mxn" level-0 macros (ie: those used to inline double loops for level-1m-like operations on small matrices) in frame/include/level0 to use more obscure local variable names in an effort to avoid variable shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings, which are only output using -Wshadow.) - Added a conj argument to setm, so that its interface now mirrors that of scalm. The semantic meaning of the conj argument is to optionally allow implicit conjugation of the scalar prior to being populated into the object. - Deprecated all type-aware mixed domain and mixed precision APIs. Note that this does not preclude supporting mixed types via the object APIs, where it produces absolutely zero API code bloat.	2016-04-11 17:21:28 -05:00
Field G. Van Zee	e2e9d64a63	Load balance thread ranges for arbitrary diagonals. Details: - Expanded/updated interface for bli_get_range_weighted() and bli_get_range() so that the direction of movement is specified in the function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b()) and also so that the object being partitioned is passed instead of an uplo parameter. Updated invocations in level-3 blocked variants, as appropriate. - (Re)implemented bli_get_range_() and bli_get_range_weighted_() to carefully take into account the location of the diagonal when computing ranges so that the area of each subpartition (which, in all present level-3 operations, is proportional to the amount of computation engendered) is as equal as possible. - Added calls to a new class of routines to all non-gemm level-3 blocked variants: bli_<oper>_prune_unref_mparts_[mnk]() where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which dimension is being partitioned. These routines call a more basic routine, bli_prune_unref_mparts(), to prune unreferenced/unstored regions from matrices and simultaneously adjust other matrices which share the same dimension accordingly. - Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the new pruning routines. - Fixed incorrect blocking factors passed into bli_get_range_() in bli_trsm_blk_var[12][fb].c - Added a new test driver in test/thread_ranges that can exercise the new bli_get_range_() and bli_get_range_weighted_*() under a range of conditions. - Reimplemented m and n fields of obj_t as elements in a "dim" array field so that dimensions could be queried via index constant (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification macros accordingly. - Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values. - Added bli_round() macro, which calls C math library function round(), and bli_round_to_mult(), which rounds a value to the nearest multiple of some other value. - Added miscellaneous pruning- and mdim_t-related macros. - Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to bli_obj_row_off(), bli_obj_col_off().	2015-09-24 12:14:03 -05:00
Field G. Van Zee	ee129c6b02	Fixed bugs in _get_range(), _get_range_weighted(). Details: - Fixed some bugs that only manifested in multithreaded instances of some (non-gemm) level-3 operations. The bugs were related to invalid allocation of "edge" cases to thread subpartitions. (Here, we define an "edge" case to be one where the dimension being partitioned for parallelism is not a whole multiple of whatever register blocksize is needed in that dimension.) In BLIS, we always require edge cases to be part of the bottom, right, or bottom-right subpartitions. (This is so that zero-padding only has to happen at the bottom, right, or bottom-right edges of micro-panels.) The previous implementations of bli_get_range() and _get_range_weighted() did not adhere to this implicit policy and thus produced bad ranges for some combinations of operation, parameter cases, problem sizes, and n-way parallelism. - As part of the above fix, the functions bli_get_range() and _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b, and _b2t suffixes, similar to the partitioning functions. This is an easy way to make sure that the variants are calling the right version of each function. The function signatures have also been changed slightly. - Comment/whitespace updates. - Removed unnecessary '/' from macros in bli_obj_macro_defs.h.	2015-06-10 12:53:28 -05:00
Field G. Van Zee	26a4b8f6f9	Implemented 3m2, 3m3 induced algorithms (gemm only). Details: - Defined a new "3ms" (separated 3m) pack schema and added appropriate support in packm_init(), packm_blk_var2(). - Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p) as an argument instead of computing it locally. Exception: for trmm, is_p must be computed locally, since it changes for triangular packed matrices. Also exposed is_p in interface to dt-specific packm_blk_var2 (and _var1, even though it does not use imaginary stride). - Renamed many functions/variables from _3mi to _3mis to indicate that they work for either interleaved or separated 3m pack schemas. - Generalized gemm and herk macro-kernels to pass in imaginary stride rather than compute them locally. - Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2- and 3m3-specific virtual micro-kernels. - Added special gemm macro-kernels to support 3m2 and 3m3. - Added support for 3m2 and 3m3 to testsuite. - Corrected the type of the panel dimension (pd_) in various macro- kernels from inc_t to dim_t. - Renamed many functions defined in bli_blocksize.c. - Moved most induced-related macro defs from frame/include to frame/ind/include. - Updated the _ukernel.c files so that the micro-kernel function pointers are obtained from the func_t objects rather than the cpp macros that define the function names. - Updated test/3m4m driver, Makefile, and run script.	2015-04-01 10:44:54 -05:00
Field G. Van Zee	8d5169ccda	Fixed bug in release of mem_t buffer. Details: - Fixed a bug that affects all level-2 and level-3 blocked variants. The bug only manifested, however, if the packing of operands (A and B in gemm, for example) spanned multiple nodes in the control tree. Until recently, the main consumers of packm were level-3 operations, all of which packed both input operands from blocked variant 1 (B outside of the loop, and A within the loop). This particular usage masked a flaw in the code whereby bli_obj_release_pack() would always release the underlying mem_t buffer (provided it was allocated), even if the buffer was not allocated in the current variant. This has been fixed by replacing all calls to bli_obj_release_pack() with calls to a new function, bli_packm_release(), which takes the same control tree node argument passed into the object's corresponding call to packm_init() or packv_init(). bli_packm_release() then proceeds to invoke bli_obj_release_pack() only if the control tree node indicates that packing was requested. Thanks to Devangi Parikh for identifying this bug.	2015-03-18 11:38:08 -05:00
Field G. Van Zee	441d47542a	Renamed 3m and 4m symbols/macros to 3mi and 4mi. Details: - Renamed several variables and macros from 3m/4m to 3mi/4mi. This is because those packing schemas were always implicitly "interleaved". This new naming scheme will make way for new schemas that separate instead of interleve the real and imaginary (and summed) parts. - Expanded the pack format sub-field of the pack schema field of the info_t to 4 bits (from 3). This will allow for more schema types going forward. - Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.	2015-02-19 17:06:10 -06:00
Field G. Van Zee	650d2a6ff2	Added initial support for imaginary stride. Details: - Added an imaginary stride field ("is") to obj_t. - Renamed bli_obj_set_incs() macro to bli_obj_set_strides(). - Defined bli_obj_imag_stride() and bli_obj_set_imag_stride() and added invocations in key locations. - Added some basic error-checking related to imaginary stride. - For now, imaginary stride will not be exposed into the most-used BLIS APIs such as bli_obj_create(), and certainly not the computational APIs such as bli_dgemm().	2015-02-09 14:59:20 -06:00
Field G. Van Zee	4674ca8cff	Extended newly relaxed KC to hemm, symm. Details: - These changes were intended for the previous commit. - Defined bli_gemm_determine_kc_[fb]() and bli_gemm_determine_kc_[fb](), which determine blocksizes for gemm-based operations, taking special care to "nudge" the kc dimension up to a multiple of MR or NR for hemm and symm operations, as needed. - Changed bli_gemm_blk_var3f.c to call bli_gemm_determine_kc_f(). instead of bli_determine_blocksize_f(). - Comment updates to bli_trmm_blocksize.c, bli_trsm_blocksize.c.	2014-10-23 10:50:59 -05:00
Field G. Van Zee	e9899be090	Added high-level implementations of 4m, 3m. Details: - Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at high levels, respectively. APIs for trmm and trsm were NOT added due to the fact that these approaches are inherently incompatible with implementing 4m or 3m at high levels (because the input right-hand side matrix is overwritten). - Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and 3m so that all are stylistically consistent. - Added new "rih" packing kernels (both low-level and structure-aware) to support both 4mh and 3mh. - Defined new pack_t schemas to support real-only, imaginary-only, and real+imaginary packing formats. - Added various level0 scalar macros to support the rih packm kernels. - Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh. - Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in that order) and execute the first one that is enabled, or the native implementation if none are enabled. - Added implementation query functions for each level-3 operation so that the user can query a string that describes the implementation that is currently enabled. - Updated test suite to output implementation types for reach level-3 operation, as well as micro-kernel types for each of the five micro- kernels. - Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX. - Fixed an obscure bug when packing Hermitian matrices (regular packing type) whereby the diagonal elements of the packed micro-panels could get tainted if the source matrix's imaginary diagonal part contained garbage.	2014-09-16 18:19:32 -05:00
Field G. Van Zee	9dc9b44a05	Renamed bli_obj_pack_status() to _pack_schema(). Details: - Renamed the bli_obj_pack_status() macro to bli_obj_pack_schema() in order to help avoid confusion as to what the macro returns.	2014-09-11 12:03:28 -05:00
Field G. Van Zee	7fc48a7d92	Combined 4m/3m bits into an expanded bitfield. Details: - Combined the 4m/3m bits into an expanded bitfield, which will encode the packing "format" of the micro-panels. This will allow for more easily and compactly encoding additional formats. - Other minor comment/whitespace updates to bli_type_defs.h. - Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new format bitfield. - Comment update to bli_kernel_post_macro_defs.h. - Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h.	2014-08-23 16:50:58 -05:00
Field G. Van Zee	98ec95877a	Corrected comment for _obj_is_[row\|col]_stored(). Details: - Fixed a mistake in the comments introduced in the previous commit for bli_obj_is_row_stored() and bli_obj_is_col_stored().	2014-08-07 18:28:32 -05:00
Field G. Van Zee	43d5e419e1	Reverted _obj_is_[row\|col]_stored() macros. Details: - Rolled back recent changes to bli_obj_is_row_stored() and bli_obj_is_col_stored() so that those macros now only inspect the strides (row or column). It turns out that the more sophisticated definitions introduced in `a51e32e` are not necessary, because these "obj" macros are virtually never used on packed matrices, and when they are, they can use bli_obj_is_[row\|col}_packed() macros, which inspect the info bitfield.	2014-08-07 18:20:40 -05:00
Field G. Van Zee	383631b514	Redefined bit field macros with bitshift operator. Details: - Redefined many of the macros that define bit fields and bit values in the obj_t info field using the bitshift operator (<<). This makes it easier to reorder bit fields, or expand existing bit fields, or add new fields. The bitshifting should be evaluated by the compiler at compile-time.	2014-07-31 14:51:48 -05:00
Field G. Van Zee	137143345d	Reimplemented unit blocksize fix in prev commit. Details: - Instead of inferring the storage format of the micro-panels from within the packm variants, we now pass in a bool_t value that denotes whether the packed matrix contains row-stored column panels or column-stored row panels. This value can then be tested more easily inside the main packm variant loop. - Renumbered pack_t schema values in bli_type_defs.h so that there are now five bits, each with different meaning: - 4: packed or not packed? - 3: packed for 3m? - 2: packed for 4m? - 1: packed to panels? - 0: stored by rows or columns? - Added new macros that test for status of above bits in schema bit subfield, and renamed some existing macros related to 4m/3m.	2014-07-31 12:12:45 -05:00
Field G. Van Zee	a51e32ec06	Fixed unit register blocksize brokenness. Details: - Fixed a breakdown in BLIS's ability to differentiate between row-stored and column-stored micro-panels when MR or NR is unit. When either register blocksize (or both) is equal to one, inspecting the strides of the affected packed micro-panel is no longer sufficient to determine whether the micro-panel is a row-stored column panel or a column-stored row panel (because both strides are unit). At that point, dimension information is necessary when invoking the bli_is_row_stored_f() and bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to Ilya Polkovnichenko for reporting this bug. - Added panel dimensions (m and n) to obj_t, which are set in packm_init() and then passed into the blocked variants to support the aforementioned update.	2014-07-30 10:41:48 -05:00
Field G. Van Zee	7ed415824d	Updated copyright headers (continued). Details: - Inserted "at Austin" into third clause of license declarations. Meant to include this change in previous commit.	2014-07-14 16:14:33 -05:00
Field G. Van Zee	5c2c6c8561	Updated copyright headers to contain "at Austin". Details: - Updated copyright headers to include "at Austin" in the name of the University of Texas. - Updated the copyright years of a few headers to 2014 (from 2011 and 2012).	2014-07-14 16:05:03 -05:00
Field G. Van Zee	c663ce3b51	Fixed various bugs when C99 complex is enabled. Details: - Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and elsewhere in the framework that were not yet set up to work properly when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h - Extensive changes to f2c-derived files in frame/compat/f2c to allow C99 complex storage. Most of these changes center around accessing real and imaginary components via bli_?real()/bli_?imag() accessor macros, and setting of values via bli_?sets() assignment macros. (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX was broken.)	2014-02-27 16:32:57 -06:00
Field G. Van Zee	6363a9f658	Added level-3 support for complex via 4m-/3m. Details: - Added the ability to induce complex domain level-3 operations via new virtual complex micro-kernels which are implemented via only real domain micro-kernels. Two new implementations are provided: 4m and 3m. 4m implements complex matrix multiplication in terms of four real matrix multiplications, where as 3m uses only three and thus is capable of even higher (than peak) performance. However, the 3m method has somewhat weaker numerical properties, making it less desirable in general. - Further refined packing routines, which were recently revamped, and added packing functionality for 4m and 3m. - Some modifications to trmm and trsm macro-kernels to facilitate indexing into micro-panels which were packed for 4m/3m virtual kernels. - Added 4m and 3m interfaces for each level-3 operation. - Various other minor changes to facilitate 4m/3m methods.	2014-02-19 17:00:52 -06:00
Field G. Van Zee	2cb13600f9	Updated year in copyright headers to 2014.	2014-01-03 12:29:13 -06:00
Field G. Van Zee	b444489f10	Added new "attached" scalar representation. Details: - Added infrastructure to support a new scalar representation, whereby every object contains an internal scalar that defaults to 1.0. This facilitates passing scalars around without having to house them in separate objects. These "attached" scalars are stored in the internal atom_t field of the obj_t struct, and are always stored to be the same datatype as the object to which they are attached. Level-3 variants no longer take scalar arguments, however, level-3 internal back-ends stll do; this is so that the calling function can perform subproblems such as C := C - alpha * A * B on-the-fly without needing to change either of the scalars attached to A or B. - Removed scalar argument from packm_int(). - Observe and apply attached scalars in scalm_int(), and removed scalar from interface of scalm_unb_var1(). - Renamed the following functions (and corresponding invocations): bli_obj_init_scalar_copy_of() -> bli_obj_scalar_init_detached_copy_of() bli_obj_init_scalar() -> bli_obj_scalar_init_detached() bli_obj_create_scalar_with_attached_buffer() -> bli_obj_create_1x1_with_attached_buffer() bli_obj_scalar_equals() -> bli_obj_equals() - Defined new functions: bli_obj_scalar_detach() bli_obj_scalar_attach() bli_obj_scalar_apply_scalar() bli_obj_scalar_reset() bli_obj_scalar_has_nonzero_imag() bli_obj_scalar_equals() - Placed all bli_obj_scalar_* functions in a new file, bli_obj_scalar.c. - Renamed the following macros: bli_obj_scalar_buffer() -> bli_obj_buffer_for_1x1() bli_obj_is_scalar() -> bli_obj_is_1x1() - Defined new macros to set and copy internal scalars between objects: bli_obj_set_internal_scalar() bli_obj_copy_internal_scalar() - In level-3 internal back-ends, added conditional blocks where alpha and beta are checked for non-unit-ness. Those values for alpha and beta are applied to the scalars attached to aliases of A/B/C, as appropriate, before being passed into the variant specified by the control tree. - In level-3 blocked variants, pass BLIS_ONE into subproblems instead of alpha and/or beta. - In level-3 macro-kernels, changed how scalars are obtained. Now, scalars attached to A and B are multiplied together to obtain alpha, while beta is obtained directly from C. - In level-3 front-ends, removed old function calls meant to provide future support for mixed domain/precision. These can be added back later once that functionality is given proper treatment. Also, removed the creating of copy-casts of alpha and beta since typecasting of scalars is now implicitly handled in the internal back-ends when alpha and beta are applied to the attached scalars.	2013-12-03 16:08:30 -06:00
Field G. Van Zee	9552e6ee82	Removed optional scaling from packm control tree. Details: - Removed does_scale field from packm control tree node and bli_packm_cntl_obj_create() interface. Adjusted all invocations of _cntl_obj_create() accordingly. - Redefined/renamted macros that are used in aliasing so that now, bli_obj_alias_to() does a full alias (shallow copy) while bli_obj_alias_for_packing() does a partial alias that preserves the pack_mem-related fields of the aliasing (destination) object. - Removed bli_trmm3_cntl.c, .h after realizing that the trmm control tree will work just fine for bli_trmm3(). - Removed some commented vestiges of the typecasting functionality needed to support heterogeneous datatypes.	2013-11-24 11:40:31 -06:00

1 2

64 Commits