amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 01:30:00 +00:00

Author	SHA1	Message	Date
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
Field G. Van Zee	0645f239fb	Remove UT-Austin from copyright headers' clause 3. Details: - Removed explicit reference to The University of Texas at Austin in the third clause of the license comment blocks of all relevant files and replaced it with a more all-encompassing "copyright holder(s)". - Removed duplicate words ("derived") from a few kernels' license comment blocks. - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c with format of all other comment blocks.	2018-12-04 14:31:06 -06:00
Field G. Van Zee	49d3f9fcbb	Merge branch 'master' into dev	2018-10-17 18:00:40 -05:00
Field G. Van Zee	71c5832d5f	Consolidated slab/rr-explicit level-3 macrokernels. Details: - Consolidated the sl.c and rr.c level-3 macrokernels into a single file per sl/rr pair, with those files named as they were before `c92762e`. The consolidation does not take away the option of using slab or round-robin assignment of micropanels to threads; it merely hides the choice within the definitions of functions such as bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter() rather than expose that choice explicitly in the code. The choice of slab or rr is not always hidden, however; there are some cases involving herk and trmm, for example, that require some part of the computation to use rr unconditionally. (The --thread-part-jrir option controls the partitioning in all other cases.) - Note: Originally, the sl and rr macrokernels were separated out for clarity. However, aside from the additional binary code bloat, I later deemed that clarity not worth the price of maintaining the additional (mostly similar) codes.	2018-10-17 14:11:01 -05:00
Field G. Van Zee	5fec95b99f	Implemented mixed-datatype support for gemm. Details: - Implemented support for gemm where A, B, and C may have different storage datatypes, as well as a computational precision (and implied computation domain) that may be different from the storage precision of either A or B. This results in 128 different combinations, all which are implemented within this commit. (For now, the mixed-datatype functionality is only supported via the object API.) If desired, the mixed-datatype support may be disabled at configure-time. - Added a memory-intensive optimization to certain mixed-datatype cases that requires a single m-by-n matrix be allocated (temporarily) per call to gemm. This optimization aims to avoid the overhead involved in repeatedly updating C with general stride, or updating C after a typecast from the computation precision. This memory optimization may be disabled at configure-time (provided that the mixed-datatype support is enabled in the first place). - Added support for testing mixed-datatype combinations to testsuite. The user may test gemm with mixed domains, precisions, both, or neither. - Added a standalone test driver directory for building and running mixed-datatype performance experiments. - Defined a new variation of castm, castnzm, which operates like castm except that imaginary values are not touched when casting a real operand to a complex operand. (By contrast, in these situations castm sets the imaginary components of the destination matrix to zero.) - Defined bli_obj_imag_is_zero() and substituted calls in lieu of all usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and also simplified the implementation of bli_obj_imag_equals(). - Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex() when given BLIS_CONSTANT objects. - Disabled dt_on_output field in auxinfo_t structure as well as all accessor functions. Also commented out all usage of accessor functions within macrokernels. (Typecasting in the microkernel is still feasible, though probably unrealistic for now given the additional complexity required.) - Use void function pointer type (instead of void*) for storing function pointers in bli_l0_fpa.c. - Added documentation for using gemm with mixed datatypes in docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c. - Defined level-1d operation xpbyd and level-1m operation xpbym. - Added xpbym test module to testsuite. - Updated frame/include/bli_x86_asm_macros.h with additional macros (courtsey of Devin Matthews).	2018-10-15 16:37:39 -05:00
Field G. Van Zee	c92762ecdc	Added option of slab or rr partitioning in jr/ir. Details: - Updated existing macrokernel function names and definitions to explicitly use slab assignment of micropanels to threads, then created duplicate versions of macrokernels that explicitly use round-robin assignment instead of slab. NOTE: As in `ac18949`, trsm_r macrokernels were not substantially updated in this commit because they are currently disabled in bli_trsm_front.c. - Updated existing packing function (in blk_packm_blk_var1.c) to explicitly use slab partitioning, and then duplicated for round-robin. - Updated control tree initialization to use the appropriate macrokernel and packm function pointers depending on which method (slab or rr) was enabled at configure-time. - Updated configure script to accept new --thread-part-jrir=[slab\|rr] option (-m [slab\|rr] for short), which allows the user to explicitly request either slab or round-robin assignment (partitioning) of micropanels to threads. - Updated sandbox/ref99 according to above changes. - Minor updates to build/add-copyright.py.	2018-10-07 20:30:32 -05:00
Field G. Van Zee	4fa4cb0734	Trivial comment header updates. Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers.	2018-08-29 18:06:41 -05:00
Field G. Van Zee	71f9787195	Whitespace changes to macrokernels' func ptr defs.	2018-07-25 15:55:36 -05:00
Field G. Van Zee	ecbebe7c2e	Defined rntm_t to relocate cntx_t.thrloop (#235 ). Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field after it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used different parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS__NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set__nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.	2018-07-17 18:37:32 -05:00
Field G. Van Zee	87db5c048e	Changed usage of virtual microkernel slots in cntx. Details: - Changed the way virtual microkernels are handled in the context. Previously, there were query routines such as bli_cntx_get_l3_ukr_dt() which returned the native ukernel for a datatype if the method was equal to BLIS_NAT, or the virtual ukernel for that datatype if the method was some other value. Going forward, the context native and virtual ukernel slots will both be initialized to native ukernel function pointers for native execution, and for non-native execution the virtual ukernel pointer will be something else. This allows us to always query the virtual ukernel slot (from within, say, the macrokernel) without needing any logic in the query routine to decide which function pointer (native or virtual) to return. (Essentially, the logic has been shifted to init-time instead of compute-time.) This scheme will also allow generalized virtual ukernels as a way to insert extra logic in between the macrokernel and the native microkernel. - Initialize native contexts (in bli_cntx_ref.c) with native ukernel function addresses stored to the virtual ukernel slots pursuant to the above policy change. - Renamed all static functions that were native/virtual-ambiguous, such as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt() pursuant to the above polilcy change. Those routines now use the substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All of these functions were static functions defined in bli_cntx.h, and most uses were in level-3 front-ends and macrokernels. - Deprecated anti_pref bool_t in context, along with related functions such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's panel-block execution is disabled.	2018-06-12 19:38:37 -05:00
Field G. Van Zee	9588625c43	Renamed "next micropanel" macros in _l3_thrinfo.h. Details: - Renamed several macros defined in bli_l3_thrinfo.h designed to compute the values of a_next and b_next to insert into an auxinfo_t struct in level-3 macrokernels. (Previously, the macros did not use a bli_ prefix.) - Updated instances of above macro usage within various macrokernels.	2018-05-30 15:19:53 -05:00
Field G. Van Zee	4b36e85be9	Converted function-like macros to static functions. Details: - Converted most C preprocessor macros in bli_param_macro_defs.h and bli_obj_macro_defs.h to static functions. - Reshuffled some functions/macros to bli_misc_macro_defs.h and also between bli_param_macro_defs.h and bli_obj_macro_defs.h. - Changed obj_t-initializing macros in bli_type_defs.h to static functions. - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from bli_constants.h. - Whitespace changes in select files (four spaces to single tab).	2018-05-08 14:26:30 -05:00
Field G. Van Zee	75d0d1057d	Renamed various datatype-related macros/functions. Details: - Renamed the following macros in bli_obj_macro_defs.h and bli_param_macro_defs.h: - bli_obj_datatype() -> bli_obj_dt() - bli_obj_target_datatype() -> bli_obj_target_dt() - bli_obj_execution_datatype() -> bli_obj_exec_dt() - bli_obj_set_datatype() -> bli_obj_set_dt() - bli_obj_set_target_datatype() -> bli_obj_set_target_dt() - bli_obj_set_execution_datatype() -> bli_obj_set_exec_dt() - bli_obj_datatype_proj_to_real() -> bli_obj_dt_proj_to_real() - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex() - bli_datatype_proj_to_real() -> bli_dt_proj_to_real() - bli_datatype_proj_to_complex() -> bli_dt_proj_to_complex() - Renamed the following functions in bli_obj.c: - bli_datatype_size() -> bli_dt_size() - bli_datatype_string() -> bli_dt_string() - bli_datatype_union() -> bli_dt_union() - Removed a pair of old level-1f penryn intrinsics kernels that were no longer in use.	2018-04-30 14:57:33 -05:00
Field G. Van Zee	39be59f2a8	Replaced several macros with static function APIs. Details: - Reimplemented several sets of get/set-style preprocessor macros with static functions, including those in the following frame/base headers: auxinfo, cntl, mbool, mem, membrk, opid, and pool. A few headers in frame/thread were touched as well: mutex_*, thrcomm, and thrinfo.	2017-12-07 17:35:20 -06:00
Field G. Van Zee	bdc0a264d2	Adjusted stride selection of ct in macrokernels. Details: - Updated the changes introduced in `618f433` so that the strides of the temporary microtile ct used in the macrokernels is determined based on the storage preference of the microkernel (via the new functions below), rather than the strides of c. In almost all cases, presently, this change results in no net effect, as a high-level optimization in the _front() functions aligns the storage of c to that of the microkernel's preference. However, I encountered some cases where this is not always the case in some development code that has yet to be committed, and therefore I'm generalizing the framework code in advance. - Defined two new functions in bli_cntx.c: bli_cntx_l3_ukr_prefers_rows_dt() bli_cntx_l3_ukr_prefers_cols_dt() which return bool_t's based on the current micro-kernel's storage preferences. For induced methods, the preference of the underlying real domain microkernel is returned. - Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of the above functions, rather than querying the preferences of the native microkernel directly (which did the wrong thing for induced methods).	2016-11-16 14:13:08 -06:00
Field G. Van Zee	618f4331eb	Align strides of ct in macrokernels to that of c. Details: - Previously, rs_ct and cs_ct, the strides of the temporary microtile used primarily in the macrokernels' edge case handling, were unconditionally set to 1 and MR, respectively. However, Devin Matthews noted that this ought to be changed so that the strides of ct were in agreement with the strides of C. (That is, if C was row-stored, then ct should be accessed as by rows as well.) The implicit assumption is that the strides of C have already been adjusted, via induced transposition, if the storage preference of the microkernel is at odds with the storage of C. So, if the microkernel prefers row storage, the macrokernel's interior cases would present row-stored (ideal) microkernel subproblems to the microkernel, but for edge cases, it would still see column-stored subproblems (not ideal). This commit fixes this issue. Thanks to Devin for his suggestion.	2016-10-31 14:40:51 -05:00
Field G. Van Zee	701b9aa3ff	Redesigned control tree infrastructure. Details: - Altered control tree node struct definitions so that all nodes have the same struct definition, whose primary fields consist of a blocksize id, a variant function pointer, a pointer to an optional parameter struct, and a pointer to a (single) sub-node. This unified control tree type is now named cntl_t. - Changed the way control tree nodes are connected, and what computation they represent, such that, for example, packing operations are now associated with nodes that are "inline" in the tree, rather than off- shoot braches. The original tree for the classic Goto gemm algorithm was expressed (roughly) as: blk_var2 -> blk_var3 -> blk_var1 -> ker_var2 \| \| -> packb -> packa and now, the same tree would look like: blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2 Specifically, the packb and packa nodes perform their respective packing operations and then recurse (without any loop) to a subproblem. This means there are now two kinds of level-3 control tree nodes: partitioning and non-partitioning. The blocked variants are members of the former, because they iteratively partition off submatrices and perform suboperations on those partitions, while the packing variants belong to the latter group. (This change has the effect of allowing greatly simplified initialization of the nodes, which previously involved setting many unused node fields to NULL.) - Changed the way thrinfo_t tree nodes are arranged to mirror the new connective structure of control trees. That is, packm nodes are no longer off-shoot branches of the main algorithmic nodes, but rather connected "inline". - Simplified control tree creation functions. Partitioning nodes are created concisely with just a few fields needing initialization. By contrast, the packing nodes require additional parameters, which are stored in a packm-specific struct that is tracked via the optional parameters pointer within the control tree struct. (This parameter struct must always begin with a uint64_t that contains the byte size of the struct. This allows us to use a generic function to recursively copy control trees.) gemm, herk, and trmm control tree creation continues to be consolidated into a single function, with the operation family being used to select among the parameter-agnostic macro-kernel wrappers. A single routine, bli_cntl_free(), is provided to free control trees recursively, whereby the chief thread within a groups release the blocks associated with mem_t entries back to the memory broker from which they were acquired. - Updated internal back-ends, e.g. bli_gemm_int(), to query and call the function pointer stored in the current control tree node (rather than index into a local function pointer array). Before being invoked, these function pointers are first cast to a gemm_voft (for gemm, herk, or trmm families) or trsm_voft (for trsm family) type, which is defined in frame/3/bli_l3_var_oft.h. - Retired herk and trmm internal back-ends, since all execution now flows through gemm or trsm blocked variants. - Merged forwards- and backwards-moving variants by querying the direction from routines as a function of the variant's matrix operands. gemm and herk always move forward, while trmm and trsm move in a direction that is dependent on which operand (a or b) is triangular. - Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(), each of which takes additional arguments and hides complexity in managing the difference between the way ranges are computed for the four families of operations. - Simplified level-3 blocked variants according to the above changes, so that the only steps taken are: 1. Query partitioning direction (forwards or backwards). 2. Prune unreferenced regions, if they exist. 3. Determine the thread partitioning sub-ranges. <begin loop> 4. Determine the partitioning blocksize (passing in the partitioning direction) 5. Acquire the curren iteration's partitions for the matrices affected by the current variants's partitioning dimension (m, k, n). 6. Call the subproblem. <end loop> - Instantiate control trees once per thread, per operation invocation. (This is a change from the previous regime in which control trees were treated as stateless objects, initialized with the library, and shared as read-only objects between threads.) This once-per-thread allocation is done primarily to allow threads to use the control tree as as place to cache certain data for use in subsequent loop iterations. Presently, the only application of this caching is a mem_t entry for the packing blocks checked out from the memory broker (allocator). If a non-NULL control tree is passed in by the (expert) user, then the tree is copied by each thread. This is done in bli_l3_thread_decorator(), in bli_thrcomm_*.c. - Added a new field to the context, and opid_t which tracks the "family" of the operation being executed. For example, gemm, hemm, and symm are all part of the gemm family, while herk, syrk, her2k, and syr2k are all part of the herk family. Knowing the operation's family is necessary when conditionally executing the internal (beta) scalar reset on on C in blocked variant 3, which is needed for gemm and herk families, but must not be performed for the trmm family (because beta has only been applied to the current row-panel of C after the first rank-kc iteration). - Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind to comform with the new control tree design, and renamed the macro- kernel codes corresponding to 3m2 and 4m1b. - Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h. - Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to frame/base/bli_auxinfo.h. - Fixed a minor bug whereby the storage-to-ukr-preference matching optimization in the various level-3 front-ends was not being applied properly when the context indicated that execution would be via an induced method. (Before, we always checked the native micro-kernel corresponding to the datatype being executed, whereas now we check the native micro-kernel corresponding to the datatype's real projection, since that is the micro-kernel that is actually used by induced methods. - Added an option to the testsuite to skip the testing of native level-3 complex implementations. Previously, it was always tested, provided that the c/z datatypes were enabled. However, some configurations use reference micro-kernels for complex datatypes, and testing these implementations can slow down the testsuite considerably.	2016-08-26 19:04:45 -05:00
Field G. Van Zee	31def12e26	First phase of control tree redesign. Details: - These changes constitute the first set of changes in preparation to revamping the structure and use of control trees in BLIS. Modifications in this commit don't affect the control tree code yet, but rather lay the groundwork. - Defined wrappers for the following functions, where the the wrappers each take a direction parameter of a new enumerated type (BLIS_BWD or BLIS_FWD), dir_t, and executes the correct underlying function. - bli_acquire_mpart_() and _vpart_() - bli__determine_kc_[fb]() - bli_thread_get_range_() and bli_thread_get_range_weighted_*() - Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving) blocked variants for trmm and trsm, and renamed gemm and herk variants accordingly. The direction is now queried via routines such as bli_trmm_direct(), which deterines the direction from the implied side and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD. - Defined wrappers to parameter-specific macrokernels for herk, trmm, and trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying macrokernel based on the implied parameters. The same logic used to choose the dir_t in _direct() functions is used here. - Simplified the function pointer arrays in _int() functions given the consolidation and dir_t querying mentioned above. - Function signature (whitespace) reformatting for various functions. - Removed old code in various 'old' directories.	2016-06-30 15:19:20 -05:00
Field G. Van Zee	096895c5d5	Reorganized code, APIs related to multithreading. Details: - Reorganized code and renamed files defining APIs related to multithreading. All code that is not specific to a particular operation is now located in a new directory: frame/thread. Code is now organized, roughly, by the namespace to which it belongs (see below). - Consolidated all operation-specific _thrinfo_t object types into a single thrinfo_t object type. Operation-specific level-3 _thrinfo_t APIs were also consolidated, leaving bli_l3_thrinfo_() and bli_packm_thrinfo_() functions (aside from a few general purpose bli_thrinfo_() functions). - Renamed thread_comm_t object type to thrcomm_t. - Renamed many of the routines and functions (and macros) for multithreading. We now have the following API namespaces: - bli_thrinfo_(): functions related to thrinfo_t objects - bli_thrcomm_(): functions related to thrcomm_t objects. - bli_thread_(): general-purpose functions, such as initialization, finalization, and computing ranges. (For now, some macros, such as bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the bli_thread_ namespace prefix, even though bli_thrinfo_ may be more appropriate.) - Renamed thread-related macros so that they use a bli_ prefix. - Renamed control tree-related macros so that they use a bli_ prefix (to be consistent with the thread-related macros that were also renamed). - Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This #undef was a temporary fix to some macro defaults which were being applied in the wrong order, which was recently fixed.	2016-06-06 13:32:04 -05:00
Field G. Van Zee	537a1f4f85	Implemented runtime contexts and reorganized code. Details: - Retrofitted a new data structure, known as a context, into virtually all internal APIs for computational operations in BLIS. The structure is now present within the type-aware APIs, as well as many supporting utility functions that require information stored in the context. User- level object APIs were unaffected and continue to be "context-free," however, these APIs were duplicated/mirrored so that "context-aware" APIs now also exist, differentiated with an "_ex" suffix (for "expert"). These new context-aware object APIs (along with the lower-level, type- aware, BLAS-like APIs) contain the the address of a context as a last parameter, after all other operands. Contexts, or specifically, cntx_t object pointers, are passed all the way down the function stack into the kernels and allow the code at any level to query information about the runtime, such as kernel addresses and blocksizes, in a thread- friendly manner--that is, one that allows thread-safety, even if the original source of the information stored in the context changes at run-time; see next bullet for more on this "original source" of info). (Special thanks go to Lee Killough for suggesting the use of this kind of data structure in discussions that transpired during the early planning stages of BLIS, and also for suggesting such a perfectly appropriate name.) - Added a new API, in frame/base/bli_gks.c, to define a "global kernel structure" (gks). This data structure and API will allow the caller to initialize a context with the kernel addresses, blocksizes, and other information associated with the currently active kernel configuration. The currently active kernel configuration within the gks cannot be changed (for now), and is initialized with the traditional cpp macros that define kernel function names, blocksizes, and the like. However, in the future, the gks API will be expanded to allow runtime management of kernels and runtime parameters. The most obvious application of this new infrastructure is the runtime detection of hardware (and the implied selection of appropriate kernels). With contexts in place, kernels may even be "hot swapped" at runtime within the gks. Once execution enters a level-3 _front() function, the memory allocator will be reinitialized on-the-fly, if necessary, to accommodate the new kernels' blocksizes. If another application thread is executing with another (previously loaded) kernel, it will finish in a deterministic fashion because its kernel information was loaded into its context before computation began, and also because the blocks it checked out from the internal memory pools will be unaffected by the newer threads' reinitialization of the allocator. - Reorganized and streamlined the 'ind' directory, which contains much of the code enabling use of induced methods for complex domain matrix multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as those APIs' functionality is now mostly subsumed within the global kernel structure. - Updated bli_pool.c to define a new function, bli_pool_reinit_if(), that will reinitialize a memory pool if the necessary pool block size has increased. - Updated bli_mem.c to use bli_pool_reinit_if() instead of bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed usage of contexts where appropriate to communicate cache and register blocksizes to bli_mem_compute_pool_block_sizes(). - Simplified control trees now that much of the information resides in the context and/or the global kernel structure: - Removed blocksize object pointers (blksz_t) fields from all control tree node definitions and replaced them with blocksize id (bszid_t) values instead, which may be passed into a context query routine in order to extract the corresponding blocksize from the given context. - Removed micro-kernel function pointers (func_t) fields from all control tree node definitions. Now, any code that needs these function pointers can query them from the local context, as identified by a level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or level-1v kernel id (l1vkr_t). - Removed blksz_t object creation and initialization, as well as kernel function object creation and initialization, from all operation- specific control tree initialization files (bli__cntl.c), since this information will now live in the gks and, secondarily, in the context. - Removed blocksize multiples from blksz_t objects. Now, we track blocksize multiples for each blocksize id (bszid_t) in the context object. - Removed the bool_t's that were required when a func_t was initialized. These bools are meant to allow one to track the micro-kernel's storage preferences (by rows or columns). This preference is now tracked separately within the gks and contexts. - Merged and reorganized many separate-but-related functions into single files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and util directories, but has the most obvious effect of allowing BLIS to compile noticeably faster. - Reorganized execution paths for level-1v, -1d, -1m, and -2 operations in an attempt to reduce overhead for memory-bound operations. This includes removal of default use of object-based variants for level-2 operations. Now, by default, level-2 operations will directly call a low-level (non-object based) loop over a level-1v or -1f kernel. - Converted many common query functions in blk_blksz.c (renamed from bli_blocksize.c) and bli_func.c into cpp macros, now defined in their respective header files. - Defined bli_mbool.c API to create and query "multi-bools", or heterogeneous bool_t's (one for each floating-point datatype), in the same spirit as blksz_t and func_t. - Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS and BLIS_SIMD_SIZE. These values are needed in order to compute a third new parameter, which may be set indirectly via the aforementioned macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to statically allocate memory in macro-kernels and the induced methods' virtual kernels to be used as temporary space to hold a single micro-tile. These values are now output by the testsuite. The default value of BLIS_STACK_BUF_MAX_SIZE is computed as "2 BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE". - Cleaned up top-level 'kernels' directory (for example, renaming the embarrassingly misleading "avx" and "avx2" directories to "sandybridge" and "haswell," respectively, and gave more consistent and meaningful names to many kernel files (as well as updating their interfaces to conform to the new context-aware kernel APIs). - Updated the testsuite to query blocksizes from a locally-initialized context for test modules that need those values: axpyf, dotxf, dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr. - Reformatted many function signatures into a standard format that will more easily facilitate future API-wide changes. - Updated many "mxn" level-0 macros (ie: those used to inline double loops for level-1m-like operations on small matrices) in frame/include/level0 to use more obscure local variable names in an effort to avoid variable shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings, which are only output using -Wshadow.) - Added a conj argument to setm, so that its interface now mirrors that of scalm. The semantic meaning of the conj argument is to optionally allow implicit conjugation of the scalar prior to being populated into the object. - Deprecated all type-aware mixed domain and mixed precision APIs. Note that this does not preclude supporting mixed types via the object APIs, where it produces absolutely zero API code bloat.	2016-04-11 17:21:28 -05:00
Field G. Van Zee	37e55ca39b	Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm. Details: - Fixed a family of bugs in the triangular level-3 operations for certain complex implementations (3m1 and 4m1a) that only manifest if one of the register blocksizes (PACKMR/PACKNR, actually) is odd: - Fixed incorrect imaginary stride computation in bli_packm_blk_var2() for the triangular case. - Fixed the incorrect computation of imaginary stride, as stored in the auxinfo_t struct in trmm and trsm macro-kernels. - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the cases where the the register blocksize for the triangular matrix is odd. Introduced a new byte-granular pointer arithmetic macro, bli_ptr_add(), that computes the correct value. - Added cpp macro to bli_macro_defs.h for typeof() operator, defined in terms of __typeof__, which is used by bli_ptr_add() macro. - Disabled the row- vs. column-storage optimization in bli_trmm_front() for singleton problems because the inherent ambiguity of whether a scalar is row-stored or column-stored causes the wrong parameter combination code to be executed (by dumb luck of our checking for row storage first). - Added commented-out debugging lines to 3m1/4m1a and reference micro-kernels, and trsm_ll macro-kernel.	2015-10-30 18:25:04 -05:00
Field G. Van Zee	26a4b8f6f9	Implemented 3m2, 3m3 induced algorithms (gemm only). Details: - Defined a new "3ms" (separated 3m) pack schema and added appropriate support in packm_init(), packm_blk_var2(). - Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p) as an argument instead of computing it locally. Exception: for trmm, is_p must be computed locally, since it changes for triangular packed matrices. Also exposed is_p in interface to dt-specific packm_blk_var2 (and _var1, even though it does not use imaginary stride). - Renamed many functions/variables from _3mi to _3mis to indicate that they work for either interleaved or separated 3m pack schemas. - Generalized gemm and herk macro-kernels to pass in imaginary stride rather than compute them locally. - Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2- and 3m3-specific virtual micro-kernels. - Added special gemm macro-kernels to support 3m2 and 3m3. - Added support for 3m2 and 3m3 to testsuite. - Corrected the type of the panel dimension (pd_) in various macro- kernels from inc_t to dim_t. - Renamed many functions defined in bli_blocksize.c. - Moved most induced-related macro defs from frame/include to frame/ind/include. - Updated the _ukernel.c files so that the micro-kernel function pointers are obtained from the func_t objects rather than the cpp macros that define the function names. - Updated test/3m4m driver, Makefile, and run script.	2015-04-01 10:44:54 -05:00
Field G. Van Zee	a86db60ee2	Extensive renaming of 3m/4m-related files, symbols. Details: - Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi ('i' for "interleaved"). Similar changes to 3M/4M macros. - Renamed all 3m/4m files and functions to 3m1/4m1. - Whitespace changes.	2015-02-23 18:42:39 -06:00
Field G. Van Zee	441d47542a	Renamed 3m and 4m symbols/macros to 3mi and 4mi. Details: - Renamed several variables and macros from 3m/4m to 3mi/4mi. This is because those packing schemas were always implicitly "interleaved". This new naming scheme will make way for new schemas that separate instead of interleve the real and imaginary (and summed) parts. - Expanded the pack format sub-field of the pack schema field of the info_t to 4 bits (from 3). This will allow for more schema types going forward. - Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.	2015-02-19 17:06:10 -06:00
Field G. Van Zee	518a1756cc	Fixed indexing bug for trmm3 via 3mh, 4mh. Details: - Fixed a bug that only affected trmm3 when performed via 3mh or 4mh, whereby micro-panels of the triangular matrix were packed with "dead space" between them due to failing to adjust for the fact that pointer arithmetic was occurring in units of complex elements while the data being packed consisted of real elements. It turns out that the macro- kernel suffered from the same bug, meaning the panels were actually being packed and read consistently. The only way I was able to discover the bug in the first place was because the packed block of A was overflowing into the beginning of the packed row panel of B using the sandybridge configuration.	2015-02-19 14:27:09 -06:00
Field G. Van Zee	fe2b8d39a4	Fixed an obscure bug in 3mh/3m/4mh/4m packing. Details: - Modified bli_packm_blk_var1.c and _var2.c to increase the triangular case's panel increment by 1 if it would otherwise be odd. This is particularly necessary in _var2.c when handling the interleaved 3m or ro/io/rpi pack schemas, since division of an odd number by 2 can happen if both the panel length and the panel packing dimension (register packing blocksize) are odd, thus making their product odd. - Modified bli_packm_init.c so that panel strides are increased by 1 if they would otherwise be odd, even for non-3m related packing. - Modified the trmm and trsm macro-kernels so that triangular packed micro-panels are traversed with this new "increment by 1 if odd" policy. - Added sanity checks in trmm and trsm macro-kernels that would result in an abort() if the conditions that would lead to a "divide odd integer by 2" scenario ever manifest. - Defined bli_is_odd(), _is_even() macros in bli_scalar_macro_defs.h.	2015-02-11 19:33:10 -06:00
Field G. Van Zee	e64dba5633	Re-implemented micro-panel alignment. Details: - This commit re-implements a feature that was removed in commit `c2b2ab62`. It was removed because, at the time, I wasn't sure how the micro-panel alignment feature would interact with the 4m method (when applied at the micro-kernrel level), and so it seemed safer to disable the feature entirely rather than allow possible breakage. This commit revisits the issue and safely re-implements the feature in a way that is compatible with 4m, 3m, 4mh, and 3mh (and native execution). - Modified the static memory pool to account for micro-panel alignment space. - Modified packm_init and blocked variants to align whole micro-panels by a datatype-specific alignment value that may be set by the configuration. (If it is not set by the configuration, it will default to BLIS_SIZEOF_?.) - Modified macro-kernels so that: - storage stride is handled properly given the new micro-panel alignment behavior; - indexing through 3m/4m/rih-type sub-panels, as is done by trmm and trsm, is more robust (e.g. will work if the applicable packing register blocksize is odd); - imaginary strides are computed and stored within auxinfo_t structs, which allows the virtual micro-kernels to more easily determine how to index into the micro-panel operands. - Modified virtual 3m and 4m micro-kernels to use the imaginary strides within the auxinfo_t structs instead of panel strides. - Deprecated the panel stride fields from the auxinfo_t structs. - Updated test suite to print out the micro-panel alignment values.	2014-10-20 19:23:06 -05:00
Field G. Van Zee	0d954087b2	Minor changes and fixes. Details: - Redefined bli_is_last_iter() to take thread_id and num_thread arguments, which allows the macro to correctly compute whether a given iteration is the last that the thread will compute in that particular loop. The new definition, however, remains disabled (commented out) until someone can look at this more closely, as the new definition seems to actually hurt performance slightly. - Whitespace and related updates to level-3 macro-kernels. - Updated test suite so that performance results in the hundreds of gigaflops does not disrupt the column alignment of the output.	2014-10-17 11:19:34 -05:00
Field G. Van Zee	9dc9b44a05	Renamed bli_obj_pack_status() to _pack_schema(). Details: - Renamed the bli_obj_pack_status() macro to bli_obj_pack_schema() in order to help avoid confusion as to what the macro returns.	2014-09-11 12:03:28 -05:00
Field G. Van Zee	cf5efdde05	Pass pack_t schemas into ukernels via auxinfo_t. Details: - Modified macro-kernels to pass the pack_t schema values for matrices A and B into the datatype-specific functions, where they are now inserted into a newly-expanded auxinfo_t struct. This gives gives the micro-kernels access to the pack_t schema values embedded in the control trees, which determine the precise format into which the matrix elements are packed. - Updated a call to bli_packm_init_pack() in src/test_libblis.c to remove densify argument. Meant to include this in commit `c472993b`.	2014-09-11 11:47:56 -05:00
Field G. Van Zee	137143345d	Reimplemented unit blocksize fix in prev commit. Details: - Instead of inferring the storage format of the micro-panels from within the packm variants, we now pass in a bool_t value that denotes whether the packed matrix contains row-stored column panels or column-stored row panels. This value can then be tested more easily inside the main packm variant loop. - Renumbered pack_t schema values in bli_type_defs.h so that there are now five bits, each with different meaning: - 4: packed or not packed? - 3: packed for 3m? - 2: packed for 4m? - 1: packed to panels? - 0: stored by rows or columns? - Added new macros that test for status of above bits in schema bit subfield, and renamed some existing macros related to 4m/3m.	2014-07-31 12:12:45 -05:00
Field G. Van Zee	7ed415824d	Updated copyright headers (continued). Details: - Inserted "at Austin" into third clause of license declarations. Meant to include this change in previous commit.	2014-07-14 16:14:33 -05:00
Field G. Van Zee	5c2c6c8561	Updated copyright headers to contain "at Austin". Details: - Updated copyright headers to include "at Austin" in the name of the University of Texas. - Updated the copyright years of a few headers to 2014 (from 2011 and 2012).	2014-07-14 16:05:03 -05:00
Tyler Smith	b8134b720b	Quick and dirty multithreading for TRSM Should work fine for small number of threads (up to 8 or maybe even 16). However, performance is yet untested. This parallelizes the "JR" loop for the left sided cases and the "IR" loop for the right sided cases. Future work is to parallelize the outer loops as well.	2014-07-02 16:02:39 -05:00
Tyler Smith	f0824a04fc	Initial commit to enable threading in TRSM, Also enabled weighted partitioning for herk, trmm Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions Correctly computed a_next and b_next for gemm, herk macrokernels a_next and b_next point to the current micropanels in trmm	2014-03-24 15:21:42 -05:00
Field G. Van Zee	9b0e715f29	Minor simplifications to trmm, trsm macro-kernels. Details: - Simplified some code that would have allowed the diagonal of a trmm or trsm triangular matrix to intersect the short end of a micro-panel. This is disallowed via higher-level constraints on cache blocksizes, so this code was never needed and only served to obfuscate. - Updated some comments in trmm, trsm macro-kernels.	2014-03-19 15:47:54 -05:00
Field G. Van Zee	6363a9f658	Added level-3 support for complex via 4m-/3m. Details: - Added the ability to induce complex domain level-3 operations via new virtual complex micro-kernels which are implemented via only real domain micro-kernels. Two new implementations are provided: 4m and 3m. 4m implements complex matrix multiplication in terms of four real matrix multiplications, where as 3m uses only three and thus is capable of even higher (than peak) performance. However, the 3m method has somewhat weaker numerical properties, making it less desirable in general. - Further refined packing routines, which were recently revamped, and added packing functionality for 4m and 3m. - Some modifications to trmm and trsm macro-kernels to facilitate indexing into micro-panels which were packed for 4m/3m virtual kernels. - Added 4m and 3m interfaces for each level-3 operation. - Various other minor changes to facilitate 4m/3m methods.	2014-02-19 17:00:52 -06:00
Field G. Van Zee	bd3ab1ad4c	Minor fixes to trsm consistent with prev on trmm. Details: - Removed use of bli_min() and bli_max() that were only being used to try to support situations where the diagonal would intersect the short end of some micro-panels, which is situation that is disallowed at a higher level by various constraints on the register and cache blocksize. This only affected trsm_ll and trsm_lu. - Use panel stride as passed into the macro-kernel rather than compute it via k and PACKMR/PACKNR. This affects all macro-kernels of trsm.	2014-02-13 09:29:55 -06:00
Field G. Van Zee	7aceef7683	Updated comments in macro-kernels. Details: - Updated (and fixed some errors in) the "Assumptions/assertions" comment section of macro-kernels. - Changed register blocksizes of reference configuration to MR = 8 and NR = 4. It's always good for MR != NR in the reference configuration since it may help uncover bugs related to non-square micro-kernels.	2014-02-06 17:31:19 -06:00
Field G. Van Zee	8fd292aa78	Pass panel dimensions into macro-kernels. Details: - Modified the interfaces to the datatype-specific macro-kernels so that: - pd_a and pd_b are passed in (which contain the panel dimensions of packed panels of a and b). - rs_a and cs_b are no longer passed in (they were guaranteed to be 1). - Modified implementations of datatype-specific macro-kernels so pd_a, pd_b, cs_a, and rs_b are used instead of cpp macros for MR, NR, PACKMR, and PACKNR, respectively. - Declare temporary c matrices (ct) as being maxmr-by-maxnr, which for now is equivalent to being mr-by-nr. maxmr and maxnr are declared in a new header file bli_kernel_post_macro_defs.h.	2014-02-06 14:32:21 -06:00
Field G. Van Zee	5a36e5bf2f	Embed func_t microkernel objects in control trees. Details: - Modified all control tree node definitions to include a new field of type func_t, which is similar to a blksz_t except that it contains one function pointer (each typed simply as void) for each datatype. We use the func_t* to embed pointers to the micro-kernels to use for the leaf-level nodes of each control tree. This change is a natural extension of control trees and will allow more flexibility in the future. - Modified all macro-kernel wrappers to obtain the micro-kernel pointers from the incomming (previously ignored) control tree node and then pass the queried pointer into the datatype-specific macro-kernel code, which then casts the pointer to the appropriate type (new typedefs residing in bli_kernel_type_defs.h) and then uses the pointer to call the micro- kernel. Thus, the micro-kernel function is no longer "hard-coded" (that is, determined when the datatype-specific macro-kernel functions are instantiated by the C preprocessor). - Added macros to bli_kernel_macro_defs.h that build datatype-specific base names if they do not exist already, and then uses those to build datatype-specific micro-kernel function names. This will allow developers extra flexibility if they wanted to, for example, name each of their datatype-specific micro-kernels differently (e.g. double real might be named bli_dgemm_opt_4x4() while double complex might be named bli_zgemm_opt_2x2()). - Inserted appropriate code into _cntl_init() functions that allocates and initializes a func_t object for the corresponding micro-kernels. The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(), and then reused via extern wherever possible.	2014-01-27 11:13:00 -06:00
Field G. Van Zee	2cb13600f9	Updated year in copyright headers to 2014.	2014-01-03 12:29:13 -06:00
Field G. Van Zee	290fa54e00	Store variable panel strides in trmm/trsm auxinfo. Details: - Changed the value being stored into the auxinfo_t structure in trmm and trsm macro-kernels. Whereas before we stored whatever value was provided to the macro-kernel implementation via ps_a/ps_b, now we store the stride that will advance to the next variable-length micro-panel of the triangular matrix A (left) or B (right). - Whitespace changes to the files affected above.	2013-12-20 14:10:26 -06:00
Field G. Van Zee	e3a6c7e776	Macroized conditionals for a2/b2 in macro-kernels. Details: - Replaced conditional expressions in macro-kernels related to computing the addresses a2 and b2 (a_next and b_next) with a preprocessor macro invocation, bli_is_last_iter(), that tests the same condition. - Updated gemm_ukr module to use auxinfo_t argument. - Whitespace changes in test suite ukr modules.	2013-12-19 16:29:31 -06:00
Field G. Van Zee	a0331fb10a	Introduced auxinfo_t argument to micro-kernels. Details: - Removed a_next and b_next arguments to micro-kernels and replaced them with a pointer to a new datatype, auxinfo_t, which is simply a struct that holds a_next and b_next. The struct may hold other auxiliary information that may be useful to a micro-kernel, such as micro-panel stride. Micro-kernels may access struct fields via accessor macros defined in bli_auxinfo_macro_defs.h. - Updated all instances of micro-kernel definitions, micro-kernel calls, as well as macro-kernels (for declaring and initializing the structs) according to above change.	2013-12-19 14:50:11 -06:00
Field G. Van Zee	b444489f10	Added new "attached" scalar representation. Details: - Added infrastructure to support a new scalar representation, whereby every object contains an internal scalar that defaults to 1.0. This facilitates passing scalars around without having to house them in separate objects. These "attached" scalars are stored in the internal atom_t field of the obj_t struct, and are always stored to be the same datatype as the object to which they are attached. Level-3 variants no longer take scalar arguments, however, level-3 internal back-ends stll do; this is so that the calling function can perform subproblems such as C := C - alpha * A * B on-the-fly without needing to change either of the scalars attached to A or B. - Removed scalar argument from packm_int(). - Observe and apply attached scalars in scalm_int(), and removed scalar from interface of scalm_unb_var1(). - Renamed the following functions (and corresponding invocations): bli_obj_init_scalar_copy_of() -> bli_obj_scalar_init_detached_copy_of() bli_obj_init_scalar() -> bli_obj_scalar_init_detached() bli_obj_create_scalar_with_attached_buffer() -> bli_obj_create_1x1_with_attached_buffer() bli_obj_scalar_equals() -> bli_obj_equals() - Defined new functions: bli_obj_scalar_detach() bli_obj_scalar_attach() bli_obj_scalar_apply_scalar() bli_obj_scalar_reset() bli_obj_scalar_has_nonzero_imag() bli_obj_scalar_equals() - Placed all bli_obj_scalar_* functions in a new file, bli_obj_scalar.c. - Renamed the following macros: bli_obj_scalar_buffer() -> bli_obj_buffer_for_1x1() bli_obj_is_scalar() -> bli_obj_is_1x1() - Defined new macros to set and copy internal scalars between objects: bli_obj_set_internal_scalar() bli_obj_copy_internal_scalar() - In level-3 internal back-ends, added conditional blocks where alpha and beta are checked for non-unit-ness. Those values for alpha and beta are applied to the scalars attached to aliases of A/B/C, as appropriate, before being passed into the variant specified by the control tree. - In level-3 blocked variants, pass BLIS_ONE into subproblems instead of alpha and/or beta. - In level-3 macro-kernels, changed how scalars are obtained. Now, scalars attached to A and B are multiplied together to obtain alpha, while beta is obtained directly from C. - In level-3 front-ends, removed old function calls meant to provide future support for mixed domain/precision. These can be added back later once that functionality is given proper treatment. Also, removed the creating of copy-casts of alpha and beta since typecasting of scalars is now implicitly handled in the internal back-ends when alpha and beta are applied to the attached scalars.	2013-12-03 16:08:30 -06:00
Field G. Van Zee	67761e224c	Attempting to fix errors in bgq build. Details: - Removed restrict declaration from b_cast and c_cast from bli_trsm_lu_ker_var2.c and bli_trsm_rl_ker_var2.c. Curiously, they are causing problems for xlc only in those two files and no other macro-kernels. - Fixed (hopefully) kernel function parameter type declarations in kernels/bgq/1f/bli_axpyf_opt_var1.c and kernels/bgq/3/bli_gemm_8x8.c.	2013-11-18 11:57:40 -06:00
Field G. Van Zee	9bd7fcfd43	Outer-to-inner 'restrict' fix in macro-kernels. Details: - Fixed sloppy placement of 'restrict' pointer declarations in level-3 macro-kernels. Previously, all restricted pointers were being declared at the outer-most function scope level. While this violates the C99 standard, very few of the compilers used with BLIS so far have seemed to care. The lone exception has been IBM's xlc. Thanks to Tyler Smith for identifying this bug (and suggesting the fix).	2013-11-18 10:58:09 -06:00
Field G. Van Zee	376bbb59c8	Removed support for duplication. Details: - Removed support for duplication from the gemmtrsm/trsm micro-kernels and all framework code. - Updated test suite modules according to above changes.	2013-11-08 11:17:34 -06:00
Field G. Van Zee	6072d7c848	Fixed bugs in trsm, trmm macro-kernels. Details: - Fixed a bug in trsm_rl_ker_var2() caused by incorrect edge case handling. - Fixed a bug in trsm_rl_ker_var2() and trsm_ru_ker_var2() whereby k was incorrectly being adjusted upward by MR, instead of NR. The rl and ru trmm macro-kernels were updated in a similar fashion. - Fixed a bug in trsm_ru_ker_var2() that was due to a missing negation on diagoffb when recomputing k to skip a zero region below where the diagonal intersects the right side of the block. The corresponding trmm macro-kernel was also updated. - Fixed a bug in trsm_ru_ker_var2() where the the adjustment of k (by NR) needed to be placed AFTER the block that recomputes k to skip the zero region (if present). The other three trsm macro-kernels, as well as the trmm macro-kernels, were updated in the same manner, for consistency. - Fixed a bug in trmm_lu_ker_var2() in which the wrong dimension (n) was being updated to skip a zero region to the left of where the diagonal of A intersects the top edge of the block. - Comment updates to all trsm and trmm macro-kernels. - Comment updates to bli_packm_init.c.	2013-07-17 12:27:45 -05:00

1 2

52 Commits