amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 22:41:11 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	bd02c4e9f7	Cleanups to testsuite, input.operations format. Details: - Removed the line in each operation entry in input.operations titled "test sequential front-end" and the corresponding support for the lines in the testsuite input parsing code. This line was included in the some of the earliest versions of the testsuite, back when I intended to eventually have separate multithreaded APIs. Specifically, I envisioned that multithreaded and sequential testing could be enabled or disabled on an operation level. However, BLIS evolved in a different direction and still does not have multithreaded-specific APIs (even if it will eventually someday). But even if it did have such APIs, I doubt I would allow the user to enable/disable them on an operation level. Thus, this was a zombie future parameter that was never used and never made sense to begin with. The one instance of the front_seq variable, used in the various libblis_test_<operation>() functions to guard the call to the operation test driver, that remains was commented out instead of deleted so that someday it could be easily changed via sed, if desired. - Various minor cleanups to the testsuite code, including consolidating use of DISABLE and DISABLE_ALL and reexpressing certain conditional expressions in the libblis_test_<operation>() functions in terms of boolean functions.	2018-06-04 13:42:17 -05:00
Field G. Van Zee	1ef9360b1f	Enable non-unit vector stride tests by default. Details: - Change "vector storage schemes to test" parameter in testsuite's input.general file to "cj". This means that both unit stride column vectors and non-unit stride column vectors will be tested in operations with vector operands (e.g. level-1v, level-1f, level-2). - Very minor comment (typo) changes to input.operations.	2018-03-01 14:36:39 -06:00
Field G. Van Zee	8c4e55a1a1	Added individual operation overrides in testsuite. Details: - Updated the testsuite driver so that setting one or more individual operation test switches to "2" in input.operations will enable ONLY those operations and disable all others, regardless of the values of the section overrides and other operation switches. This makes it every easy to quickly test only one or two operations, and equally easy to revert back to the previous combination of operation tests. - Added more comments to input.operations describing the use of individual "enable only" overrides.	2018-02-28 17:01:47 -06:00
Field G. Van Zee	86969873b5	Reclassified amaxv operation as a level-1v kernel. Details: - Moved amaxv from being a utility operation to being a level-1v operation. This includes the establishment of a new amaxv kernel to live beside all of the other level-1v kernels. - Added two new functions to bli_part.c: bli_acquire_mij() bli_acquire_vi() The first acquires a scalar object for the (i,j) element of a matrix, and the second acquires a scalar object for the ith element of a vector. - Added integer support to bli_getsc level-0 operation. This involved adding integer support to the bli_*gets level-0 scalar macros. - Added a new test module to test amaxv as a level-1v operation. The test module works by comparing the value identified by bli_amaxv() to the the value found from a reference-like code local to the test module source file. In other words, it (intentionally) does not guarantee the same index is found; only the same value. This allows for different implementations in the case where a vector contains two or more elements containing exactly the same floating point value (or values, in the case of the complex domain). - Removed the directory frame/include/old/.	2016-10-04 14:24:59 -05:00
Devin Matthews	bdbda6e6ac	Give the level1v operations some love: - Add missing axpby and xpby operations (plus test cases). - Add special case for scal2v with alpha=1. - Add restrict qualifiers. - Add special-case algorithms for incx=incy=1.	2016-04-25 11:05:57 -05:00
Field G. Van Zee	537a1f4f85	Implemented runtime contexts and reorganized code. Details: - Retrofitted a new data structure, known as a context, into virtually all internal APIs for computational operations in BLIS. The structure is now present within the type-aware APIs, as well as many supporting utility functions that require information stored in the context. User- level object APIs were unaffected and continue to be "context-free," however, these APIs were duplicated/mirrored so that "context-aware" APIs now also exist, differentiated with an "_ex" suffix (for "expert"). These new context-aware object APIs (along with the lower-level, type- aware, BLAS-like APIs) contain the the address of a context as a last parameter, after all other operands. Contexts, or specifically, cntx_t object pointers, are passed all the way down the function stack into the kernels and allow the code at any level to query information about the runtime, such as kernel addresses and blocksizes, in a thread- friendly manner--that is, one that allows thread-safety, even if the original source of the information stored in the context changes at run-time; see next bullet for more on this "original source" of info). (Special thanks go to Lee Killough for suggesting the use of this kind of data structure in discussions that transpired during the early planning stages of BLIS, and also for suggesting such a perfectly appropriate name.) - Added a new API, in frame/base/bli_gks.c, to define a "global kernel structure" (gks). This data structure and API will allow the caller to initialize a context with the kernel addresses, blocksizes, and other information associated with the currently active kernel configuration. The currently active kernel configuration within the gks cannot be changed (for now), and is initialized with the traditional cpp macros that define kernel function names, blocksizes, and the like. However, in the future, the gks API will be expanded to allow runtime management of kernels and runtime parameters. The most obvious application of this new infrastructure is the runtime detection of hardware (and the implied selection of appropriate kernels). With contexts in place, kernels may even be "hot swapped" at runtime within the gks. Once execution enters a level-3 _front() function, the memory allocator will be reinitialized on-the-fly, if necessary, to accommodate the new kernels' blocksizes. If another application thread is executing with another (previously loaded) kernel, it will finish in a deterministic fashion because its kernel information was loaded into its context before computation began, and also because the blocks it checked out from the internal memory pools will be unaffected by the newer threads' reinitialization of the allocator. - Reorganized and streamlined the 'ind' directory, which contains much of the code enabling use of induced methods for complex domain matrix multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as those APIs' functionality is now mostly subsumed within the global kernel structure. - Updated bli_pool.c to define a new function, bli_pool_reinit_if(), that will reinitialize a memory pool if the necessary pool block size has increased. - Updated bli_mem.c to use bli_pool_reinit_if() instead of bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed usage of contexts where appropriate to communicate cache and register blocksizes to bli_mem_compute_pool_block_sizes(). - Simplified control trees now that much of the information resides in the context and/or the global kernel structure: - Removed blocksize object pointers (blksz_t) fields from all control tree node definitions and replaced them with blocksize id (bszid_t) values instead, which may be passed into a context query routine in order to extract the corresponding blocksize from the given context. - Removed micro-kernel function pointers (func_t) fields from all control tree node definitions. Now, any code that needs these function pointers can query them from the local context, as identified by a level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or level-1v kernel id (l1vkr_t). - Removed blksz_t object creation and initialization, as well as kernel function object creation and initialization, from all operation- specific control tree initialization files (bli__cntl.c), since this information will now live in the gks and, secondarily, in the context. - Removed blocksize multiples from blksz_t objects. Now, we track blocksize multiples for each blocksize id (bszid_t) in the context object. - Removed the bool_t's that were required when a func_t was initialized. These bools are meant to allow one to track the micro-kernel's storage preferences (by rows or columns). This preference is now tracked separately within the gks and contexts. - Merged and reorganized many separate-but-related functions into single files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and util directories, but has the most obvious effect of allowing BLIS to compile noticeably faster. - Reorganized execution paths for level-1v, -1d, -1m, and -2 operations in an attempt to reduce overhead for memory-bound operations. This includes removal of default use of object-based variants for level-2 operations. Now, by default, level-2 operations will directly call a low-level (non-object based) loop over a level-1v or -1f kernel. - Converted many common query functions in blk_blksz.c (renamed from bli_blocksize.c) and bli_func.c into cpp macros, now defined in their respective header files. - Defined bli_mbool.c API to create and query "multi-bools", or heterogeneous bool_t's (one for each floating-point datatype), in the same spirit as blksz_t and func_t. - Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS and BLIS_SIMD_SIZE. These values are needed in order to compute a third new parameter, which may be set indirectly via the aforementioned macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to statically allocate memory in macro-kernels and the induced methods' virtual kernels to be used as temporary space to hold a single micro-tile. These values are now output by the testsuite. The default value of BLIS_STACK_BUF_MAX_SIZE is computed as "2 BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE". - Cleaned up top-level 'kernels' directory (for example, renaming the embarrassingly misleading "avx" and "avx2" directories to "sandybridge" and "haswell," respectively, and gave more consistent and meaningful names to many kernel files (as well as updating their interfaces to conform to the new context-aware kernel APIs). - Updated the testsuite to query blocksizes from a locally-initialized context for test modules that need those values: axpyf, dotxf, dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr. - Reformatted many function signatures into a standard format that will more easily facilitate future API-wide changes. - Updated many "mxn" level-0 macros (ie: those used to inline double loops for level-1m-like operations on small matrices) in frame/include/level0 to use more obscure local variable names in an effort to avoid variable shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings, which are only output using -Wshadow.) - Added a conj argument to setm, so that its interface now mirrors that of scalm. The semantic meaning of the conj argument is to optionally allow implicit conjugation of the scalar prior to being populated into the object. - Deprecated all type-aware mixed domain and mixed precision APIs. Note that this does not preclude supporting mixed types via the object APIs, where it produces absolutely zero API code bloat.	2016-04-11 17:21:28 -05:00
Field G. Van Zee	f1a6b7d028	Reorganized code for induced complex methods. Details: - Consolidated most of the code relating to induced complex methods (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods are now enabled on a per-operation basis. The current "available" (enabled and implemented) implementation can then be queried on an operation basis. Micro-kernel func_t objects as well as blksz_t objects can also be queried in a similar maner. - Redefined several micro-kernel and operation-related functions in bli_info_() API, in accordance with above changes. - Added mr and nr fields to blksz_t object, which point to the mr and nr blksz_t objects for each cache blocksize (and are NULL for register blocksizes). Renamed the sub-blocksize field "sub" to "mult" since it is really expressing a blocksize multiple. - Updated bli__determine_kc_[fb]() for gemm/hemm/symm, trmm, and trsm to correctly query mr and nr (for purposes of nudging kc). - Introduced an enumerated opid_t in bli_type_defs.h that uniquely identifies an operation. For now, only level-3 id values are defined, along with a generic, catch-all BLIS_NOID value. - Reworked testsuite so that all induced methods that are enabled are tested (one at a time) rather than only testing the first available method. - Reformated summary at the beginning of testsuite output so that blocksize and micro-kernel info is shown for each induced method that was requested (as well as native execution). - Reduced the number of columns needed to display non-matlab testsuite output (from approx. 90 to 80).	2015-03-18 15:37:10 -05:00
Field G. Van Zee	c0acca0f51	Clarified comments in testsuite input.operations.	2015-03-03 10:56:22 -06:00
Field G. Van Zee	45692e3ad4	Reverted some accidental changes. Details: - Reverted some changes that were unintentionally included in the previous commit (`9526ce98`). Thanks to Tony Kelman for pointing this out. (Note: a few select changes were not reverted.)	2014-08-07 13:21:15 -05:00
Field G. Van Zee	9526ce9881	Updated copyright headers of emscripten configuration files.	2014-08-06 14:15:34 -05:00
Field G. Van Zee	970b431416	Minor bugfixes to BLAS compatibility layer. Details: - Changed bla_amax.c so that i?amax() routines now correctly return 0 if ( n < 1 \|\| incx <= 0 ). - Changed bla_rotg.c and bla_rotmg.c to use bli_fabs() macro instead of f2c's abs() macro for float and double cases. - Thanks to Murtaza Ali for suggesting the two fixes above. - Updated label of fnormv to normfv in testsuite/input.operations.	2014-07-10 09:30:00 -05:00
Field G. Van Zee	fde5f1fdec	Added extensive support for configuration defaults. Details: - Standard names for reference kernels (levels-1v, -1f and 3) are now macro constants. Examples: BLIS_SAXPYV_KERNEL_REF BLIS_DDOTXF_KERNEL_REF BLIS_ZGEMM_UKERNEL_REF - Developers no longer have to name all datatype instances of a kernel with a common base name; [sdcz] datatype flavors of each kernel or micro-kernel (level-1v, -1f, or 3) may now be named independently. This means you can now, if you wish, encode the datatype-specific register blocksizes in the name of the micro-kernel functions. - Any datatype instances of any kernel (1v, 1f, or 3) that is left undefined in bli_kernel.h will default to the corresponding reference implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined, it will be defined to be BLIS_DGEMM_UKERNEL_REF. - Developers no longer need to name level-1v/-1f kernels with multiple datatype chars to match the number of types the kernel WOULD take in a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is sufficient, as in bli_daxpyv_opt(). - There is no longer a need to define an obj_t wrapper to go along with your level-1v/-1f kernels. The framework now prvides a _kernel() function which serves as the obj_t wrapper for whatever kernels are specified (or defaulted to) via bli_kernel.h - Developers no longer need to prototype their kernels, and thus no longer need to include any prototyping headers from within bli_kernel.h. The framework now generates kernel prototypes, with the proper type signature, based on the kernel names defined (or defaulted to) via bli_kernel.h. - If the complex datatype x (of [cz]) implementation of the gemm micro- kernel is left undefined by bli_kernel.h, but its same-precision real domain equivalent IS defined, BLIS will use a 4m-based implementation for the datatype x implementations of all level-3 operations, using only the real gemm micro-kernel.	2014-02-25 13:34:56 -06:00
Field G. Van Zee	d37c2cff62	Minor comment and Makefile changes. Details: - Added missing 'check-config' and 'check-make-defs' targets to testsuite/Makefile. - Removed unused 'test' target from top-level Makefile. - Comment changes to testsuite input files.	2013-11-13 10:47:11 -06:00
Field G. Van Zee	68a5910974	Added comments to testsuite/input.operations. Details: - Added extensive comments to the top of testsuite/input.operations, which describe how to edit the file. - Removed input.operations.0 and input.operations.1. - Changed input.general to test all datatypes ("sdcz") by default.	2013-11-07 11:36:11 -06:00
Field G. Van Zee	a091a219bd	Minor fixes to piledriver configuration, ukernel. Details: - Applied a patch from Tyler that fixes minor staleness in the piledriver configuration and gemm micro-kernel. - Very minor changes to test suite input files.	2013-10-14 10:11:29 -05:00
Field G. Van Zee	be4833bd91	Added test suite modules for level-1f, 3 kernels. Details: - Added test modules in test suite for level-1f kernels and level-3 micro-kernels. (Duplication in the micro-kernels, for now, is NOT supported by these test modules.) - Added section override switches to test suite's input.operations file. - Added obj_t APIs for level-1f front-ends and their unblocked variants to facilitate the level-1f test modules. Also added front-end for dupl operation. - Added obj_t-based check routines for level-1f operations, which are called from the new front-ends mentioned above. - Added query routines for axpyf, dotxf, and dotxaxpyf that return fusing factors as a function of datatype, which is needed by their respective test modules. - Whitespace changes to bli_kernel.h of all existing configurations.	2013-10-10 14:20:06 -05:00
Field G. Van Zee	73aa1e9f31	Added section overrides to test suite. Details: - Added new lines of input to the test suite's input.operations file, which allows the user to disable entire sections (levels) of tests. Before this change, the user had to manually disable each operation tests's "master switch". (This is why input.operations.0 existed: to allow a more convenient starting point for someone who only wanted to test one or a few operations.)	2013-10-01 17:01:18 -05:00
Field G. Van Zee	5e54f46ccb	Added template implementations and other tweaks. Details: - Added a 'template' configuration, which contains stub implementations of the level 1, 1f, and 3 kernels with one datatype implemented in C for each, with lots of in-file comments and documentation. - Modified some variable/parameter names for some 1/1f operations. (e.g. renaming vector length parameter from m to n.) - Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files to bli_kernel.h. - Modifed test suite to print out fusing factors for axpyf, dotxf, and dotxaxpyf, as well as the default fusing factor (which are all equal in the reference and template implementations). - Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these reference variants were implemented in terms of front-end routines rather that directly in terms of the kernels. (For example, axpy2v was implemented as two calls to axpyv rather than two calls to AXPYV_KERNEL.) - Changed the interface to dotxf so that it matches that of axpyf, in that A is assumed to be m x b_n in both cases, and for dotxf A is actually used as A^T. - Minor variable naming and comment changes to reference micro-kernels in frame/3/gemm/ukernels and frame/3/trsm/ukernels.	2013-09-30 12:58:18 -05:00
Field G. Van Zee	9013ad6ff2	Switched integer typedefs (again) to C types. Details: - Redefined gint_t and guint_t in terms of the standard C types long int and unsigned long int, respectively. - Changed testsuite default max problem size to 500. - Changed testsuite input.operations to use square problems for level-3 operation tests.	2013-09-04 13:36:07 -05:00
Field G. Van Zee	9ee6e12537	Changed dimension spec for gemm in testsuite. Details: - Encounted a bizarre typecasting bug whereby the test suite was not computing the proper dimension from the problem size and dimension specification when the latter was set to -3. Will investigate. Thanks to Fran for finding this "bug".	2013-09-03 21:53:27 -05:00
Field G. Van Zee	e8be081e68	Generalized matlab and file output in testsuite. Details: - Added a new option in input.general that allows outputting in matlab/octave format so that one can output in matlab format independently from outputting to files. - Adjusted input.operations according to above. - Added input.operations.0 and input.operations.1 with all options disabled and enabled, respectively.	2013-08-28 15:52:34 -05:00
Field G. Van Zee	2d9c667f3c	Fixed x86_64 kernel bugs and other minor issues. Details: - Fixed bugs in trmv_l and trsv_u due to backwards iteration resulting in unaligned subpartitions. We were already going out of our way a bit to handle edge cases in the first iteration for blocked variants, and this was simply the unblocked-fused extension of that idea. - Fixed control tree handling in her/her2/syr/syr2 that was not taking into account how the choice of variant needed to be altered for upper-stored matrices (given that only lower-stored algorithms are explicitly implemented). - Added bli_determine_blocksize_dim_f(), bli_determine_blocksize_dim_b() macros to provide inlined versions of bli_determine_blocksize_[fb]() for use by unblocked-fused variants. - Integrated new blocksize_dim macros into gemv/hemv unf variants for consistency with that of the bugfix for trmv/trsv (both of which now use the same macros). - Modified bli_obj_vector_inc() so that 1 is returned if the object is a vector of length 1 (ie: 1 x 1). This fixes a bug whereby under certain conditions (e.g. dotv_opt_var1), an invalid increment was returned, which was invalid only because the code was expecting 1 (for purposes of performing contiguous vector loads) but got a value greater than 1 because the column stride of the object (e.g. rho) was inflated for alignment purposes (albeit unnecessarily since there is only one element in the object). - Replaced some old invocations of set0 with set0s. - Added alpha parameter to gemmtrsm ukernels for x86_64 and use accordingly. - Fixed increment bug in cleanup loop of gemm ukernel for x86_64. - Added safeguard to test modules so that testing a problem with a zero dimension does not result in a failure. - Tweaked handling of zero dimensions in level-2 and level-3 operations' internal back-ends to correctly handle cases where output operand still needs to be scaled (e.g. by beta, in the case of gemm with k = 0).	2013-05-24 16:28:10 -05:00
Field G. Van Zee	768fcebaa8	Added unified test suite, and many fixes. Details: - Added a highly configurable, unified test suite. - Removed DUPB configuration constant from bl2_kernel.h and macro-kernel header files. Now, instead, DUPB is computed as (NDUP != 1) within each macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into incorrectly when DUPB was set to FALSE but the NDUP was still non-unit. By encoding both pieces of information into one constant in _kernel.h, it seems somewhat less likely others will encounter this bug in the future. - Added level-2 cache blocksizes to _kernel.h for reference configuration, and defined blocksizes in _cntl.c files to these default values. - Changed semantics of her2k and syr2k such that these operations no longer expect the B matrix to already be conjugate-transposed (or just transposed for syr2k). However, these semantics are preserved for the internal mechanics of the implementations, including the internal back-end and all blocked variants. - Inserted checks for real-valued alpha and beta for herk/her2k and herk, respectively. - Relaxed general object structure constraints in _basic_check() for gemv, ger. - Changed her front-end to NOT copy-cast to real projection; instead, this is replaced by selecting either the real part or both parts within the unblocked algorithm implementation, depending on the value of conjh. - Added conjh to all _check routines for her so that the code knows when to verify that alpha has an imaginary component equal to zero (for her, but not syr). - Changed control tree for her to forgo packing. - Added unit diagonal support to fnormm. - Redefined real versions of abval2s macros in terms of fabs(), fabsf(). - Redefined complex versions of sqrt2s macros using the actual "complex square root" formula. - Created new level-0 object-based routines, suffixed with "sc" (for "scalar"). - Defined new level-1v, -1d, and -1m versions of add and sub operations (two-operand add and subtract). - Added new scalar macros: - getris: acquire real and imaginary components. - setris: set real and imaginary components. - addjs: addition with conjugated x. - subjs: subtraction with conjugated x. - Defined new utility operations: - absumv: element-wise sum of absolute values for vector elements. - absumm: element-wise sum of absolute values for matrix elements. - mkherm: convert existing matrix to Hermitian. - mksymm: convert existing matrix to symmetric. - mktrim: convert existing matrix to triangular. - Added various error checking routines. - Added bl2_clock_min_diff(), which is used to more cleanly measure the wall clock time of a code block. - Added general stride support to bl2_obj_alloc_buffer(). - Added bl2_obj_init_scalar(). - Updated parameter mapping in bl2_param_map.c. - Added support for queriable version string. - Fixed a bug in the her2k macro-kernels (which currently are simply implemented in terms of two invocations of herk) whereby beta was being applied to both the first and second rank-k updates, rather than only the first. - Fixed a bug in trmm/trsm whereby transpose and right side cases were not properly implemented due to erroneous assumptions regarding aliasing and root objects. - Fixed a bug in the upper triangular trsm macro-kernel in which the wrong MR x NR block of B was being updated. - Fixed a bug in the inverts macro in the double real case whereby the value was typecast to float before inversion. This affected non-unit cases of dtrsm. - Fixed a bug in the reference kernels for gemmtrsm whereby the minus one constant was being applied incorrectly. - Fixed a bug in the overall treatment of non-unit alpha for trsm. The code now mimics the rank-k strategy of gemm, whereby alpah is applied during the first iteration of variant 3, with BLIS_ONE passed in instead for subsequent iterations. This also required passing alpha into the macro- kernels as well as the fused gemmtrsm micro-kernels. - Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being called for blocks strictly above the diagonal. While this sounds good in theory, this cannot be done because gemm_ker_var2 expects row panels of A to be packed from top to bottom, while for trsm_u, A is actually packed from bottom to top due to the reverse (BR->TL) nature of the algorithm. - Fixed a bug in packm_cxk() whereby panel packings with unit panel dimensions were mishandled due to incorrect arguments to the copyv kernel. Also changed the copyv kernel invocation to scal2v so that these edge cases are properly handled when scaling is requested. - Fixed a bug in packv_int() whereby an uninitialized object is passed in instead of the source object. - Fixed a bug whereby level-2 code could allocate memory dynamically via bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed a potential future bug whereby a mem_t object that is actually no longer "allocated" from the static pool is mistaken for being allocated due to failure to NULLify the buffer when the block was most recently released. - Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly toggled when the requested subpartition needed to be "reflected" due to it residing in an unstored region.	2013-02-11 13:20:44 -06:00

23 Commits