amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 14:31:12 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	e29b1f9706	Fixed failing testsuite gemmtrsm_ukr for power9. Details: - Added code that fixes false failures in the gemmtrsm_ukr module of the testsuite. The tests were failing because the computation (bli_gemv()) that performs the numerical check was not able to properly travserse the matrix operands bx1 and b11 that are views into the micropanel of B, which has duplicated/broadcast elements under the power9 subconfig. (For example, a micropanel of B with duplication factor of 2 needs to use a column stride of 2; previously, the column stride was being interpreted as 1.) - Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride() static functions in bli_obj_macro_defs.h. (Previously, only the function bli_obj_set_strides() was defined. Amazing to think that we got this far without these former functions.) - Updated/expounded upon comments.	2019-11-05 17:15:19 -06:00
Field G. Van Zee	49177a6b9a	Fixed latent testsuite ukr module bugs for power9. Details: - Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and gemmtrsm) that only manifested once we began running with parameters that mimic those of power9. The problem was rooted in the way those modules were creating objects (and thus allocating memory) for the micropanel operands to the microkernel being tested. Since power9 duplicates/broadcasts elements of B in memory, we needed an easy way of asking for more than one storage element per logical element in the matrix. I incorrectly expressed this as: bli_obj_create( datatype, k, n, ldbp, 1, &bp ); The problem here is that bli_obj_create() is exceedingly efficient at calculating the size it passes to malloc() and doesn't allocate a full leading dimension's worth of elements for the last column (or row, in this example). This would normally not bother anyone since you're not supposed to access that memory anyway. But here, my attempted "hack" for getting extra elements was insufficient, and needed to be changed to: bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp ); That is, the extra elements needed to be baked into the dimensions of the matrix object in order to have the intended effect on the number of elements actually allocated. Thanks to Jeff Hammond for reporting this bug. - Fixed a typically harmless memory leak in the aforementioned test modules (the objects for the packed micropanels were not being freed). - Updated/expanded a common comment across all three ukr test modules.	2019-11-04 18:09:37 -06:00
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
Field G. Van Zee	6218ac95a5	Merge branch 'master' into amd	2019-10-11 11:53:51 -05:00
Field G. Van Zee	29b0e1ef4e	Code review + tweaks to AMD's AOCL 2.0 PR (#349 ). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes.	2019-10-11 10:24:24 -05:00
Field G. Van Zee	31c8657f1d	Added support for pre-broadcast when packing B. Details: - Added support for being able to duplicate (broadcast) elements in memory when packing matrix B (ie: the left-hand operand) in level-3 operations. This turns out advantageous for some architectures that can afford the cost of the extra bandwidth and somehow benefit from the pre-broadcast elements (and thus being able to avoid using broadcast-style load instructions on micro-rows of B in the gemm microkernel). - Support optionally disabling right-side hemm and symm. If this occurs, hemm_r is implemented in terms of hemm_l (and symm_r in terms of symm_l). This is needed when broadcasting during packing because the alternative--supporting the broadcast of B while also allowing matrix B to be Hermitian/symmetric--would be an absolute mess. - Support alignment factors for packed blocks of A, B, and C separately (as well as for general-purpose buffers). In addition, we support byte offsets from those alignment values (which is different from aligning by align+offset bytes to begin with). The default alignment values are BLIS_PAGE_SIZE in all four cases, with the offset values defaulting to zero. - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed into the packm kernel, where it will be needed by packm kernels that perform broadcasts of B, since the idea is that we only want to broadcast when packing micropanels of B and not A. - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be used to set custom virtual level-3 microkernels in the cntx_t, which would typically be done in the bli_cntx_init_*() function defined in the subconfiguration of interest. - Added a "broadcast B" kernel function for use with NP/NR = 12/6, defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c. - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels defined in ref_kernels/3/bb. (These kernels have been tested with double real with NP/NR = 12/6.) - Added #ifndef ... #endif guards around several macro constants defined in frame/include/bli_kernel_macro_defs.h. - Defined a few "broadcast B" static functions in frame/include/level0/bb for use by "broadcast B"-style packm reference kernels. For now, only the real domain kernels are tested and fully defined. - Output the alignment and offset values for packed blocks of A and B in the testsuite's "BLIS configuration info" section. - Comment updates to various files. - Bumped so_version to 3.0.0.	2019-09-17 17:42:10 -05:00
Field G. Van Zee	c22b9dba58	More updates to comments in testsuite modules. Details: - Updated most comments in testsuite modules that describe how the correctness test is performed so that it is clear whether the vector (normfv) or matrix (normfm) form of Frobenius norm is used.	2019-07-16 13:14:47 -05:00
Field G. Van Zee	c4cc6fa702	New cntx_t blksz "set" functions + misc tweaks. Details: - Defined two new static functions in bli_cntx.h: bli_cntx_set_blksz_def_dt() bli_cntx_set_blksz_max_dt() which developers may find convenient when experimenting with different values of cache blocksizes. - Updated one- and two-socket multithreaded problem size range and increment values in test/3/Makefile. - Changed default to column storage in test/3/test_gemm.c. - Fixed typo in comment in testsuite/src/test_subm.c.	2019-07-16 13:00:35 -05:00
Field G. Van Zee	6bf449cc69	Merge branch 'amd'	2019-05-31 17:42:40 -05:00
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
Field G. Van Zee	057f5f3d21	Minor build system housekeeping. Details: - Commented out redundant setting of LIBBLIS_LINK within all driver- level Makefiles. This variable is already set within common.mk, and so the only time it should be overridden is if the user wants to link to a different copy of libblis. - Very minor changes to build/gen-make-frags/gen-make-frag.sh. - Whitespace and inconsequential quoting change to configure. - Moved top-level 'windows' directory into a new 'attic' directory.	2019-05-23 12:51:17 -05:00
Field G. Van Zee	b9c9f03502	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2019-04-27 18:44:50 -05:00
Field G. Van Zee	89cd650e7b	Use void_fp for function pointers instead of void. Change void-typed function pointers to void_fp. - Updated all instances of void* variables that store function pointers to variables of a new type, void_fp. Originally, I wanted to define the type of void_fp as "void (void_fp)( void )"--that is, a pointer to a function with no return value and no arguments. However, once I did this, I realized that gcc complains with incompatible pointer type (-Wincompatible-pointer-types) warnings every time any such a pointer is being assigned to its final, type-accurate function pointer type. That is, gcc will silently typecast a void to another defined function pointer type (e.g. dscalv_ker_ft) during an assignment from the former to the latter, but the same statement will trigger a warning when typecasting from a void_fp type. I suspect an explicit typecast is needed in order to avoid the warning, which I'm not willing to insert at this time. - Added a typedef to bli_type_defs.h defining void_fp as void, along with a commented-out version of the aborted definition described above. (Note that POSIX requires that void and function pointers be interchangeable; it is the C standard that does not provide this guarantee.) - Comment updates to various _oapi.c files.	2019-04-02 17:23:55 -05:00
Field G. Van Zee	809395649c	Annotated additional symbols for export. Details: - Added export annotations to additional function prototypes in order to accommodate the testsuite. - Disabled calling bli_amaxv_check() from within the testsuite's test_amaxv.c.	2019-03-13 18:21:35 -05:00
Field G. Van Zee	c665eb9b88	Minor updates to docs, Makefiles. Details: - Changed all occurrances of micro-kernel -> microkernel macro-kernel -> macrokernel micro-panel -> micropanel in all markdown documents in 'docs' directory. This change is being made since we've reached the point in adoption and acceptance of BLIS's insights where words such as "microkernel" are no longer new, and therefore now merit being unhyphenated. - Updated "Implementation Notes" sections of KernelsHowTo.md, which still contained references to nonexistent cpp macros such as BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?. - Added 'run-fast' and 'check-fast' targets to testsuite/Makefile. - Minor updates to Testsuite.md, including suggesting use of 'make check' and 'make check-fast' when running from the local testsuite directory. - Added a comment to top-level Makefile explaining the purpose behind the TESTSUITE_WRAPPER variable, which at first glance appears to serve no purpose.	2019-01-28 16:22:23 -06:00
Field G. Van Zee	bdd46f9ee8	Rewrote reference kernels to use #pragma omp simd. Details: - Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified indexing annotated by the #pragma omp simd directive, which a compiler can use to vectorize certain constant-bounded loops. (The new kernels actually use _Pragma("omp simd") since the kernels are defined via templatizing macros.) Modest speedup was observed in most cases using gcc 5.4.0, which may improve with newer versions. Thanks to Devin Matthews for suggesting this via issue #286 and #259. - Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex, respectively, with a default row preference for the gemm ukernel. Also updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4, respectively, for all datatypes. - Modified configure to verify that -fopenmp-simd is a valid compiler option (via a new detect/omp_simd/omp_simd_detect.c file). - Added a new header in which prefetch macros are defined according to which compiler is detected (via macros such as __GNUC__). These prefetch macros are not yet employed anywhere, though. - Updated the year in copyrights of template license headers in build/templates and removed AMD as a default copyright holder.	2019-01-24 17:23:18 -06:00
Field G. Van Zee	2f3174330f	Implemented a pool-based small block allocator. Details: - Implemented a sophisticated data structure and set of APIs that track the small blocks of memory (around 80-100 bytes each) used when creating nodes for control and thread trees (cntl_t and thrinfo_t) as well as thread communicators (thrcomm_t). The purpose of the small block allocator, or sba, is to allow the library to transition into a runtime state in which it does not perform any calls to malloc() or free() during normal execution of level-3 operations, regardless of the threading environment (potentially multiple application threads as well as multiple BLIS threads). The functionality relies on a new data structure, apool_t, which is (roughly speaking) a pool of arrays, where each array element is a pool of small blocks. The outer pool, which is protected by a mutex, provides separate arrays for each application thread while the arrays each handle multiple BLIS threads for any given application thread. The design minimizes the potential for lock contention, as only concurrent application threads would need to fight for the apool_t lock, and only if they happen to begin their level-3 operations at precisely the same time. Thanks to Kiran Varaganti and AMD for requesting this feature. - Added a configure option to disable the sba pools, which are enabled by default; renamed the --[dis\|en]able-packbuf-pools option to --[dis\|en]able-pba-pools; and rewrote the --help text associated with this new option and consolidated it with the --help text for the option associated with the sba (--[dis\|en]able-sba-pools). - Moved the membrk field from the cntx_t to the rntm_t. We now pass in a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we do for bli_sba_acquire() and _release(). - Replaced all calls to bli_malloc_intl() and bli_free_intl() that are used for small blocks with calls to bli_sba_acquire(), which takes a rntm (in addition to the bytes requested), and bli_sba_release(). These latter two functions reduce to the former two when the sba pools are disabled at configure-time. - Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as required by the new usage of bli_sba_acquire() and _release(). - Moved the freeing of "old" blocks (those allocated prior to a change in the block_size) from bli_membrk_acquire_m() to the implementation of the pool_t checkout function. - Miscellaneous improvements to the pool_t API. - Added a block_size field to the pblk_t. - Harmonized the way that the trsm_ukr testsuite module performs packing relative to that of gemmtrsm_ukr, in part to avoid the need to create a packm control tree node, which now requires a rntm_t that has been initialized with an sba and membrk. - Re-enable explicit call bli_finalize() in testsuite so that users who run the testsuite with memory tracing enabled can check for memory leaks. - Manually imported the compact/minor changes from `61441b24` that cause the rntm to be copied locally when it is passed in via one of the expert APIs. - Reordered parameters to various bli_thrcomm_() functions so that the thrcomm_t to the comm being modified is last, not first. - Added more descriptive tracing for allocating/freeing small blocks and formalized via a new configure option: --[dis\|en]able-mem-tracing. - Moved some unused scalm code and headers into frame/1m/other. - Whitespace changes to bli_pthread.c. - Regenerated build/libblis-symbols.def.	2018-12-25 19:35:01 -06:00
Field G. Van Zee	f808d829c5	Handle edge cases, zero-filling in packm kernels. Details: - Updated the API and semantics of packm kernels such that they must now handle edge cases, meaning that a c-by-k packm kernel must be able to pack edge cases that are fewer than c rows/columns and be able to zero-fill the remaining elements. They must also be able to zero-fill the equivalent region when copying fewer than k columns/rows (which is needed by trsm). The new packm kernel API is generally: void packm_kernel ( conj_t conja, dim_t cdim, dim_t n, dim_t n_max, ctype* restrict kappa, ctype* restrict a, inc_t inca, inc_t lda, ctype* restrict p, inc_t ldp, cntx_t* restrict cntx ); where cdim and n are the dimensions (short and long, respectively) of the submatrix being copied from the source matrix A, and n_max is the "full" long dimension (corresponding to the k dimension in gemm) of the micropanel. The "full" short dimension (corresponding to the register blocksize MR or NR) is not part of the API because it is known intrinsically by the packm kernel implementation. Thanks to Devin Matthews for prompting us to make this change (#282). - Updated all reference packm kernels in ref_kernels/1m according to above changes, as well as all optimized packm kernels (which only consisted of those for knl). - Bumped the major soname version number in 'so_version' to 2. At first I was considering leaving it unchanged, but I couldn't escape the reality that the packm kernel API is much closer to an expert API than it is some obscure helper function interface within the framework that nobody would ever notice. - Removed reference packm kernels for mr/nr = 30. The only sub-config that would have been using those kernels is knc, which is likely no longer being used by very many people (if any). (This also mostly offset the larger object code footprint incurred by moving the edge- case handling into the individual packm kernels.) - Fixed an obscure race condition for 3mh and 4mh induced methods in which those implementations were modifying the contexts stored in the gks rather than a local copy. - Fixed a minor bug in the testsuite that prevented non-1m-based induced method implementations of trsm from executing.	2018-12-12 15:22:59 -06:00
Field G. Van Zee	0645f239fb	Remove UT-Austin from copyright headers' clause 3. Details: - Removed explicit reference to The University of Texas at Austin in the third clause of the license comment blocks of all relevant files and replaced it with a more all-encompassing "copyright holder(s)". - Removed duplicate words ("derived") from a few kernels' license comment blocks. - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c with format of all other comment blocks.	2018-12-04 14:31:06 -06:00
Field G. Van Zee	375eb30b0a	Added mixed-precision support to 1m method. Details: - Lifted the constraint that 1m only be used when all operands' storage datatypes (along with the computation datatype) are equal. Now, 1m may be used as long as all operands are stored in the complex domain. This change largely consisted of adding the ability to pack to 1e and 1r formats from one precision to another. It also required adding logic for handling complex values of alpha to bli_packm_blk_var1_md() (similar to the logic in bli_packm_blk_var1()). - Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c, bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong ukernel output preference field being read. Previously, the preference for the native complex ukernel was being read instead of the pref for the native real domain ukernel. This bug would not manifest if the preference for the native complex ukernel happened to be equal to that of the native real ukernel. - Added support for testing mixed-precision 1m execution via the gemm module of the testsuite. - Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack schemas are always read from the context, rather than trying to sometimes embed them directly to the A and B objects. (They are still embedded, but now uniformly only after reading the schemas from the context.) - Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only consumer). - Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to bli_gemm_ker_var2_md(). - Added explicit handling for beta == 1 and beta == 0 in the reference gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c. - Rewrote various level-0 macro defs, including axpyris, axpbyris, scal2ris, and xpbyris (and their conjugating counterparts) to explicitly support three operand types and updated invocations to xpbyris in bli_gemmtrsm1m_ref.c. - Query and use the storage datatype of the packed object instead of the storage datatype of the source object in bli_packm_blk_var1(). - Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to frame/3/gemm/ind/bli_gemm_ind_opt.h. - Various whitespace/comment updates.	2018-12-03 17:49:52 -06:00
Field G. Van Zee	e769bf46b0	Tweak testsuite to issue FAIL for Nan, Inf (#279 ). Details: - Adjusted the definition for libblis_test_get_string_for_result() in testsuite/src/test_libblis.c so that the "FAIL" string is returned if the computed residual contains either NaN or Inf. Previously, a residual containing NaN would result in the selection of the "PASS" string. Thanks to Devin Matthews for reporting this issue (#279). - Expounded on comment for the macro definitions of bli_isnan() and bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they must remain macros.	2018-11-20 16:16:53 -06:00
Field G. Van Zee	4bbb454bf3	Testsuite docs update for mixed-datatype gemm. Details: - Updated docs/Testsuite.md to include mention of the new mixed-domain and mixed-precision settings, including descriptions. - Updated docs/MixedDatatypes.md to include a brief section on running the testsuite to exercise mixed-datatype functionality, which mostly amounts to a link to the Testsuite.md document. - Minor verbiage change to testsuite output to correct a misleading label associated with the value returned by the query function bli_info_get_simd_num_registers(). (The function does not return the number of SIMD registers present in the hardware, but rather a maximum assumed value for the purposes of allocating temporary microtile workspace on the function stack.)	2018-11-03 19:11:01 -05:00
Field G. Van Zee	f19c33af4c	Disallow 64b BLAS integers + 32b BLIS integers. Details: - Print an error message from configure if the user attempts to explicitly configure BLIS for simultaneous use of 64-bit integers in the BLAS API with 32-bit integers in the BLIS API. - Added cpp macro conditional to bli_type_defs.h to mandate that BLIS integers be 64 bits if the BLAS integers are 64 bits. This and the above item take care of issue #274. Thanks to Devin Matthews and Jeff Hammond for suggesting these safeguards. - Slight reorganization and relabeling (for clarity) of BLAS/CBLAS sections and BLIS integer size line of the testsuite configuration output. - Very minor edits to docs/MixedDatatypes.md.	2018-10-26 17:07:15 -05:00
Field G. Van Zee	6fbc456fb3	Added SALT testing to Travis CI. Details: - Modified .travis.yml to automatically employ the simulation of application-level threading within the testsuite, with supporting changes to common.mk, the top-level Makefile, and travis/do_testsuite.sh. - Added a new pair of input files to testsuite directory with the '.salt' suffix (similar to those with the '.fast' suffix) for testing application-level threading. - Updated docs/BuildSystem.md to document the new make targets 'testblis-salt' and 'checkblis-salt'.	2018-10-25 13:20:25 -05:00
Field G. Van Zee	4ee986f0a7	Added mixed-datatype testing to Travis CI (#271 ). Details: - Modified .travis.yml to automatically test the mixed-datatype support of the gemm operation, with supporting changes to common.mk, the top-level Makefile, and travis/do_testsuite.sh. - Added a new pair of input files to testsuite directory with the '.mixed' suffix (similar to those with the '.fast' suffix) for testing mixed-datatype gemm. - Updated docs/BuildSystem.md to document the new make targets 'testblis-md' and 'checkblis-md'.	2018-10-22 14:09:44 -05:00
Field G. Van Zee	090e4f08fc	Merge branch 'master' into dev	2018-10-19 18:41:10 -05:00
Field G. Van Zee	3678a1cd51	Merge branch 'master' into win-pthreads	2018-10-19 16:11:31 -05:00
Field G. Van Zee	473ce54f5f	Added bli_pthread_() API. Details: - Defined a bli_pthread_() API so that the testsuite, when being linked against a Windows DLL, will be able to access pthreads functionality without those pthreads functions being explicitly exported by the DLL. Instead, we export the bli_pthread_() layer, which uses types and functions that are identical to pthreads, but adds a 'bli_' prefix. Only a few basic functions are present in the bli_pthreads_() API for now. Thanks to Devin Matthews and Isuru Fernando for their help on a related PR (#261) that this commit will hopefully facilitate. - Updated testsuite so that it calls bli_pthread_() layer instead of pthread_() functions directly. - Regenerated build/libblis-symbols.def. - Comment updated to build/regen-symbols.sh.	2018-10-18 19:03:56 -05:00
Field G. Van Zee	bb6df2814f	Defined a new level-1d operation: shiftd. Details: - Defined a new level-1d operation called 'shiftd', including object and typed APIs. This operation adds a scalar value to every element along an arbitrary diagonal of a matrix. Currently, shiftd is implemented in terms of the addv kernel. (The scalar is passed in as the x vector with an increment of zero.) - Replaced ad-hoc usage of setd and addd (after creating a temporary matrix object) with use of shiftd, which is much more concise, in various test driver files in the testsuite. Similar changes were made to the standalone test drivers and the example code. - Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md for bli_shiftd() and bli_?shiftd(), respectively. - Added observed object properties to level-1d documentation in BLISObjectAPI.md.	2018-10-18 17:11:39 -05:00
Field G. Van Zee	49d3f9fcbb	Merge branch 'master' into dev	2018-10-17 18:00:40 -05:00
Devin Matthews	29e6245816	Merge branch 'master' into win-pthreads	2018-10-16 10:12:25 -05:00
Field G. Van Zee	ed65771482	Fixed merge fail on testsuite threading macros. Details: - Applied the following C preprocessor macro renames BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N in src/test_libblis.c. This is apparently the result of a failure by git to properly merge the 'master' and 'amd' branches in the previous commit. (The 'master' branch contained a commit, `53a9ab1`, in which these same cpp macros were renamed throughout the source distribution.	2018-10-15 17:54:45 -05:00
Field G. Van Zee	779d64dc30	Added entry for xpbym to input.operations.fast. Details: - Forgot to add an entry for the new xpbym operation to input.operations.fast in previous commit.	2018-10-15 17:13:18 -05:00
Field G. Van Zee	5fec95b99f	Implemented mixed-datatype support for gemm. Details: - Implemented support for gemm where A, B, and C may have different storage datatypes, as well as a computational precision (and implied computation domain) that may be different from the storage precision of either A or B. This results in 128 different combinations, all which are implemented within this commit. (For now, the mixed-datatype functionality is only supported via the object API.) If desired, the mixed-datatype support may be disabled at configure-time. - Added a memory-intensive optimization to certain mixed-datatype cases that requires a single m-by-n matrix be allocated (temporarily) per call to gemm. This optimization aims to avoid the overhead involved in repeatedly updating C with general stride, or updating C after a typecast from the computation precision. This memory optimization may be disabled at configure-time (provided that the mixed-datatype support is enabled in the first place). - Added support for testing mixed-datatype combinations to testsuite. The user may test gemm with mixed domains, precisions, both, or neither. - Added a standalone test driver directory for building and running mixed-datatype performance experiments. - Defined a new variation of castm, castnzm, which operates like castm except that imaginary values are not touched when casting a real operand to a complex operand. (By contrast, in these situations castm sets the imaginary components of the destination matrix to zero.) - Defined bli_obj_imag_is_zero() and substituted calls in lieu of all usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and also simplified the implementation of bli_obj_imag_equals(). - Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex() when given BLIS_CONSTANT objects. - Disabled dt_on_output field in auxinfo_t structure as well as all accessor functions. Also commented out all usage of accessor functions within macrokernels. (Typecasting in the microkernel is still feasible, though probably unrealistic for now given the additional complexity required.) - Use void function pointer type (instead of void*) for storing function pointers in bli_l0_fpa.c. - Added documentation for using gemm with mixed datatypes in docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c. - Defined level-1d operation xpbyd and level-1m operation xpbym. - Added xpbym test module to testsuite. - Updated frame/include/bli_x86_asm_macros.h with additional macros (courtsey of Devin Matthews).	2018-10-15 16:37:39 -05:00
Field G. Van Zee	f1dba506c9	Output threading status/params from testsuite. Details: - Updated testsuite to output various parameters related to parallelism in BLIS. These parameters include: - threading status: disabled, openmp, or pthreads; - thread partitioning for jr/ir loops: slab or rr (round-robin); - ways of parallelism from environment variables, and also actual values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for square problems (assuming all dimensions are set to 1000); - automatic thread factorization parameters. - Also output the status of two relatively new configure-time options: libmemkind and the sandbox.	2018-10-08 17:59:41 -05:00
Devin Matthews	b8dfd82e0d	Get pthreads via blis.h in the test driver.	2018-10-02 15:37:12 -05:00
Devin Matthews	627d0c5bfd	Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows.	2018-10-02 14:40:55 -05:00
Field G. Van Zee	c03728f1f4	Various minor cleanups. Details: - Rewrote bli_winsys.c to define bli_setenv() and bli_sleep() unconditionally, but differently for Windows and non-Windows, but then disabled the definition of bli_setenv() entirely since BLIS no longer needs to set environment variables. Updated bli_winsys.h accordingly, and call bli_sleep() from within testsuite instead of sleep() directly. - Use #if !defined(_POSIX_BARRIERS) \|\| (_POSIX_BARRIERS != 200809L) instead of #if !defined(_POSIX_BARRIERS) \|\| (_POSIX_BARRIERS < 0) when guarding against local definition of pthread barrier in testsuite. (The description for unistd.h implies that _POSIX_BARRIERS should always be set to 200809L when barriers are supported, though I won't be surprised if we encounter a case in the future where it is set to something else such as 1 while still supported.) - Removed old _VERS_CONF_INST definitions and installation rules in top-level Makefile. These are no longer needed because we no longer output libraries with the version and configuration name as substrings. - Comment/whitespace updates in Makefile, config.mk.in, common.mk, configure, bli_extern_defs.h, and test_libblis.h. - Added mention of 1m to README.md and other trivial tweaks.	2018-09-10 17:54:27 -05:00
Field G. Van Zee	4b5437ec7a	Define a cpp macro specific to BLIS compilation. Details: - Tweaked the cflags functions in common.mk so that a new preprocessor macro, BLIS_IS_BUILDING_LIBRARY, is defined, but only when BLIS itself is being built. This macro will not be defined when, for example, the testsuite or example code compiles code local to those applications. This was done in part by defining a new cflags function get-user-cflags-for(), which is now the designated function for application Makefiles if they wish to inherit a basic set of CFLAGS from BLIS. (The compiler flags returned are identical to that of get-frame-cflags-for() except that -DBLIS_IS_BUILDING_LIBRARY is omitted.) - Updated all test driver-like makefiles to call get-user-cflags-for() instead of get-frame-cflags-for().	2018-09-07 17:24:32 -05:00
Mathieu Poumeyrol	4e7d06700f	second __APPLE__	2018-09-06 23:48:31 +02:00
Mathieu Poumeyrol	24ecc0d94a	use _POSIX_BARRIERS instead of __APPLE__	2018-09-06 22:10:16 +02:00
Mathieu Poumeyrol	d688a2b7e5	add an adhoc impl for pthread_barrier	2018-09-06 15:31:14 +02:00
Field G. Van Zee	4fa4cb0734	Trivial comment header updates. Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers.	2018-08-29 18:06:41 -05:00
Field G. Van Zee	b051ffb815	Merge branch 'dev'	2018-08-29 17:06:48 -05:00
Field G. Van Zee	8199e339ae	Added testsuite threading to input.general.fast. Details: - Added lines associated with the testsuite's new threading option to input.general.fast. This change was intended for the previous commit (`10d0735`).	2018-08-27 07:00:12 -05:00
Field G. Van Zee	10d07357af	Better thread safety; added threading to testsuite. Details: - Replaced critical sections that were conditional upon multithreading being enabled (via pthreads or OpenMP) with unconditional use of pthreads mutexes. (Why pthreads? Because BLIS already requires it for its initialization mechanism: pthread_once().) This was done in bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's mtx_t object and bli_mutex_*() API with pthread mutexes in bli_thread.c. The previous status quo could result in a race condition if the application called BLIS from more than one thread. The new pthread-based code should be completely agnostic to the application's threading configuration. Thanks to AMD for bringing to our attention the need for a thread-safety review. - Added an option to the testsuite to simulate application-level multithreading. Specifically, each thread maintains a counter that is incremented after each experiment. The thread only executes the experiment if: counter % n_threads == thread_id. In other words, the threads simply take turns executing each problem experiment. Also, POSIX guarantees that fprintf() will not intermingle output, so output was switched to fprintf() instead of libblis_test_fprintf(). - Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with wrappers to pthread_mutex_init()/_destroy(). - Changed the implementation of bli_l3_ind_oper_enable_only() to fix a race condition; specifically, two threads calling the function with the same parameters could lead to a non-deterministic outcome. - Added #include <pthread.h> to bli_cpuid.c and moved the same in bli_arch.c. - Added 'const' to declaration of OPT_MARKER in bli_getopt.c. - Added #include <pthread.h> to bli_system.h. - Added add-copyright.py script to automate adding new copyright lines to (and updating existing lines of) source files.	2018-08-26 20:34:30 -05:00
Field G. Van Zee	0f491e994a	Allow lesser Makefiles to reference installed BLIS. Details: - Updated the build system so that "lesser" Makefiles, such as those in belonging to example code or the testsuite, may be run even if the directory is orphaned from the original build tree. This allows a user to configure, compile, and install BLIS, delete the build tree (that is, the source distribution, or the build directory for out- of-tree builds) and then compile example or testsuite code and link against the installed copy of BLIS (provided the example or testsuite directory was preserved or obtained from another source). The only requirement is that make be invoked while setting the BLIS_INSTALL_PATH variable to the same installation prefix used when BLIS was configured. The easiest syntax is: make BLIS_INSTALL_PATH=/install/prefix though it's also permissible to set BLIS_INSTALL_PATH as an environment variable prior to running 'make'. - Updated all lesser Makefiles to implement the new aforementioned build behavior. - Relocated check-blastest.sh and check-blistest.sh from build to blastest and testsuite, respectively, so that if those directories are copied elsewhere the user can still run 'make check' locally. - Updated docs/Testsuite.md with language that mentions this new option of building/linking against an installed copy of BLIS.	2018-08-25 20:12:36 -05:00
Field G. Van Zee	017548314f	Replaced function chooser macros w/ func ptr arrays. Details: - Previously, most object API functions (_oapi.c) used a function chooser macro that would expand out to an if-elseif-elseif-else conditional that used a num_t datatype to call the appropriate type-specific API (_tapi.c). This always felt a little hackish, and would get in the way somewhat of addig support for new num_t datatypes in the future. So, I've replaced that functionality with code that queries a function pointer that is then typecast appropriately. This model of function calling was already pervasive for kernels queried from the cntx_t structure. It was also already in use in various other functions, such as macrokernels, and this commit simply extends that pattern. - The above change required many new files, mostly header files, that define the function types (mostly _ft.h) for the queriable functions as well as some source files to define the function pointer arrays and their corresponding query functions (_fpa.c). Various other function types, mostly for kernel function types, were renamed to reduce the potential for confusion with the function types for expert and basic (non-expert) typed API functions. - Removed definitions for all of the "bli_call_ft_*()" function chooser macros from bli_misc_macro_defs.h.	2018-08-07 14:13:25 -05:00
Field G. Van Zee	94d5ef42c8	Adjusted gflops format spec in testsuite, test/3m4m. Details: - Changed the format specifier for the gflops column in the testsuite output from %7.3f to %7.2f. This was done mainly to keep the output aligned properly when the expected perfomance exceeded 1000 gflops. Also, two decimal places still conveys plenty of precision for all practical applications, including just eyeballing performance deltas between two executions (let alone two implementations). - Changed the format specifier for gflops in the test/3m4m drivers from %6.3f to %7.2f (for the same reasons listed above).	2018-08-04 15:57:17 -05:00

1 2 3 4

186 Commits