amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-05 06:51:11 +00:00

Author	SHA1	Message	Date
Jérôme Duval	f377bb4485	Add Haiku to the known OS list (#361 )	2019-11-07 16:39:29 -06:00
Field G. Van Zee	e29b1f9706	Fixed failing testsuite gemmtrsm_ukr for power9. Details: - Added code that fixes false failures in the gemmtrsm_ukr module of the testsuite. The tests were failing because the computation (bli_gemv()) that performs the numerical check was not able to properly travserse the matrix operands bx1 and b11 that are views into the micropanel of B, which has duplicated/broadcast elements under the power9 subconfig. (For example, a micropanel of B with duplication factor of 2 needs to use a column stride of 2; previously, the column stride was being interpreted as 1.) - Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride() static functions in bli_obj_macro_defs.h. (Previously, only the function bli_obj_set_strides() was defined. Amazing to think that we got this far without these former functions.) - Updated/expounded upon comments.	2019-11-05 17:15:19 -06:00
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
Field G. Van Zee	6218ac95a5	Merge branch 'master' into amd	2019-10-11 11:53:51 -05:00
Field G. Van Zee	29b0e1ef4e	Code review + tweaks to AMD's AOCL 2.0 PR (#349 ). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes.	2019-10-11 10:24:24 -05:00
Field G. Van Zee	c60db26aee	Fixed bad loop counter in bli_[cz]scal2bbs_mxn(). Details: - Fixed a typo in the loop counter for the 'd' (duplication) dimension in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h. They shouldn't be used by anyone yet, but thankfully clang via AppVeyor spit out warnings that alerted me to the issue.	2019-09-17 18:04:17 -05:00
Field G. Van Zee	31c8657f1d	Added support for pre-broadcast when packing B. Details: - Added support for being able to duplicate (broadcast) elements in memory when packing matrix B (ie: the left-hand operand) in level-3 operations. This turns out advantageous for some architectures that can afford the cost of the extra bandwidth and somehow benefit from the pre-broadcast elements (and thus being able to avoid using broadcast-style load instructions on micro-rows of B in the gemm microkernel). - Support optionally disabling right-side hemm and symm. If this occurs, hemm_r is implemented in terms of hemm_l (and symm_r in terms of symm_l). This is needed when broadcasting during packing because the alternative--supporting the broadcast of B while also allowing matrix B to be Hermitian/symmetric--would be an absolute mess. - Support alignment factors for packed blocks of A, B, and C separately (as well as for general-purpose buffers). In addition, we support byte offsets from those alignment values (which is different from aligning by align+offset bytes to begin with). The default alignment values are BLIS_PAGE_SIZE in all four cases, with the offset values defaulting to zero. - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed into the packm kernel, where it will be needed by packm kernels that perform broadcasts of B, since the idea is that we only want to broadcast when packing micropanels of B and not A. - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be used to set custom virtual level-3 microkernels in the cntx_t, which would typically be done in the bli_cntx_init_*() function defined in the subconfiguration of interest. - Added a "broadcast B" kernel function for use with NP/NR = 12/6, defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c. - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels defined in ref_kernels/3/bb. (These kernels have been tested with double real with NP/NR = 12/6.) - Added #ifndef ... #endif guards around several macro constants defined in frame/include/bli_kernel_macro_defs.h. - Defined a few "broadcast B" static functions in frame/include/level0/bb for use by "broadcast B"-style packm reference kernels. For now, only the real domain kernels are tested and fully defined. - Output the alignment and offset values for packed blocks of A and B in the testsuite's "BLIS configuration info" section. - Comment updates to various files. - Bumped so_version to 3.0.0.	2019-09-17 17:42:10 -05:00
kdevraje	cac127182d	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis with public repo commit id `565fa3853b`. Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42	2019-06-24 14:05:54 +05:30
Field G. Van Zee	ad937db950	Added missing #include "bli_family_thunderx2.h". Details: - Added a cpp-conditional directive block to bli_arch_config.h that #includes "bli_family_thunderx2.h". The code has been missing since `adf5c17f`. However, this never manifested as an error because the file is virtually empty and not needed for thunderx2 (or most subconfigs). Thanks to Jeff Diamond for helping to spot this.	2019-06-07 11:34:08 -05:00
Field G. Van Zee	6bf449cc69	Merge branch 'amd'	2019-05-31 17:42:40 -05:00
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
kdevraje	a3554eb1dc	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2 Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae	2019-05-23 11:53:32 +05:30
Kiran Varaganti	a23f92594c	config_registry: New AMD zen2 architecture configuration added. frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS] frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...). frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..). frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20 Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866	2019-05-22 05:28:16 -04:00
kdevraje	df755848b8	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis into rome2.0 Change-Id: Ie8aad1ab810f0f3c0b90ec67f9dd3dfb8dcc74cc	2019-05-22 13:30:07 +05:30
Field G. Van Zee	fa7e6b182b	Define _POSIX_C_SOURCE in bli_system.h. Details: - Added #ifndef _POSIX_C_SOURCE #define _POSIX_C_SOURCE 200809L #endif to bli_system.h so that an application that uses BLIS (specifically, an application that #includes blis.h) does not need to remember to #define the macro itself (either on the command line or in the code that includes blis.h) in order to activate things like the pthreads. Thanks to Christos Psarras for reporting this issue and suggesting this fix. - Commented out #include <sys/time.h> in bli_system.h, since I don't think this header is used/needed anymore. - Comment update to function macro for bli_?normiv_unb_var1() in frame/util/bli_util_unb_var1.c.	2019-05-01 19:13:00 -05:00
Field G. Van Zee	b9c9f03502	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2019-04-27 18:44:50 -05:00
Field G. Van Zee	945928c650	Merge branch 'amd' of github.com:flame/blis into amd	2019-04-17 15:58:56 -05:00
Field G. Van Zee	89cd650e7b	Use void_fp for function pointers instead of void. Change void-typed function pointers to void_fp. - Updated all instances of void* variables that store function pointers to variables of a new type, void_fp. Originally, I wanted to define the type of void_fp as "void (void_fp)( void )"--that is, a pointer to a function with no return value and no arguments. However, once I did this, I realized that gcc complains with incompatible pointer type (-Wincompatible-pointer-types) warnings every time any such a pointer is being assigned to its final, type-accurate function pointer type. That is, gcc will silently typecast a void to another defined function pointer type (e.g. dscalv_ker_ft) during an assignment from the former to the latter, but the same statement will trigger a warning when typecasting from a void_fp type. I suspect an explicit typecast is needed in order to avoid the warning, which I'm not willing to insert at this time. - Added a typedef to bli_type_defs.h defining void_fp as void, along with a commented-out version of the aborted definition described above. (Note that POSIX requires that void and function pointers be interchangeable; it is the C standard that does not provide this guarantee.) - Comment updates to various _oapi.c files.	2019-04-02 17:23:55 -05:00
Isuru Fernando	044df9506f	Test with shared on windows (#306 ) Export macros can't support both shared and static at the same time. When blis is built with both shared and static, headers assume that shared is used at link time and dllimports the symbols with __imp_ prefix. To use the headers with static libraries a user can give -DBLIS_EXPORT= to import the symbol without the __imp_ prefix	2019-03-27 12:39:31 -05:00
Field G. Van Zee	663f662932	Merge branch 'amd' of github.com:flame/blis into amd	2019-03-16 16:17:12 -05:00
Field G. Van Zee	938c05ef86	Merge branch 'amd' of github.com:flame/blis into amd	2019-03-16 16:01:43 -05:00
Field G. Van Zee	e095926c64	Support shared lib export of only public symbols. Details: - Introduced a new configure option, --enable-export-all, which will cause all shared library symbols to be exported by default, or, alternatively, --disable-export-all, which will cause all symbols to be hidden by default, with only those symbols that are annotated for visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS symbols), to be exported. The default for this configure option is --disable-export-all. Thanks to Isuru Fernando for consulting on this commit. - Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h, which was intended for `5a5f494`. - Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to frame/include/bli_config_macro_defs.h. - Provided appropriate logic within common.mk to implement variable symbol visibility for gcc, clang, and icc (to the extend that each of these compilers allow). - Relocated --help text associated with debug option (-d) to configure slightly further down in the list.	2019-03-13 17:35:18 -05:00
Field G. Van Zee	5a5f494e42	Removed export macros from all internal prototypes. Details: - After merging PR #303, at Isuru's request, I removed the use of BLIS_EXPORT_BLIS from all function prototypes except those that we potentially wish to be exported in shared/dynamic libraries. In other words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of functions that can be considered private or for internal use only. This is likely the last big modification along the path towards implementing the functionality spelled out in issue #248. Thanks again to Isuru Fernando for his initial efforts of sprinkling the export macros throughout BLIS, which made removing them where necessary relatively painless. Also, I'd like to thank Tony Kelman, Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for participating in the initial discussion in issue #37 that was later summarized and restated in issue #248. - CREDITS file update.	2019-03-12 18:45:09 -05:00
Field G. Van Zee	3dc18920b6	Merge branch 'master' into dev	2019-03-12 11:20:25 -05:00
Isuru Fernando	766769eeb9	Export functions without def file (#303 ) * Revert "restore bli_extern_defs exporting for now" This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8. * Remove symbols not intended to be public * No need of def file anymore * Fix whitespace * No need of configure option * Remove export macro from definitions * Remove blas export macro from definitions	2019-03-11 19:05:32 -05:00
Field G. Van Zee	4ed39c0971	Merge branch 'amd' of github.com:flame/blis into amd	2019-03-08 11:56:58 -06:00
Field G. Van Zee	3bdab823fa	Merge branch 'master' into dev	2019-02-28 14:07:24 -06:00
Isuru Fernando	f0dcc8944f	Add symbol export macro for all functions (#302 ) * initial export of blis functions * Regenerate def file for master * restore bli_extern_defs exporting for now	2019-02-27 17:27:23 -06:00
Field G. Van Zee	540ec1b479	Updated level-3 BLAS to call object API directly. Details: - Updated the BLAS compatibility layer for level-3 operations so that the corresponding BLIS object API is called directly rather than first calling the typed BLIS API. The previous code based on the typed BLIS API calls is still available in a deactivated cpp macro branch, which may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not yet correspond to a configure option. If it seems like people might want to toggle this behavior more regularly, a configure option can be added in the future.) - Updated the BLIS typed API to statically "pre-initialize" objects via new initializor macros. Initialization is then finished via calls to static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(), which are similar to the previously-called functions, bli_obj_create_1x1_with_attached_buffer() and bli_obj_create_with_attached_buffer(), respectively. (The BLAS compatibility layer updates mentioned above employ this new technique as well.) - Transformed certain routines in bli_param_map.c--specifically, the ones that convert netlib-style parameters to BLIS equivalents--into static functions, now in bli_param_map.h. (The remaining three classes of conversation routines were left unchanged.) - Added the aforementioned pre-initializor macros to bli_type_defs.h. - Relocated bli_obj_init_const() and bli_obj_init_constdata() from bli_obj_macro_defs.h to bli_type_defs.h. - Added a few macros to bli_param_macro_defs.h for testing domains for real/complexness and precisions for single/double-ness.	2019-02-24 19:09:10 -06:00
Field G. Van Zee	075143dfd9	Added support for IC loop parallelism to trsm. Details: - Parallelism within the IC loop (3rd loop around the microkernel) is now supported within the trsm operation. This is done via a new branch on each of the control and thread trees, which guide execution of a new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm subproblem corresponds to the macrokernel computation on only the block of A that contains the diagonal (labeled as A11 in algorithms with FLAME-like partitioning), and the corresponding row panel of C. During the trsm subproblem, all threads within the JC communicator participate and parallelize along the JR loop, including any parallelism that was specified for the IC loop. (IR loop parallelism is not supported for trsm due to inter-iteration dependencies.) After this trsm subproblem is complete, a barrier synchronizes all participating threads and then they proceed to apply the prescribed BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT parallelism specified within) to the remaining gemm subproblem (the rank-k update that is performed using the newly updated row-panel of B). Thus, trsm now supports JC, IC, and JR loop parallelism. - Modified bli_trsm_l_cntl_create() to create the new "prenode" branch of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for now, since it is not currently used. (All trsm problems are cast in terms of left-side trsm.) - Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t subnode is only recursed upon if there existed a corresponding thrinfo_t node, which may not always exist (for problems too small to employ full parallelization due to the minimum granularity imposed by micropanels). - Updated other functions in frame/base/bli_cntl.c, such as bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes if they exist. - Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes when they exist, and added support for growing a prenode branch to bli_thrinfo_grow() via a corresponding set of help functions named with the _prenode() suffix. - Added a bszid_t field thrinfo_t nodes. This field comes in handy when debugging the allocation/release of thrinfo_t nodes, as it helps trace the "identity" of each nodes as it is created/destroyed. - Renamed bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths() and created a separate bli_l3_thrinfo_print_trsm_paths() function to print out the newly reconfigured thrinfo_t trees for the trsm operation. - Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c regarding variable declarations. - Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B, BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels (semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which represent the subpartition ahead of and behind, respectively, BLIS_SUBPART1. Updated check functions in bli_check.c accordingly. - Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and bli_acquire_mpart_t2b/b2t(), _l2r/r2l(). - Deprecated old functions in frame/3/bli_l3_thrinfo.c.	2019-02-14 18:52:45 -06:00
Nicholai Tukanov	78bc0bc8b6	Power9 sub-configuration (#298 ) Formally registered power9 sub-configuration. Details: - Added and registered power9 sub-configuration into the build system. Thanks to Nicholai Tukanov and Devangi Parikh for these contributions. - Note: The sub-configuration does not yet have a corresponding architecture-specific kernel set registered, and so for now the sub-config is using the generic kernel set.	2019-02-14 13:29:02 -06:00
Field G. Van Zee	6b83273126	Generalized ref kernels' pragma omp simd usage. Details: - Replaced direct usage of _Pragma( "omp simd" ) in reference kernels with PRAGMA_SIMD, which is defined as a function of the compiler being used in a new bli_pragma_macro_defs.h file. That definition is cleared when BLIS detects that the -fopenmp-simd command line option is unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions that guided this commit. - Updated configure and bli_config.h.in so that the appropriate anchor is substituted in (when the corresponding pragma omp simd support is present).	2019-02-12 16:01:28 -06:00
M. Zhou	1aa280d052	Amend OS detection for kFreeBSD. (#295 )	2019-01-27 15:40:48 -06:00
Field G. Van Zee	bdd46f9ee8	Rewrote reference kernels to use #pragma omp simd. Details: - Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified indexing annotated by the #pragma omp simd directive, which a compiler can use to vectorize certain constant-bounded loops. (The new kernels actually use _Pragma("omp simd") since the kernels are defined via templatizing macros.) Modest speedup was observed in most cases using gcc 5.4.0, which may improve with newer versions. Thanks to Devin Matthews for suggesting this via issue #286 and #259. - Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex, respectively, with a default row preference for the gemm ukernel. Also updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4, respectively, for all datatypes. - Modified configure to verify that -fopenmp-simd is a valid compiler option (via a new detect/omp_simd/omp_simd_detect.c file). - Added a new header in which prefetch macros are defined according to which compiler is detected (via macros such as __GNUC__). These prefetch macros are not yet employed anywhere, though. - Updated the year in copyrights of template license headers in build/templates and removed AMD as a default copyright holder.	2019-01-24 17:23:18 -06:00
Field G. Van Zee	adf5c17f08	Formally registered thunderx2 subconfiguration. Details: - Added a separate subconfiguration for thunderx2, which now uses different optimization flags than cortexa57/cortexa53.	2019-01-18 15:14:45 -06:00
M. Zhou	094cfdf7df	Port BLIS to GNU Hurd OS. (#294 ) Prevent blis.h from misidentifying Hurd as OSX.	2019-01-18 12:46:13 -06:00
Field G. Van Zee	706cbd9d56	Minor tweaks/cleanups to bli_malloc.c, _apool.c. Details: - Removed malloc_ft and free_ft function pointer arguments from the interface to bli_apool_init() after deciding that there is no need to specify the malloc()/free() for blocks within the apool. (The apool blocks are actually just array_t structs.) Instead, we simply call bli_malloc_intl()/_free_intl() directly. This has the added benefit of allowing additional output when memory tracing is enabled via --enable-mem-tracing. Also made corresponding changes elsewhere in the apool API. - Changed the inner pools (elements of the array_t within the apool_t) to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL and BLIS_FREE_INTL. - Disabled definitions of bli_malloc_pool() and bli_free_pool() since there are no longer any consumers of these functions. - Very minor comment / printf() updates.	2019-01-07 18:28:19 -06:00
Field G. Van Zee	2f3174330f	Implemented a pool-based small block allocator. Details: - Implemented a sophisticated data structure and set of APIs that track the small blocks of memory (around 80-100 bytes each) used when creating nodes for control and thread trees (cntl_t and thrinfo_t) as well as thread communicators (thrcomm_t). The purpose of the small block allocator, or sba, is to allow the library to transition into a runtime state in which it does not perform any calls to malloc() or free() during normal execution of level-3 operations, regardless of the threading environment (potentially multiple application threads as well as multiple BLIS threads). The functionality relies on a new data structure, apool_t, which is (roughly speaking) a pool of arrays, where each array element is a pool of small blocks. The outer pool, which is protected by a mutex, provides separate arrays for each application thread while the arrays each handle multiple BLIS threads for any given application thread. The design minimizes the potential for lock contention, as only concurrent application threads would need to fight for the apool_t lock, and only if they happen to begin their level-3 operations at precisely the same time. Thanks to Kiran Varaganti and AMD for requesting this feature. - Added a configure option to disable the sba pools, which are enabled by default; renamed the --[dis\|en]able-packbuf-pools option to --[dis\|en]able-pba-pools; and rewrote the --help text associated with this new option and consolidated it with the --help text for the option associated with the sba (--[dis\|en]able-sba-pools). - Moved the membrk field from the cntx_t to the rntm_t. We now pass in a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we do for bli_sba_acquire() and _release(). - Replaced all calls to bli_malloc_intl() and bli_free_intl() that are used for small blocks with calls to bli_sba_acquire(), which takes a rntm (in addition to the bytes requested), and bli_sba_release(). These latter two functions reduce to the former two when the sba pools are disabled at configure-time. - Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as required by the new usage of bli_sba_acquire() and _release(). - Moved the freeing of "old" blocks (those allocated prior to a change in the block_size) from bli_membrk_acquire_m() to the implementation of the pool_t checkout function. - Miscellaneous improvements to the pool_t API. - Added a block_size field to the pblk_t. - Harmonized the way that the trsm_ukr testsuite module performs packing relative to that of gemmtrsm_ukr, in part to avoid the need to create a packm control tree node, which now requires a rntm_t that has been initialized with an sba and membrk. - Re-enable explicit call bli_finalize() in testsuite so that users who run the testsuite with memory tracing enabled can check for memory leaks. - Manually imported the compact/minor changes from `61441b24` that cause the rntm to be copied locally when it is passed in via one of the expert APIs. - Reordered parameters to various bli_thrcomm_() functions so that the thrcomm_t to the comm being modified is last, not first. - Added more descriptive tracing for allocating/freeing small blocks and formalized via a new configure option: --[dis\|en]able-mem-tracing. - Moved some unused scalm code and headers into frame/1m/other. - Whitespace changes to bli_pthread.c. - Regenerated build/libblis-symbols.def.	2018-12-25 19:35:01 -06:00
Field G. Van Zee	e809b5d2f1	Merge branch 'master' into amd	2018-12-20 16:27:26 -06:00
Field G. Van Zee	76016691e2	Improvements to bli_pool; malloc()/free() tracing. Details: - Added malloc_ft and free_ft fields to pool_t, which are provided when the pool is initialized, to allow bli_pool_alloc_block() and bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align() with arbitrary align_size values (according to how the pool_t was initialized). - Added a block_ptrs_len argument to bli_pool_init(), which allows the caller to specify an initial length for the block_ptrs array, which previously suffered the cost of being reallocated, copied, and freed each time a new block was added to the pool. - Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t into a single "buf" field. Consolidated the bli_pblk API accordingly and also updated the bli_mem API implementation. This was done because I'd previously already implemented opaque alignment via bli_malloc_align(), which allocates extra space and stores the original pointer returned by malloc() one element before the element whose address is aligned. - Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call bli_fmalloc_align() and bli_ffree_align(), which required adding an align_size field to the membrk_t struct. - Pass the pack schemas directly into bli_l3_cntl_create_if() rather than transmit them via objects for A and B. - Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free(). The function had not been conditionally freeing control trees for quite some time. Also, removed obj_t* parameters since they aren't needed anymore (or never were). - Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a separate function, bli_l3_thread_decorator_thread_check(). - Renamed: bli_malloc_align() -> bli_fmalloc_align() bli_free_align() -> bli_ffree_align() bli_malloc_noalign() -> bli_fmalloc_noalign() bli_free_noalign() -> bli_ffree_noalign() The 'f' is for "function" since they each take a malloc_ft or free_ft function pointer argument. - Inserted various printf() calls for the purposes of tracing memory allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which, for now, is intended to be a "hidden" feature rather than one hooked up to a configure-time option. - Defined bli_rntm_equals(), which compares two rntm_t for equality. (There are no use cases for this function yet, but there may be soon.) - Whitespace changes to function parameter lists in bli_pool.c, .h.	2018-12-13 17:23:09 -06:00
Field G. Van Zee	f808d829c5	Handle edge cases, zero-filling in packm kernels. Details: - Updated the API and semantics of packm kernels such that they must now handle edge cases, meaning that a c-by-k packm kernel must be able to pack edge cases that are fewer than c rows/columns and be able to zero-fill the remaining elements. They must also be able to zero-fill the equivalent region when copying fewer than k columns/rows (which is needed by trsm). The new packm kernel API is generally: void packm_kernel ( conj_t conja, dim_t cdim, dim_t n, dim_t n_max, ctype* restrict kappa, ctype* restrict a, inc_t inca, inc_t lda, ctype* restrict p, inc_t ldp, cntx_t* restrict cntx ); where cdim and n are the dimensions (short and long, respectively) of the submatrix being copied from the source matrix A, and n_max is the "full" long dimension (corresponding to the k dimension in gemm) of the micropanel. The "full" short dimension (corresponding to the register blocksize MR or NR) is not part of the API because it is known intrinsically by the packm kernel implementation. Thanks to Devin Matthews for prompting us to make this change (#282). - Updated all reference packm kernels in ref_kernels/1m according to above changes, as well as all optimized packm kernels (which only consisted of those for knl). - Bumped the major soname version number in 'so_version' to 2. At first I was considering leaving it unchanged, but I couldn't escape the reality that the packm kernel API is much closer to an expert API than it is some obscure helper function interface within the framework that nobody would ever notice. - Removed reference packm kernels for mr/nr = 30. The only sub-config that would have been using those kernels is knc, which is likely no longer being used by very many people (if any). (This also mostly offset the larger object code footprint incurred by moving the edge- case handling into the individual packm kernels.) - Fixed an obscure race condition for 3mh and 4mh induced methods in which those implementations were modifying the contexts stored in the gks rather than a local copy. - Fixed a minor bug in the testsuite that prevented non-1m-based induced method implementations of trsm from executing.	2018-12-12 15:22:59 -06:00
Field G. Van Zee	0645f239fb	Remove UT-Austin from copyright headers' clause 3. Details: - Removed explicit reference to The University of Texas at Austin in the third clause of the license comment blocks of all relevant files and replaced it with a more all-encompassing "copyright holder(s)". - Removed duplicate words ("derived") from a few kernels' license comment blocks. - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c with format of all other comment blocks.	2018-12-04 14:31:06 -06:00
Field G. Van Zee	375eb30b0a	Added mixed-precision support to 1m method. Details: - Lifted the constraint that 1m only be used when all operands' storage datatypes (along with the computation datatype) are equal. Now, 1m may be used as long as all operands are stored in the complex domain. This change largely consisted of adding the ability to pack to 1e and 1r formats from one precision to another. It also required adding logic for handling complex values of alpha to bli_packm_blk_var1_md() (similar to the logic in bli_packm_blk_var1()). - Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c, bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong ukernel output preference field being read. Previously, the preference for the native complex ukernel was being read instead of the pref for the native real domain ukernel. This bug would not manifest if the preference for the native complex ukernel happened to be equal to that of the native real ukernel. - Added support for testing mixed-precision 1m execution via the gemm module of the testsuite. - Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack schemas are always read from the context, rather than trying to sometimes embed them directly to the A and B objects. (They are still embedded, but now uniformly only after reading the schemas from the context.) - Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only consumer). - Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to bli_gemm_ker_var2_md(). - Added explicit handling for beta == 1 and beta == 0 in the reference gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c. - Rewrote various level-0 macro defs, including axpyris, axpbyris, scal2ris, and xpbyris (and their conjugating counterparts) to explicitly support three operand types and updated invocations to xpbyris in bli_gemmtrsm1m_ref.c. - Query and use the storage datatype of the packed object instead of the storage datatype of the source object in bli_packm_blk_var1(). - Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to frame/3/gemm/ind/bli_gemm_ind_opt.h. - Various whitespace/comment updates.	2018-12-03 17:49:52 -06:00
Field G. Van Zee	6a4885f8be	Merge branch 'master' into dev	2018-11-27 13:22:59 -06:00
Isuru Fernando	9ddffba584	Fix MinGW build failure Fixes https://github.com/flame/blis/issues/278	2018-11-21 00:23:34 -06:00
Field G. Van Zee	1d8aae220b	Track internal scalar datatypes. Details: - Added a num_t datatype bitfield to the obj_t in the form of a new info2 field in the obj_t. This change was made primarily so that in the case of mixed-datatype gemm, the alpha scalar would not need to be cast to the storage datatype of B (or A) before then being cast to the computation datatype just before the macrokernel is called. This double-casting regime could result in loss of precision if the storage datatype of B (or A) is less than the computation precision. In practice, it was likely not going to be a big deal since most usage of alpha is for -1.0, 0.0, and 1.0 (or integer multiples thereof), which can all be represented exactly in single or double precision. - The type of objbits_t was changed to uint32_t, so the new format potentially takes up the same space as the previous obj_t definition, assuming no padding inserted by the compiler. Shrinking info to 32 bits and spilling over into a second field was chosen over using the high 32 bits of a single 64-bit objbits_t info field because many of the bitwise operations are performed with enums such as num_t, dom_t, and prec_t, which may take on the type of 32-bit ints. It's easier to just keep all of those bitwise operations in 32 bits than perform a million typecasts throughout bli_type_defs.h and bli_obj_macro_defs.h to ensure that the integers are treated as 64-bit for the purposes of the ANDs, ORs, and bitshifts. - Many comment updates. - Thanks to Devin Matthews and Devangi Parikh for their feedback and involvement during this commit cycle.	2018-11-20 18:42:07 -06:00
Field G. Van Zee	e769bf46b0	Tweak testsuite to issue FAIL for Nan, Inf (#279 ). Details: - Adjusted the definition for libblis_test_get_string_for_result() in testsuite/src/test_libblis.c so that the "FAIL" string is returned if the computed residual contains either NaN or Inf. Previously, a residual containing NaN would result in the selection of the "PASS" string. Thanks to Devin Matthews for reporting this issue (#279). - Expounded on comment for the macro definitions of bli_isnan() and bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they must remain macros.	2018-11-20 16:16:53 -06:00
Field G. Van Zee	84dd298a27	Patch to fix msys2/Windows build failure (#277 ). Details: - Expanded cpp guard in frame/include/bli_x86_asm_macros.h to also check __MINGW32__ in addition to _WIN32, __clang__, and __MIC__. Thanks to Isuru Fernando for suggesting this fix, and also to Costas Yamin for originally reporting the issue (#277).	2018-11-14 13:47:45 -06:00
Field G. Van Zee	f19c33af4c	Disallow 64b BLAS integers + 32b BLIS integers. Details: - Print an error message from configure if the user attempts to explicitly configure BLIS for simultaneous use of 64-bit integers in the BLAS API with 32-bit integers in the BLIS API. - Added cpp macro conditional to bli_type_defs.h to mandate that BLIS integers be 64 bits if the BLAS integers are 64 bits. This and the above item take care of issue #274. Thanks to Devin Matthews and Jeff Hammond for suggesting these safeguards. - Slight reorganization and relabeling (for clarity) of BLAS/CBLAS sections and BLIS integer size line of the testsuite configuration output. - Very minor edits to docs/MixedDatatypes.md.	2018-10-26 17:07:15 -05:00

1 2 3 4 5 ...

348 Commits