amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 22:41:11 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	a4e8801d08	Increased MT sup threshold for double to 201. Details: - Fine-tuned the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 180 to 201 for haswell and 180 to 256 for zen. - Updated octave scripts in test/sup/octave to include a seventh column to display performance for m = n = k.	2019-05-31 17:30:51 -05:00
Field G. Van Zee	cb788ffc89	Increased MT sup threshold for double to 180. Details: - Increased the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 80 to 180, and this change was made for both haswell and zen subconfigurations. This is less about the m dimension in particular and more about facilitating a smoother performance transition when m = n = k.	2019-05-23 13:00:53 -05:00
Field G. Van Zee	ecbdd1c42d	Ceased use of BLIS_ENABLE_SUP_MR/NR_EXT macros. Details: - Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long conditional that would determine whether to allow the last iteration to be merged with the second-to-last iteration. Functionally, the macros were not needed, and they ended up causing problems when building configuration families such as intel64 and x86_64.	2019-04-27 19:38:11 -05:00
Field G. Van Zee	b9c9f03502	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2019-04-27 18:44:50 -05:00
Field G. Van Zee	9945ef24fd	Adjusted cache blocksizes for zen subconfig. Details: - Adjusted the zen sub-configuration's cache blocksizes for float, scomplex, and dcomplex based on the existing values for double. (The previous values were taken directly from the haswell subconfig, which targets Intel Haswell/Broadwell/Skylake systems.)	2019-03-19 15:28:44 -05:00
Field G. Van Zee	70f12f209b	Changed unsafe-loop to unsafe-math optimizations. Details: - Changed -funsafe-loop-optimizations (re-)introduced in `7690855` for make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to account for a miscommunication in issue #300). Thanks to Dave Love for this suggestion and Jeff Hammond for his feedback on the topic.	2019-02-20 16:10:10 -06:00
Field G. Van Zee	7690855c51	Restored -funsafe-loop-optimizations to subconfigs. Details: - Restored use of -funsafe-loop-optimizations in the definitions of CRVECFLAGS (when using gcc), but only for sub-configurations (and not configuration families such as amd64, intel64, and x86_64). This more or less reverts `5190d05` and `6cf1550`.	2019-02-18 19:16:01 -06:00
Field G. Van Zee	44994d1490	Disable TBM, XOP, LWP instructions in AMD configs. Details: - Added -mno-tbm -mno-xop -mno-lwp to CKVECFLAGS in bulldozer, piledriver, steamroller, and excavator configurations to explicitly disable AMD's bulldozer-era TBM, XOP, and LWP instruction sets in an attempt to fix the invalid instruction error that has plagued Travis CI builds since `6a014a3`. Thanks to Devin Matthews for pointing out that the offending instruction was part of TBM (issue #300). - Restored -O3 to piledriver configuration's COPTFLAGS.	2019-02-18 18:35:30 -06:00
Field G. Van Zee	1e5b530744	Reverted piledriver COPTFLAGS from -O3 to -O2. Details: - Debugging continues; changing COPTFLAGS for piledriver subconfig from -O3 to -O2, its original value prior to `6a014a3`.	2019-02-18 18:04:38 -06:00
Field G. Van Zee	6cf1550491	Removed -funsafe-loop-optimizations from all configs. Details: - Error persists. Removed -funsafe-loop-optimizations from all remaining sub-configurations.	2019-02-18 17:29:51 -06:00
Field G. Van Zee	5190d05a27	Removed -funsafe-loop-optimizations from piledriver. Details: - Error persists; continuing debugging from `bf0fb78c` by removing -funsafe-loop-optimizations from piledriver configuration.	2019-02-18 17:07:35 -06:00
Field G. Van Zee	bf0fb78c5e	Removed -funsafe-loop-optimizations from families. Details: - Removed -funsafe-loop-optimizations from the configuration families affected by `6a014a3`, specifically: intel64, amd64, and x86_64. This is part of an attempt to debug why the sde, as executed by Travis CI, is crashing via the following error: TID 0 SDE-ERROR: Executed instruction not valid for specified chip (ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103	2019-02-18 16:51:38 -06:00
Field G. Van Zee	6a014a3377	Standardized optimization flags in make_defs.mk. Details: - Per Dave Love's recommendation in issue #300, this commit defines COPTFLAGS := -03 and CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations in the make_defs.mk for all Intel- and AMD-based configurations.	2019-02-18 14:52:29 -06:00
Nicholai Tukanov	78bc0bc8b6	Power9 sub-configuration (#298 ) Formally registered power9 sub-configuration. Details: - Added and registered power9 sub-configuration into the build system. Thanks to Nicholai Tukanov and Devangi Parikh for these contributions. - Note: The sub-configuration does not yet have a corresponding architecture-specific kernel set registered, and so for now the sub-config is using the generic kernel set.	2019-02-14 13:29:02 -06:00
Devangi N. Parikh	dfc91843ea	Fixed gcc flags for thunderx2 subconfiguration Details: - Fixed -march flag. Thunderx2 is an armv8.1a architecture not armv8a.	2019-02-04 15:23:40 -05:00
Field G. Van Zee	26c5cf495c	Fixed bug in skx subconfig related to `bdd46f9`. Details: - Fixed code in the skx subconfiguration that became a bug after committing `bdd46f9`. Specifically, the bli_cntx_init_skx() function was overwriting default blocksizes for the scomplex and dcomplex microkernels despite the fact that only single and double real microkernels were being registered. This was not a problem prior to `bdd46f9` since all microkernels used dynamically-queried (at runtime) register blocksizes for loop bounds. However, post-bdd46f9, this became a bug because the reference ukernels for scomplex and dcomplex were written with their register blocksizes hard-coded as constant loop bounds, which conflicted the the erroneous scomplex and dcomplex values that bli_cntx_init_skx() was setting in the context. The lesson here is that going forward, all subconfigurations must not set any blocksizes for datatypes corresponding to default/reference microkernels. (Note that a blocksize is left unchanged by the bli_cntx_set_blkszs() function if it was set to -1.)	2019-01-24 18:49:31 -06:00
Field G. Van Zee	adf5c17f08	Formally registered thunderx2 subconfiguration. Details: - Added a separate subconfiguration for thunderx2, which now uses different optimization flags than cortexa57/cortexa53.	2019-01-18 15:14:45 -06:00
Field G. Van Zee	0645f239fb	Remove UT-Austin from copyright headers' clause 3. Details: - Removed explicit reference to The University of Texas at Austin in the third clause of the license comment blocks of all relevant files and replaced it with a more all-encompassing "copyright holder(s)". - Removed duplicate words ("derived") from a few kernels' license comment blocks. - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c with format of all other comment blocks.	2018-12-04 14:31:06 -06:00
Field G. Van Zee	3c52725693	Renamed/moved l3 zen ukernels to haswell kernel set. Details: - Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and then updated the file contents to use the 'haswell' infix. - Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to above function renames. - Moved/updated the corresponding prototypes in bli_kernels_zen.h to bli_kernels_haswell.h. - Updated config_registry according to above changes. - NOTE: This rename reflects the fact that haswell microkernels are specifically written to overcome the floating-point latency for FMA instructions on Intel Haswell-like architectures, which can issue two FMA instructions per cycle. These ukernels happen to work fine on AMD Zen-based architectures. However, Zen only issues one FMA per cycle, which, while halving its floating-point throughput, gives it extra flexibility in the design of its microkernels--namely, mr and nr can be smaller and still overcome the floating-point latency for those single-issue cores. A smaller value of mr and nr allows for a larger value of kc, which may be useful in some situations. In the future, we may write such Zen-specific microkernels to take advantage of this additional flexibility.	2018-10-17 14:56:22 -05:00
Ye Luo	6722ec2181	Fix bgclang compilation on BGQ (#270 ) * Fix bgq kernels * Support bgq with bgclang	2018-10-17 11:26:00 -05:00
Field G. Van Zee	53a9ab1c85	Renamed thread auto-factorization macro constants. Details: - Renamed the following C preprocessor macros whose fallback/default values are specified within frame/include/bli_kernel_macro_defs.h: BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N - Renamed the above cpp macro overrides within the knl, skx, and zen sub-configurations, as well as invocations of those macros in bli_rntm.c. - Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer used by any code within BLIS.	2018-10-10 15:11:09 -05:00
Field G. Van Zee	e249a00a82	Imported skx dgemm ukernel from skx-redux branch. Details: - Added the new bli_dgemm_skx_asm_16x14.c microkernel from the skx-redux branch, along with appropriate blocksizes in bli_cntx_init_skx.c and a prototype in bli_kernels_skx.h. (Devin has not yet written the sgemm analague, so for now we will continue using the older sgemm ukernel.) - Updated frame/include/bli_x86_asm_macros.h with a minor change that was present within the skx-redux branch.	2018-09-10 16:48:35 -05:00
Field G. Van Zee	cc2cca4f56	Merge branch 'dev'	2018-09-06 17:12:13 -05:00
Field G. Van Zee	fb81c7fc66	Defined cortexa53 sub-configuration. Details: - Added a new sub-configuration 'cortexa53', which is a mirror image of cortexa57 except that it will use slightly different compiler flags. Thanks to Mathieu Poumeyrol for making this suggestion after discovering that the compiler flags being used by cortexa57 were not working properly in certain OS X environments (the fix to which is currently pending in pull request #245).	2018-09-06 16:29:39 -05:00
Mathieu Poumeyrol	97965b0905	cortexa9 and cortexa53 travis build + qemu test (#245 )	2018-09-06 14:10:29 -05:00
Field G. Van Zee	4fa4cb0734	Trivial comment header updates. Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers.	2018-08-29 18:06:41 -05:00
Field G. Van Zee	8e10cac5f3	Updates to CREDITS, RELEASING, config/README.md. Details: - Added individuals' github handles to CREDITS file. - Updated RELEASING, config/README.md files.	2018-07-27 14:45:35 -05:00
Field G. Van Zee	89e178ce38	Merge branch 'master' into dev	2018-07-04 17:51:16 -05:00
Isuru Fernando	14648e1376	Native windows support using clang (#227 ) * Add appveyor file * Build script * Remove fPIC for now * copy as * set CC and CXX * Change the order of immintrin.h * Fix testsuite header * Move testsuite defs to .c * Fix appveyor file * Remove fPIC again and fix strerror_r missing bug * Remove appveyor script * cd to blis directory * Fix sleep implementation * Add f2c_types_win.h * Fix f2c compilation * Remove rdp and rename appveyor.yml * Remove setenv declaration in test header * set CPICFLAGS to empty * Fix another immintrin.h issue * Escape CFLAGS and LDFLAGS * Fix more ?mmintrin.h issues * Build x86_64 in appveyor * override LIBM LIBPTHREAD AR AS * override pthreads in configure * Move windows definitions to bli_winsys.h * Fix LIBPTHREAD default value * Build intel64 in appveyor for now	2018-07-04 17:48:42 -05:00
Field G. Van Zee	195480beb5	Merge branch 'master' into dev	2018-06-25 13:24:21 -05:00
Field G. Van Zee	d4a22702c7	Set up haswell config for optional col-pref ukrs. Details: - Added two presently-disabled cpp blocks in bli_cntx_init_haswell.c to easily allow one to switch to a set of column-preferential gemm microkernels (in the haswell subconfiguration). The second column- preferring block sets the the register blocksizes to their appropriate values. However, cache blocksizes are left unchanged, and therefore are likely suboptimal. This should be addressed later.	2018-06-19 14:54:57 -05:00
Field G. Van Zee	ed2c8aed84	Temporarily disabled small matrix handling on zen. Details: - Disabled small matrix handling in config/zen/bli_family_zen.h due to what appears to be a bug that manifests as failures in the single and double precision real level-3 BLAS test drivers (visible via out.sblat3 and out.dblat3). Thanks to Robin Christ for reporting this issue.	2018-06-18 11:49:34 -05:00
Field G. Van Zee	dbaf440540	Merge branch 'master' into dev	2018-06-11 12:37:04 -05:00
Field G. Van Zee	262a62e348	Fixed undefined ref in steamroller/excavator configs. Details: - Fixed erroneous calls to bli_cntx_init_piledriver_ref() in bli_cntx_init_steamroller() and bli_cntx_init_excavator(), which should have been to their respectively-named bli_cntx_init_*() functions instead. Thanks to qnerd for bringing these bugs to our attention.	2018-06-08 12:10:54 -05:00
Devin Matthews	850a8a46c0	Test all x86_64 configurations... (#212 ) Add custom SDE cpuid files. * Set up testing of all x86_64 architectures (except bulldozer) using SDE. * Update .travis.yml [ci skip] * Update do_testsuite.sh [ci skip] * Updated .travis.yml with my secret token. Details: - Replaced Devin's temporary secret token with my own, which is used by Travis when accessing the Intel SDE via Dropbox. * Work around CPUID dispatch in glibc/libm by patching ld.so. * Detect path of loader at runtime. * Attempt to make SDE run on Travis * Allow unpatched ld.so if we don't know how to patch it. I think this only happens for older glibc without the multi-arch stuff (e.g. Ubuntu 14.04 on Travis), but who knows? * Upgrade Travis to gcc-6 and binutils-2.26. * Try to get Travis to use the right assembler. * Apparently you need ld-2.26 too. * Try to also patch ld.so from Ubuntu 14.04. * Take the nuclear option. * Account for non-absolute dependencies in ldd output. * String manipulation fail. * Update patch-ld-so.py * Add Zen to SDE testing. * Removed dead variable from travis/do_testsuite.sh. Details: - Removed 'BLIS_ENABLE_TEST_OUTPUT=yes' from make invocations in travis/do_testsuite.sh. This variable is no longer present in the BLIS build system (if it ever was?), and therefore has no effect.	2018-05-29 13:51:21 -05:00
Field G. Van Zee	ad67dc4e34	Communicate cc, cc_vendor to make via config.mk. Details: - Historically, the compiler selection has happened statically in the various make_defs.mk and would only be overriden by setting CC (either prior to running configure or as a configure argument). However, in the last couple months, configure has evolved to contain rather sophisticated compiler detection logic for the purposes of blacklisting sub-configurations. It only makes sense that configure now fully take over the responsibility of selecting a compiler from the GNU make side of the build system. Thanks to Alex Arslan for his help exposing this issue. - Substitute found_cc into CC in config.mk via configure. - Set a new variable, CC_VENDOR, in config.mk via substitution from configure, and disable the corresponding CC_VENDOR code in common.mk. - Disabled default compiler selection (usually gcc) in the sub-configs' various make_def.mk files.	2018-05-14 18:35:28 -05:00
Field G. Van Zee	20af119fc9	Added README.md to 'config' directory. Details: - Added a brief README.md file to the config directory to redirect those who may be exploring the source tree to the ConfigurationHowTo wiki. (Included is a very brief explanation of configurations for those who don't have time to read the wiki.) Thanks to Nico Schlömer for this suggestion.	2018-05-14 17:44:58 -05:00
Field G. Van Zee	4fb353bd90	Merge branch 'master' into dev	2018-05-13 17:50:51 -05:00
Field G. Van Zee	bf03503059	Renamed (shortened) a few build system variables. Details: - Renamed the following variables in config.mk (via build/config.mk.in): BLIS_ENABLE_VERBOSE_MAKE_OUTPUT -> ENABLE_VERBOSE BLIS_ENABLE_STATIC_BUILD -> MK_ENABLE_STATIC BLIS_ENABLE_SHARED_BUILD -> MK_ENABLE_SHARED BLIS_ENABLE_BLAS2BLIS -> MK_ENABLE_BLAS BLIS_ENABLE_CBLAS -> MK_ENABLE_CBLAS BLIS_ENABLE_MEMKIND -> MK_ENABLE_MEMKIND and also renamed all uses of these variables in makefiles and makefile fragments. Notice that we use the "MK_" prefix so that those variables can be easily differentiated (such as via grep) from their "BLIS_" C preprocessor macro counterparts. - Other whitespace changes to build/config.mk.in. - Renamed the following C preprocessor macros in bli_config.h (via build/bli_config.h.in): BLIS_ENABLE_BLAS2BLIS -> BLIS_ENABLE_BLAS BLIS_DISABLE_BLAS2BLIS -> BLIS_DISABLE_BLAS BLIS_BLAS2BLIS_INT_TYPE_SIZE -> BLIS_BLAS_INT_TYPE_SIZE and also renamed all relevant uses of these macros in BLIS source files. - Renamed "blas2blis" variable occurrences in configure to "blas", as was done in build/config.mk.in and build/bli_config.h.in. - Renamed the following functions in frame/base/bli_info.c: bli_info_get_enable_blas2blis() -> bli_info_get_enable_blas() bli_info_get_blas2blis_int_type_size() -> bli_info_get_blas_int_type_size() - Remove bli_config.h during 'make cleanh' target of top-level Makefile.	2018-05-08 16:49:22 -05:00
Field G. Van Zee	4b36e85be9	Converted function-like macros to static functions. Details: - Converted most C preprocessor macros in bli_param_macro_defs.h and bli_obj_macro_defs.h to static functions. - Reshuffled some functions/macros to bli_misc_macro_defs.h and also between bli_param_macro_defs.h and bli_obj_macro_defs.h. - Changed obj_t-initializing macros in bli_type_defs.h to static functions. - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from bli_constants.h. - Whitespace changes in select files (four spaces to single tab).	2018-05-08 14:26:30 -05:00
Nisanth M P	4cff432d70	AMD specific optimizations for target 'zen' (#194 ) Re-enabled AMD-specific optimizations for zen. Details: - Re-enabled Zen-specific cache blocksizes for 'zen' sub-configuration. - Re-enabled small matrix gemm optimization for 'zen'. - These were both temporarily disabled during a previous merge simply due to lack of Zen hardware for testing.	2018-05-02 12:50:42 -05:00
Field G. Van Zee	60366a3fab	Updates to knl kernels and related code. Details: - Imported the 24x16 knl sgemm microkernel (and its corresonding spackm kernel) from TBLIS and enabled its use in the knl sub-config. Also Added sgemm microkernel prototype to bli_kernels_knl.h. - Updated dgemm and dpackm microkernels from TBLIS, which included an important change regarding the offsets array (changed from extern declaration to static declaration/definition). - Activated use of level-1v and -1f zen kernels in skx and knl sub-configs. - Removed some old macros no longer needed in bli_family_skx.h now that libmemkind support exists in configure. - Moved bli_avx512_macros.h to frame/include and adjusted #includes in skx and knl kernels accordingly. - Moved unused kernels in kernels/knl/3 to kernels/knl/3/other directory. - Fixed a minor bug in the 'make' output per compile when verboseness is not turned on. The rule-generating function 'make-kernel-rule' was previously passing in the name of the config, rather than the name of the kernel set returned by get-config-for-kset, which could give misleading information to the user when the kconfig_map mapped a kernel set to a sub-configuration that did not share the same name. (This didn't affect the CFLAGS that were actually used.) - Updated test/3m4m/Makefile, removing acml targets and renaming the remaining targets.	2018-04-16 18:46:21 -05:00
Field G. Van Zee	78a24e7dad	Updated bli_avx512_macros.h in knl and skx configs. Details: - Downloaded updated version of bli_avx512_macros.h from TBLIS [1] in attempt to address issue #192. [1] https://github.com/devinamatthews/tblis/	2018-04-09 17:02:13 -05:00
Field G. Van Zee	45fbe66b3e	Fixed libmemkind dependency for x86_64. Details: - Removed some old conditional code in config/knl/make_defs.mk that added -lmemkind to LDFLAGS if DEBUG_TYPE was not 'sde' and inserted code into common.mk that affirmatively filters out -lmemkind from LDFLAGS if DEBUG_TYPE is 'sde'. (Thanks to Dave Love for reporting this issue.) Other minor cleanups to neighboring code in common.mk. - Updated CRVECFLAGS in knl/make_defs.mk to be based on -march=knl, and then AVX-512 functionality is manually removed via various -mno-avx512* flags. Also, make the setting of CRVECFLAGS conditional on CC_VENDOR. Similar change to skx/make_defs.mk. - Comment/whitespace updates.	2018-04-09 14:01:08 -05:00
Field G. Van Zee	bd0276752c	Track separate ref kernel flags for each sub-config. Details: - Renamed CVECFLAGS variables in sub-configurations' make_defs.mk files to CKVECFLAGS. - Added default defintions of two new make variables to most sub- configurations' make_defs.mk files--CROPTFLAGS and CRVECFLAGS-- which correspond to reference kernel analogues of the CKOPTFLAGS and CKVECFLAGS, which track optimization and vectorization flags for optimized kernels. Currently, two sub-configurations (knl and skx) explicitly set CRVECFLAGS to non-default values (using AVX2 instead of AVX-512 for reference kernels. Thanks to Jeff Hammond, whose feedback prompted me to make this change (issue #187). - Changed common.mk so that the get-refkern-cflags-for function returns the flags associated with the given sub-configuration's CROPTFLAGS and CRVECFLAGS (instead of CKOPTFLAGS and CKVECFLAGS).	2018-04-06 18:51:43 -05:00
Field G. Van Zee	786d15c5ef	Added skx, knl to x86_64 configuration family. Details: - Added 'skx' and 'knl' sub-configurations to the 'x86_64' configuration family in the config_registry file. - Added logic to configure that avoids committing certain sub-configs to the configuration/kernel registries if those sub-configs cannot be handled properly by the chosen compiler. (This was modeled after similar logic in TBLIS's configure; thanks to Devin Matthews for pointing this out.) First, the compiler and its version are inspected and, based on the results, certain configurations are added to a "blacklist". Then, as the configuration registries are being created, configurations and/or kernels that match items in the blacklist are skipped over and not commited to the registries. Under certain circumstances, omitting a blacklisted configuration will indirectly invalidate other configurations due to the loss of availability of the original blacklisted configuration's kernel set. This additional indirect blacklist is also accounted for. - Added output to the beginning of configure that echos information about the chosen compiler as well as the configurations that are blacklisted and must be stripped from the registries. - Various other cleanups in configure, especially with respect to explicitly declaring local variables in functions. - Comment updates to config/zen/make_defs.mk regarding choice of -march flags based on compiler version.	2018-04-04 16:06:47 -05:00
Field G. Van Zee	e2192a8fd5	Removed vzeroupper intrinsics from zen kenels. Details: - Fixed a bug in the zen (also used by haswell) dotxf kernels whereby a vzeroupper instruction destoryed part of the intermediate result stored by the vdpps instructions that came right before. (The vzeroupper instrinsic was removed.) - Removed remaining vzeroupper instrinsics from other zen kernels. Previously, the vzeroupper instructions were included because BLIS is typically compiled with -mfpmath=sse. But it was brought to my attention that inserting these vzeroupper instructions is unnecessary for our purposes, since (a) -mfpmath=sse results in VEX-encoded scalar code rather than literal SSE instructions, and (b) compilers already (likely) insert vzeroupper instructions where necessary. Thanks to Devin Matthews for zeroing in on the dotxf bug. - Removed -malign-double from bulldozer make_defs.mk. This alignment was already happening by default since bulldozer is an x86_64 system.	2018-03-23 12:53:48 -05:00
Field G. Van Zee	22289ad23c	Added build system support for libmemkind. Details: - Added support for libmemkind to configure. configure attempts to detect the presence of libmemkind by compiling a small program containing #include <hbwmalloc.h> and a call to hbw_malloc(). If successful, it is assumed that libmemkind is present and available. If present, use of libmemkind is enabled by default, and otherwise use is disabled by default. If libmemkind is present, the user may explicitly disable use of the library by running configure with the --without-memkind option. Furthermore, a configuration may disable libmemkind, perhaps conditional on some aspect of the build system, by including -DBLIS_DISABLE_MEMKIND in the configuration's CPPROCFLAGS make variable and setting the BLIS_ENABLE_MEMKIND makefile variable, set in config.mk, to 'no'. (The knl configuration makes use of this latter feature; see below.) - If enabled at configure-time, bli_system.h will #include <hbwmalloc.h> and bli_kernel_macro_defs.h will define BLIS_MALLOC_POOL and BLIS_FREE_POOL to use hbw_malloc() and hbw_free(), respectively. - Deprecated explicit use of BLIS_NO_HBWMALLOC in config/knl/bli_family.knl.h and replaced use of -DBLIS_NO_HBWMALLOC in config/knl/make_defs.mk with -DBLIS_DISABLE_MEMKIND, which overrides (#undefs) the definition of BLIS_ENABLE_MEMKIND in bli_system.h, if it would otherwise be defined. Also, set the BLIS_ENABLE_MEMKIND makefile variable to 'no'. - common.mk now adds libmemkind to LDFLAGS if libmemkind is enabled.	2018-03-22 18:21:30 -05:00
Devin Matthews	8f2fabec80	Make arm32 and arm64 families work. (#176 )	2018-03-14 17:43:42 -05:00
Devin Matthews	9cee78e006	Fix Cortex-A9 and Cortex-A15 configs. Tested with QEMU.	2018-03-14 13:09:48 -05:00

1 2 3 4 5 ...

312 Commits