amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 18:15:37 +00:00

Author	SHA1	Message	Date
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
Field G. Van Zee	f5871c7e06	Added complex asm packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision complex domain (c and z) and housed them in the 'haswell' kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Minor modifications to the corresponding s and d packm kernels that were introduced in `426ad67`. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), upon which these complex kernels are partially based.	2021-02-28 17:03:57 -06:00
Field G. Van Zee	426ad679f5	Added assembly packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision real domain (s and d) and housed them in the 'haswell' kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), which I have now tweaked and used to create comparable single-precision real kernels (s6xk and s16xk).	2021-02-27 18:39:56 -06:00
Field G. Van Zee	11dfc176a3	Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c.	2020-12-01 19:51:27 +00:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
Field G. Van Zee	e14424f55b	Merge branch 'dev'	2020-11-07 13:02:50 -06:00
Field G. Van Zee	a0849d390d	Register l3 sup kernels in zen2 subconfig. Details: - Registered full suite of sgemm and dgemm sup millikernels, blocksizes, and crossover thresholds in bli_cntx_init_zen2.c. - Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742 system.	2020-10-09 20:22:17 +00:00
Nicholai Tukanov	2d8ec164e7	Add POWER10 support to BLIS (#450 )	2020-09-29 16:52:18 -05:00
Field G. Van Zee	4fd8d9fec2	Tweaked zen2 subconfig's MC cache blocksizes. Details: - Updated the MC cache blocksizes registered by the 'zen2' subconfig. - Minor updates to test/3/Makefile and test/3/runme.sh.	2020-09-28 23:39:05 +00:00
Field G. Van Zee	e293cae2d1	Implemented sgemmsup assembly kernels. Details: - Created a set of single-precision real millikernels and microkernels comparable to the dgemmsup kernels that already exist within BLIS. - Added prototypes for all kernels within bli_kernels_haswell.h. - Registered entry-point millikernels in bli_cntx_init_haswell.c and bli_cntx_init_zen.c. - Added sgemmsup support to the Makefile, runme.sh script, and source file in test/sup. This included edits that allow for separate "small" dimensions for single- and double-precision as well as for single- vs. multithreaded execution.	2020-09-15 16:09:11 -05:00
Devin Matthews	7d41128219	Use -O2 for all framework code. (#435 ) It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes #341 and fixes #342.	2020-08-13 17:50:58 -05:00
Dave Love	9c5b485d35	Don't override -mcpu with -march on ARM (#353 ) * Use -mcpu for ARM See the GCC doc about -march, -mtune, and -mpu and maybe https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu * Fix typo in flags * Fix typo in cortexa9 flags * Modify cortexa53 compilation flags to fix failing BLAS check (#341)	2020-08-07 15:11:18 -05:00
Field G. Van Zee	00e14cb6d8	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-07-29 14:24:34 -05:00
Field G. Van Zee	5fc701ac5f	Added -fomit-frame-pointer option to CKOPTFLAGS. Details: - Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS variable in the following make_defs.mk files: config/haswell/make_defs.mk config/skx/make_defs.mk as well as comments that mention why the compiler option is needed. This option is needed to prevent the compiler from using the rbp frame register (in the very early portion of kernel code, typically where k_iter and k_left are defined and computed), which, as of `1c719c9`, is used explicitly by the gemmsup millikernels. Thanks to Devin Matthews for identifying this missing option and to Jeff Diamond for reporting the original bug in #417. - The file config/zen/amd_config.mk which feeds into the make_defs.mk for both zen and zen2 subconfigs, was also touched, but only to add a commented-out compiler option (and the aforementioned explanatory comment) since that file already uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of CKOPTFLAGS.	2020-07-01 15:48:58 -05:00
Field G. Van Zee	2e3b3782cf	Merge branch 'master' into amd	2020-04-06 14:55:35 -05:00
Devin Matthews	492a736fab	Fix vectorized version of bli_amaxv (#382 ) * Fix vectorized version of bli_amaxv To match Netlib, i?amax should return: - the lowest index among equal values - the first NaN if one is encountered * Fix typos. * And another one... * Update ref. amaxv kernel too. * Re-enabled optimized amaxv kernels. Details: - Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen' kernel set for use in haswell, zen, zen2, knl, and skx subconfigs. These two kernels (for s and d datatypes) were temporarily disabled in `e186d71` as part of issue #380. However, the key missing semantic properties that prompted the disabling of these kernels--returning the index of the first rather than of the last element with largest absolute value, and returning the index of the first NaN if one is encountered--were added as part of #382 thanks to Devin Matthews. Thus, now that the kernels are working as expected once more, this commit causes these kernels to once again be registered for the affected subconfigs, which effectively reverts all code changes included in `e186d71`. - Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c. Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>	2020-03-24 17:28:47 -05:00
Field G. Van Zee	e186d7141a	Disabled optimized amaxv kernels. Details: - Disabled use of optimized amaxv kernels, which use vector intrinsics for both 's' and 'd' datatypes. We disable these kernels because the current implementations fail to observe a semantic property of the BLAS i?amax_() subroutine, which is to return the index of the first element containing the maximum absolute value (that is, the first element if there exist two or more elements that contain the same value). With the optimized kernels disabled, the affected subconfigurations (haswell, zen, zen2, knl, and skx) will use the default reference implementations. Thanks to Mat Cross for reporting this issue via #380. - CREDITS file update.	2020-03-21 18:40:36 -05:00
Field G. Van Zee	c0558fde45	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates.	2020-02-17 14:08:08 -06:00
Field G. Van Zee	2853825234	Merge branch 'master' into amd	2019-12-06 16:06:46 -06:00
Nicholai Tukanov	61b1f0b060	Add prototypes for POWER9 reference kernels (#365 ) Updates and fixes to power9 subconfig. Details: - Register s,c,z reference gemm and trsm ukernels that assume elements of B have been broadcast. - Added prototypes for level-3 ukernels that assume elements of B have been broadcast. Also added prototype for an spackm function that employs a duplication/broadcast factor of 4. - Register virtual gemmtrsm ukernels that work with broadcasting of B. - Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h. - Thanks to Nicholai Tukanov for providing these updates.	2019-12-04 14:18:47 -06:00
Field G. Van Zee	39fa7136f4	Added support for selective packing to gemmsup. Details: - Implemented optional packing for A or B (or both) within the sup framework (which currently only supports gemm). The request for packing either matrix A or matrix B can be made via setting environment variables BLIS_PACK_A or BLIS_PACK_B (to any non-zero value; if set, zero means "disable packing"). It can also be made globally at runtime via bli_pack_set_pack_a() and bli_pack_set_pack_b() or with individual rntm_t objects via bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert interface of either the BLIS typed or object APIs. (If using the BLAS API, environment variables are the only way to communicate the packing request.) - One caveat (for now) with the current implementation of selective packing is that any blocksize extension registered in the _cntx_init function (such as is currently used by haswell and zen subconfigs) will be ignored if the affected matrix is packed. The reason is simply that I didn't get around to implementing the necessary logic to pack a larger edge-case micropanel, though this is entirely possible and should be done in the future. - Spun off the variant-choosing portion of bli_gemmsup_ref() into bli_gemmsup_int(), in bli_l3_sup_int.c. - Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along with corresponding headers, in which higher-level packm-related functions are defined for use within the sup framework. The actual packm variant code resides in bli_l3_sup_packm_var.c. - Pass the following new parameters into var1n and var2m: packa, packb bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now always NULL), and pointer to a thrinfo_t* (which for nowis the address of the global single-threaded packm thread control node). - Added panel strides ps_a and ps_b to the auxinfo_t structure so that the millikernel can query the panel stride of the packed matrix and step through it accordingly. If the matrix isn't packed, the panel stride of interest for the given millikernel will be set to the appropriate value so that the mkernel may step through the unpacked matrix as it normally would. - Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate panel strides (ps_a and ps_b, respectively) instead of computing them on the fly. - Spun off the environment variable getting and setting functions into a new file, bli_env.c (with a corresponding prototype header). These functions are now used by the threading infrastructure (e.g. BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B). - Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER. - Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER, for use within the definition of BLIS_MEM_INITIALIZER. - Moved the global_rntm object to bli_rntm.c and extern it where needed. This means that the function bli_thread_init_rntm() was renamed to bli_rntm_init_from_global() and relocated accordingly. - Added a new bli_pack.c function, which serves as the home for functions that manage the pack_a and pack_b fields of the global rntm_t, including from environment variables, just as we have functions to manage the threading fields of the global rntm_t in bli_thread.c. - Reorganized naming for files in frame/thread, which mostly involved spinning off the bli_l3_thread_decorator() functions into their own files. This change makes more sense when considering the further addition of bli_l3_sup_thread_decorator() functions (for now limited only to the single-threaded form found in the _single.c file). - Explicitly initialize the reference sup handlers in both bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more obvious how to customize to a different handler, if desired. - Removed various snippets of disabled code. - Various comment updates.	2019-11-29 15:27:07 -06:00
Field G. Van Zee	fb8bef9982	Fixed copy-paste bug in bli_spackm_6xk_bb4_ref(). Details: - Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that manifested as failures in single-precision real level-3 operations. Also replaced the duplication factor constants with a const-qualifed varialbe, dfac, so that this won't happen again. - Changed NC for single-precision real from 4080 to 8160 so that the packed matrix B will have the same byte footprint in both single and double real.	2019-11-14 13:05:28 -06:00
Field G. Van Zee	bdc7ee3394	Various fixes to support packing duplication in B. Details: - Added cpp macros to trmm and trmm3 front-ends to optionally force those operations to be cast so the structured matrix is on the left. symm and hemm already had such macros, but these too were renamed so that the macros were individual to the operation. We now have four such macros: #define BLIS_DISABLE_HEMM_RIGHT #define BLIS_DISABLE_SYMM_RIGHT #define BLIS_DISABLE_TRMM_RIGHT #define BLIS_DISABLE_TRMM3_RIGHT Also, updated the comments in the symm and hemm front-ends related to the first two macro guards, and added corresponding comments to the trmm and trmm3 front-ends for the latter two guards. (They all functionally do the same thing, just for their specific operations.) Thanks to Jeff Hammond for reporting the bugs that led me to this change (via #359). - Updated config/old/haswellbb subconfiguration (used to debug issues related to duplicating B during packing) to register: a packing kernel for single-precision real; gemmbb ukernels for s, c, and z; trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c and z; and to use non-default cache and register blocksizes for s, c, and z datatypes. Also declared prototypes for all of the gemmbb, trsmbb, and gemmtrsmbb ukernel functions within the bli_cntx_init_haswellbb() function. This should, once applied to the power9 configuration, fix the remaining issues in #359. - Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a duplication factor of 4. This function is defined in the same file as bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).	2019-11-11 15:47:17 -06:00
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
Field G. Van Zee	6218ac95a5	Merge branch 'master' into amd	2019-10-11 11:53:51 -05:00
Field G. Van Zee	0016d541e6	Changed -march=znver2 to =znver1 for clang on zen2. Details: - In config/zen2/make_defs.mk, changed the -march= flag so that -march=znver1 is used instead of -march=znver2 when CC_VENDOR is clang. (The gcc branch attempts to differentiate between various versions, but the equivalent version cutoffs for clang are not yet known by us, so we have to use a single flag for all versions of clang. Hopefully -march=znver1 is new enough. If not, we'll fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.) This issue was discovered thanks to AppVeyor.	2019-10-11 11:09:44 -05:00
Field G. Van Zee	e94a0530e5	Corrected zen NC that was non-multiple of NR. Details: - Updated an incorrectly set cache blocksize NC for single real within config/zen/bli_cntx_init_zen.c that was non a multiple of the corresponding value of NR. This issue, which was caught by Travis CI, was introduced in `29b0e1e`.	2019-10-11 10:48:27 -05:00
Field G. Van Zee	29b0e1ef4e	Code review + tweaks to AMD's AOCL 2.0 PR (#349 ). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes.	2019-10-11 10:24:24 -05:00
Field G. Van Zee	31c8657f1d	Added support for pre-broadcast when packing B. Details: - Added support for being able to duplicate (broadcast) elements in memory when packing matrix B (ie: the left-hand operand) in level-3 operations. This turns out advantageous for some architectures that can afford the cost of the extra bandwidth and somehow benefit from the pre-broadcast elements (and thus being able to avoid using broadcast-style load instructions on micro-rows of B in the gemm microkernel). - Support optionally disabling right-side hemm and symm. If this occurs, hemm_r is implemented in terms of hemm_l (and symm_r in terms of symm_l). This is needed when broadcasting during packing because the alternative--supporting the broadcast of B while also allowing matrix B to be Hermitian/symmetric--would be an absolute mess. - Support alignment factors for packed blocks of A, B, and C separately (as well as for general-purpose buffers). In addition, we support byte offsets from those alignment values (which is different from aligning by align+offset bytes to begin with). The default alignment values are BLIS_PAGE_SIZE in all four cases, with the offset values defaulting to zero. - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed into the packm kernel, where it will be needed by packm kernels that perform broadcasts of B, since the idea is that we only want to broadcast when packing micropanels of B and not A. - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be used to set custom virtual level-3 microkernels in the cntx_t, which would typically be done in the bli_cntx_init_*() function defined in the subconfiguration of interest. - Added a "broadcast B" kernel function for use with NP/NR = 12/6, defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c. - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels defined in ref_kernels/3/bb. (These kernels have been tested with double real with NP/NR = 12/6.) - Added #ifndef ... #endif guards around several macro constants defined in frame/include/bli_kernel_macro_defs.h. - Defined a few "broadcast B" static functions in frame/include/level0/bb for use by "broadcast B"-style packm reference kernels. For now, only the real domain kernels are tested and fully defined. - Output the alignment and offset values for packed blocks of A and B in the testsuite's "BLIS configuration info" section. - Comment updates to various files. - Bumped so_version to 3.0.0.	2019-09-17 17:42:10 -05:00
Devin Matthews	138d403b6b	Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (#331 )	2019-08-26 18:11:27 -05:00
Field G. Van Zee	e8c6281f13	Add -march support for specific gcc version ranges. Details: - Added logic to configure that checks the version of the compiler against known version ranges that could cause problems later in the build process. For example, versions of gcc older than 4.9.0 use different -march labels than version 4.9.0 or later ('-march=corei7-avx' vs '-march=sandybridge', respectively). Similarly, before 6.1, compilation on Zen was possible, but you need to start with -march=bdver4 and then disable instruction sets that were discarded during the transition from Excavator to Zen. So now, configure substitutes 'yes'/'no' values into anchors in config.mk.in, which sets various make variables (e.g. GCC_OT_4_9_0), which can be accessed and branched upon by the various configurations' make_defs.mk files when setting their compiler flags. - Updated config/haswell/make_defs.mk to branch on GCC_OT_4_9_0. - Updated config/sandybridge/make_defs.mk to branch on GCC_OT_4_9_0. - Updated config/zen/make_defs.mk to branch on GCC_OT_6_1_0.	2019-08-21 12:38:53 -05:00
Meghana	fdce1a5648	changed gcc version check condition from 'ifeq' to 'if greater or equal' Change-Id: Ie4c461867829bcc113210791bbefb9517e52c226	2019-07-24 15:04:41 +05:30
Meghana	c9486e0c4f	code to detect version of gcc and set flags accordingly for zen2 Change-Id: I29b0311d0000dee1a2533ee29941acf53f9e9f34	2019-07-24 09:45:17 +05:30
Field G. Van Zee	deda4ca8a0	Added test/1m4m driver directory. Details: - Added a new standalone test driver directory named '1m4m' that can build and run performance experiments for BLIS 1m, 4m1a, assembly, OpenBLAS, and the vendor library (MKL). This new driver directory was used to regenerate performance results for the 1m paper. - Added alternate (commented-out) cache blocksizes to config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to work well on an a 12-core Intel Xeon E5-2650 v3.	2019-07-22 13:59:05 -05:00
Meghana	dcc0ce12fd	Added a global Makefile for AMD architectures in config/zen folder This Makefile(amd_config.mk) has all the flags that are common to EPYC series Change-Id: Ic02c60a8293ccdd37f0f292e631acd198e6895de	2019-07-22 17:12:01 +05:30
Field G. Van Zee	af17bca26a	Updated haswell MC cache blocksizes. Details: - Updated the default MC cache blocksizes used by the haswell subconfig for both row-preferential (the default) and column-preferential microkernels.	2019-07-19 14:46:23 -05:00
Field G. Van Zee	b5e9bce4dd	Updated -march flags for sandybridge, haswell. Details: - Updated the '-march=corei7-avx' flag in the sandybridge subconfig to '-march=sandybridge' and the '-march=core-avx2' flag in the haswell subconfig to '-march=haswell'. The older flags were used by older versions of gcc and should have been updated to the newer forms a long time ago. (The older flags were clearly working, even though they are no longer documented in the gcc man page.)	2019-07-19 14:42:37 -05:00
Meghana Vankadari	b84cee29f4	Merge "Added compiler flags for vanilla clang" into amd-staging-rome2.0	2019-07-08 02:03:07 -04:00
kdevraje	1f80858abf	This checkin solves the dgemm performance issue jira ticket CPUPL 458, as #else was missed during integration, it was always following else path to get the block sizes Change-Id: I0084b5856c2513ab1066c08c15b5086db6532717	2019-07-05 16:05:11 +05:30
Meghana	c7dd6e6cd2	Added compiler flags for vanilla clang Change-Id: I13c00b4c0d65bbda4c929848fd48b0ab611952ab	2019-07-04 09:32:51 +05:30
Meghana	2acd49b764	fix for test failures using AOCC 2.0 Change-Id: If44eaccc64bbe96bbbe1d32279b1b5773aba08d1	2019-07-01 15:44:07 +05:30
kdevraje	cac127182d	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis with public repo commit id `565fa3853b`. Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42	2019-06-24 14:05:54 +05:30
Field G. Van Zee	a4e8801d08	Increased MT sup threshold for double to 201. Details: - Fine-tuned the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 180 to 201 for haswell and 180 to 256 for zen. - Updated octave scripts in test/sup/octave to include a seventh column to display performance for m = n = k.	2019-05-31 17:30:51 -05:00
Kiran Devrajegowda	3a45ecb154	Merge "Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup" into amd-staging-rome2.0	2019-05-31 06:47:02 -04:00
Kiran Varaganti	b69fb0b74a	Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup Change-Id: I9f5d8225254676a99c6f2b09a0825e545206d0fc	2019-05-31 15:14:22 +05:30
kdevraje	3f867c96ca	When running HPL with pure MPI without DGEMM Threading (Single Threaded BLIS ), making this macro 1 gives best performance.wq Change-Id: I24fd0bf99216f315e49f1c74c44c3feaffd7078d	2019-05-31 14:31:49 +05:30
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
Meghana	ee123f5358	Defined small matrix thresholds for TRSM for various cases for NAPLES and ROME Updated copyright information for kernels/zen/bli_trsm_small.c file Removed separate kernels for zen2 architecture Instead added threshold conditions in zen kernels both for ROME and NAPLES Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5	2019-05-27 15:36:44 +05:30
Field G. Van Zee	cb788ffc89	Increased MT sup threshold for double to 180. Details: - Increased the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 80 to 180, and this change was made for both haswell and zen subconfigurations. This is less about the m dimension in particular and more about facilitating a smoother performance transition when m = n = k.	2019-05-23 13:00:53 -05:00

1 2 3 4 5 ...

367 Commits