amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
RuQing Xu	b8dcc5bc75	Fixed typed API definition for gemmt (#476 ) Details: - Fixed incorrect definition and prototype of bli_?gemmt() in frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously defined identically to gemm, which was wrong because it did not take into account the uplo property of C. - Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md. Specifically, the document erroneously listed only a single transab parameter instead of transa and transb.	2021-03-01 16:58:24 -06:00
Field G. Van Zee	2a815d5b36	Support trsm pre-inversion in 1m, bb, ref kernels. Details: - Expanded support for disabling trsm diagonal pre-inversion to other microkernel types, including the reference microkernel as well as the kernel implementations for 1m and the pre-broadcast B (bb) format used by the power9 subconfig. This builds on the 'haswell' and 'penryn' kernel support added in `7038bba`. Thanks to Bhaskar Nallani for reminding me, in #461 (post-closure), that 1m support was missing from that commit. - Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the omp simd implementation after making a stripped-down copy in 'old'. This code has been disabled for some time and it seemed better suited to rot away out of sight rather than clutter up a file that is already cluttered by the presence of lower and upper versions. - Minor comment update to bli_ind_init().	2021-01-04 18:03:39 -06:00
Field G. Van Zee	c3ed2cbb9f	Enable 1m only if real domain ukr is not reference. Details: - Previously, BLIS would automatically enable use of the 1m method for a given precision if the complex domain microkernel was a reference kernel. This commit adds an additional constraint so that 1m is only enabled if the corresponding real domain microkernel is NOT reference. That is, BLIS now forgos use of 1m if both the real and complex domain kernels are reference implementations. Note that this does not prevent 1m from being enabled manually under those conditions; it only means that 1m will not be enabled automatically at initialization-time.	2021-01-04 16:16:32 -06:00
Field G. Van Zee	ed50c94738	Merge branch 'master' into dev	2021-01-04 14:31:44 -06:00
Devin Matthews	ae6ef66ef8	bli_diag_offset_with_trans had wrong return type. Fixes #468 .	2020-12-30 17:34:55 -06:00
Field G. Van Zee	0cef09aa92	Consolidated code in level-3 _front() functions. Details: - Reduced a code segment that appears in all of the bli_*_front() functions except for bli_gemm_front(). Previously, the code looked like this (taken from bli_herk_front()): if ( bli_cntx_method( cntx ) == BLIS_NAT ) { bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local ); bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local ); } else // if ( bli_cntx_method( cntx ) != BLIS_NAT ) { pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); } This code segment is part of a sort-of-hack that allows us to communicate the pack schemas into the level-3 thread decorator, which needs them so that they can be passed into bli_l3_cntl_create_if(), where the control tree is created. However, the first conditional case above is unnecessary because the second case is fully generalized. That is, even in the native case, the context contains correct, queryable schemas. Thus, these code segments were reduced to something like: pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); There's always a small chance that the seemingly unnecessary code in the first branch case has some special use that is not apparent to me, but the testsuite's default input parameters seem to think this commit will be fine.	2020-12-04 16:40:59 -06:00
Field G. Van Zee	7038bbaa05	Optionally disable trsm diagonal pre-inversion. Details: - Implemented a configure-time option, --disable-trsm-preinversion, that optionally disables the pre-inversion of diagonal elements of the triangular matrix in the trsm operation and instead uses division instructions within the gemmtrsm microkernels. Pre-inversion is enabled by default. When it is disabled, performance may suffer slightly, but numerical robustness should improve for certain pathological cases involving denormal (subnormal) numbers that would otherwise result in overflow in the pre-inverted value. Thanks to Bhaskar Nallani for reporting this issue via #461. - Added preprocessor macro guards to bli_trsm_cntl.c as well as the gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant to the aforementioned feature. - Added macros to frame/include/bli_x86_asm_macros.h related to division instructions.	2020-12-04 16:08:15 -06:00
Field G. Van Zee	11dfc176a3	Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c.	2020-12-01 19:51:27 +00:00
Field G. Van Zee	64856ea5a6	Auto-reduce (by default) prime numbers of threads. Details: - When requesting multithreaded parallelism by specifying the total number of threads (whether it be via environment variable, globally at runtime, or locally at runtime), reduce the number of threads actually used by one if the original value (a) is prime and (b) exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set to 11 by default. If, when specifying the total number of threads (and not the individual ways of parallelism for each loop), prime numbers of threads are desired, this feature may be overridden by defining the BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that corresponds to the configuration family targeted at configure-time. (For now, there is no configure option(s) to control this feature.) Thanks to Jeff Diamond for suggesting this change. - Defined a new function in bli_thread.c, bli_is_prime(), that returns a bool that determines whether an integer is prime. This function is implemented in terms of existing functions in bli_thread.c. - Updated docs/Multithreading.md to document the above feature, along with unrelated minor edits.	2020-11-23 16:54:51 -06:00
Field G. Van Zee	9bb23e6c2a	Added support for systemless build (no pthreads). Details: - Added a configure option, --[enable\|disable]-system, which determines whether the modest operating system dependencies in BLIS are included. The most notable example of this on Linux and BSD/OSX is the use of POSIX threads to ensure thread safety for when application-level threads call BLIS. When --disable-system is given, the bli_pthreads implementation is dummied out entirely, allowing the calling code within BLIS to remain unchanged. Why would anyone want to build BLIS like this? The motivating example was submitted via #454 in which a user wanted to build BLIS for a simulator such as gem5 where thread safety may not be a concern (and where the operating system is largely absent anyway). Thanks to Stepan Nassyr for suggesting this feature. - Another, more minor side effect of the --disable-system option is that the implementation of bli_clock() unconditionally returns 0.0 instead of the time elapsed since some fixed point in the past. The reasoning for this is that if the operating system is truly minimal, the system function call upon which bli_clock() would normally be implemented (e.g. clock_gettime()) may not be available. - Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h to remove redundancies. - Removed old comments and commented #include of "bli_pthread_wrap.h" from bli_system.h. - Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md and BLISTypedAPI.md, with a note that both are non-functional when BLIS is configured with --disable-system.	2020-11-16 15:55:45 -06:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
Field G. Van Zee	2a0682f8e5	Implemented runtime subconfig selection (#451 ). Details: - Implemented support for the user manually overriding the automatic subconfiguration selection that happens at runtime. This override can be requested by setting the BLIS_ARCH_TYPE environment variable. The variable must be set to the arch_t id (as enumerated in bli_type_defs.h) corresponding to the desired subconfiguration. If a value outside this enumerated range is given, BLIS will abort with an error message. If the value is in the valid range but corresponds to a subconfiguration that was not activated at configure-time/compile-time, BLIS will abort with a (different) error message. Thanks to decandia50 for suggesting this feature via issue #451. - Defined a new function bli_gks_lookup_id to return the address of an internal data structure within the gks. If this address is NULL, then it indicates that the subconfig corresponding to the arch_t id passed into the function was not compiled into BLIS. This function is used in the second of the two abort scenarios described above. - Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which is returned for the latter of the two abort scenarios mentioned above, along with a corresponding error message and a function to perform the error check. - Added cpp macro branching to bli_env.c to support compilation of the auto-detect.x executable during configure-time. This cpp branch is similar to the cpp code already found in bli_arch.c and bli_cpuid.c. - Cleaned up the auto_detect() function to facilitate easier maintenance going forward. Also added a convenient debug switch that outputs the compilation command for the auto-detect.x executable and exits.	2020-10-18 18:04:03 -05:00
Nicholai Tukanov	2d8ec164e7	Add POWER10 support to BLIS (#450 )	2020-09-29 16:52:18 -05:00
Field G. Van Zee	645d771a14	Minor packm kernel type cleanup (void* -> ctype). Details: - Changed all void function arguments in reference packm kernels to those of the native type (ctype). These pointers no longer need to be void and are better represented by their native types anyway. (See below for details.) Updated knl packm kernels accordingly. - In the definition of the PACKM_KER_PROT prototype macro template in frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a, and p from void* to ctype. They were originally void because these function signatures had to share the same type so they could all be stored in a single array of that shared type, from which they were queried and called by packm_cxk(). This is no longer how the function pointers are stored, and so it no longer makes sense to force the caller of packm kernels to use void, only so that the implementor of the packm kernels can typecast back to the native datatype within the kernel definition. This change has no effect internally within BLIS because currently all packm kernels are called after querying the function addresses from the context and then typecasting to the appropriate function pointer type, which is based upon type-specific function pointers like float and double*. - Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and misleading due to changes to the handling of packm kernels since moving them into the context.	2020-09-12 15:31:56 -05:00
Devin Matthews	a8efb72074	Merge pull request #434 from flame/intel-zdot Add an option to change the complex return type.	2020-09-07 16:18:19 -05:00
Field G. Van Zee	97e87f2c9f	Whitespace/comment updates to #434 PR.	2020-09-07 15:56:42 -05:00
Devin Matthews	c253d14a72	Also handle Intel-style complex return in CBLAS interface.	2020-08-07 09:39:04 -05:00
Devin Matthews	b1b5870dd3	Add checks so that s390x is detected as 64-bit.	2020-08-06 17:34:20 -05:00
Devin Matthews	5b5278ff49	Use #ifdef instead of #if as macro may be undefined.	2020-08-06 14:19:37 -05:00
Devin Matthews	7fdc0fc893	Add an option to change the complex return type. ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu\|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes #433.	2020-08-06 14:09:23 -05:00
Field G. Van Zee	00e14cb6d8	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-07-29 14:24:34 -05:00
Field G. Van Zee	2c554c2fce	Redefined bool_t typedef in terms of C99 bool. Details: - Changed the typedef that defines bool_t from: typedef gint_t bool_t; where gint_t is a signed integer that forms the basis of most other integers in BLIS, to: typedef bool bool_t; - Changed BLIS's TRUE and FALSE macro definitions from being in terms of integer literals: #define TRUE 1 #define FALSE 0 to being in terms of C99 boolean constants: #define TRUE true #define FALSE false which are provided by stdbool.h. - This commit constitutes the second phase of a transition toward using C99's bool instead of bool_t, which will address issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`.	2020-07-24 15:57:19 -05:00
Devin Matthews	b4f47f7540	Add BLIS_EXPORT_BLIS to bli_abort. (#429 ) Fixes #428.	2020-07-24 13:56:13 -05:00
Field G. Van Zee	a69a4d7e2f	Cleaned up bool_t usage and various typecasts. Details: - Fixed various typecasts in frame/base/bli_cntx.h frame/base/bli_mbool.h frame/base/bli_rntm.h frame/include/bli_misc_macro_defs.h frame/include/bli_obj_macro_defs.h frame/include/bli_param_macro_defs.h that were missing or being done improperly/incompletely. For example, many return values were being typecast as (bool_t)x && y rather than (bool_t)(x && y) Thankfully, none of these deficiencies had manifested as actual bugs at the time of this commit. - Changed the return type of bli_env_get_var() from dim_t to gint_t. This reflects the fact that bli_env_get_var() needs to be able to return a signed integer, and even though dim_t is currently defined as a signed integer, it does not intuitively appear to necessarily be signed by inspection (i.e., an integer named "dim_t" for matrix "dimension"). Also, updated use of bli_env_get_var() within bli_pack.c to reflect the changed return type. - Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t and added comments to the bli_thrcomm_*.h files that will explain a planned replacement of bool_t with C99's bool type. - Note: These changes are being made to facilitate the substitution of 'bool' for 'bool_t', which will eliminate the namespace conflict with arm_sve.h as reported in issue #420. This commit implements the first phase of that transition. Thanks to RuQing Xu for reporting this issue. - CREDITS file update.	2020-07-22 16:13:09 -05:00
Field G. Van Zee	a6437a5c11	Replaced broken ref99 sandbox w/ simpler version. Details: - The 'ref99' sandbox was broken by multiple refactorings and internal API changes over the last two years. Rather than try to fix it, I've replaced it with a much simpler version based on var2 of gemmsup. Why not fix the previous implementation? It occurred to me that the old implementation was trying to be a lightly simplified duplication of what exists in the framework. Duplication aside, this sandbox would have worked fine if it had been completely independent of the framework code. The problem was that it was only partially independent, with many function calls calling a function in BLIS rather than a duplicated/simplified version within the sandbox. (And the reason I didn't make it fully independent to begin with was that it seemed unnecessarily duplicative at the time.) Maintaining two versions of the same implementation is problematic for obvious reasons, especially when it wasn't even done properly to begin with. This explains the reimplementation in this commit. The only catch is that the newer implementation is single-threaded only and does not perform any packing on either input matrix (A or B). Basically, it's only meant to be a simple placeholder that shows how you could plug in your own implementation. Thanks to Francisco Igual for reporting this brokenness. - Updated the three reference gemmsup kernels (defined in ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle conjugation of conja and/or conjb. The general storage kernel, which is currently identical to the column-storage kernel, is used in the new ref99 sandbox to provide basic support for all datatypes (including scomplex and dcomplex). - Minor updates to docs/Sandboxes.md, including adding the threading and packing limitations to the Caveats section. - Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new sandbox implementation is based).	2020-07-20 19:21:07 -05:00
Field G. Van Zee	72f6ed0637	Declare/define static functions via BLIS_INLINE. Details: - Updated all static function definitions to use the cpp macro BLIS_INLINE instead of the static keyword. This allows blis.h to use a different keyword (inline) to define these functions when compiling with C++, which might otherwise trigger "defined but not used" warning messages. Thanks to Giorgos Margaritis for reporting this issue and Devin Matthews for suggesting the fix. - Updated the following files, which are used by configure's hardware auto-detection facility, to unconditionally #define BLIS_INLINE to the static keyword (since we know BLIS will be compiled with C, not C++): build/detect/config/config_detect.c frame/base/bli_arch.c frame/base/bli_cpuid.c - CREDITS file update.	2020-07-03 17:55:54 -05:00
Field G. Van Zee	6af59b7057	Fixed disabled edge case optimization in gemmsup. Details: - Fixed an inadvertently disabled edge case optimization in the two gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case optimizations allow the last millikernel operation in the jr loop to be executed with inflated an register blocksize if it is the last (or only) iteration. For example, if mr=6 and nr=8 and the gemmsup problem is m=8, n=100, k=100. (In this case, the panel-block variant (var1n) is executed, which places the jr loop in the m dimension.) In principle, this problem could be executed as two millikernels: one with dimensions 6x100x100, and one as 2x100x100. However, with the support for inflated blocksizes in the kernel, the entire 8x100x100 problem can be passed to the millikernel function, which will then execute it more favorably as two 4x100x100 millikernel sub-calls. Now, this optimization is disabled under certain circumstances, such as when multithreading. Previously, the is_mt predicate was being set incorrectly such that it was non-zero even when running single-threaded. - Upon fixing the is_mt issue above, another bit of code needed to be moved so that the result of the optimization could have an impact on the assignment of loop bounds ranges to threads.	2020-07-01 14:54:23 -05:00
Field G. Van Zee	b5b604e106	Ensure random objects' 1-norms are non-zero. Details: - Fixed an innocuous bug that manifested when running the testsuite on extremely small matrices with randomization via the "powers of 2 in narrow precision range" option enabled. When the randomization function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will then compute 0.0/0.0 during the normalization process, which leads to NaN residuals. The solution entails smarter implementaions of randv, randnv, randm, and randnm, each of which will compute the 1-norm of the vector or matrix in question. If the object has a 1-norm of 0.0, the object is re-randomized until the 1-norm is not 0.0. Thanks to Kiran Varaganti for reporting this issue (#413). - Updated the implementation of randm_unb_var1() so that it loops over a call to the randv_unb_var1() implementation directly rather than calling it indirectly via randv(). This was done to avoid the overhead of multiple calls to norm1v() when randomizing the rows/columns of a matrix. - Updated comments.	2020-06-17 16:42:24 -05:00
Field G. Van Zee	787adad73b	Defined netlib equivalent of xerbla_array(). Details: - Added a function definition for xerbla_array_(), which largely mirrors its netlib implementation. Thanks to Isuru Fernando for suggesting the addition of this function.	2020-05-08 16:18:20 -05:00
Guodong Xu	f032d5d4a6	New kernel set for Arm SVE using assembly (#396 ) Here adds two kernels for Arm SVE vector extensions. 1. a gemm kernel for double at sizes 8x8. 2. a packm kernel for double at dimension 8xk. To achive best performance, variable length agonostic programming is not used. Vector length (VL) of 256 bits is mandated in both kernels. Kernels to support other VLs can be added later. "SVE is a vector extension for AArch64 execution mode for the A64 instruction set of the Armv8 architecture. Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains into a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. Instructions are provided specifically to query an implementation for its register size, to guarantee that the applications can run on different implementations of the ISA without the need to recompile the code." [1] [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-04-29 12:08:46 -05:00
Field G. Van Zee	477ce91c52	Moved #include "cpuid.h" to bli_cpuid.c. Details: - Relocated the #include "cpuid.h" directive from bli_cpuid.h to bli_cpuid.c. This was done because cpuid.h (which is pulled into the post-build blis.h developer header) doesn't protect its definitions with a preprocessor guard of the form: #ifndef FOOBAR_H #define FOOBAR_H // header contents. #endif and as a result, applications (previously) could not #include both blis.h and cpuid.h (since the former was already including the latter). Thanks to Bhaskar Nallani for raising this issue via #393 and to Devin Matthews for suggesting this fix. - CREDITS file update.	2020-04-22 14:26:49 -05:00
Field G. Van Zee	976902406b	Disable packing by default in expert rntm_t init. Details: - Changed the behavior of bli_rntm_init() as well as the static initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t objects by default specify the disabling of packing for A and B. Packing of A/B was already disabled by default when calling non-expert APIs (and enabled only when the user set environment variables BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of using user-initialized rntm_t objects with expert APIs comes into line with the default behavior of non-expert APIs--that is, they now both lead to the avoidance of packing in the sup code path. (Note: The conventional code path is unaffected by the environment variables BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t object when calling an expert API.) This addresses issue #392. Thanks to Kiran Varaganti for bringing this inconsistency to our attention. - The above change was accomplished by changing the the definitions of static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b() in bli_rntm.h, which are both for internal use only.	2020-04-17 15:11:10 -05:00
Field G. Van Zee	2cb604ba47	Rename more bli_thread_obarrier(), _obroadcast(). Details: - Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast() that were made in the supmt-specific code commited to the 'amd' branch, which has now been merged with 'master'. Prior to the merge, 'master' received commit `c01d249`, which applied these renamings to the existing, non-sup codebase.	2020-04-06 16:42:14 -05:00
Field G. Van Zee	2e3b3782cf	Merge branch 'master' into amd	2020-04-06 14:55:35 -05:00
Field G. Van Zee	9f3a8d4d85	Added missing return to bli_thread_partition_2x2(). Details: - Added a missing return statement to the body of an early case handling branch in bli_thread_partition_2x2(). This bug only affected cases where n_threads < 4, and even then, the code meant to handle cases where n_threads >= 4 executes and does the right thing, albeit using more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti for reporting this bug via issue #377. - Whitespace changes to bli_thread.c (spaces -> tabs).	2020-03-14 17:48:43 -05:00
Field G. Van Zee	c01d249d7c	Renamed bli_thread_obarrier(), _obroadcast(). Details: - Renamed two bli_thread_*() APIs: bli_thread_obarrier() -> bli_thread_barrier() bli_thread_obroadcast() -> bli_thread_broadcast() The 'o' was a leftover from when thrcomm_t objects tracked both "inner" and "outer" communicators. They have long since been simplified to only support the latter, and thus the 'o' is superfluous.	2020-02-25 14:50:53 -06:00
Field G. Van Zee	9e5f7296cc	Skip building thrinfo_t tree when mt is disabled. Details: - Return early from bli_thrinfo_sup_grow() if the thrinfo_t object address is equal to either &BLIS_GEMM_SINGLE_THREADED or &BLIS_PACKM_SINGLE_THREADED. - Added preprocessor logic to bli_l3_sup_thread_decorator() in bli_l3_sup_decor_single.c that (by default) disables code that creates and frees the thrinfo_t tree and instead passes &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the sup implementation. - The net effect of the above changes is that a small amount of thrinfo_t overhead is avoided when running small/skinny dgemm problems when BLIS is compiled with multithreading disabled.	2020-02-18 15:16:03 -06:00
Field G. Van Zee	90081e6a64	Fixed bug(s) in mt sup when single-threaded. Details: - Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of changing function interface for the thread entry point function (of type l3supint_t). - Unfortunately, fixing the interface was not enough, as it caused a memory leak in the sba at bli_finalize() time. It turns out that, due to the new multithreading-capable variant code useing thrinfo_t objects--specifically, their calling of bli_thrinfo_grow()--we have to pass in a real thrinfo_t object rather than the global objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED. Thus, I inserted the appropriate logic from the OpenMP and pthreads versions so that single-threaded execution would work as intended with the newly upgraded variants.	2020-02-17 14:57:25 -06:00
Field G. Van Zee	c0558fde45	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates.	2020-02-17 14:08:08 -06:00
Field G. Van Zee	d7a7679182	Fixed int-to-packbuf_t conversion error (C++ only). Details: - Fixed an error that manifests only when using C++ (specifically, modern versions of g++) to compile drivers in 'test' (and likely most other application code that #includes blis.h. Thanks to Ajay Panyala for reporting this issue (#374).	2020-02-07 17:37:03 -06:00
Dave Love	f391b3e2e7	Fix parsing in vpu_count on workstation SKX (#351 ) * Fix parsing in vpu_count on workstation SKX * Document Skylake-X as Haswell for single FMA * Update vpu_count for Skylake and Cascade Lake models * Support printing the configuration selected, controlled by the environment Intended particularly for diagnosing mis-selection of SKX through unknown, or incorrect, number of VPUs. * Move bli_log outside the cpp condition, and use it where intended * Add Fixme comment (Skylake D) * Mostly superficial edits to commits towards #351. Details: - Moved architecture/sub-config logging-related code from bli_cpuid.c to bli_arch.c, tweaked names, and added more set/get layering. - Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c. - Content, whitespace changes to new bullet in HardwareSupport.md that relates to single-VPU Skylake-Xs. * Fix comment typos Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>	2020-01-06 14:15:48 -06:00
Field G. Van Zee	5271107378	Fixed bugs in cblas_sdsdot(), sdsdot_(). Details: - Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar, named 'sb'. This value was already being added by the underlying sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub(). Thanks to Simon Lukas Märtens for reporting this bug via #367. - Fixed a second bug in order of typecasting intermediate products in sdsdot_(). Previously, the "alpha" scalar was being added after the "outer" typecast to float. However, the operation is supposed to first add the dot product to the (promoted) scalar and THEN downcast the sum to float. Thanks to Devin Matthews for catching this bug.	2019-12-16 16:30:26 -06:00
Field G. Van Zee	fe2560a4b1	Annoted missing thread-related symbols for export. Details: - Added BLIS_EXPORT_BLIS annotation to function prototypes for bli_thrcomm_bcast() bli_thrcomm_barrier() bli_thread_range_sub() so that these functions are exported to shared libraries by default. This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for reporting this bug. - CREDITS file update.	2019-12-06 17:12:44 -06:00
Field G. Van Zee	efa61a6c8b	Added missing bli_l3_sup_thread_decorator() symbol. Details: - Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp and pthreads so that those builds don't fail when performing shared library linking (especially for Windows DLLs via AppVeyor). For now, these dummy implementations of bli_l3_sup_thread_decorator() are merely carbon-copies of the implementation provided for single- threaded execution (ie: the one found in bli_l3_sup_decor_single.c). Thus, an OpenMP or pthreads build will be able to use the gemmsup code (including the new selective packing functionality), as it did before `39fa7136`, even though it will not actually employ any multithreaded parallelism.	2019-11-29 16:17:04 -06:00
Field G. Van Zee	39fa7136f4	Added support for selective packing to gemmsup. Details: - Implemented optional packing for A or B (or both) within the sup framework (which currently only supports gemm). The request for packing either matrix A or matrix B can be made via setting environment variables BLIS_PACK_A or BLIS_PACK_B (to any non-zero value; if set, zero means "disable packing"). It can also be made globally at runtime via bli_pack_set_pack_a() and bli_pack_set_pack_b() or with individual rntm_t objects via bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert interface of either the BLIS typed or object APIs. (If using the BLAS API, environment variables are the only way to communicate the packing request.) - One caveat (for now) with the current implementation of selective packing is that any blocksize extension registered in the _cntx_init function (such as is currently used by haswell and zen subconfigs) will be ignored if the affected matrix is packed. The reason is simply that I didn't get around to implementing the necessary logic to pack a larger edge-case micropanel, though this is entirely possible and should be done in the future. - Spun off the variant-choosing portion of bli_gemmsup_ref() into bli_gemmsup_int(), in bli_l3_sup_int.c. - Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along with corresponding headers, in which higher-level packm-related functions are defined for use within the sup framework. The actual packm variant code resides in bli_l3_sup_packm_var.c. - Pass the following new parameters into var1n and var2m: packa, packb bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now always NULL), and pointer to a thrinfo_t* (which for nowis the address of the global single-threaded packm thread control node). - Added panel strides ps_a and ps_b to the auxinfo_t structure so that the millikernel can query the panel stride of the packed matrix and step through it accordingly. If the matrix isn't packed, the panel stride of interest for the given millikernel will be set to the appropriate value so that the mkernel may step through the unpacked matrix as it normally would. - Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate panel strides (ps_a and ps_b, respectively) instead of computing them on the fly. - Spun off the environment variable getting and setting functions into a new file, bli_env.c (with a corresponding prototype header). These functions are now used by the threading infrastructure (e.g. BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B). - Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER. - Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER, for use within the definition of BLIS_MEM_INITIALIZER. - Moved the global_rntm object to bli_rntm.c and extern it where needed. This means that the function bli_thread_init_rntm() was renamed to bli_rntm_init_from_global() and relocated accordingly. - Added a new bli_pack.c function, which serves as the home for functions that manage the pack_a and pack_b fields of the global rntm_t, including from environment variables, just as we have functions to manage the threading fields of the global rntm_t in bli_thread.c. - Reorganized naming for files in frame/thread, which mostly involved spinning off the bli_l3_thread_decorator() functions into their own files. This change makes more sense when considering the further addition of bli_l3_sup_thread_decorator() functions (for now limited only to the single-threaded form found in the _single.c file). - Explicitly initialize the reference sup handlers in both bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more obvious how to customize to a different handler, if desired. - Removed various snippets of disabled code. - Various comment updates.	2019-11-29 15:27:07 -06:00
Field G. Van Zee	881b05ecd4	Fixed blastest failure for 'generic' subconfig. Details: - Fixed a subtle and complicated bug that only manifested via the BLAS test drivers in the generic subconfiguration, and possibly any other subconfiguration that did not register complex-domain gemm ukernels, or registered ONLY real-domain ukernels as row-preferential. This is a long story, but it boils down to an exception to the "transpose the operation to bring storage of C into agreement with ukernel pref" optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the proper functioning of the 1m method, but only when the imaginary component of beta is zero. See the comments in issue #342 for more details. Thanks to Dave Love for identifying the commit in which this bug was introduced, and other feedback related to this bug.	2019-11-21 16:34:27 -06:00
Field G. Van Zee	0c7165fb01	Fixed obscure bug in bli_acquire_mpart_[mn]dim(). Details: - Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(), and bli_acquire_mpart_mndim() that allowed the use of a blocksize b that is too large given the current row/column index (i.e., the i/j argument) and the size of the dimension being partitioned (i.e., the m/n argument). This bug only affected backwards partitioning/motion through the dimension and was the result of a misplaced conditional check-and-redirect to the backwards code path. It should be noted that this bug was discovered not because it manifested the way it could (thanks to the callers in BLIS making sure to always pass in the "correct" blocksize b), but could have manifested if the functions were used by 3rd party callers. Thanks to Minh Quan Ho for reporting the bug via issue #363.	2019-11-14 16:48:14 -06:00
Field G. Van Zee	bdc7ee3394	Various fixes to support packing duplication in B. Details: - Added cpp macros to trmm and trmm3 front-ends to optionally force those operations to be cast so the structured matrix is on the left. symm and hemm already had such macros, but these too were renamed so that the macros were individual to the operation. We now have four such macros: #define BLIS_DISABLE_HEMM_RIGHT #define BLIS_DISABLE_SYMM_RIGHT #define BLIS_DISABLE_TRMM_RIGHT #define BLIS_DISABLE_TRMM3_RIGHT Also, updated the comments in the symm and hemm front-ends related to the first two macro guards, and added corresponding comments to the trmm and trmm3 front-ends for the latter two guards. (They all functionally do the same thing, just for their specific operations.) Thanks to Jeff Hammond for reporting the bugs that led me to this change (via #359). - Updated config/old/haswellbb subconfiguration (used to debug issues related to duplicating B during packing) to register: a packing kernel for single-precision real; gemmbb ukernels for s, c, and z; trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c and z; and to use non-default cache and register blocksizes for s, c, and z datatypes. Also declared prototypes for all of the gemmbb, trsmbb, and gemmtrsmbb ukernel functions within the bli_cntx_init_haswellbb() function. This should, once applied to the power9 configuration, fix the remaining issues in #359. - Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a duplication factor of 4. This function is defined in the same file as bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).	2019-11-11 15:47:17 -06:00
Jérôme Duval	f377bb4485	Add Haiku to the known OS list (#361 )	2019-11-07 16:39:29 -06:00

1 2 3 4 5 ...

799 Commits