amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	0cef09aa92	Consolidated code in level-3 _front() functions. Details: - Reduced a code segment that appears in all of the bli_*_front() functions except for bli_gemm_front(). Previously, the code looked like this (taken from bli_herk_front()): if ( bli_cntx_method( cntx ) == BLIS_NAT ) { bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local ); bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local ); } else // if ( bli_cntx_method( cntx ) != BLIS_NAT ) { pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); } This code segment is part of a sort-of-hack that allows us to communicate the pack schemas into the level-3 thread decorator, which needs them so that they can be passed into bli_l3_cntl_create_if(), where the control tree is created. However, the first conditional case above is unnecessary because the second case is fully generalized. That is, even in the native case, the context contains correct, queryable schemas. Thus, these code segments were reduced to something like: pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); There's always a small chance that the seemingly unnecessary code in the first branch case has some special use that is not apparent to me, but the testsuite's default input parameters seem to think this commit will be fine.	2020-12-04 16:40:59 -06:00
Field G. Van Zee	7038bbaa05	Optionally disable trsm diagonal pre-inversion. Details: - Implemented a configure-time option, --disable-trsm-preinversion, that optionally disables the pre-inversion of diagonal elements of the triangular matrix in the trsm operation and instead uses division instructions within the gemmtrsm microkernels. Pre-inversion is enabled by default. When it is disabled, performance may suffer slightly, but numerical robustness should improve for certain pathological cases involving denormal (subnormal) numbers that would otherwise result in overflow in the pre-inverted value. Thanks to Bhaskar Nallani for reporting this issue via #461. - Added preprocessor macro guards to bli_trsm_cntl.c as well as the gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant to the aforementioned feature. - Added macros to frame/include/bli_x86_asm_macros.h related to division instructions.	2020-12-04 16:08:15 -06:00
Field G. Van Zee	11dfc176a3	Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c.	2020-12-01 19:51:27 +00:00
Field G. Van Zee	64856ea5a6	Auto-reduce (by default) prime numbers of threads. Details: - When requesting multithreaded parallelism by specifying the total number of threads (whether it be via environment variable, globally at runtime, or locally at runtime), reduce the number of threads actually used by one if the original value (a) is prime and (b) exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set to 11 by default. If, when specifying the total number of threads (and not the individual ways of parallelism for each loop), prime numbers of threads are desired, this feature may be overridden by defining the BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that corresponds to the configuration family targeted at configure-time. (For now, there is no configure option(s) to control this feature.) Thanks to Jeff Diamond for suggesting this change. - Defined a new function in bli_thread.c, bli_is_prime(), that returns a bool that determines whether an integer is prime. This function is implemented in terms of existing functions in bli_thread.c. - Updated docs/Multithreading.md to document the above feature, along with unrelated minor edits.	2020-11-23 16:54:51 -06:00
Field G. Van Zee	9bb23e6c2a	Added support for systemless build (no pthreads). Details: - Added a configure option, --[enable\|disable]-system, which determines whether the modest operating system dependencies in BLIS are included. The most notable example of this on Linux and BSD/OSX is the use of POSIX threads to ensure thread safety for when application-level threads call BLIS. When --disable-system is given, the bli_pthreads implementation is dummied out entirely, allowing the calling code within BLIS to remain unchanged. Why would anyone want to build BLIS like this? The motivating example was submitted via #454 in which a user wanted to build BLIS for a simulator such as gem5 where thread safety may not be a concern (and where the operating system is largely absent anyway). Thanks to Stepan Nassyr for suggesting this feature. - Another, more minor side effect of the --disable-system option is that the implementation of bli_clock() unconditionally returns 0.0 instead of the time elapsed since some fixed point in the past. The reasoning for this is that if the operating system is truly minimal, the system function call upon which bli_clock() would normally be implemented (e.g. clock_gettime()) may not be available. - Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h to remove redundancies. - Removed old comments and commented #include of "bli_pthread_wrap.h" from bli_system.h. - Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md and BLISTypedAPI.md, with a note that both are non-functional when BLIS is configured with --disable-system.	2020-11-16 15:55:45 -06:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
Field G. Van Zee	234b8b0cf4	Increased dotxaxpyf testsuite thresholds. Details: - Increased the test thresholds used by the dotxaxpyf testsuite module by a factor of five in order to avoid residuals that unnecessarily fall in the MARGINAL range. This commit should fix #455. Thanks to @nagsingh for reporting this issue.	2020-11-12 19:11:16 -06:00
Field G. Van Zee	ed612dd82c	Updated README.md with sgemmsup blurb. Details: - Added an entry to the "What's New" section of the README.md to announce the availability of sgemmsup.	2020-11-07 13:09:42 -06:00
Field G. Van Zee	e14424f55b	Merge branch 'dev'	2020-11-07 13:02:50 -06:00
Field G. Van Zee	0cfe1aac22	Relocated operation index to ToC in API docs. Details: - Moved the "Operation index" section of both the BLISObjectAPI.md and BLISTypedAPI.md docs to appear immediately after the table of contents of each document. This allows the reader to quickly jump to the documentation for any operation without having to scroll through much of the document (when rendered via a web browser). - Fixed a mistake in the BLISObjectAPI.md for the setd operation, which does not observe the diag property of its matrix argument. Thanks to Jeff Diamond for reporting this.	2020-10-30 17:10:36 -05:00
Field G. Van Zee	2a0682f8e5	Implemented runtime subconfig selection (#451 ). Details: - Implemented support for the user manually overriding the automatic subconfiguration selection that happens at runtime. This override can be requested by setting the BLIS_ARCH_TYPE environment variable. The variable must be set to the arch_t id (as enumerated in bli_type_defs.h) corresponding to the desired subconfiguration. If a value outside this enumerated range is given, BLIS will abort with an error message. If the value is in the valid range but corresponds to a subconfiguration that was not activated at configure-time/compile-time, BLIS will abort with a (different) error message. Thanks to decandia50 for suggesting this feature via issue #451. - Defined a new function bli_gks_lookup_id to return the address of an internal data structure within the gks. If this address is NULL, then it indicates that the subconfig corresponding to the arch_t id passed into the function was not compiled into BLIS. This function is used in the second of the two abort scenarios described above. - Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which is returned for the latter of the two abort scenarios mentioned above, along with a corresponding error message and a function to perform the error check. - Added cpp macro branching to bli_env.c to support compilation of the auto-detect.x executable during configure-time. This cpp branch is similar to the cpp code already found in bli_arch.c and bli_cpuid.c. - Cleaned up the auto_detect() function to facilitate easier maintenance going forward. Also added a convenient debug switch that outputs the compilation command for the auto-detect.x executable and exits.	2020-10-18 18:04:03 -05:00
Field G. Van Zee	eccdd75a2d	Whitespace tweak in docs/PerformanceSmall.md.	2020-10-09 15:44:16 -05:00
Field G. Van Zee	7677e9ba60	Merge branch 'dev' of github.com:flame/blis into dev	2020-10-09 15:41:25 -05:00
Field G. Van Zee	addcd46b05	Added Epyc 7742 Zen2 ("Rome") sup perf results. Details: - Added single-threaded and multithreaded sup performance results to docs/PerformanceSmall.md for both sgemm and dgemm. These results were gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2 microarchitecture. Special thanks to Jeff Diamond for facilitating access to the system via the Oracle Cloud. - Updates to octave scripts in test/sup/octave for use with Octave 5.2 and for use with subplot_tight(). - Minor updates to octave scripts in test/3/octave. - Renamed files containing the previous Zen performance results for consistency with the new results. - Decreased line thickness slightly in large/conventional Zen2 graphs. I'm done tweaking those this time. Really. - Added missing line regarding eigen header installation for each microarchitecture section.	2020-10-09 15:41:09 -05:00
Field G. Van Zee	a0849d390d	Register l3 sup kernels in zen2 subconfig. Details: - Registered full suite of sgemm and dgemm sup millikernels, blocksizes, and crossover thresholds in bli_cntx_init_zen2.c. - Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742 system.	2020-10-09 20:22:17 +00:00
Field G. Van Zee	d98368c32d	Another tweak to line thickness of Zen2 graphs.	2020-10-08 19:05:51 -05:00
Field G. Van Zee	1855dfbdaa	Tweaked line thickness in Zen2 graphs once more. Details: - Decreased (relative to previous commit) line thickness in recent Zen2 graphs.	2020-10-08 19:01:00 -05:00
Field G. Van Zee	0991611e7e	Increased line thickness in recent Zen2 graphs. Details: - Increased the width of the lines in the graphs introduced in `74ec6b8`.	2020-10-08 18:54:49 -05:00
Field G. Van Zee	8273cbacd7	README.md, docs/FAQ.md updates. Details: - Added a frequently asked question to docs/FAQ.md regarding the difference between upstream (vanilla) BLIS and AMD BLIS. - Updated the name of ICES in the README.md to reflect the Oden rebranding.	2020-10-07 14:51:33 -05:00
Field G. Van Zee	a178a822ad	Added Zen2 links to docs/Performance.md Contents.	2020-09-30 16:00:52 -05:00
Field G. Van Zee	74ec6b8f45	Added Epyc 7742 Zen2 ("Rome") performance results. Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on an Epyc 7742 "Rome" server with AMD's Zen2 microarchitecture. Special thanks to Jeff Diamond for facilitating access to the system via the Oracle Cloud. - Renamed files containing the previous Zen performance results for consistency with the new results.	2020-09-30 15:54:18 -05:00
Field G. Van Zee	bc4a213a2c	Updated matlab (now octave) plot code in test/3. Details: - Renamed test/3/matlab to test/3/octave. - Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m files for use with octave (which is free and doesn't crash on me mid-way through my use of subplot). - Updated runthese.m scratchpad for zen2 invocations. - Added Nikolay S.'s subplot_tight() function, along with its license.	2020-09-30 15:28:20 -05:00
Field G. Van Zee	c77ddc4181	Added optional numactl usage to test/3/runme.sh.	2020-09-30 20:15:43 +00:00
Nicholai Tukanov	2d8ec164e7	Add POWER10 support to BLIS (#450 )	2020-09-29 16:52:18 -05:00
Field G. Van Zee	4fd8d9fec2	Tweaked zen2 subconfig's MC cache blocksizes. Details: - Updated the MC cache blocksizes registered by the 'zen2' subconfig. - Minor updates to test/3/Makefile and test/3/runme.sh.	2020-09-28 23:39:05 +00:00
Field G. Van Zee	5efcdeffd5	More minor README.md updates.	2020-09-25 14:25:24 -05:00
Field G. Van Zee	9e940f8aad	Added 1m SISC bibtex to README.md. Details: - Added final citation info to 1m bibtex in README.md file. - Updated draft 1m paper link. - Changed some http to https.	2020-09-25 13:53:35 -05:00
Field G. Van Zee	e293cae2d1	Implemented sgemmsup assembly kernels. Details: - Created a set of single-precision real millikernels and microkernels comparable to the dgemmsup kernels that already exist within BLIS. - Added prototypes for all kernels within bli_kernels_haswell.h. - Registered entry-point millikernels in bli_cntx_init_haswell.c and bli_cntx_init_zen.c. - Added sgemmsup support to the Makefile, runme.sh script, and source file in test/sup. This included edits that allow for separate "small" dimensions for single- and double-precision as well as for single- vs. multithreaded execution.	2020-09-15 16:09:11 -05:00
Field G. Van Zee	2765c6f37c	Type saga continues; fixed sgemm ukernel signature. Details: - Changed double* pointers in sgemm function signature to float*. At this point I've lost track of whether this was my fault or another dormant bug like the one described in `ece9f6a`, but at this point I no longer care. It's one of those days (aka I didn't ask for this).	2020-09-12 17:48:15 -05:00
Field G. Van Zee	0779559509	Fixed missing restrict in knl sgemm prototype. Details: - Added a missing 'restrict' qualifier in the sgemm ukernel prototype for knl. (Not sure how that code was ever compiling before now.)	2020-09-12 17:37:21 -05:00
Field G. Van Zee	ece9f6a3ef	Fixed dormant type bugs in bli_kernels_knl.h. Details: - Fixed dormant type mismatches in the use of the prototype-generating macros in bli_kernels_knl.h. Specifically, some float prototypes were incorrectly using double as their ctype. This didn't actually matter until the type changes in `645d771`, as previously those types were not used since packm was prototyped with void* pointers.	2020-09-12 17:22:42 -05:00
Field G. Van Zee	8ebb3b60e1	Fixed accidental breakage in `645d771`. Details: - In trying to clean up kappa_cast variables in the reference packm kernels, which I initally believed to be redundant given the other void* -> ctype* changes in `645d771`, I accidentally ended up violating restrict semantics for 1e/1r packing and possibly other packm kernels. (Normally, my pre-commit testsuite run would have caught this, but I was unknowingly using an edited input.operations file in which I'd disabled most tests as part of unrelated work.) This commit reverts the kappa_cast changes in `645d771`.	2020-09-12 17:00:47 -05:00
Field G. Van Zee	645d771a14	Minor packm kernel type cleanup (void* -> ctype). Details: - Changed all void function arguments in reference packm kernels to those of the native type (ctype). These pointers no longer need to be void and are better represented by their native types anyway. (See below for details.) Updated knl packm kernels accordingly. - In the definition of the PACKM_KER_PROT prototype macro template in frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a, and p from void* to ctype. They were originally void because these function signatures had to share the same type so they could all be stored in a single array of that shared type, from which they were queried and called by packm_cxk(). This is no longer how the function pointers are stored, and so it no longer makes sense to force the caller of packm kernels to use void, only so that the implementor of the packm kernels can typecast back to the native datatype within the kernel definition. This change has no effect internally within BLIS because currently all packm kernels are called after querying the function addresses from the context and then typecasting to the appropriate function pointer type, which is based upon type-specific function pointers like float and double*. - Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and misleading due to changes to the handling of packm kernels since moving them into the context.	2020-09-12 15:31:56 -05:00
Field G. Van Zee	54bf6c3554	Minor README.md update. Details: - Added a new entry to the "What people are saying about BLIS" section.	2020-09-10 15:42:01 -05:00
Field G. Van Zee	e50b4d4046	Minor update to README.md (SIAM Best Paper Prize).	2020-09-09 14:12:53 -05:00
Devin Matthews	a8efb72074	Merge pull request #434 from flame/intel-zdot Add an option to change the complex return type.	2020-09-07 16:18:19 -05:00
Field G. Van Zee	97e87f2c9f	Whitespace/comment updates to #434 PR.	2020-09-07 15:56:42 -05:00
Devin Matthews	b0c4da1732	Merge pull request #436 from flame/s390x Add checks so that s390x is detected as 64-bit.	2020-09-07 15:47:54 -05:00
Field G. Van Zee	810e90ee80	Minor README.md update. Details: - Added HPE to list of funders. - Changed http to https in funders' website links.	2020-09-01 16:11:40 -05:00
Devin Matthews	7d41128219	Use -O2 for all framework code. (#435 ) It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes #341 and fixes #342.	2020-08-13 17:50:58 -05:00
Dave Love	9c5b485d35	Don't override -mcpu with -march on ARM (#353 ) * Use -mcpu for ARM See the GCC doc about -march, -mtune, and -mpu and maybe https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu * Fix typo in flags * Fix typo in cortexa9 flags * Modify cortexa53 compilation flags to fix failing BLAS check (#341)	2020-08-07 15:11:18 -05:00
Devin Matthews	c253d14a72	Also handle Intel-style complex return in CBLAS interface.	2020-08-07 09:39:04 -05:00
Devin Matthews	5d653a11a0	Update Multithreading.md Addresses the issue raised in #426.	2020-08-06 17:58:26 -05:00
Devin Matthews	b1b5870dd3	Add checks so that s390x is detected as 64-bit.	2020-08-06 17:34:20 -05:00
Field G. Van Zee	882dcb11bf	Mention example code at top of documentation docs. Details: - Steer the reader towards the example code section of each documentation doc (object and typed). - Trivial update to examples/oapi/README, examples/tapi/README.	2020-08-06 17:28:14 -05:00
Field G. Van Zee	f4894512e5	Very minor updates to previous commit.	2020-08-06 17:20:00 -05:00
Field G. Van Zee	adedb893ae	Documented mutator functions in BLISObjectAPI.md. Details: - Added documentation for commonly-used object mutator functions in BLISObjectAPI.md. Previously, only accessor functions were documented. Thanks to Jeff Diamond for pointing out this omission. - Explicitly set the 'diag' property of objects in oapi example modules (08level2.c and 09level3.c).	2020-08-06 17:14:01 -05:00
Devin Matthews	5b5278ff49	Use #ifdef instead of #if as macro may be undefined.	2020-08-06 14:19:37 -05:00
Devin Matthews	7fdc0fc893	Add an option to change the complex return type. ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu\|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes #433.	2020-08-06 14:09:23 -05:00
Field G. Van Zee	6e522e5823	Mention disabling of sup in docs/Sandboxes.md. Details: - Added language to remind the reader to disable sup if the intended behavior is for the sandbox implementation to handle all problem sizes, even the smaller ones that would normally be handled by the sup code path.	2020-07-30 19:31:37 -05:00

1 2 3 4 5 ...

1928 Commits