amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 10:47:16 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	810e90ee80	Minor README.md update. Details: - Added HPE to list of funders. - Changed http to https in funders' website links.	2020-09-01 16:11:40 -05:00
Devin Matthews	7d41128219	Use -O2 for all framework code. (#435 ) It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes #341 and fixes #342.	2020-08-13 17:50:58 -05:00
Dave Love	9c5b485d35	Don't override -mcpu with -march on ARM (#353 ) * Use -mcpu for ARM See the GCC doc about -march, -mtune, and -mpu and maybe https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu * Fix typo in flags * Fix typo in cortexa9 flags * Modify cortexa53 compilation flags to fix failing BLAS check (#341)	2020-08-07 15:11:18 -05:00
Devin Matthews	5d653a11a0	Update Multithreading.md Addresses the issue raised in #426.	2020-08-06 17:58:26 -05:00
Field G. Van Zee	882dcb11bf	Mention example code at top of documentation docs. Details: - Steer the reader towards the example code section of each documentation doc (object and typed). - Trivial update to examples/oapi/README, examples/tapi/README.	2020-08-06 17:28:14 -05:00
Field G. Van Zee	f4894512e5	Very minor updates to previous commit.	2020-08-06 17:20:00 -05:00
Field G. Van Zee	adedb893ae	Documented mutator functions in BLISObjectAPI.md. Details: - Added documentation for commonly-used object mutator functions in BLISObjectAPI.md. Previously, only accessor functions were documented. Thanks to Jeff Diamond for pointing out this omission. - Explicitly set the 'diag' property of objects in oapi example modules (08level2.c and 09level3.c).	2020-08-06 17:14:01 -05:00
Field G. Van Zee	6e522e5823	Mention disabling of sup in docs/Sandboxes.md. Details: - Added language to remind the reader to disable sup if the intended behavior is for the sandbox implementation to handle all problem sizes, even the smaller ones that would normally be handled by the sup code path.	2020-07-30 19:31:37 -05:00
Field G. Van Zee	00e14cb6d8	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-07-29 14:24:34 -05:00
Field G. Van Zee	2c554c2fce	Redefined bool_t typedef in terms of C99 bool. Details: - Changed the typedef that defines bool_t from: typedef gint_t bool_t; where gint_t is a signed integer that forms the basis of most other integers in BLIS, to: typedef bool bool_t; - Changed BLIS's TRUE and FALSE macro definitions from being in terms of integer literals: #define TRUE 1 #define FALSE 0 to being in terms of C99 boolean constants: #define TRUE true #define FALSE false which are provided by stdbool.h. - This commit constitutes the second phase of a transition toward using C99's bool instead of bool_t, which will address issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`.	2020-07-24 15:57:19 -05:00
Field G. Van Zee	e01dd12558	Fail-safe updates to Makefiles in 'test' dir. Details: - Updated Makefiles in test, test/3, and test/sup so that running any of the usual targets without having first built BLIS results in a helpful error message. For example, if BLIS is not yet configured, make will output: Makefile:327: * Cannot proceed: config.mk not detected! Run configure first. Stop. Similarly, if BLIS is configured but not yet built, make will output: Makefile:340: * Cannot proceed: BLIS library not yet built! Run make first. Stop. In previous commits, these actions would result in a rather cryptic make error such as: make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x', needed by 'blis-nat-st'. Stop.	2020-07-24 15:41:46 -05:00
Devin Matthews	b4f47f7540	Add BLIS_EXPORT_BLIS to bli_abort. (#429 ) Fixes #428.	2020-07-24 13:56:13 -05:00
Field G. Van Zee	a69a4d7e2f	Cleaned up bool_t usage and various typecasts. Details: - Fixed various typecasts in frame/base/bli_cntx.h frame/base/bli_mbool.h frame/base/bli_rntm.h frame/include/bli_misc_macro_defs.h frame/include/bli_obj_macro_defs.h frame/include/bli_param_macro_defs.h that were missing or being done improperly/incompletely. For example, many return values were being typecast as (bool_t)x && y rather than (bool_t)(x && y) Thankfully, none of these deficiencies had manifested as actual bugs at the time of this commit. - Changed the return type of bli_env_get_var() from dim_t to gint_t. This reflects the fact that bli_env_get_var() needs to be able to return a signed integer, and even though dim_t is currently defined as a signed integer, it does not intuitively appear to necessarily be signed by inspection (i.e., an integer named "dim_t" for matrix "dimension"). Also, updated use of bli_env_get_var() within bli_pack.c to reflect the changed return type. - Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t and added comments to the bli_thrcomm_*.h files that will explain a planned replacement of bool_t with C99's bool type. - Note: These changes are being made to facilitate the substitution of 'bool' for 'bool_t', which will eliminate the namespace conflict with arm_sve.h as reported in issue #420. This commit implements the first phase of that transition. Thanks to RuQing Xu for reporting this issue. - CREDITS file update.	2020-07-22 16:13:09 -05:00
Field G. Van Zee	a6437a5c11	Replaced broken ref99 sandbox w/ simpler version. Details: - The 'ref99' sandbox was broken by multiple refactorings and internal API changes over the last two years. Rather than try to fix it, I've replaced it with a much simpler version based on var2 of gemmsup. Why not fix the previous implementation? It occurred to me that the old implementation was trying to be a lightly simplified duplication of what exists in the framework. Duplication aside, this sandbox would have worked fine if it had been completely independent of the framework code. The problem was that it was only partially independent, with many function calls calling a function in BLIS rather than a duplicated/simplified version within the sandbox. (And the reason I didn't make it fully independent to begin with was that it seemed unnecessarily duplicative at the time.) Maintaining two versions of the same implementation is problematic for obvious reasons, especially when it wasn't even done properly to begin with. This explains the reimplementation in this commit. The only catch is that the newer implementation is single-threaded only and does not perform any packing on either input matrix (A or B). Basically, it's only meant to be a simple placeholder that shows how you could plug in your own implementation. Thanks to Francisco Igual for reporting this brokenness. - Updated the three reference gemmsup kernels (defined in ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle conjugation of conja and/or conjb. The general storage kernel, which is currently identical to the column-storage kernel, is used in the new ref99 sandbox to provide basic support for all datatypes (including scomplex and dcomplex). - Minor updates to docs/Sandboxes.md, including adding the threading and packing limitations to the Caveats section. - Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new sandbox implementation is based).	2020-07-20 19:21:07 -05:00
Devin Matthews	bca040be9d	Merge pull request #425 from gmargari/patch-1 Update Multithreading.md	2020-07-20 09:27:30 -05:00
Giorgos Margaritis	171ecc1dc6	Update Multithreading.md	2020-07-20 12:24:06 +03:00
Field G. Van Zee	2605eb4d99	Added missing rv_d?x6 edge cases to sup kernel. Details: - Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling various n = 6 edge cases with a single sup kernel call. Previously, only n = {4,2,1} were handled explicitly as single kernel calls; that is, cases where n = 6 were previously being executed via two kernel calls (n = 4 and n = 2). - Added commented debug line to testsuite's test_libblis.c.	2020-07-15 15:25:19 -05:00
Field G. Van Zee	72f6ed0637	Declare/define static functions via BLIS_INLINE. Details: - Updated all static function definitions to use the cpp macro BLIS_INLINE instead of the static keyword. This allows blis.h to use a different keyword (inline) to define these functions when compiling with C++, which might otherwise trigger "defined but not used" warning messages. Thanks to Giorgos Margaritis for reporting this issue and Devin Matthews for suggesting the fix. - Updated the following files, which are used by configure's hardware auto-detection facility, to unconditionally #define BLIS_INLINE to the static keyword (since we know BLIS will be compiled with C, not C++): build/detect/config/config_detect.c frame/base/bli_arch.c frame/base/bli_cpuid.c - CREDITS file update.	2020-07-03 17:55:54 -05:00
Field G. Van Zee	5fc701ac5f	Added -fomit-frame-pointer option to CKOPTFLAGS. Details: - Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS variable in the following make_defs.mk files: config/haswell/make_defs.mk config/skx/make_defs.mk as well as comments that mention why the compiler option is needed. This option is needed to prevent the compiler from using the rbp frame register (in the very early portion of kernel code, typically where k_iter and k_left are defined and computed), which, as of `1c719c9`, is used explicitly by the gemmsup millikernels. Thanks to Devin Matthews for identifying this missing option and to Jeff Diamond for reporting the original bug in #417. - The file config/zen/amd_config.mk which feeds into the make_defs.mk for both zen and zen2 subconfigs, was also touched, but only to add a commented-out compiler option (and the aforementioned explanatory comment) since that file already uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of CKOPTFLAGS.	2020-07-01 15:48:58 -05:00
Field G. Van Zee	6af59b7057	Fixed disabled edge case optimization in gemmsup. Details: - Fixed an inadvertently disabled edge case optimization in the two gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case optimizations allow the last millikernel operation in the jr loop to be executed with inflated an register blocksize if it is the last (or only) iteration. For example, if mr=6 and nr=8 and the gemmsup problem is m=8, n=100, k=100. (In this case, the panel-block variant (var1n) is executed, which places the jr loop in the m dimension.) In principle, this problem could be executed as two millikernels: one with dimensions 6x100x100, and one as 2x100x100. However, with the support for inflated blocksizes in the kernel, the entire 8x100x100 problem can be passed to the millikernel function, which will then execute it more favorably as two 4x100x100 millikernel sub-calls. Now, this optimization is disabled under certain circumstances, such as when multithreading. Previously, the is_mt predicate was being set incorrectly such that it was non-zero even when running single-threaded. - Upon fixing the is_mt issue above, another bit of code needed to be moved so that the result of the optimization could have an impact on the assignment of loop bounds ranges to threads.	2020-07-01 14:54:23 -05:00
Field G. Van Zee	b37634540f	Support ldims, packing in sup/test drivers. Details: - Updated the test/sup source file (test_gemm.c) and Makefile to support building matrices with small or large leading dimensions, and updated runme.sh to support executing both kinds of test drivers. - Updated runme.sh to allow for executing sup drivers with unpacked (the default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B environment variables), and for capturing output to files that encode both the leading dimension (small or large) and packing status into the filenames. - Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt into test/sup/octave and updated the octave code in that consolidated directory to read the new output filename format (encoding ldim and packing). Also added comments and streamlined code, particularly in plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0. - Moved old octave_st, octave_mt directories to test/sup/old.	2020-06-25 16:05:12 -05:00
Field G. Van Zee	ceb9b95a96	Fixed incorrect link to shiftd in BLISTypedAPI.md. Details: - Previously, the entry for shiftd in the Operation index section of BLISTypedAPI.md was incorrectly linking to the shiftd operation entry in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for helping find this incorrect link.	2020-06-18 17:15:25 -05:00
Field G. Van Zee	b3c4201681	CREDITS file update.	2020-06-18 14:00:56 -05:00
Isuru Fernando	31af73c11a	Expand windows instructions (#414 ) * Expand windows instructions * Windows: both static and shared don't work at the same time	2020-06-18 13:35:54 -05:00
Field G. Van Zee	b5b604e106	Ensure random objects' 1-norms are non-zero. Details: - Fixed an innocuous bug that manifested when running the testsuite on extremely small matrices with randomization via the "powers of 2 in narrow precision range" option enabled. When the randomization function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will then compute 0.0/0.0 during the normalization process, which leads to NaN residuals. The solution entails smarter implementaions of randv, randnv, randm, and randnm, each of which will compute the 1-norm of the vector or matrix in question. If the object has a 1-norm of 0.0, the object is re-randomized until the 1-norm is not 0.0. Thanks to Kiran Varaganti for reporting this issue (#413). - Updated the implementation of randm_unb_var1() so that it loops over a call to the randv_unb_var1() implementation directly rather than calling it indirectly via randv(). This was done to avoid the overhead of multiple calls to norm1v() when randomizing the rows/columns of a matrix. - Updated comments.	2020-06-17 16:42:24 -05:00
Isuru Fernando	35e38fb693	FIx typo in FAQ	2020-06-16 09:08:31 -07:00
Field G. Van Zee	1c719c91a3	Bugfixes, cleanup of sup dgemm ukernels. Details: - Fixed a few not-really-bugs: - Previously, the d6x8m kernels were still prefetching the next upanel of A using MRrs_a instead of ps_a (same for prefetching of next upanel of B in d6x8n kernels using NRcs_b instead of ps_b). Given that the upanels might be packed, using ps_a or ps_b is the correct way to compute the prefetch address. - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck, executed as intended even though it was based on a faulty pointer management. Basically, in the rd_d6x8m kernel, the pointer for B (stored in rdx) was loaded only once, outside of the jj loop, and in the second iteration its new position was calculated by incrementing rdx by the absolute offset (four columns), which happened to be the same as the relative offset (also four columns) that was needed. It worked only because that loop only executed twice. A similar issue was fixed in the rd_d6x8n kernels. - Various cleanups and additions, including: - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so that it is loaded only once outside of the loops rather than multiple times inside the loops. - Changed outer loop in rd kernels so that the jump/comparison and loop bounds more closely mimic what you'd see in higher-level source code. That is, something like: for( i = 0; i < 6; i+=3 ) rather than something like: for( i = 0; i <= 3; i+=3 ) - Switched row-based IO to use byte offsets instead of byte column strides (e.g. via rsi register), which were known to be 8 anyway since otherwise that conditional branch wouldn't have executed. - Cleaned up and homogenized prefetching a bit. - Updated the comments that show the before and after of the in-register transpositions. - Added comments to column-based IO cases to indicate which columns are being accessed/updated. - Added rbp register to clobber lists. - Removed some dead (commented out) code. - Fixed some copy-paste typos in comments in the rv_6x8n kernels. - Cleaned up whitespace (including leading ws -> tabs). - Moved edge case (non-milli) kernels to their own directory, d6x8, and split them into separate files based on the "NR" value of the kernels (Mx8, Mx4, Mx2, etc.). - Moved config-specific reference Mx1 kernels into their own file (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory. - Added rd_dMx1 assembly kernels, which seems marginally faster than the corresponding reference kernels. - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using the row-oriented reference kernels for all storage combos.	2020-06-04 17:21:08 -05:00
Isuru Fernando	943a21def0	Add build instructions for Windows (#404 )	2020-05-21 14:09:21 -05:00
Field G. Van Zee	fbef422f0d	Separate OS X and Windows into separate FAQs. Details: - Separated the unified Mac OS X / Windows frequently asked question into two separate questions, one for each OS.	2020-05-21 10:30:41 -05:00
Guodong Xu	28be1a4265	avoid loading twice in armv8a gemm kernel (#403 ) This bug happens at a corner case, when k_iter == 0 and we jump to CONSIDERKLEFT. In current design, first row/col. of a and b are loaded twice. The fix is to rearrange a and b (first row/col.) loading instructions. Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-05-20 13:22:22 -05:00
Field G. Van Zee	d51245e58b	Add support for Intel oneAPI in configure. Details: - Properly select cc_vendor based on the output of invoking CC with the --version option, including cases where CC is the variant of clang that is included with Intel oneAPI. (However, we continue to treat the compiler as clang for other purposes, not icc.) Thanks to Ajay Panyala and Devin Matthews for reporting on this issue via #402.	2020-05-08 18:00:54 -05:00
Field G. Van Zee	787adad73b	Defined netlib equivalent of xerbla_array(). Details: - Added a function definition for xerbla_array_(), which largely mirrors its netlib implementation. Thanks to Isuru Fernando for suggesting the addition of this function.	2020-05-08 16:18:20 -05:00
Field G. Van Zee	c53b5153be	Documented Perl prerequisite for build system. Details: - Added Perl to list of prerequisites for building BLIS. This is in part (and perhaps completely?) due to some substitution commands used at the end of configure that include '\n' characters that are not properly interpreted by the version of sed included on some versions of OS X. This new documentation addresses issue #398.	2020-05-05 12:39:12 -05:00
Guodong Xu	f032d5d4a6	New kernel set for Arm SVE using assembly (#396 ) Here adds two kernels for Arm SVE vector extensions. 1. a gemm kernel for double at sizes 8x8. 2. a packm kernel for double at dimension 8xk. To achive best performance, variable length agonostic programming is not used. Vector length (VL) of 256 bits is mandated in both kernels. Kernels to support other VLs can be added later. "SVE is a vector extension for AArch64 execution mode for the A64 instruction set of the Armv8 architecture. Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains into a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. Instructions are provided specifically to query an implementation for its register size, to guarantee that the applications can run on different implementations of the ISA without the need to recompile the code." [1] [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-04-29 12:08:46 -05:00
Yingbo Ma	4d87eb24e8	Update KernelsHowTo.md (#395 )	2020-04-27 16:02:47 -05:00
Field G. Van Zee	477ce91c52	Moved #include "cpuid.h" to bli_cpuid.c. Details: - Relocated the #include "cpuid.h" directive from bli_cpuid.h to bli_cpuid.c. This was done because cpuid.h (which is pulled into the post-build blis.h developer header) doesn't protect its definitions with a preprocessor guard of the form: #ifndef FOOBAR_H #define FOOBAR_H // header contents. #endif and as a result, applications (previously) could not #include both blis.h and cpuid.h (since the former was already including the latter). Thanks to Bhaskar Nallani for raising this issue via #393 and to Devin Matthews for suggesting this fix. - CREDITS file update.	2020-04-22 14:26:49 -05:00
Field G. Van Zee	8bde63ffd7	Adding missing conjy to her2/syr2 in typed API doc. Details: - Fixed a missing argument (conjy) in the function signatures of bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert van de Geijn for reporting this omission.	2020-04-18 12:50:12 -05:00
Field G. Van Zee	976902406b	Disable packing by default in expert rntm_t init. Details: - Changed the behavior of bli_rntm_init() as well as the static initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t objects by default specify the disabling of packing for A and B. Packing of A/B was already disabled by default when calling non-expert APIs (and enabled only when the user set environment variables BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of using user-initialized rntm_t objects with expert APIs comes into line with the default behavior of non-expert APIs--that is, they now both lead to the avoidance of packing in the sup code path. (Note: The conventional code path is unaffected by the environment variables BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t object when calling an expert API.) This addresses issue #392. Thanks to Kiran Varaganti for bringing this inconsistency to our attention. - The above change was accomplished by changing the the definitions of static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b() in bli_rntm.h, which are both for internal use only.	2020-04-17 15:11:10 -05:00
Field G. Van Zee	5f2aee7c5f	README.md update to promote supmt dgemm. Details: - Updated the sup entry in the "What's New" section of the README.md file to promote the multithreaded dgemm sup feature introduced in `c0558fd`.	2020-04-07 14:55:15 -05:00
Field G. Van Zee	f5923cd9ff	CHANGELOG update (0.7.0)	2020-04-07 14:41:45 -05:00
Field G. Van Zee	68b88aca66	Version file update (0.7.0) 0.7.0	2020-04-07 14:41:44 -05:00
Field G. Van Zee	b04de636c1	ReleaseNotes.md update in advance of next version. Details: - Updated docs/ReleaseNotes.md in preparation for next version.	2020-04-07 14:37:43 -05:00
Field G. Van Zee	2cb604ba47	Rename more bli_thread_obarrier(), _obroadcast(). Details: - Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast() that were made in the supmt-specific code commited to the 'amd' branch, which has now been merged with 'master'. Prior to the merge, 'master' received commit `c01d249`, which applied these renamings to the existing, non-sup codebase.	2020-04-06 16:42:14 -05:00
Field G. Van Zee	efb12bc895	Minor updates/elaborations to RELEASING file.	2020-04-06 15:01:53 -05:00
Field G. Van Zee	2e3b3782cf	Merge branch 'master' into amd	2020-04-06 14:55:35 -05:00
Satish Balay	da0c086f46	OSX: specify the full path to the location of libblis.dylib (#390 ) * OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime Before this change: Appication gives runtime error [when linked with blis] dyld: Library not loaded: libblis.3.dylib balay@kpro lib % otool -L libblis.dylib libblis.dylib: libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0) After this change: balay@kpro lib % otool -L libblis.dylib libblis.dylib: /Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0) * INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR Co-Authored-By: Jed Brown <jed@jedbrown.org> * CREDITS file update. Co-authored-by: Jed Brown <jed@jedbrown.org> Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>	2020-03-31 17:09:41 -05:00
Field G. Van Zee	2bca03ea9d	Updates, tweaks to runme.sh in test/1m4m. Details: - Made several updates to test/1m4m/runme.sh, including: - Added missing handling for 1m and 4m1a implementations when setting the BLIS_??_NT environment variables. - Added support for using numactl to run the test executables. - Several other cleanups.	2020-03-28 22:10:00 +00:00
Field G. Van Zee	c40a33190b	Warn user when auto-detection returns 'generic'. Details: - Added logic to configure that causes the script to output a warning to the user if/when "./configure auto" is run and the underlying hardware feature detection code is unable to identify the hardware. In these cases, the auto-detect code will return 'generic', which is likely not what the user expected, and a flag will be set so that a message is printed at the end of the configure output. (Thankfully, we don't expect this scenario to play out very often.) Thanks to Devin Matthews for suggesting this fix #384.	2020-03-26 16:55:00 -05:00
Devin Matthews	492a736fab	Fix vectorized version of bli_amaxv (#382 ) * Fix vectorized version of bli_amaxv To match Netlib, i?amax should return: - the lowest index among equal values - the first NaN if one is encountered * Fix typos. * And another one... * Update ref. amaxv kernel too. * Re-enabled optimized amaxv kernels. Details: - Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen' kernel set for use in haswell, zen, zen2, knl, and skx subconfigs. These two kernels (for s and d datatypes) were temporarily disabled in `e186d71` as part of issue #380. However, the key missing semantic properties that prompted the disabling of these kernels--returning the index of the first rather than of the last element with largest absolute value, and returning the index of the first NaN if one is encountered--were added as part of #382 thanks to Devin Matthews. Thus, now that the kernels are working as expected once more, this commit causes these kernels to once again be registered for the affected subconfigs, which effectively reverts all code changes included in `e186d71`. - Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c. Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>	2020-03-24 17:28:47 -05:00
Field G. Van Zee	e186d7141a	Disabled optimized amaxv kernels. Details: - Disabled use of optimized amaxv kernels, which use vector intrinsics for both 's' and 'd' datatypes. We disable these kernels because the current implementations fail to observe a semantic property of the BLAS i?amax_() subroutine, which is to return the index of the first element containing the maximum absolute value (that is, the first element if there exist two or more elements that contain the same value). With the optimized kernels disabled, the affected subconfigurations (haswell, zen, zen2, knl, and skx) will use the default reference implementations. Thanks to Mat Cross for reporting this issue via #380. - CREDITS file update.	2020-03-21 18:40:36 -05:00

1 2 3 4 5 ...

1886 Commits