amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Devin Matthews	de16beb83b	PACKDIM_MR=8 didn't work out, but messing with the prefetching helps 2%.	2017-12-11 12:07:31 +05:30
Devin Matthews	25d0e61854	Revert "Change PACKDIM_MR (double) for haswell to 8." This reverts commit `681eec913d`.	2017-12-11 12:07:31 +05:30
Devin Matthews	c5bdd84b35	Change PACKDIM_MR (double) for haswell to 8.	2017-12-11 12:07:31 +05:30
Field G. Van Zee	172789d562	Restored deleted lines from makefile fragments.	2017-12-11 12:07:31 +05:30
Devin Matthews	3ea9bd2c8e	Change to /bin/sh. All scripts checked with Debian's checkbashisms. Also check for clang first in auto-detect.sh.	2017-12-11 12:07:31 +05:30
Devin Matthews	49438409ee	Remove shebangs from makefiles.	2017-12-11 12:07:31 +05:30
J M Dieterich	497e264047	Fix if/else structure. Thanks to TravisCI.	2017-12-11 12:07:31 +05:30
J M Dieterich	835035c56a	Mark piledriver compilable w/ clang.	2017-12-11 12:06:40 +05:30
J M Dieterich	6cdb533472	Mark bulldozer compilable w/ clang.	2017-12-11 12:06:40 +05:30
J M Dieterich	a85697d622	Correct error message.	2017-12-11 12:06:40 +05:30
J M Dieterich	e0c64cad27	Indeed once can compile for carrizo also using clang.	2017-12-11 12:06:40 +05:30
J M Dieterich	4aafe0505d	A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash	2017-12-11 12:06:40 +05:30
Field G. Van Zee	abaeaa68ea	Fixed a bug in norm1v, norm1m. Details: - Fixed a bug that manifested as improperly-computed 1-norm for vectors and matrices. This is one of the few operations in BLIS that does not have its own test module within the testsuite, hence why it went undetected for so long. The bad 1-norms were being used to normalize matrices in the testsuite after initialization, which led to some matrices containing a combination of "large" and "small" values. This tended to push the residuals computed after each test away from zero. In some cases, they were off just enough to the testsuite to label it a "failure". Many thanks to Jeff Hammond for reporting this bug. (Wonky details: the bug was due to improperly-defined level-0 scalar macros for abval2, an operation that computes the absolute square, or complex magnitude/modulus. Certain complex domain instances of abval2 were being incorrectly defined in terms of real-only solutions, leading to bad results. This level-0 operation forms the basis of norm1v/norm1m. absq2 was also affected, but almost nothing uses this operation.)	2017-12-11 12:05:22 +05:30
Devin Matthews	cc3107ae1c	Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123 .	2017-12-11 12:05:22 +05:30
Field G. Van Zee	c8ab91f70d	Disable complex 3m/4m in testsuite by default. Details: - Disabled testsuite tests of all level-3 implementations based on 3m and 4m. This will improve testing runtime on Travis CI as well as for anyone manually running the testsuite using default test parameters. Thanks to Devin Matthews for suggesting this change.	2017-12-11 12:05:22 +05:30
Jeff Hammond	9700f0e578	allow KNL build without hbwmalloc.h (i.e. emulated) we want to be able to run BLIS KNL binaries on non-KNL machines via SDE. although it is possible to install hbwmalloc implementation on such systems, it is easier not to, since obviously the performance of SDE execution is not representative so there is no reason to emulate HBW allocation.	2017-12-11 12:05:22 +05:30
Field G. Van Zee	17dcd5a33f	Fixed stray parentheses in README citations.	2017-12-11 12:05:22 +05:30
Field G. Van Zee	2910d44ff9	CHANGELOG update (0.2.2)	2017-12-11 12:05:22 +05:30
Field G. Van Zee	5ca3863220	Fixed a trsm1m bug that affected right-side cases. Details: - Fixed a bug introduced in `1c732d3` that affected trsm1m_r. The result was nondeterministic behavior (usually segmentation faults) for certain problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c which explicitly directed the virtual gemm micro-kernel to use temporary space if the storage preference of the [real domain] gemm ukernel did not match the storage of the output matrix C. In the context of gemm, this handling is not needed because agreement between the storage pref and the matrix is guaranteed by a high-level optimization in BLIS. However, this optimization is not applied to trsm because the storage of C is not necessarily the same as the storage of the micro-panels of B--both of which are updated by the micro-kernel during a trsm operation. Thus, the guarantee of storage/preference agreement is not in place for trsm, which means we must handle that case within the virtual gemm micro-kernel. - Comment updates and a minor macro change to bli_trsm*_cntx_init() for 3m1, 4m1a, and 1m.	2017-12-11 12:03:07 +05:30
Field G. Van Zee	1af0b09f5c	README.md update. Details: - Updated bibtex entries for 4th BLIS paper, and adds entries for 5th and 6th BLIS papers.	2017-12-11 12:03:07 +05:30
Field G. Van Zee	db4a0bb8ba	Whitespace reformatting to armv8a kernels file. Details: - Updated formatting of function signature/header in kernels/armv8a/3/bli_gemm_opt_4x4.c.	2017-12-11 11:58:33 +05:30
Field G. Van Zee	e3eb01f6b9	Disabled experiment-related 1m code. Details: - Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was specifically inserted to facilitate the benchmarking of 1m block-panel and panel-block algorithms. - Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to reflect changes used/needed during benchmarking.	2017-12-11 11:58:33 +05:30
Field G. Van Zee	4f61528d56	Added 1m-specific APIs for bp, pb gemm algorithms. Details: - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the body of bli_gemm_cntl_create() replaced with a call to the former. - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now, bli_cntl_free() can check if the thread parameter is NULL, and if so, call the latter, and otherwise call the former. - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in terms of bli_gemm1mxx_cntx_init(), which behaves the same as bli_gemm1m_cntx_init() did before, except that an extra bool parameter (is_pb) is used to support both bp and pb algorithms (including to support the anti-preference field described below). - Added support for "anti-preference" in context. The anti_pref field, when true, will toggle the boolean return value of routines such as bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of causing BLIS to transpose the operation to achieve disagreement (rather than agreement) between the storage of C and the micro-kernel output preference. This disagreement is needed for panel-block implementations, since they induce a transposition of the suboperation immediately before the macro-kernel is called, which changes the apparent storage of C. For now, anti-preference is used only with the pb algorithm for 1m (and not with any other non-1m implementation). - Defined new functions, bli_cntx_l3_ukr_eff_prefers_storage_of() bli_cntx_l3_ukr_eff_dislikes_storage_of() bli_cntx_l3_nat_ukr_eff_prefers_storage_of() bli_cntx_l3_nat_ukr_eff_dislikes_storage_of() which are identical to their non-"eff" (effectively) counterparts except that they take the anti-preference field of the context into account. - Explicitly initialize the anti-pref field to FALSE in bli_gks_cntx_set_l3_nat_ukr_prefs(). - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel in terms of the existing block-panel macro-kernel _ker_var2(). This technique requires inducing transposes on all operands and swapping the A and B. - Changed bli_obj_induce_trans() macro so that pack-related fields are also changed to reflect the induced transposition. - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily specify the 1m algorithm (block-panel or panel-block). - Renamed the following cntx_t-related macros: bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block() bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel() bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel() and updated all instantiations. Also updated the field names in the cntx_t struct. - Comment updates.	2017-12-11 11:58:33 +05:30
Field G. Van Zee	1d728ccb23	Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver.	2017-12-11 11:55:31 +05:30
Jeff Hammond	0d1b90286e	never use libm with Intel compilers Intel compilers include a highly optimized math library (libimf) that should be used instead of GNU libm. yes, this change is for ALL targets, including those that are not supported by the Intel compiler. there is no harm in doing this, and it is future-proof in the event that the Intel compilers support other architectures.	2017-12-11 11:52:25 +05:30
prangana	d6ef56c6db	Update version number Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4 betarelease-0.9	2017-06-01 16:22:23 +05:30
prangana	9d93f8481a	Update Licence File Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2	2017-05-30 14:00:03 +05:30
sthangar	42e7f6fb2a	fixed license attribute issues in AMD added files Change-Id: I303f870a777c7cd1c1af29ea0b93f3e0a27948e4	2017-03-31 14:33:02 +05:30
prangana	5600001e97	Fix merge conflicts after sync with release branch Change-Id: Icf14a09f728befb69a73fff9fa79c4128e728310	2017-03-20 14:02:40 +05:30
Kiran Varaganti	0b19029342	Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81	2017-03-14 14:51:31 +05:30
praveeng	825363bd2a	Merge code from master to amd-staging as on 2017_03_08 by praveeng Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d	2017-03-08 15:43:42 +05:30
sthangar	093bdb80c8	Checked in Unpacked DGEMM code Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723	2017-03-07 13:35:50 +05:30
Kiran Varaganti	33923da9a1	Added variant 10 for double precision axpyv microkernel Change-Id: I7a20cc113a422603250bc450825c965136354974	2017-03-06 14:31:31 +05:30
Kiran Varaganti	bc828f7f8e	Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972	2017-03-03 14:45:35 +05:30
sthangar	c9949f4603	Checked in DGEMMTRSM and edge case handling routine in DDOTXF Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e	2017-03-01 11:14:34 +05:30
Devin Matthews	513944e4a9	Merge pull request #118 from devinamatthews/master Handle k=0 correctly in KNL dgemm ukernel.	2017-02-20 10:04:33 -05:00
Devin Matthews	0e18f68cf1	Handle k=0 correctly in KNL dgemm ukernel.	2017-02-20 09:03:21 -06:00
Devin Matthews	8b462a0e8c	Merge pull request #117 from devinamatthews/master Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.	2017-02-19 23:03:03 -05:00
Devin Matthews	7d42fc0796	Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.	2017-02-19 21:10:55 -05:00
Kiran Varaganti	04245c9ff7	Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5	2017-02-10 14:24:30 +05:30
Field G. Van Zee	c362afc525	Added missing "level-0" BLAS [sd]cabs1_(). Details: - Fixed issue #115 by adding implementations for scabs1_() and dcabs1_() to the BLAS compatibility layer. Thanks to heroxbd for pointing out their absence.	2017-02-09 11:54:59 -06:00
Field G. Van Zee	018180c938	Fixed a minor bug in configure (issue #114 ). Details: - Fixed a bug in the configure script whereby a non-preferred value for --enable-threading would cause problems in common.mk vis-a-vis detecting which threading model was chosen. Thanks to heroxbd for reporting this issue.	2017-02-08 11:20:52 -06:00
Kiran Varaganti	58b5b77e5f	Fixed a bug in axpyv, the arguments passed to intrinsic fmad instruction are corrected Change-Id: If12f24c6bc74b22ac9e4acd6b9378e06d79f2f5e	2017-02-08 21:43:34 +05:30
Kiran Varaganti	85de4ebf74	variant 4 axpyv single precision modified: explicitly used FMA intrinsics, replaced vector multiply and add operations Change-Id: I975feef56696d479d2b9e9441b0660021cf4f6ff	2017-02-08 14:41:04 +05:30
Kiran Varaganti	3fa53e8af3	Merged axpyv and gemm small in bli_kernel.h Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging modified: config/zen/bli_kernel.h modified: frame/3/gemm/bli_gemm_front.c modified: kernels/x86_64/zen/3/bli_gemm_small_matrix.c Change-Id: If181cf9345178c448b3530beb8bef453917fe295	2017-02-08 11:51:57 +05:30
sthangar	95be7b0470	Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0	2017-02-08 11:24:10 +05:30
Kiran Varaganti	b5291a445b	Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9	2017-02-07 12:39:31 +05:30
Kiran Varaganti	f4bfc1662a	New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072	2017-02-06 15:04:27 +05:30
Devin Matthews	ddf45e7177	Merge pull request #113 from devinamatthews/knl_thread_params Change default threading parameters for KNL.	2017-01-27 14:25:40 -06:00
Devin Matthews	78e1b16e16	Change default threading parameters for KNL.	2017-01-27 14:22:20 -06:00

1 2 3 4 5 ...

891 Commits