amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 09:39:59 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	4f61528d56	Added 1m-specific APIs for bp, pb gemm algorithms. Details: - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the body of bli_gemm_cntl_create() replaced with a call to the former. - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now, bli_cntl_free() can check if the thread parameter is NULL, and if so, call the latter, and otherwise call the former. - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in terms of bli_gemm1mxx_cntx_init(), which behaves the same as bli_gemm1m_cntx_init() did before, except that an extra bool parameter (is_pb) is used to support both bp and pb algorithms (including to support the anti-preference field described below). - Added support for "anti-preference" in context. The anti_pref field, when true, will toggle the boolean return value of routines such as bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of causing BLIS to transpose the operation to achieve disagreement (rather than agreement) between the storage of C and the micro-kernel output preference. This disagreement is needed for panel-block implementations, since they induce a transposition of the suboperation immediately before the macro-kernel is called, which changes the apparent storage of C. For now, anti-preference is used only with the pb algorithm for 1m (and not with any other non-1m implementation). - Defined new functions, bli_cntx_l3_ukr_eff_prefers_storage_of() bli_cntx_l3_ukr_eff_dislikes_storage_of() bli_cntx_l3_nat_ukr_eff_prefers_storage_of() bli_cntx_l3_nat_ukr_eff_dislikes_storage_of() which are identical to their non-"eff" (effectively) counterparts except that they take the anti-preference field of the context into account. - Explicitly initialize the anti-pref field to FALSE in bli_gks_cntx_set_l3_nat_ukr_prefs(). - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel in terms of the existing block-panel macro-kernel _ker_var2(). This technique requires inducing transposes on all operands and swapping the A and B. - Changed bli_obj_induce_trans() macro so that pack-related fields are also changed to reflect the induced transposition. - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily specify the 1m algorithm (block-panel or panel-block). - Renamed the following cntx_t-related macros: bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block() bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel() bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel() and updated all instantiations. Also updated the field names in the cntx_t struct. - Comment updates.	2017-12-11 11:58:33 +05:30
Field G. Van Zee	1d728ccb23	Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver.	2017-12-11 11:55:31 +05:30
Jeff Hammond	0d1b90286e	never use libm with Intel compilers Intel compilers include a highly optimized math library (libimf) that should be used instead of GNU libm. yes, this change is for ALL targets, including those that are not supported by the Intel compiler. there is no harm in doing this, and it is future-proof in the event that the Intel compilers support other architectures.	2017-12-11 11:52:25 +05:30
prangana	d6ef56c6db	Update version number Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4 betarelease-0.9	2017-06-01 16:22:23 +05:30
prangana	9d93f8481a	Update Licence File Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2	2017-05-30 14:00:03 +05:30
sthangar	42e7f6fb2a	fixed license attribute issues in AMD added files Change-Id: I303f870a777c7cd1c1af29ea0b93f3e0a27948e4	2017-03-31 14:33:02 +05:30
prangana	5600001e97	Fix merge conflicts after sync with release branch Change-Id: Icf14a09f728befb69a73fff9fa79c4128e728310	2017-03-20 14:02:40 +05:30
Kiran Varaganti	0b19029342	Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81	2017-03-14 14:51:31 +05:30
praveeng	825363bd2a	Merge code from master to amd-staging as on 2017_03_08 by praveeng Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d	2017-03-08 15:43:42 +05:30
sthangar	093bdb80c8	Checked in Unpacked DGEMM code Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723	2017-03-07 13:35:50 +05:30
Kiran Varaganti	33923da9a1	Added variant 10 for double precision axpyv microkernel Change-Id: I7a20cc113a422603250bc450825c965136354974	2017-03-06 14:31:31 +05:30
Kiran Varaganti	bc828f7f8e	Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972	2017-03-03 14:45:35 +05:30
sthangar	c9949f4603	Checked in DGEMMTRSM and edge case handling routine in DDOTXF Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e	2017-03-01 11:14:34 +05:30
Devin Matthews	513944e4a9	Merge pull request #118 from devinamatthews/master Handle k=0 correctly in KNL dgemm ukernel.	2017-02-20 10:04:33 -05:00
Devin Matthews	0e18f68cf1	Handle k=0 correctly in KNL dgemm ukernel.	2017-02-20 09:03:21 -06:00
Devin Matthews	8b462a0e8c	Merge pull request #117 from devinamatthews/master Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.	2017-02-19 23:03:03 -05:00
Devin Matthews	7d42fc0796	Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.	2017-02-19 21:10:55 -05:00
Kiran Varaganti	04245c9ff7	Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5	2017-02-10 14:24:30 +05:30
Field G. Van Zee	c362afc525	Added missing "level-0" BLAS [sd]cabs1_(). Details: - Fixed issue #115 by adding implementations for scabs1_() and dcabs1_() to the BLAS compatibility layer. Thanks to heroxbd for pointing out their absence.	2017-02-09 11:54:59 -06:00
Field G. Van Zee	018180c938	Fixed a minor bug in configure (issue #114 ). Details: - Fixed a bug in the configure script whereby a non-preferred value for --enable-threading would cause problems in common.mk vis-a-vis detecting which threading model was chosen. Thanks to heroxbd for reporting this issue.	2017-02-08 11:20:52 -06:00
Kiran Varaganti	58b5b77e5f	Fixed a bug in axpyv, the arguments passed to intrinsic fmad instruction are corrected Change-Id: If12f24c6bc74b22ac9e4acd6b9378e06d79f2f5e	2017-02-08 21:43:34 +05:30
Kiran Varaganti	85de4ebf74	variant 4 axpyv single precision modified: explicitly used FMA intrinsics, replaced vector multiply and add operations Change-Id: I975feef56696d479d2b9e9441b0660021cf4f6ff	2017-02-08 14:41:04 +05:30
Kiran Varaganti	3fa53e8af3	Merged axpyv and gemm small in bli_kernel.h Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging modified: config/zen/bli_kernel.h modified: frame/3/gemm/bli_gemm_front.c modified: kernels/x86_64/zen/3/bli_gemm_small_matrix.c Change-Id: If181cf9345178c448b3530beb8bef453917fe295	2017-02-08 11:51:57 +05:30
sthangar	95be7b0470	Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0	2017-02-08 11:24:10 +05:30
Kiran Varaganti	b5291a445b	Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9	2017-02-07 12:39:31 +05:30
Kiran Varaganti	f4bfc1662a	New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072	2017-02-06 15:04:27 +05:30
Devin Matthews	ddf45e7177	Merge pull request #113 from devinamatthews/knl_thread_params Change default threading parameters for KNL.	2017-01-27 14:25:40 -06:00
Devin Matthews	78e1b16e16	Change default threading parameters for KNL.	2017-01-27 14:22:20 -06:00
sthangar	574472ba5a	checked in unpacked SGEMM optimization Change-Id: I8e4ea374415c0c402c660b656fb076af15354181	2017-01-27 14:32:02 +05:30
praveeng	41595e98ee	Merge master code as on 2016_12_07 to amd-staging Change-Id: I5d9ecef9bff960aeb9b51ca4e4b21714e789e44f	2016-12-07 15:14:02 +05:30
sthangar	d625c49e20	checked-in SGEMMTRSM microkernel for Zen Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f	2016-12-01 16:17:09 +05:30
Field G. Van Zee	a6ab91bc61	Merge pull request #111 from figual/master Fixed missing cntx argument in ARMv8 microkernels.	2016-11-30 09:26:58 -06:00
Francisco Igual	7f31a6307b	Fixed missing cntx argument in ARMv8 microkernels.	2016-11-27 14:40:47 +01:00
praveeng	d8f13beeea	Merge master code till 2016_11_25 to amd-staging	2016-11-25 17:31:08 +05:30
praveeng	c25a9205fd	Merge master code till Switched to simpler trsm_r 2016_11_25 to amd-staging Change-Id: Ibf71d224d8fb6cf0bc497f84d50c27d276512cc1	2016-11-25 17:08:22 +05:30
Field G. Van Zee	145a551d52	Switched to simpler trsm_r implementation. Details: - Disabled the implementation of trsm_r that allows the right-hand matrix B to be trianglar, and switched to the implementation that simply transposes the operation (and thus the storage of C) in order to recast the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru macrokernels, which require an awkward swapping of MR and NR. For now, the support for trsm_r macrokernels, via separate control trees, remains. - Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS is defined by default. This is mostly a safety precaution in case someone tries to switch back to the previous trsm_r implementation, but also serves as a convenience on some systems where one does not naturally choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0.	2016-11-23 17:59:06 -06:00
Field G. Van Zee	b3e58ee303	Reimplemented 4x12 haswell ukernels (real only). Details: - Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which defines 4x24 single real and 4x12 double real gemm microkernels, with broadcast-based implementations. (The previous microkernel file has been moved to an 'old' subdirectory.)	2016-11-23 17:58:26 -06:00
sthangar	65298762ff	removed a redundant copy operation in DNRM2 Change-Id: I673b08efde4480e871779716f7715566740ad9ce	2016-11-22 12:15:33 +05:30
sthangar	d6863e851a	checked-in DNRM2 optimizations Change-Id: I3b31d768bd7f4fbf43042aa5a0762995c73c4522	2016-11-21 11:30:30 +05:30
Field G. Van Zee	bdc0a264d2	Adjusted stride selection of ct in macrokernels. Details: - Updated the changes introduced in `618f433` so that the strides of the temporary microtile ct used in the macrokernels is determined based on the storage preference of the microkernel (via the new functions below), rather than the strides of c. In almost all cases, presently, this change results in no net effect, as a high-level optimization in the _front() functions aligns the storage of c to that of the microkernel's preference. However, I encountered some cases where this is not always the case in some development code that has yet to be committed, and therefore I'm generalizing the framework code in advance. - Defined two new functions in bli_cntx.c: bli_cntx_l3_ukr_prefers_rows_dt() bli_cntx_l3_ukr_prefers_cols_dt() which return bool_t's based on the current micro-kernel's storage preferences. For induced methods, the preference of the underlying real domain microkernel is returned. - Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of the above functions, rather than querying the preferences of the native microkernel directly (which did the wrong thing for induced methods).	2016-11-16 14:13:08 -06:00
Field G. Van Zee	031978d264	Fixed inactive trsm_r blocksize constraint code. Details: - Changed a cpp macro that was meant to prevent using certain trsm_r code if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded incorrectly at first. I've now fixed its location and changed its consequence to a compile-time #error message.	2016-11-16 14:04:33 -06:00
sthangar	9772218cae	Added optimized DAMAX routines for Zen Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8	2016-11-16 15:19:19 +05:30
Santanu Thangaraj	9c448e3017	Merge "Added new optimized micro-kernel for dotxv routine" into amd-staging	2016-11-16 04:18:57 -05:00
praveeng	998d824044	Merge master code till devinamatthews/omp_num_thrds 2016_11_16 to amd-staging Change-Id: I601ff1d3ec8a680e1be039ffc7b299744e8a27c5	2016-11-16 14:24:15 +05:30
Field G. Van Zee	6b5a4032d2	Merge pull request #109 from devinamatthews/omp_num_threads Add automatic loop thread assignment.	2016-11-10 15:28:24 -06:00
Devin Matthews	a8220e3a86	- Fix typo in bli_cntx.c - Bump BLIS_DEFAULT_NR_THREAD_MAX to 4	2016-11-10 14:19:34 -06:00
Kiran Varaganti	e35d3c23f2	Added new optimized micro-kernel for dotxv routine Change-Id: I2c544e9b25a454d971ad690353502a55cd668391	2016-11-10 14:30:53 +05:30
praveeng	0d13e9a4f6	bli_kernel.h Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091	2016-11-07 14:40:41 +05:30
Devin Matthews	c05b3862f6	Add automatic loop thread assignment. - Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before. - Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h. - All level-3 BLAS covered.	2016-11-04 15:48:02 -05:00
Field G. Van Zee	3b524a08e3	Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code. Details: - Consolidated the macros that define the lower and upper versions of the gemmtrsm microkernels into a single macro that is instantiated twice. Did this for both 3m1 and 4m1 microkernels. - Consolidated lower and upper versions of the trsm microkernels for 3m1 and 4m1 into single files (each).	2016-11-02 17:45:18 -05:00

1 2 3 4 5 ...

869 Commits