amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Tyler Smith	cefd3d5d20	A couple of functions were incorrectly ifdeffed away on Xeon Phi. Fixed this	2015-02-05 11:09:12 -06:00
Field G. Van Zee	7574c9947d	Added basic flop-counting mechanism (level-3 only). Details: - Added optional flop counting to all level-3 front-ends, which is enabled via BLIS_ENABLE_FLOP_COUNT. The flop count can be reset at any time via bli_flop_count_reset() and queried via bli_flop_count(). Caveats: - flop counts are approximate for her[2]k, syr[2]k, trmm, and trsm operations; - flop counts ignore extra flops due to non-unit alpha; - flop counts do not account for situations where beta is zero.	2015-02-04 12:11:55 -06:00
Field G. Van Zee	ceda4f27d1	Implemented bli_obj_imag_equals(). Details: - Implemented a new function, bli_obj_imag_equals(), which compares the imaginary part of the first argument to the second argument, which may be a BLIS_CONSTANT or of a regular real datatype.	2015-01-29 13:22:54 -06:00
Field G. Van Zee	81114824a0	Minor 4m/3m consolidation to mem_pool_macro_defs.h. Details: - Merged the 4m and 3m definitions in bli_mem_pool_macro_defs.h to reduce code and improve readability.	2015-01-06 12:15:21 -06:00
Tyler Michael Smith	36a9b7b743	reduced the default number of MC by KC blocks for bgq	2014-12-17 21:55:50 +00:00
Field G. Van Zee	c60619c7c3	Minor tweaks for 3m4m test drivers. Details: - Changed gemm_kc blocksizes to be reduced by two-thirds instead of half. - Changed 3m4m/test_gemm.c driver to divide by 3 instead of 2 when computing the fixed k dimension. - Fixed runme.sh so that it would use multiple threads for s/dgemm cases.	2014-12-16 17:08:22 -06:00
Field G. Van Zee	c6929ba6a5	Added 4m_1b to test/3m4m test driver and script.	2014-12-16 11:27:50 -06:00
Field G. Van Zee	785d480805	Merge branch 'master' of github.com:flame/blis	2014-12-12 14:34:19 -06:00
Field G. Van Zee	9456f330af	Added 4m_1b implementation for gemm. Details: - Added yet another 4m-based implementation for complex domain level-3 operations. This method, which the 3m/4m paper identifies as Algorithm "4m_1b" fissures the first loop around the micro-kernel so that the real sub-panel of the current micro-panel of B is multiplied against (both sub-panels of) all micro-panels of A, before doing the same for the imaginary sub-panel of the micro-panel of B. For now, only gemm is supported, and 4m_1b (labeled "4mb" within the framework) is not yet integrated into the test suite.	2014-12-12 14:31:57 -06:00
Field G. Van Zee	4156c0880d	Fixed obscure level-2 packing / general stride bug. Details: - Fixed a bug in certain structured level-2 operations that manifested only when the structured matrix was provided to BLIS as matrix stored with general stride. The bug was introduced in `c472993b` when the densify field was removed from the packm control tree node and associated APIs. Since then, the packed object was unconditionally marked with an uplo field of BLIS_DENSE. This is fine for level-3 operations where micro-panels are always densified, but in level-2 contexts, the underlying unblocked variant (fused or unfused) of structured operations (e.g. trmv) still needs to know whether to execute its "lower" or "upper" branches of code. Since this field was unconditionally being set to BLIS_DENSE, the unblocked variants were always executed the "else" branch, which happened to be the "lower" case code. Thus, running an upper case produced the wrong answer. This most obviously manifested in the form of failures for trmm, trmm3, and trsm in the test suite. The bug was fixed by setting the packed object's uplo field to BLIS_DENSE only if the schema indicated that micro-panels were to be packed. Otherwise, we can assume we are packing to regular row or column storage, as is the case with level-2 packing. Thanks to Francisco Igual for reporting the testsuite failures and ultimately leading us to this bug.	2014-12-09 16:03:14 -06:00
Field G. Van Zee	689f60a578	Merge pull request #21 from figual/master Adding armv8a configuration and micro-kernels.	2014-12-07 14:03:30 -06:00
Francisco D. Igual	483e4d6a3f	Adding armv8a configuration and micro-kernels. Only sgemm micro-kernel is fully functional at this point.	2014-12-07 20:27:49 +01:00
Tyler Smith	bef24e67e0	Fixed a type of race condition exposed by pthreads implementation. Lead thread of the inner thread communicator could exit subproblem, move on the next iteration of the loop and modify a1_pack, b1_pack, or c1_pack while other threads were still using those. Barriers were inserted to fix this.	2014-11-26 18:00:56 -06:00
Field G. Van Zee	76bde44411	Merge branch 'master' of github.com:flame/blis	2014-11-26 17:25:24 -06:00
Tyler Michael Smith	f3d729e504	Added static mutex to bli_init and bli_finalize	2014-11-26 22:25:24 -06:00
Tyler Michael Smith	d71cc79786	Refactored bli_threading files and added support for pthreads	2014-11-26 21:36:39 -06:00
Field G. Van Zee	e56e61438f	Minor cleanups to bli_threading.h and friends. Details: - No longer need to define BLIS_ENABLE_MULTITHREADING manually in bli_config.h; it now gets defined when BLIS_ENABLE_OPENMP or BLIS_ENABLE_PTHREADS is defined. - Added sanity check to prevent both BLIS__ENABLE_OPENMP and BLIS_ENABLE_PTHREADS from being enabled simultaneously. - Reorganization of bli_threading*.h header files, which led to simplification of threading-related part of blis.h. - added "-fopenmp -lpthread" to LDFLAGS of sandybridge make_defs.mk file.	2014-11-26 17:20:35 -06:00
Field G. Van Zee	3be2744cbe	Update to template gemm ukernel comments. Details: - Updated comments on alignment of a1 and b1 to match wiki.	2014-11-21 12:28:08 -06:00
Field G. Van Zee	994429c688	Merge pull request #20 from TimmyLiu/master #define PASTEF773 required by cblas compatibility layer	2014-11-20 13:55:35 -06:00
Timmy	694029d9d7	#define PASTEF773 required by cblas compatiility layer	2014-11-19 15:25:14 -06:00
Field G. Van Zee	58796abda6	Removed KC constraint comments from _kernel.h files. Details: - Since `4674ca8c`, the constraint that KC be a multiple of both MR and NR have been relaxed, and thus it was time to remove the comments from the top of the bli_kernel.h files of all configurations.	2014-11-06 14:31:52 -06:00
Field G. Van Zee	7bbc95a54f	Added new piledriver micro-kernels. Details: - Added new micro-kernels for the AMD piledriver architecture (one for each datatype). - Updates and tweaks to piledriver configuration. - Added 3xk packm micro-kernel support. - Explicitly unrolled some of the smaller packm micro-kernels. - Added notes to avx/sandybridge and piledriver micro-kernel files acknowledging the influence of the corresponding kernel code in OpenBLAS.	2014-10-29 10:52:23 -05:00
Field G. Van Zee	59613f1d55	Added separeate micro-panel alignment for A and B. Details: - Changed the recently-added micro-panel alignment macros so that we now have two sets--one for micro-panels of matrix A and one for micro- panels of matrix B: BLIS_UPANEL_[AB]_ALIGN_SIZE_?. - Store each set of alignment values into a separate blksz_t object in bli_gemm_cntl_init(). - Adjusted packm_init() to use the separate alignment values. - Added query routines for the new alignment values to bli_info.c. - Modified test suite output accordingly.	2014-10-23 17:21:37 -05:00
Field G. Van Zee	a8e12884ee	CHANGELOG update (0.1.6)	2014-10-23 11:35:48 -05:00
Field G. Van Zee	38ea5022e4	Version file update (0.1.6) 0.1.6	2014-10-23 11:35:45 -05:00
Field G. Van Zee	a3e6341bdb	Factored common code from blocksize functions. Details: - Split bli_determine_blocksize_[fb]() into two functions each, the newer ones ending with the _sub suffix. These new sub-functions are now called from bli_[gemm\|trmm\|trsm]_determine_kc_[fb](), which eliminates redundant code and will allow any future tweaks to the core sub-functions to automatically be inherited by the operation- specific versions.	2014-10-23 11:13:28 -05:00
Field G. Van Zee	4674ca8cff	Extended newly relaxed KC to hemm, symm. Details: - These changes were intended for the previous commit. - Defined bli_gemm_determine_kc_[fb]() and bli_gemm_determine_kc_[fb](), which determine blocksizes for gemm-based operations, taking special care to "nudge" the kc dimension up to a multiple of MR or NR for hemm and symm operations, as needed. - Changed bli_gemm_blk_var3f.c to call bli_gemm_determine_kc_f(). instead of bli_determine_blocksize_f(). - Comment updates to bli_trmm_blocksize.c, bli_trsm_blocksize.c.	2014-10-23 10:50:59 -05:00
Field G. Van Zee	ab954ba6f8	Relaxed constraint that KC be multiple of MR, NR. Details: - Relaxed a long-held requirement in register blocksizes that required the kernel programmer to choose a KC that was divisible by both MR and NR. This was very constraining on some architectures that did not use register blocksizes that were powers of two. The constraint is now enforced only for trmm and trsm, where it is needed, and it is now handled by "nudging" kc upward at runtime, if necessary, to be a multiple of MR or NR, as needed. - Defined bli_trmm_determine_kc_[fb]() and bli_trsm_determine_kc_[fb](), which determine blocksizes for trmm and trsm, taking special care to "nudge" the kc dimension up to a multiple of MR or NR, as needed. - Changed bli_trmm_blk_var3[fb].c to call bli_trmm_determine_kc_[fb]() instead of bli_determine_blocksize_[fb](). - Added safeguard to bli_align_dim_to_mult() that returns the dimension unmodified if the dimension multiple is zero (to avoid division by zero). - Removed cpp guard/check for KC % MR == 0 and KC % NR == 0 from bli_kernel_macro_defs.h. - Whitespace, variable name changes to bli_blocksize.c. - Removed old commented code from bli_gemm_cntl.c.	2014-10-23 10:12:27 -05:00
Tyler Smith	95cdae65d6	Fixed bug in KNC microkernel where k=0 and beta != 1	2014-10-22 16:30:16 -05:00
Field G. Van Zee	e64dba5633	Re-implemented micro-panel alignment. Details: - This commit re-implements a feature that was removed in commit `c2b2ab62`. It was removed because, at the time, I wasn't sure how the micro-panel alignment feature would interact with the 4m method (when applied at the micro-kernrel level), and so it seemed safer to disable the feature entirely rather than allow possible breakage. This commit revisits the issue and safely re-implements the feature in a way that is compatible with 4m, 3m, 4mh, and 3mh (and native execution). - Modified the static memory pool to account for micro-panel alignment space. - Modified packm_init and blocked variants to align whole micro-panels by a datatype-specific alignment value that may be set by the configuration. (If it is not set by the configuration, it will default to BLIS_SIZEOF_?.) - Modified macro-kernels so that: - storage stride is handled properly given the new micro-panel alignment behavior; - indexing through 3m/4m/rih-type sub-panels, as is done by trmm and trsm, is more robust (e.g. will work if the applicable packing register blocksize is odd); - imaginary strides are computed and stored within auxinfo_t structs, which allows the virtual micro-kernels to more easily determine how to index into the micro-panel operands. - Modified virtual 3m and 4m micro-kernels to use the imaginary strides within the auxinfo_t structs instead of panel strides. - Deprecated the panel stride fields from the auxinfo_t structs. - Updated test suite to print out the micro-panel alignment values.	2014-10-20 19:23:06 -05:00
Field G. Van Zee	add16b0e54	Added 3m4m test driver subdir of 'test'. Details: - Added a modified test driver for [cz]gemm that will test all 3m/4m as well as assembly-based and OpenBLAS implementations of gemm in single and multithreaded modes.	2014-10-17 11:49:24 -05:00
Field G. Van Zee	e171504a72	Use correct definition of bli_is_last_iter(). Details: - As intended for previous commit, the new definition of bli_is_last_iter() is now disabled in favor of the old definition.	2014-10-17 11:25:59 -05:00
Field G. Van Zee	0d954087b2	Minor changes and fixes. Details: - Redefined bli_is_last_iter() to take thread_id and num_thread arguments, which allows the macro to correctly compute whether a given iteration is the last that the thread will compute in that particular loop. The new definition, however, remains disabled (commented out) until someone can look at this more closely, as the new definition seems to actually hurt performance slightly. - Whitespace and related updates to level-3 macro-kernels. - Updated test suite so that performance results in the hundreds of gigaflops does not disrupt the column alignment of the output.	2014-10-17 11:19:34 -05:00
Field G. Van Zee	d1e86e1876	More minor tweaks to sandybridge/avx micro-kernel. Details: - Re-enabled use of b_next for dgemm and cgemm micro-kernels.	2014-10-12 13:43:47 -05:00
Field G. Van Zee	7b6fe4cae5	Minor tweaks to sandybridge/avx micro-kernels. Details: - Changed the MC blocksize for zgemm micro-kernel from 128 to 64. - Removed usage of b_next in all x86_64/avx gemm micro-kernels.	2014-10-12 12:01:51 -05:00
Field G. Van Zee	a6a156e9fe	Added cgemm ukernel for avx/sandybridge. Details: - Implemented AVX-based cgemm micro-kernel (via GNU extended inline assembly syntax). - Updated sandybridge configuration accordingly.	2014-10-10 14:26:41 -05:00
Field G. Van Zee	6f8575ab25	Added zgemm ukernel for avx/sandybridge. Details: - Implemented AVX-based zgemm micro-kernel (via GNU extended inline assembly syntax). - Updated sandybridge configuration accordingly.	2014-10-10 10:01:45 -05:00
Field G. Van Zee	23ce7ee542	Merge branch 'master' of github.com:flame/blis	2014-10-09 16:41:22 -05:00
Field G. Van Zee	99fd9a3971	Fixed two minor bugs. Details: - Fixed a bug in the test suite for the trsm_ukr and gemmtrsm_ukr test modules whereby the uplo bits of some packed matrix objects were not being set properly, resulting in false FAILURE results for those tests. Thanks to Tyler Smith for bringing this issue to my attention. - Fixed a bug in bli_obj_alloc_buffer() that caused an unnecessary "not yet implemented" abort() when creating a 1x1 object with non-unit strides.	2014-10-09 16:38:04 -05:00
Tyler Smith	7a8ad47fb2	Minor changes to knc configuration, including preference row major storage Also fixed a bug in the knc micro-kernel where it would fail if k == 0	2014-10-08 15:52:13 -05:00
Field G. Van Zee	76b7c34af0	Fixed a bug in the pack schema-related bit macros. Details: - Expanded the BLIS_PACK_SCHEMA_BITS value in bli_type_defs.h to include all six bits presently used in the pack schema bitfield of the info field of obj_t structs. Prior to this commit, the macro constant only included the lowest five bits, which excluded the "is or is not packed" bit. This manifested as a strange bug in probably many level-2 codes that invoked packing, though we only observed it in ger before fixing. Thanks to Devin Matthews for finding and reporting this bug.	2014-10-02 14:15:38 -05:00
Field G. Van Zee	a5763e3322	Added extra output to bli_obj_print(). Details: - Print extra values from info field of obj_t struct within bli_obj_print().	2014-10-02 13:28:17 -05:00
Tyler Smith	9bba209fc4	Fixed bug when packing anywhere besides in blk_var_1 for gemm.	2014-09-29 14:56:36 -05:00
Tyler Smith	614a4afc92	Merge branch 'master' of http://github.com/flame/blis	2014-09-26 10:49:57 -05:00
Field G. Van Zee	4a7df04e8a	Added 30xk support for packm ukernels. Details: - Updated bli_kernel__macro_defs.h headers to include default definitions for 30xk packm kernels. - Extended function pointer arrays in bli_packm_cxk_() out to 31 and included 30xk kernels. - Addex 30xk kernels to frame/1m/packm/ukernels/bli_packm_ref_cxk_*.c.	2014-09-22 16:06:15 -05:00
Field G. Van Zee	b6d4bd792e	Fixed missing tabs from Makefile patch.	2014-09-22 16:02:37 -05:00
Field G. Van Zee	32630f9b6f	Comment update to virtual micro-kernels.	2014-09-19 17:18:20 -05:00
Field G. Van Zee	13447cffea	Minor bugfix to top-level Makefile. Details: - Applied a patch that allows the top-level Makefile to work on certain systems. The patch simply separates out the source-to-object code generation rules for .c and .S files into two separate rules. Thanks to Devin Matthews for submitting this patch.	2014-09-19 13:00:48 -05:00
Field G. Van Zee	e80a453784	Fixed bug introduced by bugfix in `25b258d`. Details: - We actually need to check alignment of lda*sizeof(double) and NOT a+lda because in the latter case, alignment could cancel out and still allow the optimized code to run when it shouldn't. Thanks to Devin for pointing this out.	2014-09-18 10:24:20 -05:00
Field G. Van Zee	25b258d61f	Fixed a non-fatal problem with bugfix in `a68b316c`. Details: - The bugfix in `a68b316c` was inadvertantly checkin alignment of the leading dimension itself, rather than the byte size of the leading dimension. Now, we simply check alignment of a+lda.	2014-09-18 10:10:49 -05:00

1 2 3 4 5 ...

494 Commits