amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	8f739cc847	Added API to set mt environment variables. Details: - Renamed bli_env_get_nway() -> bli_thread_get_env(). - Added bli_thread_set_env() to allow setting environment variables pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS. - Added the following convenience wrapper routines: bli_thread_get_jc_nt() bli_thread_get_ic_nt() bli_thread_get_jr_nt() bli_thread_get_ir_nt() bli_thread_get_num_threads() bli_thread_set_jc_nt() bli_thread_set_ic_nt() bli_thread_set_jr_nt() bli_thread_set_ir_nt() bli_thread_set_num_threads() - Added #include "errno.h" to bli_system.h. - This commit addresses issue #140. - Thanks to Chris Goodyer for inspiring these updates.	2017-12-11 12:08:58 +05:30
Marat Dukhan	1016383307	Fix Emscripten builds	2017-12-11 12:08:58 +05:30
Minh Quan HO	c09b30d115	set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is not set in bli_membrk_init	2017-12-11 12:08:58 +05:30
sthangar	997628ed97	Reducing the framework overhead of GEMV routines Change-Id: I83607ad767bff74e305e915b54b0ea34ec3e5684	2017-12-11 12:08:58 +05:30
Field G. Van Zee	abaeaa68ea	Fixed a bug in norm1v, norm1m. Details: - Fixed a bug that manifested as improperly-computed 1-norm for vectors and matrices. This is one of the few operations in BLIS that does not have its own test module within the testsuite, hence why it went undetected for so long. The bad 1-norms were being used to normalize matrices in the testsuite after initialization, which led to some matrices containing a combination of "large" and "small" values. This tended to push the residuals computed after each test away from zero. In some cases, they were off just enough to the testsuite to label it a "failure". Many thanks to Jeff Hammond for reporting this bug. (Wonky details: the bug was due to improperly-defined level-0 scalar macros for abval2, an operation that computes the absolute square, or complex magnitude/modulus. Certain complex domain instances of abval2 were being incorrectly defined in terms of real-only solutions, leading to bad results. This level-0 operation forms the basis of norm1v/norm1m. absq2 was also affected, but almost nothing uses this operation.)	2017-12-11 12:05:22 +05:30
Devin Matthews	cc3107ae1c	Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123 .	2017-12-11 12:05:22 +05:30
Field G. Van Zee	5ca3863220	Fixed a trsm1m bug that affected right-side cases. Details: - Fixed a bug introduced in `1c732d3` that affected trsm1m_r. The result was nondeterministic behavior (usually segmentation faults) for certain problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c which explicitly directed the virtual gemm micro-kernel to use temporary space if the storage preference of the [real domain] gemm ukernel did not match the storage of the output matrix C. In the context of gemm, this handling is not needed because agreement between the storage pref and the matrix is guaranteed by a high-level optimization in BLIS. However, this optimization is not applied to trsm because the storage of C is not necessarily the same as the storage of the micro-panels of B--both of which are updated by the micro-kernel during a trsm operation. Thus, the guarantee of storage/preference agreement is not in place for trsm, which means we must handle that case within the virtual gemm micro-kernel. - Comment updates and a minor macro change to bli_trsm*_cntx_init() for 3m1, 4m1a, and 1m.	2017-12-11 12:03:07 +05:30
Field G. Van Zee	e3eb01f6b9	Disabled experiment-related 1m code. Details: - Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was specifically inserted to facilitate the benchmarking of 1m block-panel and panel-block algorithms. - Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to reflect changes used/needed during benchmarking.	2017-12-11 11:58:33 +05:30
Field G. Van Zee	4f61528d56	Added 1m-specific APIs for bp, pb gemm algorithms. Details: - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the body of bli_gemm_cntl_create() replaced with a call to the former. - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now, bli_cntl_free() can check if the thread parameter is NULL, and if so, call the latter, and otherwise call the former. - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in terms of bli_gemm1mxx_cntx_init(), which behaves the same as bli_gemm1m_cntx_init() did before, except that an extra bool parameter (is_pb) is used to support both bp and pb algorithms (including to support the anti-preference field described below). - Added support for "anti-preference" in context. The anti_pref field, when true, will toggle the boolean return value of routines such as bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of causing BLIS to transpose the operation to achieve disagreement (rather than agreement) between the storage of C and the micro-kernel output preference. This disagreement is needed for panel-block implementations, since they induce a transposition of the suboperation immediately before the macro-kernel is called, which changes the apparent storage of C. For now, anti-preference is used only with the pb algorithm for 1m (and not with any other non-1m implementation). - Defined new functions, bli_cntx_l3_ukr_eff_prefers_storage_of() bli_cntx_l3_ukr_eff_dislikes_storage_of() bli_cntx_l3_nat_ukr_eff_prefers_storage_of() bli_cntx_l3_nat_ukr_eff_dislikes_storage_of() which are identical to their non-"eff" (effectively) counterparts except that they take the anti-preference field of the context into account. - Explicitly initialize the anti-pref field to FALSE in bli_gks_cntx_set_l3_nat_ukr_prefs(). - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel in terms of the existing block-panel macro-kernel _ker_var2(). This technique requires inducing transposes on all operands and swapping the A and B. - Changed bli_obj_induce_trans() macro so that pack-related fields are also changed to reflect the induced transposition. - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily specify the 1m algorithm (block-panel or panel-block). - Renamed the following cntx_t-related macros: bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block() bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel() bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel() and updated all instantiations. Also updated the field names in the cntx_t struct. - Comment updates.	2017-12-11 11:58:33 +05:30
Field G. Van Zee	1d728ccb23	Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver.	2017-12-11 11:55:31 +05:30
praveeng	825363bd2a	Merge code from master to amd-staging as on 2017_03_08 by praveeng Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d	2017-03-08 15:43:42 +05:30
Field G. Van Zee	c362afc525	Added missing "level-0" BLAS [sd]cabs1_(). Details: - Fixed issue #115 by adding implementations for scabs1_() and dcabs1_() to the BLAS compatibility layer. Thanks to heroxbd for pointing out their absence.	2017-02-09 11:54:59 -06:00
sthangar	95be7b0470	Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0	2017-02-08 11:24:10 +05:30
sthangar	574472ba5a	checked in unpacked SGEMM optimization Change-Id: I8e4ea374415c0c402c660b656fb076af15354181	2017-01-27 14:32:02 +05:30
praveeng	d8f13beeea	Merge master code till 2016_11_25 to amd-staging	2016-11-25 17:31:08 +05:30
praveeng	c25a9205fd	Merge master code till Switched to simpler trsm_r 2016_11_25 to amd-staging Change-Id: Ibf71d224d8fb6cf0bc497f84d50c27d276512cc1	2016-11-25 17:08:22 +05:30
Field G. Van Zee	145a551d52	Switched to simpler trsm_r implementation. Details: - Disabled the implementation of trsm_r that allows the right-hand matrix B to be trianglar, and switched to the implementation that simply transposes the operation (and thus the storage of C) in order to recast the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru macrokernels, which require an awkward swapping of MR and NR. For now, the support for trsm_r macrokernels, via separate control trees, remains. - Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS is defined by default. This is mostly a safety precaution in case someone tries to switch back to the previous trsm_r implementation, but also serves as a convenience on some systems where one does not naturally choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0.	2016-11-23 17:59:06 -06:00
sthangar	65298762ff	removed a redundant copy operation in DNRM2 Change-Id: I673b08efde4480e871779716f7715566740ad9ce	2016-11-22 12:15:33 +05:30
sthangar	d6863e851a	checked-in DNRM2 optimizations Change-Id: I3b31d768bd7f4fbf43042aa5a0762995c73c4522	2016-11-21 11:30:30 +05:30
Field G. Van Zee	bdc0a264d2	Adjusted stride selection of ct in macrokernels. Details: - Updated the changes introduced in `618f433` so that the strides of the temporary microtile ct used in the macrokernels is determined based on the storage preference of the microkernel (via the new functions below), rather than the strides of c. In almost all cases, presently, this change results in no net effect, as a high-level optimization in the _front() functions aligns the storage of c to that of the microkernel's preference. However, I encountered some cases where this is not always the case in some development code that has yet to be committed, and therefore I'm generalizing the framework code in advance. - Defined two new functions in bli_cntx.c: bli_cntx_l3_ukr_prefers_rows_dt() bli_cntx_l3_ukr_prefers_cols_dt() which return bool_t's based on the current micro-kernel's storage preferences. For induced methods, the preference of the underlying real domain microkernel is returned. - Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of the above functions, rather than querying the preferences of the native microkernel directly (which did the wrong thing for induced methods).	2016-11-16 14:13:08 -06:00
Field G. Van Zee	031978d264	Fixed inactive trsm_r blocksize constraint code. Details: - Changed a cpp macro that was meant to prevent using certain trsm_r code if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded incorrectly at first. I've now fixed its location and changed its consequence to a compile-time #error message.	2016-11-16 14:04:33 -06:00
praveeng	998d824044	Merge master code till devinamatthews/omp_num_thrds 2016_11_16 to amd-staging Change-Id: I601ff1d3ec8a680e1be039ffc7b299744e8a27c5	2016-11-16 14:24:15 +05:30
Field G. Van Zee	6b5a4032d2	Merge pull request #109 from devinamatthews/omp_num_threads Add automatic loop thread assignment.	2016-11-10 15:28:24 -06:00
Devin Matthews	a8220e3a86	- Fix typo in bli_cntx.c - Bump BLIS_DEFAULT_NR_THREAD_MAX to 4	2016-11-10 14:19:34 -06:00
praveeng	0d13e9a4f6	bli_kernel.h Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091	2016-11-07 14:40:41 +05:30
Devin Matthews	c05b3862f6	Add automatic loop thread assignment. - Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before. - Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h. - All level-3 BLAS covered.	2016-11-04 15:48:02 -05:00
Field G. Van Zee	3b524a08e3	Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code. Details: - Consolidated the macros that define the lower and upper versions of the gemmtrsm microkernels into a single macro that is instantiated twice. Did this for both 3m1 and 4m1 microkernels. - Consolidated lower and upper versions of the trsm microkernels for 3m1 and 4m1 into single files (each).	2016-11-02 17:45:18 -05:00
Field G. Van Zee	d25e6f8b63	Can disable trsm_r-specific blocksize constraints. Details: - Added cpp guards around the constraints in bli_kernel_macro_defs.h that enforce MC % NR = 0 and NC % MR = 0. These constraints are ONLY needed when handling right-side trsm by allowing the matrix on the right (matrix B) to be triangular, because it involves swapping register, but not cache, blocksizes (packing A by NR and B by MR) and then swapping the operands to gemmtrsm just before that kernel is called. It may be useful to disable these constraints if, for example, the developer wishes to test the configuration with a different set of cache blocksizes where only MC % MR = 0 and NC % NR = 0 are enforced. - In summary, #defining BLIS_RELAX_MCNR_NCMR_CONSTRAINTS will bypass the enforcement of MC % NR = 0 and NC % MR = 0.	2016-11-01 14:35:15 -05:00
Field G. Van Zee	618f4331eb	Align strides of ct in macrokernels to that of c. Details: - Previously, rs_ct and cs_ct, the strides of the temporary microtile used primarily in the macrokernels' edge case handling, were unconditionally set to 1 and MR, respectively. However, Devin Matthews noted that this ought to be changed so that the strides of ct were in agreement with the strides of C. (That is, if C was row-stored, then ct should be accessed as by rows as well.) The implicit assumption is that the strides of C have already been adjusted, via induced transposition, if the storage preference of the microkernel is at odds with the storage of C. So, if the microkernel prefers row storage, the macrokernel's interior cases would present row-stored (ideal) microkernel subproblems to the microkernel, but for edge cases, it would still see column-stored subproblems (not ideal). This commit fixes this issue. Thanks to Devin for his suggestion.	2016-10-31 14:40:51 -05:00
Devin Matthews	216206c1d3	Fix up for merge to master.	2016-10-25 13:56:18 -05:00
Devin Matthews	11eb7957ab	Merge branch 'master' into knl # Conflicts: # frame/thread/bli_thread.h	2016-10-25 13:51:07 -05:00
Field G. Van Zee	936d5fdc26	Fixed multithreading compilation bug in `970745a`. Details: - Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING from bli_thread.h to bli_config_macro_defs.h. Also moved the sanity check that OpenMP and POSIX threads are not both enabled. - Thanks to Krzysztof Drewniak for reporting this bug.	2016-10-21 14:34:27 -05:00
Field G. Van Zee	8feb0f85a6	Removed auto-prototyping of malloc()/free() substitutes. Details: - Removed the header file, bli_malloc_prototypes.h, which automatically generated prototypes for the functions specified by the following cpp macros: BLIS_MALLOC_INTL BLIS_FREE_INTL BLIS_MALLOC_POOL BLIS_FREE_POOL BLIS_MALLOC_USER BLIS_FREE_USER These prototypes were originally provided primarily as a convenience to those developers who specified their own malloc()/free() substitutes for one or more of the following. However, we generated these prototypes regardless, even when the default values (malloc and free) of the macros above were used. A problem arose under certain circumstances (e.g., gcc in C++ mode on Linux with glibc) when including blis.h that stemmed from the "throw" specification which was added to the glibc's malloc() prototype, resulting in a prototype mismatch. Therefore, going forward, developers who specify their own custom malloc()/free() substitutes must also prototype those substitutes via bli_kernel.h. Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews for researching the nature and potential solutions.	2016-10-19 16:05:41 -05:00
Field G. Van Zee	970745a5fc	Reorganized typedefs to avoid compiler warnings. Details: - Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h. - Moved #include of bli_malloc.h from blis.h to bli_type_defs.h. - Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h. - Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h. - The redundant typedefs of membrk_t and mtx_t caused a warning on some C compilers. Thanks to Tyler Smith for reporting this issue.	2016-10-19 15:58:03 -05:00
praveeng	d864ea9f4f	Merge master code 2016_10_14 till Added disabled code thrinfo_t structures Change-Id: If7db98d286c1471fcd30f00757abee9b253ef987	2016-10-14 17:01:31 +05:30
Field G. Van Zee	28b2af8a71	Added disabled code to print thrinfo_t structures. Details: - Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious developer to print the contents of the thrinfo_t structures of each thread, for verification purposes or just to study the way thread information and communicators are used in BLIS. - Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing an array of thrinfo_t* values that is used in the new, cpp-guarde code mentioned above. - Removed some old commented lines from bli_gemm_front.c.	2016-10-13 14:50:08 -05:00
praveeng	7045fcbf0b	Merge master code 2016_10_13 Removed previously renamed/old files Change-Id: I8106d371afaa0af474a8967388d44481b05de923	2016-10-13 12:03:24 +05:30
sthangar	7e04490002	Checked in the SAMAX optimizations Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd	2016-10-13 10:07:51 +05:30
Field G. Van Zee	9cda6057ea	Removed previously renamed/old files. Details: - Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h, both of which were renamed/removed in `701b9aa`. For some reason, these files survived when the compose branch was merged back into master. (Clearly, git's merging algorithm is not perfect.) - Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed memory allocator that I was keeping around for no particular reason).	2016-10-11 13:21:26 -05:00
Field G. Van Zee	22377abd84	Fixed bli_gemm() segfault on empty C matrices. Details: - Fixed a bug that would manifest in the form of a segmentation fault in bli_cntl_free() when calling any level-3 operation on an empty output matrix (ie: m = n = 0). Specifically, the code previously assumed that the entire control tree was built prior to it being freed. However, if the level-3 operation performs an early exit, the control tree will be incomplete, and this scenario is now handled. Thanks to Elmar Peise for reporting this bug.	2016-10-10 13:43:56 -05:00
Field G. Van Zee	0b571cd94d	Fixed segfault in bli_free_align() for NULL ptrs. Details: - Fixed a bug in bli_free_align() caused by failing to handle NULL pointers up-front, which led to performing pointer arithmetic on NULL pointers in order to free the address immediately before the pointer. Thanks to Devin Matthews for reporting this bug.	2016-10-06 14:48:15 -05:00
praveeng	f2e7ea113a	conflicts merge for bli_kernel.h Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0	2016-10-06 12:35:30 +05:30
Field G. Van Zee	87fddeab3c	Merge branch 'compose'	2016-10-05 13:35:01 -05:00
Field G. Van Zee	86969873b5	Reclassified amaxv operation as a level-1v kernel. Details: - Moved amaxv from being a utility operation to being a level-1v operation. This includes the establishment of a new amaxv kernel to live beside all of the other level-1v kernels. - Added two new functions to bli_part.c: bli_acquire_mij() bli_acquire_vi() The first acquires a scalar object for the (i,j) element of a matrix, and the second acquires a scalar object for the ith element of a vector. - Added integer support to bli_getsc level-0 operation. This involved adding integer support to the bli_*gets level-0 scalar macros. - Added a new test module to test amaxv as a level-1v operation. The test module works by comparing the value identified by bli_amaxv() to the the value found from a reference-like code local to the test module source file. In other words, it (intentionally) does not guarantee the same index is found; only the same value. This allows for different implementations in the case where a vector contains two or more elements containing exactly the same floating point value (or values, in the case of the complex domain). - Removed the directory frame/include/old/.	2016-10-04 14:24:59 -05:00
Field G. Van Zee	8d55033c96	Implemented distributed thrinfo_t management. Details: - Implemented Ricardo Magana's distributed thread info/communicator management. Rather that fully construct the thrinfo_t structures, from root to leaf, prior to spawning threads, the threads individually construct their thrinfo_t trees (or, chains), and do so incrementally, as needed, reusing the same structure nodes during subsequent blocked variant iterations. This required moving the initial creation of the thrinfo_t structure (now, the root nodes) from the _front() functions to the bli_l3_thread_decorator(). The incremental "growing" of the tree is performed in the internal back-end (ie: _int()) function, and so mostly invisible. Also, the incremental growth of the thrinfo_t tree is done as a function of the current and parent control tree nodes (as well as the parent thrinfo_t node), further reinforcing the parallel relationship between the two data structures. - Removed the "inner" communicator from thrinfo_t structure definition, as well as its id. Changed all APIs accordingly. Renamed bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm(). - Defined bli_l3_thrinfo_print_paths(), which prints the information in an array of thrinfo_t* structure pointers. (Used only as a debugging/verification tool.) - Deprecated the following thrinfo_t creation functions: bli_packm_thrinfo_create() bli_l3_thrinfo_create() because they are no longer used. bli_thrinfo_create() is now called directly when creating thrinfo_t nodes.	2016-09-27 15:20:58 -05:00
Field G. Van Zee	c0630c4024	Added debugging printf()'s to bli_l3_thrinfo.c. Details: - Added optional printf() statements to print out thread communicator info as the thrinfo_t structure is built in bli_l3_thrinfo.c. - Minor changes to frame/thread/bli_thrinfo.h.	2016-09-12 13:59:02 -05:00
Field G. Van Zee	35509818cb	Added, moved some thread barriers. Details: - Removed thread barriers from the end of the loop bodies of bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(), and bli_trsm_blk_var2(). - Moved the thread barrier at the end of bli_packm_int() to the end of bli_l3_packm(), and added missing barriers to that function. - Removed the no longer necessary (and now incorrect) ochief guard in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C. - Thanks to Tyler Smith for help with these changes.	2016-08-31 17:34:15 -05:00
sthangar	8a2373f26b	Norm 2 optimization Change-Id: Ide9decaccd20bf0ccc32c9abb6556e038dceed2b	2016-08-29 14:28:39 +05:30
Field G. Van Zee	701b9aa3ff	Redesigned control tree infrastructure. Details: - Altered control tree node struct definitions so that all nodes have the same struct definition, whose primary fields consist of a blocksize id, a variant function pointer, a pointer to an optional parameter struct, and a pointer to a (single) sub-node. This unified control tree type is now named cntl_t. - Changed the way control tree nodes are connected, and what computation they represent, such that, for example, packing operations are now associated with nodes that are "inline" in the tree, rather than off- shoot braches. The original tree for the classic Goto gemm algorithm was expressed (roughly) as: blk_var2 -> blk_var3 -> blk_var1 -> ker_var2 \| \| -> packb -> packa and now, the same tree would look like: blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2 Specifically, the packb and packa nodes perform their respective packing operations and then recurse (without any loop) to a subproblem. This means there are now two kinds of level-3 control tree nodes: partitioning and non-partitioning. The blocked variants are members of the former, because they iteratively partition off submatrices and perform suboperations on those partitions, while the packing variants belong to the latter group. (This change has the effect of allowing greatly simplified initialization of the nodes, which previously involved setting many unused node fields to NULL.) - Changed the way thrinfo_t tree nodes are arranged to mirror the new connective structure of control trees. That is, packm nodes are no longer off-shoot branches of the main algorithmic nodes, but rather connected "inline". - Simplified control tree creation functions. Partitioning nodes are created concisely with just a few fields needing initialization. By contrast, the packing nodes require additional parameters, which are stored in a packm-specific struct that is tracked via the optional parameters pointer within the control tree struct. (This parameter struct must always begin with a uint64_t that contains the byte size of the struct. This allows us to use a generic function to recursively copy control trees.) gemm, herk, and trmm control tree creation continues to be consolidated into a single function, with the operation family being used to select among the parameter-agnostic macro-kernel wrappers. A single routine, bli_cntl_free(), is provided to free control trees recursively, whereby the chief thread within a groups release the blocks associated with mem_t entries back to the memory broker from which they were acquired. - Updated internal back-ends, e.g. bli_gemm_int(), to query and call the function pointer stored in the current control tree node (rather than index into a local function pointer array). Before being invoked, these function pointers are first cast to a gemm_voft (for gemm, herk, or trmm families) or trsm_voft (for trsm family) type, which is defined in frame/3/bli_l3_var_oft.h. - Retired herk and trmm internal back-ends, since all execution now flows through gemm or trsm blocked variants. - Merged forwards- and backwards-moving variants by querying the direction from routines as a function of the variant's matrix operands. gemm and herk always move forward, while trmm and trsm move in a direction that is dependent on which operand (a or b) is triangular. - Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(), each of which takes additional arguments and hides complexity in managing the difference between the way ranges are computed for the four families of operations. - Simplified level-3 blocked variants according to the above changes, so that the only steps taken are: 1. Query partitioning direction (forwards or backwards). 2. Prune unreferenced regions, if they exist. 3. Determine the thread partitioning sub-ranges. <begin loop> 4. Determine the partitioning blocksize (passing in the partitioning direction) 5. Acquire the curren iteration's partitions for the matrices affected by the current variants's partitioning dimension (m, k, n). 6. Call the subproblem. <end loop> - Instantiate control trees once per thread, per operation invocation. (This is a change from the previous regime in which control trees were treated as stateless objects, initialized with the library, and shared as read-only objects between threads.) This once-per-thread allocation is done primarily to allow threads to use the control tree as as place to cache certain data for use in subsequent loop iterations. Presently, the only application of this caching is a mem_t entry for the packing blocks checked out from the memory broker (allocator). If a non-NULL control tree is passed in by the (expert) user, then the tree is copied by each thread. This is done in bli_l3_thread_decorator(), in bli_thrcomm_*.c. - Added a new field to the context, and opid_t which tracks the "family" of the operation being executed. For example, gemm, hemm, and symm are all part of the gemm family, while herk, syrk, her2k, and syr2k are all part of the herk family. Knowing the operation's family is necessary when conditionally executing the internal (beta) scalar reset on on C in blocked variant 3, which is needed for gemm and herk families, but must not be performed for the trmm family (because beta has only been applied to the current row-panel of C after the first rank-kc iteration). - Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind to comform with the new control tree design, and renamed the macro- kernel codes corresponding to 3m2 and 4m1b. - Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h. - Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to frame/base/bli_auxinfo.h. - Fixed a minor bug whereby the storage-to-ukr-preference matching optimization in the various level-3 front-ends was not being applied properly when the context indicated that execution would be via an induced method. (Before, we always checked the native micro-kernel corresponding to the datatype being executed, whereas now we check the native micro-kernel corresponding to the datatype's real projection, since that is the micro-kernel that is actually used by induced methods. - Added an option to the testsuite to skip the testing of native level-3 complex implementations. Previously, it was always tested, provided that the c/z datatypes were enabled. However, some configurations use reference micro-kernels for complex datatypes, and testing these implementations can slow down the testsuite considerably.	2016-08-26 19:04:45 -05:00
Field G. Van Zee	c6f5c215ee	Merge branch 'master' into compose	2016-08-22 17:33:02 -05:00

1 2 3 4 5 ...

459 Commits