amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-20 00:18:56 +00:00

Author	SHA1	Message	Date
kdevraje	cac127182d	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis with public repo commit id `565fa3853b`. Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42	2019-06-24 14:05:54 +05:30
Kiran Varaganti	b69fb0b74a	Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup Change-Id: I9f5d8225254676a99c6f2b09a0825e545206d0fc	2019-05-31 15:14:22 +05:30
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
kdevraje	02920f5c48	make checkblis fails for matrix dimension check at the begining hence reverting it Change-Id: Ibd2ee8c2d4914598b72003fbfc5845be9c9c1e87	2019-05-23 15:29:59 +05:30
kdevraje	84215022f2	Adding threshold condition to dgemm small matrix kernels, defining the constants in zen2 configuration Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5	2019-05-23 14:33:47 +05:30
kdevraje	df755848b8	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis into rome2.0 Change-Id: Ie8aad1ab810f0f3c0b90ec67f9dd3dfb8dcc74cc	2019-05-22 13:30:07 +05:30
Kiran Varaganti	f5ed95ecd7	Merged BLIS Release 1.3 Modified config/zen/make_defs.mk, now CKVECFLAGS := -mavx2 -mfpmath=sse -mfma -march=znver1 Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81	2019-03-05 15:03:57 +05:30
Field G. Van Zee	075143dfd9	Added support for IC loop parallelism to trsm. Details: - Parallelism within the IC loop (3rd loop around the microkernel) is now supported within the trsm operation. This is done via a new branch on each of the control and thread trees, which guide execution of a new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm subproblem corresponds to the macrokernel computation on only the block of A that contains the diagonal (labeled as A11 in algorithms with FLAME-like partitioning), and the corresponding row panel of C. During the trsm subproblem, all threads within the JC communicator participate and parallelize along the JR loop, including any parallelism that was specified for the IC loop. (IR loop parallelism is not supported for trsm due to inter-iteration dependencies.) After this trsm subproblem is complete, a barrier synchronizes all participating threads and then they proceed to apply the prescribed BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT parallelism specified within) to the remaining gemm subproblem (the rank-k update that is performed using the newly updated row-panel of B). Thus, trsm now supports JC, IC, and JR loop parallelism. - Modified bli_trsm_l_cntl_create() to create the new "prenode" branch of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for now, since it is not currently used. (All trsm problems are cast in terms of left-side trsm.) - Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t subnode is only recursed upon if there existed a corresponding thrinfo_t node, which may not always exist (for problems too small to employ full parallelization due to the minimum granularity imposed by micropanels). - Updated other functions in frame/base/bli_cntl.c, such as bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes if they exist. - Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes when they exist, and added support for growing a prenode branch to bli_thrinfo_grow() via a corresponding set of help functions named with the _prenode() suffix. - Added a bszid_t field thrinfo_t nodes. This field comes in handy when debugging the allocation/release of thrinfo_t nodes, as it helps trace the "identity" of each nodes as it is created/destroyed. - Renamed bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths() and created a separate bli_l3_thrinfo_print_trsm_paths() function to print out the newly reconfigured thrinfo_t trees for the trsm operation. - Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c regarding variable declarations. - Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B, BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels (semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which represent the subpartition ahead of and behind, respectively, BLIS_SUBPART1. Updated check functions in bli_check.c accordingly. - Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and bli_acquire_mpart_t2b/b2t(), _l2r/r2l(). - Deprecated old functions in frame/3/bli_l3_thrinfo.c.	2019-02-14 18:52:45 -06:00
Field G. Van Zee	eb97f778a1	Added missing AMD copyrights to previous commit. Details: - Forgot to add AMD copyrights to several touched files that did not already have them in `2f31743`.	2018-12-25 20:17:09 -06:00
Field G. Van Zee	2f3174330f	Implemented a pool-based small block allocator. Details: - Implemented a sophisticated data structure and set of APIs that track the small blocks of memory (around 80-100 bytes each) used when creating nodes for control and thread trees (cntl_t and thrinfo_t) as well as thread communicators (thrcomm_t). The purpose of the small block allocator, or sba, is to allow the library to transition into a runtime state in which it does not perform any calls to malloc() or free() during normal execution of level-3 operations, regardless of the threading environment (potentially multiple application threads as well as multiple BLIS threads). The functionality relies on a new data structure, apool_t, which is (roughly speaking) a pool of arrays, where each array element is a pool of small blocks. The outer pool, which is protected by a mutex, provides separate arrays for each application thread while the arrays each handle multiple BLIS threads for any given application thread. The design minimizes the potential for lock contention, as only concurrent application threads would need to fight for the apool_t lock, and only if they happen to begin their level-3 operations at precisely the same time. Thanks to Kiran Varaganti and AMD for requesting this feature. - Added a configure option to disable the sba pools, which are enabled by default; renamed the --[dis\|en]able-packbuf-pools option to --[dis\|en]able-pba-pools; and rewrote the --help text associated with this new option and consolidated it with the --help text for the option associated with the sba (--[dis\|en]able-sba-pools). - Moved the membrk field from the cntx_t to the rntm_t. We now pass in a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we do for bli_sba_acquire() and _release(). - Replaced all calls to bli_malloc_intl() and bli_free_intl() that are used for small blocks with calls to bli_sba_acquire(), which takes a rntm (in addition to the bytes requested), and bli_sba_release(). These latter two functions reduce to the former two when the sba pools are disabled at configure-time. - Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as required by the new usage of bli_sba_acquire() and _release(). - Moved the freeing of "old" blocks (those allocated prior to a change in the block_size) from bli_membrk_acquire_m() to the implementation of the pool_t checkout function. - Miscellaneous improvements to the pool_t API. - Added a block_size field to the pblk_t. - Harmonized the way that the trsm_ukr testsuite module performs packing relative to that of gemmtrsm_ukr, in part to avoid the need to create a packm control tree node, which now requires a rntm_t that has been initialized with an sba and membrk. - Re-enable explicit call bli_finalize() in testsuite so that users who run the testsuite with memory tracing enabled can check for memory leaks. - Manually imported the compact/minor changes from `61441b24` that cause the rntm to be copied locally when it is passed in via one of the expert APIs. - Reordered parameters to various bli_thrcomm_() functions so that the thrcomm_t to the comm being modified is last, not first. - Added more descriptive tracing for allocating/freeing small blocks and formalized via a new configure option: --[dis\|en]able-mem-tracing. - Moved some unused scalm code and headers into frame/1m/other. - Whitespace changes to bli_pthread.c. - Regenerated build/libblis-symbols.def.	2018-12-25 19:35:01 -06:00
sraut	1f4eeee517	Fixed BLAS test failures of small matrix SYRK for single and double precision. Details: - SYRK for small matrix was implemented by reusing small GEMM routine. This was resulting in output written to the full C matrix, and C being symmetric the lower and upper triangles of C matrix contained same results. BLAS SYRK API spec demands either lower or upper triangle of C matrix to be written with results. So, this was resulting in BLAS test failures, even though testsuite of BLIS was passing small SYRK operation. - To fix BLAS test failures of small matrix SYRK, separate kernel routines are implemented for small SYRK for both single and double precision. The newly added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c. Now the intermediate results of matrix C are written to a scratch buffer. Final results are written from scratch buffer to matrix C using SIMD copy to either lower or upper traingle part of matrix C. - Source and header files frame/3/syrk/bli_syrk_front.c and frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines. Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb	2018-12-19 21:23:05 +05:30
Field G. Van Zee	76016691e2	Improvements to bli_pool; malloc()/free() tracing. Details: - Added malloc_ft and free_ft fields to pool_t, which are provided when the pool is initialized, to allow bli_pool_alloc_block() and bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align() with arbitrary align_size values (according to how the pool_t was initialized). - Added a block_ptrs_len argument to bli_pool_init(), which allows the caller to specify an initial length for the block_ptrs array, which previously suffered the cost of being reallocated, copied, and freed each time a new block was added to the pool. - Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t into a single "buf" field. Consolidated the bli_pblk API accordingly and also updated the bli_mem API implementation. This was done because I'd previously already implemented opaque alignment via bli_malloc_align(), which allocates extra space and stores the original pointer returned by malloc() one element before the element whose address is aligned. - Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call bli_fmalloc_align() and bli_ffree_align(), which required adding an align_size field to the membrk_t struct. - Pass the pack schemas directly into bli_l3_cntl_create_if() rather than transmit them via objects for A and B. - Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free(). The function had not been conditionally freeing control trees for quite some time. Also, removed obj_t* parameters since they aren't needed anymore (or never were). - Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a separate function, bli_l3_thread_decorator_thread_check(). - Renamed: bli_malloc_align() -> bli_fmalloc_align() bli_free_align() -> bli_ffree_align() bli_malloc_noalign() -> bli_fmalloc_noalign() bli_free_noalign() -> bli_ffree_noalign() The 'f' is for "function" since they each take a malloc_ft or free_ft function pointer argument. - Inserted various printf() calls for the purposes of tracing memory allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which, for now, is intended to be a "hidden" feature rather than one hooked up to a configure-time option. - Defined bli_rntm_equals(), which compares two rntm_t for equality. (There are no use cases for this function yet, but there may be soon.) - Whitespace changes to function parameter lists in bli_pool.c, .h.	2018-12-13 17:23:09 -06:00
Field G. Van Zee	0645f239fb	Remove UT-Austin from copyright headers' clause 3. Details: - Removed explicit reference to The University of Texas at Austin in the third clause of the license comment blocks of all relevant files and replaced it with a more all-encompassing "copyright holder(s)". - Removed duplicate words ("derived") from a few kernels' license comment blocks. - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c with format of all other comment blocks.	2018-12-04 14:31:06 -06:00
Field G. Van Zee	375eb30b0a	Added mixed-precision support to 1m method. Details: - Lifted the constraint that 1m only be used when all operands' storage datatypes (along with the computation datatype) are equal. Now, 1m may be used as long as all operands are stored in the complex domain. This change largely consisted of adding the ability to pack to 1e and 1r formats from one precision to another. It also required adding logic for handling complex values of alpha to bli_packm_blk_var1_md() (similar to the logic in bli_packm_blk_var1()). - Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c, bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong ukernel output preference field being read. Previously, the preference for the native complex ukernel was being read instead of the pref for the native real domain ukernel. This bug would not manifest if the preference for the native complex ukernel happened to be equal to that of the native real ukernel. - Added support for testing mixed-precision 1m execution via the gemm module of the testsuite. - Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack schemas are always read from the context, rather than trying to sometimes embed them directly to the A and B objects. (They are still embedded, but now uniformly only after reading the schemas from the context.) - Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only consumer). - Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to bli_gemm_ker_var2_md(). - Added explicit handling for beta == 1 and beta == 0 in the reference gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c. - Rewrote various level-0 macro defs, including axpyris, axpbyris, scal2ris, and xpbyris (and their conjugating counterparts) to explicitly support three operand types and updated invocations to xpbyris in bli_gemmtrsm1m_ref.c. - Query and use the storage datatype of the packed object instead of the storage datatype of the source object in bli_packm_blk_var1(). - Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to frame/3/gemm/ind/bli_gemm_ind_opt.h. - Various whitespace/comment updates.	2018-12-03 17:49:52 -06:00
Field G. Van Zee	1d8aae220b	Track internal scalar datatypes. Details: - Added a num_t datatype bitfield to the obj_t in the form of a new info2 field in the obj_t. This change was made primarily so that in the case of mixed-datatype gemm, the alpha scalar would not need to be cast to the storage datatype of B (or A) before then being cast to the computation datatype just before the macrokernel is called. This double-casting regime could result in loss of precision if the storage datatype of B (or A) is less than the computation precision. In practice, it was likely not going to be a big deal since most usage of alpha is for -1.0, 0.0, and 1.0 (or integer multiples thereof), which can all be represented exactly in single or double precision. - The type of objbits_t was changed to uint32_t, so the new format potentially takes up the same space as the previous obj_t definition, assuming no padding inserted by the compiler. Shrinking info to 32 bits and spilling over into a second field was chosen over using the high 32 bits of a single 64-bit objbits_t info field because many of the bitwise operations are performed with enums such as num_t, dom_t, and prec_t, which may take on the type of 32-bit ints. It's easier to just keep all of those bitwise operations in 32 bits than perform a million typecasts throughout bli_type_defs.h and bli_obj_macro_defs.h to ensure that the integers are treated as 64-bit for the purposes of the ANDs, ORs, and bitshifts. - Many comment updates. - Thanks to Devin Matthews and Devangi Parikh for their feedback and involvement during this commit cycle.	2018-11-20 18:42:07 -06:00
Field G. Van Zee	0f9b53e84b	Fixed a bug in high-level mixeddt conditional. Details: - Fixed a bug in frame/3/bli_l3_oapi.c in the conditional that divides use of induced method (1m) execution from native execution. The former was intended to only be used in cases where all storage datatypes are complex and the datatype of C is equal to the computation datatype. (If mixed datatypes are detected, native execution would be used.) However, the code in bli_gemm() was erroneously checking the execution datatype instead of the computation datatype, which at that point is guaranteed to be equal to the storage datatype even if the computation datatype contains a different value. Thanks to Devangi Parikh for helping in isolating this bug.	2018-11-13 13:03:15 -06:00
Field G. Van Zee	c3c6ebc9c6	Fixed thrinfo_t printing for small problems. Details: - Fixed a bug in the code that prints out the communicator and work ids from the various threads' thrinfo_t nodes. This bug manifested when the dimension being parallelized was not large enough such that every thread was assigned actual work (since the minimum amount of work is determined by the register blocksize in the dimension being parallelized). In those cases, the threads that receive no work in that dimension do not finish building their thrinfo_t tree, leaving lower-level nodes non-existent. (The bug itself was usally observed as a segfault when the printing code attempted to dereference all the way down the thrinfo_t tree.) The solution involves explicitly checking each node as it is dereferenced, and if at any time NULL is found, all subsequent communicator and work ids are set to -1.	2018-10-21 18:48:54 -05:00
Field G. Van Zee	49d3f9fcbb	Merge branch 'master' into dev	2018-10-17 18:00:40 -05:00
Field G. Van Zee	71c5832d5f	Consolidated slab/rr-explicit level-3 macrokernels. Details: - Consolidated the sl.c and rr.c level-3 macrokernels into a single file per sl/rr pair, with those files named as they were before `c92762e`. The consolidation does not take away the option of using slab or round-robin assignment of micropanels to threads; it merely hides the choice within the definitions of functions such as bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter() rather than expose that choice explicitly in the code. The choice of slab or rr is not always hidden, however; there are some cases involving herk and trmm, for example, that require some part of the computation to use rr unconditionally. (The --thread-part-jrir option controls the partitioning in all other cases.) - Note: Originally, the sl and rr macrokernels were separated out for clarity. However, aside from the additional binary code bloat, I later deemed that clarity not worth the price of maintaining the additional (mostly similar) codes.	2018-10-17 14:11:01 -05:00
Field G. Van Zee	5fec95b99f	Implemented mixed-datatype support for gemm. Details: - Implemented support for gemm where A, B, and C may have different storage datatypes, as well as a computational precision (and implied computation domain) that may be different from the storage precision of either A or B. This results in 128 different combinations, all which are implemented within this commit. (For now, the mixed-datatype functionality is only supported via the object API.) If desired, the mixed-datatype support may be disabled at configure-time. - Added a memory-intensive optimization to certain mixed-datatype cases that requires a single m-by-n matrix be allocated (temporarily) per call to gemm. This optimization aims to avoid the overhead involved in repeatedly updating C with general stride, or updating C after a typecast from the computation precision. This memory optimization may be disabled at configure-time (provided that the mixed-datatype support is enabled in the first place). - Added support for testing mixed-datatype combinations to testsuite. The user may test gemm with mixed domains, precisions, both, or neither. - Added a standalone test driver directory for building and running mixed-datatype performance experiments. - Defined a new variation of castm, castnzm, which operates like castm except that imaginary values are not touched when casting a real operand to a complex operand. (By contrast, in these situations castm sets the imaginary components of the destination matrix to zero.) - Defined bli_obj_imag_is_zero() and substituted calls in lieu of all usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and also simplified the implementation of bli_obj_imag_equals(). - Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex() when given BLIS_CONSTANT objects. - Disabled dt_on_output field in auxinfo_t structure as well as all accessor functions. Also commented out all usage of accessor functions within macrokernels. (Typecasting in the microkernel is still feasible, though probably unrealistic for now given the additional complexity required.) - Use void function pointer type (instead of void*) for storing function pointers in bli_l0_fpa.c. - Added documentation for using gemm with mixed datatypes in docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c. - Defined level-1d operation xpbyd and level-1m operation xpbym. - Added xpbym test module to testsuite. - Updated frame/include/bli_x86_asm_macros.h with additional macros (courtsey of Devin Matthews).	2018-10-15 16:37:39 -05:00
sraut	78a6935483	Added comments for the change in syrk small matrix change. Change-Id: I958939e9953323730da49ef07d1b10e578837d82	2018-10-11 16:30:57 +05:30
Field G. Van Zee	c92762ecdc	Added option of slab or rr partitioning in jr/ir. Details: - Updated existing macrokernel function names and definitions to explicitly use slab assignment of micropanels to threads, then created duplicate versions of macrokernels that explicitly use round-robin assignment instead of slab. NOTE: As in `ac18949`, trsm_r macrokernels were not substantially updated in this commit because they are currently disabled in bli_trsm_front.c. - Updated existing packing function (in blk_packm_blk_var1.c) to explicitly use slab partitioning, and then duplicated for round-robin. - Updated control tree initialization to use the appropriate macrokernel and packm function pointers depending on which method (slab or rr) was enabled at configure-time. - Updated configure script to accept new --thread-part-jrir=[slab\|rr] option (-m [slab\|rr] for short), which allows the user to explicitly request either slab or round-robin assignment (partitioning) of micropanels to threads. - Updated sandbox/ref99 according to above changes. - Minor updates to build/add-copyright.py.	2018-10-07 20:30:32 -05:00
Field G. Van Zee	ac18949a4b	Multithreading optimizations for l3 macrokernels. Details: - Adjusted the method by which micropanels are assigned to threads in the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly) employ contiguous "slab" partitioning rather than interleaved (round robin) partitioning. The new partitioning schemes and related details for specific families of operations are listed below: - gemm: slab partitioning. - herk: slab partitioning for region corresponding to non-triangular region of C; round robin partitioning for triangular region. - trmm: slab partitioning for region corresponding to non-triangular region of B; round robin partitioning for triangular region. (NOTE: This affects both left- and right-side macrokernels: trmm_ll, trmm_lu, trmm_rl, trmm_ru.) - trsm: slab partitioning. (NOTE: This only affects only left-side macrokernels trsm_ll, trsm_lu; right-side macrokernels were not touched.) Also note that the previous macrokernels were preserved inside of the 'other' directory of each operation family directory (e.g. frame/3/gemm/other, frame/3/herk/other, etc). - Updated gemm macrokernel in sandbox/ref99 in light of above changes and fixed a stale function pointer type in blx_gemm_int.c (gemm_voft -> gemm_var_oft). - Added standalone test drivers in test/3m4m for herk, trmm, and trsm and minor changes to test/3m4m/Makefile. - Updated the arguments and definitions of bli__get_next_[ab]_upanel() and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h. - Renamed bli_thread_get_range() APIs to bli_thread_range*().	2018-09-30 18:54:56 -05:00
praveeng	86330953b1	Resolved conflicts and modified bli_trsm_small.c Change-Id: I578d419cff658003e0fdd4c4cdc93145d951ce31	2018-09-28 10:08:06 +05:30
Field G. Van Zee	4fa4cb0734	Trivial comment header updates. Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers.	2018-08-29 18:06:41 -05:00
Field G. Van Zee	017548314f	Replaced function chooser macros w/ func ptr arrays. Details: - Previously, most object API functions (_oapi.c) used a function chooser macro that would expand out to an if-elseif-elseif-else conditional that used a num_t datatype to call the appropriate type-specific API (_tapi.c). This always felt a little hackish, and would get in the way somewhat of addig support for new num_t datatypes in the future. So, I've replaced that functionality with code that queries a function pointer that is then typecast appropriately. This model of function calling was already pervasive for kernels queried from the cntx_t structure. It was also already in use in various other functions, such as macrokernels, and this commit simply extends that pattern. - The above change required many new files, mostly header files, that define the function types (mostly _ft.h) for the queriable functions as well as some source files to define the function pointer arrays and their corresponding query functions (_fpa.c). Various other function types, mostly for kernel function types, were renamed to reduce the potential for confusion with the function types for expert and basic (non-expert) typed API functions. - Removed definitions for all of the "bli_call_ft_*()" function chooser macros from bli_misc_macro_defs.h.	2018-08-07 14:13:25 -05:00
Field G. Van Zee	71f9787195	Whitespace changes to macrokernels' func ptr defs.	2018-07-25 15:55:36 -05:00
Field G. Van Zee	ecbebe7c2e	Defined rntm_t to relocate cntx_t.thrloop (#235 ). Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field after it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used different parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS__NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set__nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.	2018-07-17 18:37:32 -05:00
Field G. Van Zee	3ee2bc0f7a	Renamed files that distinguish basic/expert APIs. Details: - Renamed various files that were previously named according to a "with context" or "without context" convention. For example, the following files in frame/3 were renamed: frame/3/bli_l3_oapi_woc.c -> frame/3/bli_l3_oapi_ba.c frame/3/bli_l3_oapi_wc.c -> frame/3/bli_l3_oapi_ex.c frame/3/bli_l3_tapi_woc.c -> frame/3/bli_l3_tapi_ba.c frame/3/bli_l3_tapi_wc.c -> frame/3/bli_l3_tapi_ex.c Here, the "ba" is for "basic" and "ex" is for "expert". This new naming scheme will make more sense especially if/when additional expert parameters are added to the expert APIs (typed and object).	2018-07-07 16:02:16 -05:00
Field G. Van Zee	e88aedae73	Separated expert, non-expert typed APIs. Details: - Split existing typed APIs into two subsets of interfaces: one for use with expert parameters, such as the cntx_t, and one without. This separation was already in place for the object APIs, and after this commit the typed and object APIs will have similar expert and non- expert APIs. The expert functions will be suffixed with "_ex" just as is the case for expert interfaces in the object APIs. - Updated internal invocations of typed APIs (functions such as bli_?setm() and bli_?scalv()) throughout BLIS to reflect use of the new explictly expert APIs. - Updated example code in examples/tapi to reflect the existence (and usage) of non-expert APIs. - Bumped the major soname version number in 'so_version'. While code compiled against a previous version/commit will likely still work (since the old typed function symbol names still exist in the new API, just with one less function argument) the semantics of the function have changed if the cntx_t parameter the application passes in is non-NULL. For example, calling bli_daxpyv() with a non-NULL context does not behave the same way now as it did before; before, the context would be used in the computation, and now the context would be ignored since the interace for that function no longer expects a context argument.	2018-07-06 19:14:02 -05:00
Field G. Van Zee	87db5c048e	Changed usage of virtual microkernel slots in cntx. Details: - Changed the way virtual microkernels are handled in the context. Previously, there were query routines such as bli_cntx_get_l3_ukr_dt() which returned the native ukernel for a datatype if the method was equal to BLIS_NAT, or the virtual ukernel for that datatype if the method was some other value. Going forward, the context native and virtual ukernel slots will both be initialized to native ukernel function pointers for native execution, and for non-native execution the virtual ukernel pointer will be something else. This allows us to always query the virtual ukernel slot (from within, say, the macrokernel) without needing any logic in the query routine to decide which function pointer (native or virtual) to return. (Essentially, the logic has been shifted to init-time instead of compute-time.) This scheme will also allow generalized virtual ukernels as a way to insert extra logic in between the macrokernel and the native microkernel. - Initialize native contexts (in bli_cntx_ref.c) with native ukernel function addresses stored to the virtual ukernel slots pursuant to the above policy change. - Renamed all static functions that were native/virtual-ambiguous, such as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt() pursuant to the above polilcy change. Those routines now use the substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All of these functions were static functions defined in bli_cntx.h, and most uses were in level-3 front-ends and macrokernels. - Deprecated anti_pref bool_t in context, along with related functions such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's panel-block execution is disabled.	2018-06-12 19:38:37 -05:00
Field G. Van Zee	55b6abdf74	Enforce consistent datatypes in most object APIs. Details: - Added logic to level-1v, -1d, -1f, -1m, -2, and -3 operations' _check() functions to ensure that all operands are of the same datatype. There are some exceptions that were left out, such as the _check() function for the various norm operations since they have a different idea of datatype consistency (ie: the norm object must be the real projection of the primary input vector/matrix object).	2018-06-07 14:08:12 -05:00
sraut	695cd520e2	AMD Copyright information changed to 2018 Change-Id: Idfd11afd5d252f8063d0158680d24bf7e2854469	2018-06-06 11:48:56 +05:30
sraut	df1dd24fd8	small matrix trsm intrinsics optimization code for AX=B and XA'=B Change-Id: I90123c4d9adbd314c867995cd19dc975150b448c	2018-06-06 11:24:33 +05:30
Field G. Van Zee	f97a86f322	Updated setting/querying pack schema (cntx->cntl). - Query pack schemas in level-3 bli__front() functions and store those values in the schema bitfields of the correponding obj_t's when the cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default native schemas are stored to the obj_t's.) - In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in bli__front(), clear the schema bitfields, and pass the queried values into bli_gemm_cntl_create() and bli_trsm_cntl_create(). - Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to take schemas for A and B, and use these values to initialize the appropriate control tree nodes. (Also cpp-disabled the panel-block cntl tree creation variant, bli_gemmpb_cntl_create(), as it has not been employed by BLIS in quite some time.) - Simplified querying of schema in bli_packm_init() thanks to above changes. - Updated openmp and pthreads definitions of bli_l3_thread_decorator() so that thread-local aliases of matrix operands are guaranteed, even if aliasing is disabled within the internal back-end functions (e.g. bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c explaining why the extra aliasing is not needed there. - Change bli_gemm() and level-3 friends so that the operation's ind() function is called only if all matrix operands have the same datatype, and only if that datatype is complex. The former condition is needed in preparation for work related to mixed domain operands, while the latter helps with readability, especially for those who don't want to venture into frame/ind. - Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be consistent with BLIS calling conventions (modified argument(s) are last), and updated all invocations in the level-3 _front() functions. - Comment updates to bli_cntx_set_thrloop_from_env().	2018-06-02 20:28:20 -05:00
Field G. Van Zee	9588625c43	Renamed "next micropanel" macros in _l3_thrinfo.h. Details: - Renamed several macros defined in bli_l3_thrinfo.h designed to compute the values of a_next and b_next to insert into an auxinfo_t struct in level-3 macrokernels. (Previously, the macros did not use a bli_ prefix.) - Updated instances of above macro usage within various macrokernels.	2018-05-30 15:19:53 -05:00
Field G. Van Zee	4b36e85be9	Converted function-like macros to static functions. Details: - Converted most C preprocessor macros in bli_param_macro_defs.h and bli_obj_macro_defs.h to static functions. - Reshuffled some functions/macros to bli_misc_macro_defs.h and also between bli_param_macro_defs.h and bli_obj_macro_defs.h. - Changed obj_t-initializing macros in bli_type_defs.h to static functions. - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from bli_constants.h. - Whitespace changes in select files (four spaces to single tab).	2018-05-08 14:26:30 -05:00
Field G. Van Zee	75d0d1057d	Renamed various datatype-related macros/functions. Details: - Renamed the following macros in bli_obj_macro_defs.h and bli_param_macro_defs.h: - bli_obj_datatype() -> bli_obj_dt() - bli_obj_target_datatype() -> bli_obj_target_dt() - bli_obj_execution_datatype() -> bli_obj_exec_dt() - bli_obj_set_datatype() -> bli_obj_set_dt() - bli_obj_set_target_datatype() -> bli_obj_set_target_dt() - bli_obj_set_execution_datatype() -> bli_obj_set_exec_dt() - bli_obj_datatype_proj_to_real() -> bli_obj_dt_proj_to_real() - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex() - bli_datatype_proj_to_real() -> bli_dt_proj_to_real() - bli_datatype_proj_to_complex() -> bli_dt_proj_to_complex() - Renamed the following functions in bli_obj.c: - bli_datatype_size() -> bli_dt_size() - bli_datatype_string() -> bli_dt_string() - bli_datatype_union() -> bli_dt_union() - Removed a pair of old level-1f penryn intrinsics kernels that were no longer in use.	2018-04-30 14:57:33 -05:00
Field G. Van Zee	5112e1859e	Added missing 'restrict' to some kernels' cntx_t. Details: - Added missing 'restrict' keyword to cntx_t argument of function signatures corresponding to level-1v, level-1f, and level-1m kernels. This affected bli_l1v_ker_prot.h, bli_l1f_ker_prot.h, and bli_l1m_ker_prot.h. (The 'restrict' was already being used to qualify cntx_t* arguments for kernels defined in bli_l3_ker_prot.h.) - Added comments to bli_l1v_ker.h, bli_l1f_ker.h, bli_l1m_ker.h, and bli_l3_ukr.h that help explain how those headers function to produce kernel prototypes using the prototype macros defined in the files mentioned above.	2018-02-23 14:31:26 -06:00
Field G. Van Zee	16813335bd	Merge branch 'amd' into rt Details: - Merged contributions made by AMD via 'amd' branch (see summary below). Special thanks to AMD for their contributions to-date, especially with regard to intrinsic- and assembly-based kernels. - Added column storage output cases to microkernels in bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with the extra cost of transposing the microtile in registers, this is much faster than using the general storage case when the underlying matrix is column-stored. - Added s and d assembly-based zen gemmtrsm_u microkernel (including column storage optimization mentioned above). - Updated zen sub-configuration to reflect presence of new native kernels. - Temporarily reverted zen sub-configuration's level-3 cache blocksizes to smaller haswell values. - Temporarily disabled small matrix handling for zen configuration family in config/zen/bli_family_zen.h. - Updated zen CFLAGS according to changes in `1e4365b`. - Updated haswell microkernels such that: - only one vzeroupper instruction is called prior to returning - movapd/movupd are used in leiu of movaps/movups for double-real microkernels. (Note that single-real microkernels still use movaps/movups.) - Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is now included via frame/include/bli_arch_config.h. - Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation in testsuite/src/test_amaxv.c). - Added early return for alpha == 0 in bli_dotxv_ref.c. - Integrated changes from `f07b176`, including a fix for undefined behavior when executing the 1m method under certain conditions. - Updated config_registry; no longer need haswell kernels for zen sub-configuration. - Tweaked marginal and pass thresholds for dotxf. - Reformatted level-1v, -1f, and -3 amd kernels and inserted additional comments. - Updated LICENSE file to explicitly mention that parts are copyright UT-Austin and AMD. - Added AMD copyright to header templates in build/templates. Summary of previous changes from 'amd' branch. - Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and s and d assembly-based zen gemmtrsm_l microkernels (d6x8). - Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv, and scalv, with extra-unrolling variants for axpyv and scalv. - Added a small matrix handler to bli_gemm_front(), with the handler implemented in kernels/zen/3/bli_gemm_small_matrix.c. - Added additional logic to sumsqv that first attempts to compute the sum of the squares via dotv(). If there is a floating-point exception (FE_OVERFLOW), then the previous (numerically conservative) code is used; otherwise, the result of dotv() is square-rooted and stored as the result. This new implementation is only enabled when FE_OVERFLOW is #defined. If the macro is not #defined, then the previous implementation is used. - Added axpyv and dotv standalone test drivers to test directory. - Added zen support to old cpuid_x86.c driver in build/auto-detect/old. - Added thread-local and __attribute__-related macros to bli_macro_defs.h.	2018-02-21 17:43:32 -06:00
Nisanth M P	5a7005dd44	Merge changes in AMD beta release 0.95 into amd branch	2018-01-03 12:37:53 +05:30
Field G. Van Zee	9804adfd40	Added option to disable pack buffer memory pools. Details: - Added a new configure option, --[en\|dis]able-packbuf-pools, which will enable or disable the use of internal memory pools for managing buffers used for packing. When disabled, the function specified by the cpp macro BLIS_MALLOC_POOL is called whenever a packing buffer is needed (and BLIS_FREE_POOL is called when the buffer is ready to be released, usually at the end of a loop). When enabled, which was the status quo prior to this commit, a memory pool data structure is created and managed to provide threads with packing buffers. The memory pool minimizes calls to bli_malloc_pool() (i.e., the wrapper that calls BLIS_MALLOC_POOL), but does so through a somewhat more complex mechanism that may incur additional overhead in some (but not all) situations. The new option defaults to --enable-packbuf-pools. - Removed the reinitialization of the memory pools from the level-3 front-ends and replaced it with automatic reinitialization within the pool API's implementation. This required an extra argument to bli_pool_checkout_block() in the form of a requested size, but hides the complexity entirely from BLIS. And since bli_pool_checkout_block() is only ever called within a critical section, this change fixes a potential race condition in which threads using contexts with different cache blocksizes--most likely a heterogeneous environment--can check out pool blocks that are too small for the submatrices it wishes to pack. Thanks to Nisanth Padinharepatt for reporting this potential issue. - Removed several functions in light of the relocation of pool reinit, including bli_membrk_reinit_pools(), bli_memsys_reinit(), bli_pool_reinit_if(), and bli_check_requested_block_size_for_pool(). - Updated the testsuite to print whether the memory pools are enabled or disabled.	2017-12-21 19:22:57 -06:00
Field G. Van Zee	70640a3710	Implemented library self-initialization. Details: - Defined two new functions in bli_init.c: bli_init_once() and bli_finalize_once(). Each is implemented with pthread_once(), which guarantees that, among the threads that pass in the same pthread_once_t data structure, exactly one thread will execute a user-defined function. (Thus, there is now a runtime dependency against libpthread even when multithreading is not enabled at configure-time.) - Added calls to bli_init_once() to top-level user APIs for all computational operations as well as many other functions in BLIS to all but guarantee that BLIS will self-initialize through the normal use of its functions. - Rewrote and simplified bli_init() and bli_finalize() and related functions. - Added -lpthread to LDFLAGS in common.mk. - Modified the bli_init_auto()/_finalize_auto() functions used by the BLAS compatibility layer to take and return no arguments. (The previous API that tracked whether BLIS was initialized, and then only finalized if it was initialized in the same function, was too cute by half and borderline useless because by default BLIS stays initialized when auto-initialized via the compatibility layer.) - Removed static variables that track initialization of the sub-APIs in bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and bli_ind.c. We don't need to track initialization at the sub-API level, especially now that BLIS can self-initialize. - Added a critical section around the changing of the error checking level in bli_error.c. - Deprecated bli_ind_oper_has_avail() as well as all functions bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation name. These functions had no use cases within BLIS and likely none outside of BLIS. - Commented out calls to bli_init() and bli_finalize() in testsuite's main() function, and likewise for standalone test drivers in 'test' directory, so that self-initialization is exercised by default.	2017-12-11 17:18:43 -06:00
Field G. Van Zee	513ef4d040	Various typecasting fixes, mis-typed enums, etc. Details: - Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c. - Properly typecast integer arguments to match format specifier in various calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and bli_util_oapi.c. - Fixed "unsigned less-than-comparison with zero" checks in bli_check.c, bli_cntx.h. - Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been l1fkr_t or l1vkr_t). - Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t value BLIS_GEMM_UKR in bli_cntx_ref.c. - NOTE: These issues were identified via compiler warnings when building BLIS with clang on a rather old installation of OS X: $ clang --version Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn) Target: x86_64-apple-darwin15.2.0 Thread model: posix	2017-12-11 12:35:59 -06:00
prangana	3bc99a96a3	Fix merge conflicts after rebase with release branch Change-Id: I581b26c6d515f717ff0dce91c7c0c92553aa2630	2017-12-11 13:07:59 +05:30
Field G. Van Zee	0c8afa546d	Fixed a minor bug in level-3 packm management. Details: - Fixed a bug in bli_l3_packm() that caused cntl_t-cached packed mem_t entries to be released and then re-acquired unnecessarily. (In essence, the "<" operands in the conditional that guards the release-and-reacquire code block simply needed to be swapped.) The bug should have only affected performance (rather than the computed result). Thanks to Minh Quan for identifying and reporting the bug.	2017-12-11 12:12:29 +05:30
Field G. Van Zee	95adc43d80	Moved 'family' field from cntx_t to cntl_t. Details: - Removed the family field inside the cntx_t struct and re-added it to the cntl_t struct. Updated all accessor functions/macros accordingly, as well as all consumers and intermediaries of the family parameter (such as bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_()). This change was motivated by the desire to keep the context limited, as much as possible, to information about the computing environment. (The family field, by contrast, is a descriptor about the operation being executed.) - Added additional functions to bli_blksz_() API. - Added additional functions to bli_cntx_() API. - Minor updates to bli_func.c, bli_mbool.c. - Removed 'obj' from bli_blksz_() API names. - Removed 'obj' from bli_cntx_() API names. - Removed 'obj' from bli_cntl_(), bli__cntl_() API names. Renamed routines that operate only on a single struct to contain the "_node" suffix to differentiate with those routines that operate on the entire tree. - Added enums for packm and unpackm kernels to bli_type_defs.h. - Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h. They weren't being used and probably never will be.	2017-12-11 12:12:29 +05:30
Field G. Van Zee	4f61528d56	Added 1m-specific APIs for bp, pb gemm algorithms. Details: - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the body of bli_gemm_cntl_create() replaced with a call to the former. - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now, bli_cntl_free() can check if the thread parameter is NULL, and if so, call the latter, and otherwise call the former. - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in terms of bli_gemm1mxx_cntx_init(), which behaves the same as bli_gemm1m_cntx_init() did before, except that an extra bool parameter (is_pb) is used to support both bp and pb algorithms (including to support the anti-preference field described below). - Added support for "anti-preference" in context. The anti_pref field, when true, will toggle the boolean return value of routines such as bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of causing BLIS to transpose the operation to achieve disagreement (rather than agreement) between the storage of C and the micro-kernel output preference. This disagreement is needed for panel-block implementations, since they induce a transposition of the suboperation immediately before the macro-kernel is called, which changes the apparent storage of C. For now, anti-preference is used only with the pb algorithm for 1m (and not with any other non-1m implementation). - Defined new functions, bli_cntx_l3_ukr_eff_prefers_storage_of() bli_cntx_l3_ukr_eff_dislikes_storage_of() bli_cntx_l3_nat_ukr_eff_prefers_storage_of() bli_cntx_l3_nat_ukr_eff_dislikes_storage_of() which are identical to their non-"eff" (effectively) counterparts except that they take the anti-preference field of the context into account. - Explicitly initialize the anti-pref field to FALSE in bli_gks_cntx_set_l3_nat_ukr_prefs(). - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel in terms of the existing block-panel macro-kernel _ker_var2(). This technique requires inducing transposes on all operands and swapping the A and B. - Changed bli_obj_induce_trans() macro so that pack-related fields are also changed to reflect the induced transposition. - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily specify the 1m algorithm (block-panel or panel-block). - Renamed the following cntx_t-related macros: bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block() bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel() bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel() and updated all instantiations. Also updated the field names in the cntx_t struct. - Comment updates.	2017-12-11 11:58:33 +05:30
Field G. Van Zee	1d728ccb23	Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver.	2017-12-11 11:55:31 +05:30
Field G. Van Zee	b150870397	Removed most "old" directories. Details: - Removed the vast majority of directories named "old", which contained deprecated code that I wasn't quite ready to jettison from the source tree.	2017-12-08 16:08:41 -06:00

1 2 3 4 5 ...

260 Commits