amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	cf06364327	Fixed typo in BLAS gemm3m call to _check(). Details: - Fixed an unresolved symbol issue leftover from #590 whereby ?gemm3m_() as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which does not exist. It should have simply called the _check() function for gemm.	2022-03-29 16:18:25 -05:00
Dipal M Zambare	1ec020b33e	AMD kernel updates; frame-specific AMD updates. (#597 ) Details: - Allow building BLIS with certain framework files (each with the '_amd' suffix) that have been customized by AMD for Zen-based hardware. These customized files were derived from portable versions of the same files (i.e., those without the '_amd' suffix). Whether the portable or AMD- specific files are compiled is now controlled by a new configure option, --[en\|dis]able-amd-frame-tweaks. This option is disabled by default in vanilla BLIS, though AMD may choose to enable it by default in their fork. For now, the added AMD-specific files are: - bli_gemv_unf_var2_amd.c - bla_copy_amd.c - bla_gemv_amd.c These files reside in 'amd' subdirectories found within the directory housing their generic counterparts. - Register optimized real-domain copyv, setv, and swapv kernels in bli_cntx_init_zen.c. - Various minor updates to level-1v kernels in 'zen' kernel set. - Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to the 'zen' kernel set - If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim, call gemv instead and return early. - Combined variable declarations with their initialization in various level-2 and level-3 BLAS compatibility files, and also inserted 'const' qualifer in those same declaration statements. - Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ . - Added copyv and swapv test drivers to 'test' directory. - Whitespace, comment changes.	2022-03-29 16:15:36 -05:00
Bhaskar Nallani	0db2bd5341	Added BLAS/CBLAS APIs for gemm3m. (#590 ) Details: - Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply invoke the 1m implementation unconditionally. (Note that these APIs bypass sup handling.) - Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h. - Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h. - Relocated: frame/compat/cblas/src/cblas_?gemmt.c files into frame/compat/cblas/src/extra/ - Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ . - Minor reorganization of prototypes and cpp macro directives in bli_blas.h, cblas.h, and cblas_f77.h. - Trival whitespace change to cblas_zgemm.c.	2022-03-24 18:41:55 -05:00
Field G. Van Zee	f1dbb0e514	Trival whitespace change; commit log addendum. Details: - A co-attribution to Mithun Mohan was inadvertently omitted from the commit log for headline change in the previous commit, `7c07b47`.	2022-03-11 13:38:28 -06:00
Field G. Van Zee	7c07b477e4	Avoid gemmsup barriers when not packing A or B. (#622 ) Details: - Implemented a multithreaded optimization for the special (and common) case of employing the gemmsup code path when the user requests (implicitly or explicitly) that neither A nor B be packed during computation. This optimization takes the form of a greatly reduced code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a broadcast and two barriers, and results in higher performance when obtaining two-way or higher parallelism within BLIS. Thanks to Bhaskar Nallani of AMD for proposing this change via issue #605. - Added an early return branch to bli_thrinfo_create_for_cntl() that detects and quickly handles cases where no parallelism is being obtained within BLIS (i.e., single-threaded execution). Note that this special case handling was/is already present in bli_thrinfo_sup_create_for_cntl(). - CREDITS file update.	2022-03-11 13:28:50 -06:00
Field G. Van Zee	84732bf956	Revamp how tools are handled/checked by configure. Details: - Consolidate handling of tools that are specifiable via CC, CXX, FC, PYTHON, AR, and RANLIB into one bash function, select_tool_w_env(). - If the user specifies a tool via an environment variable (e.g. CC=gcc) and that tool does not seem valid, print an error message and abort configure, unless the tool is optional (e.g. CXX or FC), in which case a warning message is printed instead. - The definition of "seems valid" above amounts to: - responding to at least one of a basic set of command line options (e.g. --version, -V, -h) if the os_name is Linux (since GNU tools tend to respond to flags such as --version) or if the tool in question is CC, CXX, FC, or PYTHON (which tend to respond to the expected flags regardless of OS) - the binary merely existing for AR and RANLIB on Darwin/OSX/BSD. (These OSes tend to have non-GNU versions of ar and ranlib, which typically do not respond to --version and friends.) - This PR addresses #584. Thanks to Devin Matthews for suggesting some of the changes in this commit.	2022-02-28 12:19:31 -06:00
Field G. Van Zee	c9700f369a	Renamed SIMD-related macro constants for clarity. Details: - Renamed the following macros defined in bli_kernel_macro_defs.h: BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS BLIS_SIMD_SIZE -> BLIS_SIMD_MAX_SIZE Also updated all instances of these macros elsewhere, including subconfigurations, source code, and documentation. Thanks to Devin Matthews for suggesting this change.	2022-02-15 15:36:52 -06:00
Field G. Van Zee	ee9ff988c4	Move edge cases to gemmtrsm ukrs; doc updates. Details: - Moved edge-case handling into the gemmtrsm microkernel. This required changing the microkernel API to take m and n dimension parameters as well as updating all existing gemmtrsm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. Also updated all existing gemmtrsm kernels in the 'kernels' directory (which for now is limited to haswell and penryn kernel sets, plus native and 1m-based reference kernels in 'ref_kernels') to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Note that the edge-case handling for gemm-like operations had already been relocated into the gemm microkernel in `54fa28b`. - Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in bli_edge_case_macro_defs.h to allow for easier reading. - Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up the bullet under "Implementation Notes for gemm" that covers alignment issues. (Thanks to Ivan Korostelev for pointing out the confusing and outdated language in issue #591.) - Other minor tweaks to KernelsHowTo.md.	2022-02-15 15:01:51 -06:00
Devin Matthews	2506159346	Don't use `-Wl,-flat-namespace`. Flat namespaces can cause problems due to conflicting system libraries, etc., so just mark `xerbla_` as a weak symbol on macOS instead.	2022-02-13 20:11:55 -06:00
Devin Matthews	268ce1f29a	Relax alignment constraints Remove alignment of temporary AB buffer in edge case handling macros unless alignment is specifically requested (e.g. Core2, SDB/IVB). Fixes #595.	2022-01-10 18:32:25 -06:00
Devin Matthews	54fa28bd84	Move edge cases to gemm ukr; more user-custom mods. (#583 ) Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations.	2021-12-24 08:00:33 -06:00
Devin Matthews	cf7d616a2f	Enable user-customized packm ukernel/variant. (#549 ) Details: - Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and .ker_params. These fields store pointers to functions and data that will allow the user to more flexibly create custom operations while recycling BLIS's existing partitioning infrastructure. - Updated typed API to packm variant and structure-aware kernels to replace the diagonal offset with panel offsets, and changed strides of both C and P to inc/ldim semantics. Updated object API to the packm variant to include rntm_t. - Removed the packm variant function pointer from the packm cntl_t node definition since it has been replaced by the .pack_fn pointer in the obj_t. - Updated bli_packm_int() to read the new packm variant function pointer from the obj_t and call it instead of from the cntl_t node. - Moved some of the logic of bli_l3_packm.c to a new file, bli_packm_alloc.c. - Rewrote bli_packm_blk_var1.c so that it uses byte (char) pointers instead of typed pointers, allowing a single function to be used regardless of datatype. This obviated having a separate implementation in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a new function, bli_packm_scalar(). - Employed a new standard whereby right-hand matrix operands ("B") are always packed as column-stored row panels -- that is, identically to that of left-hand matrix operands ("A"). This means that while we pack matrix A normally, we actually pack B in a transposed state. This allowed us to simplify a lot of code throughout the framework, and also affected some of the logic in bli_l3_packa() and _packb(). - Simplified bli_packm_init.c in light of the new B^T convention described above. bli_packm_init()--which is now called from within bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns a bool that indicates whether packing should be performed (or skipped). - Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(), which, among other things, defaults the new .pack_fn field of the obj_t to bli_packm_blk_var1() if the field is NULL. - Defined a new function, bli_obj_reset_origin(), which permanently refocuses the view of an object so that it "forgets" any offsets from its original pointer. This function also sets the object's root field to itself. Calls to bli_obj_reset_origin() for each matrix operand appear in the _front() functions, after the obj_t's are aliased. This resetting of the underlying matrices' origins is needed in preparation for more advanced features from within custom packm kernels. - Redefined bli_pba_rntm_set_pba() from a regular function to a static inline function. - Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use libblis_test_pobj_create() to create local packed objects. Previously, these packed objects were created by calling lower-level functions.	2021-12-02 17:10:03 -06:00
Field G. Van Zee	b727645eb7	Merge branch 'dev'	2021-11-19 13:22:09 -06:00
Dipal M Zambare	26e4b6b293	Added support for AMD's Zen3 microarchitecture. Details: - Added a new 'zen3' subconfiguration targeting support for the AMD Zen3 microarchitecture (#561). Thanks to AMD for this contribution. - Restructured clang and AOCC support for zen, zen2, and zen3 make_defs.mk files. The clang and AOCC version detection now happens in configure, not in the subconfigurations' makefile fragments. That is, we've added logic to configure that detects the version of clang/AOCC, outputs an appropriate variable to config.mk (ie: CLANG_OT_, AOCC_OT_), and then checks for it within the makefile fragment (as is currently done for the GCC_OT_* variables). - Added configure support for a GCC_OT_10_1_0 variable (and associated substitution anchor) to communicate whether the gcc version is older than 10.1.0, and use this variable to check for recent enough versions of gcc to use -march=znver3 in the zen3 subconfig. - Inlined the contents of config/zen/amd_config.mk into the zen and zen2 make_defs.mk so that the files are self-contained, harmonizing the format of all three Zen-based subconfigurations' make_defs.mk files. - Added indenting (with spaces) of GNU make conditionals for easier reading in zen, zen2, and zen3 make_defs.mk files. - Adjusted the range of models checked by bli_cpuid_is_zen() (which was previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is completely disjoint from the models checked by bli_cpuid_is_zen2() (0x30 ~ 0xff). This is normally necessary because Zen and Zen2 microarchitectures share the same family (23, or 0x17), and so the model code is the only way to differentiate the two. But in our case, fixing the model range for zen wasn't actually necessary since we checked for zen2 first, and therefore the wide zen range acted like the 'else' of an 'if-else' statement. That said, the change helps improve clarity for the reader by encoding useful knowledge, which was obtained from https://en.wikichip.org/wiki/amd/cpuid . - Added zen2.def and zen3.def files to the collection in travis/cpuid. Note that support for zen, zen2, and zen3 is now present, and while all the three microarchitectures have identical instruction sets from the perspective of BLIS microkernels, they each correspond to different subconfigurations and therefore merit separate testing. Thanks to Devin Matthews for his guidance in hacking these files as slight modifications of zen.def. - Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh. Now, zen, zen2, and zen3 are tested through the SDE via Travis CI builds. - Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils repository on GitHub rather than on Intel's website. This change was made in an attempt to circumvent recent troubles with Travis CI not being able to download the SDE directly from Intel's website via curl. Thanks to Devin Matthews for suggesting the idea. - Updated travis/do_sde.sh to grab the latest version (8.69.1) of the Intel SDE from the flame/ci-utils repository. - Updated .travis.yml to use gcc 9. The file was previously using gcc 8, which did not support -march=znver2. - Created amd64_legacy umbrella family in config_registry for targeting older (bulldozer, piledriver, steamroller, and excavator) microarchitectures and moved those same subconfigs out of the amd64 umbrella family. However, x86_64 retains amd64_legacy as a constituent member. - Fixed a bug in configure related to the building of the so-called config list. When processing the contents of config_registry, configure creates a series of structures and lists that allow for various mappings related to configuration families, subconfigs, and kernel sets. Two of those lists are built via substitution of umbrella families with their subconfig members, and one of those lists was improperly performing the substitution in a way that would erroneously match on partial umbrella family names. That code was changed to match the code that was already doing the substitution properly, via substitute_words(). Also added comments noting the importance of using substitute_words() in both instances. - Comment updates.	2021-11-17 13:02:00 -06:00
Field G. Van Zee	7bde468c6f	Added support for addons. Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files.	2021-11-13 16:39:37 -06:00
Meghana-vankadari	7bc8ab485e	Added BLAS/CBLAS APIs for axpby, gemm_batch. (#566 ) Details: - Expanded the BLAS compatibility layer to include support for ?axpby_() and ?gemm_batch_(). The former is a straightforward BLAS-like interface into the axpbyv operation while the latter implements a batched gemm via loops over bli_?gemm(). Also expanded the CBLAS compatibility layer to include support for cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari for submitting these new APIs via #566. - Fixed a long-standing bug in common.mk that for some reason never manifested until now. Previously, CBLAS source files were compiled without the location of cblas.h being specified via a -I flag. I'm not sure why this worked, but it may be due to the fact that the cblas.h file resided in the same directory as all of the CBLAS source, and perhaps compilers implicitly add a -I flag for the directory that corresponds to the location of the source file being compiled. This bug only showed up because some CBLAS-like source code was moved into an 'extra' subdirectory of that frame/compat/cblas/src directory. After moving the code, compilation for those files failed (because the cblas.h header file, presumably, could not be found in the same location). This bug was fixed within common.mk by explicitly adding the cblas.h directory to the list of -I flags passed to the compiler. - Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory, and updated test/Makefile to build those drivers. - Fixed typo in error message string in cblas_sgemm.c.	2021-11-11 16:46:14 -06:00
Devin Matthews	28b0982ea7	Refactored her[2]k/syr[2]k in terms of gemmt. (#531 ) Details: - Renamed herk macrokernels and supporting files and functions to gemmt, which is possible since at the macrokernel level they are identical. Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal functions rather than cpp macros that instantiate multiple functions. Thanks to Devin Matthews for his efforts on this issue (#531). - Check that the maximum stack buffer size is sufficiently large relative to the register blocksizes for each datatype, and do so when the context is initialized rather than when an operation is called. Note that with this change, users who pass in their own contexts into the expert interfaces currently will not have any checks performed. Thanks to Devin Matthews for suggesting this change.	2021-11-10 12:34:50 -06:00
Field G. Van Zee	cfa3db3f34	Fixed bug in mixed-dt gemm introduced in `e9da642`. Details: - Fixed a bug that broke certain mixed-datatype gemm behavior. This bug was introduced recently in `e9da642` when the code that performs the operation transposition (for microkernel IO preference purposes) was moved up so that it occurred sooner. However, when I moved that code, I failed to notice that there was a cpp-protected "if" conditional that applied to the entire code block that was moved. Once the code block was relocated, the orphaned if-statement was now (erroneously) glomming on to the next thing that happened to be in the function, which happened to be the call to bli_rntm_set_ways_for_op(), causing a rather odd memory exhaustion error in the sba due to the num_threads field of the rntm_t still being -1 (because the rntm_t field were never processed as they should have been). Thanks to @ArcadioN09 (Snehith) for reporting this error and helpfully including relevant memory trace output.	2021-11-03 18:13:56 -05:00
Field G. Van Zee	f065a8070f	Removed support for 3m, 4m induced methods. Details: - Removed support for all induced methods except for 1m. This included removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any code that existed only to support those implementations. These implementations were rarely used and posed code maintenance challenges for BLIS's maintainers going forward. - Removed reference kernels for packm that pack 3m and 4m micropanels, and removed 3m/4m-related code from bli_cntx_ref.c. - Removed support for 3m/4m from the code in frame/ind, then reorganized and streamlined the remaining code in that directory. The ind(), nat(), and 1m() APIs were all removed. (These additional API layers no longer made as much sense with only one induced method (1m) being supported.) The bli_ind.c file (and header) were moved to frame/base and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to frame/3. - Removed 3m/4m support from the code in frame/1m/packm. - Removed 3m/4m support from trmm/trsm macrokernels and simplified some pointer arithmetic that was previously expressed in terms of the bli_ptr_inc_by_frac() static inline function (whose definition was also removed). - Removed the following subdirectories of level-0 macro headers from frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros defined in these directories were used exclusively for 3m and 4m method codes. - Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in light of 1m being the only induced method left within BLIS. - Removed dt_on_output field within auxinfo_t and its associated accessor functions. - Re-indexed the 1e/1r pack schemas after removing those associated with variants of the 3m and 4m methods. This leaves two bits unused within the pack format portion of the schema bitfield. (See bli_type_defs.h for more info.) - Spun off the basic and expert interfaces to the object and typed APIs into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c and bli_l3_tapi_ex.c. - Moved the level-3 operation-specific _check function calls from the operations' _front() functions to the corresponding _ex() function of the object API. (This change roughly maintains where the _check() functions are called in the call stack but lays the groundwork for future changes that may come to the level-3 object APIs.) Minor modifications to bli_l3_check.c to allow the check() functions to be called from the expert interface APIs. - Removed support within the testsuite for testing the aforementioned induced methods, and updated the standalone test drivers in the 'test' directory so reflect the retirement of those induced methods. - Modified the sandbox contract so that the user is obliged to define bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light of the nat() functions no longer existing.) Also updated the existing 'power10' and 'gemmlike' sandboxes to come into compliance with the new sandbox rules. - Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation to reflect the retirement of 3m/4m, and also modified Sandboxes.md to bring the document into alignment with new conventions. - Updated various comments; removed segments of commented-out code.	2021-10-28 16:05:43 -05:00
Field G. Van Zee	e9da6425e2	Allow use of 1m with mixing of row/col-pref ukrs. Details: - Fixed a bug that broke the use of 1m for dcomplex when the single- precision real and double-precision real ukernels had opposing I/O preferences (row-preferential sgemm ukernel + column-preferential dgemm ukernel, or vice versa). The fix involved adjusting the API to bli_cntx_set_ind_blkszs() so that the induced method context init function (e.g., bli_cntx_init_<subconfig>_ind()) could call that function for only one datatype at a time. This allowed the blocksize scaling (which varies depending on whether we're doing 1m_r or 1m_c) to happen on a per-datatype basis. This fixes issue #557. Thanks to Devin Matthews and RuQing Xu for helping discover and report this bug. - The aforementioned 1m fix required moving the 1m_r/1m_c logic from bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is called from each level-3 _front() function. The pack_t schemas in the cntx_t were also removed entirely, along with the associated accessor functions. This in turn required updating the trsm1m-related virtual ukernels to read the pack schema for B from the auxinfo_t struct rather than the context. This also required slight tweaks to bli_gemm_md.c. - Repositioned the logic for transposing the operation to accommodate the microkernel IO preference. This mostly only affects gemm. Thanks to Devin Matthews for his help with this. - Updated dpackm pack ukernels in the 'armsve' kernel set to avoid querying pack_t schemas from the context. - Removed the num_t dt argument from the ind_cntx_init_ft type defined in bli_gks.c. The context initialization functions for induced methods were previously passed a dt argument, but I can no longer figure out why they were passed this value. To reduce confusion, I've removed the dt argument (including also from the function defintion + prototype). - Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This breaks high-leve implementations of 3m and 4m, but this is okay since those implementations will be removed very soon. - Removed some older blocks of preprocessor-disabled code. - Comment update to test_libblis.c.	2021-10-13 14:15:38 -05:00
Minh Quan Ho	81e1034632	Alloc at least 1 elem in pool_t block_ptrs. (#560 ) Details: - Previously, the block_ptrs field of the pool_t was allowed to be initialized as any unsigned integer, including 0. However, a length of 0 could be problematic given that malloc(0) is undefined and therefore variable across implementations. As a safety measure, we check for block_ptrs array lengths of 0 and, in that case, increase them to 1. - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>	2021-10-13 13:28:02 -05:00
Minh Quan Ho	327481a4b0	Fix insufficient pool-growing logic in bli_pool.c. (#559 ) Details: - The current mechanism for growing a pool_t doubles the length of the block_ptrs array every time the array length needs to be increased due to new blocks being added. However, that logic did not take in account the new total number of blocks, and the fact that the caller may be requesting more blocks that would fit even after doubling the current length of block_ptrs. The code comments now contain two illustrating examples that show why, even after doubling, we must always have at least enough room to fit all of the old blocks plus the newly requested blocks. - This commit also happens to fix a memory corruption issue that stems from growing any pool_t that is initialized with a block_ptrs length of 0. (Previously, the memory pool for packed buffers of C was initialized with a block_ptrs length of 0, but because it is unused this bug did not manifest by default.) - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>	2021-10-12 12:53:04 -05:00
Devin Matthews	4277fec0d0	Merge pull request #533 from xrq-phys/arm64-hi-bw ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig	2021-10-07 13:47:22 -05:00
RuQing Xu	a024715065	Firestorm CPUID Dispatcher Commenting out <sys/sysctl.h> due to possibly a Xcode bug.	2021-10-07 00:15:54 +09:00
Devin Matthews	34919de3df	Make error checking level a thread-local variable. Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.	2021-10-05 15:22:31 -05:00
Devin Matthews	079fbd42ce	Merge branch 'master' into arm64-hi-bw	2021-10-04 17:21:48 -05:00
Devin Matthews	6d3036e31d	Merge pull request #545 from hominhquan/clean_error bli_error: more cleanup on the error strings array	2021-10-04 15:58:43 -05:00
Dave Love	d0a0b4b841	Arm micro-architecture dispatch (#344 ) Details: - Reworked support for ARM hardware detection in bli_cpuid.c to parse the result of a CPUID-like instruction. - Added a64fx support to bli_gks.c. - #include arm64 and arm32 family headers from bli_arch_config.h. - Fix the ordering of the "armsve" and "a64fx" strings in the config_name string array in bli_arch.c. The ordering did not match the ordering of the corresponding arch_t values in bli_type_defs.h, as it should have all along. - Added clang support to make_defs.mk in arm64, cortexa53, cortexa57 subconfigs. - Updated arm64 and arm32 families in config_registry. - Updated docs/HardwareSupport.md to reflect added ARM support. - Thanks to Dave Love, RuQing Xu, and Devin Matthews for their contributions in this PR (#344).	2021-10-04 13:03:04 -05:00
Field G. Van Zee	1f527a93b9	Re-enable and fix `fb93d24`. Details: - Re-enabled the changes made in `fb93d24`. - Defined BLIS_ENABLE_SYSTEM in bli_arch.c, bli_cpuid.c, and bli_env.c, all of which needed the definition (in addition to config_detect.c) in order for the configure-time hardware detection binary to be compiled properly. Thanks to Minh Quan Ho for helping identify these additional files as needing to be updated. - Added additional comments to all four source files, most notably to prompt the reader to remember to update all of the files when updating any of the files. Also made the cpp code in each of the files as consistent/similar as possible. - Refer to issues #532 and PR #546 for more history.	2021-09-20 17:56:36 -05:00
Field G. Van Zee	7b39c14920	Reverted `fb93d24`. Details: - The latest changes in `fb93d24` are still causing problems. Reverting and preparing to move them to a branch.	2021-09-20 16:13:50 -05:00
Field G. Van Zee	fb93d242a4	Re-enable and fix `8e0c425` (BLIS_ENABLE_SYSTEM). Details: - Re-enable the changes originally made in `8e0c425` but quickly reverted in `2be78fc`. - Moved the #include of bli_config.h so that it occurs before the #include of bli_system.h. This allows the #define BLIS_ENABLE_SYSTEM or #define BLIS_DISABLE_SYSTEM in bli_config.h to be processed by the time it is needed in bli_system.h. This change should have been in the original `8e0c425`, but was accidentally omitted. Thanks to Minh Quan Ho for catching this. - Add #define BLIS_ENABLE_SYSTEM to config_detect.c so that the proper cpp conditional branch executes in bli_system.h when compiling the hardware detection binary. The changes made in `8e0c425` were an attempt to support the definition of BLIS_OS_NONE when configuring with --disable-system (in issue #532). That commit failed because, aside from the required but omitted header reordering (second bullet above), AppVeyor was unable to compile the hardware detection binary as a result of missing Windows headers. This commit, which builds on PR #546, should help fix that issue. Thanks to Minh Quan Ho for his assistance and patience on this matter.	2021-09-20 15:42:08 -05:00
Minh Quan HO	eaa554aa52	bli_error: more cleanup on the error strings array - There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing: the maximal number of error codes/messages. - The previous initialization of error messages at compile time ignored that the 'bli_error_string' array still occupies useless memory due to 2D char[][] declaration. Instead, it should be just an array of pointers, pointing at strings in .rodata section. - This commit does the two modifications: * retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere * switch bli_error_string from char[][] to char [] to reduce its footprint from 40KB (200200) to 1.3KB (170sizeof(char)). (No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time, since compiler is smart enough to determine its value is 170.)	2021-09-20 10:39:05 +02:00
Field G. Van Zee	52f29f739d	Removed last vestige of #define BLIS_NUM_ARCHS. Details: - Removed the commented-out #define BLIS_NUM_ARCHS in bli_type_defs.h and its associated (now outdated) comments. BLIS_NUM_ARCHS has been part of the arch_t enum for some time now, and so this change is mostly about removing any opportunity for confusion for people who may be reading the code. Thanks to Minh Quan Ho for leading me to cleanup.	2021-09-17 08:38:29 -05:00
Devin Matthews	9c0064f3f6	Fix config_name in bli_arch.c	2021-09-10 10:39:04 -05:00
Field G. Van Zee	2be78fc977	Disabled (at least temporarily) commit `8e0c425`. Details: - Reverted changes in `8e0c425` due to AppVeyor build failures that we do not yet understand.	2021-08-27 12:17:26 -05:00
Field G. Van Zee	8e0c4255de	Define BLIS_OS_NONE when using --disable-system. Details: - Modified bli_system.h so that the cpp macro BLIS_OS_NONE is defined when BLIS_DISABLE_SYSTEM is defined. Otherwise, the previous OS- detecting macro conditionals are considered. This change is to accommodate a solution to a cross-compilation issue described in #532.	2021-08-26 15:29:18 -05:00
Field G. Van Zee	e320ec6d5c	Moved lang defs from _macro_def.h to _lang_defs.h. Details: - Moved miscellaneous language-related definitions, including defs related to the handling of the 'restrict' keyword, from the top half of bli_macro_defs.h into a new file, bli_lang_defs.h, which is now #included immediately after "bli_system.h" in blis.h. This change is an attempt to fix a report of recent breakage of C++ compilers due to the recent introduction of 'restrict' in bli_type_defs.h (which previously was being included before bli_macro_defs.h and its restrict handling therein. Thanks to Ivan Korostelev for reporting this issue in #527. - CREDITS file update.	2021-08-20 17:15:20 -05:00
Devin Matthews	2c0b4150e4	Merge pull request #527 from flame/obj_t_makeover Implement proposed new function pointer fields for obj_t.	2021-08-14 18:41:35 -05:00
Field G. Van Zee	4b8ed99d92	Whitespace tweaks.	2021-08-13 15:31:10 -05:00
Devin Matthews	1772db029e	Add row- and column-strides for A/B in obj_ukr_fn_t.	2021-08-13 14:46:35 -05:00
Devin Matthews	4f70eb7913	Clean up some warnings that show up on clang/OSX.	2021-08-13 11:12:43 -05:00
Devin Matthews	3cddce1e2a	Remove schema field on obj_t (redundant) and add new API functions.	2021-08-12 22:32:34 -05:00
Field G. Van Zee	20a1c4014c	Disabled sanity check in bli_pool_finalize(). Details: - Disabled a sanity check in bli_pool_finalize() that was meant to alert the user if a pool_t was being finalized while some blocks were still checked out. However, this is exactly the situation that might happen when a pool_t is re-initialized for a larger blocksize, and currently bli_pool_reinit() is implemeneted as _finalize() followed by _init(). So, this sanity check is not universally appropriate. Thanks to AMD-India for reporting this issue.	2021-08-12 14:44:04 -05:00
RuQing Xu	e38ca28689	Added Apple Firestorm (A14/M1) Subconfig - Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.	2021-08-13 03:21:19 +09:00
Devin Matthews	64a1f786d5	Implement proposed new function pointer fields for obj_t. The added fields: 1. `pack_t schema`: storing the pack schema on the object allows the macrokernel to act accordingly without side-channel information from the rntm_t and cntx_t. The pack schema and "pack_[ab]" fields could be removed from those structs. 2. `void* user_data`: this field can be used to store any sort of additional information provided by the user. The pointer is propagated to submatrix objects and copies, but is otherwise ignored by the framework and the default implementations of the following three fields. User-specified pack, kernel, or ukr functions can do whatever they want with the data, and the user is 100% responsible for allocating, assigning, and freeing this buffer. 3. `obj_pack_fn_t pack`: the function called when a matrix is packed. This functions receives the expected arguments, as well as a mdim_t and mem_t* as memory must be allocated inside this function, and behavior may differ based on which matrix is being backed (i.e. transposition for B). This could also be achieved by passing a desired pack schema, but this would require additional information to travel down the control tree. 4. `obj_ker_fn_t ker`: the function called when we get to the "second loop", or the macro-kernel. Behavior may depend on the pack schemas of the input matrices. The default implementation would perform the inner two loops around the ukr, and then call either the default ukr or a user-supplied one (next field). 5. `obj_ukr_fn_t ukr`: the function called by the default macrokernel. This would replace the various current "virtual" microkernels, and could also be used to supply user-defined behavior. Users could supply both a custom kernel (above) and microkernel, although the user-specified kernel does not necessarily have to call the ukr function specified on the obj_t. Note that no macros or functions for accessing these new fields have been defined yet. That is next once these are finalized. Addresses https://github.com/flame/blis/projects/1#card-62357687.	2021-08-11 18:11:47 -05:00
Field G. Van Zee	a32257eeab	Fixed bli_init.c compile-time error on OSX clang. Details: - Fixed a compile-time error in bli_init.c when compiling with OSX's clang. This error was introduced in `868b901`, which introduced a post-declaration struct assignment where the RHS was a struct initialization expression (i.e. { ... }). This use of struct initializer expressions apparently works with gcc despite it not being strict C99. The fix included in this commit declares a temporary variable for the purposes of being initialized to the desired value, via the struct initializer, and then copies the temporary struct (via '=' struct assignment) to the persistent struct. Thanks to Devin Matthews for his help with this.	2021-08-05 16:23:02 -05:00
Field G. Van Zee	868b90138e	Fixed one-time use property of bli_init() (#525 ). Details: - Fixes a rather obvious bug that resulted in segmentation fault whenever the calling application tried to re-initialize BLIS after its first init/finalize cycle. The bug resulted from the fact that the bli_init.c APIs made no effort to allow bli_init() to be called subsequent times at all due to it, and bli_finalize(), being implemented in terms of pthread_once(). This has been fixed by resetting the pthread_once_t control variable for initialization at the end of bli_finalize_apis(), and by resetting the control variable for finalization at the end of bli_init_apis(). Thanks to @lschork2 for reporting this issue (#525), and to Minh Quan Ho and Devin Matthews for suggesting the chosen solution. - CREDITS file update.	2021-08-04 18:31:01 -05:00
Field G. Van Zee	689fa0f403	Merge branch 'master' into dev	2021-06-13 19:44:14 -05:00
Devin Matthews	7c3eb44efa	Add vhsubpd/vhsubpd. Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].	2021-06-02 11:28:22 -05:00
Field G. Van Zee	213dce32d2	Added a new 'gemmlike' sandbox. Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework.	2021-05-28 14:49:57 -05:00

1 2 3 4 5 ...

867 Commits