Commit Graph

852 Commits

Author SHA1 Message Date
Meghana-vankadari
7bc8ab485e Added BLAS/CBLAS APIs for axpby, gemm_batch. (#566)
Details:
- Expanded the BLAS compatibility layer to include support for 
  ?axpby_() and ?gemm_batch_(). The former is a straightforward
  BLAS-like interface into the axpbyv operation while the latter
  implements a batched gemm via loops over bli_?gemm(). Also
  expanded the CBLAS compatibility layer to include support for
  cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to 
  the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari
  for submitting these new APIs via #566.
- Fixed a long-standing bug in common.mk that for some reason never
  manifested until now. Previously, CBLAS source files were compiled
  *without* the location of cblas.h being specified via a -I flag.
  I'm not sure why this worked, but it may be due to the fact that
  the cblas.h file resided in the same directory as all of the CBLAS
  source, and perhaps compilers implicitly add a -I flag for the
  directory that corresponds to the location of the source file being
  compiled. This bug only showed up because some CBLAS-like source code
  was moved into an 'extra' subdirectory of that frame/compat/cblas/src
  directory. After moving the code, compilation for those files failed
  (because the cblas.h header file, presumably, could not be found in
  the same location). This bug was fixed within common.mk by explicitly
  adding the cblas.h directory to the list of -I flags passed to the
  compiler.
- Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory,
  and updated test/Makefile to build those drivers.
- Fixed typo in error message string in cblas_sgemm.c.
2021-11-11 16:46:14 -06:00
Devin Matthews
28b0982ea7 Refactored her[2]k/syr[2]k in terms of gemmt. (#531)
Details:
- Renamed herk macrokernels and supporting files and functions to gemmt, 
  which is possible since at the macrokernel level they are identical. 
  Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
  level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
  functions rather than cpp macros that instantiate multiple functions.
  Thanks to Devin Matthews for his efforts on this issue (#531).
- Check that the maximum stack buffer size is sufficiently large
  relative to the register blocksizes for each datatype, and do so when
  the context is initialized rather than when an operation is called.
  Note that with this change, users who pass in their own contexts into
  the expert interfaces currently will *not* have any checks performed.
  Thanks to Devin Matthews for suggesting this change.
2021-11-10 12:34:50 -06:00
Field G. Van Zee
cfa3db3f34 Fixed bug in mixed-dt gemm introduced in e9da642.
Details:
- Fixed a bug that broke certain mixed-datatype gemm behavior. This
  bug was introduced recently in e9da642 when the code that performs
  the operation transposition (for microkernel IO preference purposes)
  was moved up so that it occurred sooner. However, when I moved that
  code, I failed to notice that there was a cpp-protected "if"
  conditional that applied to the entire code block that was moved. Once
  the code block was relocated, the orphaned if-statement was now
  (erroneously) glomming on to the next thing that happened to be in the
  function, which happened to be the call to bli_rntm_set_ways_for_op(),
  causing a rather odd memory exhaustion error in the sba due to the
  num_threads field of the rntm_t still being -1 (because the rntm_t
  field were never processed as they should have been). Thanks to
  @ArcadioN09 (Snehith) for reporting this error and helpfully including
  relevant memory trace output.
2021-11-03 18:13:56 -05:00
Field G. Van Zee
f065a8070f Removed support for 3m, 4m induced methods.
Details:
- Removed support for all induced methods except for 1m. This included
  removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any
  code that existed only to support those implementations. These
  implementations were rarely used and posed code maintenance challenges
  for BLIS's maintainers going forward.
- Removed reference kernels for packm that pack 3m and 4m micropanels,
  and removed 3m/4m-related code from bli_cntx_ref.c.
- Removed support for 3m/4m from the code in frame/ind, then reorganized
  and streamlined the remaining code in that directory. The *ind(),
  *nat(), and *1m() APIs were all removed. (These additional API layers
  no longer made as much sense with only one induced method (1m) being
  supported.) The bli_ind.c file (and header) were moved to frame/base
  and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to
  frame/3.
- Removed 3m/4m support from the code in frame/1m/packm.
- Removed 3m/4m support from trmm/trsm macrokernels and simplified some
  pointer arithmetic that was previously expressed in terms of the
  bli_ptr_inc_by_frac() static inline function (whose definition was
  also removed).
- Removed the following subdirectories of level-0 macro headers from
  frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros
  defined in these directories were used exclusively for 3m and 4m
  method codes.
- Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in
  light of 1m being the only induced method left within BLIS.
- Removed dt_on_output field within auxinfo_t and its associated
  accessor functions.
- Re-indexed the 1e/1r pack schemas after removing those associated with
  variants of the 3m and 4m methods. This leaves two bits unused within
  the pack format portion of the schema bitfield. (See bli_type_defs.h
  for more info.)
- Spun off the basic and expert interfaces to the object and typed APIs
  into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c
  and bli_l3_tapi_ex.c.
- Moved the level-3 operation-specific _check function calls from the
  operations' _front() functions to the corresponding _ex() function of
  the object API. (This change roughly maintains where the _check()
  functions are called in the call stack but lays the groundwork for
  future changes that may come to the level-3 object APIs.) Minor
  modifications to bli_l3_check.c to allow the check() functions to be
  called from the expert interface APIs.
- Removed support within the testsuite for testing the aforementioned
  induced methods, and updated the standalone test drivers in the 'test'
  directory so reflect the retirement of those induced methods.
- Modified the sandbox contract so that the user is obliged to define
  bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light
  of the *nat() functions no longer existing.) Also updated the existing
  'power10' and 'gemmlike' sandboxes to come into compliance with the
  new sandbox rules.
- Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation
  to reflect the retirement of 3m/4m, and also modified Sandboxes.md to
  bring the document into alignment with new conventions.
- Updated various comments; removed segments of commented-out code.
2021-10-28 16:05:43 -05:00
Field G. Van Zee
e9da6425e2 Allow use of 1m with mixing of row/col-pref ukrs.
Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
  precision real and double-precision real ukernels had opposing I/O
  preferences (row-preferential sgemm ukernel + column-preferential
  dgemm ukernel, or vice versa). The fix involved adjusting the API
  to bli_cntx_set_ind_blkszs() so that the induced method context init
  function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
  function for only one datatype at a time. This allowed the blocksize
  scaling (which varies depending on whether we're doing 1m_r or 1m_c)
  to happen on a per-datatype basis. This fixes issue #557. Thanks to
  Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
  bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
  called from each level-3 _front() function. The pack_t schemas in the
  cntx_t were also removed entirely, along with the associated accessor
  functions. This in turn required updating the trsm1m-related virtual
  ukernels to read the pack schema for B from the auxinfo_t struct
  rather than the context. This also required slight tweaks to
  bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
  the microkernel IO preference. This mostly only affects gemm. Thanks
  to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
  querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
  in bli_gks.c. The context initialization functions for induced methods
  were previously passed a dt argument, but I can no longer figure out
  *why* they were passed this value. To reduce confusion, I've removed
  the dt argument (including also from the function defintion +
  prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
  breaks high-leve implementations of 3m and 4m, but this is okay since
  those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.
2021-10-13 14:15:38 -05:00
Minh Quan Ho
81e1034632 Alloc at least 1 elem in pool_t block_ptrs. (#560)
Details:
- Previously, the block_ptrs field of the pool_t was allowed to be
  initialized as any unsigned integer, including 0. However, a length of
  0 could be problematic given that malloc(0) is undefined and therefore
  variable across implementations. As a safety measure, we check for
  block_ptrs array lengths of 0 and, in that case, increase them to 1.
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
2021-10-13 13:28:02 -05:00
Minh Quan Ho
327481a4b0 Fix insufficient pool-growing logic in bli_pool.c. (#559)
Details:
- The current mechanism for growing a pool_t doubles the length of the
  block_ptrs array every time the array length needs to be increased
  due to new blocks being added. However, that logic did not take in
  account the new total number of blocks, and the fact that the caller
  may be requesting more blocks that would fit even after doubling the
  current length of block_ptrs. The code comments now contain two 
  illustrating examples that show why, even after doubling, we must 
  always have at least enough room to fit all of the old blocks plus
  the newly requested blocks.
- This commit also happens to fix a memory corruption issue that stems
  from growing any pool_t that is initialized with a block_ptrs length
  of 0. (Previously, the memory pool for packed buffers of C was 
  initialized with a block_ptrs length of 0, but because it is unused 
  this bug did not manifest by default.)
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
2021-10-12 12:53:04 -05:00
Devin Matthews
4277fec0d0 Merge pull request #533 from xrq-phys/arm64-hi-bw
ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig
2021-10-07 13:47:22 -05:00
RuQing Xu
a024715065 Firestorm CPUID Dispatcher
Commenting out <sys/sysctl.h> due to possibly a Xcode bug.
2021-10-07 00:15:54 +09:00
Devin Matthews
34919de3df Make error checking level a thread-local variable.
Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.
2021-10-05 15:22:31 -05:00
Devin Matthews
079fbd42ce Merge branch 'master' into arm64-hi-bw 2021-10-04 17:21:48 -05:00
Devin Matthews
6d3036e31d Merge pull request #545 from hominhquan/clean_error
bli_error: more cleanup on the error strings array
2021-10-04 15:58:43 -05:00
Dave Love
d0a0b4b841 Arm micro-architecture dispatch (#344)
Details:
- Reworked support for ARM hardware detection in bli_cpuid.c to parse 
  the result of a CPUID-like instruction.
- Added a64fx support to bli_gks.c.
- #include arm64 and arm32 family headers from bli_arch_config.h.
- Fix the ordering of the "armsve" and "a64fx" strings in the 
  config_name string array in bli_arch.c. The ordering did not match
  the ordering of the corresponding arch_t values in bli_type_defs.h,
  as it should have all along.
- Added clang support to make_defs.mk in arm64, cortexa53, cortexa57 
  subconfigs.
- Updated arm64 and arm32 families in config_registry.
- Updated docs/HardwareSupport.md to reflect added ARM support.
- Thanks to Dave Love, RuQing Xu, and Devin Matthews for their
  contributions in this PR (#344).
2021-10-04 13:03:04 -05:00
Field G. Van Zee
1f527a93b9 Re-enable and fix fb93d24.
Details:
- Re-enabled the changes made in fb93d24.
- Defined BLIS_ENABLE_SYSTEM in bli_arch.c, bli_cpuid.c, and bli_env.c,
  all of which needed the definition (in addition to config_detect.c) in
  order for the configure-time hardware detection binary to be compiled
  properly. Thanks to Minh Quan Ho for helping identify these additional
  files as needing to be updated.
- Added additional comments to all four source files, most notably to
  prompt the reader to remember to update all of the files when updating
  any of the files. Also made the cpp code in each of the files as
  consistent/similar as possible.
- Refer to issues #532 and PR #546 for more history.
2021-09-20 17:56:36 -05:00
Field G. Van Zee
7b39c14920 Reverted fb93d24.
Details:
- The latest changes in fb93d24 are still causing problems. Reverting
  and preparing to move them to a branch.
2021-09-20 16:13:50 -05:00
Field G. Van Zee
fb93d242a4 Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
Details:
- Re-enable the changes originally made in 8e0c425 but quickly reverted
  in 2be78fc.
- Moved the #include of bli_config.h so that it occurs before the
  #include of bli_system.h. This allows the #define BLIS_ENABLE_SYSTEM
  or #define BLIS_DISABLE_SYSTEM in bli_config.h to be processed by the
  time it is needed in bli_system.h. This change should have been
  in the original 8e0c425, but was accidentally omitted. Thanks to Minh
  Quan Ho for catching this.
- Add #define BLIS_ENABLE_SYSTEM to config_detect.c so that the proper
  cpp conditional branch executes in bli_system.h when compiling the
  hardware detection binary. The changes made in 8e0c425 were an attempt
  to support the definition of BLIS_OS_NONE when configuring with
  --disable-system (in issue #532).  That commit failed because, aside
  from the required but omitted header reordering (second bullet above),
  AppVeyor was unable to compile the hardware detection binary as a
  result of missing Windows headers. This commit, which builds on PR
  #546, should help fix that issue. Thanks to Minh Quan Ho for his
  assistance and patience on this matter.
2021-09-20 15:42:08 -05:00
Minh Quan HO
eaa554aa52 bli_error: more cleanup on the error strings array
- There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and
  the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing:
  the maximal number of error codes/messages.
- The previous initialization of error messages at compile time ignored that
  the 'bli_error_string' array still occupies useless memory due to 2D char[][]
  declaration. Instead, it should be just an array of pointers, pointing at
  strings in .rodata section.
- This commit does the two modifications:
   * retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere
   * switch bli_error_string from char[][] to char *[] to reduce its footprint
     from 40KB (200*200) to 1.3KB (170*sizeof(char*)).
     (No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time,
     since compiler is smart enough to determine its value is 170.)
2021-09-20 10:39:05 +02:00
Field G. Van Zee
52f29f739d Removed last vestige of #define BLIS_NUM_ARCHS.
Details:
- Removed the commented-out #define BLIS_NUM_ARCHS in bli_type_defs.h
  and its associated (now outdated) comments. BLIS_NUM_ARCHS has been
  part of the arch_t enum for some time now, and so this change is
  mostly about removing any opportunity for confusion for people who
  may be reading the code. Thanks to Minh Quan Ho for leading me to
  cleanup.
2021-09-17 08:38:29 -05:00
Devin Matthews
9c0064f3f6 Fix config_name in bli_arch.c 2021-09-10 10:39:04 -05:00
Field G. Van Zee
2be78fc977 Disabled (at least temporarily) commit 8e0c425.
Details:
- Reverted changes in 8e0c425 due to AppVeyor build failures that we do
  not yet understand.
2021-08-27 12:17:26 -05:00
Field G. Van Zee
8e0c4255de Define BLIS_OS_NONE when using --disable-system.
Details:
- Modified bli_system.h so that the cpp macro BLIS_OS_NONE is defined
  when BLIS_DISABLE_SYSTEM is defined. Otherwise, the previous OS-
  detecting macro conditionals are considered. This change is to
  accommodate a solution to a cross-compilation issue described in
  #532.
2021-08-26 15:29:18 -05:00
Field G. Van Zee
e320ec6d5c Moved lang defs from _macro_def.h to _lang_defs.h.
Details:
- Moved miscellaneous language-related definitions, including defs
  related to the handling of the 'restrict' keyword, from the top half
  of bli_macro_defs.h into a new file, bli_lang_defs.h, which is now
  #included immediately after "bli_system.h" in blis.h. This change is
  an attempt to fix a report of recent breakage of C++ compilers due
  to the recent introduction of 'restrict' in bli_type_defs.h (which
  previously was being included *before* bli_macro_defs.h and its
  restrict handling therein. Thanks to Ivan Korostelev for reporting
  this issue in #527.
- CREDITS file update.
2021-08-20 17:15:20 -05:00
Devin Matthews
2c0b4150e4 Merge pull request #527 from flame/obj_t_makeover
Implement proposed new function pointer fields for obj_t.
2021-08-14 18:41:35 -05:00
Field G. Van Zee
4b8ed99d92 Whitespace tweaks. 2021-08-13 15:31:10 -05:00
Devin Matthews
1772db029e Add row- and column-strides for A/B in obj_ukr_fn_t. 2021-08-13 14:46:35 -05:00
Devin Matthews
4f70eb7913 Clean up some warnings that show up on clang/OSX. 2021-08-13 11:12:43 -05:00
Devin Matthews
3cddce1e2a Remove schema field on obj_t (redundant) and add new API functions. 2021-08-12 22:32:34 -05:00
Field G. Van Zee
20a1c4014c Disabled sanity check in bli_pool_finalize().
Details:
- Disabled a sanity check in bli_pool_finalize() that was meant to alert
  the user if a pool_t was being finalized while some blocks were still
  checked out. However, this is exactly the situation that might happen
  when a pool_t is re-initialized for a larger blocksize, and currently
  bli_pool_reinit() is implemeneted as _finalize() followed by _init().
  So, this sanity check is not universally appropriate. Thanks to
  AMD-India for reporting this issue.
2021-08-12 14:44:04 -05:00
RuQing Xu
e38ca28689 Added Apple Firestorm (A14/M1) Subconfig
- Use the same bulk kernel as Cortex-A53 / ThunderX2;
- Larger block size;
- Use gemmsup kernels for double precision.
2021-08-13 03:21:19 +09:00
Devin Matthews
64a1f786d5 Implement proposed new function pointer fields for obj_t.
The added fields:
1. `pack_t schema`: storing the pack schema on the object allows the macrokernel to act accordingly without side-channel information from the rntm_t and cntx_t. The pack schema and "pack_[ab]" fields could be removed from those structs.
2. `void* user_data`: this field can be used to store any sort of additional information provided by the user. The pointer is propagated to submatrix objects and copies, but is otherwise ignored by the framework and the default implementations of the following three fields. User-specified pack, kernel, or ukr functions can do whatever they want with the data, and the user is 100% responsible for allocating, assigning, and freeing this buffer.
3. `obj_pack_fn_t pack`: the function called when a matrix is packed. This functions receives the expected arguments, as well as a mdim_t and mem_t* as memory must be allocated inside this function, and behavior may differ based on which matrix is being backed (i.e. transposition for B). This could also be achieved by passing a desired pack schema, but this would require additional information to travel down the control tree.
4. `obj_ker_fn_t ker`: the function called when we get to the "second loop", or the macro-kernel. Behavior may depend on the pack schemas of the input matrices. The default implementation would perform the inner two loops around the ukr, and then call either the default ukr or a user-supplied one (next field).
5. `obj_ukr_fn_t ukr`: the function called by the default macrokernel. This would replace the various current "virtual" microkernels, and could also be used to supply user-defined behavior. Users could supply both a custom kernel (above) and microkernel, although the user-specified kernel does **not** necessarily have to call the ukr function specified on the obj_t.

Note that no macros or functions for accessing these new fields have been defined yet. That is next once these are finalized. Addresses https://github.com/flame/blis/projects/1#card-62357687.
2021-08-11 18:11:47 -05:00
Field G. Van Zee
a32257eeab Fixed bli_init.c compile-time error on OSX clang.
Details:
- Fixed a compile-time error in bli_init.c when compiling with OSX's
  clang. This error was introduced in 868b901, which introduced a
  post-declaration struct assignment where the RHS was a struct
  initialization expression (i.e. { ... }). This use of struct
  initializer expressions apparently works with gcc despite it not
  being strict C99. The fix included in this commit declares a temporary
  variable for the purposes of being initialized to the desired value,
  via the struct initializer, and then copies the temporary struct (via
  '=' struct assignment) to the persistent struct. Thanks to Devin
  Matthews for his help with this.
2021-08-05 16:23:02 -05:00
Field G. Van Zee
868b90138e Fixed one-time use property of bli_init() (#525).
Details:
- Fixes a rather obvious bug that resulted in segmentation fault
  whenever the calling application tried to re-initialize BLIS after
  its first init/finalize cycle. The bug resulted from the fact that
  the bli_init.c APIs made no effort to allow bli_init() to be called
  subsequent times at all due to it, and bli_finalize(), being
  implemented in terms of pthread_once(). This has been fixed by
  resetting the pthread_once_t control variable for initialization
  at the end of bli_finalize_apis(), and by resetting the control
  variable for finalization at the end of bli_init_apis(). Thanks to
  @lschork2 for reporting this issue (#525), and to Minh Quan Ho and
  Devin Matthews for suggesting the chosen solution.
- CREDITS file update.
2021-08-04 18:31:01 -05:00
Field G. Van Zee
689fa0f403 Merge branch 'master' into dev 2021-06-13 19:44:14 -05:00
Devin Matthews
7c3eb44efa Add vhsubpd/vhsubpd.
Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].
2021-06-02 11:28:22 -05:00
Field G. Van Zee
213dce32d2 Added a new 'gemmlike' sandbox.
Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
  multithreaded gemm in the style of gemmsup but also unconditionally
  employs packing. The purpose of this sandbox is to
  (1) avoid select abstractions, such as objects and control trees, in
      order to allow readers to better understand how a real-world
      implementation of high-performance gemm can be constructed;
  (2) provide a starting point for expert users who wish to build
      something that is gemm-like without "reinventing the wheel."
  Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
  Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
  instead of "bli_" in order to avoid any symbol collisions in the main
  library.
- The sandbox contains two variants, each of which implements gemm via a
  block-panel algorithm. The only difference between the two is that
  variant 1 calls the microkernel directly while variant 2 calls the
  microkernel indirectly, via a function wrapper, which allows the edge
  case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
  (not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
  framework.
2021-05-28 14:49:57 -05:00
RuQing Xu
61584deddf Added 512b SVE-based a64fx subconfig + SVE kernels.
Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically 
  tuned block size by Stepan Nassyr. This subconfig also sets the sector 
  cache size and enables memory-tagging code in SVE gemm kernels. This 
  subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
  blocksizes according to the analytical model. This part is ported from 
  Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE 
  at size (2*VL, 10). These kernels use unindexed FMLA instructions 
  because indexed FMLA takes 2 FMA units in many implementations.
  PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
  support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for 
  size (12, k). This dpackm kernel is not currently used by any 
  subconfiguration.
- Implemented several experimental dgemmsup kernels which would 
  improve performance in a few cases. However, those dgemmsup kernels 
  generally underperform hence they are not currently used in any 
  subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
  PR #424.
2021-05-19 09:52:29 -05:00
Field G. Van Zee
b683d01b9c Use extra #undef when including ba/ex API headers.
Details:
- Inserted a "#include bli_xapi_undef.h" after each usage of the basic
  and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h,
  bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to
  the previous status quo, in which each header made minimal #undef
  prior to its own definitions and then a single instance of
  "#include bli_xapi_undef.h" cleaned up any remaining macro defs after
  all other headers were used. This commit will guarantee that macro
  defs from the setup of one header (say, bli_oapi_ex.h) don't "infect"
  the definitions made in a subsequent header. As with this previous
  commit, this change does not fix any issue but rather attempts to
  avoid creating orphaned macro definitions that are only needed within
  a very limited scope.
- Removed minimal #undef from bli_?api_[ba|ex].h.
- Removed old commented-out lines from bli_?api_[ba|ex].h.
2021-05-13 15:23:22 -05:00
Field G. Van Zee
d4427a5b2f Minor preprocessor/header cleanup.
Details:
- Added frame/include/bli_xapi_undef.h, which explicitly undefines all
  macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
  bli_tapi_ex.h. (This is for safety and good cpp coding practice, not
  because it fixes anything.)
- Added #include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h,
  bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h.
- Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
  bli_tapi_ex.h.
- Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing
  that nothing in BLIS used those function pointer types. Also commented
  out the "#include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h.
2021-05-13 13:55:11 -05:00
Field G. Van Zee
5aa63cd927 Fixed typo in cpp guard in bli_util_ft.h.
Details:
- Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in
  bli_util_ft.h. This typo was causing some types to be redefined when
  they weren't supposed to be.
2021-05-12 19:53:35 -05:00
Field G. Van Zee
f0e8634775 Defined eqsc, eqv, eqm to test object equality.
Details:
- Defined eqsc, eqv, and eqm operations, which set a bool depending on
  whether the two scalars, two vectors, or two matrix operands are equal
  (element-wise). eqsc and eqv support implicit conjugation and eqm
  supports diagonal offset, diag, uplo, and trans parameters (in a
  manner consistent with other level-1m operations). These operations
  are currently housed under frame/util, at least for now, because they
  are not computational in nature.
- Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm.
- Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md.
  Also:
  - Documented getsc and setsc in both docs.
  - Reordered entry for setijv in BLISTypedAPI.md, and added separator
    bars to both docs.
  - Added missing "Observed object properties" clauses to various
    levle-1v entries in BLISObjectAPI.md.
- Defined bli_apply_trans() in bli_param_macro_defs.h.
- Defined supporting _check() function, bli_l0_xxbsc_check(), in
  bli_l0_check.c for eqsc.
- Programming style and whitespace updates to bli_l1m_unb_var1.c.
- Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c
- Consolidated redundant macro redefinition for copym function pointer
  type in bli_l1m_ft.h.
- Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that
  allow oapi and tapi source files to forego defining certain expert
  functions. (Certain operations such as printv and printm do not need
  to have both basic expert interfaces. This also includes eqsc, eqv,
  and eqm.)
2021-05-12 18:45:32 -05:00
Devin Matthews
5d46dbee4a Replace bli_dlamch with something less archaic (#498)
Details:
- Added new implementations of bli_slamch() and bli_dlamch() that use
  constants from the standard C library in lieu of dynamically-computed
  values (via code inherited from netlib). The previous implementation
  is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is 
  defined by the subconfiguration at compile-time. Thanks to Devin
  Matthews for providing this patch, and to Stefano Zampini for
  reporting the issue (#497) that prompted Devin to propose the patch.
2021-05-12 18:42:09 -05:00
Field G. Van Zee
6a89c7d8f9 Defined setijv, getijv to set/get vector elements.
Details:
- Defined getijv, setijv operations to get and set elements of a vector,
  in bli_setgetijv.c and .h.
- Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively.
- Added additional bounds checking to getijm and setijm to prevent
  actions with negative indices.
- Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv
  and setijv.
- Added documentation to BLISTypedAPI.md for getijm and setijm, which
  were inadvertently missing.
- Added a new entry to the FAQ titled "Why does BLIS have vector
  (level-1v) and matrix (level-1m) variations of most level-1
  operations?"
- Comment updates.
2021-05-01 18:54:48 -05:00
Field G. Van Zee
4534daffd1 Minor API breakage in bli_pack API.
Details:
- Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that
  instead of returning a bool, they set a bool that is passed in by
  address. This does break the public exported API, but I expect very
  few users actually use this function. (This change is being made in
  preparation for a much more extensive commit relating to error
  checking.)
2021-04-27 18:16:44 -05:00
Field G. Van Zee
09bd4f4f12 Add err_t* "return" parameter to malloc functions.
Details:
- Added an err_t* parameter to memory allocation functions including
  bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(),
  bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions
  already use the return value to return the allocated memory address,
  they can't communicate errors to the caller through the return value.
  This commit does not employ any error checking within these functions
  or their callers, but this sets up BLIS for a more comprehensive
  commit that moves in that direction.
- Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to
  bli_type_defs.h. This was done so that what remains of bli_malloc.h
  can be included after the definition of the err_t enum. (This ordering
  was needed because bli_malloc.h now contains function prototypes that
  use err_t.)
- Defined bli_is_success() and bli_is_failure() static functions in
  bli_param_macro_defs.h. These functions provide easy checks for error
  codes and will be used more heavily in future commits.
- Unfortunately, the additional err_t* argument discussed above breaks
  the API for bli_malloc_user(), which is an exported symbol in the
  shared library. However, it's quite possible that the only application
  that calls bli_malloc_user()--indeed, the reason it is was marked for
  symbol exporting to begin with--is the BLIS testsuite. And if that's
  the case, this breakage won't affect anyone. Nonetheless, the "major"
  part of the so_version file has been updated accordingly to 4.0.0.
2021-03-31 17:09:36 -05:00
Field G. Van Zee
0450249267 Always stay initialized after BLAS compat calls.
Details:
- Removed the option to finalize BLIS after every BLAS call, which also
  means that BLIS would initialize at the beginning of every BLAS call.
  This option never really made sense and wasn't even implemented
  properly to begin with. (Because bli_init_auto() and _finalize_auto()
  were implemented in terms of bli_init_once() and _finalize_once(),
  respectively, the application would have only been able to call one
  BLAS routine before BLIS would find itself in a unusable, permanently
  uninitialized state.) Because this option was never meant for regular
  use, it never made it into configure as an actual configure-time
  option, and therefore this commit only removes parts of the code
  affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED.
2021-03-28 19:11:43 -05:00
Field G. Van Zee
3a6f41afb8 Renamed membrk files/vars/functions to pba.
Details:
- Renamed the files, variables, and functions relating to the packing
  block allocator from its legacy name (membrk) to its current name
  (pba). This more clearly contrasts the packing block allocator with
  the small block allocator (sba).
- Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that
  caused the function to erroneously change the value of the pack_a
  field of the global rntm_t instead of the pack_b field. (Apparently
  nobody has used this API yet.)
- Comment updates.
2021-03-27 17:22:14 -05:00
Field G. Van Zee
36cb4116d1 Switch allocator mutexes to static initialization.
Details:
- Switched the small block allocator (sba), as defined in bli_sba.c and
  bli_apool.c, to static initialization of its internal mutex. Did a
  similar thing for the packing block allocator (pba), which appears as
  global_membrk in bli_membrk.c.
- Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex()
  to ensure they won't be used in the future.
- In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp
  blocks guarded by BLIS_USE_PTHREAD_MUTEX.
2021-03-27 15:15:09 -05:00
Field G. Van Zee
4493cf516e Redefined BLIS_NUM_ARCHS to update automatically.
Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
  value in the arch_t enum. This means that it no longer needs to get
  updated manually whenever new subconfigurations are added to BLIS.
  Also removed the explicit initial index assigment of 0 from the
  first enum value, which was unnecessary due to how the C language
  standard mandates indexing of enum values. Thanks to Devin Matthews
  for originally submitting this as a PR in #446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
  change.
2021-03-15 13:12:49 -05:00
Field G. Van Zee
a4b73de84c Disabled _self() and _equal() in bli_pthread API.
Details:
- Disabled the _self() and _equal() extensions to the bli_pthread API
  introduced in d479654. These functions were disabled after I realized
  that they aren't actually needed yet. Thanks to Devin Matthews for
  helping me reason through the appropriate consumer code that will
  appear in BLIS (eventually) in a future commit. (Also, I could never
  get the Windows branch to link properly in clang builds in AppVeyor.
  See the comment I left in the code, and #485, for more info.)
2021-03-12 19:47:39 -06:00
Field G. Van Zee
f9d604679d Added _self() and _equal() to bli_pthread API.
Details:
- Expanded the bli_pthread API to include equivalents to pthread_self()
  and pthread_equal(). Implemented these two functions for all three cpp
  branches present within bli_pthread.c: systemless, Windows, and
  Linux/BSD.
2021-03-12 19:47:39 -06:00