Commit Graph

78 Commits

Author SHA1 Message Date
Field G. Van Zee
efe85b3c19 Added missing return to bli_thread_partition_2x2().
Details:
- Added a missing return statement to the body of an early case handling
  branch in bli_thread_partition_2x2(). This bug only affected cases
  where n_threads < 4, and even then, the code meant to handle cases
  where n_threads >= 4 executes and does the right thing, albeit using
  more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
  for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).

Change-Id: I2182be0911f76861dd14bec9b6bacb6c20c2725d
2020-03-16 12:28:25 +05:30
Field G. Van Zee
a7c5723e77 Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.

Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0
2020-03-13 01:10:34 -04:00
Kiran Devrajegowda
04fc9d3710 Merge "Fixed bug(s) in mt sup when single-threaded." into amd-staging-rome-2.2 2020-03-13 01:10:22 -04:00
Field G. Van Zee
574de9e29e Fixed bug(s) in mt sup when single-threaded.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
  changing function interface for the thread entry point function
  (of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
  a memory leak in the sba at bli_finalize() time. It turns out that,
  due to the new multithreading-capable variant code useing thrinfo_t
  objects--specifically, their calling of bli_thrinfo_grow()--we
  have to pass in a real thrinfo_t object rather than the global
  objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
  Thus, I inserted the appropriate logic from the OpenMP and pthreads
  versions so that single-threaded execution would work as intended
  with the newly upgraded variants.

Change-Id: I2bfff849abf3fa30c73e0c5876128400854bbcb5
2020-03-13 01:10:04 -04:00
Field G. Van Zee
1a284828d1 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-03-13 01:09:29 -04:00
Devrajegowda, Kiran
1fe8edbed0 "Merge Selective Packing code from amd branch flame/blis"
Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed
2019-12-16 14:48:53 +05:30
Kiran Varaganti
1650bcb623 Revert " Merge Selective Packing code from amd branch flame/blis"
This reverts commit e4a6af33f5.

Reason for revert: <Review not done>

Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2
2019-12-13 00:01:35 -05:00
Field G. Van Zee
efa61a6c8b Added missing bli_l3_sup_thread_decorator() symbol.
Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
  and pthreads so that those builds don't fail when performing shared
  library linking (especially for Windows DLLs via AppVeyor). For now,
  these dummy implementations of bli_l3_sup_thread_decorator() are
  merely carbon-copies of the implementation provided for single-
  threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
  Thus, an OpenMP or pthreads build will be able to use the gemmsup
  code (including the new selective packing functionality), as it did
  before 39fa7136, even though it will not actually employ any
  multithreaded parallelism.
2019-11-29 16:17:04 -06:00
Field G. Van Zee
39fa7136f4 Added support for selective packing to gemmsup.
Details:
- Implemented optional packing for A or B (or both) within the sup
  framework (which currently only supports gemm). The request for
  packing either matrix A or matrix B can be made via setting
  environment variables BLIS_PACK_A or BLIS_PACK_B (to any
  non-zero value; if set, zero means "disable packing"). It can also
  be made globally at runtime via bli_pack_set_pack_a() and
  bli_pack_set_pack_b() or with individual rntm_t objects via
  bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
  interface of either the BLIS typed or object APIs. (If using the
  BLAS API, environment variables are the only way to communicate the
  packing request.)
- One caveat (for now) with the current implementation of selective
  packing is that any blocksize extension registered in the _cntx_init
  function (such as is currently used by haswell and zen subconfigs)
  will be ignored if the affected matrix is packed. The reason is
  simply that I didn't get around to implementing the necessary logic
  to pack a larger edge-case micropanel, though this is entirely
  possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
  bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
  with corresponding headers, in which higher-level packm-related
  functions are defined for use within the sup framework. The actual
  packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
  bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
  always NULL), and pointer to a thrinfo_t* (which for nowis the address
  of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
  the millikernel can query the panel stride of the packed matrix and
  step through it accordingly. If the matrix isn't packed, the panel
  stride of interest for the given millikernel will be set to the
  appropriate value so that the mkernel may step through the unpacked
  matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
  panel strides (ps_a and ps_b, respectively) instead of computing them
  on the fly.
- Spun off the environment variable getting and setting functions into
  a new file, bli_env.c (with a corresponding prototype header). These
  functions are now used by the threading infrastructure (e.g.
  BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
  infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
  for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
  This means that the function bli_thread_init_rntm() was renamed to
  bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
  functions that manage the pack_a and pack_b fields of the global
  rntm_t, including from environment variables, just as we have
  functions to manage the threading fields of the global rntm_t in
  bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
  spinning off the bli_l3_thread_decorator() functions into their own
  files. This change makes more sense when considering the further
  addition of bli_l3_sup_thread_decorator() functions (for now limited
  only to the single-threaded form found in the  _single.c file).
- Explicitly initialize the reference sup handlers in both
  bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
  obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
2019-11-29 15:27:07 -06:00
Field G. Van Zee
29b0e1ef4e Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
  into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
  inadvertantly not incremented when the Zen2 subconfiguration was
  added.
- In bli_gemm_front(), added a missing conditional constraint around the
  call to bli_gemm_small() that ensures that the computation precision
  of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
  that existed around the call to bli_syrk_small() into bli_syrk_small()
  to minimize the calling code footprint and also to bring that code
  into stylistic harmony with similar code in bli_gemm_front() and
  bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
  proper accessor static functions (e.g. 'a->dim[0]' becomes
  'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
  bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
  strictly speaking unnecessary, but it serves as a useful visual cue to
  those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
  bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
  version check for availability of -march=znver2, and added appropriate
  support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
  config/zen/amd_config.mk, including: removal of -march=znver1 et al.
  from CKVECFLAGS (since the -march flag is added within make_defs.mk);
  setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
  added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
  set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
2019-10-11 10:24:24 -05:00
kdevraje
cac127182d Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis
with public repo commit id 565fa3853b.

Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42
2019-06-24 14:05:54 +05:30
kdevraje
13806ba3b0 This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041
2019-05-27 16:24:43 +05:30
Field G. Van Zee
5a5f494e42 Removed export macros from all internal prototypes.
Details:
- After merging PR #303, at Isuru's request, I removed the use of
  BLIS_EXPORT_BLIS from all function prototypes *except* those that we
  potentially wish to be exported in shared/dynamic libraries. In other
  words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
  functions that can be considered private or for internal use only.
  This is likely the last big modification along the path towards
  implementing the functionality spelled out in issue #248. Thanks
  again to Isuru Fernando for his initial efforts of sprinkling the
  export macros throughout BLIS, which made removing them where
  necessary relatively painless. Also, I'd like to thank Tony Kelman,
  Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
  participating in the initial discussion in issue #37 that was later
  summarized and restated in issue #248.
- CREDITS file update.
2019-03-12 18:45:09 -05:00
Field G. Van Zee
3bdab823fa Merge branch 'master' into dev 2019-02-28 14:07:24 -06:00
Isuru Fernando
f0dcc8944f Add symbol export macro for all functions (#302)
* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now
2019-02-27 17:27:23 -06:00
Field G. Van Zee
075143dfd9 Added support for IC loop parallelism to trsm.
Details:
- Parallelism within the IC loop (3rd loop around the microkernel) is
  now supported within the trsm operation. This is done via a new branch
  on each of the control and thread trees, which guide execution of a
  new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
  subproblem corresponds to the macrokernel computation on only the
  block of A that contains the diagonal (labeled as A11 in algorithms
  with FLAME-like partitioning), and the corresponding row panel of C.
  During the trsm subproblem, all threads within the JC communicator
  participate and parallelize along the JR loop, including any
  parallelism that was specified for the IC loop. (IR loop parallelism
  is not supported for trsm due to inter-iteration dependencies.) After
  this trsm subproblem is complete, a barrier synchronizes all
  participating threads and then they proceed to apply the prescribed
  BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
  parallelism specified within) to the remaining gemm subproblem (the
  rank-k update that is performed using the newly updated row-panel of
  B). Thus, trsm now supports JC, IC, and JR loop parallelism.
- Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
  of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
  now, since it is not currently used. (All trsm problems are cast in
  terms of left-side trsm.)
- Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
  trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
  subnode is only recursed upon if there existed a corresponding
  thrinfo_t node, which may not always exist (for problems too small
  to employ full parallelization due to the minimum granularity imposed
  by micropanels).
- Updated other functions in frame/base/bli_cntl.c, such as
  bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
  if they exist.
- Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
  when they exist, and added support for growing a prenode branch to
  bli_thrinfo_grow() via a corresponding set of help functions named
  with the _prenode() suffix.
- Added a bszid_t field thrinfo_t nodes. This field comes in handy when
  debugging the allocation/release of thrinfo_t nodes, as it helps trace
  the "identity" of each nodes as it is created/destroyed.
- Renamed
    bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
  and created a separate bli_l3_thrinfo_print_trsm_paths() function to
  print out the newly reconfigured thrinfo_t trees for the trsm
  operation.
- Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
  regarding variable declarations.
- Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
  BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
  (semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
  represent the subpartition ahead of and behind, respectively,
  BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
- Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
  bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
- Deprecated old functions in frame/3/bli_l3_thrinfo.c.
2019-02-14 18:52:45 -06:00
Field G. Van Zee
eb97f778a1 Added missing AMD copyrights to previous commit.
Details:
- Forgot to add AMD copyrights to several touched files that did not
  already have them in 2f31743.
2018-12-25 20:17:09 -06:00
Field G. Van Zee
2f3174330f Implemented a pool-based small block allocator.
Details:
- Implemented a sophisticated data structure and set of APIs that track
  the small blocks of memory (around 80-100 bytes each) used when
  creating nodes for control and thread trees (cntl_t and thrinfo_t) as
  well as thread communicators (thrcomm_t). The purpose of the small
  block allocator, or sba, is to allow the library to transition into a
  runtime state in which it does not perform any calls to malloc() or
  free() during normal execution of level-3 operations, regardless of
  the threading environment (potentially multiple application threads
  as well as multiple BLIS threads). The functionality relies on a new
  data structure, apool_t, which is (roughly speaking) a pool of
  arrays, where each array element is a pool of small blocks. The outer
  pool, which is protected by a mutex, provides separate arrays for each
  application thread while the arrays each handle multiple BLIS threads
  for any given application thread. The design minimizes the potential
  for lock contention, as only concurrent application threads would
  need to fight for the apool_t lock, and only if they happen to begin
  their level-3 operations at precisely the same time. Thanks to Kiran
  Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
  by default; renamed the --[dis|en]able-packbuf-pools option to
  --[dis|en]able-pba-pools; and rewrote the --help text associated with
  this new option and consolidated it with the --help text for the
  option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
  a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
  do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
  used for small blocks with calls to bli_sba_acquire(), which takes a
  rntm (in addition to the bytes requested), and bli_sba_release().
  These latter two functions reduce to the former two when the sba pools
  are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
  required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
  in the block_size) from bli_membrk_acquire_m() to the implementation
  of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
  relative to that of gemmtrsm_ukr, in part to avoid the need to create
  a packm control tree node, which now requires a rntm_t that has been
  initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
  run the testsuite with memory tracing enabled can check for memory
  leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
  the rntm to be copied locally when it is passed in via one of the
  expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
  thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
  formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.
2018-12-25 19:35:01 -06:00
Field G. Van Zee
e809b5d2f1 Merge branch 'master' into amd 2018-12-20 16:27:26 -06:00
Field G. Van Zee
93d56319f2 Added missing bli_init_once() in bli_thread API.
Details:
- Fixed an issue with specifying threading globally at runtime via
  bli_thread_set_num_threads() (the automatic way) or via
  bli_thread_set_ways() (the manual way), with bli_thread_init_rntm()
  also affected. These functions were not calling bli_init_once() prior
  to acting, and therefore their effects on the global rntm_t structure
  were being wiped out by the eventual call to bli_init_once(), by some
  other BLIS function. Thanks to Ali Emre Gülcü for reporting the
  behavior associated with this bug.
- Added additional content to docs/Multithreading.md covering topics of
  choosing between OpenMP and pthreads, and specifying affinity via
  OpenMP.
- CREDITS file update.
2018-12-17 19:17:30 -06:00
Field G. Van Zee
76016691e2 Improvements to bli_pool; malloc()/free() tracing.
Details:
- Added malloc_ft and free_ft fields to pool_t, which are provided when
  the pool is initialized, to allow bli_pool_alloc_block() and
  bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align()
  with arbitrary align_size values (according to how the pool_t was
  initialized).
- Added a block_ptrs_len argument to bli_pool_init(), which allows the
  caller to specify an initial length for the block_ptrs array, which
  previously suffered the cost of being reallocated, copied, and freed
  each time a new block was added to the pool.
- Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t
  into a single "buf" field. Consolidated the bli_pblk API accordingly
  and also updated the bli_mem API implementation. This was done
  because I'd previously already implemented opaque alignment via
  bli_malloc_align(), which allocates extra space and stores the
  original pointer returned by malloc() one element before the element
  whose address is aligned.
- Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call
  bli_fmalloc_align() and bli_ffree_align(), which required adding an
  align_size field to the membrk_t struct.
- Pass the pack schemas directly into bli_l3_cntl_create_if() rather
  than transmit them via objects for A and B.
- Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free().
  The function had not been conditionally freeing control trees for
  quite some time. Also, removed obj_t* parameters since they aren't
  needed anymore (or never were).
- Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a
  separate function, bli_l3_thread_decorator_thread_check().
- Renamed:
    bli_malloc_align()   -> bli_fmalloc_align()
    bli_free_align()     -> bli_ffree_align()
    bli_malloc_noalign() -> bli_fmalloc_noalign()
    bli_free_noalign()   -> bli_ffree_noalign()
  The 'f' is for "function" since they each take a malloc_ft or free_ft
  function pointer argument.
- Inserted various printf() calls for the purposes of tracing memory
  allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which,
  for now, is intended to be a "hidden" feature rather than one hooked
  up to a configure-time option.
- Defined bli_rntm_equals(), which compares two rntm_t for equality.
  (There are no use cases for this function yet, but there may be soon.)
- Whitespace changes to function parameter lists in bli_pool.c, .h.
2018-12-13 17:23:09 -06:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
0e27963a67 Add bli_pthread_mutex_trylock().
Details:
- Added the missing bli_pthread_mutex_trylock() function and prototype
  to the non-Windows sections of bli_pthread.c and .h. This function
  isn't needed by BLIS, but I figured why not make the Windows and
  non-Windows sections consistent with one another.
2018-10-24 12:16:19 -05:00
Field G. Van Zee
4b683740c1 Defined bli_pthread_cond_*() and related defs.
Details:
- Added function definitions for bli_pthread_cond_*() as well as related
  types and constants to bli_pthread.c, and corresponding prototypes to
  bli_pthread.h.
2018-10-24 11:56:16 -05:00
Field G. Van Zee
4b4f8072b9 Define bli_pthreads barrier types on OS X.
Details:
- Fully define bli_pthreads barrier-related types on OS X. Only typedef
  those types in terms of pthreads types on non-Windows, non-Apple OSes
  (i.e. Linux).
2018-10-24 11:31:46 -05:00
Field G. Van Zee
ad98790dce Fix names of Windows pthread initializer macros.
Details:
- Renamed the PTHREAD_ initializer macros in the Windows cpp case to use
  BLIS_ prefixes to match their non-Windows counterparts.
2018-10-23 20:35:05 -05:00
Field G. Van Zee
06c23954e6 Defined unified bli_pthreads_*() API for all OSes.
Details:
- Expanded the bli_pthread_*() -> pthread_*() wrappers in
  frame/thread/bli_pthread.c to include cases for Windows taken from
  frame/base/bli_pthread_wrap.c. Now, bli_thread_*() is always defined
  and always used by BLIS and the BLIS testsuite (in lieu of calling
  pthreads directly, as before). The implementation used in this new
  API depends on whether we are building for Windows, and to a lesser
  extent, whether we are building on OS X. For the core API, Windows
  uses Windows threads, non-Windows (Linux, OS X) uses pthreads.
  OS X and Windows get barriers implemented in terms of other
  bli_pthread_*() functions, and Linux gets barriers implemented in
  terms of pthread_barrier*(). This commit addresses issue #273.
- Fixed a bug in the Linux definition of bli_pthread_mutex_unlock(),
  which was erroneously calling pthread_mutex_lock().
- Minor changes to configure so that the auto-detection executable
  can be built given the above changes (most notably, turning on
  POSIX extensions via -D_GNU_SOURCE).
- Removed temporary play-test code for shiftd that accidentally got
  committed into test/3m4m/test_gemm.c.
2018-10-23 19:16:54 -05:00
Field G. Van Zee
eac7d267a0 Unconditionally define bli_l3_thread_entry().
Details:
- Define a dummy bli_l3_thread_entry() function when multithreading is
  disabled altogether, or enabled via OpenMP. This function was
  originally necessary when multithreading is enabled via pthreads.
  By defining the function no matter the threading options given, it is
  less likely that an AppVeyor Windows build will complain due to a
  missing symbol in the DLL. (To be clear: AppVeyor was working fine
  before, but a problem may have arisen if it were switched to an
  OpenMP build.)
- Removed the prototype for bli_l3_thread_entry() from
  bli_thrcomm_pthreads.c and placed it in bli_thrcomm.h.
- Regenerated the symbols list file build/libblis-symbols.def.
2018-10-22 18:10:59 -05:00
Field G. Van Zee
85397cd4fa Added explanatory comment to bli_pthread.c.
Details:
- Added a verbose comment to bli_pthread.c that explains why a bli_
  wrapper to pthreads APIs is useful.
2018-10-19 13:12:43 -05:00
Field G. Van Zee
473ce54f5f Added bli_pthread_*() API.
Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
  against a Windows DLL, will be able to access pthreads functionality
  without those pthreads functions being explicitly exported by the DLL.
  Instead, we export the bli_pthread_*() layer, which uses types and
  functions that are identical to pthreads, but adds a 'bli_' prefix.
  Only a few basic functions are present in the bli_pthreads_*() API
  for now. Thanks to Devin Matthews and Isuru Fernando for their help
  on a related PR (#261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
  pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.
2018-10-18 19:03:56 -05:00
Field G. Van Zee
53e0a0c9b3 Merge branch 'master' into win-pthreads 2018-10-18 14:54:59 -05:00
Field G. Van Zee
71c5832d5f Consolidated slab/rr-explicit level-3 macrokernels.
Details:
- Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
  file per sl/rr pair, with those files named as they were before
  c92762e. The consolidation does not take away the *option* of using
  slab or round-robin assignment of micropanels to threads; it merely
  *hides* the choice within the definitions of functions such as
  bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
  rather than expose that choice explicitly in the code. The choice of
  slab or rr is not always hidden, however; there are some cases
  involving herk and trmm, for example, that require some part of the
  computation to use rr unconditionally. (The --thread-part-jrir option
  controls the partitioning in all other cases.)
- Note: Originally, the sl and rr macrokernels were separated out for
  clarity. However, aside from the additional binary code bloat, I later
  deemed that clarity not worth the price of maintaining the additional
  (mostly similar) codes.
2018-10-17 14:11:01 -05:00
Field G. Van Zee
b9c61d03f5 Merge branch 'nested-omp-patch' 2018-10-16 14:39:57 -05:00
Devin Matthews
29e6245816 Merge branch 'master' into win-pthreads 2018-10-16 10:12:25 -05:00
Field G. Van Zee
dc5fd898af Merge branch 'amd' 2018-10-15 17:41:35 -05:00
Field G. Van Zee
3612ecac98 Added comments to nested OpenMP handling code.
Details:
- Added comments to bli_thrcomm_openmp.c relating to changes made in
  6ac0c80 and 1064d79.
2018-10-11 15:16:41 -05:00
Field G. Van Zee
667d3929ee Added Fortran APIs for some thread functions.
Details:
- Defined Fortran-77 compatible APIs for bli_thread_set_num_threads()
  and bli_thread_set_ways(). These wrappers are defined in
  frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for
  suggesting these new interfaces.
- Added missing prototype for bli_thread_set_ways() in bli_thread.h and
  removed prototypes for non-existent functions bli_thread_set_*_nt().
- CREDITS file update.
2018-10-11 11:47:57 -05:00
Devin Matthews
1064d79711 Adjust rntm_t struct as well. 2018-10-11 11:14:25 -05:00
Devin Matthews
6ac0c80560 Fix OMP nesting problem.
Detect when OpenMP uses fewer threads than requested and correct accordingly, so that we don't wait forever for nonexistent threads. Fixes #267.
2018-10-11 10:45:07 -05:00
Field G. Van Zee
c92762ecdc Added option of slab or rr partitioning in jr/ir.
Details:
- Updated existing macrokernel function names and definitions to
  explicitly use slab assignment of micropanels to threads, then created
  duplicate versions of macrokernels that explicitly use round-robin
  assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
  were not substantially updated in this commit because they are
  currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
  explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
  and packm function pointers depending on which method (slab or rr) was
  enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
  option (-m [slab|rr] for short), which allows the user to explicitly
  request either slab or round-robin assignment (partitioning) of
  micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
2018-10-07 20:30:32 -05:00
Devin Matthews
627d0c5bfd Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows. 2018-10-02 14:40:55 -05:00
Field G. Van Zee
ac18949a4b Multithreading optimizations for l3 macrokernels.
Details:
- Adjusted the method by which micropanels are assigned to threads in
  the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
  employ contiguous "slab" partitioning rather than interleaved (round
  robin) partitioning. The new partitioning schemes and related details
  for specific families of operations are listed below:
  - gemm: slab partitioning.
  - herk: slab partitioning for region corresponding to non-triangular
          region of C; round robin partitioning for triangular region.
  - trmm: slab partitioning for region corresponding to non-triangular
          region of B; round robin partitioning for triangular region.
          (NOTE: This affects both left- and right-side macrokernels:
          trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
  - trsm: slab partitioning.
          (NOTE: This only affects only left-side macrokernels trsm_ll,
          trsm_lu; right-side macrokernels were not touched.)
  Also note that the previous macrokernels were preserved inside of
  the 'other' directory of each operation family directory (e.g.
  frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
  and fixed a stale function pointer type in blx_gemm_int.c
  (gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
  and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
  and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
2018-09-30 18:54:56 -05:00
Field G. Van Zee
c03728f1f4 Various minor cleanups.
Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
  unconditionally, but differently for Windows and non-Windows, but
  then disabled the definition of bli_setenv() entirely since BLIS
  no longer needs to set environment variables. Updated bli_winsys.h
  accordingly, and call bli_sleep() from within testsuite instead of
  sleep() directly.
- Use
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
  instead of
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
  when guarding against local definition of pthread barrier in
  testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
  should always be set to 200809L when barriers are supported, though I
  won't be surprised if we encounter a case in the future where it is
  set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
  top-level Makefile. These are no longer needed because we no longer
  output libraries with the version and configuration name as
  substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
  configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.
2018-09-10 17:54:27 -05:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
10d07357af Better thread safety; added threading to testsuite.
Details:
- Replaced critical sections that were conditional upon multithreading
  being enabled (via pthreads or OpenMP) with unconditional use of
  pthreads mutexes. (Why pthreads? Because BLIS already requires it
  for its initialization mechanism: pthread_once().) This was done in
  bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
  mtx_t object and bli_mutex_*() API with pthread mutexes in
  bli_thread.c. The previous status quo could result in a race condition
  if the application called BLIS from more than one thread. The new
  pthread-based code should be completely agnostic to the application's
  threading configuration. Thanks to AMD for bringing to our attention
  the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
  multithreading. Specifically, each thread maintains a counter that is
  incremented after each experiment. The thread only executes the
  experiment if: counter % n_threads == thread_id. In other words, the
  threads simply take turns executing each problem experiment. Also,
  POSIX guarantees that fprintf() will not intermingle output, so
  output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
  replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
  wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
  a race condition; specifically, two threads calling the function with
  the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
  bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
  to (and updating existing lines of) source files.
2018-08-26 20:34:30 -05:00
Field G. Van Zee
fa08e5ead9 Fixed minor issues in ecbebe7 with mt disabled.
Details:
- Fixed an unused variable warning in frame/base/bli_rntm.c when
  multithreading is disabled.
- Fixed a missing variable declaration in bli_thread_init_rntm_from_env()
  when multithreading is disabled.
2018-07-17 19:02:15 -05:00
Field G. Van Zee
ecbebe7c2e Defined rntm_t to relocate cntx_t.thrloop (#235).
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
  field of the cntx_t (context). The thrloop array holds the number of
  ways of parallelism (thread "splits") to extract per level-3
  algorithmic loop until those values can be used to create a
  corresponding node in the thread control tree (thrinfo_t structure),
  which (for any given level-3 invocation) usually happens by the time
  the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
  when invoking level-3 operations from two or more application threads.
  The race condition existed because the cntx_t, a pointer to which is
  usually queried from the global kernel structure (gks), is supposed to
  be a read-only. However, the previous code would write to the cntx_t's
  thrloop field *after* it had been queried, thus violating its read-only
  status. In practice, this would not cause a problem when a sequential
  application made a multithreaded call to BLIS, nor when two or more
  application threads used the same parallelization scheme when calling
  BLIS, because in either case all application theads would be using
  the same ways of parallelism for each loop. The true effects of the
  race condition were limited to situations where two or more application
  theads used *different* parallelization schemes for any given level-3
  call.
- In remedying the above race condition, the application or calling
  library can now specify the parallelization scheme on a per-call basis.
  All that is required is that the thread encode its request for
  parallelism into the rntm_t struct prior to passing the address of the
  rntm_t to one of the expert interfaces of either the typed or object
  APIs. This allows, for example, one application thread to extract 4-way
  parallelism from a call to gemm while another application thread
  requests 2-way parallelism. Or, two threads could each request 4-way
  parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
  of the level-3 implementation stack (with the most notable exception
  being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
  APIs. (A few internal functions gained the rntm_t* parameter even
  though they currently have no use for it, such as bli_l3_packm().)
  This required some internal calls to some of those functions to
  be updated since BLIS was already using those operations internally
  via the expert interfaces. For situations where a rntm_t object is
  not available, such as within packm/unpackm implementations, NULL is
  passed in to the relevant expert interfaces. This is acceptable for
  now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
  environment variables such as BLIS_NUM_THREADS and BLIS_*_NT  are only
  read once, at library initialization. (Thanks to Nathaniel Smith for
  suggesting this to avoid repeated calls getenv(), which can be slow.)
  Those values are recorded to a global rntm_t object. Public APIs, in
  bli_thread.c, are still available to get/set these values from the
  global rntm_t, though now the "set" functions have additional logic
  to ensure that the values are set in a synchronous manner via a mutex.
  If/when NULL is passed into an expert API (meaning the user opted to
  not provide a custom rntm_t), the values from the global rntm_t are
  copied to a local rntm_t, which is then passed down the function stack.
  Calling a basic API is equivalent to calling the expert APIs with NULL
  for the cntx and rntm parameters, which means the semantic behavior of
  these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
  and reimplemented, with the function now being able to treat the
  incoming rntm_t in a manner agnostic to its origin--whether it came
  from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
  parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
  as the corresponding "get" functions. The new model simplifies these
  interfaces so that one must either set the total number of threads, OR
  set all of the ways of parallelism for each loop simultaneously (in a
  single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
  (and two specific ways within each method) of requesting parallelism
  in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
2018-07-17 18:37:32 -05:00
Field G. Van Zee
89e178ce38 Merge branch 'master' into dev 2018-07-04 17:51:16 -05:00
Isuru Fernando
14648e1376 Native windows support using clang (#227)
* Add appveyor file

* Build script

* Remove fPIC for now

* copy as

* set CC and CXX

* Change the order of immintrin.h

* Fix testsuite header

* Move testsuite defs to .c

* Fix appveyor file

* Remove fPIC again and fix strerror_r missing bug

* Remove appveyor script

* cd to blis directory

* Fix sleep implementation

* Add f2c_types_win.h

* Fix f2c compilation

* Remove rdp and rename appveyor.yml

* Remove setenv declaration in test header

* set CPICFLAGS to empty

* Fix another immintrin.h issue

* Escape CFLAGS and LDFLAGS

* Fix more ?mmintrin.h issues

* Build x86_64 in appveyor

* override LIBM LIBPTHREAD AR AS

* override pthreads in configure

* Move windows definitions to bli_winsys.h

* Fix LIBPTHREAD default value

* Build intel64 in appveyor for now
2018-07-04 17:48:42 -05:00
Field G. Van Zee
f97a86f322 Updated setting/querying pack schema (cntx->cntl).
- Query pack schemas in level-3 bli_*_front() functions and store those
  values in the schema bitfields of the correponding obj_t's when the
  cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default
  native schemas are stored to the obj_t's.)
- In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in
  bli_*_front(), clear the schema bitfields, and pass the queried values
  into bli_gemm_cntl_create() and bli_trsm_cntl_create().
- Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to
  take schemas for A and B, and use these values to initialize the
  appropriate control tree nodes. (Also cpp-disabled the panel-block cntl
  tree creation variant, bli_gemmpb_cntl_create(), as it has not been
  employed by BLIS in quite some time.)
- Simplified querying of schema in bli_packm_init() thanks to above
  changes.
- Updated openmp and pthreads definitions of bli_l3_thread_decorator()
  so that thread-local aliases of matrix operands are guaranteed, even
  if aliasing is disabled within the internal back-end functions (e.g.
  bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c
  explaining why the extra aliasing is not needed there.
- Change bli_gemm() and level-3 friends so that the operation's ind()
  function is called only if all matrix operands have the same datatype,
  and only if that datatype is complex. The former condition is needed
  in preparation for work related to mixed domain operands, while the
  latter helps with readability, especially for those who don't want to
  venture into frame/ind.
- Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be
  consistent with BLIS calling conventions (modified argument(s) are
  last), and updated all invocations in the level-3 _front() functions.
- Comment updates to bli_cntx_set_thrloop_from_env().
2018-06-02 20:28:20 -05:00