Commit Graph

1479 Commits

Author SHA1 Message Date
Field G. Van Zee
eac7d267a0 Unconditionally define bli_l3_thread_entry().
Details:
- Define a dummy bli_l3_thread_entry() function when multithreading is
  disabled altogether, or enabled via OpenMP. This function was
  originally necessary when multithreading is enabled via pthreads.
  By defining the function no matter the threading options given, it is
  less likely that an AppVeyor Windows build will complain due to a
  missing symbol in the DLL. (To be clear: AppVeyor was working fine
  before, but a problem may have arisen if it were switched to an
  OpenMP build.)
- Removed the prototype for bli_l3_thread_entry() from
  bli_thrcomm_pthreads.c and placed it in bli_thrcomm.h.
- Regenerated the symbols list file build/libblis-symbols.def.
2018-10-22 18:10:59 -05:00
Field G. Van Zee
4ee986f0a7 Added mixed-datatype testing to Travis CI (#271).
Details:
- Modified .travis.yml to automatically test the mixed-datatype support
  of the gemm operation, with supporting changes to common.mk, the
  top-level Makefile, and travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
  '.mixed' suffix (similar to those with the '.fast' suffix) for testing
  mixed-datatype gemm.
- Updated docs/BuildSystem.md to document the new make targets
  'testblis-md' and 'checkblis-md'.
2018-10-22 14:09:44 -05:00
Field G. Van Zee
c3c6ebc9c6 Fixed thrinfo_t printing for small problems.
Details:
- Fixed a bug in the code that prints out the communicator and work ids
  from the various threads' thrinfo_t nodes. This bug manifested when
  the dimension being parallelized was not large enough such that every
  thread was assigned actual work (since the minimum amount of work is
  determined by the register blocksize in the dimension being
  parallelized). In those cases, the threads that receive no work in
  that dimension do not finish building their thrinfo_t tree, leaving
  lower-level nodes non-existent. (The bug itself was usally observed as
  a segfault when the printing code attempted to dereference all the way
  down the thrinfo_t tree.) The solution involves explicitly checking
  each node as it is dereferenced, and if at any time NULL is found, all
  subsequent communicator and work ids are set to -1.
2018-10-21 18:48:54 -05:00
Field G. Van Zee
73a222c0d9 Minor edits to 'configure --help' text. 2018-10-20 14:13:04 -05:00
Field G. Van Zee
14f3d5e6df Refresh libblis-symbols.def post-merge 090e4f0. 2018-10-19 20:39:35 -05:00
Field G. Van Zee
090e4f08fc Merge branch 'master' into dev 2018-10-19 18:41:10 -05:00
Field G. Van Zee
0854e880b0 Merge pull request #261 from flame/win-pthreads
Implement missing pthreads function on Windows
2018-10-19 18:05:00 -05:00
Field G. Van Zee
c9be5889fb Added "Known issues" section to Multithreading.md.
Details:
- Added known issues section to Multithreading.md.
- Trivial changes to MixedDatatypes.md, Sandboxes.md.
2018-10-19 17:42:40 -05:00
Field G. Van Zee
343a2715eb Whitespace changes to configure, bli_pthread_wrap.
Details:
- Mostly whitespace changes (spaces to tabs) to configure and
  bli_pthread_wrap.c and .h.
2018-10-19 16:59:19 -05:00
Field G. Van Zee
3678a1cd51 Merge branch 'master' into win-pthreads 2018-10-19 16:11:31 -05:00
Field G. Van Zee
4e38a8d4ee Implemented python version checking in configure.
Details:
- Added python version checking to configure script. (Recall that python
  is needed to execute the flatten-headers.py script.) Minimum versions
  of python needed are currently as follows:
    python2: 2.7 or later
    python3: 3.5 or later
  The standard search order for python interpeters is:
    python python3 python2
  The PYTHON environment variable is also supported and will be checked
  before the standard search order list.
- Updated BuildSystem.md to include: a minimum make version; mention
  that the C compiler must actually be a C99 compiler; and the caveat
  that Windows builds do not require pthreads since BLIS can provide
  an implementation of pthreads internally.
2018-10-19 15:54:15 -05:00
Field G. Van Zee
85397cd4fa Added explanatory comment to bli_pthread.c.
Details:
- Added a verbose comment to bli_pthread.c that explains why a bli_
  wrapper to pthreads APIs is useful.
2018-10-19 13:12:43 -05:00
Field G. Van Zee
53c07035ef Refresh libblis-symbols.def from bb6df28.
Details:
- Forgot to regenerate the symbols file after the previous commit
  (bb6df281) in which shiftd operation was introduced.
2018-10-19 12:53:03 -05:00
Field G. Van Zee
473ce54f5f Added bli_pthread_*() API.
Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
  against a Windows DLL, will be able to access pthreads functionality
  without those pthreads functions being explicitly exported by the DLL.
  Instead, we export the bli_pthread_*() layer, which uses types and
  functions that are identical to pthreads, but adds a 'bli_' prefix.
  Only a few basic functions are present in the bli_pthreads_*() API
  for now. Thanks to Devin Matthews and Isuru Fernando for their help
  on a related PR (#261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
  pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.
2018-10-18 19:03:56 -05:00
Field G. Van Zee
bb6df2814f Defined a new level-1d operation: shiftd.
Details:
- Defined a new level-1d operation called 'shiftd', including object and
  typed APIs. This operation adds a scalar value to every element along
  an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
  terms of the addv kernel. (The scalar is passed in as the x vector
  with an increment of zero.)
- Replaced ad-hoc usage of setd and addd (after creating a temporary
  matrix object) with use of shiftd, which is much more concise, in
  various test driver files in the testsuite. Similar changes were made
  to the standalone test drivers and the example code.
- Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
  for bli_shiftd() and bli_?shiftd(), respectively.
- Added observed object properties to level-1d documentation in
  BLISObjectAPI.md.
2018-10-18 17:11:39 -05:00
Field G. Van Zee
53e0a0c9b3 Merge branch 'master' into win-pthreads 2018-10-18 14:54:59 -05:00
Field G. Van Zee
ec67679990 Refreshed Windows symbol list; added regen script.
Details:
- Moved windows/build/libblis-symbols.def to build/libblis-symbols.def.
  Updated link commands in common.mk accordingly.
- Added a new script build/regen-symbols.sh that will regenerate the
  libblis-symbols.def file in its new location after building a
  haswell-targeted shared library. Thanks to Isuru Fernando for
  providing the symbol generation command.
- Ran the new script to refresh the symbols file.
2018-10-18 14:27:02 -05:00
Field G. Van Zee
fdad54ab8e Removed old symbol from libblis-symbols.def.
Details:
- Removed bli_gemm_ker_var1() from windows/build/libblis-symbols.def
  since this function is no longer compiled.
2018-10-18 12:43:22 -05:00
Field G. Van Zee
49d3f9fcbb Merge branch 'master' into dev 2018-10-17 18:00:40 -05:00
Field G. Van Zee
3c52725693 Renamed/moved l3 zen ukernels to haswell kernel set.
Details:
- Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and
  then updated the file contents to use the 'haswell' infix.
- Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to
  above function renames.
- Moved/updated the corresponding prototypes in bli_kernels_zen.h to
  bli_kernels_haswell.h.
- Updated config_registry according to above changes.
- NOTE: This rename reflects the fact that haswell microkernels are
  specifically written to overcome the floating-point latency for FMA
  instructions on Intel Haswell-like architectures, which can issue two
  FMA instructions per cycle. These ukernels happen to work fine on AMD
  Zen-based architectures. However, Zen only issues one FMA per cycle,
  which, while halving its floating-point throughput, gives it extra
  flexibility in the design of its microkernels--namely, mr and nr can
  be smaller and still overcome the floating-point latency for those
  single-issue cores. A smaller value of mr and nr allows for a larger
  value of kc, which may be useful in some situations. In the future,
  we may write such Zen-specific microkernels to take advantage of this
  additional flexibility.
2018-10-17 14:56:22 -05:00
Field G. Van Zee
71c5832d5f Consolidated slab/rr-explicit level-3 macrokernels.
Details:
- Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
  file per sl/rr pair, with those files named as they were before
  c92762e. The consolidation does not take away the *option* of using
  slab or round-robin assignment of micropanels to threads; it merely
  *hides* the choice within the definitions of functions such as
  bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
  rather than expose that choice explicitly in the code. The choice of
  slab or rr is not always hidden, however; there are some cases
  involving herk and trmm, for example, that require some part of the
  computation to use rr unconditionally. (The --thread-part-jrir option
  controls the partitioning in all other cases.)
- Note: Originally, the sl and rr macrokernels were separated out for
  clarity. However, aside from the additional binary code bloat, I later
  deemed that clarity not worth the price of maintaining the additional
  (mostly similar) codes.
2018-10-17 14:11:01 -05:00
Field G. Van Zee
57eab3a4f0 CREDITS file update. 2018-10-17 11:29:20 -05:00
Ye Luo
6722ec2181 Fix bgclang compilation on BGQ (#270)
* Fix bgq kernels

* Support bgq with bgclang
2018-10-17 11:26:00 -05:00
Devin Matthews
1c7247b6d1 Merge branch 'win-pthreads' of github.com:flame/blis into win-pthreads 2018-10-16 14:44:32 -05:00
Devin Matthews
c1bc5530d5 Don't call pthread_once in auto-detect. 2018-10-16 14:44:10 -05:00
Field G. Van Zee
b9c61d03f5 Merge branch 'nested-omp-patch' 2018-10-16 14:39:57 -05:00
Field G. Van Zee
5a1e461ffe Execute flatten-headers.py via $(PYTHON).
Details:
- Execute build/flatten-headers.py python script via $(PYTHON) in
  common.mk. This allows distributions that define the current/preferred
  python interpreter in the PYTHON environment variable to use that
  interpreter when executing flatten-headers.py. Thanks to Isuru
  Fernando for this suggestion, and for Dave Love for submitting the
  initial issue/request.
2018-10-16 14:21:45 -05:00
Devin Matthews
6c5a1aaff5 Fix type in bli_pthread_wrap.c 2018-10-16 10:15:59 -05:00
Devin Matthews
29e6245816 Merge branch 'master' into win-pthreads 2018-10-16 10:12:25 -05:00
Devin Matthews
0b73209f6b Add missing argument to WaitForSingleObject and use $is_win in configure
to turn off pthreads.
2018-10-16 10:02:06 -05:00
Field G. Van Zee
ed65771482 Fixed merge fail on testsuite threading macros.
Details:
- Applied the following C preprocessor macro renames

    BLIS_DEFAULT_MR_THREAD_MAX  -> BLIS_THREAD_MAX_IR
    BLIS_DEFAULT_NR_THREAD_MAX  -> BLIS_THREAD_MAX_JR
    BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
    BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N

  in src/test_libblis.c. This is apparently the result of a failure by
  git to properly merge the 'master' and 'amd' branches in the previous
  commit. (The 'master' branch contained a commit, 53a9ab1, in which
  these same cpp macros were renamed throughout the source distribution.
2018-10-15 17:54:45 -05:00
Field G. Van Zee
dc5fd898af Merge branch 'amd' 2018-10-15 17:41:35 -05:00
Field G. Van Zee
779d64dc30 Added entry for xpbym to input.operations.fast.
Details:
- Forgot to add an entry for the new xpbym operation to
  input.operations.fast in previous commit.
2018-10-15 17:13:18 -05:00
Field G. Van Zee
5fec95b99f Implemented mixed-datatype support for gemm.
Details:
- Implemented support for gemm where A, B, and C may have different
  storage datatypes, as well as a computational precision (and implied
  computation domain) that may be different from the storage precision
  of either A or B. This results in 128 different combinations, all
  which are implemented within this commit. (For now, the mixed-datatype
  functionality is only supported via the object API.) If desired, the
  mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
  that requires a single m-by-n matrix be allocated (temporarily) per
  call to gemm. This optimization aims to avoid the overhead involved in
  repeatedly updating C with general stride, or updating C after a
  typecast from the computation precision. This memory optimization may
  be disabled at configure-time (provided that the mixed-datatype
  support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
  The user may test gemm with mixed domains, precisions, both, or
  neither.
- Added a standalone test driver directory for building and running
  mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
  except that imaginary values are not touched when casting a real
  operand to a complex operand. (By contrast, in these situations castm
  sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
  usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
  also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
  when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
  accessor functions. Also commented out all usage of accessor
  functions within macrokernels. (Typecasting in the microkernel is
  still feasible, though probably unrealistic for now given the
  additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
  pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
  docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
  (courtsey of Devin Matthews).
2018-10-15 16:37:39 -05:00
Field G. Van Zee
3612ecac98 Added comments to nested OpenMP handling code.
Details:
- Added comments to bli_thrcomm_openmp.c relating to changes made in
  6ac0c80 and 1064d79.
2018-10-11 15:16:41 -05:00
Field G. Van Zee
667d3929ee Added Fortran APIs for some thread functions.
Details:
- Defined Fortran-77 compatible APIs for bli_thread_set_num_threads()
  and bli_thread_set_ways(). These wrappers are defined in
  frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for
  suggesting these new interfaces.
- Added missing prototype for bli_thread_set_ways() in bli_thread.h and
  removed prototypes for non-existent functions bli_thread_set_*_nt().
- CREDITS file update.
2018-10-11 11:47:57 -05:00
Devin Matthews
1064d79711 Adjust rntm_t struct as well. 2018-10-11 11:14:25 -05:00
Devin Matthews
6ac0c80560 Fix OMP nesting problem.
Detect when OpenMP uses fewer threads than requested and correct accordingly, so that we don't wait forever for nonexistent threads. Fixes #267.
2018-10-11 10:45:07 -05:00
Field G. Van Zee
53a9ab1c85 Renamed thread auto-factorization macro constants.
Details:
- Renamed the following C preprocessor macros whose fallback/default
  values are specified within frame/include/bli_kernel_macro_defs.h:

    BLIS_DEFAULT_MR_THREAD_MAX  -> BLIS_THREAD_MAX_IR
    BLIS_DEFAULT_NR_THREAD_MAX  -> BLIS_THREAD_MAX_JR
    BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
    BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N

- Renamed the above cpp macro overrides within the knl, skx, and zen
  sub-configurations, as well as invocations of those macros in
  bli_rntm.c.
- Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer
  used by any code within BLIS.
2018-10-10 15:11:09 -05:00
Field G. Van Zee
637c2ce794 Updated column index range for irun.py -q.
Details:
- Forgot to apply the column index range fix in 10f179f to situations
  when "quiet" mode (-q) is requested. This commit applies the new
  column index range modifications to the quiet case.
2018-10-09 17:18:04 -05:00
Field G. Van Zee
e2a59400bd Allow trsm_l parallelism in the jc loop.
Details:
- Previously, trsm was consolidating all ways of parallelism into the jr
  loop. This was unnecessary and to some degree detrimental on some
  types of hardware. Now, any parallelism bound for the jc loop will be
  applied to the jc loop, while all other loops' parallelism is funneled
  to the jr loop. Thanks to Devangi Parikh for helping investigate this
  issue and suggesting the fix.
- NOTE: This change affects only left-side trsm. However, currently
  right-side trsm is currently implemented in terms of the left-side
  case, and thus the change effectively applies to both left and right
  cases.
2018-10-09 15:29:48 -05:00
Field G. Van Zee
f1dba506c9 Output threading status/params from testsuite.
Details:
- Updated testsuite to output various parameters related to parallelism
  in BLIS. These parameters include:
  - threading status: disabled, openmp, or pthreads;
  - thread partitioning for jr/ir loops: slab or rr (round-robin);
  - ways of parallelism from environment variables, and also actual
    values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
    square problems (assuming all dimensions are set to 1000);
  - automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
  libmemkind and the sandbox.
2018-10-08 17:59:41 -05:00
Field G. Van Zee
10f179fb13 Updated irun.py to use updated column index range.
Details:
- Updated the irun.py script so that it updates the matlab column index
  range (if found) to reflect the additional columns of data that are
  substituted in. Thanks to Devangi Parikh for recognizing and reporting
  this issue.
2018-10-08 14:36:38 -05:00
Field G. Van Zee
c244a716c9 Added missing -r option to configure --help output.
Details:
- Added inadvertantly-omitted mention of -r option-equivalent to
  --thread-part-jrir to the output for 'configure --help'. Also made
  minor edits to the same text.
2018-10-07 20:59:40 -05:00
Field G. Van Zee
c92762ecdc Added option of slab or rr partitioning in jr/ir.
Details:
- Updated existing macrokernel function names and definitions to
  explicitly use slab assignment of micropanels to threads, then created
  duplicate versions of macrokernels that explicitly use round-robin
  assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
  were not substantially updated in this commit because they are
  currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
  explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
  and packm function pointers depending on which method (slab or rr) was
  enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
  option (-m [slab|rr] for short), which allows the user to explicitly
  request either slab or round-robin assignment (partitioning) of
  micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
2018-10-07 20:30:32 -05:00
Field G. Van Zee
98e01ea04b Merge branch 'master' into amd 2018-10-04 20:44:12 -05:00
Field G. Van Zee
541b8a3b3e Removed 1h short-circuit from bli_clock_min_diff().
Details:
- Removed a guard from bli_clock_min_diff() that would return 0 if the
  time delta was greater than 60 minutes. This was originally intended
  to disregard extremely large values under the assumption that the
  user probably didn't intend to run a test that long. However, since
  it is in bli_clock_min_diff(), it doesn't actually help short-circuit
  an implementation that is hanging or looping infinitely, since such
  an implementation would first have to finish before the
  bli_clock_min_diff() is called. Thanks to Kiran Varaganti for
  reporting this issue.
2018-10-04 20:39:06 -05:00
Devangi N. Parikh
8bf30eb473 Fixed runme.sh in test/studies/thunderx2
Details:
- Fixed the setting of threads for a single core run.
2018-10-03 22:22:29 -04:00
Devangi N. Parikh
f6f2456ba2 Fixed the Makefile in test/studies/thunderx2
Details:
- Fixed target for make-all-st and make-all-mt so that the armpl
  targets are built
2018-10-03 21:43:46 -04:00
Field G. Van Zee
743a1a6dec Fixed misleading version query from gcc 7+.
Details:
- gcc 7 introduced new behavior to the -dumpversion option whereby only
  the major version component is output. However, as part of this
  change, gcc 7 also introduced a new option, -dumpfullversion, which is
  guaranteed to always output the major, minor, and revision numbers. If
  we are using gcc 7 or later, we re-query the version string with this
  new option and then re-parse the result so as to avoid misleading
  output from configure (e.g. using gcc 7.3.0 is reported as 7.7.7).
2018-10-03 14:40:10 -05:00