1846 Commits

Author SHA1 Message Date
Field G. Van Zee
68b88aca66 Version file update (0.7.0) 0.7.0 2020-04-07 14:41:44 -05:00
Field G. Van Zee
b04de636c1 ReleaseNotes.md update in advance of next version.
Details:
- Updated docs/ReleaseNotes.md in preparation for next version.
2020-04-07 14:37:43 -05:00
Field G. Van Zee
2cb604ba47 Rename more bli_thread_obarrier(), _obroadcast().
Details:
- Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
  that were made in the supmt-specific code commited to the 'amd'
  branch, which has now been merged with 'master'. Prior to the merge,
  'master' received commit c01d249, which applied these renamings to
  the existing, non-sup codebase.
2020-04-06 16:42:14 -05:00
Field G. Van Zee
efb12bc895 Minor updates/elaborations to RELEASING file. 2020-04-06 15:01:53 -05:00
Field G. Van Zee
2e3b3782cf Merge branch 'master' into amd 2020-04-06 14:55:35 -05:00
Satish Balay
da0c086f46 OSX: specify the full path to the location of libblis.dylib (#390)
* OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime

Before this change:

Appication gives runtime error [when linked with blis]
dyld: Library not loaded: libblis.3.dylib

balay@kpro lib % otool -L libblis.dylib
libblis.dylib:
        libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)

After this change:
balay@kpro lib % otool -L libblis.dylib
libblis.dylib:
	/Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)

* INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR

Co-Authored-By: Jed Brown <jed@jedbrown.org>

* CREDITS file update.

Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
2020-03-31 17:09:41 -05:00
Field G. Van Zee
2bca03ea9d Updates, tweaks to runme.sh in test/1m4m.
Details:
- Made several updates to test/1m4m/runme.sh, including:
  - Added missing handling for 1m and 4m1a implementations when setting
    the BLIS_??_NT environment variables.
  - Added support for using numactl to run the test executables.
  - Several other cleanups.
2020-03-28 22:10:00 +00:00
Field G. Van Zee
c40a33190b Warn user when auto-detection returns 'generic'.
Details:
- Added logic to configure that causes the script to output a warning
  to the user if/when "./configure auto" is run and the underlying
  hardware feature detection code is unable to identify the hardware.
  In these cases, the auto-detect code will return 'generic', which
  is likely not what the user expected, and a flag will be set so that
  a message is printed at the end of the configure output. (Thankfully,
  we don't expect this scenario to play out very often.) Thanks to
  Devin Matthews for suggesting this fix #384.
2020-03-26 16:55:00 -05:00
Devin Matthews
492a736fab Fix vectorized version of bli_amaxv (#382)
* Fix vectorized version of bli_amaxv

To match Netlib, i?amax should return:
- the lowest index among equal values
- the first NaN if one is encountered

* Fix typos.

* And another one...

* Update ref. amaxv kernel too.

* Re-enabled optimized amaxv kernels.

Details:
- Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen'
  kernel set for use in haswell, zen, zen2, knl, and skx subconfigs.
  These two kernels (for s and d datatypes) were temporarily disabled in
  e186d71 as part of issue #380. However, the key missing semantic
  properties that prompted the disabling of these kernels--returning the
  index of the *first* rather than of the last element with largest
  absolute value, and returning the index of the first NaN if one is
  encountered--were added as part of #382 thanks to Devin Matthews.
  Thus, now that the kernels are working as expected once more, this
  commit causes these kernels to once again be registered for the
  affected subconfigs, which effectively reverts all code changes
  included in e186d71.
- Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c.

Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
2020-03-24 17:28:47 -05:00
Field G. Van Zee
e186d7141a Disabled optimized amaxv kernels.
Details:
- Disabled use of optimized amaxv kernels, which use vector intrinsics
  for both 's' and 'd' datatypes. We disable these kernels because the
  current implementations fail to observe a semantic property of the
  BLAS i?amax_() subroutine, which is to return the index of the
  *first* element containing the maximum absolute value (that is, the
  first element if there exist two or more elements that contain the
  same value). With the optimized kernels disabled, the affected
  subconfigurations (haswell, zen, zen2, knl, and skx) will use the
  default reference implementations. Thanks to Mat Cross for reporting
  this issue via #380.
- CREDITS file update.
2020-03-21 18:40:36 -05:00
Field G. Van Zee
9f3a8d4d85 Added missing return to bli_thread_partition_2x2().
Details:
- Added a missing return statement to the body of an early case handling
  branch in bli_thread_partition_2x2(). This bug only affected cases
  where n_threads < 4, and even then, the code meant to handle cases
  where n_threads >= 4 executes and does the right thing, albeit using
  more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
  for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).
2020-03-14 17:48:43 -05:00
Field G. Van Zee
8c3d9b9eeb Merge branch 'amd' of github.com:flame/blis into amd 2020-03-10 14:03:33 -05:00
Field G. Van Zee
71249fe8dd Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
  to compile and run both single-threaded and multithreaded experiments.
  This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
  previous test/sup/octave scripts) as well as a test/sup/octave_mt
  directory (based on the previous test/supmt/octave scripts). The
  octave scripts are slightly different and not easily mergeable, and
  thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
  the previous test/supmt directory as test/sup/old/supmt.
2020-03-10 13:55:29 -05:00
Field G. Van Zee
0f9e0399e1 Updated sup performance graphs; added mt results.
Details:
- Reran all existing single-threaded performance experiments comparing
  BLIS sup to other implementations (including the conventional code
  path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
  showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
  (Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
2020-03-05 17:03:21 -06:00
Field G. Van Zee
90db88e572 Updated sup[mt] Makefiles for variable dim ranges.
Details:
- Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
  different problem size ranges for the drivers where one, two, or three
  matrix dimensions is large. This will facilitate the generation of
  more meaningful graphs, particularly when two dimensions are tiny.
2020-03-02 15:06:48 -06:00
Field G. Van Zee
31f11a06ea Updates to octave scripts in test/sup[mt]/octave.
Details:
- Optimized scripts in test/sup/octave and test/supmt/octave for use
  with octave 5.2.0 on Ubuntu 18.04.
- Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m,
  which were not only unnecessary but also causing issues with versions
  5.x.
2020-02-27 14:33:20 -06:00
Field G. Van Zee
c01d249d7c Renamed bli_thread_obarrier(), _obroadcast().
Details:
- Renamed two bli_thread_*() APIs:
    bli_thread_obarrier()   -> bli_thread_barrier()
    bli_thread_obroadcast() -> bli_thread_broadcast()
  The 'o' was a leftover from when thrcomm_t objects tracked both
  "inner" and "outer" communicators. They have long since been
  simplified to only support the latter, and thus the 'o' is
  superfluous.
2020-02-25 14:50:53 -06:00
Field G. Van Zee
f6e6bf73e6 List Gentoo under supported external packages.
Details:
- Add mention of Gentoo Linux under the list of external packages in
  the README.md file. Thanks to M. Zhou for maintaining this package.
2020-02-24 17:52:23 -06:00
Field G. Van Zee
9e5f7296cc Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.
2020-02-18 15:16:03 -06:00
Field G. Van Zee
90081e6a64 Fixed bug(s) in mt sup when single-threaded.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
  changing function interface for the thread entry point function
  (of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
  a memory leak in the sba at bli_finalize() time. It turns out that,
  due to the new multithreading-capable variant code useing thrinfo_t
  objects--specifically, their calling of bli_thrinfo_grow()--we
  have to pass in a real thrinfo_t object rather than the global
  objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
  Thus, I inserted the appropriate logic from the OpenMP and pthreads
  versions so that single-threaded execution would work as intended
  with the newly upgraded variants.
2020-02-17 14:57:25 -06:00
Field G. Van Zee
c0558fde45 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
2020-02-17 14:08:08 -06:00
Field G. Van Zee
d7a7679182 Fixed int-to-packbuf_t conversion error (C++ only).
Details:
- Fixed an error that manifests only when using C++ (specifically,
  modern versions of g++) to compile drivers in 'test' (and likely most
  other application code that #includes blis.h. Thanks to Ajay Panyala
  for reporting this issue (#374).
2020-02-07 17:37:03 -06:00
Field G. Van Zee
d626112b8d Removed sorting on LDFLAGS in common.mk (#373).
Details:
- Removed a line of code in common.mk that passed LDFLAGS through the
  sort function. The purpose was not to sort the contents, but rather
  to remove duplicates. However, there is valid syntax in a string of
  linker flags that, when sorted, yields different/broken behavior.
  So I've removed the line in common.mk that sorts LDFLAGS. Also, for
  future use, I've added a new function, rm-dupls, that removes
  duplicates without sorting. (This function was based on code from a
  stackoverflow thread that is linked to in the comments for that
  code.) Thanks to Isuru Fernando for reporting this issue (#373).
2020-01-15 13:27:02 -06:00
Field G. Van Zee
e67deb22aa CHANGELOG update (0.6.1) 2020-01-14 16:01:34 -06:00
Field G. Van Zee
10949f528c Version file update (0.6.1) 0.6.1 2020-01-14 16:01:33 -06:00
Field G. Van Zee
5db8e710a2 ReleaseNotes.md update in advance of next version.
Details:
- Updated ReleaseNotes.md in preparation for next version.
2020-01-14 15:59:59 -06:00
Field G. Van Zee
cde4d9d7a2 Removed 'attic/windows' (to prevent confusion).
Details:
- Finally removed 'attic/windows' and its contents. This directory once
  contained "proto" Windows support for BLIS, but we've since moved on
  to (thanks to Isuru Fernando) providing Windows DLL support via
  AppVeyor's build artifacts. Furthermore, since 'windows' was the only
  subdirectory within 'attic', the directory path would show up in
  GitHub's listing at https://github.com/flame/blis, which probably led
  to someone being confused about how BLIS provides Windows support. I
  assume (but don't know for sure) that nobody is using these files, so
  this is admittedly a case of shoot first and ask questions later.
2020-01-14 15:19:25 -06:00
Field G. Van Zee
7d3407d468 CREDITS file update. 2020-01-14 15:17:53 -06:00
Dave Love
f391b3e2e7 Fix parsing in vpu_count on workstation SKX (#351)
* Fix parsing in vpu_count on workstation SKX

* Document Skylake-X as Haswell for single FMA

* Update vpu_count for Skylake and Cascade Lake models

* Support printing the configuration selected, controlled by the environment

Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.

* Move bli_log outside the cpp condition, and use it where intended

* Add Fixme comment (Skylake D)

* Mostly superficial edits to commits towards #351.

Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
  to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
  relates to single-VPU Skylake-Xs.

* Fix comment typos

Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
2020-01-06 14:15:48 -06:00
Field G. Van Zee
5ca1a3cfc1 Fixed 'configure' breakage introduced in 6433831.
Details:
- Added a missing 'fi' (endif) keyword to a conditional block added in
  the configure script in commit 6433831.
2020-01-06 12:29:12 -06:00
Field G. Van Zee
e7431b4a83 Updated 1m draft article link in README.md. 2020-01-06 12:01:41 -06:00
Jeff Hammond
6433831cc3 blacklist ICC 18 for knl/skx due to test failures
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
2020-01-03 17:51:05 -08:00
Jeff Hammond
af3589f1f9 blacklist Intel 19+
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
2020-01-03 17:51:05 -08:00
Jeff Hammond
60de939deb fix link to docs
the comment contains an incorrect link, which is trivially fixed here.

@fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.
2020-01-01 21:30:38 -08:00
Field G. Van Zee
5271107378 Fixed bugs in cblas_sdsdot(), sdsdot_().
Details:
- Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar,
  named 'sb'. This value was already being added by the underlying
  sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub().
  Thanks to Simon Lukas Märtens for reporting this bug via #367.
- Fixed a second bug in order of typecasting intermediate products in
  sdsdot_(). Previously, the "alpha" scalar was being added after the
  "outer" typecast to float. However, the operation is supposed to first
  add the dot product to the (promoted) scalar and THEN downcast the sum
  to float. Thanks to Devin Matthews for catching this bug.
2019-12-16 16:30:26 -06:00
Field G. Van Zee
fe2560a4b1 Annoted missing thread-related symbols for export.
Details:
- Added BLIS_EXPORT_BLIS annotation to function prototypes for

    bli_thrcomm_bcast()
    bli_thrcomm_barrier()
    bli_thread_range_sub()

  so that these functions are exported to shared libraries by default.
  This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for
  reporting this bug.
- CREDITS file update.
2019-12-06 17:12:44 -06:00
Field G. Van Zee
2853825234 Merge branch 'master' into amd 2019-12-06 16:06:46 -06:00
Nicholai Tukanov
61b1f0b060 Add prototypes for POWER9 reference kernels (#365)
Updates and fixes to power9 subconfig.

Details:
- Register s,c,z reference gemm and trsm ukernels that assume elements
  of B have been broadcast.
- Added prototypes for level-3 ukernels that assume elements of B have
  been broadcast. Also added prototype for an spackm function that
  employs a duplication/broadcast factor of 4.
- Register virtual gemmtrsm ukernels that work with broadcasting of B.
- Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h.
- Thanks to Nicholai Tukanov for providing these updates.
2019-12-04 14:18:47 -06:00
Field G. Van Zee
efa61a6c8b Added missing bli_l3_sup_thread_decorator() symbol.
Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
  and pthreads so that those builds don't fail when performing shared
  library linking (especially for Windows DLLs via AppVeyor). For now,
  these dummy implementations of bli_l3_sup_thread_decorator() are
  merely carbon-copies of the implementation provided for single-
  threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
  Thus, an OpenMP or pthreads build will be able to use the gemmsup
  code (including the new selective packing functionality), as it did
  before 39fa7136, even though it will not actually employ any
  multithreaded parallelism.
2019-11-29 16:17:04 -06:00
Field G. Van Zee
39fa7136f4 Added support for selective packing to gemmsup.
Details:
- Implemented optional packing for A or B (or both) within the sup
  framework (which currently only supports gemm). The request for
  packing either matrix A or matrix B can be made via setting
  environment variables BLIS_PACK_A or BLIS_PACK_B (to any
  non-zero value; if set, zero means "disable packing"). It can also
  be made globally at runtime via bli_pack_set_pack_a() and
  bli_pack_set_pack_b() or with individual rntm_t objects via
  bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
  interface of either the BLIS typed or object APIs. (If using the
  BLAS API, environment variables are the only way to communicate the
  packing request.)
- One caveat (for now) with the current implementation of selective
  packing is that any blocksize extension registered in the _cntx_init
  function (such as is currently used by haswell and zen subconfigs)
  will be ignored if the affected matrix is packed. The reason is
  simply that I didn't get around to implementing the necessary logic
  to pack a larger edge-case micropanel, though this is entirely
  possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
  bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
  with corresponding headers, in which higher-level packm-related
  functions are defined for use within the sup framework. The actual
  packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
  bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
  always NULL), and pointer to a thrinfo_t* (which for nowis the address
  of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
  the millikernel can query the panel stride of the packed matrix and
  step through it accordingly. If the matrix isn't packed, the panel
  stride of interest for the given millikernel will be set to the
  appropriate value so that the mkernel may step through the unpacked
  matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
  panel strides (ps_a and ps_b, respectively) instead of computing them
  on the fly.
- Spun off the environment variable getting and setting functions into
  a new file, bli_env.c (with a corresponding prototype header). These
  functions are now used by the threading infrastructure (e.g.
  BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
  infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
  for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
  This means that the function bli_thread_init_rntm() was renamed to
  bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
  functions that manage the pack_a and pack_b fields of the global
  rntm_t, including from environment variables, just as we have
  functions to manage the threading fields of the global rntm_t in
  bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
  spinning off the bli_l3_thread_decorator() functions into their own
  files. This change makes more sense when considering the further
  addition of bli_l3_sup_thread_decorator() functions (for now limited
  only to the single-threaded form found in the  _single.c file).
- Explicitly initialize the reference sup handlers in both
  bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
  obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
2019-11-29 15:27:07 -06:00
Field G. Van Zee
bbb21fd0a9 Tweaked SIAM/SC Best Prize language in README.md. 2019-11-21 18:15:16 -06:00
Field G. Van Zee
043366f92d Fixed typo in previous commit (SIAM/SC prize). 2019-11-21 18:13:51 -06:00
Field G. Van Zee
05a4d583e6 Added SIAM/SC prize to "What's New" in README.md. 2019-11-21 18:12:24 -06:00
Field G. Van Zee
881b05ecd4 Fixed blastest failure for 'generic' subconfig.
Details:
- Fixed a subtle and complicated bug that only manifested via the BLAS
  test drivers in the generic subconfiguration, and possibly any other
  subconfiguration that did not register complex-domain gemm ukernels,
  or registered ONLY real-domain ukernels as row-preferential. This is
  a long story, but it boils down to an exception to the "transpose the
  operation to bring storage of C into agreement with ukernel pref"
  optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
  proper functioning of the 1m method, but only when the imaginary
  component of beta is zero. See the comments in issue #342 for more
  details. Thanks to Dave Love for identifying the commit in which this
  bug was introduced, and other feedback related to this bug.
2019-11-21 16:34:27 -06:00
Field G. Van Zee
0c7165fb01 Fixed obscure bug in bli_acquire_mpart_[mn]dim().
Details:
- Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(),
  and bli_acquire_mpart_mndim() that allowed the use of a blocksize b
  that is too large given the current row/column index (i.e., the i/j
  argument) and the size of the dimension being partitioned (i.e., the
  m/n argument). This bug only affected backwards partitioning/motion
  through the dimension and was the result of a misplaced conditional
  check-and-redirect to the backwards code path. It should be noted
  that this bug was discovered not because it manifested the way it
  could (thanks to the callers in BLIS making sure to always pass in
  the "correct" blocksize b), but could have manifested if the
  functions were used by 3rd party callers. Thanks to Minh Quan Ho for
  reporting the bug via issue #363.
2019-11-14 16:48:14 -06:00
Field G. Van Zee
fb8bef9982 Fixed copy-paste bug in bli_spackm_6xk_bb4_ref().
Details:
- Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that
  manifested as failures in single-precision real level-3 operations.
  Also replaced the duplication factor constants with a const-qualifed
  varialbe, dfac, so that this won't happen again.
- Changed NC for single-precision real from 4080 to 8160 so that the
  packed matrix B will have the same byte footprint in both single
  and double real.
2019-11-14 13:05:28 -06:00
Field G. Van Zee
8f399c8940 Tweaked/added notes to docs/Multithreading.md.
Details:
- Added language to docs/Multithreading.md cautioning the reader about
  the nuances of setting multithreading parameters via the manual and
  automatic ways simultaneously, and also about how these parameters
  behave when multithreading is disabled at configure-time. These
  changes are an attempt to address the issues that arose in issue #362.
  Thanks to Jérémie du Boisberranger for his feedback on this topic.
- CREDITS file update.
2019-11-12 15:32:57 -06:00
Field G. Van Zee
bdc7ee3394 Various fixes to support packing duplication in B.
Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
  those operations to be cast so the structured matrix is on the left.
  symm and hemm already had such macros, but these too were renamed so
  that the macros were individual to the operation. We now have four
  such macros:
    #define BLIS_DISABLE_HEMM_RIGHT
    #define BLIS_DISABLE_SYMM_RIGHT
    #define BLIS_DISABLE_TRMM_RIGHT
    #define BLIS_DISABLE_TRMM3_RIGHT
  Also, updated the comments in the symm and hemm front-ends related to
  the first two macro guards, and added corresponding comments to the
  trmm and trmm3 front-ends for the latter two guards. (They all
  functionally do the same thing, just for their specific operations.)
  Thanks to Jeff Hammond for reporting the bugs that led me to this
  change (via #359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
  related to duplicating B during packing) to register: a packing
  kernel for single-precision real; gemmbb ukernels for s, c, and z;
  trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
  and z; and to use non-default cache and register blocksizes for s, c,
  and z datatypes. Also declared prototypes for all of the gemmbb,
  trsmbb, and gemmtrsmbb ukernel functions within the
  bli_cntx_init_haswellbb() function. This should, once applied to the
  power9 configuration, fix the remaining issues in #359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
  duplication factor of 4. This function is defined in the same file as
  bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).
2019-11-11 15:47:17 -06:00
Field G. Van Zee
0eb79ca850 Avoid unused variable warning in lread.c (#356).
Details:
- Replaced the line

    f = f;

  with

    ( void )f;

  for the unused variable 'f' in blastest/f2c/lread.c. (Hopefully)
  addresses issue #356, but since we don't use xlc who knows. Thanks
  to Jeff Hammond for reporting this.
2019-11-08 14:48:48 -06:00
Jérôme Duval
f377bb4485 Add Haiku to the known OS list (#361) 2019-11-07 16:39:29 -06:00