Commit Graph

1605 Commits

Author SHA1 Message Date
Field G. Van Zee
4ed39c0971 Merge branch 'amd' of github.com:flame/blis into amd 2019-03-08 11:56:58 -06:00
Field G. Van Zee
b938c16b0c Renamed test/3m4m to test/3.
Details:
- Renamed '3m4m' directory to '3', which captures the directory nicely
  since it builds test drivers to test level-3 operations.
- These test drivers ceased to be used to test the 3m and 4m (or even
  1m) induced methods long ago, hence the name change.
2019-03-07 16:40:39 -06:00
Field G. Van Zee
ab89a40582 More minor updates and edits to test/3m4m.
Details:
- Further updates to matlab scripts, mostly for compatibility with
  GNU Octave.
- More tweaks to runme.sh.
- Updates to runme.m that allow copy-paste into matlab interactive
  session to generate graphs.
2019-03-07 16:26:12 -06:00
Field G. Van Zee
f0e70dfbf3 Very minor updates to test/3m4m for ul252.
Details:
- Very minor updates to the newly revamped test/3m4m drivers when used
  on a Xeon Platinum (SkylakeX).
2019-03-07 01:04:05 +00:00
Field G. Van Zee
9f1dbe572b Overhauled test/3m4m Makefile and scripts.
Details:
- Rewrote much of Makefile to generate executables for single- and dual-
  socket multithreading as well as single-threaded. Each of the three
  can also use a different problem size range/increment, as is often
  appropriate when doubling/halving the number of threads.
- Rewrote runme.sh script to flexibly execute as many threading
  parameter scenarios as is given in the input parameter string
  (currently set within the script itself). The string also encodes
  the maximum problem size for each threading scenario, which is used
  to identify the executable to run. Also improved the "progress" output
  of the script to reduce redundant info and improve readability in
  terminals that are not especially wide.
- Minor updates to test_*.c source files.
- Updated matlab scripts according to changes made to the Makefile,
  test drivers, and runme.sh script, and renamed 'plot_all.m' to
  'runme.m'.
2019-03-05 17:47:55 -06:00
Field G. Van Zee
3bdab823fa Merge branch 'master' into dev 2019-02-28 14:07:24 -06:00
Field G. Van Zee
e2a02ebd00 Updates (from ls5) to test/3m4m/runme.sh.
Details:
- Lonestar5-specific updates to runme.sh.
2019-02-28 13:58:59 -06:00
Isuru Fernando
f0dcc8944f Add symbol export macro for all functions (#302)
* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now
2019-02-27 17:27:23 -06:00
Field G. Van Zee
540ec1b479 Updated level-3 BLAS to call object API directly.
Details:
- Updated the BLAS compatibility layer for level-3 operations so that
  the corresponding BLIS object API is called directly rather than first
  calling the typed BLIS API. The previous code based on the typed BLIS
  API calls is still available in a deactivated cpp macro branch, which
  may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not
  yet correspond to a configure option. If it seems like people might
  want to toggle this behavior more regularly, a configure option can be
  added in the future.)
- Updated the BLIS typed API to statically "pre-initialize" objects via
  new initializor macros. Initialization is then finished via calls to
  static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(),
  which are similar to the previously-called functions,
  bli_obj_create_1x1_with_attached_buffer() and
  bli_obj_create_with_attached_buffer(), respectively. (The BLAS
  compatibility layer updates mentioned above employ this new technique
  as well.)
- Transformed certain routines in bli_param_map.c--specifically, the
  ones that convert netlib-style parameters to BLIS equivalents--into
  static functions, now in bli_param_map.h. (The remaining three classes
  of conversation routines were left unchanged.)
- Added the aforementioned pre-initializor macros to bli_type_defs.h.
- Relocated bli_obj_init_const() and bli_obj_init_constdata() from
  bli_obj_macro_defs.h to bli_type_defs.h.
- Added a few macros to bli_param_macro_defs.h for testing domains for
  real/complexness and precisions for single/double-ness.
2019-02-24 19:09:10 -06:00
Field G. Van Zee
8e023bc914 Updates to 3m4m/matlab scripts.
Details:
- Minor updates to matlab graph-generating scripts.
- Added a plot_all.m script that is more of a scratchpad for copying and
  pasting function invocations into matlab to generate plots that are
  presently of interest to us.
2019-02-22 16:55:30 -06:00
Field G. Van Zee
70f12f209b Changed unsafe-loop to unsafe-math optimizations.
Details:
- Changed -funsafe-loop-optimizations (re-)introduced in 7690855 for
  make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to
  account for a miscommunication in issue #300). Thanks to Dave Love
  for this suggestion and Jeff Hammond for his feedback on the topic.
2019-02-20 16:10:10 -06:00
Field G. Van Zee
7690855c51 Restored -funsafe-loop-optimizations to subconfigs.
Details:
- Restored use of -funsafe-loop-optimizations in the definitions of
  CRVECFLAGS (when using gcc), but only for sub-configurations (and
  not configuration families such as amd64, intel64, and x86_64).
  This more or less reverts 5190d05 and 6cf1550.
2019-02-18 19:16:01 -06:00
Field G. Van Zee
44994d1490 Disable TBM, XOP, LWP instructions in AMD configs.
Details:
- Added -mno-tbm -mno-xop -mno-lwp to CKVECFLAGS in bulldozer,
  piledriver, steamroller, and excavator configurations to explicitly
  disable AMD's bulldozer-era TBM, XOP, and LWP instruction sets in an
  attempt to fix the invalid instruction error that has plagued Travis
  CI builds since 6a014a3. Thanks to Devin Matthews for pointing out
  that the offending instruction was part of TBM (issue #300).
- Restored -O3 to piledriver configuration's COPTFLAGS.
2019-02-18 18:35:30 -06:00
Field G. Van Zee
1e5b530744 Reverted piledriver COPTFLAGS from -O3 to -O2.
Details:
- Debugging continues; changing COPTFLAGS for piledriver subconfig from
  -O3 to -O2, its original value prior to 6a014a3.
2019-02-18 18:04:38 -06:00
Field G. Van Zee
6cf1550491 Removed -funsafe-loop-optimizations from all configs.
Details:
- Error persists. Removed -funsafe-loop-optimizations from all remaining
  sub-configurations.
2019-02-18 17:29:51 -06:00
Field G. Van Zee
5190d05a27 Removed -funsafe-loop-optimizations from piledriver.
Details:
- Error persists; continuing debugging from bf0fb78c by removing
  -funsafe-loop-optimizations from piledriver configuration.
2019-02-18 17:07:35 -06:00
Field G. Van Zee
bf0fb78c5e Removed -funsafe-loop-optimizations from families.
Details:
- Removed -funsafe-loop-optimizations from the configuration families
  affected by 6a014a3, specifically: intel64, amd64, and x86_64.
  This is part of an attempt to debug why the sde, as executed by
  Travis CI, is crashing via the following error:

    TID 0 SDE-ERROR: Executed instruction not valid for specified chip
    (ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103
2019-02-18 16:51:38 -06:00
Field G. Van Zee
6a014a3377 Standardized optimization flags in make_defs.mk.
Details:
- Per Dave Love's recommendation in issue #300, this commit defines
    COPTFLAGS := -03
  and
    CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations
  in the make_defs.mk for all Intel- and AMD-based configurations.
2019-02-18 14:52:29 -06:00
Field G. Van Zee
565fa3853b Redirect trsm pc, ir parallelism to ic, jr loops.
Details:
- trsm parallelization was temporarily simplifed in 075143d to entirely
  ignore any parallelism specified via the pc or ir loops. Now, any
  parallelism specified to the pc loop will be redirected to the ic
  loop, and any parallelism specified to the ir loop will be redirected
  to the jr loop. (Note that because of inter-iteration dependencies,
  trsm cannot parallelize the ir loop. Parallelism via the pc loop is
  at least somewhat feasible in theory, but it would require tracking
  dependencies between blocks--something for which BLIS currently lacks
  the necessary supporting infrastructure.)
2019-02-18 11:43:58 -06:00
Field G. Van Zee
a023c643f2 Regenerated symbols in build/libblis-symbols.def.
Details:
- Reran ./build/regen-symbols.sh after running
  'configure --enable-cblas auto'
2019-02-14 20:18:55 -06:00
Field G. Van Zee
075143dfd9 Added support for IC loop parallelism to trsm.
Details:
- Parallelism within the IC loop (3rd loop around the microkernel) is
  now supported within the trsm operation. This is done via a new branch
  on each of the control and thread trees, which guide execution of a
  new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
  subproblem corresponds to the macrokernel computation on only the
  block of A that contains the diagonal (labeled as A11 in algorithms
  with FLAME-like partitioning), and the corresponding row panel of C.
  During the trsm subproblem, all threads within the JC communicator
  participate and parallelize along the JR loop, including any
  parallelism that was specified for the IC loop. (IR loop parallelism
  is not supported for trsm due to inter-iteration dependencies.) After
  this trsm subproblem is complete, a barrier synchronizes all
  participating threads and then they proceed to apply the prescribed
  BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
  parallelism specified within) to the remaining gemm subproblem (the
  rank-k update that is performed using the newly updated row-panel of
  B). Thus, trsm now supports JC, IC, and JR loop parallelism.
- Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
  of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
  now, since it is not currently used. (All trsm problems are cast in
  terms of left-side trsm.)
- Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
  trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
  subnode is only recursed upon if there existed a corresponding
  thrinfo_t node, which may not always exist (for problems too small
  to employ full parallelization due to the minimum granularity imposed
  by micropanels).
- Updated other functions in frame/base/bli_cntl.c, such as
  bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
  if they exist.
- Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
  when they exist, and added support for growing a prenode branch to
  bli_thrinfo_grow() via a corresponding set of help functions named
  with the _prenode() suffix.
- Added a bszid_t field thrinfo_t nodes. This field comes in handy when
  debugging the allocation/release of thrinfo_t nodes, as it helps trace
  the "identity" of each nodes as it is created/destroyed.
- Renamed
    bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
  and created a separate bli_l3_thrinfo_print_trsm_paths() function to
  print out the newly reconfigured thrinfo_t trees for the trsm
  operation.
- Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
  regarding variable declarations.
- Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
  BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
  (semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
  represent the subpartition ahead of and behind, respectively,
  BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
- Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
  bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
- Deprecated old functions in frame/3/bli_l3_thrinfo.c.
2019-02-14 18:52:45 -06:00
Nicholai Tukanov
78bc0bc8b6 Power9 sub-configuration (#298)
Formally registered power9 sub-configuration.

Details:
- Added and registered power9 sub-configuration into the build system.
  Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
- Note: The sub-configuration does not yet have a corresponding
  architecture-specific kernel set registered, and so for now the 
  sub-config is using the generic kernel set.
2019-02-14 13:29:02 -06:00
Field G. Van Zee
6b83273126 Generalized ref kernels' pragma omp simd usage.
Details:
- Replaced direct usage of _Pragma( "omp simd" ) in reference kernels
  with PRAGMA_SIMD, which is defined as a function of the compiler being
  used in a new bli_pragma_macro_defs.h file. That definition is cleared
  when BLIS detects that the -fopenmp-simd command line option is
  unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions
  that guided this commit.
- Updated configure and bli_config.h.in so that the appropriate anchor
  is substituted in (when the corresponding pragma omp simd support is
  present).
2019-02-12 16:01:28 -06:00
Field G. Van Zee
b1f5ce8622 Minor updates to scripts in test/mixeddt/matlab. 2019-02-05 17:38:50 -06:00
Devangi N. Parikh
38203ecd15 Added thunderx2 system in the mixeddt test scripts
Details:
 - Added thunderx2 (tx2) as a system in the runme.sh in test/mixeddt
2019-02-04 15:28:28 -05:00
Devangi N. Parikh
dfc91843ea Fixed gcc flags for thunderx2 subconfiguration
Details:
- Fixed -march flag. Thunderx2 is an armv8.1a architecture not armv8a.
2019-02-04 15:23:40 -05:00
Field G. Van Zee
c665eb9b88 Minor updates to docs, Makefiles.
Details:
- Changed all occurrances of
    micro-kernel -> microkernel
    macro-kernel -> macrokernel
    micro-panel  -> micropanel
  in all markdown documents in 'docs' directory. This change is being
  made since we've reached the point in adoption and acceptance of
  BLIS's insights where words such as "microkernel" are no longer new,
  and therefore now merit being unhyphenated.
- Updated "Implementation Notes" sections of KernelsHowTo.md, which
  still contained references to nonexistent cpp macros such as
  BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?.
- Added 'run-fast' and 'check-fast' targets to testsuite/Makefile.
- Minor updates to Testsuite.md, including suggesting use of
  'make check' and 'make check-fast' when running from the local
  testsuite directory.
- Added a comment to top-level Makefile explaining the purpose behind
  the TESTSUITE_WRAPPER variable, which at first glance appears to serve
  no purpose.
2019-01-28 16:22:23 -06:00
M. Zhou
1aa280d052 Amend OS detection for kFreeBSD. (#295) 2019-01-27 15:40:48 -06:00
Field G. Van Zee
fffc23bb35 CREDITS file update. 2019-01-25 13:35:31 -06:00
Field G. Van Zee
26c5cf495c Fixed bug in skx subconfig related to bdd46f9.
Details:
- Fixed code in the skx subconfiguration that became a bug after
  committing bdd46f9. Specifically, the bli_cntx_init_skx() function
  was overwriting default blocksizes for the scomplex and dcomplex
  microkernels despite the fact that only single and double real
  microkernels were being registered. This was not a problem prior to
  bdd46f9 since all microkernels used dynamically-queried (at runtime)
  register blocksizes for loop bounds. However, post-bdd46f9, this
  became a bug because the reference ukernels for scomplex and dcomplex
  were written with their register blocksizes hard-coded as constant
  loop bounds, which conflicted the the erroneous scomplex and dcomplex
  values that bli_cntx_init_skx() was setting in the context. The
  lesson here is that going forward, all subconfigurations must not set
  any blocksizes for datatypes corresponding to default/reference
  microkernels. (Note that a blocksize is left unchanged by the
  bli_cntx_set_blkszs() function if it was set to -1.)
2019-01-24 18:49:31 -06:00
Field G. Van Zee
180f8e42e1 Fixed undefined behavior trsm ukr bug in bdd46f9.
Details:
- Fixed a bug that mainfested anytime a configuration was used in which
  optimized microkernels were registered and the trsm operation (or
  kernel) was invoked. The bug resulted from the optimized microkernels'
  register blocksizes conflicting with the hard-coded values--expressed
  in the form of constant loop bounds--used in the new reference trsm
  ukernels that were introduced in bdd46f9. The fix was easy: reverting
  back to the implementation that uses variable-bound loops, which
  amounted to changing an #if 0 to #if 1 (since I preserved the older
  implementation in the file alongside the new code based on constant-
  bound loops). It should be noted that this fix must be permanent,
  since the trsm kernel code with constant-bound loops can never work
  with gemm ukernels that use different register blocksizes.
2019-01-24 18:01:15 -06:00
Field G. Van Zee
bdd46f9ee8 Rewrote reference kernels to use #pragma omp simd.
Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
  indexing annotated by the #pragma omp simd directive, which a compiler
  can use to vectorize certain constant-bounded loops. (The new kernels
  actually use _Pragma("omp simd") since the kernels are defined via
  templatizing macros.) Modest speedup was observed in most cases using
  gcc 5.4.0, which may improve with newer versions. Thanks to Devin
  Matthews for suggesting this via issue #286 and #259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
  be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
  respectively, with a default row preference for the gemm ukernel. Also
  updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
  respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
  option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
  which compiler is detected (via macros such as __GNUC__). These
  prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
  build/templates and removed AMD as a default copyright holder.
2019-01-24 17:23:18 -06:00
Field G. Van Zee
63de2b0090 Prevent redef of ftnlen in blastest f2c_types.h.
Details:
- Guard typedef of ftnlen in f2c_types.h with a #ifndef HAVE_BLIS_H
  directive to prevent the redefinition of that type. Thanks to Jeff
  Diamond for reporting this compiler warning (and apologies for the
  delay in committing a fix).
2019-01-23 12:16:27 -06:00
Field G. Van Zee
eec2e183a7 Added escaping to '/' in os_name in configure.
Details:
- Add os_name to the list of variables into which the '/' character is
  escaped. This is meant to address (or at least make progress toward
  addressing) #293. Thanks to Isuru Fernando for spotting this as the
  potential fix, and also thanks to M. Zhou for the original report.
2019-01-21 12:12:18 -06:00
Field G. Van Zee
adf5c17f08 Formally registered thunderx2 subconfiguration.
Details:
- Added a separate subconfiguration for thunderx2, which now uses
  different optimization flags than cortexa57/cortexa53.
2019-01-18 15:14:45 -06:00
M. Zhou
094cfdf7df Port BLIS to GNU Hurd OS. (#294)
Prevent blis.h from misidentifying Hurd as OSX.
2019-01-18 12:46:13 -06:00
Field G. Van Zee
5d7d616e8e README.md update re: mixeddt TOMS paper. 2019-01-15 20:52:51 -06:00
Field G. Van Zee
58c7fb4788 Added more matlab scripts for mixeddt paper.
Details:
- Added a variant set of matlab scripts geared to producing plots that
  reflect performance data gathered with and without extra memory
  optimizations enabled. These scripts reside (for now) in
  test/mixeddt/matlab/wawoxmem.
2019-01-08 17:00:27 -06:00
Field G. Van Zee
34286eb914 Minor update to docs/HardwareSupport.md. 2019-01-08 11:41:20 -06:00
Field G. Van Zee
108b04dc5b Regenerated symbols in build/libblis-symbols.def.
Details:
- Reran ./build/regen-symbols.sh after running
  'configure --enable-cblas auto' to reflect removal of
  bli_malloc_pool() and bli_free_pool().
2019-01-07 20:16:31 -06:00
Field G. Van Zee
706cbd9d56 Minor tweaks/cleanups to bli_malloc.c, _apool.c.
Details:
- Removed malloc_ft and free_ft function pointer arguments from the
  interface to bli_apool_init() after deciding that there is no need to
  specify the malloc()/free() for blocks within the apool. (The apool
  blocks are actually just array_t structs.) Instead, we simply call
  bli_malloc_intl()/_free_intl() directly. This has the added benefit
  of allowing additional output when memory tracing is enabled via
  --enable-mem-tracing. Also made corresponding changes elsewhere in
  the apool API.
- Changed the inner pools (elements of the array_t within the apool_t)
  to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL
  and BLIS_FREE_INTL.
- Disabled definitions of bli_malloc_pool() and bli_free_pool() since
  there are no longer any consumers of these functions.
- Very minor comment / printf() updates.
2019-01-07 18:28:19 -06:00
Minh Quan Ho
579145039d Initialize error messages at compile time (#289)
* Initialize error messages at compile time

- Assigning strings directly to the bli_error_string array, instead of
snprintf() at execution-time.

* Retired bli_error_init(), _finalize().

Details:
- Removed functions obviated by changes in 80e8dc6: bli_error_init(),
  bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
  the former two in bli_init.c.

* Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
  'configure --enable-cblas auto'.
2019-01-07 16:00:15 -06:00
Field G. Van Zee
aafbca086e Updated external package language in README.md.
Details:
- Updated/added comments about Fedora, OpenSUSE, and GNU Guix under the
  newly-renamed "External GNU/Linux packages" section. Thanks to Dave
  Love for providing these revisions.
2019-01-07 12:38:21 -06:00
Field G. Van Zee
daacfe6840 Allow running configure with python 3.4.
Details:
- Relax version blacklisting of python3 to allow 3.4 or later instead
  of 3.5 or later. Thanks to Dave Love for pointing out that 3.4 was
  sufficient for the purpose of BLIS's build system. (It should be
  noted that we're not sure which, if any, python3 versions prior to
  3.4 are insufficient, and that the only thing stopping us from
  determining this is the fact that these earlier versions of python3
  are not readily available for us to test with.)
- Updated docs/BuildSystem.md to be explicit about current python2 vs
  python3 version requirements.
2019-01-07 12:12:47 -06:00
Field G. Van Zee
ad8d9adb09 README.md, CREDITS update.
Details:
- Added "What's New" and "What People Are Saying About BLIS" sections to
  README.md.
- Added missing github handles to various individuals' entries in the
  CREDITS file.
2019-01-03 16:08:24 -06:00
Field G. Van Zee
7052fca5ae Apply f272c289 to bli_fmalloc_noalign().
Details:
- Perform the same check for NULL return values and error message output
  in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This
  change was intended for f272c289.)
2019-01-02 13:48:40 -06:00
Field G. Van Zee
528e3ad16a Merge branch 'amd' 2019-01-02 13:39:19 -06:00
Field G. Van Zee
3126c52ea7 Merge branch 'amd' 2019-01-02 13:37:37 -06:00
Field G. Van Zee
f272c2899a Add error message to malloc() check for NULL.
Details:
- Output an error message if and when the malloc()-equivalent called by
  bli_fmalloc_align() ever returns NULL. Everything was already in place
  for this to happen, including the error return code, the error string
  sprintf(), the error checking function bli_check_valid_malloc_buf()
  definition, and its prototype. Thanks to Minh Quan Ho for pointing out
  the missing error message.
- Increased the default block_ptrs_len for each inner pool stored in the
  small block allocator from 10 to 25. Under normal execution, each
  thread uses only 21 blocks, so this change will prevent the sba from
  needing to resize the block_ptrs array of any given inner pool as
  threads initially populate the pool with small blocks upon first
  execution of a level-3 operation.
- Nix stray newline echo in configure.
2019-01-02 12:34:15 -06:00
Field G. Van Zee
eb97f778a1 Added missing AMD copyrights to previous commit.
Details:
- Forgot to add AMD copyrights to several touched files that did not
  already have them in 2f31743.
2018-12-25 20:17:09 -06:00