Commit Graph

756 Commits

Author SHA1 Message Date
Field G. Van Zee
618f4331eb Align strides of ct in macrokernels to that of c.
Details:
- Previously, rs_ct and cs_ct, the strides of the temporary microtile used
  primarily in the macrokernels' edge case handling, were unconditionally
  set to 1 and MR, respectively. However, Devin Matthews noted that this
  ought to be changed so that the strides of ct were in agreement with the
  strides of C. (That is, if C was row-stored, then ct should be accessed
  as by rows as well.) The implicit assumption is that the strides of C
  have already been adjusted, via induced transposition, if the storage
  preference of the microkernel is at odds with the storage of C. So, if
  the microkernel prefers row storage, the macrokernel's interior cases
  would present row-stored (ideal) microkernel subproblems to the
  microkernel, but for edge cases, it would still see column-stored
  subproblems (not ideal). This commit fixes this issue. Thanks to Devin
  for his suggestion.
2016-10-31 14:40:51 -05:00
Field G. Van Zee
6303910023 Merge pull request #105 from devinamatthews/knl
Support for Intel Knight's Landing.
2016-10-25 19:34:51 -05:00
Devin Matthews
216206c1d3 Fix up for merge to master. 2016-10-25 13:56:18 -05:00
Devin Matthews
11eb7957ab Merge branch 'master' into knl
# Conflicts:
#	frame/thread/bli_thread.h
2016-10-25 13:51:07 -05:00
Devin Matthews
cd5b668183 Don't use %rbp in KNL packing kernels. 2016-10-25 13:49:27 -05:00
Field G. Van Zee
956b3edf8e Merge pull request #104 from devinamatthews/misspellings
Add flexible options for thread model (pthread/posix for pthreads etc.).
2016-10-25 13:02:57 -05:00
Devin Matthews
0662a3c1b1 Add flexible options for thread model (pthread/posix for pthreads etc.). 2016-10-25 12:42:44 -05:00
Field G. Van Zee
b7e41d71b0 Merge pull request #103 from devinamatthews/patch-1
Change .align to .p2align in Bulldozer ukernels.
2016-10-24 16:47:46 -05:00
Devin Matthews
5117d444f7 Change .align to .p2align in Bulldozer ukernels
Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts.
2016-10-24 16:20:47 -05:00
Field G. Van Zee
4bd905bd45 Merge pull request #93 from ShadenSmith/config_check
Adds sanity check to configuration choice.
2016-10-21 14:48:44 -05:00
Field G. Van Zee
936d5fdc26 Fixed multithreading compilation bug in 970745a.
Details:
- Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING
  from bli_thread.h to bli_config_macro_defs.h. Also moved the
  sanity check that OpenMP and POSIX threads are not both enabled.
- Thanks to Krzysztof Drewniak for reporting this bug.
2016-10-21 14:34:27 -05:00
Field G. Van Zee
8feb0f85a6 Removed auto-prototyping of malloc()/free() substitutes.
Details:
- Removed the header file, bli_malloc_prototypes.h, which automatically
  generated prototypes for the functions specified by the following
  cpp macros:
    BLIS_MALLOC_INTL
    BLIS_FREE_INTL
    BLIS_MALLOC_POOL
    BLIS_FREE_POOL
    BLIS_MALLOC_USER
    BLIS_FREE_USER
  These prototypes were originally provided primarily as a convenience
  to those developers who specified their own malloc()/free() substitutes
  for one or more of the following. However, we generated these prototypes
  regardless, even when the default values (malloc and free) of the
  macros above were used. A problem arose under certain circumstances
  (e.g., gcc in C++ mode on Linux with glibc) when including blis.h that
  stemmed from the "throw" specification which was added to the glibc's
  malloc() prototype, resulting in a prototype mismatch. Therefore, going
  forward, developers who specify their own custom malloc()/free()
  substitutes must also prototype those substitutes via bli_kernel.h.
  Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews
  for researching the nature and potential solutions.
2016-10-19 16:05:41 -05:00
Field G. Van Zee
970745a5fc Reorganized typedefs to avoid compiler warnings.
Details:
- Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h.
- Moved #include of bli_malloc.h from blis.h to bli_type_defs.h.
- Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h.
- Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h.
- The redundant typedefs of membrk_t and mtx_t caused a warning on some C
  compilers. Thanks to Tyler Smith for reporting this issue.
2016-10-19 15:58:03 -05:00
Field G. Van Zee
28b2af8a71 Added disabled code to print thrinfo_t structures.
Details:
- Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious
  developer to print the contents of the thrinfo_t structures of each
  thread, for verification purposes or just to study the way thread
  information and communicators are used in BLIS.
- Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing
  an array of thrinfo_t* values that is used in the new, cpp-guarde code
  mentioned above.
- Removed some old commented lines from bli_gemm_front.c.
2016-10-13 14:50:08 -05:00
Field G. Van Zee
11eed3f683 Fixed a configure -t omp/openmp bug from fd04869.
Details:
- Forgot to update certain occurrences of "omp" in common.mk during
  commit fd04869, which changed the preferred configure option string
  for enabling OpenMP from "omp" to "openmp".
2016-10-13 14:23:23 -05:00
Field G. Van Zee
9cda6057ea Removed previously renamed/old files.
Details:
- Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h,
  both of which were renamed/removed in 701b9aa. For some reason, these
  files survived when the compose branch was merged back into master.
  (Clearly, git's merging algorithm is not perfect.)
- Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed
  memory allocator that I was keeping around for no particular reason).
2016-10-11 13:21:26 -05:00
Field G. Van Zee
22377abd84 Fixed bli_gemm() segfault on empty C matrices.
Details:
- Fixed a bug that would manifest in the form of a segmentation fault
  in bli_cntl_free() when calling any level-3 operation on an empty
  output matrix (ie: m = n = 0). Specifically, the code previously
  assumed that the entire control tree was built prior to it being
  freed. However, if the level-3 operation performs an early exit, the
  control tree will be incomplete, and this scenario is now handled.
  Thanks to Elmar Peise for reporting this bug.
2016-10-10 13:43:56 -05:00
Field G. Van Zee
0b571cd94d Fixed segfault in bli_free_align() for NULL ptrs.
Details:
- Fixed a bug in bli_free_align() caused by failing to handle NULL pointers
  up-front, which led to performing pointer arithmetic on NULL pointers in
  order to free the address immediately before the pointer. Thanks to Devin
  Matthews for reporting this bug.
2016-10-06 14:48:15 -05:00
Field G. Van Zee
4fb9b4ef2e CHANGELOG update (0.2.1) 2016-10-05 14:41:35 -05:00
Field G. Van Zee
866b2dde3f Version file update (0.2.1) 0.2.1 2016-10-05 14:41:34 -05:00
Field G. Van Zee
87fddeab3c Merge branch 'compose' 2016-10-05 13:35:01 -05:00
Field G. Van Zee
6f71cd3449 Merge pull request #94 from flame/distcomm
Implemented distributed thrinfo_t management.
2016-10-04 15:53:46 -05:00
Field G. Van Zee
86969873b5 Reclassified amaxv operation as a level-1v kernel.
Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
  This includes the establishment of a new amaxv kernel to live beside all
  of the other level-1v kernels.
- Added two new functions to bli_part.c:
    bli_acquire_mij()
    bli_acquire_vi()
  The first acquires a scalar object for the (i,j) element of a matrix,
  and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
  adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
  module works by comparing the value identified by bli_amaxv() to the
  the value found from a reference-like code local to the test module
  source file. In other words, it (intentionally) does not guarantee the
  same index is found; only the same value. This allows for different
  implementations in the case where a vector contains two or more elements
  containing exactly the same floating point value (or values, in the case
  of the complex domain).
- Removed the directory frame/include/old/.
2016-10-04 14:24:59 -05:00
Field G. Van Zee
8d55033c96 Implemented distributed thrinfo_t management.
Details:
- Implemented Ricardo Magana's distributed thread info/communicator
  management. Rather that fully construct the thrinfo_t structures, from
  root to leaf, prior to spawning threads, the threads individually
  construct their thrinfo_t trees (or, chains), and do so incrementally,
  as needed, reusing the same structure nodes during subsequent blocked
  variant iterations. This required moving the initial creation of the
  thrinfo_t structure (now, the root nodes) from the _front() functions
  to the bli_l3_thread_decorator(). The incremental "growing" of the tree
  is performed in the internal back-end (ie: _int()) function, and so
  mostly invisible. Also, the incremental growth of the thrinfo_t tree is
  done as a function of the current and parent control tree nodes (as well
  as the parent thrinfo_t node), further reinforcing the parallel
  relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
  as well as its id. Changed all APIs accordingly. Renamed
  bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
  in an array of thrinfo_t* structure pointers. (Used only as a
  debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
    bli_packm_thrinfo_create()
    bli_l3_thrinfo_create()
  because they are no longer used. bli_thrinfo_create() is now called
  directly when creating thrinfo_t nodes.
2016-09-27 15:20:58 -05:00
Field G. Van Zee
fd04869ae4 Changed configure's 'omp' threading to 'openmp'.
Details:
- Changed the configure script so that the expected string argument to the
  -t (or --enable-threading=) option that enables OpenMP multithreading is
  'openmp'. The previous expected string, 'omp', is still supported but
  should be considered deprecated.
2016-09-27 14:14:11 -05:00
Field G. Van Zee
9424af8720 Merge branch 'compose' 2016-09-27 12:51:08 -05:00
Shaden Smith
7f32dd57c6 Adds sanity check to configuration choice. 2016-09-17 11:33:57 -05:00
Field G. Van Zee
efa7341df0 Merge pull request #92 from ShadenSmith/readme_fix
Fixes broken URL in README.md
2016-09-16 11:01:57 -05:00
Shaden Smith
e1453f68f6 Fixes broken URL in README.md 2016-09-16 09:29:28 -05:00
Field G. Van Zee
c0630c4024 Added debugging printf()'s to bli_l3_thrinfo.c.
Details:
- Added optional printf() statements to print out thread communicator
  info as the thrinfo_t structure is built in bli_l3_thrinfo.c.
- Minor changes to frame/thread/bli_thrinfo.h.
2016-09-12 13:59:02 -05:00
Field G. Van Zee
7b3bf1ffcd Merge branch 'master' into compose 2016-09-06 15:47:13 -05:00
Field G. Van Zee
121c39d455 Added complex gemm micro-kernels for haswell.
Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).
2016-09-05 13:11:42 -05:00
Field G. Van Zee
35509818cb Added, moved some thread barriers.
Details:
- Removed thread barriers from the end of the loop bodies of
  bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
  and bli_trsm_blk_var2().
- Moved the thread barrier at the end of bli_packm_int() to the
  end of bli_l3_packm(), and added missing barriers to that function.
- Removed the no longer necessary (and now incorrect) ochief guard
  in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
- Thanks to Tyler Smith for help with these changes.
2016-08-31 17:34:15 -05:00
Field G. Van Zee
abd61f9fa7 Updated BLIS4 TOMS citation in README.md. 2016-08-30 12:34:19 -05:00
Field G. Van Zee
701b9aa3ff Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
  same struct definition, whose primary fields consist of a blocksize id,
  a variant function pointer, a pointer to an optional parameter struct,
  and a pointer to a (single) sub-node. This unified control tree type is
  now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
  they represent, such that, for example, packing operations are now
  associated with nodes that are "inline" in the tree, rather than off-
  shoot braches. The original tree for the classic Goto gemm algorithm was
  expressed (roughly) as:

    blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                         |           |
                         -> packb    -> packa

  and now, the same tree would look like:

    blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

  Specifically, the packb and packa nodes perform their respective packing
  operations and then recurse (without any loop) to a subproblem. This means
  there are now two kinds of level-3 control tree nodes: partitioning and
  non-partitioning. The blocked variants are members of the former, because
  they iteratively partition off submatrices and perform suboperations on
  those partitions, while the packing variants belong to the latter group.
  (This change has the effect of allowing greatly simplified initialization
  of the nodes, which previously involved setting many unused node fields to
  NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
  connective structure of control trees. That is, packm nodes are no longer
  off-shoot branches of the main algorithmic nodes, but rather connected
  "inline".
- Simplified control tree creation functions. Partitioning nodes are created
  concisely with just a few fields needing initialization. By contrast, the
  packing nodes require additional parameters, which are stored in a
  packm-specific struct that is tracked via the optional parameters pointer
  within the control tree struct. (This parameter struct must always begin
  with a uint64_t that contains the byte size of the struct. This allows
  us to use a generic function to recursively copy control trees.) gemm,
  herk, and trmm control tree creation continues to be consolidated into
  a single function, with the operation family being used to select
  among the parameter-agnostic macro-kernel wrappers. A single routine,
  bli_cntl_free(), is provided to free control trees recursively, whereby
  the chief thread within a groups release the blocks associated with
  mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
  function pointer stored in the current control tree node (rather than
  index into a local function pointer array). Before being invoked, these
  function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
  families) or trsm_voft (for trsm family) type, which is defined in
  frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
  through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
  from routines as a function of the variant's matrix operands. gemm and
  herk always move forward, while trmm and trsm move in a direction that
  is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
  each of which takes additional arguments and hides complexity in managing
  the difference between the way ranges are computed for the four families
  of operations.
- Simplified level-3 blocked variants according to the above changes, so that
  the only steps taken are:
  1. Query partitioning direction (forwards or backwards).
  2. Prune unreferenced regions, if they exist.
  3. Determine the thread partitioning sub-ranges.
  <begin loop>
    4. Determine the partitioning blocksize (passing in the partitioning
       direction)
    5. Acquire the curren iteration's partitions for the matrices affected
       by the current variants's partitioning dimension (m, k, n).
    6. Call the subproblem.
  <end loop>
- Instantiate control trees once per thread, per operation invocation.
  (This is a change from the previous regime in which control trees were
  treated as stateless objects, initialized with the library, and shared
  as read-only objects between threads.) This once-per-thread allocation
  is done primarily to allow threads to use the control tree as as place
  to cache certain data for use in subsequent loop iterations. Presently,
  the only application of this caching is a mem_t entry for the packing
  blocks checked out from the memory broker (allocator). If a non-NULL
  control tree is passed in by the (expert) user, then the tree is copied
  by each thread. This is done in bli_l3_thread_decorator(), in
  bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
  of the operation being executed. For example, gemm, hemm, and symm are
  all part of the gemm family, while herk, syrk, her2k, and syr2k are
  all part of the herk family. Knowing the operation's family is necessary
  when conditionally executing the internal (beta) scalar reset on on
  C in blocked variant 3, which is needed for gemm and herk families,
  but must not be performed for the trmm family (because beta has only
  been applied to the current row-panel of C after the first rank-kc
  iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
  to comform with the new control tree design, and renamed the macro-
  kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
  bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
  frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
  optimization in the various level-3 front-ends was not being applied
  properly when the context indicated that execution would be via an
  induced method. (Before, we always checked the native micro-kernel
  corresponding to the datatype being executed, whereas now we check
  the native micro-kernel corresponding to the datatype's real projection,
  since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
  complex implementations. Previously, it was always tested, provided that
  the c/z datatypes were enabled. However, some configurations use
  reference micro-kernels for complex datatypes, and testing these
  implementations can slow down the testsuite considerably.
2016-08-26 19:04:45 -05:00
Field G. Van Zee
73517f522b Merge branch 'master' into compose 2016-08-23 13:46:59 -05:00
Field G. Van Zee
50293da38d Avoid compiling BLAS/CBLAS files when disabled.
Details:
- Updated the top-level Makefile, build/config.mk.in template, and
  configure script so that object files corresponding to source files
  belonging to the BLAS compatibility layer are not compiled (or archived)
  when the compatibility layer is disabled. (Same for CBLAS.) Thanks
  to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
  of converting (overwriting) some, such as enable_blas2blis and
  enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
  now stored in new variables that live alongside the originals (with the
  suffix "_01").  This is convenient since some values need to be
  sed-substituted into the config.mk.in template, which requires "yes" or
  "no", while some need to be written to the bli_config.h.in template,
  which requires "0" or "1".
2016-08-23 13:38:36 -05:00
Field G. Van Zee
c6f5c215ee Merge branch 'master' into compose 2016-08-22 17:33:02 -05:00
Field G. Van Zee
16a4c7a823 Fixed bugs in bli_mutex_init() and friends.
Details:
- Fixed a couple of bugs that affected OpenMP and POSIX threads
  configurations that resulted in compiler errors and warnings due
  to type mismatch, and in the case of pthreads, a missing function
  argument. The bugs are fairly recent, introduced in a017062.
2016-08-19 11:38:36 -05:00
Devin Matthews
c8e4ef9395 Add prefetchw to 30x8 kernel. 2016-08-03 16:13:03 -05:00
Devin Matthews
4b5a2f3d6e Merge remote-tracking branch 'origin/knl' into knl
# Conflicts:
#	kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c
2016-08-03 16:09:51 -05:00
Devin Matthews
380736bfe9 Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug. 2016-08-03 16:08:28 -05:00
Devin Matthews
9f52a587de Try prefetchw[t1] instead of regular prefetch for C. 2016-08-03 16:03:53 -05:00
Devin Matthews
8945a1512d This version gets ~1550 GFLOPs on KNL wuth 16x4. 2016-08-03 11:28:24 -05:00
Devin Matthews
6ce4c022eb Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved. 2016-07-27 16:26:36 -05:00
Field G. Van Zee
d52cb76715 Merge branch 'master' into compose 2016-07-27 16:04:55 -05:00
Field G. Van Zee
c31b1e7b9d Relax alignment restrictions for sandybridge ukrs.
Details:
- Relaxed the base pointer and leading dimension alignment restrictions
  in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
  instead of vmovaps/vmovapd. These change mimic those made to the haswell
  microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
  directory to use DBL_MAX as the initial time candidate. Thanks to Devin
  Matthews for suggesting this change.
- Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.
2016-07-27 15:58:07 -05:00
Devin Matthews
b8f2b55532 Try an 8x24 kernel for the hell of it. 2016-07-27 15:22:55 -05:00
Devin Matthews
7ede5863ae Allocate pack buffer on MCDRAM for KNL. 2016-07-27 13:42:32 -06:00
Devin Matthews
ad89ed2e82 Merge branch 'knl' of github.com:devinamatthews/blis into knl 2016-07-27 11:45:40 -05:00