Commit Graph

1886 Commits

Author SHA1 Message Date
praveeng
cd84fb9518 syntax erros in configure file
Change-Id: Ibe8a6071aad97df550df64c009fec33a9d8f43a1
2016-10-06 15:08:21 +05:30
praveeng
f2e7ea113a conflicts merge for bli_kernel.h
Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0
2016-10-06 12:35:30 +05:30
sthangar
133983c36f code clean up in bli_kernel.h
Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d
2016-10-06 11:26:22 +05:30
Field G. Van Zee
4fb9b4ef2e CHANGELOG update (0.2.1) 2016-10-05 14:41:35 -05:00
Field G. Van Zee
866b2dde3f Version file update (0.2.1) 0.2.1 2016-10-05 14:41:34 -05:00
Field G. Van Zee
87fddeab3c Merge branch 'compose' 2016-10-05 13:35:01 -05:00
Field G. Van Zee
6f71cd3449 Merge pull request #94 from flame/distcomm
Implemented distributed thrinfo_t management.
2016-10-04 15:53:46 -05:00
Field G. Van Zee
86969873b5 Reclassified amaxv operation as a level-1v kernel.
Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
  This includes the establishment of a new amaxv kernel to live beside all
  of the other level-1v kernels.
- Added two new functions to bli_part.c:
    bli_acquire_mij()
    bli_acquire_vi()
  The first acquires a scalar object for the (i,j) element of a matrix,
  and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
  adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
  module works by comparing the value identified by bli_amaxv() to the
  the value found from a reference-like code local to the test module
  source file. In other words, it (intentionally) does not guarantee the
  same index is found; only the same value. This allows for different
  implementations in the case where a vector contains two or more elements
  containing exactly the same floating point value (or values, in the case
  of the complex domain).
- Removed the directory frame/include/old/.
2016-10-04 14:24:59 -05:00
Field G. Van Zee
8d55033c96 Implemented distributed thrinfo_t management.
Details:
- Implemented Ricardo Magana's distributed thread info/communicator
  management. Rather that fully construct the thrinfo_t structures, from
  root to leaf, prior to spawning threads, the threads individually
  construct their thrinfo_t trees (or, chains), and do so incrementally,
  as needed, reusing the same structure nodes during subsequent blocked
  variant iterations. This required moving the initial creation of the
  thrinfo_t structure (now, the root nodes) from the _front() functions
  to the bli_l3_thread_decorator(). The incremental "growing" of the tree
  is performed in the internal back-end (ie: _int()) function, and so
  mostly invisible. Also, the incremental growth of the thrinfo_t tree is
  done as a function of the current and parent control tree nodes (as well
  as the parent thrinfo_t node), further reinforcing the parallel
  relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
  as well as its id. Changed all APIs accordingly. Renamed
  bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
  in an array of thrinfo_t* structure pointers. (Used only as a
  debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
    bli_packm_thrinfo_create()
    bli_l3_thrinfo_create()
  because they are no longer used. bli_thrinfo_create() is now called
  directly when creating thrinfo_t nodes.
2016-09-27 15:20:58 -05:00
Field G. Van Zee
fd04869ae4 Changed configure's 'omp' threading to 'openmp'.
Details:
- Changed the configure script so that the expected string argument to the
  -t (or --enable-threading=) option that enables OpenMP multithreading is
  'openmp'. The previous expected string, 'omp', is still supported but
  should be considered deprecated.
2016-09-27 14:14:11 -05:00
Field G. Van Zee
9424af8720 Merge branch 'compose' 2016-09-27 12:51:08 -05:00
Shaden Smith
7f32dd57c6 Adds sanity check to configuration choice. 2016-09-17 11:33:57 -05:00
Field G. Van Zee
efa7341df0 Merge pull request #92 from ShadenSmith/readme_fix
Fixes broken URL in README.md
2016-09-16 11:01:57 -05:00
Shaden Smith
e1453f68f6 Fixes broken URL in README.md 2016-09-16 09:29:28 -05:00
Field G. Van Zee
b922d75634 Avoid compiling BLAS/CBLAS files when disabled.
Details:
- Updated the top-level Makefile, build/config.mk.in template, and
  configure script so that object files corresponding to source files
  belonging to the BLAS compatibility layer are not compiled (or archived)
  when the compatibility layer is disabled. (Same for CBLAS.) Thanks
  to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
  of converting (overwriting) some, such as enable_blas2blis and
  enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
  now stored in new variables that live alongside the originals (with the
  suffix "_01").  This is convenient since some values need to be
  sed-substituted into the config.mk.in template, which requires "yes" or
  "no", while some need to be written to the bli_config.h.in template,
  which requires "0" or "1".

Updated BLIS4 TOMS citation in README.md.

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).

Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1
2016-09-15 12:24:07 +05:30
Pradeep Rao
69826110ba Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging 2016-09-14 03:26:25 -04:00
Field G. Van Zee
c0630c4024 Added debugging printf()'s to bli_l3_thrinfo.c.
Details:
- Added optional printf() statements to print out thread communicator
  info as the thrinfo_t structure is built in bli_l3_thrinfo.c.
- Minor changes to frame/thread/bli_thrinfo.h.
2016-09-12 13:59:02 -05:00
Field G. Van Zee
7b3bf1ffcd Merge branch 'master' into compose 2016-09-06 15:47:13 -05:00
Field G. Van Zee
121c39d455 Added complex gemm micro-kernels for haswell.
Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).
2016-09-05 13:11:42 -05:00
Field G. Van Zee
35509818cb Added, moved some thread barriers.
Details:
- Removed thread barriers from the end of the loop bodies of
  bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
  and bli_trsm_blk_var2().
- Moved the thread barrier at the end of bli_packm_int() to the
  end of bli_l3_packm(), and added missing barriers to that function.
- Removed the no longer necessary (and now incorrect) ochief guard
  in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
- Thanks to Tyler Smith for help with these changes.
2016-08-31 17:34:15 -05:00
sthangar
64598ee4cf fixed the symlink issue
Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955
2016-08-31 12:54:50 +05:30
Field G. Van Zee
abd61f9fa7 Updated BLIS4 TOMS citation in README.md. 2016-08-30 12:34:19 -05:00
sthangar
8a2373f26b Norm 2 optimization
Change-Id: Ide9decaccd20bf0ccc32c9abb6556e038dceed2b
2016-08-29 14:28:39 +05:30
sthangar
fdc6639023 Placed 1 and 1f AMD optimized AVX routines under zen folder
Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328
2016-08-29 10:43:38 +05:30
Field G. Van Zee
701b9aa3ff Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
  same struct definition, whose primary fields consist of a blocksize id,
  a variant function pointer, a pointer to an optional parameter struct,
  and a pointer to a (single) sub-node. This unified control tree type is
  now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
  they represent, such that, for example, packing operations are now
  associated with nodes that are "inline" in the tree, rather than off-
  shoot braches. The original tree for the classic Goto gemm algorithm was
  expressed (roughly) as:

    blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                         |           |
                         -> packb    -> packa

  and now, the same tree would look like:

    blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

  Specifically, the packb and packa nodes perform their respective packing
  operations and then recurse (without any loop) to a subproblem. This means
  there are now two kinds of level-3 control tree nodes: partitioning and
  non-partitioning. The blocked variants are members of the former, because
  they iteratively partition off submatrices and perform suboperations on
  those partitions, while the packing variants belong to the latter group.
  (This change has the effect of allowing greatly simplified initialization
  of the nodes, which previously involved setting many unused node fields to
  NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
  connective structure of control trees. That is, packm nodes are no longer
  off-shoot branches of the main algorithmic nodes, but rather connected
  "inline".
- Simplified control tree creation functions. Partitioning nodes are created
  concisely with just a few fields needing initialization. By contrast, the
  packing nodes require additional parameters, which are stored in a
  packm-specific struct that is tracked via the optional parameters pointer
  within the control tree struct. (This parameter struct must always begin
  with a uint64_t that contains the byte size of the struct. This allows
  us to use a generic function to recursively copy control trees.) gemm,
  herk, and trmm control tree creation continues to be consolidated into
  a single function, with the operation family being used to select
  among the parameter-agnostic macro-kernel wrappers. A single routine,
  bli_cntl_free(), is provided to free control trees recursively, whereby
  the chief thread within a groups release the blocks associated with
  mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
  function pointer stored in the current control tree node (rather than
  index into a local function pointer array). Before being invoked, these
  function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
  families) or trsm_voft (for trsm family) type, which is defined in
  frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
  through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
  from routines as a function of the variant's matrix operands. gemm and
  herk always move forward, while trmm and trsm move in a direction that
  is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
  each of which takes additional arguments and hides complexity in managing
  the difference between the way ranges are computed for the four families
  of operations.
- Simplified level-3 blocked variants according to the above changes, so that
  the only steps taken are:
  1. Query partitioning direction (forwards or backwards).
  2. Prune unreferenced regions, if they exist.
  3. Determine the thread partitioning sub-ranges.
  <begin loop>
    4. Determine the partitioning blocksize (passing in the partitioning
       direction)
    5. Acquire the curren iteration's partitions for the matrices affected
       by the current variants's partitioning dimension (m, k, n).
    6. Call the subproblem.
  <end loop>
- Instantiate control trees once per thread, per operation invocation.
  (This is a change from the previous regime in which control trees were
  treated as stateless objects, initialized with the library, and shared
  as read-only objects between threads.) This once-per-thread allocation
  is done primarily to allow threads to use the control tree as as place
  to cache certain data for use in subsequent loop iterations. Presently,
  the only application of this caching is a mem_t entry for the packing
  blocks checked out from the memory broker (allocator). If a non-NULL
  control tree is passed in by the (expert) user, then the tree is copied
  by each thread. This is done in bli_l3_thread_decorator(), in
  bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
  of the operation being executed. For example, gemm, hemm, and symm are
  all part of the gemm family, while herk, syrk, her2k, and syr2k are
  all part of the herk family. Knowing the operation's family is necessary
  when conditionally executing the internal (beta) scalar reset on on
  C in blocked variant 3, which is needed for gemm and herk families,
  but must not be performed for the trmm family (because beta has only
  been applied to the current row-panel of C after the first rank-kc
  iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
  to comform with the new control tree design, and renamed the macro-
  kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
  bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
  frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
  optimization in the various level-3 front-ends was not being applied
  properly when the context indicated that execution would be via an
  induced method. (Before, we always checked the native micro-kernel
  corresponding to the datatype being executed, whereas now we check
  the native micro-kernel corresponding to the datatype's real projection,
  since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
  complex implementations. Previously, it was always tested, provided that
  the c/z datatypes were enabled. However, some configurations use
  reference micro-kernels for complex datatypes, and testing these
  implementations can slow down the testsuite considerably.
2016-08-26 19:04:45 -05:00
Kiran Varaganti
a58dd35ed7 Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision
Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9
2016-08-26 14:55:12 +05:30
Field G. Van Zee
73517f522b Merge branch 'master' into compose 2016-08-23 13:46:59 -05:00
Field G. Van Zee
50293da38d Avoid compiling BLAS/CBLAS files when disabled.
Details:
- Updated the top-level Makefile, build/config.mk.in template, and
  configure script so that object files corresponding to source files
  belonging to the BLAS compatibility layer are not compiled (or archived)
  when the compatibility layer is disabled. (Same for CBLAS.) Thanks
  to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
  of converting (overwriting) some, such as enable_blas2blis and
  enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
  now stored in new variables that live alongside the originals (with the
  suffix "_01").  This is convenient since some values need to be
  sed-substituted into the config.mk.in template, which requires "yes" or
  "no", while some need to be written to the bli_config.h.in template,
  which requires "0" or "1".
2016-08-23 13:38:36 -05:00
praveeng
22dd6a353d Merge master code as on 2016_08_23 to amd-staging branch by praveeng
Changes to be committed:
	modified:   frame/thread/bli_mutex_openmp.h
	modified:   frame/thread/bli_mutex_pthreads.h

Change-Id: Ica522edbb1d0173f53f38d5057b1f7aef73666be
2016-08-23 15:16:49 +05:30
Field G. Van Zee
c6f5c215ee Merge branch 'master' into compose 2016-08-22 17:33:02 -05:00
praveeng
f20ed3885d Merge branch 'master' of https://github.com/clMathLibraries/blis-amd for "Fixed bugs in bli_mutex_init() and friends." 2016-08-22 15:27:33 +05:30
praveeng
02ac597e4b Revert commits 357c990bdd
Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99
2016-08-22 15:12:27 +05:30
praveeng
84e41cc73c Revert commits 8aee306
Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189
2016-08-22 15:12:27 +05:30
praveeng
30ccfcee82 removed changes from readme file which are giving confilcts
Change-Id: Ic71ad1313e1404fed444e899466043704d875af6
2016-08-22 15:12:27 +05:30
praveeng
aeca25cd63 first commit
Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2
2016-08-22 15:12:27 +05:30
praveeng
6b2274864b small modification to readme for git push test
Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a
2016-08-22 15:12:27 +05:30
praveeng
daa7a9ecb2 first commit
Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2
2016-08-22 15:12:27 +05:30
praveeng
5f66a4aa05 small modification to readme for git push test
Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a
2016-08-22 15:11:44 +05:30
praveeng
c6cbd78d23 Revert commits 357c990bdd
Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99
2016-08-22 15:11:09 +05:30
praveeng
9219a90607 Revert commits 8aee306
Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189
2016-08-22 15:11:09 +05:30
praveeng
728573296e removed changes from readme file which are giving confilcts
Change-Id: Ic71ad1313e1404fed444e899466043704d875af6
2016-08-22 15:11:09 +05:30
praveeng
ad7862e291 first commit
Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2
2016-08-22 15:11:09 +05:30
praveeng
ad4b471a25 small modification to readme for git push test
Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a
2016-08-22 15:09:22 +05:30
praveeng
55d641363f first commit
Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2
2016-08-22 15:08:00 +05:30
praveeng
f3b6b15f6d small modification to readme for git push test
Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a
2016-08-22 15:08:00 +05:30
Field G. Van Zee
16a4c7a823 Fixed bugs in bli_mutex_init() and friends.
Details:
- Fixed a couple of bugs that affected OpenMP and POSIX threads
  configurations that resulted in compiler errors and warnings due
  to type mismatch, and in the case of pthreads, a missing function
  argument. The bugs are fairly recent, introduced in a017062.
2016-08-19 11:38:36 -05:00
Devin Matthews
c8e4ef9395 Add prefetchw to 30x8 kernel. 2016-08-03 16:13:03 -05:00
Devin Matthews
4b5a2f3d6e Merge remote-tracking branch 'origin/knl' into knl
# Conflicts:
#	kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c
2016-08-03 16:09:51 -05:00
Devin Matthews
380736bfe9 Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug. 2016-08-03 16:08:28 -05:00
Devin Matthews
9f52a587de Try prefetchw[t1] instead of regular prefetch for C. 2016-08-03 16:03:53 -05:00