Commit Graph

514 Commits

Author SHA1 Message Date
Tyler Smith
ddf62ba7d2 Refuse to free the packm thread info if it uses the single threaded version 2015-03-27 14:27:51 -05:00
Tyler Smith
016fc58758 Don't free packm thread info if it is null 2015-03-27 14:23:02 -05:00
Tyler Smith
00a443c529 Use bli_malloc instead of malloc for the thread info paths 2015-03-27 14:11:07 -05:00
Field G. Van Zee
f1a6b7d028 Reorganized code for induced complex methods.
Details:
- Consolidated most of the code relating to induced complex methods
  (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
  are now enabled on a per-operation basis. The current "available"
  (enabled and implemented) implementation can then be queried on
  an operation basis. Micro-kernel func_t objects as well as blksz_t
  objects can also be queried in a similar maner.
- Redefined several micro-kernel and operation-related functions in
  bli_info_*() API, in accordance with above changes.
- Added mr and nr fields to blksz_t object, which point to the mr
  and nr blksz_t objects for each cache blocksize (and are NULL for
  register blocksizes). Renamed the sub-blocksize field "sub" to
  "mult" since it is really expressing a blocksize multiple.
- Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
  trsm to correctly query mr and nr (for purposes of nudging kc).
- Introduced an enumerated opid_t in bli_type_defs.h that uniquely
  identifies an operation. For now, only level-3 id values are defined,
  along with a generic, catch-all BLIS_NOID value.
- Reworked testsuite so that all induced methods that are enabled
  are tested (one at a time) rather than only testing the first
  available method.
- Reformated summary at the beginning of testsuite output so that
  blocksize and micro-kernel info is shown for each induced method
  that was requested (as well as native execution).
- Reduced the number of columns needed to display non-matlab
  testsuite output (from approx. 90 to 80).
2015-03-18 15:37:10 -05:00
Field G. Van Zee
8d5169ccda Fixed bug in release of mem_t buffer.
Details:
- Fixed a bug that affects all level-2 and level-3 blocked variants. The
  bug only manifested, however, if the packing of operands (A and B in
  gemm, for example) spanned multiple nodes in the control tree. Until
  recently, the main consumers of packm were level-3 operations, all of
  which packed both input operands from blocked variant 1 (B outside of
  the loop, and A within the loop). This particular usage masked a flaw
  in the code whereby bli_obj_release_pack() would always release the
  underlying mem_t buffer (provided it was allocated), even if the buffer
  was not allocated in the current variant. This has been fixed by
  replacing all calls to bli_obj_release_pack() with calls to a new
  function, bli_packm_release(), which takes the same control tree node
  argument passed into the object's corresponding call to packm_init()
  or packv_init(). bli_packm_release() then proceeds to invoke
  bli_obj_release_pack() only if the control tree node indicates that
  packing was requested. Thanks to Devangi Parikh for identifying this
  bug.
2015-03-18 11:38:08 -05:00
Field G. Van Zee
c0acca0f51 Clarified comments in testsuite input.operations. 2015-03-03 10:56:22 -06:00
Field G. Van Zee
03ba9a6b17 Removed some 'old' directories. 2015-02-24 10:33:28 -06:00
Field G. Van Zee
a86db60ee2 Extensive renaming of 3m/4m-related files, symbols.
Details:
- Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi
  ('i' for "interleaved"). Similar changes to 3M/4M macros.
- Renamed all 3m/4m files and functions to 3m1/4m1.
- Whitespace changes.
2015-02-23 18:42:39 -06:00
Field G. Van Zee
8cf8da291a Minor updates to induced complex mode management.
Details:
- Relocated bli_4mh.c, bli_4mb.c, bli_4m.c, bli_3mh.c, bli_3m.c (and
  associated headers) from frame/base to frame/base/induced.
- Added bli_xm.? to frame/base/induced, which implements
  bli_xm_is_enabled(), which detects whether ANY induced complex method
  is currently enabled.
- The new function bli_xm_is_enabled() is now used in bli_info.c to
  detect when an induced complex method is used, so we know when to
  return blocksizes from one of the induced methods' blocksize objects.
2015-02-20 15:24:27 -06:00
Tyler Michael Smith
411e637ee7 Merge branch 'master' of http://github.com/flame/blis 2015-02-20 20:39:25 -06:00
Tyler Michael Smith
c2569b8803 Fixed a memory leak in freeing the thread infos 2015-02-20 20:38:19 -06:00
Field G. Van Zee
fc0b771227 Added max(mr,nr) to kc in static mem pools.
Details:
- Changed the static memory definitions to compute the maximum register
  blocksize for each datatype and add it to kc when computing the size
  of blocks of A and B. This formally accounts for the nudging of kc
  up to a multiple of mr or nr at runtime for triangular operations
  (e.g. trmm).
2015-02-20 11:47:44 -06:00
Tyler Michael Smith
af32e3a608 Fixed a bug with get_range_weighted would return end = 0 for small problem sizes 2015-02-19 22:51:11 -06:00
Field G. Van Zee
441d47542a Renamed 3m and 4m symbols/macros to 3mi and 4mi.
Details:
- Renamed several variables and macros from 3m/4m to 3mi/4mi. This is
  because those packing schemas were always implicitly "interleaved".
  This new naming scheme will make way for new schemas that separate
  instead of interleve the real and imaginary (and summed) parts.
- Expanded the pack format sub-field of the pack schema field of the
  info_t to 4 bits (from 3). This will allow for more schema types
  going forward.
- Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.
2015-02-19 17:06:10 -06:00
Field G. Van Zee
518a1756cc Fixed indexing bug for trmm3 via 3mh, 4mh.
Details:
- Fixed a bug that only affected trmm3 when performed via 3mh or 4mh,
  whereby micro-panels of the triangular matrix were packed with "dead
  space" between them due to failing to adjust for the fact that pointer
  arithmetic was occurring in units of complex elements while the data
  being packed consisted of real elements. It turns out that the macro-
  kernel suffered from the same bug, meaning the panels were actually
  being packed and read consistently. The only way I was able to
  discover the bug in the first place was because the packed block of A
  was overflowing into the beginning of the packed row panel of B using
  the sandybridge configuration.
2015-02-19 14:27:09 -06:00
Field G. Van Zee
493087d730 Merge branch 'master' of github.com:flame/blis 2015-02-18 09:45:51 -06:00
Field G. Van Zee
25021299b6 Merge branch 'master' of github.com:flame/blis 2015-02-11 20:03:21 -06:00
Field G. Van Zee
fe2b8d39a4 Fixed an obscure bug in 3mh/3m/4mh/4m packing.
Details:
- Modified bli_packm_blk_var1.c and _var2.c to increase the triangular
  case's panel increment by 1 if it would otherwise be odd. This is
  particularly necessary in _var2.c when handling the interleaved 3m
  or ro/io/rpi pack schemas, since division of an odd number by 2 can
  happen if both the panel length and the panel packing dimension
  (register packing blocksize) are odd, thus making their product odd.
- Modified bli_packm_init.c so that panel strides are increased by 1
  if they would otherwise be odd, even for non-3m related packing.
- Modified the trmm and trsm macro-kernels so that triangular packed
  micro-panels are traversed with this new "increment by 1 if odd"
  policy.
- Added sanity checks in trmm and trsm macro-kernels that would result
  in an abort() if the conditions that would lead to a "divide odd
  integer by 2" scenario ever manifest.
- Defined bli_is_odd(), _is_even() macros in bli_scalar_macro_defs.h.
2015-02-11 19:33:10 -06:00
Field G. Van Zee
650d2a6ff2 Added initial support for imaginary stride.
Details:
- Added an imaginary stride field ("is") to obj_t.
- Renamed bli_obj_set_incs() macro to bli_obj_set_strides().
- Defined bli_obj_imag_stride() and bli_obj_set_imag_stride() and
  added invocations in key locations.
- Added some basic error-checking related to imaginary stride.
- For now, imaginary stride will not be exposed into the most-used
  BLIS APIs such as bli_obj_create(), and certainly not the
  computational APIs such as bli_dgemm().
2015-02-09 14:59:20 -06:00
Field G. Van Zee
f05a57634a Defined gemm cntl function to query ukrs func_t.
Details:
- Added a new function, bli_gemm_cntl_ukrs(), that returns the func_t*
  for the gemm micro-kernels from the leaf node of the control tree.
  This allows all the func_t* fields from higher-level nodes in the tree
  to be NULL, which makes the function that builds the control trees
  slightly easier to read.
- Call bli_gemm_cntl_ukrs() instead of the cntl_gemm_ukrs() macro in
  all bli_*_front() functions (which is needed to apply the row/column
  preference optimization).
- In all level-3 bli_*_cntl_init() functions, changed the _obj_create()
  function arguments corresponding to the gemm_ukrs fields in higher-
  level cntl tree nodes to NULL.
- Removed some old her2k macro-kernels.
2015-02-08 19:40:34 -06:00
Tyler Smith
cefd3d5d20 A couple of functions were incorrectly ifdeffed away on Xeon Phi. Fixed this 2015-02-05 11:09:12 -06:00
Field G. Van Zee
7574c9947d Added basic flop-counting mechanism (level-3 only).
Details:
- Added optional flop counting to all level-3 front-ends, which is
  enabled via BLIS_ENABLE_FLOP_COUNT. The flop count can be
  reset at any time via bli_flop_count_reset() and queried via
  bli_flop_count(). Caveats:
  - flop counts are approximate for her[2]k, syr[2]k, trmm, and
    trsm operations;
  - flop counts ignore extra flops due to non-unit alpha;
  - flop counts do not account for situations where beta is zero.
2015-02-04 12:11:55 -06:00
Field G. Van Zee
ceda4f27d1 Implemented bli_obj_imag_equals().
Details:
- Implemented a new function, bli_obj_imag_equals(), which compares the
  imaginary part of the first argument to the second argument, which may
  be a BLIS_CONSTANT or of a regular real datatype.
2015-01-29 13:22:54 -06:00
Field G. Van Zee
81114824a0 Minor 4m/3m consolidation to mem_pool_macro_defs.h.
Details:
- Merged the 4m and 3m definitions in bli_mem_pool_macro_defs.h to
  reduce code and improve readability.
2015-01-06 12:15:21 -06:00
Tyler Michael Smith
36a9b7b743 reduced the default number of MC by KC blocks for bgq 2014-12-17 21:55:50 +00:00
Field G. Van Zee
c60619c7c3 Minor tweaks for 3m4m test drivers.
Details:
- Changed gemm_kc blocksizes to be reduced by two-thirds instead of
  half.
- Changed 3m4m/test_gemm.c driver to divide by 3 instead of 2 when
  computing the fixed k dimension.
- Fixed runme.sh so that it would use multiple threads for s/dgemm
  cases.
2014-12-16 17:08:22 -06:00
Field G. Van Zee
c6929ba6a5 Added 4m_1b to test/3m4m test driver and script. 2014-12-16 11:27:50 -06:00
Field G. Van Zee
785d480805 Merge branch 'master' of github.com:flame/blis 2014-12-12 14:34:19 -06:00
Field G. Van Zee
9456f330af Added 4m_1b implementation for gemm.
Details:
- Added yet another 4m-based implementation for complex domain level-3
  operations. This method, which the 3m/4m paper identifies as Algorithm
  "4m_1b" fissures the first loop around the micro-kernel so that the
  real sub-panel of the current micro-panel of B is multiplied against
  (both sub-panels of) all micro-panels of A, before doing the same for
  the imaginary sub-panel of the micro-panel of B. For now, only gemm is
  supported, and 4m_1b (labeled "4mb" within the framework) is not yet
  integrated into the test suite.
2014-12-12 14:31:57 -06:00
Field G. Van Zee
4156c0880d Fixed obscure level-2 packing / general stride bug.
Details:
- Fixed a bug in certain structured level-2 operations that manifested
  only when the structured matrix was provided to BLIS as matrix stored
  with general stride. The bug was introduced in c472993b when the
  densify field was removed from the packm control tree node and
  associated APIs. Since then, the packed object was unconditionally
  marked with an uplo field of BLIS_DENSE. This is fine for level-3
  operations where micro-panels are always densified, but in level-2
  contexts, the underlying unblocked variant (fused or unfused) of
  structured operations (e.g. trmv) still needs to know whether to
  execute its "lower" or "upper" branches of code. Since this field
  was unconditionally being set to BLIS_DENSE, the unblocked variants
  were always executed the "else" branch, which happened to be the
  "lower" case code. Thus, running an upper case produced the wrong
  answer. This most obviously manifested in the form of failures for
  trmm, trmm3, and trsm in the test suite.
  The bug was fixed by setting the packed object's uplo field to
  BLIS_DENSE only if the schema indicated that micro-panels were to be
  packed. Otherwise, we can assume we are packing to regular row or
  column storage, as is the case with level-2 packing. Thanks to
  Francisco Igual for reporting the testsuite failures and ultimately
  leading us to this bug.
2014-12-09 16:03:14 -06:00
Field G. Van Zee
689f60a578 Merge pull request #21 from figual/master
Adding armv8a configuration and micro-kernels.
2014-12-07 14:03:30 -06:00
Francisco D. Igual
483e4d6a3f Adding armv8a configuration and micro-kernels.
Only sgemm micro-kernel is fully functional at this point.
2014-12-07 20:27:49 +01:00
Tyler Smith
bef24e67e0 Fixed a type of race condition exposed by pthreads implementation.
Lead thread of the inner thread communicator could exit subproblem, move on the next iteration of the loop and modify a1_pack, b1_pack, or c1_pack while other threads were still using those.

Barriers were inserted to fix this.
2014-11-26 18:00:56 -06:00
Field G. Van Zee
76bde44411 Merge branch 'master' of github.com:flame/blis 2014-11-26 17:25:24 -06:00
Tyler Michael Smith
f3d729e504 Added static mutex to bli_init and bli_finalize 2014-11-26 22:25:24 -06:00
Tyler Michael Smith
d71cc79786 Refactored bli_threading files and added support for pthreads 2014-11-26 21:36:39 -06:00
Field G. Van Zee
e56e61438f Minor cleanups to bli_threading.h and friends.
Details:
- No longer need to define BLIS_ENABLE_MULTITHREADING manually in
  bli_config.h; it now gets defined when BLIS_ENABLE_OPENMP or
  BLIS_ENABLE_PTHREADS is defined.
- Added sanity check to prevent both BLIS__ENABLE_OPENMP and
  BLIS_ENABLE_PTHREADS from being enabled simultaneously.
- Reorganization of bli_threading*.h header files, which led to
  simplification of threading-related part of blis.h.
- added "-fopenmp -lpthread" to LDFLAGS of sandybridge make_defs.mk
  file.
2014-11-26 17:20:35 -06:00
Field G. Van Zee
3be2744cbe Update to template gemm ukernel comments.
Details:
- Updated comments on alignment of a1 and b1 to match wiki.
2014-11-21 12:28:08 -06:00
Field G. Van Zee
994429c688 Merge pull request #20 from TimmyLiu/master
#define PASTEF773 required by cblas compatibility layer
2014-11-20 13:55:35 -06:00
Timmy
694029d9d7 #define PASTEF773 required by cblas compatiility layer 2014-11-19 15:25:14 -06:00
Field G. Van Zee
58796abda6 Removed KC constraint comments from _kernel.h files.
Details:
- Since 4674ca8c, the constraint that KC be a multiple of both MR and
  NR have been relaxed, and thus it was time to remove the comments
  from the top of the bli_kernel.h files of all configurations.
2014-11-06 14:31:52 -06:00
Field G. Van Zee
7bbc95a54f Added new piledriver micro-kernels.
Details:
- Added new micro-kernels for the AMD piledriver architecture (one
  for each datatype).
- Updates and tweaks to piledriver configuration.
- Added 3xk packm micro-kernel support.
- Explicitly unrolled some of the smaller packm micro-kernels.
- Added notes to avx/sandybridge and piledriver micro-kernel files
  acknowledging the influence of the corresponding kernel code in
  OpenBLAS.
2014-10-29 10:52:23 -05:00
Field G. Van Zee
59613f1d55 Added separeate micro-panel alignment for A and B.
Details:
- Changed the recently-added micro-panel alignment macros so that we now
  have two sets--one for micro-panels of matrix A and one for micro-
  panels of matrix B: BLIS_UPANEL_[AB]_ALIGN_SIZE_?.
- Store each set of alignment values into a separate blksz_t object in
  bli_gemm_cntl_init().
- Adjusted packm_init() to use the separate alignment values.
- Added query routines for the new alignment values to bli_info.c.
- Modified test suite output accordingly.
2014-10-23 17:21:37 -05:00
Field G. Van Zee
a8e12884ee CHANGELOG update (0.1.6) 2014-10-23 11:35:48 -05:00
Field G. Van Zee
38ea5022e4 Version file update (0.1.6) 0.1.6 2014-10-23 11:35:45 -05:00
Field G. Van Zee
a3e6341bdb Factored common code from blocksize functions.
Details:
- Split bli_determine_blocksize_[fb]() into two functions each, the
  newer ones ending with the _sub suffix. These new sub-functions are
  now called from bli_[gemm|trmm|trsm]_determine_kc_[fb](), which
  eliminates redundant code and will allow any future tweaks to the
  core sub-functions to automatically be inherited by the operation-
  specific versions.
2014-10-23 11:13:28 -05:00
Field G. Van Zee
4674ca8cff Extended newly relaxed KC to hemm, symm.
Details:
- These changes were intended for the previous commit.
- Defined bli_gemm_determine_kc_[fb]() and bli_gemm_determine_kc_[fb](),
  which determine blocksizes for gemm-based operations, taking special
  care to "nudge" the kc dimension up to a multiple of MR or NR for
  hemm and symm operations, as needed.
- Changed bli_gemm_blk_var3f.c to call bli_gemm_determine_kc_f().
  instead of bli_determine_blocksize_f().
- Comment updates to bli_trmm_blocksize.c, bli_trsm_blocksize.c.
2014-10-23 10:50:59 -05:00
Field G. Van Zee
ab954ba6f8 Relaxed constraint that KC be multiple of MR, NR.
Details:
- Relaxed a long-held requirement in register blocksizes that required
  the kernel programmer to choose a KC that was divisible by both MR
  and NR. This was very constraining on some architectures that did not
  use register blocksizes that were powers of two. The constraint is
  now enforced only for trmm and trsm, where it is needed, and it is
  now handled by "nudging" kc upward at runtime, if necessary, to be a
  multiple of MR or NR, as needed.
- Defined bli_trmm_determine_kc_[fb]() and bli_trsm_determine_kc_[fb](),
  which determine blocksizes for trmm and trsm, taking special care to
  "nudge" the kc dimension up to a multiple of MR or NR, as needed.
- Changed bli_trmm_blk_var3[fb].c to call bli_trmm_determine_kc_[fb]()
  instead of bli_determine_blocksize_[fb]().
- Added safeguard to bli_align_dim_to_mult() that returns the dimension
  unmodified if the dimension multiple is zero (to avoid division by
  zero).
- Removed cpp guard/check for KC % MR == 0 and KC % NR == 0 from
  bli_kernel_macro_defs.h.
- Whitespace, variable name changes to bli_blocksize.c.
- Removed old commented code from bli_gemm_cntl.c.
2014-10-23 10:12:27 -05:00
Tyler Smith
95cdae65d6 Fixed bug in KNC microkernel where k=0 and beta != 1 2014-10-22 16:30:16 -05:00
Field G. Van Zee
e64dba5633 Re-implemented micro-panel alignment.
Details:
- This commit re-implements a feature that was removed in commit
  c2b2ab62. It was removed because, at the time, I wasn't sure how the
  micro-panel alignment feature would interact with the 4m method (when
  applied at the micro-kernrel level), and so it seemed safer to disable
  the feature entirely rather than allow possible breakage. This commit
  revisits the issue and safely re-implements the feature in a way that
  is compatible with 4m, 3m, 4mh, and 3mh (and native execution).
- Modified the static memory pool to account for micro-panel alignment
  space.
- Modified packm_init and blocked variants to align whole micro-panels
  by a datatype-specific alignment value that may be set by the
  configuration. (If it is not set by the configuration, it will default
  to BLIS_SIZEOF_?.)
- Modified macro-kernels so that:
  - storage stride is handled properly given the new micro-panel
    alignment behavior;
  - indexing through 3m/4m/rih-type sub-panels, as is done by trmm and
    trsm, is more robust (e.g. will work if the applicable packing
    register blocksize is odd);
  - imaginary strides are computed and stored within auxinfo_t structs,
    which allows the virtual micro-kernels to more easily determine how
    to index into the micro-panel operands.
- Modified virtual 3m and 4m micro-kernels to use the imaginary strides
  within the auxinfo_t structs instead of panel strides.
- Deprecated the panel stride fields from the auxinfo_t structs.
- Updated test suite to print out the micro-panel alignment values.
2014-10-20 19:23:06 -05:00