Commit Graph

288 Commits

Author SHA1 Message Date
Tyler Michael Smith
73b3db5948 Some fixes for the bgq configuration 2014-03-26 15:39:05 +00:00
Tyler Smith
f0824a04fc Initial commit to enable threading in TRSM,
Also enabled weighted partitioning for herk, trmm
Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions
Correctly computed a_next and b_next for gemm, herk macrokernels
a_next and b_next point to the current micropanels in trmm
2014-03-24 15:21:42 -05:00
Tyler Smith
23d9eab354 Merge https://github.com/flame/blis 2014-03-20 16:54:35 -05:00
Tyler Smith
5d5dc2eede Parallelized trmm and trmm3
Also fixed bugs in packm
2014-03-20 16:43:36 -05:00
Field G. Van Zee
fd3e32a5f4 Refined INSERT_GENTFUNC macro usage.
Details:
- Defined new INSERT_GENTFUNC macros so that the macro always takes
  exactly the number of arguments needed for the particular operation or
  variant being defined. Many operations were using INSERT_GENTFUNC
  macros that expected one auxiliary argument even though none were
  needed. Those instances have now been updated. Most of these instances
  were in the level-0 and -1v operations, as well as some operations
  defined in frame/util.
2014-03-20 13:59:48 -05:00
Field G. Van Zee
9b0e715f29 Minor simplifications to trmm, trsm macro-kernels.
Details:
- Simplified some code that would have allowed the diagonal of a trmm
  or trsm triangular matrix to intersect the short end of a micro-panel.
  This is disallowed via higher-level constraints on cache blocksizes, so
  this code was never needed and only served to obfuscate.
- Updated some comments in trmm, trsm macro-kernels.
2014-03-19 15:47:54 -05:00
Field G. Van Zee
a3902750b9 Reorganized norm operations.
Details:
- Completely reoganized norm operations:
  - Renames:
    - fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
    - absumv -> norm1v (vector 1-norm)
  - New operations:
    - norm1m (matrix 1-norm)
    - normiv, normim (infinity-norm)
    - amaxv (BLAS-like absolute maximum value index)
    - asumv (BLAS-like absolute sum)
- Deprecated absumm, as it did not correspond to any actual norm.
  (However, an inlined version now exists in the testsuite module for
  randm.)
2014-03-19 12:35:17 -05:00
Tyler Smith
c0140cb752 Fixed packm variants 3 and 4 where every thread was trying to manipulate the same state
Now just performed by the master thread.
2014-03-19 11:21:16 -05:00
Tyler Smith
fb42983bd9 Fixed a barrier bug and a thread decorator bug 2014-03-18 16:37:28 -05:00
Tyler Smith
aa2405f8b2 Fixing function pointer issues with thread decorator 2014-03-18 15:23:09 -05:00
Tyler Smith
ec8b88f935 Enabled threading for packm blocked variants 3 and 4 2014-03-18 14:35:37 -05:00
Tyler Smith
0ac534cdf6 Added decorator for calling parallelized intermal functions
Will allow for easy support for different threading models
2014-03-18 13:26:27 -05:00
Tyler Smith
5296f58975 Fixing some bugs with herk parallelization 2014-03-17 17:15:35 -05:00
Tyler Smith
c51d011083 Initial multithreading support for HERK 2014-03-17 15:00:47 -05:00
Tyler Smith
c720b14156 Switched to using environment variables to control threading.
The environment variables all follow the format BLIS_X_NT,
where X is the index of the loop as described in our paper
Anatomy of High Performance Many-Threaded Matrix Multiplication.
These indices are IR, JR, IC, KC, and JC.

Also enabled parallelism for hemm and symm, but these are currently untested.
2014-03-17 11:39:32 -05:00
Tyler Smith
92233cf642 Some fixes to gemm thread info tree creation,
Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED
instead of BLIS_SINGLE_THREADED
2014-03-11 14:16:08 -05:00
Tyler Smith
020f80c302 Added files specific to threading for gemm and packm operations 2014-03-11 12:08:17 -05:00
Tyler Smith
8d8f4352a4 Added single threaded thread info data structures specifically for gemm and packm 2014-03-10 15:47:28 -05:00
Tyler Smith
0e86777611 Merge branch 'master' of https://github.com/tlrmchlsmth/blis 2014-03-10 15:16:21 -05:00
Tyler Smith
2e727a025a Modifying the thread info data structures
This change makes each operation have its own thread info type,
allowing more fine control of threading in operations that have different types of suboperations
2014-03-10 15:14:33 -05:00
Field G. Van Zee
a770590cf2 Minor fixes to sumsqv, abmaxv.
Details:
- Minor update to bli_sumsqv_unb_var1() to bring it up-to-date with
  LAPACK 3.5.0's zlassq.f, which, starting with 3.4.2, returns NaN when
  the vector (or matrix) contains a NaN.
- Minor change to bli_abmaxv_unb_var1() to more closely mimic the
  behavior of netlib BLAS's izamax(). There, a "less than or equal to"
  operator is used in the search instead of "less than", which would
  change the element index returned if there were multiple maximum values.
- Added macro function definitions for bli_isinf() and bli_isnan(), which
  are currently implemented in terms of isinf() and isnan() from math.h.
2014-03-05 09:23:46 -06:00
Tyler Smith
b3bff631ea Merge https://github.com/flame/blis 2014-02-27 16:53:24 -06:00
Tyler Smith
2c158fb885 Merge https://github.com/flame/blis
Conflicts:
	frame/1m/packm/bli_packm_blk_var1.c
2014-02-27 16:46:23 -06:00
Field G. Van Zee
e8757b03a7 Use "%ld" as int format specifier in fprintm.
Details:
- Changed "%d" to "%ld" when printing integers via bli_fprintm().
- Meant to include this in previous commit.
2014-02-27 16:40:07 -06:00
Field G. Van Zee
c663ce3b51 Fixed various bugs when C99 complex is enabled.
Details:
- Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
  elsewhere in the framework that were not yet set up to work properly
  when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
- Extensive changes to f2c-derived files in frame/compat/f2c to allow
  C99 complex storage. Most of these changes center around accessing
  real and imaginary components via bli_?real()/bli_?imag() accessor
  macros, and setting of values via bli_?sets() assignment macros.
  (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
  was broken.)
2014-02-27 16:32:57 -06:00
Tyler Smith
e4738c48e0 Added support for parallelism in gemm micro-kernel 2014-02-27 16:29:46 -06:00
Tyler Smith
bfe214b633 Fixed bug with parallel packing, and bug with allocating an array of thread infos
In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency.
This dependeny was removed, allowing each iteration to be executed in parallel.

Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs.
2014-02-27 15:53:10 -06:00
Tyler Smith
6193d9ceea Fixed bug in thread trees 2014-02-27 14:09:19 -06:00
Tyler Smith
ac5a2de1d1 Merge branch 'master' of https://github.com/tlrmchlsmth/blis 2014-02-27 11:59:33 -06:00
Tyler Smith
01b125e815 First pass at adding parallelism to BLIS.
Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.
2014-02-27 11:55:45 -06:00
Field G. Van Zee
c2b2ab6270 Deprecated panel stride alignment in bli_config.h.
Details:
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE from bli_config.h of all
  configurations. It was already going unused in packm_init() since the
  recent 4m/3m commit. This setting was rarely, if ever, useful, and its
  existence only posed a potential risk for 4m/3m-based implementations.
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE usage from mem_pool_macro_defs.h.
- Updated comments regarding CONTIG_STRIDE_ALIGN_SIZE in template
  micro-kernels.
2014-02-26 12:46:45 -06:00
Field G. Van Zee
f18aee83a5 CHANGELOG update (for 0.1.1). 2014-02-25 17:58:42 -06:00
Field G. Van Zee
fde5f1fdec Added extensive support for configuration defaults.
Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
  macro constants. Examples:
    BLIS_SAXPYV_KERNEL_REF
    BLIS_DDOTXF_KERNEL_REF
    BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
  with a common base name; [sdcz] datatype flavors of each kernel or
  micro-kernel (level-1v, -1f, or 3) may now be named independently.
  This means you can now, if you wish, encode the datatype-specific
  register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
  undefined in bli_kernel.h will default to the corresponding reference
  implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
  it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
  datatype chars to match the number of types the kernel WOULD take in
  a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
  sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
  your level-1v/-1f kernels. The framework now prvides a _kernel()
  function which serves as the obj_t wrapper for whatever kernels are
  specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
  longer need to include any prototyping headers from within
  bli_kernel.h. The framework now generates kernel prototypes, with the
  proper type signature, based on the kernel names defined (or defaulted
  to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
  kernel is left undefined by bli_kernel.h, but its same-precision real
  domain equivalent IS defined, BLIS will use a 4m-based implementation
  for the datatype x implementations of all level-3 operations, using
  only the real gemm micro-kernel.
0.1.1
2014-02-25 13:34:56 -06:00
Field G. Van Zee
15b51e990f Merge branch 'master' of github.com:fgvanzee/blis 2014-02-21 09:04:32 -06:00
Field G. Van Zee
fc04b5eb69 Merge pull request #3 from figual/master
New ARM armv7a kernels and Assembly file consideration in Makefile
2014-02-21 09:04:13 -06:00
Francisco Igual
d1813c9dee Added new armv7a micro-kernels and configuration files from Werner Saar. 2014-02-21 15:14:31 +01:00
Francisco Igual
0cd098c03a o Modified Makefile to consider .S assembly microkernels. 2014-02-21 15:12:30 +01:00
Field G. Van Zee
6363a9f658 Added level-3 support for complex via 4m-/3m.
Details:
- Added the ability to induce complex domain level-3 operations via new
  virtual complex micro-kernels which are implemented via only real
  domain micro-kernels. Two new implementations are provided: 4m and 3m.
  4m implements complex matrix multiplication in terms of four real
  matrix multiplications, where as 3m uses only three and thus is
  capable of even higher (than peak) performance. However, the 3m method
  has somewhat weaker numerical properties, making it less desirable
  in general.
- Further refined packing routines, which were recently revamped, and
  added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
  into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.
2014-02-19 17:00:52 -06:00
Field G. Van Zee
b29e1c2b27 Merge pull request #2 from tlrmchlsmth/master
Fixes and improvements to xeon phi implementation.
2014-02-14 14:11:54 -06:00
Tyler Smith
bd3c7ecfb5 Removing changes to input.general and input.operations 2014-02-14 14:05:57 -06:00
Tyler Smith
ce06686368 Fixed more Xeon Phi bugs, especially with scattered update 2014-02-14 13:52:18 -06:00
Tyler Smith
31134b5c70 Some fixes, changes, and improvements to the microkernel to the Xeon Phi 2014-02-14 11:19:44 -06:00
Field G. Van Zee
ee60377e46 Shifted some fields in info_t.
Details:
- Shifted the pack order, pack buffer type, and structure type fields
  to make room for an extra bit in the pack type/status field.
2014-02-13 14:03:31 -06:00
Field G. Van Zee
bd3ab1ad4c Minor fixes to trsm consistent with prev on trmm.
Details:
- Removed use of bli_min() and bli_max() that were only being used to
  try to support situations where the diagonal would intersect the
  short end of some micro-panels, which is situation that is disallowed
  at a higher level by various constraints on the register and cache
  blocksize. This only affected trsm_ll and trsm_lu.
- Use panel stride as passed into the macro-kernel rather than compute
  it via k and PACKMR/PACKNR. This affects all macro-kernels of trsm.
2014-02-13 09:29:55 -06:00
Field G. Van Zee
6260b0b5f8 Fixed obscure bug in trmm_ll, trmm_lu.
Details:
- Fixed an obscure bug in left-hand trmm that would only manifest when
  non-zero register blocksize extensions (PACKMR > MR or PACKNR > NR)
  are used.
- Removed use of bli_min() and bli_max() that were only being used to
  try to support situations where the diagonal would intersect the
  short end of some micro-panels, which is situation that is disallowed
  at a higher level by various constraints on the register and cache
  blocksize. This only affected trmm_ll and trmm_lu.
- Use panel stride as passed into the macro-kernel rather than compute
  it via k and PACKMR/PACKNR. This affects all macro-kernels of trmm.
2014-02-13 09:19:56 -06:00
Field G. Van Zee
16915c1c1e Fixed an obscure bug in packm_cxk().
Details:
- Fixed a bug in packm_cxk() whereby the packm ukernel was being chosen
  from ldp, which is always equal to PACKMR or PACKNR. The problem with
  this is that the pack ukernels were implicitly assuming that the
  panel dimension of the panel being packed was equal to ldp, which
  is not the case when the register blocksizes extensions are non-zero
  (ie: when PACKMR > MR or PACKNR > NR, whichever is applicable). This
  problem has been fixed by passing ldp into the pack ukernels, which
  now walk through the packed micro-panel region by incrementing by this
  value, rather than incrementing by the inherent panel dimension value
  assumed by each packm ukernel (e.g. 4 in the case of packm_ref_4xk).
- Also fixed a very minor edge case inefficiency whereby pack ukernels
  smaller than the default were not being used in edge cases, and instead
  those situations were being handled by scal2m. This is related to the
  issue above, because the pack ukernel itself was being chosen based on
  ldp instead of the panel dimension.
2014-02-11 10:54:19 -06:00
Field G. Van Zee
b7da57b282 Updated calls to packm_blk_var2() in testsuite.
Details:
- In ukernel testsuite modules, replaced calls to packm_blk_var2() with
  _var1(). Meant to include this in previous commit.
2014-02-11 10:28:23 -06:00
Field G. Van Zee
c255a293e2 Consolidated packm_blk_var2 and var3.
Details:
- Consolidated the functionality previously supported by packm_blk_var2()
  and packm_blk_var3() into a new variant, packm_blk_var1().
- Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
  to accommodate above changes.
- Removed packm_blk_var3() and retired packm_blk_var2() to
  frame/1m/packm/old.
- Updated all level-3 _cntl_init() functions so that the new, more
  versatile packm_blk_var1 is used for all level-3 matrix packing.
2014-02-10 14:31:24 -06:00
Field G. Van Zee
32d8f264ae Refactored packm variants.
Details:
- Revised packm_blk_var2() and _var3() by encapsulating the general,
  hermitian/symmetric, and triangular panel-packing subproblems into
  separate functions: packm_gen_cxk(), packm_herm_cxk(), and
  packm_tri_cxk(), respectively. Also, homogenized the packm code as
  well as the new specialized packm_*_cxk() code to further improve
  readability.
2014-02-09 10:07:37 -06:00
Field G. Van Zee
6c80670287 Renamed enumerated type in testsuite and modules.
Details:
- Renamed the test suite's "mt_impl_t" enumerated type to "iface_t", and
  renamed all corresponding "impl" variables to "iface".
2014-02-07 11:27:15 -06:00