Commit Graph

275 Commits

Author SHA1 Message Date
Tyler Smith
5296f58975 Fixing some bugs with herk parallelization 2014-03-17 17:15:35 -05:00
Tyler Smith
c51d011083 Initial multithreading support for HERK 2014-03-17 15:00:47 -05:00
Tyler Smith
c720b14156 Switched to using environment variables to control threading.
The environment variables all follow the format BLIS_X_NT,
where X is the index of the loop as described in our paper
Anatomy of High Performance Many-Threaded Matrix Multiplication.
These indices are IR, JR, IC, KC, and JC.

Also enabled parallelism for hemm and symm, but these are currently untested.
2014-03-17 11:39:32 -05:00
Tyler Smith
92233cf642 Some fixes to gemm thread info tree creation,
Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED
instead of BLIS_SINGLE_THREADED
2014-03-11 14:16:08 -05:00
Tyler Smith
020f80c302 Added files specific to threading for gemm and packm operations 2014-03-11 12:08:17 -05:00
Tyler Smith
8d8f4352a4 Added single threaded thread info data structures specifically for gemm and packm 2014-03-10 15:47:28 -05:00
Tyler Smith
0e86777611 Merge branch 'master' of https://github.com/tlrmchlsmth/blis 2014-03-10 15:16:21 -05:00
Tyler Smith
2e727a025a Modifying the thread info data structures
This change makes each operation have its own thread info type,
allowing more fine control of threading in operations that have different types of suboperations
2014-03-10 15:14:33 -05:00
Tyler Smith
b3bff631ea Merge https://github.com/flame/blis 2014-02-27 16:53:24 -06:00
Tyler Smith
2c158fb885 Merge https://github.com/flame/blis
Conflicts:
	frame/1m/packm/bli_packm_blk_var1.c
2014-02-27 16:46:23 -06:00
Field G. Van Zee
e8757b03a7 Use "%ld" as int format specifier in fprintm.
Details:
- Changed "%d" to "%ld" when printing integers via bli_fprintm().
- Meant to include this in previous commit.
2014-02-27 16:40:07 -06:00
Field G. Van Zee
c663ce3b51 Fixed various bugs when C99 complex is enabled.
Details:
- Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
  elsewhere in the framework that were not yet set up to work properly
  when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
- Extensive changes to f2c-derived files in frame/compat/f2c to allow
  C99 complex storage. Most of these changes center around accessing
  real and imaginary components via bli_?real()/bli_?imag() accessor
  macros, and setting of values via bli_?sets() assignment macros.
  (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
  was broken.)
2014-02-27 16:32:57 -06:00
Tyler Smith
e4738c48e0 Added support for parallelism in gemm micro-kernel 2014-02-27 16:29:46 -06:00
Tyler Smith
bfe214b633 Fixed bug with parallel packing, and bug with allocating an array of thread infos
In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency.
This dependeny was removed, allowing each iteration to be executed in parallel.

Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs.
2014-02-27 15:53:10 -06:00
Tyler Smith
6193d9ceea Fixed bug in thread trees 2014-02-27 14:09:19 -06:00
Tyler Smith
ac5a2de1d1 Merge branch 'master' of https://github.com/tlrmchlsmth/blis 2014-02-27 11:59:33 -06:00
Tyler Smith
01b125e815 First pass at adding parallelism to BLIS.
Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.
2014-02-27 11:55:45 -06:00
Field G. Van Zee
c2b2ab6270 Deprecated panel stride alignment in bli_config.h.
Details:
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE from bli_config.h of all
  configurations. It was already going unused in packm_init() since the
  recent 4m/3m commit. This setting was rarely, if ever, useful, and its
  existence only posed a potential risk for 4m/3m-based implementations.
- Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE usage from mem_pool_macro_defs.h.
- Updated comments regarding CONTIG_STRIDE_ALIGN_SIZE in template
  micro-kernels.
2014-02-26 12:46:45 -06:00
Field G. Van Zee
f18aee83a5 CHANGELOG update (for 0.1.1). 2014-02-25 17:58:42 -06:00
Field G. Van Zee
fde5f1fdec Added extensive support for configuration defaults.
Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
  macro constants. Examples:
    BLIS_SAXPYV_KERNEL_REF
    BLIS_DDOTXF_KERNEL_REF
    BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
  with a common base name; [sdcz] datatype flavors of each kernel or
  micro-kernel (level-1v, -1f, or 3) may now be named independently.
  This means you can now, if you wish, encode the datatype-specific
  register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
  undefined in bli_kernel.h will default to the corresponding reference
  implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
  it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
  datatype chars to match the number of types the kernel WOULD take in
  a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
  sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
  your level-1v/-1f kernels. The framework now prvides a _kernel()
  function which serves as the obj_t wrapper for whatever kernels are
  specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
  longer need to include any prototyping headers from within
  bli_kernel.h. The framework now generates kernel prototypes, with the
  proper type signature, based on the kernel names defined (or defaulted
  to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
  kernel is left undefined by bli_kernel.h, but its same-precision real
  domain equivalent IS defined, BLIS will use a 4m-based implementation
  for the datatype x implementations of all level-3 operations, using
  only the real gemm micro-kernel.
0.1.1
2014-02-25 13:34:56 -06:00
Field G. Van Zee
15b51e990f Merge branch 'master' of github.com:fgvanzee/blis 2014-02-21 09:04:32 -06:00
Field G. Van Zee
fc04b5eb69 Merge pull request #3 from figual/master
New ARM armv7a kernels and Assembly file consideration in Makefile
2014-02-21 09:04:13 -06:00
Francisco Igual
d1813c9dee Added new armv7a micro-kernels and configuration files from Werner Saar. 2014-02-21 15:14:31 +01:00
Francisco Igual
0cd098c03a o Modified Makefile to consider .S assembly microkernels. 2014-02-21 15:12:30 +01:00
Field G. Van Zee
6363a9f658 Added level-3 support for complex via 4m-/3m.
Details:
- Added the ability to induce complex domain level-3 operations via new
  virtual complex micro-kernels which are implemented via only real
  domain micro-kernels. Two new implementations are provided: 4m and 3m.
  4m implements complex matrix multiplication in terms of four real
  matrix multiplications, where as 3m uses only three and thus is
  capable of even higher (than peak) performance. However, the 3m method
  has somewhat weaker numerical properties, making it less desirable
  in general.
- Further refined packing routines, which were recently revamped, and
  added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
  into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.
2014-02-19 17:00:52 -06:00
Field G. Van Zee
b29e1c2b27 Merge pull request #2 from tlrmchlsmth/master
Fixes and improvements to xeon phi implementation.
2014-02-14 14:11:54 -06:00
Tyler Smith
bd3c7ecfb5 Removing changes to input.general and input.operations 2014-02-14 14:05:57 -06:00
Tyler Smith
ce06686368 Fixed more Xeon Phi bugs, especially with scattered update 2014-02-14 13:52:18 -06:00
Tyler Smith
31134b5c70 Some fixes, changes, and improvements to the microkernel to the Xeon Phi 2014-02-14 11:19:44 -06:00
Field G. Van Zee
ee60377e46 Shifted some fields in info_t.
Details:
- Shifted the pack order, pack buffer type, and structure type fields
  to make room for an extra bit in the pack type/status field.
2014-02-13 14:03:31 -06:00
Field G. Van Zee
bd3ab1ad4c Minor fixes to trsm consistent with prev on trmm.
Details:
- Removed use of bli_min() and bli_max() that were only being used to
  try to support situations where the diagonal would intersect the
  short end of some micro-panels, which is situation that is disallowed
  at a higher level by various constraints on the register and cache
  blocksize. This only affected trsm_ll and trsm_lu.
- Use panel stride as passed into the macro-kernel rather than compute
  it via k and PACKMR/PACKNR. This affects all macro-kernels of trsm.
2014-02-13 09:29:55 -06:00
Field G. Van Zee
6260b0b5f8 Fixed obscure bug in trmm_ll, trmm_lu.
Details:
- Fixed an obscure bug in left-hand trmm that would only manifest when
  non-zero register blocksize extensions (PACKMR > MR or PACKNR > NR)
  are used.
- Removed use of bli_min() and bli_max() that were only being used to
  try to support situations where the diagonal would intersect the
  short end of some micro-panels, which is situation that is disallowed
  at a higher level by various constraints on the register and cache
  blocksize. This only affected trmm_ll and trmm_lu.
- Use panel stride as passed into the macro-kernel rather than compute
  it via k and PACKMR/PACKNR. This affects all macro-kernels of trmm.
2014-02-13 09:19:56 -06:00
Field G. Van Zee
16915c1c1e Fixed an obscure bug in packm_cxk().
Details:
- Fixed a bug in packm_cxk() whereby the packm ukernel was being chosen
  from ldp, which is always equal to PACKMR or PACKNR. The problem with
  this is that the pack ukernels were implicitly assuming that the
  panel dimension of the panel being packed was equal to ldp, which
  is not the case when the register blocksizes extensions are non-zero
  (ie: when PACKMR > MR or PACKNR > NR, whichever is applicable). This
  problem has been fixed by passing ldp into the pack ukernels, which
  now walk through the packed micro-panel region by incrementing by this
  value, rather than incrementing by the inherent panel dimension value
  assumed by each packm ukernel (e.g. 4 in the case of packm_ref_4xk).
- Also fixed a very minor edge case inefficiency whereby pack ukernels
  smaller than the default were not being used in edge cases, and instead
  those situations were being handled by scal2m. This is related to the
  issue above, because the pack ukernel itself was being chosen based on
  ldp instead of the panel dimension.
2014-02-11 10:54:19 -06:00
Field G. Van Zee
b7da57b282 Updated calls to packm_blk_var2() in testsuite.
Details:
- In ukernel testsuite modules, replaced calls to packm_blk_var2() with
  _var1(). Meant to include this in previous commit.
2014-02-11 10:28:23 -06:00
Field G. Van Zee
c255a293e2 Consolidated packm_blk_var2 and var3.
Details:
- Consolidated the functionality previously supported by packm_blk_var2()
  and packm_blk_var3() into a new variant, packm_blk_var1().
- Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
  to accommodate above changes.
- Removed packm_blk_var3() and retired packm_blk_var2() to
  frame/1m/packm/old.
- Updated all level-3 _cntl_init() functions so that the new, more
  versatile packm_blk_var1 is used for all level-3 matrix packing.
2014-02-10 14:31:24 -06:00
Field G. Van Zee
32d8f264ae Refactored packm variants.
Details:
- Revised packm_blk_var2() and _var3() by encapsulating the general,
  hermitian/symmetric, and triangular panel-packing subproblems into
  separate functions: packm_gen_cxk(), packm_herm_cxk(), and
  packm_tri_cxk(), respectively. Also, homogenized the packm code as
  well as the new specialized packm_*_cxk() code to further improve
  readability.
2014-02-09 10:07:37 -06:00
Field G. Van Zee
6c80670287 Renamed enumerated type in testsuite and modules.
Details:
- Renamed the test suite's "mt_impl_t" enumerated type to "iface_t", and
  renamed all corresponding "impl" variables to "iface".
2014-02-07 11:27:15 -06:00
Field G. Van Zee
6c12598b1b Employ simpler INSERT_ macro for ref ukernels.
Details:
- Defined a new macro, INSERT_GENTFUNC_BASIC0, which takes only one
  argument--the base name of the function--and employed this macro
  in the reference micro-kernel files instead of the _BASIC macro,
  which takes one auxiliary argument. That argument was not being
  used and probably just acted to unnecessarily obfuscate.
2014-02-06 18:26:35 -06:00
Field G. Van Zee
32cae66326 Fixed some instances of sloppy 'restrict' usage.
Details:
- Fixed some technical incorrectness with some usage of the 'restrict'
  keyword in the reference trsm micro-kernels.
- Tweak to testsuite/Makefile that causes rebuild if libblis was
  touched.
2014-02-06 18:06:42 -06:00
Field G. Van Zee
7aceef7683 Updated comments in macro-kernels.
Details:
- Updated (and fixed some errors in) the "Assumptions/assertions" comment
  section of macro-kernels.
- Changed register blocksizes of reference configuration to MR = 8 and
  NR = 4. It's always good for MR != NR in the reference configuration
  since it may help uncover bugs related to non-square micro-kernels.
2014-02-06 17:31:19 -06:00
Field G. Van Zee
8fd292aa78 Pass panel dimensions into macro-kernels.
Details:
- Modified the interfaces to the datatype-specific macro-kernels so that:
  - pd_a and pd_b are passed in (which contain the panel dimensions of
    packed panels of a and b).
  - rs_a and cs_b are no longer passed in (they were guaranteed to be 1).
- Modified implementations of datatype-specific macro-kernels so pd_a,
  pd_b, cs_a, and rs_b are used instead of cpp macros for MR, NR, PACKMR,
  and PACKNR, respectively.
- Declare temporary c matrices (ct) as being maxmr-by-maxnr, which for now
  is equivalent to being mr-by-nr. maxmr and maxnr are declared in a new
  header file bli_kernel_post_macro_defs.h.
2014-02-06 14:32:21 -06:00
Field G. Van Zee
3404e6657e Deprecated incremental blocksize macro const defs.
Details:
- Removed macro constant definitions related to incremental blocksizes
  from all configurations' bli_kernel.h files. This change is minor and
  is mostly a cleanup related to a previous commit.
2014-02-05 11:19:10 -06:00
Field G. Van Zee
1e9afd39a6 Comment updates (removed vestiges of "bd"). 2014-02-04 20:15:19 -06:00
Field G. Van Zee
5cf58f7c2d Added early returns for "object is zeros" case.
Details:
- Added some logic to packm_init(), pack_int() and gemm_int() so that
  (a) objects marked as BLIS_ZEROS are not packed, and (b) those
  objects are not computed with. This functionality is not currently
  needed by any existing implementations, but may be used in the
  future.
2014-02-04 09:15:19 -06:00
Field G. Van Zee
6bbd4be769 Added 'f' on some gemm and trmm blocked variants.
Details:
- Added 'f' to some block variant files/functions to be consistent with
  other file/functions' naming convention. Here, the f indicates
  partitioning in the "forward" direction.
2014-02-03 13:15:25 -06:00
Field G. Van Zee
eb13cb2c6b Removed redundant non-gemm blksz_t creation.
Details:
- Removed code that creates duplicate blksz_t objects for herk, trmm,
  and trsm. Instead, the gemm blksz_t objects are accessed via extern
  and used directly. This reduces the amount of code associated with
  each of the three _cntl_init() and _cntl_finalize() function.
2014-02-03 11:07:01 -06:00
Field G. Van Zee
0a023a7d9e Introduced new level-3 front-end layer.
Details:
- Added new _front() functions for each level-3 operation. This is done
  so that the choosing of the control tree (and *only* the choosing of
  the control tree) happens in what was previously the "front end"
  (e.g. bli_gemm()). That control tree is then passed into the _front()
  function, which then performs up-front tasks such as parameter
  checking.
2014-01-29 14:02:08 -06:00
Field G. Van Zee
251c5d1121 Removed redundant hemm, her2k control trees.
Details:
- Removed code that generated a control tree specifically for hemm and
  symm. Instead, the gemm control tree is now configured so that it
  works for gemm, hemm, or symm.
- Retired most her2k code, as it was not being used. (Currently, her2k is
  implemented as two invocations of herk.) I couldn't think of many
  situations where her2k variants were needed.
- Removed some older her2k code.
2014-01-28 19:40:29 -06:00
Field G. Van Zee
5a36e5bf2f Embed func_t microkernel objects in control trees.
Details:
- Modified all control tree node definitions to include a new field of
  type func_t*, which is similar to a blksz_t except that it contains
  one function pointer (each typed simply as void*) for each datatype.
  We use the func_t* to embed pointers to the micro-kernels to use for
  the leaf-level nodes of each control tree. This change is a natural
  extension of control trees and will allow more flexibility in the
  future.
- Modified all macro-kernel wrappers to obtain the micro-kernel pointers
  from the incomming (previously ignored) control tree node and then pass
  the queried pointer into the datatype-specific macro-kernel code, which
  then casts the pointer to the appropriate type (new typedefs residing
  in bli_kernel_type_defs.h) and then uses the pointer to call the micro-
  kernel. Thus, the micro-kernel function is no longer "hard-coded" (that
  is, determined when the datatype-specific macro-kernel functions are
  instantiated by the C preprocessor).
- Added macros to bli_kernel_macro_defs.h that build datatype-specific
  base names if they do not exist already, and then uses those to build
  datatype-specific micro-kernel function names. This will allow
  developers extra flexibility if they wanted to, for example, name each
  of their datatype-specific micro-kernels differently (e.g. double
  real might be named bli_dgemm_opt_4x4() while double complex might be
  named bli_zgemm_opt_2x2()).
- Inserted appropriate code into _cntl_init() functions that allocates
  and initializes a func_t object for the corresponding micro-kernels.
  The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(),
  and then reused via extern wherever possible.
2014-01-27 11:13:00 -06:00
Field G. Van Zee
6cbd6f1c7f Removed commented mixed domain macro-kernel code.
Details:
- Removed commented-out code from macro-kernels that was supposed to
  facilitate implementing mixed domain (complex times real) matrix
  multiplication. This functionality is still (probably possible),
  but I'm getting tired of looking at the code every time I edit
  a macro-kernel. Plus, there are probably ways of doing it at a
  higher level, via control trees.
2014-01-24 10:38:29 -06:00