361 Commits

Author SHA1 Message Date
Field G. Van Zee
f2809fc5f7 Merge pull request #39 from devinamatthews/fix_f2c_conflicts
Devin's f2c type namespace update.

Details:
- Added "bla_" prefix to f2c type names to prevent conflicts with external user code.
- Removed most of the body of bli_f2c.h, which was unused.
2016-02-27 13:06:03 -06:00
Devin Matthews
8624a33ccc Fix remaining f2c conflicts. 2016-02-25 13:51:26 -06:00
Devin Matthews
372eef0b6c Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
progress.
2016-02-25 12:01:58 -06:00
Field G. Van Zee
f86b94f206 Included missing blas2blis integer def to CBLAS.
Details:
- Added #include "bli_config_macro_defs" to all cblas_*.c files in
  compat/cblas/src. This has the effect of defining
  BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does
  not define it. Thanks to Tony Kelman for reporting this bug.
- In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int'
  to 'f77_int'. This eliminates a compiler warning and a potential
  runtime bug and/or crash when the size of an int differs from the size
  of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE).
2016-02-23 18:12:34 -06:00
Field G. Van Zee
0b126de134 Consolidated packm_blk_var1 and packm_blk_var2.
Details:
- Consolidated the two blocked variants for packm into a single
  implementation (packm_blk_var1) and removed the other variant.
- Updated all induced method _cntl_init() functions in frame/cntl/ind/
  to use the new blocked variant 1.
- Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
  to detect pack_t schemas for induced methods and native execution,
  respectively.
2015-11-13 16:29:12 -06:00
Field G. Van Zee
30e5eb29e0 Minor changes to treatment of rs, cs in bli_obj.c.
Details:
- Applied a patch submitted by Devin Matthews that:
  - implements subtle changes to handling of somewhat unusual cases of
    row and column strides to accommodate certail tensor cases, which
    includes adding dimension parameters to _is_col_tilted() and
    _is_row_tilted() macros,
  - simplifies how buffers are sized when requested BLIS-allocated
    objects,
  - re-consolidates bli_adjust_strides_*() into one function, and
  - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
    environments.
2015-11-13 12:14:19 -06:00
Field G. Van Zee
42810bbfa0 Fixed minor bugs for uncommon obj_create cases.
Details:
- Separated bli_adjust_strides() into _alloc() and _attach() flavors so
  that the latter can avoid a test performed by the former, in which the
  rs and cs are overridden and set to zero if either matrix dimension is
  zero. Actually, we also disable this overridding behavior, even for the
  _alloc() case, since keeping the original strides (probably) does not
  hurt anything. The original code has been kept commented-out, though,
  in case an unintended consequence is later discovered.
- Fixed a typo in an error check for general stride cases where rs == cs.
2015-11-12 12:07:46 -06:00
Field G. Van Zee
3e6dd11467 Minor re-expression in quadratic partitioning code.
Details:
- Minor change to quadratic equation solution code that avoids
  recomputation of the sqrt() parameter when the compiler is not
  smart enough to perform this optimization automatically.
2015-11-03 10:30:08 -06:00
Field G. Van Zee
3e116f0a29 Fixed imaginary bug in quadratic partitioning code.
Details:
- Fixed a bug in the relatively new quadratic partitioning code that,
  under the right conditions, would perform sqrt() on a negative value.
  If the solution is imaginary, we discard it and use an alternate
  partition width that assumes no diagonal intersection. That alternate
  width is actually already computed, so, the fix was quite simple.
  Thanks to Devangi Parikh for reporting this bug.
2015-11-02 17:18:23 -06:00
Field G. Van Zee
4a502fbe77 Laid groundwork for runtime memory pool resizing.
Details:
- Changed bli_pool_finalize() so that the freeing begins with the block
  at top_index instead of block 0. This allows us to use the function
  for terminal finalization as well as temporary cleanup prior to
  reinitialization. Also, clear the pool_t struct upon _pool_finalize()
  in case it is called in the terminal case with some blocks still
  checked out to threads (in which case the threads will see the new
  block size as 0 and thus release the block as intended).
- Added bli_pool_reinit(), which calls _pool_finalize() followed by
  _pool_init() with new parameters.
- Added bli_mem_reinit(), which is based on bli_pool_reinit().
- Added new wrapper, _mem_compute_pool_block_sizes(), which calls
  _mem_compute_pool_block_sizes_dt().
- Updated bli_mem_release() so that the pblk_t is freed, via
  _pool_free_block(), if the block size recorded in the mem_t at the
  time the pblk_t was acquired is now different from the value in the
  pool_t.
2015-11-02 13:28:34 -06:00
Field G. Van Zee
37e55ca39b Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.
Details:
- Fixed a family of bugs in the triangular level-3 operations for
  certain complex implementations (3m1 and 4m1a) that only manifest if
  one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
  - Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
    for the triangular case.
  - Fixed the incorrect computation of imaginary stride, as stored in
    the auxinfo_t struct in trmm and trsm macro-kernels.
  - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
    cases where the the register blocksize for the triangular matrix is
    odd. Introduced a new byte-granular pointer arithmetic macro,
    bli_ptr_add(), that computes the correct value.
- Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
  terms of __typeof__, which is used by bli_ptr_add() macro.
- Disabled the row- vs. column-storage optimization in bli_trmm_front()
  for singleton problems because the inherent ambiguity of whether a
  scalar is row-stored or column-stored causes the wrong parameter
  combination code to be executed (by dumb luck of our checking for
  row storage first).
- Added commented-out debugging lines to 3m1/4m1a and reference
  micro-kernels, and trsm_ll macro-kernel.
2015-10-30 18:25:04 -05:00
Field G. Van Zee
77ddb0b1d3 Removed flop-counting mechanism.
Details:
- Removed the optional flop-counting feature introduced in commit
  7574c994.
2015-10-13 12:53:06 -05:00
Field G. Van Zee
e2e9d64a63 Load balance thread ranges for arbitrary diagonals.
Details:
- Expanded/updated interface for bli_get_range_weighted() and
  bli_get_range() so that the direction of movement is specified in the
  function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
  and also so that the object being partitioned is passed instead of an
  uplo parameter. Updated invocations in level-3 blocked variants, as
  appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
  carefully take into account the location of the diagonal when computing
  ranges so that the area of each subpartition (which, in all present
  level-3 operations, is proportional to the amount of computation
  engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
  variants:
    bli_<oper>_prune_unref_mparts_[mnk]()
  where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
  dimension is being partitioned. These routines call a more basic
  routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
  regions from matrices and simultaneously adjust other matrices which
  share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
  new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
  bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
  bli_get_range_*() and bli_get_range_weighted_*() under a range of
  conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
  array field so that dimensions could be queried via index constant
  (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
  macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
  and bli_round_to_mult(), which rounds a value to the nearest multiple
  of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
  bli_obj_row_off(), bli_obj_col_off().
2015-09-24 12:14:03 -05:00
Field G. Van Zee
4dd9dd3e1d Fixed minor alignment ambiguity bug in bli_pool.c.
Details:
- Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
  pointer arithmetic was performed on a void* as if it were a byte
  pointer (such as char*). Some compilers may have already been
  interpreting this situation as intended, despite the sloppiness.
  Thanks to Aleksei Rechinskii for reporting this issue.
- Redefined pointer alignment macros to typecast to uintptr_t instead of
  siz_t.
2015-08-21 11:52:37 -05:00
Field G. Van Zee
7cd01b71b5 Implemented dynamic allocation for packing buffers.
Details:
- Replaced the old memory allocator, which was based on statically-
  allocated arrays, with one based on a new internal pool_t type, which,
  combined with a new bli_pool_*() API, provides a new abstract data
  type that implements the same memory pool functionality but with blocks
  from the heap (ie: malloc() or equivalent). Hiding the details of the
  pool in a separate API also allows for a much simpler bli_mem.c family
  of functions.
- Added a new internal header, bli_config_macro_defs.h, which enables
  sane defaults for the values previously found in bli_config. Those
  values can be overridden by #defining them in bli_config.h the same
  way kernel defaults can be overridden in bli_kernel.h. This file most
  resembles what was previously a typical configuration's bli_config.h.
- Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
  defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
  blocks in the memory pool. Also added a corresponding query routine to
  the bli_info API.
- Deprecated (once again) the micro-panel alignment feature. Upon further
  reflection, it seems that the goal of more predictable L1 cache
  replacement behavior is outweighed by the harm caused by non-contiguous
  micro-panels when k % kc != 0. I honestly don't think anyone will even
  miss this feature.
- Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
  bli_cntl_init() instead of bli_init().
- Removed query functions from bli_info.c that are no longer applicable
  given the dynamic memory allocator.
- Removed unnecessary definitions from configurations' bli_config.h files,
  which are now pleasantly sparse.
- Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
  modules. Thanks to Devangi Parikh for pointing out these
  miscalculations.
- Comment, whitespace changes.
2015-06-19 11:31:53 -05:00
Field G. Van Zee
9848f255a3 Added early return to API-level _init() routines.
Details:
- Added conditional code that returns early from the API-level _init()
  routines if the API is already initialized. Actually meant for this to
  be included in 5f93cbe8.
2015-06-11 19:14:22 -05:00
Field G. Van Zee
5f93cbe870 Introduced API-level initialization.
Details:
- Added API-level initialization state to _const, _error, _mem, _thread,
  _ind, and _cntl APIs. While this functionality will mostly go unused,
  adding miniscule overhead at init-time, there will be at least once
  instance in the near future where, in order to avoid an infinite loop,
  a certain portion of the initialization will call a query function that
  itself attempts to call bli_init(). API-level initialization will allow
  this later stage to verify that an earlier stage of initialization has
  completed, even if the overall call to bli_init() has not yet returned.
- Added _is_initialized() functions for each API, setting the underlying
  bool_t during _init() and unsetting it during _finalize().
- Comment, whitespace changes.
2015-06-11 18:52:12 -05:00
Field G. Van Zee
ee129c6b02 Fixed bugs in _get_range(), _get_range_weighted().
Details:
- Fixed some bugs that only manifested in multithreaded instances of
  some (non-gemm) level-3 operations. The bugs were related to invalid
  allocation of "edge" cases to thread subpartitions. (Here, we define
  an "edge" case to be one where the dimension being partitioned for
  parallelism is not a whole multiple of whatever register blocksize
  is needed in that dimension.) In BLIS, we always require edge cases
  to be part of the bottom, right, or bottom-right subpartitions.
  (This is so that zero-padding only has to happen at the bottom, right,
  or bottom-right edges of micro-panels.) The previous implementations
  of bli_get_range() and _get_range_weighted() did not adhere to this
  implicit policy and thus produced bad ranges for some combinations of
  operation, parameter cases, problem sizes, and n-way parallelism.
- As part of the above fix, the functions bli_get_range() and
  _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b,
  and _b2t suffixes, similar to the partitioning functions. This is
  an easy way to make sure that the variants are calling the right
  version of each function. The function signatures have also been
  changed slightly.
- Comment/whitespace updates.
- Removed unnecessary '/' from macros in bli_obj_macro_defs.h.
2015-06-10 12:53:28 -05:00
Field G. Van Zee
b6ee82a3d4 Minor cleanup to bli_init() and friends.
Details:
- Spun-off initialization of global scalar constants to bli_const_init()
  and of threading stuff to bli_thread_init().
- Added some missing _finalize() functions, even when there is nothing
  to do.
2015-06-03 12:14:23 -05:00
Field G. Van Zee
1213f5ceba POSIX thread bugfixes/edits to bli_init.c, _mem.c.
Details:
- Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex
  was used to lock access to initialization/finalization actions.
  But everything worked out okay as long as bli_init() was called by
  single-threaded code.
- Changed to static initialization for memory allocator mutex in
  bli_mem.c, and moved mutex to that file (from bli_init.c).
- Fixed some type mismatches in bli_threading_pthreads.c that resulted
  in compiler warnings.
- Fixed a small memory leak with allocated-but-never-freed (and unused)
  pthread_attr_t objects.
- Whitespace changes to bli_init.c and bli_mem.c.
2015-06-02 13:27:47 -05:00
Field G. Van Zee
426b648858 Fixed a packing bug that manifested in trsm_r.
Details:
- Fixed a bug that caused a memory leak in the contiguous memory
  allocator. Because packm_init() was using simple aliasing when
  a subpartition object was marked as zeros by bli_acquire_mpart_*(),
  the "destination" pack object's mem_t entry was being overwritten
  by the corresponding field of the "source" object (which was likely
  NULL). This prevented the block from being released back to the
  memory allocator. But this bug only manifested when changing the
  location of packing B from outside the var1 loop to inside the
  var3 loop, and only for trsm with triangular B (side = right). The
  bug was fixed by changing the type of alias used in packm_init()
  when handling zero partition cases. Specifically, we now use
  bli_obj_alias_for_packing(), which does not clobber the destination
  (pack) object's mem_t field. Thanks to Devangi Parikh for this bug
  report.
2015-04-08 15:12:21 -05:00
Field G. Van Zee
26a4b8f6f9 Implemented 3m2, 3m3 induced algorithms (gemm only).
Details:
- Defined a new "3ms" (separated 3m) pack schema and added appropriate
  support in packm_init(), packm_blk_var2().
- Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
  as an argument instead of computing it locally. Exception: for trmm,
  is_p must be computed locally, since it changes for triangular
  packed matrices. Also exposed is_p in interface to dt-specific
  packm_blk_var2 (and _var1, even though it does not use imaginary
  stride).
- Renamed many functions/variables from _3mi to _3mis to indicate that
  they work for either interleaved or separated 3m pack schemas.
- Generalized gemm and herk macro-kernels to pass in imaginary stride
  rather than compute them locally.
- Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
  and 3m3-specific virtual micro-kernels.
- Added special gemm macro-kernels to support 3m2 and 3m3.
- Added support for 3m2 and 3m3 to testsuite.
- Corrected the type of the panel dimension (pd_) in various macro-
  kernels from inc_t to dim_t.
- Renamed many functions defined in bli_blocksize.c.
- Moved most induced-related macro defs from frame/include to
  frame/ind/include.
- Updated the _ukernel.c files so that the micro-kernel function pointers
  are obtained from the func_t objects rather than the cpp macros that
  define the function names.
- Updated test/3m4m driver, Makefile, and run script.
2015-04-01 10:44:54 -05:00
Tyler Smith
ddf62ba7d2 Refuse to free the packm thread info if it uses the single threaded version 2015-03-27 14:27:51 -05:00
Tyler Smith
016fc58758 Don't free packm thread info if it is null 2015-03-27 14:23:02 -05:00
Tyler Smith
00a443c529 Use bli_malloc instead of malloc for the thread info paths 2015-03-27 14:11:07 -05:00
Field G. Van Zee
f1a6b7d028 Reorganized code for induced complex methods.
Details:
- Consolidated most of the code relating to induced complex methods
  (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
  are now enabled on a per-operation basis. The current "available"
  (enabled and implemented) implementation can then be queried on
  an operation basis. Micro-kernel func_t objects as well as blksz_t
  objects can also be queried in a similar maner.
- Redefined several micro-kernel and operation-related functions in
  bli_info_*() API, in accordance with above changes.
- Added mr and nr fields to blksz_t object, which point to the mr
  and nr blksz_t objects for each cache blocksize (and are NULL for
  register blocksizes). Renamed the sub-blocksize field "sub" to
  "mult" since it is really expressing a blocksize multiple.
- Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
  trsm to correctly query mr and nr (for purposes of nudging kc).
- Introduced an enumerated opid_t in bli_type_defs.h that uniquely
  identifies an operation. For now, only level-3 id values are defined,
  along with a generic, catch-all BLIS_NOID value.
- Reworked testsuite so that all induced methods that are enabled
  are tested (one at a time) rather than only testing the first
  available method.
- Reformated summary at the beginning of testsuite output so that
  blocksize and micro-kernel info is shown for each induced method
  that was requested (as well as native execution).
- Reduced the number of columns needed to display non-matlab
  testsuite output (from approx. 90 to 80).
2015-03-18 15:37:10 -05:00
Field G. Van Zee
8d5169ccda Fixed bug in release of mem_t buffer.
Details:
- Fixed a bug that affects all level-2 and level-3 blocked variants. The
  bug only manifested, however, if the packing of operands (A and B in
  gemm, for example) spanned multiple nodes in the control tree. Until
  recently, the main consumers of packm were level-3 operations, all of
  which packed both input operands from blocked variant 1 (B outside of
  the loop, and A within the loop). This particular usage masked a flaw
  in the code whereby bli_obj_release_pack() would always release the
  underlying mem_t buffer (provided it was allocated), even if the buffer
  was not allocated in the current variant. This has been fixed by
  replacing all calls to bli_obj_release_pack() with calls to a new
  function, bli_packm_release(), which takes the same control tree node
  argument passed into the object's corresponding call to packm_init()
  or packv_init(). bli_packm_release() then proceeds to invoke
  bli_obj_release_pack() only if the control tree node indicates that
  packing was requested. Thanks to Devangi Parikh for identifying this
  bug.
2015-03-18 11:38:08 -05:00
Field G. Van Zee
03ba9a6b17 Removed some 'old' directories. 2015-02-24 10:33:28 -06:00
Field G. Van Zee
a86db60ee2 Extensive renaming of 3m/4m-related files, symbols.
Details:
- Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi
  ('i' for "interleaved"). Similar changes to 3M/4M macros.
- Renamed all 3m/4m files and functions to 3m1/4m1.
- Whitespace changes.
2015-02-23 18:42:39 -06:00
Field G. Van Zee
8cf8da291a Minor updates to induced complex mode management.
Details:
- Relocated bli_4mh.c, bli_4mb.c, bli_4m.c, bli_3mh.c, bli_3m.c (and
  associated headers) from frame/base to frame/base/induced.
- Added bli_xm.? to frame/base/induced, which implements
  bli_xm_is_enabled(), which detects whether ANY induced complex method
  is currently enabled.
- The new function bli_xm_is_enabled() is now used in bli_info.c to
  detect when an induced complex method is used, so we know when to
  return blocksizes from one of the induced methods' blocksize objects.
2015-02-20 15:24:27 -06:00
Tyler Michael Smith
411e637ee7 Merge branch 'master' of http://github.com/flame/blis 2015-02-20 20:39:25 -06:00
Tyler Michael Smith
c2569b8803 Fixed a memory leak in freeing the thread infos 2015-02-20 20:38:19 -06:00
Field G. Van Zee
fc0b771227 Added max(mr,nr) to kc in static mem pools.
Details:
- Changed the static memory definitions to compute the maximum register
  blocksize for each datatype and add it to kc when computing the size
  of blocks of A and B. This formally accounts for the nudging of kc
  up to a multiple of mr or nr at runtime for triangular operations
  (e.g. trmm).
2015-02-20 11:47:44 -06:00
Tyler Michael Smith
af32e3a608 Fixed a bug with get_range_weighted would return end = 0 for small problem sizes 2015-02-19 22:51:11 -06:00
Field G. Van Zee
441d47542a Renamed 3m and 4m symbols/macros to 3mi and 4mi.
Details:
- Renamed several variables and macros from 3m/4m to 3mi/4mi. This is
  because those packing schemas were always implicitly "interleaved".
  This new naming scheme will make way for new schemas that separate
  instead of interleve the real and imaginary (and summed) parts.
- Expanded the pack format sub-field of the pack schema field of the
  info_t to 4 bits (from 3). This will allow for more schema types
  going forward.
- Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.
2015-02-19 17:06:10 -06:00
Field G. Van Zee
518a1756cc Fixed indexing bug for trmm3 via 3mh, 4mh.
Details:
- Fixed a bug that only affected trmm3 when performed via 3mh or 4mh,
  whereby micro-panels of the triangular matrix were packed with "dead
  space" between them due to failing to adjust for the fact that pointer
  arithmetic was occurring in units of complex elements while the data
  being packed consisted of real elements. It turns out that the macro-
  kernel suffered from the same bug, meaning the panels were actually
  being packed and read consistently. The only way I was able to
  discover the bug in the first place was because the packed block of A
  was overflowing into the beginning of the packed row panel of B using
  the sandybridge configuration.
2015-02-19 14:27:09 -06:00
Field G. Van Zee
493087d730 Merge branch 'master' of github.com:flame/blis 2015-02-18 09:45:51 -06:00
Field G. Van Zee
25021299b6 Merge branch 'master' of github.com:flame/blis 2015-02-11 20:03:21 -06:00
Field G. Van Zee
fe2b8d39a4 Fixed an obscure bug in 3mh/3m/4mh/4m packing.
Details:
- Modified bli_packm_blk_var1.c and _var2.c to increase the triangular
  case's panel increment by 1 if it would otherwise be odd. This is
  particularly necessary in _var2.c when handling the interleaved 3m
  or ro/io/rpi pack schemas, since division of an odd number by 2 can
  happen if both the panel length and the panel packing dimension
  (register packing blocksize) are odd, thus making their product odd.
- Modified bli_packm_init.c so that panel strides are increased by 1
  if they would otherwise be odd, even for non-3m related packing.
- Modified the trmm and trsm macro-kernels so that triangular packed
  micro-panels are traversed with this new "increment by 1 if odd"
  policy.
- Added sanity checks in trmm and trsm macro-kernels that would result
  in an abort() if the conditions that would lead to a "divide odd
  integer by 2" scenario ever manifest.
- Defined bli_is_odd(), _is_even() macros in bli_scalar_macro_defs.h.
2015-02-11 19:33:10 -06:00
Field G. Van Zee
650d2a6ff2 Added initial support for imaginary stride.
Details:
- Added an imaginary stride field ("is") to obj_t.
- Renamed bli_obj_set_incs() macro to bli_obj_set_strides().
- Defined bli_obj_imag_stride() and bli_obj_set_imag_stride() and
  added invocations in key locations.
- Added some basic error-checking related to imaginary stride.
- For now, imaginary stride will not be exposed into the most-used
  BLIS APIs such as bli_obj_create(), and certainly not the
  computational APIs such as bli_dgemm().
2015-02-09 14:59:20 -06:00
Field G. Van Zee
f05a57634a Defined gemm cntl function to query ukrs func_t.
Details:
- Added a new function, bli_gemm_cntl_ukrs(), that returns the func_t*
  for the gemm micro-kernels from the leaf node of the control tree.
  This allows all the func_t* fields from higher-level nodes in the tree
  to be NULL, which makes the function that builds the control trees
  slightly easier to read.
- Call bli_gemm_cntl_ukrs() instead of the cntl_gemm_ukrs() macro in
  all bli_*_front() functions (which is needed to apply the row/column
  preference optimization).
- In all level-3 bli_*_cntl_init() functions, changed the _obj_create()
  function arguments corresponding to the gemm_ukrs fields in higher-
  level cntl tree nodes to NULL.
- Removed some old her2k macro-kernels.
2015-02-08 19:40:34 -06:00
Tyler Smith
cefd3d5d20 A couple of functions were incorrectly ifdeffed away on Xeon Phi. Fixed this 2015-02-05 11:09:12 -06:00
Field G. Van Zee
7574c9947d Added basic flop-counting mechanism (level-3 only).
Details:
- Added optional flop counting to all level-3 front-ends, which is
  enabled via BLIS_ENABLE_FLOP_COUNT. The flop count can be
  reset at any time via bli_flop_count_reset() and queried via
  bli_flop_count(). Caveats:
  - flop counts are approximate for her[2]k, syr[2]k, trmm, and
    trsm operations;
  - flop counts ignore extra flops due to non-unit alpha;
  - flop counts do not account for situations where beta is zero.
2015-02-04 12:11:55 -06:00
Field G. Van Zee
ceda4f27d1 Implemented bli_obj_imag_equals().
Details:
- Implemented a new function, bli_obj_imag_equals(), which compares the
  imaginary part of the first argument to the second argument, which may
  be a BLIS_CONSTANT or of a regular real datatype.
2015-01-29 13:22:54 -06:00
Field G. Van Zee
81114824a0 Minor 4m/3m consolidation to mem_pool_macro_defs.h.
Details:
- Merged the 4m and 3m definitions in bli_mem_pool_macro_defs.h to
  reduce code and improve readability.
2015-01-06 12:15:21 -06:00
Field G. Van Zee
c60619c7c3 Minor tweaks for 3m4m test drivers.
Details:
- Changed gemm_kc blocksizes to be reduced by two-thirds instead of
  half.
- Changed 3m4m/test_gemm.c driver to divide by 3 instead of 2 when
  computing the fixed k dimension.
- Fixed runme.sh so that it would use multiple threads for s/dgemm
  cases.
2014-12-16 17:08:22 -06:00
Field G. Van Zee
785d480805 Merge branch 'master' of github.com:flame/blis 2014-12-12 14:34:19 -06:00
Field G. Van Zee
9456f330af Added 4m_1b implementation for gemm.
Details:
- Added yet another 4m-based implementation for complex domain level-3
  operations. This method, which the 3m/4m paper identifies as Algorithm
  "4m_1b" fissures the first loop around the micro-kernel so that the
  real sub-panel of the current micro-panel of B is multiplied against
  (both sub-panels of) all micro-panels of A, before doing the same for
  the imaginary sub-panel of the micro-panel of B. For now, only gemm is
  supported, and 4m_1b (labeled "4mb" within the framework) is not yet
  integrated into the test suite.
2014-12-12 14:31:57 -06:00
Field G. Van Zee
4156c0880d Fixed obscure level-2 packing / general stride bug.
Details:
- Fixed a bug in certain structured level-2 operations that manifested
  only when the structured matrix was provided to BLIS as matrix stored
  with general stride. The bug was introduced in c472993b when the
  densify field was removed from the packm control tree node and
  associated APIs. Since then, the packed object was unconditionally
  marked with an uplo field of BLIS_DENSE. This is fine for level-3
  operations where micro-panels are always densified, but in level-2
  contexts, the underlying unblocked variant (fused or unfused) of
  structured operations (e.g. trmv) still needs to know whether to
  execute its "lower" or "upper" branches of code. Since this field
  was unconditionally being set to BLIS_DENSE, the unblocked variants
  were always executed the "else" branch, which happened to be the
  "lower" case code. Thus, running an upper case produced the wrong
  answer. This most obviously manifested in the form of failures for
  trmm, trmm3, and trsm in the test suite.
  The bug was fixed by setting the packed object's uplo field to
  BLIS_DENSE only if the schema indicated that micro-panels were to be
  packed. Otherwise, we can assume we are packing to regular row or
  column storage, as is the case with level-2 packing. Thanks to
  Francisco Igual for reporting the testsuite failures and ultimately
  leading us to this bug.
2014-12-09 16:03:14 -06:00
Tyler Smith
bef24e67e0 Fixed a type of race condition exposed by pthreads implementation.
Lead thread of the inner thread communicator could exit subproblem, move on the next iteration of the loop and modify a1_pack, b1_pack, or c1_pack while other threads were still using those.

Barriers were inserted to fix this.
2014-11-26 18:00:56 -06:00