Commit Graph

560 Commits

Author SHA1 Message Date
Field G. Van Zee
37e55ca39b Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.
Details:
- Fixed a family of bugs in the triangular level-3 operations for
  certain complex implementations (3m1 and 4m1a) that only manifest if
  one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
  - Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
    for the triangular case.
  - Fixed the incorrect computation of imaginary stride, as stored in
    the auxinfo_t struct in trmm and trsm macro-kernels.
  - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
    cases where the the register blocksize for the triangular matrix is
    odd. Introduced a new byte-granular pointer arithmetic macro,
    bli_ptr_add(), that computes the correct value.
- Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
  terms of __typeof__, which is used by bli_ptr_add() macro.
- Disabled the row- vs. column-storage optimization in bli_trmm_front()
  for singleton problems because the inherent ambiguity of whether a
  scalar is row-stored or column-stored causes the wrong parameter
  combination code to be executed (by dumb luck of our checking for
  row storage first).
- Added commented-out debugging lines to 3m1/4m1a and reference
  micro-kernels, and trsm_ll macro-kernel.
2015-10-30 18:25:04 -05:00
Field G. Van Zee
46294d80e5 Merge pull request #35 from figual/master
Fixed incomplete code in the double precision ARMv8 microkernel.
2015-10-27 12:41:23 -05:00
Francisco Igual
a0a7b85ac3 Fixed incomplete code in the double precision ARMv8 microkernel. 2015-10-27 08:59:15 +00:00
Field G. Van Zee
d3159c5740 Merge branch 'master' of github.com:flame/blis 2015-10-21 14:54:00 -05:00
Field G. Van Zee
b489152e11 Use vzeroall in haswell micro-kernels. 2015-10-21 14:53:17 -05:00
Field G. Van Zee
7e03e45bfe Merge pull request #33 from xianyi/master
Enable Travis CI
2015-10-14 13:26:07 -05:00
Zhang Xianyi
4f88c29f9e Detect Intel Broadwell (using Haswell config). 2015-10-14 12:57:50 -05:00
Zhang Xianyi
4b0ac1a998 Merge branch 'upstream_master' 2015-10-14 12:51:05 -05:00
Field G. Van Zee
77ddb0b1d3 Removed flop-counting mechanism.
Details:
- Removed the optional flop-counting feature introduced in commit
  7574c994.
2015-10-13 12:53:06 -05:00
Field G. Van Zee
276da36618 Minor formatting change to README.md. 2015-10-12 11:43:03 -05:00
Field G. Van Zee
d17057446f Added "Getting Started" section to README.md.
Details:
- Added section to README.md file containing links to wikis with brief
  descriptions.
2015-10-12 11:39:49 -05:00
Field G. Van Zee
e7e1f2f7b6 Minor updates to CREDITS, README files. 2015-10-02 16:51:52 -05:00
Field G. Van Zee
55329906ec Minor edits to README.md, testsuite.
Details:
- Fixed typos in README.md.
- Fixed column heading alignment for testsuite when matlab output is
  enabled.
- Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.
2015-09-26 20:47:19 -05:00
Field G. Van Zee
bbebdb5793 Replaced README with README.md.
Details:
- Replaced the old (and short) README file with a much more comprehensive
  version written in github-flavored markdown. The new file is based on
  content taken from the old Google Code homepage.
2015-09-25 14:47:27 -05:00
Field G. Van Zee
e2e9d64a63 Load balance thread ranges for arbitrary diagonals.
Details:
- Expanded/updated interface for bli_get_range_weighted() and
  bli_get_range() so that the direction of movement is specified in the
  function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
  and also so that the object being partitioned is passed instead of an
  uplo parameter. Updated invocations in level-3 blocked variants, as
  appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
  carefully take into account the location of the diagonal when computing
  ranges so that the area of each subpartition (which, in all present
  level-3 operations, is proportional to the amount of computation
  engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
  variants:
    bli_<oper>_prune_unref_mparts_[mnk]()
  where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
  dimension is being partitioned. These routines call a more basic
  routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
  regions from matrices and simultaneously adjust other matrices which
  share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
  new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
  bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
  bli_get_range_*() and bli_get_range_weighted_*() under a range of
  conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
  array field so that dimensions could be queried via index constant
  (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
  macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
  and bli_round_to_mult(), which rounds a value to the nearest multiple
  of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
  bli_obj_row_off(), bli_obj_col_off().
2015-09-24 12:14:03 -05:00
Zhang Xianyi
fe3e355c9c Merge branch 'upstream_master' 2015-08-21 14:38:36 -05:00
Zhang Xianyi
efa641e36b Try to fix the compiling bug on travis. 2015-08-22 03:15:50 +08:00
Field G. Van Zee
4dd9dd3e1d Fixed minor alignment ambiguity bug in bli_pool.c.
Details:
- Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
  pointer arithmetic was performed on a void* as if it were a byte
  pointer (such as char*). Some compilers may have already been
  interpreting this situation as intended, despite the sloppiness.
  Thanks to Aleksei Rechinskii for reporting this issue.
- Redefined pointer alignment macros to typecast to uintptr_t instead of
  siz_t.
2015-08-21 11:52:37 -05:00
Zhang Xianyi
12ffd568b0 Add Travis CI. 2015-08-22 00:24:28 +08:00
Field G. Van Zee
ecc3ebb749 CHANGELOG update (0.1.8) 2015-07-29 13:31:12 -05:00
Field G. Van Zee
47caa33485 Version file update (0.1.8) 0.1.8 2015-07-29 13:31:09 -05:00
Field G. Van Zee
ef0fbbbdb6 Merge branch 'master' of github.com:flame/blis 2015-07-09 13:54:54 -05:00
Field G. Van Zee
fdfe14f1e1 Added support for Intel Haswell/Broadwell.
Details:
- Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
  and FMA instructions. (Complex support is currently provided by default
  induced method, 4m1a.)
- Added a 'haswell' configuration, which uses the aforementioned kernels.
- Inserted auto-detection support for haswell configuration in
  build/auto-detect/cpuid_x86.c.
- Modified configure script to explicitly echo when automatic or manual
  configuration is in progress.
- Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.
2015-07-09 13:52:39 -05:00
Field G. Van Zee
d4b891369c Added 'carrizo' configuration.
Details:
- Added a new configuration for AMD Excavator-based hardware also known
  as Carrizo when referring to the entire APU. This configuration uses
  the same micro-kernels as the piledriver, but with different
  cache blocksizes.
2015-07-07 10:06:53 -05:00
Field G. Van Zee
0b7255a642 CHANGELOG update (0.1.7) 2015-06-19 12:01:50 -05:00
Field G. Van Zee
267253de8a Version file update (0.1.7) 0.1.7 2015-06-19 12:01:49 -05:00
Field G. Van Zee
7cd01b71b5 Implemented dynamic allocation for packing buffers.
Details:
- Replaced the old memory allocator, which was based on statically-
  allocated arrays, with one based on a new internal pool_t type, which,
  combined with a new bli_pool_*() API, provides a new abstract data
  type that implements the same memory pool functionality but with blocks
  from the heap (ie: malloc() or equivalent). Hiding the details of the
  pool in a separate API also allows for a much simpler bli_mem.c family
  of functions.
- Added a new internal header, bli_config_macro_defs.h, which enables
  sane defaults for the values previously found in bli_config. Those
  values can be overridden by #defining them in bli_config.h the same
  way kernel defaults can be overridden in bli_kernel.h. This file most
  resembles what was previously a typical configuration's bli_config.h.
- Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
  defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
  blocks in the memory pool. Also added a corresponding query routine to
  the bli_info API.
- Deprecated (once again) the micro-panel alignment feature. Upon further
  reflection, it seems that the goal of more predictable L1 cache
  replacement behavior is outweighed by the harm caused by non-contiguous
  micro-panels when k % kc != 0. I honestly don't think anyone will even
  miss this feature.
- Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
  bli_cntl_init() instead of bli_init().
- Removed query functions from bli_info.c that are no longer applicable
  given the dynamic memory allocator.
- Removed unnecessary definitions from configurations' bli_config.h files,
  which are now pleasantly sparse.
- Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
  modules. Thanks to Devangi Parikh for pointing out these
  miscalculations.
- Comment, whitespace changes.
2015-06-19 11:31:53 -05:00
Field G. Van Zee
9848f255a3 Added early return to API-level _init() routines.
Details:
- Added conditional code that returns early from the API-level _init()
  routines if the API is already initialized. Actually meant for this to
  be included in 5f93cbe8.
2015-06-11 19:14:22 -05:00
Field G. Van Zee
5f93cbe870 Introduced API-level initialization.
Details:
- Added API-level initialization state to _const, _error, _mem, _thread,
  _ind, and _cntl APIs. While this functionality will mostly go unused,
  adding miniscule overhead at init-time, there will be at least once
  instance in the near future where, in order to avoid an infinite loop,
  a certain portion of the initialization will call a query function that
  itself attempts to call bli_init(). API-level initialization will allow
  this later stage to verify that an earlier stage of initialization has
  completed, even if the overall call to bli_init() has not yet returned.
- Added _is_initialized() functions for each API, setting the underlying
  bool_t during _init() and unsetting it during _finalize().
- Comment, whitespace changes.
2015-06-11 18:52:12 -05:00
Field G. Van Zee
ee129c6b02 Fixed bugs in _get_range(), _get_range_weighted().
Details:
- Fixed some bugs that only manifested in multithreaded instances of
  some (non-gemm) level-3 operations. The bugs were related to invalid
  allocation of "edge" cases to thread subpartitions. (Here, we define
  an "edge" case to be one where the dimension being partitioned for
  parallelism is not a whole multiple of whatever register blocksize
  is needed in that dimension.) In BLIS, we always require edge cases
  to be part of the bottom, right, or bottom-right subpartitions.
  (This is so that zero-padding only has to happen at the bottom, right,
  or bottom-right edges of micro-panels.) The previous implementations
  of bli_get_range() and _get_range_weighted() did not adhere to this
  implicit policy and thus produced bad ranges for some combinations of
  operation, parameter cases, problem sizes, and n-way parallelism.
- As part of the above fix, the functions bli_get_range() and
  _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b,
  and _b2t suffixes, similar to the partitioning functions. This is
  an easy way to make sure that the variants are calling the right
  version of each function. The function signatures have also been
  changed slightly.
- Comment/whitespace updates.
- Removed unnecessary '/' from macros in bli_obj_macro_defs.h.
2015-06-10 12:53:28 -05:00
Field G. Van Zee
9135dfd69d Minor updates to test/3m4m files. 2015-06-05 13:37:44 -05:00
Field G. Van Zee
d62ceece94 Minor update to test/3m4m/runme.sh.
Details:
- Removed some stale script code that should have been removed
  during 590bb3b8c.
2015-06-03 12:56:45 -05:00
Field G. Van Zee
b6ee82a3d4 Minor cleanup to bli_init() and friends.
Details:
- Spun-off initialization of global scalar constants to bli_const_init()
  and of threading stuff to bli_thread_init().
- Added some missing _finalize() functions, even when there is nothing
  to do.
2015-06-03 12:14:23 -05:00
Field G. Van Zee
1213f5ceba POSIX thread bugfixes/edits to bli_init.c, _mem.c.
Details:
- Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex
  was used to lock access to initialization/finalization actions.
  But everything worked out okay as long as bli_init() was called by
  single-threaded code.
- Changed to static initialization for memory allocator mutex in
  bli_mem.c, and moved mutex to that file (from bli_init.c).
- Fixed some type mismatches in bli_threading_pthreads.c that resulted
  in compiler warnings.
- Fixed a small memory leak with allocated-but-never-freed (and unused)
  pthread_attr_t objects.
- Whitespace changes to bli_init.c and bli_mem.c.
2015-06-02 13:27:47 -05:00
Field G. Van Zee
590bb3b8c5 Backed-out adjusted dim changes to test/3m4m.
Details:
- Reverted most changes applied during commit ec25807b.
2015-05-24 16:02:53 -05:00
Field G. Van Zee
ec25807b26 Tweaks to test/3m4m to test with adjusted dims.
Details:
- Updated test/3m4m driver files to build test drivers that allow
  comparision of real "asm_blis" results to complex "asm_blis" results,
  except with the latter's problem sizes adjusted so that problems are
  generated with equal flop counts.
2015-04-10 13:23:50 -05:00
Field G. Van Zee
426b648858 Fixed a packing bug that manifested in trsm_r.
Details:
- Fixed a bug that caused a memory leak in the contiguous memory
  allocator. Because packm_init() was using simple aliasing when
  a subpartition object was marked as zeros by bli_acquire_mpart_*(),
  the "destination" pack object's mem_t entry was being overwritten
  by the corresponding field of the "source" object (which was likely
  NULL). This prevented the block from being released back to the
  memory allocator. But this bug only manifested when changing the
  location of packing B from outside the var1 loop to inside the
  var3 loop, and only for trsm with triangular B (side = right). The
  bug was fixed by changing the type of alias used in packm_init()
  when handling zero partition cases. Specifically, we now use
  bli_obj_alias_for_packing(), which does not clobber the destination
  (pack) object's mem_t field. Thanks to Devangi Parikh for this bug
  report.
2015-04-08 15:12:21 -05:00
Field G. Van Zee
c84286d5ce More minor tweaks to test/3m4m.
Details:
- Added a line of output that forces matlab to allocate the entire array
  up-front.
- Re-enabled real domain benchmarks in runme.sh, which were temporarily
  disabled.
2015-04-04 15:39:14 -05:00
Field G. Van Zee
309717c8eb More tweaks to test/3m4m, configurations.
Details:
- Fixed incorrect number of mc_x_kc memory blocks in
  sandybridge/bli_config.h.
- Enabled OpenMP multithreding in piledriver/bli_config.h.
- More updates to test/3m4m driver files.
2015-04-03 19:28:49 -05:00
Field G. Van Zee
4baf3b9c69 Tweaked test/3m4m driver, including acml support.
Details:
- Added ACML support to test/3m4m driver Makefile and runme.sh script.
2015-04-03 16:44:32 -05:00
Field G. Van Zee
a32f7c49ca Merge pull request #23 from xianyi/master
Add auto-detecting CPU  on configure stage.
2015-04-03 08:28:11 -05:00
Field G. Van Zee
349e075ad6 Tweaks to sandybridge config, test/3m4m driver.
Details:
- Enable OpenMP support by default in sandybridge's bli_config.h.
- Reorganized sandybridge's bli_kernel.h.
- Updated 3m4m Makefile, runme.sh to also test MKL implementation.
2015-04-02 18:12:28 -05:00
Zhang Xianyi
4bfd1ce8ca Detect NEON for cortex-a9 and cortex-a15. 2015-04-02 16:40:21 -05:00
Zhang Xianyi
aa6eec4f43 Detect the CPU architecture. Support ARM cores.
Detect the CPU architecture by compiler's predefined macros.
Then, detect the CPU cores.

Support detecting x86 and ARM architectures.
2015-04-02 16:09:02 -05:00
Zhang Xianyi
2947cfb749 Add auto-detecting CPU on configure stage.
e.g.  /Path_to_BLIS/configure auto

Now, it only support detecting x86 CPUs.
2015-04-01 12:24:00 -05:00
Field G. Van Zee
26a4b8f6f9 Implemented 3m2, 3m3 induced algorithms (gemm only).
Details:
- Defined a new "3ms" (separated 3m) pack schema and added appropriate
  support in packm_init(), packm_blk_var2().
- Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
  as an argument instead of computing it locally. Exception: for trmm,
  is_p must be computed locally, since it changes for triangular
  packed matrices. Also exposed is_p in interface to dt-specific
  packm_blk_var2 (and _var1, even though it does not use imaginary
  stride).
- Renamed many functions/variables from _3mi to _3mis to indicate that
  they work for either interleaved or separated 3m pack schemas.
- Generalized gemm and herk macro-kernels to pass in imaginary stride
  rather than compute them locally.
- Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
  and 3m3-specific virtual micro-kernels.
- Added special gemm macro-kernels to support 3m2 and 3m3.
- Added support for 3m2 and 3m3 to testsuite.
- Corrected the type of the panel dimension (pd_) in various macro-
  kernels from inc_t to dim_t.
- Renamed many functions defined in bli_blocksize.c.
- Moved most induced-related macro defs from frame/include to
  frame/ind/include.
- Updated the _ukernel.c files so that the micro-kernel function pointers
  are obtained from the func_t objects rather than the cpp macros that
  define the function names.
- Updated test/3m4m driver, Makefile, and run script.
2015-04-01 10:44:54 -05:00
Tyler Smith
ddf62ba7d2 Refuse to free the packm thread info if it uses the single threaded version 2015-03-27 14:27:51 -05:00
Tyler Smith
016fc58758 Don't free packm thread info if it is null 2015-03-27 14:23:02 -05:00
Tyler Smith
00a443c529 Use bli_malloc instead of malloc for the thread info paths 2015-03-27 14:11:07 -05:00
Field G. Van Zee
f1a6b7d028 Reorganized code for induced complex methods.
Details:
- Consolidated most of the code relating to induced complex methods
  (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
  are now enabled on a per-operation basis. The current "available"
  (enabled and implemented) implementation can then be queried on
  an operation basis. Micro-kernel func_t objects as well as blksz_t
  objects can also be queried in a similar maner.
- Redefined several micro-kernel and operation-related functions in
  bli_info_*() API, in accordance with above changes.
- Added mr and nr fields to blksz_t object, which point to the mr
  and nr blksz_t objects for each cache blocksize (and are NULL for
  register blocksizes). Renamed the sub-blocksize field "sub" to
  "mult" since it is really expressing a blocksize multiple.
- Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
  trsm to correctly query mr and nr (for purposes of nudging kc).
- Introduced an enumerated opid_t in bli_type_defs.h that uniquely
  identifies an operation. For now, only level-3 id values are defined,
  along with a generic, catch-all BLIS_NOID value.
- Reworked testsuite so that all induced methods that are enabled
  are tested (one at a time) rather than only testing the first
  available method.
- Reformated summary at the beginning of testsuite output so that
  blocksize and micro-kernel info is shown for each induced method
  that was requested (as well as native execution).
- Reduced the number of columns needed to display non-matlab
  testsuite output (from approx. 90 to 80).
2015-03-18 15:37:10 -05:00