Commit Graph

402 Commits

Author SHA1 Message Date
Field G. Van Zee
701b9aa3ff Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
  same struct definition, whose primary fields consist of a blocksize id,
  a variant function pointer, a pointer to an optional parameter struct,
  and a pointer to a (single) sub-node. This unified control tree type is
  now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
  they represent, such that, for example, packing operations are now
  associated with nodes that are "inline" in the tree, rather than off-
  shoot braches. The original tree for the classic Goto gemm algorithm was
  expressed (roughly) as:

    blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                         |           |
                         -> packb    -> packa

  and now, the same tree would look like:

    blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

  Specifically, the packb and packa nodes perform their respective packing
  operations and then recurse (without any loop) to a subproblem. This means
  there are now two kinds of level-3 control tree nodes: partitioning and
  non-partitioning. The blocked variants are members of the former, because
  they iteratively partition off submatrices and perform suboperations on
  those partitions, while the packing variants belong to the latter group.
  (This change has the effect of allowing greatly simplified initialization
  of the nodes, which previously involved setting many unused node fields to
  NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
  connective structure of control trees. That is, packm nodes are no longer
  off-shoot branches of the main algorithmic nodes, but rather connected
  "inline".
- Simplified control tree creation functions. Partitioning nodes are created
  concisely with just a few fields needing initialization. By contrast, the
  packing nodes require additional parameters, which are stored in a
  packm-specific struct that is tracked via the optional parameters pointer
  within the control tree struct. (This parameter struct must always begin
  with a uint64_t that contains the byte size of the struct. This allows
  us to use a generic function to recursively copy control trees.) gemm,
  herk, and trmm control tree creation continues to be consolidated into
  a single function, with the operation family being used to select
  among the parameter-agnostic macro-kernel wrappers. A single routine,
  bli_cntl_free(), is provided to free control trees recursively, whereby
  the chief thread within a groups release the blocks associated with
  mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
  function pointer stored in the current control tree node (rather than
  index into a local function pointer array). Before being invoked, these
  function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
  families) or trsm_voft (for trsm family) type, which is defined in
  frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
  through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
  from routines as a function of the variant's matrix operands. gemm and
  herk always move forward, while trmm and trsm move in a direction that
  is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
  each of which takes additional arguments and hides complexity in managing
  the difference between the way ranges are computed for the four families
  of operations.
- Simplified level-3 blocked variants according to the above changes, so that
  the only steps taken are:
  1. Query partitioning direction (forwards or backwards).
  2. Prune unreferenced regions, if they exist.
  3. Determine the thread partitioning sub-ranges.
  <begin loop>
    4. Determine the partitioning blocksize (passing in the partitioning
       direction)
    5. Acquire the curren iteration's partitions for the matrices affected
       by the current variants's partitioning dimension (m, k, n).
    6. Call the subproblem.
  <end loop>
- Instantiate control trees once per thread, per operation invocation.
  (This is a change from the previous regime in which control trees were
  treated as stateless objects, initialized with the library, and shared
  as read-only objects between threads.) This once-per-thread allocation
  is done primarily to allow threads to use the control tree as as place
  to cache certain data for use in subsequent loop iterations. Presently,
  the only application of this caching is a mem_t entry for the packing
  blocks checked out from the memory broker (allocator). If a non-NULL
  control tree is passed in by the (expert) user, then the tree is copied
  by each thread. This is done in bli_l3_thread_decorator(), in
  bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
  of the operation being executed. For example, gemm, hemm, and symm are
  all part of the gemm family, while herk, syrk, her2k, and syr2k are
  all part of the herk family. Knowing the operation's family is necessary
  when conditionally executing the internal (beta) scalar reset on on
  C in blocked variant 3, which is needed for gemm and herk families,
  but must not be performed for the trmm family (because beta has only
  been applied to the current row-panel of C after the first rank-kc
  iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
  to comform with the new control tree design, and renamed the macro-
  kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
  bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
  frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
  optimization in the various level-3 front-ends was not being applied
  properly when the context indicated that execution would be via an
  induced method. (Before, we always checked the native micro-kernel
  corresponding to the datatype being executed, whereas now we check
  the native micro-kernel corresponding to the datatype's real projection,
  since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
  complex implementations. Previously, it was always tested, provided that
  the c/z datatypes were enabled. However, some configurations use
  reference micro-kernels for complex datatypes, and testing these
  implementations can slow down the testsuite considerably.
2016-08-26 19:04:45 -05:00
Field G. Van Zee
c6f5c215ee Merge branch 'master' into compose 2016-08-22 17:33:02 -05:00
Field G. Van Zee
16a4c7a823 Fixed bugs in bli_mutex_init() and friends.
Details:
- Fixed a couple of bugs that affected OpenMP and POSIX threads
  configurations that resulted in compiler errors and warnings due
  to type mismatch, and in the case of pthreads, a missing function
  argument. The bugs are fairly recent, introduced in a017062.
2016-08-19 11:38:36 -05:00
Field G. Van Zee
d52cb76715 Merge branch 'master' into compose 2016-07-27 16:04:55 -05:00
Field G. Van Zee
c31b1e7b9d Relax alignment restrictions for sandybridge ukrs.
Details:
- Relaxed the base pointer and leading dimension alignment restrictions
  in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
  instead of vmovaps/vmovapd. These change mimic those made to the haswell
  microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
  directory to use DBL_MAX as the initial time candidate. Thanks to Devin
  Matthews for suggesting this change.
- Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.
2016-07-27 15:58:07 -05:00
Field G. Van Zee
95abea46f8 Merge branch 'master' into compose 2016-07-23 15:38:33 -05:00
Field G. Van Zee
a017062fdf Integrated "memory broker" (membrk_t) abstraction.
Details:
- Integrated a patch originally authored and submitted by Ricardo Magana
  of HP Enterprise. The changeset inserts use of a new object type, membrk_t,
  (memory broker) that allows multiple sets of memory pools on, for example,
  separate NUMA nodes, each of which has a separate memory space.
- Added membrk field to cntx_t and defined corresponding accessor macros.
- Added membrk field to mem_t object and defined corresponding accessor macros.
- Created new bli_membrk.c file, which contains the new memory broker API,
  including:
    bli_membrk_init(), bli_membrk_finalize()
    bli_membrk_acquire_[mv](), bli_membrk_release(),
    bli_membrk_init_pools(), bli_membrk_reinit_pools(),
    bli_membrk_finalize_pools(),
    bli_membrk_pool_size()
- In bli_mem.c, changed function calls to
    bli_mem_init_pools()     -> bli_membrk_init()
    bli_mem_reinit_pools()   -> bli_membrk_reinit()
    bli_mem_finalize_pools() -> bli_membrk_finalize()
- In bli_packv_init.c, bli_packm_init.c, changed function calls to:
    bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]()
    bli_mem_release()      -> bli_membrk_release()
- Added bli_mutex.c and related files to frame/thread. These files define
  abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or
  single-threaded execution. This new API is employed within functions
  such as bli_membrk_acquire_[mv]() and bli_membrk_release().
2016-07-22 17:02:59 -05:00
Field G. Van Zee
31def12e26 First phase of control tree redesign.
Details:
- These changes constitute the first set of changes in preparation to
  revamping the structure and use of control trees in BLIS. Modifications
  in this commit don't affect the control tree code yet, but rather lay
  the groundwork.
- Defined wrappers for the following functions, where the the wrappers
  each take a direction parameter of a new enumerated type (BLIS_BWD or
  BLIS_FWD), dir_t, and executes the correct underlying function.
  - bli_acquire_mpart_*() and _vpart_*()
  - bli_*_determine_kc_[fb]()
  - bli_thread_get_range_*() and bli_thread_get_range_weighted_*()
- Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving)
  blocked variants for trmm and trsm, and renamed gemm and herk variants
  accordingly. The direction is now queried via routines such as
  bli_trmm_direct(), which deterines the direction from the implied side
  and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD.
- Defined wrappers to parameter-specific macrokernels for herk, trmm, and
  trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying
  macrokernel based on the implied parameters. The same logic used to
  choose the dir_t in _direct() functions is used here.
- Simplified the function pointer arrays in _int() functions given the
  consolidation and dir_t querying mentioned above.
- Function signature (whitespace) reformatting for various functions.
- Removed old code in various 'old' directories.
2016-06-30 15:19:20 -05:00
Field G. Van Zee
232754feec Fixed compiler warning in rand[vm], randn[vm].
Details:
- Fixed compiler warnings about unused variables related to the disabling
  of normalization in the structured cases of the rand[vm] and randn[vm]
  operations.
2016-06-21 14:25:39 -05:00
Field G. Van Zee
a89555d160 Added randn[vm] operations, support in testsuite.
Details:
- Defined a new randomization operation, randn, on vectors and matrices.
  The randnv and randnm operations randomize each element of the target
  object with values from a narrow range of values. Presently, those
  values are all integer powers of two, but they do not need to be powers
  of two in order to achieve the primary goal, which is to initialize
  objects that can be operated on with plenty of precision "slack"
  available to allow computations that avoid roundoff. Using this method
  of randomization makes it much more likely that testsuite residuals of
  properly-functioning operations are close to zero, if not exactly zero.
- Updated existing randomization operations randv and randm to skip
  special diagonal handling and normalization for matrices with structure.
  This is now handled by the testsuite modules by explicitly calling a
  testsuite function that loads the diagonal (and scales off-diagonal
  elements).
- Added support for randnv and randnm in the testsuite with a new switch
  in input.general that universally toggles between use of the classic
  randv/randm, which use real values on the interval [-1,1], and
  randnv/randnm, which use only values from a narrow range. Currently,
  the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as
  well as 0.0.
- Updated testsuite modules so that a testsutie wrapper function is called
  instead of directly calling the randomization operations (such as
  bli_randv() and bli_randm()). This wrapper also takes a bool_t that
  indicates whether the object's elements should be normalized. (NOTE: As
  alluded to above, in the test modules of triangular solve operations such
  as trsv and trsm, we perform the extra step of loading the diagonal.)
- Defined a new level-0 operation, invertsc, which inverts a scalar.
- Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely
  but possible divide-by-zero.
- Updated function signature and prototype formatting in testsuite.
2016-06-17 14:08:35 -05:00
Field G. Van Zee
096895c5d5 Reorganized code, APIs related to multithreading.
Details:
- Reorganized code and renamed files defining APIs related to multithreading.
  All code that is not specific to a particular operation is now located in a
  new directory: frame/thread. Code is now organized, roughly, by the
  namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
  thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
  also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
  functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
  We now have the following API namespaces:
  - bli_thrinfo_*(): functions related to thrinfo_t objects
  - bli_thrcomm_*(): functions related to thrcomm_t objects.
  - bli_thread_*(): general-purpose functions, such as initialization,
    finalization, and computing ranges. (For now, some macros, such as
    bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
    bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
    appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
  consistent with the thread-related macros that were also renamed).
- Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
  #undef was a temporary fix to some macro defaults which were being applied
  in the wrong order, which was recently fixed.
2016-06-06 13:32:04 -05:00
Tyler Michael Smith
232530e88f Merge commit 'refs/pull/81/head' of https://github.com/flame/blis
Conflicts:
	frame/base/bli_threading_pthreads.c
	frame/base/bli_threading_pthreads.h
2016-06-01 15:14:10 -05:00
Tyler Michael Smith
4bcabd1bf6 Use spin locks instead of pthread barriers 2016-06-01 13:27:28 -05:00
Jeff Hammond
eef37f8b4d use GCC intrinsic instead of pthread_mutex for atomic increment and fetch 2016-05-29 22:28:13 -07:00
Field G. Van Zee
9dcd6f05c4 Implemented developer-configurable malloc()/free().
Details:
- Replaced all instances of bli_malloc() and bli_free() with one of:
  - bli_malloc_pool()/bli_free_pool()
  - bli_malloc_user()/bli_free_user()
  - bli_malloc_intl()/bli_free_intl()
  each of which can be configured to call malloc()/free() substitutes,
  so long as the substitute functions have the same function type
  signatures as malloc() and free() defined by C's stdlib.h. The _pool()
  function is called when allocating blocks for the memory pools (used
  for packing buffers, primarily), the _user() function is called when
  obj_t's are created (via bli_obj_create() and friends), and the _intl()
  function is called for internal use by BLIS, such as when creating
  control tree nodes or temporary buffers for manipulating internal data
  structures. Substitutes for any of the three types of bli_malloc() may
  be specified by #defining the following pairs of cpp macros in
  bli_kernel.h:
  - BLIS_MALLOC_POOL/BLIS_FREE_POOL
  - BLIS_MALLOC_USER/BLIS_FREE_USER
  - BLIS_MALLOC_INTL/BLIS_FREE_INTL
  to be the name of the substitute functions. (Obviously, the object
  code that contains these functions must be provided at link-time.)
  These macros default to malloc() and free(). Subsitute functions are
  also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
- Removed definitions for bli_malloc() and bli_free().
- Note that bli_malloc_pool() and bli_malloc_user() are now defined in
  terms of a new function, bli_malloc_align(), which aligns memory to an
  arbitrary (power of two) alignment boundary, but does so manually,
  whereas before alignment was performed behind the scenes by
  posix_memalign(). Currently, bli_malloc_intl() is defined in terms
  of bli_malloc_noalign(), which serves as a simple wrapper to the
  designated function that is passed in (e.g. BLIS_MALLOC_INTL).
  Similarly, there are bli_free_align() and bli_free_noalign(), which
  are used in concert with their bli_malloc_*() counterparts.
2016-05-24 13:15:32 -05:00
Field G. Van Zee
32db0adc21 Generate prototypes for user-defined packm kernels.
Details:
- Created template prototypes for packm kernels (in bli_l1m_ker.h), and
  then redefined reference packm kernels' prototyping headers in terms of
  this template, as is already done for level-1v, -1f, and -3 kernels.
- Automatically generate prototypes for user-defined packm kernels in
  bli_kernel_prototypes.h (using the new template prototypes in
  bli_l1m_ker.h).
- Defined packm kernel function types in bli_l1m_ft.h, including for
  packm kernels specific to induced methods, which are now used in
  bli_packm_cxk.c and friends rather than using a locally-defined
  function type.
- In bli_packm_cxk.c, extended function pointer for packm kernels array
  from out to index 31 (from previous maximum of 17). This allows us to
  store the unrolled 30xk kernel in the array for use (on knc, for
  example). Note: This should have been done a long time ago.
2016-05-17 15:20:16 -05:00
Field G. Van Zee
4bcf1b35ab Fixed bli_get_range_*() bugs in trsm variants.
Details:
- Fixed incorrect calls to bli_get_range_*() from within trsm blocked
  variants 1f, 2b, and 2f. The bug somehow went undetected since the
  big commit (537a1f4), and, strangely, did not manifest via the BLIS
  testsuite. The bug finally came to our attention when running thei
  libflame test suite while linking to BLIS. Thanks to Kiran Varaganti
  for submitting the initial report that led to this bug.
2016-05-11 16:09:49 -05:00
Field G. Van Zee
9cfa33023f Minor updates to bli_f2c.h.
Details:
- Added #undef guards to certain #define statements in bli_f2c.h,
  and renamed the file guard to BLIS_F2C_H. This helps when
  #including "blis.h" from an application or library that already
  #includes an "f2c.h" header.
2016-05-11 16:02:30 -05:00
Devin Matthews
7c604e1cbc Move default SIMD-related definitions to bli_kernel_macro_defs.h. Otherwise, configurations which customize these fail as these are now defined in bli_kernel.h. 2016-05-10 12:11:55 -05:00
Field G. Van Zee
a7be2d28e8 Merge pull request #74 from devinamatthews/fix_common_symbols
Default-initialize all extern global variables to avoid generating common symbols.
2016-05-10 11:48:51 -05:00
Devin Matthews
4b1e55edbf Default-initialize all extern global variables to avoid generating common symbols. Fixes #73. 2016-05-10 10:08:47 -05:00
Field G. Van Zee
97b512ef62 Include headers from cblas.h to pull in f77_int.
Details:
- Added #include statements for certain key BLIS headers so that the
  definition of f77_int is pulled in when a user compiles application
  code with only #include "cblas.h" (and no other BLIS header). This
  is necessary since f77_int is now used within the cblas API.
2016-05-06 10:24:30 -05:00
Field G. Van Zee
c3a4d39d03 Updates to haswell gemm micro-kernels.
Details:
- Added two new sets of [sd]gemm micro-kernels for haswell architectures,
  one that is 4x24/4x12 (s and d) and one that is 6x16/6x8.
- Changed the haswell configuration to use the 6x16/6x8 micro-kernels
  by default.
- Updated various Makefiles, in test, test/3m4m, and testsuite.
2016-05-04 17:22:56 -05:00
Field G. Van Zee
0b01d355ae Miscellaneous cleanups, fixes to recent commits.
Details:
- Fixed a typo in bli_l1f_ref.h, introduced into bbb8569, that only
  manifested when non-reference level-1f kernels were used.
- Added an #undef BLIS_SIMD_ALIGN_SIZE to bli_kernel.h of dunnington
  configuration to prevent a compile-time warning until I can figure out
  the proper permanent fix.
- Moved frame/1f/kernels/bli_dotxaxpyf_ref_var1.c out of the compilation
  path (into 'other' directory). _ref_var2 is used by default, which is
  the variant that is built on axpyf and dotxf instead of dotaxpyv.
- Removed section of frame/include/bli_config_macro_defs.h pertaining to
  mixed datatype support.
2016-04-27 15:21:10 -05:00
Field G. Van Zee
ed7326c836 Added 'restrict' to l1v/l1f code in 'kernels' dir.
Details:
- Added 'restrict' keyword to existing kernel definitions in 'kernels'
  directory. These changes were meant for inclusion in bbb8569.
2016-04-27 14:57:40 -05:00
Field G. Van Zee
bbb8569b2a Use 'restrict' in all kernel APIs; wspace changes.
Details:
- Updated level-1v, level-1f kernel function types (bli_l1?_ft.h) and
  generic kernel prototypes (bli_l1?_ker.h) to use 'restrict' for all
  numerical operand pointers (ie: all pointers except the cntx_t).
- Updated level-1f reference kernel definitions to use 'restrict' for
  all numerical operand pointers. (Level-1v reference kernel definitions
  were already updated in bdbda6e.)
- Rewrote the level-1v and level-1f reference kernel prototypes in
  bli_l1v_ref.h and bli_l1f_ref.h, respectively, to simply #include
  bli_l1v_ker.h and bli_l1f_ker.h with redefined function base names
  (as was already being done for the level-3 micro-kernel prototypes
  in bli_l3_ref.h), rather than duplicate the signatures from the
  _ker.h files.
- Added definitions to frame/include/bli_kernel_prototypes.h for axpbyv
  and xpbyv, which were probably meant for inclusion in bdbda6e.
- Converted a number of instances of four spaces, as introduced in
  bdbda6e, to tabs.
2016-04-27 14:13:46 -05:00
Field G. Van Zee
4ea419c72c Merge pull request #70 from devinamatthews/daxpby
Give the level1v operations some love
2016-04-26 12:50:45 -05:00
Devin Matthews
bdbda6e6ac Give the level1v operations some love:
- Add missing axpby and xpby operations (plus test cases).
- Add special case for scal2v with alpha=1.
- Add restrict qualifiers.
- Add special-case algorithms for incx=incy=1.
2016-04-25 11:05:57 -05:00
Field G. Van Zee
aa0bceec27 Merge branch 'master' of github.com:flame/blis 2016-04-22 12:01:31 -05:00
Field G. Van Zee
4136553f0d Clear level-3 cntx_t's via memset() before use.
Details:
- In all level-3 operations' _cntx_init() functions, replaced calls to
  bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all
  level-3 operations' _cntx_finalize() functions, removed calls to
  bli_cntx_obj_finalize(), leaving those function definitions empty.
- Changed the definition of bli_cntx_obj_clear() so that the clearing
  occurs via a single call to memset().
2016-04-22 11:53:53 -05:00
Devin Matthews
a9b6c3abda Merge remote-tracking branch 'origin/master' into cblas-f77-int
# Conflicts:
#	config/haswell/bli_config.h
2016-04-20 16:00:10 -05:00
Devin Matthews
e4c54c8146 Change integer type in CBLAS function signatures to f77_int, and add proper const-correctness to BLAS layer. 2016-04-20 15:56:46 -05:00
Field G. Van Zee
dd0ab1d93f Converted some bli_cntx query functions to macros.
Details:
- Commented out several datatype-aware query functions (those ending in
  _dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and
  added equivalent cpp query macros to bli_cntx.h.
- Added 'bli_config.h' to .gitignore.
2016-04-20 14:38:23 -05:00
Devin Matthews
0e1a9821d8 Add configure options and generate bli_config.h automatically.
Options to configure have been added for:
- Setting the internal BLIS and BLAS/CBLAS integer sizes.
- Enabling and disabling the BLAS and CBLAS layers.

Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file.

Lastly, support for OMP in clang has been added (closes #56).
2016-04-19 11:44:37 -05:00
Tyler Michael Smith
cbcd0b739d Changing ifdef for OSX pthread barriers 2016-04-18 03:12:57 -05:00
Tyler Smith
41694675e4 pthreads bugfixes
Getting pthreads to work on my Mac
Implemented a pthread barrier when _POSIX_BARRIER isn't defined
Now spawn n-1 threads instead of n threads so that master thread isn't just spinning the whole time
Add -lpthread instead of -pthread to LDFLAGS (for clang)
2016-04-13 15:51:08 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Field G. Van Zee
5a978fffdb Merge pull request #45 from devinamatthews/high_prec_timers
Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday
2016-03-04 17:26:58 -06:00
Devin Matthews
44fddd48dc Add missing \. 2016-03-04 12:36:38 -06:00
Devin Matthews
7cabd2131f Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday. 2016-03-03 11:43:07 -06:00
Tyler Smith
adb2b4e096 Fixing guard for non implemented partitioning through packed matrices 2016-03-02 14:48:12 -06:00
Field G. Van Zee
f2809fc5f7 Merge pull request #39 from devinamatthews/fix_f2c_conflicts
Devin's f2c type namespace update.

Details:
- Added "bla_" prefix to f2c type names to prevent conflicts with external user code.
- Removed most of the body of bli_f2c.h, which was unused.
2016-02-27 13:06:03 -06:00
Devin Matthews
8624a33ccc Fix remaining f2c conflicts. 2016-02-25 13:51:26 -06:00
Devin Matthews
372eef0b6c Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
progress.
2016-02-25 12:01:58 -06:00
Field G. Van Zee
f86b94f206 Included missing blas2blis integer def to CBLAS.
Details:
- Added #include "bli_config_macro_defs" to all cblas_*.c files in
  compat/cblas/src. This has the effect of defining
  BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does
  not define it. Thanks to Tony Kelman for reporting this bug.
- In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int'
  to 'f77_int'. This eliminates a compiler warning and a potential
  runtime bug and/or crash when the size of an int differs from the size
  of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE).
2016-02-23 18:12:34 -06:00
Field G. Van Zee
0b126de134 Consolidated packm_blk_var1 and packm_blk_var2.
Details:
- Consolidated the two blocked variants for packm into a single
  implementation (packm_blk_var1) and removed the other variant.
- Updated all induced method _cntl_init() functions in frame/cntl/ind/
  to use the new blocked variant 1.
- Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
  to detect pack_t schemas for induced methods and native execution,
  respectively.
2015-11-13 16:29:12 -06:00
Field G. Van Zee
30e5eb29e0 Minor changes to treatment of rs, cs in bli_obj.c.
Details:
- Applied a patch submitted by Devin Matthews that:
  - implements subtle changes to handling of somewhat unusual cases of
    row and column strides to accommodate certail tensor cases, which
    includes adding dimension parameters to _is_col_tilted() and
    _is_row_tilted() macros,
  - simplifies how buffers are sized when requested BLIS-allocated
    objects,
  - re-consolidates bli_adjust_strides_*() into one function, and
  - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
    environments.
2015-11-13 12:14:19 -06:00
Field G. Van Zee
42810bbfa0 Fixed minor bugs for uncommon obj_create cases.
Details:
- Separated bli_adjust_strides() into _alloc() and _attach() flavors so
  that the latter can avoid a test performed by the former, in which the
  rs and cs are overridden and set to zero if either matrix dimension is
  zero. Actually, we also disable this overridding behavior, even for the
  _alloc() case, since keeping the original strides (probably) does not
  hurt anything. The original code has been kept commented-out, though,
  in case an unintended consequence is later discovered.
- Fixed a typo in an error check for general stride cases where rs == cs.
2015-11-12 12:07:46 -06:00
Field G. Van Zee
3e6dd11467 Minor re-expression in quadratic partitioning code.
Details:
- Minor change to quadratic equation solution code that avoids
  recomputation of the sqrt() parameter when the compiler is not
  smart enough to perform this optimization automatically.
2015-11-03 10:30:08 -06:00
Field G. Van Zee
3e116f0a29 Fixed imaginary bug in quadratic partitioning code.
Details:
- Fixed a bug in the relatively new quadratic partitioning code that,
  under the right conditions, would perform sqrt() on a negative value.
  If the solution is imaginary, we discard it and use an alternate
  partition width that assumes no diagonal intersection. That alternate
  width is actually already computed, so, the fix was quite simple.
  Thanks to Devangi Parikh for reporting this bug.
2015-11-02 17:18:23 -06:00