Commit Graph

319 Commits

Author SHA1 Message Date
Field G. Van Zee
d5bf79e50b Miscellaneous tweaks and fixes.
Details:
- Fixed incorrect calling sequence in bli_cntx_init_knl.c--an instance of
  bli_blksz_init_easy() that should have been bli_blksz_init().
- Fixed a bug in code that is supposed to output the list of sub-directories
  in the 'config' directory when configure script is run with no arguments.
- Expanded the output of "make showconfig" to include more info from config.mk.
- Minor changes to build/auto-detect/cpuid_x86.c, mostly in preparation for
  someone to add excavator and zen support.
- Added a link to the ConfigurationHowTo wiki to config_registry.
- Other minor tweaks to configure.
2017-11-13 14:24:29 -06:00
Field G. Van Zee
2c51356a8b Implemented runtime hardware detection via cpuid.
Details:
- Added runtime support for selecting an appropriate arch_t value based
  on the results of the cpuid instruction (for x86_64). This allows
  deferral of choosing a context (kernels, blocksizes, etc.) until
  runtime, which allows BLIS to be built with support for multiple
  microarchitectures. Currently, only amd64 and intel64 configurations
  are registered in the config_registry; however, one could create
  custom configuration families to support arbitrary sets of x86_64
  microarchitectures.
- Current Intel microarchitectures supported via cpuid are knl, haswell,
  sandybridge, and penryn.
- Current AMD microarchitectures supported via cpuid are: zen, excavator,
  steamroller, piledriver, and bulldozer.
2017-11-01 17:37:02 -05:00
Field G. Van Zee
3e4f42a4d2 Typecast l1mkr_t enum value prior to comparison.
Details:
- Typecast l1mkr_t enum value in bli_cntx.h to guint_t before testing for
  out-of-range value. This is an attempt to pacify a strange warning from
  clang on OS X that is seemingly the result of the following compiler
  warning flag:
    -Wtautological-constant-out-of-range-compare
2017-10-27 11:41:37 -05:00
Field G. Van Zee
07c352188b Added "generic" configuration.
Details:
- Added a "generic" configuration that leaves the default blocksizes and
  kernels unchanged. This replaces the older "reference" configuration.
  Updated auto-detect script and code accordingly.
- Added support for generic configuration to arch_t (bli_type_defs.h),
  bli_gks_init() (bli_gks.c), and bli_arch_config.h
- Moved bli_arch_query_id() to bli_arch.c (and prototype to bli_arch.h).
- Whitespace changes to configurations' make_defs.mk files.
2017-10-23 16:59:22 -05:00
Field G. Van Zee
453deb2906 Implemented runtime kernel management.
Details:
- Reworked the build system around a configuration registry file, named
  config_registry', that identifies valid configuration targets, their
  constituent sub-configurations, and the kernel sets that are needed by
  those sub-configurations. The build system now facilitates the building
  of a single library that can contains kernels and cache/register
  blocksizes for multiple configurations (microarchitectures). Reference
  kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
  config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
  in determining which sub-configurations (CONFIG_LIST) and kernel sets
  (KERNEL_LIST) are included in the library, and which make_defs.mk files'
  CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
  functions into a standard format that includes the kernel set name
  (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
  kernels sub-directory. These files exist to provide prototypes for the
  kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
  This directory includes a new source file, bli_cntx_ref.c (compiled on
  a per-configuration basis), that defines the code needed to initialize
  a reference context and a context for induced methods for the
  microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
  variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
  basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
  #defines for the config (family) name, the sub-configurations that are
  associated with the family, and the kernel sets needed by those
  sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
  what remains to new header files named "bli_arch_<configname>.h", which
  are conditionally #included from a new header bli_arch.h. These files
  are still needed to set library-wide parameters such as custom
  malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
  The files contain a function, named the same as the file, that initializes
  a "native" context for a particular configuration (microarchitecture). The
  idea is that optimized kernels, if available, will be initialized into
  these contexts. Other fields will retain pointers to reference functions,
  which will be compiled on a per-configuration basis. These bli_cntx_init_*()
  functions will be called during the initialization of the global kernel
  structure. They are thought of as initializing for "native" execution, but
  they also form the basis for contexts that use induced methods. These
  functions are prototyped, along with their _ref() and _ind() brethren, by
  prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
  identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
  structures (pointers to cntx_t, actually). The first dimension is indexed
  over arch_t and the inner dimension is the ind_t (induced method) for
  each microarchitecture. When a microarchitecture (configuration) is
  "registered" at init-time, the inner array for that configuration in the
  2D array is initialized (and allocated, if it hasn't been already). The
  cntx_t slot for BLIS_NAT is initialized immediately and those for other
  induced method types are initialized and cached on-demand, as needed. At
  cntx_t registration, we also store function pointers to cntx_init functions
  that will initialize (a) "reference" contexts and (b) contexts for use with
  induced methods. We don't cache the full contexts for reference contexts
  since they are rarely needed. The functions that initialize these two kinds
  of contexts are generated automatically for each targeted sub-configuration
  from cpp-templatized code at compile-time. Induced method contexts that
  need "stage" adjustments can still obtain them via functions in
  bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
  the level-1f, level-1v, and packm kernels, and for converting a native
  context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
  macros in bli_kernel_macro_defs.h to being runtime checks defined in
  bli_check.c and called from bli_gks_register_cntx() at the time that the
  global kernel structure's internal context is initialized for a given
  microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
  the previous operation-level cntx_t_init()/_finalize() invocations.
  Instead, we now query the gks for a suitable context, usually via
  bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
  hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
  into a single kernel that will branch on the schema and support packing
  to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
  and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
  bli_malloc_intl(). Useful when we wish to allocate and initialize to
  zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
  bli_cntx.h into static functions.
2017-10-18 13:29:32 -05:00
Field G. Van Zee
e02d3cb841 Fixed a pthread typo in previous commit.
Details:
- Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'.
2017-09-26 19:02:53 -05:00
Field G. Van Zee
f5962a1aae Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
  into a k x k triangular matrix for the purposes of obtaining an mr x k
  micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
  very large k (depending on the product of mr x kc on that architecture).
  The bug arose from the fact that the test module was triggering the
  allocation of blocks from the internal memory pools, which are limited in
  size. This allocation imposes an implicit assumption that the micro-
  panel being tested with will fit inside, and this assumption is violated
  for large values of k. Arbitrarily large k may now be tested for both
  operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
  statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
  issues.
2017-09-26 17:00:04 -05:00
Field G. Van Zee
60a1eeb231 Added edge handling to _determine_blocksize_b().
Details:
- Added explicit handling of situations where i == dim to
  bli_determine_blocksize_b_sub(). This isn't actually needed by any
  current use case within BLIS, but handling the situation is nonetheless
  prudent. Thanks to Minh Quan for reporting this issue and requesting
  the fix.
2017-08-05 13:04:31 -05:00
Field G. Van Zee
c63980f4ca Moved 'family' field from cntx_t to cntl_t.
Details:
- Removed the family field inside the cntx_t struct and re-added it to the
  cntl_t struct. Updated all accessor functions/macros accordingly, as well
  as all consumers and intermediaries of the family parameter (such as
  bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
  change was motivated by the desire to keep the context limited, as much
  as possible, to information about the computing environment. (The family
  field, by contrast, is a descriptor about the operation being executed.)
- Added additional functions to bli_blksz_*() API.
- Added additional functions to bli_cntx_*() API.
- Minor updates to bli_func.c, bli_mbool.c.
- Removed 'obj' from bli_blksz_*() API names.
- Removed 'obj' from bli_cntx_*() API names.
- Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
  that operate only on a single struct to contain the "_node" suffix to
  differentiate with those routines that operate on the entire tree.
- Added enums for packm and unpackm kernels to bli_type_defs.h.
- Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
  They weren't being used and probably never will be.
2017-07-29 14:53:39 -05:00
Field G. Van Zee
0e58ba1b3a Added API to set mt environment variables.
Details:
- Renamed bli_env_get_nway() -> bli_thread_get_env().
- Added bli_thread_set_env() to allow setting environment variables
  pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
- Added the following convenience wrapper routines:
    bli_thread_get_jc_nt()
    bli_thread_get_ic_nt()
    bli_thread_get_jr_nt()
    bli_thread_get_ir_nt()
    bli_thread_get_num_threads()
    bli_thread_set_jc_nt()
    bli_thread_set_ic_nt()
    bli_thread_set_jr_nt()
    bli_thread_set_ir_nt()
    bli_thread_set_num_threads()
- Added #include "errno.h" to bli_system.h.
- This commit addresses issue #140.
- Thanks to Chris Goodyer for inspiring these updates.
2017-07-17 19:03:22 -05:00
Minh Quan HO
ba7cada51a set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers
The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is
not set in bli_membrk_init
2017-07-07 10:52:05 +02:00
Devin Matthews
fdc66f12d4 Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123. 2017-05-04 10:36:04 -05:00
Field G. Van Zee
1c732d3ddc Added 1m-specific APIs for bp, pb gemm algorithms.
Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
  body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
  bli_cntl_free() can check if the thread parameter is NULL, and if so,
  call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
  terms of bli_gemm1mxx_cntx_init(), which behaves the same as
  bli_gemm1m_cntx_init() did before, except that an extra bool parameter
  (is_pb) is used to support both bp and pb algorithms (including to
  support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
  when true, will toggle the boolean return value of routines such as
  bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
  causing BLIS to transpose the operation to achieve disagreement (rather
  than agreement) between the storage of C and the micro-kernel output
  preference. This disagreement is needed for panel-block implementations,
  since they induce a transposition of the suboperation immediately before
  the macro-kernel is called, which changes the apparent storage of C. For
  now, anti-preference is used only with the pb algorithm for 1m (and not
  with any other non-1m implementation).
- Defined new functions,
    bli_cntx_l3_ukr_eff_prefers_storage_of()
    bli_cntx_l3_ukr_eff_dislikes_storage_of()
    bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
    bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
  which are identical to their non-"eff" (effectively) counterparts except
  that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
  bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
  in terms of the existing block-panel macro-kernel _ker_var2(). This
  technique requires inducing transposes on all operands and swapping
  the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
  also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
  specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
    bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
    bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
    bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
  and updated all instantiations. Also updated the field names in the
  cntx_t struct.
- Comment updates.
2017-01-25 16:25:46 -06:00
Field G. Van Zee
126482a3b6 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2016-11-25 18:29:49 -06:00
Field G. Van Zee
bdc0a264d2 Adjusted stride selection of ct in macrokernels.
Details:
- Updated the changes introduced in 618f433 so that the strides of the
  temporary microtile ct used in the macrokernels is determined based
  on the storage preference of the microkernel (via the new functions
  below), rather than the strides of c. In almost all cases, presently,
  this change results in no net effect, as a high-level optimization
  in the _front() functions aligns the storage of c to that of the
  microkernel's preference. However, I encountered some cases where
  this is not always the case in some development code that has yet
  to be committed, and therefore I'm generalizing the framework code
  in advance.
- Defined two new functions in bli_cntx.c:
    bli_cntx_l3_ukr_prefers_rows_dt()
    bli_cntx_l3_ukr_prefers_cols_dt()
  which return bool_t's based on the current micro-kernel's storage
  preferences. For induced methods, the preference of the underlying
  real domain microkernel is returned.
- Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and
  by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of
  the above functions, rather than querying the preferences of the
  native microkernel directly (which did the wrong thing for induced
  methods).
2016-11-16 14:13:08 -06:00
Devin Matthews
a8220e3a86 - Fix typo in bli_cntx.c
- Bump BLIS_DEFAULT_NR_THREAD_MAX to 4
2016-11-10 14:19:34 -06:00
Devin Matthews
c05b3862f6 Add automatic loop thread assignment.
- Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before.
- Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h.
- All level-3 BLAS covered.
2016-11-04 15:48:02 -05:00
Field G. Van Zee
970745a5fc Reorganized typedefs to avoid compiler warnings.
Details:
- Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h.
- Moved #include of bli_malloc.h from blis.h to bli_type_defs.h.
- Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h.
- Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h.
- The redundant typedefs of membrk_t and mtx_t caused a warning on some C
  compilers. Thanks to Tyler Smith for reporting this issue.
2016-10-19 15:58:03 -05:00
Field G. Van Zee
9cda6057ea Removed previously renamed/old files.
Details:
- Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h,
  both of which were renamed/removed in 701b9aa. For some reason, these
  files survived when the compose branch was merged back into master.
  (Clearly, git's merging algorithm is not perfect.)
- Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed
  memory allocator that I was keeping around for no particular reason).
2016-10-11 13:21:26 -05:00
Field G. Van Zee
22377abd84 Fixed bli_gemm() segfault on empty C matrices.
Details:
- Fixed a bug that would manifest in the form of a segmentation fault
  in bli_cntl_free() when calling any level-3 operation on an empty
  output matrix (ie: m = n = 0). Specifically, the code previously
  assumed that the entire control tree was built prior to it being
  freed. However, if the level-3 operation performs an early exit, the
  control tree will be incomplete, and this scenario is now handled.
  Thanks to Elmar Peise for reporting this bug.
2016-10-10 13:43:56 -05:00
Field G. Van Zee
0b571cd94d Fixed segfault in bli_free_align() for NULL ptrs.
Details:
- Fixed a bug in bli_free_align() caused by failing to handle NULL pointers
  up-front, which led to performing pointer arithmetic on NULL pointers in
  order to free the address immediately before the pointer. Thanks to Devin
  Matthews for reporting this bug.
2016-10-06 14:48:15 -05:00
Field G. Van Zee
87fddeab3c Merge branch 'compose' 2016-10-05 13:35:01 -05:00
Field G. Van Zee
86969873b5 Reclassified amaxv operation as a level-1v kernel.
Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
  This includes the establishment of a new amaxv kernel to live beside all
  of the other level-1v kernels.
- Added two new functions to bli_part.c:
    bli_acquire_mij()
    bli_acquire_vi()
  The first acquires a scalar object for the (i,j) element of a matrix,
  and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
  adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
  module works by comparing the value identified by bli_amaxv() to the
  the value found from a reference-like code local to the test module
  source file. In other words, it (intentionally) does not guarantee the
  same index is found; only the same value. This allows for different
  implementations in the case where a vector contains two or more elements
  containing exactly the same floating point value (or values, in the case
  of the complex domain).
- Removed the directory frame/include/old/.
2016-10-04 14:24:59 -05:00
Field G. Van Zee
8d55033c96 Implemented distributed thrinfo_t management.
Details:
- Implemented Ricardo Magana's distributed thread info/communicator
  management. Rather that fully construct the thrinfo_t structures, from
  root to leaf, prior to spawning threads, the threads individually
  construct their thrinfo_t trees (or, chains), and do so incrementally,
  as needed, reusing the same structure nodes during subsequent blocked
  variant iterations. This required moving the initial creation of the
  thrinfo_t structure (now, the root nodes) from the _front() functions
  to the bli_l3_thread_decorator(). The incremental "growing" of the tree
  is performed in the internal back-end (ie: _int()) function, and so
  mostly invisible. Also, the incremental growth of the thrinfo_t tree is
  done as a function of the current and parent control tree nodes (as well
  as the parent thrinfo_t node), further reinforcing the parallel
  relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
  as well as its id. Changed all APIs accordingly. Renamed
  bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
  in an array of thrinfo_t* structure pointers. (Used only as a
  debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
    bli_packm_thrinfo_create()
    bli_l3_thrinfo_create()
  because they are no longer used. bli_thrinfo_create() is now called
  directly when creating thrinfo_t nodes.
2016-09-27 15:20:58 -05:00
Field G. Van Zee
701b9aa3ff Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
  same struct definition, whose primary fields consist of a blocksize id,
  a variant function pointer, a pointer to an optional parameter struct,
  and a pointer to a (single) sub-node. This unified control tree type is
  now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
  they represent, such that, for example, packing operations are now
  associated with nodes that are "inline" in the tree, rather than off-
  shoot braches. The original tree for the classic Goto gemm algorithm was
  expressed (roughly) as:

    blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                         |           |
                         -> packb    -> packa

  and now, the same tree would look like:

    blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

  Specifically, the packb and packa nodes perform their respective packing
  operations and then recurse (without any loop) to a subproblem. This means
  there are now two kinds of level-3 control tree nodes: partitioning and
  non-partitioning. The blocked variants are members of the former, because
  they iteratively partition off submatrices and perform suboperations on
  those partitions, while the packing variants belong to the latter group.
  (This change has the effect of allowing greatly simplified initialization
  of the nodes, which previously involved setting many unused node fields to
  NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
  connective structure of control trees. That is, packm nodes are no longer
  off-shoot branches of the main algorithmic nodes, but rather connected
  "inline".
- Simplified control tree creation functions. Partitioning nodes are created
  concisely with just a few fields needing initialization. By contrast, the
  packing nodes require additional parameters, which are stored in a
  packm-specific struct that is tracked via the optional parameters pointer
  within the control tree struct. (This parameter struct must always begin
  with a uint64_t that contains the byte size of the struct. This allows
  us to use a generic function to recursively copy control trees.) gemm,
  herk, and trmm control tree creation continues to be consolidated into
  a single function, with the operation family being used to select
  among the parameter-agnostic macro-kernel wrappers. A single routine,
  bli_cntl_free(), is provided to free control trees recursively, whereby
  the chief thread within a groups release the blocks associated with
  mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
  function pointer stored in the current control tree node (rather than
  index into a local function pointer array). Before being invoked, these
  function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
  families) or trsm_voft (for trsm family) type, which is defined in
  frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
  through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
  from routines as a function of the variant's matrix operands. gemm and
  herk always move forward, while trmm and trsm move in a direction that
  is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
  each of which takes additional arguments and hides complexity in managing
  the difference between the way ranges are computed for the four families
  of operations.
- Simplified level-3 blocked variants according to the above changes, so that
  the only steps taken are:
  1. Query partitioning direction (forwards or backwards).
  2. Prune unreferenced regions, if they exist.
  3. Determine the thread partitioning sub-ranges.
  <begin loop>
    4. Determine the partitioning blocksize (passing in the partitioning
       direction)
    5. Acquire the curren iteration's partitions for the matrices affected
       by the current variants's partitioning dimension (m, k, n).
    6. Call the subproblem.
  <end loop>
- Instantiate control trees once per thread, per operation invocation.
  (This is a change from the previous regime in which control trees were
  treated as stateless objects, initialized with the library, and shared
  as read-only objects between threads.) This once-per-thread allocation
  is done primarily to allow threads to use the control tree as as place
  to cache certain data for use in subsequent loop iterations. Presently,
  the only application of this caching is a mem_t entry for the packing
  blocks checked out from the memory broker (allocator). If a non-NULL
  control tree is passed in by the (expert) user, then the tree is copied
  by each thread. This is done in bli_l3_thread_decorator(), in
  bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
  of the operation being executed. For example, gemm, hemm, and symm are
  all part of the gemm family, while herk, syrk, her2k, and syr2k are
  all part of the herk family. Knowing the operation's family is necessary
  when conditionally executing the internal (beta) scalar reset on on
  C in blocked variant 3, which is needed for gemm and herk families,
  but must not be performed for the trmm family (because beta has only
  been applied to the current row-panel of C after the first rank-kc
  iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
  to comform with the new control tree design, and renamed the macro-
  kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
  bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
  frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
  optimization in the various level-3 front-ends was not being applied
  properly when the context indicated that execution would be via an
  induced method. (Before, we always checked the native micro-kernel
  corresponding to the datatype being executed, whereas now we check
  the native micro-kernel corresponding to the datatype's real projection,
  since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
  complex implementations. Previously, it was always tested, provided that
  the c/z datatypes were enabled. However, some configurations use
  reference micro-kernels for complex datatypes, and testing these
  implementations can slow down the testsuite considerably.
2016-08-26 19:04:45 -05:00
Field G. Van Zee
95abea46f8 Merge branch 'master' into compose 2016-07-23 15:38:33 -05:00
Field G. Van Zee
a017062fdf Integrated "memory broker" (membrk_t) abstraction.
Details:
- Integrated a patch originally authored and submitted by Ricardo Magana
  of HP Enterprise. The changeset inserts use of a new object type, membrk_t,
  (memory broker) that allows multiple sets of memory pools on, for example,
  separate NUMA nodes, each of which has a separate memory space.
- Added membrk field to cntx_t and defined corresponding accessor macros.
- Added membrk field to mem_t object and defined corresponding accessor macros.
- Created new bli_membrk.c file, which contains the new memory broker API,
  including:
    bli_membrk_init(), bli_membrk_finalize()
    bli_membrk_acquire_[mv](), bli_membrk_release(),
    bli_membrk_init_pools(), bli_membrk_reinit_pools(),
    bli_membrk_finalize_pools(),
    bli_membrk_pool_size()
- In bli_mem.c, changed function calls to
    bli_mem_init_pools()     -> bli_membrk_init()
    bli_mem_reinit_pools()   -> bli_membrk_reinit()
    bli_mem_finalize_pools() -> bli_membrk_finalize()
- In bli_packv_init.c, bli_packm_init.c, changed function calls to:
    bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]()
    bli_mem_release()      -> bli_membrk_release()
- Added bli_mutex.c and related files to frame/thread. These files define
  abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or
  single-threaded execution. This new API is employed within functions
  such as bli_membrk_acquire_[mv]() and bli_membrk_release().
2016-07-22 17:02:59 -05:00
Field G. Van Zee
31def12e26 First phase of control tree redesign.
Details:
- These changes constitute the first set of changes in preparation to
  revamping the structure and use of control trees in BLIS. Modifications
  in this commit don't affect the control tree code yet, but rather lay
  the groundwork.
- Defined wrappers for the following functions, where the the wrappers
  each take a direction parameter of a new enumerated type (BLIS_BWD or
  BLIS_FWD), dir_t, and executes the correct underlying function.
  - bli_acquire_mpart_*() and _vpart_*()
  - bli_*_determine_kc_[fb]()
  - bli_thread_get_range_*() and bli_thread_get_range_weighted_*()
- Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving)
  blocked variants for trmm and trsm, and renamed gemm and herk variants
  accordingly. The direction is now queried via routines such as
  bli_trmm_direct(), which deterines the direction from the implied side
  and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD.
- Defined wrappers to parameter-specific macrokernels for herk, trmm, and
  trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying
  macrokernel based on the implied parameters. The same logic used to
  choose the dir_t in _direct() functions is used here.
- Simplified the function pointer arrays in _int() functions given the
  consolidation and dir_t querying mentioned above.
- Function signature (whitespace) reformatting for various functions.
- Removed old code in various 'old' directories.
2016-06-30 15:19:20 -05:00
Field G. Van Zee
096895c5d5 Reorganized code, APIs related to multithreading.
Details:
- Reorganized code and renamed files defining APIs related to multithreading.
  All code that is not specific to a particular operation is now located in a
  new directory: frame/thread. Code is now organized, roughly, by the
  namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
  thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
  also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
  functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
  We now have the following API namespaces:
  - bli_thrinfo_*(): functions related to thrinfo_t objects
  - bli_thrcomm_*(): functions related to thrcomm_t objects.
  - bli_thread_*(): general-purpose functions, such as initialization,
    finalization, and computing ranges. (For now, some macros, such as
    bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
    bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
    appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
  consistent with the thread-related macros that were also renamed).
- Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
  #undef was a temporary fix to some macro defaults which were being applied
  in the wrong order, which was recently fixed.
2016-06-06 13:32:04 -05:00
Tyler Michael Smith
232530e88f Merge commit 'refs/pull/81/head' of https://github.com/flame/blis
Conflicts:
	frame/base/bli_threading_pthreads.c
	frame/base/bli_threading_pthreads.h
2016-06-01 15:14:10 -05:00
Tyler Michael Smith
4bcabd1bf6 Use spin locks instead of pthread barriers 2016-06-01 13:27:28 -05:00
Jeff Hammond
eef37f8b4d use GCC intrinsic instead of pthread_mutex for atomic increment and fetch 2016-05-29 22:28:13 -07:00
Field G. Van Zee
9dcd6f05c4 Implemented developer-configurable malloc()/free().
Details:
- Replaced all instances of bli_malloc() and bli_free() with one of:
  - bli_malloc_pool()/bli_free_pool()
  - bli_malloc_user()/bli_free_user()
  - bli_malloc_intl()/bli_free_intl()
  each of which can be configured to call malloc()/free() substitutes,
  so long as the substitute functions have the same function type
  signatures as malloc() and free() defined by C's stdlib.h. The _pool()
  function is called when allocating blocks for the memory pools (used
  for packing buffers, primarily), the _user() function is called when
  obj_t's are created (via bli_obj_create() and friends), and the _intl()
  function is called for internal use by BLIS, such as when creating
  control tree nodes or temporary buffers for manipulating internal data
  structures. Substitutes for any of the three types of bli_malloc() may
  be specified by #defining the following pairs of cpp macros in
  bli_kernel.h:
  - BLIS_MALLOC_POOL/BLIS_FREE_POOL
  - BLIS_MALLOC_USER/BLIS_FREE_USER
  - BLIS_MALLOC_INTL/BLIS_FREE_INTL
  to be the name of the substitute functions. (Obviously, the object
  code that contains these functions must be provided at link-time.)
  These macros default to malloc() and free(). Subsitute functions are
  also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
- Removed definitions for bli_malloc() and bli_free().
- Note that bli_malloc_pool() and bli_malloc_user() are now defined in
  terms of a new function, bli_malloc_align(), which aligns memory to an
  arbitrary (power of two) alignment boundary, but does so manually,
  whereas before alignment was performed behind the scenes by
  posix_memalign(). Currently, bli_malloc_intl() is defined in terms
  of bli_malloc_noalign(), which serves as a simple wrapper to the
  designated function that is passed in (e.g. BLIS_MALLOC_INTL).
  Similarly, there are bli_free_align() and bli_free_noalign(), which
  are used in concert with their bli_malloc_*() counterparts.
2016-05-24 13:15:32 -05:00
Devin Matthews
4b1e55edbf Default-initialize all extern global variables to avoid generating common symbols. Fixes #73. 2016-05-10 10:08:47 -05:00
Field G. Van Zee
4ea419c72c Merge pull request #70 from devinamatthews/daxpby
Give the level1v operations some love
2016-04-26 12:50:45 -05:00
Devin Matthews
bdbda6e6ac Give the level1v operations some love:
- Add missing axpby and xpby operations (plus test cases).
- Add special case for scal2v with alpha=1.
- Add restrict qualifiers.
- Add special-case algorithms for incx=incy=1.
2016-04-25 11:05:57 -05:00
Field G. Van Zee
4136553f0d Clear level-3 cntx_t's via memset() before use.
Details:
- In all level-3 operations' _cntx_init() functions, replaced calls to
  bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all
  level-3 operations' _cntx_finalize() functions, removed calls to
  bli_cntx_obj_finalize(), leaving those function definitions empty.
- Changed the definition of bli_cntx_obj_clear() so that the clearing
  occurs via a single call to memset().
2016-04-22 11:53:53 -05:00
Field G. Van Zee
dd0ab1d93f Converted some bli_cntx query functions to macros.
Details:
- Commented out several datatype-aware query functions (those ending in
  _dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and
  added equivalent cpp query macros to bli_cntx.h.
- Added 'bli_config.h' to .gitignore.
2016-04-20 14:38:23 -05:00
Tyler Michael Smith
cbcd0b739d Changing ifdef for OSX pthread barriers 2016-04-18 03:12:57 -05:00
Tyler Smith
41694675e4 pthreads bugfixes
Getting pthreads to work on my Mac
Implemented a pthread barrier when _POSIX_BARRIER isn't defined
Now spawn n-1 threads instead of n threads so that master thread isn't just spinning the whole time
Add -lpthread instead of -pthread to LDFLAGS (for clang)
2016-04-13 15:51:08 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Devin Matthews
7cabd2131f Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday. 2016-03-03 11:43:07 -06:00
Devin Matthews
372eef0b6c Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
progress.
2016-02-25 12:01:58 -06:00
Field G. Van Zee
30e5eb29e0 Minor changes to treatment of rs, cs in bli_obj.c.
Details:
- Applied a patch submitted by Devin Matthews that:
  - implements subtle changes to handling of somewhat unusual cases of
    row and column strides to accommodate certail tensor cases, which
    includes adding dimension parameters to _is_col_tilted() and
    _is_row_tilted() macros,
  - simplifies how buffers are sized when requested BLIS-allocated
    objects,
  - re-consolidates bli_adjust_strides_*() into one function, and
  - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
    environments.
2015-11-13 12:14:19 -06:00
Field G. Van Zee
42810bbfa0 Fixed minor bugs for uncommon obj_create cases.
Details:
- Separated bli_adjust_strides() into _alloc() and _attach() flavors so
  that the latter can avoid a test performed by the former, in which the
  rs and cs are overridden and set to zero if either matrix dimension is
  zero. Actually, we also disable this overridding behavior, even for the
  _alloc() case, since keeping the original strides (probably) does not
  hurt anything. The original code has been kept commented-out, though,
  in case an unintended consequence is later discovered.
- Fixed a typo in an error check for general stride cases where rs == cs.
2015-11-12 12:07:46 -06:00
Field G. Van Zee
3e6dd11467 Minor re-expression in quadratic partitioning code.
Details:
- Minor change to quadratic equation solution code that avoids
  recomputation of the sqrt() parameter when the compiler is not
  smart enough to perform this optimization automatically.
2015-11-03 10:30:08 -06:00
Field G. Van Zee
3e116f0a29 Fixed imaginary bug in quadratic partitioning code.
Details:
- Fixed a bug in the relatively new quadratic partitioning code that,
  under the right conditions, would perform sqrt() on a negative value.
  If the solution is imaginary, we discard it and use an alternate
  partition width that assumes no diagonal intersection. That alternate
  width is actually already computed, so, the fix was quite simple.
  Thanks to Devangi Parikh for reporting this bug.
2015-11-02 17:18:23 -06:00
Field G. Van Zee
4a502fbe77 Laid groundwork for runtime memory pool resizing.
Details:
- Changed bli_pool_finalize() so that the freeing begins with the block
  at top_index instead of block 0. This allows us to use the function
  for terminal finalization as well as temporary cleanup prior to
  reinitialization. Also, clear the pool_t struct upon _pool_finalize()
  in case it is called in the terminal case with some blocks still
  checked out to threads (in which case the threads will see the new
  block size as 0 and thus release the block as intended).
- Added bli_pool_reinit(), which calls _pool_finalize() followed by
  _pool_init() with new parameters.
- Added bli_mem_reinit(), which is based on bli_pool_reinit().
- Added new wrapper, _mem_compute_pool_block_sizes(), which calls
  _mem_compute_pool_block_sizes_dt().
- Updated bli_mem_release() so that the pblk_t is freed, via
  _pool_free_block(), if the block size recorded in the mem_t at the
  time the pblk_t was acquired is now different from the value in the
  pool_t.
2015-11-02 13:28:34 -06:00
Field G. Van Zee
77ddb0b1d3 Removed flop-counting mechanism.
Details:
- Removed the optional flop-counting feature introduced in commit
  7574c994.
2015-10-13 12:53:06 -05:00
Field G. Van Zee
e2e9d64a63 Load balance thread ranges for arbitrary diagonals.
Details:
- Expanded/updated interface for bli_get_range_weighted() and
  bli_get_range() so that the direction of movement is specified in the
  function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
  and also so that the object being partitioned is passed instead of an
  uplo parameter. Updated invocations in level-3 blocked variants, as
  appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
  carefully take into account the location of the diagonal when computing
  ranges so that the area of each subpartition (which, in all present
  level-3 operations, is proportional to the amount of computation
  engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
  variants:
    bli_<oper>_prune_unref_mparts_[mnk]()
  where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
  dimension is being partitioned. These routines call a more basic
  routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
  regions from matrices and simultaneously adjust other matrices which
  share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
  new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
  bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
  bli_get_range_*() and bli_get_range_weighted_*() under a range of
  conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
  array field so that dimensions could be queried via index constant
  (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
  macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
  and bli_round_to_mult(), which rounds a value to the nearest multiple
  of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
  bli_obj_row_off(), bli_obj_col_off().
2015-09-24 12:14:03 -05:00