Commit Graph

28 Commits

Author SHA1 Message Date
Devin Matthews
7c604e1cbc Move default SIMD-related definitions to bli_kernel_macro_defs.h. Otherwise, configurations which customize these fail as these are now defined in bli_kernel.h. 2016-05-10 12:11:55 -05:00
Devin Matthews
bdbda6e6ac Give the level1v operations some love:
- Add missing axpby and xpby operations (plus test cases).
- Add special case for scal2v with alpha=1.
- Add restrict qualifiers.
- Add special-case algorithms for incx=incy=1.
2016-04-25 11:05:57 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Field G. Van Zee
7cd01b71b5 Implemented dynamic allocation for packing buffers.
Details:
- Replaced the old memory allocator, which was based on statically-
  allocated arrays, with one based on a new internal pool_t type, which,
  combined with a new bli_pool_*() API, provides a new abstract data
  type that implements the same memory pool functionality but with blocks
  from the heap (ie: malloc() or equivalent). Hiding the details of the
  pool in a separate API also allows for a much simpler bli_mem.c family
  of functions.
- Added a new internal header, bli_config_macro_defs.h, which enables
  sane defaults for the values previously found in bli_config. Those
  values can be overridden by #defining them in bli_config.h the same
  way kernel defaults can be overridden in bli_kernel.h. This file most
  resembles what was previously a typical configuration's bli_config.h.
- Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
  defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
  blocks in the memory pool. Also added a corresponding query routine to
  the bli_info API.
- Deprecated (once again) the micro-panel alignment feature. Upon further
  reflection, it seems that the goal of more predictable L1 cache
  replacement behavior is outweighed by the harm caused by non-contiguous
  micro-panels when k % kc != 0. I honestly don't think anyone will even
  miss this feature.
- Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
  bli_cntl_init() instead of bli_init().
- Removed query functions from bli_info.c that are no longer applicable
  given the dynamic memory allocator.
- Removed unnecessary definitions from configurations' bli_config.h files,
  which are now pleasantly sparse.
- Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
  modules. Thanks to Devangi Parikh for pointing out these
  miscalculations.
- Comment, whitespace changes.
2015-06-19 11:31:53 -05:00
Field G. Van Zee
f1a6b7d028 Reorganized code for induced complex methods.
Details:
- Consolidated most of the code relating to induced complex methods
  (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
  are now enabled on a per-operation basis. The current "available"
  (enabled and implemented) implementation can then be queried on
  an operation basis. Micro-kernel func_t objects as well as blksz_t
  objects can also be queried in a similar maner.
- Redefined several micro-kernel and operation-related functions in
  bli_info_*() API, in accordance with above changes.
- Added mr and nr fields to blksz_t object, which point to the mr
  and nr blksz_t objects for each cache blocksize (and are NULL for
  register blocksizes). Renamed the sub-blocksize field "sub" to
  "mult" since it is really expressing a blocksize multiple.
- Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
  trsm to correctly query mr and nr (for purposes of nudging kc).
- Introduced an enumerated opid_t in bli_type_defs.h that uniquely
  identifies an operation. For now, only level-3 id values are defined,
  along with a generic, catch-all BLIS_NOID value.
- Reworked testsuite so that all induced methods that are enabled
  are tested (one at a time) rather than only testing the first
  available method.
- Reformated summary at the beginning of testsuite output so that
  blocksize and micro-kernel info is shown for each induced method
  that was requested (as well as native execution).
- Reduced the number of columns needed to display non-matlab
  testsuite output (from approx. 90 to 80).
2015-03-18 15:37:10 -05:00
Field G. Van Zee
a86db60ee2 Extensive renaming of 3m/4m-related files, symbols.
Details:
- Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi
  ('i' for "interleaved"). Similar changes to 3M/4M macros.
- Renamed all 3m/4m files and functions to 3m1/4m1.
- Whitespace changes.
2015-02-23 18:42:39 -06:00
Field G. Van Zee
7bbc95a54f Added new piledriver micro-kernels.
Details:
- Added new micro-kernels for the AMD piledriver architecture (one
  for each datatype).
- Updates and tweaks to piledriver configuration.
- Added 3xk packm micro-kernel support.
- Explicitly unrolled some of the smaller packm micro-kernels.
- Added notes to avx/sandybridge and piledriver micro-kernel files
  acknowledging the influence of the corresponding kernel code in
  OpenBLAS.
2014-10-29 10:52:23 -05:00
Field G. Van Zee
59613f1d55 Added separeate micro-panel alignment for A and B.
Details:
- Changed the recently-added micro-panel alignment macros so that we now
  have two sets--one for micro-panels of matrix A and one for micro-
  panels of matrix B: BLIS_UPANEL_[AB]_ALIGN_SIZE_?.
- Store each set of alignment values into a separate blksz_t object in
  bli_gemm_cntl_init().
- Adjusted packm_init() to use the separate alignment values.
- Added query routines for the new alignment values to bli_info.c.
- Modified test suite output accordingly.
2014-10-23 17:21:37 -05:00
Field G. Van Zee
ab954ba6f8 Relaxed constraint that KC be multiple of MR, NR.
Details:
- Relaxed a long-held requirement in register blocksizes that required
  the kernel programmer to choose a KC that was divisible by both MR
  and NR. This was very constraining on some architectures that did not
  use register blocksizes that were powers of two. The constraint is
  now enforced only for trmm and trsm, where it is needed, and it is
  now handled by "nudging" kc upward at runtime, if necessary, to be a
  multiple of MR or NR, as needed.
- Defined bli_trmm_determine_kc_[fb]() and bli_trsm_determine_kc_[fb](),
  which determine blocksizes for trmm and trsm, taking special care to
  "nudge" the kc dimension up to a multiple of MR or NR, as needed.
- Changed bli_trmm_blk_var3[fb].c to call bli_trmm_determine_kc_[fb]()
  instead of bli_determine_blocksize_[fb]().
- Added safeguard to bli_align_dim_to_mult() that returns the dimension
  unmodified if the dimension multiple is zero (to avoid division by
  zero).
- Removed cpp guard/check for KC % MR == 0 and KC % NR == 0 from
  bli_kernel_macro_defs.h.
- Whitespace, variable name changes to bli_blocksize.c.
- Removed old commented code from bli_gemm_cntl.c.
2014-10-23 10:12:27 -05:00
Field G. Van Zee
e64dba5633 Re-implemented micro-panel alignment.
Details:
- This commit re-implements a feature that was removed in commit
  c2b2ab62. It was removed because, at the time, I wasn't sure how the
  micro-panel alignment feature would interact with the 4m method (when
  applied at the micro-kernrel level), and so it seemed safer to disable
  the feature entirely rather than allow possible breakage. This commit
  revisits the issue and safely re-implements the feature in a way that
  is compatible with 4m, 3m, 4mh, and 3mh (and native execution).
- Modified the static memory pool to account for micro-panel alignment
  space.
- Modified packm_init and blocked variants to align whole micro-panels
  by a datatype-specific alignment value that may be set by the
  configuration. (If it is not set by the configuration, it will default
  to BLIS_SIZEOF_?.)
- Modified macro-kernels so that:
  - storage stride is handled properly given the new micro-panel
    alignment behavior;
  - indexing through 3m/4m/rih-type sub-panels, as is done by trmm and
    trsm, is more robust (e.g. will work if the applicable packing
    register blocksize is odd);
  - imaginary strides are computed and stored within auxinfo_t structs,
    which allows the virtual micro-kernels to more easily determine how
    to index into the micro-panel operands.
- Modified virtual 3m and 4m micro-kernels to use the imaginary strides
  within the auxinfo_t structs instead of panel strides.
- Deprecated the panel stride fields from the auxinfo_t structs.
- Updated test suite to print out the micro-panel alignment values.
2014-10-20 19:23:06 -05:00
Field G. Van Zee
4a7df04e8a Added 30xk support for packm ukernels.
Details:
- Updated bli_kernel_*_macro_defs.h headers to include default
  definitions for 30xk packm kernels.
- Extended function pointer arrays in bli_packm_cxk_*() out to 31 and
  included 30xk kernels.
- Addex 30xk kernels to frame/1m/packm/ukernels/bli_packm_ref_cxk_*.c.
2014-09-22 16:06:15 -05:00
Field G. Van Zee
e9899be090 Added high-level implementations of 4m, 3m.
Details:
- Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
  high levels, respectively. APIs for trmm and trsm were NOT added due
  to the fact that these approaches are inherently incompatible with
  implementing 4m or 3m at high levels (because the input right-hand
  side matrix is overwritten).
- Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
  3m so that all are stylistically consistent.
- Added new "rih" packing kernels (both low-level and structure-aware)
  to support both 4mh and 3mh.
- Defined new pack_t schemas to support real-only, imaginary-only, and
  real+imaginary packing formats.
- Added various level0 scalar macros to support the rih packm kernels.
- Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
- Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
  level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
  that order) and execute the first one that is enabled, or the native
  implementation if none are enabled.
- Added implementation query functions for each level-3 operation so
  that the user can query a string that describes the implementation
  that is currently enabled.
- Updated test suite to output implementation types for reach level-3
  operation, as well as micro-kernel types for each of the five micro-
  kernels.
- Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
- Fixed an obscure bug when packing Hermitian matrices (regular packing
  type) whereby the diagonal elements of the packed micro-panels could
  get tainted if the source matrix's imaginary diagonal part contained
  garbage.
2014-09-16 18:19:32 -05:00
Field G. Van Zee
af521ee6f2 Changed semantics of blocksize extensions.
Details:
- Changed semantics of cache and register blocksize extensions so that
  the extended values are tracked, rather than just the marginal
  extensions.
- BLIS_EXTEND_[MKN]C_? has been renamed BLIS_MAXIMUM_[MKN]C_?.
- BLIS_EXTEND_[MKN]R_? has been renamed BLIS_PACKDIM_[MKN]R_?.
- bli_blksz_ext_*() APIs have been renamed to bli_blksz_max_*(). Note
  that these "max" query routines grab the maximum value for cache
  blocksizes and the packdim value for register blocksizes.
- bli_info_*() API has been updated accordingly.
- All configurations have been updated accordingly.
2014-09-01 14:06:46 -05:00
Field G. Van Zee
d0eec4bddd Added optional row preference to ukernel config.
Details:
- Added the ability for the kernel developer to indicate the gemm micro-
  kernel as having a preference for accessing the micro-tile of C via
  contiguous rows (as opposed to contiguous columns). This property may
  be encoded in bli_kernel.h as BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS,
  which may be defined or left undefined. Leaving it undefined leads to
  the default assumption of column preference.
- Changed conditionals in frame/3/*/*_front.c that induce transposition
  of the operation so that the transposition is induced only if there
  is disagreement between the storage of C and the preference of the
  micro-kernel. Previously, the only conditional that needed to be met
  was that C was row-stored, which is to say that we assumed the micro-
  kernel preferred column-contiguous access on C.
- Added a "prefers_contig_rows" property to func_t objects, and updated
  calls to bli_func_obj_create() in _cntl.c files in order to support
  the above changes.
- Removed the row-storage optimization from bli_trsm_front.c because
  it is actually ineffective. This is because the right-side case of
  trsm flips the A and B micro-panel operands (since BLIS only requires
  left-side gemmtrsm/trsm kernels), meaning any transposition done
  at the high level is then undone at the low level.
- Tweaked trmm, trmm3 _front.c files to eliminate a possible redundant
  invocation of the bli_obj_swap() macro.
2014-08-19 15:49:19 -05:00
Field G. Van Zee
7ed415824d Updated copyright headers (continued).
Details:
- Inserted "at Austin" into third clause of license declarations.
  Meant to include this change in previous commit.
2014-07-14 16:14:33 -05:00
Field G. Van Zee
5c2c6c8561 Updated copyright headers to contain "at Austin".
Details:
- Updated copyright headers to include "at Austin" in the name of the
  University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
  2012).
2014-07-14 16:05:03 -05:00
Field G. Van Zee
fde5f1fdec Added extensive support for configuration defaults.
Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
  macro constants. Examples:
    BLIS_SAXPYV_KERNEL_REF
    BLIS_DDOTXF_KERNEL_REF
    BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
  with a common base name; [sdcz] datatype flavors of each kernel or
  micro-kernel (level-1v, -1f, or 3) may now be named independently.
  This means you can now, if you wish, encode the datatype-specific
  register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
  undefined in bli_kernel.h will default to the corresponding reference
  implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
  it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
  datatype chars to match the number of types the kernel WOULD take in
  a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
  sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
  your level-1v/-1f kernels. The framework now prvides a _kernel()
  function which serves as the obj_t wrapper for whatever kernels are
  specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
  longer need to include any prototyping headers from within
  bli_kernel.h. The framework now generates kernel prototypes, with the
  proper type signature, based on the kernel names defined (or defaulted
  to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
  kernel is left undefined by bli_kernel.h, but its same-precision real
  domain equivalent IS defined, BLIS will use a 4m-based implementation
  for the datatype x implementations of all level-3 operations, using
  only the real gemm micro-kernel.
2014-02-25 13:34:56 -06:00
Field G. Van Zee
5a36e5bf2f Embed func_t microkernel objects in control trees.
Details:
- Modified all control tree node definitions to include a new field of
  type func_t*, which is similar to a blksz_t except that it contains
  one function pointer (each typed simply as void*) for each datatype.
  We use the func_t* to embed pointers to the micro-kernels to use for
  the leaf-level nodes of each control tree. This change is a natural
  extension of control trees and will allow more flexibility in the
  future.
- Modified all macro-kernel wrappers to obtain the micro-kernel pointers
  from the incomming (previously ignored) control tree node and then pass
  the queried pointer into the datatype-specific macro-kernel code, which
  then casts the pointer to the appropriate type (new typedefs residing
  in bli_kernel_type_defs.h) and then uses the pointer to call the micro-
  kernel. Thus, the micro-kernel function is no longer "hard-coded" (that
  is, determined when the datatype-specific macro-kernel functions are
  instantiated by the C preprocessor).
- Added macros to bli_kernel_macro_defs.h that build datatype-specific
  base names if they do not exist already, and then uses those to build
  datatype-specific micro-kernel function names. This will allow
  developers extra flexibility if they wanted to, for example, name each
  of their datatype-specific micro-kernels differently (e.g. double
  real might be named bli_dgemm_opt_4x4() while double complex might be
  named bli_zgemm_opt_2x2()).
- Inserted appropriate code into _cntl_init() functions that allocates
  and initializes a func_t object for the corresponding micro-kernels.
  The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(),
  and then reused via extern wherever possible.
2014-01-27 11:13:00 -06:00
Field G. Van Zee
2cb13600f9 Updated year in copyright headers to 2014. 2014-01-03 12:29:13 -06:00
Field G. Van Zee
376bbb59c8 Removed support for duplication.
Details:
- Removed support for duplication from the gemmtrsm/trsm micro-kernels
  and all framework code.
- Updated test suite modules according to above changes.
2013-11-08 11:17:34 -06:00
Field G. Van Zee
be4833bd91 Added test suite modules for level-1f, 3 kernels.
Details:
- Added test modules in test suite for level-1f kernels and level-3
  micro-kernels. (Duplication in the micro-kernels, for now, is NOT
  supported by these test modules.)
- Added section override switches to test suite's input.operations file.
- Added obj_t APIs for level-1f front-ends and their unblocked variants to
  facilitate the level-1f test modules. Also added front-end for dupl
  operation.
- Added obj_t-based check routines for level-1f operations, which are
  called from the new front-ends mentioned above.
- Added query routines for axpyf, dotxf, and dotxaxpyf that return fusing
  factors as a function of datatype, which is needed by their respective
  test modules.
- Whitespace changes to bli_kernel.h of all existing configurations.
2013-10-10 14:20:06 -05:00
Field G. Van Zee
5e54f46ccb Added template implementations and other tweaks.
Details:
- Added a 'template' configuration, which contains stub implementations of the
  level 1, 1f, and 3 kernels with one datatype implemented in C for each, with
  lots of in-file comments and documentation.
- Modified some variable/parameter names for some 1/1f operations. (e.g.
  renaming vector length parameter from m to n.)
- Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files
  to bli_kernel.h.
- Modifed test suite to print out fusing factors for axpyf, dotxf, and
  dotxaxpyf, as well as the default fusing factor (which are all equal
  in the reference and template implementations).
- Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these
  reference variants were implemented in terms of front-end routines rather
  that directly in terms of the kernels. (For example, axpy2v was implemented
  as two calls to axpyv rather than two calls to AXPYV_KERNEL.)
- Changed the interface to dotxf so that it matches that of axpyf, in that
  A is assumed to be m x b_n in both cases, and for dotxf A is actually used
  as A^T.
- Minor variable naming and comment changes to reference micro-kernels in
  frame/3/gemm/ukernels and frame/3/trsm/ukernels.
2013-09-30 12:58:18 -05:00
Field G. Van Zee
da77e9614f Minor improvements to static memory allocator.
Details:
- Expanded on cpp macro definitions from bli_mem.c and relocated them to
  a new header file, frame/include/bli_mem_pool_macro_defs.h. The expanded
  functionality includes computing the pool size for each datatype (using
  that datatype's cache blocksizes) and using the maximum to size the
  actual pool array. This addresses the somewhat common pitfall whereby a
  developer updates cache blocksizes in bli_kernel.h for only one datatype
  (say, single-precision real), while the memory pools are sized using the
  double-precision real values. Then, when the developer attempts to link
  to and run a level-3 BLIS routine (e.g. dgemm), the library aborts with
  a message saying the static memory pool was exhausted. Clearly, this
  message is misleading when the pool was not sized properly to begin with.
- Removed previously disabled code in bli_kernel_macro_defs.h that was
  meant to check for size consistency among the various cache blocksizes.
  (Obviously the memory pool size-based solution mentioned above is better.)
- Added BLIS_SIZEOF_? cpp macros to bli_type_defs.h. This seemed like a
  reasonable place to put these constants, rather than further crowd up
  bli_config.h.
- Updated testsuite driver to output memory pool sizes for A, B, and C.
- Minor comment updates to bli_config.h.
- Removed 'flame' configuration. It was beginning to get out-of-date, and
  I hadn't used it in months. We can always re-create it later.
2013-09-13 12:00:37 -05:00
Field G. Van Zee
67a8b9498d Added missing cpp kernel blocksize constraints.
Details:
- Added missing C preprocessor guards in bli_kernel_macro_defs.h that enforce
  constraints on the register blocksizes relative to the cache blocksizes.
  Thanks to Tyler for helping me stumble across this issue.
2013-07-26 11:12:37 -05:00
Field G. Van Zee
3725013985 Added experimental bli_gemm_ker_var5().
Details:
- Added support for an experimental gemm macro-kernel incrementally
  packs one micro-panel of B at a time. This is useful for certain
  special cases of gemm where m is small.
- Minor changes to default values of clarksville configuration.
- Defined BLIS_PACKED_BLOCKS as part of pack_t type, even though we
  do not yet have any use (or implementation support) for block storage.
- Comment update to bli_packm_init.c.
2013-07-08 11:24:18 -05:00
Field G. Van Zee
2d6f9e8379 Disabled blocksize checks for memory pools.
Details:
- Temporarily disabled checks that ensure that enough memory will be allocated
  by the contiguous memory allocator for all types, given that the values for
  double precision real are the ones used to allocate the space. These checks
  can easily go awry in certain situations, especially if you are developing for
  only one datatype. So for now, they are probably more trouble than they are
  worth.
2013-04-21 15:10:34 -05:00
Field G. Van Zee
b6ef84fad1 Allow ldim of packed micro-panels != MR, NR.
Details:
- Made substantial changes throughout the framework to decouple the leading
  dimension (row or column stride) used within each packed micro-panel from
  the corresponding register blocksize. It appears advantageous on some
  systems to use, for example, packed micro-panels of A where the column
  stride is greater than MR (whereas previously it was always equal to MR).
- Changes include:
  - Added BLIS_EXTEND_[MNK]R_? macros, which specify how much extra padding
    to use when packing micro-panels of A and B.
  - Adjusted all packing routines and macro-kernels to use PACKMR and PACKNR
    where appropriate, instead of MR and NR.
  - Added pd field (panel dimension) to obj_t.
  - New interface to bli_packm_cntl_obj_create().
  - Renamed bli_obj_packed_length()/_width() macros to
    bli_obj_padded_length()/_width().
  - Removed local #defines for cache/register blocksizes in level-3 *_cntl.c.
  - Print out new cache and register blocksize extensions in test suite.
- Also added new BLIS_EXTEND_[MNK]C_? macros for future use in using a larger
  blocksize for edge cases, which can improve performance at the margins.
2013-04-21 15:00:24 -05:00
Field G. Van Zee
31b100e7bf Added new kernel blocksize macro aliases.
Details:
- Added new macros that alias level-3 cache and register blocksize macros
  to names that can be constructed via the PASTEMAC macro. These aliased
  macro definitions live inside bli_kernel_macro_defs.h, which is now
  #included after bli_kernel.h.
- Modified macro-kernels to use new aliased blocksize macros instead of
  operation-specific ones.
- Removed local, operation-specific kernel blocksize macro definitions
  (found in macro-kernel header files).
2013-04-11 11:11:52 -05:00