Commit Graph

48 Commits

Author SHA1 Message Date
Field G. Van Zee
b150870397 Removed most "old" directories.
Details:
- Removed the vast majority of directories named "old", which contained
  deprecated code that I wasn't quite ready to jettison from the source
  tree.
2017-12-08 16:08:41 -06:00
Field G. Van Zee
453deb2906 Implemented runtime kernel management.
Details:
- Reworked the build system around a configuration registry file, named
  config_registry', that identifies valid configuration targets, their
  constituent sub-configurations, and the kernel sets that are needed by
  those sub-configurations. The build system now facilitates the building
  of a single library that can contains kernels and cache/register
  blocksizes for multiple configurations (microarchitectures). Reference
  kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
  config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
  in determining which sub-configurations (CONFIG_LIST) and kernel sets
  (KERNEL_LIST) are included in the library, and which make_defs.mk files'
  CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
  functions into a standard format that includes the kernel set name
  (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
  kernels sub-directory. These files exist to provide prototypes for the
  kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
  This directory includes a new source file, bli_cntx_ref.c (compiled on
  a per-configuration basis), that defines the code needed to initialize
  a reference context and a context for induced methods for the
  microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
  variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
  basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
  #defines for the config (family) name, the sub-configurations that are
  associated with the family, and the kernel sets needed by those
  sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
  what remains to new header files named "bli_arch_<configname>.h", which
  are conditionally #included from a new header bli_arch.h. These files
  are still needed to set library-wide parameters such as custom
  malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
  The files contain a function, named the same as the file, that initializes
  a "native" context for a particular configuration (microarchitecture). The
  idea is that optimized kernels, if available, will be initialized into
  these contexts. Other fields will retain pointers to reference functions,
  which will be compiled on a per-configuration basis. These bli_cntx_init_*()
  functions will be called during the initialization of the global kernel
  structure. They are thought of as initializing for "native" execution, but
  they also form the basis for contexts that use induced methods. These
  functions are prototyped, along with their _ref() and _ind() brethren, by
  prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
  identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
  structures (pointers to cntx_t, actually). The first dimension is indexed
  over arch_t and the inner dimension is the ind_t (induced method) for
  each microarchitecture. When a microarchitecture (configuration) is
  "registered" at init-time, the inner array for that configuration in the
  2D array is initialized (and allocated, if it hasn't been already). The
  cntx_t slot for BLIS_NAT is initialized immediately and those for other
  induced method types are initialized and cached on-demand, as needed. At
  cntx_t registration, we also store function pointers to cntx_init functions
  that will initialize (a) "reference" contexts and (b) contexts for use with
  induced methods. We don't cache the full contexts for reference contexts
  since they are rarely needed. The functions that initialize these two kinds
  of contexts are generated automatically for each targeted sub-configuration
  from cpp-templatized code at compile-time. Induced method contexts that
  need "stage" adjustments can still obtain them via functions in
  bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
  the level-1f, level-1v, and packm kernels, and for converting a native
  context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
  macros in bli_kernel_macro_defs.h to being runtime checks defined in
  bli_check.c and called from bli_gks_register_cntx() at the time that the
  global kernel structure's internal context is initialized for a given
  microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
  the previous operation-level cntx_t_init()/_finalize() invocations.
  Instead, we now query the gks for a suitable context, usually via
  bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
  hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
  into a single kernel that will branch on the schema and support packing
  to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
  and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
  bli_malloc_intl(). Useful when we wish to allocate and initialize to
  zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
  bli_cntx.h into static functions.
2017-10-18 13:29:32 -05:00
Field G. Van Zee
c63980f4ca Moved 'family' field from cntx_t to cntl_t.
Details:
- Removed the family field inside the cntx_t struct and re-added it to the
  cntl_t struct. Updated all accessor functions/macros accordingly, as well
  as all consumers and intermediaries of the family parameter (such as
  bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
  change was motivated by the desire to keep the context limited, as much
  as possible, to information about the computing environment. (The family
  field, by contrast, is a descriptor about the operation being executed.)
- Added additional functions to bli_blksz_*() API.
- Added additional functions to bli_cntx_*() API.
- Minor updates to bli_func.c, bli_mbool.c.
- Removed 'obj' from bli_blksz_*() API names.
- Removed 'obj' from bli_cntx_*() API names.
- Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
  that operate only on a single struct to contain the "_node" suffix to
  differentiate with those routines that operate on the entire tree.
- Added enums for packm and unpackm kernels to bli_type_defs.h.
- Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
  They weren't being used and probably never will be.
2017-07-29 14:53:39 -05:00
Field G. Van Zee
126482a3b6 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2016-11-25 18:29:49 -06:00
Field G. Van Zee
701b9aa3ff Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
  same struct definition, whose primary fields consist of a blocksize id,
  a variant function pointer, a pointer to an optional parameter struct,
  and a pointer to a (single) sub-node. This unified control tree type is
  now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
  they represent, such that, for example, packing operations are now
  associated with nodes that are "inline" in the tree, rather than off-
  shoot braches. The original tree for the classic Goto gemm algorithm was
  expressed (roughly) as:

    blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                         |           |
                         -> packb    -> packa

  and now, the same tree would look like:

    blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

  Specifically, the packb and packa nodes perform their respective packing
  operations and then recurse (without any loop) to a subproblem. This means
  there are now two kinds of level-3 control tree nodes: partitioning and
  non-partitioning. The blocked variants are members of the former, because
  they iteratively partition off submatrices and perform suboperations on
  those partitions, while the packing variants belong to the latter group.
  (This change has the effect of allowing greatly simplified initialization
  of the nodes, which previously involved setting many unused node fields to
  NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
  connective structure of control trees. That is, packm nodes are no longer
  off-shoot branches of the main algorithmic nodes, but rather connected
  "inline".
- Simplified control tree creation functions. Partitioning nodes are created
  concisely with just a few fields needing initialization. By contrast, the
  packing nodes require additional parameters, which are stored in a
  packm-specific struct that is tracked via the optional parameters pointer
  within the control tree struct. (This parameter struct must always begin
  with a uint64_t that contains the byte size of the struct. This allows
  us to use a generic function to recursively copy control trees.) gemm,
  herk, and trmm control tree creation continues to be consolidated into
  a single function, with the operation family being used to select
  among the parameter-agnostic macro-kernel wrappers. A single routine,
  bli_cntl_free(), is provided to free control trees recursively, whereby
  the chief thread within a groups release the blocks associated with
  mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
  function pointer stored in the current control tree node (rather than
  index into a local function pointer array). Before being invoked, these
  function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
  families) or trsm_voft (for trsm family) type, which is defined in
  frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
  through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
  from routines as a function of the variant's matrix operands. gemm and
  herk always move forward, while trmm and trsm move in a direction that
  is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
  each of which takes additional arguments and hides complexity in managing
  the difference between the way ranges are computed for the four families
  of operations.
- Simplified level-3 blocked variants according to the above changes, so that
  the only steps taken are:
  1. Query partitioning direction (forwards or backwards).
  2. Prune unreferenced regions, if they exist.
  3. Determine the thread partitioning sub-ranges.
  <begin loop>
    4. Determine the partitioning blocksize (passing in the partitioning
       direction)
    5. Acquire the curren iteration's partitions for the matrices affected
       by the current variants's partitioning dimension (m, k, n).
    6. Call the subproblem.
  <end loop>
- Instantiate control trees once per thread, per operation invocation.
  (This is a change from the previous regime in which control trees were
  treated as stateless objects, initialized with the library, and shared
  as read-only objects between threads.) This once-per-thread allocation
  is done primarily to allow threads to use the control tree as as place
  to cache certain data for use in subsequent loop iterations. Presently,
  the only application of this caching is a mem_t entry for the packing
  blocks checked out from the memory broker (allocator). If a non-NULL
  control tree is passed in by the (expert) user, then the tree is copied
  by each thread. This is done in bli_l3_thread_decorator(), in
  bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
  of the operation being executed. For example, gemm, hemm, and symm are
  all part of the gemm family, while herk, syrk, her2k, and syr2k are
  all part of the herk family. Knowing the operation's family is necessary
  when conditionally executing the internal (beta) scalar reset on on
  C in blocked variant 3, which is needed for gemm and herk families,
  but must not be performed for the trmm family (because beta has only
  been applied to the current row-panel of C after the first rank-kc
  iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
  to comform with the new control tree design, and renamed the macro-
  kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
  bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
  frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
  optimization in the various level-3 front-ends was not being applied
  properly when the context indicated that execution would be via an
  induced method. (Before, we always checked the native micro-kernel
  corresponding to the datatype being executed, whereas now we check
  the native micro-kernel corresponding to the datatype's real projection,
  since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
  complex implementations. Previously, it was always tested, provided that
  the c/z datatypes were enabled. However, some configurations use
  reference micro-kernels for complex datatypes, and testing these
  implementations can slow down the testsuite considerably.
2016-08-26 19:04:45 -05:00
Field G. Van Zee
096895c5d5 Reorganized code, APIs related to multithreading.
Details:
- Reorganized code and renamed files defining APIs related to multithreading.
  All code that is not specific to a particular operation is now located in a
  new directory: frame/thread. Code is now organized, roughly, by the
  namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
  thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
  also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
  functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
  We now have the following API namespaces:
  - bli_thrinfo_*(): functions related to thrinfo_t objects
  - bli_thrcomm_*(): functions related to thrcomm_t objects.
  - bli_thread_*(): general-purpose functions, such as initialization,
    finalization, and computing ranges. (For now, some macros, such as
    bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
    bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
    appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
  consistent with the thread-related macros that were also renamed).
- Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
  #undef was a temporary fix to some macro defaults which were being applied
  in the wrong order, which was recently fixed.
2016-06-06 13:32:04 -05:00
Field G. Van Zee
9dcd6f05c4 Implemented developer-configurable malloc()/free().
Details:
- Replaced all instances of bli_malloc() and bli_free() with one of:
  - bli_malloc_pool()/bli_free_pool()
  - bli_malloc_user()/bli_free_user()
  - bli_malloc_intl()/bli_free_intl()
  each of which can be configured to call malloc()/free() substitutes,
  so long as the substitute functions have the same function type
  signatures as malloc() and free() defined by C's stdlib.h. The _pool()
  function is called when allocating blocks for the memory pools (used
  for packing buffers, primarily), the _user() function is called when
  obj_t's are created (via bli_obj_create() and friends), and the _intl()
  function is called for internal use by BLIS, such as when creating
  control tree nodes or temporary buffers for manipulating internal data
  structures. Substitutes for any of the three types of bli_malloc() may
  be specified by #defining the following pairs of cpp macros in
  bli_kernel.h:
  - BLIS_MALLOC_POOL/BLIS_FREE_POOL
  - BLIS_MALLOC_USER/BLIS_FREE_USER
  - BLIS_MALLOC_INTL/BLIS_FREE_INTL
  to be the name of the substitute functions. (Obviously, the object
  code that contains these functions must be provided at link-time.)
  These macros default to malloc() and free(). Subsitute functions are
  also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
- Removed definitions for bli_malloc() and bli_free().
- Note that bli_malloc_pool() and bli_malloc_user() are now defined in
  terms of a new function, bli_malloc_align(), which aligns memory to an
  arbitrary (power of two) alignment boundary, but does so manually,
  whereas before alignment was performed behind the scenes by
  posix_memalign(). Currently, bli_malloc_intl() is defined in terms
  of bli_malloc_noalign(), which serves as a simple wrapper to the
  designated function that is passed in (e.g. BLIS_MALLOC_INTL).
  Similarly, there are bli_free_align() and bli_free_noalign(), which
  are used in concert with their bli_malloc_*() counterparts.
2016-05-24 13:15:32 -05:00
Devin Matthews
4b1e55edbf Default-initialize all extern global variables to avoid generating common symbols. Fixes #73. 2016-05-10 10:08:47 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Field G. Van Zee
8d5169ccda Fixed bug in release of mem_t buffer.
Details:
- Fixed a bug that affects all level-2 and level-3 blocked variants. The
  bug only manifested, however, if the packing of operands (A and B in
  gemm, for example) spanned multiple nodes in the control tree. Until
  recently, the main consumers of packm were level-3 operations, all of
  which packed both input operands from blocked variant 1 (B outside of
  the loop, and A within the loop). This particular usage masked a flaw
  in the code whereby bli_obj_release_pack() would always release the
  underlying mem_t buffer (provided it was allocated), even if the buffer
  was not allocated in the current variant. This has been fixed by
  replacing all calls to bli_obj_release_pack() with calls to a new
  function, bli_packm_release(), which takes the same control tree node
  argument passed into the object's corresponding call to packm_init()
  or packv_init(). bli_packm_release() then proceeds to invoke
  bli_obj_release_pack() only if the control tree node indicates that
  packing was requested. Thanks to Devangi Parikh for identifying this
  bug.
2015-03-18 11:38:08 -05:00
Field G. Van Zee
670b63926a Added whitespace to bli_obj_scalar_ routine calls.
Details:
- Added extra spaces to align arguments of
  bli_obj_scalar_init_detached_copy_of(). This misalignment was due to
  the fact that the function was previously named
  bli_obj_init_scalar_copy_of() and the name change, performed in
  b444489f, was done via recursive sed commands which left subsequent
  lines untouched.
2014-08-24 10:46:27 -05:00
Field G. Van Zee
7ed415824d Updated copyright headers (continued).
Details:
- Inserted "at Austin" into third clause of license declarations.
  Meant to include this change in previous commit.
2014-07-14 16:14:33 -05:00
Field G. Van Zee
5c2c6c8561 Updated copyright headers to contain "at Austin".
Details:
- Updated copyright headers to include "at Austin" in the name of the
  University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
  2012).
2014-07-14 16:05:03 -05:00
Field G. Van Zee
94c0df797e Changed order of zero dim / error checking.
Details:
- Updated level-2 and level-3 internal back-ends so that the operation's
  _check() function is called BEFORE any attempt to return early due to
  the presence of zero dimensions. This ordering makes more sense because
  (for example) object dimensions should match even if one of them is
  zero. Previously, a dimension mismatch could result in an early return
  with no error message.
- Updated bli_check_object_buffer() so that NULL buffers result in an
  error only if the object is dimensionally non-empty (i.e., only if both
  of the object's dimensions are non-zero). This allows BLIS operations
  to be performed on dimensionally empty objects (i.e., where at least one
  dimension is zero).
- Updated the error message associated with bli_check_object_buffer()
  to mention the newly relaxed constraint mentioned above, vis-a-vis
  non-zero dimensions.
2014-07-14 11:24:36 -05:00
Tyler Smith
f4fdfe8fc5 Merge http://github.com/flame/blis 2014-04-30 11:46:35 -05:00
Field G. Van Zee
262cdabcc8 Changed treatment of NULL object buffers.
Details:
- Relaxed the constraint in bli_obj_attach_buffer_check(), which required
  the buffer address being attached to be non-NULL. This is acceptable
  because the user was already able to create and use objects with NULL
  buffers (via bli_obj_create_without_buffer(), which initializes the
  buffer to NULL).
- Inserted calls to newly defined function, bli_check_object_buffer(),
  into nearly all operations' _check() or _int_check() functions. This
  allows BLIS to abort peacefully if a computational routine is called
  with an object containing a NULL buffer. By contrast, under such
  conditions, BLAS would typically fail with a segmentation fault.
- Within operation front-ends, moved the calls to _check()/_int_check()
  so that zero dimensions are checked first (and if found, execution
  returns with trivial or no computation). This resolves issue #7. Thanks
  to Jack Poulson for reporting this bug.
2014-04-28 16:48:25 -05:00
Tyler Smith
31bb065ba4 Merge http://github.com/flame/blis 2014-04-23 12:30:19 -05:00
Field G. Van Zee
58671597d3 Minor cleanups to level-2 _cntl.c files.
Details:
- Changed level-2 _cntl.c files so that the blocksizes for gemv are
  imported and used, rather than blocksizes being declared locally.
- Whitespace changes to gemv_cntl.c and gemm_cntl.c files (as well as
  4m/3m variants).
- Removed test/old/test_blis2.c.
2014-04-10 15:35:30 -05:00
Tyler Smith
5ec93bd9a7 Bunch of minor fixes
Removed barrier after unpackm in all level3 blocked variants
Now there is an implicit barrier inside unpackm that only occurs if C is packed (which is usually not the case)

Moved the enabling of the tree barriers into bli_config.h
Fed the default MR and NR for double precision into bli_get_range instead of the number 8
2014-04-04 15:09:10 -05:00
Tyler Smith
2e727a025a Modifying the thread info data structures
This change makes each operation have its own thread info type,
allowing more fine control of threading in operations that have different types of suboperations
2014-03-10 15:14:33 -05:00
Tyler Smith
01b125e815 First pass at adding parallelism to BLIS.
Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.
2014-02-27 11:55:45 -06:00
Field G. Van Zee
d628bf1da1 Consolidated pack_t enums; retired VECTOR value.
Details:
- Changed the pack_t enumerations so that BLIS_PACKED_VECTOR no longer has
  its own value, and instead simply aliases to BLIS_PACKED_UNSPEC. This
  makes room in the three pack_t bits of the info field of obj_t so that
  two values are now unused, and may be used for other future purposes.
- Updated sloppy terminology usage in comments in level-2 front-ends.
  (Replaced "is contiguous" with more accurate "has unit stride".)
2014-01-15 11:40:12 -06:00
Field G. Van Zee
2cb13600f9 Updated year in copyright headers to 2014. 2014-01-03 12:29:13 -06:00
Field G. Van Zee
392428dea4 Added "ri" scalar macros.
Details:
- Added set of basic scalar macros that take arguments' real and
  imaginary components separately, named like the previous set except
  with the "ris" (instead of "s") suffix.
- Redefined the previous set of scalar macros (those that take arguments
  "whole") in terms of the new "ri" set.
- Renamed setris and getris macros to sets and gets.
- Renamed setimag0 macros to seti0s.
- Use bli_?1 macro instead of a local constant in bla_trmv.c, bla_trsv.c.
2013-12-12 19:01:47 -06:00
Field G. Van Zee
d289f5d3a9 Whitespace changes to level-2 blocked variants.
Details:
- Joined some lines in level-2 blocked variants to match formatting used
  in level-3 blocked variants.
- Streamlined implementation of bli_obj_equals() in bli_query.c.
2013-12-05 10:56:13 -06:00
Field G. Van Zee
b444489f10 Added new "attached" scalar representation.
Details:
- Added infrastructure to support a new scalar representation, whereby
  every object contains an internal scalar that defaults to 1.0. This
  facilitates passing scalars around without having to house them in
  separate objects. These "attached" scalars are stored in the internal
  atom_t field of the obj_t struct, and are always stored to be the same
  datatype as the object to which they are attached. Level-3 variants no
  longer take scalar arguments, however, level-3 internal back-ends stll
  do; this is so that the calling function can perform subproblems such
  as C := C - alpha * A * B on-the-fly without needing to change either
  of the scalars attached to A or B.
- Removed scalar argument from packm_int().
- Observe and apply attached scalars in scalm_int(), and removed scalar
  from interface of scalm_unb_var1().
- Renamed the following functions (and corresponding invocations):

   bli_obj_init_scalar_copy_of()
                           -> bli_obj_scalar_init_detached_copy_of()
   bli_obj_init_scalar()   -> bli_obj_scalar_init_detached()
   bli_obj_create_scalar_with_attached_buffer()
                           -> bli_obj_create_1x1_with_attached_buffer()
   bli_obj_scalar_equals() -> bli_obj_equals()

- Defined new functions:

   bli_obj_scalar_detach()
   bli_obj_scalar_attach()
   bli_obj_scalar_apply_scalar()
   bli_obj_scalar_reset()
   bli_obj_scalar_has_nonzero_imag()
   bli_obj_scalar_equals()

- Placed all bli_obj_scalar_* functions in a new file, bli_obj_scalar.c.
- Renamed the following macros:

   bli_obj_scalar_buffer() -> bli_obj_buffer_for_1x1()
   bli_obj_is_scalar()     -> bli_obj_is_1x1()

- Defined new macros to set and copy internal scalars between objects:

   bli_obj_set_internal_scalar()
   bli_obj_copy_internal_scalar()

- In level-3 internal back-ends, added conditional blocks where alpha and
  beta are checked for non-unit-ness. Those values for alpha and beta are
  applied to the scalars attached to aliases of A/B/C, as appropriate,
  before being passed into the variant specified by the control tree.
- In level-3 blocked variants, pass BLIS_ONE into subproblems instead of
  alpha and/or beta.
- In level-3 macro-kernels, changed how scalars are obtained. Now, scalars
  attached to A and B are multiplied together to obtain alpha, while beta
  is obtained directly from C.
- In level-3 front-ends, removed old function calls meant to provide
  future support for mixed domain/precision. These can be added back later
  once that functionality is given proper treatment. Also, removed the
  creating of copy-casts of alpha and beta since typecasting of scalars
  is now implicitly handled in the internal back-ends when alpha and
  beta are applied to the attached scalars.
2013-12-03 16:08:30 -06:00
Field G. Van Zee
9552e6ee82 Removed optional scaling from packm control tree.
Details:
- Removed does_scale field from packm control tree node and
  bli_packm_cntl_obj_create() interface. Adjusted all invocations of
  _cntl_obj_create() accordingly.
- Redefined/renamted macros that are used in aliasing so that now,
  bli_obj_alias_to() does a full alias (shallow copy) while
  bli_obj_alias_for_packing() does a partial alias that preserves the
  pack_mem-related fields of the aliasing (destination) object.
- Removed bli_trmm3_cntl.c, .h after realizing that the trmm control tree
  will work just fine for bli_trmm3().
- Removed some commented vestiges of the typecasting functionality needed
  to support heterogeneous datatypes.
2013-11-24 11:40:31 -06:00
Field G. Van Zee
a98f78b715 Changed dim_t and inc_t to be signed integers.
Details:
- Redefined dim_t and inc_t in terms of gint_t (instead of guint_t).
  This will facilitate interoperability with Fortran in the future.
  (Fortran does not support unsigned integers.)
- Redefined many instances of stride-related macros so that they return
  or use the absolute value of the strides, rather than the raw strides
  which may now be signed. Added new macros bli_is_row_stored_f() and
  bli_is_col_stored_f(), which assume positive (forward-oriented) strides,
  and changed the packm_blk_var[23] variants to use these macros instead
  of the existing bli_is_row_stored(), bli_is_col_stored().
- Added/adjusted typecasting to to various functions/macros, including
  bli_obj_alloc_buffer(), bli_obj_buffer_at_off(), and various pointer-
  related macros in bli_param_macro_defs.h.
- Redefined bli_convert_blas_incv() macro so that the BLAS compatibility
  layer properly handles situations where vector increments are negative.
  Thanks to Vladimir Sukharev for pointing out this issue.
- Changed type of increment parameters in bli_adjust_strides() from dim_t
  to inc_t. Likewise in bli_check_matrix_strides().
- Defined bli_check_matrix_object(), which checks for negative strides.
- Redefined bli_check_scalar_object() and bli_check_vector_object() so
  that they also check for negative stride.
- Added instances of bli_check_matrix_object() to various operations'
  _check routines.
2013-11-06 15:32:47 -06:00
Field G. Van Zee
5e54f46ccb Added template implementations and other tweaks.
Details:
- Added a 'template' configuration, which contains stub implementations of the
  level 1, 1f, and 3 kernels with one datatype implemented in C for each, with
  lots of in-file comments and documentation.
- Modified some variable/parameter names for some 1/1f operations. (e.g.
  renaming vector length parameter from m to n.)
- Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files
  to bli_kernel.h.
- Modifed test suite to print out fusing factors for axpyf, dotxf, and
  dotxaxpyf, as well as the default fusing factor (which are all equal
  in the reference and template implementations).
- Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these
  reference variants were implemented in terms of front-end routines rather
  that directly in terms of the kernels. (For example, axpy2v was implemented
  as two calls to axpyv rather than two calls to AXPYV_KERNEL.)
- Changed the interface to dotxf so that it matches that of axpyf, in that
  A is assumed to be m x b_n in both cases, and for dotxf A is actually used
  as A^T.
- Minor variable naming and comment changes to reference micro-kernels in
  frame/3/gemm/ukernels and frame/3/trsm/ukernels.
2013-09-30 12:58:18 -05:00
Field G. Van Zee
068437736b Fixed set-but-not-used compiler (gcc) warnings.
Details:
- Used void-casts of certain variables to appease gcc (and perhaps other
  compilers) when such variables are only used in the complex instances of
  the functions. Special thanks to Karl Rupp for suggesting a portable fix
  for these warnings.
2013-09-09 14:07:58 -05:00
Field G. Van Zee
12dbd2f334 Moved init_safe(), finalize_safe() to BLAS compat.
Details:
- Moved the bli_init_safe() and bli_finalize_safe() function calls from the
  BLAS-like BLIS layer to the BLAS compatibility layer. Having these auto-
  initializers in the BLIS layer wasn't buying us anything because the user
  could still call the library with uninitialized global scalar constants,
  for example. Thus, we will just have to live with the constraint that
  bli_init() MUST be called before calling ANY routine with a bli_ prefix.
- Added the missing _init_safe() and finalize_safe() calls to the level-1
  BLAS compatibility wrappers.
2013-08-08 14:39:35 -05:00
Field G. Van Zee
6e7e452343 Fixed minor warnings and misc issues.
Details:
- Fixed various warnings output by gcc 4.6.3-1, including removing some
  set-but-not-used variables and addressing some instances of typecasting
  of pointer types to integer types of different sizes.
2013-07-22 14:50:57 -05:00
Field G. Van Zee
0288c827d3 Updated ukernels for x86_64.
Details:
- Tweaked micro-kernels and configuration for clarksville.
- Updated/cleaned up old test drivers in test directory.
- Fixed syntax bug in trsv_unb_var1 and trsv_unf_var1 (introduced
  recently).
2013-06-01 08:02:23 -05:00
Field G. Van Zee
85a6d1c9a5 Replaced axpys usage with subs in trsv.
Details:
- Replaced instances of axpys with alpha equal to -1 with subs.
- Use BLIS_MAX_TYPE_SIZE to define BLIS_CONSTANT_SLOT_SIZE instead of
  sizeof(dcomplex).
2013-05-29 10:58:24 -05:00
Field G. Van Zee
2d9c667f3c Fixed x86_64 kernel bugs and other minor issues.
Details:
- Fixed bugs in trmv_l and trsv_u due to backwards iteration resulting in
  unaligned subpartitions. We were already going out of our way a bit to
  handle edge cases in the first iteration for blocked variants, and this
  was simply the unblocked-fused extension of that idea.
- Fixed control tree handling in her/her2/syr/syr2 that was not taking
  into account how the choice of variant needed to be altered for
  upper-stored matrices (given that only lower-stored algorithms are
  explicitly implemented).
- Added bli_determine_blocksize_dim_f(), bli_determine_blocksize_dim_b()
  macros to provide inlined versions of bli_determine_blocksize_[fb]() for
  use by unblocked-fused variants.
- Integrated new blocksize_dim macros into gemv/hemv unf variants for
  consistency with that of the bugfix for trmv/trsv (both of which now
  use the same macros).
- Modified bli_obj_vector_inc() so that 1 is returned if the object is a
  vector of length 1 (ie: 1 x 1). This fixes a bug whereby under certain
  conditions (e.g. dotv_opt_var1), an invalid increment was returned, which
  was invalid only because the code was expecting 1 (for purposes of
  performing contiguous vector loads) but got a value greater than 1 because
  the column stride of the object (e.g. rho) was inflated for alignment
  purposes (albeit unnecessarily since there is only one element in the
  object).
- Replaced some old invocations of set0 with set0s.
- Added alpha parameter to gemmtrsm ukernels for x86_64 and use accordingly.
- Fixed increment bug in cleanup loop of gemm ukernel for x86_64.
- Added safeguard to test modules so that testing a problem with a zero
  dimension does not result in a failure.
- Tweaked handling of zero dimensions in level-2 and level-3 operations'
  internal back-ends to correctly handle cases where output operand still
  needs to be scaled (e.g. by beta, in the case of gemm with k = 0).
2013-05-24 16:28:10 -05:00
Field G. Van Zee
9e2b227866 Renamed _set_trans(), _trans_status() macros.
Details:
- Renamed the following macros:
    bli_obj_set_trans()    -> bli_obj_set_onlytrans()
    bli_obj_trans_status() -> bli_obj_onlytrans_status()
  to remove ambiguity as to which bits are read/updated.
2013-05-03 17:24:58 -05:00
Field G. Van Zee
6bfa96f848 Absorbed blocksize extensions into main objects.
Details:
- Revamped some parts of commit b6ef84fad1 by adding blocksize extension
  fields to the blksz_t object rather than have them as separate structs.
- Updated all packm interfaces/invocations according to above change.
- Generalized bli_determine_blocksize_?() so that edge case optimization
  happens if and only if cache blocksizes are created with non-zero
  extensions.
- Updated comments in bli_kernel.h files to indicate that the edge case
  blocksize extension mechanism is now available for use.
2013-04-30 19:35:54 -05:00
Field G. Van Zee
6a538fa7b1 Formatting change to mods in previous commit. 2013-04-15 14:40:31 -05:00
Field G. Van Zee
ea079d3559 Set structure of objects in level-2 BLIS APIs.
Details:
- Added missing statement to set structure field of local objects in
  top-level BLIS (BLAS-like) API wrappers. Thanks to Bryan Marker for
  reporting this bug.
2013-04-15 14:31:40 -05:00
Field G. Van Zee
b65cdc57d9 Migrated 'bl2' prefix to 'bli'.
Details:
- Changed all filename and function prefixes from 'bl2' to 'bli'.
- Changed the "blis2.h" header filename to "blis.h" and changed all
  corresponding #include statements accordingly.
- Fixed incorrect association for Fran in CREDITS file.
2013-03-24 20:01:49 -05:00
Field G. Van Zee
551ea4767a Removed #include "blis2.h" from low-level headers.
Details:
- Removed #include of "blis2.h" from various lower-level, operation-specific
  header files throughout the framework. Given that these low-level headers
  are included within #blis2.h in a very specific order, #include'ing blis2.h
  within them directly is unnecessary.
2013-03-24 18:00:10 -05:00
Field G. Van Zee
bb612f864e Updated behavior of bl2_obj_induce_trans() macro.
Details:
- Changed bl2_obj_induce_trans() so that the transposition bit is no longer
  updated as part of the macro. All current uses of the macro have been
  coupled with instances of bl2_obj_set_trans() to clear the bit.
- Added Jed to CREDITS file.
2013-03-01 12:55:42 -06:00
Field G. Van Zee
ede75693e5 Implemented blas2blis compatibility layer.
Details:
- Added the blas2blis compatibility layer, located in frame/compat. This
  includes virtually all of the BLAS, including banded and packed level-2
  operations.

- Defined bl2_init_safe(), bl2_finalize_safe(). The former allows a conditional
  initialization, which stores the "exit status" in an err_t, which is then
  read by the latter function to determine whether finalization should actually
  take place.
- Added calls to bl2_init_safe(), bl2_finalize_safe() to all level-2 and
  level-3 BLAS-like wrappers.
- Added configuration option to instruct BLIS to remain initialized whenever
  it automatically initializes itself (via bl2_init_safe()), until/unless the
  application code explicitly calls bl2_finalize().

- Added INSERT_GENTFUNC* and INSERT_GENTPROT* macros to facilitate type
  templatization of blas2blis wrappers.
- Defined level-0 scalar macro bl2_??swaps().
- Defined level-1v operation bl2_swapv().
- Defined some "Fortran" types to bl2_type_defs.h for use with BLAS
  wrappers.
2013-02-22 12:11:24 -06:00
Field G. Van Zee
e823b08aaf Fixed some scalar types in BLAS-like Herm APIs.
Details:
- Some of the scalars of Hermitian operations, such as alpha in her,
  alpha and beta in herk, and beta in her2k, need to be real. These
  arguments were typed incorrectly as the complex types. This has been
  fixed. Note the issue was only present in the BLAS-like APIs for
  these operations (not the native object-based interfaces).
2013-02-21 12:00:17 -06:00
Field G. Van Zee
e6ac623a90 Properly implemented beta == 0 semantics.
Details:
- Changed name of set0 and set0_mxn macros to set0s and set0s_mxn,
  respectively.
- Added code to the following operations that sets the output operand to
  zero if the corresponding scalar is zero (rather than performing the
  floating-point multiply, or in the case of setv, copying the value).
  This will prevent nan's and inf's from creeping into results from
  uninitialized memory.
  - axpy
  - dotxv
  - scalv
  - scal2v
  - setv
  - gemv
  - ger
  - hemv
  - her
  - her2
  - gemm reference ukernels
2013-02-13 18:44:59 -06:00
Field G. Van Zee
1274e12437 Updated copyright headers from 2012 to 2013. 2013-02-11 14:37:47 -06:00
Field G. Van Zee
768fcebaa8 Added unified test suite, and many fixes.
Details:
- Added a highly configurable, unified test suite.

- Removed DUPB configuration constant from bl2_kernel.h and macro-kernel
  header files. Now, instead, DUPB is computed as (NDUP != 1) within each
  macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into
  incorrectly when DUPB was set to FALSE but the NDUP was still non-unit.
  By encoding both pieces of information into one constant in _kernel.h,
  it seems somewhat less likely others will encounter this bug in the
  future.
- Added level-2 cache blocksizes to _kernel.h for reference configuration,
  and defined blocksizes in _cntl.c files to these default values.

- Changed semantics of her2k and syr2k such that these operations no longer
  expect the B matrix to already be conjugate-transposed (or just transposed
  for syr2k). However, these semantics are preserved for the internal
  mechanics of the implementations, including the internal back-end and all
  blocked variants.
- Inserted checks for real-valued alpha and beta for herk/her2k and herk,
  respectively.

- Relaxed general object structure constraints in _basic_check() for gemv, ger.
- Changed her front-end to NOT copy-cast to real projection; instead, this is
  replaced by selecting either the real part or both parts within the unblocked
  algorithm implementation, depending on the value of conjh.
- Added conjh to all _check routines for her so that the code knows when to
  verify that alpha has an imaginary component equal to zero (for her, but
  not syr).
- Changed control tree for her to forgo packing.

- Added unit diagonal support to fnormm.
- Redefined real versions of abval2s macros in terms of fabs(), fabsf().
- Redefined complex versions of sqrt2s macros using the actual "complex square
  root" formula.
- Created new level-0 object-based routines, suffixed with "sc" (for "scalar").
- Defined new level-1v, -1d, and -1m versions of add and sub operations
  (two-operand add and subtract).
- Added new scalar macros:
  - getris: acquire real and imaginary components.
  - setris: set real and imaginary components.
  - addjs: addition with conjugated x.
  - subjs: subtraction with conjugated x.
- Defined new utility operations:
  - absumv: element-wise sum of absolute values for vector elements.
  - absumm: element-wise sum of absolute values for matrix elements.
  - mkherm: convert existing matrix to Hermitian.
  - mksymm: convert existing matrix to symmetric.
  - mktrim: convert existing matrix to triangular.

- Added various error checking routines.
- Added bl2_clock_min_diff(), which is used to more cleanly measure the
  wall clock time of a code block.
- Added general stride support to bl2_obj_alloc_buffer().
- Added bl2_obj_init_scalar().
- Updated parameter mapping in bl2_param_map.c.
- Added support for queriable version string.

- Fixed a bug in the her2k macro-kernels (which currently are simply
  implemented in terms of two invocations of herk) whereby beta was being
  applied to both the first and second rank-k updates, rather than only
  the first.
- Fixed a bug in trmm/trsm whereby transpose and right side cases were not
  properly implemented due to erroneous assumptions regarding aliasing and
  root objects.
- Fixed a bug in the upper triangular trsm macro-kernel in which the wrong
  MR x NR block of B was being updated.
- Fixed a bug in the inverts macro in the double real case whereby the
  value was typecast to float before inversion. This affected non-unit cases
  of dtrsm.
- Fixed a bug in the reference kernels for gemmtrsm whereby the minus one
  constant was being applied incorrectly.
- Fixed a bug in the overall treatment of non-unit alpha for trsm. The code
  now mimics the rank-k strategy of gemm, whereby alpah is applied during
  the first iteration of variant 3, with BLIS_ONE passed in instead for
  subsequent iterations. This also required passing alpha into the macro-
  kernels as well as the fused gemmtrsm micro-kernels.
- Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being
  called for blocks strictly above the diagonal. While this sounds good in
  theory, this cannot be done because gemm_ker_var2 expects row panels of
  A to be packed from top to bottom, while for trsm_u, A is actually packed
  from bottom to top due to the reverse (BR->TL) nature of the algorithm.
- Fixed a bug in packm_cxk() whereby panel packings with unit panel
  dimensions were mishandled due to incorrect arguments to the copyv kernel.
  Also changed the copyv kernel invocation to scal2v so that these edge
  cases are properly handled when scaling is requested.
- Fixed a bug in packv_int() whereby an uninitialized object is passed in
  instead of the source object.
- Fixed a bug whereby level-2 code could allocate memory dynamically via
  bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed
  a potential future bug whereby a mem_t object that is actually no longer
  "allocated" from the static pool is mistaken for being allocated due to
  failure to NULLify the buffer when the block was most recently released.
- Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly
  toggled when the requested subpartition needed to be "reflected" due to it
  residing in an unstored region.
2013-02-11 13:20:44 -06:00
Field G. Van Zee
00f3498a89 Initial commit. 2012-12-03 12:36:11 -06:00