Merge remote-tracking branch 'origin/master' into knl

2026-05-11 09:39:59 +00:00 · 2016-04-18 10:21:35 -05:00
parent bd5e2296e9 cbcd0b739d
commit c38e0dab05
1343 changed files with 54329 additions and 31224 deletions
--- a/766
+++ b/766
@@ -1,10 +1,772 @@
-commit 47caa33485b91ea6f2a5e386e61210c90c5f489f (HEAD -> master, tag: 0.1.8)
+commit 898614a555ea0aa7de4ca07bb3cb8f5708b6a002 (HEAD -> master, tag: 0.2.0)
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Apr 11 17:32:09 2016 -0500
+
+    Version file update (0.2.0)
+
+commit 537a1f4f85ce1aa008901857cb3182e6b4546d7f
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Apr 11 17:21:28 2016 -0500
+
+    Implemented runtime contexts and reorganized code.
+    
+    Details:
+    - Retrofitted a new data structure, known as a context, into virtually
+      all internal APIs for computational operations in BLIS. The structure
+      is now present within the type-aware APIs, as well as many supporting
+      utility functions that require information stored in the context. User-
+      level object APIs were unaffected and continue to be "context-free,"
+      however, these APIs were duplicated/mirrored so that "context-aware"
+      APIs now also exist, differentiated with an "_ex" suffix (for "expert").
+      These new context-aware object APIs (along with the lower-level, type-
+      aware, BLAS-like APIs) contain the the address of a context as a last
+      parameter, after all other operands. Contexts, or specifically, cntx_t
+      object pointers, are passed all the way down the function stack into
+      the kernels and allow the code at any level to query information about
+      the runtime, such as kernel addresses and blocksizes, in a thread-
+      friendly manner--that is, one that allows thread-safety, even if the
+      original source of the information stored in the context changes at
+      run-time; see next bullet for more on this "original source" of info).
+      (Special thanks go to Lee Killough for suggesting the use of this kind
+      of data structure in discussions that transpired during the early
+      planning stages of BLIS, and also for suggesting such a perfectly
+      appropriate name.)
+    - Added a new API, in frame/base/bli_gks.c, to define a "global kernel
+      structure" (gks). This data structure and API will allow the caller to
+      initialize a context with the kernel addresses, blocksizes, and other
+      information associated with the currently active kernel configuration.
+      The currently active kernel configuration within the gks cannot be
+      changed (for now), and is initialized with the traditional cpp macros
+      that define kernel function names, blocksizes, and the like. However,
+      in the future, the gks API will be expanded to allow runtime management
+      of kernels and runtime parameters. The most obvious application of this
+      new infrastructure is the runtime detection of hardware (and the
+      implied selection of appropriate kernels). With contexts in place,
+      kernels may even be "hot swapped" at runtime within the gks. Once
+      execution enters a level-3 _front() function, the memory allocator will
+      be reinitialized on-the-fly, if necessary, to accommodate the new
+      kernels' blocksizes. If another application thread is executing with
+      another (previously loaded) kernel, it will finish in a deterministic
+      fashion because its kernel information was loaded into its context
+      before computation began, and also because the blocks it checked out
+      from the internal memory pools will be unaffected by the newer threads'
+      reinitialization of the allocator.
+    - Reorganized and streamlined the 'ind' directory, which contains much of
+      the code enabling use of induced methods for complex domain matrix
+      multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
+      those APIs' functionality is now mostly subsumed within the global
+      kernel structure.
+    - Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
+      that will reinitialize a memory pool if the necessary pool block size
+      has increased.
+    - Updated bli_mem.c to use bli_pool_reinit_if() instead of
+      bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
+      usage of contexts where appropriate to communicate cache and register
+      blocksizes to bli_mem_compute_pool_block_sizes().
+    - Simplified control trees now that much of the information resides in
+      the context and/or the global kernel structure:
+      - Removed blocksize object pointers (blksz_t*) fields from all control
+        tree node definitions and replaced them with blocksize id (bszid_t)
+        values instead, which may be passed into a context query routine in
+        order to extract the corresponding blocksize from the given context.
+      - Removed micro-kernel function pointers (func_t*) fields from all
+        control tree node definitions. Now, any code that needs these function
+        pointers can query them from the local context, as identified by a
+        level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
+        level-1v kernel id (l1vkr_t).
+      - Removed blksz_t object creation and initialization, as well as kernel
+        function object creation and initialization, from all operation-
+        specific control tree initialization files (bli_*_cntl.c), since this
+        information will now live in the gks and, secondarily, in the context.
+    - Removed blocksize multiples from blksz_t objects. Now, we track
+      blocksize multiples for each blocksize id (bszid_t) in the context
+      object.
+    - Removed the bool_t's that were required when a func_t was initialized.
+      These bools are meant to allow one to track the micro-kernel's storage
+      preferences (by rows or columns). This preference is now tracked
+      separately within the gks and contexts.
+    - Merged and reorganized many separate-but-related functions into single
+      files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
+      util directories, but has the most obvious effect of allowing BLIS
+      to compile noticeably faster.
+    - Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
+      in an attempt to reduce overhead for memory-bound operations. This
+      includes removal of default use of object-based variants for level-2
+      operations. Now, by default, level-2 operations will directly call a
+      low-level (non-object based) loop over a level-1v or -1f kernel.
+    - Converted many common query functions in blk_blksz.c (renamed from
+      bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
+      respective header files.
+    - Defined bli_mbool.c API to create and query "multi-bools", or
+      heterogeneous bool_t's (one for each floating-point datatype), in the
+      same spirit as blksz_t and func_t.
+    - Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
+      and BLIS_SIMD_SIZE. These values are needed in order to compute a third
+      new parameter, which may be set indirectly via the aforementioned
+      macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
+      statically allocate memory in macro-kernels and the induced methods'
+      virtual kernels to be used as temporary space to hold a single
+      micro-tile. These values are now output by the testsuite. The default
+      value of BLIS_STACK_BUF_MAX_SIZE is computed as
+      "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
+    - Cleaned up top-level 'kernels' directory (for example, renaming the
+      embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
+      and "haswell," respectively, and gave more consistent and meaningful
+      names to many kernel files (as well as updating their interfaces to
+      conform to the new context-aware kernel APIs).
+    - Updated the testsuite to query blocksizes from a locally-initialized
+      context for test modules that need those values: axpyf, dotxf,
+      dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
+    - Reformatted many function signatures into a standard format that will
+      more easily facilitate future API-wide changes.
+    - Updated many "mxn" level-0 macros (ie: those used to inline double loops
+      for level-1m-like operations on small matrices) in frame/include/level0
+      to use more obscure local variable names in an effort to avoid variable
+      shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
+      which are only output using -Wshadow.)
+    - Added a conj argument to setm, so that its interface now mirrors that
+      of scalm. The semantic meaning of the conj argument is to optionally
+      allow implicit conjugation of the scalar prior to being populated into
+      the object.
+    - Deprecated all type-aware mixed domain and mixed precision APIs. Note
+      that this does not preclude supporting mixed types via the object APIs,
+      where it produces absolutely zero API code bloat.
+
+commit d1f8e5d9b2ecd054ed103f4d642d748db2d4f173 (origin/master)
+Merge: 20af937 c11d28e
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Tue Apr 5 12:21:27 2016 -0500
+
+    Merge pull request #60 from esauvage/master
+    
+    sgemm µkernel for bulldozer : bug correction for k%4 != 0
+
+commit c11d28eed89d65494bc4019f04d046520866c0ff
+Author: Etienne Sauvage <etienne.sauvage@gmail.com>
+Date:   Sat Apr 2 21:15:48 2016 +0200
+
+    cgemm µkernel for bulldozer : bug correction for k%4 != 0
+
+commit 20af937b57f82bb3acb09418d5c0206e1b24f2c7
+Merge: 36c3abb fc61a11
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Thu Mar 31 14:37:30 2016 -0500
+
+    Merge pull request #59 from devinamatthews/fix_testsuite_makefile
+    
+    Fix testsuite makefile
+
+commit fc61a1143edeba4946d4b9915f1775bb08e643fc
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Thu Mar 31 10:53:01 2016 -0500
+
+    Fix formatting in configure.
+
+commit 26379b14de630e3a6c6eef5dfe87ff001558a8a6
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Thu Mar 31 10:45:48 2016 -0500
+
+    Adjust paths in common.mk to support building from testsuite dir.
+
+commit 36c3abb05fecb02d4a9ab13b2b69d133adf34583
+Merge: 64b41fa 917ce75
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Thu Mar 31 10:26:17 2016 -0500
+
+    Merge pull request #58 from esauvage/master
+    
+    cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi…
+
+commit 356d854fc9e34642cc46e0e02a8ceb56114878af
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Wed Mar 30 16:33:15 2016 -0500
+
+    Make symlink to common.mk in build directory.
+
+commit edbb8470044f82ef959583ee09613a5a985292b5
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Wed Mar 30 16:27:11 2016 -0500
+
+    Refactor out some definitions which moved from make_defs.mk to Makefile for use in testsuite Makefile.
+
+commit 917ce75482a543fef46553efff6c246939761e59
+Author: Etienne Sauvage <etienne.sauvage@gmail.com>
+Date:   Wed Mar 30 22:03:09 2016 +0200
+
+    cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel
+
+commit 64b41fa554dff44b2f9ad48901b67c63836407a8
+Merge: 1b09e34 0171ad5
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Tue Mar 29 15:19:41 2016 -0500
+
+    Merge pull request #54 from devinamatthews/more_config_opts
+    
+    More config opts
+
+commit 1b09e343dfe5b48b4842e2cb96f41c8cc249bad0
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Tue Mar 29 12:55:28 2016 -0500
+
+    Updated gcc version from 4.8 to 4.9 in .travis.yml.
+
+commit 0171ad58997b3a5a9b76301511dbe0751fffc940
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Mon Mar 28 13:55:06 2016 -0500
+
+    Add icc and clang support for Intel architectures, fixes #47. 2bd036f fixes #49 BTW.
+
+commit 3090fff64cc87ff2519a09f38e6b8699cf3cba11
+Merge: 8624e36 4ca5d5b
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Mar 28 12:36:25 2016 -0500
+
+    Merge pull request #44 from esauvage/master
+    
+    sgemm micro-kernel for FMA4 instruction set
+
+commit e6e566426ac3ded7ef87cd8ff9be98accfdc4acc
+Merge: 469429e 8624e36
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Sat Mar 26 14:10:15 2016 -0500
+
+    Merge branch 'master' into more_config_opts
+
+commit 8624e36543160739d954c4dbcc5a5594458f3a12
+Merge: a315833 2bd036f
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Sat Mar 26 13:56:28 2016 -0500
+
+    Merge pull request #50 from devinamatthews/fix_noopt_avx
+    
+    Fix configuration issue where instruction set flags are not specified for debug builds.
+
+commit 469429ec34e5b1a172ce35596f9c7afdaacac131
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 25 20:45:41 2016 -0500
+
+     Fix LD_FLAGS -> LDFLAGS.
+
+commit 8442d65c9ead0376fc5f2dfad62fd4862ab9b2b3
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 25 20:06:48 2016 -0500
+
+    Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures.
+
+commit 76099f20be1b49ac960f7e3c5a8296bbf4e1782d
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 25 17:22:58 2016 -0500
+
+    Add threading option to configure.
+
+commit ad43eab4c7899d56d8d7caa6e2d92bc0581ea5a5
+Merge: 9452bdb 2bd036f
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 25 15:00:02 2016 -0500
+
+    Merge branch 'fix_noopt_avx' into more_config_opts
+
+commit 9452bdb3afbf2d7f898134a091d7790817e7be9c
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 25 14:59:50 2016 -0500
+
+    Add options for verbose make output and static/shared linking to configure.
+
+commit 2bd036f1f9ce1ee0864365557f66d9415dd42de3
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 25 12:16:49 2016 -0500
+
+    Fix configuration issue where instruction set flags are not specified for debug builds.
+
+commit a315833f067944fb0bc14cf60f0c7dcb5dc897b6
+Merge: 1d1a426 af92773
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Thu Mar 24 12:30:21 2016 -0500
+
+    Merge pull request #48 from figual/master
+    
+    Updated and improved ARMv8 micro-kernels.
+
+commit af92773f4f85a2441fe0c6e3a52c31b07253d08e
+Author: figual <figual@ucm.es>
+Date:   Wed Mar 23 22:07:02 2016 +0100
+
+    Updated and improved ARMv8 micro-kernels.
+
+commit 1d1a426d18ec03754021456862a1f4d1dfec1fbf
+Merge: 5a978ff d226dfa
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Mar 7 15:17:53 2016 -0600
+
+    Merge pull request #46 from devinamatthews/new-config-opts
+    
+    Add several changes to the build system.
+
+commit d226dfa05190eb477b33563b1edccf8603973336
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Sat Mar 5 16:18:14 2016 -0600
+
+    Add several changes to the build system.
+    
+    1) Add -- options.
+    2) Add -d/--enable-debug option to enable debugging symbols with and without optimization.
+    3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor.
+    4) Add make V=[0,1] option to control build verbosity.
+
+commit 5a978fffdb8f09a81c89541d541d4a6830cd70a4
+Merge: adb2b4e 63e2642
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Fri Mar 4 17:26:58 2016 -0600
+
+    Merge pull request #45 from devinamatthews/high_prec_timers
+    
+    Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday
+
+commit 63e264239053b913164a849dd8a45829087eaddc
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 4 13:17:50 2016 -0600
+
+    Make sure that -lrt is linked on Linux.
+
+commit 44fddd48dc1708a956803d1948f04429ec0d8700
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Fri Mar 4 12:36:38 2016 -0600
+
+    Add missing \.
+
+commit 7cabd2131f953de23e7015d760b0ddfda51b1251
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Thu Mar 3 11:43:07 2016 -0600
+
+    Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday.
+
+commit adb2b4e096c78e8b2f85fd372cf0d5eb04af5be8
+Author: Tyler Smith <tms@cs.utexas.edu>
+Date:   Wed Mar 2 14:48:12 2016 -0600
+
+    Fixing guard for non implemented partitioning through packed matrices
+
+commit 4ca5d5b1fd6f2e4a8b2e139c5405475239581e51
+Author: Etienne Sauvage <etienne.sauvage@gmail.com>
+Date:   Tue Mar 1 21:33:01 2016 +0100
+
+    sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel
+
+commit 627d59b5ba06866b26f46e4434a0435b600925e3
+Author: Etienne Sauvage <etienne.sauvage@gmail.com>
+Date:   Mon Feb 29 21:53:12 2016 +0100
+
+    symbolic link for bulldozer configuration to kernels
+
+commit 2dc5c0ae038ed175fab85751803ada05734d1ba1
+Merge: f2809fc 3d0fae8
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Feb 29 12:22:51 2016 -0600
+
+    Merge pull request #40 from tkelman/bulldozer-symlink
+    
+    Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer
+
+commit f2809fc5f74466c755da6a5b4632853e634060b5
+Merge: f86b94f 8624a33
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Sat Feb 27 13:06:03 2016 -0600
+
+    Merge pull request #39 from devinamatthews/fix_f2c_conflicts
+    
+    Devin's f2c type namespace update.
+    
+    Details:
+    - Added "bla_" prefix to f2c type names to prevent conflicts with external user code.
+    - Removed most of the body of bli_f2c.h, which was unused.
+
+commit 3d0fae810d942085d8f2d389820b4e0027577db8
+Author: Tony Kelman <tony@kelman.net>
+Date:   Thu Feb 25 23:24:03 2016 -0800
+
+    Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer
+    
+    to fix linking issue mentioned in #37 and https://groups.google.com/forum/#!topic/blis-devel/iypwljcaeEI
+
+commit 8624a33ccc12dff6f6c4f92992ca5636af1576a6
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Thu Feb 25 13:51:26 2016 -0600
+
+    Fix remaining f2c conflicts.
+
+commit 372eef0b6c0a535bf88d4b46b72f61266e8491ba
+Author: Devin Matthews <dmatthews@utexas.edu>
+Date:   Thu Feb 25 12:01:58 2016 -0600
+
+     Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
+    progress.
+
+commit f86b94f206e2e09fa3221cc55c3dc5b05ca4775a
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Tue Feb 23 18:12:34 2016 -0600
+
+    Included missing blas2blis integer def to CBLAS.
+    
+    Details:
+    - Added #include "bli_config_macro_defs" to all cblas_*.c files in
+      compat/cblas/src. This has the effect of defining
+      BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does
+      not define it. Thanks to Tony Kelman for reporting this bug.
+    - In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int'
+      to 'f77_int'. This eliminates a compiler warning and a potential
+      runtime bug and/or crash when the size of an int differs from the size
+      of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE).
+
+commit 0b126de1342c11c65623bcb38e258e21e9244e3d
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Fri Nov 13 16:29:12 2015 -0600
+
+    Consolidated packm_blk_var1 and packm_blk_var2.
+    
+    Details:
+    - Consolidated the two blocked variants for packm into a single
+      implementation (packm_blk_var1) and removed the other variant.
+    - Updated all induced method _cntl_init() functions in frame/cntl/ind/
+      to use the new blocked variant 1.
+    - Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
+      to detect pack_t schemas for induced methods and native execution,
+      respectively.
+
+commit 30e5eb29e060b97752f702d2ea5d101d950f53b2
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Fri Nov 13 12:14:19 2015 -0600
+
+    Minor changes to treatment of rs, cs in bli_obj.c.
+    
+    Details:
+    - Applied a patch submitted by Devin Matthews that:
+      - implements subtle changes to handling of somewhat unusual cases of
+        row and column strides to accommodate certail tensor cases, which
+        includes adding dimension parameters to _is_col_tilted() and
+        _is_row_tilted() macros,
+      - simplifies how buffers are sized when requested BLIS-allocated
+        objects,
+      - re-consolidates bli_adjust_strides_*() into one function, and
+      - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
+        environments.
+
+commit f0a4f41b5acf55b41707ec821c4c5f9076dfbc24
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Thu Nov 12 15:22:50 2015 -0600
+
+    Fixed unimplemented case in core2 sgemm ukernel.
+    
+    Details:
+    - Implemented the "beta == 0" case for general stride output for the
+      dunnington sgemm micro-kernel. This case had been, up until now,
+      identical to the "beta != 0" case, which does not work when the
+      output matrix has nan's and inf's. It had manifested as nan residuals
+      in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin
+      Matthews for reporting this bug.
+
+commit 42810bbfa0b8f006ecc5128d903909ec13ea63f9
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Thu Nov 12 12:07:46 2015 -0600
+
+    Fixed minor bugs for uncommon obj_create cases.
+    
+    Details:
+    - Separated bli_adjust_strides() into _alloc() and _attach() flavors so
+      that the latter can avoid a test performed by the former, in which the
+      rs and cs are overridden and set to zero if either matrix dimension is
+      zero. Actually, we also disable this overridding behavior, even for the
+      _alloc() case, since keeping the original strides (probably) does not
+      hurt anything. The original code has been kept commented-out, though,
+      in case an unintended consequence is later discovered.
+    - Fixed a typo in an error check for general stride cases where rs == cs.
+
+commit 3e6dd11467643fbc2cb45c13cec8dd6024232833
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Tue Nov 3 10:30:08 2015 -0600
+
+    Minor re-expression in quadratic partitioning code.
+    
+    Details:
+    - Minor change to quadratic equation solution code that avoids
+      recomputation of the sqrt() parameter when the compiler is not
+      smart enough to perform this optimization automatically.
+
+commit 0694b722f7e4df00efb32639095a2aca80e67f52
+Merge: 3e116f0 33557ec
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Nov 2 17:24:25 2015 -0600
+
+    Merge branch 'master' of github.com:flame/blis
+
+commit 3e116f0a2953f50b3c068759a775ad7ffae04e49
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Nov 2 17:18:23 2015 -0600
+
+    Fixed imaginary bug in quadratic partitioning code.
+    
+    Details:
+    - Fixed a bug in the relatively new quadratic partitioning code that,
+      under the right conditions, would perform sqrt() on a negative value.
+      If the solution is imaginary, we discard it and use an alternate
+      partition width that assumes no diagonal intersection. That alternate
+      width is actually already computed, so, the fix was quite simple.
+      Thanks to Devangi Parikh for reporting this bug.
+
+commit 33557ecccaf49b2569b7f3d7bcea52c2aab94c68
+Author: Jeff Hammond <jeff.science@gmail.com>
+Date:   Mon Nov 2 12:18:43 2015 -0800
+
+    add Travis CI build status icon to the README
+
+commit 4a502fbe77bd0f701108baaa559d9cfb483f88de
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Nov 2 13:28:34 2015 -0600
+
+    Laid groundwork for runtime memory pool resizing.
+    
+    Details:
+    - Changed bli_pool_finalize() so that the freeing begins with the block
+      at top_index instead of block 0. This allows us to use the function
+      for terminal finalization as well as temporary cleanup prior to
+      reinitialization. Also, clear the pool_t struct upon _pool_finalize()
+      in case it is called in the terminal case with some blocks still
+      checked out to threads (in which case the threads will see the new
+      block size as 0 and thus release the block as intended).
+    - Added bli_pool_reinit(), which calls _pool_finalize() followed by
+      _pool_init() with new parameters.
+    - Added bli_mem_reinit(), which is based on bli_pool_reinit().
+    - Added new wrapper, _mem_compute_pool_block_sizes(), which calls
+      _mem_compute_pool_block_sizes_dt().
+    - Updated bli_mem_release() so that the pblk_t is freed, via
+      _pool_free_block(), if the block size recorded in the mem_t at the
+      time the pblk_t was acquired is now different from the value in the
+      pool_t.
+
+commit 37e55ca39bdbddaec03ad30d43e8ad2b3e549c96
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Fri Oct 30 18:25:04 2015 -0500
+
+    Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.
+    
+    Details:
+    - Fixed a family of bugs in the triangular level-3 operations for
+      certain complex implementations (3m1 and 4m1a) that only manifest if
+      one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
+      - Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
+        for the triangular case.
+      - Fixed the incorrect computation of imaginary stride, as stored in
+        the auxinfo_t struct in trmm and trsm macro-kernels.
+      - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
+        cases where the the register blocksize for the triangular matrix is
+        odd. Introduced a new byte-granular pointer arithmetic macro,
+        bli_ptr_add(), that computes the correct value.
+    - Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
+      terms of __typeof__, which is used by bli_ptr_add() macro.
+    - Disabled the row- vs. column-storage optimization in bli_trmm_front()
+      for singleton problems because the inherent ambiguity of whether a
+      scalar is row-stored or column-stored causes the wrong parameter
+      combination code to be executed (by dumb luck of our checking for
+      row storage first).
+    - Added commented-out debugging lines to 3m1/4m1a and reference
+      micro-kernels, and trsm_ll macro-kernel.
+
+commit 46294d80e5a79c598e200e1c8ec2a642ff839971
+Merge: d3159c5 a0a7b85
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Tue Oct 27 12:41:23 2015 -0500
+
+    Merge pull request #35 from figual/master
+    
+    Fixed incomplete code in the double precision ARMv8 microkernel.
+
+commit a0a7b85ac3e157af53cff8db0e008f4a3f90372c
+Author: Francisco Igual <figual@ucm.es>
+Date:   Tue Oct 27 08:59:15 2015 +0000
+
+    Fixed incomplete code in the double precision ARMv8 microkernel.
+
+commit d3159c5740c9ee7f8c0b661003aab6f00646ad6f
+Merge: b489152 7e03e45
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Wed Oct 21 14:54:00 2015 -0500
+
+    Merge branch 'master' of github.com:flame/blis
+
+commit b489152e112644ec3b6d19e687231a9607f7694f
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Wed Oct 21 14:53:17 2015 -0500
+
+    Use vzeroall in haswell micro-kernels.
+
+commit 7e03e45bfe6c27c4fdbf06b1caa7f49e9a5fef49
+Merge: 77ddb0b 4f88c29
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Wed Oct 14 13:26:07 2015 -0500
+
+    Merge pull request #33 from xianyi/master
+    
+    Enable Travis CI
+
+commit 4f88c29f9e634cbb6fb22d8c88931f0ec78ad7db
+Author: Zhang Xianyi <traits.zhang@gmail.com>
+Date:   Wed Oct 14 12:57:50 2015 -0500
+
+    Detect Intel Broadwell (using Haswell config).
+
+commit 4b0ac1a9984a93f7ad4369b10fca63991107d9f5
+Merge: fe3e355 77ddb0b
+Author: Zhang Xianyi <traits.zhang@gmail.com>
+Date:   Wed Oct 14 12:51:05 2015 -0500
+
+    Merge branch 'upstream_master'
+
+commit 77ddb0b1d31ada111dadf392766ba6d9210ed9fb
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Tue Oct 13 12:53:06 2015 -0500
+
+    Removed flop-counting mechanism.
+    
+    Details:
+    - Removed the optional flop-counting feature introduced in commit
+      7574c994.
+
+commit 276da366187460a4c8e6e0910e79cb39ce780bfe
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Oct 12 11:43:03 2015 -0500
+
+    Minor formatting change to README.md.
+
+commit d17057446f5404824478e8a6cd08f242ab75544a
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Mon Oct 12 11:39:49 2015 -0500
+
+    Added "Getting Started" section to README.md.
+    
+    Details:
+    - Added section to README.md file containing links to wikis with brief
+      descriptions.
+
+commit e7e1f2f7b601b21b50e3cdad8972cb3fe11018d3
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Fri Oct 2 16:51:52 2015 -0500
+
+    Minor updates to CREDITS, README files.
+
+commit 55329906ecd7ce1ab910e4d30a29354a9172e7ea
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Sat Sep 26 20:47:19 2015 -0500
+
+    Minor edits to README.md, testsuite.
+    
+    Details:
+    - Fixed typos in README.md.
+    - Fixed column heading alignment for testsuite when matlab output is
+      enabled.
+    - Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.
+
+commit bbebdb5793a8fd6aaf257012ab0272beaa04a0de
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Fri Sep 25 14:47:27 2015 -0500
+
+    Replaced README with README.md.
+    
+    Details:
+    - Replaced the old (and short) README file with a much more comprehensive
+      version written in github-flavored markdown. The new file is based on
+      content taken from the old Google Code homepage.
+
+commit e2e9d64a63485461192d9c2a6dd0183a8b71013c
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Thu Sep 24 12:14:03 2015 -0500
+
+    Load balance thread ranges for arbitrary diagonals.
+    
+    Details:
+    - Expanded/updated interface for bli_get_range_weighted() and
+      bli_get_range() so that the direction of movement is specified in the
+      function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
+      and also so that the object being partitioned is passed instead of an
+      uplo parameter. Updated invocations in level-3 blocked variants, as
+      appropriate.
+    - (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
+      carefully take into account the location of the diagonal when computing
+      ranges so that the area of each subpartition (which, in all present
+      level-3 operations, is proportional to the amount of computation
+      engendered) is as equal as possible.
+    - Added calls to a new class of routines to all non-gemm level-3 blocked
+      variants:
+        bli_<oper>_prune_unref_mparts_[mnk]()
+      where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
+      dimension is being partitioned. These routines call a more basic
+      routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
+      regions from matrices and simultaneously adjust other matrices which
+      share the same dimension accordingly.
+    - Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
+      new pruning routines.
+    - Fixed incorrect blocking factors passed into bli_get_range_*() in
+      bli_trsm_blk_var[12][fb].c
+    - Added a new test driver in test/thread_ranges that can exercise the new
+      bli_get_range_*() and bli_get_range_weighted_*() under a range of
+      conditions.
+    - Reimplemented m and n fields of obj_t as elements in a "dim"
+      array field so that dimensions could be queried via index constant
+      (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
+      macros accordingly.
+    - Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
+    - Added bli_round() macro, which calls C math library function round(),
+      and bli_round_to_mult(), which rounds a value to the nearest multiple
+      of some other value.
+    - Added miscellaneous pruning- and mdim_t-related macros.
+    - Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
+      bli_obj_row_off(), bli_obj_col_off().
+
+commit fe3e355c9c5a6f65b8736b009e2d501b62a83ea1
+Merge: efa641e 4dd9dd3
+Author: Zhang Xianyi <traits.zhang@gmail.com>
+Date:   Fri Aug 21 14:38:36 2015 -0500
+
+    Merge branch 'upstream_master'
+
+commit efa641e36b73abee34166a252e90e28a6281d92d
+Author: Zhang Xianyi <traits.zhang@gmail.com>
+Date:   Sat Aug 22 03:15:50 2015 +0800
+
+    Try to fix the compiling bug on travis.
+
+commit 4dd9dd3e1de626b51bfe85d9ee65f193d60e8d38
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Fri Aug 21 11:52:37 2015 -0500
+
+    Fixed minor alignment ambiguity bug in bli_pool.c.
+    
+    Details:
+    - Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
+      pointer arithmetic was performed on a void* as if it were a byte
+      pointer (such as char*). Some compilers may have already been
+      interpreting this situation as intended, despite the sloppiness.
+      Thanks to Aleksei Rechinskii for reporting this issue.
+    - Redefined pointer alignment macros to typecast to uintptr_t instead of
+      siz_t.
+
+commit 12ffd568b04feda57147c13b67717416a01c82f8
+Author: Zhang Xianyi <traits.zhang@gmail.com>
+Date:   Sat Aug 22 00:24:28 2015 +0800
+
+    Add Travis CI.
+
+commit ecc3ebb749e0861c27deda52b5f87236ede4901b
+Author: Field G. Van Zee <field@cs.utexas.edu>
+Date:   Wed Jul 29 13:31:12 2015 -0500
+
+    CHANGELOG update (0.1.8)
+
+commit 47caa33485b91ea6f2a5e386e61210c90c5f489f (tag: 0.1.8)
 Author: Field G. Van Zee <field@cs.utexas.edu>
 Date:   Wed Jul 29 13:31:09 2015 -0500

    Version file update (0.1.8)

-commit ef0fbbbdb6148b96938733fce72cb4ed7dad685e (origin/master)
+commit ef0fbbbdb6148b96938733fce72cb4ed7dad685e
 Merge: fdfe14f d4b8913
 Author: Field G. Van Zee <field@cs.utexas.edu>
 Date:   Thu Jul 9 13:54:54 2015 -0500
--- a/65
+++ b/65
@@ -154,22 +154,13 @@ BLIS_DLL_NAME      := $(BLIS_LIB_BASE_NAME).so
 # --- BLIS framework source and object variable names ---

 # These are the makefile variables that source code files will be accumulated
-# into by the makefile fragments. Notice that we include separate variables
-# for regular and "special" source.
+# into by the makefile fragments.
 MK_FRAME_SRC           :=
-MK_FRAME_NOOPT_SRC     :=
-MK_FRAME_KERNELS_SRC   :=
 MK_CONFIG_SRC          :=
-MK_CONFIG_NOOPT_SRC    :=
-MK_CONFIG_KERNELS_SRC  :=

 # These hold object filenames corresponding to above.
 MK_FRAME_OBJS          :=
-MK_FRAME_NOOPT_OBJS    :=
-MK_FRAME_KERNELS_OBJS  :=
 MK_CONFIG_OBJS         :=
-MK_CONFIG_NOOPT_OBJS   :=
-MK_CONFIG_KERNELS_OBJS :=

 # Append the base library path to the library names.
 MK_ALL_BLIS_LIB        := $(BASE_LIB_PATH)/$(BLIS_LIB_NAME)
@@ -309,41 +300,17 @@ CFLAGS_KERNELS := $(CFLAGS_KERNELS) $(VERS_DEF)
 # Convert source file paths to object file paths by replacing the base source
 # directories with the base object directories, and also replacing the source
 # file suffix (eg: '.c') with '.o'.
-MK_BLIS_FRAME_OBJS         := $(patsubst $(FRAME_PATH)/%.c, $(BASE_OBJ_FRAME_PATH)/%.o, \
-                                          $(filter %.c, $(MK_FRAME_SRC)))
-MK_BLIS_FRAME_NOOPT_OBJS   := $(patsubst $(FRAME_PATH)/%.c, $(BASE_OBJ_FRAME_PATH)/%.o, \
-                                          $(filter %.c, $(MK_FRAME_NOOPT_SRC)))
-MK_BLIS_FRAME_KERNELS_OBJS := $(patsubst $(FRAME_PATH)/%.c, $(BASE_OBJ_FRAME_PATH)/%.o, \
-                                          $(filter %.c, $(MK_FRAME_KERNELS_SRC)))
+MK_BLIS_FRAME_OBJS   := $(patsubst $(FRAME_PATH)/%.c, $(BASE_OBJ_FRAME_PATH)/%.o, \
+                                        $(filter %.c, $(MK_FRAME_SRC)))

-MK_BLIS_CONFIG_OBJS          := $(patsubst $(CONFIG_PATH)/%.S, $(BASE_OBJ_CONFIG_PATH)/%.o, \
-                                          $(filter %.S, $(MK_CONFIG_SRC)))
-MK_BLIS_CONFIG_OBJS          += $(patsubst $(CONFIG_PATH)/%.c, $(BASE_OBJ_CONFIG_PATH)/%.o, \
-                                          $(filter %.c, $(MK_CONFIG_SRC)))
-
-MK_BLIS_CONFIG_NOOPT_OBJS    := $(patsubst $(CONFIG_PATH)/%.S, $(BASE_OBJ_CONFIG_PATH)/%.o, \
-                                          $(filter %.S, $(MK_CONFIG_NOOPT_SRC)))
-MK_BLIS_CONFIG_NOOPT_OBJS    += $(patsubst $(CONFIG_PATH)/%.c, $(BASE_OBJ_CONFIG_PATH)/%.o, \
-                                          $(filter %.c, $(MK_CONFIG_NOOPT_SRC)))
-
-MK_BLIS_CONFIG_KERNELS_OBJS  := $(patsubst $(CONFIG_PATH)/%.S, $(BASE_OBJ_CONFIG_PATH)/%.o, \
-                                          $(filter %.S, $(MK_CONFIG_KERNELS_SRC)))
-MK_BLIS_CONFIG_KERNELS_OBJS  += $(patsubst $(CONFIG_PATH)/%.c, $(BASE_OBJ_CONFIG_PATH)/%.o, \
-                                          $(filter %.c, $(MK_CONFIG_KERNELS_SRC)))
+MK_BLIS_CONFIG_OBJS  := $(patsubst $(CONFIG_PATH)/%.S, $(BASE_OBJ_CONFIG_PATH)/%.o, \
+                                         $(filter %.S, $(MK_CONFIG_SRC)))
+MK_BLIS_CONFIG_OBJS  += $(patsubst $(CONFIG_PATH)/%.c, $(BASE_OBJ_CONFIG_PATH)/%.o, \
+                                         $(filter %.c, $(MK_CONFIG_SRC)))

 # Combine all of the object files into some readily-accessible variables.
-MK_ALL_BLIS_OPT_OBJS      := $(MK_BLIS_CONFIG_OBJS) \
-                             $(MK_BLIS_FRAME_OBJS)
-
-MK_ALL_BLIS_NOOPT_OBJS    := $(MK_BLIS_CONFIG_NOOPT_OBJS) \
-                             $(MK_BLIS_FRAME_NOOPT_OBJS)
-
-MK_ALL_BLIS_KERNELS_OBJS  := $(MK_BLIS_CONFIG_KERNELS_OBJS) \
-                             $(MK_BLIS_FRAME_KERNELS_OBJS)
-
-MK_ALL_BLIS_OBJS          := $(MK_ALL_BLIS_OPT_OBJS) \
-                             $(MK_ALL_BLIS_NOOPT_OBJS) \
-                             $(MK_ALL_BLIS_KERNELS_OBJS)
+MK_ALL_BLIS_OBJS     := $(MK_BLIS_CONFIG_OBJS) \
+                        $(MK_BLIS_FRAME_OBJS)



@@ -424,15 +391,15 @@ clean: cleanlib cleantest

 # Define two functions, each of which takes one argument (an object file
 # path). The functions determine which CFLAGS and text string are needed to
-# compile the object file. Note that we match with a preceding forward slash,
-# so the directory name must begin with the special directory name, but it
-# can have trailing characters (e.g. 'kernels_x86').
-get_cflags_for_obj = $(if $(findstring /$(NOOPT_DIR),$1),$(CFLAGS_NOOPT),\
-                     $(if $(findstring /$(KERNELS_DIR),$1),$(CFLAGS_KERNELS),\
+# compile the object file. Note that we match without a preceding forward slash,
+# so the directory name may have 'kernels' as a substring (e.g. 'ukernels' or
+# 'kernels_opt').
+get_cflags_for_obj = $(if $(findstring $(NOOPT_DIR),$1),$(CFLAGS_NOOPT),\
+                     $(if $(findstring $(KERNELS_DIR),$1),$(CFLAGS_KERNELS),\
                     $(CFLAGS)))

-get_ctext_for_obj = $(if $(findstring /$(NOOPT_DIR),$1),$(NOOPT_TEXT),\
-                    $(if $(findstring /$(KERNELS_DIR),$1),$(KERNELS_TEXT),))
+get_ctext_for_obj = $(if $(findstring $(NOOPT_DIR),$1),$(NOOPT_TEXT),\
+                    $(if $(findstring $(KERNELS_DIR),$1),$(KERNELS_TEXT),))

 $(BASE_OBJ_FRAME_PATH)/%.o: $(FRAME_PATH)/%.c $(MK_HEADER_FILES) $(MAKE_DEFS_MK_PATH)
 ifeq ($(BLIS_ENABLE_VERBOSE_MAKE_OUTPUT),yes)
--- a/build/gen-make-frags/gen-make-frag.sh
+++ b/build/gen-make-frags/gen-make-frag.sh
@@ -254,7 +254,9 @@ gen_mkfiles()
 	
 	
 	# Append a relevant suffix to the makefile variable name, if necesary
-	all_add_src_var_name "$cur_dir"
+	# NOTE: This step is disabled because special directories are presently
+	# ignored when generating makefile variable names.
+	#all_add_src_var_name "$cur_dir"
 	
 	
 	# Be verbose if level 2 was requested
@@ -286,7 +288,9 @@ gen_mkfiles()
 	
 	
 	# Remove a relevant suffix from the makefile variable name, if necesary
-	all_del_src_var_name "$cur_dir"
+	# NOTE: This step is disabled because special directories are presently
+	# ignored when generating makefile variable names.
+	#all_del_src_var_name "$cur_dir"
 	
 	
 	# Return peacefully
@@ -295,42 +299,44 @@ gen_mkfiles()



-update_src_var_name_special()
-{
-	local dir act i name var_suffix
-	
-	# Extract arguments.
-	act="$1"
-	dir="$2"
-	
-	# Strip / from end of directory path, if there is one, and then strip
-	# path from directory name.
-	dir=${dir%/}
-	dir=${dir##*/}
-	
-	# Run through our list.
-	for specdir in "${special_dirs}"; do
-		
-		# If the current item matches sdir, then we'll have
-		# to make a modification of some form.
-		if [ "$dir" = "$specdir" ]; then
-			
-			# Convert the directory name to uppercase.
-			var_suffix=$(echo "$dir" | tr '[:lower:]' '[:upper:]')
-			
-			# Either add or remove the suffix, and also update the
-			# source file suffix variable.
-			if [ "$act" == "+" ]; then
-				src_var_name=${src_var_name}_$var_suffix
-			else
-				src_var_name=${src_var_name%_$var_suffix}
-			fi
-			
-			# No need to continue iterating.
-			break;
-		fi
-	done
-}
+#update_src_var_name_special()
+#{
+#	local dir act i name var_suffix
+#	
+#	# Extract arguments.
+#	act="$1"
+#	dir="$2"
+#	
+#	# Strip / from end of directory path, if there is one, and then strip
+#	# path from directory name.
+#	dir=${dir%/}
+#	dir=${dir##*/}
+#	
+#	# Run through our list.
+#	# NOTE: CURRENTLY, SPECIAL DIRECTORY NAMES ARE IGNORED. In order to
+#	#       re-enable them, remove the quotes from "${special_dirs}".
+#	for specdir in "${special_dirs}"; do
+#		
+#		# If the current item matches sdir, then we'll have
+#		# to make a modification of some form.
+#		if [ "$dir" = "$specdir" ]; then
+#			
+#			# Convert the directory name to uppercase.
+#			var_suffix=$(echo "$dir" | tr '[:lower:]' '[:upper:]')
+#			
+#			# Either add or remove the suffix, and also update the
+#			# source file suffix variable.
+#			if [ "$act" == "+" ]; then
+#				src_var_name=${src_var_name}_$var_suffix
+#			else
+#				src_var_name=${src_var_name%_$var_suffix}
+#			fi
+#			
+#			# No need to continue iterating.
+#			break;
+#		fi
+#	done
+#}

 #init_src_var_name()
 #{
@@ -351,20 +357,20 @@ update_src_var_name_special()
 #	done
 #}

-all_add_src_var_name()
-{
-	local dir="$1"
-	
-	update_src_var_name_special "+" "$dir"
+#all_add_src_var_name()
+#{
+#	local dir="$1"
+#	
+#	update_src_var_name_special "+" "$dir"
+#
+#}

-}
-
-all_del_src_var_name()
-{
-	local dir="$1"
-	
-	update_src_var_name_special "-" "$dir"
-}
+#all_del_src_var_name()
+#{
+#	local dir="$1"
+#	
+#	update_src_var_name_special "-" "$dir"
+#}

 read_mkfile_config()
 {
--- a/common.mk
+++ b/common.mk
@@ -161,7 +161,7 @@ LDFLAGS      += -fopenmp
 endif
 ifeq ($(THREADING_MODEL),pthreads)
 CTHREADFLAGS := -pthread -DBLIS_ENABLE_PTHREADS
-LDFLAGS      += -pthread
+LDFLAGS      += -lpthread
 endif
 endif

@@ -175,7 +175,7 @@ LDFLAGS      += -openmp
 endif
 ifeq ($(THREADING_MODEL),pthreads)
 CTHREADFLAGS := -pthread -DBLIS_ENABLE_PTHREADS
-LDFLAGS      += -pthread
+LDFLAGS      += -lpthread
 endif
 endif

@@ -188,7 +188,7 @@ $(error OpenMP is not supported with Clang.)
 endif
 ifeq ($(THREADING_MODEL),pthreads)
 CTHREADFLAGS := -pthread -DBLIS_ENABLE_PTHREADS
-LDFLAGS      += -pthread
+LDFLAGS      += -lpthread
 endif
 endif

--- a/config/bgq/bli_kernel.h
+++ b/config/bgq/bli_kernel.h
@@ -144,25 +144,7 @@

 // -- Default fusing factors for level-1f operations --

-#define BLIS_L1F_FUSE_FAC_S        8
-#define BLIS_L1F_FUSE_FAC_D        8
-#define BLIS_L1F_FUSE_FAC_C        4
-#define BLIS_L1F_FUSE_FAC_Z        2
-
-#define BLIS_AXPYF_FUSE_FAC_S          BLIS_L1F_FUSE_FAC_S
-#define BLIS_AXPYF_FUSE_FAC_D          BLIS_L1F_FUSE_FAC_D
-#define BLIS_AXPYF_FUSE_FAC_C          BLIS_L1F_FUSE_FAC_C
-#define BLIS_AXPYF_FUSE_FAC_Z          BLIS_L1F_FUSE_FAC_Z
-
-#define BLIS_DOTXF_FUSE_FAC_S          BLIS_L1F_FUSE_FAC_S
-#define BLIS_DOTXF_FUSE_FAC_D          BLIS_L1F_FUSE_FAC_D
-#define BLIS_DOTXF_FUSE_FAC_C          BLIS_L1F_FUSE_FAC_C
-#define BLIS_DOTXF_FUSE_FAC_Z          BLIS_L1F_FUSE_FAC_Z
-
-#define BLIS_DOTXAXPYF_FUSE_FAC_S      BLIS_L1F_FUSE_FAC_S
-#define BLIS_DOTXAXPYF_FUSE_FAC_D      BLIS_L1F_FUSE_FAC_D
-#define BLIS_DOTXAXPYF_FUSE_FAC_C      BLIS_L1F_FUSE_FAC_C
-#define BLIS_DOTXAXPYF_FUSE_FAC_Z      BLIS_L1F_FUSE_FAC_Z
+#define BLIS_DEFAULT_AF_D          8



@@ -171,10 +153,8 @@

 // -- gemm --

-#include "bli_gemm_8x8.h"
-
-#define BLIS_DGEMM_UKERNEL         bli_dgemm_8x8
-#define BLIS_ZGEMM_UKERNEL         bli_zgemm_8x8
+#define BLIS_DGEMM_UKERNEL         bli_dgemm_int_8x8
+#define BLIS_ZGEMM_UKERNEL         bli_zgemm_int_8x8

 // -- trsm-related --

--- a/config/bulldozer/bli_kernel.h
+++ b/config/bulldozer/bli_kernel.h
@@ -51,87 +51,6 @@
 //     (b) MR (for zero-padding purposes when MR and NR are "swapped")
 //

-// #define BLIS_DEFAULT_MC_S              128
-// #define BLIS_DEFAULT_KC_S              384
-// #define BLIS_DEFAULT_NC_S              4096
-
-#define BLIS_DEFAULT_MC_D              1080
-#define BLIS_DEFAULT_KC_D              120
-#define BLIS_DEFAULT_NC_D              8400
-
-// #define BLIS_DEFAULT_MC_C              128
-// #define BLIS_DEFAULT_KC_C              256
-// #define BLIS_DEFAULT_NC_C              4096
-// 
-// #define BLIS_DEFAULT_MC_Z              64
-// #define BLIS_DEFAULT_KC_Z              256
-// #define BLIS_DEFAULT_NC_Z              2048
-
-// -- Register blocksizes --
-
-// #define BLIS_DEFAULT_MR_S              8
-// #define BLIS_DEFAULT_NR_S              8
-
-#define BLIS_DEFAULT_MR_D              4
-#define BLIS_DEFAULT_NR_D              6
-
-	// #define BLIS_DEFAULT_MR_C              8
-	// #define BLIS_DEFAULT_NR_C              4
-	// 
-	// #define BLIS_DEFAULT_MR_Z              8
-	// #define BLIS_DEFAULT_NR_Z              4
-
-// NOTE: If the micro-kernel, which is typically unrolled to a factor
-// of f, handles leftover edge cases (ie: when k % f > 0) then these
-// register blocksizes in the k dimension can be defined to 1.
-
-//#define BLIS_DEFAULT_KR_S              1
-//#define BLIS_DEFAULT_KR_D              1
-//#define BLIS_DEFAULT_KR_C              1
-//#define BLIS_DEFAULT_KR_Z              1
-
-// -- Maximum cache blocksizes (for optimizing edge cases) --
-
-// NOTE: These cache blocksize "extensions" have the same constraints as
-// the corresponding default blocksizes above. When these values are
-// larger than the default blocksizes, blocksizes used at edge cases are
-// enlarged if such an extension would encompass the remaining portion of
-// the matrix dimension.
-
-//#define BLIS_MAXIMUM_MC_S              (BLIS_DEFAULT_MC_S + BLIS_DEFAULT_MC_S/4)
-//#define BLIS_MAXIMUM_KC_S              (BLIS_DEFAULT_KC_S + BLIS_DEFAULT_KC_S/4)
-//#define BLIS_MAXIMUM_NC_S              (BLIS_DEFAULT_NC_S + BLIS_DEFAULT_NC_S/4)
-
-//#define BLIS_MAXIMUM_MC_D              (BLIS_DEFAULT_MC_D + BLIS_DEFAULT_MC_D/4)
-//#define BLIS_MAXIMUM_KC_D              (BLIS_DEFAULT_KC_D + BLIS_DEFAULT_KC_D/4)
-//#define BLIS_MAXIMUM_NC_D              (BLIS_DEFAULT_NC_D + BLIS_DEFAULT_NC_D/4)
-
-//#define BLIS_MAXIMUM_MC_C              (BLIS_DEFAULT_MC_C + BLIS_DEFAULT_MC_C/4)
-//#define BLIS_MAXIMUM_KC_C              (BLIS_DEFAULT_KC_C + BLIS_DEFAULT_KC_C/4)
-//#define BLIS_MAXIMUM_NC_C              (BLIS_DEFAULT_NC_C + BLIS_DEFAULT_NC_C/4)
-
-//#define BLIS_MAXIMUM_MC_Z              (BLIS_DEFAULT_MC_Z + BLIS_DEFAULT_MC_Z/4)
-//#define BLIS_MAXIMUM_KC_Z              (BLIS_DEFAULT_KC_Z + BLIS_DEFAULT_KC_Z/4)
-//#define BLIS_MAXIMUM_NC_Z              (BLIS_DEFAULT_NC_Z + BLIS_DEFAULT_NC_Z/4)
-
-// -- Packing register blocksize (for packed micro-panels) --
-
-// NOTE: These register blocksize "extensions" determine whether the
-// leading dimensions used within the packed micro-panels are equal to
-// or greater than their corresponding register blocksizes above.
-
-//#define BLIS_PACKDIM_MR_S              (BLIS_DEFAULT_MR_S + ...)
-//#define BLIS_PACKDIM_NR_S              (BLIS_DEFAULT_NR_S + ...)
-
-//#define BLIS_PACKDIM_MR_D              (BLIS_DEFAULT_MR_D + ...)
-//#define BLIS_PACKDIM_NR_D              (BLIS_DEFAULT_NR_D + ...)
-
-//#define BLIS_PACKDIM_MR_C              (BLIS_DEFAULT_MR_C + ...)
-//#define BLIS_PACKDIM_NR_C              (BLIS_DEFAULT_NR_C + ...)
-
-//#define BLIS_PACKDIM_MR_Z              (BLIS_DEFAULT_MR_Z + ...)
-//#define BLIS_PACKDIM_NR_Z              (BLIS_DEFAULT_NR_Z + ...)
-



@@ -149,23 +68,28 @@

 // -- gemm --

-#define BLIS_SGEMM_UKERNEL         bli_sgemm_8x8_FMA4
+#define BLIS_SGEMM_UKERNEL         bli_sgemm_asm_8x8_fma4
 #define BLIS_DEFAULT_MC_S          128
 #define BLIS_DEFAULT_KC_S          384
 #define BLIS_DEFAULT_NC_S          4096
 #define BLIS_DEFAULT_MR_S          8
 #define BLIS_DEFAULT_NR_S          8

-#define BLIS_DGEMM_UKERNEL         bli_dgemm_4x6_FMA4
+#define BLIS_DGEMM_UKERNEL         bli_dgemm_asm_4x6_fma4
+#define BLIS_DEFAULT_MC_D          1080
+#define BLIS_DEFAULT_KC_D          120
+#define BLIS_DEFAULT_NC_D          8400
+#define BLIS_DEFAULT_MR_D          4
+#define BLIS_DEFAULT_NR_D          6

-#define BLIS_CGEMM_UKERNEL         bli_cgemm_8x4_FMA4
+#define BLIS_CGEMM_UKERNEL         bli_cgemm_asm_8x4_fma4
 #define BLIS_DEFAULT_MC_C          96
 #define BLIS_DEFAULT_KC_C          256
 #define BLIS_DEFAULT_NC_C          4096
 #define BLIS_DEFAULT_MR_C          8
 #define BLIS_DEFAULT_NR_C          4

-#define BLIS_ZGEMM_UKERNEL         bli_zgemm_4x4_FMA4
+#define BLIS_ZGEMM_UKERNEL         bli_zgemm_asm_4x4_fma4
 #define BLIS_DEFAULT_MC_Z          64 
 #define BLIS_DEFAULT_KC_Z          192
 #define BLIS_DEFAULT_NC_Z          4096
--- a/config/carrizo/bli_kernel.h
+++ b/config/carrizo/bli_kernel.h
@@ -51,28 +51,28 @@
 //     (b) MR (for zero-padding purposes when MR and NR are "swapped")
 //

-#define BLIS_SGEMM_UKERNEL             bli_sgemm_new_16x3
+#define BLIS_SGEMM_UKERNEL             bli_sgemm_asm_16x3
 #define BLIS_DEFAULT_MC_S              528
 #define BLIS_DEFAULT_KC_S              256
 #define BLIS_DEFAULT_NC_S              8400
 #define BLIS_DEFAULT_MR_S              16
 #define BLIS_DEFAULT_NR_S              3

-#define BLIS_DGEMM_UKERNEL             bli_dgemm_new_8x3
+#define BLIS_DGEMM_UKERNEL             bli_dgemm_asm_8x3
 #define BLIS_DEFAULT_MC_D              264
 #define BLIS_DEFAULT_KC_D              256
 #define BLIS_DEFAULT_NC_D              8400
 #define BLIS_DEFAULT_MR_D              8
 #define BLIS_DEFAULT_NR_D              3

-#define BLIS_CGEMM_UKERNEL             bli_cgemm_new_4x2
+#define BLIS_CGEMM_UKERNEL             bli_cgemm_asm_4x2
 #define BLIS_DEFAULT_MC_C              264
 #define BLIS_DEFAULT_KC_C              256
 #define BLIS_DEFAULT_NC_C              8400
 #define BLIS_DEFAULT_MR_C              4
 #define BLIS_DEFAULT_NR_C              2

-#define BLIS_ZGEMM_UKERNEL             bli_zgemm_new_2x2
+#define BLIS_ZGEMM_UKERNEL             bli_zgemm_asm_2x2
 #define BLIS_DEFAULT_MC_Z              100
 #define BLIS_DEFAULT_KC_Z              320
 #define BLIS_DEFAULT_NC_Z              8400
--- a/config/cortex-a15/kernels
+++ b/config/cortex-a15/kernels
@@ -1 +1 @@
-../../kernels/arm/neon
+../../kernels/arm
--- a/config/cortex-a9/kernels
+++ b/config/cortex-a9/kernels
@@ -1 +1 @@
-../../kernels/arm/neon
+../../kernels/arm
--- a/config/dunnington/bli_kernel.h
+++ b/config/dunnington/bli_kernel.h
@@ -67,26 +67,6 @@
 //#define BLIS_DEFAULT_KC_Z              384
 //#define BLIS_DEFAULT_NC_Z              4096

-// NOTE: If 4m blocksizes are not defined here, they will be determined
-// from the corresponding real domain blocksizes.
-#define BLIS_DEFAULT_4M_MC_C           384
-#define BLIS_DEFAULT_4M_KC_C           512
-#define BLIS_DEFAULT_4M_NC_C           4096
-
-#define BLIS_DEFAULT_4M_MC_Z           192
-#define BLIS_DEFAULT_4M_KC_Z           256
-#define BLIS_DEFAULT_4M_NC_Z           4096
-
-// NOTE: If 3m blocksizes are not defined here, they will be determined
-// from the corresponding real domain blocksizes.
-#define BLIS_DEFAULT_3M_MC_C           384
-#define BLIS_DEFAULT_3M_KC_C           512
-#define BLIS_DEFAULT_3M_NC_C           4096
-
-#define BLIS_DEFAULT_3M_MC_Z           192
-#define BLIS_DEFAULT_3M_KC_Z           256
-#define BLIS_DEFAULT_3M_NC_Z           4096
-
 // -- Register blocksizes --

 #define BLIS_DEFAULT_MR_S              8
@@ -101,56 +81,6 @@
 #define BLIS_DEFAULT_MR_Z              2
 #define BLIS_DEFAULT_NR_Z              2

-// NOTE: If the micro-kernel, which is typically unrolled to a factor
-// of f, handles leftover edge cases (ie: when k % f > 0) then these
-// register blocksizes in the k dimension can be defined to 1.
-
-//#define BLIS_DEFAULT_KR_S              1
-//#define BLIS_DEFAULT_KR_D              1
-//#define BLIS_DEFAULT_KR_C              1
-//#define BLIS_DEFAULT_KR_Z              1
-
-// -- Maximum cache blocksizes (for optimizing edge cases) --
-
-// NOTE: These cache blocksize "extensions" have the same constraints as
-// the corresponding default blocksizes above. When these values are
-// larger than the default blocksizes, blocksizes used at edge cases are
-// enlarged if such an extension would encompass the remaining portion of
-// the matrix dimension.
-
-//#define BLIS_MAXIMUM_MC_S              (BLIS_DEFAULT_MC_S + BLIS_DEFAULT_MC_S/4)
-//#define BLIS_MAXIMUM_KC_S              (BLIS_DEFAULT_KC_S + BLIS_DEFAULT_KC_S/4)
-//#define BLIS_MAXIMUM_NC_S              (BLIS_DEFAULT_NC_S + BLIS_DEFAULT_NC_S/4)
-
-//#define BLIS_MAXIMUM_MC_D              (BLIS_DEFAULT_MC_D + BLIS_DEFAULT_MC_D/4)
-//#define BLIS_MAXIMUM_KC_D              (BLIS_DEFAULT_KC_D + BLIS_DEFAULT_KC_D/4)
-//#define BLIS_MAXIMUM_NC_D              (BLIS_DEFAULT_NC_D + BLIS_DEFAULT_NC_D/4)
-
-//#define BLIS_MAXIMUM_MC_C              (BLIS_DEFAULT_MC_C + BLIS_DEFAULT_MC_C/4)
-//#define BLIS_MAXIMUM_KC_C              (BLIS_DEFAULT_KC_C + BLIS_DEFAULT_KC_C/4)
-//#define BLIS_MAXIMUM_NC_C              (BLIS_DEFAULT_NC_C + BLIS_DEFAULT_NC_C/4)
-
-//#define BLIS_MAXIMUM_MC_Z              (BLIS_DEFAULT_MC_Z + BLIS_DEFAULT_MC_Z/4)
-//#define BLIS_MAXIMUM_KC_Z              (BLIS_DEFAULT_KC_Z + BLIS_DEFAULT_KC_Z/4)
-//#define BLIS_MAXIMUM_NC_Z              (BLIS_DEFAULT_NC_Z + BLIS_DEFAULT_NC_Z/4)
-
-// -- Packing register blocksize (for packed micro-panels) --
-
-// NOTE: These register blocksize "extensions" determine whether the
-// leading dimensions used within the packed micro-panels are equal to
-// or greater than their corresponding register blocksizes above.
-
-//#define BLIS_PACKDIM_MR_S              (BLIS_DEFAULT_MR_S + ...)
-//#define BLIS_PACKDIM_NR_S              (BLIS_DEFAULT_NR_S + ...)
-
-//#define BLIS_PACKDIM_MR_D              (BLIS_DEFAULT_MR_D + ...)
-//#define BLIS_PACKDIM_NR_D              (BLIS_DEFAULT_NR_D + ...)
-
-//#define BLIS_PACKDIM_MR_C              (BLIS_DEFAULT_MR_C + ...)
-//#define BLIS_PACKDIM_NR_C              (BLIS_DEFAULT_NR_C + ...)
-
-//#define BLIS_PACKDIM_MR_Z              (BLIS_DEFAULT_MR_Z + ...)
-//#define BLIS_PACKDIM_NR_Z              (BLIS_DEFAULT_NR_Z + ...)



@@ -169,13 +99,13 @@

 // -- gemm --

-#define BLIS_SGEMM_UKERNEL    bli_sgemm_opt_8x4
-#define BLIS_DGEMM_UKERNEL    bli_dgemm_opt_4x4
+#define BLIS_SGEMM_UKERNEL    bli_sgemm_asm_8x4
+#define BLIS_DGEMM_UKERNEL    bli_dgemm_asm_4x4

 // -- trsm-related --

-#define BLIS_DGEMMTRSM_L_UKERNEL   bli_dgemmtrsm_l_opt_4x4
-#define BLIS_DGEMMTRSM_U_UKERNEL   bli_dgemmtrsm_u_opt_4x4
+#define BLIS_DGEMMTRSM_L_UKERNEL   bli_dgemmtrsm_l_asm_4x4
+#define BLIS_DGEMMTRSM_U_UKERNEL   bli_dgemmtrsm_u_asm_4x4



@@ -184,23 +114,23 @@

 // -- axpy2v --

-#define BLIS_DAXPY2V_KERNEL     bli_daxpy2v_opt_var1
+#define BLIS_DAXPY2V_KERNEL     bli_daxpy2v_int_var1

 // -- dotaxpyv --

-#define BLIS_DDOTAXPYV_KERNEL   bli_ddotaxpyv_opt_var1
+#define BLIS_DDOTAXPYV_KERNEL   bli_ddotaxpyv_int_var1

 // -- axpyf --

-#define BLIS_DAXPYF_KERNEL      bli_daxpyf_opt_var1
+#define BLIS_DAXPYF_KERNEL      bli_daxpyf_int_var1

 // -- dotxf --

-#define BLIS_DDOTXF_KERNEL      bli_ddotxf_opt_var1
+#define BLIS_DDOTXF_KERNEL      bli_ddotxf_int_var1

 // -- dotxaxpyf --

-#define BLIS_DDOTXAXPYF_KERNEL  bli_ddotxaxpyf_opt_var1
+#define BLIS_DDOTXAXPYF_KERNEL  bli_ddotxaxpyf_int_var1



--- a/config/dunnington/kernels
+++ b/config/dunnington/kernels
@@ -1 +1 @@
-../../kernels/x86_64/core2-sse3
+../../kernels/x86_64/penryn
--- a/config/haswell/bli_kernel.h
+++ b/config/haswell/bli_kernel.h
@@ -89,21 +89,6 @@

 #endif

-/*
-#define BLIS_CGEMM_UKERNEL         bli_cgemm_asm_8x4
-#define BLIS_DEFAULT_MC_C          96
-#define BLIS_DEFAULT_KC_C          256
-#define BLIS_DEFAULT_NC_C          4096
-#define BLIS_DEFAULT_MR_C          8
-#define BLIS_DEFAULT_NR_C          4
-
-#define BLIS_ZGEMM_UKERNEL         bli_zgemm_asm_4x4
-#define BLIS_DEFAULT_MC_Z          64 
-#define BLIS_DEFAULT_KC_Z          192
-#define BLIS_DEFAULT_NC_Z          4096
-#define BLIS_DEFAULT_MR_Z          4
-#define BLIS_DEFAULT_NR_Z          4
-*/



--- a/config/haswell/kernels
+++ b/config/haswell/kernels
@@ -1 +1 @@
-../../kernels/x86_64/avx2
+../../kernels/x86_64/haswell
--- a/config/loongson3a/bli_kernel.h
+++ b/config/loongson3a/bli_kernel.h
@@ -149,7 +149,7 @@

 // -- gemm --

-#define BLIS_DGEMM_UKERNEL         bli_dgemm_opt_d4x4
+#define BLIS_DGEMM_UKERNEL         bli_dgemm_opt_4x4

 // -- trsm-related --

--- a/config/mic/bli_config.h
+++ b/config/mic/bli_config.h
@@ -42,6 +42,9 @@

 #define BLIS_SIMD_ALIGN_SIZE             32

+#define BLIS_SIMD_SIZE                   64
+#define BLIS_SIMD_NUM_REGISTERS          32
+


 #endif
--- a/config/mic/bli_kernel.h
+++ b/config/mic/bli_kernel.h
@@ -153,8 +153,8 @@

 #define BLIS_DGEMM_UKERNEL_PREFERS_CONTIG_ROWS

-#define BLIS_DGEMM_UKERNEL         bli_dgemm_opt_30x8
-#define BLIS_SGEMM_UKERNEL         bli_sgemm_opt_30x16
+#define BLIS_SGEMM_UKERNEL         bli_sgemm_asm_30x16
+#define BLIS_DGEMM_UKERNEL         bli_dgemm_asm_30x8

 // -- trsm-related --

--- a/config/piledriver/bli_kernel.h
+++ b/config/piledriver/bli_kernel.h
@@ -51,7 +51,7 @@
 //     (b) MR (for zero-padding purposes when MR and NR are "swapped")
 //

-#define BLIS_SGEMM_UKERNEL         bli_sgemm_new_16x3
+#define BLIS_SGEMM_UKERNEL         bli_sgemm_asm_16x3
 #define BLIS_DEFAULT_MC_S              2016
 #define BLIS_DEFAULT_KC_S              128
 #define BLIS_DEFAULT_NC_S              8400
@@ -59,7 +59,7 @@
 #define BLIS_DEFAULT_NR_S              3
 //#define BLIS_UPANEL_B_ALIGN_SIZE_S     4096

-#define BLIS_DGEMM_UKERNEL         bli_dgemm_new_8x3
+#define BLIS_DGEMM_UKERNEL         bli_dgemm_asm_8x3
 //#define BLIS_DEFAULT_MC_D              768
 //#define BLIS_DEFAULT_KC_D              168
 #define BLIS_DEFAULT_MC_D              1008
@@ -69,14 +69,14 @@
 #define BLIS_DEFAULT_NR_D              3
 //#define BLIS_UPANEL_B_ALIGN_SIZE_D     4096

-#define BLIS_CGEMM_UKERNEL         bli_cgemm_new_4x2
+#define BLIS_CGEMM_UKERNEL         bli_cgemm_asm_4x2
 #define BLIS_DEFAULT_MC_C              512
 #define BLIS_DEFAULT_KC_C              256
 #define BLIS_DEFAULT_NC_C              8400
 #define BLIS_DEFAULT_MR_C              4
 #define BLIS_DEFAULT_NR_C              2

-#define BLIS_ZGEMM_UKERNEL         bli_zgemm_new_2x2
+#define BLIS_ZGEMM_UKERNEL         bli_zgemm_asm_2x2
 #define BLIS_DEFAULT_MC_Z              400
 #define BLIS_DEFAULT_KC_Z              160
 #define BLIS_DEFAULT_NC_Z              8400
--- a/config/sandybridge/kernels
+++ b/config/sandybridge/kernels
@@ -1 +1 @@
-../../kernels/x86_64/avx
+../../kernels/x86_64/sandybridge
--- a/config/template/bli_kernel.h
+++ b/config/template/bli_kernel.h
@@ -177,17 +177,17 @@
 // be packed here, but this tends to be much too expensive in practice to
 // actually employ.)

-//#define BLIS_DEFAULT_L2_MC_S           1000
-//#define BLIS_DEFAULT_L2_NC_S           1000
+//#define BLIS_DEFAULT_M2_S           1000
+//#define BLIS_DEFAULT_N2_S           1000

-//#define BLIS_DEFAULT_L2_MC_D           1000
-//#define BLIS_DEFAULT_L2_NC_D           1000
+//#define BLIS_DEFAULT_M2_D           1000
+//#define BLIS_DEFAULT_N2_D           1000

-//#define BLIS_DEFAULT_L2_MC_C           1000
-//#define BLIS_DEFAULT_L2_NC_C           1000
+//#define BLIS_DEFAULT_M2_C           1000
+//#define BLIS_DEFAULT_N2_C           1000

-//#define BLIS_DEFAULT_L2_MC_Z           1000
-//#define BLIS_DEFAULT_L2_NC_Z           1000
+//#define BLIS_DEFAULT_M2_Z           1000
+//#define BLIS_DEFAULT_N2_Z           1000



@@ -196,25 +196,25 @@

 // -- Default fusing factors for level-1f operations --

-//#define BLIS_L1F_FUSE_FAC_S            8
-//#define BLIS_L1F_FUSE_FAC_D            4
-//#define BLIS_L1F_FUSE_FAC_C            4
-//#define BLIS_L1F_FUSE_FAC_Z            2
+//#define BLIS_DEFAULT_1F_S            8
+//#define BLIS_DEFAULT_1F_D            4
+//#define BLIS_DEFAULT_1F_C            4
+//#define BLIS_DEFAULT_1F_Z            2

-//#define BLIS_AXPYF_FUSE_FAC_S          BLIS_L1F_FUSE_FAC_S
-//#define BLIS_AXPYF_FUSE_FAC_D          BLIS_L1F_FUSE_FAC_D
-//#define BLIS_AXPYF_FUSE_FAC_C          BLIS_L1F_FUSE_FAC_C
-//#define BLIS_AXPYF_FUSE_FAC_Z          BLIS_L1F_FUSE_FAC_Z
+//#define BLIS_DEFAULT_AF_S          BLIS_DEFAULT_1F_S
+//#define BLIS_DEFAULT_AF_D          BLIS_DEFAULT_1F_D
+//#define BLIS_DEFAULT_AF_C          BLIS_DEFAULT_1F_C
+//#define BLIS_DEFAULT_AF_Z          BLIS_DEFAULT_1F_Z

-//#define BLIS_DOTXF_FUSE_FAC_S          BLIS_L1F_FUSE_FAC_S
-//#define BLIS_DOTXF_FUSE_FAC_D          BLIS_L1F_FUSE_FAC_D
-//#define BLIS_DOTXF_FUSE_FAC_C          BLIS_L1F_FUSE_FAC_C
-//#define BLIS_DOTXF_FUSE_FAC_Z          BLIS_L1F_FUSE_FAC_Z
+//#define BLIS_DEFAULT_DF_S          BLIS_DEFAULT_1F_S
+//#define BLIS_DEFAULT_DF_D          BLIS_DEFAULT_1F_D
+//#define BLIS_DEFAULT_DF_C          BLIS_DEFAULT_1F_C
+//#define BLIS_DEFAULT_DF_Z          BLIS_DEFAULT_1F_Z

-//#define BLIS_DOTXAXPYF_FUSE_FAC_S      BLIS_L1F_FUSE_FAC_S
-//#define BLIS_DOTXAXPYF_FUSE_FAC_D      BLIS_L1F_FUSE_FAC_D
-//#define BLIS_DOTXAXPYF_FUSE_FAC_C      BLIS_L1F_FUSE_FAC_C
-//#define BLIS_DOTXAXPYF_FUSE_FAC_Z      BLIS_L1F_FUSE_FAC_Z
+//#define BLIS_DEFAULT_XF_S      BLIS_DEFAULT_1F_S
+//#define BLIS_DEFAULT_XF_D      BLIS_DEFAULT_1F_D
+//#define BLIS_DEFAULT_XF_C      BLIS_DEFAULT_1F_C
+//#define BLIS_DEFAULT_XF_Z      BLIS_DEFAULT_1F_Z



--- a/config/template/kernels/1/bli_axpyv_opt_var1.c
+++ b/config/template/kernels/1/bli_axpyv_opt_var1.c
@@ -36,59 +36,87 @@



-void bli_saxpyv_opt_var1( conj_t             conjx,
-                          dim_t              n,
-                          float*    restrict alpha,
-                          float*    restrict x, inc_t incx,
-                          float*    restrict y, inc_t incy )
+void bli_saxpyv_opt_var1
+     (
+       conj_t    conjx,
+       dim_t     n,
+       float*    alpha,
+       float*    x, inc_t incx,
+       float*    y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SAXPYV_KERNEL_REF( conjx,
-	                        n,
-	                        alpha,
-	                        x, incx,
-	                        y, incy );
+	BLIS_SAXPYV_KERNEL_REF
+	(
+	  conjx,
+	  n,
+	  alpha,
+	  x, incx,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_daxpyv_opt_var1( conj_t             conjx,
-                          dim_t              n,
-                          double*   restrict alpha,
-                          double*   restrict x, inc_t incx,
-                          double*   restrict y, inc_t incy )
+void bli_daxpyv_opt_var1
+     (
+       conj_t    conjx,
+       dim_t     n,
+       double*   alpha,
+       double*   x, inc_t incx,
+       double*   y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_DAXPYV_KERNEL_REF( conjx,
-	                        n,
-	                        alpha,
-	                        x, incx,
-	                        y, incy );
+	BLIS_DAXPYV_KERNEL_REF
+	(
+	  conjx,
+	  n,
+	  alpha,
+	  x, incx,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_caxpyv_opt_var1( conj_t             conjx,
-                          dim_t              n,
-                          scomplex* restrict alpha,
-                          scomplex* restrict x, inc_t incx,
-                          scomplex* restrict y, inc_t incy )
+void bli_caxpyv_opt_var1
+     (
+       conj_t    conjx,
+       dim_t     n,
+       scomplex* alpha,
+       scomplex* x, inc_t incx,
+       scomplex* y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CAXPYV_KERNEL_REF( conjx,
-	                        n,
-	                        alpha,
-	                        x, incx,
-	                        y, incy );
+	BLIS_CAXPYV_KERNEL_REF
+	(
+	  conjx,
+	  n,
+	  alpha,
+	  x, incx,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_zaxpyv_opt_var1( conj_t             conjx,
-                          dim_t              n,
-                          dcomplex* restrict alpha,
-                          dcomplex* restrict x, inc_t incx,
-                          dcomplex* restrict y, inc_t incy )
+void bli_zaxpyv_opt_var1
+     (
+       conj_t    conjx,
+       dim_t     n,
+       dcomplex* alpha,
+       dcomplex* x, inc_t incx,
+       dcomplex* y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 /*
  Template axpyv kernel implementation
@@ -193,11 +221,15 @@ void bli_zaxpyv_opt_var1( conj_t             conjx,
 	// Call the reference implementation if needed.
 	if ( use_ref == TRUE )
 	{
-		BLIS_ZAXPYV_KERNEL_REF( conjx,
-		                        n,
-		                        alpha,
-		                        x, incx,
-		                        y, incy );
+		BLIS_ZAXPYV_KERNEL_REF
+		(
+		  conjx,
+		  n,
+		  alpha,
+		  x, incx,
+		  y, incy,
+		  cntx
+		);
        return;
 	}

@@ -219,7 +251,7 @@ void bli_zaxpyv_opt_var1( conj_t             conjx,
 		// Compute front edge cases if x and y were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzaxpys( *alpha, *xp, *yp );
+			bli_zaxpys( *alpha, *xp, *yp );

 			xp += 1; yp += 1;
 		}
@@ -228,7 +260,7 @@ void bli_zaxpyv_opt_var1( conj_t             conjx,
 		// yp are guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzaxpys( *alpha, *xp, *yp );
+			bli_zaxpys( *alpha, *xp, *yp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -237,7 +269,7 @@ void bli_zaxpyv_opt_var1( conj_t             conjx,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzaxpys( *alpha, *xp, *yp );
+			bli_zaxpys( *alpha, *xp, *yp );

 			xp += 1; yp += 1;
 		}
@@ -247,7 +279,7 @@ void bli_zaxpyv_opt_var1( conj_t             conjx,
 		// Compute front edge cases if x and y were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzaxpyjs( *alpha, *xp, *yp );
+			bli_zaxpyjs( *alpha, *xp, *yp );

 			xp += 1; yp += 1;
 		}
@@ -256,7 +288,7 @@ void bli_zaxpyv_opt_var1( conj_t             conjx,
 		// yp are guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzaxpyjs( *alpha, *xp, *yp );
+			bli_zaxpyjs( *alpha, *xp, *yp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -265,7 +297,7 @@ void bli_zaxpyv_opt_var1( conj_t             conjx,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzaxpyjs( *alpha, *xp, *yp );
+			bli_zaxpyjs( *alpha, *xp, *yp );

 			xp += 1; yp += 1;
 		}
--- a/config/template/kernels/1/bli_dotv_opt_var1.c
+++ b/config/template/kernels/1/bli_dotv_opt_var1.c
@@ -36,66 +36,94 @@



-void bli_sdotv_opt_var1( conj_t             conjx,
-                         conj_t             conjy,
-                         dim_t              n,
-                         float*    restrict x, inc_t incx,
-                         float*    restrict y, inc_t incy,
-                         float*    restrict rho )
+void bli_sdotv_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       float*    x, inc_t incx,
+       float*    y, inc_t incy,
+       float*    rho,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SDOTV_KERNEL_REF( conjx,
-	                       conjy,
-	                       n,
-	                       x, incx,
-	                       y, incy,
-	                       rho );
+	BLIS_SDOTV_KERNEL_REF
+	(
+	  conjx,
+	  conjy,
+	  n,
+	  x, incx,
+	  y, incy,
+	  rho,
+	  cntx
+	);
 }



-void bli_ddotv_opt_var1( conj_t             conjx,
-                         conj_t             conjy,
-                         dim_t              n,
-                         double*   restrict x, inc_t incx,
-                         double*   restrict y, inc_t incy,
-                         double*   restrict rho )
+void bli_ddotv_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       double*   x, inc_t incx,
+       double*   y, inc_t incy,
+       double*   rho,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_DDOTV_KERNEL_REF( conjx,
-	                       conjy,
-	                       n,
-	                       x, incx,
-	                       y, incy,
-	                       rho );
+	BLIS_DDOTV_KERNEL_REF
+	(
+	  conjx,
+	  conjy,
+	  n,
+	  x, incx,
+	  y, incy,
+	  rho,
+	  cntx
+	);
 }



-void bli_cdotv_opt_var1( conj_t             conjx,
-                         conj_t             conjy,
-                         dim_t              n,
-                         scomplex* restrict x, inc_t incx,
-                         scomplex* restrict y, inc_t incy,
-                         scomplex* restrict rho )
+void bli_cdotv_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       scomplex* x, inc_t incx,
+       scomplex* y, inc_t incy,
+       scomplex* rho,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CDOTV_KERNEL_REF( conjx,
-	                       conjy,
-	                       n,
-	                       x, incx,
-	                       y, incy,
-	                       rho );
+	BLIS_CDOTV_KERNEL_REF
+	(
+	  conjx,
+	  conjy,
+	  n,
+	  x, incx,
+	  y, incy,
+	  rho,
+	  cntx
+	);
 }



-void bli_zdotv_opt_var1( conj_t             conjx,
-                         conj_t             conjy,
-                         dim_t              n,
-                         dcomplex* restrict x, inc_t incx,
-                         dcomplex* restrict y, inc_t incy,
-                         dcomplex* restrict rho )
+void bli_zdotv_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       dcomplex* x, inc_t incx,
+       dcomplex* y, inc_t incy,
+       dcomplex* rho,
+       cntx_t*   cntx
+     )
 {
 /*
  Template dotv kernel implementation
@@ -210,12 +238,16 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 	// Call the reference implementation if needed.
 	if ( use_ref == TRUE )
 	{
-		BLIS_ZDOTV_KERNEL_REF( conjx,
-		                       conjy,
-		                       n,
-		                       x, incx,
-		                       y, incy,
-		                       rho );
+		BLIS_ZDOTV_KERNEL_REF
+		(
+		  conjx,
+		  conjy,
+		  n,
+		  x, incx,
+		  y, incy,
+		  rho,
+		  cntx
+		);
        return;
 	}

@@ -250,7 +282,7 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 		// Compute front edge cases if x and y were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
+			bli_zdots( *xp, *yp, dotxy );

 			xp += 1; yp += 1;
 		}
@@ -259,7 +291,7 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 		// yp are guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
+			bli_zdots( *xp, *yp, dotxy );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -268,7 +300,7 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
+			bli_zdots( *xp, *yp, dotxy );

 			xp += 1; yp += 1;
 		}
@@ -278,7 +310,7 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 		// Compute front edge cases if x and y were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
+			bli_zdotjs( *xp, *yp, dotxy );

 			xp += 1; yp += 1;
 		}
@@ -287,7 +319,7 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 		// yp are guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
+			bli_zdotjs( *xp, *yp, dotxy );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -296,7 +328,7 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
+			bli_zdotjs( *xp, *yp, dotxy );

 			xp += 1; yp += 1;
 		}
@@ -307,6 +339,6 @@ void bli_zdotv_opt_var1( conj_t             conjx,
 	if ( bli_is_conj( conjy ) )
 		bli_zconjs( dotxy );

-	bli_zzcopys( dotxy, *rho );
+	bli_zcopys( dotxy, *rho );
 }

--- a/config/template/kernels/1f/bli_axpy2v_opt_var1.c
+++ b/config/template/kernels/1f/bli_axpy2v_opt_var1.c
@@ -36,88 +36,108 @@



-void bli_saxpy2v_opt_var1(
-                           conj_t             conjx,
-                           conj_t             conjy,
-                           dim_t              n,
-                           float*    restrict alpha1,
-                           float*    restrict alpha2,
-                           float*    restrict x, inc_t incx,
-                           float*    restrict y, inc_t incy,
-                           float*    restrict z, inc_t incz
-                         )
+void bli_saxpy2v_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       float*    alpha1,
+       float*    alpha2,
+       float*    x, inc_t incx,
+       float*    y, inc_t incy,
+       float*    z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SAXPY2V_KERNEL_REF( conjx,
-	                         conjy,
-	                         n,
-	                         alpha1,
-	                         alpha2,
-	                         x, incx,
-	                         y, incy,
-	                         z, incz );
+	BLIS_SAXPY2V_KERNEL_REF
+	(
+	  conjx,
+	  conjy,
+	  n,
+	  alpha1,
+	  alpha2,
+	  x, incx,
+	  y, incy,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_daxpy2v_opt_var1(
-                           conj_t             conjx,
-                           conj_t             conjy,
-                           dim_t              n,
-                           double*   restrict alpha1,
-                           double*   restrict alpha2,
-                           double*   restrict x, inc_t incx,
-                           double*   restrict y, inc_t incy,
-                           double*   restrict z, inc_t incz
-                         )
+void bli_daxpy2v_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       double*   alpha1,
+       double*   alpha2,
+       double*   x, inc_t incx,
+       double*   y, inc_t incy,
+       double*   z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_DAXPY2V_KERNEL_REF( conjx,
-	                         conjy,
-	                         n,
-	                         alpha1,
-	                         alpha2,
-	                         x, incx,
-	                         y, incy,
-	                         z, incz );
+	BLIS_DAXPY2V_KERNEL_REF
+	(
+	  conjx,
+	  conjy,
+	  n,
+	  alpha1,
+	  alpha2,
+	  x, incx,
+	  y, incy,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_caxpy2v_opt_var1(
-                           conj_t             conjx,
-                           conj_t             conjy,
-                           dim_t              n,
-                           scomplex* restrict alpha1,
-                           scomplex* restrict alpha2,
-                           scomplex* restrict x, inc_t incx,
-                           scomplex* restrict y, inc_t incy,
-                           scomplex* restrict z, inc_t incz
-                         )
+void bli_caxpy2v_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       scomplex* alpha1,
+       scomplex* alpha2,
+       scomplex* x, inc_t incx,
+       scomplex* y, inc_t incy,
+       scomplex* z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CAXPY2V_KERNEL_REF( conjx,
-	                         conjy,
-	                         n,
-	                         alpha1,
-	                         alpha2,
-	                         x, incx,
-	                         y, incy,
-	                         z, incz );
+	BLIS_CAXPY2V_KERNEL_REF
+	(
+	  conjx,
+	  conjy,
+	  n,
+	  alpha1,
+	  alpha2,
+	  x, incx,
+	  y, incy,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_zaxpy2v_opt_var1(
-                           conj_t             conjx,
-                           conj_t             conjy,
-                           dim_t              n,
-                           dcomplex* restrict alpha1,
-                           dcomplex* restrict alpha2,
-                           dcomplex* restrict x, inc_t incx,
-                           dcomplex* restrict y, inc_t incy,
-                           dcomplex* restrict z, inc_t incz
-                         )
+void bli_zaxpy2v_opt_var1
+     (
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       dcomplex* alpha1,
+       dcomplex* alpha2,
+       dcomplex* x, inc_t incx,
+       dcomplex* y, inc_t incy,
+       dcomplex* z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 /*
  Template axpy2v kernel implementation
@@ -229,14 +249,18 @@ void bli_zaxpy2v_opt_var1(
 	// Call the reference implementation if needed.
 	if ( use_ref == TRUE )
 	{
-		BLIS_ZAXPY2V_KERNEL_REF( conjx,
-		                         conjy,
-		                         n,
-		                         alpha1,
-		                         alpha2,
-		                         x, incx,
-		                         y, incy,
-		                         z, incz );
+		BLIS_ZAXPY2V_KERNEL_REF
+		(
+		  conjx,
+		  conjy,
+		  n,
+		  alpha1,
+		  alpha2,
+		  x, incx,
+		  y, incy,
+		  z, incz,
+		  cntx
+		);
        return;
 	}

@@ -259,8 +283,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzaxpys( *alpha1, *xp, *zp );
-			bli_zzzaxpys( *alpha2, *yp, *zp );
+			bli_zaxpys( *alpha1, *xp, *zp );
+			bli_zaxpys( *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -272,8 +296,8 @@ void bli_zaxpy2v_opt_var1(
 		// to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzaxpys( *alpha1, *xp, *zp );
-			bli_zzzaxpys( *alpha2, *yp, *zp );
+			bli_zaxpys( *alpha1, *xp, *zp );
+			bli_zaxpys( *alpha2, *yp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -283,8 +307,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzaxpys( *alpha1, *xp, *zp );
-			bli_zzzaxpys( *alpha2, *yp, *zp );
+			bli_zaxpys( *alpha1, *xp, *zp );
+			bli_zaxpys( *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -294,8 +318,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzaxpys(  *alpha1, *xp, *zp );
-			bli_zzzaxpyjs( *alpha2, *yp, *zp );
+			bli_zaxpys(  *alpha1, *xp, *zp );
+			bli_zaxpyjs( *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -307,8 +331,8 @@ void bli_zaxpy2v_opt_var1(
 		// to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzaxpys(  *alpha1, *xp, *zp );
-			bli_zzzaxpyjs( *alpha2, *yp, *zp );
+			bli_zaxpys(  *alpha1, *xp, *zp );
+			bli_zaxpyjs( *alpha2, *yp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -318,8 +342,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzaxpys(  *alpha1, *xp, *zp );
-			bli_zzzaxpyjs( *alpha2, *yp, *zp );
+			bli_zaxpys(  *alpha1, *xp, *zp );
+			bli_zaxpyjs( *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -329,8 +353,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzaxpyjs( *alpha1, *xp, *zp );
-			bli_zzzaxpys(  *alpha2, *yp, *zp );
+			bli_zaxpyjs( *alpha1, *xp, *zp );
+			bli_zaxpys(  *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -342,8 +366,8 @@ void bli_zaxpy2v_opt_var1(
 		// to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzaxpyjs( *alpha1, *xp, *zp );
-			bli_zzzaxpys(  *alpha2, *yp, *zp );
+			bli_zaxpyjs( *alpha1, *xp, *zp );
+			bli_zaxpys(  *alpha2, *yp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -353,8 +377,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzaxpyjs( *alpha1, *xp, *zp );
-			bli_zzzaxpys(  *alpha2, *yp, *zp );
+			bli_zaxpyjs( *alpha1, *xp, *zp );
+			bli_zaxpys(  *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -364,8 +388,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzaxpyjs( *alpha1, *xp, *zp );
-			bli_zzzaxpyjs( *alpha2, *yp, *zp );
+			bli_zaxpyjs( *alpha1, *xp, *zp );
+			bli_zaxpyjs( *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -377,8 +401,8 @@ void bli_zaxpy2v_opt_var1(
 		// to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzaxpyjs( *alpha1, *xp, *zp );
-			bli_zzzaxpyjs( *alpha2, *yp, *zp );
+			bli_zaxpyjs( *alpha1, *xp, *zp );
+			bli_zaxpyjs( *alpha2, *yp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -388,8 +412,8 @@ void bli_zaxpy2v_opt_var1(
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzaxpyjs( *alpha1, *xp, *zp );
-			bli_zzzaxpyjs( *alpha2, *yp, *zp );
+			bli_zaxpyjs( *alpha1, *xp, *zp );
+			bli_zaxpyjs( *alpha2, *yp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
--- a/config/template/kernels/1f/bli_axpyf_opt_var1.c
+++ b/config/template/kernels/1f/bli_axpyf_opt_var1.c
@@ -36,87 +36,107 @@



-void bli_saxpyf_opt_var1(
-                          conj_t             conja,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          float*    restrict alpha,
-                          float*    restrict a, inc_t inca, inc_t lda,
-                          float*    restrict x, inc_t incx,
-                          float*    restrict y, inc_t incy
-                        )
+void bli_saxpyf_opt_var1
+     (
+       conj_t    conja,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       float*    alpha,
+       float*    a, inc_t inca, inc_t lda,
+       float*    x, inc_t incx,
+       float*    y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SAXPYF_KERNEL_REF( conja,
-	                        conjx,
-	                        m,
-	                        b_n,
-	                        alpha,
-	                        a, inca, lda,
-	                        x, incx,
-	                        y, incy );
+	BLIS_SAXPYF_KERNEL_REF
+	(
+	  conja,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  x, incx,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_daxpyf_opt_var1(
-                          conj_t             conja,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          double*   restrict alpha,
-                          double*   restrict a, inc_t inca, inc_t lda,
-                          double*   restrict x, inc_t incx,
-                          double*   restrict y, inc_t incy
-                        )
+void bli_daxpyf_opt_var1
+     (
+       conj_t    conja,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       double*   alpha,
+       double*   a, inc_t inca, inc_t lda,
+       double*   x, inc_t incx,
+       double*   y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_DAXPYF_KERNEL_REF( conja,
-	                        conjx,
-	                        m,
-	                        b_n,
-	                        alpha,
-	                        a, inca, lda,
-	                        x, incx,
-	                        y, incy );
+	BLIS_DAXPYF_KERNEL_REF
+	(
+	  conja,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  x, incx,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_caxpyf_opt_var1(
-                          conj_t             conja,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          scomplex* restrict alpha,
-                          scomplex* restrict a, inc_t inca, inc_t lda,
-                          scomplex* restrict x, inc_t incx,
-                          scomplex* restrict y, inc_t incy
-                        )
+void bli_caxpyf_opt_var1
+     (
+       conj_t    conja,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       scomplex* alpha,
+       scomplex* a, inc_t inca, inc_t lda,
+       scomplex* x, inc_t incx,
+       scomplex* y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CAXPYF_KERNEL_REF( conja,
-	                        conjx,
-	                        m,
-	                        b_n,
-	                        alpha,
-	                        a, inca, lda,
-	                        x, incx,
-	                        y, incy );
+	BLIS_CAXPYF_KERNEL_REF
+	(
+	  conja,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  x, incx,
+	  y, incy,
+	  cntx
+	);
 }


-void bli_zaxpyf_opt_var1(
-                          conj_t             conja,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          dcomplex* restrict alpha,
-                          dcomplex* restrict a, inc_t inca, inc_t lda,
-                          dcomplex* restrict x, inc_t incx,
-                          dcomplex* restrict y, inc_t incy
-                        )
+void bli_zaxpyf_opt_var1
+     (
+       conj_t    conja,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       dcomplex* alpha,
+       dcomplex* a, inc_t inca, inc_t lda,
+       dcomplex* x, inc_t incx,
+       dcomplex* y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 /*
  Template axpyf kernel implementation
@@ -243,14 +263,18 @@ void bli_zaxpyf_opt_var1(
 	// Call the reference implementation if needed.
 	if ( use_ref == TRUE )
 	{
-		BLIS_ZAXPYF_KERNEL_REF( conja,
-		                        conjx,
-		                        m,
-		                        b_n,
-		                        alpha,
-		                        a, inca, lda,
-		                        x, incx,
-		                        y, incy );
+		BLIS_ZAXPYF_KERNEL_REF
+		(
+		  conja,
+		  conjx,
+		  m,
+		  b_n,
+		  alpha,
+		  a, inca, lda,
+		  x, incx,
+		  y, incy,
+		  cntx
+		);
        return;
 	}

@@ -274,16 +298,16 @@ void bli_zaxpyf_opt_var1(
 	{
 		for ( j = 0; j < b_n; ++j )
 		{
-			bli_zzcopys( *xp[ j ], alpha_x[ j ] );
-			bli_zzscals( *alpha, alpha_x[ j ] );
+			bli_zcopys( *xp[ j ], alpha_x[ j ] );
+			bli_zscals( *alpha, alpha_x[ j ] );
 		}
 	}
 	else // if ( bli_is_conj( conjx ) )
 	{
 		for ( j = 0; j < b_n; ++j )
 		{
-			bli_zzcopyjs( *xp[ j ], alpha_x[ j ] );
-			bli_zzscals( *alpha, alpha_x[ j ] );
+			bli_zcopyjs( *xp[ j ], alpha_x[ j ] );
+			bli_zscals( *alpha, alpha_x[ j ] );
 		}
 	}

@@ -296,7 +320,7 @@ void bli_zaxpyf_opt_var1(
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzaxpys( alpha_x[ j ], *ap[ j ], *yp );
+				bli_zaxpys( alpha_x[ j ], *ap[ j ], *yp );

 				ap[ j ] += 1;
 			}
@@ -312,7 +336,7 @@ void bli_zaxpyf_opt_var1(
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzaxpys( alpha_x[ j ], *ap[ j ], *yp );
+				bli_zaxpys( alpha_x[ j ], *ap[ j ], *yp );

 				ap[ j ] += n_elem_per_iter;
 			}
@@ -324,7 +348,7 @@ void bli_zaxpyf_opt_var1(
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzaxpys( alpha_x[ j ], *ap[ j ], *yp );
+				bli_zaxpys( alpha_x[ j ], *ap[ j ], *yp );

 				ap[ j ] += 1;
 			}
@@ -338,7 +362,7 @@ void bli_zaxpyf_opt_var1(
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzaxpyjs( alpha_x[ j ], *ap[ j ], *yp );
+				bli_zaxpyjs( alpha_x[ j ], *ap[ j ], *yp );

 				ap[ j ] += 1;
 			}
@@ -354,7 +378,7 @@ void bli_zaxpyf_opt_var1(
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzaxpyjs( alpha_x[ j ], *ap[ j ], *yp );
+				bli_zaxpyjs( alpha_x[ j ], *ap[ j ], *yp );

 				ap[ j ] += n_elem_per_iter;
 			}
@@ -366,7 +390,7 @@ void bli_zaxpyf_opt_var1(
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzaxpyjs( alpha_x[ j ], *ap[ j ], *yp );
+				bli_zaxpyjs( alpha_x[ j ], *ap[ j ], *yp );

 				ap[ j ] += 1;
 			}
--- a/config/template/kernels/1f/bli_dotaxpyv_opt_var1.c
+++ b/config/template/kernels/1f/bli_dotaxpyv_opt_var1.c
@@ -36,87 +36,115 @@



-void bli_sdotaxpyv_opt_var1( conj_t             conjxt,
-                             conj_t             conjx,
-                             conj_t             conjy,
-                             dim_t              n,
-                             float*    restrict alpha,
-                             float*    restrict x, inc_t incx,
-                             float*    restrict y, inc_t incy,
-                             float*    restrict rho,
-                             float*    restrict z, inc_t incz )
+void bli_sdotaxpyv_opt_var1
+     (
+       conj_t    conjxt,
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       float*    alpha,
+       float*    x, inc_t incx,
+       float*    y, inc_t incy,
+       float*    rho,
+       float*    z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SDOTAXPYV_KERNEL_REF( conjxt,
-	                           conjx,
-	                           conjy,
-	                           n,
-	                           alpha,
-	                           x, incx,
-	                           y, incy,
-	                           rho,
-	                           z, incz );
+	BLIS_SDOTAXPYV_KERNEL_REF
+	(
+	  conjxt,
+	  conjx,
+	  conjy,
+	  n,
+	  alpha,
+	  x, incx,
+	  y, incy,
+	  rho,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_ddotaxpyv_opt_var1( conj_t             conjxt,
-                             conj_t             conjx,
-                             conj_t             conjy,
-                             dim_t              n,
-                             double*   restrict alpha,
-                             double*   restrict x, inc_t incx,
-                             double*   restrict y, inc_t incy,
-                             double*   restrict rho,
-                             double*   restrict z, inc_t incz )
+void bli_ddotaxpyv_opt_var1
+     (
+       conj_t    conjxt,
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       double*   alpha,
+       double*   x, inc_t incx,
+       double*   y, inc_t incy,
+       double*   rho,
+       double*   z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_DDOTAXPYV_KERNEL_REF( conjxt,
-	                           conjx,
-	                           conjy,
-	                           n,
-	                           alpha,
-	                           x, incx,
-	                           y, incy,
-	                           rho,
-	                           z, incz );
+	BLIS_DDOTAXPYV_KERNEL_REF
+	(
+	  conjxt,
+	  conjx,
+	  conjy,
+	  n,
+	  alpha,
+	  x, incx,
+	  y, incy,
+	  rho,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_cdotaxpyv_opt_var1( conj_t             conjxt,
-                             conj_t             conjx,
-                             conj_t             conjy,
-                             dim_t              n,
-                             scomplex* restrict alpha,
-                             scomplex* restrict x, inc_t incx,
-                             scomplex* restrict y, inc_t incy,
-                             scomplex* restrict rho,
-                             scomplex* restrict z, inc_t incz )
+void bli_cdotaxpyv_opt_var1
+     (
+       conj_t    conjxt,
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       scomplex* alpha,
+       scomplex* x, inc_t incx,
+       scomplex* y, inc_t incy,
+       scomplex* rho,
+       scomplex* z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CDOTAXPYV_KERNEL_REF( conjxt,
-	                           conjx,
-	                           conjy,
-	                           n,
-	                           alpha,
-	                           x, incx,
-	                           y, incy,
-	                           rho,
-	                           z, incz );
+	BLIS_CDOTAXPYV_KERNEL_REF
+	(
+	  conjxt,
+	  conjx,
+	  conjy,
+	  n,
+	  alpha,
+	  x, incx,
+	  y, incy,
+	  rho,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
-                             conj_t             conjx,
-                             conj_t             conjy,
-                             dim_t              n,
-                             dcomplex* restrict alpha,
-                             dcomplex* restrict x, inc_t incx,
-                             dcomplex* restrict y, inc_t incy,
-                             dcomplex* restrict rho,
-                             dcomplex* restrict z, inc_t incz )
+void bli_zdotaxpyv_opt_var1
+     (
+       conj_t    conjxt,
+       conj_t    conjx,
+       conj_t    conjy,
+       dim_t     n,
+       dcomplex* alpha,
+       dcomplex* x, inc_t incx,
+       dcomplex* y, inc_t incy,
+       dcomplex* rho,
+       dcomplex* z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 /*
  Template dotaxpyv kernel implementation
@@ -240,15 +268,19 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 	// Call the reference implementation if needed.
 	if ( use_ref == TRUE )
 	{
-		BLIS_ZDOTAXPYV_KERNEL_REF( conjxt,
-		                           conjx,
-		                           conjy,
-		                           n,
-		                           alpha,
-		                           x, incx,
-		                           y, incy,
-		                           rho,
-		                           z, incz );
+		BLIS_ZDOTAXPYV_KERNEL_REF
+		(
+		  conjxt,
+		  conjx,
+		  conjy,
+		  n,
+		  alpha,
+		  x, incx,
+		  y, incy,
+		  rho,
+		  z, incz,
+		  cntx
+		);
        return;
 	}

@@ -285,8 +317,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
-			bli_zzzaxpys( *alpha, *xp, *zp );
+			bli_zdots( *xp, *yp, dotxy );
+			bli_zaxpys( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -298,8 +330,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
-			bli_zzzaxpys( *alpha, *xp, *zp );
+			bli_zdots( *xp, *yp, dotxy );
+			bli_zaxpys( *alpha, *xp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -309,8 +341,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
-			bli_zzzaxpys( *alpha, *xp, *zp );
+			bli_zdots( *xp, *yp, dotxy );
+			bli_zaxpys( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -320,8 +352,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
-			bli_zzzaxpys( *alpha, *xp, *zp );
+			bli_zdotjs( *xp, *yp, dotxy );
+			bli_zaxpys( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -333,8 +365,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
-			bli_zzzaxpys( *alpha, *xp, *zp );
+			bli_zdotjs( *xp, *yp, dotxy );
+			bli_zaxpys( *alpha, *xp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -344,8 +376,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
-			bli_zzzaxpys( *alpha, *xp, *zp );
+			bli_zdotjs( *xp, *yp, dotxy );
+			bli_zaxpys( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -355,8 +387,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
-			bli_zzzaxpyjs( *alpha, *xp, *zp );
+			bli_zdots( *xp, *yp, dotxy );
+			bli_zaxpyjs( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -368,8 +400,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
-			bli_zzzaxpyjs( *alpha, *xp, *zp );
+			bli_zdots( *xp, *yp, dotxy );
+			bli_zaxpyjs( *alpha, *xp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -379,8 +411,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzdots( *xp, *yp, dotxy );
-			bli_zzzaxpyjs( *alpha, *xp, *zp );
+			bli_zdots( *xp, *yp, dotxy );
+			bli_zaxpyjs( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -390,8 +422,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute front edge cases if x, y, and z were unaligned.
 		for ( i = 0; i < n_pre; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
-			bli_zzzaxpyjs( *alpha, *xp, *zp );
+			bli_zdotjs( *xp, *yp, dotxy );
+			bli_zaxpyjs( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -403,8 +435,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
 		for ( i = 0; i < n_iter; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
-			bli_zzzaxpyjs( *alpha, *xp, *zp );
+			bli_zdotjs( *xp, *yp, dotxy );
+			bli_zaxpyjs( *alpha, *xp, *zp );

 			xp += n_elem_per_iter;
 			yp += n_elem_per_iter;
@@ -414,8 +446,8 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 		// Compute tail edge cases, if applicable.
 		for ( i = 0; i < n_left; ++i )
 		{
-			bli_zzzdotjs( *xp, *yp, dotxy );
-			bli_zzzaxpyjs( *alpha, *xp, *zp );
+			bli_zdotjs( *xp, *yp, dotxy );
+			bli_zaxpyjs( *alpha, *xp, *zp );

 			xp += 1; yp += 1; zp += 1;
 		}
@@ -426,6 +458,6 @@ void bli_zdotaxpyv_opt_var1( conj_t             conjxt,
 	if ( bli_is_conj( conjy ) )
 		bli_zconjs( dotxy );

-	bli_zzcopys( dotxy, *rho );
+	bli_zcopys( dotxy, *rho );
 }

--- a/config/template/kernels/1f/bli_dotxaxpyf_opt_var1.c
+++ b/config/template/kernels/1f/bli_dotxaxpyf_opt_var1.c
@@ -36,115 +36,143 @@



-void bli_sdotxaxpyf_opt_var1( conj_t             conjat,
-                              conj_t             conja,
-                              conj_t             conjw,
-                              conj_t             conjx,
-                              dim_t              m,
-                              dim_t              b_n,
-                              float*    restrict alpha,
-                              float*    restrict a, inc_t inca, inc_t lda,
-                              float*    restrict w, inc_t incw,
-                              float*    restrict x, inc_t incx,
-                              float*    restrict beta,
-                              float*    restrict y, inc_t incy,
-                              float*    restrict z, inc_t incz )
+void bli_sdotxaxpyf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conja,
+       conj_t    conjw,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       float*    alpha,
+       float*    a, inc_t inca, inc_t lda,
+       float*    w, inc_t incw,
+       float*    x, inc_t incx,
+       float*    beta,
+       float*    y, inc_t incy,
+       float*    z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SDOTXAXPYF_KERNEL_REF( conjat,
-	                            conja,
-	                            conjw,
-	                            conjx,
-	                            m,
-	                            b_n,
-	                            alpha,
-	                            a, inca, lda,
-	                            w, incw,
-	                            x, incx,
-	                            beta,
-	                            y, incy,
-	                            z, incz );
+	BLIS_SDOTXAXPYF_KERNEL_REF
+	(
+	  conjat,
+	  conja,
+	  conjw,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  w, incw,
+	  x, incx,
+	  beta,
+	  y, incy,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_ddotxaxpyf_opt_var1( conj_t             conjat,
-                              conj_t             conja,
-                              conj_t             conjw,
-                              conj_t             conjx,
-                              dim_t              m,
-                              dim_t              b_n,
-                              double*   restrict alpha,
-                              double*   restrict a, inc_t inca, inc_t lda,
-                              double*   restrict w, inc_t incw,
-                              double*   restrict x, inc_t incx,
-                              double*   restrict beta,
-                              double*   restrict y, inc_t incy,
-                              double*   restrict z, inc_t incz )
+void bli_ddotxaxpyf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conja,
+       conj_t    conjw,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       double*   alpha,
+       double*   a, inc_t inca, inc_t lda,
+       double*   w, inc_t incw,
+       double*   x, inc_t incx,
+       double*   beta,
+       double*   y, inc_t incy,
+       double*   z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_DDOTXAXPYF_KERNEL_REF( conjat,
-	                            conja,
-	                            conjw,
-	                            conjx,
-	                            m,
-	                            b_n,
-	                            alpha,
-	                            a, inca, lda,
-	                            w, incw,
-	                            x, incx,
-	                            beta,
-	                            y, incy,
-	                            z, incz );
+	BLIS_DDOTXAXPYF_KERNEL_REF
+	(
+	  conjat,
+	  conja,
+	  conjw,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  w, incw,
+	  x, incx,
+	  beta,
+	  y, incy,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_cdotxaxpyf_opt_var1( conj_t             conjat,
-                              conj_t             conja,
-                              conj_t             conjw,
-                              conj_t             conjx,
-                              dim_t              m,
-                              dim_t              b_n,
-                              scomplex* restrict alpha,
-                              scomplex* restrict a, inc_t inca, inc_t lda,
-                              scomplex* restrict w, inc_t incw,
-                              scomplex* restrict x, inc_t incx,
-                              scomplex* restrict beta,
-                              scomplex* restrict y, inc_t incy,
-                              scomplex* restrict z, inc_t incz )
+void bli_cdotxaxpyf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conja,
+       conj_t    conjw,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       scomplex* alpha,
+       scomplex* a, inc_t inca, inc_t lda,
+       scomplex* w, inc_t incw,
+       scomplex* x, inc_t incx,
+       scomplex* beta,
+       scomplex* y, inc_t incy,
+       scomplex* z, inc_t incz,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CDOTXAXPYF_KERNEL_REF( conjat,
-	                            conja,
-	                            conjw,
-	                            conjx,
-	                            m,
-	                            b_n,
-	                            alpha,
-	                            a, inca, lda,
-	                            w, incw,
-	                            x, incx,
-	                            beta,
-	                            y, incy,
-	                            z, incz );
+	BLIS_CDOTXAXPYF_KERNEL_REF
+	(
+	  conjat,
+	  conja,
+	  conjw,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  w, incw,
+	  x, incx,
+	  beta,
+	  y, incy,
+	  z, incz,
+	  cntx
+	);
 }



-void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
-                              conj_t             conja,
-                              conj_t             conjw,
-                              conj_t             conjx,
-                              dim_t              m,
-                              dim_t              b_n,
-                              dcomplex* restrict alpha,
-                              dcomplex* restrict a, inc_t inca, inc_t lda,
-                              dcomplex* restrict w, inc_t incw,
-                              dcomplex* restrict x, inc_t incx,
-                              dcomplex* restrict beta,
-                              dcomplex* restrict y, inc_t incy,
-                              dcomplex* restrict z, inc_t incz )
+void bli_zdotxaxpyf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conja,
+       conj_t    conjw,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       dcomplex* alpha,
+       dcomplex* a, inc_t inca, inc_t lda,
+       dcomplex* w, inc_t incw,
+       dcomplex* x, inc_t incx,
+       dcomplex* beta,
+       dcomplex* y, inc_t incy,
+       dcomplex* z, inc_t incz,
+       cntx_t*   cntx
+     )

 {
 /*
@@ -289,19 +317,23 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 	// Call the reference implementation if needed.
 	if ( use_ref == TRUE )
 	{
-		BLIS_ZDOTXAXPYF_KERNEL_REF( conjat,
-		                            conja,
-		                            conjw,
-		                            conjx,
-		                            m,
-		                            b_n,
-		                            alpha,
-		                            a, inca, lda,
-		                            w, incw,
-		                            x, incx,
-		                            beta,
-		                            y, incy,
-		                            z, incz );
+		BLIS_ZDOTXAXPYF_KERNEL_REF
+		(
+		  conjat,
+		  conja,
+		  conjw,
+		  conjx,
+		  m,
+		  b_n,
+		  alpha,
+		  a, inca, lda,
+		  w, incw,
+		  x, incx,
+		  beta,
+		  y, incy,
+		  z, incz,
+		  cntx
+		);
        return;
 	}

@@ -326,16 +358,16 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 	{
 		for ( j = 0; j < b_n; ++j )
 		{
-			bli_zzcopys( *xp[ j ], alpha_x[ j ] );
-			bli_zzscals( *alpha, alpha_x[ j ] );
+			bli_zcopys( *xp[ j ], alpha_x[ j ] );
+			bli_zscals( *alpha, alpha_x[ j ] );
 		}
 	}
 	else // if ( bli_is_conj( conjx ) )
 	{
 		for ( j = 0; j < b_n; ++j )
 		{
-			bli_zzcopyjs( *xp[ j ], alpha_x[ j ] );
-			bli_zzscals( *alpha, alpha_x[ j ] );
+			bli_zcopyjs( *xp[ j ], alpha_x[ j ] );
+			bli_zscals( *alpha, alpha_x[ j ] );
 		}
 	}

@@ -366,8 +398,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdots( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdots( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdots( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdots( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -383,8 +415,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdots( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdots( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdots( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdots( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += n_elem_per_iter;
 			}
@@ -396,8 +428,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdots( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdots( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdots( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdots( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -411,8 +443,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdotjs( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdots( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdotjs( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdots( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -428,8 +460,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdotjs( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdots( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdotjs( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdots( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += n_elem_per_iter;
 			}
@@ -441,8 +473,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdotjs( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdots( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdotjs( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdots( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -456,8 +488,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdots( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdotjs( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdots( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdotjs( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -473,8 +505,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdots( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdotjs( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdots( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdotjs( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += n_elem_per_iter;
 			}
@@ -486,8 +518,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdots( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdotjs( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdots( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdotjs( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -501,8 +533,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdotjs( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdotjs( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdotjs( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdotjs( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -518,8 +550,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdotjs( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdotjs( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdotjs( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdotjs( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += n_elem_per_iter;
 			}
@@ -531,8 +563,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 		{
 			for ( j = 0; j < b_n; ++j )
 			{
-				bli_zzzdotjs( *ap[ j ], *wp, At_w[ j ] );
-				bli_zzzdotjs( *ap[ j ], alpha_x[ j ], *zp );
+				bli_zdotjs( *ap[ j ], *wp, At_w[ j ] );
+				bli_zdotjs( *ap[ j ], alpha_x[ j ], *zp );

 				ap[ j ] += 1;
 			}
@@ -555,8 +587,8 @@ void bli_zdotxaxpyf_opt_var1( conj_t             conjat,
 	// scaling by beta.
 	for ( j = 0; j < b_n; ++j )
 	{
-		bli_zzscals( *beta, *yp[ j ] );
-		bli_zzzaxpys( *alpha, At_w[ j ], *yp[ j ] );
+		bli_zscals( *beta, *yp[ j ] );
+		bli_zaxpys( *alpha, At_w[ j ], *yp[ j ] );
 	}
 }

--- a/config/template/kernels/1f/bli_dotxf_opt_var1.c
+++ b/config/template/kernels/1f/bli_dotxf_opt_var1.c
@@ -36,95 +36,115 @@



-void bli_sdotxf_opt_var1(
-                          conj_t             conjat,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          float*    restrict alpha,
-                          float*    restrict a, inc_t inca, inc_t lda,
-                          float*    restrict x, inc_t incx,
-                          float*    restrict beta,
-                          float*    restrict y, inc_t incy
-                        )
+void bli_sdotxf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       float*    alpha,
+       float*    a, inc_t inca, inc_t lda,
+       float*    x, inc_t incx,
+       float*    beta,
+       float*    y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SDOTXF_KERNEL_REF( conjat,
-	                        conjx,
-	                        m,
-	                        b_n,
-	                        alpha,
-	                        a, inca, lda,
-	                        x, incx,
-	                        beta,
-	                        y, incy );
+	BLIS_SDOTXF_KERNEL_REF
+	(
+	  conjat,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  x, incx,
+	  beta,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_ddotxf_opt_var1(
-                          conj_t             conjat,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          double*   restrict alpha,
-                          double*   restrict a, inc_t inca, inc_t lda,
-                          double*   restrict x, inc_t incx,
-                          double*   restrict beta,
-                          double*   restrict y, inc_t incy
-                        )
+void bli_ddotxf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       double*   alpha,
+       double*   a, inc_t inca, inc_t lda,
+       double*   x, inc_t incx,
+       double*   beta,
+       double*   y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_DDOTXF_KERNEL_REF( conjat,
-	                        conjx,
-	                        m,
-	                        b_n,
-	                        alpha,
-	                        a, inca, lda,
-	                        x, incx,
-	                        beta,
-	                        y, incy );
+	BLIS_DDOTXF_KERNEL_REF
+	(
+	  conjat,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  x, incx,
+	  beta,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_cdotxf_opt_var1(
-                          conj_t             conjat,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          scomplex* restrict alpha,
-                          scomplex* restrict a, inc_t inca, inc_t lda,
-                          scomplex* restrict x, inc_t incx,
-                          scomplex* restrict beta,
-                          scomplex* restrict y, inc_t incy
-                        )
+void bli_cdotxf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       scomplex* alpha,
+       scomplex* a, inc_t inca, inc_t lda,
+       scomplex* x, inc_t incx,
+       scomplex* beta,
+       scomplex* y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CDOTXF_KERNEL_REF( conjat,
-	                        conjx,
-	                        m,
-	                        b_n,
-	                        alpha,
-	                        a, inca, lda,
-	                        x, incx,
-	                        beta,
-	                        y, incy );
+	BLIS_CDOTXF_KERNEL_REF
+	(
+	  conjat,
+	  conjx,
+	  m,
+	  b_n,
+	  alpha,
+	  a, inca, lda,
+	  x, incx,
+	  beta,
+	  y, incy,
+	  cntx
+	);
 }



-void bli_zdotxf_opt_var1(
-                          conj_t             conjat,
-                          conj_t             conjx,
-                          dim_t              m,
-                          dim_t              b_n,
-                          dcomplex* restrict alpha,
-                          dcomplex* restrict a, inc_t inca, inc_t lda,
-                          dcomplex* restrict x, inc_t incx,
-                          dcomplex* restrict beta,
-                          dcomplex* restrict y, inc_t incy
-                        )
+void bli_zdotxf_opt_var1
+     (
+       conj_t    conjat,
+       conj_t    conjx,
+       dim_t     m,
+       dim_t     b_n,
+       dcomplex* alpha,
+       dcomplex* a, inc_t inca, inc_t lda,
+       dcomplex* x, inc_t incx,
+       dcomplex* beta,
+       dcomplex* y, inc_t incy,
+       cntx_t*   cntx
+     )
 {
 /*
  Template dotxf kernel implementation
@@ -225,10 +245,14 @@ void bli_zdotxf_opt_var1(
 	// If the vector lengths are zero, scale r by beta and return.
 	if ( bli_zero_dim1( m ) )
 	{
-		bli_zzscalv( BLIS_NO_CONJUGATE,
-		             b_n,
-		             beta,
-		             y, incy );
+		bli_zscalv_ex
+		(
+		  BLIS_NO_CONJUGATE,
+		  b_n,
+		  beta,
+		  y, incy,
+		  cntx
+		);
 		return;
 	}

@@ -265,15 +289,19 @@ void bli_zdotxf_opt_var1(
 	// Call the reference implementation if needed.
 	if ( use_ref == TRUE )
 	{
-		BLIS_ZDOTXF_KERNEL_REF( conjat,
-		                        conjx,
-		                        m,
-		                        b_n,
-		                        alpha,
-		                        a, inca, lda,
-		                        x, incx,
-		                        beta,
-		                        y, incy );
+		BLIS_ZDOTXF_KERNEL_REF
+		(
+		  conjat,
+		  conjx,
+		  m,
+		  b_n,
+		  alpha,
+		  a, inca, lda,
+		  x, incx,
+		  beta,
+		  y, incy,
+		  cntx
+		);
        return;
 	}

--- a/config/template/kernels/3/bli_gemm_opt_mxn.c
+++ b/config/template/kernels/3/bli_gemm_opt_mxn.c
@@ -36,37 +36,45 @@



-void bli_sgemm_opt_mxn(
-                        dim_t              k,
-                        float*    restrict alpha,
-                        float*    restrict a1,
-                        float*    restrict b1,
-                        float*    restrict beta,
-                        float*    restrict c11, inc_t rs_c, inc_t cs_c,
-                        auxinfo_t*         data
-                      )
+void bli_sgemm_opt_mxn
+     (
+       dim_t               k,
+       float*     restrict alpha,
+       float*     restrict a1,
+       float*     restrict b1,
+       float*     restrict beta,
+       float*     restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_SGEMM_UKERNEL_REF( k,
-	                   alpha,
-	                   a1,
-	                   b1,
-	                   beta,
-	                   c11, rs_c, cs_c,
-	                   data );
+	BLIS_SGEMM_UKERNEL_REF
+	(
+	  k,
+	  alpha,
+	  a1,
+	  b1,
+	  beta,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }



-void bli_dgemm_opt_mxn(
-                        dim_t              k,
-                        double*   restrict alpha,
-                        double*   restrict a1,
-                        double*   restrict b1,
-                        double*   restrict beta,
-                        double*   restrict c11, inc_t rs_c, inc_t cs_c,
-                        auxinfo_t*         data
-                      )
+void bli_dgemm_opt_mxn
+     (
+       dim_t               k,
+       double*    restrict alpha,
+       double*    restrict a1,
+       double*    restrict b1,
+       double*    restrict beta,
+       double*    restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 /*
  Template gemm micro-kernel implementation
@@ -85,133 +93,27 @@ void bli_dgemm_opt_mxn(
  where A1 is MR x k, B1 is k x NR, C11 is MR x NR, and alpha and beta are
  scalars.

-  Parameters:
+  For more info, please refer to the BLIS website's wiki on kernels:

-  - k:      The number of columns of A1 and rows of B1.
-  - alpha:  The address of a scalar to the A1 * B1 product.
-  - a1:     The address of a micro-panel of matrix A of dimension MR x k,
-            stored by columns with leading dimension PACKMR, where
-            typically PACKMR = MR.
-  - b1:     The address of a micro-panel of matrix B of dimension k x NR,
-            stored by rows with leading dimension PACKNR, where typically
-            PACKNR = NR.
-  - beta:   The address of a scalar to the input value of matrix C11.
-  - c11:    The address of a submatrix C11 of dimension MR x NR, stored
-            according to rs_c and cs_c.
-  - rs_c:   The row stride of matrix C11 (ie: the distance to the next row,
-            in units of matrix elements).
-  - cs_c:   The column stride of matrix C11 (ie: the distance to the next
-            column, in units of matrix elements).
-  - data:   The address of an auxinfo_t object that contains auxiliary
-            information that may be useful when optimizing the gemm
-            micro-kernel implementation. (See BLIS KernelsHowTo wiki for
-            more info.)
+    https://github.com/flame/blis/wiki/KernelsHowTo

-  Diagram for gemm
-
-  The diagram below shows the packed micro-panel operands and how elements
-  of each would be stored when MR = NR = 4. The hex digits indicate the
-  layout and order (but NOT the numeric contents) of the elements in
-  memory. Note that the storage of C11 is not shown since it is determined
-  by the row and column strides of C11.
-
-         c11:           a1:                        b1:       
-         _______        ______________________     _______   
-        |       |      |0 4 8 C               |   |0 1 2 3|  
-    MR  |       |      |1 5 9 D . . .         |   |4 5 6 7|  
-        |       |  +=  |2 6 A E               |   |8 9 A B|  
-        |_______|      |3_7_B_F_______________|   |C D E F|  
-                                                  |   .   |  
-            NR                    k               |   .   | k
-                                                  |   .   |  
-                                                  |       |  
-                                                  |       |  
-                                                  |_______|  
-                                                             
-                                                      NR     
-  Implementation Notes for gemm
-
-  - Register blocksizes. The C preprocessor macros bli_?mr and bli_?nr
-    evaluate to the MR and NR register blocksizes for the datatype
-    corresponding to the '?' character. These values are abbreviations
-    of the macro constants BLIS_DEFAULT_MR_? and BLIS_DEFAULT_NR_?,
-    which are defined in the bli_kernel.h header file of the BLIS
-    configuration.
-  - Leading dimensions of a1 and b1: PACKMR and PACKNR. The packed
-    micro-panels a1 and b1 are simply stored in column-major and row-major
-    order, respectively. Usually, the width of either micro-panel (ie:
-    the number of rows of A1, or MR, and the number of columns of B1, or
-    NR) is equal to that micro-panel's so-called "leading dimension."
-    Sometimes, it may be beneficial to specify a leading dimension that
-    is larger than the panel width. This may be desirable because it
-    allows each column of A1 or row of B1 to maintain a certain alignment
-    in memory that would not otherwise be maintained by MR and/or NR. In
-    this case, you should index through a1 and b1 using the values PACKMR
-    and PACKNR, respectively, as defined by bli_?packmr and bli_?packnr.
-    These values are defined as BLIS_PACKDIM_MR_? and BLIS_PACKDIM_NR_?,
-    respectively, in the bli_kernel.h header file of the BLIS
-    configuration.
-  - Storage preference of c11: Sometimes, an optimized micro-kernel will
-    have a preferred storage format for C11--typically either contiguous
-    row-storage or contiguous column-storage. This preference comes from
-    how the micro-kernel is most efficiently able to load/store elements
-    of C11 from/to memory. Most micro-kernels use vector instructions to
-    load and store contigous columns (or column segments) of C11. However,
-    the developer may decide that loading contiguous rows (or row
-    segments) is desirable. If this is the case, this preference should be
-    noted in bli_kernel.h by defining the macro
-    BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS. Leaving the macro undefined
-    leaves the default assumption (contiguous column preference) in
-    place. Setting this macro allows the framework to perform a minor
-    optimization at run-time that will ensure the micro-kernel preference
-    is honored, if at all possible.
-  - Edge cases in MR, NR dimensions. Sometimes the micro-kernel will be
-    called with micro-panels a1 and b1 that correspond to edge cases,
-    where only partial results are needed. Zero-padding is handled
-    automatically by the packing function to facilitate reuse of the same
-    micro-kernel. Similarly, the logic for computing to temporary storage
-    and then saving only the elements that correspond to elements of C11
-    that exist (at the edges) is handled automatically within the
-    macro-kernel.
-  - Alignment of a1 and b1. By default, the alignment of addresses a1 and
-    b1 are aligned only to sizeof(type). If BLIS_CONTIG_ADDR_ALIGN_SIZE is
-    set to some larger multiple of sizeof(type), such as the page size,
-    then a1 and b1 will be aligned to PACKMR * sizeof(type) and PACKNR *
-    sizeof(type), respectively. Alignment of a1 and b1 is also affected
-    by BLIS_UPANEL_A_ALIGN_SIZE_? and BLIS_UPANEL_B_ALIGN_SIZE_?, which
-    align the distance (stride) between subsequent micro-panels. (By
-    default, those values are simply sizeof(type), in which case they have
-    no effect.)
-  - Unrolling loops. As a general rule of thumb, the loop over k is
-    sometimes moderately unrolled; for example, in our experience, an
-    unrolling factor of u = 4 is fairly common. If unrolling is applied
-    in the k dimension, edge cases must be handled to support values of k
-    that are not multiples of u. It is nearly universally true that there
-    should be no loops in the MR or NR directions; in other words,
-    iteration over these dimensions should always be fully unrolled
-    (within the loop over k).
-  - Zero beta. If beta = 0.0 (or 0.0 + 0.0i for complex datatypes), then
-    the micro-kernel should NOT use it explicitly, as C11 may contain
-    uninitialized memory (including NaNs). This case should be detected
-    and handled separately, preferably by simply overwriting C11 with the
-    alpha * A1 * B1 product. An example of how to perform this "beta equals
-    zero" handling is included in the gemm micro-kernel associated with
-    the template configuration.
-
-  For more info, please refer to the BLIS website and/or contact the
-  blis-devel mailing list.
+  and/or contact the blis-devel mailing list.

  -FGVZ
 */
-	const dim_t        mr    = bli_dmr;
-	const dim_t        nr    = bli_dnr;
+	const num_t        dt     = BLIS_DOUBLE;

-	const inc_t        cs_a  = bli_dpackmr;
+	const dim_t        mr     = bli_cntx_get_blksz_def_dt( dt, BLIS_MR, cntx );
+	const dim_t        nr     = bli_cntx_get_blksz_def_dt( dt, BLIS_NR, cntx );

-	const inc_t        rs_b  = bli_dpacknr;
+	const inc_t        packmr = bli_cntx_get_blksz_max_dt( dt, BLIS_MR, cntx );
+	const inc_t        packnr = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );

-	const inc_t        rs_ab = 1;
-	const inc_t        cs_ab = bli_dmr;
+	const inc_t        cs_a   = packmr;
+	const inc_t        rs_b   = packnr;
+
+	const inc_t        rs_ab  = 1;
+	const inc_t        cs_ab  = mr;

 	dim_t              l, j, i;

@@ -291,36 +193,56 @@ void bli_cgemm_opt_mxn(
                        scomplex* restrict c11, inc_t rs_c, inc_t cs_c,
                        auxinfo_t*         data
                      )
+     (
+       dim_t               k,
+       scomplex*  restrict alpha,
+       scomplex*  restrict a1,
+       scomplex*  restrict b1,
+       scomplex*  restrict beta,
+       scomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CGEMM_UKERNEL_REF( k,
-	                   alpha,
-	                   a1,
-	                   b1,
-	                   beta,
-	                   c11, rs_c, cs_c,
-	                   data );
+	BLIS_CGEMM_UKERNEL_REF
+	(
+	  k,
+	  alpha,
+	  a1,
+	  b1,
+	  beta,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }



-void bli_zgemm_opt_mxn(
-                        dim_t              k,
-                        dcomplex* restrict alpha,
-                        dcomplex* restrict a1,
-                        dcomplex* restrict b1,
-                        dcomplex* restrict beta,
-                        dcomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                        auxinfo_t*         data
-                      )
+void bli_zgemm_opt_mxn
+     (
+       dim_t               k,
+       dcomplex*  restrict alpha,
+       dcomplex*  restrict a1,
+       dcomplex*  restrict b1,
+       dcomplex*  restrict beta,
+       dcomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_ZGEMM_UKERNEL_REF( k,
-	                   alpha,
-	                   a1,
-	                   b1,
-	                   beta,
-	                   c11, rs_c, cs_c,
-	                   data );
+	BLIS_ZGEMM_UKERNEL_REF
+	(
+	  k,
+	  alpha,
+	  a1,
+	  b1,
+	  beta,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }

--- a/config/template/kernels/3/bli_gemmtrsm_l_opt_mxn.c
+++ b/config/template/kernels/3/bli_gemmtrsm_l_opt_mxn.c
@@ -36,18 +36,24 @@



-void bli_sgemmtrsm_l_opt_mxn(
-                              dim_t              k,
-                              float*    restrict alpha,
-                              float*    restrict a10,
-                              float*    restrict a11,
-                              float*    restrict b01,
-                              float*    restrict b11,
-                              float*    restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_sgemmtrsm_l_opt_mxn
+     (
+       dim_t               k,
+       float*     restrict alpha,
+       float*     restrict a10,
+       float*     restrict a11,
+       float*     restrict b01,
+       float*     restrict b11,
+       float*     restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
-	const inc_t        rs_b      = bli_spacknr;
+	const num_t        dt        = BLIS_FLOAT;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	float*    restrict minus_one = bli_sm1;
@@ -69,16 +75,18 @@ void bli_sgemmtrsm_l_opt_mxn(



-void bli_dgemmtrsm_l_opt_mxn(
-                              dim_t              k,
-                              double*   restrict alpha,
-                              double*   restrict a10,
-                              double*   restrict a11,
-                              double*   restrict b01,
-                              double*   restrict b11,
-                              double*   restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_dgemmtrsm_l_opt_mxn
+     (
+       dim_t               k,
+       double*    restrict alpha,
+       double*    restrict a10,
+       double*    restrict a11,
+       double*    restrict b01,
+       double*    restrict b11,
+       double*    restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 /*
  Template gemmtrsm_l micro-kernel implementation
@@ -96,114 +104,19 @@ void bli_dgemmtrsm_l_opt_mxn(
  B11 is MR x NR, and alpha is a scalar. Here, inv() denotes matrix
  inverse.

-  Parameters:
+  For more info, please refer to the BLIS website's wiki on kernels:

-  - k:      The number of columns of A10 and rows of B01.
-  - alpha:  The address of a scalar to be applied to B11.
-  - a10:    The address of A10, which is the MR x k submatrix of the packed
-            micro-panel of A that is situated to the left of the MR x MR
-            triangular submatrix A11. A10 is stored by columns with leading
-            dimension PACKMR, where typically PACKMR = MR.
-  - a11:    The address of A11, which is the MR x MR lower triangular
-            submatrix within the packed micro-panel of matrix A that is
-            situated to the right of A10. A11 is stored by columns with
-            leading dimension PACKMR, where typically PACKMR = MR. Note
-            that A11 contains elements in both triangles, though elements
-            in the unstored triangle are not guaranteed to be zero and
-            thus should not be referenced.
-  - b01:    The address of B01, which is the k x NR submatrix of the packed
-            micro-panel of B that is situated above the MR x NR submatrix
-            B11. B01 is stored by rows with leading dimension PACKNR, where
-            typically PACKNR = NR.
-  - b11:    The address B11, which is the MR x NR submatrix of the packed
-            micro-panel of B, situated below B01. B11 is stored by rows
-            with leading dimension PACKNR, where typically PACKNR = NR.
-  - c11:    The address of C11, which is the MR x NR submatrix of matrix
-            C, stored according to rs_c and cs_c. C11 is the submatrix
-            within C that corresponds to the elements which were packed
-            into B11. Thus, C is the original input matrix B to the overall
-            trsm operation.
-  - rs_c:   The row stride of C11 (ie: the distance to the next row of C11,
-            in units of matrix elements).
-  - cs_c:   The column stride of C11 (ie: the distance to the next column of
-            C11, in units of matrix elements).
-  - data:   The address of an auxinfo_t object that contains auxiliary
-            information that may be useful when optimizing the gemmtrsm
-            micro-kernel implementation. (See BLIS KernelsHowTo wiki for
-            more info.)
+    https://github.com/flame/blis/wiki/KernelsHowTo

-  Diagram for gemmtrsm_l
-
-  The diagram below shows the packed micro-panel operands for trsm_l and
-  how elements of each would be stored when MR = NR = 4. (The hex digits
-  indicate the layout and order (but NOT the numeric contents) in memory.
-  Here, matrix A11 (referenced by a11) is lower triangular. Matrix A11
-  does contain elements corresponding to the strictly upper triangle,
-  however, they are not guaranteed to contain zeros and thus these elements
-  should not be referenced.
-
-                                                NR    
-                                              _______ 
-                                         b01:|0 1 2 3|
-                                             |4 5 6 7|
-                                             |8 9 A B|
-                                             |C D E F|
-                                           k |   .   |
-                                             |   .   |
-         a10:                a11:            |   .   |
-         ___________________  _______        |_______|
-        |0 4 8 C            |`.      |   b11:|       |
-    MR  |1 5 9 D . . .      |  `.    |       |       |
-        |2 6 A E            |    `.  |    MR |       |
-        |3_7_B_F____________|______`.|       |_______|
-                                                      
-                  k             MR                    
-
-
-  Implementation Notes for gemmtrsm
-
-  - Register blocksizes. See Implementation Notes for gemm.
-  - Leading dimensions of a1 and b1: PACKMR and PACKNR. See Implementation
-    Notes for gemm.
-  - Edge cases in MR, NR dimensions. See Implementation Notes for gemm.
-  - Alignment of a1 and b1. The addresses a1 and b1 are aligned according
-    to PACKMR*sizeof(type) and PACKNR*sizeof(type), respectively.
-  - Unrolling loops. Most optimized implementations should unroll all
-    three loops within the trsm subproblem of gemmtrsm. See Implementation
-    Notes for gemm for remarks on unrolling the gemm subproblem.
-  - Prefetching next micro-panels of A and B. When invoked from within a
-    gemmtrsm_l micro-kernel, the addresses accessible via
-    bli_auxinfo_next_a() and bli_auxinfo_next_b() refer to the next
-    invocation's a10 and b01, respectively, while in gemmtrsm_u, the
-    _next_a() and _next_b() macros return the addresses of the next
-    invocation's a11 and b11 (since those submatrices precede a12 and b21).
-    (See BLIS KernelsHowTo wiki for more info.)
-  - Zero alpha. The micro-kernel can safely assume that alpha is non-zero;
-    "alpha equals zero" handling is performed at a much higher level,
-    which means that, in such a scenario, the micro-kernel will never get
-    called.
-  - Diagonal elements of A11. See Implementation Notes for trsm.
-  - Zero elements of A11. See Implementation Notes for trsm.
-  - Output. See Implementation Notes for trsm.
-  - Optimization. Let's assume that the gemm micro-kernel has already been
-    optimized. You have two options with regard to optimizing the fused
-    gemmtrsm micro-kernels:
-    (1) Optimize only the trsm micro-kernels. This will result in the gemm
-        and trsm_l micro-kernels being called in sequence. (Likewise for
-        gemm and trsm_u.)
-    (2) Fuse the implementation of the gemm micro-kernel with that of the
-        trsm micro-kernels by inlining both into the gemmtrsm_l and
-        gemmtrsm_u micro-kernel definitions. This option is more labor-
-        intensive, but also more likely to yield higher performance because
-        it avoids redundant memory operations on the packed MR x NR
-        submatrix B11.
-
-  For more info, please refer to the BLIS website and/or contact the
-  blis-devel mailing list.
+  and/or contact the blis-devel mailing list.

  -FGVZ
 */
-	const inc_t        rs_b      = bli_dpacknr;
+	const num_t        dt        = BLIS_DOUBLE;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	double*   restrict minus_one = bli_dm1;
@@ -227,18 +140,24 @@ void bli_dgemmtrsm_l_opt_mxn(



-void bli_cgemmtrsm_l_opt_mxn(
-                              dim_t              k,
-                              scomplex* restrict alpha,
-                              scomplex* restrict a10,
-                              scomplex* restrict a11,
-                              scomplex* restrict b01,
-                              scomplex* restrict b11,
-                              scomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_cgemmtrsm_l_opt_mxn
+     (
+       dim_t               k,
+       scomplex*  restrict alpha,
+       scomplex*  restrict a10,
+       scomplex*  restrict a11,
+       scomplex*  restrict b01,
+       scomplex*  restrict b11,
+       scomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
-	const inc_t        rs_b      = bli_cpacknr;
+	const num_t        dt        = BLIS_SCOMPLEX;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	scomplex* restrict minus_one = bli_cm1;
@@ -260,18 +179,24 @@ void bli_cgemmtrsm_l_opt_mxn(



-void bli_zgemmtrsm_l_opt_mxn(
-                              dim_t              k,
-                              dcomplex* restrict alpha,
-                              dcomplex* restrict a10,
-                              dcomplex* restrict a11,
-                              dcomplex* restrict b01,
-                              dcomplex* restrict b11,
-                              dcomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_zgemmtrsm_l_opt_mxn
+     (
+       dim_t               k,
+       dcomplex*  restrict alpha,
+       dcomplex*  restrict a10,
+       dcomplex*  restrict a11,
+       dcomplex*  restrict b01,
+       dcomplex*  restrict b11,
+       dcomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
-	const inc_t        rs_b      = bli_zpacknr;
+	const num_t        dt        = BLIS_DCOMPLEX;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	dcomplex* restrict minus_one = bli_zm1;
--- a/config/template/kernels/3/bli_gemmtrsm_u_opt_mxn.c
+++ b/config/template/kernels/3/bli_gemmtrsm_u_opt_mxn.c
@@ -36,18 +36,24 @@



-void bli_sgemmtrsm_u_opt_mxn(
-                              dim_t              k,
-                              float*    restrict alpha,
-                              float*    restrict a12,
-                              float*    restrict a11,
-                              float*    restrict b21,
-                              float*    restrict b11,
-                              float*    restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_sgemmtrsm_u_opt_mxn
+     (
+       dim_t               k,
+       float*     restrict alpha,
+       float*     restrict a10,
+       float*     restrict a11,
+       float*     restrict b01,
+       float*     restrict b11,
+       float*     restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
-	const inc_t        rs_b      = bli_spacknr;
+	const num_t        dt        = BLIS_FLOAT;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	float*    restrict minus_one = bli_sm1;
@@ -69,16 +75,18 @@ void bli_sgemmtrsm_u_opt_mxn(



-void bli_dgemmtrsm_u_opt_mxn(
-                              dim_t              k,
-                              double*   restrict alpha,
-                              double*   restrict a12,
-                              double*   restrict a11,
-                              double*   restrict b21,
-                              double*   restrict b11,
-                              double*   restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_dgemmtrsm_u_opt_mxn
+     (
+       dim_t               k,
+       double*    restrict alpha,
+       double*    restrict a10,
+       double*    restrict a11,
+       double*    restrict b01,
+       double*    restrict b11,
+       double*    restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 /*
  Template gemmtrsm_u micro-kernel implementation
@@ -96,111 +104,19 @@ void bli_dgemmtrsm_u_opt_mxn(
  B11 is MR x NR, and alpha is a scalar. Here, inv() denotes matrix
  inverse.

-  Parameters:
+  For more info, please refer to the BLIS website's wiki on kernels:

-  - k:      The number of columns of A12 and rows of B21.
-  - alpha:  The address of a scalar to be applied to B11.
-  - a12:    The address of A12, which is the MR x k submatrix of the packed
-            micro-panel of A that is situated to the right of the MR x MR
-            triangular submatrix A11. A12 is stored by columns with leading
-            dimension PACKMR, where typically PACKMR = MR.
-  - a11:    The address of A11, which is the MR x MR upper triangular
-            submatrix within the packed micro-panel of matrix A that is
-            situated to the left of A12. A11 is stored by columns with
-            leading dimension PACKMR, where typically PACKMR = MR. Note
-            that A11 contains elements in both triangles, though elements
-            in the unstored triangle are not guaranteed to be zero and
-            thus should not be referenced.
-  - b21:    The address of B21, which is the k x NR submatrix of the packed
-            micro-panel of B that is situated above the MR x NR submatrix
-            B11. B01 is stored by rows with leading dimension PACKNR, where
-            typically PACKNR = NR.
-  - b11:    The address B11, which is the MR x NR submatrix of the packed
-            micro-panel of B, situated below B01. B11 is stored by rows
-            with leading dimension PACKNR, where typically PACKNR = NR.
-  - c11:    The address of C11, which is the MR x NR submatrix of matrix
-            C, stored according to rs_c and cs_c. C11 is the submatrix
-            within C that corresponds to the elements which were packed
-            into B11. Thus, C is the original input matrix B to the overall
-            trsm operation.
-  - rs_c:   The row stride of C11 (ie: the distance to the next row of C11,
-            in units of matrix elements).
-  - cs_c:   The column stride of C11 (ie: the distance to the next column of
-            C11, in units of matrix elements).
-  - data:   The address of an auxinfo_t object that contains auxiliary
-            information that may be useful when optimizing the gemmtrsm
-            micro-kernel implementation. (See BLIS KernelsHowTo wiki for
-            more info.)
+    https://github.com/flame/blis/wiki/KernelsHowTo

-  Diagram for gemmtrsm_u
-
-  The diagram below shows the packed micro-panel operands for trsm_l and
-  how elements of each would be stored when MR = NR = 4. (The hex digits
-  indicate the layout and order (but NOT the numeric contents) in memory.
-  Here, matrix A11 (referenced by a11) is upper triangular. Matrix A11
-  does contain elements corresponding to the strictly lower triangle,
-  however, they are not guaranteed to contain zeros and thus these elements
-  should not be referenced.
-
-       a11:     a12:                          NR    
-       ________ ___________________         _______ 
-      |`.      |0 4 8              |   b11:|0 1 2 3|
-  MR  |  `.    |1 5 9 . . .        |       |4 5 6 7|
-      |    `.  |2 6 A              |    MR |8 9 A B|
-      |______`.|3_7_B______________|       |___.___|
-                                       b21:|   .   |
-          MR             k                 |   .   |
-                                           |       |
-                                           |       |
-    NOTE: Storage digits are shown       k |       |
-    starting with a12 to avoid             |       |
-    obscuring triangular structure         |       |
-    of a11.                                |_______|
-                                                    
-
-  Implementation Notes for gemmtrsm
-
-  - Register blocksizes. See Implementation Notes for gemm.
-  - Leading dimensions of a1 and b1: PACKMR and PACKNR. See Implementation
-    Notes for gemm.
-  - Edge cases in MR, NR dimensions. See Implementation Notes for gemm.
-  - Alignment of a1 and b1. The addresses a1 and b1 are aligned according
-    to PACKMR*sizeof(type) and PACKNR*sizeof(type), respectively.
-  - Unrolling loops. Most optimized implementations should unroll all
-    three loops within the trsm subproblem of gemmtrsm. See Implementation
-    Notes for gemm for remarks on unrolling the gemm subproblem.
-  - Prefetching next micro-panels of A and B. When invoked from within a
-    gemmtrsm_l micro-kernel, the addresses accessible via
-    bli_auxinfo_next_a() and bli_auxinfo_next_b() refer to the next
-    invocation's a10 and b01, respectively, while in gemmtrsm_u, the
-    _next_a() and _next_b() macros return the addresses of the next
-    invocation's a11 and b11 (since those submatrices precede a12 and b21).
-    (See BLIS KernelsHowTo wiki for more info.)
-  - Zero alpha. The micro-kernel can safely assume that alpha is non-zero;
-    "alpha equals zero" handling is performed at a much higher level,
-    which means that, in such a scenario, the micro-kernel will never get
-    called.
-  - Diagonal elements of A11. See Implementation Notes for trsm.
-  - Zero elements of A11. See Implementation Notes for trsm.
-  - Output. See Implementation Notes for trsm.
-  - Optimization. Let's assume that the gemm micro-kernel has already been
-    optimized. You have two options with regard to optimizing the fused
-    gemmtrsm micro-kernels:
-    (1) Optimize only the trsm micro-kernels. This will result in the gemm
-        and trsm_l micro-kernels being called in sequence. (Likewise for
-        gemm and trsm_u.)
-    (2) Fuse the implementation of the gemm micro-kernel with that of the
-        trsm micro-kernels by inlining both into the gemmtrsm_l and
-        gemmtrsm_u micro-kernel definitions. This option is more labor-
-        intensive, but also more likely to yield higher performance because
-        it avoids redundant memory operations on the packed MR x NR
-        submatrix B11.
-
-  For more info, please refer to the BLIS website and/or contact the
-  blis-devel mailing list.
+  and/or contact the blis-devel mailing list.

+  -FGVZ
 */
-	const inc_t        rs_b      = bli_dpacknr;
+	const num_t        dt        = BLIS_DOUBLE;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	double*   restrict minus_one = bli_dm1;
@@ -224,18 +140,24 @@ void bli_dgemmtrsm_u_opt_mxn(



-void bli_cgemmtrsm_u_opt_mxn(
-                              dim_t              k,
-                              scomplex* restrict alpha,
-                              scomplex* restrict a12,
-                              scomplex* restrict a11,
-                              scomplex* restrict b21,
-                              scomplex* restrict b11,
-                              scomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_cgemmtrsm_u_opt_mxn
+     (
+       dim_t               k,
+       scomplex*  restrict alpha,
+       scomplex*  restrict a10,
+       scomplex*  restrict a11,
+       scomplex*  restrict b01,
+       scomplex*  restrict b11,
+       scomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
-	const inc_t        rs_b      = bli_cpacknr;
+	const num_t        dt        = BLIS_SCOMPLEX;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	scomplex* restrict minus_one = bli_cm1;
@@ -257,18 +179,24 @@ void bli_cgemmtrsm_u_opt_mxn(



-void bli_zgemmtrsm_u_opt_mxn(
-                              dim_t              k,
-                              dcomplex* restrict alpha,
-                              dcomplex* restrict a12,
-                              dcomplex* restrict a11,
-                              dcomplex* restrict b21,
-                              dcomplex* restrict b11,
-                              dcomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                              auxinfo_t*         data
-                            )
+void bli_zgemmtrsm_u_opt_mxn
+     (
+       dim_t               k,
+       dcomplex*  restrict alpha,
+       dcomplex*  restrict a10,
+       dcomplex*  restrict a11,
+       dcomplex*  restrict b01,
+       dcomplex*  restrict b11,
+       dcomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
-	const inc_t        rs_b      = bli_zpacknr;
+	const num_t        dt        = BLIS_DCOMPLEX;
+
+	const inc_t        packnr    = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
+
+	const inc_t        rs_b      = packnr;
 	const inc_t        cs_b      = 1;

 	dcomplex* restrict minus_one = bli_zm1;
--- a/config/template/kernels/3/bli_trsm_l_opt_mxn.c
+++ b/config/template/kernels/3/bli_trsm_l_opt_mxn.c
@@ -36,28 +36,36 @@



-void bli_strsm_l_opt_mxn(
-                          float*    restrict a11,
-                          float*    restrict b11,
-                          float*    restrict c11, inc_t rs_c, inc_t cs_c,
-                          auxinfo_t*         data
-                        )
+void bli_strsm_l_opt_mxn
+     (
+       float*     restrict a11,
+       float*     restrict b11,
+       float*     restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_STRSM_L_UKERNEL_REF( a11,
-	                     b11,
-	                     c11, rs_c, cs_c,
-	                     data );
+	BLIS_STRSM_L_UKERNEL_REF
+	(
+	  a11,
+	  b11,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }



-void bli_dtrsm_l_opt_mxn(
-                          double*   restrict a11,
-                          double*   restrict b11,
-                          double*   restrict c11, inc_t rs_c, inc_t cs_c,
-                          auxinfo_t*         data
-                        )
+void bli_dtrsm_l_opt_mxn
+     (
+       double*    restrict a11,
+       double*    restrict b11,
+       double*    restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 /*
  Template trsm_l micro-kernel implementation
@@ -76,80 +84,28 @@ void bli_dtrsm_l_opt_mxn(
  where A11 is MR x MR and lower triangular, B11 is MR x NR, and C11 is
  MR x NR.

-  Parameters:
+  For more info, please refer to the BLIS website's wiki on kernels:

-  - a11:    The address of A11, which is the MR x MR lower triangular
-            submatrix within the packed micro-panel of matrix A. A11 is
-            stored by columns with leading dimension PACKMR, where
-            typically PACKMR = MR. Note that A11 contains elements in both
-            triangles, though elements in the unstored triangle are not
-            guaranteed to be zero and thus should not be referenced.
-  - b11:    The address of B11, which is an MR x NR submatrix of the
-            packed micro-panel of B. B11 is stored by rows with leading
-            dimension PACKNR, where typically PACKNR = NR.
-  - c11:    The address of C11, which is an MR x NR submatrix of matrix C,
-            stored according to rs_c and cs_c. C11 is the submatrix within
-            C that corresponds to the elements which were packed into B11.
-            Thus, C is the original input matrix B to the overall trsm
-            operation.
-  - rs_c:   The row stride of C11 (ie: the distance to the next row of C11,
-            in units of matrix elements).
-  - cs_c:   The column stride of C11 (ie: the distance to the next column of
-            C11, in units of matrix elements).
-  - data:   The address of an auxinfo_t object that contains auxiliary
-            information that may be useful when optimizing the trsm
-            micro-kernel implementation. (See BLIS KernelsHowTo wiki for
-            more info.)
+    https://github.com/flame/blis/wiki/KernelsHowTo

-  Diagrams for trsm
-
-  Please see the diagram for gemmtrsm_l to see depiction of the trsm_l and
-  where it fits in with its preceding gemm subproblem.
-
-  Implementation Notes for trsm
-
-  - Register blocksizes. See Implementation Notes for gemm.
-  - Leading dimensions of a11 and b11: PACKMR and PACKNR. See
-    Implementation Notes for gemm.
-  - Edge cases in MR, NR dimensions. See Implementation Notes for gemm.
-  - Alignment of a11 and b11. See Implementation Notes for gemmtrsm.
-  - Unrolling loops. Most optimized implementations should unroll all
-    three loops within the trsm micro-kernel.
-  - Prefetching next micro-panels of A and B. We advise against using
-    the bli_auxinfo_next_a() and bli_auxinfo_next_b() macros from within
-    the trsm_l and trsm_u micro-kernels, since the values returned usually
-    only make sense in the context of the overall gemmtrsm subproblem. 
-  - Diagonal elements of A11. At the time this micro-kernel is called,
-    the diagonal entries of triangular matrix A11 contain the inverse of
-    the original elements. This inversion is done during packing so that
-    we can avoid expensive division instructions within the micro-kernel
-    itself. If the diag parameter to the higher level trsm operation was
-    equal to BLIS_UNIT_DIAG, the diagonal elements will be explicitly
-    unit.
-  - Zero elements of A11. Since A11 is lower triangular (for trsm_l), the
-    strictly upper triangle implicitly contains zeros. Similarly, the
-    strictly lower triangle of A11 implicitly contains zeros when A11 is
-    upper triangular (for trsm_u). However, the packing function may or
-    may not actually write zeros to this region. Thus, while the
-    implementation may reference these elements, it should not use them
-    in any computation.
-  - Output. This micro-kernel must write its result to two places: the
-    submatrix B11 of the current packed micro-panel of B and the submatrix
-    C11 of the output matrix C.
-
-  For more info, please refer to the BLIS website and/or contact the
-  blis-devel mailing list.
+  and/or contact the blis-devel mailing list.

  -FGVZ
 */
-	const dim_t        m     = bli_dmr;
-	const dim_t        n     = bli_dnr;
+	const dim_t        mr     = bli_cntx_get_blksz_def_dt( dt, BLIS_MR, cntx );
+	const dim_t        nr     = bli_cntx_get_blksz_def_dt( dt, BLIS_NR, cntx );

-	const inc_t        rs_a  = 1;
-	const inc_t        cs_a  = bli_dpackmr;
+	const inc_t        packmr = bli_cntx_get_blksz_max_dt( dt, BLIS_MR, cntx );
+	const inc_t        packnr = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );

-	const inc_t        rs_b  = bli_dpacknr;
-	const inc_t        cs_b  = 1;
+	const dim_t        m      = mr;
+	const dim_t        n      = nr;
+
+	const inc_t        rs_a   = 1;
+	const inc_t        cs_a   = packmr;
+
+	const inc_t        rs_b   = packnr;
+	const inc_t        cs_b   = 1;

 	dim_t              iter, i, j, l;
 	dim_t              n_behind;
@@ -208,33 +164,45 @@ void bli_dtrsm_l_opt_mxn(



-void bli_ctrsm_l_opt_mxn(
-                          scomplex* restrict a11,
-                          scomplex* restrict b11,
-                          scomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                          auxinfo_t*         data
-                        )
+void bli_ctrsm_l_opt_mxn
+     (
+       scomplex*  restrict a11,
+       scomplex*  restrict b11,
+       scomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CTRSM_L_UKERNEL_REF( a11,
-	                     b11,
-	                     c11, rs_c, cs_c,
-	                     data );
+	BLIS_CTRSM_L_UKERNEL_REF
+	(
+	  a11,
+	  b11,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }



-void bli_ztrsm_l_opt_mxn(
-                          dcomplex* restrict a11,
-                          dcomplex* restrict b11,
-                          dcomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                          auxinfo_t*         data
-                        )
+void bli_ztrsm_l_opt_mxn
+     (
+       dcomplex*  restrict a11,
+       dcomplex*  restrict b11,
+       dcomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_ZTRSM_L_UKERNEL_REF( a11,
-	                     b11,
-	                     c11, rs_c, cs_c,
-	                     data );
+	BLIS_ZTRSM_L_UKERNEL_REF
+	(
+	  a11,
+	  b11,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }

--- a/config/template/kernels/3/bli_trsm_u_opt_mxn.c
+++ b/config/template/kernels/3/bli_trsm_u_opt_mxn.c
@@ -36,18 +36,24 @@



-void bli_strsm_u_opt_mxn(
-                          float*    restrict a11,
-                          float*    restrict b11,
-                          float*    restrict c11, inc_t rs_c, inc_t cs_c,
-                          auxinfo_t*         data
-                        )
+void bli_strsm_u_opt_mxn
+     (
+       float*     restrict a11,
+       float*     restrict b11,
+       float*     restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_STRSM_U_UKERNEL_REF( a11,
-	                     b11,
-	                     c11, rs_c, cs_c,
-	                     data );
+	BLIS_STRSM_U_UKERNEL_REF
+	(
+	  a11,
+	  b11,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }


@@ -58,6 +64,13 @@ void bli_dtrsm_u_opt_mxn(
                          double*   restrict c11, inc_t rs_c, inc_t cs_c,
                          auxinfo_t*         data
                        )
+     (
+       double*    restrict a11,
+       double*    restrict b11,
+       double*    restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 /*
  Template trsm_u micro-kernel implementation
@@ -76,79 +89,28 @@ void bli_dtrsm_u_opt_mxn(
  where A11 is MR x MR and upper triangular, B11 is MR x NR, and C11 is
  MR x NR.

-  Parameters:
+  For more info, please refer to the BLIS website's wiki on kernels:

-  - a11:    The address of A11, which is the MR x MR upper triangular
-            submatrix within the packed micro-panel of matrix A. A11 is
-            stored by columns with leading dimension PACKMR, where
-            typically PACKMR = MR. Note that A11 contains elements in both
-            triangles, though elements in the unstored triangle are not
-            guaranteed to be zero and thus should not be referenced.
-  - b11:    The address of B11, which is an MR x NR submatrix of the
-            packed micro-panel of B. B11 is stored by rows with leading
-            dimension PACKNR, where typically PACKNR = NR.
-  - c11:    The address of C11, which is an MR x NR submatrix of matrix C,
-            stored according to rs_c and cs_c. C11 is the submatrix within
-            C that corresponds to the elements which were packed into B11.
-            Thus, C is the original input matrix B to the overall trsm
-            operation.
-  - rs_c:   The row stride of C11 (ie: the distance to the next row of C11,
-            in units of matrix elements).
-  - cs_c:   The column stride of C11 (ie: the distance to the next column of
-            C11, in units of matrix elements).
-  - data:   The address of an auxinfo_t object that contains auxiliary
-            information that may be useful when optimizing the trsm
-            micro-kernel implementation. (See BLIS KernelsHowTo wiki for
-            more info.)
+    https://github.com/flame/blis/wiki/KernelsHowTo

-  Diagrams for trsm
-
-  Please see the diagram for gemmtrsm_u to see depiction of the trsm_u and
-  where it fits in with its preceding gemm subproblem.
-
-  Implementation Notes for trsm
-
-  - Register blocksizes. See Implementation Notes for gemm.
-  - Leading dimensions of a11 and b11: PACKMR and PACKNR. See
-    Implementation Notes for gemm.
-  - Edge cases in MR, NR dimensions. See Implementation Notes for gemm.
-  - Alignment of a11 and b11. See Implementation Notes for gemmtrsm.
-  - Unrolling loops. Most optimized implementations should unroll all
-    three loops within the trsm micro-kernel.
-  - Prefetching next micro-panels of A and B. We advise against using
-    the bli_auxinfo_next_a() and bli_auxinfo_next_b() macros from within
-    the trsm_l and trsm_u micro-kernels, since the values returned usually
-    only make sense in the context of the overall gemmtrsm subproblem. 
-  - Diagonal elements of A11. At the time this micro-kernel is called,
-    the diagonal entries of triangular matrix A11 contain the inverse of
-    the original elements. This inversion is done during packing so that
-    we can avoid expensive division instructions within the micro-kernel
-    itself. If the diag parameter to the higher level trsm operation was
-    equal to BLIS_UNIT_DIAG, the diagonal elements will be explicitly
-    unit.
-  - Zero elements of A11. Since A11 is lower triangular (for trsm_l), the
-    strictly upper triangle implicitly contains zeros. Similarly, the
-    strictly lower triangle of A11 implicitly contains zeros when A11 is
-    upper triangular (for trsm_u). However, the packing function may or
-    may not actually write zeros to this region. Thus, the implementation
-    should not reference these elements.
-  - Output. This micro-kernel must write its result to two places: the
-    submatrix B11 of the current packed micro-panel of B and the submatrix
-    C11 of the output matrix C.
-
-  For more info, please refer to the BLIS website and/or contact the
-  blis-devel mailing list.
+  and/or contact the blis-devel mailing list.

  -FGVZ
 */
-	const dim_t        m     = bli_dmr;
-	const dim_t        n     = bli_dnr;
+	const dim_t        mr     = bli_cntx_get_blksz_def_dt( dt, BLIS_MR, cntx );
+	const dim_t        nr     = bli_cntx_get_blksz_def_dt( dt, BLIS_NR, cntx );

-	const inc_t        rs_a  = 1;
-	const inc_t        cs_a  = bli_dpackmr;
+	const inc_t        packmr = bli_cntx_get_blksz_max_dt( dt, BLIS_MR, cntx );
+	const inc_t        packnr = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );

-	const inc_t        rs_b  = bli_dpacknr;
-	const inc_t        cs_b  = 1;
+	const dim_t        m      = mr;
+	const dim_t        n      = nr;
+
+	const inc_t        rs_a   = 1;
+	const inc_t        cs_a   = packmr;
+
+	const inc_t        rs_b   = packnr;
+	const inc_t        cs_b   = 1;

 	dim_t              iter, i, j, l;
 	dim_t              n_behind;
@@ -207,33 +169,45 @@ void bli_dtrsm_u_opt_mxn(



-void bli_ctrsm_u_opt_mxn(
-                          scomplex* restrict a11,
-                          scomplex* restrict b11,
-                          scomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                          auxinfo_t*         data
-                        )
+void bli_ctrsm_u_opt_mxn
+     (
+       scomplex*  restrict a11,
+       scomplex*  restrict b11,
+       scomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_CTRSM_U_UKERNEL_REF( a11,
-	                     b11,
-	                     c11, rs_c, cs_c,
-	                     data );
+	BLIS_CTRSM_U_UKERNEL_REF
+	(
+	  a11,
+	  b11,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }



-void bli_ztrsm_u_opt_mxn(
-                          dcomplex* restrict a11,
-                          dcomplex* restrict b11,
-                          dcomplex* restrict c11, inc_t rs_c, inc_t cs_c,
-                          auxinfo_t*         data
-                        )
+void bli_ztrsm_u_opt_mxn
+     (
+       dcomplex*  restrict a11,
+       dcomplex*  restrict b11,
+       dcomplex*  restrict c11, inc_t rs_c, inc_t cs_c,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     )
 {
 	/* Just call the reference implementation. */
-	BLIS_ZTRSM_U_UKERNEL_REF( a11,
-	                     b11,
-	                     c11, rs_c, cs_c,
-	                     data );
+	BLIS_ZTRSM_U_UKERNEL_REF
+	(
+	  a11,
+	  b11,
+	  c11, rs_c, cs_c,
+	  data,
+	  cntx
+	);
 }

--- a/frame/1f/dotxaxpyf/bli_dotxaxpyf_fusefac.h
+++ b/frame/1f/dotxaxpyf/bli_dotxaxpyf_fusefac.h
@@ -32,8 +32,10 @@

 */

-//
-// Prototype object-based fusing factor query routine.
-//
-dim_t bli_dotxaxpyf_fusefac( num_t dt );
+#include "bli_l0_check.h"

+#include "bli_l0_oapi.h"
+#include "bli_l0_tapi.h"
+
+// copysc
+#include "bli_copysc.h"
--- a/frame/0/bli_l0_check.c
+++ b/frame/0/bli_l0_check.c
@@ -0,0 +1,314 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "blis.h"
+
+//
+// Define object-based check functions.
+//
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  psi  \
+     ) \
+{ \
+	bli_l0_xxsc_check( chi, psi ); \
+}
+
+GENFRONT( addsc )
+GENFRONT( copysc )
+GENFRONT( divsc )
+GENFRONT( mulsc )
+GENFRONT( sqrtsc )
+GENFRONT( subsc )
+
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  norm  \
+     ) \
+{ \
+	bli_l0_xx2sc_check( chi, norm ); \
+}
+
+GENFRONT( absqsc )
+GENFRONT( normfsc )
+
+
+void bli_getsc_check
+     (
+       obj_t*  chi,
+       double* zeta_r,
+       double* zeta_i 
+     )
+{
+	err_t e_val;
+
+	// Check object datatypes.
+
+	e_val = bli_check_noninteger_object( chi );
+	bli_check_error_code( e_val );
+
+	// Check object dimensions.
+
+	e_val = bli_check_scalar_object( chi );
+	bli_check_error_code( e_val );
+
+	// Check object buffers (for non-NULLness).
+
+	e_val = bli_check_object_buffer( chi );
+	bli_check_error_code( e_val );
+}
+
+
+void bli_setsc_check
+     (
+       double  zeta_r,
+       double  zeta_i,
+       obj_t*  chi 
+     )
+{
+	err_t e_val;
+
+	// Check object datatypes.
+
+	e_val = bli_check_floating_object( chi );
+	bli_check_error_code( e_val );
+
+	// Check object dimensions.
+
+	e_val = bli_check_scalar_object( chi );
+	bli_check_error_code( e_val );
+
+	// Check object buffers (for non-NULLness).
+
+	e_val = bli_check_object_buffer( chi );
+	bli_check_error_code( e_val );
+}
+
+
+void bli_unzipsc_check
+     (
+       obj_t*  chi,
+       obj_t*  zeta_r,
+       obj_t*  zeta_i 
+     )
+{
+	err_t e_val;
+
+	// Check object datatypes.
+
+    e_val = bli_check_noninteger_object( chi );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_real_object( zeta_r );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_real_object( zeta_i );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_nonconstant_object( zeta_r );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_nonconstant_object( zeta_i );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_object_real_proj_of( chi, zeta_r );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_object_real_proj_of( chi, zeta_i );
+    bli_check_error_code( e_val );
+
+	// Check object dimensions.
+
+	e_val = bli_check_scalar_object( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_scalar_object( zeta_r );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_scalar_object( zeta_i );
+	bli_check_error_code( e_val );
+
+	// Check object buffers (for non-NULLness).
+
+	e_val = bli_check_object_buffer( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_object_buffer( zeta_r );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_object_buffer( zeta_i );
+	bli_check_error_code( e_val );
+}
+
+
+void bli_zipsc_check
+     (
+       obj_t*  zeta_r,
+       obj_t*  zeta_i,
+       obj_t*  chi 
+     )
+{
+	err_t e_val;
+
+	// Check object datatypes.
+
+    e_val = bli_check_real_object( zeta_r );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_real_object( zeta_i );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_noninteger_object( chi );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_nonconstant_object( chi );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_object_real_proj_of( chi, zeta_r );
+    bli_check_error_code( e_val );
+
+    e_val = bli_check_object_real_proj_of( chi, zeta_i );
+    bli_check_error_code( e_val );
+
+	// Check object dimensions.
+
+	e_val = bli_check_scalar_object( zeta_r );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_scalar_object( zeta_i );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_scalar_object( chi );
+	bli_check_error_code( e_val );
+
+	// Check object buffers (for non-NULLness).
+
+	e_val = bli_check_object_buffer( zeta_r );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_object_buffer( zeta_i );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_object_buffer( chi );
+	bli_check_error_code( e_val );
+}
+
+
+// -----------------------------------------------------------------------------
+
+void bli_l0_xxsc_check
+     (
+       obj_t*  chi,
+       obj_t*  psi 
+     )
+{
+	err_t e_val;
+
+	// Check object datatypes.
+
+	e_val = bli_check_noninteger_object( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_noninteger_object( psi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_nonconstant_object( psi );
+	bli_check_error_code( e_val );
+
+	// Check object dimensions.
+
+	e_val = bli_check_scalar_object( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_scalar_object( psi );
+	bli_check_error_code( e_val );
+
+	// Check object buffers (for non-NULLness).
+
+	e_val = bli_check_object_buffer( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_object_buffer( psi );
+	bli_check_error_code( e_val );
+}
+
+void bli_l0_xx2sc_check
+     (
+       obj_t*  chi,
+       obj_t*  absq 
+     )
+{
+	err_t e_val;
+
+	// Check object datatypes.
+
+	e_val = bli_check_noninteger_object( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_nonconstant_object( absq );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_real_object( absq );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_object_real_proj_of( chi, absq );
+	bli_check_error_code( e_val );
+
+	// Check object dimensions.
+
+	e_val = bli_check_scalar_object( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_scalar_object( absq );
+	bli_check_error_code( e_val );
+
+	// Check object buffers (for non-NULLness).
+
+	e_val = bli_check_object_buffer( chi );
+	bli_check_error_code( e_val );
+
+	e_val = bli_check_object_buffer( absq );
+	bli_check_error_code( e_val );
+}
+
--- a/frame/0/bli_l0_check.h
+++ b/frame/0/bli_l0_check.h
@@ -0,0 +1,134 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+
+//
+// Prototype object-based check functions.
+//
+
+#undef  GENTPROT
+#define GENTPROT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  psi  \
+     );
+
+GENTPROT( addsc )
+GENTPROT( copysc )
+GENTPROT( divsc )
+GENTPROT( mulsc )
+GENTPROT( sqrtsc )
+GENTPROT( subsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  absq  \
+     );
+
+GENTPROT( absqsc )
+GENTPROT( normfsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       obj_t*  chi, \
+       double* zeta_r, \
+       double* zeta_i  \
+     );
+
+GENTPROT( getsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       double  zeta_r, \
+       double  zeta_i, \
+       obj_t*  chi  \
+     );
+
+GENTPROT( setsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  zeta_r, \
+       obj_t*  zeta_i  \
+     );
+
+GENTPROT( unzipsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( opname ) \
+\
+void PASTEMAC(opname,_check) \
+     ( \
+       obj_t*  zeta_r, \
+       obj_t*  zeta_i, \
+       obj_t*  chi  \
+     );
+
+GENTPROT( zipsc )
+
+
+// -----------------------------------------------------------------------------
+
+void bli_l0_xxsc_check
+     (
+       obj_t*  chi,
+       obj_t*  psi 
+     );
+
+void bli_l0_xx2sc_check
+     (
+       obj_t*  chi,
+       obj_t*  norm 
+     );
--- a/frame/0/bli_l0_oapi.c
+++ b/frame/0/bli_l0_oapi.c
@@ -0,0 +1,288 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "blis.h"
+
+//
+// Define object-based interfaces.
+//
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  absq  \
+     ) \
+{ \
+	num_t     dt_chi; \
+	num_t     dt_absq_c  = bli_obj_datatype_proj_to_complex( *absq ); \
+\
+    void*     buf_chi; \
+    void*     buf_absq   = bli_obj_buffer_at_off( *absq ); \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, absq ); \
+\
+	/* If chi is a scalar constant, use dt_absq_c to extract the address of the
+	   corresponding constant value; otherwise, use the datatype encoded
+	   within the chi object and extract the buffer at the chi offset. */ \
+	bli_set_scalar_dt_buffer( chi, dt_absq_c, dt_chi, buf_chi ); \
+\
+	/* Invoke the typed function. */ \
+	bli_call_ft_2 \
+	( \
+	   dt_chi, \
+	   opname, \
+	   buf_chi, \
+	   buf_absq  \
+	); \
+}
+
+GENFRONT( absqsc )
+GENFRONT( normfsc )
+
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  psi  \
+     ) \
+{ \
+	num_t     dt        = bli_obj_datatype( *psi ); \
+\
+	conj_t    conjchi   = bli_obj_conj_status( *chi ); \
+\
+    void*     buf_chi   = bli_obj_buffer_for_1x1( dt, *chi ); \
+    void*     buf_psi   = bli_obj_buffer_at_off( *psi ); \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, psi ); \
+\
+	/* Invoke the typed function. */ \
+	bli_call_ft_3 \
+	( \
+	   dt, \
+	   opname, \
+	   conjchi, \
+	   buf_chi, \
+	   buf_psi  \
+	); \
+}
+
+GENFRONT( addsc )
+GENFRONT( divsc )
+GENFRONT( mulsc )
+GENFRONT( subsc )
+
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  psi  \
+     ) \
+{ \
+	num_t     dt        = bli_obj_datatype( *psi ); \
+\
+    void*     buf_chi   = bli_obj_buffer_for_1x1( dt, *chi ); \
+	void*     buf_psi   = bli_obj_buffer_at_off( *psi ); \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, psi ); \
+\
+	/* Invoke the typed function. */ \
+	bli_call_ft_2 \
+	( \
+	   dt, \
+	   opname, \
+	   buf_chi, \
+	   buf_psi  \
+	); \
+}
+
+GENFRONT( sqrtsc )
+
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       double* zeta_r, \
+       double* zeta_i  \
+     ) \
+{ \
+	num_t     dt_chi    = bli_obj_datatype( *chi ); \
+	num_t     dt_def    = BLIS_DCOMPLEX; \
+	num_t     dt_use; \
+\
+	/* If chi is a constant object, default to using the dcomplex
+	   value to maximize precision, and since we don't know if the
+	   caller needs just the real or the real and imaginary parts. */ \
+	void*     buf_chi   = bli_obj_buffer_for_1x1( dt_def, *chi ); \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, zeta_r, zeta_i ); \
+\
+	/* The _check() routine prevents integer types, so we know that chi
+	   is either a constant or an actual floating-point type. */ \
+	if ( bli_is_constant( dt_chi ) ) dt_use = dt_def; \
+	else                             dt_use = dt_chi; \
+\
+	/* Invoke the typed function. */ \
+	bli_call_ft_3 \
+	( \
+	   dt_use, \
+	   opname, \
+	   buf_chi, \
+	   zeta_r, \
+	   zeta_i  \
+	); \
+}
+
+GENFRONT( getsc )
+
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       double  zeta_r, \
+       double  zeta_i, \
+       obj_t*  chi  \
+     ) \
+{ \
+	num_t     dt_chi    = bli_obj_datatype( *chi ); \
+\
+	void*     buf_chi   = bli_obj_buffer_at_off( *chi ); \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( zeta_r, zeta_i, chi ); \
+\
+	/* Invoke the typed function. */ \
+	bli_call_ft_3 \
+	( \
+	   dt_chi, \
+	   opname, \
+	   zeta_r, \
+	   zeta_i, \
+	   buf_chi  \
+	); \
+}
+
+GENFRONT( setsc )
+
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  zeta_r, \
+       obj_t*  zeta_i  \
+     ) \
+{ \
+	num_t     dt_chi; \
+	num_t     dt_zeta_c   = bli_obj_datatype_proj_to_complex( *zeta_r ); \
+\
+    void*     buf_chi; \
+\
+    void*     buf_zeta_r  = bli_obj_buffer_at_off( *zeta_r ); \
+    void*     buf_zeta_i  = bli_obj_buffer_at_off( *zeta_i ); \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, zeta_r, zeta_i ); \
+\
+	/* If chi is a scalar constant, use dt_zeta_c to extract the address of the
+	   corresponding constant value; otherwise, use the datatype encoded
+	   within the chi object and extract the buffer at the chi offset. */ \
+	bli_set_scalar_dt_buffer( chi, dt_zeta_c, dt_chi, buf_chi ); \
+\
+	/* Invoke the typed function. */ \
+	bli_call_ft_3 \
+	( \
+	   dt_chi, \
+	   opname, \
+	   buf_chi, \
+	   buf_zeta_r, \
+	   buf_zeta_i  \
+	); \
+}
+
+GENFRONT( unzipsc )
+
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  zeta_r, \
+       obj_t*  zeta_i, \
+       obj_t*  chi  \
+     ) \
+{ \
+	num_t     dt_chi      = bli_obj_datatype( *chi ); \
+\
+    void*     buf_zeta_r  = bli_obj_buffer_for_1x1( dt_chi, *zeta_r ); \
+    void*     buf_zeta_i  = bli_obj_buffer_for_1x1( dt_chi, *zeta_i ); \
+\
+    void*     buf_chi     = bli_obj_buffer_at_off( *chi ); \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, zeta_r, zeta_i ); \
+\
+	/* Invoke the typed function. */ \
+	bli_call_ft_3 \
+	( \
+	   dt_chi, \
+	   opname, \
+	   buf_zeta_i, \
+	   buf_zeta_r, \
+	   buf_chi  \
+	); \
+}
+
+GENFRONT( zipsc )
+
--- a/frame/0/bli_l0_oapi.h
+++ b/frame/0/bli_l0_oapi.h
@@ -0,0 +1,125 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+
+//
+// Prototype object-based interfaces.
+//
+
+#undef  GENPROT
+#define GENPROT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  absq  \
+     );
+
+GENPROT( absqsc )
+GENPROT( normfsc )
+
+
+#undef  GENPROT
+#define GENPROT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  psi  \
+     );
+
+GENPROT( addsc )
+GENPROT( divsc )
+GENPROT( mulsc )
+GENPROT( sqrtsc )
+GENPROT( subsc )
+
+
+#undef  GENPROT
+#define GENPROT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       double* zeta_r, \
+       double* zeta_i  \
+     );
+
+GENPROT( getsc )
+
+
+#undef  GENPROT
+#define GENPROT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       double  zeta_r, \
+       double  zeta_i, \
+       obj_t*  chi  \
+     );
+
+GENPROT( setsc )
+
+
+#undef  GENPROT
+#define GENPROT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  zeta_r, \
+       obj_t*  zeta_i  \
+     );
+
+GENPROT( unzipsc )
+
+
+#undef  GENPROT
+#define GENPROT( opname ) \
+\
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  zeta_r, \
+       obj_t*  zeta_i, \
+       obj_t*  chi  \
+     );
+
+GENPROT( zipsc )
+
+
+
+
+
+
+
--- a/frame/0/bli_l0_tapi.c
+++ b/frame/0/bli_l0_tapi.c
@@ -0,0 +1,210 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "blis.h"
+
+//
+// Define BLAS-like interfaces with typed operands.
+//
+
+#undef  GENTFUNC
+#define GENTFUNC( ctype, ch, opname, kername ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       conj_t  conjchi, \
+       ctype*  chi, \
+       ctype*  psi  \
+     ) \
+{ \
+	ctype chi_conj; \
+\
+	PASTEMAC(ch,copycjs)( conjchi, *chi, chi_conj ); \
+	PASTEMAC(ch,kername)( chi_conj, *psi ); \
+}
+
+INSERT_GENTFUNC_BASIC( addsc, adds )
+INSERT_GENTFUNC_BASIC( divsc, invscals )
+INSERT_GENTFUNC_BASIC( subsc, subs )
+
+
+#undef  GENTFUNC
+#define GENTFUNC( ctype, ch, opname, kername ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       conj_t  conjchi, \
+       ctype*  chi, \
+       ctype*  psi  \
+     ) \
+{ \
+	if ( PASTEMAC(ch,eq0)( *chi ) ) \
+	{ \
+		/* Overwrite potential Infs and NaNs. */ \
+		PASTEMAC(ch,set0s)( *psi ); \
+	} \
+	else \
+	{ \
+		ctype chi_conj; \
+\
+		PASTEMAC(ch,copycjs)( conjchi, *chi, chi_conj ); \
+		PASTEMAC(ch,kername)( chi_conj, *psi ); \
+	} \
+}
+
+INSERT_GENTFUNC_BASIC( mulsc, scals )
+
+
+#undef  GENTFUNCR
+#define GENTFUNCR( ctype, ctype_r, ch, chr, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*   chi, \
+       ctype_r* absq  \
+     ) \
+{ \
+    ctype_r chi_r; \
+    ctype_r chi_i; \
+    ctype_r absq_i; \
+\
+	( void )absq_i; \
+\
+    PASTEMAC2(ch,chr,gets)( *chi, chi_r, chi_i ); \
+\
+    /* absq   = chi_r * chi_r + chi_i * chi_i; \
+	   absq_r = 0.0; (thrown away) */ \
+	PASTEMAC(ch,absq2ris)( chi_r, chi_i, *absq, absq_i ); \
+\
+	( void )chi_i; \
+}
+
+INSERT_GENTFUNCR_BASIC0( absqsc )
+
+
+#undef  GENTFUNCR
+#define GENTFUNCR( ctype, ctype_r, ch, chr, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*   chi, \
+       ctype_r* norm  \
+     ) \
+{ \
+    /* norm = sqrt( chi_r * chi_r + chi_i * chi_i ); */ \
+    PASTEMAC2(ch,chr,abval2s)( *chi, *norm ); \
+}
+
+INSERT_GENTFUNCR_BASIC0( normfsc )
+
+
+#undef  GENTFUNC
+#define GENTFUNC( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*  chi, \
+       ctype*  psi  \
+     ) \
+{ \
+	/* NOTE: sqrtsc/sqrt2s differs from normfsc/abval2s in the complex domain. */ \
+	PASTEMAC(ch,sqrt2s)( *chi, *psi ); \
+}
+
+INSERT_GENTFUNC_BASIC0( sqrtsc )
+
+
+#undef  GENTFUNC
+#define GENTFUNC( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*  chi, \
+       double* zeta_r, \
+       double* zeta_i  \
+     ) \
+{ \
+	PASTEMAC2(ch,d,gets)( *chi, *zeta_r, *zeta_i ); \
+}
+
+INSERT_GENTFUNC_BASIC0( getsc )
+
+
+#undef  GENTFUNC
+#define GENTFUNC( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       double  zeta_r, \
+       double  zeta_i, \
+       ctype*  chi  \
+     ) \
+{ \
+	PASTEMAC2(d,ch,sets)( zeta_r, zeta_i, *chi ); \
+}
+
+INSERT_GENTFUNC_BASIC0( setsc )
+
+
+#undef  GENTFUNCR
+#define GENTFUNCR( ctype, ctype_r, ch, chr, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*   chi, \
+       ctype_r* zeta_r, \
+       ctype_r* zeta_i  \
+     ) \
+{ \
+	PASTEMAC2(ch,chr,gets)( *chi, *zeta_r, *zeta_i ); \
+}
+
+INSERT_GENTFUNCR_BASIC0( unzipsc )
+
+
+#undef  GENTFUNCR
+#define GENTFUNCR( ctype, ctype_r, ch, chr, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype_r* zeta_r, \
+       ctype_r* zeta_i, \
+       ctype*   chi  \
+     ) \
+{ \
+	PASTEMAC2(chr,ch,sets)( *zeta_r, *zeta_i, *chi ); \
+}
+
+INSERT_GENTFUNCR_BASIC0( zipsc )
+
--- a/frame/0/bli_l0_tapi.h
+++ b/frame/0/bli_l0_tapi.h
@@ -0,0 +1,131 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+
+//
+// Prototype BLAS-like interfaces with typed operands.
+//
+
+#undef  GENTPROT
+#define GENTPROT( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       conj_t  conjchi, \
+       ctype*  chi, \
+       ctype*  psi  \
+     );
+
+INSERT_GENTPROT_BASIC( addsc )
+INSERT_GENTPROT_BASIC( divsc )
+INSERT_GENTPROT_BASIC( mulsc )
+INSERT_GENTPROT_BASIC( subsc )
+
+
+#undef  GENTPROTR
+#define GENTPROTR( ctype, ctype_r, ch, chr, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*   chi, \
+       ctype_r* absq  \
+     );
+
+INSERT_GENTPROTR_BASIC( absqsc )
+INSERT_GENTPROTR_BASIC( normfsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*  chi, \
+       ctype*  psi  \
+     );
+
+INSERT_GENTPROT_BASIC( sqrtsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*  chi, \
+       double* zeta_r, \
+       double* zeta_i  \
+     );
+
+INSERT_GENTPROT_BASIC( getsc )
+
+
+#undef  GENTPROT
+#define GENTPROT( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       double  zeta_r, \
+       double  zeta_i, \
+       ctype*  chi  \
+     );
+
+INSERT_GENTPROT_BASIC( setsc )
+
+
+#undef  GENTPROTR
+#define GENTPROTR( ctype, ctype_r, ch, chr, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype*   chi, \
+       ctype_r* zeta_r, \
+       ctype_r* zeta_i  \
+     );
+
+INSERT_GENTPROTR_BASIC( unzipsc )
+
+
+#undef  GENTPROTR
+#define GENTPROTR( ctype, ctype_r, ch, chr, opname ) \
+\
+void PASTEMAC(ch,opname) \
+     ( \
+       ctype_r* zeta_r, \
+       ctype_r* zeta_i, \
+       ctype*   chi  \
+     );
+
+INSERT_GENTPROTR_BASIC( zipsc )
+
--- a/frame/0/copysc/bli_copysc.c
+++ b/frame/0/copysc/bli_copysc.c
@@ -34,66 +34,93 @@

 #include "blis.h"

+// NOTE: This is one of the few functions in BLIS that is defined
+// with heterogeneous type support. This is done so that we have
+// an operation that can be used to typecast (copy-cast) a scalar
+// of one datatype to a scalar of another datatype.
+
+typedef void (*FUNCPTR_T)(
+                           conj_t conjchi,
+                           void*  chi,
+                           void*  psi
+                         );
+
+static FUNCPTR_T GENARRAY2_ALL(ftypes,copysc);

 //
-// Define object-based interface.
+// Define object-based interfaces.
 //
-void bli_copysc( obj_t* chi,
-                 obj_t* psi )
-{
-	if ( bli_error_checking_is_enabled() )
-		bli_copysc_check( chi, psi );

-	bli_copysc_unb_var1( chi, psi );
-}
-
-
-//
-// Define BLAS-like interfaces with homogeneous-typed operands.
-//
-#undef  GENTFUNC
-#define GENTFUNC( ctype, ch, opname, varname ) \
+#undef  GENFRONT
+#define GENFRONT( opname ) \
 \
-void PASTEMAC(ch,opname)( \
-                          conj_t conjchi, \
-                          ctype* chi, \
-                          ctype* psi  \
-                        ) \
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  psi  \
+     ) \
 { \
-	PASTEMAC2(ch,ch,varname)( conjchi, \
-	                          chi, \
-	                          psi ); \
+	conj_t    conjchi   = bli_obj_conj_status( *chi ); \
+\
+	num_t     dt_psi    = bli_obj_datatype( *psi ); \
+    void*     buf_psi   = bli_obj_buffer_at_off( *psi ); \
+\
+	num_t     dt_chi; \
+	void*     buf_chi; \
+\
+	FUNCPTR_T f; \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, psi ); \
+\
+	/* If chi is a scalar constant, use dt_psi to extract the address of the
+	   corresponding constant value; otherwise, use the datatype encoded
+	   within the chi object and extract the buffer at the chi offset. */ \
+	bli_set_scalar_dt_buffer( chi, dt_psi, dt_chi, buf_chi ); \
+\
+	/* Index into the type combination array to extract the correct
+	   function pointer. */ \
+	f = ftypes[dt_chi][dt_psi]; \
+\
+	/* Invoke the void pointer-based function. */ \
+	f( \
+	   conjchi, \
+	   buf_chi, \
+	   buf_psi  \
+	 ); \
 }

-INSERT_GENTFUNC_BASIC( copysc, copysc_unb_var1 )
+GENFRONT( copysc )


 //
-// Define BLAS-like interfaces with heterogeneous-typed operands.
+// Define BLAS-like interfaces with typed operands.
 //
+
 #undef  GENTFUNC2
-#define GENTFUNC2( ctype_x, ctype_y, chx, chy, opname, varname ) \
+#define GENTFUNC2( ctype_x, ctype_y, chx, chy, varname ) \
 \
-void PASTEMAC2(chx,chy,opname)( \
-                                conj_t   conjchi, \
-                                ctype_x* chi, \
-                                ctype_y* psi  \
-                              ) \
+void PASTEMAC2(chx,chy,varname) \
+     ( \
+       conj_t conjchi, \
+       void*  chi, \
+       void*  psi \
+     ) \
 { \
-	PASTEMAC2(chx,chy,varname)( conjchi, \
-	                            chi, \
-	                            psi ); \
+	ctype_x* chi_cast = chi; \
+	ctype_y* psi_cast = psi; \
+\
+	if ( bli_is_conj( conjchi ) ) \
+	{ \
+		PASTEMAC2(chx,chy,copyjs)( *chi_cast, *psi_cast ); \
+	} \
+	else \
+	{ \
+		PASTEMAC2(chx,chy,copys)( *chi_cast, *psi_cast ); \
+	} \
 }

-// Define the basic set of functions unconditionally, and then also some
-// mixed datatype functions if requested.
-INSERT_GENTFUNC2_BASIC( copysc, copysc_unb_var1 )
-
-#ifdef BLIS_ENABLE_MIXED_DOMAIN_SUPPORT
-INSERT_GENTFUNC2_MIX_D( copysc, copysc_unb_var1 )
-#endif
-
-#ifdef BLIS_ENABLE_MIXED_PRECISION_SUPPORT
-INSERT_GENTFUNC2_MIX_P( copysc, copysc_unb_var1 )
-#endif
+INSERT_GENTFUNC2_BASIC0( copysc )
+INSERT_GENTFUNC2_MIX_D0( copysc )
+INSERT_GENTFUNC2_MIX_P0( copysc )

--- a/frame/0/copysc/bli_copysc.h
+++ b/frame/0/copysc/bli_copysc.h
@@ -32,51 +32,37 @@

 */

-#include "bli_copysc_check.h"
-#include "bli_copysc_unb_var1.h"
-

 //
-// Prototype object-based interface.
+// Prototype object-based interfaces.
 //
-void bli_copysc( obj_t* chi,
-                 obj_t* psi );

-
-//
-// Prototype BLAS-like interfaces with homogeneous-typed operands.
-//
-#undef  GENTPROT
-#define GENTPROT( ctype, ch, opname ) \
+#undef  GENFRONT
+#define GENFRONT( opname ) \
 \
-void PASTEMAC(ch,opname)( \
-                          conj_t conjchi, \
-                          ctype* chi, \
-                          ctype* psi  \
-                        );
-
-INSERT_GENTPROT_BASIC( copysc )
+void PASTEMAC0(opname) \
+     ( \
+       obj_t*  chi, \
+       obj_t*  psi  \
+     );
+GENFRONT( copysc )


 //
-// Prototype BLAS-like interfaces with heterogeneous-typed operands.
+// Define BLAS-like interfaces with heterogeneous-typed operands.
 //
+
 #undef  GENTPROT2
-#define GENTPROT2( ctype_x, ctype_y, chx, chy, opname ) \
+#define GENTPROT2( ctype_x, ctype_y, chx, chy, varname ) \
 \
-void PASTEMAC2(chx,chy,opname)( \
-                                conj_t   conjchi, \
-                                ctype_x* chi, \
-                                ctype_y* psi  \
-                              );
+void PASTEMAC2(chx,chy,varname) \
+     ( \
+       conj_t conjchi, \
+       void*  chi, \
+       void*  psi \
+     );

 INSERT_GENTPROT2_BASIC( copysc )
-
-#ifdef BLIS_ENABLE_MIXED_DOMAIN_SUPPORT
 INSERT_GENTPROT2_MIX_D( copysc )
-#endif
-
-#ifdef BLIS_ENABLE_MIXED_PRECISION_SUPPORT
 INSERT_GENTPROT2_MIX_P( copysc )
-#endif

--- a/frame/0/old/absqsc/bli_absqsc.c
+++ b/frame/0/old/absqsc/bli_absqsc.c
--- a/frame/0/old/absqsc/bli_absqsc.h
+++ b/frame/0/old/absqsc/bli_absqsc.h
--- a/frame/0/old/absqsc/bli_absqsc_check.c
+++ b/frame/0/old/absqsc/bli_absqsc_check.c
--- a/frame/0/old/absqsc/bli_absqsc_check.h
+++ b/frame/0/old/absqsc/bli_absqsc_check.h
--- a/frame/0/old/absqsc/bli_absqsc_unb_var1.c
+++ b/frame/0/old/absqsc/bli_absqsc_unb_var1.c
--- a/frame/0/old/absqsc/bli_absqsc_unb_var1.h
+++ b/frame/0/old/absqsc/bli_absqsc_unb_var1.h
--- a/frame/0/old/addsc/bli_addsc.c
+++ b/frame/0/old/addsc/bli_addsc.c
--- a/frame/0/old/addsc/bli_addsc.h
+++ b/frame/0/old/addsc/bli_addsc.h
--- a/frame/0/old/addsc/bli_addsc_check.c
+++ b/frame/0/old/addsc/bli_addsc_check.c
--- a/frame/0/old/addsc/bli_addsc_check.h
+++ b/frame/0/old/addsc/bli_addsc_check.h
--- a/frame/0/old/addsc/bli_addsc_unb_var1.c
+++ b/frame/0/old/addsc/bli_addsc_unb_var1.c
--- a/frame/0/old/addsc/bli_addsc_unb_var1.h
+++ b/frame/0/old/addsc/bli_addsc_unb_var1.h
--- a/frame/0/old/bli_getsc.c
+++ b/frame/0/old/bli_getsc.c
@@ -34,76 +34,78 @@

 #include "blis.h"

+typedef void (*FUNCPTR_T)(
+                           void*   chi,
+                           double* zeta_r,
+                           double* zeta_i 
+                         );
+
+static FUNCPTR_T GENARRAY(ftypes,getsc);

 //
-// Define object-based interface.
+// Define object-based interfaces.
 //
+
 #undef  GENFRONT
-#define GENFRONT( opname, varname ) \
+#define GENFRONT( opname ) \
 \
 void PASTEMAC0(opname)( \
-                        obj_t* x, \
-                        obj_t* y  \
+                        obj_t*  chi, \
+                        double* zeta_r, \
+                        double* zeta_i  \
                      ) \
 { \
-    if ( bli_error_checking_is_enabled() ) \
-        PASTEMAC(opname,_check)( x, y ); \
+	num_t     dt_chi    = bli_obj_datatype( *chi ); \
+	num_t     dt_def    = BLIS_DCOMPLEX; \
+	num_t     dt_use; \
 \
-    PASTEMAC0(varname)( x, \
-                        y ); \
+	/* If chi is a constant object, default to using the dcomplex
+	   value to maximize precision, and since we don't know if the
+	   caller needs just the real or the real and imaginary parts. */ \
+	void*     buf_chi   = bli_obj_buffer_for_1x1( dt_def, *chi ); \
+\
+	FUNCPTR_T f; \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( chi, zeta_r, zeta_i ); \
+\
+	/* The _check() routine prevents integer types, so we know that chi
+	   is either a constant or an actual floating-point type. */ \
+	if ( bli_is_constant( dt_chi ) ) dt_use = dt_def; \
+	else                             dt_use = dt_chi; \
+\
+	/* Index into the type combination array to extract the correct
+	   function pointer. */ \
+	f = ftypes[dt_use]; \
+\
+	/* Invoke the function. */ \
+	f( \
+	   buf_chi, \
+	   zeta_r, \
+	   zeta_i  \
+	 ); \
 }

-GENFRONT( addv, addv_kernel )
+GENFRONT( getsc )


 //
-// Define BLAS-like interfaces with homogeneous-typed operands.
+// Define BLAS-like interfaces with typed operands.
 //
+
 #undef  GENTFUNC
-#define GENTFUNC( ctype, ch, opname, varname ) \
+#define GENTFUNC( ctype, ch, opname ) \
 \
 void PASTEMAC(ch,opname)( \
-                          conj_t conjx, \
-                          dim_t  n, \
-                          ctype* x, inc_t incx, \
-                          ctype* y, inc_t incy \
+                          void*   chi, \
+                          double* zeta_r, \
+                          double* zeta_i  \
                        ) \
 { \
-	PASTEMAC2(ch,ch,varname)( conjx, \
-	                          n, \
-	                          x, incx, \
-	                          y, incy ); \
-}
-
-INSERT_GENTFUNC_BASIC( addv, ADDV_KERNEL )
-
-
-//
-// Define BLAS-like interfaces with heterogeneous-typed operands.
-//
-#undef  GENTFUNC2
-#define GENTFUNC2( ctype_x, ctype_y, chx, chy, opname, varname ) \
+	ctype* chi_cast = chi; \
 \
-void PASTEMAC2(chx,chy,opname)( \
-                                conj_t   conjx, \
-                                dim_t    n, \
-                                ctype_x* x, inc_t incx, \
-                                ctype_y* y, inc_t incy \
-                              ) \
-{ \
-	PASTEMAC2(chx,chy,varname)( conjx, \
-	                            n, \
-	                            x, incx, \
-	                            y, incy ); \
+	PASTEMAC2(ch,d,gets)( *chi_cast, *zeta_r, *zeta_i ); \
 }

-INSERT_GENTFUNC2_BASIC( addv, ADDV_KERNEL )
-
-#ifdef BLIS_ENABLE_MIXED_DOMAIN_SUPPORT
-INSERT_GENTFUNC2_MIX_D( addv, ADDV_KERNEL )
-#endif
-
-#ifdef BLIS_ENABLE_MIXED_PRECISION_SUPPORT
-INSERT_GENTFUNC2_MIX_P( addv, ADDV_KERNEL )
-#endif
+INSERT_GENTFUNC_BASIC( getsc )

--- a/frame/1/invertv/bli_invertv.c
+++ b/frame/1/invertv/bli_invertv.c
@@ -32,42 +32,33 @@

 */

-#include "blis.h"
-

 //
-// Define object-based interface.
+// Prototype object-based interfaces.
 //
+
 #undef  GENFRONT
-#define GENFRONT( opname, varname ) \
+#define GENFRONT( opname ) \
 \
 void PASTEMAC0(opname)( \
-                        obj_t* x  \
-                      ) \
-{ \
-    if ( bli_error_checking_is_enabled() ) \
-        PASTEMAC(opname,_check)( x ); \
-\
-    PASTEMAC0(varname)( x ); \
-}
-
-GENFRONT( invertv, invertv_kernel )
+                        obj_t*  chi, \
+                        double* zeta_r, \
+                        double* zeta_i  \
+                      );
+GENFRONT( getsc )


 //
-// Define BLAS-like interfaces.
+// Prototype BLAS-like interfaces with typed operands.
 //
-#undef  GENTFUNC
-#define GENTFUNC( ctype, ch, opname, varname ) \
+
+#undef  GENTPROT
+#define GENTPROT( ctype, ch, opname ) \
 \
 void PASTEMAC(ch,opname)( \
-                          dim_t  n, \
-                          ctype* x, inc_t incx \
-                        ) \
-{ \
-	PASTEMAC(ch,varname)( n, \
-	                      x, incx ); \
-}
-
-INSERT_GENTFUNC_BASIC( invertv, INVERTV_KERNEL )
+                          void*   chi, \
+                          double* zeta_r, \
+                          double* zeta_i  \
+                        );

+INSERT_GENTPROT_BASIC( getsc )
--- a/frame/0/old/bli_setsc.c
+++ b/frame/0/old/bli_setsc.c
@@ -0,0 +1,101 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "blis.h"
+
+typedef void (*FUNCPTR_T)(
+                           double* zeta_r,
+                           double* zeta_i,
+                           void*   chi 
+                         );
+
+static FUNCPTR_T GENARRAY(ftypes,setsc);
+
+//
+// Define object-based interfaces.
+//
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname)( \
+                        double* zeta_r, \
+                        double* zeta_i, \
+                        obj_t*  chi  \
+                      ) \
+{ \
+	num_t     dt_chi    = bli_obj_datatype( *chi ); \
+\
+	void*     buf_chi   = bli_obj_buffer_at_off( *chi ); \
+\
+	FUNCPTR_T f; \
+\
+	if ( bli_error_checking_is_enabled() ) \
+	    PASTEMAC(opname,_check)( zeta_r, zeta_i, chi ); \
+\
+	/* Index into the type combination array to extract the correct
+	   function pointer. */ \
+	f = ftypes[dt_chi]; \
+\
+	/* Invoke the function. */ \
+	f( \
+	   zeta_r, \
+	   zeta_i, \
+	   buf_chi  \
+	 ); \
+}
+
+GENFRONT( setsc )
+
+
+//
+// Define BLAS-like interfaces with typed operands.
+//
+
+#undef  GENTFUNC
+#define GENTFUNC( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname)( \
+                          double* zeta_r, \
+                          double* zeta_i  \
+                          void*   chi, \
+                        ) \
+{ \
+	ctype* chi_cast = chi; \
+\
+	PASTEMAC2(d,ch,sets)( *zeta_r, *zeta_i, *chi_cast ); \
+}
+
+INSERT_GENTFUNC_BASIC( setsc )
+
--- a/frame/0/old/bli_setsc.h
+++ b/frame/0/old/bli_setsc.h
@@ -0,0 +1,64 @@
+/*
+
+   BLIS    
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2014, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name of The University of Texas at Austin nor the names
+      of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+
+//
+// Prototype object-based interfaces.
+//
+
+#undef  GENFRONT
+#define GENFRONT( opname ) \
+\
+void PASTEMAC0(opname)( \
+                        double* zeta_r, \
+                        double* zeta_i, \
+                        obj_t*  chi  \
+                      );
+GENFRONT( setsc )
+
+
+//
+// Prototype BLAS-like interfaces with typed operands.
+//
+
+#undef  GENTPROT
+#define GENTPROT( ctype, ch, opname ) \
+\
+void PASTEMAC(ch,opname)( \
+                          double* zeta_r, \
+                          double* zeta_i, \
+                          void*   chi  \
+                        );
+
+INSERT_GENTPROT_BASIC( setsc )
--- a/frame/0/old/copysc/bli_copysc.c
+++ b/frame/0/old/copysc/bli_copysc.c
@@ -38,22 +38,14 @@
 //
 // Define object-based interface.
 //
-#undef  GENFRONT
-#define GENFRONT( opname, varname ) \
-\
-void PASTEMAC0(opname)( \
-                        obj_t* x, \
-                        obj_t* y  \
-                      ) \
-{ \
-    if ( bli_error_checking_is_enabled() ) \
-        PASTEMAC(opname,_check)( x, y ); \
-\
-    PASTEMAC0(varname)( x, \
-                        y ); \
-}
+void bli_copysc( obj_t* chi,
+                 obj_t* psi )
+{
+	if ( bli_error_checking_is_enabled() )
+		bli_copysc_check( chi, psi );

-GENFRONT( swapv, swapv_kernel )
+	bli_copysc_unb_var1( chi, psi );
+}


 //
@@ -63,17 +55,17 @@ GENFRONT( swapv, swapv_kernel )
 #define GENTFUNC( ctype, ch, opname, varname ) \
 \
 void PASTEMAC(ch,opname)( \
-                          dim_t  n, \
-                          ctype* x, inc_t incx, \
-                          ctype* y, inc_t incy \
+                          conj_t conjchi, \
+                          ctype* chi, \
+                          ctype* psi  \
                        ) \
 { \
-	PASTEMAC2(ch,ch,varname)( n, \
-	                          x, incx, \
-	                          y, incy ); \
+	PASTEMAC2(ch,ch,varname)( conjchi, \
+	                          chi, \
+	                          psi ); \
 }

-INSERT_GENTFUNC_BASIC( swapv, SWAPV_KERNEL )
+INSERT_GENTFUNC_BASIC( copysc, copysc_unb_var1 )


 //
@@ -83,23 +75,25 @@ INSERT_GENTFUNC_BASIC( swapv, SWAPV_KERNEL )
 #define GENTFUNC2( ctype_x, ctype_y, chx, chy, opname, varname ) \
 \
 void PASTEMAC2(chx,chy,opname)( \
-                                dim_t    n, \
-                                ctype_x* x, inc_t incx, \
-                                ctype_y* y, inc_t incy \
+                                conj_t   conjchi, \
+                                ctype_x* chi, \
+                                ctype_y* psi  \
                              ) \
 { \
-	PASTEMAC2(chx,chy,varname)( n, \
-	                            x, incx, \
-	                            y, incy ); \
+	PASTEMAC2(chx,chy,varname)( conjchi, \
+	                            chi, \
+	                            psi ); \
 }

-INSERT_GENTFUNC2_BASIC( swapv, SWAPV_KERNEL )
+// Define the basic set of functions unconditionally, and then also some
+// mixed datatype functions if requested.
+INSERT_GENTFUNC2_BASIC( copysc, copysc_unb_var1 )

 #ifdef BLIS_ENABLE_MIXED_DOMAIN_SUPPORT
-INSERT_GENTFUNC2_MIX_D( swapv, SWAPV_KERNEL )
+INSERT_GENTFUNC2_MIX_D( copysc, copysc_unb_var1 )
 #endif

 #ifdef BLIS_ENABLE_MIXED_PRECISION_SUPPORT
-INSERT_GENTFUNC2_MIX_P( swapv, SWAPV_KERNEL )
+INSERT_GENTFUNC2_MIX_P( copysc, copysc_unb_var1 )
 #endif

--- a/frame/0/old/copysc/bli_copysc.h
+++ b/frame/0/old/copysc/bli_copysc.h
@@ -32,17 +32,15 @@

 */

-#include "bli_setv_check.h"
-
-#include "bli_setv_kernel.h"
-#include "bli_setv_ref.h"
+#include "bli_copysc_check.h"
+#include "bli_copysc_unb_var1.h"


 //
 // Prototype object-based interface.
 //
-void bli_setv( obj_t* beta,
-               obj_t* x );
+void bli_copysc( obj_t* chi,
+                 obj_t* psi );


 //
@@ -52,33 +50,33 @@ void bli_setv( obj_t* beta,
 #define GENTPROT( ctype, ch, opname ) \
 \
 void PASTEMAC(ch,opname)( \
-                          dim_t  n, \
-                          ctype* beta, \
-                          ctype* x, inc_t incx \
+                          conj_t conjchi, \
+                          ctype* chi, \
+                          ctype* psi  \
                        );

-INSERT_GENTPROT_BASIC( setv )
+INSERT_GENTPROT_BASIC( copysc )


 //
 // Prototype BLAS-like interfaces with heterogeneous-typed operands.
 //
 #undef  GENTPROT2
-#define GENTPROT2( ctype_b, ctype_x, chb, chx, opname ) \
+#define GENTPROT2( ctype_x, ctype_y, chx, chy, opname ) \
 \
-void PASTEMAC2(chb,chx,opname)( \
-                                dim_t    n, \
-                                ctype_b* beta, \
-                                ctype_x* x, inc_t incx \
+void PASTEMAC2(chx,chy,opname)( \
+                                conj_t   conjchi, \
+                                ctype_x* chi, \
+                                ctype_y* psi  \
                              );

-INSERT_GENTPROT2_BASIC( setv )
+INSERT_GENTPROT2_BASIC( copysc )

 #ifdef BLIS_ENABLE_MIXED_DOMAIN_SUPPORT
-INSERT_GENTPROT2_MIX_D( setv )
+INSERT_GENTPROT2_MIX_D( copysc )
 #endif

 #ifdef BLIS_ENABLE_MIXED_PRECISION_SUPPORT
-INSERT_GENTPROT2_MIX_P( setv )
+INSERT_GENTPROT2_MIX_P( copysc )
 #endif

--- a/frame/0/old/copysc/bli_copysc_check.c
+++ b/frame/0/old/copysc/bli_copysc_check.c
--- a/frame/0/old/copysc/bli_copysc_check.h
+++ b/frame/0/old/copysc/bli_copysc_check.h
--- a/frame/0/old/copysc/bli_copysc_unb_var1.c
+++ b/frame/0/old/copysc/bli_copysc_unb_var1.c
--- a/frame/0/old/copysc/bli_copysc_unb_var1.h
+++ b/frame/0/old/copysc/bli_copysc_unb_var1.h
--- a/frame/0/old/divsc/bli_divsc.c
+++ b/frame/0/old/divsc/bli_divsc.c
--- a/frame/0/old/divsc/bli_divsc.h
+++ b/frame/0/old/divsc/bli_divsc.h
--- a/frame/0/old/divsc/bli_divsc_check.c
+++ b/frame/0/old/divsc/bli_divsc_check.c
--- a/frame/0/old/divsc/bli_divsc_check.h
+++ b/frame/0/old/divsc/bli_divsc_check.h
--- a/frame/0/old/divsc/bli_divsc_unb_var1.c
+++ b/frame/0/old/divsc/bli_divsc_unb_var1.c
--- a/frame/0/old/divsc/bli_divsc_unb_var1.h
+++ b/frame/0/old/divsc/bli_divsc_unb_var1.h
--- a/frame/0/old/getsc/bli_getsc.c
+++ b/frame/0/old/getsc/bli_getsc.c
--- a/frame/0/old/getsc/bli_getsc.h
+++ b/frame/0/old/getsc/bli_getsc.h
--- a/frame/0/old/getsc/bli_getsc_check.c
+++ b/frame/0/old/getsc/bli_getsc_check.c
--- a/frame/0/old/getsc/bli_getsc_check.h
+++ b/frame/0/old/getsc/bli_getsc_check.h
--- a/frame/0/old/mulsc/bli_mulsc.c
+++ b/frame/0/old/mulsc/bli_mulsc.c
--- a/frame/0/old/mulsc/bli_mulsc.h
+++ b/frame/0/old/mulsc/bli_mulsc.h
--- a/frame/0/old/mulsc/bli_mulsc_check.c
+++ b/frame/0/old/mulsc/bli_mulsc_check.c
--- a/frame/0/old/mulsc/bli_mulsc_check.h
+++ b/frame/0/old/mulsc/bli_mulsc_check.h
--- a/frame/0/old/mulsc/bli_mulsc_unb_var1.c
+++ b/frame/0/old/mulsc/bli_mulsc_unb_var1.c
--- a/frame/0/old/mulsc/bli_mulsc_unb_var1.h
+++ b/frame/0/old/mulsc/bli_mulsc_unb_var1.h
--- a/frame/0/old/normfsc/bli_normfsc.c
+++ b/frame/0/old/normfsc/bli_normfsc.c
--- a/frame/0/old/normfsc/bli_normfsc.h
+++ b/frame/0/old/normfsc/bli_normfsc.h
--- a/frame/0/old/normfsc/bli_normfsc_check.c
+++ b/frame/0/old/normfsc/bli_normfsc_check.c
--- a/frame/0/old/normfsc/bli_normfsc_check.h
+++ b/frame/0/old/normfsc/bli_normfsc_check.h
--- a/frame/0/old/normfsc/bli_normfsc_unb_var1.c
+++ b/frame/0/old/normfsc/bli_normfsc_unb_var1.c
--- a/frame/0/old/normfsc/bli_normfsc_unb_var1.h
+++ b/frame/0/old/normfsc/bli_normfsc_unb_var1.h
--- a/frame/0/old/setsc/bli_setsc.c
+++ b/frame/0/old/setsc/bli_setsc.c
--- a/frame/0/old/setsc/bli_setsc.h
+++ b/frame/0/old/setsc/bli_setsc.h
--- a/frame/0/old/setsc/bli_setsc_check.c
+++ b/frame/0/old/setsc/bli_setsc_check.c
--- a/frame/0/old/setsc/bli_setsc_check.h
+++ b/frame/0/old/setsc/bli_setsc_check.h
--- a/frame/0/old/sqrtsc/bli_sqrtsc.c
+++ b/frame/0/old/sqrtsc/bli_sqrtsc.c
--- a/frame/0/old/sqrtsc/bli_sqrtsc.h
+++ b/frame/0/old/sqrtsc/bli_sqrtsc.h
--- a/frame/0/old/sqrtsc/bli_sqrtsc_check.c
+++ b/frame/0/old/sqrtsc/bli_sqrtsc_check.c
--- a/frame/0/old/sqrtsc/bli_sqrtsc_check.h
+++ b/frame/0/old/sqrtsc/bli_sqrtsc_check.h
--- a/frame/0/old/sqrtsc/bli_sqrtsc_unb_var1.c
+++ b/frame/0/old/sqrtsc/bli_sqrtsc_unb_var1.c
--- a/frame/0/old/sqrtsc/bli_sqrtsc_unb_var1.h
+++ b/frame/0/old/sqrtsc/bli_sqrtsc_unb_var1.h
--- a/frame/0/old/subsc/bli_subsc.c
+++ b/frame/0/old/subsc/bli_subsc.c
--- a/frame/0/old/subsc/bli_subsc.h
+++ b/frame/0/old/subsc/bli_subsc.h
--- a/frame/0/old/subsc/bli_subsc_check.c
+++ b/frame/0/old/subsc/bli_subsc_check.c
--- a/frame/0/old/subsc/bli_subsc_check.h
+++ b/frame/0/old/subsc/bli_subsc_check.h
--- a/frame/0/old/subsc/bli_subsc_unb_var1.c
+++ b/frame/0/old/subsc/bli_subsc_unb_var1.c
--- a/frame/0/old/subsc/bli_subsc_unb_var1.h
+++ b/frame/0/old/subsc/bli_subsc_unb_var1.h
--- a/Show More
+++ b/Show More