Added 'docs' directory with wiki markdown files.

Details: - Exported all github wikis to a new 'docs' directory. - Renamed 'BLISAPIQuickReference' wiki to 'BLISTypedAPI' and removed all cntx_t* arguments from the (now non-expert) APIs (with the exception of the kernel APIs). - Added section to BuildSystem documenting new ARG_MAX hack.
2026-04-19 23:28:52 +00:00 · 2018-07-07 16:45:29 -05:00
parent 3ee2bc0f7a
commit bcacddfad7
10 changed files with 4734 additions and 0 deletions
--- a/docs/BLISTypedAPI.md
+++ b/docs/BLISTypedAPI.md
--- a/docs/BuildSystem.md
+++ b/docs/BuildSystem.md
@@ -0,0 +1,382 @@
+## Contents
+
+* **[Contents](BuildSystem#contents)**
+* **[Introduction](BuildSystem#introduction)**
+* **[Obtaining BLIS](BuildSystem#obtaining-blis)**
+* **[Step 1: Chose a framework configuration](BuildSystem#step-1-choose-a-framework-configuration)**
+* **[Step 2: Running `configure`](BuildSystem#step-2-running-configure)**
+* **[Step 3: Compilation](BuildSystem#step-3-compilation)**
+* **[Step 3b: Testing (optional)](BuildSystem#step-3b-testing-optional)**
+* **[Step 4: Installation](BuildSystem#step-4-installation)**
+* **[Cleaning out build products](BuildSystem#cleaning-out-build-products)**
+* **[Linking against BLIS](BuildSystem#linking-against-blis)**
+* **[Uninstalling](BuildSystem#uninstalling)**
+* **[Conclusion](BuildSystem#conclusion)**
+
+## Introduction
+
+This wiki describes how to configure, compile, and install a BLIS library on your local system.
+
+The BLIS build system was designed for use with GNU/Linux (or some other sane UNIX). Other requirements are:
+
+  * Python (2.7 or later)
+  * GNU `bash` (3.2 or later)
+  * GNU `make`
+  * a working C compiler
+
+We also require various other shell utilities that are so ubiquitous that they are not worth mentioning (such as `mv`, `mkdir`, `find`, and so forth). If you are missing these utilities, then you have much bigger problems than not being able to build BLIS.
+
+
+## Obtaining BLIS
+
+Before starting, you must obtain a copy of BLIS.
+
+If you are an end-user (i.e., not a developer), you can download a tarball or zip file of the latest tagged version by returning to the main [BLIS homepage](https://github.com/flame/blis) and clicking on the [releases](https://github.com/flame/blis/releases) link. **However**, we highly recommend that you instead clone a copy using the command:
+```
+$ git clone https://github.com/flame/blis.git
+```
+
+Cloning a repository allows users and developers alike to quickly and easily pull in new commits as they are available, including commits that occur **between** tagged releases.
+
+Once you download the BLIS distribution, the top-level directory should look something like:
+```
+$ ls
+CHANGELOG  Makefile      common.mk        configure  mpi_test     testsuite
+CREDITS    README.md     config           frame      obj          version
+INSTALL    bli_config.h  config.mk        kernels    ref_kernels  windows
+LICENSE    build         config_registry  lib        test
+```
+
+
+## Step 1: Choose a framework configuration
+
+The first step is to choose how to configure BLIS. Specifically, a user must decide which configuration to use, or whether to allow `configure` to automatically guess the best configuration for your hardware. (Note: This automatic configuration selection only applies to x86_64 systems.)
+
+Configurations are described in detail in the [BLIS configuration guide](ConfigurationHowTo) wiki.
+
+Generally speaking, a configuration consists of several files that reside in a sub-directory of the `config` directory. To see a list of the available configurations, you may inspect this directory, or run `configure` with no arguments. Here are the current (as of this writing) contents of the `config` directory:
+```
+$ ls config
+amd64      cortexa15  excavator  intel64  old         power7       template
+bgq        cortexa57  generic    knc      penryn      sandybridge  zen
+bulldozer  cortexa9   haswell    knl      piledriver  steamroller
+```
+There is one additional configuration available that is not present in the `config` directory, and that is `auto`.
+By targeting the `auto` configuration (i.e., `./configure auto`), the user is requesting that `configure` select a configuration automatically based on the detected features of the processor.
+
+Another special configuration (one that, unlike `auto`, _is_ present in `config`) is the `generic` configuration. This configuration, like its name suggests, is architecture-agnostic and may be targeted in virtually any environment that supports the minimum build requirements of BLIS. The `generic` configuration uses a set of built-in, portable reference kernels (written in C99) that should work without modification on most, if not all, architectures. These reference kernels, however, should be expected to yield relatively low performance since they do not employ any architecture-specific optimizations beyond those the compiler provides automatically. (Historical note: The `generic` configuration corresponds to the `reference` configuration of previous releases of BLIS.)
+
+If you are a BLIS developer and wish to create your own configuration, either from scratch or using an existing configuration as a starting point, please read the [BLIS configuration guide](ConfigurationHowTo).
+
+## Step 2: Running `configure`
+
+This step should be somewhat familiar to many people who use open source software. To configure the build system, simply run:
+```
+$ ./configure <configname>
+```
+where `<configname>` is the configuration sub-directory name you chose in [Step 1](BuildSystem#step-1-choose-a-framework-configuration) above. If `<configname>` is not given, a helpful message is printed reminding you to explicit specify a configuration name along with a list of valid configuration families and their implied sub-configurations. For more information on sub-configurations and families, please see the [BLIS configuration guide](ConfigurationHowTo).
+
+Alternatively, `configure` can automatically select a configuration based on your hardware:
+```
+$ ./configure auto
+```
+However, as of this writing, only a limited number of architectures are detected. If the `configure` script is not able to detect your architecture, the `generic` configuration will be used. 
+
+Upon running configure, you will get output similar to the following. The exact output will depend on whether you cloned BLIS from a `git` repository or whether you obtained BLIS via a downloadable tarball from the [releases](https://github.com/flame/blis/releases) page.
+```
+$ ./configure haswell
+configure: using 'gcc' compiler.
+configure: found gcc version 5.4.0 (maj: 5, min: 4, rev: 0).
+configure: checking for blacklisted configurations due to gcc 5.4.0.
+configure: warning: gcc 5.4.0 does not support 'skx'; adding to blacklist.
+configure: found assembler ('as') version 2.26.1 (maj: 2, min: 26, rev: 1).
+configure: checking for blacklisted configurations due to as 2.26.1.
+configure: configuration blacklist:
+configure:   skx
+configure: reading configuration registry...done.
+configure: determining default version string.
+configure: found '.git' directory; assuming git clone.
+configure: executing: git describe --tags.
+configure: got back 0.3.2-16-gb699bb1f.
+configure: truncating to 0.3.2-16.
+configure: starting configuration of BLIS 0.3.2-16.
+configure: configuring with official version string.
+configure: found shared library .so version '0.0.0'.
+configure:   .so major version: 0
+configure:   .so minor.build version: 0.0
+configure: manual configuration requested; configuring with 'haswell'.
+configure: checking configuration against contents of 'config_registry'.
+configure: configuration 'haswell' is registered.
+configure: 'haswell' is defined as having the following sub-configurations:
+configure:    haswell
+configure: which collectively require the following kernels:
+configure:    haswell zen
+configure: checking sub-configurations:
+configure:   'haswell' is registered...and exists.
+configure: checking sub-configurations' requisite kernels:
+configure:   'haswell' kernels...exist.
+configure:   'zen' kernels...exist.
+configure: no install prefix option given; defaulting to '/u/field/blis'.
+configure: no install libdir option given; defaulting to PREFIX/lib.
+configure: no install includedir option given; defaulting to PREFIX/include.
+configure: final installation directories:
+configure:   libdir:     /u/field/blis/lib
+configure:   includedir: /u/field/blis/include
+configure: debug symbols disabled.
+configure: disabling verbose make output. (enable with 'make V=1'.)
+configure: building BLIS as a static library.
+configure: threading is disabled.
+configure: internal memory pools for packing buffers are enabled.
+configure: libmemkind not found; disabling.
+configure: the BLAS compatibility layer is enabled.
+configure: the CBLAS compatibility layer is disabled.
+configure: the internal integer size is automatically determined.
+configure: the BLAS/CBLAS interface integer size is 32-bit.
+configure: creating ./config.mk from ./build/config.mk.in
+configure: creating ./bli_config.h from ./build/bli_config.h.in
+configure: creating ./obj/haswell
+configure: creating ./obj/haswell/config
+configure: creating ./obj/haswell/config/haswell
+configure: creating ./obj/haswell/kernels
+configure: creating ./obj/haswell/kernels/haswell
+configure: creating ./obj/haswell/kernels/zen
+configure: creating ./obj/haswell/ref_kernels
+configure: creating ./obj/haswell/ref_kernels/haswell
+configure: creating ./obj/haswell/frame
+configure: creating ./obj/haswell/blastest
+configure: creating ./obj/haswell/testsuite
+configure: creating ./lib/haswell
+configure: creating ./include/haswell
+configure: mirroring ./config/haswell to ./obj/haswell/config/haswell
+configure: mirroring ./kernels/haswell to ./obj/haswell/kernels/haswell
+configure: mirroring ./kernels/zen to ./obj/haswell/kernels/zen
+configure: mirroring ./ref_kernels to ./obj/haswell/ref_kernels/haswell
+configure: mirroring ./frame to ./obj/haswell/frame
+configure: creating makefile fragments in ./config/haswell
+configure: creating makefile fragments in ./kernels/haswell
+configure: creating makefile fragments in ./kernels/zen
+configure: creating makefile fragments in ./ref_kernels
+configure: creating makefile fragments in ./frame
+configure: configured to build within top-level directory of source distribution.
+```
+The installation prefix can be specified via the `--prefix=PREFIX` option:
+```
+  $ ./configure --prefix=/usr <configname>
+```
+This will cause libraries to eventually be installed (via `make install`) to `PREFIX/lib` and development headers to be installed to `PREFIX/include`. (The default value of `PREFIX` is `$(HOME)/blis`.) You can also specify the library install directory separately from the development header install directory with the `--libdir=LIBDIR` and `--includedir=INCDIR` options, respectively:
+```
+  $ ./configure --libdir=/usr/lib --includedir=/usr/include <configname>
+```
+The `--libdir=LIBDIR` and `--includedir=INCDIR` options will override any `PREFIX` path, whether it was specified explicitly via `--prefix` or implicitly (via the default). That is, `LIBDIR` defaults to `PREFIX/lib` and `INCDIR` defaults to `PREFIX/include`, but each will be overriden by their respective `--libdir`/`--includedir` options. So,
+```
+  $ ./configure --libdir=/usr/lib <configname>
+
+```
+will configure BLIS to install libraries to `/usr/lib` and header files to the default location (`$HOME/blis/include`).
+Also, note that `configure` will create any installation directories that do not already exist.
+
+For a complete list of supported `configure` options and arguments, run `configure` with the `-h` option:
+```
+  $ ./configure -h
+```
+The output from this invocation of `configure` should give you an up-to-date list of options and their descriptions.
+
+
+## Step 3: Compilation
+
+Once `configure` is finished, you are ready to instantiate (compile) BLIS into a library by running `make`. Running `make` will result in output similar to:
+```
+$ make
+Generating monolithic blis.h.........................................................
+.....................................................................................
+.....................................................................................
+.....................................................................................
+.....................................................................................
+..........................................
+Generated include/haswell/blis.h
+Compiling obj/haswell/config/haswell/bli_cntx_init_haswell.o ('haswell' CFLAGS for config code)
+Compiling obj/haswell/kernels/zen/1/bli_amaxv_zen_int.o ('haswell' CFLAGS for kernels)
+Compiling obj/haswell/kernels/zen/1/bli_axpyv_zen_int.o ('haswell' CFLAGS for kernels)
+Compiling obj/haswell/kernels/zen/1/bli_axpyv_zen_int10.o ('haswell' CFLAGS for kernels)
+Compiling obj/haswell/kernels/zen/1/bli_dotv_zen_int.o ('haswell' CFLAGS for kernels)
+Compiling obj/haswell/kernels/zen/1/bli_dotv_zen_int10.o ('haswell' CFLAGS for kernels)
+```
+If you want to see the individual command line invocations of the compiler, you can run `make` as follows:
+```
+$ make V=1
+```
+Also, if you are compiling on a multicore system, you can get parallelism via:
+```
+$ make -j<n>
+```
+where `<n>` is the number of jobs `make` is allowed to run simultaneously. Generally, you should limit `<n>` to p+1, where p is the number of processor cores on your system.
+
+### Running into the ARG_MAX limit
+
+On some systems, you may observe an error message when the build system attempts to archive BLIS object files into the static library (or perhaps when the linker attempts to generate the shared library):
+```
+Archiving lib/x86_64/libblis.a
+bash: ar: Argument list too long
+Makefile:584: recipe for target 'lib/x86_64/libblis.a' failed
+make: *** [lib/x86_64/libblis.a] Error 126
+```
+This error message results when the user attempts to execute a program with too many arguments (or more specifically, a program-argument string that occupies too many bytes)--that is, when the command exceeds the [ARG_MAX limit](https://www.in-ulm.de/~mascheck/various/argmax/). This doesn't occur very often, but if it does, don't worry--we have a workaround. Simply rerun `configure` as you did previously, except this time include an addition option: `--enable-arg-max-hack`. You will see confirmation that the option was accepted as configure runs:
+```
+configure: enabling ARG_MAX hack.
+```
+The archiver and/or linker should no longer choke when creating the libraries.
+
+## Step 3b: Testing (optional)
+
+If you would like to run some ready-made tests that exercise BLIS in a number of ways, including through its BLAS compatibility layer, run `make check`:
+```
+$ make check
+```
+Watch the output near the end. You should see the following messages, though not necessarily in immediate succession:
+```
+All BLIS tests passed!
+All BLAS tests passed!
+```
+Please see the [BLIS testsuite wiki](Testsuite) for more details on running either the BLIS testsuite or the BLAS test drivers. If you have any trouble, please report your problem to BLIS developers by opening a [new issue](https://github.com/flame/blis/issues/).
+
+
+## Step 4: Installation
+
+Toward the end of compilation, you should get output similar to:
+```
+Compiling obj/haswell/frame/thread/bli_thread.o ('haswell' CFLAGS for framework code)
+Compiling obj/haswell/frame/thread/bli_thrinfo.o ('haswell' CFLAGS for framework code)
+Compiling obj/haswell/frame/util/bli_util_check.o ('haswell' CFLAGS for framework code)
+Compiling obj/haswell/frame/util/bli_util_oapi.o ('haswell' CFLAGS for framework code)
+Compiling obj/haswell/frame/util/bli_util_oapi_wc.o ('haswell' CFLAGS for framework code)
+Compiling obj/haswell/frame/util/bli_util_oapi_woc.o ('haswell' CFLAGS for framework code)
+Compiling obj/haswell/frame/util/bli_util_tapi.o ('haswell' CFLAGS for framework code)
+Compiling obj/haswell/frame/util/bli_util_unb_var1.o ('haswell' CFLAGS for framework code)
+Archiving lib/haswell/libblis.a
+Dynamically linking lib/haswell/libblis.so
+```
+Now you have a BLIS library (in static and shared forms) residing in the `lib/<configname>/` directory. To install the libraries and the header files associated with it, simply execute:
+```
+$ make install
+```
+This installs copies of the libraries and header files, and also creates conventional symbolic links of shared libraries:
+```
+Installing libblis.a into /u/field/blis/lib/
+Installing libblis.so.0.0.0 into /u/field/blis/lib/
+Installing symlink libblis.so into /u/field/blis/lib/
+Installing symlink libblis.so.0 into /u/field/blis/lib/
+Installing blis.h into /u/field/blis/include/blis/
+```
+This results in your `PREFIX` directory looking like:
+```
+# Check the contents of 'PREFIX'.
+$ ls -l $HOME/blis
+drwxr-xr-x 3 field dept 4096 May 10 17:36 include
+drwxr-xr-x 2 field dept 4096 May 10 17:42 lib
+# Check the contents of 'PREFIX/include'.
+$ ls -l $HOME/blis/include
+drwxr-xr-x 2 field dept 4096 May 10 17:42 blis
+$ ls -l $HOME/blis/include/blis
+-rw-r--r-- 1 field dept 915324 May 10 17:42 blis.h
+# Check the contents of 'PREFIX/lib'.
+$ ls -l $HOME/blis/lib
+-rw-r--r-- 1 field dept 2979052 May 10 17:42 libblis.a
+lrwxrwxrwx 1 field dept      16 May 10 17:42 libblis.so -> libblis.so.0.0.0
+lrwxrwxrwx 1 field dept      16 May 10 17:42 libblis.so.0 -> libblis.so.0.0.0
+-rw-r--r-- 1 field dept 2185976 May 10 17:42 libblis.so.0.0.0
+```
+
+## Cleaning out build products
+
+If you want to remove various build products, you can use one of the `make` targets already defined for you in the BLIS Makefile:
+```
+$ make clean
+Removing flattened header files from ./include/haswell.
+Removing object files from ./obj/haswell.
+Removing libraries from ./lib/haswell.
+```
+Executing the `clean` target will remove all binary object files and library builds from the `obj` and `lib` directories, as well as any flattened header files. Any other configurations' build products are left untouched.
+```
+$ make cleanmk
+Removing makefile fragments from ./config.
+Removing makefile fragments from ./frame.
+Removing makefile fragments from ./ref_kernels.
+Removing makefile fragments from ./kernels.
+```
+The `cleanmk` target results in removal of all makefile fragments from the framework source tree. (Makefile fragments are named `.fragment.mk` and are generated at configure-time.)
+```
+$ make distclean
+Removing makefile fragments from ./config.
+Removing makefile fragments from ./frame.
+Removing makefile fragments from ./ref_kernels.
+Removing makefile fragments from ./kernels.
+Removing flattened header files from ./include/haswell.
+Removing object files from ./obj/haswell.
+Removing libraries from ./lib/haswell.
+Removing object files from ./obj/haswell/blastest.
+Removing libf2c.a from ./obj/haswell/blastest.
+Removing binaries from ./obj/haswell/blastest.
+Removing driver output files 'out.*'.
+Removing object files from ./blastest/obj.
+Removing libf2c.a from ./blastest.
+Removing binaries from ./blastest.
+Removing driver output files 'out.*' from ./blastest.
+Removing object files from ./obj/haswell/testsuite.
+Removing binary test_libblis.x.
+Removing output.testsuite.
+Removing object files from testsuite/obj.
+Removing binary testsuite/test_libblis.x.
+Removing ./bli_config.h.
+Removing config.mk.
+Removing obj.
+Removing lib.
+Removing include.
+```
+Running the `distclean` target is like saying, "Remove anything ever created by the build system."
+
+
+## Linking against BLIS
+
+Once you have instantiated (configured, compiled, and installed) a BLIS library, you can link to it in your application's makefile as you would any other library. The following is an abbreviated makefile for a small hypothetical application that has just two external dependencies: BLIS and the standard C math library.
+```make
+BLIS_PREFIX = $(HOME)/blis
+BLIS_INC    = $(BLIS_PREFIX)/include/blis
+BLIS_LIB    = $(BLIS_PREFIX)/lib/libblis.a
+
+OTHER_LIBS  = -L/usr/lib -lm
+
+CC          = gcc
+CFLAGS      = -O2 -g -I$(BLIS_INC)
+LINKER      = $(CC)
+
+OBJS        = main.o util.o other.o
+
+%.o: %.c
+    $(CC) $(CFLAGS) -c $< -o $@
+
+all: $(OBJS) 
+    $(LINKER) $(OBJS) $(BLIS_LIB) $(OTHER_LIBS) -o my_program.x
+```
+The above example assumes you will want to include BLIS definitions and function prototypes into your application via `#include blis.h`. (If you are only using the BLIS via the BLAS compatibility layer, including `blis.h` is not necessary.) Since BLIS headers are installed into a `blis` subdirectory of `PREFIX/include`, you must make sure that the compiler knows where to find the `blis.h` header file. This is typically accomplished by inserting `#include "blis.h"` into your application's source code files and compiling the code with `-I PREFIX/include/blis`.
+
+The makefile shown above a very simple example. If you need help linking your application to your BLIS library, please [open an issue](https://github.com/flame/blis/issues).
+
+
+## Uninstalling
+
+If you decide that you want to uninstall BLIS, simply run `make uninstall`
+```
+$ make uninstall
+Uninstalling libraries libblis.a libblis.so.0.0.0 from /u/field/blis/lib/.
+Uninstalling symlinks libblis.so libblis.so.0 from /u/field/blis/lib/.
+Uninstalling directory 'blis' from /u/field/blis/include/.
+```
+This removes the libraries, symlinks, and header directory that was installed by `make install`. Before running `make uninstall`, however, make sure that BLIS is configured the with the same `LIBDIR` and `INCDIR` paths used during installation.
+
+
+
+## Conclusion
+
+If you have feedback, please consider keeping in touch with the project maintainers, contributors, and other users by joining and posting to the [BLIS mailing lists](https://github.com/flame/blis#discussion).
+
+Thanks for using BLIS!
--- a/docs/CodingConventions.md
+++ b/docs/CodingConventions.md
@@ -0,0 +1,245 @@
+## Contents
+
+* **[Contents](CodingConventions#contents)**
+* **[Introduction](CodingConventions#introduction)**
+* **[C99](CodingConventions#c99)**
+  * [Placement of braces](CodingConventions#placement-of-braces)
+  * [Indentation](CodingConventions#indentation)
+  * [Comments](CodingConventions#comments)
+  * [Blank lines](CodingConventions#blank-lines)
+  * [Condensing short code to single lines](CodingConventions#condensing-short-code-to-single-lines)
+  * [Whitespace in function calls](CodingConventions#whitespace-in-function-calls)
+  * [Whitespace in function definitions](CodingConventions#whitespace-in-function-definitions)
+  * [Whitespace in expressions](CodingConventions#whitespace-in-expressions)
+  * [Trailing whitespace](CodingConventions#trailing-whitespace)
+
+## Introduction
+
+This wiki describes the coding conventions used in BLIS. Please try to adhere to these conventions when submitting pull requests and/or (if you have permission) committing directly to the repository.
+
+## C99
+
+Most of the code in BLIS is written in C, and specifically in ISO C99. This section describes the C coding standards used within BLIS.
+
+### Placement of braces
+
+Please either use braces to denote the indentation limits of scope, or to enclose multiple statements on a single line. But do not place the open brace on the same line as a conditional if the conditional will be more than one line.
+```
+{
+    // This is fine.
+    if ( bli_obj_is_real( x ) )
+    {
+        foo = 1;
+    }
+
+    // This is also fine.
+    if ( bli_obj_is_real( x ) ) { foo = 1; return; }
+
+    // This is bad. Please use one of the two forms above.
+    if ( bli_obj_is_real( x ) ) {
+        foo = 1;
+    }
+}
+```
+
+### Indentation
+
+If at all possible, **please use tabs to denote changing levels of scope!** If you can't use tabs or doing so would be very inconvenient given your editor and setup, please set your indentation to use exactly four spaces per level of indentation. Below is what it would look like if you used tabs (with a tab width set to four spaces), or four actual spaces per indentation level.
+```
+bool_t bli_obj_is_real( obj_t* x )
+{
+    bool_t r_val;
+
+    if ( bli_obj_is_real( x ) )
+        r_val = TRUE;
+    else
+        r_val = FALSE;
+}
+```
+Ideally, tabs should be used to indicate changes in levels of scope, but then spaces should be used for multi-line statements within the same scope. In the example below, I've marked the characters that should be spaces with `.` (with tabs used for the first level of indentation):
+
+```
+bool_t bli_obj_is_complex( obj_t* x )
+{
+    bool_t r_val;
+
+    if ( bli_obj_is_scomplex( x ) ||
+    .....bli_obj_is_dcomplex( x ) ) r_val = TRUE;
+    else............................r_val = FALSE;
+
+    return r_val;
+}
+```
+
+### Comments
+
+Please use C++-style comments, and line-break your comments somewhere between character (column) 72 and 80.
+```
+{
+    // This is a comment. This comment can span multiple lines, but it should 
+    // not extend beyond column 80. (For these purposes, you can count a tab 
+    // as anywhere from one to four spaces.)
+}
+```
+If you are inserting comments in a macro definition, in which case you must use C-style comments:
+```
+#define bli_some_macro( x ) \
+\
+    /* This is a comment in a macro definition. It, too, should not spill
+       beyond column 80. Please place the ending comment marker on the last
+       line containing words, unless the comment marker would cause you to
+       go beyond column 80, in which case you can place it on the next line
+       aligned with the first comment marker. */
+```
+
+### Blank lines
+
+Please use blank lines to separate lines of code from the next line of code. However, if adjacent lines of code are meaningfully related, please skip the blank line.
+```
+{
+    // Set the matrix datatype.
+    bli_obj_set_dt( BLIS_DOUBLE, x );
+
+    // Set the matrix dimensions.
+    bli_obj_set_length( 10, x );
+    bli_obj_set_width( 5, x );
+}
+```
+
+### Condensing short code to single lines
+
+Sometimes, to more efficiently display code on the screen, it's helpful to skip certain newlines, such as those in conditional statements. This is fine, just try to line things up in a way that is visually appealing.
+```
+{
+    bool_t r_val;
+    dim_t  foo;
+
+    // This is fine.
+    if ( bli_obj_is_real( x ) ) r_val = TRUE;
+    else                        r_val = FALSE;
+
+    // This is okay. (Notice the spaces after '{' and before '}'.)
+    // However, the next example is preferred over this style.
+    if ( bli_obj_is_real( x ) ) { r_val = TRUE; foo = 1; }
+    else                        { r_val = FALSE; foo = 0; }
+
+    // Similar to above, but with some extra alignment. This is better
+    // than above.
+    if ( bli_obj_is_real( x ) ) { r_val = TRUE;  foo = 1; }
+    else                        { r_val = FALSE; foo = 0; }
+}
+```
+
+### Whitespace in function calls
+
+For single-line function calls, please **avoid** a space between the last character in the function/macro name and the open parentheses. Also, please do not insert any spaces before commas that separate arguments to a function/macro invocation.
+```
+{
+    obj_t x;
+
+    // Good.
+    bli_obj_create( BLIS_DOUBLE, 3, 4, 0, 0, &x );
+    bli_obj_set_length( 10, x );
+
+    // Bad. Please avoid.
+    bli_obj_set_dt ( BLIS_FLOAT, x );
+
+    // Bad. Please avoid.
+    bli_obj_set_dt( BLIS_FLOAT , x );
+}
+```
+For multi-line function calls, please use the following template:
+```
+{
+    bli_dgemm
+    (
+      BLIS_NO_TRANSPOSE,
+      BLIS_TRANSPOSE,
+      m, n, k,
+      &BLIS_ONE
+      a, rs_a, cs_a,
+      b, rs_b, cs_b,
+      &BLIS_ZERO,
+      c, rs_c, cs_c
+    );
+}
+```
+Notice that here, the parentheses are formatted similar to braces. However, notice that the arguments do not constitute a new level of "scope." Instead, you should use exactly two additional spaces. before each line of arguments.
+
+### Whitespace in function definitions
+
+When defining a function with few arguments, insert a single space after commas and types, and after the first parentheses and before the last parentheses:
+```
+void bli_obj_set_length( dim_t m, obj_t* a )
+{
+    // Body of function
+}
+```
+As with single-line function calls, please do not place a space between the last character of the function name and the open parentheses to the argument list!
+
+When defining a function with many arguments, especially those that would not comfortably fit in a single 80-character line, you can split the type signature into multiple lines:
+```
+void bli_gemm
+     (
+       obj_t*  alpha,
+       obj_t*  a,
+       obj_t*  b,
+       obj_t*  beta,
+       obj_t*  c,
+       cntx_t* cntx
+     )
+{
+    // Body of function
+}
+```
+If you are going to use this style of function definition, please indent the parentheses exactly five spaces (don't use tabs here). Then, indent the arguments with an additional two spaces. Thus, parentheses should be in column 6 (counting from 1) and argument types should begin in column 8. Also notice that the number of spaces after each argument's type specifier varies so that the argument names are aligned. If you insert qualifiers such as `restrict`, please right-justify them:
+```
+void bli_gemm
+     (
+       obj_t*  restrict alpha,
+       obj_t*  restrict a,
+       obj_t*  restrict b,
+       obj_t*  restrict beta,
+       obj_t*  restrict c,
+       cntx_t* restrict cntx
+     )
+{
+    // Body of function
+}
+```
+
+### Whitespace in expressions
+
+Please insert whitespace into conditional expressions.
+```
+{
+   // Good.
+   if ( m == 10 && n > 0 ) return;
+
+   // Bad.
+   if ( m==10 && n>0 ) return;
+
+   // Worse!
+   if (m==10&&n>0) return;
+}
+```
+Unlike with the parentheses that surround the argument list of a function call, there should be exactly one space after conditional keywords and the open parentheses for its associated conditional statement: `if (...)`, `else if (...)`, and `while (...)`.
+
+Sometimes, extra spaces for alignment are desired:
+```
+{
+    // This is okay.
+    if ( m == 0 ) return 0;
+    else if ( n == 0 ) return 1;
+
+    // This is sometimes preferred because it allows your eyes to more easily
+    // see the differences between the 'if' conditional expression and the
+    // 'else if' conditional expression.
+    if      ( m == 0 ) return 0;
+    else if ( n == 0 ) return 1;
+}
+```
+
+### Trailing whitespace
+
+Please try to avoid inserting any trailing whitespace. This also means that "blank" lines should not contain any tabs or spaces.
--- a/docs/ConfigurationHowTo.md
+++ b/docs/ConfigurationHowTo.md
@@ -0,0 +1,802 @@
+## Contents
+
+* **[Contents](ConfigurationHowTo#contents)**
+* **[Introduction](ConfigurationHowTo#introduction)**
+* **[Sub-configurations](ConfigurationHowTo#sub-configurations)**
+  * [`bli_cntx_init_*.c`](ConfigurationHowTo#bli_cntx_init_c)
+  * [`bli_family_*.h`](ConfigurationHowTo#bli_family_h)
+  * [`make_defs.mk`](ConfigurationHowTo#make_defsmk)
+* **[Configuration families](ConfigurationHowTo#configuration-families)**
+* **[Configuration registry](ConfigurationHowTo#configuration-registry)**
+  * [Walkthrough](ConfigurationHowTo#walkthrough)
+  * [Printing the configuration registry lists](ConfigurationHowTo#printing-the-configuration-registry-lists)
+* **[Adding a new kernel set](ConfigurationHowTo#adding-a-new-kernel-set)**
+* **[Adding a new configuration family](ConfigurationHowTo#adding-a-new-configuration-family)**
+* **[Adding a new sub-configuration](ConfigurationHowTo#adding-a-new-sub-configuration)**
+* **[Further development topics](ConfigurationHowTo#further-development-topics)**
+  * [Querying the current configuration](ConfigurationHowTo#querying-the-current-configuration)
+  * [Header dependencies](ConfigurationHowTo#header-dependencies)
+  * [Still have questions?](ConfigurationHowTo#still-have-questions)
+
+## Introduction
+
+This wiki describes how to manage, edit, and create BLIS framework configurations. **The target audience is primarily BLIS developers** who wish to add support for new types of hardware, and developers who write (or tinker with) BLIS kernels.
+
+The [wiki](BuildSystem) for the BLIS build system introduces the concept of a BLIS [configuration](BuildSystem#Step_1:_Choose_a_framework_configuration). There are actually two types of configurations: sub-configuration and configuration families.
+
+A _sub-configuration_ encapsulates all of the information needed to build BLIS for a particular microarchitecture. For example, the `haswell` configuration allows a user or developer to build a BLIS library that targets hardware based on Intel Haswell (or Broadwell or Skylake/Kabylake desktop) microprocessors. Such a sub-configuration typically includes optimized kernels as well as the corresponding cache and register blocksizes that allow those kernels to work well on the target hardware.
+
+A _configuration family_ simply specifies a collection of other registered sub-configurations. For example, the `intel64` configuration allows a user or developer to build a BLIS library that includes several Intel x86_64 configurations, and hence supports multiple microarchitectures simultaneously. The appropriate configuration information (e.g. kernels and blocksizes) will be selected via some hardware detection heuristic (e.g. the `CPUID` instruction) at runtime. (**Note:** Prior to 290dd4a, configuration families could only be defined in terms of sub-configurations. Starting with 290dd4a, configuration families may be defined in terms of other families.)
+
+Both of these configuration types are organized as directories of files and then "registered" into a configuration registry file named `config_registry`, which resides in the top-level directory.
+
+
+
+## Sub-configurations
+
+A sub-configuration is represented by a sub-directory of the `config` directory in the top-level of the BLIS distribution:
+```
+$ ls config
+amd64      cortexa15  excavator  intel64  old         power7       template
+bgq        cortexa57  generic    knc      penryn      sandybridge  zen
+bulldozer  cortexa9   haswell    knl      piledriver  steamroller
+```
+Let's inspect the `haswell` configuration as an example:
+```
+$ ls config/haswell
+bli_cntx_init_haswell.c  bli_family_haswell.h  make_defs.mk
+```
+A sub-configuration (`haswell`, in this case) usually contains just three files:
+  * `bli_cntx_init_haswell.c`. This file contains the initialization function for a context targeting the hardware in question, in this case, Intel Haswell. A context, or `cntx_t` object, in BLIS encapsulates all of the hardware-specific information--including kernel function pointers and cache and register blocksizes--necessary to support all of the main computational operations in BLIS. The initialization function inside this file should be named the same as the filename (excluding `.c` suffix), which should begin with prefix `bli_cntx_init_` and end with the (lowercase) name of the sub-configuration. The context initialization function (in this case, `bli_cntx_init_haswell()`) is used internally by BLIS when setting up the global kernel structure--a mechanism for managing and supporting multiple microarchitectures simultaneously, so that the choice of which context to use can be deferred until the computation is ready to execute. 
+  * `bli_family_haswell.h`. This header file is `#included` when the configuration in question, in this case `haswell`, was the target to `./configure`. This is where you would specify certain global parameters and settings. For example, if you wanted to specify custom implementations of `malloc()` and `free()`, this is where you would specify them. The file is oftentimes empty. (In the case of configuration families, the definitions in this file apply to the _entire_ build, and not any specific sub-configuration, but for consistency we support them for all configuration targets, whether they be singleton sub-configurations or configuration families.)
+  * `make_defs.mk`. This makefile fragment defines the compiler and compiler flags to use during compilation. Specifically, the values defined in this file are used whenever compiling source code specific to the sub-configuration (i.e., reference kernels and optimized kernels). If the sub-configuration is the target of `configure`, then these flags are also used to compile general framework code.
+
+Providing these three components constitutes a complete sub-configuration. A more detailed description of each file will follow.
+
+
+
+### bli_cntx_init_*.c
+
+As mentioned above, the kernels used by a sub-configuration are specified in the `bli_cntx_init_` function. This function is flexible in that the context is typically initialized with a set of "reference" kernels. Then, the kernel developer overwrites the fields in the context that correspond to kernel operations that have optimized counterparts that should be used instead.
+
+Let's use the following hypothetical function definition to guide our walkthrough.
+```
+#include "blis.h"
+
+void bli_cntx_init_fooarch( cntx_t* cntx )
+{
+    blksz_t blkszs[ BLIS_NUM_BLKSZS ];
+
+    // Set default kernel blocksizes and functions.
+    bli_cntx_init_fooarch_ref( cntx );
+
+    // -------------------------------------------------------------------------
+
+    // Update the context with optimized native gemm micro-kernels and
+    // their storage preferences.
+    bli_cntx_set_l3_nat_ukrs
+    (
+      5,
+      BLIS_GEMM_UKR,       BLIS_DOUBLE, bli_dgemm_bararch_asm,       FALSE,
+      BLIS_GEMMTRSM_L_UKR, BLIS_DOUBLE, bli_dgemmtrsm_l_bararch_asm, FALSE,
+      BLIS_GEMMTRSM_U_UKR, BLIS_DOUBLE, bli_dgemmtrsm_u_bararch_asm, FALSE,
+      BLIS_TRSM_L_UKR,     BLIS_DOUBLE, bli_dtrsm_l_bararch_asm,     FALSE,
+      BLIS_TRSM_U_UKR,     BLIS_DOUBLE, bli_dtrsm_u_bararch_asm,     FALSE,
+      cntx
+    );
+
+    // Update the context with optimized packm kernels.
+    bli_cntx_set_packm_kers
+    (
+      2,
+      BLIS_PACKM_4XK_KER, BLIS_DOUBLE, bli_dpackm_bararch_asm_4xk,
+      BLIS_PACKM_8XK_KER, BLIS_DOUBLE, bli_dpackm_bararch_asm_8xk,
+      cntx
+    );
+
+    // Update the context with optimized level-1f kernels.
+    bli_cntx_set_l1f_kers
+    (
+      5,
+      BLIS_AXPY2V_KER,    BLIS_DOUBLE, bli_daxpy2v_fooarch_asm,
+      BLIS_DOTAXPYV_KER,  BLIS_DOUBLE, bli_ddotaxpyv_fooarch_asm,
+      BLIS_AXPYF_KER,     BLIS_DOUBLE, bli_daxpyf_fooarch_asm,
+      BLIS_DOTXF_KER,     BLIS_DOUBLE, bli_ddotxf_fooarch_asm,
+      BLIS_DOTXAXPYF_KER, BLIS_DOUBLE, bli_ddotxaxpyf_fooarch_asm,
+      cntx
+    );
+
+    // Update the context with optimized level-1v kernels.
+    bli_cntx_set_l1v_kers
+    (
+      2,
+      BLIS_AXPYV_KER, BLIS_DOUBLE, bli_daxpyv_fooarch_asm,
+      BLIS_DOTV_KER,  BLIS_DOUBLE, bli_ddotv_fooarch_asm,
+      cntx
+    );
+
+    // Initialize level-3 blocksize objects with architecture-specific values.
+    //                                           s      d      c      z
+    bli_blksz_init_easy( &blkszs[ BLIS_MR ],     8,     8,     8,     4 );
+    bli_blksz_init_easy( &blkszs[ BLIS_NR ],     8,     4,     4,     4 );
+    bli_blksz_init_easy( &blkszs[ BLIS_MC ],   128,   128,   128,   128 );
+    bli_blksz_init_easy( &blkszs[ BLIS_KC ],   256,   256,   256,   256 );
+    bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4096,  4096,  4096,  4096 );
+
+    // Update the context with the current architecture's register and cache
+    // blocksizes (and multiples) for native execution.
+    bli_cntx_set_blkszs
+    (
+      BLIS_NAT, 5,
+      BLIS_NC, &blkszs[ BLIS_NC ], BLIS_NR,
+      BLIS_KC, &blkszs[ BLIS_KC ], BLIS_KR,
+      BLIS_MC, &blkszs[ BLIS_MC ], BLIS_MR,
+      BLIS_NR, &blkszs[ BLIS_NR ], BLIS_NR,
+      BLIS_MR, &blkszs[ BLIS_MR ], BLIS_MR,
+      cntx
+    );
+}
+```
+_**Function name/signature.**_ This function always takes one argument, a pointer to a `cntx_t` object. As with the name of the file, it should be named with the prefix `bli_cntx_init_` followed by the lowercase name of the configuration--in this case, `fooarch`.
+
+_**Blocksize object array.**_ The `blkszs` array declaration is needed later in the function and should generally be consistent (and unchanged) across all configurations.
+
+_**Reference initialization.**_ The first function call, `bli_cntx_init_fooarch_ref()`, initializes the context `cntx` with function pointers to reference implementations of all of the kernels supported by BLIS (as well as cache and register blocksizes, and other fields). This function is automatically generated by BLIS for every sub-configuration enabled at configure-time. The function prototype is generated by a preprocessor macro in `frame/include/bli_arch_config.h`.
+
+_**Level-3 micro-kernels.**_ The second function call is to a variable argument function, `bli_cntx_set_l3_nat_ukrs()`, which updates `cntx` with five optimized double-precision complex level-3 micro-kernels. The first argument encodes the number of individual kernels being registered into the context. Every subsequent line, except for the last line, is associated with the registration of a single kernel, and each of these lines is independent of one another and can occur in any order, provided that the kernel parameters of each line occur in the same order--kernel ID, followed by datatype, followed by function name, followed by storage preference boolean (i.e., whether the micro-kernel prefers row storage). The last argument of the function call is the address of the context being updated, `cntx`. Notice that we are registering micro-kernels written for another type of hardware, `bararch`, because in our hypothetical universe `bararch` is very similar to `fooarch` and so we recycle the code between the two configurations. After the function returns, the context contains pointers to optimized double-precision level-3 real micro-kernels. Note that the context will still contain reference micro-kernels for single-precision real and complex, and double-precision complex computation, as those kernels were not updated. 
+
+_Note:_ Currently, BLIS only allows the kernel developer to signal a preference (row or column) for `gemm` micro-kernels. The preference of the `gemmtrsm` and `trsm` micro-kernels can (and must) be set, but are ignored by the framework during execution.
+
+_**Level-1m (packm) kernels.**_ The third function call is to another variable argument function, `bli_cntx_set_packm_kers()`. This function works very similar to `bli_cntx_set_l3_nat_ukrs()`, except that it expects a different set of kernel IDs (because now we are registering level-1m kernels) and it does not take a storage preference boolean.  After this function returns, `cntx` contains function pointers to optimized double-precision real `packm` kernels. These kernels, like the level-3 kernels previously, are also borrowed from the `bararch` kernel set. Unregistered `packm` kernels will continue to point to reference code.
+
+_**Level-1f kernels.**_ The third function call is to yet another variable argument function, `bli_cntx_set_l1f_kers()`. This function has the same signature as `bli_cntx_set_packm_kers()`, except that it expects a different set of kernel IDs (because now we are registering level-1f kernels). After this function returns, `cntx` contains function pointers to optimized double-precision real level-1f kernels. These kernels are written for `fooarch` specifically. The unregistered level-1f kernels will continue to point to reference code.
+
+_**Level-1v kernels.**_ The fourth function call is to `bli_cntx_set_l1v_kers()`, which operates similarly to the `bli_cntx_set_l1f_kers()`, except here we are registering level-1v kernels. After the function returns, most kernels will continue to point to reference code, except double-precision real instances of `axpyv` and `dotv`.
+
+For a complete list of kernel IDs, please see the definitions of `l3ukr_t`, `l1mkr_t`, `l1fkr_t`, `l1vkr_t` in [frame/include/bli_type_defs.h](https://github.com/flame/blis/blob/master/frame/include/bli_type_defs.h).
+
+_**Setting blocksizes.**_ The next block of code initializes the `blkszs` array with register and cache blocksize values for each datatype. The values here are used by the level-3 operations that employ the level-3 micro-kernels we registered previously. We use `bli_blksz_init_easy()` when initializing only the primary value. If the auxiliary value needs to be set to a different value that the primary, `bli_blksz_init()` should be used instead, as in:
+```
+    //                                           s      d      c      z
+    bli_blksz_init_easy( &blkszs[ BLIS_MR ],     0,     8,     0,     0 );
+    bli_blksz_init_easy( &blkszs[ BLIS_NR ],     0,     4,     0,     0 );
+    bli_blksz_init     ( &blkszs[ BLIS_MC ],     0,   128,     0,     0,
+                                                 0,   160,     0,     0 );
+    bli_blksz_init     ( &blkszs[ BLIS_KC ],     0,   256,     0,     0,
+                                                 0,   288,     0,     0 );
+    bli_blksz_init_easy( &blkszs[ BLIS_NC ],     0,  4096,     0,     0 );
+```
+Here, we use `bli_blksz_init()` to set different auxiliary (maximum) cache blocksizes for _MC_ and _KC_. The same function could be used to set auxiliary (packing) register blocksizes for _MR_ and _NR_, which correspond to the _PACKMR_ and _PACKNR_ parameters. Other blocksizes, particularly those corresponding to level-1f operations, may be set. For a complete list of blocksize IDs, please see the definitions of `bszid_t` in [frame/include/bli_type_defs.h](https://github.com/flame/blis/blob/master/frame/include/bli_type_defs.h). For more information on interpretations of the auxiliary blocksize value, see the digressions below.
+
+Note that we set level-3 blocksizes even for datatypes that retain reference code kernels; however, by passing in `0` for those blocksizes, we indicate to `bli_blksz_init()` and `bli_blksz_init_easy()` that the current value should be left untouched. In the example above, this leaves the blocksizes associated with the reference kernels (set by `bli_cntx_init_fooarch_ref()`) intact for the single real, single complex, and double complex datatypes.
+
+_Digression:_ Auxiliary blocksize values for register blocksizes are interpreted as the "packing" register blocksizes. _PACKMR_ and _PACKNR_ serve as "leading dimensions" of the packed micro-panels that are passed into the micro-kernel. Oftentimes, _PACKMR = MR_ and _PACKNR = NR_, and thus the developer does not typically need to set these values manually. (See the [implementation notes for gemm](KernelsHowTo#Implementation_Notes_for_gemm) in the BLIS Kernel guide for more details on these topics.)
+
+_Digression:_ Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the _maximum_ size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is _not_ merged and instead it is computed upon in separate (final) iteration.
+
+_**Committing blocksizes.**_ Finally, we commit the values in `blkszs` to the context by calling the variable argument function `bli_cntx_set_blkszs`. This function call generally should be considered boilerplate and thus should not changed unless you are altering the matrix multiplication _algorithm_ as specified in the control tree. If this is your goal, please get in contact with BLIS developers via the [blis-devel](http://groups.google.com/group/blis-devel) mailing list for guidance, if you have not done so already.
+
+_**Availability of kernels.**_ Note that any kernel made available to the `fooarch` configuration within `config_registry` may be referenced inside `bli_cntx_init_fooarch()`. In this example, we referenced `fooarch` kernels as well as kernels native to another configuration, `bararch`. Thus, the `config_registry` would contain a line such as:
+```
+fooarch: fooarch/fooarch/bararch
+```
+Interpreting the line left-to-right: the `fooarch` configuration family contains only itself, `fooarch`, but must be able to refer to kernels from its own kernel set (`fooarch`) as well as kernels belonging to the `bararch` kernel set. The configuration registry is described more completely [in a later section](ConfigurationHowTo#configuration-registry).
+
+
+
+### bli_family_*.h
+
+This file is conditionally `#included` only for the configuration family targeted at configure-time. For example, if you run `./configure haswell`, `bli_family_haswell.h` will be `#included`, and if you run `./configure intel64`, `bli_family_intel64.h` will be `#included`. The header file is `#included` by [frame/include/bli_arch_config.h](https://github.com/flame/blis/blob/master/frame/include/bli_arch_config.h).
+
+This header file is oftentimes empty. This is because the parameters specified here usually work fine with their default values, which are defined in [frame/include/bli_kernel_macro_defs.h](https://github.com/flame/blis/blob/master/frame/include/bli_kernel_macro_defs.h). However, there may be some configurations for which a kernel developer will wish to adjust some of these parameters. Furthermore, when creating a configuration family, the parameters set in the corresponding `bli_family_*.h` file must work for **all** sub-configurations in the family.
+
+A description of the parameters that may be set in `bli_family_*.h` follows.
+
+_**Memory allocation functions.**_ BLIS allows the developer to customize the functions called for memory allocation for three different categories of memory: user, pool, and internal. The functions for user allocation are called any time the creation of a BLIS matrix or vector `obj_t` requires that a matrix buffer be allocated, such as via `bli_obj_create()`. The functions for pool allocation are called only when allocating blocks to the memory pools used to manage packed matrix buffers. The function for internal allocation are called by BLIS when allocating internal data structures, such as control trees. By default, the three pairs of parameters are defined via preprocessor macros to call the implementation of `malloc()` and `free()` provided by `stdlib.h`:
+```
+#define BLIS_MALLOC_USER  malloc
+#define BLIS_FREE_USER    free
+
+#define BLIS_MALLOC_POOL  malloc
+#define BLIS_FREE_POOL    free
+
+#define BLIS_MALLOC_INTL  malloc
+#define BLIS_FREE_INTL    free
+```
+Any substitute for `malloc()` and `free()` defined by customizing these parameters must use the same function prototypes as the original functions. Namely:
+```
+void* malloc( size_t size );
+void  free( void* p );
+```
+Furthermore, if a header file needs to be included, such as `my_malloc.h`, it should be `#included` within the `bli_family_*.h` file (before `#defining` any of the `BLIS_MALLOC_` and `BLIS_FREE_` macros).
+
+_**SIMD register file.**_ BLIS allows you to specify the _maximum_ number of SIMD registers available for use by your kernels, as well as the _maximum_ size (in bytes) of those registers. These values default to:
+```
+#define BLIS_SIMD_NUM_REGISTERS  32
+#define BLIS_SIMD_SIZE           64
+```
+These macros are used in computing the maximum amount of temporary storage (typically allocated statically, on the function stack) that will be needed to hold a single micro-tile of any datatype (and for any induced method):
+```
+#define BLIS_STACK_BUF_MAX_SIZE  ( BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2 )
+```
+These temporary buffers are used when handling edge cases (m % _MR_ != 0 || n % _NR_ != 0) within the level-3 macro-kernels, and also in the virtual micro-kernels of various implementations of induced methods for complex matrix multiplication. It is **very important** that these values be set correctly; otherwise, you may experience undefined behavior as stack data is overwritten at run-time. A kernel developer may set `BLIS_SIMD_NUM_REGISTERS` and `BLIS_SIMD_SIZE`, which will indirectly affect `BLIS_STACK_BUF_MAX_SIZE`, or he may set `BLIS_STACK_BUF_MAX_SIZE` directly. Notice that the default values are already set to work with modern x86_64 systems.
+
+_**Memory alignment.**_ BLIS implements memory alignment internally, rather than relying on a function such as `posix_memalign()`, and thus it can provide aligned memory even with functions that adhere to the `malloc()` and `free()` API in the standard C library.
+```
+#define BLIS_SIMD_ALIGN_SIZE             BLIS_SIMD_SIZE
+#define BLIS_PAGE_SIZE                   4096
+
+#define BLIS_STACK_BUF_ALIGN_SIZE        BLIS_SIMD_ALIGN_SIZE
+#define BLIS_HEAP_ADDR_ALIGN_SIZE        BLIS_SIMD_ALIGN_SIZE
+#define BLIS_HEAP_STRIDE_ALIGN_SIZE      BLIS_SIMD_ALIGN_SIZE
+#define BLIS_POOL_ADDR_ALIGN_SIZE        BLIS_PAGE_SIZE
+```
+The value `BLIS_STACK_BUF_ALIGN_SIZE` defines the alignment of stack memory used as temporary internal buffers, such as for output matrices to the micro-kernel when computing edge cases. (See [implementation notes](KernelsHowTo#implementation-notes-for-gemm) for the `gemm` micro-kernel for details.) This value defaults to `BLIS_SIMD_ALIGN_SIZE`, which defaults to `BLIS_SIMD_SIZE`.
+
+The value `BLIS_HEAP_ADDR_ALIGN_SIZE` defines the alignment used when allocating memory via the `malloc()` function defined by `BLIS_MALLOC_USER`. Setting this value to `BLIS_SIMD_ALIGN_SIZE` may speed up certain level-1v and -1f kernels. 
+
+The value `BLIS_HEAP_STRIDE_ALIGN_SIZE` defines the alignment used for so-called "leading dimensions" (i.e. column strides for column-stored matrices, and row strides for row-stored matrices) when creating BLIS matrices via the object-based API (e.g. `bli_obj_create()`). While setting `BLIS_HEAP_ADDR_ALIGN_SIZE` guarantees alignment for the first column (or row), creating a matrix with certain dimension values (_m_ and _n_) may cause subsequent columns (or rows) to be misaligned. Setting this value to `BLIS_SIMD_ALIGN_SIZE` is usually desirable. Additional alignment may or may not be beneficial.
+
+The value `BLIS_POOL_ADDR_ALIGN_SIZE` defines the alignment used when allocating blocks to the memory pools used to manage internal packing buffers. Any block of memory returned by the memory allocator is guaranteed to be aligned to this value. Aligning these blocks to the virtual memory page size (usually 4096 bytes) is standard practice.
+
+
+
+### make_defs.mk
+
+The `make_defs.mk` file primarily contains compiler and compiler flag definitions used by `make` when building a BLIS library. 
+
+The format of the file is mostly self-explanatory. However, we will expound on the contents here, using the `make_defs.mk` file for the `haswell` configuration as an example:
+```
+# Declare the name of the current configuration and add it to the
+# running list of configurations included by common.mk.
+THIS_CONFIG    := haswell
+
+ifeq ($(CC),)
+CC             := gcc
+CC_VENDOR      := gcc
+endif
+
+CPPROCFLAGS    := -D_POSIX_C_SOURCE=200112L
+CMISCFLAGS     := -std=c99 -m64
+CPICFLAGS      := -fPIC
+CWARNFLAGS     := -Wall -Wno-unused-function -Wfatal-errors
+
+ifneq ($(DEBUG_TYPE),off)
+CDBGFLAGS      := -g
+endif
+
+ifeq ($(DEBUG_TYPE),noopt)
+COPTFLAGS      := -O0
+else
+COPTFLAGS      := -O3
+endif
+
+CKOPTFLAGS     := $(COPTFLAGS)
+
+ifeq ($(CC_VENDOR),gcc)
+CVECFLAGS      := -mavx2 -mfma -mfpmath=sse -march=core-avx2
+else
+ifeq ($(CC_VENDOR),icc)
+CVECFLAGS      := -xCORE-AVX2
+else
+ifeq ($(CC_VENDOR),clang)
+CVECFLAGS      := -mavx2 -mfma -mfpmath=sse -march=core-avx2
+else
+$(error gcc, icc, or clang is required for this configuration.)
+endif
+endif
+endif
+
+# Store all of the variables here to new variables containing the
+# configuration name.
+$(eval $(call store-make-defs,$(THIS_CONFIG)))
+```
+_**Configuration name.**_ The first statement reaffirms the name of the configuration. The `THIS_CONFIG` variable is used later to attach the configuration name as a suffix to the remaining variables so that they can co-exist with variables read from other `make_defs.mk` files during multi-configuration builds. Note that if the configuration name defined here does not match the name of the directory in which `make_defs.mk` is stored, `make` will output an error when executing the top-level `Makefile`.
+
+_**Compiler definitions.**_ Next, we set the values of `CC` and `CC_VENDOR`. The former is the name (or path) to the actual compiler executable to use during compilation. The latter is the compiler family. Currently, BLIS generally supports three compiler families: `gcc`, `clang`, and `icc`. `CC_VENDOR` is used when conditionally setting various variables based on the type of flags available--flags that might not vary across different versions or installations of the same compiler (e.g. `gcc-4.9` vs `gcc-5.0`, or `gcc` vs `/usr/local/bin/gcc`), but may vary across compiler families (e.g. `gcc` vs. `icc`). If the compiler you wish to use is in your `PATH` environment variable, `CC` and `CC_VENDOR` will usually contain the same value.
+
+_**Basic compiler flags.**_ The variables `CPPROCFLAGS` and `CWARNFLAGS` should be assigned to C preprocessor flags and compiler warning flags, respectively, while `CPICFLAGS` should be assigned flags to enable position independent code (shared library) flags. Finally, `CMISCFLAGS` may be assigned any miscellaneous flags that do not neatly fit into any other category, such as language flags and 32-/64-bit flags. These four categories of flags are usually recognized across compiler families.
+
+_**Debugging flags.**_ The `CDBGFLAGS` variable should be assigned to contain flags that insert debugging symbols into the object code emitted by the compiler. Typically, this amounts to no more than the `-g` flag, but some compilers or situations may call for different (or additional) flags. This variable is conditionally set only if `$(DEBUG_TYPE)`, which is set the by `configure` script, is not equal to `noopt`.
+
+_**Optimization flags.**_ The `COPTFLAGS` variable should be assigned any flags relating to general compiler optimization. Usually this takes the form of `-O2` or `-O3`, but more specific optimization flags may be included as well, such as `-fomit-frame-pointer`. Note that, as with `CDBGFLAGS`, `COPTFLAGS` is conditionally assigned based on the value of `$(DEBUG_TYPE)`. A separate `CKOPTFLAGS` variable tracks optimizations flags used when compiling kernels. For most configurations, `CKOPTFLAGS` is assigned as a copy of `COPTFLAGS`, but if the kernel developer needs different optimization flags to be applied when compiling kernel source code, `CKOPTFLAGS` should be set accordingly.
+
+_**Vectorization flags.**_ The second-to-last block sets the `CVECFLAGS`, which should be assigned any flags that must be given to the compiler in order to enable use of a vector instruction set needed or assumed by the kernel source code. Also, if you wish to enable automatic use of certain instruction sets (e.g. `-mfpmath=sse` for many Intel architectures), this is where you should set those flags. These flags often differ among compiler families, especially between `icc` and `gcc`/`clang`. 
+
+_**Variable storage/renaming.**_ Finally, the last statement commits the variables defined in the file to "storage". That is, they are copied to variable names that contain `THIS_CONFIG` as a suffix. This allows the variables for one configuration to co-exist with variables of another configuration.
+
+
+## Configuration families
+
+A configuration family is represented similarly to that of a sub-configuration: a sub-directory of the `config` directory. Additionally, there are two types of families: singleton families and umbrella families.
+
+A _singleton_ family simply refers to a sub-configuration. The `configure` script only targets configuration families. But since every sub-configuration is also a valid configuration family, every sub-configuration is a valid configuration target.
+
+An _umbrella_ family is the more interesting type of configuration family. These families are defined as collections of architecturally related sub-configurations. (**Important:** an umbrella family should always be named something different than any of its constituent sub-configurations.) BLIS provides a mechanism to define umbrella families so that users and developers can build a single instance of BLIS that supports multiple configurations, where some heuristic is used at runtime to choose among the configurations. For example, you may wish to deploy a BLIS library on a storage device that is shared among several computers, each of which is based on a different x86_64 microarchitecture.
+
+Throughout the remainder of this document, we will sometimes refer to "umbrella families" as simply "families". Similarly, we will refer to "singleton families" and "sub-configurations" interchangeably. To the extent that any ambiguity may remain, context should clarify which type of family is germane to the discussion.
+
+Let's inspect the `amd64` configuration family as an example:
+```
+$ ls config/amd64
+bli_family_amd64.h  make_defs.mk
+```
+A configuration family contains a subset of the files contained within a sub-configuration: A `bli_family_*.h` header file and a `make_defs.mk` makefile fragment:
+  * `bli_family_amd64.h`. This header file is `#included` only when the configuration family in question, in this case `amd64`, was the target to `./configure`. The file serves a similar purpose as with sub-configurations--a place to define various parameters, such as those relating to memory allocation and alignment. However, in the context of configuration families, the uniqueness of this file makes a bit more sense. Importantly, the definitions in this file will be affect **all** sub-configurations within the family. Thus, it is useful to think of these as "global" parameters. For example, if custom implementations of `malloc()` and `free()` are specified in the `bli_family_amd64.h` file, these implementations will be used for every sub-configuration member of the `amd64` family. (The configuration registry, described in [the next section](ConfigurationHowTo#configuration-registry), specifies each configuration family's membership.) As with sub-configurations, this file may be empty, in which case reasonable defaults are selected by the framework.
+  * `make_defs.mk`. This makefile fragment defines the compiler and compiler flags in a manner identical to that of sub-configurations. However, these configuration flags are used when compiling source code that is not specific to any one particular sub-configuration. (The build system compiles a set of reference kernels and optimized kernels for each sub-configuration, during which it uses flags read from the individual sub-configurations' `make_defs.mk` files. By contrast, the general framework code is compiled once--using the flags read from the family's `make_defs.mk` file--and executed by all sub-configurations.)
+
+For a more detailed walkthrough of these files' expected/allowed contents, please see the descriptions provided in the section on [sub-configurations](ConfigurationHowTo#sub-configurations):
+ * [bli_family_*.h](ConfigurationHowTo#bli_family_h)
+ * [make_defs.h](ConfigurationHowTo#make_defsmk)
+
+With these two files defined and present, the configuration family is properly constituted and ready to be registered within the configuration registry.
+
+
+
+## Configuration registry
+
+The configuration registry is the official place for declaring a sub-configuration or configuration family. Unless a configuration (singleton or family) is declared within the registry, `configure` will not accept it as a valid configuration target at configure-time.
+
+Before describing the syntax and semantics of the registry, we'll first briefly describe three types of information we wish to encode into the registry:
+
+_**Configuration list.**_ First and foremost, the registry needs to enumerate the registered sub-configurations. That is, it needs to list the sub-configurations (or, singleton families) that are available to be targeted by `configure`. The registry also needs to specify configuration family membership--that is, the (umbrella) families to which those sub-configurations belong.
+
+_**Kernel list.**_ Next, the registry needs to specify the list of kernel sets that will be needed by each sub-configuration, and by proxy, each configuration family. It's easy to think of different configurations as corresponding to different microarchitectures, and that generally holds true. However, sometimes we use the same configuration for multiple microarchitectures (e.g. `haswell` is used for Intel Haswell, Broadwell, and non-server Skylake variants). It might also be tempting to think of each microarchitecture as having its own set of kernels. However, in practice, we find that some microarchitectures' kernels are identical to those of a previous microarchitectural revision, or to those of another vendor's microarchitecture. Thus, sometimes a sub-configuration will wish to use a kernel set that is "native" to a different configuration. In these cases, there is not a one-to-one mapping of sub-configuration names to kernel set names, and therefore the configuration registry must separately specify the kernel sets needed by any sub-configuration (and by proxy, any configuration family).
+
+_**Kernel-to-configuration map.**_ Lastly, and most subtly, for each kernel set in the kernel list, the registry needs to specify the sub-configuration(s) that depend on that particular kernel set. Notice that the kernel list can be obtained by mapping sub-configurations to kernel sets they require. By contrast, the kernel-to-configuration map tracks the reverse dependency and helps us answer: for any given kernel set, which sub-configurations caused the kernel set to be pulled into the build? This mapping is needed when determining which sub-configuration's compiler flags (as defined in its `make_defs.mk` file) to use when compiling that kernel set. The most obvious solution to this problem would have been to associate compiler flags with the individual kernel sets. However, given the desire to share kernel sets among sub-configurations, we needed the flexibility of applying different compiler flags to any given kernel set based on the sub-configuration that would be utilizing that kernel set. In the case that multiple sub-configurations pull in the same kernel set, a set of heuristics is used to choose between the sub-configurations so that a single set of compiler flags can be chosen for use when compiling that kernel set.
+
+
+
+### Walkthrough
+
+The configuration registry exists as a human-readable file, `config_registry`, located at the top-level of the BLIS distribution. What follows is an example of a `config_registry` file that is based on actual contents in a BLIS commit recent as of this writing. Note that lines containing only whitespace are ignored. Furthermore, any characters that appear after (and including) a `#` are treated as comments and also ignored.
+```
+#
+# config_registry
+#
+
+# Processor families.
+x86_64:      intel64 amd64
+intel64:     haswell sandybridge penryn generic
+amd64:       zen excavator steamroller piledriver bulldozer generic
+arm64:       cortexa57 generic
+arm32:       cortexa15 cortexa9 generic
+
+# Intel architectures.
+haswell:     haswell
+sandybridge: sandybridge
+penryn:      penryn
+knl:         knl
+
+# AMD architectures.
+zen:         zen/haswell/sandybridge
+excavator:   excavator/piledriver
+steamroller: steamroller/piledriver
+piledriver:  piledriver
+bulldozer:   bulldozer
+
+# ARM architectures.
+cortexa57:   cortexa57/armv8a
+cortexa15:   cortexa15/armv7a
+cortexa9:    cortexa9/armv7a
+
+
+# Generic architectures.
+generic:     generic
+```
+Generally speaking, the registry can be thought of as defining a very simple grammar. (However, as you'll soon see, there are nuances that are un-grammar-like.) The registry can contain two kinds of lines. The first type defines a singleton configuration family. For example, the line
+```
+haswell:     haswell
+```
+defines a configuration family `haswell` (the left side of the `:`) as containing only itself: the sub-configuration by the same name, `haswell` (the right side of the `:`). When singleton families are defined in this way, it implicitly pulls in the kernel set by the same name as the sub-configuration (in this case, `haswell`). More specifically, the `haswell` sub-configuration depends on the kernels residing in the `kernels/haswell` sub-directory.
+
+The second type of line defines an umbrella configuration family. For example, the line
+```
+intel64:     haswell sandybridge penryn generic
+```
+defines the configuration family `intel64` as containing the `haswell`, `sandybridge`, `penryn`, and `generic` sub-configurations as members (technically speaking, it is more accurate to think of the family as containing singleton families rather than their corresponding sub-configurations). Thus, if the user runs `./configure intel64`, the library will be built to support all sub-configurations defined within the `intel64` family.
+
+**Note:** `generic` is a somewhat special sub-configuration that uses only reference kernels and reference blocksizes. It is included in every umbrella family so that when those families are instantiated into BLIS libraries and linked to an application, the application will be able to run even if none of the other sub-configurations (`haswell`, `sandybridge`, `penryn`) are chosen at runtime by the hardware detection heuristic.
+
+Some sub-configurations, for various reasons, do not rely on their own set of kernels and instead use the kernel set that is native to another sub-configuration. For example, the `excavator` and `steamroller` configurations each correspond to hardware that is very similar to the hardware targeted by the `piledriver` configuration. In fact, the former two configurations rely exclusively on kernels written for the latter configuration. (Presently, there are no `excavator` or `steamroller` kernel sets in BLIS.) We denote this kernel dependency with a `/` character:
+```
+excavator:   excavator/piledriver
+steamroller: steamroller/piledriver
+```
+Here, the first line (reading from left-to-right) defines the `excavator` singleton family as containing only itself, the `excavator` sub-configuration, and also specifies that this sub-configuration must have access to the `piledriver` kernel set. The second line defines the `steamroller` singleton family in a similar manner. 
+
+**Note:** Specifying non-native kernel sets via the `/` character is only allowed when defining singleton configuration families. They may NOT appear in the definitions of umbrella families! When an umbrella family includes a singleton family that is defined to require non-native kernels, this will be accounted for during the parsing of the `config_registry` file.
+
+Sometimes, a sub-configuration may need access to more than one kernel set. If additional kernel sets are needed, they should be listed with additional `/` characters:
+```
+zen:         zen/haswell/sandybridge
+```
+The line above defines the `zen` singleton family as containing only itself, the `zen` sub-configuration, and also specifies that this sub-configuration must have access to the `haswell` kernel set as well as the `sandybridge` kernel set. What if there exists a `zen` kernel set as well, which the `zen` sub-configuration must access in addition to those of `haswell` and `sanydbridge`? In this case, it would need to be annotated explicitly as:
+```
+zen:         zen/zen/haswell/sandybridge
+```
+This line (which is hypothetical and does not appear in the `config_registry` example above) defines the `zen` singleton family in terms of only the `zen` sub-configuration, and provides that sub-configuration access to `zen`, `haswell`, and `sandybridge` kernel sets. (Also: the kernel sets may appear in any order.)
+
+Notice that while kernel sets usually correspond to a sub-configuration, they do not always. For example, while the `armv7a` and `armv8a` kernel sets are referenced in the example `config_registry` file, there do not exist any registered sub-configurations by those names. However, the kernel directories exist and the kernel sets appear in the definitions of a few `cortex` singleton families.
+
+One last thing to point out: take a look at the `x86_64` configuration family:
+```
+x86_64:      intel64 amd64
+```
+Unlike most of the registered families, which are defined in terms of sub-configurations, `x86_64` is defined in terms of *other* families--specifically, `intel64` and `amd64`:
+```
+intel64:     haswell sandybridge penryn generic
+amd64:       zen excavator steamroller piledriver bulldozer generic
+```
+This multi-level style of specifying sub-configurations became available starting in 290dd4a. The behavior of `configure` in this situation is as you would expect; that is, including `intel64` and `amd64` in the definition of `x86_64` is equivalent to:
+```
+x86_64:      haswell sandybridge penryn zen excavator steamroller piledriver bulldozer generic
+```
+Any duplicates that may result are removed automatically.
+
+
+### Printing the configuration registry lists
+
+The configuration list, kernel list, and kernel-to-configuration map are constructed internally by `configure`, but these structures can be inspected by running `configure` with the `-c` (which is the short form of `--show-config-lists`) option. This can be useful as a sanity check to make sure `configure` is properly parsing and interpreting the `config_registry` file.
+
+The first thing printed is the configuration list:
+```
+$ ./configure -c amd64
+configure: reading configuration registry...done.
+...
+configure: configuration list:
+configure:   amd64: zen excavator steamroller piledriver bulldozer generic
+configure:   arm32: cortexa15 cortexa9 generic
+configure:   arm64: cortexa57 generic
+configure:   bulldozer: bulldozer
+configure:   cortexa15: cortexa15
+configure:   cortexa57: cortexa57
+configure:   cortexa9: cortexa9
+configure:   excavator: excavator
+configure:   generic: generic
+configure:   haswell: haswell
+configure:   intel64: haswell sandybridge penryn generic
+configure:   knl: knl
+configure:   penryn: penryn
+configure:   piledriver: piledriver
+configure:   sandybridge: sandybridge
+configure:   skx: skx
+configure:   steamroller: steamroller
+configure:   x86_64: haswell sandybridge penryn zen excavator steamroller piledriver bulldozer generic
+```
+This simply lists the sub-configurations associated with each defined configuration family (singleton or umbrella). Note that they are sorted alphabetically. 
+
+Next, the kernel list (actually, all kernel lists) is printed:
+```
+configure: kernel list:
+configure:   amd64: zen piledriver bulldozer generic
+configure:   arm32: armv7a generic
+configure:   arm64: armv8a generic
+configure:   bulldozer: bulldozer
+configure:   cortexa15: armv7a
+configure:   cortexa57: armv8a
+configure:   cortexa9: armv7a
+configure:   excavator: piledriver
+configure:   generic: generic
+configure:   haswell: haswell zen
+configure:   intel64: haswell zen sandybridge penryn generic
+configure:   knl: knl
+configure:   penryn: penryn
+configure:   piledriver: piledriver
+configure:   sandybridge: sandybridge
+configure:   skx: skx
+configure:   steamroller: piledriver
+configure:   x86_64: haswell sandybridge penryn zen piledriver bulldozer generic
+configure:   zen: zen
+```
+This shows the kernel sets that are pulled in by each configuration family. For singleton families, this is specified in a straightforward manner via the `/` character described [in the previous section](ConfigurationHowTo#Walkthrough). For umbrella families, this is determined indirectly by looking up the definitions of the singleton families that are members of the umbrella family.
+
+Next, the full kernel-to-configuration map is printed:
+```
+configure: kernel-to-config map for 'amd64':
+configure:   bulldozer: bulldozer
+configure:   generic: generic
+configure:   piledriver: excavator steamroller piledriver
+configure:   zen: zen
+```
+For each of the kernel sets required of the selected configuration family above, the kernel-to-configuration map shows the sub-configurations that required that kernel set. Notice that sometimes a single kernel set may be pulled in by more than one sub-configuration, as with the `piledriver` kernel set.
+
+Lastly, we print a version of the kernel-to-configuration map in which we've used a set of heuristics to select a single sub-configuration for each kernel set in the map:
+```
+configure: kernel-to-config map for 'amd64' (chosen pairs):
+configure:   bulldozer:bulldozer
+configure:   generic:generic
+configure:   piledriver:piledriver
+configure:   zen:zen
+```
+This variant of the kernel-to-config map is formatted as a series of "sub-configuration:kernel-set" pairs. These pairs are used during the processing of the top-level `Makefile` to determine which sub-configuration's compiler flags should be used when compiling the source code within each kernel set.
+
+
+## Adding a new kernel set
+
+Adding support for a new set of kernels in BLIS is easy and can be done via the following steps.
+
+
+
+_**Create and populate the kernel set directory.**_ First, we must create a directory in `kernels` that corresponds to the new kernel set. Suppose we wanted to add kernels for Intel's Knight's Landing microarchitecture. In BLIS, this corresponds to the `knl` configuration, and so we should name the directory `knl`. This is because we want the `knl` kernel set to be pulled by default into builds that include the `knl` sub-configuration.
+```
+$ mkdir kernels/knl
+$ ls kernels
+armv7a  bgq        generic  knc  old     piledriver  sandybridge
+armv8a  bulldozer  haswell  knl  penryn  power7
+```
+Next, we must write the `knl` kernels and locate them inside `kernels/knl`. (For more information on writing BLIS kernels, please see the [BLIS Kernels guide](KernelsHowTo).) We recommend separating level-1v, level-1f, and level-3 kernels into separate `1`, `1f`, and `3` sub-directories, respectively. The kernel files and functions therein do not need to follow any particular naming convention, though we strongly recommend using the conventions already used by other kernel sets. Take a look at other kernel files, such as those for `haswell`, [for examples](https://github.com/flame/blis/tree/master/kernels). Finally, for the `knl` kernel set, you should insert a file named `bli_kernels_knl.h` into `kernels/knl` that prototypes all of your new kernel set's kernel functions. You are welcome to write your own prototypes, but to make the prototyping of kernels easier we recommend using the prototype-generating macros for level-1v, level-1f, level-1m, and level-3 functions defined in [frame/1/bli_l1v_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1/bli_l1v_ker_prot.h), [frame/1f/bli_l1f_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1f/bli_l1f_ker_prot.h), [frame/1m/bli_l1m_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1m/bli_l1m_ker_prot.h), and [frame/3/bli_l3_ukr_prot.h](https://github.com/flame/blis/blob/master/frame/3/bli_l3_ukr_prot.h), respectively. The following example utilizes how a select subset of these macros can be used to generate kernel function prototypes.
+```
+GEMM_UKR_PROT( double, d, gemm_knl_asm_24x8 )
+
+PACKM_KER_PROT( double, d, packm_knl_asm_24xk )
+PACKM_KER_PROT( double, d, packm_knl_asm_8xk )
+
+AXPYF_KER_PROT( dcomplex, z, axpyf_knl_asm )
+DOTXF_KER_PROT( dcomplex, z, dotxf_knl_asm )
+
+AXPYV_KER_PROT( float, s, axpyv_knl_asm )
+DOTXV_KER_PROT( float, s, dotxv_knl_asm )
+```
+The first line generates a function prototype for a double-precision real `gemm` micro-kernel named `bli_dgemm_knl_asm_24x8()`. Notice how the macro takes three arguments: the C language datatype, the single character corresponding to the datatype, and the base name of the function, which includes the operation (`gemm`), the kernel set name (`knl`), and a substring specifying its implementation (`asm_24x8`).
+
+The second and third lines generate prototypes for double-precision real `packm` kernels to go along with the `gemm` micro-kernel above. The fourth and fifth lines generate prototypes for double-precision complex instances of the level-1f kernels `axpyf` and `dotxf`. The last two lines generate prototypes for single-precision real instances of the level-1v kernels `axpyv` and `dotxv`.
+
+
+
+_**Add support within the framework source code.**_ We also need to make a minor update to the framework to support the new kernels--specifically, to pull in the kernels' function prototypes.
+
+
+
+**`frame/include/bli_arch_config.h`**. When adding support for the `knl` kernel set to the framework, we must modify this file to `#include` the `bli_kernels_knl.h` header file: 
+```
+#ifdef BLIS_KERNELS_KNL
+#include "bli_kernels_knl.h"
+#endif
+```
+The `BLIS_KERNELS_KNL` macro, which guards the `#include` directive, is automatically defined by the build system when the `knl` kernel set is required by _any_ sub-configuration.
+
+
+## Adding a new configuration family
+
+Adding support for a new umbrella configuration family in BLIS is fairly straightforward and can be done via the following steps. The hypothetical examples used in these steps assume you are trying to create a new configuration family `intelavx` that supports only Intel microarchitectures that support the Intel AVX instruction set. 
+
+
+
+_**Create and populate the family directory.**_ First, we must create a directory in `config` that corresponds to the new family. Since we are adding a new family named `intelavx`, we would name our directory `intelavx`.
+```
+$ mkdir config/intelavx
+$ ls config
+amd64      cortexa15  excavator  intel64   knl     piledriver   steamroller
+bgq        cortexa57  generic    intelavx  old     power7       template
+bulldozer  cortexa9   haswell    knc       penryn  sandybridge  zen
+```
+We also need to create `bli_family_intelavx.h` and `make_defs.mk` files inside our new sub-directory. Since they will be very similar to those of the `intel64` family's files, we can copy those files over and then modify them accordingly:
+```
+$ cp config/intel64/bli_family_intel64.h config/intelavx/bli_family_intelavx.h
+$ cp config/intel64/make_defs.mk config/intelavx/
+```
+First, we update the configuration name inside of `make_defs.mk`:
+```
+THIS_CONFIG    := intelavx
+```
+and while we're editing the file, we can make any other changes to compiler flags we wish (if any). Similarly, the `bli_family_intelavx.h` header file should be updated, though in our case it does not need any changes; the original file is empty and thus the copied file can remain empty as well. Note that other configuration families may have different needs. Remember that all of the parameters set in this file, either explicitly or implicitly (via their defaults), must work for **all** sub-configurations in the family. When creating or modifying a family, it's worth reviewing the parameters' defaults, which are set in [frame/include/bli_kernel_macro_defs.h](https://github.com/flame/blis/blob/master/frame/include/bli_kernel_macro_defs.h) and convincing yourself that each parameter default (or overriding definition in `bli_family_*.h`) will work for each sub-configuration.
+
+
+
+_**Add support within the framework source code.**_ Next, we need to update the BLIS framework source code so that the new configuration family is recognized and supported. Configuration families require updates to two files.
+
+**`frame/include/bli_arch_config.h`**. This file must be updated to `#include` the `bli_family_intelavx.h` header file. Notice that the preprocessor directive should be guarded as follows:
+```
+#ifdef BLIS_FAMILY_INTELAVX
+#include "bli_family_intelavx.h"
+#endif
+```
+The `BLIS_FAMILY_INTELAVX` will automatically be defined by the build system whenever the family was targeted by `configure` is `intelavx`. (In general, if the user runs `./configure foobar`, the C preprocessor macro `BLIS_FAMILY_FOOBAR` will be defined.)
+
+**`frame/base/bli_arch.c`**. This file must be updated so that `bli_arch_query_id()` returns the correct `arch_t` microarchitecture ID value to the caller. This function is called when the framework is trying to choose which sub-configuration to use at runtime. For x86_64 architectures, this is supported via the `CPUID` instruction, as implemented via `bli_cpuid_query_id()`. Thus, you can simply mimic what is done for the `intel64` family by inserting lines such as:
+```
+#ifdef BLIS_FAMILY_INTELAVX
+    id = bli_cpuid_query_id();
+#endif
+```
+This results in `bli_cpuid_query_id()` being called, which will return the `arch_t` ID value corresponding to the hardware detected by `CPUID`. (If your configuration family does not consist of x86_64 architectures, then you'll need some other heuristic to determine how to choose the correct sub-configuration at runtime. When in doubt, please [open an issue](https://github.com/flame/blis/issues) to begin a dialogue with developers.)
+
+
+
+_**Update the configuration registry.**_ The last step is to update the `config_registry` file so that it defines the new family. Since we want the family to include only Intel sub-configurations that support AVX, we would add the following line:
+```
+intelavx: haswell sandybridge
+```
+Notice that we left out the Core2-based `penryn` sub-configuration since it targets hardware that only supports SSE vector instructions.
+
+
+## Adding a new sub-configuration
+
+Adding support for a new-subconfiguration to BLIS is similar to adding support for a family, though there are a few additional steps. Throughout this section, we will use the `knl` (Knight's Landing) configuration as an example to illustrate the typical changes necessary to various files in BLIS.
+
+
+
+_**Create and populate the family directory.**_ First, we must create a directory in `config` that corresponds to the new sub-configuration.
+```
+$ mkdir config/knl
+$ ls config
+amd64      cortexa15  excavator  intel64  old         power7       template
+bgq        cortexa57  generic    knc      penryn      sandybridge  zen
+bulldozer  cortexa9   haswell    knl      piledriver  steamroller
+```
+We also need to create `bli_cntx_init_knl.c`, `bli_family_intelavx.h`, and `make_defs.mk` files inside our new sub-directory. Since they will be very similar to those of the `haswell` sub-configuration's files, we can copy those files over and then modify them accordingly:
+```
+$ cp config/haswell/bli_cntx_init_haswell.c config/knl/bli_cntx_init_knl.c
+$ cp config/haswell/bli_family_haswell.h config/knl/bli_family_knl.h
+$ cp config/haswell/make_defs.mk config/knl/
+```
+First, we update the configuration name inside of `make_defs.mk`:
+```
+THIS_CONFIG    := knl
+```
+and while we're editing the file, we can make any other changes to compiler flags we wish (if any). Similarly, the `bli_family_knl.h` header file should be updated as needed. Since the number of vector registers and the vector register size on `knl` differ from the defaults, we must explicitly set them. (The role of these parameters was explained in a [previous section](ConfigurationHowTo#bli_family_h).) Furthermore, provided that a macro `BLIS_NO_HBWMALLOC` is not set, we use a different implementation of `malloc()` and `free()` and `#include` that implementation's header file. 
+```
+#define BLIS_SIMD_NUM_REGISTERS  32
+#define BLIS_SIMD_SIZE           64
+
+#ifdef BLIS_NO_HBWMALLOC
+  #include <stdlib.h>
+  #define BLIS_MALLOC_POOL  malloc
+  #define BLIS_FREE_POOL    free
+#else
+  #include <hbwmalloc.h>
+  #define BLIS_MALLOC_POOL  hbw_malloc
+  #define BLIS_FREE_POOL    hbw_free
+#endif
+```
+Finally, we update `bli_cntx_init_knl.c` to initialize the context with the appropriate kernel function pointers and blocksize values. The functions used to perform this initialization are explained in [an earlier section](ConfigurationHowTo#bli_cntx_init_c).
+
+
+
+_**Add support within the framework source code.**_ Next, we need to update the BLIS framework source code so that the new sub-configuration is recognized and supported. Sub-configurations require updates to four files--six if hardware detection logic is added.
+
+
+
+**`frame/include/bli_type_defs.h`**. First, we need to define an ID to associate with the microarchitecture for which we are adding support. All microarchitecture type IDs are defined in [bli_type_defs.h](https://github.com/flame/blis/blob/master/frame/include/bli_type_defs.h) as an enumerated type that we `typedef` to `arch_t`. To support `knl`, we add a new enumerated type value `BLIS_ARCH_KNL`:
+```
+typedef enum
+{
+    BLIS_ARCH_KNL,
+    BLIS_ARCH_KNC,
+    BLIS_ARCH_HASWELL,
+    BLIS_ARCH_SANDYBRIDGE,
+    BLIS_ARCH_PENRYN,
+
+    BLIS_ARCH_ZEN,
+    BLIS_ARCH_EXCAVATOR,
+    BLIS_ARCH_STEAMROLLER,
+    BLIS_ARCH_PILEDRIVER,
+    BLIS_ARCH_BULLDOZER,
+
+    BLIS_ARCH_CORTEXA57,
+    BLIS_ARCH_CORTEXA15,
+    BLIS_ARCH_CORTEXA9,
+
+    BLIS_ARCH_POWER7,
+    BLIS_ARCH_BGQ,
+
+    BLIS_ARCH_GENERIC
+
+} arch_t;
+```
+Additionally, you'll need to update the definition of `BLIS_NUM_ARCHS` to reflect the new total number of enumerated `arch_t` values:
+```
+#define BLIS_NUM_ARCHS 16
+```
+
+
+**`frame/base/bli_gks.c`**. We must also update the global kernel structure, or gks, to register the new sub-configuration during library initialization. Sub-configuration registration occurs in `bli_gks_init()`. For `knl`, updating this function amounts to inserting the following lines
+```
+#ifdef BLIS_CONFIG_KNL
+        bli_gks_register_cntx( BLIS_ARCH_KNL, bli_cntx_init_knl,
+                                              bli_cntx_init_knl_ref,
+                                              bli_cntx_init_knl_ind );
+#endif
+```
+This function submits pointers to various context initialization functions to the global kernel structure, which are then stored and called at the appropriate time. The functions **must** be named strictly according to the format shown in the example above, with `knl` replaced with the sub-configuration name. Also, note the call to `bli_gks_register_cntx` is guarded by `BLIS_CONFIG_KNL`. This macro is automatically `#defined` by the build system if and when the `knl` sub-configuration is enabled at configure-time, either directly as a singleton family or indirectly via an umbrella family.
+
+
+
+**`frame/include/bli_arch_config.h`**. This file must be updated in two places. First, we must modify it to generate prototypes for the `bli_cntx_init_*()` functions, including the developer-provided function `bli_cntx_init_knl()` (defined in `config/knl/bli_cntx_init_knl.c`), by inserting:
+```
+#ifdef BLIS_CONFIG_KNL
+CNTX_INIT_PROTS( knl )
+#endif
+```
+Here, the `CNTX_INIT_PROTS` macro generates the appropriate prototypes based on the name of the sub-configuration. Next, we must `#include` the `bli_family_knl.h` header file, just as we would if we were adding support for an umbrella family:
+```
+#ifdef BLIS_FAMILY_KNL
+#include "bli_family_knl.h"
+#endif
+```
+As before with umbrella families, the `BLIS_FAMILY_KNL` macro is automatically defined by the build system for whatever family was targeted by `configure`. (That is, if the user runs `./configure foobar`, the C preprocessor macro `BLIS_FAMILY_FOOBAR` will be defined.) 
+
+
+
+**`frame/base/bli_arch.c`**. This file must be updated so that `bli_arch_query_id()` returns the correct `arch_t` architecture ID value to the caller. `bli_arch_query_id()` is called when the framework is trying to choose which sub-configuration to use at runtime. When adding support for a sub-configuration as a singleton family, this amounts to adding a block of code such as:
+```
+#ifdef BLIS_FAMILY_KNL
+    id = BLIS_ARCH_KNL;
+#endif
+```
+The `BLIS_FAMILY_KNL` macro is automatically `#defined` by the build system if the `knl` sub-configuration was targeted directly (as a singleton family) at configure-time. Other ID values are returned only if their respective family macros are defined. (Recall that only one family is ever enabled at time.) If, however, the `knl` sub-configuration was enabled indirectly via an umbrella family, `bli_arch_query_id()` will return the `arch_t` ID value via the lines similar to the following:
+```
+#ifdef BLIS_FAMILY_INTEL64
+    id = bli_cpuid_query_id();
+#endif
+#ifdef BLIS_FAMILY_AMD64
+    id = bli_cpuid_query_id();
+#endif
+```
+Supporting runtime detection of `knl` microarchitectures requires adding `knl` support to `bli_cpuid_query_id()`, which is addressed in the next step.
+
+
+
+**`frame/base/bli_cpuid.c`**. To support the aforementioned runtime microarchitecture detection, the function `bli_cpuid_query_id()`, defined in [bli_cpuid.c](https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c), will need to be updated. Specifically, we need to insert logic that will detect the presence of the new hardware based on the results of the `CPUID` instruction (assuming the new microarchitecture belongs to the x86_64 architecture family). For example, when support for `knl` was added, this entailed adding the following code block to `bli_cpuid_query_id()`:
+```
+#ifdef BLIS_CONFIG_KNL
+    if ( bli_cpuid_is_knl( family, model, features ) )
+        return BLIS_ARCH_KNL;
+#endif
+```
+Additionally, we had to define the function `bli_cpuid_is_knl()`, which checks for various processor features known to be present on `knl` systems and returns a boolean `TRUE` if all relevant feature checks are satisfied by the hardware. Note that the order in which we check for the sub-configurations is important. We must check for microarchitectural matches from most recent to most dated. This prevents an older sub-configuration from being selected on newer hardware when a newer sub-configuration would have also matched.
+
+
+
+**`frame/base/bli_cpuid.h`**. After defining the function `bli_cpuid_is_knl()`, we must also update [bli_cpuid.h](https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h) to contain a prototype for the function.
+
+
+
+_**Update the configuration registry.**_ Lastly, we update the `config_registry` file so that it defines the new sub-configuration. For example, if we want to define a sub-configuration called `knl` that used only `knl` kernels, we would add the following line:
+```
+knl: knl
+```
+If, when defining `bli_cntx_init_knl()`, we referenced kernels from a non-native kernel set--say, those of `haswell`--in addition to `knl`-specific kernels, we would need to explicitly pull in both `knl` and `haswell` kernel sets:
+```
+knl: knl/knl/haswell
+```
+
+
+## Further Development Topics
+
+### Querying the current configuration
+
+If you are ever unsure which configuration is "active", or the configuration parameters that were specified (or implied by default) at configure-time, simply run:
+
+```
+$ make showconfig
+configuration family:  intel64
+sub-configurations:    haswell sandybridge penryn
+requisite kernels:     haswell sandybridge penryn
+kernel-to-config map:  haswell:haswell penryn:penryn sandybridge:sandybridge
+-----------------------
+BLIS version string:   0.2.2-73
+install prefix:        /home/field/blis
+debugging status:      off
+multithreading status: no
+enable BLAS API?       yes
+enable CBLAS API?      no
+build static library?  yes
+build shared library?  no
+```
+
+This will tell you the current configuration name, the [configuration registry lists](ConfigurationHowTo#printing-the-configuration-registry-lists), as well as other information stored by `configure` in the `config.mk` file.
+
+
+
+### Header dependencies
+
+Due to the way the BLIS framework handles header files, **any** change to **any** header file will result in the entire library being rebuilt. This policy is in place mostly out of an abundance of caution. If two or more files use definitions in a header that is modified, and one or more of those files somehow does not get recompiled to reflect the updated definitions, you could end up sinking hours of time trying to track down a bug that didn't ever need to be an issue to begin with. Thus, to prevent developers (including the framework developer(s)) from shooting themselves in the foot with this problem, the BLIS build system recompiles **all** object files if any header file is touched. We apologize for the inconvenience this may cause.
+
+
+
+### Still have questions?
+
+If you have further questions about BLIS configurations, please do not hesitate to contact the BLIS developer community. To do so, simply join and post to the [blis-devel](http://groups.google.com/group/blis-devel) mailing list.
+***
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -0,0 +1,193 @@
+## Introduction
+
+Here we attempt to provide some frequently-asked questions about the BLIS framework
+project, as well as those we think a new user or developer might ask. If you do not see the answer to your question here, please join and post your question to one of the [BLIS mailing lists](https://github.com/flame/blis#discussion).
+
+## Contents
+
+  * [Why did you create BLIS?](FAQ#why-did-you-create-blis)
+  * [Why should I use BLIS instead of GotoBLAS / OpenBLAS / ATLAS / MKL / ESSL / ACML / Accelerate?](FAQ#why-should-i-use-blis-instead-of-gotoblas--openblas--atlas--mkl--essl--acml--accelerate)
+  * [How is BLIS related to FLAME / libflame?](FAQ#how-is-blis-related-to-flame--libflame)
+  * [Does BLIS automatically detect my hardware?](FAQ#does-blis-automatically-detect-my-hardware)
+  * [I understand that BLIS is mostly a tool for developers?](FAQ#i-understand-that-blis-is-mostly-a-tool-for-developers)
+  * [How do I link against BLIS?](FAQ#how-do-i-link-against-blis)
+  * [Must I use git? Can I download a tarball?](FAQ#must-i-use-git-can-i-download-a-tarball)
+  * [What is a micro-kernel?](FAQ#what-is-a-micro-kernel)
+  * [What is a macro-kernel?](FAQ#what-is-a-macro-kernel)
+  * [What is a context?](FAQ#what-is-a-context)
+  * [I am used to thinking in terms of column-major/row-major storage and leading dimensions. What is a "row stride" / "column stride"?](FAQ#im-used-to-thinking-in-terms-of-column-majorrow-major-storage-and-leading-dimensions-what-is-a-row-stride--column-stride)
+  * [What does it mean when a matrix with general stride is column-tilted or row-tilted?](FAQ#what-does-it-mean-when-a-matrix-with-general-stride-is-column-tilted-or-row-tilted)
+  * [I am not really interested in all of these newfangled features in BLIS. Can I just use BLIS as a BLAS library?](FAQ#im-not-really-interested-in-all-of-these-newfangled-features-in-blis-can-i-just-use-blis-as-a-blas-library)
+  * [What about CBLAS?](FAQ#what-about-cblas)
+  * [Can I call the native BLIS API from Fortran-77/90/95/2000/C++/Python?](FAQ#can-i-call-the-native-blis-api-from-fortran-7790952000cpython)
+  * [Do I need to call initialization/finalization functions before being able to use BLIS from my application?](FAQ#do-i-need-to-call-initializationfinalization-functions-before-being-able-to-use-blis-from-my-application)
+  * [Does BLIS support multithreading?](FAQ#does-blis-support-multithreading)
+  * [Does BLIS support NUMA environments?](FAQ#does-blis-support-numa-environments)
+  * [Does BLIS work with GPUs?](FAQ#does-blis-work-with-gpus)
+  * [Does BLIS work on (some architecture)?](FAQ#does-blis-work-on-some-architecture)
+  * [What about distributed-memory parallelism?](FAQ#what-about-distributed-memory-parallelism)
+  * [Can I build BLIS on Windows / Mac OS X?](FAQ#can-i-build-blis-on-windows--mac-os-x)
+  * [Can I build BLIS as a shared library?](FAQ#can-i-build-blis-as-a-shared-library)
+  * [Can I use the mixed domain / mixed precision support in BLIS?](FAQ#can-i-use-the-mixed-domain--mixed-precision-support-in-blis)
+  * [Who is involved in the project?](FAQ#who-is-involved-in-the-project)
+  * [Who funded the development of BLIS?](FAQ#who-funded-the-development-of-blis)
+  * [I found a bug. How do I report it?](FAQ#i-found-a-bug-how-do-i-report-it)
+  * [How do I request a new feature?](FAQ#how-do-i-request-a-new-feature)
+  * [Where did you get the photo for the BLIS logo / mascot?](FAQ#where-did-you-get-the-photo-for-the-blis-logo--mascot)
+
+
+
+### Why did you create BLIS?
+
+Initially, BLIS was conceived as simply "BLAS with a more flexible interface". The original BLIS was written as a wrapper layer around BLAS that allowed generalized matrix storage (i.e., separate row and column strides). We also took the opportunity to implement some complex domain features that were missing from the BLAS (mostly related to conjugating input operands). This "proto-BLIS" was deployed in [libflame](http://shpc.ices.utexas.edu/libFLAME.html) to facilitate cleaner implementations of some LAPACK-level operations.
+
+Over time, we wanted more than just a more flexible interface; we wanted an entire framework from which we could build operations in the BLAS as well as those not present within the BLAS. After this new BLIS framework was created, it turned out that the interface improvements were much less interesting (and consequential) than some of the framework's other features, and the fact that it allowed developers to rapidly instantiate new BLAS libraries by optimizing only a small amount of code.
+
+### Why should I use BLIS instead of GotoBLAS / OpenBLAS / ATLAS / MKL / ESSL / ACML / Accelerate?
+
+BLIS has numerous advantages to existing BLAS implementations. Many of these advantages are summarized on the [BLIS
+homepage](https://github.com/flame/blis#key-features). But here are a few reasons one might choose BLIS over some other implementation of BLAS:
+  * BLIS facilitates high performance while remaining very portable. BLIS isolates performance-sensitive code to a micro-kernel which contains only one loop and which, when optimized, accelerates virtually all level-3 operations. Thus, BLIS serves as a powerful tool for quickly instantiating BLAS on new or experimental hardware architectures, as well as a flexible "laboratory" in which to conduct research and experiments.
+  * BLIS provides robust multithreading support, allowing symmetric multicore/many-core parallelism via either OpenMP or POSIX threads. It also computes proper load balance for structured matrix subpartitions, regardless of the location of the diagonal, or whether the subpartition is lower- or upper-stored.
+  * BLIS supports a superset of BLAS functionality, providing operations omitted from the BLAS as well as some complex domain support that is missing in BLAS operations. BLIS is especially useful to researchers who need to develop and prototype new BLAS-like operations that do not exist in the BLAS.
+  * BLIS is backwards compatible with BLAS. BLIS contains a BLAS compatibility layer that allows an application to treat BLIS as if it were a traditional BLAS library.
+  * BLIS supports generalized matrix storage, which can be used to express column-major, row-major, and general stride storage.
+  * BLIS is free software, available under a [new/modified/3-clause BSD license](http://opensource.org/licenses/BSD-3-Clause).
+
+### How is BLIS related to FLAME / `libflame`?
+
+As explained [above](FAQ#why-did-you-create-blis?), BLIS was initially a layer within `libflame` that allowed more convenient interfacing to the BLAS. So in some ways, BLIS is a spin-off project. Prior to developing BLIS, [its author](http://www.cs.utexas.edu/users/field/) worked as the primary maintainer of `libflame`. If you look closely, you can also see that the design of BLIS was influenced by some of the more useful and innovative aspects of `libflame`, such as internal object abstractions and control trees. Also, various members of the [SHPC research group](http://shpc.ices.utexas.edu/people.html) and its [collaborators](http://shpc.ices.utexas.edu/collaborators.html) routinely provide insight, feedback, and also contribute code (especially kernels) to the BLIS project.
+
+### Does BLIS automatically detect my hardware?
+
+On certain architectures, yes. In order to use auto-detection, you must specify `auto` as your configuration when running `configure` (Please see the [BLIS build system wiki](BuildSystem) for more info.) A runtime detection option is also available. (Please see the [BLIS configuration guide](ConfigurationHowTo) for more info.)
+
+If automatic hardware detection is requested at configure-time and the build process does not recognize your architecture, the `generic` configuration is selected.
+
+### I understand that BLIS is mostly a tool for developers?
+
+Yes. In order to achieve high performance, BLIS requires that hand-coded kernels and micro-kernels be written and referenced in a valid [BLIS configuration](ConfigurationHowTo). These components are usually written by developers and then included within BLIS for use by others.
+
+The good news, however, is that end-users can use BLIS too. Once the aforementioned kernels are integrated into BLIS, they can be used without any developer-level knowledge. Usually, `./configure auto; make; make install` is sufficient for the typical users with typical hardware.
+
+### How do I link against BLIS?
+
+Linking against BLIS is easy! Most people can link to it as if it were a generic BLAS library. Please see the [Linking against BLIS](BuildSystem#linking-against-blis) section of the [build system wiki](BuildSystem).
+
+### Must I use git? Can I download a tarball?
+
+We **strongly encourage** you to obtain the BLIS source code by cloning a `git` repository (via the [git
+clone](https://github.com/flame/blis/wiki/BuildSystem#obtaining-blis) command). The reason for this is that it will allow you to easily update your local copy of BLIS by executing `git pull`.
+
+Tarballs and zip files may be obtained from the [releases](https://github.com/flame/blis/releases) page.
+
+### What is a micro-kernel?
+
+The micro-kernel (usually short for "`gemm` micro-kernel") is the basic unit of level-3 (matrix-matrix) computation within BLIS. It consists of one loop, where each iteration performs a very small outer product to update a very small matrix. The micro-kernel is typically the only piece of code that must be carefully optimized (via vector intrinsics or assembly code) to enable high performance in most of the level-3 operations such as `gemm`, `hemm`, `herk`, and `trmm`.
+
+For a more thorough explanation of the micro-kernel and its role in the overall level-3 computations, please read our [ACM TOMS papers](https://github.com/flame/blis#citations). For API and technical reference, please see the [gemm micro-kernel section](KernelsHowTo#gemm-micro-kernel) of the [BLIS Kernels guide](KernelsHowTo).
+
+### What is a macro-kernel?
+
+The macro-kernels are portable codes within the BLIS framework that implement relatively small subproblems within an overall level-3 operation. The overall problem (say, general matrix-matrix multiplication, or `gemm`) is partitioned down, according to cache blocksizes, such that its operands are (1) a suitable size and (2) stored in a special packed format. At that time, the macro-kernel is called. The macro-kernel is implemented as two loops around the micro-kernel.
+
+The macro-kernels in BLIS correspond to the so-called "inner kernels" (or simply "kernels") that formed the fundamental unit of computation in Kazushige Goto's GotoBLAS (and now in the successor library, OpenBLAS).
+
+For more information on macro-kernels, please read our [ACM TOMS papers](https://github.com/flame/blis#citations).
+
+### What is a context?
+
+As of 0.2.0, BLIS contains a new infrastructure for communicating runtime information (such as kernel addresses and blocksizes) from the highest levels of code all the way down the function stack, even into the kernels themselves. This new data structure is called a *context*, and together with its API, it helped us clean up some hacks and other awkwardness that existed in BLIS prior to 0.2.0. Contexts also lays the groundwork for managing kernels and related kernel information at runtime.
+
+If you are a kernel developer, you can usually ignore the `cntx_t*` argument that is passed into each kernel, since the kernels already inherently "know" this information (such as register blocksizes). And if you are a user, and the function you want to call takes a `cntx_t*` argument, you can safely pass in `NULL` and BLIS will automatically build a suitable context for you at runtime. 
+
+### I'm used to thinking in terms of column-major/row-major storage and leading dimensions. What is a "row stride" / "column stride"?
+
+Traditional BLAS assumes that matrices are stored in column-major order, where a leading dimension measures the distance from one element to the next element in the same row. But column-major order is really just a special case of BLIS's more generalized storage scheme.
+
+In generalized storage, we have a row stride and a column stride. The row stride measures the distance in memory between rows (within a single column) while the column stride measures the distance between columns (within a single row). Column-major storage corresponds to the situation where the row stride equals 1. Since the row stride is unit, you only have to track the column stride (i.e., the leading dimension). Similarly, in row-major order, the column stride is equal to 1 and only the row stride must be tracked.
+
+BLIS also supports situations where both the row stride and column stride are non-unit. We call this situation "general stride".
+
+### What does it mean when a matrix with general stride is column-tilted or row-tilted?
+
+When a matrix is stored with general stride, both the row stride and column stride (let's call them `rs` and `cs`) are non-unit. When `rs` < `cs`, we call the general stride matrix "column-tilted" because it is "closer" to being column-stored (than row-stored). Similarly, when `rs` > `cs`, the matrix is "row-tilted" because it is closer to being row-stored.
+
+### I'm not really interested in all of these newfangled features in BLIS. Can I just use BLIS as a BLAS library?
+
+Absolutely. Just link your application to BLIS the same way you would link to a BLAS library. For a simple linking example, see the [Linking to BLIS](BuildSystem#linking-to-blis) section of the [BLIS Build System wiki](BuildSystem).
+
+### What about CBLAS?
+
+BLIS also contains an optional CBLAS compatibility layer, which leverages the BLAS compatibility layer to help map CBLAS function calls to the corresponding functionality in BLIS. Once BLIS is built with CBLAS support, your application can access CBLAS prototypes via either `cblas.h` or `blis.h`.
+
+### Can I call the native BLIS API from Fortran-77/90/95/2000/C++/Python?
+
+In principle, BLIS's [native BLAS-like API](BLISTypedAPI) can be called from Fortran. However, you must ensure that the size of the integer in BLIS is equal to the size of integer used by your Fortran program/compiler/environment. The size of BLIS integers is set in `bli_config.h`. Please see the [bli\_config.h](ConfigurationHowTo#bli_configh) section of the [BLIS Configuration guide](ConfigurationHowTo) for more details.
+
+As for bindings to other languages, please contact the [blis-devel](http://groups.google.com/group/blis-devel) mailing list.
+
+### Do I need to call initialization/finalization functions before being able to use BLIS from my application?
+
+Originally, BLIS did indeed require the application to explicitly setup (initialize) various internal data structures via `bli_init()`. Likewise, calling `bli_finalize()` was recommended to cleanup (finalize) the library. However, since commit 9804adf, BLIS has implemented self-initialization. These explicit calls to `bli_init()` and `bli_finalize()` are no longer necessary, though experts may still use them in special cases to control the allocation and freeing of resources. This topic is discussed in the [BLIS typed API reference](BLISTypedAPI#initialization-and-cleanup).
+
+### Does BLIS support multithreading?
+
+Yes! BLIS supports multithreading (via OpenMP or POSIX threads) for all of its level-3 operations. For more information on enabling and controlling multithreading, please see the wiki on [Multithreading](Multithreading).
+
+BLIS can also very easily be made thread-safe so that you can call BLIS from threads within a multithreaded library or application. For more information on making BLIS thread-safe, see the "Multithreading" subsection of the [bli\_config.h](ConfigurationHowTo#bli_configh) header file section in the [BLIS Configuration guide](ConfigurationHowTo).
+
+### Does BLIS support NUMA environments?
+
+No. We have integrated some early foundational support for NUMA *development*, but currently BLIS will execute sub-optimally on NUMA systems. If you are interested in adapting BLIS to a NUMA architecture, please contact us via the [blis-devel](http://groups.google.com/group/blis-devel) mailing list.
+
+### Does BLIS work with GPUs?
+
+BLIS does not currently support graphical processing units (GPUs).
+
+### Does BLIS work on _(some architecture)_?
+
+Please see the [BLIS Hardware Support](HardwareSupport) wiki for a full list of supported architectures. If your favorite hardware is not listed and you have the expertise, please consider developing your own kernels and sharing them with the project! We will, of course, gratefully credit your contribution.
+
+### What about distributed-memory parallelism?
+
+No. BLIS is a framework for sequential and shared-memory/multicore implementations of BLAS-like operations. If you need distributed-memory dense linear algebra implementations, we recommend the [Elemental](http://libelemental.org/) library.
+
+### Can I build BLIS on Windows / Mac OS X?
+
+BLIS was designed for use in a GNU/Linux environment, however, it should work on other UNIX-like systems as well, such as OS X. System software requirements for UNIX-like systems are discussed in the [BLIS build system guide](BuildSystem).
+
+Support for building in Windows is not directly supported. However, Windows 10 now provides a Linux-like environment. We suspect this is the best route for those trying to build BLIS in Windows. If you have success and would like to share your experiences, please join the [blis-devel](http://groups.google.com/group/blis-devel) mailing list and send us a message!
+
+### Can I build BLIS as a shared library?
+
+Yes. By default, most configurations output only a static library archive (e.g. `.a` file). However, you can also request a shared object (e.g. `.so` file), sometimes also called a "dynamically-linked" library. For information on enabling shared library output, simply run `./configure --help`.
+
+### Can I use the mixed domain / mixed precision support in BLIS?
+
+Enabling mixed domain / mixed precision support in BLIS is a long-term goal of ours. In the meantime, if this feature is important to you, please contact us via the [blis-devel](http://groups.google.com/group/blis-devel) mailing list and tell us about your application and why you need/want support for BLAS-like operations with mixed-domain/mixed-precision operands. We are interested to hear from you!
+
+### Who is involved in the project?
+
+Lots of people! For a full list of those involved, see the
+[CREDITS](https://github.com/flame/blis/blob/master/CREDITS) file within the BLIS framework source distribution.
+
+### Who funded the development of BLIS?
+
+BLIS was primarily funded by grants from [Microsoft](http://www.microsoft.com/),
+[Intel](http://www.intel.com/), [Texas
+Instruments](http://www.ti.com/), and [AMD](http://www.amd.com/), as well as grants from the [National Science Foundation](http://www.nsf.gov/) (Awards CCF-0917167 ACI-1148125/1340293, and CCF-1320112).
+
+Reminder: _Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)._
+
+### I found a bug. How do I report it?
+
+If you think you've found a bug, we request that you [open an issue](http://github.com/flame/blis/issues). Don't be shy! Really, it's the best and most convenient way for us to track your issues/bugs/concerns. Other discussion that are primarily bug-reports should take place via the [blis-devel](http://groups.google.com/group/blis-devel) mailing list. 
+
+### How do I request a new feature?
+
+Feature requests should also be submitted by [opening a new issue](http://github.com/flame/blis/issues).
+
+### Where did you get the photo for the BLIS logo / mascot?
+
+The sleeping ["BLIS cat"](https://github.com/flame/blis/blob/master/README.md) photo was taken by Petar Mitchev and is used with his permission.
--- a/docs/HardwareSupport.md
+++ b/docs/HardwareSupport.md
@@ -0,0 +1,44 @@
+## Introduction
+
+This wiki is intended to track the support for various hardware types within the BLIS framework source distribution.
+
+We apologize if this wiki falls out of date. For the latest support, we recommend peeking inside of the relevant sub-configuration (specifically, in the `bli_cntx_init_<configname>.c` file) and looking at which kernels are registered. You may also contact the [blis-devel](http://groups.google.com/group/blis-devel) mailing list.
+
+
+## Level-3 micro-kernels
+
+The following table lists architectures for which there exist optimized level-3 micro-kernels, which micro-kernels are optimized, the name of the author or maintainer, and the current status of the micro-kernels.
+
+A few remarks / reminders:
+  * Optimizing only the [gemm micro-kernel](KernelsHowTo#gemm-micro-kernel) will result in optimal performance for all [level-3 operations](BLISTypedAPI#level-3-operations) except `trsm` (which will typically achieve 60 - 80% of attainable peak performance).
+  * The [trsm](BLISTypedAPI#trsm) operation needs the [gemmtrsm micro-kernel(s)](KernelsHowTo#gemmtrsm-micro-kernels), in addition to the aforementioned [gemm micro-kernel](KernelsHowTo#gemm-micro-kernel), in order reach optimal performance.
+  * Induced complex (1m) implementations are employed in all situations where the real domain [gemm micro-kernel](KernelsHowTo#gemm-micro-kernel) of the corresponding precision is available. Please see our [ACM TOMS article on the 1m method](https://github.com/flame/blis#citations) for more info on this topic.
+  * Some microarchitectures use the same sub-configuration. This is not a typo. For example, Haswell and Broadwell systems as well as "desktop" (non-server) versions of Skylake, Kabylake, and Coffeelake all use the `haswell` sub-configuration and the kernels registered therein.
+  * Remember that you (usually) don't have to choose your sub-configuration manually! Instead, you can always request configure-time hardware detection via `./configure auto`. This will defer to internal logic (based on CPUID for x86_64 systems) that will attempt to choose the appropriate sub-configuration automatically.
+
+| Vendor/Microarchitecture             | BLIS sub-configuration | `gemm` | `gemmtrsm` |
+|:-------------------------------------|:-----------------------|:-------|:-----------|
+| AMD Bulldozer (AVX/FMA4)             | `bulldozer`            | `sdcz` |            |
+| AMD Piledriver (AVX/FMA3)            | `piledriver`           | `sdcz` |            |
+| AMD Steamroller (AVX/FMA3)           | `steamroller`          | `sdcz` |            |
+| AMD Excavator (AVX/FMA3)             | `excavator`            | `sdcz` |            |
+| AMD Zen (AVX/FMA3)                   | `zen`                  | `sdcz` |  `sd`      |
+| Intel Core2 (SSE3)                   | `penryn`               | `sd`   |  `d`       |
+| Intel Sandy/Ivy Bridge (AVX/FMA3)    | `sandybridge`          | `sdcz` |            |
+| Intel Haswell, Broadwell (AVX/FMA3)  | `haswell`              | `sdcz` |  `sd`      |
+| Intel Sky/Kaby/Coffeelake (AVX/FMA3) | `haswell`              | `sdcz` |  `sd`      |
+| Intel Knights Landing (AVX-512/FMA3) | `knl`                  | `sd`   |            |
+| Intel SkylakeX (AVX-512/FMA3)        | `skx`                  | `sd`   |            |
+| ARMv7 Cortex-A9/A15 (NEON)           | `cortex-a9,-a15`       | `sd`   |            |
+| ARMv8 Cortex-A57 (NEON)              | `cortex-a57`           | `sd`   |            |
+| IBM Blue Gene/Q (QPX int)            | `bgq`                  |  `d`   |            |
+| IBM Power7 (QPX int)                 | `power7`               |  `d`   |            |
+| template (C99)                       | `template`             | `sdcz` | `sdcz`     |
+
+## Level-1f kernels
+
+Not yet written. Please see the relevant sub-configuration (`bli_cntx_init_<configname>.c`) to determine which kernels are implemented/registered.
+
+## Level-1v kernels
+
+Not yet written. Please see the relevant sub-configuration (`bli_cntx_init_<configname>.c`) to determine which kernels are implemented/registered.
--- a/docs/KernelsHowTo.md
+++ b/docs/KernelsHowTo.md
@@ -0,0 +1,504 @@
+## Introduction
+
+This wiki describes the computational kernels used by the BLIS framework.
+
+One of the primary features of BLIS is that it provides a large set of dense linear algebra functionality while simultaneously minimizing the amount of kernel code that must be optimized for a given architecture. BLIS does this by isolating a handful of kernels which, when implemented, facilitate functionality and performance of several of the higher-level operations.
+
+Presently, BLIS supports several groups of operations:
+  * **[Level-1v](BLISTypedAPI#level-1v-operations)**: Operations on vectors:
+    * [addv](BLISTypedAPI#addv), [amaxv](BLISTypedAPI#amaxv), [axpyv](BLISTypedAPI#axpyv), [copyv](BLISTypedAPI#copyv), [dotv](BLISTypedAPI#dotv), [dotxv](BLISTypedAPI#dotxv), [invertv](BLISTypedAPI#invertv), [scal2v](BLISTypedAPI#scal2v), [scalv](BLISTypedAPI#scalv), [setv](BLISTypedAPI#setv), [subv](BLISTypedAPI#subv), [swapv](BLISTypedAPI#swapv)
+  * **[Level-1d](BLISTypedAPI#level-1d-operations)**: Element-wise operations on matrix diagonals:
+    * [addd](BLISTypedAPI#addd), [axpyd](BLISTypedAPI#axpyd), [copyd](BLISTypedAPI#copyd), [invertd](BLISTypedAPI#invertd), [scald](BLISTypedAPI#scald), [scal2d](BLISTypedAPI#scal2d), [setd](BLISTypedAPI#setd), [setid](BLISTypedAPI#setid), [subd](BLISTypedAPI#subd)
+  * **[Level-1m](BLISTypedAPI#level-1m-operations)**: Element-wise operations on matrices:
+    * [addm](BLISTypedAPI#addm), [axpym](BLISTypedAPI#axpym), [copym](BLISTypedAPI#copym), [scalm](BLISTypedAPI#scalm), [scal2m](BLISTypedAPI#scal2m), [setm](BLISTypedAPI#setm), [subm](BLISTypedAPI#subm)
+  * **[Level-1f](BLISTypedAPI#level-1f-operations)**: Fused operations on multiple vectors:
+    * [axpy2v](BLISTypedAPI#axpy2v), [dotaxpyv](BLISTypedAPI#dotaxpyv), [axpyf](BLISTypedAPI#axpyf), [dotxf](BLISTypedAPI#dotxf), [dotxaxpyf](BLISTypedAPI#dotxaxpyf)
+  * **[Level-2](BLISTypedAPI#level-2-operations)**: Operations with one matrix and (at least) one vector operand:
+    * [gemv](BLISTypedAPI#gemv), [ger](BLISTypedAPI#ger), [hemv](BLISTypedAPI#hemv), [her](BLISTypedAPI#her), [her2](BLISTypedAPI#her2), [symv](BLISTypedAPI#symv), [syr](BLISTypedAPI#syr), [syr2](BLISTypedAPI#syr2), [trmv](BLISTypedAPI#trmv), [trsv](BLISTypedAPI#trsv)
+  * **[Level-3](BLISTypedAPI#level-3-operations)**: Operations with matrices that are multiplication-like:
+    * [gemm](BLISTypedAPI#gemm), [hemm](BLISTypedAPI#hemm), [herk](BLISTypedAPI#herk), [her2k](BLISTypedAPI#her2k), [symm](BLISTypedAPI#symm), [syrk](BLISTypedAPI#syrk), [syr2k](BLISTypedAPI#syr2k), [trmm](BLISTypedAPI#trmm), [trmm3](BLISTypedAPI#trmm3), [trsm](BLISTypedAPI#trsm)
+  * **[Utility](BLISTypedAPI#Utility-operations)**: Miscellaneous operations on matrices and vectors:
+    * [asumv](BLISTypedAPI#asumv), [norm1v](BLISTypedAPI#norm1v), [normfv](BLISTypedAPI#normfv), [normiv](BLISTypedAPI#normiv), [norm1m](BLISTypedAPI#norm1m), [normfm](BLISTypedAPI#normfm), [normim](BLISTypedAPI#normim), [mkherm](BLISTypedAPI#mkherm), [mksymm](BLISTypedAPI#mksymm), [mktrim](BLISTypedAPI#mktrim), [fprintv](BLISTypedAPI#fprintv), [fprintm](BLISTypedAPI#fprintm),[printv](BLISTypedAPI#printv), [printm](BLISTypedAPI#printm), [randv](BLISTypedAPI#randv), [randm](BLISTypedAPI#randm), [sumsqv](BLISTypedAPI#sumsqv)
+
+Most of the interest with BLAS libraries centers around level-3 operations because they exhibit favorable ratios of floating-point operations (flops) to memory operations (memops), which allows high performance. Some applications also require level-2 computation; however, these operations are at an inherent disadvantage on modern architectures due to their less favorable flop-to-memop ratio. The BLIS framework allows developers to quickly and easily build high performance level-3 operations, as well as relatively well-performing level-2 operations, simply by optimizing a small set of kernels. These kernels, and their relationship to the other higher-level operations supported by BLIS, are the subject of this wiki.
+
+Some level-1v, level-1m, and level-1d operations may also be accelerated, but since they are memory-bound, optimization typically yields minor performance improvement.
+
+
+---
+
+
+## BLIS kernels summary
+
+This section lists and briefly describes each of the main computational kernels supported by the BLIS framework. (Other kernels are supported, but they are not of interest to most developers.)
+
+### Level-3
+
+BLIS supports the following three level-3 micro-kernels. These micro-kernels are used to implement optimized level-3 operations.
+  * **gemm**: The `gemm` micro-kernel performs a small matrix multiplication and is used by every level-3 operation.
+  * **trsm**: The `trsm` micro-kernel performs a small triangular solve with multiple right-hand sides. It is not required for optimal performance and in fact is only needed when the developer opts to not implement the fused `gemmtrsm` kernel.
+  * **gemmtrsm**: The `gemmtrsm` micro-kernel implements a fused operation whereby a `gemm` and a `trsm` subproblem are fused together in a single routine. This avoids redundant memory operations that would otherwise be incurred if the operations were executed separately.
+
+The following shows the steps one would take to optimize, to varying degrees, the level-3 operations supported by BLIS:
+  1. By implementing and optimizing the `gemm` micro-kernel, **all** level-3 operations **except** `trsm` are fully optimized. In this scenario, the `trsm` operation may achieve 60-90% of attainable peak performance, depending on the architecture and problem size.
+  1. If one goes further and implements and optimizes the `trsm` micro-kernel, this kernel, when paired with an optimized `gemm` micro-kernel, results in a `trsm` implementation that is accelerated (but not optimized).
+  1. Alternatively, if one implements and optimizes the fused `gemmtrsm` micro-kernel, this kernel, when paired with an optimized `gemm` micro-kernel, enables a fully optimized `trsm` implementation.
+
+### Level-1f
+
+BLIS supports the following five level-1f (fused) kernels. These kernels are used to implement optimized level-2 operations.
+  * **axpy2v**: Performs and fuses two [axpyv](BLISTypedAPI#axpyv) operations, accumulating to the same output vector.
+  * **dotaxpyv**: Performs and fuses a [dotv](BLISTypedAPI#dotv) followed by an [axpyv](BLISTypedAPI#axpyv) operation with x.
+  * **axpyf**: Performs and fuses some implementation-dependent number of [axpyv](BLISTypedAPI#axpyv) operations, accumulating to the same output vector. Can also be expressed as a [gemv](BLISTypedAPI#gemv) operation where matrix A is _m x nf_, where nf is the number of fused operations (fusing factor).
+  * **dotxf**: Performs and fuses some implementation-dependent number of [dotxv](BLISTypedAPI#dotxv) operations, reusing the `y` vector for each [dotxv](BLISTypedAPI#dotxv).
+  * **dotxaxpyf**: Performs and fuses a [dotxf](BLISTypedAPI#dotxf) and [axpyf](BLISTypedAPI#axpyf) in which the matrix operand is reused.
+
+
+### Level-1v
+
+BLIS supports kernels for the following level-1 operations. Aside from their self-similar operations (ie: the use of an `axpyv` kernel to implement the `axpyv` operation), these kernels are used only to implement level-2 operations, and only when the developer decides to forgo more optimized approaches that involve level-1f kernels (where applicable).
+  * **axpyv**: Performs a [scale-and-accumulate vector](BLISTypedAPI#axpyv) operation.
+  * **dotv**: Performs a [dot product](BLISTypedAPI#dotv) where the output scalar is overwritten.
+  * **dotxv**: Performs an [extended dot product](BLISTypedAPI#dotxv) operation where the dot product is first scaled and then accumulated into a scaled output scalar.
+
+There are other level-1v kernels that may be optimized, such as [addv](BLISTypedAPI#addv), [subv](BLISTypedAPI#subv), and [scalv](BLISTypedAPI#scalv), but their use is less common and therefore of much less importance to most users and developers.
+
+
+### Level-1v/-1f Dependencies for Level-2 operations
+
+The table below shows dependencies between level-2 operations and each of the level-1v and level-1f kernels.
+
+Kernels marked with a "1" for a given level-2 operation are preferred for optimization because they facilitate an optimal implementation on most architectures. Kernels marked with a "2", "3", or "4" denote those which need to be optimized for alternative implementations that would typically be second, third, or fourth choices, respectively, if the preferred kernels are not optimized.
+
+| operation / kernel | effective storage   | `axpyv` | `dotxv` | `axpy2v` | `dotaxpyv` | `axpyf` | `dotxf` | `dotxaxpyf` |
+|:-------------------|:--------------------|:--------|:--------|:---------|:-----------|:--------|:--------|:------------|
+| `gemv, trmv, trsv` | row-wise            |         |   2     |          |            |         |   1     |             |
+|                    | column-wise         |   2     |         |          |            |   1     |         |             |
+| `hemv, symv`       | row- or column-wise |   4     |   4     |          |    3       |   2     |   2     |     1       |
+| `ger, her, syr`    | row- or column-wise |   1     |         |          |            |         |         |             |
+| `her2, syr2`       | row- or column-wise |   2     |         |    1     |            |         |         |             |
+
+**Note:** The "effective storage" column reflects the orientation of the matrix operand **after** transposition via the corresponding `trans_t` parameter (if applicable). For example, calling `gemv` with a column-stored matrix `A` and the `transa` parameter equal to `BLIS_TRANSPOSE` would be effectively equivalent to row-wise storage.
+
+
+---
+
+
+## BLIS kernels reference
+
+This section seeks to provide developers with a complete reference for each of the following BLIS kernels, including function prototypes, parameter descriptions, implementation notes, and diagrams:
+  * [Level-3 micro-kernels](KernelsHowTo#level-3-micro-kernels)
+    * [gemm](KernelsHowTo#gemm-micro-kernel)
+    * [trsm](KernelsHowTo#trsm-micro-kernels)
+    * [gemmtrsm](KernelsHowTo#gemmtrsm-micro-kernels)
+  * [Level-1f kernels](KernelsHowTo#level-1f-kernels)
+    * axpy2v
+    * dotaxpyv
+    * axpyf
+    * dotxf
+    * dotxaxpyf
+  * [Level-1v kernels](KernelsHowTo#level-1v-kernels)
+    * axpyv
+    * dotv
+    * dotxv
+
+The function prototypes in this section follow the same guidelines as those listed in the [BLIS typed API reference](BLISTypedAPI#Notes_for_using_this_reference). Namely:
+  * Any occurrence of `?` should be replaced with `s`, `d`, `c`, or `z` to form an actual function name.
+  * Any occurrence of `ctype` should be replaced with the actual C type corresponding to the datatype instance in question.
+  * Some matrix arguments have associated row and column strides arguments that proceed them, typically listed as `rsX` and `csX` for a given matrix `X`. Row strides are always listed first, and column strides are always listed second. The semantic meaning of a row stride is "the distance, in units of elements, from any given element to the corresponding element (within the same column) of the next row," and the meaning of a column stride is "the distance, in units of elements, from any given element to the corresponding element (within the same row) of the next column." Thus, unit row stride implies column-major storage and unit column stride implies row-major storage.
+  * All occurrences of `alpha` and `beta` parameters are scalars.
+
+
+
+### Level-3 micro-kernels
+
+This section describes in detail the various level-3 micro-kernels supported by BLIS:
+  * [gemm](KernelsHowTo#gemm-micro-kernel)
+  * [trsm](KernelsHowTo#trsm_micro-kernels)
+  * [gemmtrsm](KernelsHowTo#gemmtrsm-micro-kernels)
+
+
+#### gemm micro-kernel
+
+```
+void bli_?gemm_<suffix>
+     (
+       dim_t               k,
+       ctype*     restrict alpha,
+       ctype*     restrict a1,
+       ctype*     restrict b1,
+       ctype*     restrict beta,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+```
+
+where `<suffix>` is implementation-dependent. The following (more portable) wrapper is also defined:
+
+```
+void bli_?gemm_ukernel
+     (
+       dim_t               k,
+       ctype*     restrict alpha,
+       ctype*     restrict a1,
+       ctype*     restrict b1,
+       ctype*     restrict beta,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+```
+
+The `gemm` micro-kernel, sometimes simply referred to as "the BLIS micro-kernel" or "the micro-kernel", performs the following operation:
+
+```
+  C11 := beta * C11 + A1 * B1
+```
+
+where `A1` is an _MR x k_ "micro-panel" matrix stored in packed (column-wise) format, `B1` is a _k x NR_ "micro-panel" matrix stored in packed (row-wise) format, `C11` is an _MR x NR_ general matrix stored according to its row and column strides `rsc` and `csc`, and `alpha` and beta are scalars.
+
+_MR_ and _NR_ are the register blocksizes associated with the micro-kernel. They are chosen by the developer when the micro-kernel is written and then encoded into a BLIS configuration, which will reference the micro-kernel when the BLIS framework is instantiated into a library. For more information on setting register blocksizes and related constants, please see the [BLIS developer configuration guide](ConfigurationHowTo).
+
+Parameters:
+
+  * `k`:      The number of columns of `A1` and rows of `B1`.
+  * `alpha`:  The address of a scalar to the `A1 * B1` product.
+  * `a1`:     The address of a micro-panel of matrix `A` of dimension _MR x k_, stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKMR_.)
+  * `b1`:     The address of a micro-panel of matrix `B` of dimension _k x NR_, stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
+  * `beta`:   The address of a scalar to the input value of matrix `C11`.
+  * `c11`:    The address of a matrix `C11` of dimension _MR x NR_, stored according to `rsc` and `csc`.
+  * `rsc`:    The row stride of matrix `C11` (ie: the distance to the next row, in units of matrix elements).
+  * `csc`:    The column stride of matrix `C11` (ie: the distance to the next column, in units of matrix elements).
+  * `data`:   The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `gemm` micro-kernel implementation. (See [Using the auxinfo\_t object](KernelsHowTo#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`.)
+  * `cntx`:   The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus the `cntx` argument usually can be safely ignored.
+
+#### Diagram for gemm
+
+The diagram below shows the packed micro-panel operands and how elements of each would be stored when _MR_ = _NR_ = 4. The hex digits indicate the layout and order (but NOT the numeric contents) of the elements in memory. Note that the storage of `C11` is not shown since it is determined by the row and column strides of `C11`.
+
+```
+         c11:           a1:                        b1:                   
+         _______        ______________________     _______              
+        |       |      |0 4 8 C               |   |0 1 2 3|             
+    MR  |       |      |1 5 9 D . . .         |   |4 5 6 7|             
+        |       |  +=  |2 6 A E               |   |8 9 A B|             
+        |_______|      |3_7_B_F_______________|   |C D E F|             
+                                                  |   .   |             
+            NR                    k               |   .   | k           
+                                                  |   .   |             
+                                                  |       |             
+                                                  |       |             
+                                                  |_______|             
+                                                                        
+                                                      NR                
+```
+
+#### Implementation Notes for gemm
+
+  * **Register blocksizes.** The C preprocessor macros `bli_?mr` and `bli_?nr` evaluate to the _MR_ and _NR_ register blocksizes for the datatype corresponding to the '?' character. These values are abbreviations of the macro constants `BLIS_DEFAULT_MR_?` and `BLIS_DEFAULT_NR_?`, which are defined in the `bli_kernel.h` header file of the BLIS configuration.
+  * **Leading dimensions of `a1` and `b1`: _PACKMR_ and _PACKNR_.** The packed micro-panels `a1` and `b1` are simply stored in column-major and row-major order, respectively. Usually, the width of either micro-panel (ie: the number of rows of `A1`, or _MR_, and the number of columns of `B1`, or _NR_) is equal to that micro-panel's so-called "leading dimension." Sometimes, it may be beneficial to specify a leading dimension that is larger than the panel width. This may be desirable because it allows each column of `A1` or row of `B1` to maintain a certain alignment in memory that would not otherwise be maintained by _MR_ and/or _NR_. In this case, you should index through `a1` and `b1` using the values _PACKMR_ and _PACKNR_, respectively (which are stored in the context as the blocksize maximums associated with the `bszid_t` values `BLIS_MR` and `BLIS_NR`). These values are defined as `BLIS_PACKDIM_MR_?` and `BLIS_PACKDIM_NR_?`, respectively, in the `bli_kernel.h` header file of the BLIS configuration.
+  * **Storage preference of `c11`.** Sometimes, an optimized `gemm` micro-kernel will have a "preferred" storage format for `C11`--typically either contiguous row-storage (i.e. `cs_c` = 1) or contiguous column-storage (i.e. `rs_c` = 1). This preference comes from how the micro-kernel is most efficiently able to load/store elements of `C11` from/to memory. Most micro-kernels use vector instructions to access contiguous columns (or column segments) of `C11`. However, the developer may decide that accessing contiguous rows (or row segments) is more desirable. If this is the case, this preference should be noted in `bli_kernel.h` by defining the macro `BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS`. Leaving the macro undefined leaves the default assumption (contiguous column preference) in place. Setting this macro allows the framework to perform a minor optimization at run-time that will ensure the micro-kernel preference is honored, if at all possible.
+  * **Edge cases in _MR_, _NR_ dimensions.** Sometimes the micro-kernel will be called with micro-panels `a1` and `b1` that correspond to edge cases, where only partial results are needed. Zero-padding is handled automatically by the packing function to facilitate reuse of the same micro-kernel. Similarly, the logic for computing to temporary storage and then saving only the elements that correspond to elements of `C11` that exist (at the edges) is handled automatically within the macro-kernel.
+  * **Alignment of `a1` and `b1`.** By default, the alignment of addresses `a1` and `b1` are aligned only to `sizeof(type)`. If `BLIS_POOL_ADDR_ALIGN_SIZE` is set to some larger multiple of `sizeof(type)`, such as the page size, then the *first* `a1` and `b1` micro-panels will be aligned to that value, but subsequent micro-panels will only be aligned to `sizeof(type)`, or, if `BLIS_POOL_ADDR_ALIGN_SIZE` is a multiple of `PACKMR` and `PACKNR`, then subsequent micro-panels `a1` and `b1` will be aligned to `PACKMR * sizeof(type)` and `PACKNR * sizeof(type)`, respectively.
+  * **Unrolling loops.** As a general rule of thumb, the loop over _k_ is sometimes moderately unrolled; for example, in our experience, an unrolling factor of _u_ = 4 is fairly common. If unrolling is applied in the _k_ dimension, edge cases must be handled to support values of _k_ that are not multiples of _u_. It is nearly universally true that there should be no loops in the _MR_ or _NR_ directions; in other words, iteration over these dimensions should always be fully unrolled (within the loop over _k_).
+  * **Zero `beta`.** If `beta` = 0.0 (or 0.0 + 0.0i for complex datatypes), then the micro-kernel should NOT use it explicitly, as `C11` may contain uninitialized memory (including elements containing `NaN` or `Inf`). This case should be detected and handled separately, preferably by simply overwriting `C11` with the `alpha * A1 * B1` product. An example of how to perform this "beta equals zero" handling is included in the `gemm` micro-kernel associated with the `template` configuration.
+
+#### Using the auxinfo\_t object
+
+Each micro-kernel ([gemm](KernelsHowTo#gemm-micro-kernel), [trsm](KernelsHowTo#trsm_micro-kernels), and [gemmtrsm](KernelsHowTo#gemmtrsm-micro-kernels)) takes as its last argument a pointer of type `auxinfo_t`. This BLIS-defined type is defined as a `struct` whose fields contain auxiliary values that may be useful to some micro-kernel authors, particularly when implementing certain optimization techniques. BLIS provides kernel authors access to the fields of the `auxinfo_t` object via the following function-like preprocessor macros. Each macro takes a single argument, the `auxinfo_t` pointer, and returns one of the values stored within the object.
+
+  * `bli_auxinfo_next_a()`. Returns the address (`void*`) of the micro-panel of `A` that will be used the next time the micro-kernel will be called.
+  * `bli_auxinfo_next_b()`. Returns the address (`void*`) of the micro-panel of `B` that will be used the next time the micro-kernel will be called.
+  * `bli_auxinfo_ps_a()`. Returns the panel stride (`inc_t`) of the current micro-panel of `A`.
+  * `bli_auxinfo_ps_b()`. Returns the panel stride (`inc_t`) of the current micro-panel of `B`.
+
+The addresses of the next micro-panels of `A` and `B` may be used by the micro-kernel to perform prefetching, if prefetching is supported by the architecture. Similarly, it may be useful to know the precise distance in memory to the next micro-panel. (Note that sometimes the next micro-panel to be used is **not** the same as the next micro-panel in memory.)
+
+Any and all of these values may be safely ignored; they are completely optional. However, BLIS guarantees that all values accessed via the macros listed above will **always** be initialized and meaningful, for every invocation of each micro-kernel (`gemm`, `trsm`, and `gemmtrsm`).
+
+
+#### Example code for gemm
+
+An example implementation of the `gemm` micro-kernel may be found in the `template` configuration directory in:
+  * [config/template/kernels/3/bli\_gemm_opt\_mxn.c](https://github.com/flame/blis/tree/master/config/template/kernels/3/bli_gemm_opt_mxn.c)
+
+
+Note that this implementation is coded in C99 and lacks several kinds of optimization that are typical of real-world optimized micro-kernels, such as vector instructions (or intrinsics) and loop unrolling in _MR_ or _NR_. It is meant to serve only as a starting point for a micro-kernel developer.
+
+
+
+
+---
+
+
+#### trsm micro-kernels
+
+```
+void bli_?trsm_l_<suffix>
+     (
+       ctype*     restrict a11,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+
+void bli_?trsm_u_<suffix>
+     (
+       ctype*     restrict a11,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+```
+
+where `<suffix>` is implementation-dependent. The following (more portable) wrappers are also defined:
+
+```
+void bli_?trsm_l_ukernel
+     (
+       ctype*     restrict a11,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+
+void bli_?trsm_u_ukernel
+     (
+       ctype*     restrict a11,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+```
+
+The `trsm_l` and `trsm_u` micro-kernels perform the following operation:
+
+```
+  C11 := inv(A11) * B11
+```
+
+where `A11` is _MR x MR_ and lower (`trsm_l`) or upper (`trsm_u`) triangular, `B11` is _MR x NR_, and `C11` is _MR x NR_.
+
+_MR_ and _NR_ are the register blocksizes associated with the micro-kernel. They are chosen by the developer when the micro-kernel is written and then encoded into a BLIS configuration, which will reference the micro-kernel when the BLIS framework is instantiated into a library. For more information on setting register blocksizes and related constants, please see the [BLIS developer configuration guide](ConfigurationHowTo).
+
+Parameters:
+
+  * `a11`:    The address of `A11`, which is the _MR x MR_ lower (`trsm_l`) or upper (`trsm_u`) triangular submatrix within the packed micro-panel of matrix `A`. `A11` is stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKMR_.) Note that `A11` contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.
+  * `b11`:    The address of `B11`, which is an _MR x NR_ submatrix of the packed micro-panel of `B`. `B11` is stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
+  * `c11`:    The address of `C11`, which is an _MR x NR_ submatrix of matrix `C`, stored according to `rsc` and `csc`. `C11` is the submatrix within `C` that corresponds to the elements which were packed into `B11`. Thus, `C` is the original input matrix `B` to the overall `trsm` operation.
+  * `rsc`:    The row stride of matrix `C11` (ie: the distance to the next row, in units of matrix elements).
+  * `csc`:    The column stride of matrix `C11` (ie: the distance to the next column, in units of matrix elements).
+  * `data`:   The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `trsm` micro-kernel implementation. (See [Using the auxinfo\_t object](KernelsHowTo#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`, and also [Implementation Notes for trsm](KernelsHowTo#implementation-notes-for-trsm) for caveats.)
+  * `cntx`:   The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus the `cntx` argument usually can be safely ignored.
+
+#### Diagrams for trsm
+
+Please see the diagram for [gemmtrsm\_l](KernelsHowTo#diagram-for-gemmtrsm-l) and [gemmtrsm\_u](KernelsHowTo#diagram-for-gemmtrsm-u) to see depictions of the `trsm_l` and `trsm_u` micro-kernel operations and where they fit in with their preceding `gemm` subproblems.
+
+#### Implementation Notes for trsm
+
+  * **Register blocksizes.** See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm).
+  * **Leading dimensions of `a11` and `b11`: _PACKMR_ and _PACKNR_.** See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm).
+  * **Edge cases in _MR_, _NR_ dimensions.** See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm).
+  * **Alignment of `a11` and `b11`.** The addresses `a11` and `b11` are aligned according to `PACKMR * sizeof(type)` and `PACKNR * sizeof(type)`, respectively.
+  * **Unrolling loops.** Most optimized implementations should unroll all three loops within the `trsm` micro-kernel.
+  * **Prefetching next micro-panels of `A` and `B`.** We advise against using the `bli_auxinfo_next_a()` and `bli_auxinfo_next_b()` macros from within the `trsm_l` and `trsm_u` micro-kernels, since the values returned usually only make sense in the context of the overall `gemmtrsm` subproblem.
+  * **Diagonal elements of `A11`.** At the time this micro-kernel is called, the diagonal entries of triangular matrix `A11` contain the **_inverse_** of the original elements. This inversion is done during packing so that we can avoid expensive division instructions within the micro-kernel itself. If the `diag` parameter to the higher level `trsm` operation was equal to `BLIS_UNIT_DIAG`, the diagonal elements will be explicitly unit.
+  * **Zero elements of `A11`.** Since `A11` is lower triangular (for `trsm_l`), the strictly upper triangle implicitly contains zeros. Similarly, the strictly lower triangle of `A11` implicitly contains zeros when `A11` is upper triangular (for `trsm_u`). However, the packing function may or may not actually write zeros to this region. Thus, the implementation should not reference these elements.
+  * **Output.** This micro-kernel must write its result to two places: the submatrix `B11` of the current packed micro-panel of `B` _and_ the submatrix `C11` of the output matrix `C`.
+
+#### Example code for trsm
+
+Example implementations of the `trsm` micro-kernels may be found in the `template` configuration directory in:
+  * [config/template/kernels/3/bli\_trsm\_l\_opt\_mxn.c](https://github.com/flame/blis/tree/master/config/template/kernels/3/bli_trsm_l_opt_mxn.c)
+  * [config/template/kernels/3/bli\_trsm\_u\_opt\_mxn.c](https://github.com/flame/blis/tree/master/config/template/kernels/3/bli_trsm_u_opt_mxn.c)
+
+Note that these implementations are coded in C99 and lack several kinds of optimization that are typical of real-world optimized micro-kernels, such as vector instructions (or intrinsics) and loop unrolling in _MR_ or _NR_. They are meant to serve only as a starting point for a micro-kernel developer.
+
+
+---
+
+
+#### gemmtrsm micro-kernels
+
+```
+void bli_?gemmtrsm_l_<suffix>
+     (
+       dim_t               k,
+       ctype*     restrict alpha,
+       ctype*     restrict a10,
+       ctype*     restrict a11,
+       ctype*     restrict b01,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+
+void bli_?gemmtrsm_u_<suffix>
+     (
+       dim_t               k,
+       ctype*     restrict alpha,
+       ctype*     restrict a12,
+       ctype*     restrict a11,
+       ctype*     restrict b21,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+```
+
+where `<suffix>` is implementation-dependent. The following (more portable) wrappers are also defined:
+
+```
+void bli_?gemmtrsm_l_ukernel
+     (
+       dim_t               k,
+       ctype*     restrict alpha,
+       ctype*     restrict a10,
+       ctype*     restrict a11,
+       ctype*     restrict b01,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+
+void bli_?gemmtrsm_u_ukernel
+     (
+       dim_t               k,
+       ctype*     restrict alpha,
+       ctype*     restrict a12,
+       ctype*     restrict a11,
+       ctype*     restrict b21,
+       ctype*     restrict b11,
+       ctype*     restrict c11, inc_t rsc, inc_t csc,
+       auxinfo_t* restrict data,
+       cntx_t*    restrict cntx
+     );
+```
+
+The `gemmtrsm_l` micro-kernel performs the following compound operation:
+
+```
+  B11 := alpha * B11 - A10 * B01
+  B11 := inv(A11) * B11
+  C11 := B11
+```
+
+where `A11` is _MR_ x _MR_ and lower triangular, `A10` is _MR_ x _k_, and `B01` is _k_ x _NR_.
+The `gemmtrsm_u` micro-kernel performs:
+
+```
+  B11 := alpha * B11 - A12 * B21
+  B11 := inv(A11) * B11
+  C11 := B11
+```
+
+where `A11` is _MR_ x _MR_ and upper triangular, `A12` is _MR_ x _k_, and `B21` is _k_ x _NR_.
+In both cases, `B11` is _MR_ x _NR_ and `alpha` is a scalar. Here, `inv()` denotes matrix inverse.
+
+_MR_ and _NR_ are the register blocksizes associated with the micro-kernel. They are chosen by the developer when the micro-kernel is written and then encoded into a BLIS configuration, which will reference the micro-kernel when the BLIS framework is instantiated into a library. For more information on setting register blocksizes and related constants, please see the [BLIS developer configuration guide](ConfigurationHowTo).
+
+Parameters:
+
+  * `k`:      The number of columns of `A10` and rows of `B01` (`trsm_l`); the number of columns of `A12` and rows of `B21` (`trsm_u`).
+  * `alpha`:  The address of a scalar to be applied to `B11`.
+  * `a10`, `a12`:    The address of `A10` or `A12`, which is the _MR x k_ submatrix of the packed micro-panel of `A` that is situated to the left (`trsm_l`) or right (`trsm_u`) of the _MR x MR_ triangular submatrix `A11`. `A10` and `A12` are stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKMR_.)
+  * `a11`:    The address of `A11`, which is the _MR x MR_ lower (`trsm_l`) or upper (`trsm_u`) triangular submatrix within the packed micro-panel of matrix `A` that is situated to the right of `A10` (`trsm_l`) or the left of `A12` (`trsm_u`). `A11` is stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKMR_.) Note that `A11` contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.
+  * `b01`, `b21`:   The address of `B01` and `B21`, which is the _k x NR_ submatrix of the packed micro-panel of `B` that is situated above (`trsm_l`) or below (`trsm_u`) the _MR x NR_ block `B11`. `B01` and `B21` are stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
+  * `b11`:    The address of `B11`, which is the _MR x NR_ submatrix of the packed micro-panel of `B`, situated below `B01` (`trsm_l`) or above `B21` (`trsm_u`). `B11` is stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
+  * `c11`:    The address of `C11`, which is an _MR x NR_ submatrix of matrix `C`, stored according to `rsc` and `csc`. `C11` is the submatrix within `C` that corresponds to the elements which were packed into `B11`. Thus, `C` is the original input matrix `B` to the overall `trsm` operation.
+  * `rsc`:    The row stride of matrix `C11` (ie: the distance to the next row, in units of matrix elements).
+  * `csc`:    The column stride of matrix `C11` (ie: the distance to the next column, in units of matrix elements).
+  * `data`:   The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `gemmtrsm` micro-kernel implementation. (See [Using the auxinfo\_t object](KernelsHowTo#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`, and also [Implementation Notes for gemmtrsm](KernelsHowTo#implementation-notes-for-gemmtrsm) for caveats.)
+  * `cntx`:   The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus the `cntx` argument usually can be safely ignored.
+
+#### Diagram for gemmtrsm\_l
+
+The diagram below shows the packed micro-panel operands for `trsm_l` and how elements of each would be stored when _MR_ = _NR_ = 4. (The hex digits indicate the layout and order (but NOT the numeric contents) in memory. Here, matrix `A11` (referenced by `a11`) is **lower triangular**. Matrix `A11` **does contain** elements corresponding to the strictly upper triangle, however, they are not guaranteed to contain zeros and thus these elements should not be referenced.
+
+```
+                                              NR    
+                                            _______ 
+                                       b01:|0 1 2 3|
+                                           |4 5 6 7|
+                                           |8 9 A B|
+                                           |C D E F|
+                                         k |   .   |
+                                           |   .   |
+       a10:                a11:            |   .   |
+       ___________________  _______        |_______|
+      |0 4 8 C            |`.      |   b11:|       |
+  MR  |1 5 9 D . . .      |  `.    |       |       |
+      |2 6 A E            |    `.  |    MR |       |
+      |3_7_B_F____________|______`.|       |_______|
+                                                    
+                k             MR                    
+```
+
+
+#### Diagram for gemmtrsm\_u
+
+The diagram below shows the packed micro-panel operands for `trsm_u` and how elements of each would be stored when _MR_ = _NR_ = 4. (The hex digits indicate the layout and order (but NOT the numeric contents) in memory. Here, matrix `A11` (referenced by `a11`) is **upper triangular**. Matrix `A11` **does contain** elements corresponding to the strictly lower triangle, however, they are not guaranteed to contain zeros and thus these elements should not be referenced.
+
+```
+       a11:     a12:                          NR    
+       ________ ___________________         _______ 
+      |`.      |0 4 8              |   b11:|0 1 2 3|
+  MR  |  `.    |1 5 9 . . .        |       |4 5 6 7|
+      |    `.  |2 6 A              |    MR |8 9 A B|
+      |______`.|3_7_B______________|       |___.___|
+                                       b21:|   .   |
+          MR             k                 |   .   |
+                                           |       |
+                                           |       |
+     NOTE: Storage digits are shown      k |       |
+     starting with a12 to avoid            |       |
+     obscuring triangular structure        |       |
+     of a11.                               |_______|
+                                                                            
+```
+
+
+#### Implementation Notes for gemmtrsm
+
+  * **Register blocksizes.** See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm).
+  * **Leading dimensions of `a1` and `b1`: _PACKMR_ and _PACKNR_.** See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm).
+  * **Edge cases in _MR_, _NR_ dimensions.** See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm).
+  * **Alignment of `a1` and `b1`.** See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm).
+  * **Unrolling loops.** Most optimized implementations should unroll all three loops within the `trsm` subproblem of `gemmtrsm`. See [Implementation Notes for gemm](KernelsHowTo#implementation-notes-for-gemm) for remarks on unrolling the `gemm` subproblem.
+  * **Prefetching next micro-panels of `A` and `B`.** When invoked from within a `gemmtrsm_l` micro-kernel, the addresses accessible via `bli_auxinfo_next_a()` and `bli_auxinfo_next_b()` refer to the next invocation's `a10` and `b01`, respectively, while in `gemmtrsm_u`, the `_next_a()` and `_next_b()` macros return the addresses of the next invocation's `a11` and `b11` (since those submatrices precede `a12` and `b21`).
+  * **Zero `alpha`.** The micro-kernel can safely assume that `alpha` is non-zero; "alpha equals zero" handling is performed at a much higher level, which means that, in such a scenario, the micro-kernel will never get called.
+  * **Diagonal elements of `A11`.** See [Implementation Notes for trsm](KernelsHowTo#implementation-notes-for-trsm).
+  * **Zero elements of `A11`.** See [Implementation Notes for trsm](KernelsHowTo#implementation-notes-for-trsm).
+  * **Output.** See [Implementation Notes for trsm](KernelsHowTo#implementation-notes-for-trsm).
+  * **Optimization.** Let's assume that the [gemm micro-kernel](KernelsHowTo#gemm-micro-kernel) has already been optimized. You have two options with regard to optimizing the fused `gemmtrsm` micro-kernels:
+    1. Optimize only the [trsm micro-kernels](KernelsHowTo#trsm-micro-kernels). This will result in the `gemm` and `trsm_l` micro-kernels being called in sequence. (Likewise for `gemm` and `trsm_u`.)
+    1. Fuse the implementation of the `gemm` micro-kernel with that of the `trsm` micro-kernels by inlining both into the `gemmtrsm_l` and `gemmtrsm_u` micro-kernel definitions. This option is more labor-intensive, but also more likely to yield higher performance because it avoids redundant memory operations on the packed _MR x NR_ submatrix `B11`.
+
+
+#### Example code for gemmtrsm
+
+Example implementations of the `gemmtrsm` micro-kernels may be found in the `template` configuration directory in:
+  * [config/template/kernels/3/bli\_gemmtrsm\_l\_opt\_mxn.c](https://github.com/flame/blis/tree/master/config/template/kernels/3/bli_gemmtrsm_l_opt_mxn.c)
+  * [config/template/kernels/3/bli\_gemmtrsm\_u\_opt\_mxn.c](https://github.com/flame/blis/tree/master/config/template/kernels/3/bli_gemmtrsm_u_opt_mxn.c)
+
+Note that these implementations are coded in C99 and lack several kinds of optimization that are typical of real-world optimized micro-kernels, such as vector instructions (or intrinsics) and loop unrolling in _MR_ or _NR_. They are meant to serve only as a starting point for a micro-kernel developer.
+
+
+
+
+### Level-1f kernels
+
+_This section has yet to be written._
+
+### Level-1v kernels
+
+_This section has yet to be written._
--- a/docs/Multithreading.md
+++ b/docs/Multithreading.md
@@ -0,0 +1,98 @@
+## Contents
+
+* **[Contents](Multithreading#contents)**
+* **[Introduction](Multithreading#introduction)**
+* **[Enabling multithreading](Multithreading#enabling-multithreading)**
+* **[Specifying multithreading](Multithreading#specifying-multithreading)**
+  * [The automatic way](Multithreading#the-automatic-way)
+  * [The manual way](Multithreading#the-manual-way)
+
+## Introduction
+
+Our paper [Anatomy of High-Performance Many-Threaded Matrix Multiplication](https://github.com/flame/blis#citations), presented at IPDPS'14, identified 5 loops around the micro-kernel as opportunities for parallelization. Within BLIS, we have enabled parallelism for 4 of those loops and have extended it to the rest of the level-3 operations except for `trsm`.
+
+## Enabling multithreading
+
+Note that BLIS disables multithreading by default.
+
+As of this writing, BLIS optionally supports multithreading via either OpenMP or POSIX threads.
+
+To enable multithreading via OpenMP, you must provide the `--enable-threading` option to the `configure` script:
+```
+  $ ./configure --enable-threading=openmp haswell
+```
+In this example, we configure for the `haswell` configuration. Similarly, to enable multithreading via POSIX threads (pthreads), specify the threading model as `pthreads` instead of `openmp`:
+```
+  $ ./configure --enable-threading=pthreads haswell
+```
+For more complete and up-to-date information on the `--enable-threading` option, simply run `configure` with the `--help` (or `-h`) option:
+```
+  $ ./configure --help
+```
+
+
+## Specifying multithreading
+
+There are two broad ways to specify multithreading in BLIS: the "automatic way" or the "manual way".
+
+### The automatic way
+
+The simplest way to enable multithreading in BLIS is to simply set the total number of threads you wish BLIS to employ in its parallelization. This total number of threads is captured by the `BLIS_NUM_THREADS` environment variable. You can set this variable prior to executing your BLIS-linked executable:
+
+```
+  $ export BLIS_NUM_THREADS=16
+  $ ./my_blis_program
+```
+This causes BLIS to automatically determine a reasonable threading strategy based on what is known about your architecture. If `BLIS_NUM_THREADS` is not set, then BLIS also looks at the value of `OMP_NUM_THREADS`, if set. If neither variable is set, the default number of threads is 1.
+ 
+Alternatively, any time after calling `bli_init()` but before `bli_finalize()`, you can also set (or change) the value of `BLIS_NUM_THREADS` at run-time:
+```
+  bli_thread_set_num_threads( 8 );
+```
+Similarly, the current value of `BLIS_NUM_THREADS` can always be queried as follows:
+```
+  dim_t num_threads = bli_thread_get_num_threads();
+```
+
+### The manual way
+
+The "manual way" of specifying parallelism in BLIS involves specifying which loops within the matrix multiplication algorithm to parallelize, and the degree of parallelism to be obtained from those loops.
+
+The below chart describes the five loops used in BLIS's matrix multiplication operations. 
+
+| Loop around micro-kernel | Environment variable | Direction | Notes       |
+|:-------------------------|:---------------------|:----------|:------------|
+| 5th loop                 | `BLIS_JC_NT`         | `n`       |             |
+| 4th loop                 | _N/A_                | `k`       | Not enabled |
+| 3rd loop                 | `BLIS_IC_NT`         | `m`       |             |
+| 2nd loop                 | `BLIS_JR_NT`         | `n`       |             |
+| 1st loop                 | `BLIS_IR_NT`         | `m`       |             |
+
+Note: Parallelization of the 4th loop is not currently enabled because each iteration of the loop updates the same part of the matrix C. Thus to parallelize it requires either a reduction or mutex locks when updating C.
+
+Parallelization in BLIS is hierarchical. So if we parallelize multiple loops, the total number of threads will be the product of the amount of parallelism for each loop. Thus the total number of threads used is `BLIS_IR_NT * BLIS_JR_NT * BLIS_IC_NT * BLIS_JC_NT`.
+
+In general, the way to choose how to set these environment variables is as follows: The amount of parallelism from the M and N dimensions should be roughly the same. Thus `BLIS_IR_NT * BLIS_IC_NT` should be roughly equal to `BLIS_JR_NT * BLIS_JC_NT`.
+
+Next, which combinations of loops to parallelize depends on which caches are shared. Here are some of the more common scenarios:
+ * When compute resources have private L3 caches (example: multi-socket systems), try parallelizing  the `JC` loop. This means threads (or thread groups) will pack and compute with different row panels from matrix B.
+ * For compute resources that have private L2 caches but that share an L3 cache (example: cores on a socket), try parallelizing the `IC` loop. In this situation, threads will share the same packed row panel from matrix B, but pack and compute with different blocks of matrix A.
+ * If compute resources share an L2 cache but have private L1 caches (example: pairs of cores), try parallelizing the `JR` loop. Here, threads share the same packed block of matrix A but read different packed micro-panels of B into their private L1 caches. In some situations, parallelizing the `IR` loop may also be effective.
+
+![The primary algorithm for level-3 operations in BLIS](http://www.cs.utexas.edu/users/field/mm_algorithm.png)
+
+As with specifying parallelism via `BLIS_NUM_THREADS`, you can set the `BLIS_xx_NT` environment variables in the shell, prior to launching your BLIS-linked executable, or you can set (or update) the environment variables at run-time. Here are some examples of using the run-time API:
+```
+  bli_thread_set_jc_nt( 2 );  // Set BLIS_JC_NT to 2.
+  bli_thread_set_jc_nt( 4 );  // Set BLIS_IC_NT to 4.
+  bli_thread_set_jr_nt( 3 );  // Set BLIS_JR_NT to 3.
+  bli_thread_set_ir_nt( 1 );  // Set BLIS_IR_NT to 1.
+```
+  There are also equivalent "get" functions that allow you to query the current values for the `BLIS_xx_NT` variables:
+```
+  dim_t jc_nt = bli_thread_get_jc_nt();
+  dim_t ic_nt = bli_thread_get_ic_nt();
+  dim_t jr_nt = bli_thread_get_jr_nt();
+  dim_t ir_nt = bli_thread_get_ir_nt();
+```
+
--- a/docs/ReleaseNotes.md
+++ b/docs/ReleaseNotes.md
@@ -0,0 +1,326 @@
+# Release Notes
+
+Note: Individual credits, where appropriate, are shown in parentheses.
+
+## Contents
+
+* [Changes in 0.3.2](ReleaseNotes#changes-in-032)
+* [Changes in 0.3.1](ReleaseNotes#changes-in-031)
+* [Changes in 0.3.0](ReleaseNotes#changes-in-030)
+* [Changes in 0.2.2](ReleaseNotes#changes-in-022)
+* [Changes in 0.2.1](ReleaseNotes#changes-in-021)
+* [Changes in 0.2.0](ReleaseNotes#changes-in-020)
+* [Changes in 0.1.8](ReleaseNotes#changes-in-018)
+* [Changes in 0.1.7](ReleaseNotes#changes-in-017)
+* [Changes in 0.1.6](ReleaseNotes#changes-in-016)
+* [Changes in 0.1.5](ReleaseNotes#changes-in-015)
+* [Changes in 0.1.4](ReleaseNotes#changes-in-014)
+* [Changes in 0.1.3](ReleaseNotes#changes-in-013)
+* [Changes in 0.1.2](ReleaseNotes#changes-in-012)
+* [Changes in 0.1.1](ReleaseNotes#changes-in-011)
+* [Changes in 0.1.0](ReleaseNotes#changes-in-010)
+* [Changes in 0.0.9](ReleaseNotes#changes-in-009)
+* [Changes in 0.0.8](ReleaseNotes#changes-in-008)
+* [Changes in 0.0.7](ReleaseNotes#changes-in-007)
+* [Changes in 0.0.6](ReleaseNotes#changes-in-006)
+* [Changes in 0.0.5](ReleaseNotes#changes-in-005)
+* [Changes in 0.0.4](ReleaseNotes#changes-in-004)
+* [Changes in 0.0.3](ReleaseNotes#changes-in-003)
+* [Changes in 0.0.2](ReleaseNotes#changes-in-002)
+* [Changes in 0.0.1](ReleaseNotes#changes-in-001)
+
+## Changes in 0.3.2
+April 28, 2018
+
+- Added `setijm`, `getijm` operations for updating and querying individual matrix elements via the object API.
+- Added `examples/oapi` directory containing a code-based tutorial on using the object-based API in BLIS.
+- Track separate reference kernel `CFLAGS` for each sub-configuration.
+- Added support for blacklisting sub-configurations based on the assembler/binutils.
+- Added 64-bit support to BLAS test drivers.
+- Various bugfixes.
+
+## Changes in 0.3.1
+April 4, 2018
+
+- Enable use of new zen kernels in haswell sub-configuration.
+- Added row-storage optimizations to zen `dotxf` kernels (now also used by haswell).
+- Integrated an `f2c`ed version of the BLAS test drivers from netlib LAPACK into BLIS build system (e.g. `make testblas`, `make checkblas`). See the [Testsuite](https://github.com/flame/blis/wiki/Testsuite) wiki for more info. Also scheduled these BLAS drivers to execute regularly via Travis CI.
+- Added a new `make check` target that executes a fast version of the BLIS testsuite as well as the BLAS test drivers (primarily targeting package maintainers).
+- Allow individual operation overriding in the BLIS testsuite. (This makes it easy to quickly test one or two operations of interest.)
+- Added build system support for libmemkind. If present, `hbw_malloc()` is used as the default value for `BLIS_MALLOC_POOL` instead of `malloc()`. It can be disabled via `--disable-memkind`.
+- Tweaks and fixes to BLAS compatibility layer, courtesy of the new BLAS test drivers.
+- Output the active sub-configuration in testsuite output header.
+- Allow arbitrary nesting of "umbrella" configuration families in `config_registry`, allowing us to define x86_64 in terms of amd64 and intel64.
+- Added skx and knl to intel64 (and by proxy, x86_64) configuration families.
+- Implemented basic support for ARM hardware detection (via `/proc/cpuinfo`).
+- Various bugfixes.
+
+## Changes in 0.3.0
+February 23, 2018
+
+This version contains significant improvements from 0.2.2. Major changes include:
+- Real and complex domain (s,d,c,z) assembly-based gemm microkernels for AMD's Zen microarchitecture. (AMD, Field Van Zee)
+- Real domain (s,d) assembly-based `gemmtrsm_l` and `gemmtrsm_u` microkernels for Zen. (AMD, Field Van Zee)
+- Real domain (s,d) intrinsics-based `amaxv`, `axpyv`, `dotv`, `dotxv`, `scalv`, `axpyf`, and `dotxf` kernels for Zen. (AMD, Field Van Zee)
+- Generalized the configuration system to allow multi-configuration builds targeting configuration "families". A single sub-configuration is chosen at runtime via some heuristic, such as querying CPUID (e.g. runtime hardware detection). This change was extensive and required a reorganization of the build system, configuration semantics, reference kernels, a new naming scheme for native kernels, and a rewrite of the global kernel structure (gks). Please see the rewritten [Configuration wiki](ConfigurationHowTo) for details.
+- Implemented runtime hardware detection for x86_64 hardware.
+- Reimplemented configure-time hardware detection in terms of new runtime hardware detection code, which queries for CPU features rather than individual models.
+- Implemented library self-initialization by rewriting `bli_init()` in terms of `pthread_once()` and inserting invocations to `bli_init()` in key places throughout BLIS. The expectation is that through normal use of any BLIS API (BLAS, typed BLIS, or object-based BLIS), the user no longer needs to explicitly initialize the library, and that `bli_finalize()` should never be called by the user unless he is absolutely sure he no longer needs BLIS functionality. Related to this: global scalar constants (`BLIS_ONE`, `BLIS_ZERO`, etc.) are now statically initialized and thus ready to use immediately. Collectively, these changes provide improved thread safety at the application level.
+- Compile with and install a single monolithic (flattened) `blis.h` header to (1) speed up compilation and (2) reduce the number of build product files.
+- Added a sub-API for setting multithreading environment variables at runtime. For a few examples, please see the [Multithreading](Multithreading wiki).
+- Reimplemented OpenMP/pthread barriers in terms of GNU atomic built-ins.
+- Other small changes and fixes.
+
+## Changes in 0.2.2
+May 2, 2017
+
+- Implemented the 1m method for inducing complex matrix multiplication. (Please see ACM TOMS publication ["Implementing high-performance complex matrix multiplication via the 1m method"](https://github.com/flame/blis#citations) for more details.)
+- Switched to simpler `trsm_r` implementation.
+- Relaxed constraints that `MC % NR = 0` and `NC % MR = 0`, as this was only needed for the more sophisticated `trsm_r` implementation.
+- Automatic loop thread assignment. (Devin Matthews) 
+- Updates to `.travis.yml` configuration file. (Devin Matthews) 
+- Updates to non-default haswell micro-kernels.
+- Match storage format of the temporary micro-tiles in macro-kernels to that of the micro-kernel storage preference for edge cases.
+- Added support for Intel's Knight's Landing. (Devin Matthews) 
+- Added more flexible options to specify multithreading via the configure script. (Devin Matthews) 
+- OS X compatibility fixes. (Devin Matthews) 
+- Other small changes and fixes. 
+
+Also, thanks to Elmar Peise, Krzysztof Drewniak, and Francisco Igual for their contributions in reporting/fixing certain bugs that were addressed in this version. 
+
+## Changes in 0.2.1
+October 5, 2016
+
+- Implemented distributed `thrinfo_t` structure management. (Ricardo Magana)
+- Redesigned BLIS's level-3 algorithmic control tree structure. (suggested by Tyler Smith)
+- Consolidated `gemm`, `herk`, and `trmm` blocked variants into one set of three bidirectional variants.
+- Integrated a new "memory broker" (`membrk_t`) abstraction in place of the previous memory allocator, which allows one set of pools per broker (or, in other words, per memory space). (Ricardo Magana)
+- Reorganized multithreading APIs, including more consistent namespace prefixes: `bli_thrinfo_*()`, `bli_thrcomm_*()`, etc.
+- Added `randnm`, `randnv` operations, which produce random powers of two in a narrow range, and integrated a corresponding option into the testsuite. (suggested by AMD)
+- Reclassified `amaxv` as a level-1v operation and kernel.
+- Added complex `gemm` micro-kernels for haswell, which have register allocations consistent with the existing 6x16 `sgemm` and 6x8 `dgemm` micro-kernels.
+- Adjusted existing micro-kernels to work properly when BLIS is configured to use 32-bit integers. (Devin Matthews)
+- Relaxed alignment constraints in sandybridge and haswell micro-kernels. (Devin Matthews)
+- Define CBLAS API with `f77_int` instead of `int`, which means the BLAS compatibility integer size is inherited by the CBLAS compatibility layer. (Devin Matthews)
+- Added an alignment switch to the testsuite to globally enable/disable starting address and leading dimension alignment. (suggested by Devin Matthews)
+- Various enhancements to configure script. (Devin Matthews)
+- Avoid compiling BLAS/CBLAS compatibility layer when it is disabled via configure. (suggested by Devin Matthews)
+- Disabled compilation of object-based blocked partitioning code for level-2 operations, as it was already functionally disabled.
+- Fixes and tweaks to POSIX thread support. (Tyler Smith, Jeff Hammond)
+- Other small changes and fixes.
+
+## Changes in 0.2.0
+April 11, 2016
+
+Most of BLIS 0.2.0's changes are contained within a single commit, 537a1f4 (aka "the big commit"). An executive summary of the most consequential of these changes follows:
+
+- BLIS has been retrofitted with a new data structure, known as a "context," affecting virtually every internal API for every computational operation, as well as many supporting, non-computational functions that must access information within the context.
+- In addition to appearing within these internal APIs, the context--specifically, a pointer to a `cntx_t`--is now present within all user-level datatype-aware APIs, e.g. `bli_zgemm()`, appearing as the last argument.
+- User-level object APIs, e.g. `bli_gemm()`, were unaffected and continue to be "context-free." However, these APIs were duplicated so that corresponding "context-aware" APIs now also exist, differentiated with an `_ex` suffix (for "expert").
+- Contexts are initialized very soon after a computational function is called (if one was not passed in by the caller) and are passed all the way down the function stack, even into the kernels, and thus allow the code at any level to query information about the runtime instantiation of the current operation being executed, such as kernel addresses, micro-kernel storage preferences, and cache/register blocksizes.
+- Contexts are thread-friendly. For example, consider the situation where a developer wishes two or more threads to execute simultaneously with somewhat different runtime parameters. Contexts also inherently promote thread-safety, such as in the event that the original source of the information stored in the context changes at run-time (see next two bullets).
+- BLIS now consolidates virtually all kernel/hardware information in a new "global kernel structure" (gks) API. This new API will allow the caller to initialize a context in a thread-safe manner according to the currently active kernel configuration. For now, the currently active configuration cannot be changed once the library is built. However, in the future, this API will be expanded to allow run-time management of kernels and related parameters.
+- The most obvious application of this new infrastructure is the run-time detection of hardware (and the implied selection of appropriate kernels). With contexts, kernels may even be "hot swapped" within the gks, and once execution begins on a level-3 operation, the memory allocator will be reinitialized on-the-fly, if necessary, to accommodate the new kernels' blocksizes. If a different application thread is executing with another (previously loaded) kernel, it will finish in a deterministic fashion because its kernel info was loaded into its context before computation began, and also because the blocks it checked out from the memory pools will be unaffected by the newer threads' reinitialization of the allocator.
+
+This version contains other changes that were committed prior to 537a1f4:
+
+- Inline assembly FMA4 micro-kernels for AMD bulldozer. (Etienne Sauvage)
+- A more feature-rich configure script and build system. Certain long-style options are now accepted, including convenient command-line switches for things like enabling debugging symbols. Important definitions were also consolidated into a new makefile fragment, `common.mk`, which can be included by the BLIS build system as well as quasi-independent build systems, such as the BLIS test suite. (Devin Matthews)
+- Updated and improved armv8 micro-kernels. (Francisco Igual)
+- Define `bli_clock()` in terms of `clock_gettime()` intead of `gettimeofday()`, which has been languishing on my to-do list for years, literally. (Devin Matthews)
+- Minor but extensive modifications to parts of the BLAS compatibility layer to avoid potential namespace conflicts with external user code when `blis.h` is included. (Devin Matthews)
+- Fixed a missing BLIS integer type definition (`BLIS_BLAS2BLIS_INT_TYPE_SIZE`) when CBLAS was enabled. Thanks to Tony Kelman reporting this bug.
+- Merged `packm_blk_var2()` into `packm_blk_var1()`. The former's functionality is used by induced methods for complex level-3 operations. (Field Van Zee)
+- Subtle changes to treatment of row and column strides in `bli_obj.c` that pertain to somewhat unusual use cases, in an effort to support certain situations that arise in the context of tensor computations. (Devin Matthews)
+- Fixed an unimplemented `beta == 0` case in the penryn (formerly "dunnington") `sgemm` micro-kernel. (Field Van Zee)
+- Enhancements to the internal memory allocator in anticipation of the context retrofit. (Field Van Zee)
+- Implemented so-called "quadratic" matrix partitioning for thread-level parallelism, whereby threads compute thread index ranges to produce partitions of roughly equal area (and thus computation), subject to the (register) blocksize multiple, even when given a structured rectangular subpartition with an arbitrary diagonal offset. Thanks to Devangi Parikh for reporting bugs related to this feature. (Field Van Zee)
+- Enabled use of Travis CI for automatic testing of github commits and pull requests. (Xianyi Zhang)
+- New `README.md`, written in github markdown. (Field Van Zee)
+- Many other minor bug fixes.
+
+Special thanks go to Lee Killough for suggesting the use of a "context" data structure in discussions that transpired years ago, during the early planning stages of BLIS, and also for suggesting such a perfectly appropriate name.
+
+## Changes in 0.1.8
+July 29, 2015
+
+This release contains only two commits, but they are non-trivial: we now have configuration support for AMD Excavator (Carrizo) and micro-kernels for Intel Haswell/Broadwell.
+
+## Changes in 0.1.7
+June 19, 2015
+
+- Replaced the static memory allocator used to manage internal packing buffers with one that dynamically allocates memory, on-demand, and then recycles the allocated blocks in a software cache, or "pool". This significantly simplifies the memory-related configuration parameter set, and it completely eliminates the need to specify a maximum number of threads.
+- Implemented default values for all macro constants previously found in `bli_config.h`. The default values are now set in `frame/include/bli_config_macro_defs.h`. Any value #defined in `bli_config.h` will override these defaults.
+- Initial support for configure-time detection of hardware. By specifying the `auto` configuration at configure-time, the configure script chooses a configuration for you. If an optimized configuration does not exist, the reference implementation serves as a fallback.
+- Completely reorganized implementations for complex induced methods and added support for new algorithms.
+- Added optimized micro-kernels for AMD Piledriver family of hardware.
+- Several bugfixes to multithreaded execution.
+- Various other minor tweaks, code reorganizations, and bugfixes.
+
+## Changes in 0.1.6
+October 23, 2014
+
+- New complex domain AVX micro-kernels are now available and used by default by the sandybridge configuration.
+- Added new high-level 4m and 3m implementations presently known as "4mh" and "3mh".
+- Cleaned up 4m/3m front-end layering and added routines to enable, disable, and query which implementation will be called for a given level-3 operation. The test suite now prints this information in its pre-test summary. 4m (not 4mh) is still the default when complex micro-kernels are not present.
+- Consolidated control tree code and usage so that all level-3 multiplication operations use the same gemm_t structure, leaving only `trsm` to have a custom tree structure and associated code.
+- Re-implemented micro-panel alignment, which was removed in commit c2b2ab6 earlier this year.
+- Relaxed the long-standing constraint that `KC` be a multiple of `MR and `NR` by allowing the developer to specify target values and then adjusting them up to the next multiple of `MR` or `NR`, as needed by the affected operations (`hemm`, `symm`, `trmm`, trsm`).
+- Added a new "row preference" flag that the developer can use to signal to the framework that a micro-kernel prefers to output micro-tiles of C that are row-stored (rather than column-stored). Column storage preference is still the default.
+- Changed semantics of blocksize extensions to instead be "maximum" blocksizes (and thus emphasizing the "extended" values rather than the difference).
+- Various other minor tweaks, code reorganizations, and bugfixes.
+
+Thanks go to those whose contributions, feedback, and bug reports led to these improvements--in particular, Tony Kelman, Kevin Locke, Devin Matthews, Tyler Smith, and perhaps others whose feedback I've lost track of.
+
+## Changes in 0.1.5
+August 4, 2014
+
+- Added a CBLAS compatibility layer, which can be enabled at configure-time via `BLIS_ENABLE_CBLAS` in `bli_config.h`. Enabling the CBLAS layer implicitly forces the BLAS compatibility layer to also be enabled. Once enabled, the application may access CBLAS prototypes via `blis.h` or `cblas.h`.
+- Fixed a packing bug for cases when `MR` or `NR` (or both) are 1.
+- Redefined bit field macros in `bli_type_defs.h` with bitshift operator to ease future rearranging, expanding, or adding of info bits.
+
+## Changes in 0.1.4
+July 27, 2014
+
+- Added shared library support to build system.
+- Preliminary parallelization of `trsm` (Tyler Smith).
+- Added generic `_void()` micro-kernel wrappers so that users (or developers) can call the micro-kernel without knowing the implementation/developer-specific function names, which are specified at configure-time.
+- Added `bli_info_*()` API for querying general information about BLIS, including blocksizes.
+- Reimplemented initialization/finalization for thread safety.
+- Fixed a possible `Inf`/`NaN` issue in several level-3 operations when beta is zero.
+- Minor fixes to BLAS compatibility layer.
+- Added initial support for Emscripten (Marat Dukhan).
+
+## Changes in 0.1.3
+June 23, 2014
+
+This is a relatively minor release. The changes can be summarized as:
+- Added experimental support for PNaCL (Marat Dukhan).
+- Fixed aligned memory allocation on Windows (Tony Kelman).
+- Fixed missing version string in build products when downloading tarballs/zip files (Field Van Zee). Thanks to Victor Eijkhout for pointing out this bug.
+
+## Changes in 0.1.2
+June 2, 2014
+
+Tyler has been hard at work developing and refining extensions to BLIS that provide multithreading support (currently via OpenMP, though POSIX threads may be supported in the future). These extensions enable multithreading within all level-3 operations except for `trsm`. We are pleased to announce that these code changes are now part of BLIS.
+
+## Changes in 0.1.1
+February 25, 2014
+
+I. I am excited to announce that BLIS now provides high-performance complex domain support to ALL level-3 operations when ONLY the same-precision real domain equivalent gemm micro-kernel is present and optimized. In other words, BLIS's productivity lever just got twice as strong: optimize the `dgemm` micro-kernel, and you will get double-precision complex versions of all level-3 operations, for free. Same for `sgemm` micro-kernel and single-precision complex.
+
+II. We also now offer complex domain support based on the 3m method, but this support is ONLY accessible via separate interfaces. This separation is a safety feature, since the 3m method's numerical properties are inherently less robust. Furthermore, we think the 3m method, as implemented, is somewhat performance-limited on systems with L1 caches that have less than 8-way associativity.
+
+We plan on writing a paper on (I) and (II), so if you are curious how exactly we accomplish this, please be patient and wait for the paper. :)
+
+III. The second, user-oriented change facilitates a much more developer-friendly configuration system. This "change" actually represents a family of smaller changes. What follows is a list of those changes taken from the git log:
+- We now have standard names for reference kernels (levels-1v, -1f and 3) in the form of macro constants. Examples:
+      `BLIS_SAXPYV_KERNEL_REF`
+      `BLIS_DDOTXF_KERNEL_REF`
+      `BLIS_ZGEMM_UKERNEL_REF`
+- Developers no longer have to name all datatype instances of a kernel with a common base name; [sdcz] datatype flavors of each kernel or micro-kernel (level-1v, -1f, or 3) may now be named independently. This means you can now, if you wish, encode the datatype-specific register blocksizes in the name of the micro-kernel functions.
+- Any datatype instances of any kernel (1v, 1f, or 3) that is left undefined in `bli_kernel.h` will default to the corresponding reference implementation. For example, if `BLIS_DGEMM_UKERNEL` is left undefined, it will be defined to be `BLIS_DGEMM_UKERNEL_REF`.
+- Developers no longer need to name level-1v/-1f kernels with multiple datatype chars to match the number of types the kernel WOULD take in a mixed type environment, as in `bli_dddaxpyv_opt()`. Now, one char is sufficient, as in `bli_daxpyv_opt()`.
+- There is no longer a need to define an obj_t wrapper to go along with your level-1v/-1f kernels. The framework now provides a `_kernel()` function, as in `bli_axpyv_kernel()`, which serves as the `obj_t` wrapper for whatever kernels are specified (or defaulted to) via `bli_kernel.h`.
+- Developers no longer need to prototype their kernels, and thus no longer need to include any prototyping headers from within `bli_kernel.h`. The framework now generates kernel prototypes, with the proper type signature, based on the kernel names defined (or defaulted to) via `bli_kernel.h`.
+- If the complex datatype x (of [cz]) implementation of the gemm micro-kernel is left undefined by `bli_kernel.h`, but its same-precision real domain equivalent IS defined, BLIS will enable the automatic complex domain feature described above in (1a) for the datatype x implementations of all level-3 operations, using only the corresponding real domain gemm micro-kernel. If the complex gemm micro-kernel for x IS defined, then all complex level-3 operations will be defined in terms of that micro-kernel.
+
+The net effect of (III) is that your `bli_kernel.h` files can be MUCH simpler and less cluttered. (Extreme example: the reference configuration's `bli_kernel.h` is now completely empty!) I have updated all configurations and kernels that are currently part of BLIS by stripping out unnecessary/outdated definitions and migrating existing definitions to their new names. (If you ever need to reference the complete list of options and macros, please refer to the `bli_kernel.h` inside the template configuration.) Please set aside some time to test and, if necessary, tweak the configurations which you originally developed and submitted. I may have broken some of them. If so, please accept my apologies and contact me for assistance. I will work with you to get them functional again.
+
+The changes mentioned in (I), (II), and (III), along with all other changes since 0.1.0, are included BLIS 0.1.1 (fde5f1fd).
+
+I know these changes may be a little disruptive to some, but I think that most developers will find the new complex functionality very useful, and the new configuration system much easier to use.
+
+## Changes in 0.1.0
+November 9, 2013
+
+- Added `sgemm` micro-kernel for dunnington.
+- Added `dgemm` micro-kernels and configurations for sandybridge, bgq, mic, power7, piledriver, loonson3a, which were used to gather performance data in our second ACM TOMS paper. Many thanks to Francisco Igual, Tyler Smith, Mike Kistler, and Xianyi Zhang for developing, testing, and contributing these kernels.
+- Migrated to signed integer for `dim_t`, `inc_t` (to facilitate calling BLIS from Fortran).
+- Added "template" configuration and kernel set for developers to use as a starting point when developing new kernels from scratch.
+- Improvements to test suite, including section overrides and standalone level-1f/level-3 kernel modules.
+- Improvements to Windows build system (though it may still not yet be functional out-of-the-box). Thanks to Martin Schatz for his help here.
+- Removed support for element "duplication" in level-3 macro-kernels.
+- Several bug fixes to BLAS compatibility layer. Thanks to Vladimir Sukharev for his numerous bug reports wrt the LAPACK test suite.
+- Various other minor bugfixes.
+
+## Changes in 0.0.9
+July 18, 2013
+
+- A few algorithmic optimizations and bug fixes to `trmm` and `trsm`.
+- Parameter checking in the compatibility layer that mimics netlib BLAS.
+- Default use of `stdint.h` types (`int64_t`, `uint64_t` by default).
+- Optional (and very much untested) C99 built-in complex type/arithmetic support.
+
+Note that `bli_config.h` has changed since 0.0.8. Added configuration macros are:
+```
+  #define BLIS_ENABLE_C99_COMPLEX
+  #define BLIS_ENABLE_BLAS2BLIS_INT64
+  #define PASTEF770(name) // ...
+```
+The first macro enables C99 built-in complex types. The second causes a Fortran integer to be defined as an int64_t (rather than `int32_t`). The third is a macro to name-mangle a full routine name for Fortran (ie: add an underscore) and should be obtained from `config/reference/bli_config.h`.
+
+## Changes in 0.0.8
+June 12, 2013
+
+This version includes several kernel optimizations and bug fixes.
+
+While neither `bli_config.h` nor `bli_kernel.h` has changed formats since 0.0.7, `make_defs.mk` **has** changed, so please update your copy of this file when you git-pull. Specifically, we now define a new `CFLAGS_KERNELS` variable that allows one to use different compiler flags when compiling kernels. It works like this: At compile time, make will use `CFLAGS_KERNELS` to compile any source code that resides in any directory that begins with the name `kernels`. My recommendation is to simply apply this naming convention to the symbolic link to your kernels directory that resides in your configuration directory. Thanks to Tyler for suggesting this change.
+
+## Changes in 0.0.7
+April 30, 2013
+
+This version incorporates many small fixes and feature enhancements made during our SC13 collaboration. 
+
+## Changes in 0.0.6
+April 13, 2013
+
+Several changes regarding memory alignment were made since 0.0.5, including modifications to `bli_config.h`. Also, this update fixes a few bugs.
+
+## Changes in 0.0.5
+March 24, 2013
+
+The most obvious change in this version is the migration to the `bli` function (and source code filename) prefix, from the old `bl2` prefix, as well as a rename of the main BLIS header (`blis2.h` -> `blis.h`). The test suite seems to indicate that the change was successful.
+
+A few other much more minor changes were made, one pertaining to a renamed constant in the `_config.h` file.
+
+## Changes in 0.0.4
+March 15, 2013
+
+The changes included in 0.0.4 mostly relate to the contiguous (static) memory allocator. The previous implementation was intended as a temporary solution that would work for benchmarking purposes, until enough other priorities had been tended to that I could go back and do it right.
+
+I began with the assumption that the benefit of packing matrices into contiguous memory is non-negligible and worth the effort. Furthermore, we assume that:
+- the only portable way to acquire contiguous memory is to reserve a region of static memory and manage it ourselves;
+- the cache blocksizes used for one level-3 operation will be the same as those used for another level-3 operation, since all of them boil down to some form of matrix-matrix multiplication;
+- only three types of contiguous memory will ever be needed (for level-3 operations): a block of matrix A, a panel of matrix B, or a panel of matrix C--and the last case is not commonly used;
+- when a block or panel is to be acquired from the allocator, the caller knows which of the three types of memory is needed.
+
+Given these assumptions, I was able to come up with an implementation that is simple, easy to understand, and thread-safe (provided you add OpenMP directives to protect the critical sections, which are clearly marked with comments). It can also both allocate and release in O(1) time. And of course, page-alignment is taken care of behind the scenes. So while it is not a generalized solution by any means, I think it will work very well for our purposes.
+
+Also, note that based on the level of the overall matrix multiplication algorithm at which you parallelize, the minimum number of blocks/panels of each type of contiguous memory will vary. For example, if you want all of your threads to work on different iterations of a single rank-k update (via block-panel multiply), the threads share the packed panel of B, but each one needs memory to hold its own packed block of A. Thus, the memory allocator needs to be initialized so that it contains enough memory for at least one panel of B and at least t blocks of A, where t is the number of threads. All of this can be adjusted at configure-time in `bl2_config.h`.
+
+## Changes in 0.0.3
+February 22, 2013
+
+The biggest change in this version is that the BLAS-to-BLIS compatibility layer is now available. Virtually every BLAS interface is included, even those corresponding to functionality that BLIS does not implement (such as banded and packed level-2 operations). If the application code attempts to call one of these unimplemented routines, the code aborts with a generic not-yet-implemented error message.
+
+The compatibility layer is enabled via a configuration option in `bl2_config.h`. For now, it is enabled by default (provided you have an up-to-date copy of `bl2_config.h`).
+
+## Changes in 0.0.2
+February 11, 2013
+
+Most notably, this version contains the new test suite I've been working on for the last month. 
+
+What is the test suite? It is a highly configurable test driver that allows one to test an arbitrary set of BLIS operations, with an arbitrary set of parameter combinations, and matrix/vector storage formats, as well as whichever datatypes you are interested in. (For now, only homogeneous datatyping is supported, which is what most people want.) You can also specify an arbitrary problem size range with arbitrary increments, and arbitrary ratios between dimensions (or anchor a dimension to a single value), and you can output directly to files which store the output in matlab syntax, which makes it easy to generate performance graphs.
+
+BLIS developers: note that 0.0.2 makes small changes to the configuration files. This new version also contains many bug fixes. (Most of these fixes address bugs which were found using the test suite.)
+
+## Changes in 0.0.1
+December 10, 2012
+
+- Added auto-detection of string version (via `git`).
+- Wrote basic INSTALL, CHANGELOG, AUTHORS, and CREDITS files.
+- Updates to standalone `test` directory `Makefile`.
+- Added initial build system
+- Various code reorganizations.
+
--- a/docs/Testsuite.md
+++ b/docs/Testsuite.md
@@ -0,0 +1,330 @@
+# Contents
+
+* **[Contents](Testsuite#contents)**
+* **[BLIS testsuite](Testsuite#blis-testsuite)**
+  * **[Introduction](Testsuite#introduction)**
+  * **[Compiling](Testsuite#compiling)**
+  * **[Setting test parameters](Testsuite#setting-test-parameters)**
+    * [`input.general`](Testsuite#inputgeneral)
+    * [`input.operations`](Testsuite#inputoperations)
+  * **[Running tests](Testsuite#running-tests)**
+  * **[Interpreting the results](Testsuite#interpreting-the-results)**
+* **[BLAS test drivers](Testsuite#blas-test-drivers)**
+
+# BLIS testsuite
+
+## Introduction
+
+This wiki explains how to use the test suite included with the BLIS framework.
+
+The test suite exists in the `testsuite` directory within the top-level source distribution:
+```
+$ ls
+CHANGELOG  Makefile      common.mk        configure  mpi_test     testsuite
+CREDITS    README.md     config           frame      obj          version
+INSTALL    bli_config.h  config.mk        kernels    ref_kernels  windows
+LICENSE    build         config_registry  lib        test
+```
+There, you will find a `Makefile`, two input files, and two directories:
+```
+$ cd testsuite
+$ ls
+Makefile  input.general  input.operations  obj  src
+```
+As you would expect, the test suite's source code lives in `src` and the object files, upon being built, are placed in `obj`. The two `input.*` files control how the test suite runs, while the `Makefile` controls how the test suite executable is compiled and linked.
+
+## Compiling
+
+Before running the test suite, you must first configure, compile, and install a BLIS library. For directions on how to build and install a BLIS library, please see the [BLIS build system](BuildSystem) wiki.
+
+Once BLIS is installed, you are ready to compile the test suite.
+
+**Note:** The `Makefile` includes the same `make_defs.mk` file that was used by the top-level `Makefile` when building BLIS. This is meant to serve as a convenience so you don't have to specify things like the C compiler or compiler flags a second time. If you do wish to tweak these parameters, you may override the values included from `make_defs.mk` by editing the local `Makefile` within the `testsuite` directory. Scroll down to the section labeled "Optional overrides" and uncomment/edit values as needed.
+
+Unless special circumstances apply in your situation (such as the optional overrides mentioned above), the only value you may have to modify in `testsuite/Makefile` (if any) is the linker library flags variable, `LDFLAGS`. You may need to modify it to include the path to your standard C libraries, such as `libm` (oftentimes communicated to the linker via `-lm`):
+```
+LDFLAGS   := -L/path/to/system/libs -lm
+```
+
+When you are ready to compile, simply run `make`. Running `make` will result in output similar to:
+:
+```
+$ make
+Compiling src/test_addm.c
+Compiling src/test_addv.c
+Compiling src/test_amaxv.c
+Compiling src/test_axpbyv.c
+Compiling src/test_axpy2v.c
+Compiling src/test_axpyf.c
+Compiling src/test_axpym.c
+Compiling src/test_axpyv.c
+Compiling src/test_copym.c
+Compiling src/test_copyv.c
+```
+As with compiling a BLIS library, if you are working in a multicore environment, you may use the `-j<n>` option to compile source code in parallel with `<n>` parallel jobs:
+```
+$ make -j4
+```
+After `make` is complete, an executable named `test_libblis.x` is created:
+```
+$ ls
+Makefile  input.general  input.operations  obj  src  test_libblis.x
+```
+
+## Setting test parameters
+
+The BLIS test suite reads two input files, `input.general` and `input.operations`, to determine which tests to run and how those tests are run. Each file is contains comments and thus you may find them intuitive to use without formal instructions. However, for completeness and as a reference-of-last-resort, we describe each file and its contents in detail.
+
+### `input.general`
+
+The `input.general` input file, as its name suggests, contains parameters that control the general behavior of the test suite. These parameters (more or less) apply to all operations that get tested. Below is a representative example of the default contents of `input.general`.
+```
+# ----------------------------------------------------------------------
+#
+#  input.general   
+#  BLIS test suite
+#
+#  This file contains input values that control how BLIS operations are
+#  tested. Comments explain the purpose of each parameter as well as
+#  accepted values.
+#
+
+1       # Number of repeats per experiment (best result is reported)
+c       # Matrix storage scheme(s) to test:
+        #   'c' = col-major storage; 'g' = general stride storage;
+        #   'r' = row-major storage
+c       # Vector storage scheme(s) to test:
+        #   'c' = colvec / unit stride; 'j' = colvec / non-unit stride;
+        #   'r' = rowvec / unit stride; 'i' = rowvec / non-unit stride
+0       # Test all combinations of storage schemes?
+1       # Perform all tests with alignment?
+        #   '0' = do NOT align buffers/ldims; '1' = align buffers/ldims
+0       # Randomize vectors and matrices using:
+        #   '0' = real values on [-1,1];
+        #   '1' = powers of 2 in narrow precision range
+32      # General stride spacing (for cases when testing general stride)
+sdcz    # Datatype(s) to test:
+        #   's' = single real; 'c' = single complex;
+        #   'd' = double real; 'z' = double complex
+100     # Problem size: first to test
+300     # Problem size: maximum to test
+100     # Problem size: increment between experiments
+        # Complex level-3 implementations to test
+1       #   3mh  ('1' = enable; '0' = disable)
+1       #   3m1  ('1' = enable; '0' = disable)
+1       #   4mh  ('1' = enable; '0' = disable)
+1       #   4m1b ('1' = enable; '0' = disable)
+1       #   4m1a ('1' = enable; '0' = disable)
+1       #   1m   ('1' = enable; '0' = disable)
+1       #   native ('1' = enable; '0' = disable)
+1       # Error-checking level:
+        #   '0' = disable error checking; '1' = full error checking
+i       # Reaction to test failure:
+        #   'i' = ignore; 's' = sleep() and continue; 'a' = abort
+0       # Output results in matlab/octave format? ('1' = yes; '0' = no)
+0       # Output results to stdout AND files? ('1' = yes; '0' = no)
+```
+The remainder of this section explains each parameter switch in detail.
+
+_**Number of repeats.**_ This is the number of times an operation is run for each result that is reported. The result with the best performance is reported.
+
+_**Matrix storage scheme.**_ This string encodes all of the matrix storage schemes that are tested (for operations that contain matrix operands). There are three valid values: `'c'` for column storage, `'r'` for row storage, and `'g'` for general stride storage. You may choose one storage scheme, or combine more than one. The order of the characters determines the order in which the corresponding storage schemes are tested.
+
+_**Vector storage scheme.**_ Similar to the matrix storage scheme string, this string determines which vector storage schemes are tested (for operations that contain vector operands). There are four valid values: `'c'` for column vectors with unit stride, `'r'` for row vectors with unit stride, `'j'` for column vectors with non-unit stride, and `'i'` for row vectors with non-unit stride. You may choose any one storage scheme, or combine more than one. The ordering behaves similarly to that of the matrix storage scheme string. Using `cj` will test both unit and non-unit vector strides, and since row and column vectors are logically equivalent, this should provide complete test coverage for operations with vector operands. 
+
+_**Test all combinations of storage schemes?**_ Enabling this option causes all combinations of storage schemes to be tested. For example, if the option is disabled, a matrix storage scheme string of `cr` would cause the `gemm` test module to test execution where all matrix operands are column-stored, and then where all matrix operands are row-stored. Enabling this option with the same matrix storage string (`cr`) would cause the test suite to test `gemm` under all eight scenarios where the three `gemm` matrix operands are either column-stored or row-stored.
+
+_**Perform all tests with alignment?**_ Disabling this option causes the leading dimension (row or column stride) of test matrices to **not** be aligned according to `BLIS_HEAP_STRIDE_ALIGN_SIZE`, which defaults to `BLIS_SIMD_ALIGN_SIZE`, which defaults to `BLIS_SIMD_SIZE`, which defaults to 64 (bytes). (If any of these values is set to a non-default value, it would be in `bli_family_<arch>.h` where `<arch>` is the configuration family.) Sometimes it's useful to disable leading dimension alignment in order to test certain aspects of BLIS that need to handle computing with unaligned user data, such as level-1v and level-1f kernels.
+
+_**Randomize vectors and matrices.**_ The default randomization method uses real values on the interval [-1,1]. However, we offer an alternate randomization using powers of two in a narrow precision range, which is more likely to result in test residuals exactly equal to zero. This method is somewhat niche/experimental and most people should use random values on the [-1,1] interval.
+
+_**General stride spacing.**_ This value determines the simulated "inner" stride when testing general stride storage. For simplicity, the test suite always generates and tests general stride storage that is ["column-tilted"](FAQ#What_does_it_mean_when_a_matrix_with_general_stride_is_column-ti). If general stride storage is not being tested, then this value is ignored.
+
+_**Datatype(s) to test.**_ This string determines which floating-point datatypes are tested. There are four valid values: `'s'` for single-precision real, `'d'` for double-precision real, `'c'` for single-precision complex, and `'z'` for double-precision complex. You may choose one datatype, or combine more than one. The order of the datatype characters determines the order in which they are tested.
+
+_**Problem size.**_ These values determine the first problem size to test, the maximum problem size to test, and the increment between problem sizes. Note that the maximum problem size only bounds the range of problem sizes; it is not guaranteed to be tested. Example: If the initial problem size is 128, the maximum is 1000, and the increment is 64, then the last problem size to be tested will be 960.
+
+_**Complex level-3 implementations to test.**_ With the exception of the switch marked `native`, these switches control whether experimental complex domain implementations are tested (when applicable). These implementations employ induced methods complex matrix multiplication and apply to some (though not all) of the level-3 operations. If you don't know what these are, you can ignore them. The `native` switch corresponds to native execution of complex domain level-3 operations, which we test by default. We also test the `1m` method, since it is the induced method of choice when complex micro-kernels are not available. Note that all of these induced method tests (including `native`) are automatically disabled if the `c` and `z` datatypes are disabled.
+
+_**Error-checking level.**_ BLIS supports various "levels" of error checking prior to executing most operations. For now, only two error-checking levels are implemented: fully disabled (`'0'`) and fully enabled (`'1'`). Disabling error-checking may improve performance on some systems for small problem sizes, but generally speaking the cost is negligible.
+
+_**Reaction to test failure.**_ If the test suite executes a test that results in a numerical result that is considered a "failure", this character determines how the test suite should proceed. There are three valid values: `'i'` will cause the test suite to ignore the failure and immediately continue with all remaining tests, `'s'` will cause the test suite to sleep for some short period of time before continuing, and `'a'` will cause the test suite to abort all remaining tests. The user must specify only **one** option via its character encoding.
+
+_**Output results in Matlab/Octave format?**_ When this option is disabled, the test suite outputs results in a simple human-readable format of one experiment per line. When this option is enabled, the test suite similarly outputs results for one experiment per line, but in a format that may be read into Matlab or Octave. This is useful if the user intends to use the results of the test suite to plot performance data using one of these tools.
+
+_**Output results to `stdout` AND files?**_ When this option is disabled, the test suite outputs only to standard output. When enabled, the test suite also writes its output to files, one for each operation tested. As with the Matlab/Octave option above, this option may be useful to some users who wish to gather and retain performance data for later use.
+
+
+### `input.operations`
+
+The `input.operations` input file determines **which** operations are tested, which parameter combinations are tested, and the relative sizes of the operation's dimensions. The file itself contains comments that explain various sections. However, we reproduce this information here for your convenience.
+
+_**Enabling/disabling entire sections.**_ The values in the "Section overrides" section allow you to disable all operations in a given "level". Enabling a level here by itself does not enable every operation in that level; it simply means that the individual switches for each operation (in that level) determine whether or not the tests are executed. Use 1 to enable a section, or 0 to disable.
+
+_**Enabling/disabling individual operation tests.**_ Given that an operation's section override switch is set to 1 (enabled), whether or not that operation will get tested is determined by its local switch. For example, if the level-1v section override is set to 1, and there is a 1 on the line marked `addv`, then the `addv` operation will be tested. NOTE: You may ignore the lines marked "test sequential front-end." These lines are for future use, to distinguish tests of the sequential implementation from tests of the multithreaded implementation. For now, BLIS does not contain separate APIs for multithreaded execution, even though multithreading is supported. So, these should be left set to 1.
+
+_**Enabling only select operations**_ If you would like to enable just a few (or even just one) operation without adjusting any section overrides (or individual operation switches), change the desired operation switch(es) to 2. This will cause any operation that is not set to 2 to be disabled, regardless of section override values. For example, setting the `axpyv` and `gemv` operation switches to 2 will cause the test suite to test ONLY `axpyv` and `gemv`, even if all other sections and operations are set to 1. NOTE: As long as there is at least on operation switch set to 2, no other operations will be tested. When you are done testing your select operations, you should revert the operation switch(es) back to 1.
+
+_**Changing the problem size/shapes tested.**_ The problem sizes tested by an operation are determined by the dimension specifiers on the line marked `dimensions: <spec_labels>`. If, for example, `<spec_labels>` contains two dimension labels (e.g. `m n`), then the line should begin with two dimension specifiers. Dimension specifiers of `-1` cause the corresponding dimension to be bound to the problem size, which is determined by values set in `input.general`. Positive values cause the corresponding dimension to be fixed to that value and held constant. Examples of dimension specifiers (where the dimensions are _m_ and _n_):
+  * `-1 -1 `   ...Dimensions m and n grow with problem size (resulting in square matrices).
+  * `-1 150 `   ...Dimension m grows with problem size and n is fixed at 150.
+  * `-1 -2 `   ...Dimension m grows with problem size and n grows proportional to half the problem size.
+
+_**Changing parameter combinations tested.**_ The parameter combinations tested by an operation are determined by the parameter specifier characters on the line marked `parameters: <param_labels>`. If, for example, `<param_labels>` contains two parameter labels (e.g. `transa conjx`), then the line should contain two parameter specifier characters. The `'?'` specifier character serves as a wildcard--it causes all possible values of that parameter to be tested. A character such as `'n'` or `'t'` causes only that value to be tested. Examples of parameter specifiers (where the parameters are `transa` and `conjx`):
+  * `??`   ...All combinations of the `transa` and `conjx` parameters are tested: `nn, nc, tn, tc, cn, cc, hn, hc`.
+  * `?n`   ...`conjx` is fixed to "no conjugate" but `transa` is allowed to vary: `nn, tn, cn, hn`.
+  * `hc`   ...Only the case where `transa` is "Hermitian-transpose" and `conjx` is "conjugate" is tested.
+
+Here is a full list of the parameter types used by the various BLIS operations along with their possible character encodings:
+  * `side`: `l` = left,  `r` = right
+  * `uplo`: `l` = lower-stored, `u` = upper-stored
+  * `trans`: `n` = no transpose, `t` = transpose, `c` = conjugate, `h` = Hermitian-transpose (conjugate-transpose)
+  * `conj`: `n` = no conjugate, `c` = conjugate
+  * `diag`: `n` = non-unit diagonal, `u` = unit diagonal
+
+
+## Running tests
+
+Running the test suite is easy. Once `input.general` and `input.operations` have been tailored to your liking, simply run the test suit executable:
+```
+$ ./test_libblis.x
+```
+For sanity-checking purposes, the test suite begins by echoing the parameters it found in `input.general` to standard output. This is useful when troubleshooting the test suite if/when it exhibits strange behavior (such as seemingly skipped tests).
+
+## Interpreting the results
+
+The output to the test suite is more-or-less intuitive. Here is an snippet of output for the `gemm` test module when problem sizes of 100 to 300 in increments of 100 are tested.
+```
+% --- gemm ---
+%
+% test gemm seq front-end?    1
+% gemm m n k                  -1 -1 -2
+% gemm operand params         ??
+%
+
+% blis_<dt><oper>_<params>_<storage>           m     n     k   gflops  resid       result
+blis_sgemm_nn_ccc                            100   100    50   1.447   1.14e-07    PASS
+blis_sgemm_nn_ccc                            200   200   100   1.537   1.18e-07    PASS
+blis_sgemm_nn_ccc                            300   300   150   1.532   1.38e-07    PASS
+blis_sgemm_nc_ccc                            100   100    50   1.449   7.79e-08    PASS
+blis_sgemm_nc_ccc                            200   200   100   1.540   1.23e-07    PASS
+blis_sgemm_nc_ccc                            300   300   150   1.537   1.54e-07    PASS
+blis_sgemm_nt_ccc                            100   100    50   1.479   7.40e-08    PASS
+blis_sgemm_nt_ccc                            200   200   100   1.549   1.33e-07    PASS
+blis_sgemm_nt_ccc                            300   300   150   1.534   1.44e-07    PASS
+blis_sgemm_nh_ccc                            100   100    50   1.477   9.23e-08    PASS
+blis_sgemm_nh_ccc                            200   200   100   1.547   1.13e-07    PASS
+blis_sgemm_nh_ccc                            300   300   150   1.535   1.51e-07    PASS
+blis_sgemm_cn_ccc                            100   100    50   1.477   9.62e-08    PASS
+blis_sgemm_cn_ccc                            200   200   100   1.548   1.36e-07    PASS
+blis_sgemm_cn_ccc                            300   300   150   1.539   1.51e-07    PASS
+blis_sgemm_cc_ccc                            100   100    50   1.481   8.66e-08    PASS
+blis_sgemm_cc_ccc                            200   200   100   1.549   1.41e-07    PASS
+blis_sgemm_cc_ccc                            300   300   150   1.539   1.63e-07    PASS
+blis_sgemm_ct_ccc                            100   100    50   1.484   7.09e-08    PASS
+blis_sgemm_ct_ccc                            200   200   100   1.549   1.08e-07    PASS
+blis_sgemm_ct_ccc                            300   300   150   1.539   1.33e-07    PASS
+blis_sgemm_ch_ccc                            100   100    50   1.471   8.06e-08    PASS
+blis_sgemm_ch_ccc                            200   200   100   1.546   1.24e-07    PASS
+blis_sgemm_ch_ccc                            300   300   150   1.539   1.66e-07    PASS
+```
+
+Before each operation is tested, the test suite echos information it obtained from the `input.operations` file, such as the dimension specifier string (in this case, `"-1 -1 -2"`) and parameter specifier string (`"??"`).
+
+Each line of output contains several sections. We will cover them now, from left to right.
+
+_**Test identifier.**_ The left-most labels are strings which identify the specific test being performed. This string generally a concatenation of substrings, joined by underscores, which identify the operation being run, the parameter combination tested, and the storage scheme of each operand. When outputting to Matlab/Octave formatting is abled, these identifiers service as the names of the arrays in which the data are stored.
+
+_**Dimensions.**_ The values near the middle of the output show the size of each dimension. Different operations have different dimension sets. For example, `gemv` only has two dimensions, _m_ and _n_, while `gemm` has an additional _k_ dimension. In the snippet above, you can see that the dimension specifier string, `"-1 -1 -2"`, explains the relative sizes of the dimensions for each test: _m_ and _n_ are bound to the problem size, while _k_ is always equal to half the problem size.
+
+_**Performance.**_ The next value output is raw performance, reported in GFLOPS (billions of floating-point operations per second).
+
+_**Residual.**_ The next value, which we loosely refer to as a "residual", reports the result of the numerical correctness test for the operation. The actual method of computing the residual (and hence its exact meaning) depends on the operation being tested. However, these residuals are always computed such that the result should be no more than 2-3 orders of magnitude away from machine precision for the datatype being tested. Thus, "good" results are typically in the neighborhood of `1e-07` for single precision and `1e-15` for double precision.
+
+_**Test result.**_ The BLIS test suite compares the residual to internally-defined accuracy thresholds to categorize the test as either `PASS`, `MARGINAL`, or `FAIL`. The vast majority of tests should result in a `PASS` result, with perhaps a handful resulting in `MARGINAL`.
+
+Note that the various sections of output, which line up nicely as columns, are labeled on a line beginning with `%` immediately before the results:
+```
+% blis_<dt><oper>_<params>_<storage>           m     n     k   gflops  resid       result
+blis_sgemm_nn_ccc                            100   100    50   1.447   1.14e-07    PASS
+```
+These labels are useful as concise reminders of the meaning of each column. They are especially useful in differentiating the various dimensions from each other for operations that contain two or three dimensions.
+
+# BLAS test drivers
+
+In addition to the monolithic testsuite located in the `testsuite` directory, which exercises BLIS functionality in general (and via one of its native/preferred APIs), we also provide a C port of the netlib BLAS test drivers included in netlib LAPACK. These BLAS drivers are located in `blastest`, along with other files needed in order to build the drivers, such as a subset of `libf2c`. After configuring and compiling BLIS, the BLAS test drivers may be run from within `blastest`:
+```
+$ ./configure haswell
+# Lots of configure output...
+$ make -j4
+# Lots of compilation output...
+$ cd blastest
+$ ls
+Makefile  f2c  input  obj  src
+```
+Simply run `make`:
+```
+$ make
+Compiling obj/abs.o
+Compiling obj/acos.o
+Compiling obj/asin.o
+Compiling obj/atan.o
+...
+Compiling obj/wsfe.o
+Compiling obj/wsle.o
+Archiving libf2c.a
+Compiling obj/cblat1.o
+Linking cblat1.x against 'libf2c.a ../lib/haswell/libblis.a -lm -lpthread -lrt'
+Compiling obj/cblat2.o
+Linking cblat2.x against 'libf2c.a ../lib/haswell/libblis.a -lm -lpthread -lrt'
+Compiling obj/cblat3.o
+Linking cblat3.x against 'libf2c.a ../lib/haswell/libblis.a -lm -lpthread -lrt'
+...
+```
+And then `make run`:
+```
+Running cblat1.x > 'out.cblat1'
+Running cblat2.x < 'input/cblat2.in' (output to 'out.cblat2')
+Running cblat3.x < 'input/cblat3.in' (output to 'out.cblat3')
+Running dblat1.x > 'out.dblat1'
+Running dblat2.x < 'input/dblat2.in' (output to 'out.dblat2')
+Running dblat3.x < 'input/dblat3.in' (output to 'out.dblat3')
+Running sblat1.x > 'out.sblat1'
+Running sblat2.x < 'input/sblat2.in' (output to 'out.sblat2')
+Running sblat3.x < 'input/sblat3.in' (output to 'out.sblat3')
+Running zblat1.x > 'out.zblat1'
+Running zblat2.x < 'input/zblat2.in' (output to 'out.zblat2')
+Running zblat3.x < 'input/zblat3.in' (output to 'out.zblat3')
+```
+The results can quickly be checked via a script in the top-level `build` directory:
+```
+$ ../build/check-blastest.sh 
+All BLAS tests passed!
+```
+This is the message we expect when everything works as expected.
+
+Alternatively, you can perform all of the steps described above (`make ; make run; ../build/check-blastest.sh`) from the top-level directory in one shot. After running `configure` and `make`, simply run `make checkblas`:
+```
+$ ./configure haswell
+# Lots of configure output...
+$ make -j4
+# Lots of compilation output...
+$ make check
+```
+This will build all of the necessary BLAS test driver object files, link them, and run the drivers. Output will go to the current directory (either the top-level directory of the source distribution, or the out-of-tree build directory from which you ran `configure`), with each output file (prefixed with `out.`) named according to the BLAS driver that generated its contents:
+```
+$ ls
+CHANGELOG  bli_config.h     frame       out.cblat2  out.sblat3        testsuite
+CREDITS    build            include     out.cblat3  out.zblat1        version
+INSTALL    common.mk        kernels     out.dblat1  out.zblat2        windows
+LICENSE    config           lib         out.dblat2  out.zblat3
+Makefile   config.mk        mpi_test    out.dblat3  output.testsuite
+README.md  config_registry  obj         out.sblat1  ref_kernels
+blastest   configure        out.cblat1  out.sblat2  test
+```
+If any of the tests fail, you'll instead see the message:
+```
+$ make check
+At least one BLAS test failed. Please see out.* files for details.
+```
+As the message suggests, you should inspect the `out.*` files for more details about what went wrong.