BLIS:merge:

Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2026-04-20 07:38:53 +00:00 · 2021-04-26 23:41:13 +05:30
parent 743732c939 6a4aa986ff
commit 7401effc03
320 changed files with 147817 additions and 4448 deletions
--- a/1731
+++ b/1731
--- a/10
+++ b/10
@@ -20,7 +20,9 @@ but many others have contributed code and feedback, including
  Jérémie du Boisberranger @jeremiedbb
  Jed Brown                @jedbrown           (Argonne National Laboratory)
  Robin Christ             @robinchrist
+  Dilyn Corner             @dilyn-corner
  Mat Cross                @matcross           (NAG)
+                           @decandia50
  Kay Dewhurst             @jkd2016            (Max Planck Institute, Halle, Germany)
  Jeff Diamond                                 (Oracle)
  Johannes Dieterich       @iotamudelta
@@ -58,7 +60,10 @@ but many others have contributed code and feedback, including
  Simon Lukas Märtens      @ACSimon33          (RWTH Aachen University)
  Devin Matthews           @devinamatthews     (The University of Texas at Austin)
  Stefanos Mavros          @smavros
+  Ilknur Mustafazade       @Runkli
+                           @nagsingh
  Bhaskar Nallani          @BhaskarNallani     (AMD)
+  Stepan Nassyr            @stepannassyr       (Jülich Supercomputing Centre)
  Nisanth Padinharepatt                        (AMD)
  Ajay Panyala             @ajaypanyala
  Devangi Parikh           @dnparikh           (The University of Texas at Austin)
@@ -68,6 +73,7 @@ but many others have contributed code and feedback, including
  Jack Poulson             @poulson            (Stanford)
  Mathieu Poumeyrol        @kali
  Christos Psarras         @ChrisPsa           (RWTH Aachen University)
+                           @pkubaj
                           @qnerd
  Michael Rader            @mrader1248
  Pradeep Rao              @pradeeptrgit       (AMD)
@@ -88,11 +94,13 @@ but many others have contributed code and feedback, including
  Nicholai Tukanov         @nicholaiTukanov    (The University of Texas at Austin)
  Rhys Ulerich             @RhysU              (The University of Texas at Austin)
  Robert van de Geijn      @rvdg               (The University of Texas at Austin)
+  Meghana Vankadari        @Meghana-vankadari  (AMD)
  Kiran Varaganti          @kvaragan           (AMD)
  Natalia Vassilieva                           (Hewlett Packard Enterprise)
  Zhang Xianyi             @xianyi             (Chinese Academy of Sciences)
-  RuQing Xu                @xrq-phys           (The University of Tokyo)
  Benda Xu                 @heroxbd
+  Guodong Xu               @docularxu          (Linaro.org)
+  RuQing Xu                @xrq-phys           (The University of Tokyo)
  Costas Yamin             @cosstas
  Chenhan Yu               @ChenhanYu          (The University of Texas at Austin)
  Roman Yurchak            @rth                (Symerio)
--- a/59
+++ b/59
@@ -112,7 +112,6 @@ BASE_OBJ_PATH          := ./$(OBJ_DIR)/$(CONFIG_NAME)
 # of source code.
 BASE_OBJ_CONFIG_PATH   := $(BASE_OBJ_PATH)/$(CONFIG_DIR)
 BASE_OBJ_FRAME_PATH    := $(BASE_OBJ_PATH)/$(FRAME_DIR)
-BASE_OBJ_AOCLDTL_PATH  := $(BASE_OBJ_PATH)/$(AOCLDTL_DIR)
 BASE_OBJ_REFKERN_PATH  := $(BASE_OBJ_PATH)/$(REFKERN_DIR)
 BASE_OBJ_KERNELS_PATH  := $(BASE_OBJ_PATH)/$(KERNELS_DIR)
 BASE_OBJ_SANDBOX_PATH  := $(BASE_OBJ_PATH)/$(SANDBOX_DIR)
@@ -150,7 +149,7 @@ MK_LIBS                   += $(LIBBLIS_SO_PATH) \
                             $(LIBBLIS_SO_MAJ_PATH)
 MK_LIBS_INST              += $(LIBBLIS_SO_MMB_INST)
 MK_LIBS_SYML              += $(LIBBLIS_SO_INST) \
-			     $(LIBBLIS_SO_MAJ_INST)
+                             $(LIBBLIS_SO_MAJ_INST)
 endif

 # Strip leading, internal, and trailing whitespace.
@@ -167,6 +166,7 @@ MK_INCL_DIR_INST          := $(INSTALL_INCDIR)/blis
 # Set the path to the subdirectory of the share installation directory.
 MK_SHARE_DIR_INST         := $(INSTALL_SHAREDIR)/blis

+PC_SHARE_DIR_INST         := $(INSTALL_SHAREDIR)/pkgconfig


 #
@@ -210,11 +210,6 @@ MK_REFKERN_OBJS     := $(foreach arch, $(CONFIG_LIST), \
 # Generate object file paths for all of the portable framework source code.
 MK_FRAME_OBJS       := $(call gen-obj-paths-from-src,$(FRAME_SRC_SUFS),$(MK_FRAME_SRC),$(FRAME_PATH),$(BASE_OBJ_FRAME_PATH))

-# Generate object file paths for all of the debgu and trace logger.
-MK_AOCLDTL_OBJS       := $(call gen-obj-paths-from-src,$(AOCLDTL_SRC_SUFS),$(MK_AOCLDTL_SRC),$(AOCLDTL_PATH),$(BASE_OBJ_AOCLDTL_PATH))
-
-
-
 # Generate object file paths for the sandbox source code. If a sandbox was not
 # enabled a configure-time, this variable will we empty.
 MK_SANDBOX_OBJS     := $(call gen-obj-paths-from-src,$(SANDBOX_SRC_SUFS),$(MK_SANDBOX_SRC),$(SANDBOX_PATH),$(BASE_OBJ_SANDBOX_PATH))
@@ -224,7 +219,6 @@ MK_BLIS_OBJS        := $(MK_CONFIG_OBJS) \
                       $(MK_KERNELS_OBJS) \
                       $(MK_REFKERN_OBJS) \
                       $(MK_FRAME_OBJS) \
-                       $(MK_AOCLDTL_OBJS) \
                       $(MK_SANDBOX_OBJS)

 # Optionally filter out the BLAS and CBLAS compatibility layer object files.
@@ -256,8 +250,13 @@ ifeq ($(MK_ENABLE_CBLAS),yes)
 HEADERS_TO_INSTALL += $(CBLAS_H_FLAT)
 endif

-# Install BLIS CPP Template header files
-HEADERS_TO_INSTALL += $(CPP_HEADER_DIR)/*.hh
+# If requested, include AMD's C++ template header files in the list of headers
+# to install.
+ifeq ($(INSTALL_HH),yes)
+HEADERS_TO_INSTALL += $(wildcard $(VEND_CPP_PATH)/*.hh)
+endif
+
+

 #
 # --- public makefile fragment definitions -------------------------------------
@@ -267,6 +266,8 @@ HEADERS_TO_INSTALL += $(CPP_HEADER_DIR)/*.hh
 FRAGS_TO_INSTALL := $(CONFIG_MK_FILE) \
                    $(COMMON_MK_FILE)

+PC_IN_FILE  := blis.pc.in
+PC_OUT_FILE := blis.pc


 #
@@ -423,7 +424,7 @@ all: libs

 libs: libblis

-test: checkblis checkblas 
+test: checkblis checkblas

 check: checkblis-fast checkblas

@@ -675,7 +676,7 @@ ifeq ($(ARG_MAX_HACK),yes)
 	$(LINKER) $(SOFLAGS) -o $(LIBBLIS_SO_OUTPUT_NAME) @$@.in $(LDFLAGS)
 	$(RM_F) $@.in
 else
-	$(LINKER) $(SOFLAGS) -o $(LIBBLIS_SO_OUTPUT_NAME) $? $(LDFLAGS)
+	$(LINKER) $(SOFLAGS) -o $(LIBBLIS_SO_OUTPUT_NAME) $^ $(LDFLAGS)
 endif
 else # ifeq ($(ENABLE_VERBOSE),no)
 ifeq ($(ARG_MAX_HACK),yes)
@@ -685,7 +686,7 @@ ifeq ($(ARG_MAX_HACK),yes)
 	@$(RM_F) $@.in
 else
 	@echo "Dynamically linking $@"
-	@$(LINKER) $(SOFLAGS) -o $(LIBBLIS_SO_OUTPUT_NAME) $? $(LDFLAGS)
+	@$(LINKER) $(SOFLAGS) -o $(LIBBLIS_SO_OUTPUT_NAME) $^ $(LDFLAGS)
 endif
 endif

@@ -926,6 +927,19 @@ else
 	@- $(TESTSUITE_CHECK_PATH) $(TESTSUITE_OUT_FILE)
 endif

+
+# --- AMD's C++ template header test rules ---
+
+# NOTE: The targets below won't work as intended for an out-of-tree build,
+# and so it's disabled for now.
+
+#testcpp: testvendcpp
+
+# Recursively run the test for AMD's C++ template header.
+#testvendcpp:
+#	$(MAKE) -C $(VEND_TESTCPP_PATH)
+
+
 # --- Install header rules ---

 install-headers: check-env $(MK_INCL_DIR_INST)
@@ -943,7 +957,7 @@ endif

 # --- Install share rules ---

-install-share: check-env $(MK_SHARE_DIR_INST)
+install-share: check-env $(MK_SHARE_DIR_INST) $(PC_SHARE_DIR_INST)

 $(MK_SHARE_DIR_INST): $(FRAGS_TO_INSTALL) $(CONFIG_MK_FILE)
 ifeq ($(ENABLE_VERBOSE),yes)
@@ -962,6 +976,20 @@ else
 	               $(@)/$(CONFIG_DIR)/$(CONFIG_NAME)/
 endif

+$(PC_SHARE_DIR_INST):  $(PC_IN_FILE)
+	$(MKDIR) $(@)
+ifeq ($(ENABLE_VERBOSE),no)
+	@echo "Installing $(PC_OUT_FILE) into $(@)/"
+endif
+	$(shell cat "$(PC_IN_FILE)" \
+	| sed -e "s#@PACKAGE_VERSION@#$(VERSION)#g" \
+	| sed -e "s#@prefix@#$(prefix)#g" \
+	| sed -e "s#@exec_prefix@#$(exec_prefix)#g" \
+	| sed -e "s#@libdir@#$(libdir)#g" \
+	| sed -e "s#@includedir@#$(includedir)#g" \
+	| sed -e "s#@LDFLAGS@#$(LDFLAGS)#g" \
+	> "$(PC_OUT_FILE)" )
+	$(INSTALL) -m 0644 $(PC_OUT_FILE) $(@)

 # --- Install library rules ---

@@ -1219,6 +1247,7 @@ ifeq ($(IS_CONFIGURED),yes)
 ifeq ($(ENABLE_VERBOSE),yes)
 	- $(RM_F) $(BLIS_CONFIG_H)
 	- $(RM_F) $(CONFIG_MK_FILE)
+	- $(RM_F) $(PC_OUT_FILE)
 	- $(RM_RF) $(OBJ_DIR)
 	- $(RM_RF) $(LIB_DIR)
 	- $(RM_RF) $(INCLUDE_DIR)
@@ -1227,6 +1256,8 @@ else
 	@$(RM_F) $(BLIS_CONFIG_H)
 	@echo "Removing $(CONFIG_MK_FILE)"
 	@- $(RM_F) $(CONFIG_MK_FILE)
+	@echo "Removing $(PC_OUT_FILE)"
+	@- $(RM_F) $(PC_OUT_FILE)
 	@echo "Removing $(OBJ_DIR)"
 	@- $(RM_RF) $(OBJ_DIR)
 	@echo "Removing $(LIB_DIR)"
--- a/README.md
+++ b/README.md
@@ -64,12 +64,12 @@ compared to conventional approaches to developing BLAS libraries, as well as a
 much-needed refinement of the BLAS interface, and thus constitutes a major
 advance in dense linear algebra computation. While BLIS remains a
 work-in-progress, we are excited to continue its development and further
-cultivate its use within the community. 
+cultivate its use within the community.

 The BLIS framework is primarily developed and maintained by individuals in the
 [Science of High-Performance Computing](http://shpc.ices.utexas.edu/)
 (SHPC) group in the
-[Institute for Computational Engineering and Sciences](https://www.ices.utexas.edu/)
+[Oden Institute for Computational Engineering and Sciences](https://www.oden.utexas.edu/)
 at [The University of Texas at Austin](https://www.utexas.edu/).
 Please visit the [SHPC](http://shpc.ices.utexas.edu/) website for more
 information about our research group, such as a list of
@@ -93,6 +93,14 @@ all of which are available for free via the [edX platform](http://www.edx.org/).
 What's New
 ----------

+ * **Multithreaded small/skinny matrix support for sgemm now available!** Thanks to
+funding and hardware support from Oracle, we have now accelerated `gemm` for
+single-precision real matrix problems where one or two dimensions is exceedingly
+small. This work is similar to the `gemm` optimization announced last year.
+For now, we have only gathered performance results on an AMD Epyc Zen2 system, but
+we hope to publish additional graphs for other architectures in the future. You may
+find these Zen2 graphs via the [PerformanceSmall](docs/PerformanceSmall.md) document.
+
 * **BLIS awarded SIAM Activity Group on Supercomputing Best Paper Prize for 2020!**
 We are thrilled to announce that the paper that we internally refer to as the
 second BLIS paper,
@@ -103,10 +111,11 @@ second BLIS paper,
 for 2020. The prize is awarded once every two years to a paper judged to be
 the most outstanding paper in the field of parallel scientific and engineering
 computing, and has only been awarded once before (in 2016) since its inception
-in 2015 (the committee did not award the prize in 2018). The prize will be
-awarded at the [SIAM Conference on Parallel Processing for Scientific Computing](https://www.siam.org/conferences/cm/conference/pp20) in Seattle next February. Robert will
-be present at the conference to accept the prize and give
-[a talk on BLIS](https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=68266).
+in 2015 (the committee did not award the prize in 2018). The prize
+[was awarded](https://www.oden.utexas.edu/about/news/ScienceHighPerfomanceComputingSIAMBestPaperPrize/)
+at the [2020 SIAM Conference on Parallel Processing for Scientific Computing](https://www.siam.org/conferences/cm/conference/pp20) in Seattle. Robert was present at
+the conference to give
+[a talk on BLIS](https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=68266) and accept the prize alongside other coauthors.
 The selection committee sought to recognize the paper, "which validates BLIS,
 a framework relying on the notion of microkernels that enables both productivity
 and high performance." Their statement continues, "The framework will continue
@@ -159,6 +168,9 @@ draft](http://www.cs.utexas.edu/users/flame/pubs/blis6_toms_rev2.pdf)).
 What People Are Saying About BLIS
 ---------------------------------

+*["I noticed a substantial increase in multithreaded performance on my own
+machine, which was extremely satisfying."](https://groups.google.com/d/msg/blis-discuss/8iu9B5KCxpA/uftpjgIsBwAJ)* ... *["[I was] happy it worked so well!"](https://groups.google.com/d/msg/blis-discuss/8iu9B5KCxpA/uftpjgIsBwAJ)* (Justin Shea)
+
 *["This is an awesome library."](https://github.com/flame/blis/issues/288#issuecomment-447488637)* ... *["I want to thank you and the blis team for your efforts."](https://github.com/flame/blis/issues/288#issuecomment-448074704)* ([@Lephar](https://github.com/Lephar))

 *["Any time somebody outside Intel beats MKL by a nontrivial amount, I report it to the MKL team. It is fantastic for any open-source project to get within 10% of MKL... [T]his is why Intel funds BLIS development."](https://github.com/flame/blis/issues/264#issuecomment-428673275)* ([@jeffhammond](https://github.com/jeffhammond))
@@ -207,7 +219,7 @@ seeking to implement tensor contractions on multidimensional arrays.)
 Furthermore, since BLIS tracks stride information for each matrix, operands of
 different storage formats can be used within the same operation invocation. By
 contrast, BLAS requires column-major storage. And while the CBLAS interface
-supports row-major storage, it does not allow mixing storage formats. 
+supports row-major storage, it does not allow mixing storage formats.

 * **Rich support for the complex domain.** BLIS operations are developed and
 expressed in their most general form, which is typically in the complex domain.
@@ -249,7 +261,7 @@ of BLIS's native APIs directly. BLIS's typed API will feel familiar to many
 veterans of BLAS since these interfaces use BLAS-like calling sequences. And
 many will find BLIS's object-based APIs a delight to use when customizing
 or writing their own BLIS operations. (Objects are relatively lightweight
-`structs` and passed by address, which helps tame function calling overhead.) 
+`structs` and passed by address, which helps tame function calling overhead.)

 * **Multilayered API, exposed kernels, and sandboxes.** The BLIS framework
 exposes its
@@ -272,7 +284,7 @@ a nearly-complete template for instantiating high-performance BLAS-like
 libraries. Furthermore, the framework is extensible, allowing developers to
 leverage existing components to support new operations as they are identified.
 If such operations require new kernels for optimal efficiency, the framework
-and its APIs will be adjusted and extended accordingly. 
+and its APIs will be adjusted and extended accordingly.

 * **Code re-use.** Auto-generation approaches to achieving the aforementioned
 goals tend to quickly lead to code bloat due to the multiple dimensions of
@@ -441,14 +453,14 @@ about BLIS, please read this FAQ. If you can't find the answer to your question,
 please feel free to join the [blis-devel](https://groups.google.com/group/blis-devel)
 mailing list and post a question. We also have a
 [blis-discuss](https://groups.google.com/group/blis-discuss) mailing list that
-anyone can post to (even without joining). 
+anyone can post to (even without joining).

 **Documents for github contributors:**

 * **[Contributing bug reports, feature requests, PRs, etc](CONTRIBUTING.md).**
 Interested in contributing to BLIS? Please read this document before getting
 started. It provides a general overview of how best to report bugs, propose new
-features, and offer code patches. 
+features, and offer code patches.

 * **[Coding Conventions](docs/CodingConventions.md).** If you are interested or
 planning on contributing code to BLIS, please read this document so that you can
@@ -540,7 +552,7 @@ Contributing
 ------------

 For information on how to contribute to our project, including preferred
-[coding conventions](docs/CodingConventions), please refer to the
+[coding conventions](docs/CodingConventions.md), please refer to the
 [CONTRIBUTING](CONTRIBUTING.md) file at the top-level of the BLIS source
 distribution.

@@ -549,8 +561,8 @@ Citations

 For those of you looking for the appropriate article to cite regarding BLIS, we
 recommend citing our
-[first ACM TOMS journal paper](http://dl.acm.org/authorize?N91172) 
-([unofficial backup link](http://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.pdf)):
+[first ACM TOMS journal paper]( https://dl.acm.org/doi/10.1145/2764454?cid=81314495332)
+([unofficial backup link](https://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.pdf)):

 ```
@article{BLIS1,
@@ -560,16 +572,16 @@ recommend citing our
   volume      = {41},
   number      = {3},
   pages       = {14:1--14:33},
-   month       = jun,
+   month       = {June},
   year        = {2015},
   issue_date  = {June 2015},
   url         = {http://doi.acm.org/10.1145/2764454},
 }
-``` 
+```

 You may also cite the
-[second ACM TOMS journal paper](http://dl.acm.org/authorize?N16240) 
-([unofficial backup link](http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf)):
+[second ACM TOMS journal paper]( https://dl.acm.org/doi/10.1145/2755561?cid=81314495332)
+([unofficial backup link](https://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf)):

 ```
@article{BLIS2,
@@ -582,15 +594,16 @@ You may also cite the
   volume      = {42},
   number      = {2},
   pages       = {12:1--12:19},
-   month       = jun,
+   month       = {June},
   year        = {2016},
   issue_date  = {June 2016},
   url         = {http://doi.acm.org/10.1145/2755561},
 }
-``` 
+```

 We also have a third paper, submitted to IPDPS 2014, on achieving
-[multithreaded parallelism in BLIS](http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf):
+[multithreaded parallelism in BLIS](https://dl.acm.org/doi/10.1109/IPDPS.2014.110)
+([unofficial backup link](https://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf)):

 ```
@inproceedings{BLIS3,
@@ -599,14 +612,15 @@ We also have a third paper, submitted to IPDPS 2014, on achieving
   title       = {Anatomy of High-Performance Many-Threaded Matrix Multiplication},
   booktitle   = {28th IEEE International Parallel \& Distributed Processing Symposium
                  (IPDPS 2014)},
-   year        = 2014,
+   year        = {2014},
+   url         = {https://doi.org/10.1109/IPDPS.2014.110},
 }
 ```

 A fourth paper, submitted to ACM TOMS, also exists, which proposes an
-[analytical model](http://dl.acm.org/citation.cfm?id=2925987) 
-([unofficial backup link](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf))
-for determining blocksize parameters in BLIS: 
+[analytical model](https://dl.acm.org/doi/10.1145/2925987)
+for determining blocksize parameters in BLIS
+([unofficial backup link](https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf)):

 ```
@article{BLIS4,
@@ -617,7 +631,7 @@ for determining blocksize parameters in BLIS:
   volume      = {43},
   number      = {2},
   pages       = {12:1--12:18},
-   month       = aug,
+   month       = {August},
   year        = {2016},
   issue_date  = {August 2016},
   url         = {http://doi.acm.org/10.1145/2925987},
@@ -625,7 +639,8 @@ for determining blocksize parameters in BLIS:
 ```

 A fifth paper, submitted to ACM TOMS, begins the study of so-called
-[induced methods for complex matrix multiplication](http://www.cs.utexas.edu/users/flame/pubs/blis5_toms_rev2.pdf):
+[induced methods for complex matrix multiplication]( https://dl.acm.org/doi/10.1145/3086466?cid=81314495332)
+([unofficial backup link](https://www.cs.utexas.edu/users/flame/pubs/blis5_toms_rev2.pdf)):

 ```
@article{BLIS5,
@@ -635,27 +650,36 @@ A fifth paper, submitted to ACM TOMS, begins the study of so-called
   volume      = {44},
   number      = {1},
   pages       = {7:1--7:36},
-   month       = jul,
+   month       = {July},
   year        = {2017},
   issue_date  = {July 2017},
   url         = {http://doi.acm.org/10.1145/3086466},
 }
-``` 
+```

 A sixth paper, submitted to ACM TOMS, revisits the topic of the previous
-article and derives a [superior induced method](http://www.cs.utexas.edu/users/flame/pubs/blis6_sisc_rev1.pdf):
+article and derives a
+[superior induced method](https://epubs.siam.org/doi/10.1137/19M1282040)
+([unofficial backup link](https://www.cs.utexas.edu/users/flame/pubs/blis6_sisc_rev3.pdf)):

 ```
@article{BLIS6,
   author      = {Field G. {V}an~{Z}ee},
   title       = {Implementing High-Performance Complex Matrix Multiplication via the 1m Method},
   journal     = {SIAM Journal on Scientific Computing},
-   note        = {submitted}
+   volume      = {42},
+   number      = {5},
+   pages       = {C221--C244},
+   month       = {September}
+   year        = {2020},
+   issue_date  = {September 2020},
+   url         = {https://doi.org/10.1137/19M1282040}
 }
-``` 
+```

 A seventh paper, submitted to ACM TOMS, explores the implementation of `gemm` for
-[mixed-domain and/or mixed-precision](http://www.cs.utexas.edu/users/flame/pubs/blis7_toms_rev0.pdf) operands:
+[mixed-domain and/or mixed-precision](https://www.cs.utexas.edu/users/flame/pubs/blis7_toms_rev0.pdf) operands
+([unofficial backup link](https://www.cs.utexas.edu/users/flame/pubs/blis7_toms_rev0.pdf)):

 ```
@article{BLIS7,
@@ -671,16 +695,17 @@ Funding
 -------

 This project and its associated research were partially sponsored by grants from
-[Microsoft](http://www.microsoft.com/),
-[Intel](http://www.intel.com/),
-[Texas Instruments](http://www.ti.com/),
-[AMD](http://www.amd.com/),
-[Oracle](http://www.oracle.com/),
-[Huawei](http://www.huawei.com/),
+[Microsoft](https://www.microsoft.com/),
+[Intel](https://www.intel.com/),
+[Texas Instruments](https://www.ti.com/),
+[AMD](https://www.amd.com/),
+[HPE](https://www.hpe.com/),
+[Oracle](https://www.oracle.com/),
+[Huawei](https://www.huawei.com/),
 and
-[Facebook](http://www.facebook.com/),
+[Facebook](https://www.facebook.com/),
 as well as grants from the
-[National Science Foundation](http://www.nsf.gov/) (Awards
+[National Science Foundation](https://www.nsf.gov/) (Awards
 CCF-0917167, ACI-1148125/1340293, CCF-1320112, and ACI-1550493).

 _Any opinions, findings and conclusions or recommendations expressed in this
--- a/blis.pc.in
+++ b/blis.pc.in
@@ -0,0 +1,11 @@
+prefix=@prefix@
+exec_prefix=@exec_prefix@
+libdir=@libdir@
+includedir=@includedir@
+
+Name: BLIS
+Description: BLAS-like Library Instantiation Software Framework
+Version: @PACKAGE_VERSION@
+Libs: -L${libdir} -lblis
+Libs.private: @LDFLAGS@
+Cflags: -I${includedir}/blis
--- a/build/bli_config.h.in
+++ b/build/bli_config.h.in
@@ -45,12 +45,19 @@
 // Enabled kernel sets (kernel_list)
@kernel_list_defines@

+
 //This macro is enabled only for ZEN family configurations.
 //This enables us to use different cache-blocking sizes for TRSM instead of common level-3 cache-block sizes.
 #if @enable_aocl_zen@
 #define AOCL_BLIS_ZEN
 #endif

+#if @enable_system@
+#define BLIS_ENABLE_SYSTEM
+#else
+#define BLIS_DISABLE_SYSTEM
+#endif
+
 #if @enable_openmp@
 #define BLIS_ENABLE_OPENMP
 #endif
@@ -153,6 +160,12 @@
 #define BLIS_DISABLE_MEMKIND
 #endif

+#if @enable_trsm_preinversion@
+#define BLIS_ENABLE_TRSM_PREINVERSION
+#else
+#define BLIS_DISABLE_TRSM_PREINVERSION
+#endif
+
 #if @enable_pragma_omp_simd@
 #define BLIS_ENABLE_PRAGMA_OMP_SIMD
 #else
--- a/build/config.mk.in
+++ b/build/config.mk.in
@@ -117,6 +117,9 @@ LDFLAGS_PRESET    := @ldflags_preset@
 # The level of debugging info to generate.
 DEBUG_TYPE        := @debug_type@

+# Whether operating system support was requested via --enable-system.
+ENABLE_SYSTEM     := @enable_system@
+
 # The requested threading model.
 THREADING_MODEL   := @threading_model@

@@ -184,7 +187,8 @@ MK_ENABLE_MEMKIND := @enable_memkind@
 # enabled.
 SANDBOX           := @sandbox@

-# The name of the pthreads library.
+# The name of the pthreads library. If --disable-system was given, then this
+# variable is set to the empty value.
 LIBPTHREAD        := @libpthread@

 # end of ifndef CONFIG_MK_INCLUDED conditional block
--- a/build/templates/license.c
+++ b/build/templates/license.c
@@ -5,6 +5,7 @@
   libraries.

   Copyright (C) 2019, The University of Texas at Austin
+   Copyright (C) 2018, Advanced Micro Devices, Inc.

   Redistribution and use in source and binary forms, with or without
   modification, are permitted provided that the following conditions are
--- a/build/templates/license.h
+++ b/build/templates/license.h
@@ -5,6 +5,7 @@
   libraries.

   Copyright (C) 2019, The University of Texas at Austin
+   Copyright (C) 2018, Advanced Micro Devices, Inc.

   Redistribution and use in source and binary forms, with or without
   modification, are permitted provided that the following conditions are
--- a/common.mk
+++ b/common.mk
@@ -312,6 +312,10 @@ TESTSUITE_DIR      := testsuite
 CPP_HEADER_DIR     := cpp
 CPP_TEST_DIR       := testcpp

+VEND_DIR           := vendor
+VEND_CPP_DIR       := $(VEND_DIR)/cpp
+VEND_TESTCPP_DIR   := $(VEND_DIR)/testcpp
+
 # The filename suffix for reference kernels.
 REFNM              := ref

@@ -378,6 +382,10 @@ REFKERN_PATH       := $(DIST_PATH)/$(REFKERN_DIR)
 KERNELS_PATH       := $(DIST_PATH)/$(KERNELS_DIR)
 SANDBOX_PATH       := $(DIST_PATH)/$(SANDBOX_DIR)

+# Construct paths to some optional C++ template headers contributed by AMD.
+VEND_CPP_PATH      := $(DIST_PATH)/$(VEND_CPP_DIR)
+VEND_TESTCPP_PATH  := $(DIST_PATH)/$(VEND_TESTCPP_DIR)
+
 # Construct paths to the makefile fragments for the four primary directories
 # of source code: the config directory, general framework code, reference
 # kernel code, and optimized kernel code.
@@ -517,7 +525,8 @@ LIBMEMKIND := -lmemkind

 # Default linker flags.
 # NOTE: -lpthread is needed unconditionally because BLIS uses pthread_once()
-# to initialize itself in a thread-safe manner.
+# to initialize itself in a thread-safe manner. The one exception to this
+# rule: if --disable-system is given at configure-time, LIBPTHREAD is empty.
 LDFLAGS    := $(LDFLAGS_PRESET) $(LIBM) $(LIBPTHREAD)

 # Add libmemkind to the link-time flags, if it was enabled at configure-time.
@@ -748,6 +757,10 @@ $(foreach c, $(CONFIG_LIST_FAM), $(eval $(call append-var-for,CPPROCFLAGS,$(c)))

 # --- Threading flags ---

+# NOTE: We don't have to explicitly omit -pthread when --disable-system is given
+# since that option forces --enable-threading=none, and thus -pthread never gets
+# added to begin with.
+
 ifeq ($(CC_VENDOR),gcc)
 ifeq ($(THREADING_MODEL),auto)
 THREADING_MODEL := openmp
--- a/config/arm32/make_defs.mk
+++ b/config/arm32/make_defs.mk
@@ -61,7 +61,7 @@ COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -march=armv7-a
 else
--- a/config/arm64/make_defs.mk
+++ b/config/arm64/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -march=armv8-a
 else
--- a/config/bgq/make_defs.mk
+++ b/config/bgq/make_defs.mk
@@ -68,11 +68,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),ibm)
 CKVECFLAGS     := -qarch=qp -qtune=qp -qsimd=auto -qhot=level=1 -qprefetch -qunroll=yes -qnoipa
 endif
--- a/config/bulldozer/make_defs.mk
+++ b/config/bulldozer/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mfpmath=sse -mavx -mfma4 -march=bdver1 -mno-tbm -mno-xop -mno-lwp
 else
--- a/config/cortexa15/bli_cntx_init_cortexa15.c
+++ b/config/cortexa15/bli_cntx_init_cortexa15.c
@@ -55,11 +55,19 @@ void bli_cntx_init_cortexa15( cntx_t* cntx )

 	// Initialize level-3 blocksize objects with architecture-specific values.
 	//                                           s      d      c      z
-	bli_blksz_init_easy( &blkszs[ BLIS_MR ],     4,     4,     0,     0 );
-	bli_blksz_init_easy( &blkszs[ BLIS_NR ],     4,     4,     0,     0 );
-	bli_blksz_init_easy( &blkszs[ BLIS_MC ],   336,   176,     0,     0 );
-	bli_blksz_init_easy( &blkszs[ BLIS_KC ],   528,   368,     0,     0 );
-	bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4096,  4096,     0,     0 );
+#if 1
+	bli_blksz_init_easy( &blkszs[ BLIS_MR ],     4,     4,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NR ],     4,     4,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_MC ],   336,   176,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_KC ],   528,   368,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4096,  4096,    -1,    -1 );
+#else
+	bli_blksz_init_easy( &blkszs[ BLIS_MR ],    -1,     4,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NR ],    -1,     4,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_MC ],    -1,   176,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_KC ],    -1,   368,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NC ],    -1,  4096,    -1,    -1 );
+#endif

 	// Update the context with the current architecture's register and cache
 	// blocksizes (and multiples) for native execution.
--- a/config/cortexa15/make_defs.mk
+++ b/config/cortexa15/make_defs.mk
@@ -61,9 +61,9 @@ COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
-CKVECFLAGS     := -march=armv7-a
+CKVECFLAGS     := -mcpu=cortex-a15
 else
 $(error gcc is required for this configuration.)
 endif
--- a/config/cortexa53/make_defs.mk
+++ b/config/cortexa53/make_defs.mk
@@ -57,19 +57,19 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3 -ftree-vectorize -mtune=cortex-a53
+COPTFLAGS      := -O2 -mcpu=cortex-a53
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3 -ftree-vectorize
 ifeq ($(CC_VENDOR),gcc)
-CKVECFLAGS     := -march=armv8-a+fp+simd -mcpu=cortex-a53
+CKVECFLAGS     := -mcpu=cortex-a53
 else
 $(error gcc is required for this configuration.)
 endif

 # Flags specific to reference kernels.
-CROPTFLAGS     := $(CKOPTFLAGS)
+CROPTFLAGS     := $(CKOPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CRVECFLAGS     := $(CKVECFLAGS) -funsafe-math-optimizations -ffp-contract=fast
 else
--- a/config/cortexa57/make_defs.mk
+++ b/config/cortexa57/make_defs.mk
@@ -57,13 +57,13 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3 -ftree-vectorize -mtune=cortex-a57
+COPTFLAGS      := -O2 -mcpu=cortex-a57
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3 -ftree-vectorize
 ifeq ($(CC_VENDOR),gcc)
-CKVECFLAGS     := -march=armv8-a+fp+simd -mcpu=cortex-a57
+CKVECFLAGS     := -mcpu=cortex-a57
 else
 $(error gcc is required for this configuration.)
 endif
--- a/config/cortexa9/make_defs.mk
+++ b/config/cortexa9/make_defs.mk
@@ -61,9 +61,9 @@ COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
-CKVECFLAGS     := -march=armv7-a
+CKVECFLAGS     := -mcpu=cortex-a9
 else
 $(error gcc is required for this configuration.)
 endif
--- a/config/excavator/make_defs.mk
+++ b/config/excavator/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mfpmath=sse -mavx -mfma -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp
 else
--- a/config/generic/make_defs.mk
+++ b/config/generic/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     :=
 else
--- a/config/haswell/bli_cntx_init_haswell.c
+++ b/config/haswell/bli_cntx_init_haswell.c
@@ -67,12 +67,30 @@ void bli_cntx_init_haswell( cntx_t* cntx )
 	  // gemmtrsm_l
 	  BLIS_GEMMTRSM_L_UKR, BLIS_FLOAT,    bli_sgemmtrsm_l_haswell_asm_6x16, TRUE,
 	  BLIS_GEMMTRSM_L_UKR, BLIS_DOUBLE,   bli_dgemmtrsm_l_haswell_asm_6x8,  TRUE,
+
 	  // gemmtrsm_u
 	  BLIS_GEMMTRSM_U_UKR, BLIS_FLOAT,    bli_sgemmtrsm_u_haswell_asm_6x16, TRUE,
 	  BLIS_GEMMTRSM_U_UKR, BLIS_DOUBLE,   bli_dgemmtrsm_u_haswell_asm_6x8,  TRUE,
 	  cntx
 	);

+#if 1
+	// Update the context with optimized packm kernels.
+	bli_cntx_set_packm_kers
+	(
+	  8,
+	  BLIS_PACKM_6XK_KER,  BLIS_FLOAT,    bli_spackm_haswell_asm_6xk,
+	  BLIS_PACKM_16XK_KER, BLIS_FLOAT,    bli_spackm_haswell_asm_16xk,
+	  BLIS_PACKM_6XK_KER,  BLIS_DOUBLE,   bli_dpackm_haswell_asm_6xk,
+	  BLIS_PACKM_8XK_KER,  BLIS_DOUBLE,   bli_dpackm_haswell_asm_8xk,
+	  BLIS_PACKM_3XK_KER,  BLIS_SCOMPLEX, bli_cpackm_haswell_asm_3xk,
+	  BLIS_PACKM_8XK_KER,  BLIS_SCOMPLEX, bli_cpackm_haswell_asm_8xk,
+	  BLIS_PACKM_3XK_KER,  BLIS_DCOMPLEX, bli_zpackm_haswell_asm_3xk,
+	  BLIS_PACKM_4XK_KER,  BLIS_DCOMPLEX, bli_zpackm_haswell_asm_4xk,
+	  cntx
+	);
+#endif
+
 	// Update the context with optimized level-1f kernels.
 	bli_cntx_set_l1f_kers
 	(
@@ -90,11 +108,11 @@ void bli_cntx_init_haswell( cntx_t* cntx )
 	bli_cntx_set_l1v_kers
 	(
 	  10,
-#if 1
+	  
 	  // amaxv
 	  BLIS_AMAXV_KER,  BLIS_FLOAT,  bli_samaxv_zen_int,
 	  BLIS_AMAXV_KER,  BLIS_DOUBLE, bli_damaxv_zen_int,
-#endif
+
 	  // axpyv
 #if 0
 	  BLIS_AXPYV_KER,  BLIS_FLOAT,  bli_saxpyv_zen_int,
@@ -106,9 +124,11 @@ void bli_cntx_init_haswell( cntx_t* cntx )
 	  // dotv
 	  BLIS_DOTV_KER,   BLIS_FLOAT,  bli_sdotv_zen_int,
 	  BLIS_DOTV_KER,   BLIS_DOUBLE, bli_ddotv_zen_int,
+
 	  // dotxv
 	  BLIS_DOTXV_KER,  BLIS_FLOAT,  bli_sdotxv_zen_int,
 	  BLIS_DOTXV_KER,  BLIS_DOUBLE, bli_ddotxv_zen_int,
+
 	  // scalv
 #if 0
 	  BLIS_SCALV_KER,  BLIS_FLOAT,  bli_sscalv_zen_int,
@@ -162,9 +182,9 @@ void bli_cntx_init_haswell( cntx_t* cntx )

 	// Initialize sup thresholds with architecture-appropriate values.
 	//                                          s     d     c     z
-	bli_blksz_init_easy( &thresh[ BLIS_MT ],   -1,  201,   -1,   -1 );
-	bli_blksz_init_easy( &thresh[ BLIS_NT ],   -1,  201,   -1,   -1 );
-	bli_blksz_init_easy( &thresh[ BLIS_KT ],   -1,  201,   -1,   -1 );
+	bli_blksz_init_easy( &thresh[ BLIS_MT ],  201,  201,   -1,   -1 );
+	bli_blksz_init_easy( &thresh[ BLIS_NT ],  201,  201,   -1,   -1 );
+	bli_blksz_init_easy( &thresh[ BLIS_KT ],  201,  201,   -1,   -1 );

 	// Initialize the context with the sup thresholds.
 	bli_cntx_set_l3_sup_thresh
@@ -189,7 +209,7 @@ void bli_cntx_init_haswell( cntx_t* cntx )
 	// Update the context with optimized small/unpacked gemm kernels.
 	bli_cntx_set_l3_sup_kers
 	(
-	  8,
+	  16,
 	  //BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_r_haswell_ref,
 	  BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8m, TRUE,
 	  BLIS_RRC, BLIS_DOUBLE, bli_dgemmsup_rd_haswell_asm_6x8m, TRUE,
@@ -199,18 +219,27 @@ void bli_cntx_init_haswell( cntx_t* cntx )
 	  BLIS_CRC, BLIS_DOUBLE, bli_dgemmsup_rd_haswell_asm_6x8n, TRUE,
 	  BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
 	  BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
+
+	  BLIS_RRR, BLIS_FLOAT, bli_sgemmsup_rv_haswell_asm_6x16m, TRUE,
+	  BLIS_RRC, BLIS_FLOAT, bli_sgemmsup_rd_haswell_asm_6x16m, TRUE,
+	  BLIS_RCR, BLIS_FLOAT, bli_sgemmsup_rv_haswell_asm_6x16m, TRUE,
+	  BLIS_RCC, BLIS_FLOAT, bli_sgemmsup_rv_haswell_asm_6x16n, TRUE,
+	  BLIS_CRR, BLIS_FLOAT, bli_sgemmsup_rv_haswell_asm_6x16m, TRUE,
+	  BLIS_CRC, BLIS_FLOAT, bli_sgemmsup_rd_haswell_asm_6x16n, TRUE,
+	  BLIS_CCR, BLIS_FLOAT, bli_sgemmsup_rv_haswell_asm_6x16n, TRUE,
+	  BLIS_CCC, BLIS_FLOAT, bli_sgemmsup_rv_haswell_asm_6x16n, TRUE,
 	  cntx
 	);

 	// Initialize level-3 sup blocksize objects with architecture-specific
 	// values.
 	//                                           s      d      c      z
-	bli_blksz_init     ( &blkszs[ BLIS_MR ],    -1,     6,    -1,    -1,
-	                                            -1,     9,    -1,    -1 );
-	bli_blksz_init_easy( &blkszs[ BLIS_NR ],    -1,     8,    -1,    -1 );
-	bli_blksz_init_easy( &blkszs[ BLIS_MC ],    -1,    72,    -1,    -1 );
-	bli_blksz_init_easy( &blkszs[ BLIS_KC ],    -1,   256,    -1,    -1 );
-	bli_blksz_init_easy( &blkszs[ BLIS_NC ],    -1,  4080,    -1,    -1 );
+	bli_blksz_init     ( &blkszs[ BLIS_MR ],     6,     6,    -1,    -1,
+	                                             9,     9,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NR ],    16,     8,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_MC ],   168,    72,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_KC ],   256,   256,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4080,  4080,    -1,    -1 );

 	// Update the context with the current architecture's register and cache
 	// blocksizes for small/unpacked level-3 problems.
--- a/config/haswell/make_defs.mk
+++ b/config/haswell/make_defs.mk
@@ -57,13 +57,13 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
 # NOTE: The -fomit-frame-pointer option is needed for some kernels because
 # they make explicit use of the rbp register.
-CKOPTFLAGS     := $(COPTFLAGS) -fomit-frame-pointer
+CKOPTFLAGS     := $(COPTFLAGS) -O3 -fomit-frame-pointer
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mavx2 -mfma -mfpmath=sse -march=haswell
 ifeq ($(GCC_OT_4_9_0),yes)
--- a/config/intel64/make_defs.mk
+++ b/config/intel64/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mssse3 -mfpmath=sse -march=core2
 else
--- a/config/knc/make_defs.mk
+++ b/config/knc/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),icc)
 CKVECFLAGS     := 
 else
--- a/config/knl/make_defs.mk
+++ b/config/knl/make_defs.mk
@@ -57,7 +57,7 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 ifeq ($(DEBUG_TYPE),sde)
@@ -73,7 +73,7 @@ MK_ENABLE_MEMKIND := no
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mavx512f -mavx512pf -mfpmath=sse -march=knl
 else
--- a/config/penryn/make_defs.mk
+++ b/config/penryn/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mssse3 -mfpmath=sse -march=core2
 else
--- a/config/piledriver/make_defs.mk
+++ b/config/piledriver/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mfpmath=sse -mavx -mfma -march=bdver2 -mno-fma4 -mno-tbm -mno-xop -mno-lwp
 else
--- a/config/power10/bli_cntx_init_power10.c
+++ b/config/power10/bli_cntx_init_power10.c
@@ -0,0 +1,144 @@
+/*
+
+   BLIS
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2019, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name(s) of the copyright holder(s) nor the names of its
+      contributors may be used to endorse or promote products derived
+      from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "blis.h"
+
+// Instantiate prototypes for packm kernels.
+PACKM_KER_PROT(    float,  s, packm_6xk_bb4_power10_ref )
+PACKM_KER_PROT(    double, d, packm_6xk_bb2_power10_ref )
+
+// Instantiate prototypes for level-3 kernels.
+GEMM_UKR_PROT(     float,  s, gemmbb_power10_ref )
+GEMMTRSM_UKR_PROT( float,  s, gemmtrsmbb_l_power10_ref )
+GEMMTRSM_UKR_PROT( float,  s, gemmtrsmbb_u_power10_ref )
+TRSM_UKR_PROT(     float,  s, trsmbb_l_power10_ref )
+TRSM_UKR_PROT(     float,  s, trsmbb_u_power10_ref )
+
+GEMM_UKR_PROT(     double, d, gemmbb_power10_ref )
+GEMMTRSM_UKR_PROT( double, d, gemmtrsmbb_l_power10_ref )
+GEMMTRSM_UKR_PROT( double, d, gemmtrsmbb_u_power10_ref )
+TRSM_UKR_PROT(     double, d, trsmbb_l_power10_ref )
+TRSM_UKR_PROT(     double, d, trsmbb_u_power10_ref )
+
+GEMM_UKR_PROT(     scomplex, c, gemmbb_power10_ref )
+GEMMTRSM_UKR_PROT( scomplex, c, gemmtrsmbb_l_power10_ref )
+GEMMTRSM_UKR_PROT( scomplex, c, gemmtrsmbb_u_power10_ref )
+TRSM_UKR_PROT(     scomplex, c, trsmbb_l_power10_ref )
+TRSM_UKR_PROT(     scomplex, c, trsmbb_u_power10_ref )
+
+GEMM_UKR_PROT(     dcomplex, z, gemmbb_power10_ref )
+GEMMTRSM_UKR_PROT( dcomplex, z, gemmtrsmbb_l_power10_ref )
+GEMMTRSM_UKR_PROT( dcomplex, z, gemmtrsmbb_u_power10_ref )
+TRSM_UKR_PROT(     dcomplex, z, trsmbb_l_power10_ref )
+TRSM_UKR_PROT(     dcomplex, z, trsmbb_u_power10_ref )
+
+void bli_cntx_init_power10( cntx_t* cntx )
+{
+	blksz_t blkszs[ BLIS_NUM_BLKSZS ];
+
+	// Set default kernel blocksizes and functions.
+	bli_cntx_init_power10_ref( cntx );
+
+	// -------------------------------------------------------------------------
+
+	// Update the context with optimized native gemm micro-kernels and
+	// their storage preferences.
+	bli_cntx_set_l3_nat_ukrs
+	(
+	  12,
+	  BLIS_GEMM_UKR,       BLIS_FLOAT,    bli_sgemm_power10_mma_8x16,     TRUE,
+
+	  BLIS_TRSM_L_UKR,     BLIS_FLOAT,    bli_strsmbb_l_power10_ref,      FALSE,
+	  BLIS_TRSM_U_UKR,     BLIS_FLOAT,    bli_strsmbb_u_power10_ref,      FALSE,
+	  
+	  BLIS_GEMM_UKR,       BLIS_DOUBLE,   bli_dgemm_power10_mma_8x8,      TRUE,  
+	  
+	  BLIS_TRSM_L_UKR,     BLIS_DOUBLE,   bli_dtrsmbb_l_power10_ref,      FALSE,
+	  BLIS_TRSM_U_UKR,     BLIS_DOUBLE,   bli_dtrsmbb_u_power10_ref,      FALSE,
+	  BLIS_GEMM_UKR,       BLIS_SCOMPLEX, bli_cgemmbb_power10_ref,        FALSE,
+	  BLIS_TRSM_L_UKR,     BLIS_SCOMPLEX, bli_ctrsmbb_l_power10_ref,      FALSE,
+	  BLIS_TRSM_U_UKR,     BLIS_SCOMPLEX, bli_ctrsmbb_u_power10_ref,      FALSE,
+	  BLIS_GEMM_UKR,       BLIS_DCOMPLEX, bli_zgemmbb_power10_ref,        FALSE,
+	  BLIS_TRSM_L_UKR,     BLIS_DCOMPLEX, bli_ztrsmbb_l_power10_ref,      FALSE,
+	  BLIS_TRSM_U_UKR,     BLIS_DCOMPLEX, bli_ztrsmbb_u_power10_ref,      FALSE,
+	  cntx
+	);
+
+	// Update the context with customized virtual [gemm]trsm micro-kernels.
+	bli_cntx_set_l3_vir_ukrs
+	(
+	  8,
+	  BLIS_GEMMTRSM_L_UKR, BLIS_FLOAT,    bli_sgemmtrsmbb_l_power10_ref,
+	  BLIS_GEMMTRSM_U_UKR, BLIS_FLOAT,    bli_sgemmtrsmbb_u_power10_ref,
+	  BLIS_GEMMTRSM_L_UKR, BLIS_DOUBLE,   bli_dgemmtrsmbb_l_power10_ref,
+	  BLIS_GEMMTRSM_U_UKR, BLIS_DOUBLE,   bli_dgemmtrsmbb_u_power10_ref,
+	  BLIS_GEMMTRSM_L_UKR, BLIS_SCOMPLEX, bli_cgemmtrsmbb_l_power10_ref,
+	  BLIS_GEMMTRSM_U_UKR, BLIS_SCOMPLEX, bli_cgemmtrsmbb_u_power10_ref,
+	  BLIS_GEMMTRSM_L_UKR, BLIS_DCOMPLEX, bli_zgemmtrsmbb_l_power10_ref,
+	  BLIS_GEMMTRSM_U_UKR, BLIS_DCOMPLEX, bli_zgemmtrsmbb_u_power10_ref,
+	  cntx
+	);
+
+	// Update the context with optimized packm kernels.
+	bli_cntx_set_packm_kers
+	(
+	  2,
+	  BLIS_PACKM_6XK_KER,  BLIS_FLOAT,    bli_spackm_6xk_bb4_power10_ref,
+	  BLIS_PACKM_6XK_KER,  BLIS_DOUBLE,   bli_dpackm_6xk_bb2_power10_ref,
+	  cntx
+	);
+
+	//                                           s      d      c      z
+	bli_blksz_init_easy( &blkszs[ BLIS_MR ],     8,     8,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NR ],    16,     8,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_MC ],   832,   320,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_KC ],  1026,   960,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4096,  4096,    -1,    -1 );
+
+
+	// Update the context with the current architecture's register and cache
+	// blocksizes (and multiples) for native execution.
+	bli_cntx_set_blkszs
+	(
+	  BLIS_NAT, 5,
+	  // level-3
+	  BLIS_NC, &blkszs[ BLIS_NC ], BLIS_NR,
+	  BLIS_KC, &blkszs[ BLIS_KC ], BLIS_KR,
+	  BLIS_MC, &blkszs[ BLIS_MC ], BLIS_MR,
+	  BLIS_NR, &blkszs[ BLIS_NR ], BLIS_NR,
+	  BLIS_MR, &blkszs[ BLIS_MR ], BLIS_MR,
+	  cntx
+	);
+
+}
--- a/config/power10/bli_family_power10.h
+++ b/config/power10/bli_family_power10.h
@@ -0,0 +1,39 @@
+/*
+
+   BLIS
+   An object-based framework for developing high-performance BLAS-like
+   libraries.
+
+   Copyright (C) 2019, The University of Texas at Austin
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+    - Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    - Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    - Neither the name(s) of the copyright holder(s) nor the names of its
+      contributors may be used to endorse or promote products derived
+      from this software without specific prior written permission.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#define BLIS_POOL_ADDR_ALIGN_SIZE_A 4096
+#define BLIS_POOL_ADDR_ALIGN_SIZE_B 4096
+
+#define BLIS_POOL_ADDR_OFFSET_SIZE_A 192
+#define BLIS_POOL_ADDR_OFFSET_SIZE_B 152
--- a/config/power10/make_defs.mk
+++ b/config/power10/make_defs.mk
@@ -0,0 +1,82 @@
+#
+#
+#  BLIS    
+#  An object-based framework for developing high-performance BLAS-like
+#  libraries.
+#
+#  Copyright (C) 2019, The University of Texas at Austin
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are
+#  met:
+#   - Redistributions of source code must retain the above copyright
+#     notice, this list of conditions and the following disclaimer.
+#   - Redistributions in binary form must reproduce the above copyright
+#     notice, this list of conditions and the following disclaimer in the
+#     documentation and/or other materials provided with the distribution.
+#   - Neither the name(s) of the copyright holder(s) nor the names of its
+#     contributors may be used to endorse or promote products derived
+#     from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#
+
+
+# Declare the name of the current configuration and add it to the
+# running list of configurations included by common.mk.
+THIS_CONFIG    := power10
+#CONFIGS_INCL   += $(THIS_CONFIG)
+
+#
+# --- Determine the C compiler and related flags ---
+#
+
+# NOTE: The build system will append these variables with various
+# general-purpose/configuration-agnostic flags in common.mk. You
+# may specify additional flags here as needed.
+CPPROCFLAGS    := 
+CMISCFLAGS     :=  
+CPICFLAGS      :=
+CWARNFLAGS     :=
+
+ifneq ($(DEBUG_TYPE),off)
+CDBGFLAGS      := -g
+endif
+
+ifeq ($(DEBUG_TYPE),noopt)
+COPTFLAGS      := -O0
+else
+COPTFLAGS      := -O2
+endif
+
+# Flags specific to optimized kernels.
+CKOPTFLAGS     := $(COPTFLAGS) -O3
+ifeq ($(CC_VENDOR),gcc)
+CKVECFLAGS     := -mcpu=power10 -mtune=power10
+else
+ifeq ($(CC_VENDOR),clang)
+CKVECFLAGS     := -mcpu=power10 -mtune=power10
+else
+$(info $(CC_VENDOR)) 
+$(error gcc, clang is required for this configuration.)
+endif
+endif
+
+# Flags specific to reference kernels.
+CROPTFLAGS     := $(CKOPTFLAGS)
+CRVECFLAGS     := $(CKVECFLAGS)
+
+# Store all of the variables here to new variables containing the
+# configuration name.
+$(eval $(call store-make-defs,$(THIS_CONFIG)))
--- a/config/power7/make_defs.mk
+++ b/config/power7/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3 -mtune=power7
+COPTFLAGS      := -O2 -mtune=power7
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mvsx
 else
--- a/config/power9/make_defs.mk
+++ b/config/power9/make_defs.mk
@@ -58,11 +58,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mcpu=power9 -mtune=power9 -DXLC=0
 else
--- a/config/sandybridge/make_defs.mk
+++ b/config/sandybridge/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mavx -mfpmath=sse -march=sandybridge
 ifeq ($(GCC_OT_4_9_0),yes)
--- a/config/skx/bli_cntx_init_skx.c
+++ b/config/skx/bli_cntx_init_skx.c
@@ -106,8 +106,8 @@ void bli_cntx_init_skx( cntx_t* cntx )
 	bli_blksz_init_easy( &blkszs[ BLIS_MR ],    32,    16,    -1,    -1 );
 	bli_blksz_init_easy( &blkszs[ BLIS_NR ],    12,    14,    -1,    -1 );
 	bli_blksz_init_easy( &blkszs[ BLIS_MC ],   480,   240,    -1,    -1 );
-	bli_blksz_init     ( &blkszs[ BLIS_KC ],   384,   384,    -1,    -1,
-	                                           480,   480,    -1,    -1 );
+	bli_blksz_init     ( &blkszs[ BLIS_KC ],   384,   256,    -1,    -1,
+	                                           480,   320,    -1,    -1 );
 	bli_blksz_init_easy( &blkszs[ BLIS_NC ],  3072,  3752,    -1,    -1 );
 	bli_blksz_init_easy( &blkszs[ BLIS_AF ],     8,     8,    -1,    -1 );
 	bli_blksz_init_easy( &blkszs[ BLIS_DF ],     8,     8,    -1,    -1 );
--- a/config/skx/make_defs.mk
+++ b/config/skx/make_defs.mk
@@ -57,13 +57,14 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
 # NOTE: The -fomit-frame-pointer option is needed for some kernels because
 # they make explicit use of the rbp register.
-CKOPTFLAGS     := $(COPTFLAGS) -fomit-frame-pointer
+CKOPTFLAGS     := $(COPTFLAGS) -O3 -fomit-frame-pointer
+
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mavx512f -mavx512dq -mavx512bw -mavx512vl -mfpmath=sse -march=skylake-avx512
 else
--- a/config/steamroller/make_defs.mk
+++ b/config/steamroller/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mfpmath=sse -mavx -mfma -march=bdver3 -mno-fma4 -mno-tbm -mno-xop -mno-lwp
 else
--- a/config/template/make_defs.mk
+++ b/config/template/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 CKVECFLAGS     :=

 # Flags specific to reference kernels.
--- a/config/thunderx2/make_defs.mk
+++ b/config/thunderx2/make_defs.mk
@@ -57,13 +57,13 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3 -ftree-vectorize -mtune=thunderx2t99
+COPTFLAGS      := -O2 -mcpu=thunderx2t99
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3 -ftree-vectorize
 ifeq ($(CC_VENDOR),gcc)
-CKVECFLAGS     := -march=armv8.1-a+fp+simd -mcpu=thunderx2t99
+CKVECFLAGS     := -mcpu=thunderx2t99
 else
 $(error gcc is required for this configuration.)
 endif
--- a/config/x86_64/make_defs.mk
+++ b/config/x86_64/make_defs.mk
@@ -57,11 +57,11 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3
+COPTFLAGS      := -O2
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mssse3 -mfpmath=sse -march=core2
 else
--- a/config/zen/amd_config.mk
+++ b/config/zen/amd_config.mk
@@ -49,11 +49,13 @@ endif
 ifeq ($(DEBUG_TYPE),noopt)
 COPTFLAGS      := -O0
 else
-COPTFLAGS      := -O3 -fomit-frame-pointer
+COPTFLAGS      := -O2 -fomit-frame-pointer
 endif

 # Flags specific to optimized kernels.
-CKOPTFLAGS     := $(COPTFLAGS)
+# NOTE: The -fomit-frame-pointer option is needed for some kernels because
+# they make explicit use of the rbp register.
+CKOPTFLAGS     := $(COPTFLAGS) -O3
 ifeq ($(CC_VENDOR),gcc)
 CKVECFLAGS     := -mavx2 -mfpmath=sse -mfma
 else
--- a/config/zen2/bli_cntx_init_zen2.c
+++ b/config/zen2/bli_cntx_init_zen2.c
@@ -1,9 +1,7 @@
 /*
-
   BLIS
   An object-based framework for developing high-performance BLAS-like
   libraries.
-
   Copyright (C) 2014, The University of Texas at Austin
   Copyright (C) 2018 - 2021, Advanced Micro Devices, Inc. All rights reserved.

@@ -18,7 +16,6 @@
    - Neither the name(s) of the copyright holder(s) nor the names of its
      contributors may be used to endorse or promote products derived
      from this software without specific prior written permission.
-
   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
@@ -30,147 +27,157 @@
   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
 */

 #include "blis.h"

 void bli_cntx_init_zen2( cntx_t* cntx )
 {
-    blksz_t blkszs[ BLIS_NUM_BLKSZS ];
-    blksz_t thresh[ BLIS_NUM_THRESH ];
-    // Set default kernel blocksizes and functions.
-    bli_cntx_init_zen2_ref( cntx );
+	blksz_t blkszs[ BLIS_NUM_BLKSZS ];
+	blksz_t thresh[ BLIS_NUM_THRESH ];

-    // -------------------------------------------------------------------------
+	// Set default kernel blocksizes and functions.
+	bli_cntx_init_zen2_ref( cntx );

-    // Update the context with optimized native gemm micro-kernels and
-    // their storage preferences.
-    bli_cntx_set_l3_nat_ukrs
-    (
-      8,
-      // gemm
-      BLIS_GEMM_UKR,       BLIS_FLOAT,    bli_sgemm_haswell_asm_6x16,       TRUE,
-      BLIS_GEMM_UKR,       BLIS_DOUBLE,   bli_dgemm_haswell_asm_6x8,        TRUE,
-      BLIS_GEMM_UKR,       BLIS_SCOMPLEX, bli_cgemm_haswell_asm_3x8,        TRUE,
-      BLIS_GEMM_UKR,       BLIS_DCOMPLEX, bli_zgemm_haswell_asm_3x4,        TRUE,
-      // gemmtrsm_l
-      BLIS_GEMMTRSM_L_UKR, BLIS_FLOAT,    bli_sgemmtrsm_l_haswell_asm_6x16, TRUE,
-      BLIS_GEMMTRSM_L_UKR, BLIS_DOUBLE,   bli_dgemmtrsm_l_haswell_asm_6x8,  TRUE,
-      // gemmtrsm_u
-      BLIS_GEMMTRSM_U_UKR, BLIS_FLOAT,    bli_sgemmtrsm_u_haswell_asm_6x16, TRUE,
-      BLIS_GEMMTRSM_U_UKR, BLIS_DOUBLE,   bli_dgemmtrsm_u_haswell_asm_6x8,  TRUE,
-      cntx
-    );
+	// -------------------------------------------------------------------------

-    // packm kernels
-    bli_cntx_set_packm_kers
-      (
-        2,
-        BLIS_PACKM_8XK_KER, BLIS_DOUBLE, bli_dpackm_haswell_asm_8xk,
-        BLIS_PACKM_6XK_KER, BLIS_DOUBLE, bli_dpackm_haswell_asm_6xk,
-        cntx
-      );
+	// Update the context with optimized native gemm micro-kernels and
+	// their storage preferences.
+	bli_cntx_set_l3_nat_ukrs
+	(
+	  8,

-    // Update the context with optimized level-1f kernels.
-    bli_cntx_set_l1f_kers
-    (
+	  // gemm
+	  BLIS_GEMM_UKR,       BLIS_FLOAT,    bli_sgemm_haswell_asm_6x16,       TRUE,
+	  BLIS_GEMM_UKR,       BLIS_DOUBLE,   bli_dgemm_haswell_asm_6x8,        TRUE,
+	  BLIS_GEMM_UKR,       BLIS_SCOMPLEX, bli_cgemm_haswell_asm_3x8,        TRUE,
+	  BLIS_GEMM_UKR,       BLIS_DCOMPLEX, bli_zgemm_haswell_asm_3x4,        TRUE,
+
+	  // gemmtrsm_l
+	  BLIS_GEMMTRSM_L_UKR, BLIS_FLOAT,    bli_sgemmtrsm_l_haswell_asm_6x16, TRUE,
+	  BLIS_GEMMTRSM_L_UKR, BLIS_DOUBLE,   bli_dgemmtrsm_l_haswell_asm_6x8,  TRUE,
+
+	  // gemmtrsm_u
+	  BLIS_GEMMTRSM_U_UKR, BLIS_FLOAT,    bli_sgemmtrsm_u_haswell_asm_6x16, TRUE,
+	  BLIS_GEMMTRSM_U_UKR, BLIS_DOUBLE,   bli_dgemmtrsm_u_haswell_asm_6x8,  TRUE,
+	  cntx
+	);
+
+	// Update the context with optimized packm kernels.
+	bli_cntx_set_packm_kers
+	(
+	  8,
+	  BLIS_PACKM_6XK_KER,  BLIS_FLOAT,    bli_spackm_haswell_asm_6xk,
+	  BLIS_PACKM_16XK_KER, BLIS_FLOAT,    bli_spackm_haswell_asm_16xk,
+	  BLIS_PACKM_6XK_KER,  BLIS_DOUBLE,   bli_dpackm_haswell_asm_6xk,
+	  BLIS_PACKM_8XK_KER,  BLIS_DOUBLE,   bli_dpackm_haswell_asm_8xk,
+	  BLIS_PACKM_3XK_KER,  BLIS_SCOMPLEX, bli_cpackm_haswell_asm_3xk,
+	  BLIS_PACKM_8XK_KER,  BLIS_SCOMPLEX, bli_cpackm_haswell_asm_8xk,
+	  BLIS_PACKM_3XK_KER,  BLIS_DCOMPLEX, bli_zpackm_haswell_asm_3xk,
+	  BLIS_PACKM_4XK_KER,  BLIS_DCOMPLEX, bli_zpackm_haswell_asm_4xk,
+	  cntx
+	);
+
+	// Update the context with optimized level-1f kernels.
+	bli_cntx_set_l1f_kers
+	(
      6,
-      // axpyf
-      BLIS_AXPYF_KER,     BLIS_FLOAT,  bli_saxpyf_zen_int_5,
-      BLIS_AXPYF_KER,     BLIS_DOUBLE, bli_daxpyf_zen_int_5,
+	  // axpyf
+	  BLIS_AXPYF_KER,     BLIS_FLOAT,  bli_saxpyf_zen_int_5,
+	  BLIS_AXPYF_KER,     BLIS_DOUBLE, bli_daxpyf_zen_int_5,
      BLIS_AXPYF_KER,     BLIS_SCOMPLEX, bli_caxpyf_zen_int_5,
      BLIS_AXPYF_KER,     BLIS_DCOMPLEX, bli_zaxpyf_zen_int_5,
-      // dotxf
-      BLIS_DOTXF_KER,     BLIS_FLOAT,  bli_sdotxf_zen_int_8,
-      BLIS_DOTXF_KER,     BLIS_DOUBLE, bli_ddotxf_zen_int_8,
-      cntx
-    );
+	  // dotxf
+	  BLIS_DOTXF_KER,     BLIS_FLOAT,  bli_sdotxf_zen_int_8,
+	  BLIS_DOTXF_KER,     BLIS_DOUBLE, bli_ddotxf_zen_int_8,
+	  cntx
+	);

-    // Update the context with optimized level-1v kernels.
-    bli_cntx_set_l1v_kers
-    (
+	// Update the context with optimized level-1v kernels.
+	bli_cntx_set_l1v_kers
+	(
      20,
 #if 1
-      // amaxv
-      BLIS_AMAXV_KER,  BLIS_FLOAT,  bli_samaxv_zen_int,
-      BLIS_AMAXV_KER,  BLIS_DOUBLE, bli_damaxv_zen_int,
+	  // amaxv
+	  BLIS_AMAXV_KER,  BLIS_FLOAT,  bli_samaxv_zen_int,
+	  BLIS_AMAXV_KER,  BLIS_DOUBLE, bli_damaxv_zen_int,
 #endif
      // axpyv

-      // axpyv
-      BLIS_AXPYV_KER,  BLIS_FLOAT,  bli_saxpyv_zen_int10,
-      BLIS_AXPYV_KER,  BLIS_DOUBLE, bli_daxpyv_zen_int10,
+	  // axpyv
+	  BLIS_AXPYV_KER,  BLIS_FLOAT,  bli_saxpyv_zen_int10,
+	  BLIS_AXPYV_KER,  BLIS_DOUBLE, bli_daxpyv_zen_int10,
      BLIS_AXPYV_KER,  BLIS_SCOMPLEX, bli_caxpyv_zen_int5,
      BLIS_AXPYV_KER,  BLIS_DCOMPLEX, bli_zaxpyv_zen_int5,

-      // dotv
-      BLIS_DOTV_KER,   BLIS_FLOAT,  bli_sdotv_zen_int10,
-      BLIS_DOTV_KER,   BLIS_DOUBLE, bli_ddotv_zen_int10,
+	  // dotv
+	  BLIS_DOTV_KER,   BLIS_FLOAT,  bli_sdotv_zen_int10,
+	  BLIS_DOTV_KER,   BLIS_DOUBLE, bli_ddotv_zen_int10,
      BLIS_DOTV_KER,   BLIS_SCOMPLEX, bli_cdotv_zen_int5,
      BLIS_DOTV_KER,   BLIS_DCOMPLEX, bli_zdotv_zen_int5,

-      // dotxv
-      BLIS_DOTXV_KER,  BLIS_FLOAT,  bli_sdotxv_zen_int,
-      BLIS_DOTXV_KER,  BLIS_DOUBLE, bli_ddotxv_zen_int,
+	  // dotxv
+	  BLIS_DOTXV_KER,  BLIS_FLOAT,  bli_sdotxv_zen_int,
+	  BLIS_DOTXV_KER,  BLIS_DOUBLE, bli_ddotxv_zen_int,

-      // scalv
-      BLIS_SCALV_KER,  BLIS_FLOAT,  bli_sscalv_zen_int10,
-      BLIS_SCALV_KER,  BLIS_DOUBLE, bli_dscalv_zen_int10,
+	  // scalv
+	  BLIS_SCALV_KER,  BLIS_FLOAT,  bli_sscalv_zen_int10,
+	  BLIS_SCALV_KER,  BLIS_DOUBLE, bli_dscalv_zen_int10,

-      //swap
-      BLIS_SWAPV_KER, BLIS_FLOAT,   bli_sswapv_zen_int8,
-      BLIS_SWAPV_KER, BLIS_DOUBLE,  bli_dswapv_zen_int8,
+	  //swap
+	  BLIS_SWAPV_KER, BLIS_FLOAT,   bli_sswapv_zen_int8,
+	  BLIS_SWAPV_KER, BLIS_DOUBLE,  bli_dswapv_zen_int8,

-      //copy
-      BLIS_COPYV_KER,  BLIS_FLOAT,  bli_scopyv_zen_int,
-      BLIS_COPYV_KER,  BLIS_DOUBLE, bli_dcopyv_zen_int,
+	  //copy
+	  BLIS_COPYV_KER,  BLIS_FLOAT,  bli_scopyv_zen_int,
+	  BLIS_COPYV_KER,  BLIS_DOUBLE, bli_dcopyv_zen_int,

-      //set
-      BLIS_SETV_KER,  BLIS_FLOAT,  bli_ssetv_zen_int,
-      BLIS_SETV_KER,  BLIS_DOUBLE, bli_dsetv_zen_int,
-      cntx
-    );
+	  //set
+	  BLIS_SETV_KER,  BLIS_FLOAT,  bli_ssetv_zen_int,
+	  BLIS_SETV_KER,  BLIS_DOUBLE, bli_dsetv_zen_int,
+	  cntx
+	);

-    // Initialize level-3 blocksize objects with architecture-specific values.
-    //                                           s      d      c      z
-    bli_blksz_init_easy( &blkszs[ BLIS_MR ],     6,     6,     3,     3 );
-    bli_blksz_init_easy( &blkszs[ BLIS_NR ],    16,     8,     8,     4 );
+	// Initialize level-3 blocksize objects with architecture-specific values.
+	//                                           s      d      c      z
+	bli_blksz_init_easy( &blkszs[ BLIS_MR ],     6,     6,     3,     3 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NR ],    16,     8,     8,     4 );
 #if AOCL_BLIS_MULTIINSTANCE
-    bli_blksz_init_easy( &blkszs[ BLIS_MC ],   144,   240,   144,    72 );
-    bli_blksz_init_easy( &blkszs[ BLIS_KC ],   256,   512,   256,   256 );
-    bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4080,  2040,  4080,  4080 );
+	bli_blksz_init_easy( &blkszs[ BLIS_MC ],   144,   240,   144,    72 );
+	bli_blksz_init_easy( &blkszs[ BLIS_KC ],   256,   512,   256,   256 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4080,  2040,  4080,  4080 );
 #else
    bli_blksz_init_easy( &blkszs[ BLIS_MC ],   144,    72,   144,    72 );
-    bli_blksz_init_easy( &blkszs[ BLIS_KC ],   256,   256,   256,   256 );
-    bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4080,  4080,  4080,  4080 );
+	bli_blksz_init_easy( &blkszs[ BLIS_KC ],   256,   256,   256,   256 );
+	bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4080,  4080,  4080,  4080 );
 #endif

-    bli_blksz_init_easy( &blkszs[ BLIS_AF ],     5,     5,    -1,    -1 );
-    bli_blksz_init_easy( &blkszs[ BLIS_DF ],     8,     8,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_AF ],     5,     5,    -1,    -1 );
+	bli_blksz_init_easy( &blkszs[ BLIS_DF ],     8,     8,    -1,    -1 );

-    // Update the context with the current architecture's register and cache
-    // blocksizes (and multiples) for native execution.
-    bli_cntx_set_blkszs
-    (
-      BLIS_NAT, 7,
-      // level-3
-      BLIS_NC, &blkszs[ BLIS_NC ], BLIS_NR,
-      BLIS_KC, &blkszs[ BLIS_KC ], BLIS_KR,
-      BLIS_MC, &blkszs[ BLIS_MC ], BLIS_MR,
-      BLIS_NR, &blkszs[ BLIS_NR ], BLIS_NR,
-      BLIS_MR, &blkszs[ BLIS_MR ], BLIS_MR,
-      // level-1f
-      BLIS_AF, &blkszs[ BLIS_AF ], BLIS_AF,
-      BLIS_DF, &blkszs[ BLIS_DF ], BLIS_DF,
-      cntx
-    );
-// -------------------------------------------------------------------------
+	// Update the context with the current architecture's register and cache
+	// blocksizes (and multiples) for native execution.
+	bli_cntx_set_blkszs
+	(
+	  BLIS_NAT, 7,
+	  // level-3
+	  BLIS_NC, &blkszs[ BLIS_NC ], BLIS_NR,
+	  BLIS_KC, &blkszs[ BLIS_KC ], BLIS_KR,
+	  BLIS_MC, &blkszs[ BLIS_MC ], BLIS_MR,
+	  BLIS_NR, &blkszs[ BLIS_NR ], BLIS_NR,
+	  BLIS_MR, &blkszs[ BLIS_MR ], BLIS_MR,
+	  // level-1f
+	  BLIS_AF, &blkszs[ BLIS_AF ], BLIS_AF,
+	  BLIS_DF, &blkszs[ BLIS_DF ], BLIS_DF,
+	  cntx
+	);
+
+	// -------------------------------------------------------------------------

    //Initialize TRSM blocksize objects with architecture-specific values.
    //Using different cache block sizes for TRSM instead of common level-3 block sizes.
    //Tuning is done for double-precision only.
-    //                                          s      d      c      z
+	//                                          s     d     c     z
    bli_blksz_init_easy( &blkszs[ BLIS_MC ],   144,    72,   144,    72 );
    bli_blksz_init_easy( &blkszs[ BLIS_KC ],   256,   492,   256,   256 );
    bli_blksz_init_easy( &blkszs[ BLIS_NC ],  4080,  1600,  4080,  4080 );
@@ -193,64 +200,65 @@ void bli_cntx_init_zen2( cntx_t* cntx )
    bli_blksz_init_easy( &thresh[ BLIS_NT ],   200,  256,   256,   128 );
    bli_blksz_init_easy( &thresh[ BLIS_KT ],   240,  220,   220,   110 );

-    // Initialize the context with the sup thresholds.
-    bli_cntx_set_l3_sup_thresh
-    (
-      3,
-      BLIS_MT, &thresh[ BLIS_MT ],
-      BLIS_NT, &thresh[ BLIS_NT ],
-      BLIS_KT, &thresh[ BLIS_KT ],
-      cntx
-    );
+	// Initialize the context with the sup thresholds.
+	bli_cntx_set_l3_sup_thresh
+	(
+	  3,
+	  BLIS_MT, &thresh[ BLIS_MT ],
+	  BLIS_NT, &thresh[ BLIS_NT ],
+	  BLIS_KT, &thresh[ BLIS_KT ],
+	  cntx
+	);

-    // Initialize the context with the sup handlers.
-    bli_cntx_set_l3_sup_handlers
-    (
-    2,
-    BLIS_GEMM, bli_gemmsup_ref,
-    BLIS_GEMMT, bli_gemmtsup_ref,
-    cntx
-    );
+	// Initialize the context with the sup handlers.
+	bli_cntx_set_l3_sup_handlers
+	(
+	  2,
+	  BLIS_GEMM, bli_gemmsup_ref,
+      BLIS_GEMMT, bli_gemmtsup_ref,
+	  cntx
+	);

-    // Update the context with optimized small/unpacked gemm kernels.
-    bli_cntx_set_l3_sup_kers
-    (
+	// Update the context with optimized small/unpacked gemm kernels.
+	bli_cntx_set_l3_sup_kers
+	(
      28,
-      //BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_r_haswell_ref,
-      BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8m, TRUE,
-      BLIS_RRC, BLIS_DOUBLE, bli_dgemmsup_rd_haswell_asm_6x8m, TRUE,
-      BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8m, TRUE,
-      BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
-      BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8m, TRUE,
-      BLIS_CRC, BLIS_DOUBLE, bli_dgemmsup_rd_haswell_asm_6x8n, TRUE,
-      BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
-      BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
-      BLIS_RRR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16m, TRUE,
-      BLIS_RRC, BLIS_FLOAT, bli_sgemmsup_rd_zen_asm_6x16m, TRUE,
-      BLIS_RCR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16m, TRUE,
-      BLIS_RCC, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16n, TRUE,
-      BLIS_CRR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16m, TRUE,
-      BLIS_CRC, BLIS_FLOAT, bli_sgemmsup_rd_zen_asm_6x16n, TRUE,
-      BLIS_CCR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16n, TRUE,
-      BLIS_CCC, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16n, TRUE,
-      BLIS_RRR, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8m, TRUE,
-      BLIS_RCR, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8m, TRUE,
+	  //BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_r_haswell_ref,
+	  BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8m, TRUE,
+	  BLIS_RRC, BLIS_DOUBLE, bli_dgemmsup_rd_haswell_asm_6x8m, TRUE,
+	  BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8m, TRUE,
+	  BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
+	  BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8m, TRUE,
+	  BLIS_CRC, BLIS_DOUBLE, bli_dgemmsup_rd_haswell_asm_6x8n, TRUE,
+	  BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
+	  BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_haswell_asm_6x8n, TRUE,
+	  BLIS_RRR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16m, TRUE,
+	  BLIS_RRC, BLIS_FLOAT, bli_sgemmsup_rd_zen_asm_6x16m, TRUE,
+	  BLIS_RCR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16m, TRUE,
+	  BLIS_RCC, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16n, TRUE,
+	  BLIS_CRR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16m, TRUE,
+	  BLIS_CRC, BLIS_FLOAT, bli_sgemmsup_rd_zen_asm_6x16n, TRUE,
+	  BLIS_CCR, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16n, TRUE,
+	  BLIS_CCC, BLIS_FLOAT, bli_sgemmsup_rv_zen_asm_6x16n, TRUE,
+	  BLIS_RRR, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8m, TRUE,
+	  BLIS_RCR, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8m, TRUE,
      BLIS_CRR, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8m, TRUE,
-      BLIS_RCC, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8n, TRUE,
-      BLIS_CCR, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8n, TRUE,
-      BLIS_CCC, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8n, TRUE,
-      BLIS_RRR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4m, TRUE,
-      BLIS_RCR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4m, TRUE,
-      BLIS_CRR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4m, TRUE,
-      BLIS_RCC, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4n, TRUE,
-      BLIS_CCR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4n, TRUE,
-      BLIS_CCC, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4n, TRUE,
-      cntx
-    );
+	  BLIS_RCC, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8n, TRUE,
+	  BLIS_CCR, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8n, TRUE,
+	  BLIS_CCC, BLIS_SCOMPLEX, bli_cgemmsup_rv_zen_asm_3x8n, TRUE,

-    // Initialize level-3 sup blocksize objects with architecture-specific
-    // values.
-    //                                           s      d      c      z
+	  BLIS_RRR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4m, TRUE,
+	  BLIS_RCR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4m, TRUE,
+	  BLIS_CRR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4m, TRUE,
+	  BLIS_RCC, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4n, TRUE,
+	  BLIS_CCR, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4n, TRUE,
+	  BLIS_CCC, BLIS_DCOMPLEX, bli_zgemmsup_rv_zen_asm_3x4n, TRUE,
+	  cntx
+	);
+
+	// Initialize level-3 sup blocksize objects with architecture-specific
+	// values.
+	//                                           s      d      c      z
    bli_blksz_init     ( &blkszs[ BLIS_MR ],    6,     6,     3,      3,
                                                9,     9,     3,      3    );
    bli_blksz_init_easy( &blkszs[ BLIS_NR ],    16,    8,     8,      4    );
@@ -258,17 +266,17 @@ void bli_cntx_init_zen2( cntx_t* cntx )
    bli_blksz_init_easy( &blkszs[ BLIS_KC ],    512,   256,   128,    64   );
    bli_blksz_init_easy( &blkszs[ BLIS_NC ],    8160,  4080,  2040,   1020 );

-    // Update the context with the current architecture's register and cache
-    // blocksizes for small/unpacked level-3 problems.
-    bli_cntx_set_l3_sup_blkszs
-    (
-      5,
-      BLIS_NC, &blkszs[ BLIS_NC ],
-      BLIS_KC, &blkszs[ BLIS_KC ],
-      BLIS_MC, &blkszs[ BLIS_MC ],
-      BLIS_NR, &blkszs[ BLIS_NR ],
-      BLIS_MR, &blkszs[ BLIS_MR ],
-      cntx
-    );
+	// Update the context with the current architecture's register and cache
+	// blocksizes for small/unpacked level-3 problems.
+	bli_cntx_set_l3_sup_blkszs
+	(
+	  5,
+	  BLIS_NC, &blkszs[ BLIS_NC ],
+	  BLIS_KC, &blkszs[ BLIS_KC ],
+	  BLIS_MC, &blkszs[ BLIS_MC ],
+	  BLIS_NR, &blkszs[ BLIS_NR ],
+	  BLIS_MR, &blkszs[ BLIS_MR ],
+	  cntx
+	);
 }

--- a/config/zen2/bli_family_zen2.h
+++ b/config/zen2/bli_family_zen2.h
@@ -48,11 +48,9 @@
 #define BLIS_THREAD_MAX_IR      1
 #define BLIS_THREAD_MAX_JR      1

-
 #define BLIS_ENABLE_SMALL_MATRIX
 #define BLIS_ENABLE_SMALL_MATRIX_TRSM

-
 // This will select the threshold below which small matrix code will be called.
 #define BLIS_SMALL_MATRIX_THRES        700
 #define BLIS_SMALL_M_RECT_MATRIX_THRES 160
--- a/1
+++ b/1
@@ -42,6 +42,7 @@ cortexa15:   cortexa15/armv7a
 cortexa9:    cortexa9/armv7a

 # IBM architectures.
+power10:     power10
 power9:      power9
 bgq:         bgq

--- a/237
+++ b/237
@@ -168,6 +168,18 @@ print_usage()
 	echo "                 --disable-threading is specified, threading will be"
 	echo "                 disabled. The default is 'no'."
 	echo " "
+	echo "   --enable-system, --disable-system"
+	echo " "
+	echo "                 Enable conventional operating system support, such as"
+	echo "                 pthreads for thread-safety. The default state is enabled."
+	echo "                 However, in rare circumstances you may wish to configure"
+	echo "                 BLIS for use with a minimal or nonexistent operating"
+	echo "                 system (e.g. hardware simulators). In these situations,"
+	echo "                 --disable-system may be used to jettison all compile-time"
+	echo "                 and link-time dependencies outside of the standard C"
+	echo "                 library. When disabled, this option also forces the use"
+	echo "                 of --disable-threading."
+	echo " "
 	echo "   --disable-pba-pools, --enable-pba-pools"
 	echo "   --disable-sba-pools, --enable-sba-pools"
 	echo " "
@@ -286,6 +298,20 @@ print_usage()
 	echo "                 which may be ignored in select situations if the"
 	echo "                 implementation has a good reason to do so."
 	echo " "
+	echo "   --disable-trsm-preinversion, --enable-trsm-preinversion"
+	echo " "
+	echo "                 Disable (enabled by default) pre-inversion of triangular"
+	echo "                 matrix diagonals when performing trsm. When pre-inversion"
+	echo "                 is enabled, diagonal elements are inverted outside of the"
+	echo "                 microkernel (e.g. during packing) so that the microkernel"
+	echo "                 can use multiply instructions. When disabled, division"
+	echo "                 instructions are used within the microkernel. Executing"
+	echo "                 these division instructions within the microkernel will"
+	echo "                 incur a performance penalty, but numerical robustness will"
+	echo "                 improve for certain cases involving denormal numbers that"
+	echo "                 would otherwise result in overflow in the pre-inverted"
+	echo "                 values."
+	echo " "
 	echo "   --force-version=STRING"
 	echo " "
 	echo "                 Force configure to use an arbitrary version string"
@@ -300,9 +326,6 @@ print_usage()
 	echo "                 a sanity check to make sure these lists are constituted"
 	echo "                 as expected."
 	echo " "
-	echo "   -q, --quiet   Suppress informational output. By default, configure"
-	echo "                 is verbose. (NOTE: -q is not yet implemented)"
-	echo " "
 	echo "   --complex-return=gnu|intel"
 	echo " "
 	echo "                 Specify the way in which complex numbers are returned"
@@ -312,6 +335,9 @@ print_usage()
 	echo "                 attempt to determine the return type from the compiler."
 	echo "                 Otherwise, the default is \"gnu\"."
 	echo " "
+	echo "   -q, --quiet   Suppress informational output. By default, configure"
+	echo "                 is verbose. (NOTE: -q is not yet implemented)"
+	echo " "
 	echo "   -h, --help    Output this information and quit."
 	echo " "
 	echo " Environment Variables:"
@@ -1000,7 +1026,7 @@ select_tool()

 auto_detect()
 {
-	local cc cflags config_defines detected_config rval
+	local cc cflags config_defines detected_config rval cmd

 	# Use the same compiler that was found earlier.
 	cc="${found_cc}"
@@ -1017,68 +1043,100 @@ auto_detect()
 		cflags=
 	fi

-	# Locate our source files.
-	bli_arch_c="bli_arch.c"
-	bli_cpuid_c="bli_cpuid.c"
-	main_c="config_detect.c"
+	# Accumulate a list of source files we'll need to compile along with
+	# the top-level (root) directory in which they are located.
+	c_src_pairs=""
+	c_src_pairs="${c_src_pairs} frame:bli_arch.c"
+	c_src_pairs="${c_src_pairs} frame:bli_cpuid.c"
+	c_src_pairs="${c_src_pairs} frame:bli_env.c"
+	c_src_pairs="${c_src_pairs} build:config_detect.c"

-	bli_arch_c_filepath=$(find ${dist_path}/frame -name "${bli_arch_c}")
-	bli_cpuid_c_filepath=$(find ${dist_path}/frame -name "${bli_cpuid_c}")
-	main_c_filepath=$(find ${dist_path}/build -name "${main_c}")
+	# Accumulate a list of full filepaths to the source files listed above.
+	c_src_filepaths=""
+	for pair in ${c_src_pairs}; do

-	# Locate headers needed directly by the above files.
-	bli_arch_h="bli_arch.h"
-	bli_cpuid_h="bli_cpuid.h"
-	bli_typed_h="bli_type_defs.h"
+		filename=${pair#*:}
+		rootdir=${pair%:*}

-	bli_arch_h_filepath=$(find ${dist_path}/frame -name "${bli_arch_h}")
-	bli_cpuid_h_filepath=$(find ${dist_path}/frame -name "${bli_cpuid_h}")
-	bli_typed_h_filepath=$(find ${dist_path}/frame -name "${bli_typed_h}")
+		filepath=$(find ${dist_path}/${rootdir} -name "${filename}")
+		c_src_filepaths="${c_src_filepaths} ${filepath}"
+	done

-	bli_arch_h_path=${bli_arch_h_filepath%/${bli_arch_h}}
-	bli_cpuid_h_path=${bli_cpuid_h_filepath%/${bli_cpuid_h}}
-	bli_typed_h_path=${bli_typed_h_filepath%/${bli_typed_h}}
+	# Accumulate a list of header files we'll need to locate along with
+	# the top-level (root) directory in which they are located.
+	c_hdr_pairs=""
+	c_hdr_pairs="${c_hdr_pairs} frame:bli_system.h"
+	c_hdr_pairs="${c_hdr_pairs} frame:bli_type_defs.h"
+	c_hdr_pairs="${c_hdr_pairs} frame:bli_arch.h"
+	c_hdr_pairs="${c_hdr_pairs} frame:bli_cpuid.h"
+	c_hdr_pairs="${c_hdr_pairs} frame:bli_env.h"
+	# NOTE: These headers are needed by bli_type_defs.h.
+	c_hdr_pairs="${c_hdr_pairs} frame:bli_malloc.h"
+	c_hdr_pairs="${c_hdr_pairs} frame:bli_pthread.h"

-	# Locate other headers needed by bli_type_defs.h.
-	bli_pthread_h="bli_pthread.h"
-	bli_pthread_h_filepath=$(find ${dist_path}/frame -name "${bli_pthread_h}")
-	bli_pthread_h_path=${bli_pthread_h_filepath%/${bli_pthread_h}}
-	bli_malloc_h="bli_malloc.h"
-	bli_malloc_h_filepath=$(find ${dist_path}/frame -name "${bli_malloc_h}")
-	bli_malloc_h_path=${bli_malloc_h_filepath%/${bli_malloc_h}}
+	# Accumulate a list of full paths to the header files listed above.
+	# While we are at it, we include the "-I" compiler option to indicate
+	# adding the path to the list of directories to search when encountering
+	# #include directives.
+	c_hdr_paths=""
+	for pair in ${c_hdr_pairs}; do
+
+		filename=${pair#*:}
+		rootdir=${pair%:*}
+
+		filepath=$(find ${dist_path}/${rootdir} -name "${filename}")
+		path=${filepath%/*}
+		c_hdr_paths="${c_hdr_paths} -I${path}"
+	done

 	# Define the executable name.
 	autodetect_x="auto-detect.x"

 	# Create #defines for all of the BLIS_CONFIG_ macros in bli_cpuid.c.
+	bli_cpuid_c_filepath=$(find ${dist_path}/frame -name "bli_cpuid.c")
 	config_defines=$(grep BLIS_CONFIG_ ${bli_cpuid_c_filepath} \
 	                 | sed -e 's/#ifdef /-D/g')

-	# Set the linker flags. We need pthreads because it is needed for
-	# parts of bli_arch.c unrelated to bli_arch_string(), which is called
-	# by the main() function in ${main_c}.
-	if [[ $is_win == no || "$cc_vendor" != "clang" ]]; then
+	# Set the linker flags. We typically need pthreads (or BLIS's homerolled
+	# equiavlent) because it is needed for parts of bli_arch.c unrelated to
+	# bli_arch_string(), which is called by the main() function in ${main_c}.
+	if [[ "$is_win" == "no" || "$cc_vendor" != "clang" ]]; then
 		ldflags="${LIBPTHREAD--lpthread}"
 	fi

+	# However, if --disable-system was given, we override the choice made above
+	# and do not use any pthread link flags.
+	if [[ "$enable_system" == "no" ]]; then
+		ldflags=
+	fi
+
 	# Compile the auto-detect program using source code inside the
 	# framework.
 	# NOTE: -D_GNU_SOURCE is needed to enable POSIX extensions to
 	# pthreads (i.e., barriers).
-	${cc} ${config_defines} \
+
+	cmd="${cc} ${config_defines} \
 	      -DBLIS_CONFIGURETIME_CPUID \
-	      -I${bli_cpuid_h_path} \
-	      -I${bli_arch_h_path} \
-	      -I${bli_typed_h_path} \
-	      -I${bli_pthread_h_path} \
-	      -I${bli_malloc_h_path} \
+	      ${c_hdr_paths} \
 	      -std=c99 -D_GNU_SOURCE \
 	      ${cflags} \
-	      ${bli_arch_c_filepath} \
-	      ${bli_cpuid_c_filepath} \
+	      ${c_src_filepaths} \
 	      ${ldflags} \
-	      ${main_c_filepath} \
-	   -o ${autodetect_x}
+	      -o ${autodetect_x}"
+
+	if [ "${debug_auto_detect}" == "no" ]; then
+
+		# Execute the compilation command.
+		eval ${cmd}
+
+	else
+
+		# Debugging stuff. Instead of executing ${cmd}, join the lines together
+		# with tr and trim excess whitespace via awk.
+		cmd=$(echo "${cmd}" | tr '\n' ' ' | awk '{$1=$1;print}')
+		echo "${cmd}"
+		return
+	fi

 	# Run the auto-detect program.
 	detected_config=$(./${autodetect_x})
@@ -1333,13 +1391,17 @@ get_compiler_version()
 	# isolate the version number.
 	# The last part ({ read first rest ; echo $first ; }) is a workaround
 	# to OS X's egrep only returning the first match.
-	cc_vendor=$(echo "${vendor_string}" | egrep -o 'icc|gcc|clang|emcc|pnacl|IBM|oneAPI' | { read first rest ; echo $first ; })
+	cc_vendor=$(echo "${vendor_string}" | egrep -o 'icc|gcc|clang|emcc|pnacl|IBM|oneAPI|crosstool-NG' | { read first rest ; echo $first ; })
+	if [ "${cc_vendor}" = "crosstool-NG" ]; then
+	     # Treat compilers built by crosstool-NG (for eg: conda) as gcc.
+	     cc_vendor="gcc"
+	fi
 	if [ "${cc_vendor}" = "icc" -o \
 	     "${cc_vendor}" = "gcc" ]; then
 		cc_version=$(${cc} -dumpversion)
-	#if compiler is AOCC, first grep for clang and then the version number.
+	# If compiler is AOCC, first grep for clang and then the version number.
 	elif [ "${cc_vendor}" = "clang" ]; then
-            cc_version=$(echo "${vendor_string}" | egrep -o 'clang version [0-9]+\.[0-9]+\.?[0-9]*' | egrep -o '[0-9]+\.[0-9]+\.?[0-9]*')
+		cc_version=$(echo "${vendor_string}" | egrep -o 'clang version [0-9]+\.[0-9]+\.?[0-9]*' | egrep -o '[0-9]+\.[0-9]+\.?[0-9]*')
 	elif [ "${cc_vendor}" = "oneAPI" ]; then
 		# Treat Intel oneAPI's clang as clang, not icc.
 		cc_vendor="clang"
@@ -1463,6 +1525,15 @@ check_compiler()
 				blacklistcc_add "skx"
 			fi
 		fi
+		if [ ${cc_major} -eq 18 ]; then
+			echo "${script_name}: ${cc} ${cc_version} is known to cause erroneous results. See https://github.com/flame/blis/issues/371 for details."
+			blacklistcc_add "knl"
+			blacklistcc_add "skx"
+		fi
+		if [ ${cc_major} -ge 19 ]; then
+			echo "${script_name}: ${cc} ${cc_version} is known to cause erroneous results. See https://github.com/flame/blis/issues/371 for details."
+			echoerr_unsupportedcc
+		fi
 	fi

 	# clang
@@ -1934,6 +2005,9 @@ main()
 	debug_type=''
 	debug_flag=''

+	# The system flag.
+	enable_system='yes'
+
 	# The threading flag.
 	threading_model='off'

@@ -1961,6 +2035,7 @@ main()
 	enable_mixed_dt_extra_mem='yes'
 	enable_sup_handling='yes'
 	enable_memkind='' # The default memkind value is determined later on.
+	enable_trsm_preinversion='yes'
 	force_version='no'
 	complex_return='default'

@@ -1993,6 +2068,13 @@ main()
 	# source distribution directory.
 	dummy_file='_blis_dir_detect.tmp'

+	# -- Debugging --
+
+	# A global flag to help debug the compilation command for the executable
+	# that configure builds on-the-fly to perform hardware auto-detection.
+	debug_auto_detect="no"
+
+

 	# -- Command line option/argument parsing ----------------------------------

@@ -2069,6 +2151,12 @@ main()
 						export-shared=*)
 							export_shared=${OPTARG#*=}
 							;;
+						enable-system)
+							enable_system='yes'
+							;;
+						disable-system)
+							enable_system='no'
+							;;
 						enable-threading=*)
 							threading_model=${OPTARG#*=}
 							;;
@@ -2145,6 +2233,12 @@ main()
 						without-memkind)
 							enable_memkind='no'
 							;;
+						enable-trsm-preinversion)
+							enable_trsm_preinversion='yes'
+							;;
+						disable-trsm-preinversion)
+							enable_trsm_preinversion='no'
+							;;
 						force-version=*)
 							force_version=${OPTARG#*=}
 							;;
@@ -2441,6 +2535,15 @@ main()
 		config_name=$(auto_detect)
 		#config_name="generic"

+		# Debugging stuff. When confirming the behavior of auto_detect(),
+		# it is useful to output ${config_name}, which in theory could be
+		# set temoprarily to something other than the config_name, such as
+		# the compilation command.
+		if [ "${debug_auto_detect}" = "yes" ]; then
+			echo "auto-detect program compilation command: ${config_name}"
+			exit 1
+		fi
+
 		echo "${script_name}: hardware detection driver returned '${config_name}'."

 		# If the auto-detect code returned the "generic" string, it means we
@@ -2543,11 +2646,11 @@ main()

 			reducedclist="${kernel}"

-		# Otherwise, use the first name.
+		# Otherwise, use the last name.
 		else

-			first_config=${configs%% *}
-			reducedclist="${first_config}"
+			last_config=${configs##* }
+			reducedclist="${last_config}"
 		fi

 		# Create a new "kernel:subconfig" pair and add it to the kconfig_map
@@ -2819,6 +2922,19 @@ main()
 		exit 1
 	fi

+	# Check if we are building with or without operating system support.
+	if [ "x${enable_system}" = "xyes" ]; then
+		echo "${script_name}: enabling operating system support."
+		enable_system_01=1
+	else
+		echo "${script_name}: disabling operating system support."
+		echo "${script_name}: WARNING: all threading will be disabled!"
+		enable_system_01=0
+
+		# Force threading to be disabled.
+		threading_model='off'
+	fi
+
 	# Check the threading model flag and standardize its value, if needed.
 	# NOTE: 'omp' is deprecated but still supported; 'openmp' is preferred.
 	enable_openmp='no'
@@ -2962,6 +3078,13 @@ main()
 		echo "${script_name}: small matrix handling is disabled."
 		enable_sup_handling_01=0
 	fi
+	if [ "x${enable_trsm_preinversion}" = "xyes" ]; then
+		echo "${script_name}: trsm diagonal element pre-inversion is enabled."
+		enable_trsm_preinversion_01=1
+	else
+		echo "${script_name}: trsm diagonal element pre-inversion is disabled."
+		enable_trsm_preinversion_01=0
+	fi

 	# Report integer sizes.
 	if [ "x${int_type_size}" = "x32" ]; then
@@ -3008,8 +3131,7 @@ main()

 		enable_sandbox_01=0
 	fi
-
-
+	
 	# Check the method used for returning complex numbers
 	if [ "x${complex_return}" = "xdefault" ]; then
 		if [ -n "${FC}" ]; then
@@ -3040,7 +3162,7 @@ main()
 			complex_return='gnu'
 		fi
 	fi
-
+	
 	if [ "x${complex_return}" = "xgnu"  ]; then
 		complex_return_intel01='0'
 	elif [ "x${complex_return}" = "xintel" ]; then
@@ -3080,7 +3202,13 @@ main()
 	# For Windows builds, clear the libpthread_esc variable so that
 	# no pthreads library is substituted into config.mk. (Windows builds
 	# employ an implementation of pthreads that is internal to BLIS.)
-	if [[ $is_win == yes && "$cc_vendor" == "clang" ]]; then
+	if [[ "$is_win" == "yes" && "$cc_vendor" == "clang" ]]; then
+		libpthread_esc=
+	fi
+
+	# We also clear the libpthread_esc variable for systemless builds
+	# (--disable-system).
+	if [[ "$enable_system" == "no" ]]; then
 		libpthread_esc=
 	fi

@@ -3103,7 +3231,7 @@ main()
 		enable_aocl_zen='yes'
 		enable_aocl_zen_01=1
 	else
-		enable_aocl_zen = 'no';
+		enable_aocl_zen='no'
 		enable_aocl_zen_01=0;
 	fi

@@ -3187,6 +3315,7 @@ main()
 		| sed -e "s/@cflags_preset@/${cflags_preset_esc}/g" \
 		| sed -e "s/@ldflags_preset@/${ldflags_preset_esc}/g" \
 		| sed -e "s/@debug_type@/${debug_type}/g" \
+		| sed -e "s/@enable_system@/${enable_system}/g" \
 		| sed -e "s/@threading_model@/${threading_model}/g" \
 		| sed -e "s/@prefix@/${prefix_esc}/g" \
 		| sed -e "s/@exec_prefix@/${exec_prefix_esc}/g" \
@@ -3218,6 +3347,7 @@ main()
 		| perl -pe "s/\@config_list_defines\@/${config_list_defines}/g" \
 		| perl -pe "s/\@kernel_list_defines\@/${kernel_list_defines}/g" \
 		| sed   -e "s/\@enable_aocl_zen\@/${enable_aocl_zen_01}/g" \
+		| sed   -e "s/@enable_system@/${enable_system_01}/g" \
 		| sed   -e "s/@enable_openmp@/${enable_openmp_01}/g" \
 		| sed   -e "s/@enable_pthreads@/${enable_pthreads_01}/g" \
 		| sed   -e "s/@enable_jrir_slab@/${enable_jrir_slab_01}/g" \
@@ -3233,6 +3363,7 @@ main()
 		| sed   -e "s/@enable_mixed_dt_extra_mem@/${enable_mixed_dt_extra_mem_01}/g" \
 		| sed   -e "s/@enable_sup_handling@/${enable_sup_handling_01}/g" \
 		| sed   -e "s/@enable_memkind@/${enable_memkind_01}/g" \
+		| sed   -e "s/@enable_trsm_preinversion@/${enable_trsm_preinversion_01}/g" \
 		| sed   -e "s/@enable_pragma_omp_simd@/${enable_pragma_omp_simd_01}/g" \
 		| sed   -e "s/@enable_sandbox@/${enable_sandbox_01}/g" \
 		| sed   -e "s/@enable_shared@/${enable_shared_01}/g" \
--- a/docs/BLISObjectAPI.md
+++ b/docs/BLISObjectAPI.md
@@ -1,6 +1,7 @@
 # Contents

 * **[Contents](BLISObjectAPI.md#contents)**
+* **[Operation index](BLISObjectAPI.md#operation-index)**
 * **[Introduction](BLISObjectAPI.md#introduction)**
  * [BLIS types](BLISObjectAPI.md#blis-types)
    * [Integer-based types](BLISObjectAPI.md#integer-based-types)
@@ -15,8 +16,9 @@
 * **[Object management](BLISObjectAPI.md#object-management)**
  * [Object creation function reference](BLISObjectAPI.md#object-creation-function-reference)
  * [Object accessor function reference](BLISObjectAPI.md#object-accessor-function-reference)
+  * [Object mutator function reference](BLISObjectAPI.md#object-mutator-function-reference)
+  * [Other object function reference](BLISObjectAPI.md#other-object-function-reference)
 * **[Computational function reference](BLISObjectAPI.md#computational-function-reference)**
-  * [Operation index](BLISObjectAPI.md#operation-index)
  * [Level-1v operations](BLISObjectAPI.md#level-1v-operations)
  * [Level-1d operations](BLISObjectAPI.md#level-1d-operations)
  * [Level-1m operations](BLISObjectAPI.md#level-1m-operations)
@@ -24,14 +26,37 @@
  * [Level-2 operations](BLISObjectAPI.md#level-2-operations)
  * [Level-3 operations](BLISObjectAPI.md#level-3-operations)
  * [Utility operations](BLISObjectAPI.md#utility-operations)
-  * [Level-3 microkernels](BLISObjectAPI.md#level-3-microkernels)
 * **[Query function reference](BLISObjectAPI.md#query-function-reference)**
  * [General library information](BLISObjectAPI.md#general-library-information)
  * [Specific configuration](BLISObjectAPI.md#specific-configuration)
  * [General configuration](BLISObjectAPI.md#general-configuration)
  * [Kernel information](BLISObjectAPI.md#kernel-information)
+  * [Clock functions](BLISObjectAPI.md#clock-functions)
 * **[Example code](BLISObjectAPI.md#example-code)**

+
+
+# Operation index
+
+This index provides a quick way to jump directly to the description for each operation discussed later in the [Computational function reference](BLISObjectAPI.md#computational-function-reference) section:
+
+  * **[Level-1v](BLISObjectAPI.md#level-1v-operations)**: Operations on vectors:
+    * [addv](BLISObjectAPI.md#addv), [amaxv](BLISObjectAPI.md#amaxv), [axpyv](BLISObjectAPI.md#axpyv), [axpbyv](BLISObjectAPI.md#axpbyv), [copyv](BLISObjectAPI.md#copyv), [dotv](BLISObjectAPI.md#dotv), [dotxv](BLISObjectAPI.md#dotxv), [invertv](BLISObjectAPI.md#invertv), [scal2v](BLISObjectAPI.md#scal2v), [scalv](BLISObjectAPI.md#scalv), [setv](BLISObjectAPI.md#setv), [setrv](BLISObjectAPI.md#setrv), [setiv](BLISObjectAPI.md#setiv), [subv](BLISObjectAPI.md#subv), [swapv](BLISObjectAPI.md#swapv), [xpbyv](BLISObjectAPI.md#xpbyv)
+  * **[Level-1d](BLISObjectAPI.md#level-1d-operations)**: Element-wise operations on matrix diagonals:
+    * [addd](BLISObjectAPI.md#addd), [axpyd](BLISObjectAPI.md#axpyd), [copyd](BLISObjectAPI.md#copyd), [invertd](BLISObjectAPI.md#invertd), [scald](BLISObjectAPI.md#scald), [scal2d](BLISObjectAPI.md#scal2d), [setd](BLISObjectAPI.md#setd), [setid](BLISObjectAPI.md#setid), [shiftd](BLISObjectAPI.md#shiftd), [subd](BLISObjectAPI.md#subd), [xpbyd](BLISObjectAPI.md#xpbyd)
+  * **[Level-1m](BLISObjectAPI.md#level-1m-operations)**: Element-wise operations on matrices:
+    * [addm](BLISObjectAPI.md#addm), [axpym](BLISObjectAPI.md#axpym), [copym](BLISObjectAPI.md#copym), [scalm](BLISObjectAPI.md#scalm), [scal2m](BLISObjectAPI.md#scal2m), [setm](BLISObjectAPI.md#setm), [setrm](BLISObjectAPI.md#setrm), [setim](BLISObjectAPI.md#setim), [subm](BLISObjectAPI.md#subm)
+  * **[Level-1f](BLISObjectAPI.md#level-1f-operations)**: Fused operations on multiple vectors:
+    * [axpy2v](BLISObjectAPI.md#axpy2v), [dotaxpyv](BLISObjectAPI.md#dotaxpyv), [axpyf](BLISObjectAPI.md#axpyf), [dotxf](BLISObjectAPI.md#dotxf), [dotxaxpyf](BLISObjectAPI.md#dotxaxpyf)
+  * **[Level-2](BLISObjectAPI.md#level-2-operations)**: Operations with one matrix and (at least) one vector operand:
+    * [gemv](BLISObjectAPI.md#gemv), [ger](BLISObjectAPI.md#ger), [hemv](BLISObjectAPI.md#hemv), [her](BLISObjectAPI.md#her), [her2](BLISObjectAPI.md#her2), [symv](BLISObjectAPI.md#symv), [syr](BLISObjectAPI.md#syr), [syr2](BLISObjectAPI.md#syr2), [trmv](BLISObjectAPI.md#trmv), [trsv](BLISObjectAPI.md#trsv)
+  * **[Level-3](BLISObjectAPI.md#level-3-operations)**: Operations with matrices that are multiplication-like:
+    * [gemm](BLISObjectAPI.md#gemm), [hemm](BLISObjectAPI.md#hemm), [herk](BLISObjectAPI.md#herk), [her2k](BLISObjectAPI.md#her2k), [symm](BLISObjectAPI.md#symm), [syrk](BLISObjectAPI.md#syrk), [syr2k](BLISObjectAPI.md#syr2k), [trmm](BLISObjectAPI.md#trmm), [trmm3](BLISObjectAPI.md#trmm3), [trsm](BLISObjectAPI.md#trsm)
+  * **[Utility](BLISObjectAPI.md#Utility-operations)**: Miscellaneous operations on matrices and vectors:
+    * [asumv](BLISObjectAPI.md#asumv), [norm1v](BLISObjectAPI.md#norm1v), [normfv](BLISObjectAPI.md#normfv), [normiv](BLISObjectAPI.md#normiv), [norm1m](BLISObjectAPI.md#norm1m), [normfm](BLISObjectAPI.md#normfm), [normim](BLISObjectAPI.md#normim), [mkherm](BLISObjectAPI.md#mkherm), [mksymm](BLISObjectAPI.md#mksymm), [mktrim](BLISObjectAPI.md#mktrim), [fprintv](BLISObjectAPI.md#fprintv), [fprintm](BLISObjectAPI.md#fprintm),[printv](BLISObjectAPI.md#printv), [printm](BLISObjectAPI.md#printm), [randv](BLISObjectAPI.md#randv), [randm](BLISObjectAPI.md#randm), [sumsqv](BLISObjectAPI.md#sumsqv), [getijm](BLISObjectAPI.md#getijm), [setijm](BLISObjectAPI.md#setijm)
+
+
+
 # Introduction

 This document summarizes one of the primary native APIs in BLIS--the object API. Here, we also discuss BLIS-specific type definitions, header files, and prototypes to auxiliary functions.
@@ -40,6 +65,9 @@ There are many functions that BLIS implements that are not listed here, either b

 The object API was given its name (a) because it abstracts the floating-point types of its operands (along with many other properties) within a `typedef struct {...}` data structure, and (b) to contrast it with the other native API in BLIS, the typed API, which is [documented here](BLISTypedAPI.md). (The third API supported by BLIS is the BLAS compatibility layer, which mimics conventional Fortran-77 BLAS.)

+In general, this document should be treated more as a reference than a place to learn how to use BLIS in your application. Thus, we highly encourage all readers to first study the [example code](BLISObjectAPI.md#example-code) provided within the BLIS source distribution.
+
+
 ## BLIS types

 The following tables list various types used throughout the BLIS object API.
@@ -393,9 +421,7 @@ Objects initialized via this function should **never** be passed to `bli_obj_fre
 Notes for interpreting function descriptions:
  * Object accessor functions allow the caller to query certain properties of objects.
  * These functions are only guaranteed to return meaningful values when called upon objects that have been fully initialized/created.
-  * Many specialized functions are omitted from this section for brevity. For a full list of accessor functions, please see [frame/include/bli_obj_macro_defs.h](https://github.com/flame/blis/tree/master/frame/include/bli_obj_macro_defs.h).
-
-**Note**: For now, we mostly omit documentation for the corresponding functions used to modify object properties because those functions can easily invalidate the state of an `obj_t` and should be used only in specific instances. If you think you need to manually set the fields of an `obj_t`, please contact BLIS developers so we can give you personalized guidance.
+  * Many specialized functions are omitted from this section for brevity. For a full list of accessor functions, please see [frame/include/bli_obj_macro_defs.h](https://github.com/flame/blis/tree/master/frame/include/bli_obj_macro_defs.h), though most users will most likely not need methods beyond those documented below.

 ---

@@ -423,7 +449,7 @@ Return the precision component of the storage datatype property of `obj`.
 ```c
 trans_t bli_obj_conjtrans_status( obj_t* obj );
 ```
-Return the `trans_t` property of `obj`, which may indicate transposition, conjugation, both, or neither.
+Return the `trans_t` property of `obj`, which may indicate transposition, conjugation, both, or neither. Thus, possible return values are `BLIS_NO_TRANSPOSE`, `BLIS_CONJ_NO_TRANSPOSE`, `BLIS_TRANSPOSE`, or `BLIS_CONJ_TRANSPOSE`.

 ---

@@ -444,23 +470,30 @@ Thus, possible return values are `BLIS_NO_CONJUGATE` or `BLIS_CONJUGATE`.
 ---

 ```c
-uplo_t bli_obj_uplo( obj_t* obj );
+struc_t bli_obj_struc( obj_t* obj );
 ```
-Return the `uplo_t` property of `obj`.
+Return the structure property of `obj`.

 ---

 ```c
-struc_t bli_obj_struc( obj_t* obj );
+uplo_t bli_obj_uplo( obj_t* obj );
 ```
-Return the `struc_t` property of `obj`.
+Return the uplo (i.e., storage) property of `obj`.

 ---

 ```c
 diag_t bli_obj_diag( obj_t* obj );
 ```
-Return the `diag_t` property of `obj`.
+Return the diagonal property of `obj`.
+
+---
+
+```c
+doff_t bli_obj_diag_offset( obj_t* obj );
+```
+Return the diagonal offset of `obj`. Note that the diagonal offset will be negative, `-i`, if the diagonal begins at element `(-i,0)` and positive `j` if the diagonal begins at element `(0,j)`.

 ---

@@ -492,13 +525,6 @@ Return the number of columns (or _n_ dimension) of `obj` after taking into accou

 ---

-```c
-doff_t bli_obj_diag_offset( obj_t* obj );
-```
-Return the diagonal offset of `obj`. Note that the diagonal offset will be negative, `-i`, if the diagonal begins at element `(-i,0)` and positive `j` if the diagonal begins at element `(0,j)`.
-
---
-
 ```c
 inc_t bli_obj_row_stride( obj_t* obj );
 ```
@@ -542,6 +568,90 @@ siz_t bli_obj_elem_size( obj_t* obj );
 ```
 Return the size, in bytes, of the storage datatype as indicated by `bli_obj_dt()`.

+
+
+## Object mutator function reference
+
+Notes for interpreting function descriptions:
+  * Object mutator functions allow the caller to modify certain properties of objects.
+  * The user should be extra careful about modifying properties after objects are created. For typical use of these functions, please study the example code provided in [examples/oapi](https://github.com/flame/blis/tree/master/examples/oapi).
+  * The list of mutators below is much shorter than the list of accessor functions provided in the previous section. Most mutator functions should *not* be called by users (unless you know what you are doing). For a full list of mutator functions, please see [frame/include/bli_obj_macro_defs.h](https://github.com/flame/blis/tree/master/frame/include/bli_obj_macro_defs.h), though most users will most likely not need methods beyond those documented below.
+
+---
+
+```c
+void bli_obj_set_conjtrans( trans_t trans, obj_t* obj );
+```
+Set both conjugation and transposition properties of `obj` using the corresponding components of `trans`.
+
+---
+
+```c
+void bli_obj_set_onlytrans( trans_t trans, obj_t* obj );
+```
+Set the transposition property of `obj` using the transposition component of `trans`. Leaves the conjugation property of `obj` unchanged.
+
+---
+
+```c
+void bli_obj_set_conj( conj_t conj, obj_t* obj );
+```
+Set the conjugation property of `obj` using `conj`. Leaves the transposition property of `obj` unchanged.
+
+---
+
+```c
+void bli_obj_apply_trans( trans_t trans, obj_t* obj );
+```
+Apply `trans` to the transposition property of `obj`. For example, applying `BLIS_TRANSPOSE` will toggle the transposition property of `obj` but leave the conjugation property unchanged; applying `BLIS_CONJ_TRANSPOSE` will toggle both the conjugation and transposition properties of `obj`.
+
+---
+
+```c
+void bli_obj_apply_conj( conj_t conj, obj_t* obj );
+```
+Apply `conj` to the conjugation property of `obj`. Specifically, applying `BLIS_CONJUGATE` will toggle the conjugation property of `obj`; applying `BLIS_NO_CONJUGATE` will have no effect. Leaves the transposition property of `obj` unchanged.
+
+---
+
+```c
+void bli_obj_set_struc( struc_t struc, obj_t* obj );
+```
+Set the structure property of `obj` to `struc`.
+
+---
+
+```c
+void bli_obj_set_uplo( uplo_t uplo, obj_t* obj );
+```
+Set the uplo (i.e., storage) property of `obj` to `uplo`.
+
+---
+
+```c
+void bli_obj_set_diag( diag_t diag, obj_t* obj );
+```
+Set the diagonal property of `obj` to `diag`.
+
+---
+
+```c
+void bli_obj_set_diag_offset( doff_t doff, obj_t* obj );
+```
+Set the diagonal offset property of `obj` to `doff`. Note that `doff_t` may be typecast from any signed integer.
+
+---
+
+
+## Other object function reference
+
+---
+
+```c
+void bli_obj_induce_trans( obj_t* obj );
+```
+Modify the properties of `obj` to induce a logical transposition. This function operates without regard to whether the transposition property is already set. Therefore, depending on the circumstance, the caller may or may not wish to clear the transposition property after calling this function.
+
 ---

 ```c
@@ -567,13 +677,6 @@ void bli_obj_imag_part( obj_t* c, obj_t* i );
 ```
 Initialize `i` to be a modified shallow copy of `c` that refers only to the imaginary part of `c`.

---
-
-```c
-void bli_obj_induce_trans( obj_t* obj );
-```
-Modify the properties of `obj` to induce a logical transposition. This function operations without regard to whether the transposition property is already set. Therefore, depending on the circumstance, the caller may or may not wish to clear the transposition property after calling this function. (If needed, the user may call `bli_obj_toggle_trans( obj )` to toggle the transposition status.)
-

 # Computational function reference

@@ -591,26 +694,6 @@ Notes for interpreting function descriptions:
 ---


-## Operation index
-
-  * **[Level-1v](BLISObjectAPI.md#level-1v-operations)**: Operations on vectors:
-    * [addv](BLISObjectAPI.md#addv), [amaxv](BLISObjectAPI.md#amaxv), [axpyv](BLISObjectAPI.md#axpyv), [axpbyv](BLISObjectAPI.md#axpbyv), [copyv](BLISObjectAPI.md#copyv), [dotv](BLISObjectAPI.md#dotv), [dotxv](BLISObjectAPI.md#dotxv), [invertv](BLISObjectAPI.md#invertv), [scal2v](BLISObjectAPI.md#scal2v), [scalv](BLISObjectAPI.md#scalv), [setv](BLISObjectAPI.md#setv), [setrv](BLISObjectAPI.md#setrv), [setiv](BLISObjectAPI.md#setiv), [subv](BLISObjectAPI.md#subv), [swapv](BLISObjectAPI.md#swapv), [xpbyv](BLISObjectAPI.md#xpbyv)
-  * **[Level-1d](BLISObjectAPI.md#level-1d-operations)**: Element-wise operations on matrix diagonals:
-    * [addd](BLISObjectAPI.md#addd), [axpyd](BLISObjectAPI.md#axpyd), [copyd](BLISObjectAPI.md#copyd), [invertd](BLISObjectAPI.md#invertd), [scald](BLISObjectAPI.md#scald), [scal2d](BLISObjectAPI.md#scal2d), [setd](BLISObjectAPI.md#setd), [setid](BLISObjectAPI.md#setid), [shiftd](BLISObjectAPI.md#shiftd), [subd](BLISObjectAPI.md#subd), [xpbyd](BLISObjectAPI.md#xpbyd)
-  * **[Level-1m](BLISObjectAPI.md#level-1m-operations)**: Element-wise operations on matrices:
-    * [addm](BLISObjectAPI.md#addm), [axpym](BLISObjectAPI.md#axpym), [copym](BLISObjectAPI.md#copym), [scalm](BLISObjectAPI.md#scalm), [scal2m](BLISObjectAPI.md#scal2m), [setm](BLISObjectAPI.md#setm), [setrm](BLISObjectAPI.md#setrm), [setim](BLISObjectAPI.md#setim), [subm](BLISObjectAPI.md#subm)
-  * **[Level-1f](BLISObjectAPI.md#level-1f-operations)**: Fused operations on multiple vectors:
-    * [axpy2v](BLISObjectAPI.md#axpy2v), [dotaxpyv](BLISObjectAPI.md#dotaxpyv), [axpyf](BLISObjectAPI.md#axpyf), [dotxf](BLISObjectAPI.md#dotxf), [dotxaxpyf](BLISObjectAPI.md#dotxaxpyf)
-  * **[Level-2](BLISObjectAPI.md#level-2-operations)**: Operations with one matrix and (at least) one vector operand:
-    * [gemv](BLISObjectAPI.md#gemv), [ger](BLISObjectAPI.md#ger), [hemv](BLISObjectAPI.md#hemv), [her](BLISObjectAPI.md#her), [her2](BLISObjectAPI.md#her2), [symv](BLISObjectAPI.md#symv), [syr](BLISObjectAPI.md#syr), [syr2](BLISObjectAPI.md#syr2), [trmv](BLISObjectAPI.md#trmv), [trsv](BLISObjectAPI.md#trsv)
-  * **[Level-3](BLISObjectAPI.md#level-3-operations)**: Operations with matrices that are multiplication-like:
-    * [gemm](BLISObjectAPI.md#gemm), [hemm](BLISObjectAPI.md#hemm), [herk](BLISObjectAPI.md#herk), [her2k](BLISObjectAPI.md#her2k), [symm](BLISObjectAPI.md#symm), [syrk](BLISObjectAPI.md#syrk), [syr2k](BLISObjectAPI.md#syr2k), [trmm](BLISObjectAPI.md#trmm), [trmm3](BLISObjectAPI.md#trmm3), [trsm](BLISObjectAPI.md#trsm)
-  * **[Utility](BLISObjectAPI.md#Utility-operations)**: Miscellaneous operations on matrices and vectors:
-    * [asumv](BLISObjectAPI.md#asumv), [norm1v](BLISObjectAPI.md#norm1v), [normfv](BLISObjectAPI.md#normfv), [normiv](BLISObjectAPI.md#normiv), [norm1m](BLISObjectAPI.md#norm1m), [normfm](BLISObjectAPI.md#normfm), [normim](BLISObjectAPI.md#normim), [mkherm](BLISObjectAPI.md#mkherm), [mksymm](BLISObjectAPI.md#mksymm), [mktrim](BLISObjectAPI.md#mktrim), [fprintv](BLISObjectAPI.md#fprintv), [fprintm](BLISObjectAPI.md#fprintm),[printv](BLISObjectAPI.md#printv), [printm](BLISObjectAPI.md#printm), [randv](BLISObjectAPI.md#randv), [randm](BLISObjectAPI.md#randm), [sumsqv](BLISObjectAPI.md#sumsqv), [getijm](BLISObjectAPI.md#getijm), [setijm](BLISObjectAPI.md#setijm)
-
---
-
-
 ## Level-1v operations

 Level-1v operations perform various level-1 BLAS-like operations on vectors (hence the _v_).
@@ -996,7 +1079,7 @@ void bli_setd
     );
 ```

-Observed object properties: `conj?(alpha)`, `diagoff(A)`, `diag(A)`.
+Observed object properties: `conj?(alpha)`, `diagoff(A)`.

 ---

@@ -1599,6 +1682,27 @@ Observed object properties: `trans?(A)`, `trans?(B)`.

 ---

+#### gemmt
+```c
+void bli_gemmt
+     (
+       obj_t*  alpha,
+       obj_t*  a,
+       obj_t*  b,
+       obj_t*  beta,
+       obj_t*  c
+     );
+```
+Perform
+```
+  C := beta * C + alpha * trans?(A) * trans?(B)
+```
+where `C` is an _m x m_ matrix, `trans?(A)` is an _m x k_ matrix, and `trans?(B)` is a _k x m_ matrix. This operation is similar to `bli_gemm()` except that it only updates the lower or upper triangle of `C` as specified by `uplo(C)`.
+
+Observed object properties: `trans?(A)`, `trans?(B)`, `uplo(C)`.
+
+---
+
 #### hemm
 ```c
 void bli_hemm
@@ -2132,7 +2236,55 @@ Possible microkernel types (ie: the return values for `bli_info_get_*_ukr_impl_s
 * `BLIS_OPTIMIZED_UKERNEL` (`"optimzd"`): This value is returned when the queried microkernel is provided by an implementation that is neither reference nor virtual, and thus we assume the kernel author would deem it to be "optimized". Such a microkernel may not be optimal in the literal sense of the word, but nonetheless is _intended_ to be optimized, at least relative to the reference microkernels.
 * `BLIS_NOTAPPLIC_UKERNEL` (`"notappl"`): This value is returned usually when performing a `gemmtrsm` or `trsm` microkernel type query for any `method` value that is not `BLIS_NAT` (ie: native). That is, induced methods cannot be (purely) used on `trsm`-based microkernels because these microkernels perform more a triangular inversion, which is not matrix multiplication.

+
+## Clock functions
+
+---
+
+#### clock
+```c
+double bli_clock
+     (
+       void
+     );
+```
+Return the amount of time that has elapsed since some fixed time in the past. The return values of `bli_clock()` typically feature nanosecond precision, though this is not guaranteed.
+
+**Note:** On Linux, `bli_clock()` is implemented in terms of `clock_gettime()` using the `clockid_t` value of `CLOCK_MONOTONIC`. On OS X, `bli_clock` is implemented in terms of `mach_absolute_time()`. And on Windows, `bli_clock` is implemented in terms of `QueryPerformanceFrequency()`. Please see [frame/base/bli_clock.c](https://github.com/flame/blis/blob/master/frame/base/bli_clock.c) for more details.
+**Note:** This function is returns meaningless values when BLIS is configured with `--disable-system`.
+
+---
+
+#### clock_min_diff
+```c
+double bli_clock_min_diff
+     (
+       double time_prev_min,
+       double time_start
+     );
+```
+This function computes an intermediate value, `time_diff`, equal to `bli_clock() - time_start`, and then tentatively prepares to return the minimum value of `time_diff` and `time_min`. If that minimum value is extremely small (close to zero), the function returns `time_min` instead.
+
+This function is meant to be used in conjuction with `bli_clock()` for
+performance timing within applications--specifically in loops where only
+the fastest timing is of interest. For example:
+```c
+double t_save = DBL_MAX;
+for( i = 0; i < 3; ++i )
+{
+   double t = bli_clock();
+   bli_gemm( ... );
+   t_save = bli_clock_min_diff( t_save, t );
+}
+double gflops = ( 2.0 * m * k * n ) / ( t_save * 1.0e9 );
+```
+This code calls `bli_gemm()` three times and computes the performance, in GFLOPS, of the fastest of the three executions.
+
+---
+
+
+
 # Example code

-BLIS provides lots of example code in the [examples/oapi](https://github.com/flame/blis/tree/master/examples/oapi) directory of the BLIS source distribution. The example code in this directory is set up like a tutorial, and so we recommend starting from the beginning. Topics include creating and managing objects, printing vectors and matrices, setting and querying object properties, and calling a representative subset of the computational level-1v, -1m, -2, -3, and utility operations documented above.
+BLIS provides lots of example code in the [examples/oapi](https://github.com/flame/blis/tree/master/examples/oapi) directory of the BLIS source distribution. The example code in this directory is set up like a tutorial, and so we recommend starting from the beginning. Topics include creating and managing objects, printing vectors and matrices, setting and querying object properties, and calling a representative subset of the computational level-1v, -1m, -2, -3, and utility operations documented above. Please read the `README` contained within the `examples/oapi` directory for further details.

--- a/docs/BLISTypedAPI.md
+++ b/docs/BLISTypedAPI.md
@@ -1,6 +1,7 @@
 # Contents

 * **[Contents](BLISTypedAPI.md#contents)**
+* **[Operation index](BLISTypedAPI.md#operation-index)**
 * **[Introduction](BLISTypedAPI.md#introduction)**
  * [BLIS types](BLISTypedAPI.md#blis-types)
    * [Integer-based types](BLISTypedAPI.md#integer-based-types)
@@ -12,7 +13,6 @@
  * [BLIS header file](BLISTypedAPI.md#blis-header-file)
  * [Initialization and cleanup](BLISTypedAPI.md#initialization-and-cleanup)
 * **[Computational function reference](BLISTypedAPI.md#computational-function-reference)**
-  * [Operation index](BLISTypedAPI.md#operation-index)
  * [Level-1v operations](BLISTypedAPI.md#level-1v-operations)
  * [Level-1d operations](BLISTypedAPI.md#level-1d-operations)
  * [Level-1m operations](BLISTypedAPI.md#level-1m-operations)
@@ -26,8 +26,32 @@
  * [Specific configuration](BLISTypedAPI.md#specific-configuration)
  * [General configuration](BLISTypedAPI.md#general-configuration)
  * [Kernel information](BLISTypedAPI.md#kernel-information)
+  * [Clock functions](BLISTypedAPI.md#clock-functions)
 * **[Example code](BLISTypedAPI.md#example-code)**

+
+
+# Operation index
+
+This index provides a quick way to jump directly to the description for each operation discussed later in the [Computational function reference](BLISTypedAPI.md#computational-function-reference) section:
+
+  * **[Level-1v](BLISTypedAPI.md#level-1v-operations)**: Operations on vectors:
+    * [addv](BLISTypedAPI.md#addv), [amaxv](BLISTypedAPI.md#amaxv), [axpyv](BLISTypedAPI.md#axpyv), [axpbyv](BLISTypedAPI.md#axpbyv), [copyv](BLISTypedAPI.md#copyv), [dotv](BLISTypedAPI.md#dotv), [dotxv](BLISTypedAPI.md#dotxv), [invertv](BLISTypedAPI.md#invertv), [scal2v](BLISTypedAPI.md#scal2v), [scalv](BLISTypedAPI.md#scalv), [setv](BLISTypedAPI.md#setv), [subv](BLISTypedAPI.md#subv), [swapv](BLISTypedAPI.md#swapv), [xpbyv](BLISTypedAPI.md#xpbyv)
+  * **[Level-1d](BLISTypedAPI.md#level-1d-operations)**: Element-wise operations on matrix diagonals:
+    * [addd](BLISTypedAPI.md#addd), [axpyd](BLISTypedAPI.md#axpyd), [copyd](BLISTypedAPI.md#copyd), [invertd](BLISTypedAPI.md#invertd), [scald](BLISTypedAPI.md#scald), [scal2d](BLISTypedAPI.md#scal2d), [setd](BLISTypedAPI.md#setd), [setid](BLISTypedAPI.md#setid), [shiftd](BLISTypedAPI.md#shiftd), [subd](BLISTypedAPI.md#subd), [xpbyd](BLISTypedAPI.md#xpbyd)
+  * **[Level-1m](BLISTypedAPI.md#level-1m-operations)**: Element-wise operations on matrices:
+    * [addm](BLISTypedAPI.md#addm), [axpym](BLISTypedAPI.md#axpym), [copym](BLISTypedAPI.md#copym), [scalm](BLISTypedAPI.md#scalm), [scal2m](BLISTypedAPI.md#scal2m), [setm](BLISTypedAPI.md#setm), [subm](BLISTypedAPI.md#subm)
+  * **[Level-1f](BLISTypedAPI.md#level-1f-operations)**: Fused operations on multiple vectors:
+    * [axpy2v](BLISTypedAPI.md#axpy2v), [dotaxpyv](BLISTypedAPI.md#dotaxpyv), [axpyf](BLISTypedAPI.md#axpyf), [dotxf](BLISTypedAPI.md#dotxf), [dotxaxpyf](BLISTypedAPI.md#dotxaxpyf)
+  * **[Level-2](BLISTypedAPI.md#level-2-operations)**: Operations with one matrix and (at least) one vector operand:
+    * [gemv](BLISTypedAPI.md#gemv), [ger](BLISTypedAPI.md#ger), [hemv](BLISTypedAPI.md#hemv), [her](BLISTypedAPI.md#her), [her2](BLISTypedAPI.md#her2), [symv](BLISTypedAPI.md#symv), [syr](BLISTypedAPI.md#syr), [syr2](BLISTypedAPI.md#syr2), [trmv](BLISTypedAPI.md#trmv), [trsv](BLISTypedAPI.md#trsv)
+  * **[Level-3](BLISTypedAPI.md#level-3-operations)**: Operations with matrices that are multiplication-like:
+    * [gemm](BLISTypedAPI.md#gemm), [hemm](BLISTypedAPI.md#hemm), [herk](BLISTypedAPI.md#herk), [her2k](BLISTypedAPI.md#her2k), [symm](BLISTypedAPI.md#symm), [syrk](BLISTypedAPI.md#syrk), [syr2k](BLISTypedAPI.md#syr2k), [trmm](BLISTypedAPI.md#trmm), [trmm3](BLISTypedAPI.md#trmm3), [trsm](BLISTypedAPI.md#trsm)
+  * **[Utility](BLISTypedAPI.md#Utility-operations)**: Miscellaneous operations on matrices and vectors:
+    * [asumv](BLISTypedAPI.md#asumv), [norm1v](BLISTypedAPI.md#norm1v), [normfv](BLISTypedAPI.md#normfv), [normiv](BLISTypedAPI.md#normiv), [norm1m](BLISTypedAPI.md#norm1m), [normfm](BLISTypedAPI.md#normfm), [normim](BLISTypedAPI.md#normim), [mkherm](BLISTypedAPI.md#mkherm), [mksymm](BLISTypedAPI.md#mksymm), [mktrim](BLISTypedAPI.md#mktrim), [fprintv](BLISTypedAPI.md#fprintv), [fprintm](BLISTypedAPI.md#fprintm),[printv](BLISTypedAPI.md#printv), [printm](BLISTypedAPI.md#printm), [randv](BLISTypedAPI.md#randv), [randm](BLISTypedAPI.md#randm), [sumsqv](BLISTypedAPI.md#sumsqv)
+
+
+
 # Introduction

 This document summarizes one of the primary native APIs in BLIS--the "typed" API. Here, we also discuss BLIS-specific type definitions, header files, and prototypes to auxiliary functions. This document also includes APIs to key kernels which are used to accelerate and optimize various level-2 and level-3 operations, though the [Kernels Guide](KernelsHowTo.md) goes into more detail, especially for level-3 microkernels.
@@ -36,6 +60,8 @@ There are many functions that BLIS implements that are not listed here, either b

 For curious readers, the typed API was given its name (a) because it exposes the floating-point types in the names of its functions, and (b) to contrast it with the other native API in BLIS, the object API, which is [documented here](BLISObjectAPI.md). (The third API supported by BLIS is the BLAS compatibility layer, which mimics conventional Fortran-77 BLAS.)

+In general, this document should be treated more as a reference than a place to learn how to use BLIS in your application. Thus, we highly encourage all readers to first study the [example code](BLISTypedAPI.md#example-code) provided within the BLIS source distribution.
+
 ## BLIS types

 The following tables list various types used throughout the BLIS typed API.
@@ -190,26 +216,6 @@ Notes for interpreting function descriptions:
 ---


-## Operation index
-
-  * **[Level-1v](BLISTypedAPI.md#level-1v-operations)**: Operations on vectors:
-    * [addv](BLISTypedAPI.md#addv), [amaxv](BLISTypedAPI.md#amaxv), [axpyv](BLISTypedAPI.md#axpyv), [axpbyv](BLISTypedAPI.md#axpbyv), [copyv](BLISTypedAPI.md#copyv), [dotv](BLISTypedAPI.md#dotv), [dotxv](BLISTypedAPI.md#dotxv), [invertv](BLISTypedAPI.md#invertv), [scal2v](BLISTypedAPI.md#scal2v), [scalv](BLISTypedAPI.md#scalv), [setv](BLISTypedAPI.md#setv), [subv](BLISTypedAPI.md#subv), [swapv](BLISTypedAPI.md#swapv), [xpbyv](BLISTypedAPI.md#xpbyv)
-  * **[Level-1d](BLISTypedAPI.md#level-1d-operations)**: Element-wise operations on matrix diagonals:
-    * [addd](BLISTypedAPI.md#addd), [axpyd](BLISTypedAPI.md#axpyd), [copyd](BLISTypedAPI.md#copyd), [invertd](BLISTypedAPI.md#invertd), [scald](BLISTypedAPI.md#scald), [scal2d](BLISTypedAPI.md#scal2d), [setd](BLISTypedAPI.md#setd), [setid](BLISTypedAPI.md#setid), [shiftd](BLISTypedAPI.md#shiftd), [subd](BLISTypedAPI.md#subd), [xpbyd](BLISTypedAPI.md#xpbyd)
-  * **[Level-1m](BLISTypedAPI.md#level-1m-operations)**: Element-wise operations on matrices:
-    * [addm](BLISTypedAPI.md#addm), [axpym](BLISTypedAPI.md#axpym), [copym](BLISTypedAPI.md#copym), [scalm](BLISTypedAPI.md#scalm), [scal2m](BLISTypedAPI.md#scal2m), [setm](BLISTypedAPI.md#setm), [subm](BLISTypedAPI.md#subm)
-  * **[Level-1f](BLISTypedAPI.md#level-1f-operations)**: Fused operations on multiple vectors:
-    * [axpy2v](BLISTypedAPI.md#axpy2v), [dotaxpyv](BLISTypedAPI.md#dotaxpyv), [axpyf](BLISTypedAPI.md#axpyf), [dotxf](BLISTypedAPI.md#dotxf), [dotxaxpyf](BLISTypedAPI.md#dotxaxpyf)
-  * **[Level-2](BLISTypedAPI.md#level-2-operations)**: Operations with one matrix and (at least) one vector operand:
-    * [gemv](BLISTypedAPI.md#gemv), [ger](BLISTypedAPI.md#ger), [hemv](BLISTypedAPI.md#hemv), [her](BLISTypedAPI.md#her), [her2](BLISTypedAPI.md#her2), [symv](BLISTypedAPI.md#symv), [syr](BLISTypedAPI.md#syr), [syr2](BLISTypedAPI.md#syr2), [trmv](BLISTypedAPI.md#trmv), [trsv](BLISTypedAPI.md#trsv)
-  * **[Level-3](BLISTypedAPI.md#level-3-operations)**: Operations with matrices that are multiplication-like:
-    * [gemm](BLISTypedAPI.md#gemm), [hemm](BLISTypedAPI.md#hemm), [herk](BLISTypedAPI.md#herk), [her2k](BLISTypedAPI.md#her2k), [symm](BLISTypedAPI.md#symm), [syrk](BLISTypedAPI.md#syrk), [syr2k](BLISTypedAPI.md#syr2k), [trmm](BLISTypedAPI.md#trmm), [trmm3](BLISTypedAPI.md#trmm3), [trsm](BLISTypedAPI.md#trsm)
-  * **[Utility](BLISTypedAPI.md#Utility-operations)**: Miscellaneous operations on matrices and vectors:
-    * [asumv](BLISTypedAPI.md#asumv), [norm1v](BLISTypedAPI.md#norm1v), [normfv](BLISTypedAPI.md#normfv), [normiv](BLISTypedAPI.md#normiv), [norm1m](BLISTypedAPI.md#norm1m), [normfm](BLISTypedAPI.md#normfm), [normim](BLISTypedAPI.md#normim), [mkherm](BLISTypedAPI.md#mkherm), [mksymm](BLISTypedAPI.md#mksymm), [mktrim](BLISTypedAPI.md#mktrim), [fprintv](BLISTypedAPI.md#fprintv), [fprintm](BLISTypedAPI.md#fprintm),[printv](BLISTypedAPI.md#printv), [printm](BLISTypedAPI.md#printm), [randv](BLISTypedAPI.md#randv), [randm](BLISTypedAPI.md#randm), [sumsqv](BLISTypedAPI.md#sumsqv)
-
---
-
-
 ## Level-1v operations

 Level-1v operations perform various level-1 BLAS-like operations on vectors (hence the _v_).
@@ -1208,6 +1214,30 @@ where C is an _m x n_ matrix, `transa(A)` is an _m x k_ matrix, and `transb(B)`

 ---

+#### gemmt
+```c
+void bli_?gemmt
+     (
+       uplo_t  uploc,
+       trans_t transa,
+       trans_t transb,
+       dim_t   m,
+       dim_t   k,
+       ctype*  alpha,
+       ctype*  a, inc_t rsa, inc_t csa,
+       ctype*  b, inc_t rsb, inc_t csb,
+       ctype*  beta,
+       ctype*  c, inc_t rsc, inc_t csc
+     );
+```
+Perform
+```
+  C := beta * C + alpha * transa(A) * transb(B)
+```
+where C is an _m x m_ matrix, `transa(A)` is an _m x k_ matrix, and `transb(B)` is a _k x m_ matrix. This operation is similar to `bli_?gemm()` except that it only updates the lower or upper triangle of `C` as specified by `uploc`.
+
+---
+
 #### hemm
 ```c
 void bli_?hemm
@@ -1266,7 +1296,8 @@ where C is an _m x m_ Hermitian matrix stored in the lower or upper triangle as
 void bli_?her2k
     (
       uplo_t  uploc,
-       trans_t transab,
+       trans_t transa,
+       trans_t transb,
       dim_t   m,
       dim_t   k,
       ctype*  alpha,
@@ -1278,9 +1309,9 @@ void bli_?her2k
 ```
 Perform
 ```
-  C := beta * C + alpha * transab(A) * transab(B)^H + conj(alpha) * transab(B) * transab(A)^H
+  C := beta * C + alpha * transa(A) * transb(B)^H + conj(alpha) * transb(B) * transa(A)^H
 ```
-where C is an _m x m_ Hermitian matrix stored in the lower or upper triangle as specified by `uploc` and `transab(A)` and `transab(B)` are _m x k_ matrices.
+where C is an _m x m_ Hermitian matrix stored in the lower or upper triangle as specified by `uploc` and `transa(A)` and `transb(B)` are _m x k_ matrices.

 **Note:** The floating-point type of `beta` is always the real projection of the floating-point types of `A` and `C`.

@@ -1342,7 +1373,8 @@ where C is an _m x m_ symmetric matrix stored in the lower or upper triangle as
 void bli_?syr2k
     (
       uplo_t  uploc,
-       trans_t transab,
+       trans_t transa,
+       trans_t transb,
       dim_t   m,
       dim_t   k,
       ctype*  alpha,
@@ -1354,9 +1386,9 @@ void bli_?syr2k
 ```
 Perform
 ```
-  C := beta * C + alpha * transab(A) * transab(B)^T + alpha * transab(B) * transab(A)^T
+  C := beta * C + alpha * transa(A) * transb(B)^T + alpha * transb(B) * transa(A)^T
 ```
-where C is an _m x m_ symmetric matrix stored in the lower or upper triangle as specified by `uploa` and `transab(A)` and `transab(B)` are _m x k_ matrices.
+where C is an _m x m_ symmetric matrix stored in the lower or upper triangle as specified by `uploa` and `transa(A)` and `transb(B)` are _m x k_ matrices.

 ---

@@ -1873,7 +1905,55 @@ char* bli_info_get_trmm3_impl_string( num_t dt );
 char* bli_info_get_trsm_impl_string( num_t dt );
 ```

+
+## Clock functions
+
+---
+
+#### clock
+```c
+double bli_clock
+     (
+       void
+     );
+```
+Return the amount of time that has elapsed since some fixed time in the past. The return values of `bli_clock()` typically feature nanosecond precision, though this is not guaranteed.
+
+**Note:** On Linux, `bli_clock()` is implemented in terms of `clock_gettime()` using the `clockid_t` value of `CLOCK_MONOTONIC`. On OS X, `bli_clock` is implemented in terms of `mach_absolute_time()`. And on Windows, `bli_clock` is implemented in terms of `QueryPerformanceFrequency()`. Please see [frame/base/bli_clock.c](https://github.com/flame/blis/blob/master/frame/base/bli_clock.c) for more details.
+**Note:** This function is returns meaningless values when BLIS is configured with `--disable-system`.
+
+---
+
+#### clock_min_diff
+```c
+double bli_clock_min_diff
+     (
+       double time_prev_min,
+       double time_start
+     );
+```
+This function computes an intermediate value, `time_diff`, equal to `bli_clock() - time_start`, and then tentatively prepares to return the minimum value of `time_diff` and `time_min`. If that minimum value is extremely small (close to zero), the function returns `time_min` instead.
+
+This function is meant to be used in conjuction with `bli_clock()` for
+performance timing within applications--specifically in loops where only
+the fastest timing is of interest. For example:
+```c
+double t_save = DBL_MAX;
+for( i = 0; i < 3; ++i )
+{
+   double t = bli_clock();
+   bli_gemm( ... );
+   t_save = bli_clock_min_diff( t_save, t );
+}
+double gflops = ( 2.0 * m * k * n ) / ( t_save * 1.0e9 );
+```
+This code calls `bli_gemm()` three times and computes the performance, in GFLOPS, of the fastest of the three executions.
+
+---
+
+
+
 # Example code

-BLIS provides lots of example code in the [examples/tapi](https://github.com/flame/blis/tree/master/examples/tapi) directory of the BLIS source distribution. The example code in this directory is set up like a tutorial, and so we recommend starting from the beginning. Topics include printing vectors and matrices and calling a representative subset of the computational level-1v, -1m, -2, -3, and utility operations documented above.
+BLIS provides lots of example code in the [examples/tapi](https://github.com/flame/blis/tree/master/examples/tapi) directory of the BLIS source distribution. The example code in this directory is set up like a tutorial, and so we recommend starting from the beginning. Topics include printing vectors and matrices and calling a representative subset of the computational level-1v, -1m, -2, -3, and utility operations documented above. Please read the `README` contained within the `examples/tapi` directory for further details.

--- a/docs/BuildSystem.md
+++ b/docs/BuildSystem.md
@@ -28,6 +28,7 @@ The BLIS build system was designed for use with GNU/Linux (or some other sane UN
  * GNU `make` (3.81 or later)
  * a working C99 compiler
  * Perl (any version)
+  * `git` (1.8.5 or later, only required if cloning from Github)

 BLIS also requires a POSIX threads library at link-time (`-lpthread` or `libpthread.so`). This requirement holds even when configuring BLIS with multithreading disabled (the default) or with multithreading via OpenMP (`--enable-multithreading=openmp`). (Note: BLIS implements basic pthreads functionality automatically for Windows builds via [AppVeyor](https://ci.appveyor.com/project/shpc/blis/).)

--- a/docs/ConfigurationHowTo.md
+++ b/docs/ConfigurationHowTo.md
@@ -677,14 +677,14 @@ Adding support for a new-subconfiguration to BLIS is similar to adding support f
          BLIS_ARCH_POWER7,
          BLIS_ARCH_BGQ,

-          BLIS_ARCH_GENERIC
+          BLIS_ARCH_GENERIC,
+
+          BLIS_NUM_ARCHS

      } arch_t;
      ```
-      Additionally, you'll need to update the definition of `BLIS_NUM_ARCHS` to reflect the new total number of enumerated `arch_t` values:
-      ```c
-      #define BLIS_NUM_ARCHS 16
-      ```
+      Notice that the total number of `arch_t` values, `BLIS_NUM_ARCHS`, is updated automatically.
+


   * **`frame/base/bli_gks.c`**. We must also update the global kernel structure, or gks, to register the new sub-configuration during library initialization. Sub-configuration registration occurs in `bli_gks_init()`. For `knl`, updating this function amounts to inserting the following lines
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -8,6 +8,7 @@ project, as well as those we think a new user or developer might ask. If you do
  * [Why did you create BLIS?](FAQ.md#why-did-you-create-blis)
  * [Why should I use BLIS instead of GotoBLAS / OpenBLAS / ATLAS / MKL / ESSL / ACML / Accelerate?](FAQ.md#why-should-i-use-blis-instead-of-gotoblas--openblas--atlas--mkl--essl--acml--accelerate)
  * [How is BLIS related to FLAME / libflame?](FAQ.md#how-is-blis-related-to-flame--libflame)
+  * [What is the difference between BLIS and the AMD fork of BLIS found in AOCL?](FAQ.md#what-is-the-difference-between-blis-and-the-amd-fork-of-blis-found-in-aocl)
  * [Does BLIS automatically detect my hardware?](FAQ.md#does-blis-automatically-detect-my-hardware)
  * [I understand that BLIS is mostly a tool for developers?](FAQ.md#i-understand-that-blis-is-mostly-a-tool-for-developers)
  * [How do I link against BLIS?](FAQ.md#how-do-i-link-against-blis)
@@ -60,6 +61,12 @@ homepage](https://github.com/flame/blis#key-features). But here are a few reason

 As explained [above](FAQ.md#why-did-you-create-blis?), BLIS was initially a layer within `libflame` that allowed more convenient interfacing to the BLAS. So in some ways, BLIS is a spin-off project. Prior to developing BLIS, [its author](http://www.cs.utexas.edu/users/field/) worked as the primary maintainer of `libflame`. If you look closely, you can also see that the design of BLIS was influenced by some of the more useful and innovative aspects of `libflame`, such as internal object abstractions and control trees. Also, various members of the [SHPC research group](http://shpc.ices.utexas.edu/people.html) and its [collaborators](http://shpc.ices.utexas.edu/collaborators.html) routinely provide insight, feedback, and also contribute code (especially kernels) to the BLIS project.

+### What is the difference between BLIS and the AMD fork of BLIS found in AOCL?
+
+BLIS, also known as "vanilla BLIS" or "upstream BLIS," is maintained by its [original developer](https://github.com/fgvanzee) (with the [support of others](http://shpc.ices.utexas.edu/collaborators.html)) in the [Science of High-Performance Computing](http://shpc.ices.utexas.edu/) (SHPC) group within the [The Oden Institute for Computational Engineering and Sciences](http://www.oden.utexas.edu/) at [The University of Texas at Austin](http://www.utexas.edu/). In 2015, [AMD](https://www.amd.com/) reorganized many of their software library efforts around existing open source projects. BLIS was chosen as the basis for their [CPU BLAS library](https://developer.amd.com/amd-aocl/blas-library/), and an AMD-maintained [fork of BLIS](https://github.com/amd/blis) was established.
+
+AMD BLIS sometimes contains certain optimizations specific to AMD hardware. Many of these optimizations are (eventually) merged back into upstream BLIS. However, for various reasons, some changes may remain unique to AMD BLIS for quite some time. Thus, if you want the latest optimizations for AMD hardware, feel free to try AMD BLIS. However, please note that neither The University of Texas at Austin nor BLIS's developers can endorse or offer direct support for any outside fork of BLIS, including AMD BLIS.
+
 ### Does BLIS automatically detect my hardware?

 On certain architectures (most notably x86_64), yes. In order to use auto-detection, you must specify `auto` as your configuration when running `configure` (Please see the BLIS [Build System](BuildSystem.md) guide for more info.) A runtime detection option is also available. (Please see the [Configuration Guide](ConfigurationHowTo.md) for a comprehensive walkthrough.)
--- a/docs/Multithreading.md
+++ b/docs/Multithreading.md
@@ -110,16 +110,19 @@ Regardless of which method is employed, and which specific way within each metho
 **Note**: Please be aware of what happens if you try to specify both the automatic and manual ways, as it could otherwise confuse new users. Here are the important points:
 * Regardless of which broad method is used, **if multithreading is specified via both the automatic and manual ways, the values set via the manual way will always take precedence.**
 * Specifying parallelism for even *one* loop counts as specifying the manual way (in which case the ways of parallelism for the remaining loops will be assumed to be 1). And in the case of the environment variable method, setting the ways of parallelism for a loop to 1 counts as specifying parallelism! If you want to switch from using the manual way to automatic way, you must not only set (`export`) the `BLIS_NUM_THREADS` variable, but you must also `unset` all of the `BLIS_*_NT` variables.
- * If you have specified multithreading via *both* the automatic and manual ways, BLIS will **not** complain if the values are inconsistent with one another. (For example, you may request 8 total threads be used while also specifying 4 ways of parallelism within each of two matrix multiplication loops, for a total of 16 ways.) Furthermore, you will be able to query these inconsistent values via the runtime API both before and after multithreading executes.
+ * If you have specified multithreading via *both* the automatic and manual ways, BLIS will **not** complain if the values are inconsistent with one another. (For example, you may request 12 total threads be used while also specifying 2 and 4 ways of parallelism within the JC and IC loops, respectively, for a total of 8 ways.) Furthermore, you will be able to query these inconsistent values via the runtime API both before and after multithreading executes.
 * If multithreading is disabled, you **may still** specify multithreading values via either the manual or automatic ways. However, BLIS will silently ignore **all** of these values. A BLIS library that is built with multithreading disabled at configure-time will always run sequentially (from the perspective of a single application thread).

+Furthermore:
+* For small numbers of threads, the number requested will be honored faithfully. However, if you request a larger number of threads that happens to also be prime, BLIS will reduce the number by one in order to allow more more efficient thread factorizations. This behavior can be overridden by configuring BLIS with the `BLIS_ENABLE_AUTO_PRIME_NUM_THREADS` macro defined in the `bli_family_*.h` file of the relevant subconfiguration. Similarly, the threshold beyond which BLIS will reduce primes by one can be set via `BLIS_NT_MAX_PRIME`. (This latter value is ignored if the former macro is defined.)
+
 ## Globally via environment variables

 The most common method of specifying multithreading in BLIS is globally via environment variables. With this method, the user sets one or more environment variables in the shell before launching the BLIS-linked executable.

 Regardless of whether you end up using the automatic or manual way of expressing a request for multithreading, note that the environment variables are read (via `getenv()`) by BLIS **only once**, when the library is initialized. Subsequent to library initialization, the global settings for parallelization may only be changed via the [global runtime API](Multithreading.md#globally-at-runtime). If this constraint is not a problem, then environment variables may work fine for you. Otherwise, please consider [local settings](Multithreading.md#locally-at-runtime). (Local settings may used at any time, regardless of whether global settings were explicitly specified, and local settings always override global settings.)

-**Note**: Regardless of which way ([automatic](Multithreading.md#environment-variables-the-automatic-way) or [manual](Multithreading.md#environment-variables-the-manual-way)) environment variables are used to specify multithreading, that specification will affect operation of BLIS through **both** the BLAS compatibility layer as well as the native [typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md) APIs that are unique to BLIS.
+**Note**: Regardless of which way ([automatic](Multithreading.md#environment-variables-the-automatic-way) or [manual](Multithreading.md#environment-variables-the-manual-way)) environment variables are used to specify multithreading, that specification will affect operation of BLIS through **both** the BLAS compatibility layer as well as the native ([typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md)) APIs that are unique to BLIS.

 ### Environment variables: the automatic way

@@ -166,7 +169,7 @@ Next, which combinations of loops to parallelize depends on which caches are sha

 If you still wish to set the parallelization scheme globally, but you want to do so at runtime, BLIS provides a thread-safe API for specifying multithreading. Think of these functions as a way to modify the same internal data structure into which the environment variables are read. (Recall that the environment variables are only read once, when BLIS is initialized).

-**Note**: Regardless of which way ([automatic](Multithreading.md#globally-at-runtime-the-automatic-way) or [manual](Multithreading.md#globally-at-runtime-the-manual-way)) the global runtime API is used to specify multithreading, that specification will affect operation of BLIS through **both** the BLAS compatibility layer as well as the native [typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md) APIs that are unique to BLIS.
+**Note**: Regardless of which way ([automatic](Multithreading.md#globally-at-runtime-the-automatic-way) or [manual](Multithreading.md#globally-at-runtime-the-manual-way)) the global runtime API is used to specify multithreading, that specification will affect operation of BLIS through **both** the BLAS compatibility layer as well as the native ([typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md)) APIs that are unique to BLIS.

 ### Globally at runtime: the automatic way

@@ -207,7 +210,7 @@ In addition to the global methods based on environment variables and runtime fun

 As with environment variables and the global runtime API, there are two ways to specify parallelism: the automatic way and the manual way. Both ways involve allocating a BLIS-specific object, initializing the object and encoding the desired parallelization, and then passing a pointer to the object into one of the expert interfaces of either the [typed](docs/BLISTypedAPI.md) or [object](docs/BLISObjectAPI) APIs. We provide examples of utilizing this threading object below.

-**Note**: Neither way ([automatic](Multithreading.md#locally-at-runtime-the-automatic-way) nor [manual](Multithreading.md#locally-at-runtime-the-manual-way)) of specifying multithreading via the local runtime API can be used via the BLAS interfaces. The local runtime API may *only* be used via the native [typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md) APIs, which are unique to BLIS. (Furthermore, the expert interfaces of each API must be used. This is demonstrated later on in this section.)
+**Note**: Neither way ([automatic](Multithreading.md#locally-at-runtime-the-automatic-way) nor [manual](Multithreading.md#locally-at-runtime-the-manual-way)) of specifying multithreading via the local runtime API can be used via the BLAS interfaces. The local runtime API may *only* be used via the native ([typed](docs/BLISTypedAPI.md) and [object](docs/BLISObjectAPI.md)) APIs, which are unique to BLIS. (Furthermore, the expert interfaces of each API must be used. This is demonstrated later on in this section.)

 ### Initializing a rntm_t

@@ -289,6 +292,8 @@ Also, you may pass in `NULL` for the `rntm_t*` parameter of an expert interface.
   This situation could lead to unexpectedly low multithreaded performance. Suppose the user calls `gemm` on a problem with a large m dimension and small k and n dimensions, and explicitly requests parallelism only in the IC loop, but also suppose that the storage of C does not match that of the microkernel's preference. After BLIS transposes the operation internally, the *effective* m dimension will no longer be large; instead, it will be small (because the original m and n dimension will have been swapped). The multithreaded implementation will then proceed to parallelize this small m dimension.

   There are currently no good *and* easy solutions to this problem. Eventually, though, we plan to add support for two microkernels per datatype per configuration--one for use with matrices C that are row-stored, and one for those that are column-stored. This will obviate the logic within BLIS that sometimes induces the operation transposition, and the problem will go away.
+   
+* **Thread affinity when BLIS and MKL are used together.** Some users have reported that when running a program that links both BLIS (configured with OpenMP) and MKL, **and** when OpenMP thread affinity has been specified (e.g. via `OMP_PROC_BIND` and `OMP_PLACES`), that very poor performance is observed. This may be due to incorrect thread masking in this case, causing all threads to run on one physical core. The exact circumstances leading to this behavior have not been identified, but unsetting the OpenMP thread affinity variables appears to be a solution.

 # Conclusion

--- a/docs/Performance.md
+++ b/docs/Performance.md
@@ -15,9 +15,15 @@
  * **[Haswell](Performance.md#haswell)**
    * **[Experiment details](Performance.md#haswell-experiment-details)**
    * **[Results](Performance.md#haswell-results)**
-  * **[Epyc](Performance.md#epyc)**
-    * **[Experiment details](Performance.md#epyc-experiment-details)**
-    * **[Results](Performance.md#epyc-results)**
+  * **[Zen](Performance.md#zen)**
+    * **[Experiment details](Performance.md#zen-experiment-details)**
+    * **[Results](Performance.md#zen-results)**
+  * **[Zen2](Performance.md#zen2)**
+    * **[Experiment details](Performance.md#zen2-experiment-details)**
+    * **[Results](Performance.md#zen2-results)**
+  * **[A64fx](Performance.md#a64fx)**
+    * **[Experiment details](Performance.md#a64fx-experiment-details)**
+    * **[Results](Performance.md#a64fx-results)**
 * **[Feedback](Performance.md#feedback)**

 # Introduction
@@ -240,6 +246,7 @@ The `runthese.m` file will contain example invocations of the function.
         endif()
         ```
    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
    * Multithreaded (26 core) execution requested via `export OMP_NUM_THREADS=26`
@@ -320,6 +327,7 @@ The `runthese.m` file will contain example invocations of the function.
         endif()
         ```
    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
    * Multithreaded (12 core) execution requested via `export OMP_NUM_THREADS=12`
@@ -355,12 +363,12 @@ The `runthese.m` file will contain example invocations of the function.

 ---

-## Epyc
+## Zen

-### Epyc experiment details
+### Zen experiment details

 * Location: Oracle cloud
-* Processor model: AMD Epyc 7551 (Zen1)
+* Processor model: AMD Epyc 7551 (Zen1 "Naples")
 * Core topology: two sockets, 4 dies per socket, 2 core complexes (CCX) per die, 4 cores per CCX, 64 cores total
 * SMT status: enabled, but not utilized
 * Max clock rate: 3.0GHz (single-core), 2.55GHz (multicore)
@@ -398,6 +406,7 @@ The `runthese.m` file will contain example invocations of the function.
         endif()
         ```
    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
    * Multithreaded (32 core) execution requested via `export OMP_NUM_THREADS=32`
@@ -417,22 +426,184 @@ The `runthese.m` file will contain example invocations of the function.
 * Comments:
  * MKL performance is dismal, despite being linked in the same manner as on the Xeon Platinum. It's not clear what is causing the slowdown. It could be that MKL's runtime kernel/blocksize selection logic is falling back to some older, more basic implementation because CPUID is not returning Intel as the hardware vendor. Alternatively, it's possible that MKL is trying to use kernels for the closest Intel architectures--say, Haswell/Broadwell--but its implementations use Haswell-specific optimizations that, due to microarchitectural differences, degrade performance on Zen.

-### Epyc results
+### Zen results

 #### pdf

-* [Epyc single-threaded](graphs/large/l3_perf_epyc_nt1.pdf)
-* [Epyc multithreaded (32 cores)](graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf)
-* [Epyc multithreaded (64 cores)](graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf)
+* [Zen single-threaded](graphs/large/l3_perf_zen_nt1.pdf)
+* [Zen multithreaded (32 cores)](graphs/large/l3_perf_zen_jc1ic8jr4_nt32.pdf)
+* [Zen multithreaded (64 cores)](graphs/large/l3_perf_zen_jc2ic8jr4_nt64.pdf)

 #### png (inline)

-* **Epyc single-threaded**
-![single-threaded](graphs/large/l3_perf_epyc_nt1.png)
-* **Epyc multithreaded (32 cores)**
-![multithreaded (32 cores)](graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png)
-* **Epyc multithreaded (64 cores)**
-![multithreaded (64 cores)](graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png)
+* **Zen single-threaded**
+![single-threaded](graphs/large/l3_perf_zen_nt1.png)
+* **Zen multithreaded (32 cores)**
+![multithreaded (32 cores)](graphs/large/l3_perf_zen_jc1ic8jr4_nt32.png)
+* **Zen multithreaded (64 cores)**
+![multithreaded (64 cores)](graphs/large/l3_perf_zen_jc2ic8jr4_nt64.png)
+
+---
+
+## Zen2
+
+### Zen2 experiment details
+
+* Location: Oracle cloud
+* Processor model: AMD Epyc 7742 (Zen2 "Rome")
+* Core topology: two sockets, 8 Core Complex Dies (CCDs) per socket, 2 Core Complexes (CCX) per CCD, 4 cores per CCX, 128 cores total
+* SMT status: enabled, but not utilized
+* Max clock rate: 2.25GHz (base, documented); 3.4GHz boost (single-core, documented); 2.6GHz boost (multicore, estimated)
+* Max vector register length: 256 bits (AVX2)
+* Max FMA vector IPC: 2
+  * Alternatively, FMA vector IPC is 4 when vectors are limited to 128 bits each.
+* Peak performance:
+  * single-core: 54.4 GFLOPS (double-precision), 108.8 GFLOPS (single-precision)
+  * multicore (estimated): 41.6 GFLOPS/core (double-precision), 83.2 GFLOPS/core (single-precision)
+* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
+* Page size: 4096 bytes
+* Compiler: gcc 9.3.0
+* Results gathered: 24 September 2020, 29 September 2020
+* Implementations tested:
+  * BLIS 4fd8d9f (0.7.0-55)
+    * configured with `./configure -t openmp auto` (single- and multithreaded)
+    * sub-configuration exercised: `zen2`
+    * Single-threaded (1 core) execution requested via no change in environment variables
+    * Multithreaded (64 core) execution requested via `export BLIS_JC_NT=4 BLIS_IC_NT=4 BLIS_JR_NT=4`
+    * Multithreaded (128 core) execution requested via `export BLIS_JC_NT=8 BLIS_IC_NT=4 BLIS_JR_NT=4`
+  * OpenBLAS 0.3.10
+    * configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0` (single-threaded)
+    * configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=64` (multithreaded, 64 cores)
+    * configured `Makefile.rule` with `BINARY=64 NO_CBLAS=1 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=128` (multithreaded, 128 cores)
+    * Single-threaded (1 core) execution requested via `export OPENBLAS_NUM_THREADS=1`
+    * Multithreaded (64 core) execution requested via `export OPENBLAS_NUM_THREADS=64`
+    * Multithreaded (128 core) execution requested via `export OPENBLAS_NUM_THREADS=128`
+  * Eigen 3.3.90
+    * Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen) (24 September 2020)
+    * Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
+         ```
+         # These lines added after line 60.
+         check_cxx_compiler_flag("-march=native" COMPILER_SUPPORTS_MARCH_NATIVE)
+         if(COMPILER_SUPPORTS_MARCH_NATIVE)
+           set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
+         endif()
+         ```
+    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
+    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
+    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
+    * Multithreaded (64 core) execution requested via `export OMP_NUM_THREADS=64`
+    * Multithreaded (128 core) execution requested via `export OMP_NUM_THREADS=128`
+    * **NOTE**: This version of Eigen does not provide multithreaded implementations of `symm`/`hemm`, `syrk`/`herk`, `trmm`, or `trsm`, and therefore those curves are omitted from the multithreaded graphs.
+  * MKL 2020 update 3
+    * Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
+    * Multithreaded (64 core) execution requested via `export MKL_NUM_THREADS=64`
+    * Multithreaded (128 core) execution requested via `export MKL_NUM_THREADS=128`
+* Affinity:
+  * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-127"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset. 
+  * All executables were run through `numactl --interleave=all`.
+* Frequency throttling (via `cpupower`):
+  * Driver: acpi-cpufreq
+  * Governor: performance
+  * Hardware limits (steps): 1.5GHz, 2.0GHz, 2.25GHz
+  * Adjusted minimum: 2.25GHz
+* Comments:
+  * MKL performance is once again underwhelming. This is likely because Intel has decided that it does not want to give users of MKL a reason to purchase AMD hardware.
+
+### Zen2 results
+
+#### pdf
+
+* [Zen2 single-threaded](graphs/large/l3_perf_zen2_nt1.pdf)
+* [Zen2 multithreaded (64 cores)](graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf)
+* [Zen2 multithreaded (128 cores)](graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf)
+
+#### png (inline)
+
+* **Zen2 single-threaded**
+![single-threaded](graphs/large/l3_perf_zen2_nt1.png)
+* **Zen2 multithreaded (64 cores)**
+![multithreaded (64 cores)](graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.png)
+* **Zen2 multithreaded (128 cores)**
+![multithreaded (128 cores)](graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.png)
+
+---
+
+## A64fx
+
+### A64fx experiment details
+
+* Location: RIKEN Center of Computational Science in Kobe, Japan
+  * These test results were gathered on the Fugaku supercomputer under project "量子物質の創発と機能のための基礎科学 ―「富岳」と最先端実験の密連携による革新的強相関電子科学" (hp200132)
+* Processor model: Fujitsu A64fx
+* Core topology: one socket, 4 NUMA groups per socket, 13 cores per group (one reserved for the OS), 48 cores total
+* SMT status: Unknown
+* Max clock rate: 2.2GHz (single- and multicore, observed)
+* Max vector register length: 512 bits (SVE)
+* Max FMA vector IPC: 2
+* Peak performance:
+  * single-core: 70.4 GFLOPS (double-precision), 140.8 GFLOPS (single-precision)
+  * multicore: 70.4 GFLOPS/core (double-precision), 140.8 GFLOPS/core (single-precision)
+* Operating system: RHEL 8.3
+* Page size: 256 bytes
+* Compiler: gcc 9.3.0
+* Results gathered: 2 April 2021
+* Implementations tested:
+  * BLIS 757cb1c (post-0.8.1)
+    * configured with `./configure -t openmp --sve-vector-size=vla CFLAGS="-D_A64FX -DPREFETCH256 -DSVE_NO_NAT_COMPLEX_KERNELS" arm64_sve` (single- and multithreaded)
+    * sub-configuration exercised: `arm64_sve`
+    * Single-threaded (1 core) execution requested via:
+      * `export BLIS_SVE_KC_D=2048 BLIS_SVE_MC_D=128 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
+      * `export BLIS_SVE_KC_S=2048 BLIS_SVE_MC_S=256 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
+    * Multithreaded (12 core) execution requested via:
+      * `export BLIS_JC_NT=1 BLIS_IC_NT=2 BLIS_JR_NT=6`
+      * `export BLIS_SVE_KC_D=2400 BLIS_SVE_MC_D=64 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
+      * `export BLIS_SVE_KC_S=2400 BLIS_SVE_MC_S=128 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
+    * Multithreaded (48 core) execution requested via:
+      * `export BLIS_JC_NT=1 BLIS_IC_NT=4 BLIS_JR_NT=12`
+      * `export BLIS_SVE_KC_D=2048 BLIS_SVE_MC_D=128 BLIS_SVE_NC_D=26880 BLIS_SVE_KERNEL_IDX_D=14` (double precision)
+      * `export BLIS_SVE_KC_S=2048 BLIS_SVE_MC_S=256 BLIS_SVE_NC_S=23040 BLIS_SVE_KERNEL_IDX_S=2` (single precision)
+  * Eigen 3.3.9
+    * Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen)
+    * configured and built BLAS library via `mkdir build; cd build; cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
+    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
+    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
+    * Multithreaded (12 core) execution requested via `export OMP_NUM_THREADS=12`
+    * Multithreaded (48 core) execution requested via `export OMP_NUM_THREADS=48`
+    * **NOTE**: This version of Eigen does not provide multithreaded implementations of `symm`/`hemm`, `syrk`/`herk`, `trmm`, or `trsm`, and therefore those curves are omitted from the multithreaded graphs.
+  * ARMPL (20.1.0 for A64fx)
+    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
+    * Multithreaded (12 core) execution requested via `export OMP_NUM_THREADS=12`
+    * Multithreaded (48 core) execution requested via `export OMP_NUM_THREADS=48`
+    * **NOTE**: While this version of ARMPL does provide multithreaded implementations of `symm`/`hemm`, `syrk`/`herk`, `trmm`, or `trsm` (with the exception `dtrsm`), but these implementations yield very low performance, and their long run times led us to skip collecting these data altogether.
+  * Fujitsu SSL2 (Fujitsu toolchain 1.2.31)
+    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1 NPARALLEL=1`
+    * Multithreaded (12 core) execution requested via `export OMP_NUM_THREADS=12 NPARALLEL=12`
+    * Multithreaded (48 core) execution requested via `export OMP_NUM_THREADS=48 NPARALLEL=48`
+* Affinity:
+  * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="12-23 24-35 36-47 48-59"`.
+  * All executables were run through `numactl --interleave=all` (multithreaded only).
+* Frequency throttling: No change made. No frequency lowering observed.
+* Comments:
+  * Special thanks to Stepan Nassyr and RuQing G. Xu for their work in developing and optimizing A64fx support. Also, thanks to RuQing G. Xu for collecting the data that appear in these graphs.
+
+### A64fx results
+
+#### pdf
+
+* [A64fx single-threaded](graphs/large/l3_perf_a64fx_nt1.pdf)
+* [A64fx multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.pdf)
+* [A64fx multithreaded (48 cores)](graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.pdf)
+
+#### png (inline)
+
+* **A64fx single-threaded**
+![single-threaded](graphs/large/l3_perf_a64fx_nt1.png)
+* **A64fx multithreaded (12 cores)**
+![multithreaded (12 cores)](graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.png)
+* **A64fx multithreaded (48 cores)**
+![multithreaded (48 cores)](graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.png)

 ---

--- a/docs/PerformanceSmall.md
+++ b/docs/PerformanceSmall.md
@@ -12,9 +12,12 @@
  * **[Haswell](PerformanceSmall.md#haswell)**
    * **[Experiment details](PerformanceSmall.md#haswell-experiment-details)**
    * **[Results](PerformanceSmall.md#haswell-results)**
-  * **[Epyc](PerformanceSmall.md#epyc)**
-    * **[Experiment details](PerformanceSmall.md#epyc-experiment-details)**
-    * **[Results](PerformanceSmall.md#epyc-results)**
+  * **[Zen](PerformanceSmall.md#zen)**
+    * **[Experiment details](PerformanceSmall.md#zen-experiment-details)**
+    * **[Results](PerformanceSmall.md#zen-results)**
+  * **[Zen2](PerformanceSmall.md#zen2)**
+    * **[Experiment details](PerformanceSmall.md#zen2-experiment-details)**
+    * **[Results](PerformanceSmall.md#zen2-results)**
 * **[Feedback](PerformanceSmall.md#feedback)**

 # Introduction
@@ -295,9 +298,9 @@ The `runthese.m` file will contain example invocations of the function.

 ---

-## Epyc
+## Zen

-### Epyc experiment details
+### Zen experiment details

 * Location: Oracle cloud
 * Processor model: AMD Epyc 7551 (Zen1)
@@ -318,7 +321,7 @@ The `runthese.m` file will contain example invocations of the function.
  * BLIS 90db88e (0.6.1-8)
    * configured with `./configure --enable-cblas auto` (single-threaded)
    * configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
-    * sub-configuration exercised: `haswell`
+    * sub-configuration exercised: `zen`
    * Multithreaded (32 cores) execution requested via `export BLIS_NUM_THREADS=32`
  * OpenBLAS 0.3.8
    * configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
@@ -357,25 +360,122 @@ The `runthese.m` file will contain example invocations of the function.
 * Comments:
  * libxsmm is highly competitive for very small problems, but quickly gives up once the "large" dimension exceeds about 180-240 (or 64 in the case where all operands are square). Also, libxsmm's `gemm` cannot handle a transposition on matrix A and similarly dispatches the fallback implementation for those cases. libxsmm also does not export CBLAS interfaces, and therefore only appears on the graphs for column-stored matrices.

-### Epyc results
+### Zen results

 #### pdf

-* [Epyc single-threaded row-stored](graphs/sup/dgemm_rrr_epyc_nt1.pdf)
-* [Epyc single-threaded column-stored](graphs/sup/dgemm_ccc_epyc_nt1.pdf)
-* [Epyc multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_epyc_nt32.pdf)
-* [Epyc multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_epyc_nt32.pdf)
+* [Zen single-threaded row-stored](graphs/sup/dgemm_rrr_zen_nt1.pdf)
+* [Zen single-threaded column-stored](graphs/sup/dgemm_ccc_zen_nt1.pdf)
+* [Zen multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_zen_nt32.pdf)
+* [Zen multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_zen_nt32.pdf)

 #### png (inline)

-* **Epyc single-threaded row-stored**
-![single-threaded row-stored](graphs/sup/dgemm_rrr_epyc_nt1.png)
-* **Epyc single-threaded column-stored**
-![single-threaded column-stored](graphs/sup/dgemm_ccc_epyc_nt1.png)
-* **Epyc multithreaded (32 cores) row-stored**
-![multithreaded row-stored](graphs/sup/dgemm_rrr_epyc_nt32.png)
-* **Epyc multithreaded (32 cores) column-stored**
-![multithreaded column-stored](graphs/sup/dgemm_ccc_epyc_nt32.png)
+* **Zen single-threaded row-stored**
+![single-threaded row-stored](graphs/sup/dgemm_rrr_zen_nt1.png)
+* **Zen single-threaded column-stored**
+![single-threaded column-stored](graphs/sup/dgemm_ccc_zen_nt1.png)
+* **Zen multithreaded (32 cores) row-stored**
+![multithreaded row-stored](graphs/sup/dgemm_rrr_zen_nt32.png)
+* **Zen multithreaded (32 cores) column-stored**
+![multithreaded column-stored](graphs/sup/dgemm_ccc_zen_nt32.png)
+
+---
+
+## Zen2
+
+### Zen2 experiment details
+
+* Location: Oracle cloud
+* Processor model: AMD Epyc 7742 (Zen2 "Rome")
+* Core topology: two sockets, 8 Core Complex Dies (CCDs) per socket, 2 Core Complexes (CCX) per CCD, 4 cores per CCX, 128 cores total
+* SMT status: enabled, but not utilized
+* Max clock rate: 2.25GHz (base, documented); 3.4GHz boost (single-core, documented); 2.6GHz boost (multicore, estimated)
+* Max vector register length: 256 bits (AVX2)
+* Max FMA vector IPC: 2
+  * Alternatively, FMA vector IPC is 4 when vectors are limited to 128 bits each.
+* Peak performance:
+  * single-core: 54.4 GFLOPS (double-precision), 108.8 GFLOPS (single-precision)
+  * multicore (estimated): 41.6 GFLOPS/core (double-precision), 83.2 GFLOPS/core (single-precision)
+* Operating system: Ubuntu 18.04 (Linux kernel 4.15.0)
+* Page size: 4096 bytes
+* Compiler: gcc 9.3.0
+* Results gathered: 8 October 2020
+* Implementations tested:
+  * BLIS a0849d3 (0.7.0-67)
+    * configured with `./configure --enable-cblas auto` (single-threaded)
+    * configured with `./configure --enable-cblas -t openmp auto` (multithreaded)
+    * sub-configuration exercised: `zen2`
+    * Multithreaded (32 cores) execution requested via `export BLIS_NUM_THREADS=32`
+  * OpenBLAS 0.3.10
+    * configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=0 USE_LOCKING=1` (single-threaded)
+    * configured `Makefile.rule` with `BINARY=64 NO_LAPACK=1 NO_LAPACKE=1 USE_THREAD=1 NUM_THREADS=32` (multithreaded)
+    * Multithreaded (32 cores) execution requested via `export OPENBLAS_NUM_THREADS=32`
+  * BLASFEO 5b26d40
+    * configured `Makefile.rule` with: `BLAS_API=1 FORTRAN_BLAS_API=1 CBLAS_API=1`.
+    * built BLAS library via `make CC=gcc`
+  * Eigen 3.3.90
+    * Obtained via the [Eigen GitLab homepage](https://gitlab.com/libeigen/eigen) (24 September 2020)
+    * Prior to compilation, modified top-level `CMakeLists.txt` to ensure that `-march=native` was added to `CXX_FLAGS` variable (h/t Sameer Agarwal):
+         ```
+         # These lines added after line 60.
+         check_cxx_compiler_flag("-march=native" COMPILER_SUPPORTS_MARCH_NATIVE)
+         if(COMPILER_SUPPORTS_MARCH_NATIVE)
+           set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
+         endif()
+         ```
+    * configured and built BLAS library via `mkdir build; cd build; CC=gcc cmake ..; make blas`
+    * installed headers via `cmake . -DCMAKE_INSTALL_PREFIX=$HOME/flame/eigen; make install`
+    * The `gemm` implementation was pulled in at compile-time via Eigen headers; other operations were linked to Eigen's BLAS library.
+    * Single-threaded (1 core) execution requested via `export OMP_NUM_THREADS=1`
+    * Multithreaded (32 cores) execution requested via `export OMP_NUM_THREADS=32`
+  * MKL 2020 update 3
+    * Single-threaded (1 core) execution requested via `export MKL_NUM_THREADS=1`
+    * Multithreaded (32 cores) execution requested via `export MKL_NUM_THREADS=32`
+  * libxsmm f0ab9cb (post-1.16.1)
+    * compiled with `make AVX=2`; linked with [netlib BLAS](http://www.netlib.org/blas/) 3.6.0 as the fallback library to better show where libxsmm stops handling the computation internally.
+* Affinity:
+  * Thread affinity for BLIS was specified manually via `GOMP_CPU_AFFINITY="0-31"`. However, multithreaded OpenBLAS appears to revert to single-threaded execution if `GOMP_CPU_AFFINITY` is set. Therefore, when measuring OpenBLAS performance, the `GOMP_CPU_AFFINITY` environment variable was unset.
+  * All executables were run through `numactl --interleave=all`.
+* Frequency throttling (via `cpupower`):
+  * Driver: acpi-cpufreq
+  * Governor: performance
+  * Hardware limits (steps): 1.5GHz, 2.0GHz, 2.25GHz
+  * Adjusted minimum: 2.25GHz
+* Comments:
+  * None.
+
+### Zen2 results
+
+#### pdf
+
+* [Zen2 sgemm single-threaded row-stored](graphs/sup/sgemm_rrr_zen2_nt1.pdf)
+* [Zen2 sgemm single-threaded column-stored](graphs/sup/sgemm_ccc_zen2_nt1.pdf)
+* [Zen2 dgemm single-threaded row-stored](graphs/sup/dgemm_rrr_zen2_nt1.pdf)
+* [Zen2 dgemm single-threaded column-stored](graphs/sup/dgemm_ccc_zen2_nt1.pdf)
+* [Zen2 sgemm multithreaded (32 cores) row-stored](graphs/sup/sgemm_rrr_zen2_nt32.pdf)
+* [Zen2 sgemm multithreaded (32 cores) column-stored](graphs/sup/sgemm_ccc_zen2_nt32.pdf)
+* [Zen2 dgemm multithreaded (32 cores) row-stored](graphs/sup/dgemm_rrr_zen2_nt32.pdf)
+* [Zen2 dgemm multithreaded (32 cores) column-stored](graphs/sup/dgemm_ccc_zen2_nt32.pdf)
+
+#### png (inline)
+
+* **Zen2 sgemm single-threaded row-stored**
+![sgemm single-threaded row-stored](graphs/sup/sgemm_rrr_zen2_nt1.png)
+* **Zen2 sgemm single-threaded column-stored**
+![sgemm single-threaded column-stored](graphs/sup/sgemm_ccc_zen2_nt1.png)
+* **Zen2 dgemm single-threaded row-stored**
+![dgemm single-threaded row-stored](graphs/sup/dgemm_rrr_zen2_nt1.png)
+* **Zen2 dgemm single-threaded column-stored**
+![dgemm single-threaded column-stored](graphs/sup/dgemm_ccc_zen2_nt1.png)
+* **Zen2 sgemm multithreaded (32 cores) row-stored**
+![sgemm multithreaded row-stored](graphs/sup/sgemm_rrr_zen2_nt32.png)
+* **Zen2 sgemm multithreaded (32 cores) column-stored**
+![sgemm multithreaded column-stored](graphs/sup/sgemm_ccc_zen2_nt32.png)
+* **Zen2 dgemm multithreaded (32 cores) row-stored**
+![dgemm multithreaded row-stored](graphs/sup/dgemm_rrr_zen2_nt32.png)
+* **Zen2 dgemm multithreaded (32 cores) column-stored**
+![dgemm multithreaded column-stored](graphs/sup/dgemm_ccc_zen2_nt32.png)

 ---

--- a/docs/ReleaseNotes.md
+++ b/docs/ReleaseNotes.md
@@ -4,6 +4,8 @@

 ## Contents

+* [Changes in 0.8.1](ReleaseNotes.md#changes-in-081)
+* [Changes in 0.8.0](ReleaseNotes.md#changes-in-080)
 * [Changes in 0.7.0](ReleaseNotes.md#changes-in-070)
 * [Changes in 0.6.1](ReleaseNotes.md#changes-in-061)
 * [Changes in 0.6.0](ReleaseNotes.md#changes-in-060)
@@ -37,6 +39,104 @@
 * [Changes in 0.0.2](ReleaseNotes.md#changes-in-002)
 * [Changes in 0.0.1](ReleaseNotes.md#changes-in-001)

+## Changes in 0.8.1
+March 22, 2021
+
+Improvements present in 0.8.1:
+
+Framework:
+- Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro `BLIS_NT_MAX_PRIME`, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro `BLIS_ENABLE_AUTO_PRIME_NUM_THREADS` in the appropriate configuration family's `bli_family_*.h`. (Jeff Diamond)
+- Changed default value of `BLIS_THREAD_RATIO_M` from 2 to 1, which leads to slightly different automatic thread factorizations.
+- Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
+- Relocated the general stride handling for `gemmsup`. This fixed an issue whereby `gemm` would fail to trigger to conventional code path for cases that use general stride even after `gemmsup` rejected the problem. (RuQing Xu)
+- Fixed an incorrect function signature (and prototype) of `bli_?gemmt()`. (RuQing Xu)
+- Redefined `BLIS_NUM_ARCHS` to be part of the `arch_t` enum, which means it will be updated automatically when defining future subconfigs.
+- Minor code consolidation in all level-3 `_front()` functions.
+- Reorganized Windows cpp branch of `bli_pthreads.c`.
+- Implemented `bli_pthread_self()` and `_equals()`, but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
+
+Kernels:
+- Added low-precision POWER10 `gemm` kernels via a `power10` sandbox. This sandbox also provides an API for implementations that use these kernels. See the `sandbox/power10/POWER10.md` document for more info. (Nicholai Tukanov)
+- Added assembly `packm` kernels for the `haswell` kernel set and registered to `haswell`, `zen`, and `zen2` subconfigs accordingly. The `s`, `c`, and `z` kernels were modeled on the `d` kernel, which was contributed by AMD.
+- Reduced KC in the `skx` subconfig from 384 to 256. (Tze Meng Low)
+- Fixed bugs in two `haswell` dgemmsup kernels, which involved extraneous assembly instructions left over from when the kernels were first written. (Kiran Varaganti, Bhaskar Nallani)
+- Minor updates to all of the `gemmtrsm` kernels to allow division by diagonal elements rather that scaling by pre-inverted elements. This change was applied to `haswell` and `penryn` kernel sets as well as reference kernels, 1m kernels, and the pre-broadcast B (bb) format kernels used by the `power9` subconfig. (Bhaskar Nallani)
+- Fixed incorrect return type on `bli_diag_offset_with_trans()`. (Devin Matthews)
+
+Build system:
+- Output a pkgconfig file so that CMake users that use BLIS can find and incorporate BLIS build products. (Ajay Panyala)
+- Fixed an issue in the the configure script's kernel-to-config map that caused `skx` kernel flags to be used when compiling kernels from the `zen` kernel set. This issue wasn't really fixed, but rather tweaked in such a way that it happens to now work. A more proper fix would require a serious rethinking of the configuration system. (Devin Matthews)
+- Fixed the shared library build rule in top-level Makefile. The previous rule was incorrectly only linking prerequisites that were newer than the target (`$?`) rather than correctly linking all prerequisites (`$^`). (Devin Matthews) 
+- Fixed `cc_vendor` for crosstool-ng toolchains. (Isuru Fernando)
+- Allow disabling of `trsm` diagonal pre-inversion at compile time via `--disable-trsm-preinversion`.
+
+Testing:
+- Fixed obscure testsuite bug for the `gemmt` test module that relates to its dependency on `gemv`.
+- Allow the `amaxv` testsuite module to run with a dimension of 0. (Meghana Vankadari)
+
+Documentation:
+- Documented auto-reduction for prime numbers of threads in `docs/Multithreading.md`.
+- Fixed a missing `trans_t` argument in the API documentation for `her2k`/`syr2k` in `docs/BLISTypedAPI.md`. (RuQing Xu)
+- Removed an extra call to `free()` in the level-1v typed API example code. (Ilknur Mustafazade)
+
+## Changes in 0.8.0
+November 19, 2020
+
+Improvements present in 0.8.0:
+
+Framework:
+- Implemented support for the level-3 operation `gemmt`, which performs a `gemm` on only the lower or only the upper triangle of a square matrix C. For now, only the conventional/large code path (and not the sup code path) is provided. This support also includes `gemmt` APIs in the BLAS and CBLAS compatibility layers. (AMD)
+- Added a C++ template header, `blis.hh`, containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header, `cblas.hh`. These headers are installed only when running the `install` target with `INSTALL_HH` set to `yes`. (AMD)
+- Disallow `randv`, `randm`, `randnv`, and `randnm` from producing vectors and matrices with 1-norms of zero.
+- Changed the behavior of user-initialized `rntm_t` objects so that packing of A and B is disabled by default. (Kiran Varaganti)
+- Transitioned to using `bool` keyword instead of the previous integer-based `bool_t` typedef. (RuQing Xu)
+- Updated all inline function definitions to use the cpp macro `BLIS_INLINE` instead of the `static` keyword. (Giorgos Margaritis, Devin Matthews)
+- Relocated `#include "cpuid.h"` directive from `bli_cpuid.h` to `bli_cpuid.c` so that applications can `#include` both `blis.h` and `cpuid.h`. (Bhaskar Nallani, Devin Matthews)
+- Defined `xerbla_array_()` to complement the netlib routine `xerbla_array()`. (Isuru Fernando)
+- Replaced the previously broken `ref99` sandbox with a simpler, functioning alternative. (Francisco Igual)
+- Fixed a harmless bug whereby `herk` was calling `trmm`-related code for determining the blocksize of KC in the 4th loop.
+
+Kernels:
+- Implemented a full set of `sgemmsup` assembly millikernels and microkernels for `haswell` kernel set.
+- Implemented POWER10 `sgemm` and `dgemm` microkernels. (Nicholai Tukanov)
+- Added two kernels (`dgemm` and `dpackm`) that employ ARM SVE vector extensions. (Guodong Xu)
+- Implemented explicit beta = 0 handling in the `sgemm` microkernel in `bli_gemm_armv7a_int_d4x4.c`. This omission was causing testsuite failures in the new `gemmt` testsuite module for `cortexa15` builds given that the `gemmt` correctness check relies on `gemm` with beta = 0.
+- Updated `void*` function arguments in reference `packm` kernels to use the native pointer type, and fixed a related dormant type bug in `bli_kernels_knl.h`.
+- Fixed missing `restrict` qualifier in `sgemm` microkernel prototype for `knl` kernel set header.
+- Added some missing n = 6 edge cases to `dgemmsup` kernels.
+- Fixed an erroneously disabled edge case optimization in `gemmsup` variant code.
+- Various bugfixes and cleanups to `dgemmsup` kernels.
+
+Build system:
+- Implemented runtime subconfiguration selection override via `BLIS_ARCH_TYPE`. (decandia50)
+- Output the python found during `configure` into the `PYTHON` variable set in `build/config.mk`. (AMD)
+- Added configure support for Intel oneAPI via the `CC` environment variable. (Ajay Panyala, Devin Matthews)
+- Use `-O2` for all framework code, potentially avoiding intermitten issues with `f2c`'ed packed and banded code. (Devin Matthews)
+- Tweaked `zen2` subconfiguration's cache blocksizes and registered full suite of `sgemm` and `dgemm` millikernels.
+- Use the `-fomit-frame-pointer` compiler optimization option in the `haswell` and `skx` subconfigurations. (Jeff Diamond, Devin Matthews)
+- Tweaked Makefiles in `test`, `test/3`, and `test/sup` so that running any of the usual targets without having first built BLIS results in a helpful error message.
+- Add support for `--complex-return=[gnu|intel]` to `configure`, which allows the user to toggle between the GNU and Intel return value conventions for functions such as `cdotc`, `cdotu`, `zdotc`, and `zdotu`.
+- Updates to `cortexa9`, `cortexa53` compilation flags. (Dave Love)
+
+Testing:
+- Added a `gemmt` module to the testsuite and a standalone test driver to the `test` directory, both of which exercise the new `gemmt` functionality. (AMD)
+- Support creating matrices with small or large leading dimensions in `test/sup` test drivers.
+- Support executing `test/sup` drivers with unpacked or packed matrices.
+- Added optional `numactl` usage to `test/3/runme.sh`.
+- Updated and/or consolidated octave scripts in `test/3` and `test/sup`.
+- Increased `dotxaxpyf` testsuite thresholds to avoid false `MARGINAL` results during normal execution. (nagsingh)
+
+Documentation:
+- Added Epyc 7742 Zen2 ("Rome") performance results (single- and multithreaded) to `Performance.md` and `PerformanceSmall.md`. (Jeff Diamond)
+- Documented `gemmt` APIs in `BLISObjectAPI.md` and `BLISTypedAPI.md`. (AMD)
+- Documented commonly-used object mutator functions in `BLISObjectAPI.md`. (Jeff Diamond)
+- Relocated the operation indices of `BLISObjectAPI.md` and `BLISTypedAPI.md` to appear immediately after their respective tables of contents. (Jeff Diamond)
+- Added missing perl prerequisite to `BuildSystem.md`. (pkubaj, Dilyn Corner)
+- Fixed missing `conjy` parameter in `BLISTypedAPI.md` documentation for `her2` and `syr2`. (Robert van de Geijn)
+- Fixed incorrect link to `shiftd` in `BLISTypedAPI.md`. (Jeff Diamond)
+- Mention example code at the top of `BLISObjectAPI.md` and `BLISTypedAPI.md`.
+- Minor updates to `README.md`, `FAQ.md`, `Multithreading.md`, and `Sandboxes.md` documents.
+
 ## Changes in 0.7.0
 April 7, 2020

--- a/docs/graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.pdf
+++ b/docs/graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.pdf
--- a/docs/graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.png
+++ b/docs/graphs/large/l3_perf_a64fx_jc1ic2jr6_nt12.png
--- a/docs/graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.pdf
+++ b/docs/graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.pdf
--- a/docs/graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.png
+++ b/docs/graphs/large/l3_perf_a64fx_jc1ic4jr12_nt48.png
--- a/docs/graphs/large/l3_perf_a64fx_nt1.pdf
+++ b/docs/graphs/large/l3_perf_a64fx_nt1.pdf
--- a/docs/graphs/large/l3_perf_a64fx_nt1.png
+++ b/docs/graphs/large/l3_perf_a64fx_nt1.png
--- a/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf
+++ b/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.pdf
--- a/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.png
+++ b/docs/graphs/large/l3_perf_zen2_jc4ic4jr4_nt64.png
--- a/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf
+++ b/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.pdf
--- a/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.png
+++ b/docs/graphs/large/l3_perf_zen2_jc8ic4jr4_nt128.png
--- a/docs/graphs/large/l3_perf_zen2_nt1.pdf
+++ b/docs/graphs/large/l3_perf_zen2_nt1.pdf
--- a/docs/graphs/large/l3_perf_zen2_nt1.png
+++ b/docs/graphs/large/l3_perf_zen2_nt1.png
--- a/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf
+++ b/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.pdf
--- a/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png
+++ b/docs/graphs/large/l3_perf_epyc_jc1ic8jr4_nt32.png
--- a/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf
+++ b/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.pdf
--- a/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png
+++ b/docs/graphs/large/l3_perf_epyc_jc2ic8jr4_nt64.png
--- a/docs/graphs/large/l3_perf_epyc_nt1.pdf
+++ b/docs/graphs/large/l3_perf_epyc_nt1.pdf
--- a/docs/graphs/large/l3_perf_epyc_nt1.png
+++ b/docs/graphs/large/l3_perf_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt1.pdf
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt1.pdf
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt1.png
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt32.pdf
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt32.pdf
--- a/docs/graphs/sup/dgemm_ccc_zen2_nt32.png
+++ b/docs/graphs/sup/dgemm_ccc_zen2_nt32.png
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt1.pdf
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt1.pdf
--- a/docs/graphs/sup/dgemm_ccc_epyc_nt1.png
+++ b/docs/graphs/sup/dgemm_ccc_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_ccc_zen_nt32.pdf
+++ b/docs/graphs/sup/dgemm_ccc_zen_nt32.pdf
--- a/docs/graphs/sup/dgemm_ccc_zen_nt32.png
+++ b/docs/graphs/sup/dgemm_ccc_zen_nt32.png
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt1.pdf
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt1.pdf
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt1.png
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt1.png
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt32.pdf
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt32.pdf
--- a/docs/graphs/sup/dgemm_rrr_zen2_nt32.png
+++ b/docs/graphs/sup/dgemm_rrr_zen2_nt32.png
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt1.pdf
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt1.pdf
--- a/docs/graphs/sup/dgemm_rrr_epyc_nt1.png
+++ b/docs/graphs/sup/dgemm_rrr_epyc_nt1.png
--- a/docs/graphs/sup/dgemm_rrr_zen_nt32.pdf
+++ b/docs/graphs/sup/dgemm_rrr_zen_nt32.pdf
--- a/docs/graphs/sup/dgemm_rrr_zen_nt32.png
+++ b/docs/graphs/sup/dgemm_rrr_zen_nt32.png
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt1.pdf
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt1.pdf
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt1.png
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt1.png
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt32.pdf
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt32.pdf
--- a/docs/graphs/sup/sgemm_ccc_zen2_nt32.png
+++ b/docs/graphs/sup/sgemm_ccc_zen2_nt32.png
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt1.pdf
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt1.pdf
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt1.png
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt1.png
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt32.pdf
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt32.pdf
--- a/docs/graphs/sup/sgemm_rrr_zen2_nt32.png
+++ b/docs/graphs/sup/sgemm_rrr_zen2_nt32.png
--- a/examples/oapi/08level2.c
+++ b/examples/oapi/08level2.c
@@ -246,10 +246,11 @@ int main( int argc, char** argv )
 	// displaying junk values in the unstored triangle.
 	bli_setm( &BLIS_ZERO, &a );

-	// Mark matrix 'a' as triangular and stored in the lower triangle, and
-	// then randomize that lower triangle.
+	// Mark matrix 'a' as triangular, stored in the lower triangle, and
+	// having a non-unit diagonal. Then randomize that lower triangle.
 	bli_obj_set_struc( BLIS_TRIANGULAR, &a );
 	bli_obj_set_uplo( BLIS_LOWER, &a );
+	bli_obj_set_diag( BLIS_NONUNIT_DIAG, &a );
 	bli_randm( &a );

 	bli_printm( "a: randomized (zeros in upper triangle)", &a, "%4.1f", "" );
@@ -288,10 +289,11 @@ int main( int argc, char** argv )
 	// displaying junk values in the unstored triangle.
 	bli_setm( &BLIS_ZERO, &a );

-	// Mark matrix 'a' as triangular and stored in the lower triangle, and
-	// then randomize that lower triangle.
+	// Mark matrix 'a' as triangular, stored in the lower triangle, and
+	// having a non-unit diagonal. Then randomize that lower triangle.
 	bli_obj_set_struc( BLIS_TRIANGULAR, &a );
 	bli_obj_set_uplo( BLIS_LOWER, &a );
+	bli_obj_set_diag( BLIS_NONUNIT_DIAG, &a );
 	bli_randm( &a );

 	// Load the diagonal. By setting the diagonal to something of greater
--- a/examples/oapi/09level3.c
+++ b/examples/oapi/09level3.c
@@ -244,10 +244,11 @@ int main( int argc, char** argv )
 	// displaying junk values in the unstored triangle.
 	bli_setm( &BLIS_ZERO, &a );

-	// Mark matrix 'a' as triangular and stored in the lower triangle, and
-	// then randomize that lower triangle.
+	// Mark matrix 'a' as triangular, stored in the lower triangle, and
+	// having a non-unit diagonal. Then randomize that lower triangle.
 	bli_obj_set_struc( BLIS_TRIANGULAR, &a );
 	bli_obj_set_uplo( BLIS_LOWER, &a );
+	bli_obj_set_diag( BLIS_NONUNIT_DIAG, &a );
 	bli_randm( &a );

 	bli_printm( "a: randomized (zeros in upper triangle)", &a, "%4.1f", "" );
@@ -290,10 +291,11 @@ int main( int argc, char** argv )
 	// displaying junk values in the unstored triangle.
 	bli_setm( &BLIS_ZERO, &a );

-	// Mark matrix 'a' as triangular and stored in the lower triangle, and
-	// then randomize that lower triangle.
+	// Mark matrix 'a' as triangular, stored in the lower triangle, and
+	// having a non-unit diagonal. Then randomize that lower triangle.
 	bli_obj_set_struc( BLIS_TRIANGULAR, &a );
 	bli_obj_set_uplo( BLIS_LOWER, &a );
+	bli_obj_set_diag( BLIS_NONUNIT_DIAG, &a );
 	bli_randm( &a );

 	// Load the diagonal. By setting the diagonal to something of greater
--- a/examples/oapi/README
+++ b/examples/oapi/README
@@ -22,8 +22,8 @@ or by setting the same variable as part of the make command:
  make BLIS_INSTALL_PATH=/usr/local

 Once the executable files have been built, we recommend reading the code in
-one terminal window alongside the executable output in another. This will
-help you see the effects of each section of code.
+one terminal window alongside the executable output in another terminal.
+This will help you see the effects of each section of code.

 This tutorial is not exhaustive or complete; several object API functions
 were omitted (mostly for brevity's sake) and thus more examples could be
--- a/examples/tapi/00level1v.c
+++ b/examples/tapi/00level1v.c
@@ -175,7 +175,7 @@ int main( int argc, char** argv )
 	free( y );
 	free( z );
 	free( w );
-	free( z );
+	free( a );

 	return 0;
 }
--- a/Show More
+++ b/Show More