Reverted docs/*.md links to relative paths.

Details:
- Within the documents in docs/*.md, reverted links to other local
  documents to relative paths.
- Fixed some links/documents that did not yet have the '.md' suffix.
- Testing whether we can use relative links ('docs/BLISTypedAPI.md')
  from within README.md.
This commit is contained in:
Field G. Van Zee
2018-07-07 20:01:29 -05:00
parent d97c862c2b
commit 7d3e8a7e5f
10 changed files with 191 additions and 191 deletions

View File

@@ -14,7 +14,7 @@ intensive operations. BLIS is written in [ISO
C99](http://en.wikipedia.org/wiki/C99) and available under a
[new/modified/3-clause BSD
license](http://opensource.org/licenses/BSD-3-Clause). While BLIS exports a
[new BLAS-like API](https://github.com/flame/blis/wiki/BLISAPIQuickReference),
[new BLAS-like API](docs/BLISTypedAPI.md),
it also includes a BLAS compatibility layer which gives application developers
access to BLIS implementations via traditional [BLAS routine
calls](http://www.netlib.org/lapack/lug/node145.html). An object-based API

View File

@@ -1,29 +1,29 @@
# Contents
* **[Contents](BLISTypedAPI#contents)**
* **[Introduction](BLISTypedAPI#introduction)**
* [BLIS types](BLISTypedAPI#blis-types)
* [Integer-based types](BLISTypedAPI#integer-based-types)
* [Floating-point types](BLISTypedAPI#floating-point-types)
* [Enumerated parameter types](BLISTypedAPI#enumerated-parameter-types)
* [Context type](BLISTypedAPI#context-type)
* [BLIS header file](BLISTypedAPI#blis-header-file)
* [Initialization and cleanup](BLISTypedAPI#initialization-and-cleanup)
* **[Computational function reference](BLISTypedAPI#computational-function-reference)**
* [Operation index](BLISTypedAPI#operation-index)
* [Level-1v operations](BLISTypedAPI#level-1v-operations)
* [Level-1d operations](BLISTypedAPI#level-1d-operations)
* [Level-1m operations](BLISTypedAPI#level-1m-operations)
* [Level-1f operations](BLISTypedAPI#level-1f-operations)
* [Level-2 operations](BLISTypedAPI#level-2-operations)
* [Level-3 operations](BLISTypedAPI#level-3-operations)
* [Utility operations](BLISTypedAPI#utility-operations)
* [Level-3 micro-kernels](BLISTypedAPI#level-3-micro-kernels)
* **[Query function reference](BLISTypedAPI#query-function-reference)**
* [General library information](BLISTypedAPI#general-library-information)
* [Specific configuration](BLISTypedAPI#specific-configuration)
* [General configuration](BLISTypedAPI#general-configuration)
* [Kernel information](BLISTypedAPI#kernel-information)
* **[Contents](BLISTypedAPI.md#contents)**
* **[Introduction](BLISTypedAPI.md#introduction)**
* [BLIS types](BLISTypedAPI.md#blis-types)
* [Integer-based types](BLISTypedAPI.md#integer-based-types)
* [Floating-point types](BLISTypedAPI.md#floating-point-types)
* [Enumerated parameter types](BLISTypedAPI.md#enumerated-parameter-types)
* [Context type](BLISTypedAPI.md#context-type)
* [BLIS header file](BLISTypedAPI.md#blis-header-file)
* [Initialization and cleanup](BLISTypedAPI.md#initialization-and-cleanup)
* **[Computational function reference](BLISTypedAPI.md#computational-function-reference)**
* [Operation index](BLISTypedAPI.md#operation-index)
* [Level-1v operations](BLISTypedAPI.md#level-1v-operations)
* [Level-1d operations](BLISTypedAPI.md#level-1d-operations)
* [Level-1m operations](BLISTypedAPI.md#level-1m-operations)
* [Level-1f operations](BLISTypedAPI.md#level-1f-operations)
* [Level-2 operations](BLISTypedAPI.md#level-2-operations)
* [Level-3 operations](BLISTypedAPI.md#level-3-operations)
* [Utility operations](BLISTypedAPI.md#utility-operations)
* [Level-3 micro-kernels](BLISTypedAPI.md#level-3-micro-kernels)
* **[Query function reference](BLISTypedAPI.md#query-function-reference)**
* [General library information](BLISTypedAPI.md#general-library-information)
* [Specific configuration](BLISTypedAPI.md#specific-configuration)
* [General configuration](BLISTypedAPI.md#general-configuration)
* [Kernel information](BLISTypedAPI.md#kernel-information)
# Introduction
@@ -176,20 +176,20 @@ Notes for interpreting function descriptions:
## Operation index
* **[Level-1v](BLISTypedAPI#level-1v-operations)**: Operations on vectors:
* [addv](BLISTypedAPI#addv), [amaxv](BLISTypedAPI#amaxv), [axpyv](BLISTypedAPI#axpyv), [copyv](BLISTypedAPI#copyv), [dotv](BLISTypedAPI#dotv), [dotxv](BLISTypedAPI#dotxv), [invertv](BLISTypedAPI#invertv), [scal2v](BLISTypedAPI#scal2v), [scalv](BLISTypedAPI#scalv), [setv](BLISTypedAPI#setv), [subv](BLISTypedAPI#subv), [swapv](BLISTypedAPI#swapv)
* **[Level-1d](BLISTypedAPI#level-1d-operations)**: Element-wise operations on matrix diagonals:
* [addd](BLISTypedAPI#addd), [axpyd](BLISTypedAPI#axpyd), [copyd](BLISTypedAPI#copyd), [invertd](BLISTypedAPI#invertd), [scald](BLISTypedAPI#scald), [scal2d](BLISTypedAPI#scal2d), [setd](BLISTypedAPI#setd), [setid](BLISTypedAPI#setid), [subd](BLISTypedAPI#subd)
* **[Level-1m](BLISTypedAPI#level-1m-operations)**: Element-wise operations on matrices:
* [addm](BLISTypedAPI#addm), [axpym](BLISTypedAPI#axpym), [copym](BLISTypedAPI#copym), [scalm](BLISTypedAPI#scalm), [scal2m](BLISTypedAPI#scal2m), [setm](BLISTypedAPI#setm), [subm](BLISTypedAPI#subm)
* **[Level-1f](BLISTypedAPI#level-1f-operations)**: Fused operations on multiple vectors:
* [axpy2v](BLISTypedAPI#axpy2v), [dotaxpyv](BLISTypedAPI#dotaxpyv), [axpyf](BLISTypedAPI#axpyf), [dotxf](BLISTypedAPI#dotxf), [dotxaxpyf](BLISTypedAPI#dotxaxpyf)
* **[Level-2](BLISTypedAPI#level-2-operations)**: Operations with one matrix and (at least) one vector operand:
* [gemv](BLISTypedAPI#gemv), [ger](BLISTypedAPI#ger), [hemv](BLISTypedAPI#hemv), [her](BLISTypedAPI#her), [her2](BLISTypedAPI#her2), [symv](BLISTypedAPI#symv), [syr](BLISTypedAPI#syr), [syr2](BLISTypedAPI#syr2), [trmv](BLISTypedAPI#trmv), [trsv](BLISTypedAPI#trsv)
* **[Level-3](BLISTypedAPI#level-3-operations)**: Operations with matrices that are multiplication-like:
* [gemm](BLISTypedAPI#gemm), [hemm](BLISTypedAPI#hemm), [herk](BLISTypedAPI#herk), [her2k](BLISTypedAPI#her2k), [symm](BLISTypedAPI#symm), [syrk](BLISTypedAPI#syrk), [syr2k](BLISTypedAPI#syr2k), [trmm](BLISTypedAPI#trmm), [trmm3](BLISTypedAPI#trmm3), [trsm](BLISTypedAPI#trsm)
* **[Utility](BLISTypedAPI#Utility-operations)**: Miscellaneous operations on matrices and vectors:
* [asumv](BLISTypedAPI#asumv), [norm1v](BLISTypedAPI#norm1v), [normfv](BLISTypedAPI#normfv), [normiv](BLISTypedAPI#normiv), [norm1m](BLISTypedAPI#norm1m), [normfm](BLISTypedAPI#normfm), [normim](BLISTypedAPI#normim), [mkherm](BLISTypedAPI#mkherm), [mksymm](BLISTypedAPI#mksymm), [mktrim](BLISTypedAPI#mktrim), [fprintv](BLISTypedAPI#fprintv), [fprintm](BLISTypedAPI#fprintm),[printv](BLISTypedAPI#printv), [printm](BLISTypedAPI#printm), [randv](BLISTypedAPI#randv), [randm](BLISTypedAPI#randm), [sumsqv](BLISTypedAPI#sumsqv)
* **[Level-1v](BLISTypedAPI.md#level-1v-operations)**: Operations on vectors:
* [addv](BLISTypedAPI.md#addv), [amaxv](BLISTypedAPI.md#amaxv), [axpyv](BLISTypedAPI.md#axpyv), [copyv](BLISTypedAPI.md#copyv), [dotv](BLISTypedAPI.md#dotv), [dotxv](BLISTypedAPI.md#dotxv), [invertv](BLISTypedAPI.md#invertv), [scal2v](BLISTypedAPI.md#scal2v), [scalv](BLISTypedAPI.md#scalv), [setv](BLISTypedAPI.md#setv), [subv](BLISTypedAPI.md#subv), [swapv](BLISTypedAPI.md#swapv)
* **[Level-1d](BLISTypedAPI.md#level-1d-operations)**: Element-wise operations on matrix diagonals:
* [addd](BLISTypedAPI.md#addd), [axpyd](BLISTypedAPI.md#axpyd), [copyd](BLISTypedAPI.md#copyd), [invertd](BLISTypedAPI.md#invertd), [scald](BLISTypedAPI.md#scald), [scal2d](BLISTypedAPI.md#scal2d), [setd](BLISTypedAPI.md#setd), [setid](BLISTypedAPI.md#setid), [subd](BLISTypedAPI.md#subd)
* **[Level-1m](BLISTypedAPI.md#level-1m-operations)**: Element-wise operations on matrices:
* [addm](BLISTypedAPI.md#addm), [axpym](BLISTypedAPI.md#axpym), [copym](BLISTypedAPI.md#copym), [scalm](BLISTypedAPI.md#scalm), [scal2m](BLISTypedAPI.md#scal2m), [setm](BLISTypedAPI.md#setm), [subm](BLISTypedAPI.md#subm)
* **[Level-1f](BLISTypedAPI.md#level-1f-operations)**: Fused operations on multiple vectors:
* [axpy2v](BLISTypedAPI.md#axpy2v), [dotaxpyv](BLISTypedAPI.md#dotaxpyv), [axpyf](BLISTypedAPI.md#axpyf), [dotxf](BLISTypedAPI.md#dotxf), [dotxaxpyf](BLISTypedAPI.md#dotxaxpyf)
* **[Level-2](BLISTypedAPI.md#level-2-operations)**: Operations with one matrix and (at least) one vector operand:
* [gemv](BLISTypedAPI.md#gemv), [ger](BLISTypedAPI.md#ger), [hemv](BLISTypedAPI.md#hemv), [her](BLISTypedAPI.md#her), [her2](BLISTypedAPI.md#her2), [symv](BLISTypedAPI.md#symv), [syr](BLISTypedAPI.md#syr), [syr2](BLISTypedAPI.md#syr2), [trmv](BLISTypedAPI.md#trmv), [trsv](BLISTypedAPI.md#trsv)
* **[Level-3](BLISTypedAPI.md#level-3-operations)**: Operations with matrices that are multiplication-like:
* [gemm](BLISTypedAPI.md#gemm), [hemm](BLISTypedAPI.md#hemm), [herk](BLISTypedAPI.md#herk), [her2k](BLISTypedAPI.md#her2k), [symm](BLISTypedAPI.md#symm), [syrk](BLISTypedAPI.md#syrk), [syr2k](BLISTypedAPI.md#syr2k), [trmm](BLISTypedAPI.md#trmm), [trmm3](BLISTypedAPI.md#trmm3), [trsm](BLISTypedAPI.md#trsm)
* **[Utility](BLISTypedAPI.md#Utility-operations)**: Miscellaneous operations on matrices and vectors:
* [asumv](BLISTypedAPI.md#asumv), [norm1v](BLISTypedAPI.md#norm1v), [normfv](BLISTypedAPI.md#normfv), [normiv](BLISTypedAPI.md#normiv), [norm1m](BLISTypedAPI.md#norm1m), [normfm](BLISTypedAPI.md#normfm), [normim](BLISTypedAPI.md#normim), [mkherm](BLISTypedAPI.md#mkherm), [mksymm](BLISTypedAPI.md#mksymm), [mktrim](BLISTypedAPI.md#mktrim), [fprintv](BLISTypedAPI.md#fprintv), [fprintm](BLISTypedAPI.md#fprintm),[printv](BLISTypedAPI.md#printv), [printm](BLISTypedAPI.md#printm), [randv](BLISTypedAPI.md#randv), [randm](BLISTypedAPI.md#randm), [sumsqv](BLISTypedAPI.md#sumsqv)
---
@@ -765,7 +765,7 @@ Perform
y := y + alphax * conjx(x) + alphay * conjy(y)
```
where `x`, `y`, and `z` are vectors of length _m_. The kernel, if optimized, is implemented as a fused pair of calls to [axpyv](BLISTypedAPI#axpyv).
where `x`, `y`, and `z` are vectors of length _m_. The kernel, if optimized, is implemented as a fused pair of calls to [axpyv](BLISTypedAPI.md#axpyv).
---
@@ -790,7 +790,7 @@ Perform
y := y + alpha * conjx(x)
```
where `x`, `y`, and `z` are vectors of length _m_ and `alpha` and `rho` are scalars. The kernel, if optimized, is implemented as a fusion of calls to [dotv](BLISTypedAPI#dotv) and [axpyv](BLISTypedAPI#axpyv).
where `x`, `y`, and `z` are vectors of length _m_ and `alpha` and `rho` are scalars. The kernel, if optimized, is implemented as a fusion of calls to [dotv](BLISTypedAPI.md#dotv) and [axpyv](BLISTypedAPI.md#axpyv).
---
@@ -813,7 +813,7 @@ Perform
y := y + alpha * conja(A) * conjx(x)
```
where `A` is an _m x nf_ matrix, and `y` and `x` are vectors. The kernel, if optimized, is implemented as a fused series of calls to [axpyv](BLISTypedAPI#axpyv) where _nf_ is less than or equal to an implementation-dependent fusing factor specific to `axpyf`.
where `A` is an _m x nf_ matrix, and `y` and `x` are vectors. The kernel, if optimized, is implemented as a fused series of calls to [axpyv](BLISTypedAPI.md#axpyv) where _nf_ is less than or equal to an implementation-dependent fusing factor specific to `axpyf`.
---
@@ -837,7 +837,7 @@ Perform
y := y + alpha * conjat(A^T) * conjx(x)
```
where `A` is an _m x nf_ matrix, and `y` and `x` are vectors. The kernel, if optimized, is implemented as a fused series of calls to [dotxv](BLISTypedAPI#dotxv) where _nf_ is less than or equal to an implementation-dependent fusing factor specific to `dotxf`.
where `A` is an _m x nf_ matrix, and `y` and `x` are vectors. The kernel, if optimized, is implemented as a fused series of calls to [dotxv](BLISTypedAPI.md#dotxv) where _nf_ is less than or equal to an implementation-dependent fusing factor specific to `dotxf`.
---
@@ -866,7 +866,7 @@ Perform
z := z + alpha * conja(A) * conjx(x)
```
where `A` is an _m x nf_ matrix, `w` and `z` are vectors of length _m_, `x` and `y` are vectors of length `nf`, and `alpha` and `beta` are scalars. The kernel, if optimized, is implemented as a fusion of calls to [dotxf](BLISTypedAPI#dotxf) and [axpyf](BLISTypedAPI#axpyf).
where `A` is an _m x nf_ matrix, `w` and `z` are vectors of length _m_, `x` and `y` are vectors of length `nf`, and `alpha` and `beta` are scalars. The kernel, if optimized, is implemented as a fusion of calls to [dotxf](BLISTypedAPI.md#dotxf) and [axpyf](BLISTypedAPI.md#axpyf).
@@ -1109,7 +1109,7 @@ where `A` is an _m x m_ triangular matrix stored in the lower or upper triangle
## Level-3 operations
Level-3 operations perform various level-3 BLAS-like operations.
**Note**: Each All level-3 operations are implemented through a handful of level-3 micro-kernels. Please see the [Kernels Guide](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md) for more details.
**Note**: Each All level-3 operations are implemented through a handful of level-3 micro-kernels. Please see the [Kernels Guide](KernelsHowTo.md) for more details.
---
@@ -1626,7 +1626,7 @@ Perform
```
where `C11` is an _MR x NR_ matrix, `A1` is an _MR x k_ "micro-panel" matrix stored in packed (column-stored) format, `B1` is a _k x NR_ "micro-panel" matrix in packed (row-stored) format, and alpha and beta are scalars. The storage of `C11` is specified by its row and column strides, `rsc` and `csc`.
Please see the [Kernel Guide](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md) for more information on the `gemm` micro-kernel.
Please see the [Kernel Guide](KernelsHowTo.md) for more information on the `gemm` micro-kernel.
---
@@ -1658,7 +1658,7 @@ Perform
```
where `A11` is an _MR x MR_ lower or upper triangular matrix stored in packed (column-stored) format, `B11` is an _MR x NR_ matrix stored in packed (row-stored) format, and `C11` is an _MR x NR_ matrix stored according to row and column strides `rsc` and `csc`.
Please see the [Kernel Guide](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md) for more information on the `trsm` micro-kernel.
Please see the [Kernel Guide](KernelsHowTo.md) for more information on the `trsm` micro-kernel.
---
@@ -1704,7 +1704,7 @@ if `A11` is lower triangular, or
```
if `A11` is upper triangular.
Please see the [Kernel Guide](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md) for more information on the `gemmtrsm` micro-kernel.
Please see the [Kernel Guide](KernelsHowTo.md) for more information on the `gemmtrsm` micro-kernel.
@@ -1795,7 +1795,7 @@ Possible micro-kernel types (ie: the return values for `bli_info_get_*_ukr_impl_
### Operation implementation type query
The following routines allow the caller to obtain a string that identifies the implementation (`ind_t`) that is currently active (ie: implemented and enabled) for each level-3 operation. Possible implementation types are listed in the section above covering [micro-kernel implemenation query](BLISTypedAPI#micro-kernel-implementation-type-query).
The following routines allow the caller to obtain a string that identifies the implementation (`ind_t`) that is currently active (ie: implemented and enabled) for each level-3 operation. Possible implementation types are listed in the section above covering [micro-kernel implemenation query](BLISTypedAPI.md#micro-kernel-implementation-type-query).
```c
char* bli_info_get_gemm_impl_string( num_t dt );
char* bli_info_get_hemm_impl_string( num_t dt );

View File

@@ -1,17 +1,17 @@
## Contents
* **[Contents](BuildSystem#contents)**
* **[Introduction](BuildSystem#introduction)**
* **[Obtaining BLIS](BuildSystem#obtaining-blis)**
* **[Step 1: Chose a framework configuration](BuildSystem#step-1-choose-a-framework-configuration)**
* **[Step 2: Running `configure`](BuildSystem#step-2-running-configure)**
* **[Step 3: Compilation](BuildSystem#step-3-compilation)**
* **[Step 3b: Testing (optional)](BuildSystem#step-3b-testing-optional)**
* **[Step 4: Installation](BuildSystem#step-4-installation)**
* **[Cleaning out build products](BuildSystem#cleaning-out-build-products)**
* **[Linking against BLIS](BuildSystem#linking-against-blis)**
* **[Uninstalling](BuildSystem#uninstalling)**
* **[Conclusion](BuildSystem#conclusion)**
* **[Contents](BuildSystem.md#contents)**
* **[Introduction](BuildSystem.md#introduction)**
* **[Obtaining BLIS](BuildSystem.md#obtaining-blis)**
* **[Step 1: Chose a framework configuration](BuildSystem.md#step-1-choose-a-framework-configuration)**
* **[Step 2: Running `configure`](BuildSystem.md#step-2-running-configure)**
* **[Step 3: Compilation](BuildSystem.md#step-3-compilation)**
* **[Step 3b: Testing (optional)](BuildSystem.md#step-3b-testing-optional)**
* **[Step 4: Installation](BuildSystem.md#step-4-installation)**
* **[Cleaning out build products](BuildSystem.md#cleaning-out-build-products)**
* **[Linking against BLIS](BuildSystem.md#linking-against-blis)**
* **[Uninstalling](BuildSystem.md#uninstalling)**
* **[Conclusion](BuildSystem.md#conclusion)**
## Introduction
@@ -52,7 +52,7 @@ LICENSE build config_registry lib test
The first step is to choose how to configure BLIS. Specifically, a user must decide which configuration to use, or whether to allow `configure` to automatically guess the best configuration for your hardware. (Note: This automatic configuration selection only applies to x86_64 systems.)
Configurations are described in detail in the [Configuration Guide](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md).
Configurations are described in detail in the [Configuration Guide](ConfigurationHowTo.md).
Generally speaking, a configuration consists of several files that reside in a sub-directory of the `config` directory. To see a list of the available configurations, you may inspect this directory, or run `configure` with no arguments. Here are the current (as of this writing) contents of the `config` directory:
```
@@ -66,7 +66,7 @@ By targeting the `auto` configuration (i.e., `./configure auto`), the user is re
Another special configuration (one that, unlike `auto`, _is_ present in `config`) is the `generic` configuration. This configuration, like its name suggests, is architecture-agnostic and may be targeted in virtually any environment that supports the minimum build requirements of BLIS. The `generic` configuration uses a set of built-in, portable reference kernels (written in C99) that should work without modification on most, if not all, architectures. These reference kernels, however, should be expected to yield relatively low performance since they do not employ any architecture-specific optimizations beyond those the compiler provides automatically. (Historical note: The `generic` configuration corresponds to the `reference` configuration of previous releases of BLIS.)
If you are a BLIS developer and wish to create your own configuration, either from scratch or using an existing configuration as a starting point, please read the [BLIS configuration guide](ConfigurationHowTo).
If you are a BLIS developer and wish to create your own configuration, either from scratch or using an existing configuration as a starting point, please read the BLIS [Configuration Guide](ConfigurationHowTo.md).
## Step 2: Running `configure`
@@ -74,7 +74,7 @@ This step should be somewhat familiar to many people who use open source softwar
```
$ ./configure <configname>
```
where `<configname>` is the configuration sub-directory name you chose in [Step 1](BuildSystem#step-1-choose-a-framework-configuration) above. If `<configname>` is not given, a helpful message is printed reminding you to explicit specify a configuration name along with a list of valid configuration families and their implied sub-configurations. For more information on sub-configurations and families, please see the [BLIS configuration guide](ConfigurationHowTo).
where `<configname>` is the configuration sub-directory name you chose in [Step 1](BuildSystem.md#step-1-choose-a-framework-configuration) above. If `<configname>` is not given, a helpful message is printed reminding you to explicit specify a configuration name along with a list of valid configuration families and their implied sub-configurations. For more information on sub-configurations and families, please see the BLIS [Configuration Guide](ConfigurationHowTo.md).
Alternatively, `configure` can automatically select a configuration based on your hardware:
```
@@ -237,7 +237,7 @@ Watch the output near the end. You should see the following messages, though not
All BLIS tests passed!
All BLAS tests passed!
```
Please see the [Testsuite](https://github.com/flame/blis/blob/master/docs/Testsuite.md) document for more details on running either the BLIS testsuite or the BLAS test drivers. If you have any trouble, please report your problem to BLIS developers by opening a [new issue](https://github.com/flame/blis/issues/).
Please see the [Testsuite](Testsuite.md) document for more details on running either the BLIS testsuite or the BLAS test drivers. If you have any trouble, please report your problem to BLIS developers by opening a [new issue](https://github.com/flame/blis/issues/).
## Step 4: Installation

View File

@@ -1,17 +1,17 @@
## Contents
* **[Contents](CodingConventions#contents)**
* **[Introduction](CodingConventions#introduction)**
* **[C99](CodingConventions#c99)**
* [Placement of braces](CodingConventions#placement-of-braces)
* [Indentation](CodingConventions#indentation)
* [Comments](CodingConventions#comments)
* [Blank lines](CodingConventions#blank-lines)
* [Condensing short code to single lines](CodingConventions#condensing-short-code-to-single-lines)
* [Whitespace in function calls](CodingConventions#whitespace-in-function-calls)
* [Whitespace in function definitions](CodingConventions#whitespace-in-function-definitions)
* [Whitespace in expressions](CodingConventions#whitespace-in-expressions)
* [Trailing whitespace](CodingConventions#trailing-whitespace)
* **[Contents](CodingConventions.md#contents)**
* **[Introduction](CodingConventions.md#introduction)**
* **[C99](CodingConventions.md#c99)**
* [Placement of braces](CodingConventions.md#placement-of-braces)
* [Indentation](CodingConventions.md#indentation)
* [Comments](CodingConventions.md#comments)
* [Blank lines](CodingConventions.md#blank-lines)
* [Condensing short code to single lines](CodingConventions.md#condensing-short-code-to-single-lines)
* [Whitespace in function calls](CodingConventions.md#whitespace-in-function-calls)
* [Whitespace in function definitions](CodingConventions.md#whitespace-in-function-definitions)
* [Whitespace in expressions](CodingConventions.md#whitespace-in-expressions)
* [Trailing whitespace](CodingConventions.md#trailing-whitespace)
## Introduction

View File

@@ -1,28 +1,28 @@
## Contents
* **[Contents](ConfigurationHowTo#contents)**
* **[Introduction](ConfigurationHowTo#introduction)**
* **[Sub-configurations](ConfigurationHowTo#sub-configurations)**
* [`bli_cntx_init_*.c`](ConfigurationHowTo#bli_cntx_init_c)
* [`bli_family_*.h`](ConfigurationHowTo#bli_family_h)
* [`make_defs.mk`](ConfigurationHowTo#make_defsmk)
* **[Configuration families](ConfigurationHowTo#configuration-families)**
* **[Configuration registry](ConfigurationHowTo#configuration-registry)**
* [Walkthrough](ConfigurationHowTo#walkthrough)
* [Printing the configuration registry lists](ConfigurationHowTo#printing-the-configuration-registry-lists)
* **[Adding a new kernel set](ConfigurationHowTo#adding-a-new-kernel-set)**
* **[Adding a new configuration family](ConfigurationHowTo#adding-a-new-configuration-family)**
* **[Adding a new sub-configuration](ConfigurationHowTo#adding-a-new-sub-configuration)**
* **[Further development topics](ConfigurationHowTo#further-development-topics)**
* [Querying the current configuration](ConfigurationHowTo#querying-the-current-configuration)
* [Header dependencies](ConfigurationHowTo#header-dependencies)
* [Still have questions?](ConfigurationHowTo#still-have-questions)
* **[Contents](ConfigurationHowTo.md#contents)**
* **[Introduction](ConfigurationHowTo.md#introduction)**
* **[Sub-configurations](ConfigurationHowTo.md#sub-configurations)**
* [`bli_cntx_init_*.c`](ConfigurationHowTo.md#bli_cntx_init_c)
* [`bli_family_*.h`](ConfigurationHowTo.md#bli_family_h)
* [`make_defs.mk`](ConfigurationHowTo.md#make_defsmk)
* **[Configuration families](ConfigurationHowTo.md#configuration-families)**
* **[Configuration registry](ConfigurationHowTo.md#configuration-registry)**
* [Walkthrough](ConfigurationHowTo.md#walkthrough)
* [Printing the configuration registry lists](ConfigurationHowTo.md#printing-the-configuration-registry-lists)
* **[Adding a new kernel set](ConfigurationHowTo.md#adding-a-new-kernel-set)**
* **[Adding a new configuration family](ConfigurationHowTo.md#adding-a-new-configuration-family)**
* **[Adding a new sub-configuration](ConfigurationHowTo.md#adding-a-new-sub-configuration)**
* **[Further development topics](ConfigurationHowTo.md#further-development-topics)**
* [Querying the current configuration](ConfigurationHowTo.md#querying-the-current-configuration)
* [Header dependencies](ConfigurationHowTo.md#header-dependencies)
* [Still have questions?](ConfigurationHowTo.md#still-have-questions)
## Introduction
This document describes how to manage, edit, and create BLIS framework configurations. **The target audience is primarily BLIS developers** who wish to add support for new types of hardware, and developers who write (or tinker with) BLIS kernels.
The BLIS [Build System](https://github.com/flame/blis/blob/master/docs/BuildSystem.md) guide introduces the concept of a BLIS [configuration](https://github.com/flame/blis/blob/master/docs/BuildSystem#Step_1:_Choose_a_framework_configuration). There are actually two types of configurations: sub-configuration and configuration families.
The BLIS [Build System](BuildSystem.md) guide introduces the concept of a BLIS [configuration](BuildSystem.md#Step_1:_Choose_a_framework_configuration). There are actually two types of configurations: sub-configuration and configuration families.
A _sub-configuration_ encapsulates all of the information needed to build BLIS for a particular microarchitecture. For example, the `haswell` configuration allows a user or developer to build a BLIS library that targets hardware based on Intel Haswell (or Broadwell or Skylake/Kabylake desktop) microprocessors. Such a sub-configuration typically includes optimized kernels as well as the corresponding cache and register blocksizes that allow those kernels to work well on the target hardware.
@@ -170,7 +170,7 @@ Here, we use `bli_blksz_init()` to set different auxiliary (maximum) cache block
Note that we set level-3 blocksizes even for datatypes that retain reference code kernels; however, by passing in `0` for those blocksizes, we indicate to `bli_blksz_init()` and `bli_blksz_init_easy()` that the current value should be left untouched. In the example above, this leaves the blocksizes associated with the reference kernels (set by `bli_cntx_init_fooarch_ref()`) intact for the single real, single complex, and double complex datatypes.
_Digression:_ Auxiliary blocksize values for register blocksizes are interpreted as the "packing" register blocksizes. _PACKMR_ and _PACKNR_ serve as "leading dimensions" of the packed micro-panels that are passed into the micro-kernel. Oftentimes, _PACKMR = MR_ and _PACKNR = NR_, and thus the developer does not typically need to set these values manually. (See the [implementation notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#Implementation_Notes_for_gemm) in the BLIS Kernel guide for more details on these topics.)
_Digression:_ Auxiliary blocksize values for register blocksizes are interpreted as the "packing" register blocksizes. _PACKMR_ and _PACKNR_ serve as "leading dimensions" of the packed micro-panels that are passed into the micro-kernel. Oftentimes, _PACKMR = MR_ and _PACKNR = NR_, and thus the developer does not typically need to set these values manually. (See the [implementation notes for gemm](KernelsHowTo.md#Implementation_Notes_for_gemm) in the BLIS Kernel guide for more details on these topics.)
_Digression:_ Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the _maximum_ size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is _not_ merged and instead it is computed upon in separate (final) iteration.
@@ -180,7 +180,7 @@ _**Availability of kernels.**_ Note that any kernel made available to the `fooar
```
fooarch: fooarch/fooarch/bararch
```
Interpreting the line left-to-right: the `fooarch` configuration family contains only itself, `fooarch`, but must be able to refer to kernels from its own kernel set (`fooarch`) as well as kernels belonging to the `bararch` kernel set. The configuration registry is described more completely [in a later section](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#configuration-registry).
Interpreting the line left-to-right: the `fooarch` configuration family contains only itself, `fooarch`, but must be able to refer to kernels from its own kernel set (`fooarch`) as well as kernels belonging to the `bararch` kernel set. The configuration registry is described more completely [in a later section](ConfigurationHowTo.md#configuration-registry).
@@ -322,12 +322,12 @@ $ ls config/amd64
bli_family_amd64.h make_defs.mk
```
A configuration family contains a subset of the files contained within a sub-configuration: A `bli_family_*.h` header file and a `make_defs.mk` makefile fragment:
* `bli_family_amd64.h`. This header file is `#included` only when the configuration family in question, in this case `amd64`, was the target to `./configure`. The file serves a similar purpose as with sub-configurations--a place to define various parameters, such as those relating to memory allocation and alignment. However, in the context of configuration families, the uniqueness of this file makes a bit more sense. Importantly, the definitions in this file will be affect **all** sub-configurations within the family. Thus, it is useful to think of these as "global" parameters. For example, if custom implementations of `malloc()` and `free()` are specified in the `bli_family_amd64.h` file, these implementations will be used for every sub-configuration member of the `amd64` family. (The configuration registry, described in [the next section](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#configuration-registry), specifies each configuration family's membership.) As with sub-configurations, this file may be empty, in which case reasonable defaults are selected by the framework.
* `bli_family_amd64.h`. This header file is `#included` only when the configuration family in question, in this case `amd64`, was the target to `./configure`. The file serves a similar purpose as with sub-configurations--a place to define various parameters, such as those relating to memory allocation and alignment. However, in the context of configuration families, the uniqueness of this file makes a bit more sense. Importantly, the definitions in this file will be affect **all** sub-configurations within the family. Thus, it is useful to think of these as "global" parameters. For example, if custom implementations of `malloc()` and `free()` are specified in the `bli_family_amd64.h` file, these implementations will be used for every sub-configuration member of the `amd64` family. (The configuration registry, described in [the next section](ConfigurationHowTo.md#configuration-registry), specifies each configuration family's membership.) As with sub-configurations, this file may be empty, in which case reasonable defaults are selected by the framework.
* `make_defs.mk`. This makefile fragment defines the compiler and compiler flags in a manner identical to that of sub-configurations. However, these configuration flags are used when compiling source code that is not specific to any one particular sub-configuration. (The build system compiles a set of reference kernels and optimized kernels for each sub-configuration, during which it uses flags read from the individual sub-configurations' `make_defs.mk` files. By contrast, the general framework code is compiled once--using the flags read from the family's `make_defs.mk` file--and executed by all sub-configurations.)
For a more detailed walkthrough of these files' expected/allowed contents, please see the descriptions provided in the section on [sub-configurations](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#sub-configurations):
* [bli_family_*.h](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#bli_family_h)
* [make_defs.h](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#make_defsmk)
For a more detailed walkthrough of these files' expected/allowed contents, please see the descriptions provided in the section on [sub-configurations](ConfigurationHowTo.md#sub-configurations):
* [bli_family_*.h](ConfigurationHowTo.md#bli_family_h)
* [make_defs.h](ConfigurationHowTo.md#make_defsmk)
With these two files defined and present, the configuration family is properly constituted and ready to be registered within the configuration registry.
@@ -489,7 +489,7 @@ configure: steamroller: piledriver
configure: x86_64: haswell sandybridge penryn zen piledriver bulldozer generic
configure: zen: zen
```
This shows the kernel sets that are pulled in by each configuration family. For singleton families, this is specified in a straightforward manner via the `/` character described [in the previous section](ConfigurationHowTo#Walkthrough). For umbrella families, this is determined indirectly by looking up the definitions of the singleton families that are members of the umbrella family.
This shows the kernel sets that are pulled in by each configuration family. For singleton families, this is specified in a straightforward manner via the `/` character described [in the previous section](ConfigurationHowTo.md#Walkthrough). For umbrella families, this is determined indirectly by looking up the definitions of the singleton families that are members of the umbrella family.
Next, the full kernel-to-configuration map is printed:
```
@@ -525,7 +525,7 @@ $ ls kernels
armv7a bgq generic knc old piledriver sandybridge
armv8a bulldozer haswell knl penryn power7
```
Next, we must write the `knl` kernels and locate them inside `kernels/knl`. (For more information on writing BLIS kernels, please see the [Kernels Guide](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md).) We recommend separating level-1v, level-1f, and level-3 kernels into separate `1`, `1f`, and `3` sub-directories, respectively. The kernel files and functions therein do not need to follow any particular naming convention, though we strongly recommend using the conventions already used by other kernel sets. Take a look at other kernel files, such as those for `haswell`, [for examples](https://github.com/flame/blis/tree/master/kernels). Finally, for the `knl` kernel set, you should insert a file named `bli_kernels_knl.h` into `kernels/knl` that prototypes all of your new kernel set's kernel functions. You are welcome to write your own prototypes, but to make the prototyping of kernels easier we recommend using the prototype-generating macros for level-1v, level-1f, level-1m, and level-3 functions defined in [frame/1/bli_l1v_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1/bli_l1v_ker_prot.h), [frame/1f/bli_l1f_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1f/bli_l1f_ker_prot.h), [frame/1m/bli_l1m_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1m/bli_l1m_ker_prot.h), and [frame/3/bli_l3_ukr_prot.h](https://github.com/flame/blis/blob/master/frame/3/bli_l3_ukr_prot.h), respectively. The following example utilizes how a select subset of these macros can be used to generate kernel function prototypes.
Next, we must write the `knl` kernels and locate them inside `kernels/knl`. (For more information on writing BLIS kernels, please see the [Kernels Guide](KernelsHowTo.md).) We recommend separating level-1v, level-1f, and level-3 kernels into separate `1`, `1f`, and `3` sub-directories, respectively. The kernel files and functions therein do not need to follow any particular naming convention, though we strongly recommend using the conventions already used by other kernel sets. Take a look at other kernel files, such as those for `haswell`, [for examples](https://github.com/flame/blis/tree/master/kernels). Finally, for the `knl` kernel set, you should insert a file named `bli_kernels_knl.h` into `kernels/knl` that prototypes all of your new kernel set's kernel functions. You are welcome to write your own prototypes, but to make the prototyping of kernels easier we recommend using the prototype-generating macros for level-1v, level-1f, level-1m, and level-3 functions defined in [frame/1/bli_l1v_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1/bli_l1v_ker_prot.h), [frame/1f/bli_l1f_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1f/bli_l1f_ker_prot.h), [frame/1m/bli_l1m_ker_prot.h](https://github.com/flame/blis/blob/master/frame/1m/bli_l1m_ker_prot.h), and [frame/3/bli_l3_ukr_prot.h](https://github.com/flame/blis/blob/master/frame/3/bli_l3_ukr_prot.h), respectively. The following example utilizes how a select subset of these macros can be used to generate kernel function prototypes.
```
GEMM_UKR_PROT( double, d, gemm_knl_asm_24x8 )
@@ -635,7 +635,7 @@ First, we update the configuration name inside of `make_defs.mk`:
```
THIS_CONFIG := knl
```
and while we're editing the file, we can make any other changes to compiler flags we wish (if any). Similarly, the `bli_family_knl.h` header file should be updated as needed. Since the number of vector registers and the vector register size on `knl` differ from the defaults, we must explicitly set them. (The role of these parameters was explained in a [previous section](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#bli_family_h).) Furthermore, provided that a macro `BLIS_NO_HBWMALLOC` is not set, we use a different implementation of `malloc()` and `free()` and `#include` that implementation's header file.
and while we're editing the file, we can make any other changes to compiler flags we wish (if any). Similarly, the `bli_family_knl.h` header file should be updated as needed. Since the number of vector registers and the vector register size on `knl` differ from the defaults, we must explicitly set them. (The role of these parameters was explained in a [previous section](ConfigurationHowTo.md#bli_family_h).) Furthermore, provided that a macro `BLIS_NO_HBWMALLOC` is not set, we use a different implementation of `malloc()` and `free()` and `#include` that implementation's header file.
```
#define BLIS_SIMD_NUM_REGISTERS 32
#define BLIS_SIMD_SIZE 64
@@ -650,7 +650,7 @@ and while we're editing the file, we can make any other changes to compiler flag
#define BLIS_FREE_POOL hbw_free
#endif
```
Finally, we update `bli_cntx_init_knl.c` to initialize the context with the appropriate kernel function pointers and blocksize values. The functions used to perform this initialization are explained in [an earlier section](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#bli_cntx_init_c).
Finally, we update `bli_cntx_init_knl.c` to initialize the context with the appropriate kernel function pointers and blocksize values. The functions used to perform this initialization are explained in [an earlier section](ConfigurationHowTo.md#bli_cntx_init_c).
@@ -786,7 +786,7 @@ build static library? yes
build shared library? no
```
This will tell you the current configuration name, the [configuration registry lists](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md#printing-the-configuration-registry-lists), as well as other information stored by `configure` in the `config.mk` file.
This will tell you the current configuration name, the [configuration registry lists](ConfigurationHowTo.md#printing-the-configuration-registry-lists), as well as other information stored by `configure` in the `config.mk` file.

View File

@@ -5,35 +5,35 @@ project, as well as those we think a new user or developer might ask. If you do
## Contents
* [Why did you create BLIS?](FAQ#why-did-you-create-blis)
* [Why should I use BLIS instead of GotoBLAS / OpenBLAS / ATLAS / MKL / ESSL / ACML / Accelerate?](FAQ#why-should-i-use-blis-instead-of-gotoblas--openblas--atlas--mkl--essl--acml--accelerate)
* [How is BLIS related to FLAME / libflame?](FAQ#how-is-blis-related-to-flame--libflame)
* [Does BLIS automatically detect my hardware?](FAQ#does-blis-automatically-detect-my-hardware)
* [I understand that BLIS is mostly a tool for developers?](FAQ#i-understand-that-blis-is-mostly-a-tool-for-developers)
* [How do I link against BLIS?](FAQ#how-do-i-link-against-blis)
* [Must I use git? Can I download a tarball?](FAQ#must-i-use-git-can-i-download-a-tarball)
* [What is a micro-kernel?](FAQ#what-is-a-micro-kernel)
* [What is a macro-kernel?](FAQ#what-is-a-macro-kernel)
* [What is a context?](FAQ#what-is-a-context)
* [I am used to thinking in terms of column-major/row-major storage and leading dimensions. What is a "row stride" / "column stride"?](FAQ#im-used-to-thinking-in-terms-of-column-majorrow-major-storage-and-leading-dimensions-what-is-a-row-stride--column-stride)
* [What does it mean when a matrix with general stride is column-tilted or row-tilted?](FAQ#what-does-it-mean-when-a-matrix-with-general-stride-is-column-tilted-or-row-tilted)
* [I am not really interested in all of these newfangled features in BLIS. Can I just use BLIS as a BLAS library?](FAQ#im-not-really-interested-in-all-of-these-newfangled-features-in-blis-can-i-just-use-blis-as-a-blas-library)
* [What about CBLAS?](FAQ#what-about-cblas)
* [Can I call the native BLIS API from Fortran-77/90/95/2000/C++/Python?](FAQ#can-i-call-the-native-blis-api-from-fortran-7790952000cpython)
* [Do I need to call initialization/finalization functions before being able to use BLIS from my application?](FAQ#do-i-need-to-call-initializationfinalization-functions-before-being-able-to-use-blis-from-my-application)
* [Does BLIS support multithreading?](FAQ#does-blis-support-multithreading)
* [Does BLIS support NUMA environments?](FAQ#does-blis-support-numa-environments)
* [Does BLIS work with GPUs?](FAQ#does-blis-work-with-gpus)
* [Does BLIS work on (some architecture)?](FAQ#does-blis-work-on-some-architecture)
* [What about distributed-memory parallelism?](FAQ#what-about-distributed-memory-parallelism)
* [Can I build BLIS on Windows / Mac OS X?](FAQ#can-i-build-blis-on-windows--mac-os-x)
* [Can I build BLIS as a shared library?](FAQ#can-i-build-blis-as-a-shared-library)
* [Can I use the mixed domain / mixed precision support in BLIS?](FAQ#can-i-use-the-mixed-domain--mixed-precision-support-in-blis)
* [Who is involved in the project?](FAQ#who-is-involved-in-the-project)
* [Who funded the development of BLIS?](FAQ#who-funded-the-development-of-blis)
* [I found a bug. How do I report it?](FAQ#i-found-a-bug-how-do-i-report-it)
* [How do I request a new feature?](FAQ#how-do-i-request-a-new-feature)
* [Where did you get the photo for the BLIS logo / mascot?](FAQ#where-did-you-get-the-photo-for-the-blis-logo--mascot)
* [Why did you create BLIS?](FAQ.md#why-did-you-create-blis)
* [Why should I use BLIS instead of GotoBLAS / OpenBLAS / ATLAS / MKL / ESSL / ACML / Accelerate?](FAQ.md#why-should-i-use-blis-instead-of-gotoblas--openblas--atlas--mkl--essl--acml--accelerate)
* [How is BLIS related to FLAME / libflame?](FAQ.md#how-is-blis-related-to-flame--libflame)
* [Does BLIS automatically detect my hardware?](FAQ.md#does-blis-automatically-detect-my-hardware)
* [I understand that BLIS is mostly a tool for developers?](FAQ.md#i-understand-that-blis-is-mostly-a-tool-for-developers)
* [How do I link against BLIS?](FAQ.md#how-do-i-link-against-blis)
* [Must I use git? Can I download a tarball?](FAQ.md#must-i-use-git-can-i-download-a-tarball)
* [What is a micro-kernel?](FAQ.md#what-is-a-micro-kernel)
* [What is a macro-kernel?](FAQ.md#what-is-a-macro-kernel)
* [What is a context?](FAQ.md#what-is-a-context)
* [I am used to thinking in terms of column-major/row-major storage and leading dimensions. What is a "row stride" / "column stride"?](FAQ.md#im-used-to-thinking-in-terms-of-column-majorrow-major-storage-and-leading-dimensions-what-is-a-row-stride--column-stride)
* [What does it mean when a matrix with general stride is column-tilted or row-tilted?](FAQ.md#what-does-it-mean-when-a-matrix-with-general-stride-is-column-tilted-or-row-tilted)
* [I am not really interested in all of these newfangled features in BLIS. Can I just use BLIS as a BLAS library?](FAQ.md#im-not-really-interested-in-all-of-these-newfangled-features-in-blis-can-i-just-use-blis-as-a-blas-library)
* [What about CBLAS?](FAQ.md#what-about-cblas)
* [Can I call the native BLIS API from Fortran-77/90/95/2000/C++/Python?](FAQ.md#can-i-call-the-native-blis-api-from-fortran-7790952000cpython)
* [Do I need to call initialization/finalization functions before being able to use BLIS from my application?](FAQ.md#do-i-need-to-call-initializationfinalization-functions-before-being-able-to-use-blis-from-my-application)
* [Does BLIS support multithreading?](FAQ.md#does-blis-support-multithreading)
* [Does BLIS support NUMA environments?](FAQ.md#does-blis-support-numa-environments)
* [Does BLIS work with GPUs?](FAQ.md#does-blis-work-with-gpus)
* [Does BLIS work on (some architecture)?](FAQ.md#does-blis-work-on-some-architecture)
* [What about distributed-memory parallelism?](FAQ.md#what-about-distributed-memory-parallelism)
* [Can I build BLIS on Windows / Mac OS X?](FAQ.md#can-i-build-blis-on-windows--mac-os-x)
* [Can I build BLIS as a shared library?](FAQ.md#can-i-build-blis-as-a-shared-library)
* [Can I use the mixed domain / mixed precision support in BLIS?](FAQ.md#can-i-use-the-mixed-domain--mixed-precision-support-in-blis)
* [Who is involved in the project?](FAQ.md#who-is-involved-in-the-project)
* [Who funded the development of BLIS?](FAQ.md#who-funded-the-development-of-blis)
* [I found a bug. How do I report it?](FAQ.md#i-found-a-bug-how-do-i-report-it)
* [How do I request a new feature?](FAQ.md#how-do-i-request-a-new-feature)
* [Where did you get the photo for the BLIS logo / mascot?](FAQ.md#where-did-you-get-the-photo-for-the-blis-logo--mascot)
@@ -56,28 +56,28 @@ homepage](https://github.com/flame/blis#key-features). But here are a few reason
### How is BLIS related to FLAME / `libflame`?
As explained [above](FAQ#why-did-you-create-blis?), BLIS was initially a layer within `libflame` that allowed more convenient interfacing to the BLAS. So in some ways, BLIS is a spin-off project. Prior to developing BLIS, [its author](http://www.cs.utexas.edu/users/field/) worked as the primary maintainer of `libflame`. If you look closely, you can also see that the design of BLIS was influenced by some of the more useful and innovative aspects of `libflame`, such as internal object abstractions and control trees. Also, various members of the [SHPC research group](http://shpc.ices.utexas.edu/people.html) and its [collaborators](http://shpc.ices.utexas.edu/collaborators.html) routinely provide insight, feedback, and also contribute code (especially kernels) to the BLIS project.
As explained [above](FAQ.md#why-did-you-create-blis?), BLIS was initially a layer within `libflame` that allowed more convenient interfacing to the BLAS. So in some ways, BLIS is a spin-off project. Prior to developing BLIS, [its author](http://www.cs.utexas.edu/users/field/) worked as the primary maintainer of `libflame`. If you look closely, you can also see that the design of BLIS was influenced by some of the more useful and innovative aspects of `libflame`, such as internal object abstractions and control trees. Also, various members of the [SHPC research group](http://shpc.ices.utexas.edu/people.html) and its [collaborators](http://shpc.ices.utexas.edu/collaborators.html) routinely provide insight, feedback, and also contribute code (especially kernels) to the BLIS project.
### Does BLIS automatically detect my hardware?
On certain architectures, yes. In order to use auto-detection, you must specify `auto` as your configuration when running `configure` (Please see the BLIS [Build System](https://github.com/flame/blis/blob/master/docs/BuildSystem.md) guide for more info.) A runtime detection option is also available. (Please see the [Configuration Guide](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md) for a comprehensive walkthrough.)
On certain architectures, yes. In order to use auto-detection, you must specify `auto` as your configuration when running `configure` (Please see the BLIS [Build System](BuildSystem.md) guide for more info.) A runtime detection option is also available. (Please see the [Configuration Guide](ConfigurationHowTo.md) for a comprehensive walkthrough.)
If automatic hardware detection is requested at configure-time and the build process does not recognize your architecture, the `generic` configuration is selected.
### I understand that BLIS is mostly a tool for developers?
Yes. In order to achieve high performance, BLIS requires that hand-coded kernels and micro-kernels be written and referenced in a valid [BLIS configuration](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md). These components are usually written by developers and then included within BLIS for use by others.
Yes. In order to achieve high performance, BLIS requires that hand-coded kernels and micro-kernels be written and referenced in a valid [BLIS configuration](ConfigurationHowTo.md). These components are usually written by developers and then included within BLIS for use by others.
The good news, however, is that end-users can use BLIS too. Once the aforementioned kernels are integrated into BLIS, they can be used without any developer-level knowledge. Usually, `./configure auto; make; make install` is sufficient for the typical users with typical hardware.
### How do I link against BLIS?
Linking against BLIS is easy! Most people can link to it as if it were a generic BLAS library. Please see the [Linking against BLIS](https://github.com/flame/blis/blob/master/docs/BuildSystem.md#linking-against-blis) section of the [Build System](https://github.com/flame/blis/blob/master/docs/BuildSystem.md) guide.
Linking against BLIS is easy! Most people can link to it as if it were a generic BLAS library. Please see the [Linking against BLIS](BuildSystem.md#linking-against-blis) section of the [Build System](BuildSystem.md) guide.
### Must I use git? Can I download a tarball?
We **strongly encourage** you to obtain the BLIS source code by cloning a `git` repository (via the [git
clone](https://github.com/flame/blis/blob/master/docs/BuildSystem.md#obtaining-blis) command). The reason for this is that it will allow you to easily update your local copy of BLIS by executing `git pull`.
clone](BuildSystem.md#obtaining-blis) command). The reason for this is that it will allow you to easily update your local copy of BLIS by executing `git pull`.
Tarballs and zip files may be obtained from the [releases](https://github.com/flame/blis/releases) page.
@@ -85,7 +85,7 @@ Tarballs and zip files may be obtained from the [releases](https://github.com/fl
The micro-kernel (usually short for "`gemm` micro-kernel") is the basic unit of level-3 (matrix-matrix) computation within BLIS. It consists of one loop, where each iteration performs a very small outer product to update a very small matrix. The micro-kernel is typically the only piece of code that must be carefully optimized (via vector intrinsics or assembly code) to enable high performance in most of the level-3 operations such as `gemm`, `hemm`, `herk`, and `trmm`.
For a more thorough explanation of the micro-kernel and its role in the overall level-3 computations, please read our [ACM TOMS papers](https://github.com/flame/blis#citations). For API and technical reference, please see the [gemm micro-kernel section](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel) of the BLIS [Kernels Guide](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md).
For a more thorough explanation of the micro-kernel and its role in the overall level-3 computations, please read our [ACM TOMS papers](https://github.com/flame/blis#citations). For API and technical reference, please see the [gemm micro-kernel section](KernelsHowTo.md#gemm-micro-kernel) of the BLIS [Kernels Guide](KernelsHowTo.md).
### What is a macro-kernel?
@@ -115,7 +115,7 @@ When a matrix is stored with general stride, both the row stride and column stri
### I'm not really interested in all of these newfangled features in BLIS. Can I just use BLIS as a BLAS library?
Absolutely. Just link your application to BLIS the same way you would link to a BLAS library. For a simple linking example, see the [Linking to BLIS](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#linking-to-blis) section of the BLIS [Build System](https://github.com/flame/blis/blob/master/docs/BuildSystem.md) guide.
Absolutely. Just link your application to BLIS the same way you would link to a BLAS library. For a simple linking example, see the [Linking to BLIS](KernelsHowTo.md#linking-to-blis) section of the BLIS [Build System](BuildSystem.md) guide.
### What about CBLAS?
@@ -123,19 +123,19 @@ BLIS also contains an optional CBLAS compatibility layer, which leverages the BL
### Can I call the native BLIS API from Fortran-77/90/95/2000/C++/Python?
In principle, BLIS's native (and BLAS-like) [typed API](BLISTypedAPI) can be called from Fortran. However, you must ensure that the size of the integer in BLIS is equal to the size of integer used by your Fortran program/compiler/environment. The size of BLIS integers is set in `bli_config.h`. Please see the [bli\_config.h](ConfigurationHowTo#bli_configh) section of the BLIS [Configuration Guide](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md) for more details.
In principle, BLIS's native (and BLAS-like) [typed API](BLISTypedAPI) can be called from Fortran. However, you must ensure that the size of the integer in BLIS is equal to the size of integer used by your Fortran program/compiler/environment. The size of BLIS integers is set in `bli_config.h`. Please see the [bli\_config.h](ConfigurationHowTo#bli_configh) section of the BLIS [Configuration Guide](ConfigurationHowTo.md) for more details.
As for bindings to other languages, please contact the [blis-devel](http://groups.google.com/group/blis-devel) mailing list.
### Do I need to call initialization/finalization functions before being able to use BLIS from my application?
Originally, BLIS did indeed require the application to explicitly setup (initialize) various internal data structures via `bli_init()`. Likewise, calling `bli_finalize()` was recommended to cleanup (finalize) the library. However, since commit 9804adf, BLIS has implemented self-initialization. These explicit calls to `bli_init()` and `bli_finalize()` are no longer necessary, though experts may still use them in special cases to control the allocation and freeing of resources. This topic is discussed in the BLIS [typed API reference](https://github.com/flame/blis/blob/master/docs/BLISTypedAPI.md#initialization-and-cleanup).
Originally, BLIS did indeed require the application to explicitly setup (initialize) various internal data structures via `bli_init()`. Likewise, calling `bli_finalize()` was recommended to cleanup (finalize) the library. However, since commit 9804adf, BLIS has implemented self-initialization. These explicit calls to `bli_init()` and `bli_finalize()` are no longer necessary, though experts may still use them in special cases to control the allocation and freeing of resources. This topic is discussed in the BLIS [typed API reference](BLISTypedAPI.md#initialization-and-cleanup).
### Does BLIS support multithreading?
Yes! BLIS supports multithreading (via OpenMP or POSIX threads) for all of its level-3 operations. For more information on enabling and controlling multithreading, please see the [Multithreading](https://github.com/flame/blis/blob/master/docs/Multithreading.md) guide.
Yes! BLIS supports multithreading (via OpenMP or POSIX threads) for all of its level-3 operations. For more information on enabling and controlling multithreading, please see the [Multithreading](Multithreading.md) guide.
BLIS can also very easily be made thread-safe so that you can call BLIS from threads within a multithreaded library or application. For more information on making BLIS thread-safe, see the "Multithreading" subsection of the [bli\_config.h](ConfigurationHowTo#bli_configh) header file section in the BLIS [Configuration Guide](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md).
BLIS can also very easily be made thread-safe so that you can call BLIS from threads within a multithreaded library or application. For more information on making BLIS thread-safe, see the "Multithreading" subsection of the [bli\_config.h](ConfigurationHowTo#bli_configh) header file section in the BLIS [Configuration Guide](ConfigurationHowTo.md).
### Does BLIS support NUMA environments?
@@ -147,7 +147,7 @@ BLIS does not currently support graphical processing units (GPUs).
### Does BLIS work on _(some architecture)_?
Please see the BLIS [Hardware Support](https://github.com/flame/blis/blob/master/docs/HardwareSupport.md) guide for a full list of supported architectures. If your favorite hardware is not listed and you have the expertise, please consider developing your own kernels and sharing them with the project! We will, of course, gratefully credit your contribution.
Please see the BLIS [Hardware Support](HardwareSupport.md) guide for a full list of supported architectures. If your favorite hardware is not listed and you have the expertise, please consider developing your own kernels and sharing them with the project! We will, of course, gratefully credit your contribution.
### What about distributed-memory parallelism?
@@ -155,7 +155,7 @@ No. BLIS is a framework for sequential and shared-memory/multicore implementatio
### Can I build BLIS on Windows / Mac OS X?
BLIS was designed for use in a GNU/Linux environment, however, it should work on other UNIX-like systems as well, such as OS X. System software requirements for UNIX-like systems are discussed in the BLIS [Build System](https://github.com/flame/blis/blob/master/docs/BuildSystem.md) guide.
BLIS was designed for use in a GNU/Linux environment, however, it should work on other UNIX-like systems as well, such as OS X. System software requirements for UNIX-like systems are discussed in the BLIS [Build System](BuildSystem.md) guide.
Support for building in Windows is not directly supported. However, Windows 10 now provides a Linux-like environment. We suspect this is the best route for those trying to build BLIS in Windows. If you have success and would like to share your experiences, please join the [blis-devel](http://groups.google.com/group/blis-devel) mailing list and send us a message!

View File

@@ -10,9 +10,9 @@ We apologize if this wiki falls out of date. For the latest support, we recommen
The following table lists architectures for which there exist optimized level-3 micro-kernels, which micro-kernels are optimized, the name of the author or maintainer, and the current status of the micro-kernels.
A few remarks / reminders:
* Optimizing only the [gemm micro-kernel](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel) will result in optimal performance for all [level-3 operations](BLISTypedAPI#level-3-operations) except `trsm` (which will typically achieve 60 - 80% of attainable peak performance).
* The [trsm](BLISTypedAPI#trsm) operation needs the [gemmtrsm micro-kernel(s)](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemmtrsm-micro-kernels), in addition to the aforementioned [gemm micro-kernel](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel), in order reach optimal performance.
* Induced complex (1m) implementations are employed in all situations where the real domain [gemm micro-kernel](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel) of the corresponding precision is available. Please see our [ACM TOMS article on the 1m method](https://github.com/flame/blis#citations) for more info on this topic.
* Optimizing only the [gemm micro-kernel](KernelsHowTo.md#gemm-micro-kernel) will result in optimal performance for all [level-3 operations](BLISTypedAPI#level-3-operations) except `trsm` (which will typically achieve 60 - 80% of attainable peak performance).
* The [trsm](BLISTypedAPI#trsm) operation needs the [gemmtrsm micro-kernel(s)](KernelsHowTo.md#gemmtrsm-micro-kernels), in addition to the aforementioned [gemm micro-kernel](KernelsHowTo.md#gemm-micro-kernel), in order reach optimal performance.
* Induced complex (1m) implementations are employed in all situations where the real domain [gemm micro-kernel](KernelsHowTo.md#gemm-micro-kernel) of the corresponding precision is available. Please see our [ACM TOMS article on the 1m method](https://github.com/flame/blis#citations) for more info on this topic.
* Some microarchitectures use the same sub-configuration. This is not a typo. For example, Haswell and Broadwell systems as well as "desktop" (non-server) versions of Skylake, Kabylake, and Coffeelake all use the `haswell` sub-configuration and the kernels registered therein.
* Remember that you (usually) don't have to choose your sub-configuration manually! Instead, you can always request configure-time hardware detection via `./configure auto`. This will defer to internal logic (based on CPUID for x86_64 systems) that will attempt to choose the appropriate sub-configuration automatically.

View File

@@ -87,17 +87,17 @@ Kernels marked with a "1" for a given level-2 operation are preferred for optimi
## BLIS kernels reference
This section seeks to provide developers with a complete reference for each of the following BLIS kernels, including function prototypes, parameter descriptions, implementation notes, and diagrams:
* [Level-3 micro-kernels](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#level-3-micro-kernels)
* [gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel)
* [trsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#trsm-micro-kernels)
* [gemmtrsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemmtrsm-micro-kernels)
* [Level-1f kernels](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#level-1f-kernels)
* [Level-3 micro-kernels](KernelsHowTo.md#level-3-micro-kernels)
* [gemm](KernelsHowTo.md#gemm-micro-kernel)
* [trsm](KernelsHowTo.md#trsm-micro-kernels)
* [gemmtrsm](KernelsHowTo.md#gemmtrsm-micro-kernels)
* [Level-1f kernels](KernelsHowTo.md#level-1f-kernels)
* axpy2v
* dotaxpyv
* axpyf
* dotxf
* dotxaxpyf
* [Level-1v kernels](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#level-1v-kernels)
* [Level-1v kernels](KernelsHowTo.md#level-1v-kernels)
* axpyv
* dotv
* dotxv
@@ -113,9 +113,9 @@ The function prototypes in this section follow the same guidelines as those list
### Level-3 micro-kernels
This section describes in detail the various level-3 micro-kernels supported by BLIS:
* [gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel)
* [trsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#trsm_micro-kernels)
* [gemmtrsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemmtrsm-micro-kernels)
* [gemm](KernelsHowTo.md#gemm-micro-kernel)
* [trsm](KernelsHowTo.md#trsm_micro-kernels)
* [gemmtrsm](KernelsHowTo.md#gemmtrsm-micro-kernels)
#### gemm micro-kernel
@@ -164,13 +164,13 @@ Parameters:
* `k`: The number of columns of `A1` and rows of `B1`.
* `alpha`: The address of a scalar to the `A1 * B1` product.
* `a1`: The address of a micro-panel of matrix `A` of dimension _MR x k_, stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.)
* `b1`: The address of a micro-panel of matrix `B` of dimension _k x NR_, stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `a1`: The address of a micro-panel of matrix `A` of dimension _MR x k_, stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.)
* `b1`: The address of a micro-panel of matrix `B` of dimension _k x NR_, stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `beta`: The address of a scalar to the input value of matrix `C11`.
* `c11`: The address of a matrix `C11` of dimension _MR x NR_, stored according to `rsc` and `csc`.
* `rsc`: The row stride of matrix `C11` (ie: the distance to the next row, in units of matrix elements).
* `csc`: The column stride of matrix `C11` (ie: the distance to the next column, in units of matrix elements).
* `data`: The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `gemm` micro-kernel implementation. (See [Using the auxinfo\_t object](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`.)
* `data`: The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `gemm` micro-kernel implementation. (See [Using the auxinfo\_t object](KernelsHowTo.md#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`.)
* `cntx`: The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus the `cntx` argument usually can be safely ignored.
#### Diagram for gemm
@@ -206,7 +206,7 @@ The diagram below shows the packed micro-panel operands and how elements of each
#### Using the auxinfo\_t object
Each micro-kernel ([gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel), [trsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#trsm_micro-kernels), and [gemmtrsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemmtrsm-micro-kernels)) takes as its last argument a pointer of type `auxinfo_t`. This BLIS-defined type is defined as a `struct` whose fields contain auxiliary values that may be useful to some micro-kernel authors, particularly when implementing certain optimization techniques. BLIS provides kernel authors access to the fields of the `auxinfo_t` object via the following function-like preprocessor macros. Each macro takes a single argument, the `auxinfo_t` pointer, and returns one of the values stored within the object.
Each micro-kernel ([gemm](KernelsHowTo.md#gemm-micro-kernel), [trsm](KernelsHowTo.md#trsm_micro-kernels), and [gemmtrsm](KernelsHowTo.md#gemmtrsm-micro-kernels)) takes as its last argument a pointer of type `auxinfo_t`. This BLIS-defined type is defined as a `struct` whose fields contain auxiliary values that may be useful to some micro-kernel authors, particularly when implementing certain optimization techniques. BLIS provides kernel authors access to the fields of the `auxinfo_t` object via the following function-like preprocessor macros. Each macro takes a single argument, the `auxinfo_t` pointer, and returns one of the values stored within the object.
* `bli_auxinfo_next_a()`. Returns the address (`void*`) of the micro-panel of `A` that will be used the next time the micro-kernel will be called.
* `bli_auxinfo_next_b()`. Returns the address (`void*`) of the micro-panel of `B` that will be used the next time the micro-kernel will be called.
@@ -288,23 +288,23 @@ _MR_ and _NR_ are the register blocksizes associated with the micro-kernel. They
Parameters:
* `a11`: The address of `A11`, which is the _MR x MR_ lower (`trsm_l`) or upper (`trsm_u`) triangular submatrix within the packed micro-panel of matrix `A`. `A11` is stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.) Note that `A11` contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.
* `b11`: The address of `B11`, which is an _MR x NR_ submatrix of the packed micro-panel of `B`. `B11` is stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `a11`: The address of `A11`, which is the _MR x MR_ lower (`trsm_l`) or upper (`trsm_u`) triangular submatrix within the packed micro-panel of matrix `A`. `A11` is stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.) Note that `A11` contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.
* `b11`: The address of `B11`, which is an _MR x NR_ submatrix of the packed micro-panel of `B`. `B11` is stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `c11`: The address of `C11`, which is an _MR x NR_ submatrix of matrix `C`, stored according to `rsc` and `csc`. `C11` is the submatrix within `C` that corresponds to the elements which were packed into `B11`. Thus, `C` is the original input matrix `B` to the overall `trsm` operation.
* `rsc`: The row stride of matrix `C11` (ie: the distance to the next row, in units of matrix elements).
* `csc`: The column stride of matrix `C11` (ie: the distance to the next column, in units of matrix elements).
* `data`: The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `trsm` micro-kernel implementation. (See [Using the auxinfo\_t object](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`, and also [Implementation Notes for trsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-trsm) for caveats.)
* `data`: The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `trsm` micro-kernel implementation. (See [Using the auxinfo\_t object](KernelsHowTo.md#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`, and also [Implementation Notes for trsm](KernelsHowTo.md#implementation-notes-for-trsm) for caveats.)
* `cntx`: The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus the `cntx` argument usually can be safely ignored.
#### Diagrams for trsm
Please see the diagram for [gemmtrsm\_l](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#diagram-for-gemmtrsm-l) and [gemmtrsm\_u](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#diagram-for-gemmtrsm-u) to see depictions of the `trsm_l` and `trsm_u` micro-kernel operations and where they fit in with their preceding `gemm` subproblems.
Please see the diagram for [gemmtrsm\_l](KernelsHowTo.md#diagram-for-gemmtrsm-l) and [gemmtrsm\_u](KernelsHowTo.md#diagram-for-gemmtrsm-u) to see depictions of the `trsm_l` and `trsm_u` micro-kernel operations and where they fit in with their preceding `gemm` subproblems.
#### Implementation Notes for trsm
* **Register blocksizes.** See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm).
* **Leading dimensions of `a11` and `b11`: _PACKMR_ and _PACKNR_.** See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm).
* **Edge cases in _MR_, _NR_ dimensions.** See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm).
* **Register blocksizes.** See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm).
* **Leading dimensions of `a11` and `b11`: _PACKMR_ and _PACKNR_.** See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm).
* **Edge cases in _MR_, _NR_ dimensions.** See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm).
* **Alignment of `a11` and `b11`.** The addresses `a11` and `b11` are aligned according to `PACKMR * sizeof(type)` and `PACKNR * sizeof(type)`, respectively.
* **Unrolling loops.** Most optimized implementations should unroll all three loops within the `trsm` micro-kernel.
* **Prefetching next micro-panels of `A` and `B`.** We advise against using the `bli_auxinfo_next_a()` and `bli_auxinfo_next_b()` macros from within the `trsm_l` and `trsm_u` micro-kernels, since the values returned usually only make sense in the context of the overall `gemmtrsm` subproblem.
@@ -410,14 +410,14 @@ Parameters:
* `k`: The number of columns of `A10` and rows of `B01` (`trsm_l`); the number of columns of `A12` and rows of `B21` (`trsm_u`).
* `alpha`: The address of a scalar to be applied to `B11`.
* `a10`, `a12`: The address of `A10` or `A12`, which is the _MR x k_ submatrix of the packed micro-panel of `A` that is situated to the left (`trsm_l`) or right (`trsm_u`) of the _MR x MR_ triangular submatrix `A11`. `A10` and `A12` are stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.)
* `a11`: The address of `A11`, which is the _MR x MR_ lower (`trsm_l`) or upper (`trsm_u`) triangular submatrix within the packed micro-panel of matrix `A` that is situated to the right of `A10` (`trsm_l`) or the left of `A12` (`trsm_u`). `A11` is stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.) Note that `A11` contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.
* `b01`, `b21`: The address of `B01` and `B21`, which is the _k x NR_ submatrix of the packed micro-panel of `B` that is situated above (`trsm_l`) or below (`trsm_u`) the _MR x NR_ block `B11`. `B01` and `B21` are stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `b11`: The address of `B11`, which is the _MR x NR_ submatrix of the packed micro-panel of `B`, situated below `B01` (`trsm_l`) or above `B21` (`trsm_u`). `B11` is stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `a10`, `a12`: The address of `A10` or `A12`, which is the _MR x k_ submatrix of the packed micro-panel of `A` that is situated to the left (`trsm_l`) or right (`trsm_u`) of the _MR x MR_ triangular submatrix `A11`. `A10` and `A12` are stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.)
* `a11`: The address of `A11`, which is the _MR x MR_ lower (`trsm_l`) or upper (`trsm_u`) triangular submatrix within the packed micro-panel of matrix `A` that is situated to the right of `A10` (`trsm_l`) or the left of `A12` (`trsm_u`). `A11` is stored by columns with leading dimension _PACKMR_, where typically _PACKMR_ = _MR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKMR_.) Note that `A11` contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.
* `b01`, `b21`: The address of `B01` and `B21`, which is the _k x NR_ submatrix of the packed micro-panel of `B` that is situated above (`trsm_l`) or below (`trsm_u`) the _MR x NR_ block `B11`. `B01` and `B21` are stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `b11`: The address of `B11`, which is the _MR x NR_ submatrix of the packed micro-panel of `B`, situated below `B01` (`trsm_l`) or above `B21` (`trsm_u`). `B11` is stored by rows with leading dimension _PACKNR_, where typically _PACKNR_ = _NR_. (See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for a discussion of _PACKNR_.)
* `c11`: The address of `C11`, which is an _MR x NR_ submatrix of matrix `C`, stored according to `rsc` and `csc`. `C11` is the submatrix within `C` that corresponds to the elements which were packed into `B11`. Thus, `C` is the original input matrix `B` to the overall `trsm` operation.
* `rsc`: The row stride of matrix `C11` (ie: the distance to the next row, in units of matrix elements).
* `csc`: The column stride of matrix `C11` (ie: the distance to the next column, in units of matrix elements).
* `data`: The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `gemmtrsm` micro-kernel implementation. (See [Using the auxinfo\_t object](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`, and also [Implementation Notes for gemmtrsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemmtrsm) for caveats.)
* `data`: The address of an `auxinfo_t` object that contains auxiliary information that may be useful when optimizing the `gemmtrsm` micro-kernel implementation. (See [Using the auxinfo\_t object](KernelsHowTo.md#Using_the_auxinfo_t_object) for a discussion of the kinds of values available via `auxinfo_t`, and also [Implementation Notes for gemmtrsm](KernelsHowTo.md#implementation-notes-for-gemmtrsm) for caveats.)
* `cntx`: The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus the `cntx` argument usually can be safely ignored.
#### Diagram for gemmtrsm\_l
@@ -469,18 +469,18 @@ The diagram below shows the packed micro-panel operands for `trsm_u` and how ele
#### Implementation Notes for gemmtrsm
* **Register blocksizes.** See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm).
* **Leading dimensions of `a1` and `b1`: _PACKMR_ and _PACKNR_.** See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm).
* **Edge cases in _MR_, _NR_ dimensions.** See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm).
* **Alignment of `a1` and `b1`.** See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm).
* **Unrolling loops.** Most optimized implementations should unroll all three loops within the `trsm` subproblem of `gemmtrsm`. See [Implementation Notes for gemm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-gemm) for remarks on unrolling the `gemm` subproblem.
* **Register blocksizes.** See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm).
* **Leading dimensions of `a1` and `b1`: _PACKMR_ and _PACKNR_.** See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm).
* **Edge cases in _MR_, _NR_ dimensions.** See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm).
* **Alignment of `a1` and `b1`.** See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm).
* **Unrolling loops.** Most optimized implementations should unroll all three loops within the `trsm` subproblem of `gemmtrsm`. See [Implementation Notes for gemm](KernelsHowTo.md#implementation-notes-for-gemm) for remarks on unrolling the `gemm` subproblem.
* **Prefetching next micro-panels of `A` and `B`.** When invoked from within a `gemmtrsm_l` micro-kernel, the addresses accessible via `bli_auxinfo_next_a()` and `bli_auxinfo_next_b()` refer to the next invocation's `a10` and `b01`, respectively, while in `gemmtrsm_u`, the `_next_a()` and `_next_b()` macros return the addresses of the next invocation's `a11` and `b11` (since those submatrices precede `a12` and `b21`).
* **Zero `alpha`.** The micro-kernel can safely assume that `alpha` is non-zero; "alpha equals zero" handling is performed at a much higher level, which means that, in such a scenario, the micro-kernel will never get called.
* **Diagonal elements of `A11`.** See [Implementation Notes for trsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-trsm).
* **Zero elements of `A11`.** See [Implementation Notes for trsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-trsm).
* **Output.** See [Implementation Notes for trsm](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#implementation-notes-for-trsm).
* **Optimization.** Let's assume that the [gemm micro-kernel](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#gemm-micro-kernel) has already been optimized. You have two options with regard to optimizing the fused `gemmtrsm` micro-kernels:
1. Optimize only the [trsm micro-kernels](https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#trsm-micro-kernels). This will result in the `gemm` and `trsm_l` micro-kernels being called in sequence. (Likewise for `gemm` and `trsm_u`.)
* **Diagonal elements of `A11`.** See [Implementation Notes for trsm](KernelsHowTo.md#implementation-notes-for-trsm).
* **Zero elements of `A11`.** See [Implementation Notes for trsm](KernelsHowTo.md#implementation-notes-for-trsm).
* **Output.** See [Implementation Notes for trsm](KernelsHowTo.md#implementation-notes-for-trsm).
* **Optimization.** Let's assume that the [gemm micro-kernel](KernelsHowTo.md#gemm-micro-kernel) has already been optimized. You have two options with regard to optimizing the fused `gemmtrsm` micro-kernels:
1. Optimize only the [trsm micro-kernels](KernelsHowTo.md#trsm-micro-kernels). This will result in the `gemm` and `trsm_l` micro-kernels being called in sequence. (Likewise for `gemm` and `trsm_u`.)
1. Fuse the implementation of the `gemm` micro-kernel with that of the `trsm` micro-kernels by inlining both into the `gemmtrsm_l` and `gemmtrsm_u` micro-kernel definitions. This option is more labor-intensive, but also more likely to yield higher performance because it avoids redundant memory operations on the packed _MR x NR_ submatrix `B11`.

View File

@@ -44,7 +44,7 @@ April 4, 2018
- Enable use of new zen kernels in haswell sub-configuration.
- Added row-storage optimizations to zen `dotxf` kernels (now also used by haswell).
- Integrated an `f2c`ed version of the BLAS test drivers from netlib LAPACK into BLIS build system (e.g. `make testblas`, `make checkblas`). See the [Testsuite](https://github.com/flame/blis/blob/master/docs/Testsuite.md) document for more info. Also scheduled these BLAS drivers to execute regularly via Travis CI.
- Integrated an `f2c`ed version of the BLAS test drivers from netlib LAPACK into BLIS build system (e.g. `make testblas`, `make checkblas`). See the [Testsuite](Testsuite.md) document for more info. Also scheduled these BLAS drivers to execute regularly via Travis CI.
- Added a new `make check` target that executes a fast version of the BLIS testsuite as well as the BLAS test drivers (primarily targeting package maintainers).
- Allow individual operation overriding in the BLIS testsuite. (This makes it easy to quickly test one or two operations of interest.)
- Added build system support for libmemkind. If present, `hbw_malloc()` is used as the default value for `BLIS_MALLOC_POOL` instead of `malloc()`. It can be disabled via `--disable-memkind`.
@@ -62,12 +62,12 @@ This version contains significant improvements from 0.2.2. Major changes include
- Real and complex domain (s,d,c,z) assembly-based gemm microkernels for AMD's Zen microarchitecture. (AMD, Field Van Zee)
- Real domain (s,d) assembly-based `gemmtrsm_l` and `gemmtrsm_u` microkernels for Zen. (AMD, Field Van Zee)
- Real domain (s,d) intrinsics-based `amaxv`, `axpyv`, `dotv`, `dotxv`, `scalv`, `axpyf`, and `dotxf` kernels for Zen. (AMD, Field Van Zee)
- Generalized the configuration system to allow multi-configuration builds targeting configuration "families". A single sub-configuration is chosen at runtime via some heuristic, such as querying CPUID (e.g. runtime hardware detection). This change was extensive and required a reorganization of the build system, configuration semantics, reference kernels, a new naming scheme for native kernels, and a rewrite of the global kernel structure (gks). Please see the rewritten [Configuration Guide](https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md) for details.
- Generalized the configuration system to allow multi-configuration builds targeting configuration "families". A single sub-configuration is chosen at runtime via some heuristic, such as querying CPUID (e.g. runtime hardware detection). This change was extensive and required a reorganization of the build system, configuration semantics, reference kernels, a new naming scheme for native kernels, and a rewrite of the global kernel structure (gks). Please see the rewritten [Configuration Guide](ConfigurationHowTo.md) for details.
- Implemented runtime hardware detection for x86_64 hardware.
- Reimplemented configure-time hardware detection in terms of new runtime hardware detection code, which queries for CPU features rather than individual models.
- Implemented library self-initialization by rewriting `bli_init()` in terms of `pthread_once()` and inserting invocations to `bli_init()` in key places throughout BLIS. The expectation is that through normal use of any BLIS API (BLAS, typed BLIS, or object-based BLIS), the user no longer needs to explicitly initialize the library, and that `bli_finalize()` should never be called by the user unless he is absolutely sure he no longer needs BLIS functionality. Related to this: global scalar constants (`BLIS_ONE`, `BLIS_ZERO`, etc.) are now statically initialized and thus ready to use immediately. Collectively, these changes provide improved thread safety at the application level.
- Compile with and install a single monolithic (flattened) `blis.h` header to (1) speed up compilation and (2) reduce the number of build product files.
- Added a sub-API for setting multithreading environment variables at runtime. For a few examples, please see the [Multithreading](https://github.com/flame/blis/blob/master/docs/Multithreading.md) guide.
- Added a sub-API for setting multithreading environment variables at runtime. For a few examples, please see the [Multithreading](Multithreading.md) guide.
- Reimplemented OpenMP/pthread barriers in terms of GNU atomic built-ins.
- Other small changes and fixes.

View File

@@ -35,7 +35,7 @@ As you would expect, the test suite's source code lives in `src` and the object
## Compiling
Before running the test suite, you must first configure, compile, and install a BLIS library. For directions on how to build and install a BLIS library, please see the [Build System](https://github.com/flame/blis/blob/master/docs/BuildSystem.md) guide.
Before running the test suite, you must first configure, compile, and install a BLIS library. For directions on how to build and install a BLIS library, please see the [Build System](BuildSystem.md) guide.
Once BLIS is installed, you are ready to compile the test suite.
@@ -138,7 +138,7 @@ _**Perform all tests with alignment?**_ Disabling this option causes the leading
_**Randomize vectors and matrices.**_ The default randomization method uses real values on the interval [-1,1]. However, we offer an alternate randomization using powers of two in a narrow precision range, which is more likely to result in test residuals exactly equal to zero. This method is somewhat niche/experimental and most people should use random values on the [-1,1] interval.
_**General stride spacing.**_ This value determines the simulated "inner" stride when testing general stride storage. For simplicity, the test suite always generates and tests general stride storage that is ["column-tilted"](https://github.com/flame/blis/blob/master/docs/FAQ.md#What_does_it_mean_when_a_matrix_with_general_stride_is_column-ti). If general stride storage is not being tested, then this value is ignored.
_**General stride spacing.**_ This value determines the simulated "inner" stride when testing general stride storage. For simplicity, the test suite always generates and tests general stride storage that is ["column-tilted"](FAQ.md#What_does_it_mean_when_a_matrix_with_general_stride_is_column-ti). If general stride storage is not being tested, then this value is ignored.
_**Datatype(s) to test.**_ This string determines which floating-point datatypes are tested. There are four valid values: `'s'` for single-precision real, `'d'` for double-precision real, `'c'` for single-precision complex, and `'z'` for double-precision complex. You may choose one datatype, or combine more than one. The order of the datatype characters determines the order in which they are tested.