Details: - Exported all github wikis to a new 'docs' directory. - Renamed 'BLISAPIQuickReference' wiki to 'BLISTypedAPI' and removed all cntx_t* arguments from the (now non-expert) APIs (with the exception of the kernel APIs). - Added section to BuildSystem documenting new ARG_MAX hack.
41 KiB
Introduction
This wiki describes the computational kernels used by the BLIS framework.
One of the primary features of BLIS is that it provides a large set of dense linear algebra functionality while simultaneously minimizing the amount of kernel code that must be optimized for a given architecture. BLIS does this by isolating a handful of kernels which, when implemented, facilitate functionality and performance of several of the higher-level operations.
Presently, BLIS supports several groups of operations:
- Level-1v: Operations on vectors:
- Level-1d: Element-wise operations on matrix diagonals:
- Level-1m: Element-wise operations on matrices:
- Level-1f: Fused operations on multiple vectors:
- Level-2: Operations with one matrix and (at least) one vector operand:
- Level-3: Operations with matrices that are multiplication-like:
- Utility: Miscellaneous operations on matrices and vectors:
Most of the interest with BLAS libraries centers around level-3 operations because they exhibit favorable ratios of floating-point operations (flops) to memory operations (memops), which allows high performance. Some applications also require level-2 computation; however, these operations are at an inherent disadvantage on modern architectures due to their less favorable flop-to-memop ratio. The BLIS framework allows developers to quickly and easily build high performance level-3 operations, as well as relatively well-performing level-2 operations, simply by optimizing a small set of kernels. These kernels, and their relationship to the other higher-level operations supported by BLIS, are the subject of this wiki.
Some level-1v, level-1m, and level-1d operations may also be accelerated, but since they are memory-bound, optimization typically yields minor performance improvement.
BLIS kernels summary
This section lists and briefly describes each of the main computational kernels supported by the BLIS framework. (Other kernels are supported, but they are not of interest to most developers.)
Level-3
BLIS supports the following three level-3 micro-kernels. These micro-kernels are used to implement optimized level-3 operations.
- gemm: The
gemmmicro-kernel performs a small matrix multiplication and is used by every level-3 operation. - trsm: The
trsmmicro-kernel performs a small triangular solve with multiple right-hand sides. It is not required for optimal performance and in fact is only needed when the developer opts to not implement the fusedgemmtrsmkernel. - gemmtrsm: The
gemmtrsmmicro-kernel implements a fused operation whereby agemmand atrsmsubproblem are fused together in a single routine. This avoids redundant memory operations that would otherwise be incurred if the operations were executed separately.
The following shows the steps one would take to optimize, to varying degrees, the level-3 operations supported by BLIS:
- By implementing and optimizing the
gemmmicro-kernel, all level-3 operations excepttrsmare fully optimized. In this scenario, thetrsmoperation may achieve 60-90% of attainable peak performance, depending on the architecture and problem size. - If one goes further and implements and optimizes the
trsmmicro-kernel, this kernel, when paired with an optimizedgemmmicro-kernel, results in atrsmimplementation that is accelerated (but not optimized). - Alternatively, if one implements and optimizes the fused
gemmtrsmmicro-kernel, this kernel, when paired with an optimizedgemmmicro-kernel, enables a fully optimizedtrsmimplementation.
Level-1f
BLIS supports the following five level-1f (fused) kernels. These kernels are used to implement optimized level-2 operations.
- axpy2v: Performs and fuses two axpyv operations, accumulating to the same output vector.
- dotaxpyv: Performs and fuses a dotv followed by an axpyv operation with x.
- axpyf: Performs and fuses some implementation-dependent number of axpyv operations, accumulating to the same output vector. Can also be expressed as a gemv operation where matrix A is m x nf, where nf is the number of fused operations (fusing factor).
- dotxf: Performs and fuses some implementation-dependent number of dotxv operations, reusing the
yvector for each dotxv. - dotxaxpyf: Performs and fuses a dotxf and axpyf in which the matrix operand is reused.
Level-1v
BLIS supports kernels for the following level-1 operations. Aside from their self-similar operations (ie: the use of an axpyv kernel to implement the axpyv operation), these kernels are used only to implement level-2 operations, and only when the developer decides to forgo more optimized approaches that involve level-1f kernels (where applicable).
- axpyv: Performs a scale-and-accumulate vector operation.
- dotv: Performs a dot product where the output scalar is overwritten.
- dotxv: Performs an extended dot product operation where the dot product is first scaled and then accumulated into a scaled output scalar.
There are other level-1v kernels that may be optimized, such as addv, subv, and scalv, but their use is less common and therefore of much less importance to most users and developers.
Level-1v/-1f Dependencies for Level-2 operations
The table below shows dependencies between level-2 operations and each of the level-1v and level-1f kernels.
Kernels marked with a "1" for a given level-2 operation are preferred for optimization because they facilitate an optimal implementation on most architectures. Kernels marked with a "2", "3", or "4" denote those which need to be optimized for alternative implementations that would typically be second, third, or fourth choices, respectively, if the preferred kernels are not optimized.
| operation / kernel | effective storage | axpyv |
dotxv |
axpy2v |
dotaxpyv |
axpyf |
dotxf |
dotxaxpyf |
|---|---|---|---|---|---|---|---|---|
gemv, trmv, trsv |
row-wise | 2 | 1 | |||||
| column-wise | 2 | 1 | ||||||
hemv, symv |
row- or column-wise | 4 | 4 | 3 | 2 | 2 | 1 | |
ger, her, syr |
row- or column-wise | 1 | ||||||
her2, syr2 |
row- or column-wise | 2 | 1 |
Note: The "effective storage" column reflects the orientation of the matrix operand after transposition via the corresponding trans_t parameter (if applicable). For example, calling gemv with a column-stored matrix A and the transa parameter equal to BLIS_TRANSPOSE would be effectively equivalent to row-wise storage.
BLIS kernels reference
This section seeks to provide developers with a complete reference for each of the following BLIS kernels, including function prototypes, parameter descriptions, implementation notes, and diagrams:
- Level-3 micro-kernels
- Level-1f kernels
- axpy2v
- dotaxpyv
- axpyf
- dotxf
- dotxaxpyf
- Level-1v kernels
- axpyv
- dotv
- dotxv
The function prototypes in this section follow the same guidelines as those listed in the BLIS typed API reference. Namely:
- Any occurrence of
?should be replaced withs,d,c, orzto form an actual function name. - Any occurrence of
ctypeshould be replaced with the actual C type corresponding to the datatype instance in question. - Some matrix arguments have associated row and column strides arguments that proceed them, typically listed as
rsXandcsXfor a given matrixX. Row strides are always listed first, and column strides are always listed second. The semantic meaning of a row stride is "the distance, in units of elements, from any given element to the corresponding element (within the same column) of the next row," and the meaning of a column stride is "the distance, in units of elements, from any given element to the corresponding element (within the same row) of the next column." Thus, unit row stride implies column-major storage and unit column stride implies row-major storage. - All occurrences of
alphaandbetaparameters are scalars.
Level-3 micro-kernels
This section describes in detail the various level-3 micro-kernels supported by BLIS:
gemm micro-kernel
void bli_?gemm_<suffix>
(
dim_t k,
ctype* restrict alpha,
ctype* restrict a1,
ctype* restrict b1,
ctype* restrict beta,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
where <suffix> is implementation-dependent. The following (more portable) wrapper is also defined:
void bli_?gemm_ukernel
(
dim_t k,
ctype* restrict alpha,
ctype* restrict a1,
ctype* restrict b1,
ctype* restrict beta,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
The gemm micro-kernel, sometimes simply referred to as "the BLIS micro-kernel" or "the micro-kernel", performs the following operation:
C11 := beta * C11 + A1 * B1
where A1 is an MR x k "micro-panel" matrix stored in packed (column-wise) format, B1 is a k x NR "micro-panel" matrix stored in packed (row-wise) format, C11 is an MR x NR general matrix stored according to its row and column strides rsc and csc, and alpha and beta are scalars.
MR and NR are the register blocksizes associated with the micro-kernel. They are chosen by the developer when the micro-kernel is written and then encoded into a BLIS configuration, which will reference the micro-kernel when the BLIS framework is instantiated into a library. For more information on setting register blocksizes and related constants, please see the BLIS developer configuration guide.
Parameters:
k: The number of columns ofA1and rows ofB1.alpha: The address of a scalar to theA1 * B1product.a1: The address of a micro-panel of matrixAof dimension MR x k, stored by columns with leading dimension PACKMR, where typically PACKMR = MR. (See Implementation Notes for gemm for a discussion of PACKMR.)b1: The address of a micro-panel of matrixBof dimension k x NR, stored by rows with leading dimension PACKNR, where typically PACKNR = NR. (See Implementation Notes for gemm for a discussion of PACKNR.)beta: The address of a scalar to the input value of matrixC11.c11: The address of a matrixC11of dimension MR x NR, stored according torscandcsc.rsc: The row stride of matrixC11(ie: the distance to the next row, in units of matrix elements).csc: The column stride of matrixC11(ie: the distance to the next column, in units of matrix elements).data: The address of anauxinfo_tobject that contains auxiliary information that may be useful when optimizing thegemmmicro-kernel implementation. (See Using the auxinfo_t object for a discussion of the kinds of values available viaauxinfo_t.)cntx: The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus thecntxargument usually can be safely ignored.
Diagram for gemm
The diagram below shows the packed micro-panel operands and how elements of each would be stored when MR = NR = 4. The hex digits indicate the layout and order (but NOT the numeric contents) of the elements in memory. Note that the storage of C11 is not shown since it is determined by the row and column strides of C11.
c11: a1: b1:
_______ ______________________ _______
| | |0 4 8 C | |0 1 2 3|
MR | | |1 5 9 D . . . | |4 5 6 7|
| | += |2 6 A E | |8 9 A B|
|_______| |3_7_B_F_______________| |C D E F|
| . |
NR k | . | k
| . |
| |
| |
|_______|
NR
Implementation Notes for gemm
- Register blocksizes. The C preprocessor macros
bli_?mrandbli_?nrevaluate to the MR and NR register blocksizes for the datatype corresponding to the '?' character. These values are abbreviations of the macro constantsBLIS_DEFAULT_MR_?andBLIS_DEFAULT_NR_?, which are defined in thebli_kernel.hheader file of the BLIS configuration. - Leading dimensions of
a1andb1: PACKMR and PACKNR. The packed micro-panelsa1andb1are simply stored in column-major and row-major order, respectively. Usually, the width of either micro-panel (ie: the number of rows ofA1, or MR, and the number of columns ofB1, or NR) is equal to that micro-panel's so-called "leading dimension." Sometimes, it may be beneficial to specify a leading dimension that is larger than the panel width. This may be desirable because it allows each column ofA1or row ofB1to maintain a certain alignment in memory that would not otherwise be maintained by MR and/or NR. In this case, you should index througha1andb1using the values PACKMR and PACKNR, respectively (which are stored in the context as the blocksize maximums associated with thebszid_tvaluesBLIS_MRandBLIS_NR). These values are defined asBLIS_PACKDIM_MR_?andBLIS_PACKDIM_NR_?, respectively, in thebli_kernel.hheader file of the BLIS configuration. - Storage preference of
c11. Sometimes, an optimizedgemmmicro-kernel will have a "preferred" storage format forC11--typically either contiguous row-storage (i.e.cs_c= 1) or contiguous column-storage (i.e.rs_c= 1). This preference comes from how the micro-kernel is most efficiently able to load/store elements ofC11from/to memory. Most micro-kernels use vector instructions to access contiguous columns (or column segments) ofC11. However, the developer may decide that accessing contiguous rows (or row segments) is more desirable. If this is the case, this preference should be noted inbli_kernel.hby defining the macroBLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS. Leaving the macro undefined leaves the default assumption (contiguous column preference) in place. Setting this macro allows the framework to perform a minor optimization at run-time that will ensure the micro-kernel preference is honored, if at all possible. - Edge cases in MR, NR dimensions. Sometimes the micro-kernel will be called with micro-panels
a1andb1that correspond to edge cases, where only partial results are needed. Zero-padding is handled automatically by the packing function to facilitate reuse of the same micro-kernel. Similarly, the logic for computing to temporary storage and then saving only the elements that correspond to elements ofC11that exist (at the edges) is handled automatically within the macro-kernel. - Alignment of
a1andb1. By default, the alignment of addressesa1andb1are aligned only tosizeof(type). IfBLIS_POOL_ADDR_ALIGN_SIZEis set to some larger multiple ofsizeof(type), such as the page size, then the firsta1andb1micro-panels will be aligned to that value, but subsequent micro-panels will only be aligned tosizeof(type), or, ifBLIS_POOL_ADDR_ALIGN_SIZEis a multiple ofPACKMRandPACKNR, then subsequent micro-panelsa1andb1will be aligned toPACKMR * sizeof(type)andPACKNR * sizeof(type), respectively. - Unrolling loops. As a general rule of thumb, the loop over k is sometimes moderately unrolled; for example, in our experience, an unrolling factor of u = 4 is fairly common. If unrolling is applied in the k dimension, edge cases must be handled to support values of k that are not multiples of u. It is nearly universally true that there should be no loops in the MR or NR directions; in other words, iteration over these dimensions should always be fully unrolled (within the loop over k).
- Zero
beta. Ifbeta= 0.0 (or 0.0 + 0.0i for complex datatypes), then the micro-kernel should NOT use it explicitly, asC11may contain uninitialized memory (including elements containingNaNorInf). This case should be detected and handled separately, preferably by simply overwritingC11with thealpha * A1 * B1product. An example of how to perform this "beta equals zero" handling is included in thegemmmicro-kernel associated with thetemplateconfiguration.
Using the auxinfo_t object
Each micro-kernel (gemm, trsm, and gemmtrsm) takes as its last argument a pointer of type auxinfo_t. This BLIS-defined type is defined as a struct whose fields contain auxiliary values that may be useful to some micro-kernel authors, particularly when implementing certain optimization techniques. BLIS provides kernel authors access to the fields of the auxinfo_t object via the following function-like preprocessor macros. Each macro takes a single argument, the auxinfo_t pointer, and returns one of the values stored within the object.
bli_auxinfo_next_a(). Returns the address (void*) of the micro-panel ofAthat will be used the next time the micro-kernel will be called.bli_auxinfo_next_b(). Returns the address (void*) of the micro-panel ofBthat will be used the next time the micro-kernel will be called.bli_auxinfo_ps_a(). Returns the panel stride (inc_t) of the current micro-panel ofA.bli_auxinfo_ps_b(). Returns the panel stride (inc_t) of the current micro-panel ofB.
The addresses of the next micro-panels of A and B may be used by the micro-kernel to perform prefetching, if prefetching is supported by the architecture. Similarly, it may be useful to know the precise distance in memory to the next micro-panel. (Note that sometimes the next micro-panel to be used is not the same as the next micro-panel in memory.)
Any and all of these values may be safely ignored; they are completely optional. However, BLIS guarantees that all values accessed via the macros listed above will always be initialized and meaningful, for every invocation of each micro-kernel (gemm, trsm, and gemmtrsm).
Example code for gemm
An example implementation of the gemm micro-kernel may be found in the template configuration directory in:
Note that this implementation is coded in C99 and lacks several kinds of optimization that are typical of real-world optimized micro-kernels, such as vector instructions (or intrinsics) and loop unrolling in MR or NR. It is meant to serve only as a starting point for a micro-kernel developer.
trsm micro-kernels
void bli_?trsm_l_<suffix>
(
ctype* restrict a11,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
void bli_?trsm_u_<suffix>
(
ctype* restrict a11,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
where <suffix> is implementation-dependent. The following (more portable) wrappers are also defined:
void bli_?trsm_l_ukernel
(
ctype* restrict a11,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
void bli_?trsm_u_ukernel
(
ctype* restrict a11,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
The trsm_l and trsm_u micro-kernels perform the following operation:
C11 := inv(A11) * B11
where A11 is MR x MR and lower (trsm_l) or upper (trsm_u) triangular, B11 is MR x NR, and C11 is MR x NR.
MR and NR are the register blocksizes associated with the micro-kernel. They are chosen by the developer when the micro-kernel is written and then encoded into a BLIS configuration, which will reference the micro-kernel when the BLIS framework is instantiated into a library. For more information on setting register blocksizes and related constants, please see the BLIS developer configuration guide.
Parameters:
a11: The address ofA11, which is the MR x MR lower (trsm_l) or upper (trsm_u) triangular submatrix within the packed micro-panel of matrixA.A11is stored by columns with leading dimension PACKMR, where typically PACKMR = MR. (See Implementation Notes for gemm for a discussion of PACKMR.) Note thatA11contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.b11: The address ofB11, which is an MR x NR submatrix of the packed micro-panel ofB.B11is stored by rows with leading dimension PACKNR, where typically PACKNR = NR. (See Implementation Notes for gemm for a discussion of PACKNR.)c11: The address ofC11, which is an MR x NR submatrix of matrixC, stored according torscandcsc.C11is the submatrix withinCthat corresponds to the elements which were packed intoB11. Thus,Cis the original input matrixBto the overalltrsmoperation.rsc: The row stride of matrixC11(ie: the distance to the next row, in units of matrix elements).csc: The column stride of matrixC11(ie: the distance to the next column, in units of matrix elements).data: The address of anauxinfo_tobject that contains auxiliary information that may be useful when optimizing thetrsmmicro-kernel implementation. (See Using the auxinfo_t object for a discussion of the kinds of values available viaauxinfo_t, and also Implementation Notes for trsm for caveats.)cntx: The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus thecntxargument usually can be safely ignored.
Diagrams for trsm
Please see the diagram for gemmtrsm_l and gemmtrsm_u to see depictions of the trsm_l and trsm_u micro-kernel operations and where they fit in with their preceding gemm subproblems.
Implementation Notes for trsm
- Register blocksizes. See Implementation Notes for gemm.
- Leading dimensions of
a11andb11: PACKMR and PACKNR. See Implementation Notes for gemm. - Edge cases in MR, NR dimensions. See Implementation Notes for gemm.
- Alignment of
a11andb11. The addressesa11andb11are aligned according toPACKMR * sizeof(type)andPACKNR * sizeof(type), respectively. - Unrolling loops. Most optimized implementations should unroll all three loops within the
trsmmicro-kernel. - Prefetching next micro-panels of
AandB. We advise against using thebli_auxinfo_next_a()andbli_auxinfo_next_b()macros from within thetrsm_landtrsm_umicro-kernels, since the values returned usually only make sense in the context of the overallgemmtrsmsubproblem. - Diagonal elements of
A11. At the time this micro-kernel is called, the diagonal entries of triangular matrixA11contain the inverse of the original elements. This inversion is done during packing so that we can avoid expensive division instructions within the micro-kernel itself. If thediagparameter to the higher leveltrsmoperation was equal toBLIS_UNIT_DIAG, the diagonal elements will be explicitly unit. - Zero elements of
A11. SinceA11is lower triangular (fortrsm_l), the strictly upper triangle implicitly contains zeros. Similarly, the strictly lower triangle ofA11implicitly contains zeros whenA11is upper triangular (fortrsm_u). However, the packing function may or may not actually write zeros to this region. Thus, the implementation should not reference these elements. - Output. This micro-kernel must write its result to two places: the submatrix
B11of the current packed micro-panel ofBand the submatrixC11of the output matrixC.
Example code for trsm
Example implementations of the trsm micro-kernels may be found in the template configuration directory in:
Note that these implementations are coded in C99 and lack several kinds of optimization that are typical of real-world optimized micro-kernels, such as vector instructions (or intrinsics) and loop unrolling in MR or NR. They are meant to serve only as a starting point for a micro-kernel developer.
gemmtrsm micro-kernels
void bli_?gemmtrsm_l_<suffix>
(
dim_t k,
ctype* restrict alpha,
ctype* restrict a10,
ctype* restrict a11,
ctype* restrict b01,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
void bli_?gemmtrsm_u_<suffix>
(
dim_t k,
ctype* restrict alpha,
ctype* restrict a12,
ctype* restrict a11,
ctype* restrict b21,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
where <suffix> is implementation-dependent. The following (more portable) wrappers are also defined:
void bli_?gemmtrsm_l_ukernel
(
dim_t k,
ctype* restrict alpha,
ctype* restrict a10,
ctype* restrict a11,
ctype* restrict b01,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
void bli_?gemmtrsm_u_ukernel
(
dim_t k,
ctype* restrict alpha,
ctype* restrict a12,
ctype* restrict a11,
ctype* restrict b21,
ctype* restrict b11,
ctype* restrict c11, inc_t rsc, inc_t csc,
auxinfo_t* restrict data,
cntx_t* restrict cntx
);
The gemmtrsm_l micro-kernel performs the following compound operation:
B11 := alpha * B11 - A10 * B01
B11 := inv(A11) * B11
C11 := B11
where A11 is MR x MR and lower triangular, A10 is MR x k, and B01 is k x NR.
The gemmtrsm_u micro-kernel performs:
B11 := alpha * B11 - A12 * B21
B11 := inv(A11) * B11
C11 := B11
where A11 is MR x MR and upper triangular, A12 is MR x k, and B21 is k x NR.
In both cases, B11 is MR x NR and alpha is a scalar. Here, inv() denotes matrix inverse.
MR and NR are the register blocksizes associated with the micro-kernel. They are chosen by the developer when the micro-kernel is written and then encoded into a BLIS configuration, which will reference the micro-kernel when the BLIS framework is instantiated into a library. For more information on setting register blocksizes and related constants, please see the BLIS developer configuration guide.
Parameters:
k: The number of columns ofA10and rows ofB01(trsm_l); the number of columns ofA12and rows ofB21(trsm_u).alpha: The address of a scalar to be applied toB11.a10,a12: The address ofA10orA12, which is the MR x k submatrix of the packed micro-panel ofAthat is situated to the left (trsm_l) or right (trsm_u) of the MR x MR triangular submatrixA11.A10andA12are stored by columns with leading dimension PACKMR, where typically PACKMR = MR. (See Implementation Notes for gemm for a discussion of PACKMR.)a11: The address ofA11, which is the MR x MR lower (trsm_l) or upper (trsm_u) triangular submatrix within the packed micro-panel of matrixAthat is situated to the right ofA10(trsm_l) or the left ofA12(trsm_u).A11is stored by columns with leading dimension PACKMR, where typically PACKMR = MR. (See Implementation Notes for gemm for a discussion of PACKMR.) Note thatA11contains elements in both triangles, though elements in the unstored triangle are not guaranteed to be zero and thus should not be referenced.b01,b21: The address ofB01andB21, which is the k x NR submatrix of the packed micro-panel ofBthat is situated above (trsm_l) or below (trsm_u) the MR x NR blockB11.B01andB21are stored by rows with leading dimension PACKNR, where typically PACKNR = NR. (See Implementation Notes for gemm for a discussion of PACKNR.)b11: The address ofB11, which is the MR x NR submatrix of the packed micro-panel ofB, situated belowB01(trsm_l) or aboveB21(trsm_u).B11is stored by rows with leading dimension PACKNR, where typically PACKNR = NR. (See Implementation Notes for gemm for a discussion of PACKNR.)c11: The address ofC11, which is an MR x NR submatrix of matrixC, stored according torscandcsc.C11is the submatrix withinCthat corresponds to the elements which were packed intoB11. Thus,Cis the original input matrixBto the overalltrsmoperation.rsc: The row stride of matrixC11(ie: the distance to the next row, in units of matrix elements).csc: The column stride of matrixC11(ie: the distance to the next column, in units of matrix elements).data: The address of anauxinfo_tobject that contains auxiliary information that may be useful when optimizing thegemmtrsmmicro-kernel implementation. (See Using the auxinfo_t object for a discussion of the kinds of values available viaauxinfo_t, and also Implementation Notes for gemmtrsm for caveats.)cntx: The address of the runtime context. The context can be queried for implementation-specific values such as cache and register blocksizes. However, most micro-kernels intrinsically "know" these values already, and thus thecntxargument usually can be safely ignored.
Diagram for gemmtrsm_l
The diagram below shows the packed micro-panel operands for trsm_l and how elements of each would be stored when MR = NR = 4. (The hex digits indicate the layout and order (but NOT the numeric contents) in memory. Here, matrix A11 (referenced by a11) is lower triangular. Matrix A11 does contain elements corresponding to the strictly upper triangle, however, they are not guaranteed to contain zeros and thus these elements should not be referenced.
NR
_______
b01:|0 1 2 3|
|4 5 6 7|
|8 9 A B|
|C D E F|
k | . |
| . |
a10: a11: | . |
___________________ _______ |_______|
|0 4 8 C |`. | b11:| |
MR |1 5 9 D . . . | `. | | |
|2 6 A E | `. | MR | |
|3_7_B_F____________|______`.| |_______|
k MR
Diagram for gemmtrsm_u
The diagram below shows the packed micro-panel operands for trsm_u and how elements of each would be stored when MR = NR = 4. (The hex digits indicate the layout and order (but NOT the numeric contents) in memory. Here, matrix A11 (referenced by a11) is upper triangular. Matrix A11 does contain elements corresponding to the strictly lower triangle, however, they are not guaranteed to contain zeros and thus these elements should not be referenced.
a11: a12: NR
________ ___________________ _______
|`. |0 4 8 | b11:|0 1 2 3|
MR | `. |1 5 9 . . . | |4 5 6 7|
| `. |2 6 A | MR |8 9 A B|
|______`.|3_7_B______________| |___.___|
b21:| . |
MR k | . |
| |
| |
NOTE: Storage digits are shown k | |
starting with a12 to avoid | |
obscuring triangular structure | |
of a11. |_______|
Implementation Notes for gemmtrsm
- Register blocksizes. See Implementation Notes for gemm.
- Leading dimensions of
a1andb1: PACKMR and PACKNR. See Implementation Notes for gemm. - Edge cases in MR, NR dimensions. See Implementation Notes for gemm.
- Alignment of
a1andb1. See Implementation Notes for gemm. - Unrolling loops. Most optimized implementations should unroll all three loops within the
trsmsubproblem ofgemmtrsm. See Implementation Notes for gemm for remarks on unrolling thegemmsubproblem. - Prefetching next micro-panels of
AandB. When invoked from within agemmtrsm_lmicro-kernel, the addresses accessible viabli_auxinfo_next_a()andbli_auxinfo_next_b()refer to the next invocation'sa10andb01, respectively, while ingemmtrsm_u, the_next_a()and_next_b()macros return the addresses of the next invocation'sa11andb11(since those submatrices precedea12andb21). - Zero
alpha. The micro-kernel can safely assume thatalphais non-zero; "alpha equals zero" handling is performed at a much higher level, which means that, in such a scenario, the micro-kernel will never get called. - Diagonal elements of
A11. See Implementation Notes for trsm. - Zero elements of
A11. See Implementation Notes for trsm. - Output. See Implementation Notes for trsm.
- Optimization. Let's assume that the gemm micro-kernel has already been optimized. You have two options with regard to optimizing the fused
gemmtrsmmicro-kernels:- Optimize only the trsm micro-kernels. This will result in the
gemmandtrsm_lmicro-kernels being called in sequence. (Likewise forgemmandtrsm_u.) - Fuse the implementation of the
gemmmicro-kernel with that of thetrsmmicro-kernels by inlining both into thegemmtrsm_landgemmtrsm_umicro-kernel definitions. This option is more labor-intensive, but also more likely to yield higher performance because it avoids redundant memory operations on the packed MR x NR submatrixB11.
- Optimize only the trsm micro-kernels. This will result in the
Example code for gemmtrsm
Example implementations of the gemmtrsm micro-kernels may be found in the template configuration directory in:
- config/template/kernels/3/bli_gemmtrsm_l_opt_mxn.c
- config/template/kernels/3/bli_gemmtrsm_u_opt_mxn.c
Note that these implementations are coded in C99 and lack several kinds of optimization that are typical of real-world optimized micro-kernels, such as vector instructions (or intrinsics) and loop unrolling in MR or NR. They are meant to serve only as a starting point for a micro-kernel developer.
Level-1f kernels
This section has yet to be written.
Level-1v kernels
This section has yet to be written.