mirror of
https://github.com/amd/blis.git
synced 2026-04-29 03:51:11 +00:00
Details:
- Retrofitted a new data structure, known as a context, into virtually
all internal APIs for computational operations in BLIS. The structure
is now present within the type-aware APIs, as well as many supporting
utility functions that require information stored in the context. User-
level object APIs were unaffected and continue to be "context-free,"
however, these APIs were duplicated/mirrored so that "context-aware"
APIs now also exist, differentiated with an "_ex" suffix (for "expert").
These new context-aware object APIs (along with the lower-level, type-
aware, BLAS-like APIs) contain the the address of a context as a last
parameter, after all other operands. Contexts, or specifically, cntx_t
object pointers, are passed all the way down the function stack into
the kernels and allow the code at any level to query information about
the runtime, such as kernel addresses and blocksizes, in a thread-
friendly manner--that is, one that allows thread-safety, even if the
original source of the information stored in the context changes at
run-time; see next bullet for more on this "original source" of info).
(Special thanks go to Lee Killough for suggesting the use of this kind
of data structure in discussions that transpired during the early
planning stages of BLIS, and also for suggesting such a perfectly
appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
structure" (gks). This data structure and API will allow the caller to
initialize a context with the kernel addresses, blocksizes, and other
information associated with the currently active kernel configuration.
The currently active kernel configuration within the gks cannot be
changed (for now), and is initialized with the traditional cpp macros
that define kernel function names, blocksizes, and the like. However,
in the future, the gks API will be expanded to allow runtime management
of kernels and runtime parameters. The most obvious application of this
new infrastructure is the runtime detection of hardware (and the
implied selection of appropriate kernels). With contexts in place,
kernels may even be "hot swapped" at runtime within the gks. Once
execution enters a level-3 _front() function, the memory allocator will
be reinitialized on-the-fly, if necessary, to accommodate the new
kernels' blocksizes. If another application thread is executing with
another (previously loaded) kernel, it will finish in a deterministic
fashion because its kernel information was loaded into its context
before computation began, and also because the blocks it checked out
from the internal memory pools will be unaffected by the newer threads'
reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
the code enabling use of induced methods for complex domain matrix
multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
those APIs' functionality is now mostly subsumed within the global
kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
that will reinitialize a memory pool if the necessary pool block size
has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
usage of contexts where appropriate to communicate cache and register
blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
the context and/or the global kernel structure:
- Removed blocksize object pointers (blksz_t*) fields from all control
tree node definitions and replaced them with blocksize id (bszid_t)
values instead, which may be passed into a context query routine in
order to extract the corresponding blocksize from the given context.
- Removed micro-kernel function pointers (func_t*) fields from all
control tree node definitions. Now, any code that needs these function
pointers can query them from the local context, as identified by a
level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
level-1v kernel id (l1vkr_t).
- Removed blksz_t object creation and initialization, as well as kernel
function object creation and initialization, from all operation-
specific control tree initialization files (bli_*_cntl.c), since this
information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
blocksize multiples for each blocksize id (bszid_t) in the context
object.
- Removed the bool_t's that were required when a func_t was initialized.
These bools are meant to allow one to track the micro-kernel's storage
preferences (by rows or columns). This preference is now tracked
separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
util directories, but has the most obvious effect of allowing BLIS
to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
in an attempt to reduce overhead for memory-bound operations. This
includes removal of default use of object-based variants for level-2
operations. Now, by default, level-2 operations will directly call a
low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
heterogeneous bool_t's (one for each floating-point datatype), in the
same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
and BLIS_SIMD_SIZE. These values are needed in order to compute a third
new parameter, which may be set indirectly via the aforementioned
macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
statically allocate memory in macro-kernels and the induced methods'
virtual kernels to be used as temporary space to hold a single
micro-tile. These values are now output by the testsuite. The default
value of BLIS_STACK_BUF_MAX_SIZE is computed as
"2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
and "haswell," respectively, and gave more consistent and meaningful
names to many kernel files (as well as updating their interfaces to
conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
context for test modules that need those values: axpyf, dotxf,
dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
for level-1m-like operations on small matrices) in frame/include/level0
to use more obscure local variable names in an effort to avoid variable
shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
of scalm. The semantic meaning of the conj argument is to optionally
allow implicit conjugation of the scalar prior to being populated into
the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
that this does not preclude supporting mixed types via the object APIs,
where it produces absolutely zero API code bloat.
367 lines
13 KiB
C
367 lines
13 KiB
C
/*
|
|
|
|
BLIS
|
|
An object-based framework for developing high-performance BLAS-like
|
|
libraries.
|
|
|
|
Copyright (C) 2014, The University of Texas at Austin
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions are
|
|
met:
|
|
- Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
- Redistributions in binary form must reproduce the above copyright
|
|
notice, this list of conditions and the following disclaimer in the
|
|
documentation and/or other materials provided with the distribution.
|
|
- Neither the name of The University of Texas at Austin nor the names
|
|
of its contributors may be used to endorse or promote products
|
|
derived from this software without specific prior written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
#include "blis.h"
|
|
|
|
|
|
|
|
void bli_sgemm_opt_mxn
|
|
(
|
|
dim_t k,
|
|
float* restrict alpha,
|
|
float* restrict a1,
|
|
float* restrict b1,
|
|
float* restrict beta,
|
|
float* restrict c11, inc_t rs_c, inc_t cs_c,
|
|
auxinfo_t* restrict data,
|
|
cntx_t* restrict cntx
|
|
)
|
|
{
|
|
/* Just call the reference implementation. */
|
|
BLIS_SGEMM_UKERNEL_REF
|
|
(
|
|
k,
|
|
alpha,
|
|
a1,
|
|
b1,
|
|
beta,
|
|
c11, rs_c, cs_c,
|
|
data,
|
|
cntx
|
|
);
|
|
}
|
|
|
|
|
|
|
|
void bli_dgemm_opt_mxn
|
|
(
|
|
dim_t k,
|
|
double* restrict alpha,
|
|
double* restrict a1,
|
|
double* restrict b1,
|
|
double* restrict beta,
|
|
double* restrict c11, inc_t rs_c, inc_t cs_c,
|
|
auxinfo_t* restrict data,
|
|
cntx_t* restrict cntx
|
|
)
|
|
{
|
|
/*
|
|
Template gemm micro-kernel implementation
|
|
|
|
This function contains a template implementation for a double-precision
|
|
real micro-kernel, coded in C, which can serve as the starting point for
|
|
one to write an optimized micro-kernel on an arbitrary architecture. (We
|
|
show a template implementation for only double-precision real because
|
|
the templates for the other three floating-point types would be nearly
|
|
identical.)
|
|
|
|
This micro-kernel performs a matrix-matrix multiplication of the form:
|
|
|
|
C11 := beta * C11 + alpha * A1 * B1
|
|
|
|
where A1 is MR x k, B1 is k x NR, C11 is MR x NR, and alpha and beta are
|
|
scalars.
|
|
|
|
Parameters:
|
|
|
|
- k: The number of columns of A1 and rows of B1.
|
|
- alpha: The address of a scalar to the A1 * B1 product.
|
|
- a1: The address of a micro-panel of matrix A of dimension MR x k,
|
|
stored by columns with leading dimension PACKMR, where
|
|
typically PACKMR = MR.
|
|
- b1: The address of a micro-panel of matrix B of dimension k x NR,
|
|
stored by rows with leading dimension PACKNR, where typically
|
|
PACKNR = NR.
|
|
- beta: The address of a scalar to the input value of matrix C11.
|
|
- c11: The address of a submatrix C11 of dimension MR x NR, stored
|
|
according to rs_c and cs_c.
|
|
- rs_c: The row stride of matrix C11 (ie: the distance to the next row,
|
|
in units of matrix elements).
|
|
- cs_c: The column stride of matrix C11 (ie: the distance to the next
|
|
column, in units of matrix elements).
|
|
- data: The address of an auxinfo_t object that contains auxiliary
|
|
information that may be useful when optimizing the gemm
|
|
micro-kernel implementation. (See BLIS KernelsHowTo wiki for
|
|
more info.)
|
|
- cntx: The address of the runtime context. The context can be queried
|
|
for implementation-specific values such as cache and register
|
|
blocksizes. However, most micro-kernels intrinsically "know"
|
|
these values already, and thus the cntx argument usually can
|
|
be safely ignored. (The following template micro-kernel code
|
|
does in fact query MR, NR, PACKMR, and PACKNR, as needed, but
|
|
only because those values are not hard-coded, as they would be
|
|
in a typical optimized micro-kernel implementation.)
|
|
|
|
Diagram for gemm
|
|
|
|
The diagram below shows the packed micro-panel operands and how elements
|
|
of each would be stored when MR = NR = 4. The hex digits indicate the
|
|
layout and order (but NOT the numeric contents) of the elements in
|
|
memory. Note that the storage of C11 is not shown since it is determined
|
|
by the row and column strides of C11.
|
|
|
|
c11: a1: b1:
|
|
_______ ______________________ _______
|
|
| | |0 4 8 C | |0 1 2 3|
|
|
MR | | |1 5 9 D . . . | |4 5 6 7|
|
|
| | += |2 6 A E | |8 9 A B|
|
|
|_______| |3_7_B_F_______________| |C D E F|
|
|
| . |
|
|
NR k | . | k
|
|
| . |
|
|
| |
|
|
| |
|
|
|_______|
|
|
|
|
NR
|
|
Implementation Notes for gemm
|
|
|
|
- Register blocksizes. The C preprocessor macros bli_?mr and bli_?nr
|
|
evaluate to the MR and NR register blocksizes for the datatype
|
|
corresponding to the '?' character. These values are abbreviations
|
|
of the macro constants BLIS_DEFAULT_MR_? and BLIS_DEFAULT_NR_?,
|
|
which are defined in the bli_kernel.h header file of the BLIS
|
|
configuration.
|
|
- Leading dimensions of a1 and b1: PACKMR and PACKNR. The packed
|
|
micro-panels a1 and b1 are simply stored in column-major and row-major
|
|
order, respectively. Usually, the width of either micro-panel (ie:
|
|
the number of rows of A1, or MR, and the number of columns of B1, or
|
|
NR) is equal to that micro-panel's so-called "leading dimension."
|
|
Sometimes, it may be beneficial to specify a leading dimension that
|
|
is larger than the panel width. This may be desirable because it
|
|
allows each column of A1 or row of B1 to maintain a certain alignment
|
|
in memory that would not otherwise be maintained by MR and/or NR. In
|
|
this case, you should index through a1 and b1 using the values PACKMR
|
|
and PACKNR, respectively, as defined by bli_?packmr and bli_?packnr.
|
|
These values are defined as BLIS_PACKDIM_MR_? and BLIS_PACKDIM_NR_?,
|
|
respectively, in the bli_kernel.h header file of the BLIS
|
|
configuration.
|
|
- Storage preference of c11: Sometimes, an optimized micro-kernel will
|
|
have a preferred storage format for C11--typically either contiguous
|
|
row-storage or contiguous column-storage. This preference comes from
|
|
how the micro-kernel is most efficiently able to load/store elements
|
|
of C11 from/to memory. Most micro-kernels use vector instructions to
|
|
load and store contigous columns (or column segments) of C11. However,
|
|
the developer may decide that loading contiguous rows (or row
|
|
segments) is desirable. If this is the case, this preference should be
|
|
noted in bli_kernel.h by defining the macro
|
|
BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS. Leaving the macro undefined
|
|
leaves the default assumption (contiguous column preference) in
|
|
place. Setting this macro allows the framework to perform a minor
|
|
optimization at run-time that will ensure the micro-kernel preference
|
|
is honored, if at all possible.
|
|
- Edge cases in MR, NR dimensions. Sometimes the micro-kernel will be
|
|
called with micro-panels a1 and b1 that correspond to edge cases,
|
|
where only partial results are needed. Zero-padding is handled
|
|
automatically by the packing function to facilitate reuse of the same
|
|
micro-kernel. Similarly, the logic for computing to temporary storage
|
|
and then saving only the elements that correspond to elements of C11
|
|
that exist (at the edges) is handled automatically within the
|
|
macro-kernel.
|
|
- Alignment of a1 and b1. By default, the alignment of addresses a1 and
|
|
b1 are aligned only to sizeof(type). If BLIS_CONTIG_ADDR_ALIGN_SIZE is
|
|
set to some larger multiple of sizeof(type), such as the page size,
|
|
then a1 and b1 will be aligned to PACKMR * sizeof(type) and PACKNR *
|
|
sizeof(type), respectively. Alignment of a1 and b1 is also affected
|
|
by BLIS_UPANEL_A_ALIGN_SIZE_? and BLIS_UPANEL_B_ALIGN_SIZE_?, which
|
|
align the distance (stride) between subsequent micro-panels. (By
|
|
default, those values are simply sizeof(type), in which case they have
|
|
no effect.)
|
|
- Unrolling loops. As a general rule of thumb, the loop over k is
|
|
sometimes moderately unrolled; for example, in our experience, an
|
|
unrolling factor of u = 4 is fairly common. If unrolling is applied
|
|
in the k dimension, edge cases must be handled to support values of k
|
|
that are not multiples of u. It is nearly universally true that there
|
|
should be no loops in the MR or NR directions; in other words,
|
|
iteration over these dimensions should always be fully unrolled
|
|
(within the loop over k).
|
|
- Zero beta. If beta = 0.0 (or 0.0 + 0.0i for complex datatypes), then
|
|
the micro-kernel should NOT use it explicitly, as C11 may contain
|
|
uninitialized memory (including NaNs). This case should be detected
|
|
and handled separately, preferably by simply overwriting C11 with the
|
|
alpha * A1 * B1 product. An example of how to perform this "beta equals
|
|
zero" handling is included in the gemm micro-kernel associated with
|
|
the template configuration.
|
|
|
|
For more info, please refer to the BLIS website and/or contact the
|
|
blis-devel mailing list.
|
|
|
|
-FGVZ
|
|
*/
|
|
const num_t dt = BLIS_DOUBLE;
|
|
|
|
const dim_t mr = bli_cntx_get_blksz_def_dt( dt, BLIS_MR, cntx );
|
|
const dim_t nr = bli_cntx_get_blksz_def_dt( dt, BLIS_NR, cntx );
|
|
|
|
const inc_t packmr = bli_cntx_get_blksz_max_dt( dt, BLIS_MR, cntx );
|
|
const inc_t packnr = bli_cntx_get_blksz_max_dt( dt, BLIS_NR, cntx );
|
|
|
|
const inc_t cs_a = packmr;
|
|
const inc_t rs_b = packnr;
|
|
|
|
const inc_t rs_ab = 1;
|
|
const inc_t cs_ab = mr;
|
|
|
|
dim_t l, j, i;
|
|
|
|
double ab[ bli_dmr *
|
|
bli_dnr ];
|
|
double* abij;
|
|
double ai, bj;
|
|
|
|
|
|
/* Initialize the accumulator elements in ab to zero. */
|
|
for ( i = 0; i < mr * nr; ++i )
|
|
{
|
|
bli_dset0s( *(ab + i) );
|
|
}
|
|
|
|
/* Perform a series of k rank-1 updates into ab. */
|
|
for ( l = 0; l < k; ++l )
|
|
{
|
|
abij = ab;
|
|
|
|
/* In an optimized implementation, these two loops over MR and NR
|
|
are typically fully unrolled. */
|
|
for ( j = 0; j < nr; ++j )
|
|
{
|
|
bj = *(b1 + j);
|
|
|
|
for ( i = 0; i < mr; ++i )
|
|
{
|
|
ai = *(a1 + i);
|
|
|
|
bli_ddots( ai, bj, *abij );
|
|
|
|
abij += rs_ab;
|
|
}
|
|
}
|
|
|
|
a1 += cs_a;
|
|
b1 += rs_b;
|
|
}
|
|
|
|
/* Scale each element of ab by alpha. */
|
|
for ( i = 0; i < mr * nr; ++i )
|
|
{
|
|
bli_dscals( *alpha, *(ab + i) );
|
|
}
|
|
|
|
/* If beta is zero, overwrite c11 with the scaled result in ab.
|
|
Otherwise, scale c11 by beta and then add the scaled result in
|
|
ab. */
|
|
if ( bli_deq0( *beta ) )
|
|
{
|
|
/* c11 := ab */
|
|
bli_dcopys_mxn( mr,
|
|
nr,
|
|
ab, rs_ab, cs_ab,
|
|
c11, rs_c, cs_c );
|
|
}
|
|
else
|
|
{
|
|
/* c11 := beta * c11 + ab */
|
|
bli_dxpbys_mxn( mr,
|
|
nr,
|
|
ab, rs_ab, cs_ab,
|
|
beta,
|
|
c11, rs_c, cs_c );
|
|
}
|
|
}
|
|
|
|
|
|
|
|
void bli_cgemm_opt_mxn(
|
|
dim_t k,
|
|
scomplex* restrict alpha,
|
|
scomplex* restrict a1,
|
|
scomplex* restrict b1,
|
|
scomplex* restrict beta,
|
|
scomplex* restrict c11, inc_t rs_c, inc_t cs_c,
|
|
auxinfo_t* data
|
|
)
|
|
(
|
|
dim_t k,
|
|
scomplex* restrict alpha,
|
|
scomplex* restrict a1,
|
|
scomplex* restrict b1,
|
|
scomplex* restrict beta,
|
|
scomplex* restrict c11, inc_t rs_c, inc_t cs_c,
|
|
auxinfo_t* restrict data,
|
|
cntx_t* restrict cntx
|
|
)
|
|
{
|
|
/* Just call the reference implementation. */
|
|
BLIS_CGEMM_UKERNEL_REF
|
|
(
|
|
k,
|
|
alpha,
|
|
a1,
|
|
b1,
|
|
beta,
|
|
c11, rs_c, cs_c,
|
|
data,
|
|
cntx
|
|
);
|
|
}
|
|
|
|
|
|
|
|
void bli_zgemm_opt_mxn
|
|
(
|
|
dim_t k,
|
|
dcomplex* restrict alpha,
|
|
dcomplex* restrict a1,
|
|
dcomplex* restrict b1,
|
|
dcomplex* restrict beta,
|
|
dcomplex* restrict c11, inc_t rs_c, inc_t cs_c,
|
|
auxinfo_t* restrict data,
|
|
cntx_t* restrict cntx
|
|
)
|
|
{
|
|
/* Just call the reference implementation. */
|
|
BLIS_ZGEMM_UKERNEL_REF
|
|
(
|
|
k,
|
|
alpha,
|
|
a1,
|
|
b1,
|
|
beta,
|
|
c11, rs_c, cs_c,
|
|
data,
|
|
cntx
|
|
);
|
|
}
|
|
|