Files
blis/config/template/kernels/1f/bli_dotaxpyv_opt_var1.c
Field G. Van Zee 537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00

464 lines
12 KiB
C

/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
void bli_sdotaxpyv_opt_var1
(
conj_t conjxt,
conj_t conjx,
conj_t conjy,
dim_t n,
float* alpha,
float* x, inc_t incx,
float* y, inc_t incy,
float* rho,
float* z, inc_t incz,
cntx_t* cntx
)
{
/* Just call the reference implementation. */
BLIS_SDOTAXPYV_KERNEL_REF
(
conjxt,
conjx,
conjy,
n,
alpha,
x, incx,
y, incy,
rho,
z, incz,
cntx
);
}
void bli_ddotaxpyv_opt_var1
(
conj_t conjxt,
conj_t conjx,
conj_t conjy,
dim_t n,
double* alpha,
double* x, inc_t incx,
double* y, inc_t incy,
double* rho,
double* z, inc_t incz,
cntx_t* cntx
)
{
/* Just call the reference implementation. */
BLIS_DDOTAXPYV_KERNEL_REF
(
conjxt,
conjx,
conjy,
n,
alpha,
x, incx,
y, incy,
rho,
z, incz,
cntx
);
}
void bli_cdotaxpyv_opt_var1
(
conj_t conjxt,
conj_t conjx,
conj_t conjy,
dim_t n,
scomplex* alpha,
scomplex* x, inc_t incx,
scomplex* y, inc_t incy,
scomplex* rho,
scomplex* z, inc_t incz,
cntx_t* cntx
)
{
/* Just call the reference implementation. */
BLIS_CDOTAXPYV_KERNEL_REF
(
conjxt,
conjx,
conjy,
n,
alpha,
x, incx,
y, incy,
rho,
z, incz,
cntx
);
}
void bli_zdotaxpyv_opt_var1
(
conj_t conjxt,
conj_t conjx,
conj_t conjy,
dim_t n,
dcomplex* alpha,
dcomplex* x, inc_t incx,
dcomplex* y, inc_t incy,
dcomplex* rho,
dcomplex* z, inc_t incz,
cntx_t* cntx
)
{
/*
Template dotaxpyv kernel implementation
This function contains a template implementation for a double-precision
complex kernel, coded in C, which can serve as the starting point for one
to write an optimized kernel on an arbitrary architecture. (We show a
template implementation for only double-precision complex because the
templates for the other three floating-point types would be similar, with
the real instantiations being noticeably simpler due to the disappearance
of conjugation in the real domain.)
This kernel fuses a dotv and axpyv operation:
rho := conjxt( x^T ) * conjy( y )
z := z + alpha * conjx( x )
where x, y, and z are vectors of length n and alpha1 and alpha2 are scalars.
Parameters:
- conjxt: Compute with conjugated values of x^T?
- conjx: Compute with conjugated values of x?
- conjy: Compute with conjugated values of y?
- n: The number of elements in vectors x, y, and z.
- alpha: The address of the scalar to be applied to x.
- x: The address of vector x.
- incx: The vector increment of x. incx should be unit unless the
implementation makes special accomodation for non-unit values.
- y: The address of vector y.
- incy: The vector increment of y. incy should be unit unless the
implementation makes special accomodation for non-unit values.
- rho: The address of the output scalar of the dotv subproblem.
- z: The address of vector z.
- incz: The vector increment of z. incz should be unit unless the
implementation makes special accomodation for non-unit values.
This template code calls the reference implementation if any of the
following conditions are true:
- Any of the strides incx, incy, or incz is non-unit.
- Vectors x, y, and z are unaligned with different offsets.
If the vectors are aligned, or unaligned by the same offset, then optimized
code can be used for the bulk of the computation. This template shows how
the front-edge case can be handled so that the remaining computation is
aligned. (This template guarantees alignment in the main loops to be
BLIS_SIMD_ALIGN_SIZE, which is defined in bli_config.h.)
Here are a few additional things to consider:
- While four combinations of possible values of conjx and conjy exist, we
implement only conjugation on x explicitly; we induce the other two cases
by toggling the effective conjugation on x and then conjugating the dot
product result.
- Because conjugation disappears in the real domain, real instances of
this kernel can safely ignore the values of any conjugation parameters,
thereby simplifying the implementation.
For more info, please refer to the BLIS website and/or contact the
blis-devel mailing list.
-FGVZ
*/
const dim_t n_elem_per_reg = 1;
const dim_t n_iter_unroll = 1;
const dim_t n_elem_per_iter = n_elem_per_reg * n_iter_unroll;
const siz_t type_size = sizeof( *x );
dcomplex* xp;
dcomplex* yp;
dcomplex* zp;
dcomplex dotxy;
bool_t use_ref = FALSE;
dim_t n_pre = 0;
dim_t n_iter;
dim_t n_left;
dim_t off_x, off_y, off_z;
dim_t i;
conj_t conjxt_use;
// If the vector lengths are zero, set rho to zero and return.
if ( bli_zero_dim1( n ) )
{
bli_zset0s( *rho );
return;
}
// If there is anything that would interfere with our use of aligned
// vector loads/stores, call the reference implementation.
if ( bli_has_nonunit_inc3( incx, incy, incz ) )
{
use_ref = TRUE;
}
else if ( bli_is_unaligned_to( x, BLIS_SIMD_ALIGN_SIZE ) ||
bli_is_unaligned_to( y, BLIS_SIMD_ALIGN_SIZE ) ||
bli_is_unaligned_to( z, BLIS_SIMD_ALIGN_SIZE ) )
{
use_ref = TRUE;
// If x, y, and z are unaligned by the same offset, then we can
// still use an implementation that depends on alignment for most
// of the operation.
off_x = bli_offset_from_alignment( x, BLIS_SIMD_ALIGN_SIZE );
off_y = bli_offset_from_alignment( y, BLIS_SIMD_ALIGN_SIZE );
off_z = bli_offset_from_alignment( z, BLIS_SIMD_ALIGN_SIZE );
if ( off_x == off_y && off_x == off_z )
{
use_ref = FALSE;
n_pre = off_x / type_size;
}
}
// Call the reference implementation if needed.
if ( use_ref == TRUE )
{
BLIS_ZDOTAXPYV_KERNEL_REF
(
conjxt,
conjx,
conjy,
n,
alpha,
x, incx,
y, incy,
rho,
z, incz,
cntx
);
return;
}
// Compute the number of unrolled and leftover (edge) iterations.
n_iter = ( n - n_pre ) / n_elem_per_iter;
n_left = ( n - n_pre ) % n_elem_per_iter;
// Initialize pointers into x, y, and z.
xp = x;
yp = y;
zp = z;
// Initialize accumulator to zero.
bli_zset0s( dotxy );
conjxt_use = conjxt;
// If y must be conjugated, we compute the result indirectly by first
// toggling the effective conjugation of xt and then conjugating the
// resulting dot product.
if ( bli_is_conj( conjy ) )
bli_toggle_conj( conjxt_use );
// Iterate over elements of x, y, and z to compute:
// r = conjxt( x^T ) * conjy( y );
// z += alpha * conjx( x );
if ( bli_is_noconj( conjx ) && bli_is_noconj( conjxt_use ) )
{
// Compute front edge cases if x, y, and z were unaligned.
for ( i = 0; i < n_pre; ++i )
{
bli_zdots( *xp, *yp, dotxy );
bli_zaxpys( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
// The bulk of the operation is executed here. For best performance,
// alpha should be loaded once prior to the n_iter loop, dotxy
// should be and kept in registers, and each element of x should be
// loaded only once each. The addresses xp, yp, and zp are
// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
for ( i = 0; i < n_iter; ++i )
{
bli_zdots( *xp, *yp, dotxy );
bli_zaxpys( *alpha, *xp, *zp );
xp += n_elem_per_iter;
yp += n_elem_per_iter;
zp += n_elem_per_iter;
}
// Compute tail edge cases, if applicable.
for ( i = 0; i < n_left; ++i )
{
bli_zdots( *xp, *yp, dotxy );
bli_zaxpys( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
}
else if ( bli_is_noconj( conjx ) && bli_is_conj( conjxt_use ) )
{
// Compute front edge cases if x, y, and z were unaligned.
for ( i = 0; i < n_pre; ++i )
{
bli_zdotjs( *xp, *yp, dotxy );
bli_zaxpys( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
// The bulk of the operation is executed here. For best performance,
// alpha should be loaded once prior to the n_iter loop, dotxy
// should be and kept in registers, and each element of x should be
// loaded only once each. The addresses xp, yp, and zp are
// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
for ( i = 0; i < n_iter; ++i )
{
bli_zdotjs( *xp, *yp, dotxy );
bli_zaxpys( *alpha, *xp, *zp );
xp += n_elem_per_iter;
yp += n_elem_per_iter;
zp += n_elem_per_iter;
}
// Compute tail edge cases, if applicable.
for ( i = 0; i < n_left; ++i )
{
bli_zdotjs( *xp, *yp, dotxy );
bli_zaxpys( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
}
else if ( bli_is_conj( conjx ) && bli_is_noconj( conjxt_use ) )
{
// Compute front edge cases if x, y, and z were unaligned.
for ( i = 0; i < n_pre; ++i )
{
bli_zdots( *xp, *yp, dotxy );
bli_zaxpyjs( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
// The bulk of the operation is executed here. For best performance,
// alpha should be loaded once prior to the n_iter loop, dotxy
// should be and kept in registers, and each element of x should be
// loaded only once each. The addresses xp, yp, and zp are
// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
for ( i = 0; i < n_iter; ++i )
{
bli_zdots( *xp, *yp, dotxy );
bli_zaxpyjs( *alpha, *xp, *zp );
xp += n_elem_per_iter;
yp += n_elem_per_iter;
zp += n_elem_per_iter;
}
// Compute tail edge cases, if applicable.
for ( i = 0; i < n_left; ++i )
{
bli_zdots( *xp, *yp, dotxy );
bli_zaxpyjs( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
}
else // if ( bli_is_conj( conjx ) && bli_is_conj( conjxt_use ) )
{
// Compute front edge cases if x, y, and z were unaligned.
for ( i = 0; i < n_pre; ++i )
{
bli_zdotjs( *xp, *yp, dotxy );
bli_zaxpyjs( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
// The bulk of the operation is executed here. For best performance,
// alpha should be loaded once prior to the n_iter loop, dotxy
// should be and kept in registers, and each element of x should be
// loaded only once each. The addresses xp, yp, and zp are
// guaranteed to be aligned to BLIS_SIMD_ALIGN_SIZE.
for ( i = 0; i < n_iter; ++i )
{
bli_zdotjs( *xp, *yp, dotxy );
bli_zaxpyjs( *alpha, *xp, *zp );
xp += n_elem_per_iter;
yp += n_elem_per_iter;
zp += n_elem_per_iter;
}
// Compute tail edge cases, if applicable.
for ( i = 0; i < n_left; ++i )
{
bli_zdotjs( *xp, *yp, dotxy );
bli_zaxpyjs( *alpha, *xp, *zp );
xp += 1; yp += 1; zp += 1;
}
}
// If conjugation on y was requested, we induce it by conjugating
// the contents of rho.
if ( bli_is_conj( conjy ) )
bli_zconjs( dotxy );
bli_zcopys( dotxy, *rho );
}