Implemented mixed-datatype support for gemm.

Details:
- Implemented support for gemm where A, B, and C may have different
  storage datatypes, as well as a computational precision (and implied
  computation domain) that may be different from the storage precision
  of either A or B. This results in 128 different combinations, all
  which are implemented within this commit. (For now, the mixed-datatype
  functionality is only supported via the object API.) If desired, the
  mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
  that requires a single m-by-n matrix be allocated (temporarily) per
  call to gemm. This optimization aims to avoid the overhead involved in
  repeatedly updating C with general stride, or updating C after a
  typecast from the computation precision. This memory optimization may
  be disabled at configure-time (provided that the mixed-datatype
  support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
  The user may test gemm with mixed domains, precisions, both, or
  neither.
- Added a standalone test driver directory for building and running
  mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
  except that imaginary values are not touched when casting a real
  operand to a complex operand. (By contrast, in these situations castm
  sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
  usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
  also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
  when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
  accessor functions. Also commented out all usage of accessor
  functions within macrokernels. (Typecasting in the microkernel is
  still feasible, though probably unrealistic for now given the
  additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
  pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
  docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
  (courtsey of Devin Matthews).
This commit is contained in:
Field G. Van Zee
2018-10-15 16:37:39 -05:00
parent 667d3929ee
commit 5fec95b99f
155 changed files with 10302 additions and 943 deletions

View File

@@ -169,10 +169,14 @@ performance remain attainable.
* **A foundation for mixed domain and/or mixed precision operations.** BLIS
was designed with the hope of one day allowing computation on real and complex
operands within the same operation. Similarly, we wanted to allow mixing
operands' floating-point precisions, or both domain and precision.
While this feature is not yet implemented, we plan to prototype and explore
the potential for adding mixed domain, mixed precision support to operations
such as `gemm`.
operands' numerical domains, floating-point precisions, or both domain and
precision, and to optionally compute in a precision different than one or both
operands' storage precisions. This feature has been implemented for the general
matrix multiplication (`gemm`) operation, providing 128 different possible type
combinations, which, when combined with existing transposition, conjugation,
and storage parameters, enables 55,296 different `gemm` use cases. For more
details, please see the documentation on [mixed datatype](docs/MixedDatatypes.md)
support.
Getting Started
---------------
@@ -230,6 +234,9 @@ included in the BLIS source distribution.
table of supported microarchitectures.
* **[Multithreading](docs/Multithreading.md).** This document describes how to
use the multithreading features of BLIS.
* **[Mixed-Datatype](docs/MixedDatatype.md).** This document provides an
overview of BLIS's mixed-datatype functionality and provides a brief example
of how to take advantage of this new code.
* **[Release Notes](docs/ReleaseNotes.md).** This document tracks a summary of
changes included with each new version of BLIS, along with contributor credits
for key features.

View File

@@ -92,6 +92,26 @@
#endif
#endif
#ifndef BLIS_ENABLE_MIXED_DT
#ifndef BLIS_DISABLE_MIXED_DT
#if @enable_mixed_dt@
#define BLIS_ENABLE_MIXED_DT
#else
#define BLIS_DISABLE_MIXED_DT
#endif
#endif
#endif
#ifndef BLIS_ENABLE_MIXED_DT_EXTRA_MEM
#ifndef BLIS_DISABLE_MIXED_DT_EXTRA_MEM
#if @enable_mixed_dt_extra_mem@
#define BLIS_ENABLE_MIXED_DT_EXTRA_MEM
#else
#define BLIS_DISABLE_MIXED_DT_EXTRA_MEM
#endif
#endif
#endif
#if @enable_memkind@
#define BLIS_ENABLE_MEMKIND
#else

65
configure vendored
View File

@@ -191,6 +191,24 @@ print_usage()
echo " compatibility layer. This automatically enables the"
echo " BLAS compatibility layer as well."
echo " "
echo " --disable-mixed-dt, --enable-mixed-dt"
echo " "
echo " Disable (enabled by default) support for mixing the"
echo " storage domain and/or storage precision of matrix"
echo " operands for the gemm operation, as well as support"
echo " for computing in a precision different from one or."
echo " both of matrices A and B."
echo " "
echo " --disable-mixed-dt-extra-mem, --enable-mixed-dt-extra-mem"
echo " "
echo " Disable (enabled by default) support for additional"
echo " mixed datatype optimizations that require temporarily"
echo " allocating extra memory--specifically, a single m x n"
echo " matrix (per application thread) whose storage datatype"
echo " is equal to the computation datatype. This option may"
echo " only be enabled when mixed domain/precision support is"
echo " enabled."
echo " "
echo " -s NAME --enable-sandbox=NAME"
echo " "
echo " Enable a separate sandbox implementation of gemm. This"
@@ -1605,6 +1623,8 @@ main()
blas_int_type_size=32
enable_blas='yes'
enable_cblas='no'
enable_mixed_dt='yes'
enable_mixed_dt_extra_mem='yes'
enable_memkind='' # The default memkind value is determined later on.
force_version='no'
@@ -1739,6 +1759,18 @@ main()
disable-cblas)
enable_cblas='no'
;;
enable-mixed-dt)
enable_mixed_dt='yes'
;;
disable-mixed-dt)
enable_mixed_dt='no'
;;
enable-mixed-dt-extra-mem)
enable_mixed_dt_extra_mem='yes'
;;
disable-mixed-dt-extra-mem)
enable_mixed_dt_extra_mem='no'
;;
with-memkind)
enable_memkind='yes'
;;
@@ -2414,8 +2446,35 @@ main()
echo "${script_name}: the CBLAS compatibility layer is disabled."
enable_cblas_01=0
fi
# Report integer sizes
if [ "x${enable_mixed_dt}" = "xyes" ]; then
echo "${script_name}: mixed datatype support is enabled."
if [ "x${enable_mixed_dt_extra_mem}" = "xyes" ]; then
echo "${script_name}: mixed datatype optimizations requiring extra memory are enabled."
enable_mixed_dt_extra_mem_01=1
else
echo "${script_name}: mixed datatype optimizations requiring extra memory are disabled."
enable_mixed_dt_extra_mem_01=0
fi
enable_mixed_dt_01=1
else
echo "${script_name}: mixed datatype support is disabled."
if [ "x${enable_mixed_dt_extra_mem}" = "xyes" ]; then
echo "${script_name}: *** Mixed datatype optimizations requiring extra memory are only"
echo "${script_name}: *** available when mixed datatype support is also enabled."
echo "${script_name}: *** Please enable mixed datatype support, or disable mixed datatype"
echo "${script_name}: *** optimizations requiring extra memory, and re-run configure."
exit 1
else
enable_mixed_dt_extra_mem_01=0
fi
enable_mixed_dt_01=0
fi
# Report integer sizes.
if [ "x${int_type_size}" = "x32" ]; then
echo "${script_name}: the internal integer size is 32-bit."
elif [ "x${int_type_size}" = "x64" ]; then
@@ -2595,6 +2654,8 @@ main()
| sed -e "s/@blas_int_type_size@/${blas_int_type_size}/g" \
| sed -e "s/@enable_blas@/${enable_blas_01}/g" \
| sed -e "s/@enable_cblas@/${enable_cblas_01}/g" \
| sed -e "s/@enable_mixed_dt@/${enable_mixed_dt_01}/g" \
| sed -e "s/@enable_mixed_dt_extra_mem@/${enable_mixed_dt_extra_mem_01}/g" \
| sed -e "s/@enable_memkind@/${enable_memkind_01}/g" \
| sed -e "s/@enable_sandbox@/${enable_sandbox_01}/g" \
| sed -e "s/@enable_shared@/${enable_shared_01}/g" \

217
docs/MixedDatatypes.md Normal file
View File

@@ -0,0 +1,217 @@
## Contents
* **[Contents](MixedDatatypes.md#contents)**
* **[Introduction](MixedDatatypes.md#introduction)**
* **[Categories of mixed datatypes](MixedDatatypes.md#categories-of-mixed-datatypes)**
* **[Computation precision](MixedDatatypes.md#computation-precision)**
* **[Computation domain](MixedDatatypes.md#computation-domain)**
* **[Performing gemm with mixed datatypes](MixedDatatypes.md#performing-gemm-with-mixed-datatypes)**
* **[Known Issues](MixedDatatypes.md#known-issues)**
* **[Conclusion](MixedDatatypes.md#conclusion)**
## Introduction
This document serves as a guide to users interested in taking advantage of
BLIS's support for performing the `gemm` operation on operands of differing
types.
## Categories of mixed datatypes
Before going any further, we find it useful to categorize mixed datatype
support into four categories:
1. **Fully identical datatypes.** This is what people generally think of when
they think about the `gemm` operation: all operands are stored in the same
datatype (precision and domain), and the matrix product computation is
performed in the arithmetic represented by that datatype. (This category
doesn't actually involve mixing datatypes, but it's still worthwhile to
define.)
Example: matrix C updated by the product of matrix A and matrix B
(all matrices double-precision real).
2. **Mixed domain with identical precisions.** This category includes all
combinations of datatypes where the domain (real or complex) of each
operand may vary while the precisions (single or double precision) are
held constant across all operands.
Example: complex matrix C updated by the product of real matrix A and
complex matrix B (all matrices single-precision).
3. **Mixed precision within a single domain.** Here, all operands are stored
in the same domain (real or complex), however, the precision of each operand
may vary.
Example: double-precision real matrix C updated by the product of
single-precision real matrix A and single-precision real matrix B.
4. **Mixed precision and mixed domain.** This category allows both domains and
precision of each matrix operand to vary.
Example: double-precision complex matrix C updated by the product of
single-precision complex matrix A and single-precision real matrix B.
BLIS's implementation of mixed-datatype `gemm` supports all combinations
within all four categories.
### Computation precision
Because categories 3 and 4 involve mixing precisions, they come with an added
parameter: the *computation precision*. This parameter specifies the precision
in which the matrix multiplication (product) takes place. This precision
can be different than the storage precision of matrices A or B, and/or the
storage precision of matrix C.
When the computation precision differs from the storage precision of matrix A,
it implies that a typecast must occur when BLIS packs matrix A to contiguous
storage. Similarly, B may also need to be typecast during packing.
When the computation precision differs from the storage precision of C, it
means the result of the matrix product A*B must be typecast just before it
is accumulated back into matrix C.
### Computation domain
In addition to the computation precision, we also track a computation domain.
(Together, they form the computation datatype.) However, for now we do not
allow the user to explicitly specify the computation domain. Instead, the
computation domain is implied by the domains of A, B, and C. The following
table enumerates the six cases where there is at least one operand of each
domain, along with the corresponding same-domain cases from category 1 for
reference. We also list the total number of floating-point operations
performed in each case.
In the table, an 'R' denotes a real domain matrix operand while a 'C' denotes
a matrix in the complex domain. The R's and C's appear in the following
format of C += A * B, where A, B, and C are the matrix operands of `gemm`.
| Case # | Mixed domain case | Implied computation domain | flops performed |
|--------|:-----------------:|:--------------------------:|:---------------:|
| 1 | R += R * R | real | 2mnk |
| 2 | R += R * C | real | 2mnk |
| 3 | R += C * R | real | 2mnk |
| 4 | R += C * C | complex | 4mnk |
| 5 | C += R * R | real | 2mnk |
| 6 | C += R * C | complex | 4mnk |
| 7 | C += C * R | complex | 4mnk |
| 8 | C += C * C | complex | 8mnk |
The computation domain is implied in cases 1 and 8 in the same way that
it would be if mixed datatype support were absent entirely. These
cases execute 2mnk and 8mnk flops, respectively, as any traditional
implementation would.
In cases 2 and 3, we assume the computation domain is real because only
B or A, respectively, is complex. Thus, in these cases, the imaginary
components of the complex matrix are ignored, allowing us to perform
only 2mnk flops.
In case 5, we take the computation domain to be real because A and B are
both real, and thus it makes no sense to compute in the complex domain.
This means that we need only update the real components of C, leaving
the imaginary components untouched. This also results in 2mnk flops
being performed.
In case 4, we have complex A and B, allowing us to compute a complex
product. However, we can only save the real part of that complex product
since the output matrix C is real. Since we cannot update the imaginary
component of C (since it is not stored), we avoid computing that half of
the update entirely, reducing the flops performed to 4mnk. (Alternatively,
one may wish to request real domain computation, in which case the
imaginary components of A and B were ignored *prior* to computing the
matrix product. This approach would result in only 2mnk flops being
performed.)
In case 6, we wish for both the real and imaginary parts of B to participate
in the multiplication by A, with the result updating the corresponding real
and imaginary parts of C. Granted, the imaginary part of A is zero, and this
is taken advantage of in the computation to optimize performance, as indicated
by the 4mnk flop count. But fundamentally this computation executes in the
complex domain because both the real and imaginary parts of C are updated.
A similar story can be told about case 7.
## Performing gemm with mixed datatypes
In BLIS, performing a mixed-datatype `gemm` operation is easy. However,
it will require that the user call `gemm` through BLIS's object API.
For a basic series of examples for using the object-based API, please
see the example codes in the `examples/oapi` directory of the BLIS source
distribution.
The first step is to ensure that BLIS is configured with mixed datatype support.
Please consult with your current distribution's `configure` script for the
current semantics:
```
$ ./configure --help
```
As of this writing, mixed datatype support is enabled by default, and thus
no additional options are needed.
With mixed datatype support enabled in BLIS, using the functionality is
simply a matter of creating and initializing matrices of different precisions
and/or domains.
```c
dim_t m = 5, n = 4, k = 2;
obj_t a, b, c;
obj_t* alpha;
obj_t* beta;
bli_obj_create( BLIS_DOUBLE, m, k, 0, 0, &a );
bli_obj_create( BLIS_FLOAT, k, n, 0, 0, &b );
bli_obj_create( BLIS_SCOMPLEX, m, n, 0, 0, &c );
alpha = &BLIS_ONE;
beta = &BLIS_ONE;
bli_randm( &a );
bli_randm( &b );
bli_randm( &c );
```
Then, you specify the computation precision by setting the computation
precision property of matrix C.
```c
bli_obj_set_comp_prec( BLIS_DOUBLE_PREC, &c );
```
If you do not explicitly specify the computation precision, it will default
to the *storage* precision of C.
With the objects created and the computation precision specified, call
`bli_gemm()` just as you would if the datatypes were identical:
```c
bli_gemm( alpha, &a, &b, beta, &c );
```
For more examples of using BLIS's object-based API, including methods
of initializing an matrix object with arbitrary values, please review the
example code found in the `examples/oapi` directory of the BLIS source
distribution.
## Known Issues
While BLIS implements 128 mixed-datatype combinations of `gemm`, there may be
odd behavior in the current implementation that does not conform to the reader's
expectations. Below is a list of issues that BLIS developers are aware of in
the context of mixed-datatype `gemm`. If any of these issues poses a problem for
your application, please contact us by
[opening an issue](https://github.com/flame/blis/issues).
* **alpha with non-zero imaginary components.** Currently, there are many cases
of mixed-datatype `gemm` that do not yet support computing with `alpha` scalars
that have non-zero imaginary components--in other words, values of `alpha` that
are not in the real domain. (By contrast, non-real values for `beta` are fully
supported.) In order to support these use cases, additional code complexity and
logic would be required. Thus, we have chosen, for now, to not implement them.
If mixed-datatype `gemm` is invoked with a non-real valued `alpha` scalar, a
runtime error message will be printed and the linked program will abort.
* **Manually specifying the computation domain.** As mentioned in the section
discussing the [computation domain](MixedDatatype.md#computation-domain),
the computation domain of any case of mixed domain `gemm` is implied by the
operands and thus fixed; the user may not specify a different computation
domain, even if the mixed-domain case would reasonably allow for computing
in either domain.
## Conclusion
For more information and documentation on BLIS, please visit the [BLIS github page](https://github.com/flame/blis/).
If you found a bug or wish to request a feature, please [open an issue](https://github.com/flame/blis/issues).
For general discussion or questions, please join and post a message to the [blis-devel mailing list](http://groups.google.com/group/blis-devel).
Thanks for your interest in BLIS!

269
examples/oapi/11gemm_md.c Normal file
View File

@@ -0,0 +1,269 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include <stdio.h>
#include "blis.h"
int main( int argc, char** argv )
{
num_t dt_r, dt_c;
num_t dt_s, dt_d;
num_t dt_a, dt_b;
dim_t m, n, k;
inc_t rs, cs;
obj_t a, b, c;
obj_t* alpha;
obj_t* beta;
//
// This file demonstrates mixing datatypes in gemm.
//
// NOTE: Please make sure that mixed datatype support is enabled in BLIS
// before proceeding to build and run the example binaries. If you're not
// sure whether mixed datatype support is enabled in BLIS, please refer
// to './configure --help' for the relevant options.
//
//
// Example 1: Perform a general matrix-matrix multiply (gemm) operation
// with operands of different domains (but identical precisions).
//
printf( "\n#\n# -- Example 1 --\n#\n\n" );
// Create some matrix operands to work with.
dt_r = BLIS_DOUBLE;
dt_c = BLIS_DCOMPLEX;
m = 4; n = 5; k = 1; rs = 0; cs = 0;
bli_obj_create( dt_c, m, n, rs, cs, &c );
bli_obj_create( dt_r, m, k, rs, cs, &a );
bli_obj_create( dt_c, k, n, rs, cs, &b );
// Set the scalars to use.
alpha = &BLIS_ONE;
beta = &BLIS_ONE;
// Initialize the matrix operands.
bli_randm( &a );
bli_randm( &b );
bli_setm( &BLIS_ZERO, &c );
bli_printm( "a (double real): randomized", &a, "%4.1f", "" );
bli_printm( "b (double complex): randomized", &b, "%4.1f", "" );
bli_printm( "c (double complex): initial value", &c, "%4.1f", "" );
// c := beta * c + alpha * a * b, where 'a' is real, and 'b' and 'c' are
// complex.
bli_gemm( alpha, &a, &b, beta, &c );
bli_printm( "c (double complex): after gemm", &c, "%4.1f", "" );
// Free the objects.
bli_obj_free( &a );
bli_obj_free( &b );
bli_obj_free( &c );
//
// Example 2: Perform a general matrix-matrix multiply (gemm) operation
// with operands of different precisions (but identical domains).
//
printf( "\n#\n# -- Example 2 --\n#\n\n" );
// Create some matrix operands to work with.
dt_s = BLIS_FLOAT;
dt_d = BLIS_DOUBLE;
m = 4; n = 5; k = 1; rs = 0; cs = 0;
bli_obj_create( dt_d, m, n, rs, cs, &c );
bli_obj_create( dt_s, m, k, rs, cs, &a );
bli_obj_create( dt_s, k, n, rs, cs, &b );
// Notice that we've chosen C to be double-precision real and A and B to be
// single-precision real.
// Since we are mixing precisions, we will also need to specify the
// so-called "computation precision." That is, we need to signal to
// bli_gemm() whether we want the A*B product to be computed in single
// precision or double precision (prior to the result being accumulated
// back to C). To specify the computation precision, we need to set the
// corresponding bit in the C object. Here, we specify double-precision
// computation.
// NOTE: If you do not explicitly specify the computation precision, it
// will default to the storage precision of the C object.
bli_obj_set_comp_prec( BLIS_DOUBLE_PREC, &c );
// Initialize the matrix operands.
bli_randm( &a );
bli_randm( &b );
bli_setm( &BLIS_ZERO, &c );
bli_printm( "a (single real): randomized", &a, "%4.1f", "" );
bli_printm( "b (single real): randomized", &b, "%4.1f", "" );
bli_printm( "c (double real): initial value", &c, "%4.1f", "" );
// c := beta * c + alpha * a * b, where 'a' and 'b' are single-precision
// real, 'c' is double-precision real, and the matrix product is performed
// in double-precision arithmetic.
bli_gemm( alpha, &a, &b, beta, &c );
bli_printm( "c (double real): after gemm (exec prec = double precision)", &c, "%4.1f", "" );
// Free the objects.
bli_obj_free( &a );
bli_obj_free( &b );
bli_obj_free( &c );
//
// Example 3: Perform a general matrix-matrix multiply (gemm) operation
// with operands of different domains AND precisions.
//
printf( "\n#\n# -- Example 3 --\n#\n\n" );
// Create some matrix operands to work with.
dt_a = BLIS_FLOAT;
dt_b = BLIS_DCOMPLEX;
dt_c = BLIS_SCOMPLEX;
m = 4; n = 5; k = 1; rs = 0; cs = 0;
bli_obj_create( dt_c, m, n, rs, cs, &c );
bli_obj_create( dt_a, m, k, rs, cs, &a );
bli_obj_create( dt_b, k, n, rs, cs, &b );
// Notice that we've chosen C to be single-precision complex, and A to be
// single-precision real, and B to be double-precision complex.
// Set the computation precision to single precision this time.
bli_obj_set_comp_prec( BLIS_SINGLE_PREC, &c );
// Initialize the matrix operands.
bli_randm( &a );
bli_randm( &b );
bli_setm( &BLIS_ZERO, &c );
bli_printm( "a (single real): randomized", &a, "%4.1f", "" );
bli_printm( "b (double complex): randomized", &b, "%4.1f", "" );
bli_printm( "c (single complex): initial value", &c, "%4.1f", "" );
// c := beta * c + alpha * a * b, where 'a' is single-precision real, 'b'
// is double-precision complex, 'c' is single-precision complex, and the
// matrix product is performed in single-precision arithmetic.
bli_gemm( alpha, &a, &b, beta, &c );
bli_printm( "c (single complex): after gemm (exec prec = single precision)", &c, "%4.1f", "" );
// Free the objects.
bli_obj_free( &a );
bli_obj_free( &b );
bli_obj_free( &c );
//
// Example 4: Project objects between the real and complex domains.
//
printf( "\n#\n# -- Example 4 --\n#\n\n" );
// Create some matrix operands to work with.
dt_r = BLIS_DOUBLE;
dt_c = BLIS_DCOMPLEX;
m = 4; n = 5; rs = 0; cs = 0;
bli_obj_create( dt_r, m, n, rs, cs, &a );
bli_obj_create( dt_c, m, n, rs, cs, &b );
// Initialize a real matrix A.
bli_randm( &a );
bli_printm( "a (double real): randomized", &a, "%4.1f", "" );
// Project real matrix A to the complex domain (in B).
bli_projm( &a, &b );
bli_printm( "b (double complex): projected from 'a'", &b, "%4.1f", "" );
// Notice how the imaginary components in B are zero since any real
// matrix implicitly has imaginary values that are equal to zero.
// Now let's project in the other direction.
// Initialize the complex matrix B.
bli_randm( &b );
bli_printm( "b (double complex): randomized", &b, "%4.1f", "" );
// Project complex matrix B to the real domain (in A).
bli_projm( &b, &a );
bli_printm( "a (double real): projected from 'b'", &a, "%4.1f", "" );
// Notice how the imaginary components are lost in the projection from
// the complex domain to the real domain.
// Free the objects.
bli_obj_free( &a );
bli_obj_free( &b );
//
// Example 5: Typecast objects between the single and double precisions.
//
printf( "\n#\n# -- Example 5 --\n#\n\n" );
// Create some matrix operands to work with.
dt_s = BLIS_FLOAT;
dt_d = BLIS_DOUBLE;
m = 4; n = 3; rs = 0; cs = 0;
bli_obj_create( dt_d, m, n, rs, cs, &a );
bli_obj_create( dt_s, m, n, rs, cs, &b );
// Initialize a double-precision real matrix A.
bli_randm( &a );
bli_printm( "a (double real): randomized", &a, "%23.16e", "" );
// Typecast A to single precision.
bli_castm( &a, &b );
bli_printm( "b (single real): typecast from 'a'", &b, "%23.16e", "" );
// Notice how the values in B are only accurate to the 6th or 7th decimal
// place relative to the true values in A.
// Free the objects.
bli_obj_free( &a );
bli_obj_free( &b );
return 0;
}

View File

@@ -127,7 +127,8 @@ TEST_BINS := 00obj_basic.x \
07level1m_diag.x \
08level2.x \
09level3.x \
10util.x
10util.x \
11gemm_md.x

View File

@@ -41,7 +41,7 @@
#undef GENFRONT
#define GENFRONT( opname ) \
\
GENARRAY_FPA( void*, opname ); \
GENARRAY_FPA( PASTECH(opname,_vft), opname ); \
\
PASTECH(opname,_vft) PASTEMAC(opname,_qfp)( num_t dt ) \
{ \
@@ -63,7 +63,7 @@ GENFRONT( zipsc )
#undef GENFRONT
#define GENFRONT( opname ) \
\
GENARRAY_FPA_I( void*, opname ); \
GENARRAY_FPA_I( PASTECH(opname,_vft), opname ); \
\
PASTECH(opname,_vft) PASTEMAC(opname,_qfp)( num_t dt ) \
{ \

View File

@@ -103,6 +103,22 @@ GENFRONT( setd )
GENFRONT( setid )
#undef GENFRONT
#define GENFRONT( opname ) \
\
void PASTEMAC(opname,_check) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
) \
{ \
bli_l1d_axy_check( beta, x, y ); \
}
GENFRONT( xpbyd )
// -----------------------------------------------------------------------------
void bli_l1d_xy_check

View File

@@ -90,6 +90,19 @@ GENTPROT( setd )
GENTPROT( setid )
#undef GENTPROT
#define GENTPROT( opname ) \
\
void PASTEMAC(opname,_check) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
);
GENTPROT( xpbyd )
// -----------------------------------------------------------------------------
void bli_l1d_xy_check

View File

@@ -59,4 +59,5 @@ GENFRONT( invertd )
GENFRONT( scald )
GENFRONT( setd )
GENFRONT( setid )
GENFRONT( xpbyd )

View File

@@ -51,3 +51,5 @@ GENPROT( invertd )
GENPROT( scald )
GENPROT( setd )
GENPROT( setid )
GENPROT( xpbyd )

View File

@@ -131,3 +131,23 @@ typedef void (*PASTECH3(ch,opname,EX_SUF,tsuf)) \
INSERT_GENTDEFR( setid )
// xpbyd
#undef GENTDEF
#define GENTDEF( ctype, ch, opname, tsuf ) \
\
typedef void (*PASTECH3(ch,opname,EX_SUF,tsuf)) \
( \
doff_t diagoffx, \
diag_t diagx, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
);
INSERT_GENTDEF( xpbyd )

View File

@@ -312,5 +312,70 @@ void PASTEMAC(opname,EX_SUF) \
GENFRONT( setid )
#undef GENFRONT
#define GENFRONT( opname ) \
\
void PASTEMAC(opname,EX_SUF) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
BLIS_OAPI_EX_PARAMS \
) \
{ \
bli_init_once(); \
\
BLIS_OAPI_EX_DECLS \
\
num_t dt = bli_obj_dt( x ); \
\
doff_t diagoffx = bli_obj_diag_offset( x ); \
diag_t diagx = bli_obj_diag( x ); \
trans_t transx = bli_obj_conjtrans_status( x ); \
dim_t m = bli_obj_length( y ); \
dim_t n = bli_obj_width( y ); \
void* buf_x = bli_obj_buffer_at_off( x ); \
inc_t rs_x = bli_obj_row_stride( x ); \
inc_t cs_x = bli_obj_col_stride( x ); \
void* buf_y = bli_obj_buffer_at_off( y ); \
inc_t rs_y = bli_obj_row_stride( y ); \
inc_t cs_y = bli_obj_col_stride( y ); \
\
void* buf_beta; \
\
obj_t beta_local; \
\
if ( bli_error_checking_is_enabled() ) \
PASTEMAC(opname,_check)( x, beta, y ); \
\
/* Create local copy-casts of scalars (and apply internal conjugation
as needed). */ \
bli_obj_scalar_init_detached_copy_of( dt, BLIS_NO_CONJUGATE, \
beta, &beta_local ); \
buf_beta = bli_obj_buffer_for_1x1( dt, &beta_local ); \
\
/* Query a type-specific function pointer, except one that uses
void* instead of typed pointers. */ \
PASTECH2(opname,BLIS_TAPI_EX_SUF,_vft) f = \
PASTEMAC2(opname,BLIS_TAPI_EX_SUF,_qfp)( dt ); \
\
f \
( \
diagoffx, \
diagx, \
transx, \
m, \
n, \
buf_x, rs_x, cs_x, \
buf_beta, \
buf_y, rs_y, cs_y, \
cntx, \
rntm \
); \
}
GENFRONT( xpbyd )
#endif

View File

@@ -93,3 +93,17 @@ GENTPROT( scald )
GENTPROT( setd )
GENTPROT( setid )
#undef GENTPROT
#define GENTPROT( opname ) \
\
void PASTEMAC(opname,EX_SUF) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
BLIS_OAPI_EX_PARAMS \
);
GENTPROT( xpbyd )

View File

@@ -387,5 +387,83 @@ void PASTEMAC2(ch,opname,EX_SUF) \
INSERT_GENTFUNCR_BASIC2( setid, setv, BLIS_SETV_KER )
#undef GENTFUNC
#define GENTFUNC( ctype, ch, opname, kername, kerid ) \
\
void PASTEMAC2(ch,opname,EX_SUF) \
( \
doff_t diagoffx, \
diag_t diagx, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
) \
{ \
bli_init_once(); \
\
BLIS_TAPI_EX_DECLS \
\
const num_t dt = PASTEMAC(ch,type); \
\
ctype* x1; \
ctype* y1; \
conj_t conjx; \
dim_t n_elem; \
dim_t offx, offy; \
inc_t incx, incy; \
\
if ( bli_zero_dim2( m, n ) ) return; \
\
if ( bli_is_outside_diag( diagoffx, transx, m, n ) ) return; \
\
/* Determine the distance to the diagonals, the number of diagonal
elements, and the diagonal increments. */ \
bli_set_dims_incs_2d \
( \
diagoffx, transx, \
m, n, rs_x, cs_x, rs_y, cs_y, \
&offx, &offy, &n_elem, &incx, &incy \
); \
\
conjx = bli_extract_conj( transx ); \
\
if ( bli_is_nonunit_diag( diagx ) ) \
{ \
x1 = x + offx; \
y1 = y + offy; \
} \
else /* if ( bli_is_unit_diag( diagx ) ) */ \
{ \
/* Simulate a unit diagonal for x with a zero increment over a unit
scalar. */ \
x1 = PASTEMAC(ch,1); \
incx = 0; \
y1 = y + offy; \
} \
\
/* Obtain a valid context from the gks if necessary. */ \
if ( cntx == NULL ) cntx = bli_gks_query_cntx(); \
\
/* Query the context for the operation's kernel address. */ \
PASTECH2(ch,kername,_ker_ft) f = bli_cntx_get_l1v_ker_dt( dt, kerid, cntx ); \
\
/* Invoke the kernel with the appropriate parameters. */ \
f( \
conjx, \
n_elem, \
x1, incx, \
beta, \
y1, incy, \
cntx \
); \
}
INSERT_GENTFUNC_BASIC2( xpbyd, xpbyv, BLIS_XPBYV_KER )
#endif

View File

@@ -125,3 +125,22 @@ void PASTEMAC2(ch,opname,EX_SUF) \
INSERT_GENTPROTR_BASIC0( setid )
#undef GENTPROT
#define GENTPROT( ctype, ch, opname ) \
\
void PASTEMAC2(ch,opname,EX_SUF) \
( \
doff_t diagoffx, \
diag_t diagx, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
);
INSERT_GENTPROT_BASIC0( xpbyd )

View File

@@ -88,6 +88,22 @@ GENFRONT( scalm )
GENFRONT( setm )
#undef GENFRONT
#define GENFRONT( opname ) \
\
void PASTEMAC(opname,_check) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
) \
{ \
bli_l1m_axy_check( beta, x, y ); \
}
GENFRONT( xpbym )
// -----------------------------------------------------------------------------
void bli_l1m_xy_check

View File

@@ -78,6 +78,19 @@ GENPROT( scalm )
GENPROT( setm )
#undef GENPROT
#define GENPROT( opname ) \
\
void PASTEMAC(opname,_check) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
);
GENPROT( xpbym )
// -----------------------------------------------------------------------------
void bli_l1m_xy_check

View File

@@ -57,4 +57,23 @@ GENFRONT( axpym )
GENFRONT( scal2m )
GENFRONT( scalm )
GENFRONT( setm )
GENFRONT( xpbym )
//
// Define function pointer query interfaces for two-datatype operations.
//
#undef GENFRONT
#define GENFRONT( opname ) \
\
GENARRAY_FPA2( PASTECH2(opname,BLIS_TAPI_EX_SUF,_vft), \
PASTECH(opname,BLIS_TAPI_EX_SUF) ); \
\
PASTECH2(opname,BLIS_TAPI_EX_SUF,_vft) \
PASTEMAC2(opname,BLIS_TAPI_EX_SUF,_qfp2)( num_t dtx, num_t dty ) \
{ \
return PASTECH2(opname,BLIS_TAPI_EX_SUF,_fpa2)[ dtx ][ dty ]; \
}
GENFRONT( xpbym_md )

View File

@@ -49,4 +49,13 @@ GENPROT( axpym )
GENPROT( scal2m )
GENPROT( scalm )
GENPROT( setm )
GENPROT( xpbym )
#undef GENPROT
#define GENPROT( opname ) \
\
PASTECH2(opname,BLIS_TAPI_EX_SUF,_vft) \
PASTEMAC2(opname,BLIS_TAPI_EX_SUF,_qfp2)( num_t dtx, num_t dty );
GENPROT( xpbym_md )

View File

@@ -141,3 +141,25 @@ typedef void (*PASTECH3(ch,opname,EX_SUF,tsuf)) \
INSERT_GENTDEF( scalm )
INSERT_GENTDEF( setm )
// xpbym
#undef GENTDEF
#define GENTDEF( ctype, ch, opname, tsuf ) \
\
typedef void (*PASTECH3(ch,opname,EX_SUF,tsuf)) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
);
INSERT_GENTDEF( xpbym )
INSERT_GENTDEF( xpbym_md )

View File

@@ -302,5 +302,141 @@ void PASTEMAC(opname,EX_SUF) \
GENFRONT( setm )
#undef GENFRONT
#define GENFRONT( opname ) \
\
void PASTEMAC(opname,EX_SUF) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
BLIS_OAPI_EX_PARAMS \
) \
{ \
bli_init_once(); \
\
BLIS_OAPI_EX_DECLS \
\
if ( bli_obj_dt( x ) != bli_obj_dt( y ) ) \
return bli_xpbym_md( x, beta, y ); \
\
num_t dt = bli_obj_dt( x ); \
\
doff_t diagoffx = bli_obj_diag_offset( x ); \
diag_t diagx = bli_obj_diag( x ); \
uplo_t uplox = bli_obj_uplo( x ); \
trans_t transx = bli_obj_conjtrans_status( x ); \
dim_t m = bli_obj_length( y ); \
dim_t n = bli_obj_width( y ); \
void* buf_x = bli_obj_buffer_at_off( x ); \
inc_t rs_x = bli_obj_row_stride( x ); \
inc_t cs_x = bli_obj_col_stride( x ); \
void* buf_y = bli_obj_buffer_at_off( y ); \
inc_t rs_y = bli_obj_row_stride( y ); \
inc_t cs_y = bli_obj_col_stride( y ); \
\
void* buf_beta; \
\
obj_t beta_local; \
\
if ( bli_error_checking_is_enabled() ) \
PASTEMAC(opname,_check)( x, beta, y ); \
\
/* Create local copy-casts of scalars (and apply internal conjugation
as needed). */ \
bli_obj_scalar_init_detached_copy_of( dt, BLIS_NO_CONJUGATE, \
beta, &beta_local ); \
buf_beta = bli_obj_buffer_for_1x1( dt, &beta_local ); \
\
/* Query a type-specific function pointer, except one that uses
void* instead of typed pointers. */ \
PASTECH2(opname,BLIS_TAPI_EX_SUF,_vft) f = \
PASTEMAC2(opname,BLIS_TAPI_EX_SUF,_qfp)( dt ); \
\
f \
( \
diagoffx, \
diagx, \
uplox, \
transx, \
m, \
n, \
buf_x, rs_x, cs_x, \
buf_beta, \
buf_y, rs_y, cs_y, \
cntx, \
rntm \
); \
}
GENFRONT( xpbym )
#undef GENFRONT
#define GENFRONT( opname ) \
\
void PASTEMAC(opname,EX_SUF) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
BLIS_OAPI_EX_PARAMS \
) \
{ \
bli_init_once(); \
\
BLIS_OAPI_EX_DECLS \
\
num_t dtx = bli_obj_dt( x ); \
num_t dty = bli_obj_dt( y ); \
\
doff_t diagoffx = bli_obj_diag_offset( x ); \
diag_t diagx = bli_obj_diag( x ); \
uplo_t uplox = bli_obj_uplo( x ); \
trans_t transx = bli_obj_conjtrans_status( x ); \
dim_t m = bli_obj_length( y ); \
dim_t n = bli_obj_width( y ); \
void* buf_x = bli_obj_buffer_at_off( x ); \
inc_t rs_x = bli_obj_row_stride( x ); \
inc_t cs_x = bli_obj_col_stride( x ); \
void* buf_y = bli_obj_buffer_at_off( y ); \
inc_t rs_y = bli_obj_row_stride( y ); \
inc_t cs_y = bli_obj_col_stride( y ); \
\
void* buf_beta; \
\
obj_t beta_local; \
\
/* Create local copy-casts of scalars (and apply internal conjugation
as needed). */ \
bli_obj_scalar_init_detached_copy_of( dty, BLIS_NO_CONJUGATE, \
beta, &beta_local ); \
buf_beta = bli_obj_buffer_for_1x1( dty, &beta_local ); \
\
/* Query a (multi) type-specific function pointer, except one that uses
void* instead of typed pointers. */ \
PASTECH2(opname,BLIS_TAPI_EX_SUF,_vft) f = \
PASTEMAC2(opname,BLIS_TAPI_EX_SUF,_qfp2)( dtx, dty ); \
\
f \
( \
diagoffx, \
diagx, \
uplox, \
transx, \
m, \
n, \
buf_x, rs_x, cs_x, \
buf_beta, \
buf_y, rs_y, cs_y, \
cntx, \
rntm \
); \
}
GENFRONT( xpbym_md )
#endif

View File

@@ -80,3 +80,18 @@ void PASTEMAC(opname,EX_SUF) \
GENPROT( scalm )
GENPROT( setm )
#undef GENPROT
#define GENPROT( opname ) \
\
void PASTEMAC(opname,EX_SUF) \
( \
obj_t* x, \
obj_t* beta, \
obj_t* y \
BLIS_OAPI_EX_PARAMS \
);
GENPROT( xpbym )
GENPROT( xpbym_md )

View File

@@ -382,5 +382,155 @@ INSERT_GENTFUNC_BASIC0( scalm )
INSERT_GENTFUNC_BASIC0( setm )
#undef GENTFUNC
#define GENTFUNC( ctype, ch, opname ) \
\
void PASTEMAC2(ch,opname,EX_SUF) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
) \
{ \
bli_init_once(); \
\
BLIS_TAPI_EX_DECLS \
\
if ( bli_zero_dim2( m, n ) ) return; \
\
/* Obtain a valid context from the gks if necessary. */ \
if ( cntx == NULL ) cntx = bli_gks_query_cntx(); \
\
/* If beta is zero, then the operation reduces to copym. */ \
if ( PASTEMAC(ch,eq0)( *beta ) ) \
{ \
PASTEMAC2(ch,copym,_unb_var1) \
( \
diagoffx, \
diagx, \
uplox, \
transx, \
m, \
n, \
x, rs_x, cs_x, \
y, rs_y, cs_y, \
cntx, \
rntm \
); \
\
return; \
} \
\
/* Invoke the helper variant, which loops over the appropriate kernel
to implement the current operation. */ \
PASTEMAC2(ch,opname,_unb_var1) \
( \
diagoffx, \
diagx, \
uplox, \
transx, \
m, \
n, \
x, rs_x, cs_x, \
beta, \
y, rs_y, cs_y, \
cntx, \
rntm \
); \
\
/* When the diagonal of an upper- or lower-stored matrix is unit,
we handle it with a separate post-processing step. */ \
if ( bli_is_upper_or_lower( uplox ) && \
bli_is_unit_diag( diagx ) ) \
{ \
PASTEMAC2(ch,xpbyd,BLIS_TAPI_EX_SUF) \
( \
diagoffx, \
diagx, \
transx, \
m, \
n, \
x, rs_x, cs_x, \
beta, \
y, rs_y, cs_y, \
cntx, \
rntm \
); \
} \
}
INSERT_GENTFUNC_BASIC0( xpbym )
#undef GENTFUNC2
#define GENTFUNC2( ctype_x, ctype_y, chx, chy, opname ) \
\
void PASTEMAC3(chx,chy,opname,EX_SUF) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype_x* x, inc_t rs_x, inc_t cs_x, \
ctype_y* beta, \
ctype_y* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
) \
{ \
bli_init_once(); \
\
BLIS_TAPI_EX_DECLS \
\
if ( bli_zero_dim2( m, n ) ) return; \
\
/* Obtain a valid context from the gks if necessary. */ \
if ( cntx == NULL ) cntx = bli_gks_query_cntx(); \
\
/* If beta is zero, then the operation reduces to copym. */ \
if ( PASTEMAC(chy,eq0)( *beta ) ) \
{ \
PASTEMAC2(chx,chy,castm) \
( \
transx, \
m, \
n, \
x, rs_x, cs_x, \
y, rs_y, cs_y \
); \
\
return; \
} \
\
/* Invoke the helper variant, which loops over the appropriate kernel
to implement the current operation. */ \
PASTEMAC3(chx,chy,opname,_unb_var1) \
( \
diagoffx, \
diagx, \
uplox, \
transx, \
m, \
n, \
x, rs_x, cs_x, \
beta, \
y, rs_y, cs_y, \
cntx, \
rntm \
); \
}
INSERT_GENTFUNC2_BASIC0( xpbym_md )
INSERT_GENTFUNC2_MIXDP0( xpbym_md )
#endif

View File

@@ -98,3 +98,44 @@ void PASTEMAC2(ch,opname,EX_SUF) \
INSERT_GENTPROT_BASIC0( scalm )
INSERT_GENTPROT_BASIC0( setm )
#undef GENTPROT
#define GENTPROT( ctype, ch, opname ) \
\
void PASTEMAC2(ch,opname,EX_SUF) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
);
INSERT_GENTPROT_BASIC0( xpbym )
#undef GENTPROT2
#define GENTPROT2( ctype_x, ctype_y, chx, chy, opname ) \
\
void PASTEMAC3(chx,chy,opname,EX_SUF) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype_x* x, inc_t rs_x, inc_t cs_x, \
ctype_y* beta, \
ctype_y* y, inc_t rs_y, inc_t cs_y \
BLIS_TAPI_EX_PARAMS \
);
INSERT_GENTPROT2_BASIC0( xpbym_md )
INSERT_GENTPROT2_MIXDP0( xpbym_md )

View File

@@ -378,3 +378,252 @@ void PASTEMAC(ch,opname) \
INSERT_GENTFUNC_BASIC2( scalm_unb_var1, scalv, BLIS_SCALV_KER )
INSERT_GENTFUNC_BASIC2( setm_unb_var1, setv, BLIS_SETV_KER )
#undef GENTFUNC
#define GENTFUNC( ctype, ch, opname, kername, kerid ) \
\
void PASTEMAC(ch,opname) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y, \
cntx_t* cntx, \
rntm_t* rntm \
) \
{ \
const num_t dt = PASTEMAC(ch,type); \
\
ctype* x1; \
ctype* y1; \
uplo_t uplox_eff; \
conj_t conjx; \
dim_t n_iter; \
dim_t n_elem, n_elem_max; \
inc_t ldx, incx; \
inc_t ldy, incy; \
dim_t j, i; \
dim_t ij0, n_shift; \
\
/* Set various loop parameters. */ \
bli_set_dims_incs_uplo_2m \
( \
diagoffx, diagx, transx, \
uplox, m, n, rs_x, cs_x, rs_y, cs_y, \
&uplox_eff, &n_elem_max, &n_iter, &incx, &ldx, &incy, &ldy, \
&ij0, &n_shift \
); \
\
if ( bli_is_zeros( uplox_eff ) ) return; \
\
/* Extract the conjugation component from the transx parameter. */ \
conjx = bli_extract_conj( transx ); \
\
/* Query the kernel needed for this operation. */ \
PASTECH2(ch,kername,_ker_ft) f = bli_cntx_get_l1v_ker_dt( dt, kerid, cntx ); \
\
/* Handle dense and upper/lower storage cases separately. */ \
if ( bli_is_dense( uplox_eff ) ) \
{ \
for ( j = 0; j < n_iter; ++j ) \
{ \
n_elem = n_elem_max; \
\
x1 = x + (j )*ldx + (0 )*incx; \
y1 = y + (j )*ldy + (0 )*incy; \
\
/* Invoke the kernel with the appropriate parameters. */ \
f( \
conjx, \
n_elem, \
x1, incx, \
beta, \
y1, incy, \
cntx \
); \
} \
} \
else \
{ \
if ( bli_is_upper( uplox_eff ) ) \
{ \
for ( j = 0; j < n_iter; ++j ) \
{ \
n_elem = bli_min( n_shift + j + 1, n_elem_max ); \
\
x1 = x + (ij0+j )*ldx + (0 )*incx; \
y1 = y + (ij0+j )*ldy + (0 )*incy; \
\
/* Invoke the kernel with the appropriate parameters. */ \
f( \
conjx, \
n_elem, \
x1, incx, \
beta, \
y1, incy, \
cntx \
); \
} \
} \
else if ( bli_is_lower( uplox_eff ) ) \
{ \
for ( j = 0; j < n_iter; ++j ) \
{ \
i = bli_max( 0, ( doff_t )j - ( doff_t )n_shift ); \
n_elem = n_elem_max - i; \
\
x1 = x + (j )*ldx + (ij0+i )*incx; \
y1 = y + (j )*ldy + (ij0+i )*incy; \
\
/* Invoke the kernel with the appropriate parameters. */ \
f( \
conjx, \
n_elem, \
x1, incx, \
beta, \
y1, incy, \
cntx \
); \
} \
} \
} \
}
INSERT_GENTFUNC_BASIC2( xpbym_unb_var1, xpbyv, BLIS_XPBYV_KER )
#undef GENTFUNC2
#define GENTFUNC2( ctype_x, ctype_y, chx, chy, opname ) \
\
void PASTEMAC2(chx,chy,opname) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype_x* x, inc_t rs_x, inc_t cs_x, \
ctype_y* beta, \
ctype_y* y, inc_t rs_y, inc_t cs_y, \
cntx_t* cntx, \
rntm_t* rntm \
) \
{ \
ctype_x* restrict x1; \
ctype_y* restrict y1; \
uplo_t uplox_eff; \
dim_t n_iter; \
dim_t n_elem, n_elem_max; \
inc_t ldx, incx; \
inc_t ldy, incy; \
dim_t j, i; \
dim_t ij0, n_shift; \
\
/* Set various loop parameters. */ \
bli_set_dims_incs_uplo_2m \
( \
diagoffx, diagx, transx, \
uplox, m, n, rs_x, cs_x, rs_y, cs_y, \
&uplox_eff, &n_elem_max, &n_iter, &incx, &ldx, &incy, &ldy, \
&ij0, &n_shift \
); \
\
/* Extract the conjugation component from the transx parameter. */ \
/*conjx = bli_extract_conj( transx );*/ \
\
/* Handle dense and upper/lower storage cases separately. */ \
if ( PASTEMAC(chy,eq1)( *beta ) ) \
{ \
if ( incx == 1 && incy == 1 ) \
{ \
n_elem = n_elem_max; \
\
for ( j = 0; j < n_iter; ++j ) \
{ \
x1 = x + (j )*ldx + (0 )*incx; \
y1 = y + (j )*ldy + (0 )*incy; \
\
ctype_x* restrict chi1 = x1; \
ctype_y* restrict psi1 = y1; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC2(chx,chy,adds)( chi1[i], psi1[i] ); \
} \
} \
} \
else \
{ \
n_elem = n_elem_max; \
\
for ( j = 0; j < n_iter; ++j ) \
{ \
x1 = x + (j )*ldx + (0 )*incx; \
y1 = y + (j )*ldy + (0 )*incy; \
\
ctype_x* restrict chi1 = x1; \
ctype_y* restrict psi1 = y1; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC2(chx,chy,adds)( *chi1, *psi1 ); \
\
chi1 += incx; \
psi1 += incy; \
} \
} \
} \
} \
else /* ( !PASTEMAC(chy,eq1)( *beta ) ) */ \
{ \
if ( incx == 1 && incy == 1 ) \
{ \
n_elem = n_elem_max; \
\
for ( j = 0; j < n_iter; ++j ) \
{ \
x1 = x + (j )*ldx + (0 )*incx; \
y1 = y + (j )*ldy + (0 )*incy; \
\
ctype_x* restrict chi1 = x1; \
ctype_y* restrict psi1 = y1; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC3(chx,chy,chy,xpbys)( chi1[i], *beta, psi1[i] ); \
} \
} \
} \
else \
{ \
n_elem = n_elem_max; \
\
for ( j = 0; j < n_iter; ++j ) \
{ \
x1 = x + (j )*ldx + (0 )*incx; \
y1 = y + (j )*ldy + (0 )*incy; \
\
ctype_x* restrict chi1 = x1; \
ctype_y* restrict psi1 = y1; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC3(chx,chy,chy,xpbys)( *chi1, *beta, *psi1 ); \
\
chi1 += incx; \
psi1 += incy; \
} \
} \
} \
} \
}
INSERT_GENTFUNC2_BASIC0( xpbym_md_unb_var1 )
INSERT_GENTFUNC2_MIXDP0( xpbym_md_unb_var1 )

View File

@@ -101,3 +101,46 @@ void PASTEMAC2(ch,opname,_unb_var1) \
INSERT_GENTPROT_BASIC0( scalm )
INSERT_GENTPROT_BASIC0( setm )
#undef GENTPROT
#define GENTPROT( ctype, ch, opname ) \
\
void PASTEMAC2(ch,opname,_unb_var1) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype* x, inc_t rs_x, inc_t cs_x, \
ctype* beta, \
ctype* y, inc_t rs_y, inc_t cs_y, \
cntx_t* cntx, \
rntm_t* rntm \
);
INSERT_GENTPROT_BASIC0( xpbym )
#undef GENTPROT2
#define GENTPROT2( ctype_x, ctype_y, chx, chy, opname ) \
\
void PASTEMAC3(chx,chy,opname,_unb_var1) \
( \
doff_t diagoffx, \
diag_t diagx, \
uplo_t uplox, \
trans_t transx, \
dim_t m, \
dim_t n, \
ctype_x* x, inc_t rs_x, inc_t cs_x, \
ctype_y* beta, \
ctype_y* y, inc_t rs_y, inc_t cs_y, \
cntx_t* cntx, \
rntm_t* rntm \
);
INSERT_GENTPROT2_BASIC0( xpbym_md )
INSERT_GENTPROT2_MIXDP0( xpbym_md )

View File

@@ -55,3 +55,8 @@
#include "bli_packm_cxk_rih.h"
#include "bli_packm_cxk_1er.h"
// Mixed datatype support.
#ifdef BLIS_ENABLE_GEMM_MD
#include "bli_packm_md.h"
#endif

View File

@@ -108,7 +108,17 @@ void bli_packm_blk_var1
thrinfo_t* t
)
{
num_t dt_cp = bli_obj_dt( c );
#ifdef BLIS_ENABLE_GEMM_MD
// Call a different packm implementation when the storage and target
// datatypes differ.
if ( bli_obj_dt( c ) != bli_obj_target_dt( c ) )
{
bli_packm_blk_var1_md( c, p, cntx, cntl, t );
return;
}
#endif
num_t dt_c = bli_obj_dt( c );
struc_t strucc = bli_obj_struc( c );
doff_t diagoffc = bli_obj_diag_offset( c );
@@ -155,7 +165,7 @@ void bli_packm_blk_var1
// higher-level operation. Thus, we use BLIS_ONE for kappa so
// that the underlying packm implementation does not perform
// any scaling during packing.
buf_kappa = bli_obj_buffer_for_const( dt_cp, &BLIS_ONE );
buf_kappa = bli_obj_buffer_for_const( dt_c, &BLIS_ONE );
}
else // if ( bli_is_ind_packed( schema ) )
{
@@ -187,11 +197,10 @@ void bli_packm_blk_var1
}
// Acquire the buffer to the kappa chosen above.
buf_kappa = bli_obj_buffer_for_1x1( dt_cp, kappa_p );
buf_kappa = bli_obj_buffer_for_1x1( dt_c, kappa_p );
}
// Choose the correct func_t object based on the pack_t schema.
#if 0
if ( bli_is_4mi_packed( schema ) ) packm_kers = packm_struc_cxk_4mi_kers;
else if ( bli_is_3mi_packed( schema ) ||
@@ -208,7 +217,7 @@ void bli_packm_blk_var1
//func_t* cntx_packm_kers = bli_cntx_get_packm_ukr( cntx );
//if ( bli_func_is_null_dt( dt_cp, cntx_packm_kers ) )
//if ( bli_func_is_null_dt( dt_c, cntx_packm_kers ) )
{
// If the packm structure-aware kernel func_t in the context is
// NULL (which is the default value after the context is created),
@@ -230,11 +239,11 @@ void bli_packm_blk_var1
#endif
// Query the datatype-specific function pointer from the func_t object.
packm_ker = bli_func_get_dt( dt_cp, packm_kers );
packm_ker = bli_func_get_dt( dt_c, packm_kers );
// Index into the type combination array to extract the correct
// function pointer.
f = ftypes[dt_cp];
f = ftypes[dt_c];
// Invoke the function.
f( strucc,
@@ -433,10 +442,10 @@ void PASTEMAC(ch,varname) \
\
/*
if ( row_stored ) \
PASTEMAC(ch,fprintm)( stdout, "packm_var2: b", m, n, \
PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b", m, n, \
c_cast, rs_c, cs_c, "%4.1f", "" ); \
if ( col_stored ) \
PASTEMAC(ch,fprintm)( stdout, "packm_var2: a", m, n, \
PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a", m, n, \
c_cast, rs_c, cs_c, "%4.1f", "" ); \
*/ \
\
@@ -605,6 +614,15 @@ PASTEMAC(ch,fprintm)( stdout, "packm_var2: a", m, n, \
} \
\
/*
if ( row_stored ) \
PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b packed", *m_panel_max, *n_panel_max, \
p_use, rs_p, cs_p, "%5.2f", "" ); \
else \
PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a packed", *m_panel_max, *n_panel_max, \
p_use, rs_p, cs_p, "%5.2f", "" ); \
*/ \
\
/*
if ( col_stored ) { \
if ( bli_thread_work_id( thread ) == 0 ) \
{ \

View File

@@ -1,463 +0,0 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
#define FUNCPTR_T packm_fp
typedef void (*FUNCPTR_T)(
struc_t strucc,
doff_t diagoffc,
diag_t diagc,
uplo_t uploc,
trans_t transc,
pack_t schema,
bool_t invdiag,
bool_t revifup,
bool_t reviflo,
dim_t m,
dim_t n,
dim_t m_max,
dim_t n_max,
void* kappa,
void* c, inc_t rs_c, inc_t cs_c,
void* p, inc_t rs_p, inc_t cs_p,
inc_t is_p,
dim_t pd_p, inc_t ps_p,
void* packm_ker,
packm_thrinfo_t* thread
);
static FUNCPTR_T GENARRAY(ftypes,packm_blk_var1);
extern func_t* packm_struc_cxk_kers;
void bli_packm_blk_var1( obj_t* c,
obj_t* p,
packm_thrinfo_t* t )
{
num_t dt_cp = bli_obj_dt( c );
struc_t strucc = bli_obj_struc( c );
doff_t diagoffc = bli_obj_diag_offset( c );
diag_t diagc = bli_obj_diag( c );
uplo_t uploc = bli_obj_uplo( c );
trans_t transc = bli_obj_conjtrans_status( c );
pack_t schema = bli_obj_pack_schema( p );
bool_t invdiag = bli_obj_has_inverted_diag( p );
bool_t revifup = bli_obj_is_pack_rev_if_upper( p );
bool_t reviflo = bli_obj_is_pack_rev_if_lower( p );
dim_t m_p = bli_obj_length( p );
dim_t n_p = bli_obj_width( p );
dim_t m_max_p = bli_obj_padded_length( p );
dim_t n_max_p = bli_obj_padded_width( p );
void* buf_c = bli_obj_buffer_at_off( c );
inc_t rs_c = bli_obj_row_stride( c );
inc_t cs_c = bli_obj_col_stride( c );
void* buf_p = bli_obj_buffer_at_off( p );
inc_t rs_p = bli_obj_row_stride( p );
inc_t cs_p = bli_obj_col_stride( p );
inc_t is_p = bli_obj_imag_stride( p );
dim_t pd_p = bli_obj_panel_dim( p );
inc_t ps_p = bli_obj_panel_stride( p );
void* buf_kappa;
func_t* packm_kers;
void* packm_ker;
FUNCPTR_T f;
// This variant assumes that the micro-kernel will always apply the
// alpha scalar of the higher-level operation. Thus, we use BLIS_ONE
// for kappa so that the underlying packm implementation does not
// scale during packing.
buf_kappa = bli_obj_buffer_for_const( dt_cp, &BLIS_ONE );
// Choose the correct func_t object.
packm_kers = packm_struc_cxk_kers;
// Query the datatype-specific function pointer from the func_t object.
packm_ker = bli_func_obj_query( dt_cp, packm_kers );
// Index into the type combination array to extract the correct
// function pointer.
f = ftypes[dt_cp];
// Invoke the function.
f( strucc,
diagoffc,
diagc,
uploc,
transc,
schema,
invdiag,
revifup,
reviflo,
m_p,
n_p,
m_max_p,
n_max_p,
buf_kappa,
buf_c, rs_c, cs_c,
buf_p, rs_p, cs_p,
is_p,
pd_p, ps_p,
packm_ker,
t );
}
#undef GENTFUNC
#define GENTFUNC( ctype, ch, varname, kertype ) \
\
void PASTEMAC(ch,varname) \
struc_t strucc, \
doff_t diagoffc, \
diag_t diagc, \
uplo_t uploc, \
trans_t transc, \
pack_t schema, \
bool_t invdiag, \
bool_t revifup, \
bool_t reviflo, \
dim_t m, \
dim_t n, \
dim_t m_max, \
dim_t n_max, \
void* kappa, \
void* c, inc_t rs_c, inc_t cs_c, \
void* p, inc_t rs_p, inc_t cs_p, \
inc_t is_p, \
dim_t pd_p, inc_t ps_p, \
void* packm_ker, \
packm_thrinfo_t* thread \
) \
{ \
PASTECH(ch,kertype) packm_ker_cast = packm_ker; \
\
ctype* restrict kappa_cast = kappa; \
ctype* restrict c_cast = c; \
ctype* restrict p_cast = p; \
ctype* restrict c_begin; \
ctype* restrict p_begin; \
\
dim_t iter_dim; \
dim_t num_iter; \
dim_t it, ic, ip; \
dim_t ic0, ip0; \
doff_t ic_inc, ip_inc; \
doff_t diagoffc_i; \
doff_t diagoffc_inc; \
dim_t panel_len_full; \
dim_t panel_len_i; \
dim_t panel_len_max; \
dim_t panel_len_max_i; \
dim_t panel_dim_i; \
dim_t panel_dim_max; \
dim_t panel_off_i; \
inc_t vs_c; \
inc_t ldc; \
inc_t ldp, p_inc; \
dim_t* m_panel_full; \
dim_t* n_panel_full; \
dim_t* m_panel_use; \
dim_t* n_panel_use; \
dim_t* m_panel_max; \
dim_t* n_panel_max; \
conj_t conjc; \
bool_t row_stored; \
bool_t col_stored; \
\
ctype* restrict c_use; \
ctype* restrict p_use; \
doff_t diagoffp_i; \
\
\
/* If C is zeros and part of a triangular matrix, then we don't need
to pack it. */ \
if ( bli_is_zeros( uploc ) && \
bli_is_triangular( strucc ) ) return; \
\
/* Extract the conjugation bit from the transposition argument. */ \
conjc = bli_extract_conj( transc ); \
\
/* If c needs a transposition, induce it so that we can more simply
express the remaining parameters and code. */ \
if ( bli_does_trans( transc ) ) \
{ \
bli_swap_incs( &rs_c, &cs_c ); \
bli_negate_diag_offset( &diagoffc ); \
bli_toggle_uplo( &uploc ); \
bli_toggle_trans( &transc ); \
} \
\
/* Create flags to incidate row or column storage. Note that the
schema bit that encodes row or column is describing the form of
micro-panel, not the storage in the micro-panel. Hence the
mismatch in "row" and "column" semantics. */ \
row_stored = bli_is_col_packed( schema ); \
col_stored = bli_is_row_packed( schema ); \
\
/* If the row storage flag indicates row storage, then we are packing
to column panels; otherwise, if the strides indicate column storage,
we are packing to row panels. */ \
if ( row_stored ) \
{ \
/* Prepare to pack to row-stored column panels. */ \
iter_dim = n; \
panel_len_full = m; \
panel_len_max = m_max; \
panel_dim_max = pd_p; \
ldc = rs_c; \
vs_c = cs_c; \
diagoffc_inc = -( doff_t )panel_dim_max; \
ldp = rs_p; \
m_panel_full = &m; \
n_panel_full = &panel_dim_i; \
m_panel_use = &panel_len_i; \
n_panel_use = &panel_dim_i; \
m_panel_max = &panel_len_max_i; \
n_panel_max = &panel_dim_max; \
} \
else /* if ( col_stored ) */ \
{ \
/* Prepare to pack to column-stored row panels. */ \
iter_dim = m; \
panel_len_full = n; \
panel_len_max = n_max; \
panel_dim_max = pd_p; \
ldc = cs_c; \
vs_c = rs_c; \
diagoffc_inc = ( doff_t )panel_dim_max; \
ldp = cs_p; \
m_panel_full = &panel_dim_i; \
n_panel_full = &n; \
m_panel_use = &panel_dim_i; \
n_panel_use = &panel_len_i; \
m_panel_max = &panel_dim_max; \
n_panel_max = &panel_len_max_i; \
} \
\
/* Compute the total number of iterations we'll need. */ \
num_iter = iter_dim / panel_dim_max + ( iter_dim % panel_dim_max ? 1 : 0 ); \
\
/* Set the initial values and increments for indices related to C and P
based on whether reverse iteration was requested. */ \
if ( ( revifup && bli_is_upper( uploc ) && bli_is_triangular( strucc ) ) || \
( reviflo && bli_is_lower( uploc ) && bli_is_triangular( strucc ) ) ) \
{ \
ic0 = (num_iter - 1) * panel_dim_max; \
ic_inc = -panel_dim_max; \
ip0 = num_iter - 1; \
ip_inc = -1; \
} \
else \
{ \
ic0 = 0; \
ic_inc = panel_dim_max; \
ip0 = 0; \
ip_inc = 1; \
} \
\
p_begin = p_cast; \
\
for ( ic = ic0, ip = ip0, it = 0; it < num_iter; \
ic += ic_inc, ip += ip_inc, it += 1 ) \
{ \
panel_dim_i = bli_min( panel_dim_max, iter_dim - ic ); \
\
diagoffc_i = diagoffc + (ip )*diagoffc_inc; \
c_begin = c_cast + (ic )*vs_c; \
\
if ( bli_is_triangular( strucc ) && \
bli_is_unstored_subpart_n( diagoffc_i, uploc, *m_panel_full, *n_panel_full ) ) \
{ \
/* This case executes if the panel belongs to a triangular
matrix AND is completely unstored (ie: zero). If the panel
is unstored, we do nothing. (Notice that we don't even
increment p_begin.) */ \
\
continue; \
} \
else if ( bli_is_triangular( strucc ) && \
bli_intersects_diag_n( diagoffc_i, *m_panel_full, *n_panel_full ) ) \
{ \
/* This case executes if the panel belongs to a triangular
matrix AND is diagonal-intersecting. Notice that we
cannot bury the following conditional logic into
packm_struc_cxk() because we need to know the value of
panel_len_max_i so we can properly increment p_inc. */ \
\
/* Sanity check. Diagonals should not intersect the short end of
a micro-panel. If they do, then somehow the constraints on
cache blocksizes being a whole multiple of the register
blocksizes was somehow violated. */ \
if ( ( col_stored && diagoffc_i < 0 ) || \
( row_stored && diagoffc_i > 0 ) ) \
bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED ); \
\
if ( ( row_stored && bli_is_upper( uploc ) ) || \
( col_stored && bli_is_lower( uploc ) ) ) \
{ \
panel_off_i = 0; \
panel_len_i = bli_abs( diagoffc_i ) + panel_dim_i; \
panel_len_max_i = bli_min( bli_abs( diagoffc_i ) + panel_dim_max, \
panel_len_max ); \
diagoffp_i = diagoffc_i; \
} \
else /* if ( ( row_stored && bli_is_lower( uploc ) ) || \
( col_stored && bli_is_upper( uploc ) ) ) */ \
{ \
panel_off_i = bli_abs( diagoffc_i ); \
panel_len_i = panel_len_full - panel_off_i; \
panel_len_max_i = panel_len_max - panel_off_i; \
diagoffp_i = 0; \
} \
\
c_use = c_begin + (panel_off_i )*ldc; \
p_use = p_begin; \
\
if( packm_thread_my_iter( it, thread ) ) \
{ \
packm_ker_cast( strucc, \
diagoffp_i, \
diagc, \
uploc, \
conjc, \
schema, \
invdiag, \
*m_panel_use, \
*n_panel_use, \
*m_panel_max, \
*n_panel_max, \
kappa_cast, \
c_use, rs_c, cs_c, \
p_use, rs_p, cs_p, \
is_p ); \
} \
\
/* NOTE: This value is usually LESS than ps_p because triangular
matrices usually have several micro-panels that are shorter
than a "full" micro-panel. */ \
p_inc = ldp * panel_len_max_i; \
\
/* We nudge the panel increment up by one if it is odd. */ \
p_inc += ( bli_is_odd( p_inc ) ? 1 : 0 ); \
} \
else if ( bli_is_herm_or_symm( strucc ) ) \
{ \
/* This case executes if the panel belongs to a Hermitian or
symmetric matrix, which includes stored, unstored, and
diagonal-intersecting panels. */ \
\
panel_len_i = panel_len_full; \
panel_len_max_i = panel_len_max; \
\
if( packm_thread_my_iter( it, thread ) ) \
{ \
packm_ker_cast( strucc, \
diagoffc_i, \
diagc, \
uploc, \
conjc, \
schema, \
invdiag, \
*m_panel_use, \
*n_panel_use, \
*m_panel_max, \
*n_panel_max, \
kappa_cast, \
c_begin, rs_c, cs_c, \
p_begin, rs_p, cs_p, \
is_p ); \
} \
\
/* NOTE: This value is equivalent to ps_p. */ \
/*p_inc = ldp * panel_len_max_i;*/ \
p_inc = ps_p; \
} \
else \
{ \
/* This case executes if the panel is general, or, if the
panel is part of a triangular matrix and is neither unstored
(ie: zero) nor diagonal-intersecting. */ \
\
panel_len_i = panel_len_full; \
panel_len_max_i = panel_len_max; \
\
if( packm_thread_my_iter( it, thread ) ) \
{ \
packm_ker_cast( BLIS_GENERAL, \
0, \
diagc, \
BLIS_DENSE, \
conjc, \
schema, \
invdiag, \
*m_panel_use, \
*n_panel_use, \
*m_panel_max, \
*n_panel_max, \
kappa_cast, \
c_begin, rs_c, cs_c, \
p_begin, rs_p, cs_p, \
is_p ); \
} \
/*
if ( row_stored ) \
PASTEMAC(ch,fprintm)( stdout, "packm_var1: bp copied", panel_len_max_i, panel_dim_max, \
p_begin, rs_p, cs_p, "%9.2e", "" ); \
else if ( col_stored ) \
PASTEMAC(ch,fprintm)( stdout, "packm_var1: ap copied", panel_dim_max, panel_len_max_i, \
p_begin, rs_p, cs_p, "%9.2e", "" ); \
*/ \
\
/* NOTE: This value is equivalent to ps_p. */ \
/*p_inc = ldp * panel_len_max_i;*/ \
p_inc = ps_p; \
} \
\
\
p_begin += p_inc; \
} \
}
INSERT_GENTFUNC_BASIC( packm_blk_var1, packm_ker_t )

View File

@@ -0,0 +1,293 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
#ifdef BLIS_ENABLE_GEMM_MD
#define FUNCPTR_T packm_fp
typedef void (*FUNCPTR_T)(
trans_t transc,
pack_t schema,
dim_t m,
dim_t n,
dim_t m_max,
dim_t n_max,
void* kappa,
void* c, inc_t rs_c, inc_t cs_c,
void* p, inc_t rs_p, inc_t cs_p,
inc_t is_p,
dim_t pd_p, inc_t ps_p,
cntx_t* cntx,
thrinfo_t* thread
);
static FUNCPTR_T GENARRAY2_ALL(ftypes,packm_blk_var1_md);
void bli_packm_blk_var1_md
(
obj_t* c,
obj_t* p,
cntx_t* cntx,
cntl_t* cntl,
thrinfo_t* t
)
{
num_t dt_c = bli_obj_dt( c );
num_t dt_p = bli_obj_dt( p );
trans_t transc = bli_obj_conjtrans_status( c );
pack_t schema = bli_obj_pack_schema( p );
dim_t m_p = bli_obj_length( p );
dim_t n_p = bli_obj_width( p );
dim_t m_max_p = bli_obj_padded_length( p );
dim_t n_max_p = bli_obj_padded_width( p );
void* buf_c = bli_obj_buffer_at_off( c );
inc_t rs_c = bli_obj_row_stride( c );
inc_t cs_c = bli_obj_col_stride( c );
void* buf_p = bli_obj_buffer_at_off( p );
inc_t rs_p = bli_obj_row_stride( p );
inc_t cs_p = bli_obj_col_stride( p );
inc_t is_p = bli_obj_imag_stride( p );
dim_t pd_p = bli_obj_panel_dim( p );
inc_t ps_p = bli_obj_panel_stride( p );
void* buf_kappa;
FUNCPTR_T f;
// Unconditionally use kappa = 1.0. Thus, we don't support scaling
// during packing when mixing datatypes.
buf_kappa = bli_obj_buffer_for_const( dt_p, &BLIS_ONE );
// Index into the type combination array to extract the correct
// function pointer.
f = ftypes[dt_c][dt_p];
// Invoke the function.
f(
transc,
schema,
m_p,
n_p,
m_max_p,
n_max_p,
buf_kappa,
buf_c, rs_c, cs_c,
buf_p, rs_p, cs_p,
is_p,
pd_p, ps_p,
cntx,
t );
}
#undef GENTFUNC2
#define GENTFUNC2( ctype_c, ctype_p, chc, chp, varname ) \
\
void PASTEMAC2(chc,chp,varname) \
( \
trans_t transc, \
pack_t schema, \
dim_t m, \
dim_t n, \
dim_t m_max, \
dim_t n_max, \
void* kappa, \
void* c, inc_t rs_c, inc_t cs_c, \
void* p, inc_t rs_p, inc_t cs_p, \
inc_t is_p, \
dim_t pd_p, inc_t ps_p, \
cntx_t* cntx, \
thrinfo_t* thread \
) \
{ \
ctype_p* restrict kappa_cast = kappa; \
ctype_c* restrict c_cast = c; \
ctype_p* restrict p_cast = p; \
ctype_c* restrict c_begin; \
ctype_p* restrict p_begin; \
\
dim_t iter_dim; \
dim_t num_iter; \
dim_t it, ic, ip; \
doff_t ic_inc, ip_inc; \
dim_t panel_len_full; \
dim_t panel_len_i; \
dim_t panel_len_max; \
dim_t panel_len_max_i; \
dim_t panel_dim_i; \
dim_t panel_dim_max; \
inc_t vs_c; \
inc_t p_inc; \
dim_t* m_panel_use; \
dim_t* n_panel_use; \
dim_t* m_panel_max; \
dim_t* n_panel_max; \
conj_t conjc; \
bool_t row_stored; \
bool_t col_stored; \
\
ctype_c* restrict c_use; \
ctype_p* restrict p_use; \
\
\
/* Extract the conjugation bit from the transposition argument. */ \
conjc = bli_extract_conj( transc ); \
\
/* If c needs a transposition, induce it so that we can more simply
express the remaining parameters and code. */ \
if ( bli_does_trans( transc ) ) \
{ \
bli_swap_incs( &rs_c, &cs_c ); \
bli_toggle_trans( &transc ); \
} \
\
/* Create flags to incidate row or column storage. Note that the
schema bit that encodes row or column is describing the form of
micro-panel, not the storage in the micro-panel. Hence the
mismatch in "row" and "column" semantics. */ \
row_stored = bli_is_col_packed( schema ); \
col_stored = bli_is_row_packed( schema ); \
\
( void )col_stored; \
\
/* If the row storage flag indicates row storage, then we are packing
to column panels; otherwise, if the strides indicate column storage,
we are packing to row panels. */ \
if ( row_stored ) \
{ \
/* Prepare to pack to row-stored column panels. */ \
iter_dim = n; \
panel_len_full = m; \
panel_len_max = m_max; \
panel_dim_max = pd_p; \
vs_c = cs_c; \
m_panel_use = &panel_len_i; \
n_panel_use = &panel_dim_i; \
m_panel_max = &panel_len_max_i; \
n_panel_max = &panel_dim_max; \
} \
else /* if ( col_stored ) */ \
{ \
/* Prepare to pack to column-stored row panels. */ \
iter_dim = m; \
panel_len_full = n; \
panel_len_max = n_max; \
panel_dim_max = pd_p; \
vs_c = rs_c; \
m_panel_use = &panel_dim_i; \
n_panel_use = &panel_len_i; \
m_panel_max = &panel_dim_max; \
n_panel_max = &panel_len_max_i; \
} \
\
/* Compute the total number of iterations we'll need. */ \
num_iter = iter_dim / panel_dim_max + ( iter_dim % panel_dim_max ? 1 : 0 ); \
\
{ \
ic_inc = panel_dim_max; \
ip_inc = 1; \
} \
\
p_begin = p_cast; \
\
/*
if ( row_stored ) \
PASTEMAC(chc,fprintm)( stdout, "packm_blk_var1_md: b orig", m, n, \
c_cast, rs_c, cs_c, "%5.2f", "" ); \
if ( col_stored ) \
PASTEMAC(chc,fprintm)( stdout, "packm_blk_var1_md: a orig", m, n, \
c_cast, rs_c, cs_c, "%5.2f", "" ); \
*/ \
\
for ( ic = 0, ip = 0, it = 0; it < num_iter; \
ic += ic_inc, ip += ip_inc, it += 1 ) \
{ \
panel_dim_i = bli_min( panel_dim_max, iter_dim - ic ); \
\
c_begin = c_cast + (ic )*vs_c; \
\
{ \
c_use = c_begin; \
p_use = p_begin; \
\
panel_len_i = panel_len_full; \
panel_len_max_i = panel_len_max; \
\
if( packm_thread_my_iter( it, thread ) ) \
{ \
PASTEMAC2(chc,chp,packm_struc_cxk_md) \
( \
conjc, \
schema, \
*m_panel_use, \
*n_panel_use, \
*m_panel_max, \
*n_panel_max, \
kappa_cast, \
c_use, rs_c, cs_c, \
p_use, rs_p, cs_p, \
is_p, \
cntx \
); \
} \
\
p_inc = ps_p; \
} \
\
/*
if ( row_stored ) \
PASTEMAC(chp,fprintm)( stdout, "packm_blk_var1_md: b packed", *m_panel_max, *n_panel_max, \
p_use, rs_p, cs_p, "%5.2f", "" ); \
else \
PASTEMAC(chp,fprintm)( stdout, "packm_blk_var1_md: a packed", *m_panel_max, *n_panel_max, \
p_use, rs_p, cs_p, "%5.2f", "" ); \
*/ \
\
p_begin += p_inc; \
\
} \
}
INSERT_GENTFUNC2_BASIC0( packm_blk_var1_md )
INSERT_GENTFUNC2_MIXDP0( packm_blk_var1_md )
#endif

View File

@@ -0,0 +1,67 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
void bli_packm_blk_var1_md
(
obj_t* c,
obj_t* p,
cntx_t* cntx,
cntl_t* cntl,
thrinfo_t* t
);
#undef GENTPROT2
#define GENTPROT2( ctype_c, ctype_p, chc, chp, varname ) \
\
void PASTEMAC2(chc,chp,varname) \
( \
trans_t transc, \
pack_t schema, \
dim_t m, \
dim_t n, \
dim_t m_max, \
dim_t n_max, \
void* kappa, \
void* c, inc_t rs_c, inc_t cs_c, \
void* p, inc_t rs_p, inc_t cs_p, \
inc_t is_p, \
dim_t pd_p, inc_t ps_p, \
cntx_t* cntx, \
thrinfo_t* thread \
);
INSERT_GENTPROT2_BASIC0( packm_blk_var1_md )
INSERT_GENTPROT2_MIXDP0( packm_blk_var1_md )

View File

@@ -211,13 +211,14 @@ siz_t bli_packm_init_pack
bli_init_once();
num_t dt = bli_obj_dt( a );
num_t dt_tar = bli_obj_target_dt( a );
trans_t transa = bli_obj_onlytrans_status( a );
dim_t m_a = bli_obj_length( a );
dim_t n_a = bli_obj_width( a );
dim_t bmult_m_def = bli_cntx_get_blksz_def_dt( dt, bmult_id_m, cntx );
dim_t bmult_m_pack = bli_cntx_get_blksz_max_dt( dt, bmult_id_m, cntx );
dim_t bmult_n_def = bli_cntx_get_blksz_def_dt( dt, bmult_id_n, cntx );
dim_t bmult_n_pack = bli_cntx_get_blksz_max_dt( dt, bmult_id_n, cntx );
dim_t bmult_m_def = bli_cntx_get_blksz_def_dt( dt_tar, bmult_id_m, cntx );
dim_t bmult_m_pack = bli_cntx_get_blksz_max_dt( dt_tar, bmult_id_m, cntx );
dim_t bmult_n_def = bli_cntx_get_blksz_def_dt( dt_tar, bmult_id_n, cntx );
dim_t bmult_n_pack = bli_cntx_get_blksz_max_dt( dt_tar, bmult_id_n, cntx );
dim_t m_p, n_p;
dim_t m_p_pad, n_p_pad;
@@ -230,6 +231,17 @@ siz_t bli_packm_init_pack
// We begin by copying the fields of A.
bli_obj_alias_to( a, p );
// Typecast the internal scalar value to the target datatype.
// NOTE: This must happen BEFORE we change the datatype of P to reflect
// the target_dt.
if ( dt != dt_tar )
{
bli_obj_scalar_cast_to( dt_tar, p );
}
// Update the datatype of P to be the target datatype of A.
bli_obj_set_dt( dt_tar, p );
// Update the dimension fields to explicitly reflect a transposition,
// if needed.
// Then, clear the conjugation and transposition fields from the object

View File

@@ -0,0 +1,37 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "bli_packm_blk_var1_md.h"
#include "bli_packm_struc_cxk_md.h"

View File

@@ -0,0 +1,294 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
#ifdef BLIS_ENABLE_GEMM_MD
#undef GENTFUNC2
#define GENTFUNC2( ctype_c, ctype_p, chc, chp, varname ) \
\
void PASTEMAC2(chc,chp,varname) \
( \
conj_t conjc, \
pack_t schema, \
dim_t m_panel, \
dim_t n_panel, \
dim_t m_panel_max, \
dim_t n_panel_max, \
ctype_p* restrict kappa, \
ctype_c* restrict c, inc_t rs_c, inc_t cs_c, \
ctype_p* restrict p, inc_t rs_p, inc_t cs_p, \
inc_t is_p, \
cntx_t* cntx \
) \
{ \
dim_t panel_dim; \
dim_t panel_len; \
inc_t incc, ldc; \
inc_t ldp; \
\
\
/* Determine the dimensions and relative strides of the micro-panel
based on its pack schema. */ \
if ( bli_is_col_packed( schema ) ) \
{ \
/* Prepare to pack to row-stored column panel. */ \
panel_dim = n_panel; \
panel_len = m_panel; \
incc = cs_c; \
ldc = rs_c; \
ldp = rs_p; \
} \
else /* if ( bli_is_row_packed( schema ) ) */ \
{ \
/* Prepare to pack to column-stored row panel. */ \
panel_dim = m_panel; \
panel_len = n_panel; \
incc = rs_c; \
ldc = cs_c; \
ldp = cs_p; \
} \
\
\
if ( bli_is_nat_packed( schema ) ) \
{ \
trans_t transc = ( trans_t )conjc; \
\
/* NOTE: We ignore kappa for now, since it should be 1.0. */ \
PASTEMAC2(chc,chp,castm) \
( \
transc, \
panel_dim, \
panel_len, \
c, incc, ldc, \
p, 1, ldp \
); \
\
/* The packed memory region was acquired/allocated with "aligned"
dimensions (ie: dimensions that were possibly inflated up to a
multiple). When these dimension are inflated, it creates empty
regions along the bottom and/or right edges of the matrix. If
either region exists, we set them to zero. This allows the
micro-kernel to remain simple since it does not need to support
different register blockings for the edge cases. */ \
if ( m_panel != m_panel_max ) \
{ \
ctype_p* restrict zero = PASTEMAC(chp,0); \
dim_t i = m_panel; \
dim_t m_edge = m_panel_max - i; \
dim_t n_edge = n_panel_max; \
ctype_p* p_edge = p + (i )*rs_p; \
\
PASTEMAC2(chp,setm,BLIS_TAPI_EX_SUF) \
( \
BLIS_NO_CONJUGATE, \
0, \
BLIS_NONUNIT_DIAG, \
BLIS_DENSE, \
m_edge, \
n_edge, \
zero, \
p_edge, rs_p, cs_p, \
cntx, \
NULL \
); \
} \
\
if ( n_panel != n_panel_max ) \
{ \
ctype_p* restrict zero = PASTEMAC(chp,0); \
dim_t j = n_panel; \
dim_t m_edge = m_panel_max; \
dim_t n_edge = n_panel_max - j; \
ctype_p* p_edge = p + (j )*cs_p; \
\
PASTEMAC2(chp,setm,BLIS_TAPI_EX_SUF) \
( \
BLIS_NO_CONJUGATE, \
0, \
BLIS_NONUNIT_DIAG, \
BLIS_DENSE, \
m_edge, \
n_edge, \
zero, \
p_edge, rs_p, cs_p, \
cntx, \
NULL \
); \
} \
} \
else /* if ( bli_is_1r_packed( schema ) ) */ \
{ \
/* NOTE: We ignore kappa for now, since it should be 1.0. */ \
PASTEMAC2(chc,chp,packm_cxk_1r_md) \
( \
conjc, \
panel_dim, \
panel_len, \
c, incc, ldc, \
p, ldp \
); \
\
if ( m_panel != m_panel_max ) \
{ \
ctype_p* restrict zero = PASTEMAC(chp,0); \
dim_t offm = m_panel; \
dim_t offn = 0; \
dim_t m_edge = m_panel_max - m_panel; \
dim_t n_edge = n_panel_max; \
\
( void ) zero; \
( void ) m_edge; ( void )offm; \
( void ) n_edge; ( void )offn; \
\
PASTEMAC(chp,set1ms_mxn) \
( \
schema, \
offm, \
offn, \
m_edge, \
n_edge, \
zero, \
p, rs_p, cs_p, ldp \
); \
} \
\
if ( n_panel != n_panel_max ) \
{ \
ctype_p* restrict zero = PASTEMAC(chp,0); \
dim_t offm = 0; \
dim_t offn = n_panel; \
dim_t m_edge = m_panel_max; \
dim_t n_edge = n_panel_max - n_panel; \
\
( void ) zero; \
( void ) m_edge; ( void )offm; \
( void ) n_edge; ( void )offn; \
\
PASTEMAC(chp,set1ms_mxn) \
( \
schema, \
offm, \
offn, \
m_edge, \
n_edge, \
zero, \
p, rs_p, cs_p, ldp \
); \
} \
} \
\
\
/*
if ( bli_is_col_packed( schema ) ) \
PASTEMAC(ch,fprintm)( stdout, "packm_struc_cxk: bp copied", m_panel_max, n_panel_max, \
p, rs_p, cs_p, "%4.1f", "" ); \
else if ( bli_is_row_packed( schema ) ) \
PASTEMAC(ch,fprintm)( stdout, "packm_struc_cxk: ap copied", m_panel_max, n_panel_max, \
p, rs_p, cs_p, "%4.1f", "" ); \
*/ \
}
INSERT_GENTFUNC2_BASIC0( packm_struc_cxk_md )
INSERT_GENTFUNC2_MIXDP0( packm_struc_cxk_md )
// -----------------------------------------------------------------------------
#undef GENTFUNC2
#define GENTFUNC2( ctype_a, ctype_p, cha, chp, opname ) \
\
void PASTEMAC2(cha,chp,opname) \
( \
conj_t conja, \
dim_t m, \
dim_t n, \
ctype_a* restrict a, inc_t inca, inc_t lda, \
ctype_p* restrict p, inc_t ldp \
) \
{ \
const inc_t inca2 = 2 * inca; \
const inc_t lda2 = 2 * lda; \
const inc_t ldp2 = 2 * ldp; \
\
PASTEMAC(cha,ctyper)* restrict alpha1_r = ( PASTEMAC(cha,ctyper)* )a; \
PASTEMAC(cha,ctyper)* restrict alpha1_i = ( PASTEMAC(cha,ctyper)* )a + 1; \
PASTEMAC(chp,ctyper)* restrict pi1_r = ( PASTEMAC(chp,ctyper)* )p; \
PASTEMAC(chp,ctyper)* restrict pi1_i = ( PASTEMAC(chp,ctyper)* )p + ldp; \
\
dim_t i; \
\
if ( bli_is_conj( conja ) ) \
{ \
for ( ; n != 0; --n ) \
{ \
for ( i = 0; i < m; ++i ) \
{ \
PASTEMAC2(cha,chp,copyjris)( *(alpha1_r + i*inca2), \
*(alpha1_i + i*inca2), \
*(pi1_r + i*1), \
*(pi1_i + i*1) ); \
} \
\
alpha1_r += lda2; \
alpha1_i += lda2; \
pi1_r += ldp2; \
pi1_i += ldp2; \
} \
} \
else \
{ \
for ( ; n != 0; --n ) \
{ \
for ( i = 0; i < m; ++i ) \
{ \
PASTEMAC2(cha,chp,copyris)( *(alpha1_r + i*inca2), \
*(alpha1_i + i*inca2), \
*(pi1_r + i*1), \
*(pi1_i + i*1) ); \
} \
\
alpha1_r += lda2; \
alpha1_i += lda2; \
pi1_r += ldp2; \
pi1_i += ldp2; \
} \
} \
}
INSERT_GENTFUNC2_BASIC0( packm_cxk_1r_md )
INSERT_GENTFUNC2_MIXDP0( packm_cxk_1r_md )
#endif

View File

@@ -0,0 +1,71 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#undef GENTPROT2
#define GENTPROT2( ctype_c, ctype_p, chc, chp, varname ) \
\
void PASTEMAC2(chc,chp,varname) \
( \
conj_t conjc, \
pack_t schema, \
dim_t m_panel, \
dim_t n_panel, \
dim_t m_panel_max, \
dim_t n_panel_max, \
ctype_p* restrict kappa, \
ctype_c* restrict c, inc_t rs_c, inc_t cs_c, \
ctype_p* restrict p, inc_t rs_p, inc_t cs_p, \
inc_t is_p, \
cntx_t* cntx \
);
INSERT_GENTPROT2_BASIC0( packm_struc_cxk_md )
INSERT_GENTPROT2_MIXDP0( packm_struc_cxk_md )
#undef GENTPROT2
#define GENTPROT2( ctype_a, ctype_p, cha, chp, opname ) \
\
void PASTEMAC2(cha,chp,opname) \
( \
conj_t conja, \
dim_t m, \
dim_t n, \
ctype_a* restrict a, inc_t inca, inc_t lda, \
ctype_p* restrict p, inc_t ldp \
);
INSERT_GENTPROT2_BASIC0( packm_cxk_1r_md )
INSERT_GENTPROT2_MIXDP0( packm_cxk_1r_md )

View File

@@ -295,13 +295,33 @@ void bli_gemm_basic_check
e_val = bli_check_level3_dims( a, b, c );
bli_check_error_code( e_val );
#ifdef BLIS_ENABLE_GEMM_MD
// Skip checking for consistent datatypes between A, B, and C since
// that is totally valid for mixed-datatype gemm.
// When mixing datatypes, make sure that alpha does not have a non-zero
// imaginary component.
if ( bli_obj_dt( c ) != bli_obj_dt( a ) ||
bli_obj_dt( c ) != bli_obj_dt( b ) ||
bli_obj_comp_prec( c ) != bli_obj_prec( c ) )
if ( !bli_obj_imag_is_zero( alpha ) )
{
bli_print_msg( "Mixed-datatype gemm does not yet support alpha with a non-zero imaginary component. Please contact BLIS developers for further support.", __FILE__, __LINE__ );
bli_abort();
}
#else // BLIS_DISABLE_GEMM_MD
// Check for consistent datatypes.
// NOTE: We only perform these tests when mixed datatype support is
// disabled.
e_val = bli_check_consistent_object_datatypes( c, a );
bli_check_error_code( e_val );
e_val = bli_check_consistent_object_datatypes( c, b );
bli_check_error_code( e_val );
#endif
}
void bli_hemm_basic_check

View File

@@ -58,11 +58,13 @@ void PASTEMAC(opname,EX_SUF) \
BLIS_OAPI_EX_DECLS \
\
/* Only proceed with an induced method if all operands have the same
(complex) datatype. If any datatypes differ, skip the induced method
chooser function and proceed directly with native execution, which is
(complex) datatype, and if that datatype matches the execution
datatype. If any datatypes differ, skip the induced method chooser
function and proceed directly with native execution, which is
where mixed datatype support will be implemented (if at all). */ \
if ( bli_obj_dt( a ) == bli_obj_dt( c ) && \
bli_obj_dt( b ) == bli_obj_dt( c ) && \
if ( bli_obj_dt( c ) == bli_obj_dt( a ) && \
bli_obj_dt( c ) == bli_obj_dt( b ) && \
bli_obj_dt( c ) == bli_obj_exec_dt( c ) && \
bli_obj_is_complex( c ) ) \
{ \
/* Invoke the operation's "ind" function--its induced method front-end.

View File

@@ -38,3 +38,7 @@
#include "bli_gemm_var.h"
// Mixed datatype support.
#ifdef BLIS_ENABLE_GEMM_MD
#include "bli_gemm_md.h"
#endif

View File

@@ -54,8 +54,13 @@ void bli_gemm_front
obj_t c_local;
#ifdef BLIS_ENABLE_SMALL_MATRIX
gint_t status = bli_gemm_small( alpha, a, b, beta, c, cntx, cntl );
if ( status == BLIS_SUCCESS ) return;
// Only handle small problems separately for homogeneous datatypes.
if ( bli_obj_dt( a ) == bli_obj_dt( b ) &&
bli_obj_dt( a ) == bli_obj_dt( c ) )
{
gint_t status = bli_gemm_small( alpha, a, b, beta, c, cntx, cntl );
if ( status == BLIS_SUCCESS ) return;
}
#endif
// Check parameters.
@@ -74,38 +79,33 @@ void bli_gemm_front
bli_obj_alias_to( b, &b_local );
bli_obj_alias_to( c, &c_local );
// An optimization: If C is stored by rows and the micro-kernel prefers
// contiguous columns, or if C is stored by columns and the micro-kernel
// prefers contiguous rows, transpose the entire operation to allow the
// micro-kernel to access elements of C in its preferred manner.
if ( bli_cntx_l3_vir_ukr_dislikes_storage_of( &c_local, BLIS_GEMM_UKR, cntx ) )
{
bli_obj_swap( &a_local, &b_local );
#ifdef BLIS_ENABLE_GEMM_MD
cntx_t cntx_local;
bli_obj_induce_trans( &a_local );
bli_obj_induce_trans( &b_local );
bli_obj_induce_trans( &c_local );
// If any of the storage datatypes differ, or if the computation precision
// differs from the storage precision of C, utilize the mixed datatype
// code path.
// NOTE: If we ever want to support the caller setting the computation
// domain explicitly, we will need to check the computation dt against the
// storage dt of C (instead of the computation precision against the
// storage precision of C).
if ( bli_obj_dt( &c_local ) != bli_obj_dt( &a_local ) ||
bli_obj_dt( &c_local ) != bli_obj_dt( &b_local ) ||
bli_obj_comp_prec( &c_local ) != bli_obj_prec( &c_local ) )
{
// Handle mixed datatype cases in bli_gemm_md(), which may modify
// the objects or the context. (If the context is modified, cntx
// is adjusted to point to cntx_local.)
bli_gemm_md( &a_local, &b_local, beta, &c_local, &cntx_local, &cntx );
}
// Parse and interpret the contents of the rntm_t object to properly
// set the ways of parallelism for each loop, and then make any
// additional modifications necessary for the current operation.
bli_rntm_set_ways_for_op
(
BLIS_GEMM,
BLIS_LEFT, // ignored for gemm/hemm/symm
bli_obj_length( &c_local ),
bli_obj_width( &c_local ),
bli_obj_width( &a_local ),
rntm
);
else // homogeneous datatypes
#endif
{
// A sort of hack for communicating the desired pach schemas for A and B
// to bli_gemm_cntl_create() (via bli_l3_thread_decorator() and
// A sort of hack for communicating the desired pach schemas for A and
// B to bli_gemm_cntl_create() (via bli_l3_thread_decorator() and
// bli_l3_cntl_create_if()). This allows us to access the schemas from
// the control tree, which hopefully reduces some confusion, particularly
// in bli_packm_init().
// the control tree, which hopefully reduces some confusion,
// particularly in bli_packm_init().
if ( bli_cntx_method( cntx ) == BLIS_NAT )
{
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
@@ -121,6 +121,129 @@ void bli_gemm_front
}
}
#ifdef BLIS_ENABLE_GEMM_MD
// Don't perform the following optimization for ccr or crc cases, as
// those cases are sensitive to the ukernel storage preference (ie:
// transposing the operation would break them).
if ( !bli_gemm_md_is_ccr( &a_local, &b_local, &c_local ) &&
!bli_gemm_md_is_crc( &a_local, &b_local, &c_local ) )
#endif
// An optimization: If C is stored by rows and the micro-kernel prefers
// contiguous columns, or if C is stored by columns and the micro-kernel
// prefers contiguous rows, transpose the entire operation to allow the
// micro-kernel to access elements of C in its preferred manner.
if ( bli_cntx_l3_vir_ukr_dislikes_storage_of( &c_local, BLIS_GEMM_UKR, cntx ) )
{
bli_obj_swap( &a_local, &b_local );
bli_obj_induce_trans( &a_local );
bli_obj_induce_trans( &b_local );
bli_obj_induce_trans( &c_local );
// We must also swap the pack schemas, which were set by bli_gemm_md()
// or the inlined code above.
bli_obj_swap_pack_schemas( &a_local, &b_local );
}
// Parse and interpret the contents of the rntm_t object to properly
// set the ways of parallelism for each loop, and then make any
// additional modifications necessary for the current operation.
bli_rntm_set_ways_for_op
(
BLIS_GEMM,
BLIS_LEFT, // ignored for gemm/hemm/symm
bli_obj_length( &c_local ),
bli_obj_width( &c_local ),
bli_obj_width( &a_local ),
rntm
);
obj_t* cp = &c_local;
obj_t* betap = beta;
#ifdef BLIS_ENABLE_GEMM_MD
#ifdef BLIS_ENABLE_GEMM_MD_EXTRA_MEM
// If any of the following conditions are met, create a temporary matrix
// conformal to C into which we will accumulate the matrix product:
// - the storage precision of C differs from the computation precision;
// - the domains are mixed as crr;
// - the storage format of C does not match the preferred orientation
// of the ccr or crc cases.
// Then, after the computation is complete, this matrix will be copied
// or accumulated back to C.
const bool_t is_ccr_mismatch =
( bli_gemm_md_is_ccr( &a_local, &b_local, &c_local ) &&
!bli_obj_is_col_stored( &c_local ) );
const bool_t is_crc_mismatch =
( bli_gemm_md_is_crc( &a_local, &b_local, &c_local ) &&
!bli_obj_is_row_stored( &c_local ) );
obj_t ct;
bool_t use_ct = FALSE;
// FGVZ: Consider adding another guard here that only creates and uses a
// temporary matrix for accumulation if k < c * kc, where c is some small
// constant like 2. And don't forget to use the same conditional for the
// castm() and free() at the end.
if (
bli_obj_prec( &c_local ) != bli_obj_comp_prec( &c_local ) ||
bli_gemm_md_is_crr( &a_local, &b_local, &c_local ) ||
is_ccr_mismatch ||
is_crc_mismatch
)
{
use_ct = TRUE;
}
// If we need a temporary matrix conformal to C for whatever reason,
// we create it and prepare to use it now.
if ( use_ct )
{
const dim_t m = bli_obj_length( &c_local );
const dim_t n = bli_obj_width( &c_local );
inc_t rs = bli_obj_row_stride( &c_local );
inc_t cs = bli_obj_col_stride( &c_local );
num_t dt_ct = bli_obj_domain( &c_local ) |
bli_obj_comp_prec( &c_local );
// When performing the crr case, accumulate to a contiguously-stored
// real matrix so we do not have to repeatedly update C with general
// stride.
if ( bli_gemm_md_is_crr( &a_local, &b_local, &c_local ) )
dt_ct = BLIS_REAL | bli_obj_comp_prec( &c_local );
// When performing the mismatched ccr or crc cases, now is the time
// to specify the appropriate storage so the gemm_md_c2r_ref() virtual
// microkernel can output directly to C (instead of using a temporary
// microtile).
if ( is_ccr_mismatch ) { rs = 1; cs = m; }
else if ( is_crc_mismatch ) { rs = n; cs = 1; }
bli_obj_create( dt_ct, m, n, rs, cs, &ct );
const num_t dt_exec = bli_obj_exec_dt( &c_local );
const num_t dt_comp = bli_obj_comp_dt( &c_local );
bli_obj_set_target_dt( dt_ct, &ct );
bli_obj_set_exec_dt( dt_exec, &ct );
bli_obj_set_comp_dt( dt_comp, &ct );
// A naive approach would cast C to the comptuation datatype,
// compute with beta, and then cast the result back to the
// user-provided output matrix. However, we employ a different
// approach that halves the number of memops on C (or its
// typecast temporary) by writing the A*B product directly to
// temporary storage, and then using xpbym to scale the
// output matrix by beta and accumulate/cast the A*B product.
//bli_castm( &c_local, &ct );
betap = &BLIS_ZERO;
cp = &ct;
}
#endif
#endif
// Invoke the internal back-end via the thread handler.
bli_l3_thread_decorator
(
@@ -129,11 +252,112 @@ void bli_gemm_front
alpha,
&a_local,
&b_local,
beta,
&c_local,
betap,
cp,
cntx,
rntm,
cntl
);
#ifdef BLIS_ENABLE_GEMM_MD
#ifdef BLIS_ENABLE_GEMM_MD_EXTRA_MEM
// If we created a temporary matrix conformal to C for whatever reason,
// we copy/accumulate the result back to C and then release the object.
if ( use_ct )
{
//bli_castnzm( &ct, &c_local );
bli_xpbym( &ct, beta, &c_local );
bli_obj_free( &ct );
}
#endif
#endif
}
// -----------------------------------------------------------------------------
#if 0
if ( bli_obj_dt( a ) != bli_obj_dt( b ) ||
bli_obj_dt( a ) != bli_obj_dt( c ) ||
bli_obj_comp_prec( c ) != bli_obj_prec( c ) )
{
const bool_t a_is_real = bli_obj_is_real( a );
const bool_t a_is_comp = bli_obj_is_complex( a );
const bool_t b_is_real = bli_obj_is_real( b );
const bool_t b_is_comp = bli_obj_is_complex( b );
const bool_t c_is_real = bli_obj_is_real( c );
const bool_t c_is_comp = bli_obj_is_complex( c );
const bool_t a_is_single = bli_obj_is_single_prec( a );
const bool_t a_is_double = bli_obj_is_double_prec( a );
const bool_t b_is_single = bli_obj_is_single_prec( b );
const bool_t b_is_double = bli_obj_is_double_prec( b );
const bool_t c_is_single = bli_obj_is_single_prec( c );
const bool_t c_is_double = bli_obj_is_double_prec( c );
const bool_t comp_single = bli_obj_comp_prec( c ) == BLIS_SINGLE_PREC;
const bool_t comp_double = bli_obj_comp_prec( c ) == BLIS_DOUBLE_PREC;
const bool_t mixeddomain = bli_obj_domain( c ) != bli_obj_domain( a ) ||
bli_obj_domain( c ) != bli_obj_domain( b );
( void )a_is_real; ( void )a_is_comp;
( void )b_is_real; ( void )b_is_comp;
( void )c_is_real; ( void )c_is_comp;
( void )a_is_single; ( void )a_is_double;
( void )b_is_single; ( void )b_is_double;
( void )c_is_single; ( void )c_is_double;
( void )comp_single; ( void )comp_double;
if (
//( c_is_comp && a_is_comp && b_is_real ) ||
//( c_is_comp && a_is_real && b_is_comp ) ||
//( c_is_real && a_is_comp && b_is_comp ) ||
//( c_is_comp && a_is_real && b_is_real ) ||
//( c_is_real && a_is_comp && b_is_real ) ||
//( c_is_real && a_is_real && b_is_comp ) ||
//FALSE
TRUE
)
{
if (
( c_is_single && a_is_single && b_is_single && mixeddomain ) ||
( c_is_single && a_is_single && b_is_single && comp_single ) ||
( c_is_single && a_is_single && b_is_single && comp_double ) ||
( c_is_single && a_is_single && b_is_double ) ||
( c_is_single && a_is_double && b_is_single ) ||
( c_is_double && a_is_single && b_is_single ) ||
( c_is_single && a_is_double && b_is_double ) ||
( c_is_double && a_is_single && b_is_double ) ||
( c_is_double && a_is_double && b_is_single ) ||
( c_is_double && a_is_double && b_is_double && comp_single ) ||
( c_is_double && a_is_double && b_is_double && comp_double ) ||
( c_is_double && a_is_double && b_is_double && mixeddomain ) ||
FALSE
)
bli_gemm_md_front( alpha, a, b, beta, c, cntx, cntl );
else
bli_gemm_md_zgemm( alpha, a, b, beta, c, cntx, cntl );
}
else
bli_gemm_md_zgemm( alpha, a, b, beta, c, cntx, cntl );
return;
}
#else
#if 0
// If any of the storage datatypes differ, or if the execution precision
// differs from the storage precision of C, utilize the mixed datatype
// code path.
// NOTE: We could check the exec dt against the storage dt of C, but for
// now we don't support the caller setting the execution domain
// explicitly.
if ( bli_obj_dt( a ) != bli_obj_dt( b ) ||
bli_obj_dt( a ) != bli_obj_dt( c ) ||
bli_obj_comp_prec( c ) != bli_obj_prec( c ) )
{
bli_gemm_md_front( alpha, a, b, beta, c, cntx, cntl );
return;
}
#endif
#endif

View File

@@ -69,6 +69,22 @@ void bli_gemm_ker_var2
thrinfo_t* thread
)
{
#ifdef BLIS_ENABLE_GEMM_MD
// By now, A and B have been packed and cast to the execution precision.
// In most cases, such as when storage precision of C differs from the
// execution precision, we utilize the mixed datatype code path. However,
// a few cases still fall within this kernel, such as mixed domain with
// equal precision (ccr, crc, rcc), hence those expressions being disabled
// in the conditional below.
if ( //( bli_obj_domain( c ) != bli_obj_domain( a ) ) ||
//( bli_obj_domain( c ) != bli_obj_domain( b ) ) ||
( bli_obj_dt( c ) != bli_obj_exec_dt( c ) ) )
{
bli_gemm_ker_var2_md( a, b, c, cntx, rntm, cntl, thread );
return;
}
#endif
num_t dt_exec = bli_obj_exec_dt( c );
pack_t schema_a = bli_obj_pack_schema( a );
@@ -112,12 +128,12 @@ void bli_gemm_ker_var2
buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b );
buf_beta = bli_obj_internal_scalar_buffer( c );
// If 1m is being employed on a column- or row-stored matrix with a
// real-valued beta, we can use the real domain macro-kernel, which
// If 1m is being employed on a column- or row-stored matrix with a
// real-valued beta, we can use the real domain macro-kernel, which
// eliminates a little overhead associated with the 1m virtual
// micro-kernel.
#if 1
if ( bli_is_1m_packed( schema_a ) )
if ( bli_cntx_method( cntx ) == BLIS_1M )
{
bli_l3_ind_recast_1m_params
(
@@ -132,6 +148,22 @@ void bli_gemm_ker_var2
}
#endif
#ifdef BLIS_ENABLE_GEMM_MD
// Tweak parameters in select mixed domain cases cases.
bli_gemm_md_ker_var2_recast
(
&dt_exec,
bli_obj_dt( a ),
bli_obj_dt( b ),
bli_obj_dt( c ),
&m, &n, &k,
&pd_a, &ps_a,
&pd_b, &ps_b,
c,
&rs_c, &cs_c
);
#endif
// Index into the type combination array to extract the correct
// function pointer.
f = ftypes[dt_exec];
@@ -267,6 +299,9 @@ void PASTEMAC(ch,varname) \
/* Save the imaginary stride of A and B to the auxinfo_t object. */ \
bli_auxinfo_set_is_a( is_a, &aux ); \
bli_auxinfo_set_is_b( is_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \
dim_t jr_num_threads = bli_thread_n_way( thread ); \

View File

@@ -0,0 +1,405 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
#ifdef BLIS_ENABLE_GEMM_MD
#define FUNCPTR_T gemm_fp
typedef void (*FUNCPTR_T)
(
pack_t schema_a,
pack_t schema_b,
dim_t m,
dim_t n,
dim_t k,
void* alpha,
void* a, inc_t cs_a, inc_t is_a,
dim_t pd_a, inc_t ps_a,
void* b, inc_t rs_b, inc_t is_b,
dim_t pd_b, inc_t ps_b,
void* beta,
void* c, inc_t rs_c, inc_t cs_c,
cntx_t* cntx,
rntm_t* rntm,
thrinfo_t* thread
);
static FUNCPTR_T GENARRAY2_ALL(ftypes,gemm_ker_var2_md);
void bli_gemm_ker_var2_md
(
obj_t* a,
obj_t* b,
obj_t* c,
cntx_t* cntx,
rntm_t* rntm,
cntl_t* cntl,
thrinfo_t* thread
)
{
num_t dt_exec = bli_obj_exec_dt( c );
num_t dt_c = bli_obj_dt( c );
pack_t schema_a = bli_obj_pack_schema( a );
pack_t schema_b = bli_obj_pack_schema( b );
dim_t m = bli_obj_length( c );
dim_t n = bli_obj_width( c );
dim_t k = bli_obj_width( a );
void* buf_a = bli_obj_buffer_at_off( a );
inc_t cs_a = bli_obj_col_stride( a );
inc_t is_a = bli_obj_imag_stride( a );
dim_t pd_a = bli_obj_panel_dim( a );
inc_t ps_a = bli_obj_panel_stride( a );
void* buf_b = bli_obj_buffer_at_off( b );
inc_t rs_b = bli_obj_row_stride( b );
inc_t is_b = bli_obj_imag_stride( b );
dim_t pd_b = bli_obj_panel_dim( b );
inc_t ps_b = bli_obj_panel_stride( b );
void* buf_c = bli_obj_buffer_at_off( c );
inc_t rs_c = bli_obj_row_stride( c );
inc_t cs_c = bli_obj_col_stride( c );
obj_t scalar_a;
obj_t scalar_b;
void* buf_alpha;
void* buf_beta;
FUNCPTR_T f;
// Detach and multiply the scalars attached to A and B.
// NOTE: We know that the internal scalars of A and B are already of the
// target datatypes because the necessary typecasting would have already
// taken place during bli_packm_init().
bli_obj_scalar_detach( a, &scalar_a );
bli_obj_scalar_detach( b, &scalar_b );
bli_mulsc( &scalar_a, &scalar_b );
// Grab the addresses of the internal scalar buffers for the scalar
// merged above and the scalar attached to C.
// NOTE: We know that scalar_b is of type dt_exec due to the above code
// that casts the scalars of A and B to dt_exec via scalar_a and scalar_b,
// and we know that the internal scalar in C is already of the type dt_c
// due to the casting in the implementation of bli_obj_scalar_attach().
buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b );
buf_beta = bli_obj_internal_scalar_buffer( c );
// Tweak parameters in select mixed domain cases cases.
bli_gemm_md_ker_var2_recast
(
&dt_exec,
bli_obj_dt( a ),
bli_obj_dt( b ),
bli_obj_dt( c ),
&m, &n, &k,
&pd_a, &ps_a,
&pd_b, &ps_b,
c,
&rs_c, &cs_c
);
// Index into the type combination array to extract the correct
// function pointer.
f = ftypes[dt_c][dt_exec];
// Invoke the function.
f( schema_a,
schema_b,
m,
n,
k,
buf_alpha,
buf_a, cs_a, is_a,
pd_a, ps_a,
buf_b, rs_b, is_b,
pd_b, ps_b,
buf_beta,
buf_c, rs_c, cs_c,
cntx,
rntm,
thread );
}
#undef GENTFUNC2
#define GENTFUNC2( ctype_c, ctype_e, chc, che, varname ) \
\
void PASTEMAC2(chc,che,varname) \
( \
pack_t schema_a, \
pack_t schema_b, \
dim_t m, \
dim_t n, \
dim_t k, \
void* alpha, \
void* a, inc_t cs_a, inc_t is_a, \
dim_t pd_a, inc_t ps_a, \
void* b, inc_t rs_b, inc_t is_b, \
dim_t pd_b, inc_t ps_b, \
void* beta, \
void* c, inc_t rs_c, inc_t cs_c, \
cntx_t* cntx, \
rntm_t* rntm, \
thrinfo_t* thread \
) \
{ \
const num_t dte = PASTEMAC(che,type); \
/*const num_t dtc = PASTEMAC(chc,type);*/ \
\
/* Alias some constants to simpler names. */ \
const dim_t MR = pd_a; \
const dim_t NR = pd_b; \
/*const dim_t PACKMR = cs_a;*/ \
/*const dim_t PACKNR = rs_b;*/ \
\
/* Query the context for the micro-kernel address and cast it to its
function pointer type. */ \
PASTECH(che,gemm_ukr_ft) \
gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dte, BLIS_GEMM_UKR, cntx ); \
\
/* Temporary C buffer for edge cases. Note that the strides of this
temporary buffer are set so that they match the storage of the
original C matrix. For example, if C is column-stored, ct will be
column-stored as well. */ \
ctype_e ct[ BLIS_STACK_BUF_MAX_SIZE \
/ sizeof( ctype_e ) ] \
__attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \
const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dte, BLIS_GEMM_UKR, cntx ); \
const inc_t rs_ct = ( col_pref ? 1 : NR ); \
const inc_t cs_ct = ( col_pref ? MR : 1 ); \
\
ctype_e* restrict zero = PASTEMAC(che,0); \
ctype_e* restrict a_cast = a; \
ctype_e* restrict b_cast = b; \
ctype_c* restrict c_cast = c; \
ctype_e* restrict alpha_cast = alpha; \
ctype_c* restrict beta_cast = beta; \
ctype_e* restrict b1; \
ctype_c* restrict c1; \
\
dim_t m_iter, m_left; \
dim_t n_iter, n_left; \
dim_t i, j; \
dim_t m_cur; \
dim_t n_cur; \
inc_t rstep_a; \
inc_t cstep_b; \
inc_t rstep_c, cstep_c; \
auxinfo_t aux; \
\
/*
Assumptions/assertions:
rs_a == 1
cs_a == PACKMR
pd_a == MR
ps_a == stride to next micro-panel of A
rs_b == PACKNR
cs_b == 1
pd_b == NR
ps_b == stride to next micro-panel of B
rs_c == (no assumptions)
cs_c == (no assumptions)
*/ \
\
/* If any dimension is zero, return immediately. */ \
if ( bli_zero_dim3( m, n, k ) ) return; \
\
/* Clear the temporary C buffer in case it has any infs or NaNs. */ \
PASTEMAC(che,set0s_mxn)( MR, NR, \
ct, rs_ct, cs_ct ); \
\
/* Compute number of primary and leftover components of the m and n
dimensions. */ \
n_iter = n / NR; \
n_left = n % NR; \
\
m_iter = m / MR; \
m_left = m % MR; \
\
if ( n_left ) ++n_iter; \
if ( m_left ) ++m_iter; \
\
/* Determine some increments used to step through A, B, and C. */ \
rstep_a = ps_a; \
\
cstep_b = ps_b; \
\
rstep_c = rs_c * MR; \
cstep_c = cs_c * NR; \
\
/* Save the pack schemas of A and B to the auxinfo_t object. */ \
bli_auxinfo_set_schema_a( schema_a, &aux ); \
bli_auxinfo_set_schema_b( schema_b, &aux ); \
\
/* Save the imaginary stride of A and B to the auxinfo_t object. */ \
bli_auxinfo_set_is_a( is_a, &aux ); \
bli_auxinfo_set_is_b( is_b, &aux ); \
\
thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \
dim_t jr_num_threads = bli_thread_n_way( thread ); \
dim_t jr_thread_id = bli_thread_work_id( thread ); \
dim_t ir_num_threads = bli_thread_n_way( caucus ); \
dim_t ir_thread_id = bli_thread_work_id( caucus ); \
\
/* Loop over the n dimension (NR columns at a time). */ \
for ( j = jr_thread_id; j < n_iter; j += jr_num_threads ) \
{ \
ctype_e* restrict a1; \
ctype_c* restrict c11; \
ctype_e* restrict b2; \
\
b1 = b_cast + j * cstep_b; \
c1 = c_cast + j * cstep_c; \
\
n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \
\
/* Initialize our next panel of B to be the current panel of B. */ \
b2 = b1; \
\
/* Loop over the m dimension (MR rows at a time). */ \
for ( i = ir_thread_id; i < m_iter; i += ir_num_threads ) \
{ \
ctype_e* restrict a2; \
\
a1 = a_cast + i * rstep_a; \
c11 = c1 + i * rstep_c; \
\
m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \
\
/* Compute the addresses of the next panels of A and B. */ \
a2 = bli_gemm_get_next_a_upanel( caucus, a1, rstep_a ); \
if ( bli_is_last_iter( i, m_iter, ir_thread_id, ir_num_threads ) ) \
{ \
a2 = a_cast; \
b2 = bli_gemm_get_next_b_upanel( thread, b1, cstep_b ); \
if ( bli_is_last_iter( j, n_iter, jr_thread_id, jr_num_threads ) ) \
b2 = b_cast; \
} \
\
/* Save addresses of next panels of A and B to the auxinfo_t
object. */ \
bli_auxinfo_set_next_a( a2, &aux ); \
bli_auxinfo_set_next_b( b2, &aux ); \
\
/* Always save the micropanel product to the local microtile and
then accumulate it into C via the xpbys_mxn macro. */ \
/*if ( 1 )*/ \
{ \
/*bli_auxinfo_set_dt_on_output( dte, &aux );*/ \
\
/* Invoke the gemm micro-kernel. */ \
gemm_ukr \
( \
k, \
alpha_cast, \
a1, \
b1, \
zero, \
ct, rs_ct, cs_ct, \
&aux, \
cntx \
); \
\
/* Scale the microtile of C and add the result from above. */ \
PASTEMAC3(che,chc,chc,xpbys_mxn) \
( \
m_cur, n_cur, \
ct, rs_ct, cs_ct, \
beta_cast, \
c11, rs_c, cs_c \
); \
} \
/*
else if ( m_cur == MR && n_cur == NR ) \
{ \
bli_auxinfo_set_dt_on_output( dtc, &aux ); \
\
gemm_ukr \
( \
k, \
alpha_cast, \
a1, \
b1, \
( ctype_e* )beta_cast, \
( ctype_e* )c11, rs_c, cs_c, \
&aux, \
cntx \
); \
} \
else \
{ \
bli_auxinfo_set_dt_on_output( dte, &aux ); \
\
gemm_ukr \
( \
k, \
alpha_cast, \
a1, \
b1, \
zero, \
ct, rs_ct, cs_ct, \
&aux, \
cntx \
); \
\
PASTEMAC3(che,chc,chc,xpbys_mxn) \
( \
m_cur, n_cur, \
ct, rs_ct, cs_ct, \
beta_cast, \
c11, rs_c, cs_c \
); \
} \
*/ \
} \
} \
\
/*
PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: b1", k, NR, b1, NR, 1, "%4.1f", "" ); \
PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: a1", MR, k, a1, 1, MR, "%4.1f", "" ); \
PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: c after", m_cur, n_cur, c11, rs_c, cs_c, "%4.1f", "" ); \
*/ \
}
INSERT_GENTFUNC2_BASIC0( gemm_ker_var2_md )
INSERT_GENTFUNC2_MIXDP0( gemm_ker_var2_md )
#endif

901
frame/3/gemm/bli_gemm_md.c Normal file
View File

@@ -0,0 +1,901 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Copyright (C) 2017, Advanced Micro Devices, Inc.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
#ifdef BLIS_ENABLE_GEMM_MD
void bli_gemm_md
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
const bool_t a_is_real = bli_obj_is_real( a );
const bool_t a_is_comp = bli_obj_is_complex( a );
const bool_t b_is_real = bli_obj_is_real( b );
const bool_t b_is_comp = bli_obj_is_complex( b );
const bool_t c_is_real = bli_obj_is_real( c );
const bool_t c_is_comp = bli_obj_is_complex( c );
if ( c_is_real && a_is_real && b_is_real )
{
// C_real += A_real * B_real
doms = bli_gemm_md_rrr( a, b, beta, c, cntx_local, cntx );
}
else if ( c_is_comp && a_is_comp && b_is_comp )
{
// C_complex += A_complex * B_complex
doms = bli_gemm_md_ccc( a, b, beta, c, cntx_local, cntx );
}
else if ( c_is_comp && a_is_comp && b_is_real )
{
// C_complex += A_complex * B_real
doms = bli_gemm_md_ccr( a, b, beta, c, cntx_local, cntx );
}
else if ( c_is_comp && a_is_real && b_is_comp )
{
// C_complex += A_real * B_complex
doms = bli_gemm_md_crc( a, b, beta, c, cntx_local, cntx );
}
else if ( c_is_real && a_is_comp && b_is_comp )
{
// C_real += A_complex * B_complex
doms = bli_gemm_md_rcc( a, b, beta, c, cntx_local, cntx );
}
else if ( c_is_comp && a_is_real && b_is_real )
{
// C_complex += A_real * B_real
doms = bli_gemm_md_crr( a, b, beta, c, cntx_local, cntx );
}
else if ( c_is_real && a_is_comp && b_is_real )
{
// C_real += A_complex * B_real
doms = bli_gemm_md_rcr( a, b, beta, c, cntx_local, cntx );
}
else if ( c_is_real && a_is_real && b_is_comp )
{
// C_real += A_real * B_complex
doms = bli_gemm_md_rrc( a, b, beta, c, cntx_local, cntx );
}
else
{
doms.comp = BLIS_REAL;
doms.exec = BLIS_REAL;
// This should never execute.
bli_abort();
}
// Extract the computation and execution domains from the struct
// returned above.
dom_t dom_comp = doms.comp;
dom_t dom_exec = doms.exec;
// Inspect the computation precision of C. (The user may have set
// this explicitly to request the precision in which the computation
// should take place.)
prec_t prec_comp = bli_obj_comp_prec( c );
// The computation precision tells us the target precision of A and B.
// NOTE: We don't set the target domain here. The target domain would
// either be unchanged, or would have been changed in one of the eight
// domain cases above.
bli_obj_set_target_prec( prec_comp, a );
bli_obj_set_target_prec( prec_comp, b );
// Combine the execution domain with the computation precision to form
// the execution datatype. (The computation precision and execution
// precision are always equal.)
num_t dt_exec = dom_exec | prec_comp;
// Set the execution datatypes of A, B, and C.
bli_obj_set_exec_dt( dt_exec, a );
bli_obj_set_exec_dt( dt_exec, b );
bli_obj_set_exec_dt( dt_exec, c );
// Combine the computation precision and computation domain to form the
// computation datatype.
num_t dt_comp = dom_comp | prec_comp;
// Set the computation datatypes of A, B, and C.
bli_obj_set_comp_dt( dt_comp, a );
bli_obj_set_comp_dt( dt_comp, b );
bli_obj_set_comp_dt( dt_comp, c );
#if 0
if ( bli_obj_is_single_prec( c ) ) printf( "%% --> s += " );
else printf( "%% --> d += " );
if ( bli_obj_is_single_prec( a ) ) printf( "s " );
else printf( "d " );
if ( bli_obj_is_single_prec( b ) ) printf( "s\n" );
else printf( "d\n" );
//if ( bli_obj_is_scomplex( a ) &&
// bli_obj_is_dcomplex( b ) &&
// bli_obj_is_float( c ) )
{
printf( "bli_gemm_md(): stor precs after: %d %d %d\n", bli_obj_prec( a ),
bli_obj_prec( b ), bli_obj_prec( c ) );
printf( "bli_gemm_md(): targ precs after: %d %d %d\n", bli_obj_target_prec( a ),
bli_obj_target_prec( b ), bli_obj_target_prec( c ) );
printf( "bli_gemm_md(): exec precs after: %d %d %d\n", bli_obj_exec_prec( a ),
bli_obj_exec_prec( b ), bli_obj_exec_prec( c ) );
printf( "bli_gemm_md(): stor domain after: %d %d %d\n", bli_obj_domain( a ),
bli_obj_domain( b ), bli_obj_domain( c ) );
printf( "bli_gemm_md(): targ domain after: %d %d %d\n", bli_obj_target_domain( a ),
bli_obj_target_domain( b ), bli_obj_target_domain( c ) );
printf( "bli_gemm_md(): exec domain after: %d %d %d\n", bli_obj_exec_domain( a ),
bli_obj_exec_domain( b ), bli_obj_exec_domain( c ) );
}
#endif
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_ccr
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
// We assume that the requested computation domain is complex.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_COMPLEX;
// For ccr, the computation (ukernel) will be real, but the execution
// will appear complex to other parts of the implementation.
doms.comp = BLIS_REAL;
doms.exec = BLIS_COMPLEX;
// Here we construct the computation datatype, which for the ccr case
// is equal to the real projection of the execution datatype, and use
// that computation datatype to query the corresponding ukernel output
// preference.
const num_t dt = BLIS_REAL | bli_obj_comp_prec( c );
const bool_t row_pref
= bli_cntx_l3_nat_ukr_prefers_rows_dt( dt, BLIS_GEMM_UKR, *cntx );
// We can only perform this case of mixed-domain gemm, C += A*B where
// B is real, if the microkernel prefers column output. If it prefers
// row output, we must induce a transposition and perform C += A*B
// where A (formerly B) is real.
if ( row_pref )
{
bli_obj_swap( a, b );
bli_obj_induce_trans( a );
bli_obj_induce_trans( b );
bli_obj_induce_trans( c );
return bli_gemm_md_crc( a, b, beta, c, cntx_local, cntx );
}
// Create a local copy of the context and then prepare to use this
// context instead of the one passed in.
*cntx_local = **cntx;
*cntx = cntx_local;
// Copy the real domain blocksizes into the slots of their complex
// counterparts.
blksz_t* blksz_mr = bli_cntx_get_blksz( BLIS_MR, *cntx );
blksz_t* blksz_nr = bli_cntx_get_blksz( BLIS_NR, *cntx );
blksz_t* blksz_mc = bli_cntx_get_blksz( BLIS_MC, *cntx );
blksz_t* blksz_nc = bli_cntx_get_blksz( BLIS_NC, *cntx );
blksz_t* blksz_kc = bli_cntx_get_blksz( BLIS_KC, *cntx );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_mr, BLIS_SCOMPLEX, blksz_mr );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_mr, BLIS_DCOMPLEX, blksz_mr );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_nr, BLIS_SCOMPLEX, blksz_nr );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_nr, BLIS_DCOMPLEX, blksz_nr );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_mc, BLIS_SCOMPLEX, blksz_mc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_mc, BLIS_DCOMPLEX, blksz_mc );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_nc, BLIS_SCOMPLEX, blksz_nc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_nc, BLIS_DCOMPLEX, blksz_nc );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_kc, BLIS_SCOMPLEX, blksz_kc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_kc, BLIS_DCOMPLEX, blksz_kc );
// Halve both the real and complex MR's (which are both real MR's).
bli_blksz_scale_def_max( 1, 2, BLIS_FLOAT, blksz_mr );
bli_blksz_scale_def_max( 1, 2, BLIS_DOUBLE, blksz_mr );
bli_blksz_scale_def_max( 1, 2, BLIS_SCOMPLEX, blksz_mr );
bli_blksz_scale_def_max( 1, 2, BLIS_DCOMPLEX, blksz_mr );
// Halve both the real and complex MC's (which are both real MC's).
bli_blksz_scale_def_max( 1, 2, BLIS_FLOAT, blksz_mc );
bli_blksz_scale_def_max( 1, 2, BLIS_DOUBLE, blksz_mc );
bli_blksz_scale_def_max( 1, 2, BLIS_SCOMPLEX, blksz_mc );
bli_blksz_scale_def_max( 1, 2, BLIS_DCOMPLEX, blksz_mc );
// Set the pack schemas of objects A and B for normal execution.
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, b );
// static func_t* bli_cntx_get_l3_vir_ukrs( l3ukr_t ukr_id, cntx_t* cntx )
func_t* l3_vir_ukrs = bli_cntx_get_l3_vir_ukrs( BLIS_GEMM_UKR, *cntx );
// Rather than check which complex datatype dt_comp refers to, we set
// the mixed-domain virtual microkernel for both types.
bli_func_set_dt( bli_cgemm_md_c2r_ref, BLIS_SCOMPLEX, l3_vir_ukrs );
bli_func_set_dt( bli_zgemm_md_c2r_ref, BLIS_DCOMPLEX, l3_vir_ukrs );
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_crc
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
// We assume that the requested computation domain is complex.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_COMPLEX;
// For crc, the computation (ukernel) will be real, but the execution
// will appear complex to other parts of the implementation.
doms.comp = BLIS_REAL;
doms.exec = BLIS_COMPLEX;
// Here we construct the computation datatype, which for the crc case
// is equal to the real projection of the execution datatype, and use
// that computation datatype to query the corresponding ukernel output
// preference.
const num_t dt = BLIS_REAL | bli_obj_comp_prec( c );
const bool_t col_pref
= bli_cntx_l3_nat_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, *cntx );
// We can only perform this case of mixed-domain gemm, C += A*B where
// A is real, if the microkernel prefers row output. If it prefers
// column output, we must induce a transposition and perform C += A*B
// where B (formerly A) is real.
if ( col_pref )
{
bli_obj_swap( a, b );
bli_obj_induce_trans( a );
bli_obj_induce_trans( b );
bli_obj_induce_trans( c );
return bli_gemm_md_ccr( a, b, beta, c, cntx_local, cntx );
}
// Create a local copy of the context and then prepare to use this
// context instead of the one passed in.
*cntx_local = **cntx;
*cntx = cntx_local;
// Copy the real domain blocksizes into the slots of their complex
// counterparts.
blksz_t* blksz_mr = bli_cntx_get_blksz( BLIS_MR, *cntx );
blksz_t* blksz_nr = bli_cntx_get_blksz( BLIS_NR, *cntx );
blksz_t* blksz_mc = bli_cntx_get_blksz( BLIS_MC, *cntx );
blksz_t* blksz_nc = bli_cntx_get_blksz( BLIS_NC, *cntx );
blksz_t* blksz_kc = bli_cntx_get_blksz( BLIS_KC, *cntx );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_mr, BLIS_SCOMPLEX, blksz_mr );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_mr, BLIS_DCOMPLEX, blksz_mr );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_nr, BLIS_SCOMPLEX, blksz_nr );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_nr, BLIS_DCOMPLEX, blksz_nr );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_mc, BLIS_SCOMPLEX, blksz_mc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_mc, BLIS_DCOMPLEX, blksz_mc );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_nc, BLIS_SCOMPLEX, blksz_nc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_nc, BLIS_DCOMPLEX, blksz_nc );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_kc, BLIS_SCOMPLEX, blksz_kc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_kc, BLIS_DCOMPLEX, blksz_kc );
// Halve both the real and complex NR's (which are both real NR's).
bli_blksz_scale_def_max( 1, 2, BLIS_FLOAT, blksz_nr );
bli_blksz_scale_def_max( 1, 2, BLIS_DOUBLE, blksz_nr );
bli_blksz_scale_def_max( 1, 2, BLIS_SCOMPLEX, blksz_nr );
bli_blksz_scale_def_max( 1, 2, BLIS_DCOMPLEX, blksz_nr );
// Halve both the real and complex NC's (which are both real NC's).
bli_blksz_scale_def_max( 1, 2, BLIS_FLOAT, blksz_nc );
bli_blksz_scale_def_max( 1, 2, BLIS_DOUBLE, blksz_nc );
bli_blksz_scale_def_max( 1, 2, BLIS_SCOMPLEX, blksz_nc );
bli_blksz_scale_def_max( 1, 2, BLIS_DCOMPLEX, blksz_nc );
// Set the pack schemas of objects A and B for normal execution.
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, b );
// static func_t* bli_cntx_get_l3_vir_ukrs( l3ukr_t ukr_id, cntx_t* cntx )
func_t* l3_vir_ukrs = bli_cntx_get_l3_vir_ukrs( BLIS_GEMM_UKR, *cntx );
// Rather than check which complex datatype dt_comp refers to, we set
// the mixed-domain virtual microkernel for both types.
bli_func_set_dt( bli_cgemm_md_c2r_ref, BLIS_SCOMPLEX, l3_vir_ukrs );
bli_func_set_dt( bli_zgemm_md_c2r_ref, BLIS_DCOMPLEX, l3_vir_ukrs );
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_rcc
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
// We assume that the requested computation domain is complex.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_COMPLEX;
// For rcc, the computation (ukernel) will be real, and since the output
// matrix C is also real, so must be the execution domain.
doms.comp = BLIS_REAL;
doms.exec = BLIS_REAL;
// Create a local copy of the context and then prepare to use this
// context instead of the one passed in.
*cntx_local = **cntx;
*cntx = cntx_local;
// Copy the real domain blocksizes into the slots of their complex
// counterparts.
blksz_t* blksz_mr = bli_cntx_get_blksz( BLIS_MR, *cntx );
blksz_t* blksz_nr = bli_cntx_get_blksz( BLIS_NR, *cntx );
blksz_t* blksz_mc = bli_cntx_get_blksz( BLIS_MC, *cntx );
blksz_t* blksz_nc = bli_cntx_get_blksz( BLIS_NC, *cntx );
blksz_t* blksz_kc = bli_cntx_get_blksz( BLIS_KC, *cntx );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_mr, BLIS_SCOMPLEX, blksz_mr );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_mr, BLIS_DCOMPLEX, blksz_mr );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_nr, BLIS_SCOMPLEX, blksz_nr );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_nr, BLIS_DCOMPLEX, blksz_nr );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_mc, BLIS_SCOMPLEX, blksz_mc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_mc, BLIS_DCOMPLEX, blksz_mc );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_nc, BLIS_SCOMPLEX, blksz_nc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_nc, BLIS_DCOMPLEX, blksz_nc );
bli_blksz_copy_dt( BLIS_FLOAT, blksz_kc, BLIS_SCOMPLEX, blksz_kc );
bli_blksz_copy_dt( BLIS_DOUBLE, blksz_kc, BLIS_DCOMPLEX, blksz_kc );
// Halve both the real and complex KC's (which are both real KC's).
bli_blksz_scale_def_max( 1, 2, BLIS_FLOAT, blksz_kc );
bli_blksz_scale_def_max( 1, 2, BLIS_DOUBLE, blksz_kc );
bli_blksz_scale_def_max( 1, 2, BLIS_SCOMPLEX, blksz_kc );
bli_blksz_scale_def_max( 1, 2, BLIS_DCOMPLEX, blksz_kc );
// Use the 1r pack schema for both A and B with the conjugation
// of A or B toggled (to produce ar * br - ai * bi).
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS_1R, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS_1R, b );
bli_obj_toggle_conj( b );
// We also need to copy over the packm kernels from the 1m
// context. We query the address of that context here.
const num_t dt_comp = bli_obj_dt( a );
cntx_t* cntx_1m = bli_gks_query_ind_cntx( BLIS_1M, dt_comp );
func_t* cntx_funcs = bli_cntx_packm_kers_buf( *cntx );
func_t* cntx_1m_funcs = bli_cntx_packm_kers_buf( cntx_1m );
for ( dim_t i = 0; i <= BLIS_PACKM_31XK_KER; ++i )
{
cntx_funcs[ i ] = cntx_1m_funcs[ i ];
}
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_crr
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
#ifndef BLIS_ENABLE_GEMM_MD_EXTRA_MEM
obj_t c_real;
#endif
// We assume that the requested computation domain is real.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_REAL;
// For crr, the computation (ukernel) will be real, and since we will
// be updating only the real part of the output matrix C, the exectuion
// domain is also real.
doms.comp = BLIS_REAL;
doms.exec = BLIS_REAL;
// Since the A*B product is real, we can update only the real part of
// C. Thus, we convert the obj_t for the complex matrix to one that
// represents only the real part. HOWEVER, there are two situations in
// which we forgo this trick:
// - If extra memory optimizations are enabled, we should leave C alone
// since we'll be computing A*B to a temporary matrix and accumulating
// that result back to C, and in order for that to work, we need to
// allow that code to continue accessing C as a complex matrix.
// - Even if extra memory optimizations are diabled, logically projecting
// C as a real matrix can still cause problems if beta is non-unit. In
// that situation, the implementation won't get a chance to scale the
// imaginary components of C by beta, and thus it would compute the
// wrong answer. Thus, if beta is non-unit, we must leave C alone.
#ifndef BLIS_ENABLE_GEMM_MD_EXTRA_MEM
if ( bli_obj_equals( beta, &BLIS_ONE ) )
{
bli_obj_real_part( c, &c_real );
// Overwrite the complex obj_t with its real-only alias.
*c = c_real;
}
#endif
// Set the pack schemas of objects A and B for normal execution.
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, b );
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_rcr
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
obj_t a_real;
// We assume that the requested computation domain is real.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_REAL;
// For rcr, the computation (ukernel) will be real, and since the output
// matrix C is also real, so must be the execution domain.
doms.comp = BLIS_REAL;
doms.exec = BLIS_REAL;
// Convert the obj_t for the complex matrix to one that represents only
// the real part.
bli_obj_real_part( a, &a_real );
// Overwrite the complex obj_t with its real-only alias.
*a = a_real;
// Set the pack schemas of objects A and B for normal execution.
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, b );
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_rrc
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
obj_t b_real;
// We assume that the requested computation domain is real.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_REAL;
// For rcr, the computation (ukernel) will be real, and since the output
// matrix C is also real, so must be the execution domain.
doms.comp = BLIS_REAL;
doms.exec = BLIS_REAL;
// Convert the obj_t for the complex matrix to one that represents only
// the real part.
bli_obj_real_part( b, &b_real );
// Overwrite the complex obj_t with its real-only alias.
*b = b_real;
// Set the pack schemas of objects A and B for normal execution.
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, b );
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_rrr
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
// We assume that the requested computation domain is real.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_REAL;
// For rrr, the computation (ukernel) and execution domains are both
// real.
doms.comp = BLIS_REAL;
doms.exec = BLIS_REAL;
// Set the pack schemas of objects A and B for normal execution.
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, b );
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
// cab
mddm_t bli_gemm_md_ccc
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
)
{
mddm_t doms;
// We assume that the requested computation domain is complex.
//dom_t dom_comp_in = bli_obj_comp_domain( c );
//dom_t dom_comp_in = BLIS_COMPLEX;
// For ccc, the computation (ukernel) and execution domains are both
// complex.
doms.comp = BLIS_COMPLEX;
doms.exec = BLIS_COMPLEX;
// Set the pack schemas of objects A and B for normal execution.
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, a );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, b );
// Return the computation and execution domains.
return doms;
}
// -----------------------------------------------------------------------------
void bli_gemm_md_front
(
obj_t* alpha,
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx,
rntm_t* rntm,
cntl_t* cntl
)
{
bli_init_once();
obj_t a_local;
obj_t b_local;
obj_t c_local;
// Check parameters.
if ( bli_error_checking_is_enabled() )
bli_gemm_check( alpha, a, b, beta, c, cntx );
// If alpha is zero, scale by beta and return.
if ( bli_obj_equals( alpha, &BLIS_ZERO ) )
{
bli_scalm( beta, c );
return;
}
// Alias A, B, and C in case we need to apply transformations.
bli_obj_alias_to( a, &a_local );
bli_obj_alias_to( b, &b_local );
bli_obj_alias_to( c, &c_local );
// An optimization: If C is stored by rows and the micro-kernel prefers
// contiguous columns, or if C is stored by columns and the micro-kernel
// prefers contiguous rows, transpose the entire operation to allow the
// micro-kernel to access elements of C in its preferred manner.
if ( bli_cntx_l3_vir_ukr_dislikes_storage_of( &c_local, BLIS_GEMM_UKR, cntx ) )
{
bli_obj_swap( &a_local, &b_local );
bli_obj_induce_trans( &a_local );
bli_obj_induce_trans( &b_local );
bli_obj_induce_trans( &c_local );
}
cntx_t cntx_local;
// Handle mixed domain cases in bli_gemm_md(), which may modify
// the objects or the context. (If the context is modified, cntx
// is adjusted to point to cntx_local.)
bli_gemm_md( &a_local, &b_local, beta, &c_local, &cntx_local, &cntx );
// Record the threading for each level within the context.
bli_rntm_set_ways_for_op
(
BLIS_GEMM,
BLIS_LEFT, // ignored for gemm/hemm/symm
bli_obj_length( &c_local ),
bli_obj_width( &c_local ),
bli_obj_width( &a_local ),
rntm
);
// Invoke the internal back-end via the thread handler.
bli_l3_thread_decorator
(
bli_gemm_int,
BLIS_GEMM, // operation family id
alpha,
&a_local,
&b_local,
beta,
&c_local,
cntx,
rntm,
cntl
);
}
// -----------------------------------------------------------------------------
void bli_gemm_md_zgemm
(
obj_t* alpha,
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx,
rntm_t* rntm,
cntl_t* cntl
)
{
bli_init_once();
obj_t a_local;
obj_t b_local;
obj_t c_local;
#if 1
obj_t am, bm, cm;
obj_t* c_orig;
//if ( is_md == TRUE )
{
//num_t dt_c2 = bli_obj_dt( c );
//num_t dt_c1 = bli_dt_proj_to_complex( dt_c2 );
//num_t dt_c = bli_dt_proj_to_double_prec( dt_c1 );
//num_t dt_c = bli_obj_dt_proj_to_complex( c );
num_t dt_c = BLIS_DCOMPLEX;
if ( bli_obj_is_single_prec( c ) ) dt_c = BLIS_SCOMPLEX;
else dt_c = BLIS_DCOMPLEX;
if ( bli_obj_is_real( a ) &&
bli_obj_is_real( b ) &&
bli_obj_is_real( c ) ) dt_c = bli_dt_proj_to_real( dt_c );
dim_t m = bli_obj_length( c );
dim_t n = bli_obj_width( c );
dim_t k = bli_obj_width_after_trans( a );
bli_obj_create( dt_c, m, k, 0, 0, &am );
bli_obj_create( dt_c, k, n, 0, 0, &bm );
bli_obj_create( dt_c, m, n, 0, 0, &cm );
//bli_projm( a, &am );
//bli_projm( b, &bm );
//bli_projm( c, &cm );
bli_castm( a, &am );
bli_castm( b, &bm );
bli_castm( c, &cm );
c_orig = c;
a = &am;
b = &bm;
c = &cm;
}
#endif
// Check parameters.
if ( bli_error_checking_is_enabled() )
bli_gemm_check( alpha, a, b, beta, c, cntx );
// If alpha is zero, scale by beta and return.
if ( bli_obj_equals( alpha, &BLIS_ZERO ) )
{
bli_scalm( beta, c );
return;
}
// Alias A, B, and C in case we need to apply transformations.
bli_obj_alias_to( a, &a_local );
bli_obj_alias_to( b, &b_local );
bli_obj_alias_to( c, &c_local );
// An optimization: If C is stored by rows and the micro-kernel prefers
// contiguous columns, or if C is stored by columns and the micro-kernel
// prefers contiguous rows, transpose the entire operation to allow the
// micro-kernel to access elements of C in its preferred manner.
if ( bli_cntx_l3_vir_ukr_dislikes_storage_of( &c_local, BLIS_GEMM_UKR, cntx ) )
{
bli_obj_swap( &a_local, &b_local );
bli_obj_induce_trans( &a_local );
bli_obj_induce_trans( &b_local );
bli_obj_induce_trans( &c_local );
}
{
// A sort of hack for communicating the desired pach schemas for A and B
// to bli_gemm_cntl_create() (via bli_l3_thread_decorator() and
// bli_l3_cntl_create_if()). This allows us to access the schemas from
// the control tree, which hopefully reduces some confusion, particularly
// in bli_packm_init().
if ( bli_cntx_method( cntx ) == BLIS_NAT )
{
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &b_local );
}
else // if ( bli_cntx_method( cntx ) != BLIS_NAT )
{
pack_t schema_a = bli_cntx_schema_a_block( cntx );
pack_t schema_b = bli_cntx_schema_b_panel( cntx );
bli_obj_set_pack_schema( schema_a, &a_local );
bli_obj_set_pack_schema( schema_b, &b_local );
}
}
// Parse and interpret the contents of the rntm_t object to properly
// set the ways of parallelism for each loop, and then make any
// additional modifications necessary for the current operation.
bli_rntm_set_ways_for_op
(
BLIS_GEMM,
BLIS_LEFT, // ignored for gemm/hemm/symm
bli_obj_length( &c_local ),
bli_obj_width( &c_local ),
bli_obj_width( &a_local ),
rntm
);
// Invoke the internal back-end via the thread handler.
bli_l3_thread_decorator
(
bli_gemm_int,
BLIS_GEMM, // operation family id
alpha,
&a_local,
&b_local,
beta,
&c_local,
cntx,
rntm,
cntl
);
#if 1
//if ( is_md == TRUE )
{
//bli_projm( &cm, c_orig );
bli_castm( &cm, c_orig );
bli_obj_free( &am );
bli_obj_free( &bm );
bli_obj_free( &cm );
}
#endif
}
#endif

327
frame/3/gemm/bli_gemm_md.h Normal file
View File

@@ -0,0 +1,327 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "bli_gemm_md_c2r_ref.h"
// Define a local struct type that makes returning two values easier.
typedef struct mddm_s
{
dom_t comp;
dom_t exec;
} mddm_t;
void bli_gemm_md
(
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx_local,
cntx_t** cntx
);
mddm_t bli_gemm_md_ccc( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
mddm_t bli_gemm_md_ccr( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
mddm_t bli_gemm_md_crc( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
mddm_t bli_gemm_md_rcc( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
mddm_t bli_gemm_md_rrc( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
mddm_t bli_gemm_md_rcr( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
mddm_t bli_gemm_md_crr( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
mddm_t bli_gemm_md_rrr( obj_t* a, obj_t* b, obj_t* beta, obj_t* c, cntx_t* cntx_l, cntx_t** cntx );
// -----------------------------------------------------------------------------
void bli_gemm_md_front
(
obj_t* alpha,
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx,
rntm_t* rntm,
cntl_t* cntl
);
void bli_gemm_md_zgemm
(
obj_t* alpha,
obj_t* a,
obj_t* b,
obj_t* beta,
obj_t* c,
cntx_t* cntx,
rntm_t* rntm,
cntl_t* cntl
);
// -----------------------------------------------------------------------------
static bool_t bli_gemm_md_is_crr( obj_t* a, obj_t* b, obj_t* c )
{
bool_t r_val = FALSE;
// NOTE: The last conditional subexpression is necessary if/when we
// allow the user to specify the computation domain. (The computation
// domain is currently ignored, but once it is honored as a user-
// settable value, it will affect the execution domain, which is what
// is checked below. Until then, the last expression is not actually
// necessary since crr is already unconditionally associated with an
// execution domain of BLIS_REAL.)
if ( bli_obj_is_complex( c ) &&
bli_obj_is_real( a ) &&
bli_obj_is_real( b ) &&
bli_obj_exec_domain( c ) == BLIS_REAL )
r_val = TRUE;
return r_val;
}
static bool_t bli_gemm_md_is_ccr( obj_t* a, obj_t* b, obj_t* c )
{
bool_t r_val = FALSE;
// NOTE: The last conditional subexpression is necessary if/when we
// allow the user to specify the computation domain. (The computation
// domain is currently ignored, but once it is honored as a user-
// settable value, it will affect the execution domain, which is what
// is checked below. Until then, the last expression is not actually
// necessary since ccr is already unconditionally associated with an
// execution domain of BLIS_COMPLEX.)
if ( bli_obj_is_complex( c ) &&
bli_obj_is_complex( a ) &&
bli_obj_is_real( b ) &&
bli_obj_exec_domain( c ) == BLIS_COMPLEX )
r_val = TRUE;
return r_val;
}
static bool_t bli_gemm_md_is_crc( obj_t* a, obj_t* b, obj_t* c )
{
bool_t r_val = FALSE;
// NOTE: The last conditional subexpression is necessary if/when we
// allow the user to specify the computation domain. (The computation
// domain is currently ignored, but once it is honored as a user-
// settable value, it will affect the execution domain, which is what
// is checked below. Until then, the last expression is not actually
// necessary since crc is already unconditionally associated with an
// execution domain of BLIS_COMPLEX.)
if ( bli_obj_is_complex( c ) &&
bli_obj_is_real( a ) &&
bli_obj_is_complex( b ) &&
bli_obj_exec_domain( c ) == BLIS_COMPLEX )
r_val = TRUE;
return r_val;
}
// -----------------------------------------------------------------------------
static void bli_gemm_md_ker_var2_recast
(
num_t* dt_comp,
num_t dt_a,
num_t dt_b,
num_t dt_c,
dim_t* m,
dim_t* n,
dim_t* k,
inc_t* pd_a, inc_t* ps_a,
inc_t* pd_b, inc_t* ps_b,
obj_t* c,
inc_t* rs_c, inc_t* cs_c
)
{
if ( bli_is_real( dt_c ) &&
bli_is_complex( dt_a ) &&
bli_is_complex( dt_b ) )
{
// The rcc case is executed with a real macrokernel, so we need to
// double the k dimension (because both A and B are packed to the 1r
// schema), and also the panel strides of A and B since they were
// packed as complex matrices and we now need to convert them to
// units of real elements.
*k *= 2;
*ps_a *= 2;
*ps_b *= 2;
}
else if ( bli_is_complex( dt_c ) &&
bli_is_real( dt_a ) &&
bli_is_complex( dt_b ) )
{
#if 1
obj_t beta;
bli_obj_scalar_detach( c, &beta );
if ( //bli_obj_imag_equals( &beta, &BLIS_ZERO ) &&
bli_obj_imag_is_zero( &beta ) &&
bli_is_row_stored( *rs_c, *cs_c ) &&
bli_obj_prec( c ) == bli_obj_comp_prec( c ) )
{
// If beta is real, and C is not general-stored, and the computation
// precision is equal to the storage precision of C, we can use the
// real macrokernel (and real microkernel, which is already stored
// to the real virtual microkernel slots of the context) instead of
// the complex macrokernel and c2r virtual microkernel.
*dt_comp = bli_dt_proj_to_real( *dt_comp );
*n *= 2;
*pd_b *= 2; *ps_b *= 2;
*rs_c *= 2;
}
else
#endif
{
// Generally speaking, the crc case is executed with a complex
// macrokernel, so we need to halve the panel stride of A (which
// is real) since the macrokernel will perform the pointer
// arithmetic in units of complex elements.
*ps_a /= 2;
}
}
else if ( bli_is_complex( dt_c ) &&
bli_is_complex( dt_a ) &&
bli_is_real( dt_b ) )
{
#if 1
obj_t beta;
bli_obj_scalar_detach( c, &beta );
if ( //bli_obj_imag_equals( &beta, &BLIS_ZERO ) &&
bli_obj_imag_is_zero( &beta ) &&
bli_is_col_stored( *rs_c, *cs_c ) &&
bli_obj_prec( c ) == bli_obj_comp_prec( c ) )
{
// If beta is real, and C is not general-stored, and the computation
// precision is equal to the storage precision of C, we can use the
// real macrokernel (and real microkernel, which is already stored
// to the real virtual microkernel slots of the context) instead of
// the complex macrokernel and c2r virtual microkernel.
*dt_comp = bli_dt_proj_to_real( *dt_comp );
*m *= 2;
*pd_a *= 2; *ps_a *= 2;
*cs_c *= 2;
}
else
#endif
{
// Generally speaking, the ccr case is executed with a complex
// macrokernel, so we need to halve the panel stride of B (which
// is real) since the macrokernel will perform the pointer
// arithmetic in units of complex elements.
*ps_b /= 2;
}
}
#if 0
else if ( bli_is_real( dt_c ) &&
bli_is_real( dt_a ) &&
bli_is_real( dt_b ) )
{
// No action needed.
//printf( "gemm_md.h: rrr: m n k are now %d %d %d\n", (int)*m, (int)*n, (int)*k );
}
else if ( bli_is_complex( dt_c ) &&
bli_is_real( dt_a ) &&
bli_is_real( dt_b ) )
{
// No action needed.
}
else if ( bli_is_real( dt_c ) &&
bli_is_complex( dt_a ) &&
bli_is_real( dt_b ) )
{
// No action needed.
}
else if ( bli_is_real( dt_c ) &&
bli_is_real( dt_a ) &&
bli_is_complex( dt_b ) )
{
// No action needed.
}
#endif
}
// -----------------------------------------------------------------------------
//
// Prototype object-based interfaces.
//
#undef GENPROT
#define GENPROT( opname ) \
\
void PASTEMAC0(opname) \
( \
obj_t* a, \
obj_t* b, \
obj_t* c, \
cntx_t* cntx, \
rntm_t* rntm, \
cntl_t* cntl, \
thrinfo_t* thread \
);
GENPROT( gemm_ker_var2_md )
//
// Prototype BLAS-like interfaces with void pointer operands.
//
#undef GENTPROT2
#define GENTPROT2( ctype_c, ctype_e, chc, che, varname ) \
\
void PASTEMAC2(chc,che,varname) \
( \
pack_t schema_a, \
pack_t schema_b, \
dim_t m, \
dim_t n, \
dim_t k, \
void* alpha, \
void* a, inc_t cs_a, inc_t is_a, \
dim_t pd_a, inc_t ps_a, \
void* b, inc_t rs_b, inc_t is_b, \
dim_t pd_b, inc_t ps_b, \
void* beta, \
void* c, inc_t rs_c, inc_t cs_c, \
cntx_t* cntx, \
rntm_t* rntm, \
thrinfo_t* thread \
);
INSERT_GENTPROT2_BASIC0( gemm_ker_var2_md )
INSERT_GENTPROT2_MIXDP0( gemm_ker_var2_md )

View File

@@ -0,0 +1,223 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
#ifdef BLIS_ENABLE_GEMM_MD
#undef GENTFUNCCO
#define GENTFUNCCO( ctype, ctype_r, ch, chr, opname, suf ) \
\
void PASTEMAC2(ch,opname,suf) \
( \
dim_t k, \
ctype* restrict alpha, \
ctype* restrict a, \
ctype* restrict b, \
ctype* restrict beta, \
ctype* restrict c, inc_t rs_c, inc_t cs_c, \
auxinfo_t* restrict data, \
cntx_t* restrict cntx \
) \
{ \
const num_t dt = PASTEMAC(ch,type); \
const num_t dt_r = PASTEMAC(chr,type); \
\
PASTECH(chr,gemm_ukr_ft) \
rgemm_ukr = bli_cntx_get_l3_nat_ukr_dt( dt_r, BLIS_GEMM_UKR, cntx ); \
const bool_t col_pref = bli_cntx_l3_nat_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \
const bool_t row_pref = !col_pref; \
\
const dim_t mr = bli_cntx_get_blksz_def_dt( dt, BLIS_MR, cntx ); \
const dim_t nr = bli_cntx_get_blksz_def_dt( dt, BLIS_NR, cntx ); \
\
ctype ct[ BLIS_STACK_BUF_MAX_SIZE \
/ sizeof( ctype_r ) ] \
__attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \
inc_t rs_ct; \
inc_t cs_ct; \
\
ctype_r* restrict a_r = ( ctype_r* )a; \
\
ctype_r* restrict b_r = ( ctype_r* )b; \
\
ctype_r* restrict zero_r = PASTEMAC(chr,0); \
\
ctype_r* restrict alpha_r = &PASTEMAC(ch,real)( *alpha ); \
/*
ctype_r* restrict alpha_i = &PASTEMAC(ch,imag)( *alpha ); \
*/ \
\
ctype_r* restrict beta_r = &PASTEMAC(ch,real)( *beta ); \
ctype_r* restrict beta_i = &PASTEMAC(ch,imag)( *beta ); \
\
ctype_r* c_use; \
inc_t rs_c_use; \
inc_t cs_c_use; \
\
bool_t using_ct; \
\
/*
PASTEMAC(d,fprintm)( stdout, "gemm_ukr: a", 2*mr, k, \
a_r, 1, 6, "%5.2f", "" ); \
PASTEMAC(d,fprintm)( stdout, "gemm_ukr: b", k, nr, \
b_r, 8, 1, "%5.2f", "" ); \
*/ \
\
/* SAFETY CHECK: The higher level implementation should never
allow an alpha with non-zero imaginary component to be passed
in, because it can't be applied properly using the 1m method.
If alpha is not real, then something is very wrong. */ \
/*
if ( !PASTEMAC(chr,eq0)( *alpha_i ) ) \
bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED ); \
*/ \
\
/* If beta has a non-zero imaginary component OR if c is stored with
general stride, then we compute the alpha*a*b product into temporary
storage and then accumulate that result into c afterwards. Note that
the other two cases concerning disagreement between the storage of C
and the output preference of the micro-kernel, should ONLY occur in
the context of trsm, whereby this virtual micro-kernel is called
directly from the trsm macro-kernel to update the micro-tile b11
that exists within the packed row-panel of B. Indeed that is the
reason those cases MUST be explicitly handled. */ \
if ( !PASTEMAC(chr,eq0)( *beta_i ) ) using_ct = TRUE; \
else if ( bli_is_col_stored( rs_c, cs_c ) && row_pref ) using_ct = TRUE; \
else if ( bli_is_row_stored( rs_c, cs_c ) && col_pref ) using_ct = TRUE; \
else if ( bli_is_gen_stored( rs_c, cs_c ) ) using_ct = TRUE; \
else using_ct = FALSE; \
\
\
if ( using_ct ) \
{ \
/* In the atypical cases, we compute the result into temporary
workspace ct and then accumulated it back to c at the end. */ \
\
/* Set the strides of ct based on the preference of the underlying
native real domain gemm micro-kernel. Note that we set the ct
strides in units of complex elements. */ \
if ( col_pref ) { rs_ct = 1; cs_ct = mr; } \
else { rs_ct = nr; cs_ct = 1; } \
\
c_use = ( ctype_r* )ct; \
rs_c_use = rs_ct; \
cs_c_use = cs_ct; \
\
/* Convert the strides from being in units of complex elements to
be in units of real elements. Note that we don't need to check for
general storage here because that case corresponds to the scenario
where we are using the ct buffer and its rs_ct/cs_ct strides. */ \
if ( bli_is_col_stored( rs_c_use, cs_c_use ) ) cs_c_use *= 2; \
else rs_c_use *= 2; \
\
/* c = beta * c + alpha_r * a * b; */ \
rgemm_ukr \
( \
k, \
alpha_r, \
a_r, \
b_r, \
zero_r, \
c_use, rs_c_use, cs_c_use, \
data, \
cntx \
); \
\
dim_t i, j; \
\
/* Accumulate the final result in ct back to c. */ \
if ( PASTEMAC(ch,eq1)( *beta ) ) \
{ \
for ( j = 0; j < nr; ++j ) \
for ( i = 0; i < mr; ++i ) \
{ \
PASTEMAC(ch,adds)( *(ct + i*rs_ct + j*cs_ct), \
*(c + i*rs_c + j*cs_c ) ); \
} \
} \
else if ( PASTEMAC(ch,eq0)( *beta ) ) \
{ \
for ( j = 0; j < nr; ++j ) \
for ( i = 0; i < mr; ++i ) \
{ \
PASTEMAC(ch,copys)( *(ct + i*rs_ct + j*cs_ct), \
*(c + i*rs_c + j*cs_c ) ); \
} \
} \
else /*if ( !PASTEMAC(ch,eq1)( *beta ) )*/ \
{ \
for ( j = 0; j < nr; ++j ) \
for ( i = 0; i < mr; ++i ) \
{ \
PASTEMAC(ch,xpbys)( *(ct + i*rs_ct + j*cs_ct), \
*beta, \
*(c + i*rs_c + j*cs_c ) ); \
} \
} \
} \
else \
{ \
/* In the typical cases, we use the real part of beta and
accumulate directly into the output matrix c. */ \
\
c_use = ( ctype_r* )c; \
rs_c_use = rs_c; \
cs_c_use = cs_c; \
\
/* Convert the strides from being in units of complex elements to
be in units of real elements. Note that we don't need to check for
general storage here because that case corresponds to the scenario
where we are using the ct buffer and its rs_ct/cs_ct strides. */ \
if ( bli_is_col_stored( rs_c_use, cs_c_use ) ) cs_c_use *= 2; \
else rs_c_use *= 2; \
\
/* c = beta * c + alpha_r * a * b; */ \
rgemm_ukr \
( \
k, \
alpha_r, \
a_r, \
b_r, \
beta_r, \
c_use, rs_c_use, cs_c_use, \
data, \
cntx \
); \
} \
}
INSERT_GENTFUNCCO_BASIC( gemm_md_c2r, BLIS_REF_SUFFIX )
#endif

View File

@@ -0,0 +1,41 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
// -- Level-3 native micro-kernel prototype redefinitions ----------------------
#undef gemm_ukr_name
#define gemm_ukr_name gemm_md_c2r_ref
// Include the native micro-kernel API template.
#include "bli_l3_ukr.h"

View File

@@ -278,6 +278,9 @@ void PASTEMAC(ch,varname) \
/* Save the imaginary stride of A and B to the auxinfo_t object. */ \
bli_auxinfo_set_is_a( is_a, &aux ); \
bli_auxinfo_set_is_b( is_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -278,6 +278,9 @@ void PASTEMAC(ch,varname) \
/* Save the imaginary stride of A and B to the auxinfo_t object. */ \
bli_auxinfo_set_is_a( is_a, &aux ); \
bli_auxinfo_set_is_b( is_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -316,6 +316,9 @@ void PASTEMAC(ch,varname) \
\
/* Save the imaginary stride of B to the auxinfo_t object. */ \
bli_auxinfo_set_is_b( istep_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -323,6 +323,9 @@ void PASTEMAC(ch,varname) \
\
/* Save the imaginary stride of B to the auxinfo_t object. */ \
bli_auxinfo_set_is_b( istep_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -323,6 +323,9 @@ void PASTEMAC(ch,varname) \
\
/* Save the imaginary stride of A to the auxinfo_t object. */ \
bli_auxinfo_set_is_a( istep_a, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -324,6 +324,9 @@ void PASTEMAC(ch,varname) \
\
/* Save the imaginary stride of A to the auxinfo_t object. */ \
bli_auxinfo_set_is_a( istep_a, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -338,6 +338,9 @@ void PASTEMAC(ch,varname) \
\
/* Save the imaginary stride of B to the auxinfo_t object. */ \
bli_auxinfo_set_is_b( istep_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -346,6 +346,9 @@ void PASTEMAC(ch,varname) \
\
/* Save the imaginary stride of B to the auxinfo_t object. */ \
bli_auxinfo_set_is_b( istep_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -368,6 +368,9 @@ void PASTEMAC(ch,varname) \
NOTE: We swap the values for A and B since the triangular
"A" matrix is actually contained within B. */ \
bli_auxinfo_set_is_b( istep_a, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -363,6 +363,9 @@ void PASTEMAC(ch,varname) \
NOTE: We swap the values for A and B since the triangular
"A" matrix is actually contained within B. */ \
bli_auxinfo_set_is_b( istep_a, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
/*bli_auxinfo_set_dt_on_output( dt, &aux );*/ \
\
b1 = b_cast; \
c1 = c_cast; \

View File

@@ -65,10 +65,12 @@ static inc_t bli_auxinfo_is_b( auxinfo_t* ai )
return ai->is_b;
}
#if 0
static inc_t bli_auxinfo_dt_on_output( auxinfo_t* ai )
{
return ai->dt_on_output;
}
#endif
// auxinfo_t field modification
@@ -105,10 +107,12 @@ static void bli_auxinfo_set_is_b( inc_t is, auxinfo_t* ai )
ai->is_b = is;
}
#if 0
static void bli_auxinfo_set_dt_on_output( num_t dt_on_output, auxinfo_t* ai )
{
ai->dt_on_output = dt_on_output;
}
#endif
#endif

View File

@@ -393,7 +393,9 @@ static bool_t bli_cntx_l3_nat_ukr_prefers_cols_dt( num_t dt, l3ukr_t ukr_id, cnt
static bool_t bli_cntx_l3_nat_ukr_prefers_storage_of( obj_t* obj, l3ukr_t ukr_id, cntx_t* cntx )
{
const num_t dt = bli_obj_dt( obj );
// Note that we use the computation datatype, which may differ from the
// storage datatype of C (when performing a mixed datatype operation).
const num_t dt = bli_obj_comp_dt( obj );
const bool_t ukr_prefers_rows
= bli_cntx_l3_nat_ukr_prefers_rows_dt( dt, ukr_id, cntx );
const bool_t ukr_prefers_cols
@@ -442,9 +444,9 @@ static bool_t bli_cntx_l3_vir_ukr_prefers_cols_dt( num_t dt, l3ukr_t ukr_id, cnt
static bool_t bli_cntx_l3_vir_ukr_prefers_storage_of( obj_t* obj, l3ukr_t ukr_id, cntx_t* cntx )
{
// Note that we use the execution datatype, which may differ from the
// storage datatype of C (though this would happen in very few situations).
const num_t dt = bli_obj_exec_dt( obj );
// Note that we use the computation datatype, which may differ from the
// storage datatype of C (when performing a mixed datatype operation).
const num_t dt = bli_obj_comp_dt( obj );
const bool_t ukr_prefers_rows
= bli_cntx_l3_vir_ukr_prefers_rows_dt( dt, ukr_id, cntx );
const bool_t ukr_prefers_cols

View File

@@ -108,6 +108,7 @@ void bli_obj_create_without_buffer
bli_obj_set_elem_size( elem_size, obj );
bli_obj_set_target_dt( dt, obj );
bli_obj_set_exec_dt( dt, obj );
bli_obj_set_comp_dt( dt, obj );
bli_obj_set_dims( m, n, obj );
bli_obj_set_offs( 0, 0, obj );
bli_obj_set_diag_offset( 0, obj );
@@ -115,8 +116,14 @@ void bli_obj_create_without_buffer
// Set the internal scalar to 1.0.
s = bli_obj_internal_scalar_buffer( obj );
if ( bli_is_float( dt ) ) { bli_sset1s( *(( float* )s) ); }
else if ( bli_is_double( dt ) ) { bli_dset1s( *(( double* )s) ); }
// Always writing the imaginary component is needed in mixed-domain
// scenarios. Failing to do this can lead to reading uninitialized
// memory just before calling the macrokernel (as the internal scalars
// for A and B are merged).
//if ( bli_is_float( dt ) ) { bli_sset1s( *(( float* )s) ); }
//else if ( bli_is_double( dt ) ) { bli_dset1s( *(( double* )s) ); }
if ( bli_is_float( dt ) ) { bli_cset1s( *(( scomplex* )s) ); }
else if ( bli_is_double( dt ) ) { bli_zset1s( *(( dcomplex* )s) ); }
else if ( bli_is_scomplex( dt ) ) { bli_cset1s( *(( scomplex* )s) ); }
else if ( bli_is_dcomplex( dt ) ) { bli_zset1s( *(( dcomplex* )s) ); }
}

View File

@@ -86,6 +86,7 @@ bool_t bli_obj_equals( obj_t* a,
bool_t bli_obj_imag_equals( obj_t* a,
obj_t* b )
{
#if 0
bool_t r_val = FALSE;
num_t dt_a;
num_t dt_b;
@@ -128,7 +129,51 @@ bool_t bli_obj_imag_equals( obj_t* a,
r_val = bli_deq( bli_zimag( *ap_z ), *bp_z );
}
}
#endif
bool_t r_val = FALSE;
// The function is not yet implemented for vectors and matrices.
if ( !bli_obj_is_1x1( a ) ||
!bli_obj_is_1x1( b ) ||
bli_obj_is_complex( b ) )
bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED );
double a_r, a_i;
double b_r, b_i;
// Get the real and imaginary parts of a and cast them to local doubles.
bli_getsc( a, &a_r, &a_i );
// Get the value of b and cast to a local double. (Note: the imaginary part
// of b is ignored since we know b is real.)
bli_getsc( b, &b_r, &b_i );
// Compare the imaginary part of a to the real part of b.
if ( a_i == b_r ) r_val = TRUE;
return r_val;
}
bool_t bli_obj_imag_is_zero( obj_t* a )
{
bool_t r_val = TRUE;
// The function is not yet implemented for vectors and matrices.
if ( !bli_obj_is_1x1( a ) )
bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED );
if ( bli_obj_is_complex( a ) )
{
double a_r, a_i;
// Get the real and imaginary parts and cast them to local doubles.
bli_getsc( a, &a_r, &a_i );
// Compare the imaginary part of a to double-precision zero.
if ( !bli_deq0( a_i ) ) r_val = FALSE;
}
return r_val;
}

View File

@@ -37,3 +37,5 @@ bool_t bli_obj_equals( obj_t* a,
bool_t bli_obj_imag_equals( obj_t* a,
obj_t* b );
bool_t bli_obj_imag_is_zero( obj_t* a );

View File

@@ -0,0 +1,267 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "blis.h"
// NOTE: This is one of the few functions in BLIS that is defined
// with heterogeneous type support. This is done so that we have
// an operation that can be used to typecast (copy-cast) a matrix
// of one datatype to a scalar of another datatype.
typedef void (*FUNCPTR_T)
(
trans_t transa,
dim_t m,
dim_t n,
void* restrict a, inc_t rs_a, inc_t cs_a,
void* restrict b, inc_t rs_b, inc_t cs_b
);
static FUNCPTR_T GENARRAY2_ALL(ftypes,castnzm);
//
// Define object-based interface.
//
void bli_castnzm
(
obj_t* a,
obj_t* b
)
{
num_t dt_a = bli_obj_dt( a );
num_t dt_b = bli_obj_dt( b );
trans_t transa = bli_obj_conjtrans_status( a );
dim_t m = bli_obj_length( b );
dim_t n = bli_obj_width( b );
void* buf_a = bli_obj_buffer_at_off( a );
inc_t rs_a = bli_obj_row_stride( a );
inc_t cs_a = bli_obj_col_stride( a );
void* buf_b = bli_obj_buffer_at_off( b );
inc_t rs_b = bli_obj_row_stride( b );
inc_t cs_b = bli_obj_col_stride( b );
FUNCPTR_T f;
// Check parameters.
if ( bli_error_checking_is_enabled() )
bli_castnzm_check( a, b );
#if 0
if ( bli_obj_dt( a ) == bli_obj_dt( b ) )
{
// If a and b share the same datatype, we can simply use copym.
bli_copym( a, b );
return;
}
#endif
// Index into the type combination array to extract the correct
// function pointer.
f = ftypes[dt_a][dt_b];
// Invoke the void pointer-based function.
f
(
transa,
m,
n,
buf_a, rs_a, cs_a,
buf_b, rs_b, cs_b
);
}
// -----------------------------------------------------------------------------
//
// Define BLAS-like interfaces with typed operands.
//
#undef GENTFUNC2
#define GENTFUNC2( ctype_a, ctype_b, cha, chb, opname ) \
\
void PASTEMAC2(cha,chb,opname) \
( \
trans_t transa, \
dim_t m, \
dim_t n, \
void* restrict a, inc_t rs_a, inc_t cs_a, \
void* restrict b, inc_t rs_b, inc_t cs_b \
) \
{ \
ctype_a* restrict a_cast = a; \
ctype_b* restrict b_cast = b; \
conj_t conja; \
dim_t n_iter; \
dim_t n_elem; \
inc_t lda, inca; \
inc_t ldb, incb; \
dim_t j, i; \
\
/* Set various loop parameters. */ \
bli_set_dims_incs_2m \
( \
transa, \
m, n, rs_a, cs_a, rs_b, cs_b, \
&n_elem, &n_iter, &inca, &lda, &incb, &ldb \
); \
\
/* Extract the conjugation component from the transa parameter. */ \
conja = bli_extract_conj( transa ); \
\
if ( bli_is_conj( conja ) ) \
{ \
if ( inca == 1 && incb == 1 ) \
{ \
for ( j = 0; j < n_iter; ++j ) \
{ \
ctype_a* restrict a1 = a_cast + (j )*lda + (0 )*inca; \
ctype_b* restrict b1 = b_cast + (j )*ldb + (0 )*incb; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC2(cha,chb,copyjnzs)( a1[i], b1[i] ); \
} \
} \
} \
else \
{ \
for ( j = 0; j < n_iter; ++j ) \
{ \
ctype_a* restrict a1 = a_cast + (j )*lda + (0 )*inca; \
ctype_b* restrict b1 = b_cast + (j )*ldb + (0 )*incb; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC2(cha,chb,copyjnzs)( *a1, *b1 ); \
\
a1 += inca; \
b1 += incb; \
} \
} \
} \
} \
else \
{ \
if ( inca == 1 && incb == 1 ) \
{ \
for ( j = 0; j < n_iter; ++j ) \
{ \
ctype_a* restrict a1 = a_cast + (j )*lda + (0 )*inca; \
ctype_b* restrict b1 = b_cast + (j )*ldb + (0 )*incb; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC2(cha,chb,copynzs)( a1[i], b1[i] ); \
} \
} \
} \
else \
{ \
for ( j = 0; j < n_iter; ++j ) \
{ \
ctype_a* restrict a1 = a_cast + (j )*lda + (0 )*inca; \
ctype_b* restrict b1 = b_cast + (j )*ldb + (0 )*incb; \
\
for ( i = 0; i < n_elem; ++i ) \
{ \
PASTEMAC2(cha,chb,copynzs)( *a1, *b1 ); \
\
a1 += inca; \
b1 += incb; \
} \
} \
} \
} \
}
INSERT_GENTFUNC2_BASIC0( castnzm )
INSERT_GENTFUNC2_MIXDP0( castnzm )
// -----------------------------------------------------------------------------
//
// Define object-based _check() function.
//
void bli_castnzm_check
(
obj_t* a,
obj_t* b
)
{
err_t e_val;
// Check object datatypes.
e_val = bli_check_floating_object( a );
bli_check_error_code( e_val );
e_val = bli_check_floating_object( b );
bli_check_error_code( e_val );
// Check structure.
// NOTE: We enforce general structure for now in order to simplify the
// implementation.
bli_check_general_object( a );
bli_check_error_code( e_val );
bli_check_general_object( b );
bli_check_error_code( e_val );
// Check object dimensions.
e_val = bli_check_matrix_object( a );
bli_check_error_code( e_val );
e_val = bli_check_matrix_object( b );
bli_check_error_code( e_val );
e_val = bli_check_conformal_dims( a, b );
bli_check_error_code( e_val );
// Check object buffers (for non-NULLness).
e_val = bli_check_object_buffer( a );
bli_check_error_code( e_val );
e_val = bli_check_object_buffer( b );
bli_check_error_code( e_val );
}

View File

@@ -0,0 +1,73 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
//
// Prototype object-based interface.
//
void bli_castnzm
(
obj_t* a,
obj_t* b
);
//
// Prototype BLAS-like interfaces with heterogeneous-typed operands.
//
#undef GENTPROT2
#define GENTPROT2( ctype_a, ctype_b, cha, chb, opname ) \
\
void PASTEMAC2(cha,chb,opname) \
( \
trans_t transa, \
dim_t m, \
dim_t n, \
void* a, inc_t rs_a, inc_t cs_a, \
void* b, inc_t rs_b, inc_t cs_b \
);
INSERT_GENTPROT2_BASIC0( castnzm )
INSERT_GENTPROT2_MIXDP0( castnzm )
//
// Prototype object-based _check() function.
//
void bli_castnzm_check
(
obj_t* a,
obj_t* b
);

View File

@@ -67,7 +67,7 @@ void PASTEF772(chx,cha,blasname) \
that is, we just always sub-optimally implement those cases
by casting alpha to ctype_x (potentially the complex domain) and
using the homogeneous datatype instance according to that type. */ \
PASTEMAC2(cha,chx,cast)( (ftype_a*)alpha, alpha_cast ); \
PASTEMAC2(cha,chx,copys)( *alpha, alpha_cast ); \
\
/* Call BLIS interface. */ \
PASTEMAC2(chx,blisname,BLIS_TAPI_EX_SUF) \

View File

@@ -99,6 +99,25 @@
#endif
// -- MIXED DATATYPE SUPPORT ---------------------------------------------------
// Enable mixed datatype support?
#ifdef BLIS_DISABLE_MIXED_DT
#undef BLIS_ENABLE_GEMM_MD
#else
// Default behavior is enabled.
#define BLIS_ENABLE_GEMM_MD
#endif
// Enable memory-intensive optimizations for mixed datatype support?
#ifdef BLIS_DISABLE_MIXED_DT_EXTRA_MEM
#undef BLIS_ENABLE_GEMM_MD_EXTRA_MEM
#else
// Default behavior is enabled.
#define BLIS_ENABLE_GEMM_MD_EXTRA_MEM
#endif
// -- MISCELLANEOUS OPTIONS ----------------------------------------------------
// Do NOT require the cross-blocksize constraints. That is, do not enforce

View File

@@ -65,6 +65,18 @@ static tname PASTECH(opname,_fpa)[BLIS_NUM_FP_TYPES+1] = \
// -- "Smart" two-operand macro --
#define GENARRAY_FPA2(tname,op) \
\
static tname PASTECH(op,_fpa2)[BLIS_NUM_FP_TYPES][BLIS_NUM_FP_TYPES] = \
{ \
{ ( tname )PASTEMAC2(s,s,op), ( tname )PASTEMAC2(s,c,op), ( tname )PASTEMAC2(s,d,op), ( tname )PASTEMAC2(s,z,op) }, \
{ ( tname )PASTEMAC2(c,s,op), ( tname )PASTEMAC2(c,c,op), ( tname )PASTEMAC2(c,d,op), ( tname )PASTEMAC2(c,z,op) }, \
{ ( tname )PASTEMAC2(d,s,op), ( tname )PASTEMAC2(d,c,op), ( tname )PASTEMAC2(d,d,op), ( tname )PASTEMAC2(d,z,op) }, \
{ ( tname )PASTEMAC2(z,s,op), ( tname )PASTEMAC2(z,c,op), ( tname )PASTEMAC2(z,d,op), ( tname )PASTEMAC2(z,z,op) } \
}
// -- "Smart" two-operand macro --
/*
#define GENARRAY2_VFP(arrayname,op) \
\

View File

@@ -529,6 +529,52 @@ GENTFUNC2R( dcomplex, scomplex, double, z, c, d, tfuncname, varname )
// -- Mixed domain/precision (all) two-operand macro with real projection of first operand --
// -- (no auxiliary arguments) --
#define INSERT_GENTFUNC2R_MIXDP0( tfuncname ) \
\
GENTFUNC2( float, double, s, d, tfuncname ) \
GENTFUNC2( float, scomplex, s, c, tfuncname ) \
GENTFUNC2( float, dcomplex, s, z, tfuncname ) \
\
GENTFUNC2( double, float, d, s, tfuncname ) \
GENTFUNC2( double, scomplex, d, c, tfuncname ) \
GENTFUNC2( double, dcomplex, d, z, tfuncname ) \
\
GENTFUNC2( scomplex, float, c, s, tfuncname ) \
GENTFUNC2( scomplex, double, c, d, tfuncname ) \
GENTFUNC2( scomplex, dcomplex, c, z, tfuncname ) \
\
GENTFUNC2( dcomplex, float, z, s, tfuncname ) \
GENTFUNC2( dcomplex, double, z, d, tfuncname ) \
GENTFUNC2( dcomplex, scomplex, z, c, tfuncname )
// -- (one auxiliary argument) --
#define INSERT_GENTFUNC2R_MIX_DP( tfuncname, varname ) \
\
GENTFUNC2( float, double, s, d, tfuncname, varname ) \
GENTFUNC2( float, scomplex, s, c, tfuncname, varname ) \
GENTFUNC2( float, dcomplex, s, z, tfuncname, varname ) \
\
GENTFUNC2( double, float, d, s, tfuncname, varname ) \
GENTFUNC2( double, scomplex, d, c, tfuncname, varname ) \
GENTFUNC2( double, dcomplex, d, z, tfuncname, varname ) \
\
GENTFUNC2( scomplex, float, c, s, tfuncname, varname ) \
GENTFUNC2( scomplex, double, c, d, tfuncname, varname ) \
GENTFUNC2( scomplex, dcomplex, c, z, tfuncname, varname ) \
\
GENTFUNC2( dcomplex, float, z, s, tfuncname, varname ) \
GENTFUNC2( dcomplex, double, z, d, tfuncname, varname ) \
GENTFUNC2( dcomplex, scomplex, z, c, tfuncname, varname )
// -- Macros for functions with three primary operands -------------------------

View File

@@ -136,6 +136,20 @@ static void bli_toggle_bool( bool_t* b )
#define bli_ctype ( BLIS_SCOMPLEX )
#define bli_ztype ( BLIS_DCOMPLEX )
// return C type for char
#define bli_sctype float
#define bli_dctype double
#define bli_cctype scomplex
#define bli_zctype dcomplex
// return real proj of C type for char
#define bli_sctyper float
#define bli_dctyper double
#define bli_cctyper float
#define bli_zctyper double
// return default format specifier for char

View File

@@ -122,13 +122,15 @@ static num_t bli_obj_dt_proj_to_double_prec( obj_t* obj )
static bool_t bli_obj_is_real( obj_t* obj )
{
return ( bool_t )
( bli_obj_domain( obj ) == BLIS_BITVAL_REAL );
( bli_obj_domain( obj ) == BLIS_BITVAL_REAL &&
!bli_obj_is_const( obj ) );
}
static bool_t bli_obj_is_complex( obj_t* obj )
{
return ( bool_t )
( bli_obj_domain( obj ) == BLIS_BITVAL_COMPLEX );
( bli_obj_domain( obj ) == BLIS_BITVAL_COMPLEX &&
!bli_obj_is_const( obj ) );
}
static num_t bli_obj_dt_proj_to_real( obj_t* obj )
@@ -179,6 +181,24 @@ static prec_t bli_obj_exec_prec( obj_t* obj )
( ( obj->info & BLIS_EXEC_PREC_BIT ) >> BLIS_EXEC_DT_SHIFT );
}
static num_t bli_obj_comp_dt( obj_t* obj )
{
return ( num_t )
( ( obj->info & BLIS_COMP_DT_BITS ) >> BLIS_COMP_DT_SHIFT );
}
static dom_t bli_obj_comp_domain( obj_t* obj )
{
return ( dom_t )
( ( obj->info & BLIS_COMP_DOMAIN_BIT ) >> BLIS_COMP_DT_SHIFT );
}
static prec_t bli_obj_comp_prec( obj_t* obj )
{
return ( prec_t )
( ( obj->info & BLIS_COMP_PREC_BIT ) >> BLIS_COMP_DT_SHIFT );
}
static trans_t bli_obj_conjtrans_status( obj_t* obj )
{
return ( trans_t )
@@ -454,6 +474,24 @@ static void bli_obj_set_exec_prec( prec_t dt, obj_t* obj )
( obj->info & ~BLIS_EXEC_PREC_BIT ) | ( dt << BLIS_EXEC_DT_SHIFT );
}
static void bli_obj_set_comp_dt( num_t dt, obj_t* obj )
{
obj->info = ( objbits_t )
( obj->info & ~BLIS_COMP_DT_BITS ) | ( dt << BLIS_COMP_DT_SHIFT );
}
static void bli_obj_set_comp_domain( dom_t dt, obj_t* obj )
{
obj->info = ( objbits_t )
( obj->info & ~BLIS_COMP_DOMAIN_BIT ) | ( dt << BLIS_COMP_DT_SHIFT );
}
static void bli_obj_set_comp_prec( prec_t dt, obj_t* obj )
{
obj->info = ( objbits_t )
( obj->info & ~BLIS_COMP_PREC_BIT ) | ( dt << BLIS_COMP_DT_SHIFT );
}
static void bli_obj_set_pack_schema( pack_t schema, obj_t* obj )
{
obj->info = ( objbits_t )
@@ -1183,9 +1221,11 @@ static void bli_obj_real_part( obj_t* c, obj_t* r )
const num_t dt_stor_r = bli_dt_proj_to_real( bli_obj_dt( c ) );
const num_t dt_targ_r = bli_dt_proj_to_real( bli_obj_target_dt( c ) );
const num_t dt_exec_r = bli_dt_proj_to_real( bli_obj_exec_dt( c ) );
const num_t dt_comp_r = bli_dt_proj_to_real( bli_obj_comp_dt( c ) );
bli_obj_set_dt( dt_stor_r, r );
bli_obj_set_target_dt( dt_targ_r, r );
bli_obj_set_exec_dt( dt_exec_r, r );
bli_obj_set_comp_dt( dt_comp_r, r );
// Update the element size.
siz_t es_c = bli_obj_elem_size( c );
@@ -1212,9 +1252,11 @@ static void bli_obj_imag_part( obj_t* c, obj_t* i )
const num_t dt_stor_r = bli_dt_proj_to_real( bli_obj_dt( c ) );
const num_t dt_targ_r = bli_dt_proj_to_real( bli_obj_target_dt( c ) );
const num_t dt_exec_r = bli_dt_proj_to_real( bli_obj_exec_dt( c ) );
const num_t dt_comp_r = bli_dt_proj_to_real( bli_obj_comp_dt( c ) );
bli_obj_set_dt( dt_stor_r, i );
bli_obj_set_target_dt( dt_targ_r, i );
bli_obj_set_exec_dt( dt_exec_r, i );
bli_obj_set_comp_dt( dt_comp_r, i );
// Update the element size.
siz_t es_c = bli_obj_elem_size( c );
@@ -1251,13 +1293,24 @@ static void bli_obj_scalar_set_dt_buffer( obj_t* obj, num_t dt_aux, num_t* dt, v
}
}
// Swap object contents.
// Swap all object fields (metadata/properties).
static void bli_obj_swap( obj_t* a, obj_t* b )
{
obj_t t = *b; *b = *a; *a = t;
}
// Swap object pack schemas.
static void bli_obj_swap_pack_schemas( obj_t* a, obj_t* b )
{
const pack_t schema_a = bli_obj_pack_schema( a );
const pack_t schema_b = bli_obj_pack_schema( b );
bli_obj_set_pack_schema( schema_b, a );
bli_obj_set_pack_schema( schema_a, b );
}
// Induce a transposition on an object: swap dimensions, increments, and
// offsets, then clear the trans bit.

View File

@@ -140,14 +140,15 @@
#include "bli_axmys.h"
#include "bli_cast.h"
#include "bli_conjs.h"
#include "bli_copys.h"
#include "bli_copyjs.h"
#include "bli_copycjs.h"
#include "bli_copynzs.h"
#include "bli_copyjnzs.h"
#include "bli_dots.h"
#include "bli_dotjs.h"
@@ -191,8 +192,8 @@
// Inlined scalar macros in loops
#include "bli_adds_mxn.h"
#include "bli_adds_mxn_uplo.h"
#include "bli_copys_mxn.h"
#include "bli_set0s_mxn.h"
#include "bli_copys_mxn.h"
#include "bli_xpbys_mxn.h"
#include "bli_xpbys_mxn_uplo.h"

View File

@@ -257,6 +257,10 @@ typedef dcomplex f77_dcomplex;
- 1 == Hermitian
- 2 == symmetric
- 3 == triangular
31 ~ 29 Execution numerical datatype
- 29: domain (0 == real, 1 == complex)
- 30: precision (0 == single, 1 == double)
- 31: used to encode integer, constant types
*/
#define BLIS_DATATYPE_SHIFT 0
@@ -286,6 +290,9 @@ typedef dcomplex f77_dcomplex;
#define BLIS_PACK_REV_IF_LOWER_SHIFT 24
#define BLIS_PACK_BUFFER_SHIFT 25
#define BLIS_STRUC_SHIFT 27
#define BLIS_COMP_DT_SHIFT 29
#define BLIS_COMP_DOMAIN_SHIFT 29
#define BLIS_COMP_PREC_SHIFT 30
//
// -- BLIS info bit field masks ------------------------------------------------
@@ -318,6 +325,9 @@ typedef dcomplex f77_dcomplex;
#define BLIS_PACK_REV_IF_LOWER_BIT ( 0x1 << BLIS_PACK_REV_IF_LOWER_SHIFT )
#define BLIS_PACK_BUFFER_BITS ( 0x3 << BLIS_PACK_BUFFER_SHIFT )
#define BLIS_STRUC_BITS ( 0x3 << BLIS_STRUC_SHIFT )
#define BLIS_COMP_DT_BITS ( 0x7 << BLIS_COMP_DT_SHIFT )
#define BLIS_COMP_DOMAIN_BIT ( 0x1 << BLIS_COMP_DOMAIN_SHIFT )
#define BLIS_COMP_PREC_BIT ( 0x1 << BLIS_COMP_PREC_SHIFT )
//
@@ -603,13 +613,15 @@ typedef enum
typedef enum
{
BLIS_3MH = 0,
BLIS_3MH = 0,
BLIS_3M1,
BLIS_4MH,
BLIS_4M1B,
BLIS_4M1A,
BLIS_1M,
BLIS_NAT
BLIS_NAT,
BLIS_IND_FIRST = 0,
BLIS_IND_LAST = BLIS_NAT
} ind_t;
#define BLIS_NUM_IND_METHODS (BLIS_NAT+1)
@@ -1003,7 +1015,7 @@ typedef struct
inc_t is_b;
// The type to convert to on output.
num_t dt_on_output;
//num_t dt_on_output;
} auxinfo_t;

View File

@@ -671,12 +671,68 @@
#define MOVD(_0, _1) INSTR_(movd, _0, _1)
#define MOVL(_0, _1) INSTR_(movl, _0, _1)
#define MOVQ(_0, _1) INSTR_(movq, _0, _1)
#define CMOVA(_0, _1) INSTR_(cmova, _0, _1)
#define CMOVAE(_0, _1) INSTR_(cmovae, _0, _1)
#define CMOVB(_0, _1) INSTR_(cmovb, _0, _1)
#define CMOVBE(_0, _1) INSTR_(cmovbe, _0, _1)
#define CMOVC(_0, _1) INSTR_(cmovc, _0, _1)
#define CMOVP(_0, _1) INSTR_(cmovp, _0, _1)
#define CMOVO(_0, _1) INSTR_(cmovo, _0, _1)
#define CMOVS(_0, _1) INSTR_(cmovs, _0, _1)
#define CMOVE(_0, _1) INSTR_(cmove, _0, _1)
#define CMOVZ(_0, _1) INSTR_(cmovz, _0, _1)
#define CMOVG(_0, _1) INSTR_(cmovg, _0, _1)
#define CMOVGE(_0, _1) INSTR_(cmovge, _0, _1)
#define CMOVL(_0, _1) INSTR_(cmovl, _0, _1)
#define CMOVLE(_0, _1) INSTR_(cmovle, _0, _1)
#define CMOVNA(_0, _1) INSTR_(cmovna, _0, _1)
#define CMOVNAE(_0, _1) INSTR_(cmovnae, _0, _1)
#define CMOVNB(_0, _1) INSTR_(cmovnb, _0, _1)
#define CMOVNBE(_0, _1) INSTR_(cmovnbe, _0, _1)
#define CMOVNC(_0, _1) INSTR_(cmovnc, _0, _1)
#define CMOVNP(_0, _1) INSTR_(cmovnp, _0, _1)
#define CMOVNO(_0, _1) INSTR_(cmovno, _0, _1)
#define CMOVNS(_0, _1) INSTR_(cmovns, _0, _1)
#define CMOVNE(_0, _1) INSTR_(cmovne, _0, _1)
#define CMOVNZ(_0, _1) INSTR_(cmovnz, _0, _1)
#define CMOVNG(_0, _1) INSTR_(cmovng, _0, _1)
#define CMOVNGE(_0, _1) INSTR_(cmovnge, _0, _1)
#define CMOVNL(_0, _1) INSTR_(cmovnl, _0, _1)
#define CMOVNLE(_0, _1) INSTR_(cmovnle, _0, _1)
#define lea(_0, _1) LEA(_0, _1)
#define mov(_0, _1) MOV(_0, _1)
#define movd(_0, _1) MOVD(_0, _1)
#define movl(_0, _1) MOVL(_0, _1)
#define movq(_0, _1) MOVQ(_0, _1)
#define cmova(_0, _1) CMOVA(_0, _1)
#define cmovae(_0, _1) CMOVAE(_0, _1)
#define cmovb(_0, _1) CMOVB(_0, _1)
#define cmovbe(_0, _1) CMOVBE(_0, _1)
#define cmovc(_0, _1) CMOVC(_0, _1)
#define cmovp(_0, _1) CMOVP(_0, _1)
#define cmovo(_0, _1) CMOVO(_0, _1)
#define cmovs(_0, _1) CMOVS(_0, _1)
#define cmove(_0, _1) CMOVE(_0, _1)
#define cmovz(_0, _1) CMOVZ(_0, _1)
#define cmovg(_0, _1) CMOVG(_0, _1)
#define cmovge(_0, _1) CMOVGE(_0, _1)
#define cmovl(_0, _1) CMOVL(_0, _1)
#define cmovle(_0, _1) CMOVLE(_0, _1)
#define cmovna(_0, _1) CMOVNA(_0, _1)
#define cmovnae(_0, _1) CMOVNAE(_0, _1)
#define cmovnb(_0, _1) CMOVNB(_0, _1)
#define cmovnbe(_0, _1) CMOVNBE(_0, _1)
#define cmovnc(_0, _1) CMOVNC(_0, _1)
#define cmovnp(_0, _1) CMOVNP(_0, _1)
#define cmovno(_0, _1) CMOVNO(_0, _1)
#define cmovns(_0, _1) CMOVNS(_0, _1)
#define cmovne(_0, _1) CMOVNE(_0, _1)
#define cmovnz(_0, _1) CMOVNZ(_0, _1)
#define cmovng(_0, _1) CMOVNG(_0, _1)
#define cmovnge(_0, _1) CMOVNGE(_0, _1)
#define cmovnl(_0, _1) CMOVNL(_0, _1)
#define cmovnle(_0, _1) CMOVNLE(_0, _1)
// Vector moves
@@ -1038,6 +1094,28 @@
#define v4fnmaddss(_0, _1, _2) V4FNMADDSS(_0, _1, _2)
#define v4fnmaddps(_0, _1, _2) V4FNMADDPS(_0, _1, _2)
// Conversions
#define CVTSS2SD(_0, _1) INSTR_(cvtss2sd, _0, _1)
#define CVTSD2SS(_0, _1) INSTR_(cvtsd2ss, _0, _1)
#define CVTPS2PD(_0, _1) INSTR_(cvtps2pd, _0, _1)
#define CVTPD2PS(_0, _1) INSTR_(cvtpd2ps, _0, _1)
#define cvtss2sd(_0, _1) CVTSS2SD(_0, _1)
#define cvtsd2ss(_0, _1) CVTSD2SS(_0, _1)
#define cvtps2pd(_0, _1) CVTPS2PD(_0, _1)
#define cvtpd2ps(_0, _1) CVTPD2PS(_0, _1)
#define VCVTSS2SD(_0, _1) INSTR_(vcvtss2sd, _0, _1)
#define VCVTSD2SS(_0, _1) INSTR_(vcvtsd2ss, _0, _1)
#define VCVTPS2PD(_0, _1) INSTR_(vcvtps2pd, _0, _1)
#define VCVTPD2PS(_0, _1) INSTR_(vcvtpd2ps, _0, _1)
#define vcvtss2sd(_0, _1) VCVTSS2SD(_0, _1)
#define vcvtsd2ss(_0, _1) VCVTSD2SS(_0, _1)
#define vcvtps2pd(_0, _1) VCVTPS2PD(_0, _1)
#define vcvtpd2ps(_0, _1) VCVTPD2PS(_0, _1)
// Vector shuffles
#define PSHUFD(_0, _1, _2) INSTR_(pshufd, _0, _1, _2)

View File

@@ -126,6 +126,7 @@ extern "C" {
#include "bli_setri.h"
#include "bli_castm.h"
#include "bli_castnzm.h"
#include "bli_castv.h"
#include "bli_projm.h"
#include "bli_projv.h"

View File

@@ -37,6 +37,18 @@
// set1ms_mxn
#define bli_sset1ms_mxn( schema, offm, offn, m, n, a, y, rs_y, cs_y, ld_y ) \
{ \
/* Include real domain version to facilitate macro-izing mixed-datatype
components of packm. */ \
}
#define bli_dset1ms_mxn( schema, offm, offn, m, n, a, y, rs_y, cs_y, ld_y ) \
{ \
/* Include real domain version to facilitate macro-izing mixed-datatype
components of packm. */ \
}
#define bli_cset1ms_mxn( schema, offm, offn, m, n, a, y, rs_y, cs_y, ld_y ) \
{ \
inc_t offm_local = offm; \

View File

@@ -41,62 +41,473 @@
// - The first char encodes the type of x.
// - The second char encodes the type of y.
#define bli_ssadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_ssadds( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?s
static void bli_ssadds_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ssadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_ssadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ssadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dsadds_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dsadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dsadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dsadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_csadds_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_csadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_csadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_csadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zsadds_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zsadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zsadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zsadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_ddadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_ddadds( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?d
static void bli_sdadds_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sdadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_sdadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sdadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_ddadds_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ddadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_ddadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ddadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_cdadds_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cdadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_cdadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cdadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zdadds_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zdadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zdadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zdadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_ccadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_ccadds( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?c
static void bli_scadds_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_scadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_scadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_scadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dcadds_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dcadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dcadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dcadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_ccadds_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ccadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_ccadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ccadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zcadds_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zcadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zcadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zcadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_zzadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_zzadds( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?z
static void bli_szadds_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_szadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_szadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_szadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dzadds_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dzadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dzadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dzadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_czadds_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_czadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_czadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_czadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zzadds_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zzadds( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zzadds( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zzadds( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_sadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_ssadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_sadds_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_ssadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#define bli_dadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_ddadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_dadds_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_ddadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#define bli_cadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_ccadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_cadds_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_ccadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#define bli_zadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_zzadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_zadds_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_zzadds_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#endif

View File

@@ -0,0 +1,80 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef BLIS_COPYJNZS_H
#define BLIS_COPYJNZS_H
// copyjnzs
// Notes:
// - The first char encodes the type of x.
// - The second char encodes the type of y.
#define bli_sscopyjnzs( x, y ) bli_scopyjris( bli_sreal(x), bli_simag(x), bli_sreal(y), bli_simag(y) )
#define bli_dscopyjnzs( x, y ) bli_scopyjris( bli_dreal(x), bli_dimag(x), bli_sreal(y), bli_simag(y) )
#define bli_cscopyjnzs( x, y ) bli_scopyjris( bli_creal(x), bli_cimag(x), bli_sreal(y), bli_simag(y) )
#define bli_zscopyjnzs( x, y ) bli_scopyjris( bli_zreal(x), bli_zimag(x), bli_sreal(y), bli_simag(y) )
#define bli_sdcopyjnzs( x, y ) bli_dcopyjris( bli_sreal(x), bli_simag(x), bli_dreal(y), bli_dimag(y) )
#define bli_ddcopyjnzs( x, y ) bli_dcopyjris( bli_dreal(x), bli_dimag(x), bli_dreal(y), bli_dimag(y) )
#define bli_cdcopyjnzs( x, y ) bli_dcopyjris( bli_creal(x), bli_cimag(x), bli_dreal(y), bli_dimag(y) )
#define bli_zdcopyjnzs( x, y ) bli_dcopyjris( bli_zreal(x), bli_zimag(x), bli_dreal(y), bli_dimag(y) )
// NOTE: Use of scopyjris() (implemented in terms of scopyris()), is so we
// don't touch the imaginary part of y.
#define bli_sccopyjnzs( x, y ) bli_scopyjris( bli_sreal(x), bli_simag(x), bli_creal(y), bli_cimag(y) )
#define bli_dccopyjnzs( x, y ) bli_scopyjris( bli_dreal(x), bli_dimag(x), bli_creal(y), bli_cimag(y) )
#define bli_cccopyjnzs( x, y ) bli_ccopyjris( bli_creal(x), bli_cimag(x), bli_creal(y), bli_cimag(y) )
#define bli_zccopyjnzs( x, y ) bli_ccopyjris( bli_zreal(x), bli_zimag(x), bli_creal(y), bli_cimag(y) )
// NOTE: Use of dcopyjris() (implemented in terms of dcopyris()), is so we
// don't touch the imaginary part of y.
#define bli_szcopyjnzs( x, y ) bli_dcopyjris( bli_sreal(x), bli_simag(x), bli_zreal(y), bli_zimag(y) )
#define bli_dzcopyjnzs( x, y ) bli_dcopyjris( bli_dreal(x), bli_dimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_czcopyjnzs( x, y ) bli_zcopyjris( bli_creal(x), bli_cimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_zzcopyjnzs( x, y ) bli_zcopyjris( bli_zreal(x), bli_zimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_iicopyjnzs( x, y ) { (y) = ( gint_t ) (x); }
#define bli_scopyjnzs( x, y ) bli_sscopyjnzs( x, y )
#define bli_dcopyjnzs( x, y ) bli_ddcopyjnzs( x, y )
#define bli_ccopyjnzs( x, y ) bli_cccopyjnzs( x, y )
#define bli_zcopyjnzs( x, y ) bli_zzcopyjnzs( x, y )
#define bli_icopyjnzs( x, y ) bli_iicopyjnzs( x, y )
#endif

View File

@@ -0,0 +1,78 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas at Austin nor the names
of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef BLIS_COPYNZS_H
#define BLIS_COPYNZS_H
// copynzs
// Notes:
// - The first char encodes the type of x.
// - The second char encodes the type of y.
#define bli_sscopynzs( x, y ) bli_scopyris( bli_sreal(x), bli_simag(x), bli_sreal(y), bli_simag(y) )
#define bli_dscopynzs( x, y ) bli_scopyris( bli_dreal(x), bli_dimag(x), bli_sreal(y), bli_simag(y) )
#define bli_cscopynzs( x, y ) bli_scopyris( bli_creal(x), bli_cimag(x), bli_sreal(y), bli_simag(y) )
#define bli_zscopynzs( x, y ) bli_scopyris( bli_zreal(x), bli_zimag(x), bli_sreal(y), bli_simag(y) )
#define bli_sdcopynzs( x, y ) bli_dcopyris( bli_sreal(x), bli_simag(x), bli_dreal(y), bli_dimag(y) )
#define bli_ddcopynzs( x, y ) bli_dcopyris( bli_dreal(x), bli_dimag(x), bli_dreal(y), bli_dimag(y) )
#define bli_cdcopynzs( x, y ) bli_dcopyris( bli_creal(x), bli_cimag(x), bli_dreal(y), bli_dimag(y) )
#define bli_zdcopynzs( x, y ) bli_dcopyris( bli_zreal(x), bli_zimag(x), bli_dreal(y), bli_dimag(y) )
// NOTE: Use of scopyris() is so we don't touch the imaginary part of y.
#define bli_sccopynzs( x, y ) bli_scopyris( bli_sreal(x), bli_simag(x), bli_creal(y), bli_cimag(y) )
#define bli_dccopynzs( x, y ) bli_scopyris( bli_dreal(x), bli_dimag(x), bli_creal(y), bli_cimag(y) )
#define bli_cccopynzs( x, y ) bli_ccopyris( bli_creal(x), bli_cimag(x), bli_creal(y), bli_cimag(y) )
#define bli_zccopynzs( x, y ) bli_ccopyris( bli_zreal(x), bli_zimag(x), bli_creal(y), bli_cimag(y) )
// NOTE: Use of dcopyris() is so we don't touch the imaginary part of y.
#define bli_szcopynzs( x, y ) bli_dcopyris( bli_sreal(x), bli_simag(x), bli_zreal(y), bli_zimag(y) )
#define bli_dzcopynzs( x, y ) bli_dcopyris( bli_dreal(x), bli_dimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_czcopynzs( x, y ) bli_zcopyris( bli_creal(x), bli_cimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_zzcopynzs( x, y ) bli_zcopyris( bli_zreal(x), bli_zimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_iicopynzs( x, y ) { (y) = ( gint_t ) (x); }
#define bli_scopynzs( x, y ) bli_sscopynzs( x, y )
#define bli_dcopynzs( x, y ) bli_ddcopynzs( x, y )
#define bli_ccopynzs( x, y ) bli_cccopynzs( x, y )
#define bli_zcopynzs( x, y ) bli_zzcopynzs( x, y )
#define bli_icopynzs( x, y ) bli_iicopynzs( x, y )
#endif

View File

@@ -51,32 +51,18 @@
#define bli_cdcopys( x, y ) bli_dcopyris( bli_creal(x), bli_cimag(x), bli_dreal(y), bli_dimag(y) )
#define bli_zdcopys( x, y ) bli_dcopyris( bli_zreal(x), bli_zimag(x), bli_dreal(y), bli_dimag(y) )
#ifndef BLIS_ENABLE_C99_COMPLEX
// NOTE: Use of ccopyris() means the imaginary part of y will be overwritten with zero.
#define bli_sccopys( x, y ) bli_ccopyris( bli_sreal(x), bli_simag(x), bli_creal(y), bli_cimag(y) )
#define bli_dccopys( x, y ) bli_ccopyris( bli_dreal(x), bli_dimag(x), bli_creal(y), bli_cimag(y) )
#define bli_cccopys( x, y ) bli_ccopyris( bli_creal(x), bli_cimag(x), bli_creal(y), bli_cimag(y) )
#define bli_zccopys( x, y ) bli_ccopyris( bli_zreal(x), bli_zimag(x), bli_creal(y), bli_cimag(y) )
// NOTE: Use of zcopyris() means the imaginary part of y will be overwritten with zero.
#define bli_szcopys( x, y ) bli_zcopyris( bli_sreal(x), bli_simag(x), bli_zreal(y), bli_zimag(y) )
#define bli_dzcopys( x, y ) bli_zcopyris( bli_dreal(x), bli_dimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_czcopys( x, y ) bli_zcopyris( bli_creal(x), bli_cimag(x), bli_zreal(y), bli_zimag(y) )
#define bli_zzcopys( x, y ) bli_zcopyris( bli_zreal(x), bli_zimag(x), bli_zreal(y), bli_zimag(y) )
#else // ifdef BLIS_ENABLE_C99_COMPLEX
#define bli_sccopys( x, y ) { (y) = (x); }
#define bli_dccopys( x, y ) { (y) = (x); }
#define bli_cccopys( x, y ) { (y) = (x); }
#define bli_zccopys( x, y ) { (y) = (x); }
#define bli_szcopys( x, y ) { (y) = (x); }
#define bli_dzcopys( x, y ) { (y) = (x); }
#define bli_czcopys( x, y ) { (y) = (x); }
#define bli_zzcopys( x, y ) { (y) = (x); }
#endif // BLIS_ENABLE_C99_COMPLEX
#define bli_iicopys( x, y ) { (y) = ( gint_t ) (x); }

View File

@@ -41,62 +41,470 @@
// - The first char encodes the type of x.
// - The second char encodes the type of y.
#define bli_sscopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_sscopys( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?s
static void bli_sscopys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sscopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_sscopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sscopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dscopys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dscopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dscopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dscopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_cscopys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cscopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_cscopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cscopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zscopys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zscopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zscopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zscopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_ddcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_ddcopys( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?d
static void bli_sdcopys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sdcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_sdcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sdcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_ddcopys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ddcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_ddcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_ddcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_cdcopys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cdcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_cdcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cdcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zdcopys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zdcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zdcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zdcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_cccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_cccopys( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?c
static void bli_sccopys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sccopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_sccopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sccopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dccopys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dccopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dccopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dccopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_cccopys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cccopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_cccopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cccopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zccopys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zccopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zccopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zccopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_zzcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_zzcopys( *(x + _i*rs_x + _j*cs_x), \
*(y + _i*rs_y + _j*cs_y) ); \
// xy = ?c
static void bli_szcopys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_szcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_szcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_szcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dzcopys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dzcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dzcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dzcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_czcopys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_czcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_czcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_czcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zzcopys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zzcopys( *(x + ii + jj*cs_x),
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zzcopys( *(x + ii*rs_x + jj),
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zzcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_scopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_sscopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_scopys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_sscopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#define bli_dcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_ddcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_dcopys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_ddcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#define bli_ccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_cccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_ccopys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_cccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#define bli_zcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ) \
{ \
bli_zzcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y ); \
static void bli_zcopys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_zzcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
}
#endif

View File

@@ -42,106 +42,605 @@
// - The second char encodes the type of b.
// - The third char encodes the type of y.
#define bli_sssxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{ \
/* If beta is zero, overwrite y with x (in case y has infs or NaNs). */ \
if ( bli_seq0( *beta ) ) \
{ \
bli_sscopys_mxn( m, n, \
x, rs_x, cs_x, \
y, rs_y, cs_y ); \
} \
else \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_sssxpbys( *(x + _i*rs_x + _j*cs_x), \
*(beta), \
*(y + _i*rs_y + _j*cs_y) ); \
} \
// xby = ?ss
static void bli_sssxpbys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict beta,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_seq0( *beta ) )
{
bli_sscopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sssxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_sssxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sssxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dssxpbys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict beta,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_seq0( *beta ) )
{
bli_dscopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dssxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dssxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dssxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_cssxpbys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict beta,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_seq0( *beta ) )
{
bli_cscopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cssxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_cssxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cssxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zssxpbys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict beta,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_seq0( *beta ) )
{
bli_zscopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zssxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zssxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zssxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_dddxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{ \
/* If beta is zero, overwrite y with x (in case y has infs or NaNs). */ \
if ( bli_deq0( *beta ) ) \
{ \
bli_ddcopys_mxn( m, n, \
x, rs_x, cs_x, \
y, rs_y, cs_y ); \
} \
else \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_dddxpbys( *(x + _i*rs_x + _j*cs_x), \
*(beta), \
*(y + _i*rs_y + _j*cs_y) ); \
} \
// xby = ?dd
static void bli_sddxpbys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict beta,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_deq0( *beta ) )
{
bli_sdcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sddxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_sddxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sddxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dddxpbys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict beta,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_deq0( *beta ) )
{
bli_ddcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dddxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dddxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dddxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_cddxpbys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict beta,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_deq0( *beta ) )
{
bli_cdcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cddxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_cddxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cddxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zddxpbys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict beta,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_deq0( *beta ) )
{
bli_zdcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zddxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zddxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zddxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_cccxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{ \
/* If beta is zero, overwrite y with x (in case y has infs or NaNs). */ \
if ( bli_ceq0( *beta ) ) \
{ \
bli_cccopys_mxn( m, n, \
x, rs_x, cs_x, \
y, rs_y, cs_y ); \
} \
else \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_cccxpbys( *(x + _i*rs_x + _j*cs_x), \
*(beta), \
*(y + _i*rs_y + _j*cs_y) ); \
} \
// xby = ?cc
static void bli_sccxpbys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict beta,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_ceq0( *beta ) )
{
bli_sccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sccxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_sccxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_sccxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dccxpbys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict beta,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_ceq0( *beta ) )
{
bli_dccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dccxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dccxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dccxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_cccxpbys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict beta,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_ceq0( *beta ) )
{
bli_cccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cccxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_cccxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_cccxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zccxpbys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict beta,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_ceq0( *beta ) )
{
bli_zccopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zccxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zccxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zccxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_zzzxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{ \
/* If beta is zero, overwrite y with x (in case y has infs or NaNs). */ \
if ( bli_zeq0( *beta ) ) \
{ \
bli_zzcopys_mxn( m, n, \
x, rs_x, cs_x, \
y, rs_y, cs_y ); \
} \
else \
{ \
dim_t _i, _j; \
\
for ( _j = 0; _j < n; ++_j ) \
for ( _i = 0; _i < m; ++_i ) \
bli_zzzxpbys( *(x + _i*rs_x + _j*cs_x), \
*(beta), \
*(y + _i*rs_y + _j*cs_y) ); \
} \
// xby = ?zz
static void bli_szzxpbys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict beta,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_zeq0( *beta ) )
{
bli_szcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_szzxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_szzxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_szzxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_dzzxpbys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict beta,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_zeq0( *beta ) )
{
bli_dzcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dzzxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_dzzxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_dzzxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_czzxpbys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict beta,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_zeq0( *beta ) )
{
bli_czcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_czzxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_czzxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_czzxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
static void bli_zzzxpbys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict beta,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
// If beta is zero, overwrite y with x (in case y has infs or NaNs).
if ( bli_zeq0( *beta ) )
{
bli_zzcopys_mxn( m, n, x, rs_x, cs_x, y, rs_y, cs_y );
return;
}
#ifdef BLIS_ENABLE_CR_CASES
if ( rs_x == 1 && rs_y == 1 )
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zzzxpbys( *(x + ii + jj*cs_x), *beta,
*(y + ii + jj*cs_y) );
}
else if ( cs_x == 1 && cs_y == 1 )
{
for ( dim_t ii = 0; ii < m; ++ii )
for ( dim_t jj = 0; jj < n; ++jj )
bli_zzzxpbys( *(x + ii*rs_x + jj), *beta,
*(y + ii*rs_y + jj) );
}
else
#endif
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
bli_zzzxpbys( *(x + ii*rs_x + jj*cs_x), *beta,
*(y + ii*rs_y + jj*cs_y) );
}
}
#define bli_sxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{\
bli_sssxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ); \
static void bli_sxpbys_mxn( const dim_t m, const dim_t n, float* restrict x, const inc_t rs_x, const inc_t cs_x,
float* restrict beta,
float* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_sssxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y );
}
#define bli_dxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{\
bli_dddxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ); \
static void bli_dxpbys_mxn( const dim_t m, const dim_t n, double* restrict x, const inc_t rs_x, const inc_t cs_x,
double* restrict beta,
double* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_dddxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y );
}
#define bli_cxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{\
bli_cccxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ); \
static void bli_cxpbys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
scomplex* restrict beta,
scomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_cccxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y );
}
#define bli_zxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ) \
{\
bli_zzzxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y ); \
static void bli_zxpbys_mxn( const dim_t m, const dim_t n, dcomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
dcomplex* restrict beta,
dcomplex* restrict y, const inc_t rs_y, const inc_t cs_y )
{
bli_zzzxpbys_mxn( m, n, x, rs_x, cs_x, beta, y, rs_y, cs_y );
}
#endif

View File

@@ -42,5 +42,25 @@
#define bli_ccopyjris( ar, ai, br, bi ) bli_ccopyris( (ar), -(ai), (br), (bi) )
#define bli_zcopyjris( ar, ai, br, bi ) bli_zcopyris( (ar), -(ai), (br), (bi) )
#define bli_sscopyjris( ar, ai, br, bi ) bli_scopyjris( ar, 0.0F, br, bi )
#define bli_dscopyjris( ar, ai, br, bi ) bli_scopyjris( ar, 0.0, br, bi )
#define bli_cscopyjris( ar, ai, br, bi ) bli_scopyjris( ar, ai, br, bi )
#define bli_zscopyjris( ar, ai, br, bi ) bli_scopyjris( ar, ai, br, bi )
#define bli_sdcopyjris( ar, ai, br, bi ) bli_dcopyjris( ar, 0.0F, br, bi )
#define bli_ddcopyjris( ar, ai, br, bi ) bli_dcopyjris( ar, 0.0, br, bi )
#define bli_cdcopyjris( ar, ai, br, bi ) bli_dcopyjris( ar, ai, br, bi )
#define bli_zdcopyjris( ar, ai, br, bi ) bli_dcopyjris( ar, ai, br, bi )
#define bli_sccopyjris( ar, ai, br, bi ) bli_ccopyjris( ar, 0.0F, br, bi )
#define bli_dccopyjris( ar, ai, br, bi ) bli_ccopyjris( ar, 0.0, br, bi )
#define bli_cccopyjris( ar, ai, br, bi ) bli_ccopyjris( ar, ai, br, bi )
#define bli_zccopyjris( ar, ai, br, bi ) bli_ccopyjris( ar, ai, br, bi )
#define bli_szcopyjris( ar, ai, br, bi ) bli_zcopyjris( ar, 0.0F, br, bi )
#define bli_dzcopyjris( ar, ai, br, bi ) bli_zcopyjris( ar, 0.0, br, bi )
#define bli_czcopyjris( ar, ai, br, bi ) bli_zcopyjris( ar, ai, br, bi )
#define bli_zzcopyjris( ar, ai, br, bi ) bli_zcopyjris( ar, ai, br, bi )
#endif

View File

@@ -59,5 +59,24 @@
(bi) = (ai); \
}
#endif
#define bli_sscopyris( ar, ai, br, bi ) bli_scopyris( ar, 0.0F, br, bi )
#define bli_dscopyris( ar, ai, br, bi ) bli_scopyris( ar, 0.0, br, bi )
#define bli_cscopyris( ar, ai, br, bi ) bli_scopyris( ar, ai, br, bi )
#define bli_zscopyris( ar, ai, br, bi ) bli_scopyris( ar, ai, br, bi )
#define bli_sdcopyris( ar, ai, br, bi ) bli_dcopyris( ar, 0.0F, br, bi )
#define bli_ddcopyris( ar, ai, br, bi ) bli_dcopyris( ar, 0.0, br, bi )
#define bli_cdcopyris( ar, ai, br, bi ) bli_dcopyris( ar, ai, br, bi )
#define bli_zdcopyris( ar, ai, br, bi ) bli_dcopyris( ar, ai, br, bi )
#define bli_sccopyris( ar, ai, br, bi ) bli_ccopyris( ar, 0.0F, br, bi )
#define bli_dccopyris( ar, ai, br, bi ) bli_ccopyris( ar, 0.0, br, bi )
#define bli_cccopyris( ar, ai, br, bi ) bli_ccopyris( ar, ai, br, bi )
#define bli_zccopyris( ar, ai, br, bi ) bli_ccopyris( ar, ai, br, bi )
#define bli_szcopyris( ar, ai, br, bi ) bli_zcopyris( ar, 0.0F, br, bi )
#define bli_dzcopyris( ar, ai, br, bi ) bli_zcopyris( ar, 0.0, br, bi )
#define bli_czcopyris( ar, ai, br, bi ) bli_zcopyris( ar, ai, br, bi )
#define bli_zzcopyris( ar, ai, br, bi ) bli_zcopyris( ar, ai, br, bi )
#endif

View File

@@ -49,7 +49,8 @@
\
/* If beta is in the real domain, and c is row- or column-stored,
then we may proceed with the optimization. */ \
if ( bli_obj_imag_equals( &beta, &BLIS_ZERO ) && \
if ( /*bli_obj_imag_equals( &beta, &BLIS_ZERO ) &&*/ \
bli_obj_imag_is_zero( &beta ) && \
!bli_is_gen_stored( rs_c, cs_c ) ) \
{ \
dt_exec = bli_dt_proj_to_real( dt_exec ); \

View File

@@ -97,19 +97,6 @@ void blx_gemm_front
bli_obj_induce_trans( &c_local );
}
// Parse and interpret the contents of the rntm_t object to properly
// set the ways of parallelism for each loop, and then make any
// additional modifications necessary for the current operation.
bli_rntm_set_ways_for_op
(
BLIS_GEMM,
BLIS_LEFT, // ignored for gemm
bli_obj_length( &c_local ),
bli_obj_width( &c_local ),
bli_obj_width( &a_local ),
rntm
);
{
// A sort of hack for communicating the desired pach schemas for A and
// B to bli_gemm_cntl_create() (via bli_l3_thread_decorator() and
@@ -131,6 +118,19 @@ void blx_gemm_front
}
}
// Parse and interpret the contents of the rntm_t object to properly
// set the ways of parallelism for each loop, and then make any
// additional modifications necessary for the current operation.
bli_rntm_set_ways_for_op
(
BLIS_GEMM,
BLIS_LEFT, // ignored for gemm
bli_obj_length( &c_local ),
bli_obj_width( &c_local ),
bli_obj_width( &a_local ),
rntm
);
// Invoke the internal back-end via the thread handler.
blx_gemm_thread
(

View File

@@ -254,6 +254,9 @@ void PASTECH2(blx_,ch,varname) \
/* Save the imaginary stride of A and B to the auxinfo_t object. */ \
bli_auxinfo_set_is_a( is_a, &aux ); \
bli_auxinfo_set_is_b( is_b, &aux ); \
\
/* Save the desired output datatype (indicating no typecasting). */ \
bli_auxinfo_set_dt_on_output( dt, &aux ); \
\
thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \
dim_t jr_num_threads = bli_thread_n_way( thread ); \

View File

@@ -200,9 +200,9 @@ STR_ST := -DTHR_STR=\"st\"
STR_MT := -DTHR_STR=\"mt\"
# Problem size specification
PDEF_ST := -DP_BEGIN=40 \
PDEF_ST := -DP_BEGIN=100 \
-DP_END=2000 \
-DP_INC=40
-DP_INC=100
PDEF_MT := -DP_BEGIN=200 \
-DP_END=10000 \

View File

@@ -35,9 +35,6 @@
#include <unistd.h>
#include "blis.h"
void zgemm3m_( f77_char*, f77_char*, f77_int*, f77_int*, f77_int*, dcomplex*, dcomplex*, f77_int*, dcomplex*, f77_int*, dcomplex*, dcomplex*, f77_int* );
//#define PRINT
int main( int argc, char** argv )
@@ -148,9 +145,6 @@ int main( int argc, char** argv )
bli_obj_create( dt, m, k, 0, 0, &a );
bli_obj_create( dt, k, n, 0, 0, &b );
bli_obj_create( dt, m, n, 0, 0, &c );
//bli_obj_create( dt, m, k, 2, 2*m, &a );
//bli_obj_create( dt, k, n, 2, 2*k, &b );
//bli_obj_create( dt, m, n, 2, 2*m, &c );
bli_obj_create( dt, m, n, 0, 0, &c_save );
bli_randm( &a );
@@ -177,7 +171,6 @@ int main( int argc, char** argv )
{
bli_copym( &c_save, &c );
dtime = bli_clock();

401
test/mixeddt/Makefile Normal file
View File

@@ -0,0 +1,401 @@
#!/bin/bash
#
# BLIS
# An object-based framework for developing high-performance BLAS-like
# libraries.
#
# Copyright (C) 2014, The University of Texas at Austin
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
# - Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# - Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# - Neither the name of The University of Texas at Austin nor the names
# of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
# HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#
#
# Makefile
#
# Field G. Van Zee
#
# Makefile for standalone BLIS test drivers.
#
#
# --- Makefile PHONY target definitions ----------------------------------------
#
.PHONY: all all-st all-mt \
blis blis-st blis-mt \
blis-nat blis-nat-st blis-nat-mt \
openblas openblas-st openblas-mt \
mkl mkl-st mkl-mt \
blis-gemm-st blis-gemm-mt \
blis-gemm-nat-st blis-gemm-nat-mt \
openblas-gemm-st openblas-gemm-mt \
mkl-gemm-st mkl-gemm-mt \
clean cleanx
#
# --- Determine makefile fragment location -------------------------------------
#
# Comments:
# - DIST_PATH is assumed to not exist if BLIS_INSTALL_PATH is given.
# - We must use recursively expanded assignment for LIB_PATH and INC_PATH in
# the second case because CONFIG_NAME is not yet set.
ifneq ($(strip $(BLIS_INSTALL_PATH)),)
LIB_PATH := $(BLIS_INSTALL_PATH)/lib
INC_PATH := $(BLIS_INSTALL_PATH)/include/blis
SHARE_PATH := $(BLIS_INSTALL_PATH)/share/blis
else
DIST_PATH := ../..
LIB_PATH = ../../lib/$(CONFIG_NAME)
INC_PATH = ../../include/$(CONFIG_NAME)
SHARE_PATH := ../..
endif
#
# --- Include common makefile definitions --------------------------------------
#
# Include the common makefile fragment.
-include $(SHARE_PATH)/common.mk
#
# --- BLAS and LAPACK implementations ------------------------------------------
#
# BLIS library and header path. This is simply wherever it was installed.
#BLIS_LIB_PATH := $(INSTALL_PREFIX)/lib
#BLIS_INC_PATH := $(INSTALL_PREFIX)/include/blis
# BLIS library.
#BLIS_LIB := $(BLIS_LIB_PATH)/libblis.a
# BLAS library path(s). This is where the BLAS libraries reside.
HOME_LIB_PATH := $(HOME)/flame/lib
#MKL_LIB_PATH := /opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64
#MKL_LIB_PATH := $(HOME)/intel/mkl/lib/intel64
MKL_LIB_PATH := ${MKLROOT}/lib/intel64
#ICC_LIB_PATH := /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
# OpenBLAS
OPENBLAS_LIB := $(HOME_LIB_PATH)/libopenblas.a
OPENBLASP_LIB := $(HOME_LIB_PATH)/libopenblasp.a
# ATLAS
ATLAS_LIB := $(HOME_LIB_PATH)/libf77blas.a \
$(HOME_LIB_PATH)/libatlas.a
# MKL
MKL_LIB := -L$(MKL_LIB_PATH) \
-lmkl_intel_lp64 \
-lmkl_core \
-lmkl_sequential \
-lpthread -lm -ldl
#MKLP_LIB := -L$(MKL_LIB_PATH) \
# -lmkl_intel_thread \
# -lmkl_core \
# -lmkl_intel_ilp64 \
# -L$(ICC_LIB_PATH) \
# -liomp5
MKLP_LIB := -L$(MKL_LIB_PATH) \
-lmkl_intel_lp64 \
-lmkl_core \
-lmkl_gnu_thread \
-lpthread -lm -ldl -fopenmp
#-L$(ICC_LIB_PATH) \
#-lgomp
#
# --- General build definitions ------------------------------------------------
#
TEST_SRC_PATH := .
TEST_OBJ_PATH := .
# Gather all local object files.
TEST_OBJS := $(sort $(patsubst $(TEST_SRC_PATH)/%.c, \
$(TEST_OBJ_PATH)/%.o, \
$(wildcard $(TEST_SRC_PATH)/*.c)))
# Override the value of CINCFLAGS so that the value of CFLAGS returned by
# get-frame-cflags-for() is not cluttered up with include paths needed only
# while building BLIS.
CINCFLAGS := -I$(INC_PATH)
# Use the "framework" CFLAGS for the configuration family.
CFLAGS := $(call get-user-cflags-for,$(CONFIG_NAME))
# Add local header paths to CFLAGS.
CFLAGS += -I$(TEST_SRC_PATH)
# Locate the libblis library to which we will link.
LIBBLIS_LINK := $(LIB_PATH)/$(LIBBLIS_L)
# Datatypes for A, B, and C.
#DTA_S := -DDTA=BLIS_FLOAT
#DTA_D := -DDTA=BLIS_DOUBLE
#DTA_C := -DDTA=BLIS_SCOMPLEX
#DTA_Z := -DDTA=BLIS_DCOMPLEX
#
#DTB_S := -DDTB=BLIS_FLOAT
#DTB_D := -DDTB=BLIS_DOUBLE
#DTB_C := -DDTB=BLIS_SCOMPLEX
#DTB_Z := -DDTB=BLIS_DCOMPLEX
#
#DTC_S := -DDTC=BLIS_FLOAT
#DTC_D := -DDTC=BLIS_DOUBLE
#DTC_C := -DDTC=BLIS_SCOMPLEX
#DTC_Z := -DDTC=BLIS_DCOMPLEX
#
#DTX_S := -DDTC=BLIS_FLOAT
#DTX_D := -DDTC=BLIS_DOUBLE
# Which library?
BLI_DEF := -DBLIS
BLA_DEF := -DBLAS
# Implementation string
STR_BLI := -DSTR=\"asm_blis\"
STR_OBL := -DSTR=\"openblas\"
STR_MKL := -DSTR=\"mkl\"
# Single or multithreaded string
STR_ST := -DTHR_STR=\"st\"
STR_MT := -DTHR_STR=\"mt\"
# Problem size specification
PDEF_ST := -DP_BEGIN=96 \
-DP_END=1200 \
-DP_INC=96
PDEF_MT := -DP_BEGIN=80 \
-DP_END=4000 \
-DP_INC=80
# Enumerate possible datatypes and computation precisions.
dts := s d c z
prs := s d
# Various functions that help us construct the datatype combinations and then
# extract the needed datatype strings and C preprocessor define flags.
get-char-c = $(word 1,$(subst _, ,$(1)))
get-char-a = $(word 2,$(subst _, ,$(1)))
get-char-b = $(word 3,$(subst _, ,$(1)))
get-char-x = $(word 4,$(subst _, ,$(1)))
get-cstr = $(call get-char-c,$(1))$(call get-char-a,$(1))$(call get-char-b,$(1))$(call get-char-x,$(1))
get-cdef-a = $(strip $(subst s,-DDTA=BLIS_FLOAT, \
$(subst d,-DDTA=BLIS_DOUBLE, \
$(subst c,-DDTA=BLIS_SCOMPLEX, \
$(subst z,-DDTA=BLIS_DCOMPLEX,$(call get-char-a,$(1)))))))
get-cdef-b = $(strip $(subst s,-DDTB=BLIS_FLOAT, \
$(subst d,-DDTB=BLIS_DOUBLE, \
$(subst c,-DDTB=BLIS_SCOMPLEX, \
$(subst z,-DDTB=BLIS_DCOMPLEX,$(call get-char-b,$(1)))))))
get-cdef-c = $(strip $(subst s,-DDTC=BLIS_FLOAT, \
$(subst d,-DDTC=BLIS_DOUBLE, \
$(subst c,-DDTC=BLIS_SCOMPLEX, \
$(subst z,-DDTC=BLIS_DCOMPLEX,$(call get-char-c,$(1)))))))
get-cdef-x = $(strip $(subst s,-DDTX=BLIS_FLOAT, \
$(subst d,-DDTX=BLIS_DOUBLE,$(call get-char-x,$(1)))))
get-cdefs = $(call get-cdef-c,$(1)) $(call get-cdef-a,$(1)) $(call get-cdef-b,$(1)) $(call get-cdef-x,$(1))
# Define a function to return the appropriate -DSTR= and -D[BLIS|BLAS] flags.
get-idefs = $(strip $(subst asm_blis,-DSTR=\"$(1)\" -DBLIS, \
$(subst openblas,-DSTR=\"$(1)\" -DBLAS, \
$(subst mkl,-DSTR=\"$(1)\" -DBLAS,$(1)))))
# Enumerate all possible datatype combinations.
DT_CODES := $(foreach dt0,$(dts),$(foreach dt1,$(dts),$(foreach dt2,$(dts),$(foreach pr,$(prs),$(dt0)_$(dt1)_$(dt2)_$(pr)))))
# Build a list of the datatype strings.
DT_COMBOS := $(foreach code,$(DT_CODES),$(call get-cstr,$(code)))
# Build a list of BLIS, OpenBLAS, and MKL executables.
BLIS_OBJS_ST := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_asm_blis_st.o)
BLIS_BINS_ST := $(patsubst %.o,%.x,$(BLIS_OBJS_ST))
OPENBLAS_OBJS_ST := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_openblas_st.o)
OPENBLAS_BINS_ST := $(patsubst %.o,%.x,$(OPENBLAS_OBJS_ST))
BLIS_OBJS_MT := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_asm_blis_mt.o)
BLIS_BINS_MT := $(patsubst %.o,%.x,$(BLIS_OBJS_MT))
OPENBLAS_OBJS_MT := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_openblas_mt.o)
OPENBLAS_BINS_MT := $(patsubst %.o,%.x,$(OPENBLAS_OBJS_MT))
#MKL_OBJS_ST := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_mkl_st.o)
#BLIS_OBJS_MT := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_asm_blis_mt.o)
#OPENBLAS_OBJS_MT := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_openblas_mt.o)
#MKL_OBJS_MT := $(foreach combo,$(DT_COMBOS),test_$(combo)gemm_mkl_mt.o)
#
# --- Targets/rules ------------------------------------------------------------
#
all: st
st: blis-st openblas-st
mt: blis-mt openblas-mt
blis-st: $(BLIS_BINS_ST)
openblas-st: $(OPENBLAS_BINS_ST)
blis-mt: $(BLIS_BINS_MT)
openblas-mt: $(OPENBLAS_BINS_MT)
#blis: test_ssssgemm_asm_blis_st.x \
# test_sssdgemm_asm_blis_st.x \
# test_ssdsgemm_asm_blis_st.x \
# test_sdssgemm_asm_blis_st.x \
# test_dsssgemm_asm_blis_st.x \
# test_dddsgemm_asm_blis_st.x \
# test_ddddgemm_asm_blis_st.x
#openblas: test_ssssgemm_openblas_st.x \
# test_sssdgemm_openblas_st.x \
# test_ssdsgemm_openblas_st.x \
# test_sdssgemm_openblas_st.x \
# test_dsssgemm_openblas_st.x \
# test_dddsgemm_openblas_st.x \
# test_ddddgemm_openblas_st.x
# --Object file rules --
# Define the function that will be used to instantiate compilation rules
# for the various implementations.
define make-st-rule
test_$(call get-cstr,$(1))gemm_$(2)_st.o: test_gemm.c Makefile
ifeq ($(ENABLE_VERBOSE),yes)
$(CC) $(CFLAGS) $(PDEF_ST) $(call get-cdefs,$(1)) $(call get-idefs,$(2)) $(STR_ST) -c $$< -o $$@
else
@echo "Compiling $$@"
@$(CC) $(CFLAGS) $(PDEF_ST) $(call get-cdefs,$(1)) $(call get-idefs,$(2)) $(STR_ST) -c $$< -o $$@
endif
endef
define make-mt-rule
test_$(call get-cstr,$(1))gemm_$(2)_mt.o: test_gemm.c Makefile
ifeq ($(ENABLE_VERBOSE),yes)
$(CC) $(CFLAGS) $(PDEF_MT) $(call get-cdefs,$(1)) $(call get-idefs,$(2)) $(STR_MT) -c $$< -o $$@
else
@echo "Compiling $$@"
@$(CC) $(CFLAGS) $(PDEF_MT) $(call get-cdefs,$(1)) $(call get-idefs,$(2)) $(STR_MT) -c $$< -o $$@
endif
endef
# Define the implementations for which we will instantiate compilation rules.
IMPLS := asm_blis openblas
# Instantiate the rule function make-st-rule() and make-mt-rule for each
# implementation in IMPLS and each of the datatype "codes" in DT_CODES.
$(foreach impl,$(IMPLS), \
$(foreach code,$(DT_CODES),$(eval $(call make-st-rule,$(code),$(impl)))))
$(foreach impl,$(IMPLS), \
$(foreach code,$(DT_CODES),$(eval $(call make-mt-rule,$(code),$(impl)))))
# -- Executable file rules --
# NOTE: For the BLAS test drivers, we place the BLAS libraries before BLIS
# on the link command line in case BLIS was configured with the BLAS
# compatibility layer. This prevents BLIS from inadvertently getting called
# for the BLAS routines we are trying to test with.
test_%_openblas_st.x: test_%_openblas_st.o $(LIBBLIS_LINK)
ifeq ($(ENABLE_VERBOSE),yes)
$(LINKER) $< $(OPENBLAS_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@
$(RM_F) $<
else
@@echo "Linking $@ to '$(notdir $(OPENBLAS_LIB)) $(LIBBLIS_LINK)'"
@$(LINKER) $< $(OPENBLAS_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@
@$(RM_F) $<
endif
test_%_openblas_mt.x: test_%_openblas_mt.o $(LIBBLIS_LINK)
ifeq ($(ENABLE_VERBOSE),yes)
$(LINKER) $< $(OPENBLASP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@
$(RM_F) $<
else
@@echo "Linking $@ to '$(notdir $(OPENBLAS_LIB)) $(LIBBLIS_LINK)'"
@$(LINKER) $< $(OPENBLASP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@
@$(RM_F) $<
endif
#test_%_mkl_st.x: test_%_mkl_st.o $(LIBBLIS_LINK)
#ifeq ($(ENABLE_VERBOSE),yes)
# $(LINKER) $< $(MKL_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@
# $(RM_F) $<
#else
# @@echo "Linking $@ to '$(notdir $(MKL_LIB)) $(LIBBLIS_LINK)'"
# @$(LINKER) $< $(MKL_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@
# @$(RM_F) $<
#endif
#test_%_mkl_mt.x: test_%_mkl_mt.o $(LIBBLIS_LINK)
# $(LINKER) $< $(MKLP_LIB) $(LIBBLIS_LINK) $(LDFLAGS) -o $@
test_%_blis_st.x: test_%_blis_st.o $(LIBBLIS_LINK)
ifeq ($(ENABLE_VERBOSE),yes)
$(LINKER) $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@
$(RM_F) $<
else
@@echo "Linking $@ to '$(LIBBLIS_LINK)'"
@$(LINKER) $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@
@$(RM_F) $<
endif
test_%_blis_mt.x: test_%_blis_mt.o $(LIBBLIS_LINK)
ifeq ($(ENABLE_VERBOSE),yes)
$(LINKER) $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@
$(RM_F) $<
else
@@echo "Linking $@ to '$(LIBBLIS_LINK)'"
@$(LINKER) $< $(LIBBLIS_LINK) $(LDFLAGS) -o $@
@$(RM_F) $<
endif
# -- Clean rules --
clean: cleanx
cleanx:
- $(RM_F) *.o *.x
cleanout:
- $(RM_F) *.m

Binary file not shown.

View File

@@ -0,0 +1,165 @@
function r_val = gen_dt_combos()
dt_chars = [ 's' 'd' 'c' 'z' ];
pr_chars = [ 's' 'd' ];
if 0
i = 1;
for dtc = dt_chars
for dta = dt_chars
for dtb = dt_chars
for pr = pr_chars
dt_combos( i, : ) = sprintf( '%c%c%c%c', dtc, dta, dtb, pr );
i = i + 1;
end
end
end
end
end
%n_combos = size(temp,1);
if 1
dt_combos( 1, : ) = 'ssss';
dt_combos( 2, : ) = 'ssds';
dt_combos( 3, : ) = 'sscs';
dt_combos( 4, : ) = 'sszs';
dt_combos( 5, : ) = 'sdss';
dt_combos( 6, : ) = 'sdds';
dt_combos( 7, : ) = 'sdcs';
dt_combos( 8, : ) = 'sdzs';
dt_combos( 9, : ) = 'sssd';
dt_combos( 10, : ) = 'ssdd';
dt_combos( 11, : ) = 'sscd';
dt_combos( 12, : ) = 'sszd';
dt_combos( 13, : ) = 'sdsd';
dt_combos( 14, : ) = 'sddd';
dt_combos( 15, : ) = 'sdcd';
dt_combos( 16, : ) = 'sdzd';
dt_combos( 17, : ) = 'scss';
dt_combos( 18, : ) = 'scds';
dt_combos( 19, : ) = 'sccs';
dt_combos( 20, : ) = 'sczs';
dt_combos( 21, : ) = 'szss';
dt_combos( 22, : ) = 'szds';
dt_combos( 23, : ) = 'szcs';
dt_combos( 24, : ) = 'szzs';
dt_combos( 25, : ) = 'scsd';
dt_combos( 26, : ) = 'scdd';
dt_combos( 27, : ) = 'sccd';
dt_combos( 28, : ) = 'sczd';
dt_combos( 29, : ) = 'szsd';
dt_combos( 30, : ) = 'szdd';
dt_combos( 31, : ) = 'szcd';
dt_combos( 32, : ) = 'szzd';
dt_combos( 33, : ) = 'dsss';
dt_combos( 34, : ) = 'dsds';
dt_combos( 35, : ) = 'dscs';
dt_combos( 36, : ) = 'dszs';
dt_combos( 37, : ) = 'ddss';
dt_combos( 38, : ) = 'ddds';
dt_combos( 39, : ) = 'ddcs';
dt_combos( 40, : ) = 'ddzs';
dt_combos( 41, : ) = 'dssd';
dt_combos( 42, : ) = 'dsdd';
dt_combos( 43, : ) = 'dscd';
dt_combos( 44, : ) = 'dszd';
dt_combos( 45, : ) = 'ddsd';
dt_combos( 46, : ) = 'dddd';
dt_combos( 47, : ) = 'ddcd';
dt_combos( 48, : ) = 'ddzd';
dt_combos( 49, : ) = 'dcss';
dt_combos( 50, : ) = 'dcds';
dt_combos( 51, : ) = 'dccs';
dt_combos( 52, : ) = 'dczs';
dt_combos( 53, : ) = 'dzss';
dt_combos( 54, : ) = 'dzds';
dt_combos( 55, : ) = 'dzcs';
dt_combos( 56, : ) = 'dzzs';
dt_combos( 57, : ) = 'dcsd';
dt_combos( 58, : ) = 'dcdd';
dt_combos( 59, : ) = 'dccd';
dt_combos( 60, : ) = 'dczd';
dt_combos( 61, : ) = 'dzsd';
dt_combos( 62, : ) = 'dzdd';
dt_combos( 63, : ) = 'dzcd';
dt_combos( 64, : ) = 'dzzd';
dt_combos( 65, : ) = 'csss';
dt_combos( 66, : ) = 'csds';
dt_combos( 67, : ) = 'cscs';
dt_combos( 68, : ) = 'cszs';
dt_combos( 69, : ) = 'cdss';
dt_combos( 70, : ) = 'cdds';
dt_combos( 71, : ) = 'cdcs';
dt_combos( 72, : ) = 'cdzs';
dt_combos( 73, : ) = 'cssd';
dt_combos( 74, : ) = 'csdd';
dt_combos( 75, : ) = 'cscd';
dt_combos( 76, : ) = 'cszd';
dt_combos( 77, : ) = 'cdsd';
dt_combos( 78, : ) = 'cddd';
dt_combos( 79, : ) = 'cdcd';
dt_combos( 80, : ) = 'cdzd';
dt_combos( 81, : ) = 'ccss';
dt_combos( 82, : ) = 'ccds';
dt_combos( 83, : ) = 'cccs';
dt_combos( 84, : ) = 'cczs';
dt_combos( 85, : ) = 'czss';
dt_combos( 86, : ) = 'czds';
dt_combos( 87, : ) = 'czcs';
dt_combos( 88, : ) = 'czzs';
dt_combos( 89, : ) = 'ccsd';
dt_combos( 90, : ) = 'ccdd';
dt_combos( 91, : ) = 'cccd';
dt_combos( 92, : ) = 'cczd';
dt_combos( 93, : ) = 'czsd';
dt_combos( 94, : ) = 'czdd';
dt_combos( 95, : ) = 'czcd';
dt_combos( 96, : ) = 'czzd';
dt_combos( 97, : ) = 'zsss';
dt_combos( 98, : ) = 'zsds';
dt_combos( 99, : ) = 'zscs';
dt_combos( 100, : ) = 'zszs';
dt_combos( 101, : ) = 'zdss';
dt_combos( 102, : ) = 'zdds';
dt_combos( 103, : ) = 'zdcs';
dt_combos( 104, : ) = 'zdzs';
dt_combos( 105, : ) = 'zssd';
dt_combos( 106, : ) = 'zsdd';
dt_combos( 107, : ) = 'zscd';
dt_combos( 108, : ) = 'zszd';
dt_combos( 109, : ) = 'zdsd';
dt_combos( 110, : ) = 'zddd';
dt_combos( 111, : ) = 'zdcd';
dt_combos( 112, : ) = 'zdzd';
dt_combos( 113, : ) = 'zcss';
dt_combos( 114, : ) = 'zcds';
dt_combos( 115, : ) = 'zccs';
dt_combos( 116, : ) = 'zczs';
dt_combos( 117, : ) = 'zzss';
dt_combos( 118, : ) = 'zzds';
dt_combos( 119, : ) = 'zzcs';
dt_combos( 120, : ) = 'zzzs';
dt_combos( 121, : ) = 'zcsd';
dt_combos( 122, : ) = 'zcdd';
dt_combos( 123, : ) = 'zccd';
dt_combos( 124, : ) = 'zczd';
dt_combos( 125, : ) = 'zzsd';
dt_combos( 126, : ) = 'zzdd';
dt_combos( 127, : ) = 'zzcd';
dt_combos( 128, : ) = 'zzzd';
end
r_val = dt_combos;
end

View File

@@ -0,0 +1,139 @@
function r_val = plot_all_md( is_mt )
if is_mt == 1
thr_str = 'mt';
else
thr_str = 'st';
end
if 1
dt_combos = gen_dt_combos();
else
dt_combos( 1, : ) = [ 'ssss' ];
dt_combos( 2, : ) = [ 'sssd' ];
dt_combos( 3, : ) = [ 'ssds' ];
dt_combos( 4, : ) = [ 'sdss' ];
dt_combos( 5, : ) = [ 'dsss' ];
dt_combos( 6, : ) = [ 'ddds' ];
dt_combos( 7, : ) = [ 'dddd' ];
end
n_combos = size(dt_combos,1);
filetemp_blis = '../output_%s_%sgemm_asm_blis.m';
filetemp_open = '../output_%s_%sgemm_openblas.m';
% Construct filenames for the "reference" (single real) data, then load
% the data files, and finally save the results to different variable names.
file_blis_sref = sprintf( filetemp_blis, thr_str, 'ssss' );
file_open_sref = sprintf( filetemp_open, thr_str, 'ssss' );
%str = sprintf( ' Loading %s', file_blis_sref ); disp(str);
run( file_blis_sref )
%str = sprintf( ' Loading %s', file_open_sref ); disp(str);
run( file_open_sref )
data_gemm_asm_blis_sref( :, : ) = data_gemm_asm_blis( :, : );
data_gemm_openblas_sref( :, : ) = data_gemm_openblas( :, : );
% Construct filenames for the "reference" (double real) data, then load
% the data files, and finally save the results to different variable names.
file_blis_dref = sprintf( filetemp_blis, thr_str, 'dddd' );
file_open_dref = sprintf( filetemp_open, thr_str, 'dddd' );
%str = sprintf( ' Loading %s', file_blis_dref ); disp(str);
run( file_blis_dref )
%str = sprintf( ' Loading %s', file_open_dref ); disp(str);
run( file_open_dref )
data_gemm_asm_blis_dref( :, : ) = data_gemm_asm_blis( :, : );
data_gemm_openblas_dref( :, : ) = data_gemm_openblas( :, : );
% Construct filenames for the "reference" (single complex) data, then load
% the data files, and finally save the results to different variable names.
file_blis_cref = sprintf( filetemp_blis, thr_str, 'cccs' );
file_open_cref = sprintf( filetemp_open, thr_str, 'cccs' );
%str = sprintf( ' Loading %s', file_blis_cref ); disp(str);
run( file_blis_cref )
%str = sprintf( ' Loading %s', file_open_cref ); disp(str);
run( file_open_cref )
data_gemm_asm_blis_cref( :, : ) = data_gemm_asm_blis( :, : );
data_gemm_openblas_cref( :, : ) = data_gemm_openblas( :, : );
% Construct filenames for the "reference" (double complex) data, then load
% the data files, and finally save the results to different variable names.
file_blis_zref = sprintf( filetemp_blis, thr_str, 'zzzd' );
file_open_zref = sprintf( filetemp_open, thr_str, 'zzzd' );
%str = sprintf( ' Loading %s', file_blis_zref ); disp(str);
run( file_blis_zref )
%str = sprintf( ' Loading %s', file_open_zref ); disp(str);
run( file_open_zref )
data_gemm_asm_blis_zref( :, : ) = data_gemm_asm_blis( :, : );
data_gemm_openblas_zref( :, : ) = data_gemm_openblas( :, : );
fig = figure;
orient( fig, 'landscape' );
set(gcf,'Position',[0 0 2000 900]);
set(gcf,'PaperUnits', 'inches');
set(gcf,'PaperSize', [64 33]);
set(gcf,'PaperPosition', [0 0 64 33]);
%set(gcf,'PaperPositionMode','auto');
set(gcf,'PaperPositionMode','manual');
set(gcf,'PaperOrientation','landscape');
for dti = 1:n_combos
%for dti = 1:1
% Grab the current datatype combination.
combo = dt_combos( dti, : );
str = sprintf( 'Plotting %d: %s', dti, combo ); disp(str);
if combo(4) == 's'
data_gemm_asm_blis_ref( :, : ) = data_gemm_asm_blis_sref( :, : );
data_gemm_openblas_ref( :, : ) = data_gemm_openblas_sref( :, : );
elseif combo(4) == 'd'
data_gemm_asm_blis_ref( :, : ) = data_gemm_asm_blis_dref( :, : );
data_gemm_openblas_ref( :, : ) = data_gemm_openblas_dref( :, : );
end
if ( combo(1) == 'c' || combo(1) == 'z' ) && ...
( combo(2) == 'c' || combo(2) == 'z' ) && ...
( combo(3) == 'c' || combo(3) == 'z' )
if combo(4) == 's'
data_gemm_asm_blis_ref( :, : ) = data_gemm_asm_blis_cref( :, : );
data_gemm_openblas_ref( :, : ) = data_gemm_openblas_cref( :, : );
elseif combo(4) == 'd'
data_gemm_asm_blis_ref( :, : ) = data_gemm_asm_blis_zref( :, : );
data_gemm_openblas_ref( :, : ) = data_gemm_openblas_zref( :, : );
end
end
% Construct filenames for the data files from templates.
file_blis = sprintf( filetemp_blis, thr_str, combo );
file_open = sprintf( filetemp_open, thr_str, combo );
% Load the data files.
%str = sprintf( ' Loading %s', file_blis ); disp(str);
run( file_blis )
%str = sprintf( ' Loading %s', file_open ); disp(str);
run( file_open )
% Plot the result.
plot_gemm_perf( combo, ...
data_gemm_asm_blis, ...
data_gemm_asm_blis_ref, ...
data_gemm_openblas, ...
data_gemm_openblas_ref, ...
is_mt, dti );
end
if 0
set(gcf,'Position',[0 0 2000 900]);
set(gcf,'PaperUnits', 'inches');
set(gcf,'PaperSize', [48 22]);
set(gcf,'PaperPosition', [0 0 48 22]);
%set(gcf,'PaperPositionMode','auto');
set(gcf,'PaperPositionMode','manual');
set(gcf,'PaperOrientation','landscape');
end
print(gcf, 'gemm_md','-bestfit','-dpdf');
%print(gcf, 'gemm_md','-fillpage','-dpdf');

View File

@@ -0,0 +1,181 @@
function r_val = plot_gemm_perf( dt_str, ...
data_blis, ...
data_blis_ref, ...
data_open, ...
data_open_ref, ...
is_mt, ...
theid )
if 1
ax1 = subplot( 8, 16, theid );
hold( ax1, 'on' );
end
color_blis_ref = 'b'; lines_blis_ref = ':'; markr_blis_ref = '';
color_open_ref = 'k'; lines_open_ref = ':'; markr_open_ref = 'o';
color_mkl_ref = 'r'; lines_mkl_ref = ':'; markr_mkl_ref = '.';
color_blis = 'b'; lines_blis = '-'; markr_blis = '';
color_open = 'k'; lines_open = '-'; markr_open = 'o';
color_mkl = 'r'; lines_mkl = '-'; markr_mkl = '.';
if dt_str(4) == 's'
flopspercycle = 32;
else
flopspercycle = 16;
end
if is_mt == 1
titlename = '%sgemm';
yaxisname = 'GFLOPS/core';
filename_pdf = 'fig_%sgemm_m1p_k1p_n1p_has_mt_perf.pdf';
filename_png = 'fig_%sgemm_m1p_k1p_n1p_has_mt_perf.png';
nth = 4;
x_end = 4000;
max_perf_core = (flopspercycle * 3.6) * 1;
else
titlename = '%sgemm';
yaxisname = 'GFLOPS';
filename_pdf = 'fig_%sgemm_m1p_k1p_n1p_has_st_perf.pdf';
filename_png = 'fig_%sgemm_m1p_k1p_n1p_has_st_perf.png';
nth = 1;
x_end = 2000;
max_perf_core = (flopspercycle * 3.6) * 1;
end
titlename = sprintf( titlename, dt_str );
filename_pdf = sprintf( filename_pdf, dt_str );
filename_png = sprintf( filename_png, dt_str );
%dt0_str = [ dt_str(4), dt_str(4), dt_str(4), dt_str(4) ];
dt0_str = dt_str(4);
blis_sref_legend = sprintf( 'BLIS [sc]gemm' );
blis_dref_legend = sprintf( 'BLIS [dz]gemm' );
blis_legend = sprintf( 'BLIS mixed' );
open_sref_legend = sprintf( 'OBLA [sc]gemm' );
open_dref_legend = sprintf( 'OBLA [dz]gemm' );
open_legend = sprintf( 'OBLA mixed' );
y_scale = 1.00;
%xaxisname = 'problem size (m = n = k)';
xaxisname = ' m = n = k';
colorflag = '-rgb';
x_begin = 0;
y_begin = 0;
y_end = max_perf_core * y_scale;
flopscol = 4;
msize = 5;
if 1
fontsize = 12;
else
fontsize = 16;
end
linesize = 0.7;
legend_loc = 'SouthEast';
% --------------------------------------------------------------------
%fig = figure;
%hold on; ax1 = gca;
x_axis( :, 1 ) = data_blis( :, 1 );
data_peak( 1, 1:2 ) = [ 0 max_perf_core ];
data_peak( 2, 1:2 ) = [ x_end max_perf_core ];
blis_ref = line( x_axis( :, 1 ), data_blis_ref( :, flopscol ) / nth, ...
'Color',color_blis_ref, 'LineStyle',lines_blis_ref, ...
'LineWidth',linesize );
blis_md = line( x_axis( :, 1 ), data_blis( :, flopscol ) / nth, ...
'Color',color_blis, 'LineStyle',lines_blis, ...
'LineWidth',linesize );
open_ref = line( x_axis( :, 1 ), data_open_ref( :, flopscol ) / nth, ...
'Color',color_open_ref, 'LineStyle',lines_open_ref, ...
'LineWidth',linesize );
open_md = line( x_axis( :, 1 ), data_open( :, flopscol ) / nth, ...
'Color',color_open, 'LineStyle',lines_open, ...
'LineWidth',linesize );
%hold on; ax1 = gca;
%'Parent',ax1, ...
xlim( ax1, [x_begin x_end] );
ylim( ax1, [y_begin y_end] );
if theid == 1
leg = legend( ...
[ ...
blis_ref ...
blis_md ...
open_ref ...
open_md ...
], ...
blis_sref_legend, ...
blis_legend, ...
open_sref_legend, ...
open_legend, ...
'Location', 'best' );
%'Location', legend_loc );
set( leg,'Box','off' );
set( leg,'Color','none' );
set( leg,'FontSize',fontsize-2 );
set( leg,'Units','inches' );
elseif theid == 9
leg = legend( ...
[ ...
blis_ref ...
blis_md ...
open_ref ...
open_md ...
], ...
blis_dref_legend, ...
blis_legend, ...
open_dref_legend, ...
open_legend, ...
'Location', 'best' );
%'Location', legend_loc );
set( leg,'Box','off' );
set( leg,'Color','none' );
set( leg,'FontSize',fontsize-2 );
set( leg,'Units','inches' );
end
set( ax1,'FontSize',fontsize );
set( ax1,'TitleFontSizeMultiplier',1.0 ); % default is 1.1.
box( ax1, 'on' );
titl = title( titlename );
set( titl, 'FontWeight', 'normal' ); % default font style is now 'bold'.
tpos = get( titl, 'Position' ); % default is to align across whole figure, not box.
%tpos(1) = tpos(1) + 100;
tpos(1) = tpos(1) + 40;
set( titl, 'Position', tpos ); % here we nudge it back to centered with box.
if theid > 112
xlab = xlabel( ax1,xaxisname );
%tpos = get( xlab, 'Position' )
%tpos(2) = tpos(2) + 10;
%set( xlab, 'Position', tpos );
end
if mod(theid-1,16) == 0
ylab = ylabel( ax1,yaxisname );
end
%export_fig( filename, colorflag, '-pdf', '-m2', '-painters', '-transparent' );
%saveas( fig, filename_png );
%hold( ax1, 'off' );
r_val = 0;
end

View File

@@ -0,0 +1,44 @@
fig1 = figure(1);
clf;
%orient(fig1,'landscape')
orient(gcf,'landscape')
for i = 1:128
subplot(8,16,i);
xx = 400:400:2000;
aa = rand(size(xx));
plot(xx,aa);
end
% broken.
if 0
set(gcf, 'PaperUnits', 'inches');
set(gcf, 'PaperSize', [60 36]);
set(fig1,'PaperUnits','normalized');
set(fig1,'PaperPosition', [0 0 1 1]);
print(fig1, 'testrand', '-dpdf');
end
if 0
% works okay.
set(gcf,'PaperUnits', 'inches');
set(gcf,'PaperSize', [72 36]);
set(gcf,'PaperPositionMode','auto');
set(gcf,'PaperOrientation','landscape');
set(gcf,'Position',[50 50 4000 1800]);
print(gcf, 'testrand','-bestfit','-dpdf');
end
if 1
% works better?
set(gcf,'Position',[0 0 2000 900]);
set(gcf,'PaperUnits', 'inches');
set(gcf,'PaperSize', [48 22]);
set(gcf,'PaperPosition', [0 0 48 22]);
%set(gcf,'PaperPositionMode','auto');
set(gcf,'PaperPositionMode','manual');
set(gcf,'PaperOrientation','landscape');
print(gcf, 'testrand','-bestfit','-dpdf');
end

Binary file not shown.

168
test/mixeddt/runme.sh Executable file
View File

@@ -0,0 +1,168 @@
#!/bin/bash
# File pefixes.
exec_root="test"
out_root="output"
sys="blis"
#sys="stampede2"
#sys="lonestar5"
# Bind threads to processors.
#export OMP_PROC_BIND=true
#export GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15"
#export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"
#export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"
#export GOMP_CPU_AFFINITY="0 2 4 6 1 3 5 7"
#export GOMP_CPU_AFFINITY="0 4 1 5 2 6 3 7"
#export GOMP_CPU_AFFINITY="0 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33 36 37 40 41 44 45"
#export GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14 16 18 20 22 1 3 5 7 9 11 13 15 17 19 21 23"
#export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23"
export GOMP_CPU_AFFINITY="0 1 2 3"
# Modify LD_LIBRARY_PATH.
if [ ${sys} = "blis" ]; then
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH"
elif [ ${sys} = "stampede2" ]; then
:
elif [ ${sys} = "lonestar5" ]; then
# A hack to use libiomp5 with gcc.
#export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/apps/intel/16.0.1.150/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64"
:
fi
# Threading scheme to use when multithreading
if [ ${sys} = "blis" ]; then
jc_nt=2 # 5th loop
ic_nt=2 # 3rd loop
jr_nt=1 # 2nd loop
ir_nt=1 # 1st loop
nt=4
elif [ ${sys} = "stampede2" ]; then
jc_nt=2 # 5th loop
ic_nt=8 # 3rd loop
jr_nt=1 # 2nd loop
ir_nt=1 # 1st loop
nt=16
elif [ ${sys} = "lonestar5" ]; then
jc_nt=4 # 5th loop
ic_nt=6 # 3rd loop
jr_nt=1 # 2nd loop
ir_nt=1 # 1st loop
nt=24
fi
# Complex domain implementations to test.
if [ ${sys} = "blis" ]; then
test_impls="openblas asm_blis"
elif [ ${sys} = "stampede2" ]; then
test_impls="openblas asm_blis mkl"
elif [ ${sys} = "lonestar5" ]; then
test_impls="openblas mkl asm_blis"
fi
# Datatypes to test.
#dts="s d c z"
# Operations to test.
l3_ops="gemm"
test_ops="${l3_ops}"
# Define the list of datatype chars and precision chars.
dt_chars="s d c z"
pr_chars="s d"
# Construct the datatype combination strings.
dt_combos=""
for dtc in ${dt_chars}; do
for dta in ${dt_chars}; do
for dtb in ${dt_chars}; do
for pre in ${pr_chars}; do
dt_combos="${dt_combos} ${dtc}${dta}${dtb}${pre}"
done
done
done
done
# Threadedness to test.
threads="mt"
#threads="st"
test_impls="openblas"
#dt_combos="ssss sssd ssds sdss dsss ddds dddd"
#dt_combos="csss csds cdss cdds zsss zsds zdss zdds cssd csdd cdsd cddd zssd zsdd zdsd zddd"
#dt_combos="cssd csdd cdsd cddd zsss zsds zdss zdds"
#dt_combos="cdsd cddd zsss zsds zdss zdds"
#test_impls="asm_blis"
# Now perform complex test cases.
for th in ${threads}; do
for dt in ${dt_combos}; do
for im in ${test_impls}; do
for op in ${test_ops}; do
# Set the number of threads according to th.
if [ ${th} = "mt" ]; then
export BLIS_JC_NT=${jc_nt}
export BLIS_IC_NT=${ic_nt}
export BLIS_JR_NT=${jr_nt}
export BLIS_IR_NT=${ir_nt}
export OMP_NUM_THREADS=${nt}
export OPENBLAS_NUM_THREADS=${nt}
# Unset GOMP_CPU_AFFINITY for OpenBLAS, as it causes the library
# to execute sequentially.
if [ ${im} = "openblas" ]; then
unset GOMP_CPU_AFFINITY
else
export GOMP_CPU_AFFINITY="0 1 2 3"
fi
else
export BLIS_JC_NT=1
export BLIS_IC_NT=1
export BLIS_JR_NT=1
export BLIS_IR_NT=1
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
fi
# Construct the name of the test executable.
exec_name="${exec_root}_${dt}${op}_${im}_${th}.x"
# Construct the name of the output file.
out_file="${out_root}_${th}_${dt}${op}_${im}.m"
echo "Running (nt = ${OMP_NUM_THREADS}) ./${exec_name} > ${out_file}"
# Run executable.
./${exec_name} > ${out_file}
#sleep 1
done
done
done
done

580
test/mixeddt/test_gemm.c Normal file
View File

@@ -0,0 +1,580 @@
/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of The University of Texas nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include <unistd.h>
#include "blis.h"
void blas_gemm_md( obj_t* alpha, obj_t* a, obj_t* b, obj_t* beta, obj_t* c );
void blas_gemm( trans_t transa, trans_t transb, num_t dt, obj_t* ao, obj_t* alpha, obj_t* bo, obj_t* beta, obj_t* co );
//#define PRINT
int main( int argc, char** argv )
{
obj_t a, b, c;
obj_t c_save;
obj_t* alphao;
obj_t* betao;
dim_t m, n, k;
dim_t p;
dim_t p_begin, p_end, p_inc;
int m_input, n_input, k_input;
num_t dta, dtb, dtc, dtx;
char dta_ch, dtb_ch, dtc_ch;
char dtx_ch;
int r, n_repeats;
trans_t transa;
trans_t transb;
double dtime;
double dtime_save;
double gflops;
double flopsmul;
//bli_init();
//bli_error_checking_level_set( BLIS_NO_ERROR_CHECKING );
n_repeats = 3;
dta = DTA;
dtb = DTB;
dtc = DTC;
dtx = DTX;
// Extract the precision component of the computation datatype.
prec_t comp_prec = bli_dt_prec( dtx );
( void )dta_ch;
( void )dtb_ch;
( void )dtc_ch;
( void )dtx_ch;
p_begin = P_BEGIN;
p_end = P_END;
p_inc = P_INC;
m_input = -1;
n_input = -1;
k_input = -1;
#if 0
k_input = 256;
#endif
// Choose the char corresponding to the requested datatype.
if ( bli_is_float( dta ) ) dta_ch = 's';
else if ( bli_is_double( dta ) ) dta_ch = 'd';
else if ( bli_is_scomplex( dta ) ) dta_ch = 'c';
else dta_ch = 'z';
if ( bli_is_float( dtb ) ) dtb_ch = 's';
else if ( bli_is_double( dtb ) ) dtb_ch = 'd';
else if ( bli_is_scomplex( dtb ) ) dtb_ch = 'c';
else dtb_ch = 'z';
if ( bli_is_float( dtc ) ) dtc_ch = 's';
else if ( bli_is_double( dtc ) ) dtc_ch = 'd';
else if ( bli_is_scomplex( dtc ) ) dtc_ch = 'c';
else dtc_ch = 'z';
if ( bli_is_float( dtx ) ) dtx_ch = 's';
else dtx_ch = 'd';
transa = BLIS_NO_TRANSPOSE;
transb = BLIS_NO_TRANSPOSE;
// Begin with initializing the last entry to zero so that
// matlab allocates space for the entire array once up-front.
for ( p = p_begin; p + p_inc <= p_end; p += p_inc ) ;
//printf( "data_%s_%c%c%c%cgemm_%s", THR_STR, dtc_ch, dta_ch, dtb_ch, dtx_ch, STR );
printf( "data_gemm_%s", STR );
printf( "( %2lu, 1:4 ) = [ %4lu %4lu %4lu %7.2f ];\n",
( unsigned long )(p - p_begin + 1)/p_inc + 1,
( unsigned long )0,
( unsigned long )0,
( unsigned long )0, 0.0 );
// Adjust the flops scaling based on which domain case is being executed.
if ( bli_is_real( dtc ) && bli_is_real( dta ) && bli_is_real( dtb ) )
flopsmul = 2.0;
else if ( bli_is_real( dtc ) && bli_is_real( dta ) && bli_is_complex( dtb ) )
flopsmul = 2.0;
else if ( bli_is_real( dtc ) && bli_is_complex( dta ) && bli_is_real( dtb ) )
flopsmul = 2.0;
else if ( bli_is_real( dtc ) && bli_is_complex( dta ) && bli_is_complex( dtb ) )
#ifdef BLIS
flopsmul = 4.0;
#else
flopsmul = 4.0; // executes 8.0, but only gets "credit" for 4.0
#endif
else if ( bli_is_complex( dtc ) && bli_is_real( dta ) && bli_is_real( dtb ) )
flopsmul = 2.0;
else if ( bli_is_complex( dtc ) && bli_is_real( dta ) && bli_is_complex( dtb ) )
#ifdef BLIS
flopsmul = 4.0;
#else
flopsmul = 4.0; // executes 8.0, but only gets "credit" for 4.0
#endif
else if ( bli_is_complex( dtc ) && bli_is_complex( dta ) && bli_is_real( dtb ) )
flopsmul = 4.0;
else if ( bli_is_complex( dtc ) && bli_is_complex( dta ) && bli_is_complex( dtb ) )
flopsmul = 8.0;
for ( p = p_begin; p <= p_end; p += p_inc )
{
if ( m_input < 0 ) m = p * ( dim_t )abs(m_input);
else m = ( dim_t ) m_input;
if ( n_input < 0 ) n = p * ( dim_t )abs(n_input);
else n = ( dim_t ) n_input;
if ( k_input < 0 ) k = p * ( dim_t )abs(k_input);
else k = ( dim_t ) k_input;
bli_obj_create( dta, m, k, 0, 0, &a );
bli_obj_create( dtb, k, n, 0, 0, &b );
bli_obj_create( dtc, m, n, 0, 0, &c );
bli_obj_create( dtc, m, n, 0, 0, &c_save );
bli_obj_set_comp_prec( comp_prec, &c );
alphao = &BLIS_ONE;
betao = &BLIS_ONE;
bli_randm( &a );
bli_randm( &b );
bli_randm( &c );
bli_obj_set_conjtrans( transa, &a );
bli_obj_set_conjtrans( transb, &b );
bli_copym( &c, &c_save );
dtime_save = DBL_MAX;
for ( r = 0; r < n_repeats; ++r )
{
bli_copym( &c_save, &c );
dtime = bli_clock();
#ifdef PRINT
bli_printm( "a", &a, "%4.1f", "" );
bli_printm( "b", &b, "%4.1f", "" );
bli_printm( "c", &c, "%4.1f", "" );
#endif
#ifdef BLIS
bli_gemm
(
alphao,
&a,
&b,
betao,
&c
);
#else
blas_gemm_md
(
alphao,
&a,
&b,
betao,
&c
);
#endif
#ifdef PRINT
bli_printm( "c after", &c, "%4.1f", "" );
exit(1);
#endif
dtime_save = bli_clock_min_diff( dtime_save, dtime );
}
gflops = ( flopsmul * m * k * n ) / ( dtime_save * 1.0e9 );
//printf( "data_%s_%c%c%c%cgemm_%s", THR_STR, dtc_ch, dta_ch, dtb_ch, dtx_ch, STR );
printf( "data_gemm_%s", STR );
printf( "( %2lu, 1:4 ) = [ %4lu %4lu %4lu %7.2f ];\n",
( unsigned long )(p - p_begin + 1)/p_inc + 1,
( unsigned long )m,
( unsigned long )k,
( unsigned long )n, gflops );
bli_obj_free( &a );
bli_obj_free( &b );
bli_obj_free( &c );
bli_obj_free( &c_save );
}
//bli_finalize();
return 0;
}
void blas_gemm_md( obj_t* alpha, obj_t* a, obj_t* b, obj_t* beta, obj_t* c )
{
trans_t transa = bli_obj_conjtrans_status( a );
trans_t transb = bli_obj_conjtrans_status( b );
prec_t comp_prec = bli_obj_comp_prec( c );
if ( bli_obj_dt( a ) == bli_obj_dt( b ) &&
bli_obj_dt( b ) == bli_obj_dt( c ) &&
bli_obj_dt( c ) == ( num_t )comp_prec )
{
blas_gemm( transa, transb, bli_obj_dt( c ), alpha, a, b, beta, c );
return;
}
num_t dtc = bli_obj_dt( c );
num_t dta = bli_obj_dt( a );
num_t dtb = bli_obj_dt( b );
dim_t m = bli_obj_length( c );
dim_t n = bli_obj_width( c );
dim_t k = bli_obj_width_after_trans( a );
obj_t* ao = a;
obj_t* bo = b;
obj_t* co = c;
num_t targ_dt_c, targ_dt_a, targ_dt_b;
dom_t targ_dom_c, targ_dom_a, targ_dom_b;
num_t dt_comp;
dom_t comp_dom;
obj_t at, bt, ct;
obj_t ar, cr;
bool_t needacc;
bool_t force_proj_a = FALSE;
bool_t force_proj_b = FALSE;
if ( bli_is_real( dtc ) && bli_is_real( dta ) && bli_is_real( dtb ) )
{
// rrr
comp_dom = BLIS_REAL;
targ_dom_c = BLIS_REAL; targ_dom_a = BLIS_REAL; targ_dom_b = BLIS_REAL;
needacc = FALSE;
}
else if ( bli_is_real( dtc ) && bli_is_real( dta ) && bli_is_complex( dtb ) )
{
// rrc
comp_dom = BLIS_REAL;
targ_dom_c = BLIS_REAL; targ_dom_a = BLIS_REAL; targ_dom_b = BLIS_REAL;
needacc = FALSE;
force_proj_b = TRUE;
}
else if ( bli_is_real( dtc ) && bli_is_complex( dta ) && bli_is_real( dtb ) )
{
// rcr
comp_dom = BLIS_REAL;
targ_dom_c = BLIS_REAL; targ_dom_a = BLIS_REAL; targ_dom_b = BLIS_REAL;
needacc = FALSE;
force_proj_a = TRUE;
}
else if ( bli_is_real( dtc ) && bli_is_complex( dta ) && bli_is_complex( dtb ) )
{
// rcc
comp_dom = BLIS_COMPLEX;
targ_dom_c = BLIS_COMPLEX; targ_dom_a = BLIS_COMPLEX; targ_dom_b = BLIS_COMPLEX;
needacc = TRUE;
}
else if ( bli_is_complex( dtc ) && bli_is_real( dta ) && bli_is_real( dtb ) )
{
// crr
comp_dom = BLIS_REAL;
targ_dom_c = BLIS_REAL; targ_dom_a = BLIS_REAL; targ_dom_b = BLIS_REAL;
needacc = TRUE;
}
else if ( bli_is_complex( dtc ) && bli_is_real( dta ) && bli_is_complex( dtb ) )
{
// crc
comp_dom = BLIS_COMPLEX;
targ_dom_c = BLIS_COMPLEX; targ_dom_a = BLIS_COMPLEX; targ_dom_b = BLIS_COMPLEX;
needacc = FALSE;
force_proj_a = TRUE;
}
else if ( bli_is_complex( dtc ) && bli_is_complex( dta ) && bli_is_real( dtb ) )
{
// ccr
comp_dom = BLIS_REAL;
targ_dom_c = BLIS_COMPLEX; targ_dom_a = BLIS_COMPLEX; targ_dom_b = BLIS_REAL;
needacc = FALSE;
}
else if ( bli_is_complex( dtc ) && bli_is_complex( dta ) && bli_is_complex( dtb ) )
{
// ccc
comp_dom = BLIS_COMPLEX;
targ_dom_c = BLIS_COMPLEX; targ_dom_a = BLIS_COMPLEX; targ_dom_b = BLIS_COMPLEX;
needacc = FALSE;
}
else
{
comp_dom = BLIS_REAL;
targ_dom_c = BLIS_REAL; targ_dom_a = BLIS_REAL; targ_dom_b = BLIS_REAL;
needacc = FALSE;
}
// ----------------------------------------------------------------------------
// Merge the computation domain with the computation precision.
dt_comp = comp_dom | comp_prec;
targ_dt_a = targ_dom_a | comp_prec;
targ_dt_b = targ_dom_b | comp_prec;
targ_dt_c = targ_dom_c | comp_prec;
// Copy-cast A, if needed.
if ( bli_dt_prec( dta ) != comp_prec || force_proj_a )
{
bli_obj_create( targ_dt_a, m, k, 0, 0, &at );
bli_castm( ao, &at );
ao = &at;
}
// Copy-cast B, if needed.
if ( bli_dt_prec( dtb ) != comp_prec || force_proj_b )
{
bli_obj_create( targ_dt_b, k, n, 0, 0, &bt );
bli_castm( bo, &bt );
bo = &bt;
}
if ( bli_dt_prec( dtc ) != comp_prec )
{
needacc = TRUE;
}
// Copy-cast C, if needed.
if ( needacc )
{
//bli_obj_create( dt_comp, m, n, 0, 0, &ct );
bli_obj_create( targ_dt_c, m, n, 0, 0, &ct );
bli_castm( c, &ct );
co = &ct;
}
// ----------------------------------------------------------------------------
if ( bli_is_real( dtc ) && bli_is_real( dta ) && bli_is_real( dtb ) )
{
}
else if ( bli_is_real( dtc ) && bli_is_real( dta ) && bli_is_complex( dtb ) )
{
}
else if ( bli_is_real( dtc ) && bli_is_complex( dta ) && bli_is_real( dtb ) )
{
}
else if ( bli_is_real( dtc ) && bli_is_complex( dta ) && bli_is_complex( dtb ) )
{
}
else if ( bli_is_complex( dtc ) && bli_is_real( dta ) && bli_is_real( dtb ) )
{
}
else if ( bli_is_complex( dtc ) && bli_is_real( dta ) && bli_is_complex( dtb ) )
{
}
else if ( bli_is_complex( dtc ) && bli_is_complex( dta ) && bli_is_real( dtb ) )
{
inc_t rsa = bli_obj_row_stride( ao );
inc_t csa = bli_obj_col_stride( ao );
inc_t ma = bli_obj_length( ao );
inc_t na = bli_obj_width( ao );
siz_t ela = bli_obj_elem_size( ao );
num_t dtap = bli_obj_dt_proj_to_real( ao );
bli_obj_alias_to( ao, &ar ); ao = &ar;
bli_obj_set_strides( rsa, 2*csa, ao );
bli_obj_set_dims( 2*ma, na, ao );
bli_obj_set_dt( dtap, ao );
bli_obj_set_elem_size( ela/2, ao );
inc_t rsc = bli_obj_row_stride( co );
inc_t csc = bli_obj_col_stride( co );
inc_t mc = bli_obj_length( co );
inc_t nc = bli_obj_width( co );
siz_t elc = bli_obj_elem_size( co );
num_t dtcp = bli_obj_dt_proj_to_real( co );
bli_obj_alias_to( co, &cr ); co = &cr;
bli_obj_set_strides( rsc, 2*csc, co );
bli_obj_set_dims( 2*mc, nc, co );
bli_obj_set_dt( dtcp, co );
bli_obj_set_elem_size( elc/2, co );
}
else if ( bli_is_complex( dtc ) && bli_is_complex( dta ) && bli_is_complex( dtb ) )
{
}
else
{
}
// ----------------------------------------------------------------------------
// Call the BLAS.
blas_gemm( transa, transb, dt_comp, alpha, ao, bo, beta, co );
// Accumulate back to C, if needed.
if ( needacc )
{
bli_castm( &ct, c );
}
if ( bli_dt_prec( dta ) != comp_prec || force_proj_a ) { bli_obj_free( &at ); }
if ( bli_dt_prec( dtb ) != comp_prec || force_proj_b ) { bli_obj_free( &bt ); }
if ( needacc ) { bli_obj_free( &ct ); }
}
void blas_gemm( trans_t transa, trans_t transb, num_t dt, obj_t* alpha, obj_t* a, obj_t* b, obj_t* beta, obj_t* c )
{
char f77_transa = 'N';
char f77_transb = 'N';
//bli_param_map_blis_to_netlib_trans( transa, &f77_transa );
//bli_param_map_blis_to_netlib_trans( transb, &f77_transb );
if ( bli_is_float( dt ) )
{
f77_int mm = bli_obj_length( c );
f77_int kk = bli_obj_width_after_trans( a );
f77_int nn = bli_obj_width( c );
f77_int lda = bli_obj_col_stride( a );
f77_int ldb = bli_obj_col_stride( b );
f77_int ldc = bli_obj_col_stride( c );
float* alphap = bli_obj_buffer_for_1x1( dt, alpha );
float* ap = bli_obj_buffer( a );
float* bp = bli_obj_buffer( b );
float* betap = bli_obj_buffer_for_1x1( dt, beta );
float* cp = bli_obj_buffer( c );
sgemm_( &f77_transa,
&f77_transb,
&mm,
&nn,
&kk,
alphap,
ap, &lda,
bp, &ldb,
betap,
cp, &ldc );
}
else if ( bli_is_double( dt ) )
{
f77_int mm = bli_obj_length( c );
f77_int kk = bli_obj_width_after_trans( a );
f77_int nn = bli_obj_width( c );
f77_int lda = bli_obj_col_stride( a );
f77_int ldb = bli_obj_col_stride( b );
f77_int ldc = bli_obj_col_stride( c );
double* alphap = bli_obj_buffer_for_1x1( dt, alpha );
double* ap = bli_obj_buffer( a );
double* bp = bli_obj_buffer( b );
double* betap = bli_obj_buffer_for_1x1( dt, beta );
double* cp = bli_obj_buffer( c );
dgemm_( &f77_transa,
&f77_transb,
&mm,
&nn,
&kk,
alphap,
ap, &lda,
bp, &ldb,
betap,
cp, &ldc );
}
else if ( bli_is_scomplex( dt ) )
{
f77_int mm = bli_obj_length( c );
f77_int kk = bli_obj_width_after_trans( a );
f77_int nn = bli_obj_width( c );
f77_int lda = bli_obj_col_stride( a );
f77_int ldb = bli_obj_col_stride( b );
f77_int ldc = bli_obj_col_stride( c );
scomplex* alphap = bli_obj_buffer_for_1x1( dt, alpha );
scomplex* ap = bli_obj_buffer( a );
scomplex* bp = bli_obj_buffer( b );
scomplex* betap = bli_obj_buffer_for_1x1( dt, beta );
scomplex* cp = bli_obj_buffer( c );
cgemm_( &f77_transa,
&f77_transb,
&mm,
&nn,
&kk,
alphap,
ap, &lda,
bp, &ldb,
betap,
cp, &ldc );
}
else if ( bli_is_dcomplex( dt ) )
{
f77_int mm = bli_obj_length( c );
f77_int kk = bli_obj_width_after_trans( a );
f77_int nn = bli_obj_width( c );
f77_int lda = bli_obj_col_stride( a );
f77_int ldb = bli_obj_col_stride( b );
f77_int ldc = bli_obj_col_stride( c );
dcomplex* alphap = bli_obj_buffer_for_1x1( dt, alpha );
dcomplex* ap = bli_obj_buffer( a );
dcomplex* bp = bli_obj_buffer( b );
dcomplex* betap = bli_obj_buffer_for_1x1( dt, beta );
dcomplex* cp = bli_obj_buffer( c );
zgemm_( &f77_transa,
&f77_transb,
&mm,
&nn,
&kk,
alphap,
ap, &lda,
bp, &ldb,
betap,
cp, &ldc );
}
}

View File

@@ -25,6 +25,8 @@ cj # Vector storage scheme(s) to test:
sdcz # Datatype(s) to test:
# 's' = single real; 'c' = single complex;
# 'd' = double real; 'z' = double complex
0 # Test gemm with mixed-domain operands?
0 # Test gemm with mixed-precision operands?
100 # Problem size: first to test
500 # Problem size: maximum to test
100 # Problem size: increment between experiments

View File

@@ -25,6 +25,8 @@ cj # Vector storage scheme(s) to test:
sdcz # Datatype(s) to test:
# 's' = single real; 'c' = single complex;
# 'd' = double real; 'z' = double complex
0 # Test gemm with mixed-domain operands?
0 # Test gemm with mixed-precision operands?
100 # Problem size: first to test
100 # Problem size: maximum to test
100 # Problem size: increment between experiments

View File

@@ -190,6 +190,10 @@
-1 -2 # dimensions: m n
? # parameters: transa
1 # xpbym
-1 -1 # dimensions: m n
? # parameters: transa
# --- Level-1f kernels -----------------------------------------------------

Some files were not shown because too many files have changed in this diff Show More