diff --git a/CREDITS b/CREDITS index a79eb192b..17e9e14f2 100644 --- a/CREDITS +++ b/CREDITS @@ -16,6 +16,7 @@ but many others have contributed code and feedback, including Vernon Austel (IBM, T.J. Watson Research Center) Jed Brown @jedbrown (Argonne National Laboratory) Robin Christ @robinchrist + Kay Dewhurst @jkd2016 (Max Planck Institute, Halle, Germany) Johannes Dieterich @iotamudelta Krzysztof Drewniak @krzysz00 Victor Eijkhout @VictorEijkhout (Texas Advanced Computing Center) diff --git a/README.md b/README.md index fed238494..565efb09e 100644 --- a/README.md +++ b/README.md @@ -46,6 +46,20 @@ advance in dense linear algebra computation. While BLIS remains a work-in-progress, we are excited to continue its development and further cultivate its use within the community. +The BLIS framework is primarily developed and maintained by individuals in the +[Science of High-Performance Computing](http://shpc.ices.utexas.edu/) +(SHPC) group in the +[Institute for Computational Engineering and Sciences](https://www.ices.utexas.edu/) +at [The University of Texas at Austin](https://www.utexas.edu/). +Please visit the [SHPC](http://shpc.ices.utexas.edu/) website for more +information about our research group, such as a list of +[people](http://shpc.ices.utexas.edu/people.html) +and [collaborators](http://shpc.ices.utexas.edu/collaborators.html), +[funding sources](http://shpc.ices.utexas.edu/funding.html), +[publications](http://shpc.ices.utexas.edu/publications.html), +and [other educational projects](http://www.ulaff.net/) (such as MOOCs). + + Key Features ------------ @@ -163,81 +177,109 @@ such as `gemm`. Getting Started --------------- -If you just want to build/install a sequential (not parallelized) version of -BLIS in a hurry and come back and explore other topics later, you can configure +If you just want to build a sequential (not parallelized) version of BLIS +in a hurry and come back and explore other topics later, you can configure and build BLIS as follows: ``` $ ./configure auto $ make [-j] -$ make install ``` You can then verify your build by running BLAS- and BLIS-specific test drivers via `make check`: ``` $ make check [-j] ``` +And if you would like to install BLIS to the directory specified to `configure` +via the `--prefix` option, run the `install` target: +``` +$ make install +``` Please read the output of `./configure --help` for a full list of configure-time options. -A more detailed walkthrough of the build system can be found in our -[Build System](docs/BuildSystem.md) guide. +If/when you have time, we *strongly* encourage you to read the detailed +walkthrough of the build system found in our [Build System](docs/BuildSystem.md) +guide. -We provide comprehensive documentation on BLIS's two primarily APIs: -the [object API](docs/BLISObjectAPI.md) and -the [typed API](docs/BLISTypedAPI.md). -These documents provide brief descriptions of each operation interface as -well as some more general information needed when developing an application -with BLIS. +Documentation +------------- -If you want to begin using the object API in BLIS, please step through the -example code tutorial in the [examples/oapi](examples/oapi) directory. -We also have the equivalent code examples for the typed API available in -[examples/tapi](examples/tapi). +We provide extensive documentation on the BLIS build system, APIs, test +infrastructure, and other important topics. All documentation is formatted in +markdown and included in the BLIS source distribution (usually in the `docs` +directory). Slightly longer descriptions of each document may be found via in +the project's [wiki](https://github.com/flame/blis/wiki) section. -Users interested in using BLIS to obtain multithreaded parallelism should -read the [Multithreading](docs/Multithreading.md) documentation. +**Documents for everyone:** + * **[Build System](docs/BuildSystem.md).** This document covers the basics of +configuring and building BLIS libraries, as well as related topics. + * **[Testsuite](docs/Testsuite.md).** This document describes how to run +BLIS's highly parameterized and configurable test suite, as well as the +included BLAS test drivers. + * **[BLIS Typed API Reference](docs/BLISTypedAPI.md).** Here we document the +so-called "typed" (or BLAS-like) API. This is the API that many users who are +already familiar with the BLAS will likely want to use. You can find lots of +example code for the typed API in the [examples/tapi](examples/tapi) directory +included in the BLIS source distribution. + * **[BLIS Object API Reference](docs/BLISObjectAPI.md).** Here we document +the object API. This is API abstracts away properties of vectors and matrices +within `obj_t` structs that can be queried with accessor functions. Many +developers and experts prefer this API over the typed API. You can find lots of +example code for the object API in the [examples/oapi](examples/oapi) directory +included in the BLIS source distribution. + * **[Hardware Support](docs/HardwareSupport.md).** This document maintains a +table of supported microarchitectures. + * **[Multithreading](docs/Multithreading.md).** This document describes how to +use the multithreading features of BLIS. + * **[Release Notes](docs/ReleaseNotes.md).** This document tracks a summary of +changes included with each new version of BLIS, along with contributor credits +for key features. + * **[Frequently Asked Questions](docs/FAQ.md).** If you have general questions +about BLIS, please read this FAQ. If you can't find the answer to your question, +please feel free to join the [blis-devel](https://groups.google.com/group/blis-devel) +mailing list and post a question. We also have a +[blis-discuss](https://groups.google.com/group/blis-discuss) mailing list that +anyone can post to (even without joining). -Have a quick question? You may find the answer in our list of [frequently asked -questions](docs/FAQ.md). +**Documents for github contributors:** + * **[Contributing bug reports, feature requests, PRs, etc](CONTRIBUTING.md).** +Interested in contributing to BLIS? Please read this document before getting +started. It provides a general overview of how best to report bugs, propose new +features, and offer code patches. + * **[Coding Conventions](docs/CodingConventions.md).** If you are interested or +planning on contributing code to BLIS, please read this document so that you can +format your code in accordance with BLIS's standards. -Does BLIS contain kernels optimized for your favorite architecture? Please see -our [Hardware Support](docs/HardwareSupport.md) guide -for a full list of optimized kernels. - -The [Release Notes](docs/ReleaseNotes.md) contain a summary of new features -provided by each new tagged version (release) of BLIS, along with the date -the release. - -We also provide documentation on the following topics, which will likely be of -interest to more advanced users and developers: - * [Configurations](docs/ConfigurationHowTo.md). -This document describes how the configuration system works in BLIS, and also -provides step-by-step instructions for creating a new configuration. -(In BLIS, a "configuration" captures all of the details necessary to build -BLIS for a specific hardware architecture.) Configurations specify things -like cache blocksizes and kernel functions, as well as various optional -configuration settings. - * [Kernels](docs/KernelsHowTo.md). -This document describes each of the BLIS kernel operations in detail and should -provide developers with most of the information needed to get started with -writing and optimizing their own kernels. - * [Test suite](docs/Testsuite.md). -This document contains detailed instructions on running the BLIS test suite, -located in the top-level directory testsuite. Also included: a walkthrough -of the BLAS test drivers, which exercise the BLAS compatibility layer that -is, by default, included in BLIS. - -A full listing of all documentation may be found via in the project's -[wiki](https://github.com/flame/blis/wiki) section. +**Documents for BLIS developers:** + * **[Kernels Guide](docs/KernelsHowTo.md).** If you would like to learn more +about the types of kernels that BLIS exposes, their semantics, the operations +that each kernel accelerates, and various implementation issues, please read +this guide. + * **[Configuration Guide](docs/ConfigurationHowTo.md).** If you would like to +learn how to add new sub-configurations or configuration families, or are simply +interested in learning how BLIS organizes its configurations and kernel sets, +please read this thorough walkthrough of the configuration system. + * **[Sandbox Guide](docs/Sandboxes.md).** If you are interested in learning +about using sandboxes in BLIS--that is, providing alternative implementations +of the `gemm` operation--please read this document. External Linux packages ----------------------- -Generally speaking, we **highly recommend** building from source whenever possible using the latest `git` clone. (Tarballs of each [tagged release](https://github.com/flame/blis/releases) are also available, but are not preferred since they are more difficult to upgrade from than a git clone.) +Generally speaking, we **highly recommend** building from source whenever +possible using the latest `git` clone. (Tarballs of each +[tagged release](https://github.com/flame/blis/releases) are also available, but +are not preferred since they are more difficult to upgrade from than a git +clone.) -If you prefer (or need) binary packages, please check out the following offerings available thanks to generous involvement/contributions from two of our community members. +If you prefer (or need) binary packages, please check out the following offerings +available thanks to generous involvement/contributions from two of our community +members. - * Red Hat/Fedora. Dave Love provides rpm packages for x86_64, which he maintains at [Fedora Copr](https://copr.fedorainfracloud.org/coprs/loveshack/blis/). - * Ubuntu/Debian. Nico Schlömer provides apt packages for various architectures, which he maintains at the PPA [launchpad.net](https://launchpad.net/%7Enschloe/+archive/ubuntu/blis-devel). + * Red Hat/Fedora. Dave Love provides rpm packages for x86_64, which he maintains +at [Fedora Copr](https://copr.fedorainfracloud.org/coprs/loveshack/blis/). + * Ubuntu/Debian. Nico Schlömer provides apt packages for various architectures, +which he maintains at the PPA +[launchpad.net](https://launchpad.net/%7Enschloe/+archive/ubuntu/blis-devel). Discussion ---------- @@ -245,24 +287,27 @@ Discussion You can keep in touch with developers and other users of the project by joining one of the following mailing lists: - * [blis-devel](http://groups.google.com/group/blis-devel): Please join and + * [blis-devel](https://groups.google.com/group/blis-devel): Please join and post to this mailing list if you are a BLIS developer, or if you are trying to use BLIS beyond simply linking to it as a BLAS library. **Note:** Most of the interesting discussions happen here; don't be afraid to join! If you would like to submit a bug report, or discuss a possible bug, -please consider opening a [new issue](http://github.com/flame/blis/issues) on +please consider opening a [new issue](https://github.com/flame/blis/issues) on github. - * [blis-discuss](http://groups.google.com/group/blis-discuss): Please join and + * [blis-discuss](https://groups.google.com/group/blis-discuss): Please join and post to this mailing list if you have general questions or feedback regarding BLIS. Application developers (end users) may wish to post here, unless they have bug reports, in which case they should open a -[new issue](http://github.com/flame/blis/issues) on github. +[new issue](https://github.com/flame/blis/issues) on github. Contributing ------------ -For information on how to contribute to our project, including preferred [coding conventions](docs/CodingConventions), please refer to the [CONTRIBUTING](CONTRIBUTING.md) file at the top-level of the BLIS source distribution. +For information on how to contribute to our project, including preferred +[coding conventions](docs/CodingConventions), please refer to the +[CONTRIBUTING](CONTRIBUTING.md) file at the top-level of the BLIS source +distribution. Citations --------- diff --git a/blastest/f2c/open.c b/blastest/f2c/open.c index e58b445ae..2834fd946 100644 --- a/blastest/f2c/open.c +++ b/blastest/f2c/open.c @@ -245,7 +245,12 @@ int fk_open(int seq, int fmt, ftnint n) { char nbuf[10]; olist a; - (void) sprintf(nbuf,"fort.%ld",(long)n); + // FGVZ: gcc 7.3 outputs a warning that the integer value corresponding + // to the "%ld" format specifier could (in theory) use up 11 bytes in a + // string that only allows for five additional bytes. I use the modulo + // operator to reassure gcc that the integer will be very small. + //(void) sprintf(nbuf,"fort.%ld",(long)n); + (void) sprintf(nbuf,"fort.%ld",(long)n % 20); a.oerr=1; a.ounit=n; a.ofnm=nbuf; diff --git a/build/add-copyright.py b/build/add-copyright.py index 0d5e52d5e..9a18b95fc 100755 --- a/build/add-copyright.py +++ b/build/add-copyright.py @@ -187,6 +187,8 @@ def main(): else: filename = git_words[1] + #my_echo( "-debug---- %s" % filename ) + # Start by opening the file. (We can assume it exists since it # was found by 'git status', so no need to check for existence.) # Read all lines in the file and then close it. @@ -203,7 +205,7 @@ def main(): # If the file does not have any copyright notice in it already, we # assume we don't need to update it. if not has_cr: - my_echo( "[skipped] %s" % filename ) + my_echo( "[nocrline] %s" % filename ) continue # Check whether the file already has a copyright for the_org. We may @@ -214,7 +216,7 @@ def main(): mod_file_lines = [] # At this point we know that the file has at least one copyright, and - # has_org_cr encodes whether already has a copyright for the_org. + # has_org_cr encodes whether it already has a copyright for the_org. # We process the files that we know already have copyrights for the_org # differently from the files that do not yet have them. @@ -240,12 +242,15 @@ def main(): repl_line = ' %s, ' % cur_year line_ny = re.sub( find_line, repl_line, line ) - my_echo( "[updated] %s" % filename ) + my_echo( "[updated ] %s" % filename ) # Add the updated line to the running list. mod_file_lines += line_ny else: + + my_echo( "[up2date ] %s" % filename ) + # Add the unchanged line to the running list. mod_file_lines += line @@ -262,7 +267,7 @@ def main(): # Don't go any further if we're only updating existing copyright # lines. if update_only: - my_echo( "[skipped] %s" % filename ) + my_echo( "[nocrline] %s" % filename ) continue num_file_lines = len( file_lines ) @@ -313,7 +318,7 @@ def main(): mod_file_lines += line mod_file_lines += line_nyno - my_echo( "[added ] %s" % filename ) + my_echo( "[added ] %s" % filename ) # endif resnext diff --git a/build/bli_config.h.in b/build/bli_config.h.in index 97b2fcca0..dbc00a2bd 100644 --- a/build/bli_config.h.in +++ b/build/bli_config.h.in @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -52,6 +53,14 @@ #define BLIS_ENABLE_PTHREADS #endif +#if @enable_jrir_slab@ +#define BLIS_ENABLE_JRIR_SLAB +#endif + +#if @enable_jrir_rr@ +#define BLIS_ENABLE_JRIR_RR +#endif + #if @enable_packbuf_pools@ #define BLIS_ENABLE_PACKBUF_POOLS #endif diff --git a/build/irun.py b/build/irun.py new file mode 100755 index 000000000..97cc39c2f --- /dev/null +++ b/build/irun.py @@ -0,0 +1,309 @@ +#!/usr/bin/env python3 +# +# BLIS +# An object-based framework for developing high-performance BLAS-like +# libraries. +# +# Copyright (C) 2018, The University of Texas at Austin +# Copyright (C) 2018, Advanced Micro Devices, Inc. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# - Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# - Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# - Neither the name of The University of Texas at Austin nor the names +# of its contributors may be used to endorse or promote products +# derived from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +# HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +# +# + +# Import modules +import os +import sys +import getopt +import re +import subprocess +import time +import statistics + + +def print_usage(): + + my_print( " " ) + my_print( " %s" % script_name ) + my_print( " " ) + my_print( " Field G. Van Zee" ) + my_print( " " ) + my_print( " Repeatedly run a test driver and accumulate statistics for the" ) + my_print( " output." ) + my_print( " " ) + my_print( " Usage:" ) + my_print( " " ) + my_print( " %s [options] drivername" % script_name ) + my_print( " " ) + my_print( " Arguments:" ) + my_print( " " ) + my_print( " drivername The filename/path of the test driver to run. The" ) + my_print( " test driver must output its performance data to" ) + my_print( " standard output." ) + my_print( " " ) + my_print( " The following options are accepted:" ) + my_print( " " ) + my_print( " -c num performance column index" ) + my_print( " Find the performance result in column index of" ) + my_print( " the test driver's output. Here, a column is defined" ) + my_print( " as a contiguous sequence of non-whitespace characters," ) + my_print( " with the column indices beginning at 0. By default," ) + my_print( " the second-to-last column index in the output is used." ) + my_print( " " ) + my_print( " -d delay sleep() delay" ) + my_print( " Wait seconds after each execution of the" ) + my_print( " test driver. The default delay is 0." ) + my_print( " " ) + my_print( " -n niter number of iterations" ) + my_print( " Execute the test driver times. The default" ) + my_print( " value is 10." ) + my_print( " " ) + my_print( " -q quiet; summary only" ) + my_print( " Do not output statistics after every new execution of" ) + my_print( " the test driver; instead, only output the final values" ) + my_print( " after all iterations are complete. The default is to" ) + my_print( " output updated statistics after each iteration." ) + my_print( " " ) + my_print( " -h help" ) + my_print( " Output this information and exit." ) + my_print( " " ) + + +# ------------------------------------------------------------------------------ + +def my_print( s ): + + sys.stdout.write( "%s\n" % s ) + #sys.stdout.flush() + +# ------------------------------------------------------------------------------ + +# Global variables. +script_name = None +output_name = None + +def main(): + + global script_name + global output_name + + # Obtain the script name. + path, script_name = os.path.split(sys.argv[0]) + + output_name = script_name + + # Default values for optional arguments. + #perf_col = 9 + perf_col = -1 + delay = 0 + niter = 10 + quiet = False + + # Process our command line options. + try: + opts, args = getopt.getopt( sys.argv[1:], "c:d:n:hq" ) + + except getopt.GetoptError as err: + # print help information and exit: + my_print( str(err) ) # will print something like "option -a not recognized" + print_usage() + sys.exit(2) + + for opt, optarg in opts: + if opt == "-c": + perf_col = optarg + elif opt == "-d": + delay = optarg + elif opt == "-n": + niter = optarg + elif opt == "-q": + quiet = True + elif opt == "-h": + print_usage() + sys.exit() + else: + print_usage() + sys.exit() + + # Print usage if we don't have exactly one argument. + if len( args ) != 1: + print_usage() + sys.exit() + + # Acquire our only mandatory argument: the name of the test driver. + driverfile = args[0] + + #my_print( "test driver: %s" % driverfile ) + #my_print( "column num: %s" % perf_col ) + #my_print( "delay: %s" % delay ) + #my_print( "num iter: %s" % niter ) + + # Build a list of iterations. + iters = range( int(niter) ) + + # Run the test driver once to detect the number of lines of output. + p = subprocess.run( driverfile, stdout=subprocess.PIPE ) + lines0 = p.stdout.decode().splitlines() + num_lines0 = int(len(lines0)) + + # Initialize the list of lists (one list per performance result). + aperf = [] + for i in range( num_lines0 ): + aperf.append( [] ) + + for it in iters: + + # Run the test driver. + p = subprocess.run( driverfile, stdout=subprocess.PIPE ) + + # Acquire the lines of output. + lines = p.stdout.decode().splitlines() + + # Accumulate the test driver's latest results into aperf. + for i in range( num_lines0 ): + + # Parse the current line to find the performance value. + line = lines[i] + words = line.split() + if perf_col == -1: + perf = words[ len(words)-2 ] + else: + perf = words[ int(perf_col) ] + + # As unlikely as it is, guard against Inf and NaN. + if float(perf) == float('Inf') or \ + float(perf) == -float('Inf') or \ + float(perf) == float('NaN'): perf = 0.0 + + # Add the performance value to the list at the ith entry of aperf. + aperf[i].append( float(perf) ) + + # Compute stats for the current line. + avgp = statistics.mean( aperf[i] ) + maxp = max( aperf[i] ) + minp = min( aperf[i] ) + + # Only compute stdev() when we have two or more data points. + if len( aperf[i] ) > 1: stdp = statistics.stdev( aperf[i] ) + else: stdp = 0.0 + + # Construct a string to match the performance value and then + # use that string to search-and-replace with four format specs + # for the min, avg, max, and stdev values computed above. + search = '%8s' % perf + newline = re.sub( str(search), ' %7.2f %7.2f %7.2f %6.2f', line ) + + # Search for the column index range that would be present if this were + # matlab-compatible output. The index range will typically be 1:n, + # where n is the number of columns of data. + found_index = False + for word in words: + if re.match( '1:', word ): + index_str = word + found_index = True + break + + # If we find the column index range, we need to update it to reflect + # the replacement of one column of data with four, for a net increase + # of columns. We do so via another instance of re.sub() in which we + # search for the old index string and replace it with the new one. + if found_index: + last_col = int(index_str[2]) + 3 + new_index_str = '1:%1s' % last_col + newline = re.sub( index_str, new_index_str, newline ) + + # If the quiet flag was not give, output the intermediate results. + if not quiet: + print( newline % ( float(minp), float(avgp), float(maxp), float(stdp) ) ) + + # Flush stdout after each set of output prior to sleeping. + sys.stdout.flush() + + # Sleep for a bit until the next iteration. + time.sleep( int(delay) ) + + # If the quiet flag was given, output the final results. + if quiet: + + for i in range( num_lines0 ): + + # Parse the current line to find the performance value (only + # needed for call to re.sub() below). + line = lines0[i] + words = line.split() + if perf_col == -1: + perf = words[ len(words)-2 ] + else: + perf = words[ int(perf_col) ] + + # Compute stats for the current line. + avgp = statistics.mean( aperf[i] ) + maxp = max( aperf[i] ) + minp = min( aperf[i] ) + + # Only compute stdev() when we have two or more data points. + if len( aperf[i] ) > 1: stdp = statistics.stdev( aperf[i] ) + else: stdp = 0.0 + + # Construct a string to match the performance value and then + # use that string to search-and-replace with four format specs + # for the min, avg, max, and stdev values computed above. + search = '%8s' % perf + newline = re.sub( str(search), ' %7.2f %7.2f %7.2f %6.2f', line ) + + # Search for the column index range that would be present if this were + # matlab-compatible output. The index range will typically be 1:n, + # where n is the number of columns of data. + found_index = False + for word in words: + if re.match( '1:', word ): + index_str = word + found_index = True + break + + # If we find the column index range, we need to update it to reflect + # the replacement of one column of data with four, for a net increase + # of columns. We do so via another instance of re.sub() in which we + # search for the old index string and replace it with the new one. + if found_index: + last_col = int(index_str[2]) + 3 + new_index_str = '1:%1s' % last_col + newline = re.sub( index_str, new_index_str, newline ) + + # Output the results for the current line. + print( newline % ( float(minp), float(avgp), float(maxp), float(stdp) ) ) + + # Flush stdout afterwards. + sys.stdout.flush() + + + # Return from main(). + return 0 + + + + +if __name__ == "__main__": + main() diff --git a/config/knl/bli_family_knl.h b/config/knl/bli_family_knl.h index cc9c5304c..d784aed5c 100644 --- a/config/knl/bli_family_knl.h +++ b/config/knl/bli_family_knl.h @@ -38,11 +38,11 @@ // -- THREADING PARAMETERS ----------------------------------------------------- -#define BLIS_DEFAULT_M_THREAD_RATIO 4 -#define BLIS_DEFAULT_N_THREAD_RATIO 1 +#define BLIS_THREAD_RATIO_M 4 +#define BLIS_THREAD_RATIO_N 1 -#define BLIS_DEFAULT_MR_THREAD_MAX 1 -#define BLIS_DEFAULT_NR_THREAD_MAX 1 +#define BLIS_THREAD_MAX_IR 1 +#define BLIS_THREAD_MAX_JR 1 // -- MEMORY ALLOCATION -------------------------------------------------------- diff --git a/config/skx/bli_family_skx.h b/config/skx/bli_family_skx.h index 96fdc12aa..d5071baa8 100644 --- a/config/skx/bli_family_skx.h +++ b/config/skx/bli_family_skx.h @@ -37,11 +37,11 @@ // -- THREADING PARAMETERS ----------------------------------------------------- -#define BLIS_DEFAULT_M_THREAD_RATIO 3 -#define BLIS_DEFAULT_N_THREAD_RATIO 2 +#define BLIS_THREAD_RATIO_M 3 +#define BLIS_THREAD_RATIO_N 2 -#define BLIS_DEFAULT_MR_THREAD_MAX 1 -#define BLIS_DEFAULT_NR_THREAD_MAX 4 +#define BLIS_THREAD_MAX_IR 1 +#define BLIS_THREAD_MAX_JR 4 // -- MEMORY ALLOCATION -------------------------------------------------------- diff --git a/config/zen/bli_family_zen.h b/config/zen/bli_family_zen.h index 9bd44d16e..02e628017 100644 --- a/config/zen/bli_family_zen.h +++ b/config/zen/bli_family_zen.h @@ -39,8 +39,8 @@ // By default, it is effective to parallelize the outer loops. // Setting these macros to 1 will force JR and IR inner loops // to be not paralleized. -#define BLIS_DEFAULT_MR_THREAD_MAX 1 -#define BLIS_DEFAULT_NR_THREAD_MAX 1 +#define BLIS_THREAD_MAX_IR 1 +#define BLIS_THREAD_MAX_JR 1 #define BLIS_ENABLE_ZEN_BLOCK_SIZES //#define BLIS_ENABLE_SMALL_MATRIX diff --git a/config/zen/bli_kernel.h b/config/zen/old/bli_kernel.h similarity index 100% rename from config/zen/bli_kernel.h rename to config/zen/old/bli_kernel.h diff --git a/configure b/configure index d509e8e8c..6df86f3c3 100755 --- a/configure +++ b/configure @@ -163,9 +163,6 @@ print_usage() echo " incur additional overhead in some (but not all)" echo " situations." echo " " - echo " -q, --quiet Suppress informational output. By default, configure" - echo " is verbose. (NOTE: -q is not yet implemented)" - echo " " echo " -i SIZE, --int-size=SIZE" echo " " echo " Set the size (in bits) of internal BLIS integers and" @@ -212,6 +209,19 @@ print_usage() echo " detects the presence of libmemkind, libmemkind is used" echo " by default, and otherwise it is not used by default." echo " " + echo " -r METHOD, --thread-part-jrir=METHOD" + echo " " + echo " Request a method of assigning micropanels to threads in" + echo " the JR and IR loops. Valid options are 'slab' and 'rr'." + echo " Using 'slab' assigns (as much as possible) contiguous" + echo " regions of micropanels to each thread while the latter" + echo " assigns micropanels to threads in a round-robin fashion." + echo " (NOTE: Specifying this option constitutes a *request*," + echo " which may be ignored in select situations if the" + echo " implementation has a good reason to do so.) The chosen" + echo " method also applies during the packing of A and B. The" + echo " default method is 'slab'." + echo " " echo " --force-version=STRING" echo " " echo " Force configure to use an arbitrary version string" @@ -226,6 +236,9 @@ print_usage() echo " a sanity check to make sure these lists are constituted" echo " as expected." echo " " + echo " -q, --quiet Suppress informational output. By default, configure" + echo " is verbose. (NOTE: -q is not yet implemented)" + echo " " echo " -h, --help Output this information and quit." echo " " echo " Environment Variables:" @@ -1127,8 +1140,9 @@ get_compiler_version() # The last part ({ read first rest ; echo $first ; }) is a workaround # to OS X's egrep only returning the first match. cc_vendor=$(echo "${vendor_string}" | egrep -o 'icc|gcc|clang|emcc|pnacl|IBM' | { read first rest ; echo $first ; }) - if [ "$cc_vendor" = "icc" -o "$cc_vendor" = "gcc" -o "$cc_vendor" = "clang" ] - then + if [ "${cc_vendor}" = "icc" -o \ + "${cc_vendor}" = "gcc" -o \ + "${cc_vendor}" = "clang" ]; then cc_version=$(${cc} -dumpversion) else cc_version=$(echo "${vendor_string}" | egrep -o '[0-9]+\.[0-9]+\.?[0-9]*' | { read first rest ; echo ${first} ; }) @@ -1140,6 +1154,23 @@ get_compiler_version() cc_minor=$(echo "${cc_version}" | cut -d. -f2) cc_revision=$(echo "${cc_version}" | cut -d. -f3) + # gcc 7 introduced new behavior to -dumpversion whereby only the major + # version component is output. However, as part of this change, gcc 7 + # also introduced a new option, -dumpfullversion, which is guaranteed to + # always output the major, minor, and revision numbers. Thus, if we're + # using gcc and its version is 7 or later, we re-query and re-parse the + # version string. + if [ "${cc_vendor}" = "gcc" -a ${cc_major} -ge 7 ]; then + + # Re-query the version number using -dumpfullversion. + cc_version=$(${cc} -dumpfullversion) + + # And parse the result. + cc_major=$(echo "${cc_version}" | cut -d. -f1) + cc_minor=$(echo "${cc_version}" | cut -d. -f2) + cc_revision=$(echo "${cc_version}" | cut -d. -f3) + fi + echo "${script_name}: found ${cc_vendor} version ${cc_version} (maj: ${cc_major}, min: ${cc_minor}, rev: ${cc_revision})." } @@ -1576,6 +1607,9 @@ main() # The threading flag. threading_model='no' + # The method of assigning micropanels to threads in the JR and JR loops. + thread_part_jrir='slab' + # Option variables. quiet_flag='' show_config_list='' @@ -1630,7 +1664,7 @@ main() # Process our command line options. unset OPTIND - while getopts ":hp:d:s:t:qci:b:-:" opt; do + while getopts ":hp:d:s:t:r:qci:b:-:" opt; do case $opt in -) case "$OPTARG" in @@ -1694,6 +1728,9 @@ main() enable-threading=*) threading_model=${OPTARG#*=} ;; + thread-part-jrir=*) + thread_part_jrir=${OPTARG#*=} + ;; disable-threading) threading_model='no' ;; @@ -1765,6 +1802,9 @@ main() t) threading_model=$OPTARG ;; + r) + thread_part_jrir=$OPTARG + ;; i) int_type_size=$OPTARG ;; @@ -2335,7 +2375,7 @@ main() elif [ "x${threading_model}" = "xpthreads" ] || [ "x${threading_model}" = "xpthread" ] || [ "x${threading_model}" = "xposix" ]; then - echo "${script_name}: using Pthreads for threading." + echo "${script_name}: using POSIX threads for threading." enable_pthreads='yes' enable_pthreads_01=1 threading_model="pthreads" # Standardize the value. @@ -2346,7 +2386,22 @@ main() echo "${script_name}: *** Unsupported threading model: ${threading_model}." exit 1 fi - + + # Check the method of assigning micropanels to threads in the JR and IR + # loops. + enable_jrir_slab_01=0 + enable_jrir_rr_01=0 + if [ "x${thread_part_jrir}" = "xslab" ]; then + echo "${script_name}: requesting slab threading in jr and ir loops." + enable_jrir_slab_01=1 + elif [ "x${thread_part_jrir}" = "xrr" ]; then + echo "${script_name}: requesting round-robin threading in jr and ir loops." + enable_jrir_rr_01=1 + else + echo "${script_name}: *** Unsupported method of thread partitioning in jr and ir loops: ${threading_model}." + exit 1 + fi + # Convert 'yes' and 'no' flags to booleans. if [ "x${enable_packbuf_pools}" = "xyes" ]; then echo "${script_name}: internal memory pools for packing buffers are enabled." @@ -2398,7 +2453,7 @@ main() echo "${script_name}: the CBLAS compatibility layer is disabled." enable_cblas_01=0 fi - + # Report integer sizes if [ "x${int_type_size}" = "x32" ]; then echo "${script_name}: the internal integer size is 32-bit." @@ -2578,6 +2633,8 @@ main() | perl -pe "s/\@kernel_list_defines\@/${kernel_list_defines}/g" \ | sed -e "s/@enable_openmp@/${enable_openmp_01}/g" \ | sed -e "s/@enable_pthreads@/${enable_pthreads_01}/g" \ + | sed -e "s/@enable_jrir_slab@/${enable_jrir_slab_01}/g" \ + | sed -e "s/@enable_jrir_rr@/${enable_jrir_rr_01}/g" \ | sed -e "s/@enable_packbuf_pools@/${enable_packbuf_pools_01}/g" \ | sed -e "s/@int_type_size@/${int_type_size}/g" \ | sed -e "s/@blas_int_type_size@/${blas_int_type_size}/g" \ @@ -2669,7 +2726,7 @@ main() # -- Mirror source directory hierarchies to object directories ------------- - + # Combine the config_list with the config_name and then remove duplicates. config_list_plus_name=$(rm_duplicate_words "${config_list} ${config_name}") diff --git a/frame/1m/packm/bli_packm.h b/frame/1m/packm/bli_packm.h index a336cf9f2..c1c30be1c 100644 --- a/frame/1m/packm/bli_packm.h +++ b/frame/1m/packm/bli_packm.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -39,9 +40,7 @@ #include "bli_packm_part.h" -#include "bli_packm_unb_var1.h" - -#include "bli_packm_blk_var1.h" +#include "bli_packm_var.h" #include "bli_packm_struc_cxk.h" #include "bli_packm_struc_cxk_4mi.h" diff --git a/frame/1m/packm/bli_packm_blk_var1.c b/frame/1m/packm/bli_packm_blk_var1.c index 383462726..fe134c176 100644 --- a/frame/1m/packm/bli_packm_blk_var1.c +++ b/frame/1m/packm/bli_packm_blk_var1.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -34,71 +35,6 @@ #include "blis.h" -#define FUNCPTR_T packm_fp - -typedef void (*FUNCPTR_T)( - struc_t strucc, - doff_t diagoffc, - diag_t diagc, - uplo_t uploc, - trans_t transc, - pack_t schema, - bool_t invdiag, - bool_t revifup, - bool_t reviflo, - dim_t m, - dim_t n, - dim_t m_max, - dim_t n_max, - void* kappa, - void* c, inc_t rs_c, inc_t cs_c, - void* p, inc_t rs_p, inc_t cs_p, - inc_t is_p, - dim_t pd_p, inc_t ps_p, - void* packm_ker, - cntx_t* cntx, - thrinfo_t* thread - ); - -static FUNCPTR_T GENARRAY(ftypes,packm_blk_var1); - - -static func_t packm_struc_cxk_kers[BLIS_NUM_PACK_SCHEMA_TYPES] = -{ - /* float (0) scomplex (1) double (2) dcomplex (3) */ -// 0000 row/col panels - { { bli_spackm_struc_cxk, bli_cpackm_struc_cxk, - bli_dpackm_struc_cxk, bli_zpackm_struc_cxk, } }, -// 0001 row/col panels: 4m interleaved - { { NULL, bli_cpackm_struc_cxk_4mi, - NULL, bli_zpackm_struc_cxk_4mi, } }, -// 0010 row/col panels: 3m interleaved - { { NULL, bli_cpackm_struc_cxk_3mis, - NULL, bli_zpackm_struc_cxk_3mis, } }, -// 0011 row/col panels: 4m separated (NOT IMPLEMENTED) - { { NULL, NULL, - NULL, NULL, } }, -// 0100 row/col panels: 3m separated - { { NULL, bli_cpackm_struc_cxk_3mis, - NULL, bli_zpackm_struc_cxk_3mis, } }, -// 0101 row/col panels: real only - { { NULL, bli_cpackm_struc_cxk_rih, - NULL, bli_zpackm_struc_cxk_rih, } }, -// 0110 row/col panels: imaginary only - { { NULL, bli_cpackm_struc_cxk_rih, - NULL, bli_zpackm_struc_cxk_rih, } }, -// 0111 row/col panels: real+imaginary only - { { NULL, bli_cpackm_struc_cxk_rih, - NULL, bli_zpackm_struc_cxk_rih, } }, -// 1000 row/col panels: 1m-expanded (1e) - { { NULL, bli_cpackm_struc_cxk_1er, - NULL, bli_zpackm_struc_cxk_1er, } }, -// 1001 row/col panels: 1m-reordered (1r) - { { NULL, bli_cpackm_struc_cxk_1er, - NULL, bli_zpackm_struc_cxk_1er, } }, -}; - - void bli_packm_blk_var1 ( obj_t* c, @@ -108,619 +44,14 @@ void bli_packm_blk_var1 thrinfo_t* t ) { - num_t dt_cp = bli_obj_dt( c ); +#ifdef BLIS_ENABLE_JRIR_SLAB - struc_t strucc = bli_obj_struc( c ); - doff_t diagoffc = bli_obj_diag_offset( c ); - diag_t diagc = bli_obj_diag( c ); - uplo_t uploc = bli_obj_uplo( c ); - trans_t transc = bli_obj_conjtrans_status( c ); - pack_t schema = bli_obj_pack_schema( p ); - bool_t invdiag = bli_obj_has_inverted_diag( p ); - bool_t revifup = bli_obj_is_pack_rev_if_upper( p ); - bool_t reviflo = bli_obj_is_pack_rev_if_lower( p ); + bli_packm_blk_var1sl( c, p, cntx, cntl, t ); - dim_t m_p = bli_obj_length( p ); - dim_t n_p = bli_obj_width( p ); - dim_t m_max_p = bli_obj_padded_length( p ); - dim_t n_max_p = bli_obj_padded_width( p ); +#else // BLIS_ENABLE_JRIR_RR - void* buf_c = bli_obj_buffer_at_off( c ); - inc_t rs_c = bli_obj_row_stride( c ); - inc_t cs_c = bli_obj_col_stride( c ); + bli_packm_blk_var1rr( c, p, cntx, cntl, t ); - void* buf_p = bli_obj_buffer_at_off( p ); - inc_t rs_p = bli_obj_row_stride( p ); - inc_t cs_p = bli_obj_col_stride( p ); - inc_t is_p = bli_obj_imag_stride( p ); - dim_t pd_p = bli_obj_panel_dim( p ); - inc_t ps_p = bli_obj_panel_stride( p ); - - obj_t kappa; - obj_t* kappa_p; - void* buf_kappa; - - func_t* packm_kers; - void* packm_ker; - - FUNCPTR_T f; - - - // Treatment of kappa (ie: packing during scaling) depends on - // whether we are executing an induced method. - if ( bli_is_nat_packed( schema ) ) - { - // This branch is for native execution, where we assume that - // the micro-kernel will always apply the alpha scalar of the - // higher-level operation. Thus, we use BLIS_ONE for kappa so - // that the underlying packm implementation does not perform - // any scaling during packing. - buf_kappa = bli_obj_buffer_for_const( dt_cp, &BLIS_ONE ); - } - else // if ( bli_is_ind_packed( schema ) ) - { - // The value for kappa we use will depend on whether the scalar - // attached to A has a nonzero imaginary component. If it does, - // then we will apply the scalar during packing to facilitate - // implementing induced complex domain algorithms in terms of - // real domain micro-kernels. (In the aforementioned situation, - // applying a real scalar is easy, but applying a complex one is - // harder, so we avoid the need altogether with the code below.) - if ( bli_obj_scalar_has_nonzero_imag( p ) ) - { - //printf( "applying non-zero imag kappa\n" ); - - // Detach the scalar. - bli_obj_scalar_detach( p, &kappa ); - - // Reset the attached scalar (to 1.0). - bli_obj_scalar_reset( p ); - - kappa_p = κ - } - else - { - // If the internal scalar of A has only a real component, then - // we will apply it later (in the micro-kernel), and so we will - // use BLIS_ONE to indicate no scaling during packing. - kappa_p = &BLIS_ONE; - } - - // Acquire the buffer to the kappa chosen above. - buf_kappa = bli_obj_buffer_for_1x1( dt_cp, kappa_p ); - } - - - // Choose the correct func_t object based on the pack_t schema. -#if 0 - if ( bli_is_4mi_packed( schema ) ) packm_kers = packm_struc_cxk_4mi_kers; - else if ( bli_is_3mi_packed( schema ) || - bli_is_3ms_packed( schema ) ) packm_kers = packm_struc_cxk_3mis_kers; - else if ( bli_is_ro_packed( schema ) || - bli_is_io_packed( schema ) || - bli_is_rpi_packed( schema ) ) packm_kers = packm_struc_cxk_rih_kers; - else packm_kers = packm_struc_cxk_kers; -#else - // The original idea here was to read the packm_ukr from the context - // if it is non-NULL. The problem is, it requires that we be able to - // assume that the packm_ukr field is initialized to NULL, which it - // currently is not. - - //func_t* cntx_packm_kers = bli_cntx_get_packm_ukr( cntx ); - - //if ( bli_func_is_null_dt( dt_cp, cntx_packm_kers ) ) - { - // If the packm structure-aware kernel func_t in the context is - // NULL (which is the default value after the context is created), - // we use the default lookup table to determine the right func_t - // for the current schema. - const dim_t i = bli_pack_schema_index( schema ); - - packm_kers = &packm_struc_cxk_kers[ i ]; - } -#if 0 - else // cntx's packm func_t overrides - { - // If the packm structure-aware kernel func_t in the context is - // non-NULL (ie: assumed to be valid), we use that instead. - //packm_kers = bli_cntx_packm_ukrs( cntx ); - packm_kers = cntx_packm_kers; - } #endif -#endif - - // Query the datatype-specific function pointer from the func_t object. - packm_ker = bli_func_get_dt( dt_cp, packm_kers ); - - // Index into the type combination array to extract the correct - // function pointer. - f = ftypes[dt_cp]; - - // Invoke the function. - f( strucc, - diagoffc, - diagc, - uploc, - transc, - schema, - invdiag, - revifup, - reviflo, - m_p, - n_p, - m_max_p, - n_max_p, - buf_kappa, - buf_c, rs_c, cs_c, - buf_p, rs_p, cs_p, - is_p, - pd_p, ps_p, - packm_ker, - cntx, - t ); } - -#undef GENTFUNCR -#define GENTFUNCR( ctype, ctype_r, ch, chr, opname, varname ) \ -\ -void PASTEMAC(ch,varname) \ - ( \ - struc_t strucc, \ - doff_t diagoffc, \ - diag_t diagc, \ - uplo_t uploc, \ - trans_t transc, \ - pack_t schema, \ - bool_t invdiag, \ - bool_t revifup, \ - bool_t reviflo, \ - dim_t m, \ - dim_t n, \ - dim_t m_max, \ - dim_t n_max, \ - void* kappa, \ - void* c, inc_t rs_c, inc_t cs_c, \ - void* p, inc_t rs_p, inc_t cs_p, \ - inc_t is_p, \ - dim_t pd_p, inc_t ps_p, \ - void* packm_ker, \ - cntx_t* cntx, \ - thrinfo_t* thread \ - ) \ -{ \ - PASTECH2(ch,opname,_ker_ft) packm_ker_cast = packm_ker; \ -\ - ctype* restrict kappa_cast = kappa; \ - ctype* restrict c_cast = c; \ - ctype* restrict p_cast = p; \ - ctype* restrict c_begin; \ - ctype* restrict p_begin; \ -\ - dim_t iter_dim; \ - dim_t num_iter; \ - dim_t it, ic, ip; \ - dim_t ic0, ip0; \ - doff_t ic_inc, ip_inc; \ - doff_t diagoffc_i; \ - doff_t diagoffc_inc; \ - dim_t panel_len_full; \ - dim_t panel_len_i; \ - dim_t panel_len_max; \ - dim_t panel_len_max_i; \ - dim_t panel_dim_i; \ - dim_t panel_dim_max; \ - dim_t panel_off_i; \ - inc_t vs_c; \ - inc_t ldc; \ - inc_t ldp, p_inc; \ - dim_t* m_panel_full; \ - dim_t* n_panel_full; \ - dim_t* m_panel_use; \ - dim_t* n_panel_use; \ - dim_t* m_panel_max; \ - dim_t* n_panel_max; \ - conj_t conjc; \ - bool_t row_stored; \ - bool_t col_stored; \ - inc_t is_p_use; \ - dim_t ss_num; \ - dim_t ss_den; \ -\ - ctype* restrict c_use; \ - ctype* restrict p_use; \ - doff_t diagoffp_i; \ -\ -\ - /* If C is zeros and part of a triangular matrix, then we don't need - to pack it. */ \ - if ( bli_is_zeros( uploc ) && \ - bli_is_triangular( strucc ) ) return; \ -\ - /* Extract the conjugation bit from the transposition argument. */ \ - conjc = bli_extract_conj( transc ); \ -\ - /* If c needs a transposition, induce it so that we can more simply - express the remaining parameters and code. */ \ - if ( bli_does_trans( transc ) ) \ - { \ - bli_swap_incs( &rs_c, &cs_c ); \ - bli_negate_diag_offset( &diagoffc ); \ - bli_toggle_uplo( &uploc ); \ - bli_toggle_trans( &transc ); \ - } \ -\ - /* Create flags to incidate row or column storage. Note that the - schema bit that encodes row or column is describing the form of - micro-panel, not the storage in the micro-panel. Hence the - mismatch in "row" and "column" semantics. */ \ - row_stored = bli_is_col_packed( schema ); \ - col_stored = bli_is_row_packed( schema ); \ -\ - /* If the row storage flag indicates row storage, then we are packing - to column panels; otherwise, if the strides indicate column storage, - we are packing to row panels. */ \ - if ( row_stored ) \ - { \ - /* Prepare to pack to row-stored column panels. */ \ - iter_dim = n; \ - panel_len_full = m; \ - panel_len_max = m_max; \ - panel_dim_max = pd_p; \ - ldc = rs_c; \ - vs_c = cs_c; \ - diagoffc_inc = -( doff_t )panel_dim_max; \ - ldp = rs_p; \ - m_panel_full = &m; \ - n_panel_full = &panel_dim_i; \ - m_panel_use = &panel_len_i; \ - n_panel_use = &panel_dim_i; \ - m_panel_max = &panel_len_max_i; \ - n_panel_max = &panel_dim_max; \ - } \ - else /* if ( col_stored ) */ \ - { \ - /* Prepare to pack to column-stored row panels. */ \ - iter_dim = m; \ - panel_len_full = n; \ - panel_len_max = n_max; \ - panel_dim_max = pd_p; \ - ldc = cs_c; \ - vs_c = rs_c; \ - diagoffc_inc = ( doff_t )panel_dim_max; \ - ldp = cs_p; \ - m_panel_full = &panel_dim_i; \ - n_panel_full = &n; \ - m_panel_use = &panel_dim_i; \ - n_panel_use = &panel_len_i; \ - m_panel_max = &panel_dim_max; \ - n_panel_max = &panel_len_max_i; \ - } \ -\ - /* Compute the storage stride scaling. Usually this is just 1. However, - in the case of interleaved 3m, we need to scale by 3/2, and in the - cases of real-only, imag-only, or summed-only, we need to scale by - 1/2. In both cases, we are compensating for the fact that pointer - arithmetic occurs in terms of complex elements rather than real - elements. */ \ - if ( bli_is_3mi_packed( schema ) ) { ss_num = 3; ss_den = 2; } \ - else if ( bli_is_3ms_packed( schema ) ) { ss_num = 1; ss_den = 2; } \ - else if ( bli_is_rih_packed( schema ) ) { ss_num = 1; ss_den = 2; } \ - else { ss_num = 1; ss_den = 1; } \ -\ - /* Compute the total number of iterations we'll need. */ \ - num_iter = iter_dim / panel_dim_max + ( iter_dim % panel_dim_max ? 1 : 0 ); \ -\ - /* Set the initial values and increments for indices related to C and P - based on whether reverse iteration was requested. */ \ - if ( ( revifup && bli_is_upper( uploc ) && bli_is_triangular( strucc ) ) || \ - ( reviflo && bli_is_lower( uploc ) && bli_is_triangular( strucc ) ) ) \ - { \ - ic0 = (num_iter - 1) * panel_dim_max; \ - ic_inc = -panel_dim_max; \ - ip0 = num_iter - 1; \ - ip_inc = -1; \ - } \ - else \ - { \ - ic0 = 0; \ - ic_inc = panel_dim_max; \ - ip0 = 0; \ - ip_inc = 1; \ - } \ -\ - p_begin = p_cast; \ -\ -/* -if ( row_stored ) \ -PASTEMAC(ch,fprintm)( stdout, "packm_var2: b", m, n, \ - c_cast, rs_c, cs_c, "%4.1f", "" ); \ -if ( col_stored ) \ -PASTEMAC(ch,fprintm)( stdout, "packm_var2: a", m, n, \ - c_cast, rs_c, cs_c, "%4.1f", "" ); \ -*/ \ -\ - for ( ic = ic0, ip = ip0, it = 0; it < num_iter; \ - ic += ic_inc, ip += ip_inc, it += 1 ) \ - { \ - panel_dim_i = bli_min( panel_dim_max, iter_dim - ic ); \ -\ - diagoffc_i = diagoffc + (ip )*diagoffc_inc; \ - c_begin = c_cast + (ic )*vs_c; \ -\ - if ( bli_is_triangular( strucc ) && \ - bli_is_unstored_subpart_n( diagoffc_i, uploc, *m_panel_full, *n_panel_full ) ) \ - { \ - /* This case executes if the panel belongs to a triangular - matrix AND is completely unstored (ie: zero). If the panel - is unstored, we do nothing. (Notice that we don't even - increment p_begin.) */ \ -\ - continue; \ - } \ - else if ( bli_is_triangular( strucc ) && \ - bli_intersects_diag_n( diagoffc_i, *m_panel_full, *n_panel_full ) ) \ - { \ - /* This case executes if the panel belongs to a triangular - matrix AND is diagonal-intersecting. Notice that we - cannot bury the following conditional logic into - packm_struc_cxk() because we need to know the value of - panel_len_max_i so we can properly increment p_inc. */ \ -\ - /* Sanity check. Diagonals should not intersect the short end of - a micro-panel. If they do, then somehow the constraints on - cache blocksizes being a whole multiple of the register - blocksizes was somehow violated. */ \ - if ( ( col_stored && diagoffc_i < 0 ) || \ - ( row_stored && diagoffc_i > 0 ) ) \ - bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED ); \ -\ - if ( ( row_stored && bli_is_upper( uploc ) ) || \ - ( col_stored && bli_is_lower( uploc ) ) ) \ - { \ - panel_off_i = 0; \ - panel_len_i = bli_abs( diagoffc_i ) + panel_dim_i; \ - panel_len_max_i = bli_min( bli_abs( diagoffc_i ) + panel_dim_max, \ - panel_len_max ); \ - diagoffp_i = diagoffc_i; \ - } \ - else /* if ( ( row_stored && bli_is_lower( uploc ) ) || \ - ( col_stored && bli_is_upper( uploc ) ) ) */ \ - { \ - panel_off_i = bli_abs( diagoffc_i ); \ - panel_len_i = panel_len_full - panel_off_i; \ - panel_len_max_i = panel_len_max - panel_off_i; \ - diagoffp_i = 0; \ - } \ -\ - c_use = c_begin + (panel_off_i )*ldc; \ - p_use = p_begin; \ -\ - /* We need to re-compute the imaginary stride as a function of - panel_len_max_i since triangular packed matrices have panels - of varying lengths. NOTE: This imaginary stride value is - only referenced by the packm kernels for induced methods. */ \ - is_p_use = ldp * panel_len_max_i; \ -\ - /* We nudge the imaginary stride up by one if it is odd. */ \ - is_p_use += ( bli_is_odd( is_p_use ) ? 1 : 0 ); \ -\ - if( packm_thread_my_iter( it, thread ) ) \ - { \ - packm_ker_cast( strucc, \ - diagoffp_i, \ - diagc, \ - uploc, \ - conjc, \ - schema, \ - invdiag, \ - *m_panel_use, \ - *n_panel_use, \ - *m_panel_max, \ - *n_panel_max, \ - kappa_cast, \ - c_use, rs_c, cs_c, \ - p_use, rs_p, cs_p, \ - is_p_use, \ - cntx ); \ - } \ -\ - /* NOTE: This value is usually LESS than ps_p because triangular - matrices usually have several micro-panels that are shorter - than a "full" micro-panel. */ \ - p_inc = ( is_p_use * ss_num ) / ss_den; \ - } \ - else if ( bli_is_herm_or_symm( strucc ) ) \ - { \ - /* This case executes if the panel belongs to a Hermitian or - symmetric matrix, which includes stored, unstored, and - diagonal-intersecting panels. */ \ -\ - c_use = c_begin; \ - p_use = p_begin; \ -\ - panel_len_i = panel_len_full; \ - panel_len_max_i = panel_len_max; \ -\ - is_p_use = is_p; \ -\ - if( packm_thread_my_iter( it, thread ) ) \ - { \ - packm_ker_cast( strucc, \ - diagoffc_i, \ - diagc, \ - uploc, \ - conjc, \ - schema, \ - invdiag, \ - *m_panel_use, \ - *n_panel_use, \ - *m_panel_max, \ - *n_panel_max, \ - kappa_cast, \ - c_use, rs_c, cs_c, \ - p_use, rs_p, cs_p, \ - is_p_use, \ - cntx ); \ - } \ -\ - p_inc = ps_p; \ - } \ - else \ - { \ - /* This case executes if the panel is general, or, if the - panel is part of a triangular matrix and is neither unstored - (ie: zero) nor diagonal-intersecting. */ \ -\ - c_use = c_begin; \ - p_use = p_begin; \ -\ - panel_len_i = panel_len_full; \ - panel_len_max_i = panel_len_max; \ -\ - is_p_use = is_p; \ -\ - if( packm_thread_my_iter( it, thread ) ) \ - { \ - packm_ker_cast( BLIS_GENERAL, \ - 0, \ - diagc, \ - BLIS_DENSE, \ - conjc, \ - schema, \ - invdiag, \ - *m_panel_use, \ - *n_panel_use, \ - *m_panel_max, \ - *n_panel_max, \ - kappa_cast, \ - c_use, rs_c, cs_c, \ - p_use, rs_p, cs_p, \ - is_p_use, \ - cntx ); \ - } \ -\ - /* NOTE: This value is equivalent to ps_p. */ \ - p_inc = ps_p; \ - } \ -\ -/* -if ( col_stored ) { \ - if ( bli_thread_work_id( thread ) == 0 ) \ - { \ - printf( "packm_blk_var1: thread %lu (a = %p, ap = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ - fflush( stdout ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a", *m_panel_use, *n_panel_use, \ - ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: ap", *m_panel_max, *n_panel_max, \ - ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - fflush( stdout ); \ - } \ -bli_thread_obarrier( thread ); \ - if ( bli_thread_work_id( thread ) == 1 ) \ - { \ - printf( "packm_blk_var1: thread %lu (a = %p, ap = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ - fflush( stdout ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a", *m_panel_use, *n_panel_use, \ - ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: ap", *m_panel_max, *n_panel_max, \ - ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - fflush( stdout ); \ - } \ -bli_thread_obarrier( thread ); \ -} \ -else { \ - if ( bli_thread_work_id( thread ) == 0 ) \ - { \ - printf( "packm_blk_var1: thread %lu (b = %p, bp = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ - fflush( stdout ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b", *m_panel_use, *n_panel_use, \ - ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: bp", *m_panel_max, *n_panel_max, \ - ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - fflush( stdout ); \ - } \ -bli_thread_obarrier( thread ); \ - if ( bli_thread_work_id( thread ) == 1 ) \ - { \ - printf( "packm_blk_var1: thread %lu (b = %p, bp = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ - fflush( stdout ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b", *m_panel_use, *n_panel_use, \ - ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ - PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: bp", *m_panel_max, *n_panel_max, \ - ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - fflush( stdout ); \ - } \ -bli_thread_obarrier( thread ); \ -} \ -*/ \ -\ -/* - if ( bli_is_4mi_packed( schema ) ) { \ - printf( "packm_var2: is_p_use = %lu\n", is_p_use ); \ - if ( col_stored ) { \ - if ( 0 ) \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_r", *m_panel_use, *n_panel_use, \ - ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_r", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_i", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use + is_p_use, rs_p, cs_p, "%4.1f", "" ); \ - } \ - if ( row_stored ) { \ - if ( 0 ) \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_r", *m_panel_use, *n_panel_use, \ - ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_r", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_i", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use + is_p_use, rs_p, cs_p, "%4.1f", "" ); \ - } \ - } \ -*/ \ -/* -*/ \ -\ -/* -*/ \ -/* - PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_rpi", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ -*/ \ -\ -\ -/* - if ( row_stored ) { \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_r", *m_panel_max, *n_panel_max, \ - ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_i", *m_panel_max, *n_panel_max, \ - (( ctype_r* )c_use)+rs_c, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_r", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - inc_t is_b = rs_p * *m_panel_max; \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_i", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use + is_b, rs_p, cs_p, "%4.1f", "" ); \ - } \ -*/ \ -\ -\ -/* - if ( col_stored ) { \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_r", *m_panel_max, *n_panel_max, \ - ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_i", *m_panel_max, *n_panel_max, \ - (( ctype_r* )c_use)+rs_c, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_r", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ - PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_i", *m_panel_max, *n_panel_max, \ - ( ctype_r* )p_use + p_inc, rs_p, cs_p, "%4.1f", "" ); \ - } \ -*/ \ -\ - p_begin += p_inc; \ -\ - } \ -} - -INSERT_GENTFUNCR_BASIC( packm, packm_blk_var1 ) - diff --git a/frame/1m/packm/bli_packm_blk_var1.c.old b/frame/1m/packm/bli_packm_blk_var1.c.old deleted file mode 100644 index 4b18302f4..000000000 --- a/frame/1m/packm/bli_packm_blk_var1.c.old +++ /dev/null @@ -1,463 +0,0 @@ -/* - - BLIS - An object-based framework for developing high-performance BLAS-like - libraries. - - Copyright (C) 2014, The University of Texas at Austin - - Redistribution and use in source and binary forms, with or without - modification, are permitted provided that the following conditions are - met: - - Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - - Redistributions in binary form must reproduce the above copyright - notice, this list of conditions and the following disclaimer in the - documentation and/or other materials provided with the distribution. - - Neither the name of The University of Texas at Austin nor the names - of its contributors may be used to endorse or promote products - derived from this software without specific prior written permission. - - THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS - "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT - LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR - A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT - HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, - SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT - LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, - DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY - THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE - OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -*/ - -#include "blis.h" - -#define FUNCPTR_T packm_fp - -typedef void (*FUNCPTR_T)( - struc_t strucc, - doff_t diagoffc, - diag_t diagc, - uplo_t uploc, - trans_t transc, - pack_t schema, - bool_t invdiag, - bool_t revifup, - bool_t reviflo, - dim_t m, - dim_t n, - dim_t m_max, - dim_t n_max, - void* kappa, - void* c, inc_t rs_c, inc_t cs_c, - void* p, inc_t rs_p, inc_t cs_p, - inc_t is_p, - dim_t pd_p, inc_t ps_p, - void* packm_ker, - packm_thrinfo_t* thread - ); - -static FUNCPTR_T GENARRAY(ftypes,packm_blk_var1); - -extern func_t* packm_struc_cxk_kers; - - -void bli_packm_blk_var1( obj_t* c, - obj_t* p, - packm_thrinfo_t* t ) -{ - num_t dt_cp = bli_obj_dt( c ); - - struc_t strucc = bli_obj_struc( c ); - doff_t diagoffc = bli_obj_diag_offset( c ); - diag_t diagc = bli_obj_diag( c ); - uplo_t uploc = bli_obj_uplo( c ); - trans_t transc = bli_obj_conjtrans_status( c ); - pack_t schema = bli_obj_pack_schema( p ); - bool_t invdiag = bli_obj_has_inverted_diag( p ); - bool_t revifup = bli_obj_is_pack_rev_if_upper( p ); - bool_t reviflo = bli_obj_is_pack_rev_if_lower( p ); - - dim_t m_p = bli_obj_length( p ); - dim_t n_p = bli_obj_width( p ); - dim_t m_max_p = bli_obj_padded_length( p ); - dim_t n_max_p = bli_obj_padded_width( p ); - - void* buf_c = bli_obj_buffer_at_off( c ); - inc_t rs_c = bli_obj_row_stride( c ); - inc_t cs_c = bli_obj_col_stride( c ); - - void* buf_p = bli_obj_buffer_at_off( p ); - inc_t rs_p = bli_obj_row_stride( p ); - inc_t cs_p = bli_obj_col_stride( p ); - inc_t is_p = bli_obj_imag_stride( p ); - dim_t pd_p = bli_obj_panel_dim( p ); - inc_t ps_p = bli_obj_panel_stride( p ); - - void* buf_kappa; - - func_t* packm_kers; - void* packm_ker; - - FUNCPTR_T f; - - // This variant assumes that the micro-kernel will always apply the - // alpha scalar of the higher-level operation. Thus, we use BLIS_ONE - // for kappa so that the underlying packm implementation does not - // scale during packing. - buf_kappa = bli_obj_buffer_for_const( dt_cp, &BLIS_ONE ); - - // Choose the correct func_t object. - packm_kers = packm_struc_cxk_kers; - - // Query the datatype-specific function pointer from the func_t object. - packm_ker = bli_func_obj_query( dt_cp, packm_kers ); - - - // Index into the type combination array to extract the correct - // function pointer. - f = ftypes[dt_cp]; - - // Invoke the function. - f( strucc, - diagoffc, - diagc, - uploc, - transc, - schema, - invdiag, - revifup, - reviflo, - m_p, - n_p, - m_max_p, - n_max_p, - buf_kappa, - buf_c, rs_c, cs_c, - buf_p, rs_p, cs_p, - is_p, - pd_p, ps_p, - packm_ker, - t ); -} - - -#undef GENTFUNC -#define GENTFUNC( ctype, ch, varname, kertype ) \ -\ -void PASTEMAC(ch,varname) \ - struc_t strucc, \ - doff_t diagoffc, \ - diag_t diagc, \ - uplo_t uploc, \ - trans_t transc, \ - pack_t schema, \ - bool_t invdiag, \ - bool_t revifup, \ - bool_t reviflo, \ - dim_t m, \ - dim_t n, \ - dim_t m_max, \ - dim_t n_max, \ - void* kappa, \ - void* c, inc_t rs_c, inc_t cs_c, \ - void* p, inc_t rs_p, inc_t cs_p, \ - inc_t is_p, \ - dim_t pd_p, inc_t ps_p, \ - void* packm_ker, \ - packm_thrinfo_t* thread \ - ) \ -{ \ - PASTECH(ch,kertype) packm_ker_cast = packm_ker; \ -\ - ctype* restrict kappa_cast = kappa; \ - ctype* restrict c_cast = c; \ - ctype* restrict p_cast = p; \ - ctype* restrict c_begin; \ - ctype* restrict p_begin; \ -\ - dim_t iter_dim; \ - dim_t num_iter; \ - dim_t it, ic, ip; \ - dim_t ic0, ip0; \ - doff_t ic_inc, ip_inc; \ - doff_t diagoffc_i; \ - doff_t diagoffc_inc; \ - dim_t panel_len_full; \ - dim_t panel_len_i; \ - dim_t panel_len_max; \ - dim_t panel_len_max_i; \ - dim_t panel_dim_i; \ - dim_t panel_dim_max; \ - dim_t panel_off_i; \ - inc_t vs_c; \ - inc_t ldc; \ - inc_t ldp, p_inc; \ - dim_t* m_panel_full; \ - dim_t* n_panel_full; \ - dim_t* m_panel_use; \ - dim_t* n_panel_use; \ - dim_t* m_panel_max; \ - dim_t* n_panel_max; \ - conj_t conjc; \ - bool_t row_stored; \ - bool_t col_stored; \ -\ - ctype* restrict c_use; \ - ctype* restrict p_use; \ - doff_t diagoffp_i; \ -\ -\ - /* If C is zeros and part of a triangular matrix, then we don't need - to pack it. */ \ - if ( bli_is_zeros( uploc ) && \ - bli_is_triangular( strucc ) ) return; \ -\ - /* Extract the conjugation bit from the transposition argument. */ \ - conjc = bli_extract_conj( transc ); \ -\ - /* If c needs a transposition, induce it so that we can more simply - express the remaining parameters and code. */ \ - if ( bli_does_trans( transc ) ) \ - { \ - bli_swap_incs( &rs_c, &cs_c ); \ - bli_negate_diag_offset( &diagoffc ); \ - bli_toggle_uplo( &uploc ); \ - bli_toggle_trans( &transc ); \ - } \ -\ - /* Create flags to incidate row or column storage. Note that the - schema bit that encodes row or column is describing the form of - micro-panel, not the storage in the micro-panel. Hence the - mismatch in "row" and "column" semantics. */ \ - row_stored = bli_is_col_packed( schema ); \ - col_stored = bli_is_row_packed( schema ); \ -\ - /* If the row storage flag indicates row storage, then we are packing - to column panels; otherwise, if the strides indicate column storage, - we are packing to row panels. */ \ - if ( row_stored ) \ - { \ - /* Prepare to pack to row-stored column panels. */ \ - iter_dim = n; \ - panel_len_full = m; \ - panel_len_max = m_max; \ - panel_dim_max = pd_p; \ - ldc = rs_c; \ - vs_c = cs_c; \ - diagoffc_inc = -( doff_t )panel_dim_max; \ - ldp = rs_p; \ - m_panel_full = &m; \ - n_panel_full = &panel_dim_i; \ - m_panel_use = &panel_len_i; \ - n_panel_use = &panel_dim_i; \ - m_panel_max = &panel_len_max_i; \ - n_panel_max = &panel_dim_max; \ - } \ - else /* if ( col_stored ) */ \ - { \ - /* Prepare to pack to column-stored row panels. */ \ - iter_dim = m; \ - panel_len_full = n; \ - panel_len_max = n_max; \ - panel_dim_max = pd_p; \ - ldc = cs_c; \ - vs_c = rs_c; \ - diagoffc_inc = ( doff_t )panel_dim_max; \ - ldp = cs_p; \ - m_panel_full = &panel_dim_i; \ - n_panel_full = &n; \ - m_panel_use = &panel_dim_i; \ - n_panel_use = &panel_len_i; \ - m_panel_max = &panel_dim_max; \ - n_panel_max = &panel_len_max_i; \ - } \ -\ - /* Compute the total number of iterations we'll need. */ \ - num_iter = iter_dim / panel_dim_max + ( iter_dim % panel_dim_max ? 1 : 0 ); \ -\ - /* Set the initial values and increments for indices related to C and P - based on whether reverse iteration was requested. */ \ - if ( ( revifup && bli_is_upper( uploc ) && bli_is_triangular( strucc ) ) || \ - ( reviflo && bli_is_lower( uploc ) && bli_is_triangular( strucc ) ) ) \ - { \ - ic0 = (num_iter - 1) * panel_dim_max; \ - ic_inc = -panel_dim_max; \ - ip0 = num_iter - 1; \ - ip_inc = -1; \ - } \ - else \ - { \ - ic0 = 0; \ - ic_inc = panel_dim_max; \ - ip0 = 0; \ - ip_inc = 1; \ - } \ -\ - p_begin = p_cast; \ -\ - for ( ic = ic0, ip = ip0, it = 0; it < num_iter; \ - ic += ic_inc, ip += ip_inc, it += 1 ) \ - { \ - panel_dim_i = bli_min( panel_dim_max, iter_dim - ic ); \ -\ - diagoffc_i = diagoffc + (ip )*diagoffc_inc; \ - c_begin = c_cast + (ic )*vs_c; \ -\ - if ( bli_is_triangular( strucc ) && \ - bli_is_unstored_subpart_n( diagoffc_i, uploc, *m_panel_full, *n_panel_full ) ) \ - { \ - /* This case executes if the panel belongs to a triangular - matrix AND is completely unstored (ie: zero). If the panel - is unstored, we do nothing. (Notice that we don't even - increment p_begin.) */ \ -\ - continue; \ - } \ - else if ( bli_is_triangular( strucc ) && \ - bli_intersects_diag_n( diagoffc_i, *m_panel_full, *n_panel_full ) ) \ - { \ - /* This case executes if the panel belongs to a triangular - matrix AND is diagonal-intersecting. Notice that we - cannot bury the following conditional logic into - packm_struc_cxk() because we need to know the value of - panel_len_max_i so we can properly increment p_inc. */ \ -\ - /* Sanity check. Diagonals should not intersect the short end of - a micro-panel. If they do, then somehow the constraints on - cache blocksizes being a whole multiple of the register - blocksizes was somehow violated. */ \ - if ( ( col_stored && diagoffc_i < 0 ) || \ - ( row_stored && diagoffc_i > 0 ) ) \ - bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED ); \ -\ - if ( ( row_stored && bli_is_upper( uploc ) ) || \ - ( col_stored && bli_is_lower( uploc ) ) ) \ - { \ - panel_off_i = 0; \ - panel_len_i = bli_abs( diagoffc_i ) + panel_dim_i; \ - panel_len_max_i = bli_min( bli_abs( diagoffc_i ) + panel_dim_max, \ - panel_len_max ); \ - diagoffp_i = diagoffc_i; \ - } \ - else /* if ( ( row_stored && bli_is_lower( uploc ) ) || \ - ( col_stored && bli_is_upper( uploc ) ) ) */ \ - { \ - panel_off_i = bli_abs( diagoffc_i ); \ - panel_len_i = panel_len_full - panel_off_i; \ - panel_len_max_i = panel_len_max - panel_off_i; \ - diagoffp_i = 0; \ - } \ -\ - c_use = c_begin + (panel_off_i )*ldc; \ - p_use = p_begin; \ -\ - if( packm_thread_my_iter( it, thread ) ) \ - { \ - packm_ker_cast( strucc, \ - diagoffp_i, \ - diagc, \ - uploc, \ - conjc, \ - schema, \ - invdiag, \ - *m_panel_use, \ - *n_panel_use, \ - *m_panel_max, \ - *n_panel_max, \ - kappa_cast, \ - c_use, rs_c, cs_c, \ - p_use, rs_p, cs_p, \ - is_p ); \ - } \ -\ - /* NOTE: This value is usually LESS than ps_p because triangular - matrices usually have several micro-panels that are shorter - than a "full" micro-panel. */ \ - p_inc = ldp * panel_len_max_i; \ -\ - /* We nudge the panel increment up by one if it is odd. */ \ - p_inc += ( bli_is_odd( p_inc ) ? 1 : 0 ); \ - } \ - else if ( bli_is_herm_or_symm( strucc ) ) \ - { \ - /* This case executes if the panel belongs to a Hermitian or - symmetric matrix, which includes stored, unstored, and - diagonal-intersecting panels. */ \ -\ - panel_len_i = panel_len_full; \ - panel_len_max_i = panel_len_max; \ -\ - if( packm_thread_my_iter( it, thread ) ) \ - { \ - packm_ker_cast( strucc, \ - diagoffc_i, \ - diagc, \ - uploc, \ - conjc, \ - schema, \ - invdiag, \ - *m_panel_use, \ - *n_panel_use, \ - *m_panel_max, \ - *n_panel_max, \ - kappa_cast, \ - c_begin, rs_c, cs_c, \ - p_begin, rs_p, cs_p, \ - is_p ); \ - } \ -\ - /* NOTE: This value is equivalent to ps_p. */ \ - /*p_inc = ldp * panel_len_max_i;*/ \ - p_inc = ps_p; \ - } \ - else \ - { \ - /* This case executes if the panel is general, or, if the - panel is part of a triangular matrix and is neither unstored - (ie: zero) nor diagonal-intersecting. */ \ -\ - panel_len_i = panel_len_full; \ - panel_len_max_i = panel_len_max; \ -\ - if( packm_thread_my_iter( it, thread ) ) \ - { \ - packm_ker_cast( BLIS_GENERAL, \ - 0, \ - diagc, \ - BLIS_DENSE, \ - conjc, \ - schema, \ - invdiag, \ - *m_panel_use, \ - *n_panel_use, \ - *m_panel_max, \ - *n_panel_max, \ - kappa_cast, \ - c_begin, rs_c, cs_c, \ - p_begin, rs_p, cs_p, \ - is_p ); \ - } \ -/* - if ( row_stored ) \ - PASTEMAC(ch,fprintm)( stdout, "packm_var1: bp copied", panel_len_max_i, panel_dim_max, \ - p_begin, rs_p, cs_p, "%9.2e", "" ); \ - else if ( col_stored ) \ - PASTEMAC(ch,fprintm)( stdout, "packm_var1: ap copied", panel_dim_max, panel_len_max_i, \ - p_begin, rs_p, cs_p, "%9.2e", "" ); \ -*/ \ -\ - /* NOTE: This value is equivalent to ps_p. */ \ - /*p_inc = ldp * panel_len_max_i;*/ \ - p_inc = ps_p; \ - } \ -\ -\ - p_begin += p_inc; \ - } \ -} - -INSERT_GENTFUNC_BASIC( packm_blk_var1, packm_ker_t ) - diff --git a/frame/1m/packm/bli_packm_blk_var1rr.c b/frame/1m/packm/bli_packm_blk_var1rr.c new file mode 100644 index 000000000..cb364f276 --- /dev/null +++ b/frame/1m/packm/bli_packm_blk_var1rr.c @@ -0,0 +1,737 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T packm_fp + +typedef void (*FUNCPTR_T) + ( + struc_t strucc, + doff_t diagoffc, + diag_t diagc, + uplo_t uploc, + trans_t transc, + pack_t schema, + bool_t invdiag, + bool_t revifup, + bool_t reviflo, + dim_t m, + dim_t n, + dim_t m_max, + dim_t n_max, + void* kappa, + void* c, inc_t rs_c, inc_t cs_c, + void* p, inc_t rs_p, inc_t cs_p, + inc_t is_p, + dim_t pd_p, inc_t ps_p, + void* packm_ker, + cntx_t* cntx, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,packm_blk_var1rr); + + +static func_t packm_struc_cxk_kers[BLIS_NUM_PACK_SCHEMA_TYPES] = +{ + /* float (0) scomplex (1) double (2) dcomplex (3) */ +// 0000 row/col panels + { { bli_spackm_struc_cxk, bli_cpackm_struc_cxk, + bli_dpackm_struc_cxk, bli_zpackm_struc_cxk, } }, +// 0001 row/col panels: 4m interleaved + { { NULL, bli_cpackm_struc_cxk_4mi, + NULL, bli_zpackm_struc_cxk_4mi, } }, +// 0010 row/col panels: 3m interleaved + { { NULL, bli_cpackm_struc_cxk_3mis, + NULL, bli_zpackm_struc_cxk_3mis, } }, +// 0011 row/col panels: 4m separated (NOT IMPLEMENTED) + { { NULL, NULL, + NULL, NULL, } }, +// 0100 row/col panels: 3m separated + { { NULL, bli_cpackm_struc_cxk_3mis, + NULL, bli_zpackm_struc_cxk_3mis, } }, +// 0101 row/col panels: real only + { { NULL, bli_cpackm_struc_cxk_rih, + NULL, bli_zpackm_struc_cxk_rih, } }, +// 0110 row/col panels: imaginary only + { { NULL, bli_cpackm_struc_cxk_rih, + NULL, bli_zpackm_struc_cxk_rih, } }, +// 0111 row/col panels: real+imaginary only + { { NULL, bli_cpackm_struc_cxk_rih, + NULL, bli_zpackm_struc_cxk_rih, } }, +// 1000 row/col panels: 1m-expanded (1e) + { { NULL, bli_cpackm_struc_cxk_1er, + NULL, bli_zpackm_struc_cxk_1er, } }, +// 1001 row/col panels: 1m-reordered (1r) + { { NULL, bli_cpackm_struc_cxk_1er, + NULL, bli_zpackm_struc_cxk_1er, } }, +}; + + +void bli_packm_blk_var1rr + ( + obj_t* c, + obj_t* p, + cntx_t* cntx, + cntl_t* cntl, + thrinfo_t* t + ) +{ + num_t dt_cp = bli_obj_dt( c ); + + struc_t strucc = bli_obj_struc( c ); + doff_t diagoffc = bli_obj_diag_offset( c ); + diag_t diagc = bli_obj_diag( c ); + uplo_t uploc = bli_obj_uplo( c ); + trans_t transc = bli_obj_conjtrans_status( c ); + pack_t schema = bli_obj_pack_schema( p ); + bool_t invdiag = bli_obj_has_inverted_diag( p ); + bool_t revifup = bli_obj_is_pack_rev_if_upper( p ); + bool_t reviflo = bli_obj_is_pack_rev_if_lower( p ); + + dim_t m_p = bli_obj_length( p ); + dim_t n_p = bli_obj_width( p ); + dim_t m_max_p = bli_obj_padded_length( p ); + dim_t n_max_p = bli_obj_padded_width( p ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_p = bli_obj_buffer_at_off( p ); + inc_t rs_p = bli_obj_row_stride( p ); + inc_t cs_p = bli_obj_col_stride( p ); + inc_t is_p = bli_obj_imag_stride( p ); + dim_t pd_p = bli_obj_panel_dim( p ); + inc_t ps_p = bli_obj_panel_stride( p ); + + obj_t kappa; + obj_t* kappa_p; + void* buf_kappa; + + func_t* packm_kers; + void* packm_ker; + + FUNCPTR_T f; + + + // Treatment of kappa (ie: packing during scaling) depends on + // whether we are executing an induced method. + if ( bli_is_nat_packed( schema ) ) + { + // This branch is for native execution, where we assume that + // the micro-kernel will always apply the alpha scalar of the + // higher-level operation. Thus, we use BLIS_ONE for kappa so + // that the underlying packm implementation does not perform + // any scaling during packing. + buf_kappa = bli_obj_buffer_for_const( dt_cp, &BLIS_ONE ); + } + else // if ( bli_is_ind_packed( schema ) ) + { + // The value for kappa we use will depend on whether the scalar + // attached to A has a nonzero imaginary component. If it does, + // then we will apply the scalar during packing to facilitate + // implementing induced complex domain algorithms in terms of + // real domain micro-kernels. (In the aforementioned situation, + // applying a real scalar is easy, but applying a complex one is + // harder, so we avoid the need altogether with the code below.) + if ( bli_obj_scalar_has_nonzero_imag( p ) ) + { + //printf( "applying non-zero imag kappa\n" ); + + // Detach the scalar. + bli_obj_scalar_detach( p, &kappa ); + + // Reset the attached scalar (to 1.0). + bli_obj_scalar_reset( p ); + + kappa_p = κ + } + else + { + // If the internal scalar of A has only a real component, then + // we will apply it later (in the micro-kernel), and so we will + // use BLIS_ONE to indicate no scaling during packing. + kappa_p = &BLIS_ONE; + } + + // Acquire the buffer to the kappa chosen above. + buf_kappa = bli_obj_buffer_for_1x1( dt_cp, kappa_p ); + } + + + // Choose the correct func_t object based on the pack_t schema. +#if 0 + if ( bli_is_4mi_packed( schema ) ) packm_kers = packm_struc_cxk_4mi_kers; + else if ( bli_is_3mi_packed( schema ) || + bli_is_3ms_packed( schema ) ) packm_kers = packm_struc_cxk_3mis_kers; + else if ( bli_is_ro_packed( schema ) || + bli_is_io_packed( schema ) || + bli_is_rpi_packed( schema ) ) packm_kers = packm_struc_cxk_rih_kers; + else packm_kers = packm_struc_cxk_kers; +#else + // The original idea here was to read the packm_ukr from the context + // if it is non-NULL. The problem is, it requires that we be able to + // assume that the packm_ukr field is initialized to NULL, which it + // currently is not. + + //func_t* cntx_packm_kers = bli_cntx_get_packm_ukr( cntx ); + + //if ( bli_func_is_null_dt( dt_cp, cntx_packm_kers ) ) + { + // If the packm structure-aware kernel func_t in the context is + // NULL (which is the default value after the context is created), + // we use the default lookup table to determine the right func_t + // for the current schema. + const dim_t i = bli_pack_schema_index( schema ); + + packm_kers = &packm_struc_cxk_kers[ i ]; + } +#if 0 + else // cntx's packm func_t overrides + { + // If the packm structure-aware kernel func_t in the context is + // non-NULL (ie: assumed to be valid), we use that instead. + //packm_kers = bli_cntx_packm_ukrs( cntx ); + packm_kers = cntx_packm_kers; + } +#endif +#endif + + // Query the datatype-specific function pointer from the func_t object. + packm_ker = bli_func_get_dt( dt_cp, packm_kers ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_cp]; + + // Invoke the function. + f( strucc, + diagoffc, + diagc, + uploc, + transc, + schema, + invdiag, + revifup, + reviflo, + m_p, + n_p, + m_max_p, + n_max_p, + buf_kappa, + buf_c, rs_c, cs_c, + buf_p, rs_p, cs_p, + is_p, + pd_p, ps_p, + packm_ker, + cntx, + t ); +} + + +#undef GENTFUNCR +#define GENTFUNCR( ctype, ctype_r, ch, chr, opname, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + struc_t strucc, \ + doff_t diagoffc, \ + diag_t diagc, \ + uplo_t uploc, \ + trans_t transc, \ + pack_t schema, \ + bool_t invdiag, \ + bool_t revifup, \ + bool_t reviflo, \ + dim_t m, \ + dim_t n, \ + dim_t m_max, \ + dim_t n_max, \ + void* kappa, \ + void* c, inc_t rs_c, inc_t cs_c, \ + void* p, inc_t rs_p, inc_t cs_p, \ + inc_t is_p, \ + dim_t pd_p, inc_t ps_p, \ + void* packm_ker, \ + cntx_t* cntx, \ + thrinfo_t* thread \ + ) \ +{ \ + PASTECH2(ch,opname,_ker_ft) packm_ker_cast = packm_ker; \ +\ + ctype* restrict kappa_cast = kappa; \ + ctype* restrict c_cast = c; \ + ctype* restrict p_cast = p; \ + ctype* restrict c_begin; \ + ctype* restrict p_begin; \ +\ + dim_t iter_dim; \ + dim_t n_iter; \ + dim_t it, ic, ip; \ + dim_t ic0, ip0; \ + doff_t ic_inc, ip_inc; \ + doff_t diagoffc_i; \ + doff_t diagoffc_inc; \ + dim_t panel_len_full; \ + dim_t panel_len_i; \ + dim_t panel_len_max; \ + dim_t panel_len_max_i; \ + dim_t panel_dim_i; \ + dim_t panel_dim_max; \ + dim_t panel_off_i; \ + inc_t vs_c; \ + inc_t ldc; \ + inc_t ldp, p_inc; \ + dim_t* m_panel_full; \ + dim_t* n_panel_full; \ + dim_t* m_panel_use; \ + dim_t* n_panel_use; \ + dim_t* m_panel_max; \ + dim_t* n_panel_max; \ + conj_t conjc; \ + bool_t row_stored; \ + bool_t col_stored; \ + inc_t is_p_use; \ + dim_t ss_num; \ + dim_t ss_den; \ +\ + ctype* restrict c_use; \ + ctype* restrict p_use; \ + doff_t diagoffp_i; \ +\ +\ + /* If C is zeros and part of a triangular matrix, then we don't need + to pack it. */ \ + if ( bli_is_zeros( uploc ) && \ + bli_is_triangular( strucc ) ) return; \ +\ + /* Extract the conjugation bit from the transposition argument. */ \ + conjc = bli_extract_conj( transc ); \ +\ + /* If c needs a transposition, induce it so that we can more simply + express the remaining parameters and code. */ \ + if ( bli_does_trans( transc ) ) \ + { \ + bli_swap_incs( &rs_c, &cs_c ); \ + bli_negate_diag_offset( &diagoffc ); \ + bli_toggle_uplo( &uploc ); \ + bli_toggle_trans( &transc ); \ + } \ +\ + /* Create flags to incidate row or column storage. Note that the + schema bit that encodes row or column is describing the form of + micro-panel, not the storage in the micro-panel. Hence the + mismatch in "row" and "column" semantics. */ \ + row_stored = bli_is_col_packed( schema ); \ + col_stored = bli_is_row_packed( schema ); \ +\ + /* If the row storage flag indicates row storage, then we are packing + to column panels; otherwise, if the strides indicate column storage, + we are packing to row panels. */ \ + if ( row_stored ) \ + { \ + /* Prepare to pack to row-stored column panels. */ \ + iter_dim = n; \ + panel_len_full = m; \ + panel_len_max = m_max; \ + panel_dim_max = pd_p; \ + ldc = rs_c; \ + vs_c = cs_c; \ + diagoffc_inc = -( doff_t )panel_dim_max; \ + ldp = rs_p; \ + m_panel_full = &m; \ + n_panel_full = &panel_dim_i; \ + m_panel_use = &panel_len_i; \ + n_panel_use = &panel_dim_i; \ + m_panel_max = &panel_len_max_i; \ + n_panel_max = &panel_dim_max; \ + } \ + else /* if ( col_stored ) */ \ + { \ + /* Prepare to pack to column-stored row panels. */ \ + iter_dim = m; \ + panel_len_full = n; \ + panel_len_max = n_max; \ + panel_dim_max = pd_p; \ + ldc = cs_c; \ + vs_c = rs_c; \ + diagoffc_inc = ( doff_t )panel_dim_max; \ + ldp = cs_p; \ + m_panel_full = &panel_dim_i; \ + n_panel_full = &n; \ + m_panel_use = &panel_dim_i; \ + n_panel_use = &panel_len_i; \ + m_panel_max = &panel_dim_max; \ + n_panel_max = &panel_len_max_i; \ + } \ +\ + /* Compute the storage stride scaling. Usually this is just 1. However, + in the case of interleaved 3m, we need to scale by 3/2, and in the + cases of real-only, imag-only, or summed-only, we need to scale by + 1/2. In both cases, we are compensating for the fact that pointer + arithmetic occurs in terms of complex elements rather than real + elements. */ \ + if ( bli_is_3mi_packed( schema ) ) { ss_num = 3; ss_den = 2; } \ + else if ( bli_is_3ms_packed( schema ) ) { ss_num = 1; ss_den = 2; } \ + else if ( bli_is_rih_packed( schema ) ) { ss_num = 1; ss_den = 2; } \ + else { ss_num = 1; ss_den = 1; } \ +\ + /* Compute the total number of iterations we'll need. */ \ + n_iter = iter_dim / panel_dim_max + ( iter_dim % panel_dim_max ? 1 : 0 ); \ +\ + /* Set the initial values and increments for indices related to C and P + based on whether reverse iteration was requested. */ \ + if ( ( revifup && bli_is_upper( uploc ) && bli_is_triangular( strucc ) ) || \ + ( reviflo && bli_is_lower( uploc ) && bli_is_triangular( strucc ) ) ) \ + { \ + ic0 = (n_iter - 1) * panel_dim_max; \ + ic_inc = -panel_dim_max; \ + ip0 = n_iter - 1; \ + ip_inc = -1; \ + } \ + else \ + { \ + ic0 = 0; \ + ic_inc = panel_dim_max; \ + ip0 = 0; \ + ip_inc = 1; \ + } \ +\ + p_begin = p_cast; \ +\ +\ + /* Query the number of threads and thread ids from the current thread's + packm thrinfo_t node. */ \ + const dim_t nt = bli_thread_n_way( thread ); \ + const dim_t tid = bli_thread_work_id( thread ); \ +\ + dim_t it_start, it_end, it_inc; \ +\ + /* Determine the thread range and increment using the current thread's + packm thrinfo_t node. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &it_start, &it_end, &it_inc ); \ +\ + /* Iterate over every logical micropanel in the source matrix. */ \ + for ( ic = ic0, ip = ip0, it = 0; it < n_iter; \ + ic += ic_inc, ip += ip_inc, it += 1 ) \ + { \ + panel_dim_i = bli_min( panel_dim_max, iter_dim - ic ); \ +\ + diagoffc_i = diagoffc + (ip )*diagoffc_inc; \ + c_begin = c_cast + (ic )*vs_c; \ +\ + if ( bli_is_triangular( strucc ) && \ + bli_is_unstored_subpart_n( diagoffc_i, uploc, *m_panel_full, *n_panel_full ) ) \ + { \ + /* This case executes if the panel belongs to a triangular + matrix AND is completely unstored (ie: zero). If the panel + is unstored, we do nothing. (Notice that we don't even + increment p_begin.) */ \ +\ + continue; \ + } \ + else if ( bli_is_triangular( strucc ) && \ + bli_intersects_diag_n( diagoffc_i, *m_panel_full, *n_panel_full ) ) \ + { \ + /* This case executes if the panel belongs to a triangular + matrix AND is diagonal-intersecting. Notice that we + cannot bury the following conditional logic into + packm_struc_cxk() because we need to know the value of + panel_len_max_i so we can properly increment p_inc. */ \ +\ + /* Sanity check. Diagonals should not intersect the short end of + a micro-panel. If they do, then somehow the constraints on + cache blocksizes being a whole multiple of the register + blocksizes was somehow violated. */ \ + if ( ( col_stored && diagoffc_i < 0 ) || \ + ( row_stored && diagoffc_i > 0 ) ) \ + bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED ); \ +\ + if ( ( row_stored && bli_is_upper( uploc ) ) || \ + ( col_stored && bli_is_lower( uploc ) ) ) \ + { \ + panel_off_i = 0; \ + panel_len_i = bli_abs( diagoffc_i ) + panel_dim_i; \ + panel_len_max_i = bli_min( bli_abs( diagoffc_i ) + panel_dim_max, \ + panel_len_max ); \ + diagoffp_i = diagoffc_i; \ + } \ + else /* if ( ( row_stored && bli_is_lower( uploc ) ) || \ + ( col_stored && bli_is_upper( uploc ) ) ) */ \ + { \ + panel_off_i = bli_abs( diagoffc_i ); \ + panel_len_i = panel_len_full - panel_off_i; \ + panel_len_max_i = panel_len_max - panel_off_i; \ + diagoffp_i = 0; \ + } \ +\ + c_use = c_begin + (panel_off_i )*ldc; \ + p_use = p_begin; \ +\ + /* We need to re-compute the imaginary stride as a function of + panel_len_max_i since triangular packed matrices have panels + of varying lengths. NOTE: This imaginary stride value is + only referenced by the packm kernels for induced methods. */ \ + is_p_use = ldp * panel_len_max_i; \ +\ + /* We nudge the imaginary stride up by one if it is odd. */ \ + is_p_use += ( bli_is_odd( is_p_use ) ? 1 : 0 ); \ +\ + if ( bli_packm_my_iter_rr( it, it_start, it_end, tid, nt ) ) \ + { \ + packm_ker_cast( strucc, \ + diagoffp_i, \ + diagc, \ + uploc, \ + conjc, \ + schema, \ + invdiag, \ + *m_panel_use, \ + *n_panel_use, \ + *m_panel_max, \ + *n_panel_max, \ + kappa_cast, \ + c_use, rs_c, cs_c, \ + p_use, rs_p, cs_p, \ + is_p_use, \ + cntx ); \ + } \ +\ + /* NOTE: This value is usually LESS than ps_p because triangular + matrices usually have several micro-panels that are shorter + than a "full" micro-panel. */ \ + p_inc = ( is_p_use * ss_num ) / ss_den; \ + } \ + else if ( bli_is_herm_or_symm( strucc ) ) \ + { \ + /* This case executes if the panel belongs to a Hermitian or + symmetric matrix, which includes stored, unstored, and + diagonal-intersecting panels. */ \ +\ + c_use = c_begin; \ + p_use = p_begin; \ +\ + panel_len_i = panel_len_full; \ + panel_len_max_i = panel_len_max; \ +\ + is_p_use = is_p; \ +\ + if ( bli_packm_my_iter_rr( it, it_start, it_end, tid, nt ) ) \ + { \ + packm_ker_cast( strucc, \ + diagoffc_i, \ + diagc, \ + uploc, \ + conjc, \ + schema, \ + invdiag, \ + *m_panel_use, \ + *n_panel_use, \ + *m_panel_max, \ + *n_panel_max, \ + kappa_cast, \ + c_use, rs_c, cs_c, \ + p_use, rs_p, cs_p, \ + is_p_use, \ + cntx ); \ + } \ +\ + p_inc = ps_p; \ + } \ + else \ + { \ + /* This case executes if the panel is general, or, if the + panel is part of a triangular matrix and is neither unstored + (ie: zero) nor diagonal-intersecting. */ \ +\ + c_use = c_begin; \ + p_use = p_begin; \ +\ + panel_len_i = panel_len_full; \ + panel_len_max_i = panel_len_max; \ +\ + is_p_use = is_p; \ +\ + if ( bli_packm_my_iter_rr( it, it_start, it_end, tid, nt ) ) \ + { \ +/* +printf( "thread %d: packing micropanel iteration %3d\n", (int)tid, (int)it ); \ +*/ \ + packm_ker_cast( BLIS_GENERAL, \ + 0, \ + diagc, \ + BLIS_DENSE, \ + conjc, \ + schema, \ + invdiag, \ + *m_panel_use, \ + *n_panel_use, \ + *m_panel_max, \ + *n_panel_max, \ + kappa_cast, \ + c_use, rs_c, cs_c, \ + p_use, rs_p, cs_p, \ + is_p_use, \ + cntx ); \ + } \ +\ + /* NOTE: This value is equivalent to ps_p. */ \ + p_inc = ps_p; \ + } \ +\ + p_begin += p_inc; \ +\ + } \ +/* +printf( "thread %d: done\n", (int)tid ); \ +*/ \ +} + +INSERT_GENTFUNCR_BASIC( packm, packm_blk_var1rr ) + + + +/* +if ( row_stored ) \ +PASTEMAC(ch,fprintm)( stdout, "packm_var2: b", m, n, \ + c_cast, rs_c, cs_c, "%4.1f", "" ); \ +if ( col_stored ) \ +PASTEMAC(ch,fprintm)( stdout, "packm_var2: a", m, n, \ + c_cast, rs_c, cs_c, "%4.1f", "" ); \ +*/ +/* +if ( col_stored ) { \ + if ( bli_thread_work_id( thread ) == 0 ) \ + { \ + printf( "packm_blk_var1: thread %lu (a = %p, ap = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: ap", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ + if ( bli_thread_work_id( thread ) == 1 ) \ + { \ + printf( "packm_blk_var1: thread %lu (a = %p, ap = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: ap", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ +} \ +else { \ + if ( bli_thread_work_id( thread ) == 0 ) \ + { \ + printf( "packm_blk_var1: thread %lu (b = %p, bp = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: bp", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ + if ( bli_thread_work_id( thread ) == 1 ) \ + { \ + printf( "packm_blk_var1: thread %lu (b = %p, bp = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: bp", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ +} \ +*/ +/* + if ( bli_is_4mi_packed( schema ) ) { \ + printf( "packm_var2: is_p_use = %lu\n", is_p_use ); \ + if ( col_stored ) { \ + if ( 0 ) \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_r", *m_panel_use, *n_panel_use, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + is_p_use, rs_p, cs_p, "%4.1f", "" ); \ + } \ + if ( row_stored ) { \ + if ( 0 ) \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_r", *m_panel_use, *n_panel_use, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + is_p_use, rs_p, cs_p, "%4.1f", "" ); \ + } \ + } \ +*/ +/* + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_rpi", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ +*/ +/* + if ( row_stored ) { \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_i", *m_panel_max, *n_panel_max, \ + (( ctype_r* )c_use)+rs_c, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + inc_t is_b = rs_p * *m_panel_max; \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + is_b, rs_p, cs_p, "%4.1f", "" ); \ + } \ +*/ +/* + if ( col_stored ) { \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_i", *m_panel_max, *n_panel_max, \ + (( ctype_r* )c_use)+rs_c, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + p_inc, rs_p, cs_p, "%4.1f", "" ); \ + } \ +*/ diff --git a/frame/1m/packm/bli_packm_blk_var1sl.c b/frame/1m/packm/bli_packm_blk_var1sl.c new file mode 100644 index 000000000..6fbe3b211 --- /dev/null +++ b/frame/1m/packm/bli_packm_blk_var1sl.c @@ -0,0 +1,737 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T packm_fp + +typedef void (*FUNCPTR_T) + ( + struc_t strucc, + doff_t diagoffc, + diag_t diagc, + uplo_t uploc, + trans_t transc, + pack_t schema, + bool_t invdiag, + bool_t revifup, + bool_t reviflo, + dim_t m, + dim_t n, + dim_t m_max, + dim_t n_max, + void* kappa, + void* c, inc_t rs_c, inc_t cs_c, + void* p, inc_t rs_p, inc_t cs_p, + inc_t is_p, + dim_t pd_p, inc_t ps_p, + void* packm_ker, + cntx_t* cntx, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,packm_blk_var1sl); + + +static func_t packm_struc_cxk_kers[BLIS_NUM_PACK_SCHEMA_TYPES] = +{ + /* float (0) scomplex (1) double (2) dcomplex (3) */ +// 0000 row/col panels + { { bli_spackm_struc_cxk, bli_cpackm_struc_cxk, + bli_dpackm_struc_cxk, bli_zpackm_struc_cxk, } }, +// 0001 row/col panels: 4m interleaved + { { NULL, bli_cpackm_struc_cxk_4mi, + NULL, bli_zpackm_struc_cxk_4mi, } }, +// 0010 row/col panels: 3m interleaved + { { NULL, bli_cpackm_struc_cxk_3mis, + NULL, bli_zpackm_struc_cxk_3mis, } }, +// 0011 row/col panels: 4m separated (NOT IMPLEMENTED) + { { NULL, NULL, + NULL, NULL, } }, +// 0100 row/col panels: 3m separated + { { NULL, bli_cpackm_struc_cxk_3mis, + NULL, bli_zpackm_struc_cxk_3mis, } }, +// 0101 row/col panels: real only + { { NULL, bli_cpackm_struc_cxk_rih, + NULL, bli_zpackm_struc_cxk_rih, } }, +// 0110 row/col panels: imaginary only + { { NULL, bli_cpackm_struc_cxk_rih, + NULL, bli_zpackm_struc_cxk_rih, } }, +// 0111 row/col panels: real+imaginary only + { { NULL, bli_cpackm_struc_cxk_rih, + NULL, bli_zpackm_struc_cxk_rih, } }, +// 1000 row/col panels: 1m-expanded (1e) + { { NULL, bli_cpackm_struc_cxk_1er, + NULL, bli_zpackm_struc_cxk_1er, } }, +// 1001 row/col panels: 1m-reordered (1r) + { { NULL, bli_cpackm_struc_cxk_1er, + NULL, bli_zpackm_struc_cxk_1er, } }, +}; + + +void bli_packm_blk_var1sl + ( + obj_t* c, + obj_t* p, + cntx_t* cntx, + cntl_t* cntl, + thrinfo_t* t + ) +{ + num_t dt_cp = bli_obj_dt( c ); + + struc_t strucc = bli_obj_struc( c ); + doff_t diagoffc = bli_obj_diag_offset( c ); + diag_t diagc = bli_obj_diag( c ); + uplo_t uploc = bli_obj_uplo( c ); + trans_t transc = bli_obj_conjtrans_status( c ); + pack_t schema = bli_obj_pack_schema( p ); + bool_t invdiag = bli_obj_has_inverted_diag( p ); + bool_t revifup = bli_obj_is_pack_rev_if_upper( p ); + bool_t reviflo = bli_obj_is_pack_rev_if_lower( p ); + + dim_t m_p = bli_obj_length( p ); + dim_t n_p = bli_obj_width( p ); + dim_t m_max_p = bli_obj_padded_length( p ); + dim_t n_max_p = bli_obj_padded_width( p ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_p = bli_obj_buffer_at_off( p ); + inc_t rs_p = bli_obj_row_stride( p ); + inc_t cs_p = bli_obj_col_stride( p ); + inc_t is_p = bli_obj_imag_stride( p ); + dim_t pd_p = bli_obj_panel_dim( p ); + inc_t ps_p = bli_obj_panel_stride( p ); + + obj_t kappa; + obj_t* kappa_p; + void* buf_kappa; + + func_t* packm_kers; + void* packm_ker; + + FUNCPTR_T f; + + + // Treatment of kappa (ie: packing during scaling) depends on + // whether we are executing an induced method. + if ( bli_is_nat_packed( schema ) ) + { + // This branch is for native execution, where we assume that + // the micro-kernel will always apply the alpha scalar of the + // higher-level operation. Thus, we use BLIS_ONE for kappa so + // that the underlying packm implementation does not perform + // any scaling during packing. + buf_kappa = bli_obj_buffer_for_const( dt_cp, &BLIS_ONE ); + } + else // if ( bli_is_ind_packed( schema ) ) + { + // The value for kappa we use will depend on whether the scalar + // attached to A has a nonzero imaginary component. If it does, + // then we will apply the scalar during packing to facilitate + // implementing induced complex domain algorithms in terms of + // real domain micro-kernels. (In the aforementioned situation, + // applying a real scalar is easy, but applying a complex one is + // harder, so we avoid the need altogether with the code below.) + if ( bli_obj_scalar_has_nonzero_imag( p ) ) + { + //printf( "applying non-zero imag kappa\n" ); + + // Detach the scalar. + bli_obj_scalar_detach( p, &kappa ); + + // Reset the attached scalar (to 1.0). + bli_obj_scalar_reset( p ); + + kappa_p = κ + } + else + { + // If the internal scalar of A has only a real component, then + // we will apply it later (in the micro-kernel), and so we will + // use BLIS_ONE to indicate no scaling during packing. + kappa_p = &BLIS_ONE; + } + + // Acquire the buffer to the kappa chosen above. + buf_kappa = bli_obj_buffer_for_1x1( dt_cp, kappa_p ); + } + + + // Choose the correct func_t object based on the pack_t schema. +#if 0 + if ( bli_is_4mi_packed( schema ) ) packm_kers = packm_struc_cxk_4mi_kers; + else if ( bli_is_3mi_packed( schema ) || + bli_is_3ms_packed( schema ) ) packm_kers = packm_struc_cxk_3mis_kers; + else if ( bli_is_ro_packed( schema ) || + bli_is_io_packed( schema ) || + bli_is_rpi_packed( schema ) ) packm_kers = packm_struc_cxk_rih_kers; + else packm_kers = packm_struc_cxk_kers; +#else + // The original idea here was to read the packm_ukr from the context + // if it is non-NULL. The problem is, it requires that we be able to + // assume that the packm_ukr field is initialized to NULL, which it + // currently is not. + + //func_t* cntx_packm_kers = bli_cntx_get_packm_ukr( cntx ); + + //if ( bli_func_is_null_dt( dt_cp, cntx_packm_kers ) ) + { + // If the packm structure-aware kernel func_t in the context is + // NULL (which is the default value after the context is created), + // we use the default lookup table to determine the right func_t + // for the current schema. + const dim_t i = bli_pack_schema_index( schema ); + + packm_kers = &packm_struc_cxk_kers[ i ]; + } +#if 0 + else // cntx's packm func_t overrides + { + // If the packm structure-aware kernel func_t in the context is + // non-NULL (ie: assumed to be valid), we use that instead. + //packm_kers = bli_cntx_packm_ukrs( cntx ); + packm_kers = cntx_packm_kers; + } +#endif +#endif + + // Query the datatype-specific function pointer from the func_t object. + packm_ker = bli_func_get_dt( dt_cp, packm_kers ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_cp]; + + // Invoke the function. + f( strucc, + diagoffc, + diagc, + uploc, + transc, + schema, + invdiag, + revifup, + reviflo, + m_p, + n_p, + m_max_p, + n_max_p, + buf_kappa, + buf_c, rs_c, cs_c, + buf_p, rs_p, cs_p, + is_p, + pd_p, ps_p, + packm_ker, + cntx, + t ); +} + + +#undef GENTFUNCR +#define GENTFUNCR( ctype, ctype_r, ch, chr, opname, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + struc_t strucc, \ + doff_t diagoffc, \ + diag_t diagc, \ + uplo_t uploc, \ + trans_t transc, \ + pack_t schema, \ + bool_t invdiag, \ + bool_t revifup, \ + bool_t reviflo, \ + dim_t m, \ + dim_t n, \ + dim_t m_max, \ + dim_t n_max, \ + void* kappa, \ + void* c, inc_t rs_c, inc_t cs_c, \ + void* p, inc_t rs_p, inc_t cs_p, \ + inc_t is_p, \ + dim_t pd_p, inc_t ps_p, \ + void* packm_ker, \ + cntx_t* cntx, \ + thrinfo_t* thread \ + ) \ +{ \ + PASTECH2(ch,opname,_ker_ft) packm_ker_cast = packm_ker; \ +\ + ctype* restrict kappa_cast = kappa; \ + ctype* restrict c_cast = c; \ + ctype* restrict p_cast = p; \ + ctype* restrict c_begin; \ + ctype* restrict p_begin; \ +\ + dim_t iter_dim; \ + dim_t n_iter; \ + dim_t it, ic, ip; \ + dim_t ic0, ip0; \ + doff_t ic_inc, ip_inc; \ + doff_t diagoffc_i; \ + doff_t diagoffc_inc; \ + dim_t panel_len_full; \ + dim_t panel_len_i; \ + dim_t panel_len_max; \ + dim_t panel_len_max_i; \ + dim_t panel_dim_i; \ + dim_t panel_dim_max; \ + dim_t panel_off_i; \ + inc_t vs_c; \ + inc_t ldc; \ + inc_t ldp, p_inc; \ + dim_t* m_panel_full; \ + dim_t* n_panel_full; \ + dim_t* m_panel_use; \ + dim_t* n_panel_use; \ + dim_t* m_panel_max; \ + dim_t* n_panel_max; \ + conj_t conjc; \ + bool_t row_stored; \ + bool_t col_stored; \ + inc_t is_p_use; \ + dim_t ss_num; \ + dim_t ss_den; \ +\ + ctype* restrict c_use; \ + ctype* restrict p_use; \ + doff_t diagoffp_i; \ +\ +\ + /* If C is zeros and part of a triangular matrix, then we don't need + to pack it. */ \ + if ( bli_is_zeros( uploc ) && \ + bli_is_triangular( strucc ) ) return; \ +\ + /* Extract the conjugation bit from the transposition argument. */ \ + conjc = bli_extract_conj( transc ); \ +\ + /* If c needs a transposition, induce it so that we can more simply + express the remaining parameters and code. */ \ + if ( bli_does_trans( transc ) ) \ + { \ + bli_swap_incs( &rs_c, &cs_c ); \ + bli_negate_diag_offset( &diagoffc ); \ + bli_toggle_uplo( &uploc ); \ + bli_toggle_trans( &transc ); \ + } \ +\ + /* Create flags to incidate row or column storage. Note that the + schema bit that encodes row or column is describing the form of + micro-panel, not the storage in the micro-panel. Hence the + mismatch in "row" and "column" semantics. */ \ + row_stored = bli_is_col_packed( schema ); \ + col_stored = bli_is_row_packed( schema ); \ +\ + /* If the row storage flag indicates row storage, then we are packing + to column panels; otherwise, if the strides indicate column storage, + we are packing to row panels. */ \ + if ( row_stored ) \ + { \ + /* Prepare to pack to row-stored column panels. */ \ + iter_dim = n; \ + panel_len_full = m; \ + panel_len_max = m_max; \ + panel_dim_max = pd_p; \ + ldc = rs_c; \ + vs_c = cs_c; \ + diagoffc_inc = -( doff_t )panel_dim_max; \ + ldp = rs_p; \ + m_panel_full = &m; \ + n_panel_full = &panel_dim_i; \ + m_panel_use = &panel_len_i; \ + n_panel_use = &panel_dim_i; \ + m_panel_max = &panel_len_max_i; \ + n_panel_max = &panel_dim_max; \ + } \ + else /* if ( col_stored ) */ \ + { \ + /* Prepare to pack to column-stored row panels. */ \ + iter_dim = m; \ + panel_len_full = n; \ + panel_len_max = n_max; \ + panel_dim_max = pd_p; \ + ldc = cs_c; \ + vs_c = rs_c; \ + diagoffc_inc = ( doff_t )panel_dim_max; \ + ldp = cs_p; \ + m_panel_full = &panel_dim_i; \ + n_panel_full = &n; \ + m_panel_use = &panel_dim_i; \ + n_panel_use = &panel_len_i; \ + m_panel_max = &panel_dim_max; \ + n_panel_max = &panel_len_max_i; \ + } \ +\ + /* Compute the storage stride scaling. Usually this is just 1. However, + in the case of interleaved 3m, we need to scale by 3/2, and in the + cases of real-only, imag-only, or summed-only, we need to scale by + 1/2. In both cases, we are compensating for the fact that pointer + arithmetic occurs in terms of complex elements rather than real + elements. */ \ + if ( bli_is_3mi_packed( schema ) ) { ss_num = 3; ss_den = 2; } \ + else if ( bli_is_3ms_packed( schema ) ) { ss_num = 1; ss_den = 2; } \ + else if ( bli_is_rih_packed( schema ) ) { ss_num = 1; ss_den = 2; } \ + else { ss_num = 1; ss_den = 1; } \ +\ + /* Compute the total number of iterations we'll need. */ \ + n_iter = iter_dim / panel_dim_max + ( iter_dim % panel_dim_max ? 1 : 0 ); \ +\ + /* Set the initial values and increments for indices related to C and P + based on whether reverse iteration was requested. */ \ + if ( ( revifup && bli_is_upper( uploc ) && bli_is_triangular( strucc ) ) || \ + ( reviflo && bli_is_lower( uploc ) && bli_is_triangular( strucc ) ) ) \ + { \ + ic0 = (n_iter - 1) * panel_dim_max; \ + ic_inc = -panel_dim_max; \ + ip0 = n_iter - 1; \ + ip_inc = -1; \ + } \ + else \ + { \ + ic0 = 0; \ + ic_inc = panel_dim_max; \ + ip0 = 0; \ + ip_inc = 1; \ + } \ +\ + p_begin = p_cast; \ +\ +\ + /* Query the number of threads and thread ids from the current thread's + packm thrinfo_t node. */ \ + const dim_t nt = bli_thread_n_way( thread ); \ + const dim_t tid = bli_thread_work_id( thread ); \ +\ + dim_t it_start, it_end, it_inc; \ +\ + /* Determine the thread range and increment using the current thread's + packm thrinfo_t node. */ \ + bli_thread_range_jrir_sl( thread, n_iter, 1, FALSE, &it_start, &it_end, &it_inc ); \ +\ + /* Iterate over every logical micropanel in the source matrix. */ \ + for ( ic = ic0, ip = ip0, it = 0; it < n_iter; \ + ic += ic_inc, ip += ip_inc, it += 1 ) \ + { \ + panel_dim_i = bli_min( panel_dim_max, iter_dim - ic ); \ +\ + diagoffc_i = diagoffc + (ip )*diagoffc_inc; \ + c_begin = c_cast + (ic )*vs_c; \ +\ + if ( bli_is_triangular( strucc ) && \ + bli_is_unstored_subpart_n( diagoffc_i, uploc, *m_panel_full, *n_panel_full ) ) \ + { \ + /* This case executes if the panel belongs to a triangular + matrix AND is completely unstored (ie: zero). If the panel + is unstored, we do nothing. (Notice that we don't even + increment p_begin.) */ \ +\ + continue; \ + } \ + else if ( bli_is_triangular( strucc ) && \ + bli_intersects_diag_n( diagoffc_i, *m_panel_full, *n_panel_full ) ) \ + { \ + /* This case executes if the panel belongs to a triangular + matrix AND is diagonal-intersecting. Notice that we + cannot bury the following conditional logic into + packm_struc_cxk() because we need to know the value of + panel_len_max_i so we can properly increment p_inc. */ \ +\ + /* Sanity check. Diagonals should not intersect the short end of + a micro-panel. If they do, then somehow the constraints on + cache blocksizes being a whole multiple of the register + blocksizes was somehow violated. */ \ + if ( ( col_stored && diagoffc_i < 0 ) || \ + ( row_stored && diagoffc_i > 0 ) ) \ + bli_check_error_code( BLIS_NOT_YET_IMPLEMENTED ); \ +\ + if ( ( row_stored && bli_is_upper( uploc ) ) || \ + ( col_stored && bli_is_lower( uploc ) ) ) \ + { \ + panel_off_i = 0; \ + panel_len_i = bli_abs( diagoffc_i ) + panel_dim_i; \ + panel_len_max_i = bli_min( bli_abs( diagoffc_i ) + panel_dim_max, \ + panel_len_max ); \ + diagoffp_i = diagoffc_i; \ + } \ + else /* if ( ( row_stored && bli_is_lower( uploc ) ) || \ + ( col_stored && bli_is_upper( uploc ) ) ) */ \ + { \ + panel_off_i = bli_abs( diagoffc_i ); \ + panel_len_i = panel_len_full - panel_off_i; \ + panel_len_max_i = panel_len_max - panel_off_i; \ + diagoffp_i = 0; \ + } \ +\ + c_use = c_begin + (panel_off_i )*ldc; \ + p_use = p_begin; \ +\ + /* We need to re-compute the imaginary stride as a function of + panel_len_max_i since triangular packed matrices have panels + of varying lengths. NOTE: This imaginary stride value is + only referenced by the packm kernels for induced methods. */ \ + is_p_use = ldp * panel_len_max_i; \ +\ + /* We nudge the imaginary stride up by one if it is odd. */ \ + is_p_use += ( bli_is_odd( is_p_use ) ? 1 : 0 ); \ +\ + if ( bli_packm_my_iter_rr( it, it_start, it_end, tid, nt ) ) \ + { \ + packm_ker_cast( strucc, \ + diagoffp_i, \ + diagc, \ + uploc, \ + conjc, \ + schema, \ + invdiag, \ + *m_panel_use, \ + *n_panel_use, \ + *m_panel_max, \ + *n_panel_max, \ + kappa_cast, \ + c_use, rs_c, cs_c, \ + p_use, rs_p, cs_p, \ + is_p_use, \ + cntx ); \ + } \ +\ + /* NOTE: This value is usually LESS than ps_p because triangular + matrices usually have several micro-panels that are shorter + than a "full" micro-panel. */ \ + p_inc = ( is_p_use * ss_num ) / ss_den; \ + } \ + else if ( bli_is_herm_or_symm( strucc ) ) \ + { \ + /* This case executes if the panel belongs to a Hermitian or + symmetric matrix, which includes stored, unstored, and + diagonal-intersecting panels. */ \ +\ + c_use = c_begin; \ + p_use = p_begin; \ +\ + panel_len_i = panel_len_full; \ + panel_len_max_i = panel_len_max; \ +\ + is_p_use = is_p; \ +\ + if ( bli_packm_my_iter_sl( it, it_start, it_end, tid, nt ) ) \ + { \ + packm_ker_cast( strucc, \ + diagoffc_i, \ + diagc, \ + uploc, \ + conjc, \ + schema, \ + invdiag, \ + *m_panel_use, \ + *n_panel_use, \ + *m_panel_max, \ + *n_panel_max, \ + kappa_cast, \ + c_use, rs_c, cs_c, \ + p_use, rs_p, cs_p, \ + is_p_use, \ + cntx ); \ + } \ +\ + p_inc = ps_p; \ + } \ + else \ + { \ + /* This case executes if the panel is general, or, if the + panel is part of a triangular matrix and is neither unstored + (ie: zero) nor diagonal-intersecting. */ \ +\ + c_use = c_begin; \ + p_use = p_begin; \ +\ + panel_len_i = panel_len_full; \ + panel_len_max_i = panel_len_max; \ +\ + is_p_use = is_p; \ +\ + if ( bli_packm_my_iter_sl( it, it_start, it_end, tid, nt ) ) \ + { \ +/* +printf( "thread %d: packing micropanel iteration %3d\n", (int)tid, (int)it ); \ +*/ \ + packm_ker_cast( BLIS_GENERAL, \ + 0, \ + diagc, \ + BLIS_DENSE, \ + conjc, \ + schema, \ + invdiag, \ + *m_panel_use, \ + *n_panel_use, \ + *m_panel_max, \ + *n_panel_max, \ + kappa_cast, \ + c_use, rs_c, cs_c, \ + p_use, rs_p, cs_p, \ + is_p_use, \ + cntx ); \ + } \ +\ + /* NOTE: This value is equivalent to ps_p. */ \ + p_inc = ps_p; \ + } \ +\ + p_begin += p_inc; \ +\ + } \ +/* +printf( "thread %d: done\n", (int)tid ); \ +*/ \ +} + +INSERT_GENTFUNCR_BASIC( packm, packm_blk_var1sl ) + + + +/* +if ( row_stored ) \ +PASTEMAC(ch,fprintm)( stdout, "packm_var2: b", m, n, \ + c_cast, rs_c, cs_c, "%4.1f", "" ); \ +if ( col_stored ) \ +PASTEMAC(ch,fprintm)( stdout, "packm_var2: a", m, n, \ + c_cast, rs_c, cs_c, "%4.1f", "" ); \ +*/ +/* +if ( col_stored ) { \ + if ( bli_thread_work_id( thread ) == 0 ) \ + { \ + printf( "packm_blk_var1: thread %lu (a = %p, ap = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: ap", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ + if ( bli_thread_work_id( thread ) == 1 ) \ + { \ + printf( "packm_blk_var1: thread %lu (a = %p, ap = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: a", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: ap", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ +} \ +else { \ + if ( bli_thread_work_id( thread ) == 0 ) \ + { \ + printf( "packm_blk_var1: thread %lu (b = %p, bp = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: bp", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ + if ( bli_thread_work_id( thread ) == 1 ) \ + { \ + printf( "packm_blk_var1: thread %lu (b = %p, bp = %p)\n", bli_thread_work_id( thread ), c_use, p_use ); \ + fflush( stdout ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: b", *m_panel_use, *n_panel_use, \ + ( ctype* )c_use, rs_c, cs_c, "%4.1f", "" ); \ + PASTEMAC(ch,fprintm)( stdout, "packm_blk_var1: bp", *m_panel_max, *n_panel_max, \ + ( ctype* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + fflush( stdout ); \ + } \ +bli_thread_obarrier( thread ); \ +} \ +*/ +/* + if ( bli_is_4mi_packed( schema ) ) { \ + printf( "packm_var2: is_p_use = %lu\n", is_p_use ); \ + if ( col_stored ) { \ + if ( 0 ) \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_r", *m_panel_use, *n_panel_use, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + is_p_use, rs_p, cs_p, "%4.1f", "" ); \ + } \ + if ( row_stored ) { \ + if ( 0 ) \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_r", *m_panel_use, *n_panel_use, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + is_p_use, rs_p, cs_p, "%4.1f", "" ); \ + } \ + } \ +*/ +/* + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_rpi", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ +*/ +/* + if ( row_stored ) { \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: b_i", *m_panel_max, *n_panel_max, \ + (( ctype_r* )c_use)+rs_c, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + inc_t is_b = rs_p * *m_panel_max; \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: bp_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + is_b, rs_p, cs_p, "%4.1f", "" ); \ + } \ +*/ +/* + if ( col_stored ) { \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )c_use, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: a_i", *m_panel_max, *n_panel_max, \ + (( ctype_r* )c_use)+rs_c, 2*rs_c, 2*cs_c, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_r", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use, rs_p, cs_p, "%4.1f", "" ); \ + PASTEMAC(chr,fprintm)( stdout, "packm_var2: ap_i", *m_panel_max, *n_panel_max, \ + ( ctype_r* )p_use + p_inc, rs_p, cs_p, "%4.1f", "" ); \ + } \ +*/ diff --git a/frame/1m/packm/bli_packm_thrinfo.h b/frame/1m/packm/bli_packm_thrinfo.h index 41d68d356..bb1a8e159 100644 --- a/frame/1m/packm/bli_packm_thrinfo.h +++ b/frame/1m/packm/bli_packm_thrinfo.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -36,7 +37,22 @@ // thrinfo_t macros specific to packm. // -#define packm_thread_my_iter( index, thread ) ( index % thread->n_way == thread->work_id % thread->n_way ) +/* +#define bli_packm_thread_my_iter( index, thread ) \ +\ + ( index % thread->n_way == thread->work_id % thread->n_way ) +*/ + +#define bli_packm_my_iter_rr( i, start, end, work_id, n_way ) \ +\ + ( i % n_way == work_id % n_way ) + +#define bli_packm_my_iter_sl( i, start, end, work_id, n_way ) \ +\ + ( start <= i && i < end ) + + + // // thrinfo_t APIs specific to packm. diff --git a/frame/1m/packm/bli_packm_blk_var1.h b/frame/1m/packm/bli_packm_var.h similarity index 69% rename from frame/1m/packm/bli_packm_blk_var1.h rename to frame/1m/packm/bli_packm_var.h index 396160da5..2da2e1e32 100644 --- a/frame/1m/packm/bli_packm_blk_var1.h +++ b/frame/1m/packm/bli_packm_var.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -32,15 +33,52 @@ */ -void bli_packm_blk_var1 - ( - obj_t* c, - obj_t* p, - cntx_t* cntx, - cntl_t* cntl, - thrinfo_t* t +// +// Prototype object-based interfaces. +// + +#undef GENPROT +#define GENPROT( opname ) \ +\ +void PASTEMAC0(opname) \ + ( \ + obj_t* c, \ + obj_t* p, \ + cntx_t* cntx, \ + cntl_t* cntl, \ + thrinfo_t* t \ ); +GENPROT( packm_unb_var1 ) +GENPROT( packm_blk_var1 ) +GENPROT( packm_blk_var1sl ) +GENPROT( packm_blk_var1rr ) + +// +// Prototype BLAS-like interfaces with void pointer operands. +// + +#undef GENTPROT +#define GENTPROT( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + struc_t strucc, \ + doff_t diagoffc, \ + diag_t diagc, \ + uplo_t uploc, \ + trans_t transc, \ + dim_t m, \ + dim_t n, \ + dim_t m_max, \ + dim_t n_max, \ + void* kappa, \ + void* c, inc_t rs_c, inc_t cs_c, \ + void* p, inc_t rs_p, inc_t cs_p, \ + cntx_t* cntx \ + ); + +INSERT_GENTPROT_BASIC0( packm_unb_var1 ) #undef GENTPROT #define GENTPROT( ctype, ch, varname ) \ @@ -70,5 +108,6 @@ void PASTEMAC(ch,varname) \ thrinfo_t* thread \ ); -INSERT_GENTPROT_BASIC0( packm_blk_var1 ) +INSERT_GENTPROT_BASIC0( packm_blk_var1sl ) +INSERT_GENTPROT_BASIC0( packm_blk_var1rr ) diff --git a/frame/3/bli_l3_thrinfo.h b/frame/3/bli_l3_thrinfo.h index 228f22714..b66190dbc 100644 --- a/frame/3/bli_l3_thrinfo.h +++ b/frame/3/bli_l3_thrinfo.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -38,24 +39,28 @@ // gemm -#define bli_gemm_get_next_a_upanel( thread, a1, step ) ( a1 + step * thread->n_way ) -#define bli_gemm_get_next_b_upanel( thread, b1, step ) ( b1 + step * thread->n_way ) +#define bli_gemm_get_next_a_upanel( a1, step, inc ) ( a1 + step * inc ) +#define bli_gemm_get_next_b_upanel( b1, step, inc ) ( b1 + step * inc ) // herk -#define bli_herk_get_next_a_upanel( thread, a1, step ) ( a1 + step * thread->n_way ) -#define bli_herk_get_next_b_upanel( thread, b1, step ) ( b1 + step * thread->n_way ) +#define bli_herk_get_next_a_upanel( a1, step, inc ) ( a1 + step * inc ) +#define bli_herk_get_next_b_upanel( b1, step, inc ) ( b1 + step * inc ) // trmm -#define bli_trmm_r_ir_my_iter( index, thread ) ( index % thread->n_way == thread->work_id % thread->n_way ) -#define bli_trmm_r_jr_my_iter( index, thread ) ( index % thread->n_way == thread->work_id % thread->n_way ) -#define bli_trmm_l_ir_my_iter( index, thread ) ( index % thread->n_way == thread->work_id % thread->n_way ) -#define bli_trmm_l_jr_my_iter( index, thread ) ( index % thread->n_way == thread->work_id % thread->n_way ) +#define bli_trmm_get_next_a_upanel( a1, step, inc ) ( a1 + step * inc ) +#define bli_trmm_get_next_b_upanel( b1, step, inc ) ( b1 + step * inc ) + +#define bli_trmm_my_iter( index, thread ) \ +\ + ( index % thread->n_way == thread->work_id % thread->n_way ) // trsm -#define bli_trsm_my_iter( index, thread ) ( index % thread->n_way == thread->work_id % thread->n_way ) +#define bli_trsm_my_iter( index, thread ) \ +\ + ( index % thread->n_way == thread->work_id % thread->n_way ) // // thrinfo_t APIs specific to level-3 operations. diff --git a/frame/3/gemm/bli_gemm_blk_var1.c b/frame/3/gemm/bli_gemm_blk_var1.c index 0c62b69ac..73b8bed06 100644 --- a/frame/3/gemm/bli_gemm_blk_var1.c +++ b/frame/3/gemm/bli_gemm_blk_var1.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -60,7 +61,7 @@ void bli_gemm_blk_var1 bli_l3_prune_unref_mparts_m( a, b, c, cntl ); // Determine the current thread's subpartition range. - bli_thread_get_range_mdim + bli_thread_range_mdim ( direct, thread, a, b, c, cntl, cntx, &my_start, &my_end diff --git a/frame/3/gemm/bli_gemm_blk_var2.c b/frame/3/gemm/bli_gemm_blk_var2.c index 6a19e1bdb..3c25d7fa8 100644 --- a/frame/3/gemm/bli_gemm_blk_var2.c +++ b/frame/3/gemm/bli_gemm_blk_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -60,7 +61,7 @@ void bli_gemm_blk_var2 bli_l3_prune_unref_mparts_n( a, b, c, cntl ); // Determine the current thread's subpartition range. - bli_thread_get_range_ndim + bli_thread_range_ndim ( direct, thread, a, b, c, cntl, cntx, &my_start, &my_end diff --git a/frame/3/gemm/bli_gemm_cntl.c b/frame/3/gemm/bli_gemm_cntl.c index 2332a6cf7..9263b4e51 100644 --- a/frame/3/gemm/bli_gemm_cntl.c +++ b/frame/3/gemm/bli_gemm_cntl.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -53,11 +54,34 @@ cntl_t* bli_gemmbp_cntl_create pack_t schema_b ) { - void* macro_kernel_p = bli_gemm_ker_var2; + void* macro_kernel_fp; + void* packa_fp; + void* packb_fp; - // Change the macro-kernel if the operation family is herk or trmm. - if ( family == BLIS_HERK ) macro_kernel_p = bli_herk_x_ker_var2; - else if ( family == BLIS_TRMM ) macro_kernel_p = bli_trmm_xx_ker_var2; +#ifdef BLIS_ENABLE_JRIR_SLAB + + // Use the function pointers to the macrokernels that use slab + // assignment of micropanels to threads in the jr and ir loops. + if ( family == BLIS_GEMM ) macro_kernel_fp = bli_gemm_ker_var2sl; + else if ( family == BLIS_HERK ) macro_kernel_fp = bli_herk_x_ker_var2sl; + else if ( family == BLIS_TRMM ) macro_kernel_fp = bli_trmm_xx_ker_var2sl; + else macro_kernel_fp = NULL; + + packa_fp = bli_packm_blk_var1sl; + packb_fp = bli_packm_blk_var1sl; + +#else // BLIS_ENABLE_JRIR_RR + + // Use the function pointers to the macrokernels that use round-robin + // assignment of micropanels to threads in the jr and ir loops. + if ( family == BLIS_GEMM ) macro_kernel_fp = bli_gemm_ker_var2rr; + else if ( family == BLIS_HERK ) macro_kernel_fp = bli_herk_x_ker_var2rr; + else if ( family == BLIS_TRMM ) macro_kernel_fp = bli_trmm_xx_ker_var2rr; + else macro_kernel_fp = NULL; + + packa_fp = bli_packm_blk_var1rr; + packb_fp = bli_packm_blk_var1rr; +#endif // Create two nodes for the macro-kernel. cntl_t* gemm_cntl_bu_ke = bli_gemm_cntl_create_node @@ -72,7 +96,7 @@ cntl_t* bli_gemmbp_cntl_create ( family, BLIS_NR, // not used by macro-kernel, but needed for bli_thrinfo_rgrow() - macro_kernel_p, + macro_kernel_fp, gemm_cntl_bu_ke ); @@ -80,7 +104,7 @@ cntl_t* bli_gemmbp_cntl_create cntl_t* gemm_cntl_packa = bli_packm_cntl_create_node ( bli_gemm_packa, // pack the left-hand operand - bli_packm_blk_var1, + packa_fp, BLIS_MR, BLIS_KR, FALSE, // do NOT invert diagonal @@ -104,7 +128,7 @@ cntl_t* bli_gemmbp_cntl_create cntl_t* gemm_cntl_packb = bli_packm_cntl_create_node ( bli_gemm_packb, // pack the right-hand operand - bli_packm_blk_var1, + packb_fp, BLIS_KR, BLIS_NR, FALSE, // do NOT invert diagonal diff --git a/frame/3/gemm/bli_gemm_int.c b/frame/3/gemm/bli_gemm_int.c index 81552893a..a8e06df45 100644 --- a/frame/3/gemm/bli_gemm_int.c +++ b/frame/3/gemm/bli_gemm_int.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -114,7 +115,9 @@ void bli_gemm_int if ( im != BLIS_NAT ) { - if ( im == BLIS_4M1B && f == bli_gemm_ker_var2 ) f = bli_gemm4mb_ker_var2; + if ( im == BLIS_4M1B ) + if ( f == bli_gemm_ker_var2sl || + f == bli_gemm_ker_var2rr ) f = bli_gemm4mb_ker_var2; } } diff --git a/frame/3/gemm/bli_gemm_ker_var1.c b/frame/3/gemm/bli_gemm_ker_var1.c index f7038584a..e60c78a5a 100644 --- a/frame/3/gemm/bli_gemm_ker_var1.c +++ b/frame/3/gemm/bli_gemm_ker_var1.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -32,6 +33,8 @@ */ +#if 0 + #include "blis.h" void bli_gemm_ker_var1 @@ -55,3 +58,5 @@ void bli_gemm_ker_var1 bli_gemm_ker_var2( b, a, c, cntx, rntm, cntl, thread ); } +#endif + diff --git a/frame/3/gemm/bli_gemm_ker_var2rr.c b/frame/3/gemm/bli_gemm_ker_var2rr.c new file mode 100644 index 000000000..3cb108eea --- /dev/null +++ b/frame/3/gemm/bli_gemm_ker_var2rr.c @@ -0,0 +1,380 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,gemm_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_gemm_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // If 1m is being employed on a column- or row-stored matrix with a + // real-valued beta, we can use the real domain macro-kernel, which + // eliminates a little overhead associated with the 1m virtual + // micro-kernel. + if ( bli_is_1m_packed( schema_a ) ) + { + bli_l3_ind_recast_1m_params + ( + dt_exec, + schema_a, + c, + m, n, k, + pd_a, ps_a, + pd_b, ps_b, + rs_c, cs_c + ); + } + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t i, j; \ + dim_t m_cur; \ + dim_t n_cur; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Determine the thread range and increment for each thrinfo_t node. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_gemm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, ir_end, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_gemm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, jr_end, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the bottom edge of C and add the result from above. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2rr: b1", k, NR, b1, NR, 1, "%4.1f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2rr: a1", MR, k, a1, 1, MR, "%4.1f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2rr: c after", m_cur, n_cur, c11, rs_c, cs_c, "%4.1f", "" ); \ +*/ \ +} + +INSERT_GENTFUNC_BASIC0( gemm_ker_var2rr ) + diff --git a/frame/3/gemm/bli_gemm_ker_var2sl.c b/frame/3/gemm/bli_gemm_ker_var2sl.c new file mode 100644 index 000000000..3e9e28835 --- /dev/null +++ b/frame/3/gemm/bli_gemm_ker_var2sl.c @@ -0,0 +1,380 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,gemm_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_gemm_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // If 1m is being employed on a column- or row-stored matrix with a + // real-valued beta, we can use the real domain macro-kernel, which + // eliminates a little overhead associated with the 1m virtual + // micro-kernel. + if ( bli_is_1m_packed( schema_a ) ) + { + bli_l3_ind_recast_1m_params + ( + dt_exec, + schema_a, + c, + m, n, k, + pd_a, ps_a, + pd_b, ps_b, + rs_c, cs_c + ); + } + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t i, j; \ + dim_t m_cur; \ + dim_t n_cur; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Determine the thread range and increment for each thrinfo_t node. */ \ + bli_thread_range_jrir_sl( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_sl( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_gemm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_sl( i, ir_end, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_gemm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_sl( j, jr_end, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the bottom edge of C and add the result from above. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2sl: b1", k, NR, b1, NR, 1, "%4.1f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2sl: a1", MR, k, a1, 1, MR, "%4.1f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2sl: c after", m_cur, n_cur, c11, rs_c, cs_c, "%4.1f", "" ); \ +*/ \ +} + +INSERT_GENTFUNC_BASIC0( gemm_ker_var2sl ) + diff --git a/frame/3/gemm/bli_gemm_var.h b/frame/3/gemm/bli_gemm_var.h index 9baee6187..15f39e77a 100644 --- a/frame/3/gemm/bli_gemm_var.h +++ b/frame/3/gemm/bli_gemm_var.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -58,7 +59,9 @@ GENPROT( gemm_packa ) GENPROT( gemm_packb ) GENPROT( gemm_ker_var1 ) -GENPROT( gemm_ker_var2 ) + +GENPROT( gemm_ker_var2sl ) +GENPROT( gemm_ker_var2rr ) // Headers for induced algorithms: GENPROT( gemm4mb_ker_var2 ) // 4m1b @@ -90,7 +93,8 @@ void PASTEMAC(ch,varname) \ thrinfo_t* thread \ ); -INSERT_GENTPROT_BASIC0( gemm_ker_var2 ) +INSERT_GENTPROT_BASIC0( gemm_ker_var2sl ) +INSERT_GENTPROT_BASIC0( gemm_ker_var2rr ) // Headers for induced algorithms: INSERT_GENTPROT_BASIC0( gemm4mb_ker_var2 ) // 4m1b diff --git a/frame/3/gemm/ind/bli_gemm4mb_ker_var2.c b/frame/3/gemm/ind/bli_gemm4mb_ker_var2.c index 878889d2a..08992145a 100644 --- a/frame/3/gemm/ind/bli_gemm4mb_ker_var2.c +++ b/frame/3/gemm/ind/bli_gemm4mb_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -251,6 +252,9 @@ void PASTEMAC(ch,varname) \ dim_t jr_thread_id = bli_thread_work_id( thread ); \ dim_t ir_num_threads = bli_thread_n_way( caucus ); \ dim_t ir_thread_id = bli_thread_work_id( caucus ); \ +\ + dim_t jr_inc = jr_num_threads; \ + dim_t ir_inc = ir_num_threads; \ \ /* Loop over the n dimension (NR columns at a time). */ \ for ( j = jr_thread_id; j < n_iter; j += jr_num_threads ) \ @@ -295,12 +299,12 @@ void PASTEMAC(ch,varname) \ m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ \ /* Compute the addresses of the next panels of A and B. */ \ - a2 = bli_gemm_get_next_a_upanel( caucus, a1, rstep_a ); \ - if ( bli_is_last_iter( i, m_iter, ir_thread_id, ir_num_threads ) ) \ + a2 = bli_gemm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_thread_id, ir_num_threads ) ) \ { \ a2 = a_cast; \ - b2 = bli_gemm_get_next_b_upanel( thread, b1, cstep_b ); \ - if ( bli_is_last_iter( j, n_iter, jr_thread_id, jr_num_threads ) ) \ + b2 = bli_gemm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_thread_id, jr_num_threads ) ) \ b2 = b_cast; \ } \ \ diff --git a/frame/3/gemm/bli_gemm_ker_var2.c b/frame/3/gemm/other/bli_gemm_ker_var2.c similarity index 99% rename from frame/3/gemm/bli_gemm_ker_var2.c rename to frame/3/gemm/other/bli_gemm_ker_var2.c index 1967c6ce4..b48f46bc0 100644 --- a/frame/3/gemm/bli_gemm_ker_var2.c +++ b/frame/3/gemm/other/bli_gemm_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are diff --git a/frame/3/herk/bli_herk_l_ker_var2rr.c b/frame/3/herk/bli_herk_l_ker_var2rr.c new file mode 100644 index 000000000..7393f8e1b --- /dev/null +++ b/frame/3/herk/bli_herk_l_ker_var2rr.c @@ -0,0 +1,555 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T herk_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffc, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,herk_l_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_herk_l_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffc = bli_obj_diag_offset( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffc, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffc, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffc_ij; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t i, j, ip; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of C is entirely above the diagonal, + it is not stored. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffc, m, n ) ) return; \ +\ + /* If there is a zero region above where the diagonal of C intersects + the left edge of the panel, adjust the pointer to C and A and treat + this case as if the diagonal offset were zero. */ \ + if ( diagoffc < 0 ) \ + { \ + ip = -diagoffc / MR; \ + i = ip * MR; \ + m = m - i; \ + diagoffc = -diagoffc % MR; \ + c_cast = c_cast + (i )*rs_c; \ + a_cast = a_cast + (ip )*ps_a; \ + } \ +\ + /* If there is a zero region to the right of where the diagonal + of C intersects the bottom of the panel, shrink it to prevent + "no-op" iterations from executing. */ \ + if ( diagoffc + m < n ) \ + { \ + n = diagoffc + m; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the rectangular + part of C, and the triangular portion. */ \ + dim_t n_iter_rct; \ + dim_t n_iter_tri; \ +\ + if ( bli_is_strictly_below_diag_n( diagoffc, m, n ) ) \ + { \ + /* If the entire panel of C does not intersect the diagonal, there is + no triangular region, and therefore we can skip the second set of + loops. */ \ + n_iter_rct = n_iter; \ + n_iter_tri = 0; \ + } \ + else \ + { \ + /* If the panel of C does intersect the diagonal, compute the number of + iterations in the rectangular region by dividing NR into the diagonal + offset. Any remainder from this integer division is discarded, which + is what we want. That is, we want the rectangular region to contain + as many columns of whole microtiles as possible without including any + microtiles that intersect the diagonal. The number of iterations in + the triangular (or trapezoidal) region is computed as the remaining + number of iterations in the n dimension. */ \ + n_iter_rct = diagoffc / NR; \ + n_iter_tri = n_iter - n_iter_rct; \ + } \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd and 1st + loops for the initial rectangular region of C (if it exists). */ \ + bli_thread_range_jrir_rr( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* No need to compute the diagonal offset for the rectangular + region. */ \ + /*diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR;*/ \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly below the diagonal, + we compute and store as we normally would. + And if we're strictly above the diagonal, we do nothing and + continue. */ \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ + /* If there is no triangular region, then we're done. */ \ + if ( n_iter_tri == 0 ) return; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd and + 1st loops for the remaining triangular region of C. */ \ + bli_thread_range_jrir_rr( thread, n_iter_tri, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Advance the start and end iteration offsets for the triangular region + by the number of iterations used for the rectangular region. */ \ + jr_start += n_iter_rct; \ + jr_end += n_iter_rct; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* Compute the diagonal offset for the submatrix at (i,j). */ \ + diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly below the diagonal, + we compute and store as we normally would. + And if we're strictly above the diagonal, we do nothing and + continue. */ \ + if ( bli_intersects_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale C and add the result to only the stored part. */ \ + PASTEMAC(ch,xpbys_mxn_l)( diagoffc_ij, \ + m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +} + +INSERT_GENTFUNC_BASIC0( herk_l_ker_var2rr ) + diff --git a/frame/3/herk/bli_herk_l_ker_var2sl.c b/frame/3/herk/bli_herk_l_ker_var2sl.c new file mode 100644 index 000000000..569684bf7 --- /dev/null +++ b/frame/3/herk/bli_herk_l_ker_var2sl.c @@ -0,0 +1,556 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T herk_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffc, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,herk_l_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_herk_l_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffc = bli_obj_diag_offset( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffc, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffc, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffc_ij; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t i, j, ip; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of C is entirely above the diagonal, + it is not stored. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffc, m, n ) ) return; \ +\ + /* If there is a zero region above where the diagonal of C intersects + the left edge of the panel, adjust the pointer to C and A and treat + this case as if the diagonal offset were zero. */ \ + if ( diagoffc < 0 ) \ + { \ + ip = -diagoffc / MR; \ + i = ip * MR; \ + m = m - i; \ + diagoffc = -diagoffc % MR; \ + c_cast = c_cast + (i )*rs_c; \ + a_cast = a_cast + (ip )*ps_a; \ + } \ +\ + /* If there is a zero region to the right of where the diagonal + of C intersects the bottom of the panel, shrink it to prevent + "no-op" iterations from executing. */ \ + if ( diagoffc + m < n ) \ + { \ + n = diagoffc + m; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the rectangular + part of C, and the triangular portion. */ \ + dim_t n_iter_rct; \ + dim_t n_iter_tri; \ +\ + if ( bli_is_strictly_below_diag_n( diagoffc, m, n ) ) \ + { \ + /* If the entire panel of C does not intersect the diagonal, there is + no triangular region, and therefore we can skip the second set of + loops. */ \ + n_iter_rct = n_iter; \ + n_iter_tri = 0; \ + } \ + else \ + { \ + /* If the panel of C does intersect the diagonal, compute the number of + iterations in the rectangular region by dividing NR into the diagonal + offset. Any remainder from this integer division is discarded, which + is what we want. That is, we want the rectangular region to contain + as many columns of whole microtiles as possible without including any + microtiles that intersect the diagonal. The number of iterations in + the triangular (or trapezoidal) region is computed as the remaining + number of iterations in the n dimension. */ \ + n_iter_rct = diagoffc / NR; \ + n_iter_tri = n_iter - n_iter_rct; \ + } \ +\ + /* Use slab assignment of micropanels to threads in the 2nd and 1st + loops for the initial rectangular region of C (if it exists). */ \ + bli_thread_range_jrir_sl( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_sl( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* No need to compute the diagonal offset for the rectangular + region. */ \ + /*diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR;*/ \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_sl( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly below the diagonal, + we compute and store as we normally would. + And if we're strictly above the diagonal, we do nothing and + continue. */ \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ + /* If there is no triangular region, then we're done. */ \ + if ( n_iter_tri == 0 ) return; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd + loop and slab partitioning in the 1st loop for the remaining + triangular region of C. */ \ + bli_thread_range_jrir_rr( thread, n_iter_tri, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Advance the start and end iteration offsets for the triangular region + by the number of iterations used for the rectangular region. */ \ + jr_start += n_iter_rct; \ + jr_end += n_iter_rct; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* Compute the diagonal offset for the submatrix at (i,j). */ \ + diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly below the diagonal, + we compute and store as we normally would. + And if we're strictly above the diagonal, we do nothing and + continue. */ \ + if ( bli_intersects_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale C and add the result to only the stored part. */ \ + PASTEMAC(ch,xpbys_mxn_l)( diagoffc_ij, \ + m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +} + +INSERT_GENTFUNC_BASIC0( herk_l_ker_var2sl ) + diff --git a/frame/3/herk/bli_herk_u_ker_var2rr.c b/frame/3/herk/bli_herk_u_ker_var2rr.c new file mode 100644 index 000000000..e0ac82745 --- /dev/null +++ b/frame/3/herk/bli_herk_u_ker_var2rr.c @@ -0,0 +1,557 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T herk_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffc, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,herk_u_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_herk_u_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffc = bli_obj_diag_offset( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffc, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffc, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffc_ij; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t i, j, jp; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of C is entirely below the diagonal, + it is not stored. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffc, m, n ) ) return; \ +\ + /* If there is a zero region to the left of where the diagonal of C + intersects the top edge of the panel, adjust the pointer to C and B + and treat this case as if the diagonal offset were zero. + NOTE: It's possible that after this pruning that the diagonal offset + is still positive (though it is guaranteed to be less than NR). */ \ + if ( diagoffc > 0 ) \ + { \ + jp = diagoffc / NR; \ + j = jp * NR; \ + n = n - j; \ + diagoffc = diagoffc % NR; \ + c_cast = c_cast + (j )*cs_c; \ + b_cast = b_cast + (jp )*ps_b; \ + } \ +\ + /* If there is a zero region below where the diagonal of C intersects + the right edge of the panel, shrink it to prevent "no-op" iterations + from executing. */ \ + if ( -diagoffc + n < m ) \ + { \ + m = -diagoffc + n; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the triangular + part of C, and the rectangular portion. */ \ + dim_t n_iter_tri; \ + dim_t n_iter_rct; \ +\ + if ( bli_is_strictly_above_diag_n( diagoffc, m, n ) ) \ + { \ + /* If the entire panel of C does not intersect the diagonal, there is + no triangular region, and therefore we can skip the first set of + loops. */ \ + n_iter_tri = 0; \ + n_iter_rct = n_iter; \ + } \ + else \ + { \ + /* If the panel of C does intersect the diagonal, compute the number of + iterations in the triangular (or trapezoidal) region by dividing NR + into the number of rows in C. A non-zero remainder means we need to + add one additional iteration. That is, we want the triangular region + to contain as few columns of whole microtiles as possible while still + including all microtiles that intersect the diagonal. The number of + iterations in the rectangular region is computed as the remaining + number of iterations in the n dimension. */ \ + n_iter_tri = ( m + diagoffc ) / NR + ( ( m + diagoffc ) % NR ? 1 : 0 ); \ + n_iter_rct = n_iter - n_iter_tri; \ + } \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd and 1st + loops for the initial triangular region of C (if it exists). */ \ + bli_thread_range_jrir_rr( thread, n_iter_tri, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* Compute the diagonal offset for the submatrix at (i,j). */ \ + diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly above the diagonal, + we compute and store as we normally would. + And if we're strictly below the diagonal, we do nothing and + continue. */ \ + if ( bli_intersects_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale C and add the result to only the stored part. */ \ + PASTEMAC(ch,xpbys_mxn_u)( diagoffc_ij, \ + m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ + /* If there is no rectangular region, then we're done. */ \ + if ( n_iter_rct == 0 ) return; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd and 1st + loops for the remaining triangular region of C. */ \ + bli_thread_range_jrir_rr( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Advance the start and end iteration offsets for the rectangular region + by the number of iterations used for the triangular region. */ \ + jr_start += n_iter_tri; \ + jr_end += n_iter_tri; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* No need to compute the diagonal offset for the rectangular + region. */ \ + /*diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR;*/ \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly above the diagonal, + we compute and store as we normally would. + And if we're strictly below the diagonal, we do nothing and + continue. */ \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +} + +INSERT_GENTFUNC_BASIC0( herk_u_ker_var2rr ) + diff --git a/frame/3/herk/bli_herk_u_ker_var2sl.c b/frame/3/herk/bli_herk_u_ker_var2sl.c new file mode 100644 index 000000000..b182561d7 --- /dev/null +++ b/frame/3/herk/bli_herk_u_ker_var2sl.c @@ -0,0 +1,558 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T herk_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffc, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,herk_u_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_herk_u_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffc = bli_obj_diag_offset( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffc, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffc, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffc_ij; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t i, j, jp; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of C is entirely below the diagonal, + it is not stored. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffc, m, n ) ) return; \ +\ + /* If there is a zero region to the left of where the diagonal of C + intersects the top edge of the panel, adjust the pointer to C and B + and treat this case as if the diagonal offset were zero. + NOTE: It's possible that after this pruning that the diagonal offset + is still positive (though it is guaranteed to be less than NR). */ \ + if ( diagoffc > 0 ) \ + { \ + jp = diagoffc / NR; \ + j = jp * NR; \ + n = n - j; \ + diagoffc = diagoffc % NR; \ + c_cast = c_cast + (j )*cs_c; \ + b_cast = b_cast + (jp )*ps_b; \ + } \ +\ + /* If there is a zero region below where the diagonal of C intersects + the right edge of the panel, shrink it to prevent "no-op" iterations + from executing. */ \ + if ( -diagoffc + n < m ) \ + { \ + m = -diagoffc + n; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the triangular + part of C, and the rectangular portion. */ \ + dim_t n_iter_tri; \ + dim_t n_iter_rct; \ +\ + if ( bli_is_strictly_above_diag_n( diagoffc, m, n ) ) \ + { \ + /* If the entire panel of C does not intersect the diagonal, there is + no triangular region, and therefore we can skip the first set of + loops. */ \ + n_iter_tri = 0; \ + n_iter_rct = n_iter; \ + } \ + else \ + { \ + /* If the panel of C does intersect the diagonal, compute the number of + iterations in the triangular (or trapezoidal) region by dividing NR + into the number of rows in C. A non-zero remainder means we need to + add one additional iteration. That is, we want the triangular region + to contain as few columns of whole microtiles as possible while still + including all microtiles that intersect the diagonal. The number of + iterations in the rectangular region is computed as the remaining + number of iterations in the n dimension. */ \ + n_iter_tri = ( m + diagoffc ) / NR + ( ( m + diagoffc ) % NR ? 1 : 0 ); \ + n_iter_rct = n_iter - n_iter_tri; \ + } \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop + and slab partitioning in the 1st loop for the initial triangular region + of C (if it exists). */ \ + bli_thread_range_jrir_rr( thread, n_iter_tri, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_sl( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* Compute the diagonal offset for the submatrix at (i,j). */ \ + diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_sl( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly above the diagonal, + we compute and store as we normally would. + And if we're strictly below the diagonal, we do nothing and + continue. */ \ + if ( bli_intersects_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale C and add the result to only the stored part. */ \ + PASTEMAC(ch,xpbys_mxn_u)( diagoffc_ij, \ + m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ + /* If there is no rectangular region, then we're done. */ \ + if ( n_iter_rct == 0 ) return; \ +\ + /* Use slab assignment of micropanels to threads in the 2nd and 1st loops + loop for the remaining triangular region of C. */ \ + bli_thread_range_jrir_sl( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Advance the start and end iteration offsets for the rectangular region + by the number of iterations used for the triangular region. */ \ + jr_start += n_iter_tri; \ + jr_end += n_iter_tri; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* No need to compute the diagonal offset for the rectangular + region. */ \ + /*diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR;*/ \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_sl( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly above the diagonal, + we compute and store as we normally would. + And if we're strictly below the diagonal, we do nothing and + continue. */ \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +} + +INSERT_GENTFUNC_BASIC0( herk_u_ker_var2sl ) + diff --git a/frame/3/herk/bli_herk_var.h b/frame/3/herk/bli_herk_var.h index 58061a8dd..9e4a42d6a 100644 --- a/frame/3/herk/bli_herk_var.h +++ b/frame/3/herk/bli_herk_var.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -55,9 +56,13 @@ void PASTEMAC0(opname) \ //GENPROT( herk_blk_var2 ) //GENPROT( herk_blk_var3 ) -GENPROT( herk_x_ker_var2 ) -GENPROT( herk_l_ker_var2 ) -GENPROT( herk_u_ker_var2 ) +GENPROT( herk_x_ker_var2sl ) +GENPROT( herk_x_ker_var2rr ) + +GENPROT( herk_l_ker_var2sl ) +GENPROT( herk_l_ker_var2rr ) +GENPROT( herk_u_ker_var2sl ) +GENPROT( herk_u_ker_var2rr ) //GENPROT( herk_packa ) //GENPROT( herk_packb ) @@ -89,6 +94,8 @@ void PASTEMAC(ch,varname) \ thrinfo_t* thread \ ); -INSERT_GENTPROT_BASIC0( herk_l_ker_var2 ) -INSERT_GENTPROT_BASIC0( herk_u_ker_var2 ) +INSERT_GENTPROT_BASIC0( herk_l_ker_var2sl ) +INSERT_GENTPROT_BASIC0( herk_l_ker_var2rr ) +INSERT_GENTPROT_BASIC0( herk_u_ker_var2sl ) +INSERT_GENTPROT_BASIC0( herk_u_ker_var2rr ) diff --git a/frame/3/herk/bli_herk_x_ker_var2.c b/frame/3/herk/bli_herk_x_ker_var2.c index 10b6ab826..911c65b31 100644 --- a/frame/3/herk/bli_herk_x_ker_var2.c +++ b/frame/3/herk/bli_herk_x_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -34,12 +35,12 @@ #include "blis.h" -static gemm_var_oft vars[2] = +static gemm_var_oft vars_sl[2] = { - bli_herk_l_ker_var2, bli_herk_u_ker_var2, + bli_herk_l_ker_var2sl, bli_herk_u_ker_var2sl, }; -void bli_herk_x_ker_var2 +void bli_herk_x_ker_var2sl ( obj_t* a, obj_t* ah, @@ -58,7 +59,48 @@ void bli_herk_x_ker_var2 else uplo = 1; // Index into the variant array to extract the correct function pointer. - f = vars[uplo]; + f = vars_sl[uplo]; + + // Call the macrokernel. + f + ( + a, + ah, + c, + cntx, + rntm, + cntl, + thread + ); +} + +// ----------------------------------------------------------------------------- + +static gemm_var_oft vars_rr[2] = +{ + bli_herk_l_ker_var2rr, bli_herk_u_ker_var2rr, +}; + +void bli_herk_x_ker_var2rr + ( + obj_t* a, + obj_t* ah, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + bool_t uplo; + gemm_var_oft f; + + // Set a bool based on the uplo field of C's root object. + if ( bli_obj_root_is_lower( c ) ) uplo = 0; + else uplo = 1; + + // Index into the variant array to extract the correct function pointer. + f = vars_rr[uplo]; // Call the macrokernel. f diff --git a/frame/3/herk/other/bli_herk_l_ker_var2.1looprr.c b/frame/3/herk/other/bli_herk_l_ker_var2.1looprr.c new file mode 100644 index 000000000..bd7b69e81 --- /dev/null +++ b/frame/3/herk/other/bli_herk_l_ker_var2.1looprr.c @@ -0,0 +1,420 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T herk_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffc, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,herk_l_ker_var2); + + +void bli_herk_l_ker_var2 + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffc = bli_obj_diag_offset( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffc, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffc, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffc_ij; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t i, j, ip; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of C is entirely above the diagonal, + it is not stored. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffc, m, n ) ) return; \ +\ + /* If there is a zero region above where the diagonal of C intersects + the left edge of the panel, adjust the pointer to C and A and treat + this case as if the diagonal offset were zero. */ \ + if ( diagoffc < 0 ) \ + { \ + ip = -diagoffc / MR; \ + i = ip * MR; \ + m = m - i; \ + diagoffc = -diagoffc % MR; \ + c_cast = c_cast + (i )*rs_c; \ + a_cast = a_cast + (ip )*ps_a; \ + } \ +\ + /* If there is a zero region to the right of where the diagonal + of C intersects the bottom of the panel, shrink it to prevent + "no-op" iterations from executing. */ \ + if ( diagoffc + m < n ) \ + { \ + n = diagoffc + m; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Use interleaved (round robin) assignment of micropanels to threads in + the 2nd and 1st loops. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* Compute the diagonal offset for the submatrix at (i,j). */ \ + diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly below the diagonal, + we compute and store as we normally would. + And if we're strictly above the diagonal, we do nothing and + continue. */ \ + if ( bli_intersects_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale C and add the result to only the stored part. */ \ + PASTEMAC(ch,xpbys_mxn_l)( diagoffc_ij, \ + m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +} + +INSERT_GENTFUNC_BASIC0( herk_l_ker_var2 ) + diff --git a/frame/3/herk/bli_herk_l_ker_var2.c b/frame/3/herk/other/bli_herk_l_ker_var2.c similarity index 99% rename from frame/3/herk/bli_herk_l_ker_var2.c rename to frame/3/herk/other/bli_herk_l_ker_var2.c index 93c014051..832421813 100644 --- a/frame/3/herk/bli_herk_l_ker_var2.c +++ b/frame/3/herk/other/bli_herk_l_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are diff --git a/frame/3/herk/other/bli_herk_u_ker_var2.1looprr.c b/frame/3/herk/other/bli_herk_u_ker_var2.1looprr.c new file mode 100644 index 000000000..398213282 --- /dev/null +++ b/frame/3/herk/other/bli_herk_u_ker_var2.1looprr.c @@ -0,0 +1,420 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T herk_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffc, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,herk_u_ker_var2); + + +void bli_herk_u_ker_var2 + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffc = bli_obj_diag_offset( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffc, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffc, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffc_ij; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t i, j, jp; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of C is entirely below the diagonal, + it is not stored. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffc, m, n ) ) return; \ +\ + /* If there is a zero region to the left of where the diagonal of C + intersects the top edge of the panel, adjust the pointer to C and B + and treat this case as if the diagonal offset were zero. */ \ + if ( diagoffc > 0 ) \ + { \ + jp = diagoffc / NR; \ + j = jp * NR; \ + n = n - j; \ + diagoffc = diagoffc % NR; \ + c_cast = c_cast + (j )*cs_c; \ + b_cast = b_cast + (jp )*ps_b; \ + } \ +\ + /* If there is a zero region below where the diagonal of C intersects + the right edge of the panel, shrink it to prevent "no-op" iterations + from executing. */ \ + if ( -diagoffc + n < m ) \ + { \ + m = -diagoffc + n; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Use interleaved (round robin) assignment of micropanels to threads in + the 2nd and 1st loops. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Interior loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + /* Compute the diagonal offset for the submatrix at (i,j). */ \ + diagoffc_ij = diagoffc - (doff_t)j*NR + (doff_t)i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_herk_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_herk_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* If the diagonal intersects the current MR x NR submatrix, we + compute it the temporary buffer and then add in the elements + on or below the diagonal. + Otherwise, if the submatrix is strictly above the diagonal, + we compute and store as we normally would. + And if we're strictly below the diagonal, we do nothing and + continue. */ \ + if ( bli_intersects_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale C and add the result to only the stored part. */ \ + PASTEMAC(ch,xpbys_mxn_u)( diagoffc_ij, \ + m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffc_ij, m_cur, n_cur ) ) \ + { \ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the edge of C and add the result. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +} + +INSERT_GENTFUNC_BASIC0( herk_u_ker_var2 ) + diff --git a/frame/3/herk/bli_herk_u_ker_var2.c b/frame/3/herk/other/bli_herk_u_ker_var2.c similarity index 99% rename from frame/3/herk/bli_herk_u_ker_var2.c rename to frame/3/herk/other/bli_herk_u_ker_var2.c index 5875c3317..8d1a3021d 100644 --- a/frame/3/herk/bli_herk_u_ker_var2.c +++ b/frame/3/herk/other/bli_herk_u_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are diff --git a/frame/3/trmm/bli_trmm_front.c b/frame/3/trmm/bli_trmm_front.c index 3778c7302..4d6b49a25 100644 --- a/frame/3/trmm/bli_trmm_front.c +++ b/frame/3/trmm/bli_trmm_front.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -85,6 +86,10 @@ void bli_trmm_front } #if 0 + // NOTE: This case casts right-side trmm in terms of left side. This + // reduces the number of macrokernels exercised to two (trmm_ll and + // trmm_lu) but can lead to the microkernel being executed with an + // output matrix that is stored counter to its output preference. // If A is being multiplied from the right, transpose all operands // so that we can perform the computation as if A were being multiplied @@ -98,6 +103,11 @@ void bli_trmm_front } #else + // NOTE: This case computes right-side trmm natively with trmm_rl and + // trmm_ru macrokernels. This code path always gives us the opportunity + // to transpose the entire operation so that the effective storage format + // of the output matrix matches the microkernel's output preference. + // Thus, from a performance perspective, this case is preferred. // An optimization: If C is stored by rows and the micro-kernel prefers // contiguous columns, or if C is stored by columns and the micro-kernel diff --git a/frame/3/trmm/bli_trmm_ll_ker_var2rr.c b/frame/3/trmm/bli_trmm_ll_ker_var2rr.c new file mode 100644 index 000000000..a940fdb6f --- /dev/null +++ b/frame/3/trmm/bli_trmm_ll_ker_var2rr.c @@ -0,0 +1,535 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_ll_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_trmm_ll_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1011; \ + dim_t off_a1011; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current block of A is entirely above the diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else if ( bli_is_rih_packed( schema_a ) ) { ss_a_num = 1; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region above where the diagonal of A intersects the + left edge of the block, adjust the pointer to C and treat this case as + if the diagonal offset were zero. This skips over the region that was + not packed. (Note we assume the diagonal offset is a multiple of MR; + this assumption will hold as long as the cache blocksizes are each a + multiple of MR and NR.) */ \ + if ( diagoffa < 0 ) \ + { \ + i = -diagoffa; \ + m = m - i; \ + diagoffa = 0; \ + c_cast = c_cast + (i )*rs_c; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + /*thrinfo_t* ir_thread = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + /*dim_t ir_nt = bli_thread_n_way( ir_thread ); \ + dim_t ir_tid = bli_thread_work_id( ir_thread );*/ \ +\ + dim_t jr_start, jr_end; \ + /*dim_t ir_start, ir_end;*/ \ + dim_t jr_inc; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop for + the initial rectangular region of C (if it exists). + NOTE: Parallelism in the 1st loop is disabled for now. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + /*bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc );*/ \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict b1_i; \ + ctype* restrict a2; \ +\ + /* Determine the offset to and length of the panel that was + packed so we can index into the corresponding location in + b1. */ \ + off_a1011 = 0; \ + k_a1011 = bli_min( diagoffa_i + MR, k ); \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1011 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + b1_i = b1 + ( off_a1011 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1011, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1011, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffa_i, MR, k ) ) \ + { \ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += rstep_a; \ + } \ +\ + c11 += rstep_c; \ + } \ + } \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ll_ker_var2rr: a1", MR, k_a1011, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ll_ker_var2rr: b1", k_a1011, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_ll_ker_var2rr ) + diff --git a/frame/3/trmm/bli_trmm_ll_ker_var2sl.c b/frame/3/trmm/bli_trmm_ll_ker_var2sl.c new file mode 100644 index 000000000..718c6fba1 --- /dev/null +++ b/frame/3/trmm/bli_trmm_ll_ker_var2sl.c @@ -0,0 +1,535 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_ll_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_trmm_ll_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1011; \ + dim_t off_a1011; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current block of A is entirely above the diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else if ( bli_is_rih_packed( schema_a ) ) { ss_a_num = 1; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region above where the diagonal of A intersects the + left edge of the block, adjust the pointer to C and treat this case as + if the diagonal offset were zero. This skips over the region that was + not packed. (Note we assume the diagonal offset is a multiple of MR; + this assumption will hold as long as the cache blocksizes are each a + multiple of MR and NR.) */ \ + if ( diagoffa < 0 ) \ + { \ + i = -diagoffa; \ + m = m - i; \ + diagoffa = 0; \ + c_cast = c_cast + (i )*rs_c; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + /*thrinfo_t* ir_thread = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + /*dim_t ir_nt = bli_thread_n_way( ir_thread ); \ + dim_t ir_tid = bli_thread_work_id( ir_thread );*/ \ +\ + dim_t jr_start, jr_end; \ + /*dim_t ir_start, ir_end;*/ \ + dim_t jr_inc; \ +\ + /* Use slab assignment of micropanels to threads in the 2nd loop for + the initial rectangular region of C (if it exists). + NOTE: Parallelism in the 1st loop is disabled for now. */ \ + bli_thread_range_jrir_sl( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + /*bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc );*/ \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict b1_i; \ + ctype* restrict a2; \ +\ + /* Determine the offset to and length of the panel that was + packed so we can index into the corresponding location in + b1. */ \ + off_a1011 = 0; \ + k_a1011 = bli_min( diagoffa_i + MR, k ); \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1011 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + b1_i = b1 + ( off_a1011 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1011, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1011, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffa_i, MR, k ) ) \ + { \ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += rstep_a; \ + } \ +\ + c11 += rstep_c; \ + } \ + } \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ll_ker_var2sl: a1", MR, k_a1011, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ll_ker_var2sl: b1", k_a1011, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_ll_ker_var2sl ) + diff --git a/frame/3/trmm/bli_trmm_lu_ker_var2rr.c b/frame/3/trmm/bli_trmm_lu_ker_var2rr.c new file mode 100644 index 000000000..ab1efa46d --- /dev/null +++ b/frame/3/trmm/bli_trmm_lu_ker_var2rr.c @@ -0,0 +1,542 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_lu_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_trmm_lu_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1112; \ + dim_t off_a1112; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current block of A is entirely below the diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else if ( bli_is_rih_packed( schema_a ) ) { ss_a_num = 1; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region to the left of where the diagonal of A + intersects the top edge of the block, adjust the pointer to B and + treat this case as if the diagonal offset were zero. Note that we + don't need to adjust the pointer to A since packm would have simply + skipped over the region that was not stored. */ \ + if ( diagoffa > 0 ) \ + { \ + i = diagoffa; \ + k = k - i; \ + diagoffa = 0; \ + b_cast = b_cast + ( i * PACKNR ) / off_scl; \ + } \ +\ + /* If there is a zero region below where the diagonal of A intersects the + right side of the block, shrink it to prevent "no-op" iterations from + executing. */ \ + if ( -diagoffa + k < m ) \ + { \ + m = -diagoffa + k; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + /*thrinfo_t* ir_thread = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + /*dim_t ir_nt = bli_thread_n_way( ir_thread ); \ + dim_t ir_tid = bli_thread_work_id( ir_thread );*/ \ +\ + dim_t jr_start, jr_end; \ + /*dim_t ir_start, ir_end;*/ \ + dim_t jr_inc; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop for + the initial rectangular region of C (if it exists). */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + /*bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc );*/ \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, scale C + by beta. If it is strictly above the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict b1_i; \ + ctype* restrict a2; \ +\ + /* Determine the offset to and length of the panel that was + packed so we can index into the corresponding location in + b1. */ \ + off_a1112 = diagoffa_i; \ + k_a1112 = k - off_a1112; \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1112 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + b1_i = b1 + ( off_a1112 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1112, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1112, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffa_i, MR, k ) ) \ + { \ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += rstep_a; \ + } \ +\ + c11 += rstep_c; \ + } \ + } \ +\ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_lu_ker_var2rr: a1", MR, k_a1112, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_lu_ker_var2rr: b1", k_a1112, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_lu_ker_var2rr ) + diff --git a/frame/3/trmm/bli_trmm_lu_ker_var2sl.c b/frame/3/trmm/bli_trmm_lu_ker_var2sl.c new file mode 100644 index 000000000..1bb4e1b6d --- /dev/null +++ b/frame/3/trmm/bli_trmm_lu_ker_var2sl.c @@ -0,0 +1,542 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_lu_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_trmm_lu_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1112; \ + dim_t off_a1112; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current block of A is entirely below the diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else if ( bli_is_rih_packed( schema_a ) ) { ss_a_num = 1; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region to the left of where the diagonal of A + intersects the top edge of the block, adjust the pointer to B and + treat this case as if the diagonal offset were zero. Note that we + don't need to adjust the pointer to A since packm would have simply + skipped over the region that was not stored. */ \ + if ( diagoffa > 0 ) \ + { \ + i = diagoffa; \ + k = k - i; \ + diagoffa = 0; \ + b_cast = b_cast + ( i * PACKNR ) / off_scl; \ + } \ +\ + /* If there is a zero region below where the diagonal of A intersects the + right side of the block, shrink it to prevent "no-op" iterations from + executing. */ \ + if ( -diagoffa + k < m ) \ + { \ + m = -diagoffa + k; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + /*thrinfo_t* ir_thread = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + /*dim_t ir_nt = bli_thread_n_way( ir_thread ); \ + dim_t ir_tid = bli_thread_work_id( ir_thread );*/ \ +\ + dim_t jr_start, jr_end; \ + /*dim_t ir_start, ir_end;*/ \ + dim_t jr_inc; \ +\ + /* Use slab assignment of micropanels to threads in the 2nd loop for + the initial rectangular region of C (if it exists). */ \ + bli_thread_range_jrir_sl( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + /*bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc );*/ \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, scale C + by beta. If it is strictly above the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict b1_i; \ + ctype* restrict a2; \ +\ + /* Determine the offset to and length of the panel that was + packed so we can index into the corresponding location in + b1. */ \ + off_a1112 = diagoffa_i; \ + k_a1112 = k - off_a1112; \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1112 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + b1_i = b1 + ( off_a1112 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1112, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_a1112, \ + alpha_cast, \ + a1, \ + b1_i, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffa_i, MR, k ) ) \ + { \ + /* NOTE: ir loop parallelism disabled for now. */ \ + /*if ( bli_trmm_my_iter( i, ir_thread ) ) {*/ \ +\ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + /*}*/ \ +\ + a1 += rstep_a; \ + } \ +\ + c11 += rstep_c; \ + } \ + } \ +\ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_lu_ker_var2sl: a1", MR, k_a1112, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_lu_ker_var2sl: b1", k_a1112, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_lu_ker_var2sl ) + diff --git a/frame/3/trmm/bli_trmm_rl_ker_var2rr.c b/frame/3/trmm/bli_trmm_rl_ker_var2rr.c new file mode 100644 index 000000000..1b1549951 --- /dev/null +++ b/frame/3/trmm/bli_trmm_rl_ker_var2rr.c @@ -0,0 +1,598 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffb, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_rl_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_trmm_rl_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffb = bli_obj_diag_offset( b ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffb, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffb, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffb_j; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_b1121; \ + dim_t off_b1121; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_b_num; \ + inc_t ss_b_den; \ + inc_t ps_b_cur; \ + inc_t is_b_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of B is entirely above the diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffb, k, n ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of A (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_b ) || \ + bli_is_3mi_packed( schema_b ) || \ + bli_is_rih_packed( schema_b ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_b ) ) { ss_b_num = 3; ss_b_den = 2; } \ + else if ( bli_is_rih_packed( schema_b ) ) { ss_b_num = 1; ss_b_den = 2; } \ + else { ss_b_num = 1; ss_b_den = 1; } \ +\ + /* If there is a zero region above where the diagonal of B intersects + the left edge of the panel, adjust the pointer to A and treat this + case as if the diagonal offset were zero. Note that we don't need to + adjust the pointer to B since packm would have simply skipped over + the region that was not stored. */ \ + if ( diagoffb < 0 ) \ + { \ + j = -diagoffb; \ + k = k - j; \ + diagoffb = 0; \ + a_cast = a_cast + ( j * PACKMR ) / off_scl; \ + } \ +\ + /* If there is a zero region to the right of where the diagonal + of B intersects the bottom of the panel, shrink it to prevent + "no-op" iterations from executing. */ \ + if ( diagoffb + k < n ) \ + { \ + n = diagoffb + k; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k_full; \ + istep_b = PACKNR * k; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the rectangular + part of B, and the triangular portion. */ \ + dim_t n_iter_rct; \ + dim_t n_iter_tri; \ +\ + if ( bli_is_strictly_below_diag_n( diagoffb, m, n ) ) \ + { \ + /* If the entire panel of B does not intersect the diagonal, there is + no triangular region, and therefore we can skip the second set of + loops. */ \ + n_iter_rct = n_iter; \ + n_iter_tri = 0; \ + } \ + else \ + { \ + /* If the panel of B does intersect the diagonal, compute the number of + iterations in the rectangular region by dividing NR into the diagonal + offset. (There should never be any remainder in this division.) The + number of iterations in the triangular (or trapezoidal) region is + computed as the remaining number of iterations in the n dimension. */ \ + n_iter_rct = diagoffb / NR; \ + n_iter_tri = n_iter - n_iter_rct; \ + } \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd and 1st + loops for the initial rectangular region of B (if it exists). */ \ + bli_thread_range_jrir_rr( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + { \ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_trmm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_trmm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ + /* If there is no triangular region, then we're done. */ \ + if ( n_iter_tri == 0 ) return; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop + for the remaining triangular region of B (if it exists). + NOTE: We don't need to call bli_thread_range_jrir*() here since we + employ a hack that calls for each thread to execute every iteration + of the jr and ir loops but skip all but the pointer increment for + iterations that are not assigned to it. */ \ +\ + /* Advance the starting b1 and c1 pointers to the positions corresponding + to the start of the triangular region of B. */ \ + jr_start = n_iter_rct; \ + b1 = b_cast + jr_start * cstep_b; \ + c1 = c_cast + jr_start * cstep_c; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < n_iter; ++j ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + diagoffb_j = diagoffb - ( doff_t )j*NR; \ +\ + /* Determine the offset to the beginning of the panel that + was packed so we can index into the corresponding location + in A. Then compute the length of that panel. */ \ + off_b1121 = bli_max( -diagoffb_j, 0 ); \ + k_b1121 = k - off_b1121; \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + { \ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_b_cur = k_b1121 * PACKNR; \ + is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ + ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ +\ + if ( bli_trmm_my_iter( j, thread ) ) { \ +\ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( is_b_cur, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if ( bli_trmm_my_iter( i, caucus ) ) { \ +\ + ctype* restrict a1_i; \ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + a1_i = a1 + ( off_b1121 * PACKMR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b1121, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b1121, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ + } \ +\ + b1 += ps_b_cur; \ + } \ +\ + c1 += cstep_c; \ + } \ +\ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_rl_ker_var2rr: a1", MR, k_b1121, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_rl_ker_var2rr: b1", k_b1121, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_rl_ker_var2rr ) + diff --git a/frame/3/trmm/bli_trmm_rl_ker_var2sl.c b/frame/3/trmm/bli_trmm_rl_ker_var2sl.c new file mode 100644 index 000000000..80e9c7f2f --- /dev/null +++ b/frame/3/trmm/bli_trmm_rl_ker_var2sl.c @@ -0,0 +1,598 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffb, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_rl_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_trmm_rl_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffb = bli_obj_diag_offset( b ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffb, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffb, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffb_j; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_b1121; \ + dim_t off_b1121; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_b_num; \ + inc_t ss_b_den; \ + inc_t ps_b_cur; \ + inc_t is_b_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of B is entirely above the diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffb, k, n ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of A (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_b ) || \ + bli_is_3mi_packed( schema_b ) || \ + bli_is_rih_packed( schema_b ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_b ) ) { ss_b_num = 3; ss_b_den = 2; } \ + else if ( bli_is_rih_packed( schema_b ) ) { ss_b_num = 1; ss_b_den = 2; } \ + else { ss_b_num = 1; ss_b_den = 1; } \ +\ + /* If there is a zero region above where the diagonal of B intersects + the left edge of the panel, adjust the pointer to A and treat this + case as if the diagonal offset were zero. Note that we don't need to + adjust the pointer to B since packm would have simply skipped over + the region that was not stored. */ \ + if ( diagoffb < 0 ) \ + { \ + j = -diagoffb; \ + k = k - j; \ + diagoffb = 0; \ + a_cast = a_cast + ( j * PACKMR ) / off_scl; \ + } \ +\ + /* If there is a zero region to the right of where the diagonal + of B intersects the bottom of the panel, shrink it to prevent + "no-op" iterations from executing. */ \ + if ( diagoffb + k < n ) \ + { \ + n = diagoffb + k; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k_full; \ + istep_b = PACKNR * k; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the rectangular + part of B, and the triangular portion. */ \ + dim_t n_iter_rct; \ + dim_t n_iter_tri; \ +\ + if ( bli_is_strictly_below_diag_n( diagoffb, m, n ) ) \ + { \ + /* If the entire panel of B does not intersect the diagonal, there is + no triangular region, and therefore we can skip the second set of + loops. */ \ + n_iter_rct = n_iter; \ + n_iter_tri = 0; \ + } \ + else \ + { \ + /* If the panel of B does intersect the diagonal, compute the number of + iterations in the rectangular region by dividing NR into the diagonal + offset. (There should never be any remainder in this division.) The + number of iterations in the triangular (or trapezoidal) region is + computed as the remaining number of iterations in the n dimension. */ \ + n_iter_rct = diagoffb / NR; \ + n_iter_tri = n_iter - n_iter_rct; \ + } \ +\ + /* Use slab assignment of micropanels to threads in the 2nd and 1st + loops for the initial rectangular region of B (if it exists). */ \ + bli_thread_range_jrir_sl( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_sl( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + { \ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_trmm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_sl( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_trmm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ + /* If there is no triangular region, then we're done. */ \ + if ( n_iter_tri == 0 ) return; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop + for the remaining triangular region of B (if it exists). + NOTE: We don't need to call bli_thread_range_jrir*() here since we + employ a hack that calls for each thread to execute every iteration + of the jr and ir loops but skip all but the pointer increment for + iterations that are not assigned to it. */ \ +\ + /* Advance the starting b1 and c1 pointers to the positions corresponding + to the start of the triangular region of B. */ \ + jr_start = n_iter_rct; \ + b1 = b_cast + jr_start * cstep_b; \ + c1 = c_cast + jr_start * cstep_c; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < n_iter; ++j ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + diagoffb_j = diagoffb - ( doff_t )j*NR; \ +\ + /* Determine the offset to the beginning of the panel that + was packed so we can index into the corresponding location + in A. Then compute the length of that panel. */ \ + off_b1121 = bli_max( -diagoffb_j, 0 ); \ + k_b1121 = k - off_b1121; \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + { \ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_b_cur = k_b1121 * PACKNR; \ + is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ + ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ +\ + if ( bli_trmm_my_iter( j, thread ) ) { \ +\ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( is_b_cur, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if ( bli_trmm_my_iter( i, caucus ) ) { \ +\ + ctype* restrict a1_i; \ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + a1_i = a1 + ( off_b1121 * PACKMR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b1121, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b1121, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ + } \ +\ + b1 += ps_b_cur; \ + } \ +\ + c1 += cstep_c; \ + } \ +\ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_rl_ker_var2sl: a1", MR, k_b1121, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_rl_ker_var2sl: b1", k_b1121, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_rl_ker_var2sl ) + diff --git a/frame/3/trmm/bli_trmm_ru_ker_var2rr.c b/frame/3/trmm/bli_trmm_ru_ker_var2rr.c new file mode 100644 index 000000000..ff118ab6d --- /dev/null +++ b/frame/3/trmm/bli_trmm_ru_ker_var2rr.c @@ -0,0 +1,618 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffb, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_ru_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_trmm_ru_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffb = bli_obj_diag_offset( b ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffb, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffb, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffb_j; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_b0111; \ + dim_t off_b0111; \ + dim_t i, j, jb0; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_b_num; \ + inc_t ss_b_den; \ + inc_t ps_b_cur; \ + inc_t is_b_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of B is entirely below its diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffb, k, n ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of A (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_b ) || \ + bli_is_3mi_packed( schema_b ) || \ + bli_is_rih_packed( schema_b ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_b ) ) { ss_b_num = 3; ss_b_den = 2; } \ + else if ( bli_is_rih_packed( schema_b ) ) { ss_b_num = 1; ss_b_den = 2; } \ + else { ss_b_num = 1; ss_b_den = 1; } \ +\ + /* If there is a zero region to the left of where the diagonal of B + intersects the top edge of the panel, adjust the pointer to C and + treat this case as if the diagonal offset were zero. This skips over + the region that was not packed. (Note we assume the diagonal offset + is a multiple of MR; this assumption will hold as long as the cache + blocksizes are each a multiple of MR and NR.) */ \ + if ( diagoffb > 0 ) \ + { \ + j = diagoffb; \ + n = n - j; \ + diagoffb = 0; \ + c_cast = c_cast + (j )*cs_c; \ + } \ +\ + /* If there is a zero region below where the diagonal of B intersects the + right side of the block, shrink it to prevent "no-op" iterations from + executing. */ \ + if ( -diagoffb + n < k ) \ + { \ + k = -diagoffb + n; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k_full; \ + istep_b = PACKNR * k; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the triangular + part of C, and the rectangular portion. */ \ + dim_t n_iter_tri; \ + dim_t n_iter_rct; \ +\ + if ( bli_is_strictly_above_diag_n( diagoffb, k, n ) ) \ + { \ + /* If the entire panel of B does not intersect the diagonal, there is + no triangular region, and therefore we can skip the first set of + loops. */ \ + n_iter_tri = 0; \ + n_iter_rct = n_iter; \ + } \ + else \ + { \ + /* If the panel of B does intersect the diagonal, compute the number of + iterations in the triangular (or trapezoidal) region by dividing NR + into the number of rows in B. (There should never be any remainder + in this division.) The number of iterations in the rectangular region + is computed as the remaining number of iterations in the n dimension. */ \ + n_iter_tri = ( k + diagoffb ) / NR + ( ( k + diagoffb ) % NR ? 1 : 0 ); \ + n_iter_rct = n_iter - n_iter_tri; \ + } \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop + for the initial triangular region of B (if it exists). + NOTE: We don't need to call bli_thread_range_jrir*() here since we + employ a hack that calls for each thread to execute every iteration + of the jr and ir loops but skip all but the pointer increment for + iterations that are not assigned to it. */ \ +\ + b1 = b_cast; \ + c1 = c_cast; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = 0; j < n_iter_tri; ++j ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + diagoffb_j = diagoffb - ( doff_t )j*NR; \ +\ + /* Determine the offset to and length of the panel that was packed + so we can index into the corresponding location in A. */ \ + off_b0111 = 0; \ + k_b0111 = bli_min( k, -diagoffb_j + NR ); \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + { \ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_b_cur = k_b0111 * PACKNR; \ + is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ + ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ +\ + if ( bli_trmm_my_iter( j, thread ) ) { \ +\ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( is_b_cur, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if ( bli_trmm_my_iter( i, caucus ) ) { \ +\ + ctype* restrict a1_i; \ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + a1_i = a1 + ( off_b0111 * PACKMR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b0111, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b0111, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ + } \ +\ + b1 += ps_b_cur; \ + } \ +\ + c1 += cstep_c; \ + } \ +\ + /* If there is no rectangular region, then we're done. */ \ + if ( n_iter_rct == 0 ) return; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd and 1st + loops the remaining triangular region of B. */ \ + bli_thread_range_jrir_rr( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Advance the start and end iteration offsets for the rectangular region + by the number of iterations used for the triangular region. */ \ + jr_start += n_iter_tri; \ + jr_end += n_iter_tri; \ + jb0 = n_iter_tri; \ +\ + /* Save the resulting value of b1 from the previous loop since it represents + the starting point for the rectangular region. */ \ + b_cast = b1; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + /* NOTE: We must index through b_cast differently since it contains + the starting address of the rectangular region (which is already + n_iter_tri logical iterations through B). */ \ + b1 = b_cast + (j-jb0) * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + { \ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_trmm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_trmm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ +\ +\ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ru_ker_var2rr: a1", MR, k_b0111, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ru_ker_var2rr: b1", k_b0111, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_ru_ker_var2rr ) + diff --git a/frame/3/trmm/bli_trmm_ru_ker_var2sl.c b/frame/3/trmm/bli_trmm_ru_ker_var2sl.c new file mode 100644 index 000000000..0fc2d514a --- /dev/null +++ b/frame/3/trmm/bli_trmm_ru_ker_var2sl.c @@ -0,0 +1,618 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffb, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trmm_ru_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_trmm_ru_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffb = bli_obj_diag_offset( b ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + FUNCPTR_T f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffb, + schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffb, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict one = PASTEMAC(ch,1); \ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffb_j; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_b0111; \ + dim_t off_b0111; \ + dim_t i, j, jb0; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_b_num; \ + inc_t ss_b_den; \ + inc_t ps_b_cur; \ + inc_t is_b_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of B is entirely below its diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffb, k, n ) ) return; \ +\ + /* Compute k_full. For all trmm, k_full is simply k. This is + needed because some parameter combinations of trmm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of A (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = k; \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_b ) || \ + bli_is_3mi_packed( schema_b ) || \ + bli_is_rih_packed( schema_b ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. And if we are packing real-only, imag-only, or + summed-only, we need to scale the computed panel sizes by 1/2 + to compensate for the fact that the pointer arithmetic occurs + in terms of complex elements rather than real elements. */ \ + if ( bli_is_3mi_packed( schema_b ) ) { ss_b_num = 3; ss_b_den = 2; } \ + else if ( bli_is_rih_packed( schema_b ) ) { ss_b_num = 1; ss_b_den = 2; } \ + else { ss_b_num = 1; ss_b_den = 1; } \ +\ + /* If there is a zero region to the left of where the diagonal of B + intersects the top edge of the panel, adjust the pointer to C and + treat this case as if the diagonal offset were zero. This skips over + the region that was not packed. (Note we assume the diagonal offset + is a multiple of MR; this assumption will hold as long as the cache + blocksizes are each a multiple of MR and NR.) */ \ + if ( diagoffb > 0 ) \ + { \ + j = diagoffb; \ + n = n - j; \ + diagoffb = 0; \ + c_cast = c_cast + (j )*cs_c; \ + } \ +\ + /* If there is a zero region below where the diagonal of B intersects the + right side of the block, shrink it to prevent "no-op" iterations from + executing. */ \ + if ( -diagoffb + n < k ) \ + { \ + k = -diagoffb + n; \ + } \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k_full; \ + istep_b = PACKNR * k; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Note that we partition the 2nd loop into two regions: the triangular + part of C, and the rectangular portion. */ \ + dim_t n_iter_tri; \ + dim_t n_iter_rct; \ +\ + if ( bli_is_strictly_above_diag_n( diagoffb, k, n ) ) \ + { \ + /* If the entire panel of B does not intersect the diagonal, there is + no triangular region, and therefore we can skip the first set of + loops. */ \ + n_iter_tri = 0; \ + n_iter_rct = n_iter; \ + } \ + else \ + { \ + /* If the panel of B does intersect the diagonal, compute the number of + iterations in the triangular (or trapezoidal) region by dividing NR + into the number of rows in B. (There should never be any remainder + in this division.) The number of iterations in the rectangular region + is computed as the remaining number of iterations in the n dimension. */ \ + n_iter_tri = ( k + diagoffb ) / NR + ( ( k + diagoffb ) % NR ? 1 : 0 ); \ + n_iter_rct = n_iter - n_iter_tri; \ + } \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop + for the initial triangular region of B (if it exists). + NOTE: We don't need to call bli_thread_range_jrir*() here since we + employ a hack that calls for each thread to execute every iteration + of the jr and ir loops but skip all but the pointer increment for + iterations that are not assigned to it. */ \ +\ + b1 = b_cast; \ + c1 = c_cast; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = 0; j < n_iter_tri; ++j ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + diagoffb_j = diagoffb - ( doff_t )j*NR; \ +\ + /* Determine the offset to and length of the panel that was packed + so we can index into the corresponding location in A. */ \ + off_b0111 = 0; \ + k_b0111 = bli_min( k, -diagoffb_j + NR ); \ +\ + a1 = a_cast; \ + c11 = c1; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + { \ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_b_cur = k_b0111 * PACKNR; \ + is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ + ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ +\ + if ( bli_trmm_my_iter( j, thread ) ) { \ +\ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( is_b_cur, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if ( bli_trmm_my_iter( i, caucus ) ) { \ +\ + ctype* restrict a1_i; \ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + a1_i = a1 + ( off_b0111 * PACKMR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b0111, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Copy edge elements of C to the temporary buffer. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + c11, rs_c, cs_c, \ + ct, rs_ct, cs_ct ); \ +\ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k_b0111, \ + alpha_cast, \ + a1_i, \ + b1, \ + beta_cast, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ + } \ +\ + b1 += ps_b_cur; \ + } \ +\ + c1 += cstep_c; \ + } \ +\ + /* If there is no rectangular region, then we're done. */ \ + if ( n_iter_rct == 0 ) return; \ +\ + /* Use slab assignment of micropanels to threads in the 2nd and 1st + loops the remaining triangular region of B. */ \ + bli_thread_range_jrir_sl( thread, n_iter_rct, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_sl( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Advance the start and end iteration offsets for the rectangular region + by the number of iterations used for the triangular region. */ \ + jr_start += n_iter_tri; \ + jr_end += n_iter_tri; \ + jb0 = n_iter_tri; \ +\ + /* Save the resulting value of b1 from the previous loop since it represents + the starting point for the rectangular region. */ \ + b_cast = b1; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + /* NOTE: We must index through b_cast differently since it contains + the starting address of the rectangular region (which is already + n_iter_tri logical iterations through B). */ \ + b1 = b_cast + (j-jb0) * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, scale C + by beta. If it is strictly below the diagonal, scale by one. + This allows the current macro-kernel to work for both trmm + and trmm3. */ \ + { \ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_trmm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_sl( i, m_iter, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_trmm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + one, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,adds_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ + } \ +\ +\ +\ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ru_ker_var2sl: a1", MR, k_b0111, a1, 1, MR, "%4.1f", "" );*/ \ +/*PASTEMAC(ch,fprintm)( stdout, "trmm_ru_ker_var2sl: b1", k_b0111, NR, b1_i, NR, 1, "%4.1f", "" );*/ \ +} + +INSERT_GENTFUNC_BASIC0( trmm_ru_ker_var2sl ) + diff --git a/frame/3/trmm/bli_trmm_var.h b/frame/3/trmm/bli_trmm_var.h index bde7977b5..9283bcdb3 100644 --- a/frame/3/trmm/bli_trmm_var.h +++ b/frame/3/trmm/bli_trmm_var.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -55,11 +56,17 @@ void PASTEMAC0(opname) \ //GENPROT( trmm_blk_var2 ) //GENPROT( trmm_blk_var3 ) -GENPROT( trmm_xx_ker_var2 ) -GENPROT( trmm_ll_ker_var2 ) -GENPROT( trmm_lu_ker_var2 ) -GENPROT( trmm_rl_ker_var2 ) -GENPROT( trmm_ru_ker_var2 ) +GENPROT( trmm_xx_ker_var2sl ) +GENPROT( trmm_xx_ker_var2rr ) + +GENPROT( trmm_ll_ker_var2sl ) +GENPROT( trmm_ll_ker_var2rr ) +GENPROT( trmm_lu_ker_var2sl ) +GENPROT( trmm_lu_ker_var2rr ) +GENPROT( trmm_rl_ker_var2sl ) +GENPROT( trmm_rl_ker_var2rr ) +GENPROT( trmm_ru_ker_var2sl ) +GENPROT( trmm_ru_ker_var2rr ) // @@ -89,8 +96,12 @@ void PASTEMAC(ch,varname) \ thrinfo_t* thread \ ); -INSERT_GENTPROT_BASIC0( trmm_ll_ker_var2 ) -INSERT_GENTPROT_BASIC0( trmm_lu_ker_var2 ) -INSERT_GENTPROT_BASIC0( trmm_rl_ker_var2 ) -INSERT_GENTPROT_BASIC0( trmm_ru_ker_var2 ) +INSERT_GENTPROT_BASIC0( trmm_ll_ker_var2sl ) +INSERT_GENTPROT_BASIC0( trmm_ll_ker_var2rr ) +INSERT_GENTPROT_BASIC0( trmm_lu_ker_var2sl ) +INSERT_GENTPROT_BASIC0( trmm_lu_ker_var2rr ) +INSERT_GENTPROT_BASIC0( trmm_rl_ker_var2sl ) +INSERT_GENTPROT_BASIC0( trmm_rl_ker_var2rr ) +INSERT_GENTPROT_BASIC0( trmm_ru_ker_var2sl ) +INSERT_GENTPROT_BASIC0( trmm_ru_ker_var2rr ) diff --git a/frame/3/trmm/bli_trmm_xx_ker_var2.c b/frame/3/trmm/bli_trmm_xx_ker_var2.c index d0e157877..cc6f210ae 100644 --- a/frame/3/trmm/bli_trmm_xx_ker_var2.c +++ b/frame/3/trmm/bli_trmm_xx_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -34,13 +35,13 @@ #include "blis.h" -static gemm_var_oft vars[2][2] = +static gemm_var_oft vars_sl[2][2] = { - { bli_trmm_ll_ker_var2, bli_trmm_lu_ker_var2 }, - { bli_trmm_rl_ker_var2, bli_trmm_ru_ker_var2 } + { bli_trmm_ll_ker_var2sl, bli_trmm_lu_ker_var2sl }, + { bli_trmm_rl_ker_var2sl, bli_trmm_ru_ker_var2sl } }; -void bli_trmm_xx_ker_var2 +void bli_trmm_xx_ker_var2sl ( obj_t* a, obj_t* b, @@ -72,7 +73,62 @@ void bli_trmm_xx_ker_var2 } // Index into the variant array to extract the correct function pointer. - f = vars[side][uplo]; + f = vars_sl[side][uplo]; + + // Call the macrokernel. + f + ( + a, + b, + c, + cntx, + rntm, + cntl, + thread + ); +} + +// ----------------------------------------------------------------------------- + +static gemm_var_oft vars_rr[2][2] = +{ + { bli_trmm_ll_ker_var2rr, bli_trmm_lu_ker_var2rr }, + { bli_trmm_rl_ker_var2rr, bli_trmm_ru_ker_var2rr } +}; + +void bli_trmm_xx_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + bool_t side; + bool_t uplo; + gemm_var_oft f; + + // Set two bools: one based on the implied side parameter (the structure + // of the root object) and one based on the uplo field of the triangular + // matrix's root object (whether that is matrix A or matrix B). + if ( bli_obj_root_is_triangular( a ) ) + { + side = 0; + if ( bli_obj_root_is_lower( a ) ) uplo = 0; + else uplo = 1; + } + else // if ( bli_obj_root_is_triangular( b ) ) + { + side = 1; + if ( bli_obj_root_is_lower( b ) ) uplo = 0; + else uplo = 1; + } + + // Index into the variant array to extract the correct function pointer. + f = vars_rr[side][uplo]; // Call the macrokernel. f diff --git a/frame/3/trmm/bli_trmm_ll_ker_var2.c b/frame/3/trmm/other/bli_trmm_ll_ker_var2.c similarity index 98% rename from frame/3/trmm/bli_trmm_ll_ker_var2.c rename to frame/3/trmm/other/bli_trmm_ll_ker_var2.c index ff64501aa..fbbbb7b2f 100644 --- a/frame/3/trmm/bli_trmm_ll_ker_var2.c +++ b/frame/3/trmm/other/bli_trmm_ll_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -327,7 +328,7 @@ void PASTEMAC(ch,varname) \ /* Loop over the n dimension (NR columns at a time). */ \ for ( j = 0; j < n_iter; ++j ) \ { \ - if ( bli_trmm_l_jr_my_iter( j, jr_thread ) ) { \ + if ( bli_trmm_my_iter( j, jr_thread ) ) { \ \ ctype* restrict a1; \ ctype* restrict c11; \ @@ -369,7 +370,7 @@ void PASTEMAC(ch,varname) \ is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ \ - if ( bli_trmm_l_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ b1_i = b1 + ( off_a1011 * PACKNR ) / off_scl; \ \ @@ -439,7 +440,7 @@ void PASTEMAC(ch,varname) \ } \ else if ( bli_is_strictly_below_diag_n( diagoffa_i, MR, k ) ) \ { \ - if ( bli_trmm_l_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ ctype* restrict a2; \ \ diff --git a/frame/3/trmm/bli_trmm_lu_ker_var2.c b/frame/3/trmm/other/bli_trmm_lu_ker_var2.c similarity index 98% rename from frame/3/trmm/bli_trmm_lu_ker_var2.c rename to frame/3/trmm/other/bli_trmm_lu_ker_var2.c index bfe57ba16..2fe01d0e2 100644 --- a/frame/3/trmm/bli_trmm_lu_ker_var2.c +++ b/frame/3/trmm/other/bli_trmm_lu_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -334,7 +335,7 @@ void PASTEMAC(ch,varname) \ /* Loop over the n dimension (NR columns at a time). */ \ for ( j = 0; j < n_iter; ++j ) \ { \ - if ( bli_trmm_l_jr_my_iter( j, jr_thread ) ) { \ + if ( bli_trmm_my_iter( j, jr_thread ) ) { \ \ ctype* restrict a1; \ ctype* restrict c11; \ @@ -376,7 +377,7 @@ void PASTEMAC(ch,varname) \ is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ \ - if ( bli_trmm_l_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ b1_i = b1 + ( off_a1112 * PACKNR ) / off_scl; \ \ @@ -446,7 +447,7 @@ void PASTEMAC(ch,varname) \ } \ else if ( bli_is_strictly_above_diag_n( diagoffa_i, MR, k ) ) \ { \ - if ( bli_trmm_l_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ ctype* restrict a2; \ \ diff --git a/frame/3/trmm/bli_trmm_rl_ker_var2.c b/frame/3/trmm/other/bli_trmm_rl_ker_var2.c similarity index 98% rename from frame/3/trmm/bli_trmm_rl_ker_var2.c rename to frame/3/trmm/other/bli_trmm_rl_ker_var2.c index e2eef964e..860295c4c 100644 --- a/frame/3/trmm/bli_trmm_rl_ker_var2.c +++ b/frame/3/trmm/other/bli_trmm_rl_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -366,7 +367,7 @@ void PASTEMAC(ch,varname) \ is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ \ - if ( bli_trmm_r_jr_my_iter( j, jr_thread ) ) { \ + if ( bli_trmm_my_iter( j, jr_thread ) ) { \ \ /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t object. */ \ @@ -375,7 +376,7 @@ void PASTEMAC(ch,varname) \ /* Loop over the m dimension (MR rows at a time). */ \ for ( i = 0; i < m_iter; ++i ) \ { \ - if ( bli_trmm_r_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ ctype* restrict a1_i; \ ctype* restrict a2; \ @@ -451,7 +452,7 @@ void PASTEMAC(ch,varname) \ } \ else if ( bli_is_strictly_below_diag_n( diagoffb_j, k, NR ) ) \ { \ - if ( bli_trmm_r_jr_my_iter( j, jr_thread ) ) { \ + if ( bli_trmm_my_iter( j, jr_thread ) ) { \ \ /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t object. */ \ @@ -460,7 +461,7 @@ void PASTEMAC(ch,varname) \ /* Loop over the m dimension (MR rows at a time). */ \ for ( i = 0; i < m_iter; ++i ) \ { \ - if ( bli_trmm_r_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ ctype* restrict a2; \ \ diff --git a/frame/3/trmm/bli_trmm_ru_ker_var2.c b/frame/3/trmm/other/bli_trmm_ru_ker_var2.c similarity index 98% rename from frame/3/trmm/bli_trmm_ru_ker_var2.c rename to frame/3/trmm/other/bli_trmm_ru_ker_var2.c index c76bc535f..e0adf4cf2 100644 --- a/frame/3/trmm/bli_trmm_ru_ker_var2.c +++ b/frame/3/trmm/other/bli_trmm_ru_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -366,7 +367,7 @@ void PASTEMAC(ch,varname) \ is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ \ - if ( bli_trmm_r_jr_my_iter( j, jr_thread ) ) { \ + if ( bli_trmm_my_iter( j, jr_thread ) ) { \ \ /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t object. */ \ @@ -375,7 +376,7 @@ void PASTEMAC(ch,varname) \ /* Loop over the m dimension (MR rows at a time). */ \ for ( i = 0; i < m_iter; ++i ) \ { \ - if ( bli_trmm_r_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ ctype* restrict a1_i; \ ctype* restrict a2; \ @@ -451,7 +452,7 @@ void PASTEMAC(ch,varname) \ } \ else if ( bli_is_strictly_above_diag_n( diagoffb_j, k, NR ) ) \ { \ - if ( bli_trmm_r_jr_my_iter( j, jr_thread ) ) { \ + if ( bli_trmm_my_iter( j, jr_thread ) ) { \ \ /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t object. */ \ @@ -460,7 +461,7 @@ void PASTEMAC(ch,varname) \ /* Loop over the m dimension (MR rows at a time). */ \ for ( i = 0; i < m_iter; ++i ) \ { \ - if ( bli_trmm_r_ir_my_iter( i, ir_thread ) ) { \ + if ( bli_trmm_my_iter( i, ir_thread ) ) { \ \ ctype* restrict a2; \ \ diff --git a/frame/3/trsm/bli_trsm_blk_var1.c b/frame/3/trsm/bli_trsm_blk_var1.c index 8b666b3f4..783572944 100644 --- a/frame/3/trsm/bli_trsm_blk_var1.c +++ b/frame/3/trsm/bli_trsm_blk_var1.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -60,7 +61,7 @@ void bli_trsm_blk_var1 bli_l3_prune_unref_mparts_m( a, b, c, cntl ); // Determine the current thread's subpartition range. - bli_thread_get_range_mdim + bli_thread_range_mdim ( direct, thread, a, b, c, cntl, cntx, &my_start, &my_end diff --git a/frame/3/trsm/bli_trsm_blk_var2.c b/frame/3/trsm/bli_trsm_blk_var2.c index 6be5965a3..7286ba7e0 100644 --- a/frame/3/trsm/bli_trsm_blk_var2.c +++ b/frame/3/trsm/bli_trsm_blk_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -60,7 +61,7 @@ void bli_trsm_blk_var2 bli_l3_prune_unref_mparts_n( a, b, c, cntl ); // Determine the current thread's subpartition range. - bli_thread_get_range_ndim + bli_thread_range_ndim ( direct, thread, a, b, c, cntl, cntx, &my_start, &my_end diff --git a/frame/3/trsm/bli_trsm_cntl.c b/frame/3/trsm/bli_trsm_cntl.c index ee40189e5..72dd9f68b 100644 --- a/frame/3/trsm/bli_trsm_cntl.c +++ b/frame/3/trsm/bli_trsm_cntl.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -53,7 +54,28 @@ cntl_t* bli_trsm_l_cntl_create pack_t schema_b ) { - void* macro_kernel_p = bli_trsm_xx_ker_var2; + void* macro_kernel_p; + void* packa_fp; + void* packb_fp; + +#ifdef BLIS_ENABLE_JRIR_SLAB + + // Use the function pointer to the macrokernels that use slab + // assignment of micropanels to threads in the jr and ir loops. + macro_kernel_p = bli_trsm_xx_ker_var2sl; + + packa_fp = bli_packm_blk_var1sl; + packb_fp = bli_packm_blk_var1sl; + +#else // BLIS_ENABLE_JRIR_RR + + // Use the function pointer to the macrokernels that use round-robin + // assignment of micropanels to threads in the jr and ir loops. + macro_kernel_p = bli_trsm_xx_ker_var2rr; + + packa_fp = bli_packm_blk_var1rr; + packb_fp = bli_packm_blk_var1rr; +#endif const opid_t family = BLIS_TRSM; @@ -78,7 +100,7 @@ cntl_t* bli_trsm_l_cntl_create cntl_t* trsm_cntl_packa = bli_packm_cntl_create_node ( bli_trsm_packa, - bli_packm_blk_var1, + packa_fp, BLIS_MR, BLIS_MR, TRUE, // do NOT invert diagonal @@ -102,7 +124,7 @@ cntl_t* bli_trsm_l_cntl_create cntl_t* trsm_cntl_packb = bli_packm_cntl_create_node ( bli_trsm_packb, - bli_packm_blk_var1, + packb_fp, BLIS_MR, BLIS_NR, FALSE, // do NOT invert diagonal @@ -140,7 +162,16 @@ cntl_t* bli_trsm_r_cntl_create pack_t schema_b ) { - void* macro_kernel_p = bli_trsm_xx_ker_var2; + // trsm macrokernels are presently disabled for right-side execution, + // so it doesn't matter which function pointer we use here (sl or rr). + // To be safe, we'll insert an abort() guard to alert the developers + // of this should right-side macrokernels ever be re-enabled. + void* macro_kernel_p = bli_trsm_xx_ker_var2sl; + + void* packa_fp = bli_packm_blk_var1sl; + void* packb_fp = bli_packm_blk_var1sl; + + bli_abort(); const opid_t family = BLIS_TRSM; @@ -165,7 +196,7 @@ cntl_t* bli_trsm_r_cntl_create cntl_t* trsm_cntl_packa = bli_packm_cntl_create_node ( bli_trsm_packa, - bli_packm_blk_var1, + packa_fp, BLIS_NR, BLIS_MR, FALSE, // do NOT invert diagonal @@ -189,7 +220,7 @@ cntl_t* bli_trsm_r_cntl_create cntl_t* trsm_cntl_packb = bli_packm_cntl_create_node ( bli_trsm_packb, - bli_packm_blk_var1, + packb_fp, BLIS_MR, BLIS_MR, TRUE, // do NOT invert diagonal diff --git a/frame/3/trsm/bli_trsm_ll_ker_var2rr.c b/frame/3/trsm/bli_trsm_ll_ker_var2rr.c new file mode 100644 index 000000000..844d76ab7 --- /dev/null +++ b/frame/3/trsm/bli_trsm_ll_ker_var2rr.c @@ -0,0 +1,605 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha1, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* alpha2, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trsm_ll_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_trsm_ll_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_alpha1; + void* buf_alpha2; + + FUNCPTR_T f; + + // Grab the address of the internal scalar buffer for the scalar + // attached to B (the non-triangular matrix). This will be the alpha + // scalar used in the gemmtrsm subproblems (ie: the scalar that would + // be applied to the packed copy of B prior to it being updated by + // the trsm subproblem). This scalar may be unit, if for example it + // was applied during packing. + buf_alpha1 = bli_obj_internal_scalar_buffer( b ); + + // Grab the address of the internal scalar buffer for the scalar + // attached to C. This will be the "beta" scalar used in the gemm-only + // subproblems that correspond to micro-panels that do not intersect + // the diagonal. We need this separate scalar because it's possible + // that the alpha attached to B was reset, if it was applied during + // packing. + buf_alpha2 = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha1, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_alpha2, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha1, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* alpha2, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Cast the micro-kernel address to its function pointer type. */ \ + PASTECH(ch,gemmtrsm_ukr_ft) \ + gemmtrsm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMMTRSM_L_UKR, cntx ); \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict minus_one = PASTEMAC(ch,m1); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha1_cast = alpha1; \ + ctype* restrict alpha2_cast = alpha2; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1011; \ + dim_t k_a10; \ + dim_t off_a10; \ + dim_t off_a11; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If matrix A is above the diagonal, it is implicitly zero. + So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full as k inflated up to a multiple of MR. This is + needed because some parameter combinations of trsm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = ( k % MR != 0 ? k + MR - ( k % MR ) : k ); \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. Note that real-only, imag-only, and summed-only + packing formats are not applicable here since trsm is a two- + operand operation only (unlike trmm, which is capable of three- + operand). */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region above where the diagonal of A intersects the + left edge of the block, adjust the pointer to C and treat this case as + if the diagonal offset were zero. This skips over the region that was + not packed. (Note we assume the diagonal offset is a multiple of MR; + this assumption will hold as long as the cache blocksizes are each a + multiple of MR and NR.) */ \ + if ( diagoffa < 0 ) \ + { \ + i = -diagoffa; \ + m = m - i; \ + diagoffa = 0; \ + c_cast = c_cast + (i )*rs_c; \ + } \ +\ + /* Check the k dimension, which needs to be a multiple of MR. If k + isn't a multiple of MR, we adjust it higher to satisfy the micro- + kernel, which is expecting to perform an MR x MR triangular solve. + This adjustment of k is consistent with what happened when A was + packed: all of its bottom/right edges were zero-padded, and + furthermore, the panel that stores the bottom-right corner of the + matrix has its diagonal extended into the zero-padded region (as + identity). This allows the trsm of that bottom-right panel to + proceed without producing any infs or NaNs that would infect the + "good" values of the corresponding block of B. */ \ + if ( k % MR != 0 ) k += MR - ( k % MR ); \ +\ + /* NOTE: We don't need to check that m is a multiple of PACKMR since we + know that the underlying buffer was already allocated to have an m + dimension that is a multiple of PACKMR, with the region between the + last row and the next multiple of MR zero-padded accordingly. */ \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* We don't bother querying the thrinfo_t node for the 1st loop because + we can't parallelize that loop in trsm due to the inter-iteration + dependencies that exist. */ \ + /*thrinfo_t* caucus = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ +\ + dim_t jr_start, jr_end; \ + dim_t jr_inc; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop. + NOTE: Parallelism in the 1st loop is unattainable due to the + inter-iteration dependencies present in trsm. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1 + (0 )*rstep_c; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, use a + special micro-kernel that performs a fused gemm and trsm. + If the current panel of A resides below the diagonal, use a + a regular gemm micro-kernel. Otherwise, if it is above the + diagonal, it was not packed (because it is implicitly zero) + and so we do nothing. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a10; \ + ctype* restrict a11; \ + ctype* restrict b01; \ + ctype* restrict b11; \ + ctype* restrict a2; \ +\ + /* Compute various offsets into and lengths of parts of A. */ \ + off_a10 = 0; \ + k_a1011 = diagoffa_i + MR; \ + k_a10 = k_a1011 - MR; \ + off_a11 = k_a10; \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1011 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* Compute the addresses of the panel A10 and the triangular + block A11. */ \ + a10 = a1; \ + /* a11 = a1 + ( k_a10 * PACKMR ) / off_scl; */ \ + a11 = bli_ptr_inc_by_frac( a1, sizeof( ctype ), k_a10 * PACKMR, off_scl ); \ +\ + /* Compute the addresses of the panel B01 and the block + B11. */ \ + b01 = b1 + ( off_a10 * PACKNR ) / off_scl; \ + b11 = b1 + ( off_a11 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + ps_a_cur; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a10, \ + alpha1_cast, \ + a10, \ + a11, \ + b01, \ + b11, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a10, \ + alpha1_cast, \ + a10, \ + a11, \ + b01, \ + b11, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the bottom edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + rstep_a; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + alpha2_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + alpha2_cast, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += rstep_a; \ + } \ +\ + c11 += rstep_c; \ + } \ + } \ +\ +/* +if ( bli_is_4mi_packed( schema_a ) ){ \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_r before", k, n, \ + ( double* )b, rs_b, 1, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_i before", k, n, \ + ( double* )b+72, rs_b, 1, "%4.1f", "" ); \ +}else{ \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_r before", k, n, \ + ( double* )b, 2*rs_b, 2, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_i before", k, n, \ + ( double* )b+1, 2*rs_b, 2, "%4.1f", "" ); \ +} \ +*/ \ +\ +/* +PASTEMAC(d,fprintm)( stdout, "trsm_ll_ker_var2: a11p_r computed", MR, MR, \ + ( double* )a11, 1, PACKMR, "%4.1f", "" ); \ +*/ \ +\ +/* +if ( bli_is_4mi_packed( schema_a ) ){ \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_r after", k, n, \ + ( double* )b, rs_b, 1, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_i after", k, n, \ + ( double* )b+72, rs_b, 1, "%4.1f", "" ); \ +}else{ \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_r after", k, n, \ + ( double* )b, 2*rs_b, 2, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_i after", k, n, \ + ( double* )b+1, 2*rs_b, 2, "%4.1f", "" ); \ +} \ + +PASTEMAC(d,fprintm)( stdout, "trsm_ll_ker_var2: b_r", m, n, \ + ( double* )c, 1, cs_c, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsm_ll_ker_var2: b_i", m, n, \ + ( double* )c + 8*9, 1, cs_c, "%4.1f", "" ); \ +*/ \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: a1 (diag)", MR, k_a1011, a1, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: a11 (diag)", MR, MR, a11, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: b1 (diag)", k_a1011, NR, bp_i, NR, 1, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: bp11 (diag)", MR, NR, bp11, NR, 1, "%5.2f", "" ); \ +*/ \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: a1 (ndiag)", MR, k, a1, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: b1 (ndiag)", k, NR, bp, NR, 1, "%5.2f", "" ); \ +*/ \ +} + +INSERT_GENTFUNC_BASIC0( trsm_ll_ker_var2rr ) + diff --git a/frame/3/trsm/bli_trsm_ll_ker_var2sl.c b/frame/3/trsm/bli_trsm_ll_ker_var2sl.c new file mode 100644 index 000000000..e67de28fe --- /dev/null +++ b/frame/3/trsm/bli_trsm_ll_ker_var2sl.c @@ -0,0 +1,605 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha1, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* alpha2, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trsm_ll_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_trsm_ll_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_alpha1; + void* buf_alpha2; + + FUNCPTR_T f; + + // Grab the address of the internal scalar buffer for the scalar + // attached to B (the non-triangular matrix). This will be the alpha + // scalar used in the gemmtrsm subproblems (ie: the scalar that would + // be applied to the packed copy of B prior to it being updated by + // the trsm subproblem). This scalar may be unit, if for example it + // was applied during packing. + buf_alpha1 = bli_obj_internal_scalar_buffer( b ); + + // Grab the address of the internal scalar buffer for the scalar + // attached to C. This will be the "beta" scalar used in the gemm-only + // subproblems that correspond to micro-panels that do not intersect + // the diagonal. We need this separate scalar because it's possible + // that the alpha attached to B was reset, if it was applied during + // packing. + buf_alpha2 = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha1, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_alpha2, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha1, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* alpha2, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Cast the micro-kernel address to its function pointer type. */ \ + PASTECH(ch,gemmtrsm_ukr_ft) \ + gemmtrsm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMMTRSM_L_UKR, cntx ); \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict minus_one = PASTEMAC(ch,m1); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha1_cast = alpha1; \ + ctype* restrict alpha2_cast = alpha2; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1011; \ + dim_t k_a10; \ + dim_t off_a10; \ + dim_t off_a11; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If matrix A is above the diagonal, it is implicitly zero. + So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full as k inflated up to a multiple of MR. This is + needed because some parameter combinations of trsm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = ( k % MR != 0 ? k + MR - ( k % MR ) : k ); \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. Note that real-only, imag-only, and summed-only + packing formats are not applicable here since trsm is a two- + operand operation only (unlike trmm, which is capable of three- + operand). */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region above where the diagonal of A intersects the + left edge of the block, adjust the pointer to C and treat this case as + if the diagonal offset were zero. This skips over the region that was + not packed. (Note we assume the diagonal offset is a multiple of MR; + this assumption will hold as long as the cache blocksizes are each a + multiple of MR and NR.) */ \ + if ( diagoffa < 0 ) \ + { \ + i = -diagoffa; \ + m = m - i; \ + diagoffa = 0; \ + c_cast = c_cast + (i )*rs_c; \ + } \ +\ + /* Check the k dimension, which needs to be a multiple of MR. If k + isn't a multiple of MR, we adjust it higher to satisfy the micro- + kernel, which is expecting to perform an MR x MR triangular solve. + This adjustment of k is consistent with what happened when A was + packed: all of its bottom/right edges were zero-padded, and + furthermore, the panel that stores the bottom-right corner of the + matrix has its diagonal extended into the zero-padded region (as + identity). This allows the trsm of that bottom-right panel to + proceed without producing any infs or NaNs that would infect the + "good" values of the corresponding block of B. */ \ + if ( k % MR != 0 ) k += MR - ( k % MR ); \ +\ + /* NOTE: We don't need to check that m is a multiple of PACKMR since we + know that the underlying buffer was already allocated to have an m + dimension that is a multiple of PACKMR, with the region between the + last row and the next multiple of MR zero-padded accordingly. */ \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* We don't bother querying the thrinfo_t node for the 1st loop because + we can't parallelize that loop in trsm due to the inter-iteration + dependencies that exist. */ \ + /*thrinfo_t* caucus = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ +\ + dim_t jr_start, jr_end; \ + dim_t jr_inc; \ +\ + /* Use slab assignment of micropanels to threads in the 2nd loop. + NOTE: Parallelism in the 1st loop is unattainable due to the + inter-iteration dependencies present in trsm. */ \ + bli_thread_range_jrir_sl( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1 + (0 )*rstep_c; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, use a + special micro-kernel that performs a fused gemm and trsm. + If the current panel of A resides below the diagonal, use a + a regular gemm micro-kernel. Otherwise, if it is above the + diagonal, it was not packed (because it is implicitly zero) + and so we do nothing. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a10; \ + ctype* restrict a11; \ + ctype* restrict b01; \ + ctype* restrict b11; \ + ctype* restrict a2; \ +\ + /* Compute various offsets into and lengths of parts of A. */ \ + off_a10 = 0; \ + k_a1011 = diagoffa_i + MR; \ + k_a10 = k_a1011 - MR; \ + off_a11 = k_a10; \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1011 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* Compute the addresses of the panel A10 and the triangular + block A11. */ \ + a10 = a1; \ + /* a11 = a1 + ( k_a10 * PACKMR ) / off_scl; */ \ + a11 = bli_ptr_inc_by_frac( a1, sizeof( ctype ), k_a10 * PACKMR, off_scl ); \ +\ + /* Compute the addresses of the panel B01 and the block + B11. */ \ + b01 = b1 + ( off_a10 * PACKNR ) / off_scl; \ + b11 = b1 + ( off_a11 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + ps_a_cur; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a10, \ + alpha1_cast, \ + a10, \ + a11, \ + b01, \ + b11, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a10, \ + alpha1_cast, \ + a10, \ + a11, \ + b01, \ + b11, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the bottom edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + rstep_a; \ + if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + alpha2_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + alpha2_cast, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += rstep_a; \ + } \ +\ + c11 += rstep_c; \ + } \ + } \ +\ +/* +if ( bli_is_4mi_packed( schema_a ) ){ \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_r before", k, n, \ + ( double* )b, rs_b, 1, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_i before", k, n, \ + ( double* )b+72, rs_b, 1, "%4.1f", "" ); \ +}else{ \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_r before", k, n, \ + ( double* )b, 2*rs_b, 2, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_i before", k, n, \ + ( double* )b+1, 2*rs_b, 2, "%4.1f", "" ); \ +} \ +*/ \ +\ +/* +PASTEMAC(d,fprintm)( stdout, "trsm_ll_ker_var2: a11p_r computed", MR, MR, \ + ( double* )a11, 1, PACKMR, "%4.1f", "" ); \ +*/ \ +\ +/* +if ( bli_is_4mi_packed( schema_a ) ){ \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_r after", k, n, \ + ( double* )b, rs_b, 1, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsm4m1_ll_ker_var2: b_i after", k, n, \ + ( double* )b+72, rs_b, 1, "%4.1f", "" ); \ +}else{ \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_r after", k, n, \ + ( double* )b, 2*rs_b, 2, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsmnat_ll_ker_var2: b_i after", k, n, \ + ( double* )b+1, 2*rs_b, 2, "%4.1f", "" ); \ +} \ + +PASTEMAC(d,fprintm)( stdout, "trsm_ll_ker_var2: b_r", m, n, \ + ( double* )c, 1, cs_c, "%4.1f", "" ); \ +PASTEMAC(d,fprintm)( stdout, "trsm_ll_ker_var2: b_i", m, n, \ + ( double* )c + 8*9, 1, cs_c, "%4.1f", "" ); \ +*/ \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: a1 (diag)", MR, k_a1011, a1, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: a11 (diag)", MR, MR, a11, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: b1 (diag)", k_a1011, NR, bp_i, NR, 1, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: bp11 (diag)", MR, NR, bp11, NR, 1, "%5.2f", "" ); \ +*/ \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: a1 (ndiag)", MR, k, a1, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_ll_ker_var2: b1 (ndiag)", k, NR, bp, NR, 1, "%5.2f", "" ); \ +*/ \ +} + +INSERT_GENTFUNC_BASIC0( trsm_ll_ker_var2sl ) + diff --git a/frame/3/trsm/bli_trsm_lu_ker_var2rr.c b/frame/3/trsm/bli_trsm_lu_ker_var2rr.c new file mode 100644 index 000000000..3d2792508 --- /dev/null +++ b/frame/3/trsm/bli_trsm_lu_ker_var2rr.c @@ -0,0 +1,586 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha1, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* alpha2, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trsm_lu_ker_var2rr); + +// +// -- Macrokernel functions for round-robin partitioning ----------------------- +// + +void bli_trsm_lu_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_alpha1; + void* buf_alpha2; + + FUNCPTR_T f; + + // Grab the address of the internal scalar buffer for the scalar + // attached to B (the non-triangular matrix). This will be the alpha + // scalar used in the gemmtrsm subproblems (ie: the scalar that would + // be applied to the packed copy of B prior to it being updated by + // the trsm subproblem). This scalar may be unit, if for example it + // was applied during packing. + buf_alpha1 = bli_obj_internal_scalar_buffer( b ); + + // Grab the address of the internal scalar buffer for the scalar + // attached to C. This will be the "beta" scalar used in the gemm-only + // subproblems that correspond to micro-panels that do not intersect + // the diagonal. We need this separate scalar because it's possible + // that the alpha attached to B was reset, if it was applied during + // packing. + buf_alpha2 = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha1, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_alpha2, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha1, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* alpha2, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Cast the micro-kernel address to its function pointer type. */ \ + PASTECH(ch,gemmtrsm_ukr_ft) \ + gemmtrsm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMMTRSM_U_UKR, cntx ); \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict minus_one = PASTEMAC(ch,m1); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha1_cast = alpha1; \ + ctype* restrict alpha2_cast = alpha2; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1112; \ + dim_t k_a11; \ + dim_t k_a12; \ + dim_t off_a11; \ + dim_t off_a12; \ + dim_t i, j, ib; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If matrix A is below the diagonal, it is implicitly zero. + So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full as k inflated up to a multiple of MR. This is + needed because some parameter combinations of trsm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = ( k % MR != 0 ? k + MR - ( k % MR ) : k ); \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. Note that real-only, imag-only, and summed-only + packing formats are not applicable here since trsm is a two- + operand operation only (unlike trmm, which is capable of three- + operand). */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region to the left of where the diagonal of A + intersects the top edge of the block, adjust the pointer to B and + treat this case as if the diagonal offset were zero. Note that we + don't need to adjust the pointer to A since packm would have simply + skipped over the region that was not stored. */ \ + if ( diagoffa > 0 ) \ + { \ + i = diagoffa; \ + k = k - i; \ + diagoffa = 0; \ + b_cast = b_cast + ( i * PACKNR ) / off_scl; \ + } \ +\ + /* If there is a zero region below where the diagonal of A intersects the + right side of the block, shrink it to prevent "no-op" iterations from + executing. */ \ + if ( -diagoffa + k < m ) \ + { \ + m = -diagoffa + k; \ + } \ +\ + /* Check the k dimension, which needs to be a multiple of MR. If k + isn't a multiple of MR, we adjust it higher to satisfy the micro- + kernel, which is expecting to perform an MR x MR triangular solve. + This adjustment of k is consistent with what happened when A was + packed: all of its bottom/right edges were zero-padded, and + furthermore, the panel that stores the bottom-right corner of the + matrix has its diagonal extended into the zero-padded region (as + identity). This allows the trsm of that bottom-right panel to + proceed without producing any infs or NaNs that would infect the + "good" values of the corresponding block of B. */ \ + if ( k % MR != 0 ) k += MR - ( k % MR ); \ +\ + /* NOTE: We don't need to check that m is a multiple of PACKMR since we + know that the underlying buffer was already allocated to have an m + dimension that is a multiple of PACKMR, with the region between the + last row and the next multiple of MR zero-padded accordingly. */ \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* We don't bother querying the thrinfo_t node for the 1st loop because + we can't parallelize that loop in trsm due to the inter-iteration + dependencies that exist. */ \ + /*thrinfo_t* caucus = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ +\ + dim_t jr_start, jr_end; \ + dim_t jr_inc; \ +\ + /* Use round-robin assignment of micropanels to threads in the 2nd loop. + NOTE: Parallelism in the 1st loop is unattainable due to the + inter-iteration dependencies present in trsm. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1 + (m_iter-1)*rstep_c; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( ib = 0; ib < m_iter; ++ib ) \ + { \ + i = m_iter - 1 - ib; \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_b( ib, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, use a + special micro-kernel that performs a fused gemm and trsm. + If the current panel of A resides above the diagonal, use a + a regular gemm micro-kernel. Otherwise, if it is below the + diagonal, it was not packed (because it is implicitly zero) + and so we do nothing. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a11; \ + ctype* restrict a12; \ + ctype* restrict b11; \ + ctype* restrict b21; \ + ctype* restrict a2; \ +\ + /* Compute various offsets into and lengths of parts of A. */ \ + off_a11 = diagoffa_i; \ + k_a1112 = k - off_a11;; \ + k_a11 = MR; \ + k_a12 = k_a1112 - MR; \ + off_a12 = off_a11 + k_a11; \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1112 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* Compute the addresses of the triangular block A11 and the + panel A12. */ \ + a11 = a1; \ + /* a12 = a1 + ( k_a11 * PACKMR ) / off_scl; */ \ + a12 = bli_ptr_inc_by_frac( a1, sizeof( ctype ), k_a11 * PACKMR, off_scl ); \ +\ + /* Compute the addresses of the panel B01 and the block + B11. */ \ + b11 = b1 + ( off_a11 * PACKNR ) / off_scl; \ + b21 = b1 + ( off_a12 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + ps_a_cur; \ + if ( bli_is_last_iter_rr( ib, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a12, \ + alpha1_cast, \ + a12, \ + a11, \ + b21, \ + b11, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a12, \ + alpha1_cast, \ + a12, \ + a11, \ + b21, \ + b11, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the bottom edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + rstep_a; \ + if ( bli_is_last_iter_rr( ib, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_rr( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + alpha2_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + alpha2_cast, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += rstep_a; \ + } \ +\ + c11 -= rstep_c; \ + } \ + } \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: a1 (diag)", MR, k_a1112, a1, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: b11 (diag)", MR, NR, b11, NR, 1, "%6.3f", "" ); \ +printf( "m_iter = %lu\n", m_iter ); \ +printf( "m_cur = %lu\n", m_cur ); \ +printf( "k = %lu\n", k ); \ +printf( "diagoffa_i = %lu\n", diagoffa_i ); \ +printf( "off_a1112 = %lu\n", off_a1112 ); \ +printf( "k_a1112 = %lu\n", k_a1112 ); \ +printf( "k_a12 = %lu\n", k_a12 ); \ +printf( "k_a11 = %lu\n", k_a11 ); \ +printf( "rs_c,cs_c = %lu %lu\n", rs_c, cs_c ); \ +printf( "rs_ct,cs_ct= %lu %lu\n", rs_ct, cs_ct ); \ +*/ \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: b11 after (diag)", MR, NR, b11, NR, 1, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: b11 after (diag)", MR, NR, b11, NR, 1, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: ct after (diag)", m_cur, n_cur, ct, rs_ct, cs_ct, "%5.2f", "" ); \ +*/ \ +} + +INSERT_GENTFUNC_BASIC0( trsm_lu_ker_var2rr ) + diff --git a/frame/3/trsm/bli_trsm_lu_ker_var2sl.c b/frame/3/trsm/bli_trsm_lu_ker_var2sl.c new file mode 100644 index 000000000..486294352 --- /dev/null +++ b/frame/3/trsm/bli_trsm_lu_ker_var2sl.c @@ -0,0 +1,586 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffa, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha1, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* alpha2, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trsm_lu_ker_var2sl); + +// +// -- Macrokernel functions for slab partitioning ------------------------------ +// + +void bli_trsm_lu_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffa = bli_obj_diag_offset( a ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_alpha1; + void* buf_alpha2; + + FUNCPTR_T f; + + // Grab the address of the internal scalar buffer for the scalar + // attached to B (the non-triangular matrix). This will be the alpha + // scalar used in the gemmtrsm subproblems (ie: the scalar that would + // be applied to the packed copy of B prior to it being updated by + // the trsm subproblem). This scalar may be unit, if for example it + // was applied during packing. + buf_alpha1 = bli_obj_internal_scalar_buffer( b ); + + // Grab the address of the internal scalar buffer for the scalar + // attached to C. This will be the "beta" scalar used in the gemm-only + // subproblems that correspond to micro-panels that do not intersect + // the diagonal. We need this separate scalar because it's possible + // that the alpha attached to B was reset, if it was applied during + // packing. + buf_alpha2 = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffa, + schema_a, + schema_b, + m, + n, + k, + buf_alpha1, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_alpha2, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffa, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha1, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* alpha2, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Cast the micro-kernel address to its function pointer type. */ \ + PASTECH(ch,gemmtrsm_ukr_ft) \ + gemmtrsm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMMTRSM_U_UKR, cntx ); \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict minus_one = PASTEMAC(ch,m1); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha1_cast = alpha1; \ + ctype* restrict alpha2_cast = alpha2; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffa_i; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_a1112; \ + dim_t k_a11; \ + dim_t k_a12; \ + dim_t off_a11; \ + dim_t off_a12; \ + dim_t i, j, ib; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_a_num; \ + inc_t ss_a_den; \ + inc_t ps_a_cur; \ + inc_t is_a_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If matrix A is below the diagonal, it is implicitly zero. + So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffa, m, k ) ) return; \ +\ + /* Compute k_full as k inflated up to a multiple of MR. This is + needed because some parameter combinations of trsm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = ( k % MR != 0 ? k + MR - ( k % MR ) : k ); \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_a ) || \ + bli_is_3mi_packed( schema_a ) || \ + bli_is_rih_packed( schema_a ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. Note that real-only, imag-only, and summed-only + packing formats are not applicable here since trsm is a two- + operand operation only (unlike trmm, which is capable of three- + operand). */ \ + if ( bli_is_3mi_packed( schema_a ) ) { ss_a_num = 3; ss_a_den = 2; } \ + else { ss_a_num = 1; ss_a_den = 1; } \ +\ + /* If there is a zero region to the left of where the diagonal of A + intersects the top edge of the block, adjust the pointer to B and + treat this case as if the diagonal offset were zero. Note that we + don't need to adjust the pointer to A since packm would have simply + skipped over the region that was not stored. */ \ + if ( diagoffa > 0 ) \ + { \ + i = diagoffa; \ + k = k - i; \ + diagoffa = 0; \ + b_cast = b_cast + ( i * PACKNR ) / off_scl; \ + } \ +\ + /* If there is a zero region below where the diagonal of A intersects the + right side of the block, shrink it to prevent "no-op" iterations from + executing. */ \ + if ( -diagoffa + k < m ) \ + { \ + m = -diagoffa + k; \ + } \ +\ + /* Check the k dimension, which needs to be a multiple of MR. If k + isn't a multiple of MR, we adjust it higher to satisfy the micro- + kernel, which is expecting to perform an MR x MR triangular solve. + This adjustment of k is consistent with what happened when A was + packed: all of its bottom/right edges were zero-padded, and + furthermore, the panel that stores the bottom-right corner of the + matrix has its diagonal extended into the zero-padded region (as + identity). This allows the trsm of that bottom-right panel to + proceed without producing any infs or NaNs that would infect the + "good" values of the corresponding block of B. */ \ + if ( k % MR != 0 ) k += MR - ( k % MR ); \ +\ + /* NOTE: We don't need to check that m is a multiple of PACKMR since we + know that the underlying buffer was already allocated to have an m + dimension that is a multiple of PACKMR, with the region between the + last row and the next multiple of MR zero-padded accordingly. */ \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k; \ + istep_b = PACKNR * k_full; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_b( istep_b, &aux ); \ +\ + /* We don't bother querying the thrinfo_t node for the 1st loop because + we can't parallelize that loop in trsm due to the inter-iteration + dependencies that exist. */ \ + /*thrinfo_t* caucus = bli_thrinfo_sub_node( thread );*/ \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ +\ + dim_t jr_start, jr_end; \ + dim_t jr_inc; \ +\ + /* Use slab assignment of micropanels to threads in the 2nd loop. + NOTE: Parallelism in the 1st loop is unattainable due to the + inter-iteration dependencies present in trsm. */ \ + bli_thread_range_jrir_sl( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + a1 = a_cast; \ + c11 = c1 + (m_iter-1)*rstep_c; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( ib = 0; ib < m_iter; ++ib ) \ + { \ + i = m_iter - 1 - ib; \ + diagoffa_i = diagoffa + ( doff_t )i*MR; \ +\ + m_cur = ( bli_is_not_edge_b( ib, m_iter, m_left ) ? MR : m_left ); \ +\ + /* If the current panel of A intersects the diagonal, use a + special micro-kernel that performs a fused gemm and trsm. + If the current panel of A resides above the diagonal, use a + a regular gemm micro-kernel. Otherwise, if it is below the + diagonal, it was not packed (because it is implicitly zero) + and so we do nothing. */ \ + if ( bli_intersects_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a11; \ + ctype* restrict a12; \ + ctype* restrict b11; \ + ctype* restrict b21; \ + ctype* restrict a2; \ +\ + /* Compute various offsets into and lengths of parts of A. */ \ + off_a11 = diagoffa_i; \ + k_a1112 = k - off_a11;; \ + k_a11 = MR; \ + k_a12 = k_a1112 - MR; \ + off_a12 = off_a11 + k_a11; \ +\ + /* Compute the panel stride for the current diagonal- + intersecting micro-panel. */ \ + is_a_cur = k_a1112 * PACKMR; \ + is_a_cur += ( bli_is_odd( is_a_cur ) ? 1 : 0 ); \ + ps_a_cur = ( is_a_cur * ss_a_num ) / ss_a_den; \ +\ + /* Compute the addresses of the triangular block A11 and the + panel A12. */ \ + a11 = a1; \ + /* a12 = a1 + ( k_a11 * PACKMR ) / off_scl; */ \ + a12 = bli_ptr_inc_by_frac( a1, sizeof( ctype ), k_a11 * PACKMR, off_scl ); \ +\ + /* Compute the addresses of the panel B01 and the block + B11. */ \ + b11 = b1 + ( off_a11 * PACKNR ) / off_scl; \ + b21 = b1 + ( off_a12 * PACKNR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + ps_a_cur; \ + if ( bli_is_last_iter_rr( ib, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( is_a_cur, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a12, \ + alpha1_cast, \ + a12, \ + a11, \ + b21, \ + b11, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_a12, \ + alpha1_cast, \ + a12, \ + a11, \ + b21, \ + b11, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the bottom edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += ps_a_cur; \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffa_i, MR, k ) ) \ + { \ + ctype* restrict a2; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1 + rstep_a; \ + if ( bli_is_last_iter_rr( ib, m_iter, 0, 1 ) ) \ + { \ + a2 = a_cast; \ + b2 = b1; \ + if ( bli_is_last_iter_sl( j, n_iter, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Save the 4m1/3m1 imaginary stride of A to the auxinfo_t + object. */ \ + bli_auxinfo_set_is_a( istep_a, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + alpha2_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + alpha2_cast, \ + c11, rs_c, cs_c ); \ + } \ +\ + a1 += rstep_a; \ + } \ +\ + c11 -= rstep_c; \ + } \ + } \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: a1 (diag)", MR, k_a1112, a1, 1, MR, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: b11 (diag)", MR, NR, b11, NR, 1, "%6.3f", "" ); \ +printf( "m_iter = %lu\n", m_iter ); \ +printf( "m_cur = %lu\n", m_cur ); \ +printf( "k = %lu\n", k ); \ +printf( "diagoffa_i = %lu\n", diagoffa_i ); \ +printf( "off_a1112 = %lu\n", off_a1112 ); \ +printf( "k_a1112 = %lu\n", k_a1112 ); \ +printf( "k_a12 = %lu\n", k_a12 ); \ +printf( "k_a11 = %lu\n", k_a11 ); \ +printf( "rs_c,cs_c = %lu %lu\n", rs_c, cs_c ); \ +printf( "rs_ct,cs_ct= %lu %lu\n", rs_ct, cs_ct ); \ +*/ \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: b11 after (diag)", MR, NR, b11, NR, 1, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: b11 after (diag)", MR, NR, b11, NR, 1, "%5.2f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "trsm_lu_ker_var2: ct after (diag)", m_cur, n_cur, ct, rs_ct, cs_ct, "%5.2f", "" ); \ +*/ \ +} + +INSERT_GENTFUNC_BASIC0( trsm_lu_ker_var2sl ) + diff --git a/frame/3/trsm/bli_trsm_rl_ker_var2.c b/frame/3/trsm/bli_trsm_rl_ker_var2.c index 8045fe09d..5921dc275 100644 --- a/frame/3/trsm/bli_trsm_rl_ker_var2.c +++ b/frame/3/trsm/bli_trsm_rl_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -441,12 +442,12 @@ void PASTEMAC(ch,varname) \ \ /* Compute the addresses of the next panels of A and B. */ \ a2 = a1; \ - /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + /*if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) */\ if ( i + bli_thread_num_threads(thread) >= m_iter ) \ { \ a2 = a_cast; \ b2 = b1 + ps_b_cur; \ - if ( bli_is_last_iter( jb, n_iter, 0, 1 ) ) \ + if ( bli_is_last_iter_rr( jb, n_iter, 0, 1 ) ) \ b2 = b_cast; \ } \ \ @@ -521,12 +522,12 @@ void PASTEMAC(ch,varname) \ \ /* Compute the addresses of the next panels of A and B. */ \ a2 = a1; \ - /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + /*if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) */\ if ( i + bli_thread_num_threads(thread) >= m_iter ) \ { \ a2 = a_cast; \ b2 = b1 + cstep_b; \ - if ( bli_is_last_iter( jb, n_iter, 0, 1 ) ) \ + if ( bli_is_last_iter_rr( jb, n_iter, 0, 1 ) ) \ b2 = b_cast; \ } \ \ diff --git a/frame/3/trsm/bli_trsm_ru_ker_var2.c b/frame/3/trsm/bli_trsm_ru_ker_var2.c index e1279813c..df24d8129 100644 --- a/frame/3/trsm/bli_trsm_ru_ker_var2.c +++ b/frame/3/trsm/bli_trsm_ru_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -434,12 +435,12 @@ void PASTEMAC(ch,varname) \ \ /* Compute the addresses of the next panels of A and B. */ \ a2 = a1; \ - /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + /*if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) */\ if ( i + bli_thread_num_threads(thread) >= m_iter ) \ { \ a2 = a_cast; \ b2 = b1 + ps_b_cur; \ - if ( bli_is_last_iter( j, n_iter, 0, 1 ) ) \ + if ( bli_is_last_iter_rr( j, n_iter, 0, 1 ) ) \ b2 = b_cast; \ } \ \ @@ -514,12 +515,12 @@ void PASTEMAC(ch,varname) \ \ /* Compute the addresses of the next panels of A and B. */ \ a2 = a1; \ - /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + /*if ( bli_is_last_iter_rr( i, m_iter, 0, 1 ) ) */\ if ( i + bli_thread_num_threads(thread) >= m_iter ) \ { \ a2 = a_cast; \ b2 = b1 + cstep_b; \ - if ( bli_is_last_iter( j, n_iter, 0, 1 ) ) \ + if ( bli_is_last_iter_rr( j, n_iter, 0, 1 ) ) \ b2 = b_cast; \ } \ \ diff --git a/frame/3/trsm/bli_trsm_var.h b/frame/3/trsm/bli_trsm_var.h index 5ac72c28c..197bff82e 100644 --- a/frame/3/trsm/bli_trsm_var.h +++ b/frame/3/trsm/bli_trsm_var.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -57,9 +58,14 @@ GENPROT( trsm_blk_var3 ) GENPROT( trsm_packa ) GENPROT( trsm_packb ) -GENPROT( trsm_xx_ker_var2 ) -GENPROT( trsm_ll_ker_var2 ) -GENPROT( trsm_lu_ker_var2 ) +GENPROT( trsm_xx_ker_var2sl ) +GENPROT( trsm_xx_ker_var2rr ) + +GENPROT( trsm_ll_ker_var2sl ) +GENPROT( trsm_ll_ker_var2rr ) +GENPROT( trsm_lu_ker_var2sl ) +GENPROT( trsm_lu_ker_var2rr ) + GENPROT( trsm_rl_ker_var2 ) GENPROT( trsm_ru_ker_var2 ) @@ -91,8 +97,11 @@ void PASTEMAC(ch,varname) \ thrinfo_t* thread \ ); -INSERT_GENTPROT_BASIC0( trsm_ll_ker_var2 ) -INSERT_GENTPROT_BASIC0( trsm_lu_ker_var2 ) +INSERT_GENTPROT_BASIC0( trsm_ll_ker_var2sl ) +INSERT_GENTPROT_BASIC0( trsm_ll_ker_var2rr ) +INSERT_GENTPROT_BASIC0( trsm_lu_ker_var2sl ) +INSERT_GENTPROT_BASIC0( trsm_lu_ker_var2rr ) + INSERT_GENTPROT_BASIC0( trsm_rl_ker_var2 ) INSERT_GENTPROT_BASIC0( trsm_ru_ker_var2 ) diff --git a/frame/3/trsm/bli_trsm_xx_ker_var2.c b/frame/3/trsm/bli_trsm_xx_ker_var2.c index 24d55af24..1ad6bf1ed 100644 --- a/frame/3/trsm/bli_trsm_xx_ker_var2.c +++ b/frame/3/trsm/bli_trsm_xx_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -34,13 +35,13 @@ #include "blis.h" -static trsm_var_oft vars[2][2] = +static trsm_var_oft vars_sl[2][2] = { - { bli_trsm_ll_ker_var2, bli_trsm_lu_ker_var2 }, - { bli_trsm_rl_ker_var2, bli_trsm_ru_ker_var2 } + { bli_trsm_ll_ker_var2sl, bli_trsm_lu_ker_var2sl }, + { bli_trsm_rl_ker_var2 , bli_trsm_ru_ker_var2 } }; -void bli_trsm_xx_ker_var2 +void bli_trsm_xx_ker_var2sl ( obj_t* a, obj_t* b, @@ -72,7 +73,62 @@ void bli_trsm_xx_ker_var2 } // Index into the variant array to extract the correct function pointer. - f = vars[side][uplo]; + f = vars_sl[side][uplo]; + + // Call the macrokernel. + f + ( + a, + b, + c, + cntx, + rntm, + cntl, + thread + ); +} + +// ----------------------------------------------------------------------------- + +static trsm_var_oft vars_rr[2][2] = +{ + { bli_trsm_ll_ker_var2rr, bli_trsm_lu_ker_var2rr }, + { bli_trsm_rl_ker_var2 , bli_trsm_ru_ker_var2 } +}; + +void bli_trsm_xx_ker_var2rr + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + bool_t side; + bool_t uplo; + trsm_var_oft f; + + // Set two bools: one based on the implied side parameter (the structure + // of the root object) and one based on the uplo field of the triangular + // matrix's root object (whether that is matrix A or matrix B). + if ( bli_obj_root_is_triangular( a ) ) + { + side = 0; + if ( bli_obj_root_is_lower( a ) ) uplo = 0; + else uplo = 1; + } + else // if ( bli_obj_root_is_triangular( b ) ) + { + side = 1; + if ( bli_obj_root_is_lower( b ) ) uplo = 0; + else uplo = 1; + } + + // Index into the variant array to extract the correct function pointer. + f = vars_rr[side][uplo]; // Call the macrokernel. f diff --git a/frame/3/trsm/bli_trsm_ll_ker_var2.c b/frame/3/trsm/other/bli_trsm_ll_ker_var2.c similarity index 99% rename from frame/3/trsm/bli_trsm_ll_ker_var2.c rename to frame/3/trsm/other/bli_trsm_ll_ker_var2.c index 34fc6a2b6..4e7e1b850 100644 --- a/frame/3/trsm/bli_trsm_ll_ker_var2.c +++ b/frame/3/trsm/other/bli_trsm_ll_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are diff --git a/frame/3/trsm/bli_trsm_lu_ker_var2.c b/frame/3/trsm/other/bli_trsm_lu_ker_var2.c similarity index 99% rename from frame/3/trsm/bli_trsm_lu_ker_var2.c rename to frame/3/trsm/other/bli_trsm_lu_ker_var2.c index 78e2a7a15..a8978df86 100644 --- a/frame/3/trsm/bli_trsm_lu_ker_var2.c +++ b/frame/3/trsm/other/bli_trsm_lu_ker_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are diff --git a/frame/3/trsm/other/bli_trsm_rl_ker_var2.c b/frame/3/trsm/other/bli_trsm_rl_ker_var2.c new file mode 100644 index 000000000..70b3e456d --- /dev/null +++ b/frame/3/trsm/other/bli_trsm_rl_ker_var2.c @@ -0,0 +1,591 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffb, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha1, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* alpha2, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trsm_rl_ker_var2); + + +void bli_trsm_rl_ker_var2 + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffb = bli_obj_diag_offset( b ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_alpha1; + void* buf_alpha2; + + FUNCPTR_T f; + + // Grab the address of the internal scalar buffer for the scalar + // attached to A (the non-triangular matrix). This will be the alpha + // scalar used in the gemmtrsm subproblems (ie: the scalar that would + // be applied to the packed copy of A prior to it being updated by + // the trsm subproblem). This scalar may be unit, if for example it + // was applied during packing. + buf_alpha1 = bli_obj_internal_scalar_buffer( a ); + + // Grab the address of the internal scalar buffer for the scalar + // attached to C. This will be the "beta" scalar used in the gemm-only + // subproblems that correspond to micro-panels that do not intersect + // the diagonal. We need this separate scalar because it's possible + // that the alpha attached to B was reset, if it was applied during + // packing. + buf_alpha2 = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffb, + schema_a, + schema_b, + m, + n, + k, + buf_alpha1, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_alpha2, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffb, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha1, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* alpha2, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Cast the micro-kernel address to its function pointer type. */ \ + /* NOTE: We use the upper-triangular gemmtrsm ukernel because, while + the current macro-kernel targets the "rl" case (right-side/lower- + triangular), it becomes upper-triangular after the kernel operation + is transposed so that all kernel instances are of the "left" + variety (since those are the only trsm ukernels that exist). */ \ + PASTECH(ch,gemmtrsm_ukr_ft) \ + gemmtrsm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMMTRSM_U_UKR, cntx ); \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict minus_one = PASTEMAC(ch,m1); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha1_cast = alpha1; \ + ctype* restrict alpha2_cast = alpha2; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffb_j; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_b1121; \ + dim_t k_b11; \ + dim_t k_b21; \ + dim_t off_b11; \ + dim_t off_b21; \ + dim_t i, j, jb; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_b_num; \ + inc_t ss_b_den; \ + inc_t ps_b_cur; \ + inc_t is_b_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKNR + pd_a == NR + ps_a == stride to next micro-panel of A + rs_b == PACKMR + cs_b == 1 + pd_b == MR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + + Note that MR/NR and PACKMR/PACKNR have been swapped to reflect the + swapping of values in the control tree (ie: those values used when + packing). This swapping is needed since we cast right-hand trsm in + terms of transposed left-hand trsm. So, if we're going to be + transposing the operation, then A needs to be packed with NR and B + needs to be packed with MR (remember: B is the triangular matrix in + the right-hand side parameter case). + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of B is entirely above its diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_above_diag_n( diagoffb, k, n ) ) return; \ +\ + /* Compute k_full as k inflated up to a multiple of NR. This is + needed because some parameter combinations of trsm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = ( k % NR != 0 ? k + NR - ( k % NR ) : k ); \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_b ) || \ + bli_is_3mi_packed( schema_b ) || \ + bli_is_rih_packed( schema_b ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. Note that real-only, imag-only, and summed-only + packing formats are not applicable here since trsm is a two- + operand operation only (unlike trmm, which is capable of three- + operand). */ \ + if ( bli_is_3mi_packed( schema_b ) ) { ss_b_num = 3; ss_b_den = 2; } \ + else { ss_b_num = 1; ss_b_den = 1; } \ +\ + /* If there is a zero region above where the diagonal of B intersects + the left edge of the panel, adjust the pointer to A and treat this + case as if the diagonal offset were zero. Note that we don't need to + adjust the pointer to B since packm would have simply skipped over + the region that was not stored. */ \ + if ( diagoffb < 0 ) \ + { \ + j = -diagoffb; \ + k = k - j; \ + diagoffb = 0; \ + a_cast = a_cast + ( j * PACKMR ) / off_scl; \ + } \ +\ + /* If there is a zero region to the right of where the diagonal + of B intersects the bottom of the panel, shrink it so that + we can index to the correct place in C (corresponding to the + part of the panel of B that was packed). + NOTE: This is NOT being done to skip over "no-op" iterations, + as with the trsm_lu macro-kernel. This MUST be done for correct + execution because we use n (via n_iter) to compute diagonal and + index offsets for backwards movement through B. */ \ + if ( diagoffb + k < n ) \ + { \ + n = diagoffb + k; \ + } \ +\ + /* Check the k dimension, which needs to be a multiple of NR. If k + isn't a multiple of NR, we adjust it higher to satisfy the micro- + kernel, which is expecting to perform an NR x NR triangular solve. + This adjustment of k is consistent with what happened when B was + packed: all of its bottom/right edges were zero-padded, and + furthermore, the panel that stores the bottom-right corner of the + matrix has its diagonal extended into the zero-padded region (as + identity). This allows the trsm of that bottom-right panel to + proceed without producing any infs or NaNs that would infect the + "good" values of the corresponding block of A. */ \ + if ( k % NR != 0 ) k += NR - ( k % NR ); \ +\ + /* NOTE: We don't need to check that n is a multiple of PACKNR since we + know that the underlying buffer was already allocated to have an n + dimension that is a multiple of PACKNR, with the region between the + last column and the next multiple of NR zero-padded accordingly. */ \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k_full; \ + istep_b = PACKNR * k; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_schema_a( schema_b, &aux ); \ + bli_auxinfo_set_schema_b( schema_a, &aux ); \ +\ + /* Save the imaginary stride of A to the auxinfo_t object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_is_b( istep_a, &aux ); \ +\ + b1 = b_cast; \ + c1 = c_cast; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( jb = 0; jb < n_iter; ++jb ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b11; \ + ctype* restrict b21; \ + ctype* restrict b2; \ +\ + j = n_iter - 1 - jb; \ + diagoffb_j = diagoffb - ( doff_t )j*NR; \ + a1 = a_cast; \ + c11 = c1 + (n_iter-1)*cstep_c; \ +\ + n_cur = ( bli_is_not_edge_b( jb, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, use a + special micro-kernel that performs a fused gemm and trsm. + If the current panel of B resides below the diagonal, use a + a regular gemm micro-kernel. Otherwise, if it is above the + diagonal, it was not packed (because it is implicitly zero) + and so we do nothing. */ \ + if ( bli_intersects_diag_n( diagoffb_j, k, NR ) ) \ + { \ + /* Determine the offset to and length of the panel that was packed + so we can index into the corresponding location in A. */ \ + off_b11 = bli_max( -diagoffb_j, 0 ); \ + k_b1121 = k - off_b11; \ + k_b11 = NR; \ + k_b21 = k_b1121 - NR; \ + off_b21 = off_b11 + k_b11; \ +\ + /* Compute the addresses of the triangular block B11 and the + panel B21. */ \ + b11 = b1; \ + /* b21 = b1 + ( k_b11 * PACKNR ) / off_scl; */ \ + b21 = bli_ptr_inc_by_frac( b1, sizeof( ctype ), k_b11 * PACKNR, off_scl ); \ +\ + /* Compute the panel stride for the current micro-panel. */ \ + is_b_cur = k_b1121 * PACKNR; \ + is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ + ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ +\ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_is_a( is_b_cur, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if( bli_trsm_my_iter( i, thread ) ){ \ +\ + ctype* restrict a11; \ + ctype* restrict a12; \ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the A11 block and A12 panel. */ \ + a11 = a1 + ( off_b11 * PACKMR ) / off_scl; \ + a12 = a1 + ( off_b21 * PACKMR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + if ( i + bli_thread_num_threads(thread) >= m_iter ) \ + { \ + a2 = a_cast; \ + b2 = b1 + ps_b_cur; \ + if ( bli_is_last_iter( jb, n_iter, 0, 1 ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. NOTE: We swap the values for A and B since the + triangular "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_next_a( b2, &aux ); \ + bli_auxinfo_set_next_b( a2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_b21, \ + alpha1_cast, \ + b21, \ + b11, \ + a12, \ + a11, \ + c11, cs_c, rs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_b21, \ + alpha1_cast, \ + b21, \ + b11, \ + a12, \ + a11, \ + ct, cs_ct, rs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the bottom edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ +\ + b1 += ps_b_cur; \ + } \ + else if ( bli_is_strictly_below_diag_n( diagoffb_j, k, NR ) ) \ + { \ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_is_a( istep_b, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if( bli_trsm_my_iter( i, thread ) ){ \ +\ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + if ( i + bli_thread_num_threads(thread) >= m_iter ) \ + { \ + a2 = a_cast; \ + b2 = b1 + cstep_b; \ + if ( bli_is_last_iter( jb, n_iter, 0, 1 ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. NOTE: We swap the values for A and B since the + triangular "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_next_a( b2, &aux ); \ + bli_auxinfo_set_next_b( a2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + b1, \ + a1, \ + alpha2_cast, \ + c11, cs_c, rs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + b1, \ + a1, \ + zero, \ + ct, cs_ct, rs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + alpha2_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ +\ + b1 += cstep_b; \ + } \ +\ + c1 -= cstep_c; \ + } \ +} + +INSERT_GENTFUNC_BASIC0( trsm_rl_ker_var2 ) + diff --git a/frame/3/trsm/other/bli_trsm_ru_ker_var2.c b/frame/3/trsm/other/bli_trsm_ru_ker_var2.c new file mode 100644 index 000000000..289bb5d9f --- /dev/null +++ b/frame/3/trsm/other/bli_trsm_ru_ker_var2.c @@ -0,0 +1,584 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + +#define FUNCPTR_T gemm_fp + +typedef void (*FUNCPTR_T) + ( + doff_t diagoffb, + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha1, + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, + void* alpha2, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +static FUNCPTR_T GENARRAY(ftypes,trsm_ru_ker_var2); + + +void bli_trsm_ru_ker_var2 + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + doff_t diagoffb = bli_obj_diag_offset( b ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + void* buf_alpha1; + void* buf_alpha2; + + FUNCPTR_T f; + + // Grab the address of the internal scalar buffer for the scalar + // attached to A (the non-triangular matrix). This will be the alpha + // scalar used in the gemmtrsm subproblems (ie: the scalar that would + // be applied to the packed copy of A prior to it being updated by + // the trsm subproblem). This scalar may be unit, if for example it + // was applied during packing. + buf_alpha1 = bli_obj_internal_scalar_buffer( a ); + + // Grab the address of the internal scalar buffer for the scalar + // attached to C. This will be the "beta" scalar used in the gemm-only + // subproblems that correspond to micro-panels that do not intersect + // the diagonal. We need this separate scalar because it's possible + // that the alpha attached to B was reset, if it was applied during + // packing. + buf_alpha2 = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( diagoffb, + schema_a, + schema_b, + m, + n, + k, + buf_alpha1, + buf_a, cs_a, pd_a, ps_a, + buf_b, rs_b, pd_b, ps_b, + buf_alpha2, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTEMAC(ch,varname) \ + ( \ + doff_t diagoffb, \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha1, \ + void* a, inc_t cs_a, dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, dim_t pd_b, inc_t ps_b, \ + void* alpha2, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + const dim_t PACKMR = cs_a; \ + const dim_t PACKNR = rs_b; \ +\ + /* Cast the micro-kernel address to its function pointer type. */ \ + /* NOTE: We use the lower-triangular gemmtrsm ukernel because, while + the current macro-kernel targets the "ru" case (right-side/upper- + triangular), it becomes lower-triangular after the kernel operation + is transposed so that all kernel instances are of the "left" + variety (since those are the only trsm ukernels that exist). */ \ + PASTECH(ch,gemmtrsm_ukr_ft) \ + gemmtrsm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMMTRSM_L_UKR, cntx ); \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict minus_one = PASTEMAC(ch,m1); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha1_cast = alpha1; \ + ctype* restrict alpha2_cast = alpha2; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + doff_t diagoffb_j; \ + dim_t k_full; \ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t m_cur; \ + dim_t n_cur; \ + dim_t k_b0111; \ + dim_t k_b01; \ + dim_t off_b01; \ + dim_t off_b11; \ + dim_t i, j; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + inc_t istep_a; \ + inc_t istep_b; \ + inc_t off_scl; \ + inc_t ss_b_num; \ + inc_t ss_b_den; \ + inc_t ps_b_cur; \ + inc_t is_b_cur; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKNR + pd_a == NR + ps_a == stride to next micro-panel of A + rs_b == PACKMR + cs_b == 1 + pd_b == MR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + + Note that MR/NR and PACKMR/PACKNR have been swapped to reflect the + swapping of values in the control tree (ie: those values used when + packing). This swapping is needed since we cast right-hand trsm in + terms of transposed left-hand trsm. So, if we're going to be + transposing the operation, then A needs to be packed with NR and B + needs to be packed with MR (remember: B is the triangular matrix in + the right-hand side parameter case). + */ \ +\ + /* Safety trap: Certain indexing within this macro-kernel does not + work as intended if both MR and NR are odd. */ \ + if ( ( bli_is_odd( PACKMR ) && bli_is_odd( NR ) ) || \ + ( bli_is_odd( PACKNR ) && bli_is_odd( MR ) ) ) bli_abort(); \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Safeguard: If the current panel of B is entirely below its diagonal, + it is implicitly zero. So we do nothing. */ \ + if ( bli_is_strictly_below_diag_n( diagoffb, k, n ) ) return; \ +\ + /* Compute k_full as k inflated up to a multiple of NR. This is + needed because some parameter combinations of trsm reduce k + to advance past zero regions in the triangular matrix, and + when computing the imaginary stride of B (the non-triangular + matrix), which is used by 4m1/3m1 implementations, we need + this unreduced value of k. */ \ + k_full = ( k % NR != 0 ? k + NR - ( k % NR ) : k ); \ +\ + /* Compute indexing scaling factor for for 4m or 3m. This is + needed because one of the packing register blocksizes (PACKMR + or PACKNR) is used to index into the micro-panels of the non- + triangular matrix when computing with a diagonal-intersecting + micro-panel of the triangular matrix. In the case of 4m or 3m, + real values are stored in both sub-panels, and so the indexing + needs to occur in units of real values. The value computed + here is divided into the complex pointer offset to cause the + pointer to be advanced by the correct value. */ \ + if ( bli_is_4mi_packed( schema_b ) || \ + bli_is_3mi_packed( schema_b ) || \ + bli_is_rih_packed( schema_b ) ) off_scl = 2; \ + else off_scl = 1; \ +\ + /* Compute the storage stride scaling. Usually this is just 1. + However, in the case of interleaved 3m, we need to scale the + offset by 3/2. Note that real-only, imag-only, and summed-only + packing formats are not applicable here since trsm is a two- + operand operation only (unlike trmm, which is capable of three- + operand). */ \ + if ( bli_is_3mi_packed( schema_b ) ) { ss_b_num = 3; ss_b_den = 2; } \ + else { ss_b_num = 1; ss_b_den = 1; } \ +\ + /* If there is a zero region to the left of where the diagonal of B + intersects the top edge of the panel, adjust the pointer to C and + treat this case as if the diagonal offset were zero. This skips over + the region that was not packed. (Note we assume the diagonal offset + is a multiple of MR; this assumption will hold as long as the cache + blocksizes are each a multiple of MR and NR.) */ \ + if ( diagoffb > 0 ) \ + { \ + j = diagoffb; \ + n = n - j; \ + diagoffb = 0; \ + c_cast = c_cast + (j )*cs_c; \ + } \ +\ + /* If there is a zero region below where the diagonal of B intersects the + right side of the block, shrink it to prevent "no-op" iterations from + executing. */ \ + if ( -diagoffb + n < k ) \ + { \ + k = -diagoffb + n; \ + } \ +\ + /* Check the k dimension, which needs to be a multiple of NR. If k + isn't a multiple of NR, we adjust it higher to satisfy the micro- + kernel, which is expecting to perform an NR x NR triangular solve. + This adjustment of k is consistent with what happened when B was + packed: all of its bottom/right edges were zero-padded, and + furthermore, the panel that stores the bottom-right corner of the + matrix has its diagonal extended into the zero-padded region (as + identity). This allows the trsm of that bottom-right panel to + proceed without producing any infs or NaNs that would infect the + "good" values of the corresponding block of A. */ \ + if ( k % NR != 0 ) k += NR - ( k % NR ); \ +\ + /* NOTE: We don't need to check that n is a multiple of PACKNR since we + know that the underlying buffer was already allocated to have an n + dimension that is a multiple of PACKNR, with the region between the + last column and the next multiple of NR zero-padded accordingly. */ \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + istep_a = PACKMR * k_full; \ + istep_b = PACKNR * k; \ +\ + if ( bli_is_odd( istep_a ) ) istep_a += 1; \ + if ( bli_is_odd( istep_b ) ) istep_b += 1; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_schema_a( schema_b, &aux ); \ + bli_auxinfo_set_schema_b( schema_a, &aux ); \ +\ + /* Save the imaginary stride of A to the auxinfo_t object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_is_b( istep_a, &aux ); \ +\ + b1 = b_cast; \ + c1 = c_cast; \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = 0; j < n_iter; ++j ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b01; \ + ctype* restrict b11; \ + ctype* restrict b2; \ +\ + diagoffb_j = diagoffb - ( doff_t )j*NR; \ + a1 = a_cast; \ + c11 = c1; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* If the current panel of B intersects the diagonal, use a + special micro-kernel that performs a fused gemm and trsm. + If the current panel of B resides above the diagonal, use a + a regular gemm micro-kernel. Otherwise, if it is below the + diagonal, it was not packed (because it is implicitly zero) + and so we do nothing. */ \ + if ( bli_intersects_diag_n( diagoffb_j, k, NR ) ) \ + { \ + /* Determine the offset to and length of the panel that was packed + so we can index into the corresponding location in A. */ \ + off_b01 = 0; \ + k_b0111 = bli_min( k, -diagoffb_j + NR ); \ + k_b01 = k_b0111 - NR; \ + off_b11 = k_b01; \ +\ + /* Compute the addresses of the panel B10 and the triangular + block B11. */ \ + b01 = b1; \ + /* b11 = b1 + ( k_b01 * PACKNR ) / off_scl; */ \ + b11 = bli_ptr_inc_by_frac( b1, sizeof( ctype ), k_b01 * PACKNR, off_scl ); \ +\ + /* Compute the panel stride for the current micro-panel. */ \ + is_b_cur = k_b0111 * PACKNR; \ + is_b_cur += ( bli_is_odd( is_b_cur ) ? 1 : 0 ); \ + ps_b_cur = ( is_b_cur * ss_b_num ) / ss_b_den; \ +\ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_is_a( is_b_cur, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if( bli_trsm_my_iter( i, thread ) ){ \ +\ + ctype* restrict a10; \ + ctype* restrict a11; \ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the A10 panel and A11 block. */ \ + a10 = a1 + ( off_b01 * PACKMR ) / off_scl; \ + a11 = a1 + ( off_b11 * PACKMR ) / off_scl; \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + if ( i + bli_thread_num_threads(thread) >= m_iter ) \ + { \ + a2 = a_cast; \ + b2 = b1 + ps_b_cur; \ + if ( bli_is_last_iter( j, n_iter, 0, 1 ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. NOTE: We swap the values for A and B since the + triangular "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_next_a( b2, &aux ); \ + bli_auxinfo_set_next_b( a2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_b01, \ + alpha1_cast, \ + b01, \ + b11, \ + a10, \ + a11, \ + c11, cs_c, rs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the fused gemm/trsm micro-kernel. */ \ + gemmtrsm_ukr \ + ( \ + k_b01, \ + alpha1_cast, \ + b01, \ + b11, \ + a10, \ + a11, \ + ct, cs_ct, rs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Copy the result to the bottom edge of C. */ \ + PASTEMAC(ch,copys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ +\ + b1 += ps_b_cur; \ + } \ + else if ( bli_is_strictly_above_diag_n( diagoffb_j, k, NR ) ) \ + { \ + /* Save the 4m1/3m1 imaginary stride of B to the auxinfo_t + object. + NOTE: We swap the values for A and B since the triangular + "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_is_a( istep_b, &aux ); \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = 0; i < m_iter; ++i ) \ + { \ + if( bli_trsm_my_iter( i, thread ) ){ \ +\ + ctype* restrict a2; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = a1; \ + /*if ( bli_is_last_iter( i, m_iter, 0, 1 ) ) */\ + if ( i + bli_thread_num_threads(thread) >= m_iter ) \ + { \ + a2 = a_cast; \ + b2 = b1 + cstep_b; \ + if ( bli_is_last_iter( j, n_iter, 0, 1 ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. NOTE: We swap the values for A and B since the + triangular "A" matrix is actually contained within B. */ \ + bli_auxinfo_set_next_a( b2, &aux ); \ + bli_auxinfo_set_next_b( a2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + b1, \ + a1, \ + alpha2_cast, \ + c11, cs_c, rs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + minus_one, \ + b1, \ + a1, \ + zero, \ + ct, cs_ct, rs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Add the result to the edge of C. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + alpha2_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ +\ + a1 += rstep_a; \ + c11 += rstep_c; \ + } \ +\ + b1 += cstep_b; \ + } \ +\ + c1 += cstep_c; \ + } \ +} + +INSERT_GENTFUNC_BASIC0( trsm_ru_ker_var2 ) + diff --git a/frame/base/bli_clock.c b/frame/base/bli_clock.c index 6f92d907b..120704145 100644 --- a/frame/base/bli_clock.c +++ b/frame/base/bli_clock.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -55,11 +56,9 @@ double bli_clock_min_diff( double time_min, double time_start ) // Assume that anything: // - under or equal to zero, - // - over an hour, or // - under a nanosecond // is actually garbled due to the clocks being taken too closely together. if ( time_min <= 0.0 ) time_min = time_min_prev; - else if ( time_min > 3600.0 ) time_min = time_min_prev; else if ( time_min < 1.0e-9 ) time_min = time_min_prev; return time_min; diff --git a/frame/base/bli_info.c b/frame/base/bli_info.c index 344a07447..42ed83bc5 100644 --- a/frame/base/bli_info.c +++ b/frame/base/bli_info.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -94,6 +95,60 @@ gint_t bli_info_get_enable_packbuf_pools( void ) return 0; #endif } +gint_t bli_info_get_enable_threading( void ) +{ + if ( bli_info_get_enable_openmp() || + bli_info_get_enable_pthreads() ) return 1; + else return 0; +} +gint_t bli_info_get_enable_openmp( void ) +{ +#ifdef BLIS_ENABLE_OPENMP + return 1; +#else + return 0; +#endif +} +gint_t bli_info_get_enable_pthreads( void ) +{ +#ifdef BLIS_ENABLE_PTHREADS + return 1; +#else + return 0; +#endif +} +gint_t bli_info_get_thread_part_jrir_slab( void ) +{ +#ifdef BLIS_ENABLE_JRIR_SLAB + return 1; +#else + return 0; +#endif +} +gint_t bli_info_get_thread_part_jrir_rr( void ) +{ +#ifdef BLIS_ENABLE_JRIR_RR + return 1; +#else + return 0; +#endif +} +gint_t bli_info_get_enable_memkind( void ) +{ +#ifdef BLIS_ENABLE_MEMKIND + return 1; +#else + return 0; +#endif +} +gint_t bli_info_get_enable_sandbox( void ) +{ +#ifdef BLIS_ENABLE_SANDBOX + return 1; +#else + return 0; +#endif +} diff --git a/frame/base/bli_info.h b/frame/base/bli_info.h index 82ff86b03..96aeade85 100644 --- a/frame/base/bli_info.h +++ b/frame/base/bli_info.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -58,6 +59,13 @@ gint_t bli_info_get_enable_blas( void ); gint_t bli_info_get_enable_cblas( void ); gint_t bli_info_get_blas_int_type_size( void ); gint_t bli_info_get_enable_packbuf_pools( void ); +gint_t bli_info_get_enable_threading( void ); +gint_t bli_info_get_enable_openmp( void ); +gint_t bli_info_get_enable_pthreads( void ); +gint_t bli_info_get_thread_part_jrir_slab( void ); +gint_t bli_info_get_thread_part_jrir_rr( void ); +gint_t bli_info_get_enable_memkind( void ); +gint_t bli_info_get_enable_sandbox( void ); // -- Kernel implementation-related -------------------------------------------- diff --git a/frame/base/bli_prune.c b/frame/base/bli_prune.c index 9b5803d9f..1f40933b0 100644 --- a/frame/base/bli_prune.c +++ b/frame/base/bli_prune.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -45,7 +46,7 @@ void bli_prune_unref_mparts( obj_t* p, mdim_t mdim_p, // matrix is empty. This is not strictly needed but rather a minor // optimization, as it would prevent threads that would otherwise get // subproblems on BLIS_ZEROS operands from calling the macro-kernel, - // because bli_thread_get_range*() would return empty ranges, which would + // because bli_thread_range*() would return empty ranges, which would // cause the variant's for loop from executing any iterations. // NOTE: this should only ever execute if the primary object is // triangular because that is the only structure type with subpartitions diff --git a/frame/base/bli_rntm.c b/frame/base/bli_rntm.c index 935a17a7f..6ccea7277 100644 --- a/frame/base/bli_rntm.c +++ b/frame/base/bli_rntm.c @@ -101,16 +101,16 @@ bli_rntm_print( rntm ); } else if ( l3_op == BLIS_TRSM ) { - // For trsm_l, we extract all parallelism from the jr loop, and - // for trsm_r, we extract all parallelism from the ic loop. + // For trsm_l, we extract all parallelism from the jc and jr loops. + // For trsm_r, we extract all parallelism from the ic loop. if ( bli_is_left( side ) ) { bli_rntm_set_ways_only ( + jc, 1, 1, - 1, - ic * pc * jc * jr * ir, + ic * pc * jr * ir, 1, rntm ); @@ -198,15 +198,15 @@ void bli_rntm_set_ways_from_rntm pc = 1; - bli_partition_2x2( nt, m*BLIS_DEFAULT_M_THREAD_RATIO, - n*BLIS_DEFAULT_N_THREAD_RATIO, &ic, &jc ); + bli_partition_2x2( nt, m*BLIS_THREAD_RATIO_M, + n*BLIS_THREAD_RATIO_N, &ic, &jc ); - for ( ir = BLIS_DEFAULT_MR_THREAD_MAX ; ir > 1 ; ir-- ) + for ( ir = BLIS_THREAD_MAX_IR ; ir > 1 ; ir-- ) { if ( ic % ir == 0 ) { ic /= ir; break; } } - for ( jr = BLIS_DEFAULT_NR_THREAD_MAX ; jr > 1 ; jr-- ) + for ( jr = BLIS_THREAD_MAX_JR ; jr > 1 ; jr-- ) { if ( jc % jr == 0 ) { jc /= jr; break; } } diff --git a/frame/compat/bli_blas.h b/frame/compat/bli_blas.h index a0217a117..f5365379e 100644 --- a/frame/compat/bli_blas.h +++ b/frame/compat/bli_blas.h @@ -174,6 +174,10 @@ #include "bla_trmm_check.h" #include "bla_trsm_check.h" +// -- Fortran-compatible APIs to BLIS functions -- + +#include "b77_thread.h" + #endif // BLIS_ENABLE_BLAS #endif // BLIS_VIA_BLASTEST diff --git a/frame/compat/blis/thread/b77_thread.c b/frame/compat/blis/thread/b77_thread.c new file mode 100644 index 000000000..bd94652e8 --- /dev/null +++ b/frame/compat/blis/thread/b77_thread.c @@ -0,0 +1,93 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2018, The University of Texas at Austin + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" + + +// +// Define Fortran-compatible BLIS interfaces. +// + +void PASTEF770(bli_thread_set_ways) + ( + const f77_int* jc, + const f77_int* pc, + const f77_int* ic, + const f77_int* jr, + const f77_int* ir + ) +{ + dim_t jc0 = *jc; + dim_t pc0 = *pc; + dim_t ic0 = *ic; + dim_t jr0 = *jr; + dim_t ir0 = *ir; + + // Initialize BLIS. + bli_init_auto(); + + // Convert/typecast negative values to zero. + //bli_convert_blas_dim1( *jc, jc0 ); + //bli_convert_blas_dim1( *pc, pc0 ); + //bli_convert_blas_dim1( *ic, ic0 ); + //bli_convert_blas_dim1( *jr, jr0 ); + //bli_convert_blas_dim1( *ir, ir0 ); + + // Call the BLIS function. + bli_thread_set_ways( jc0, pc0, ic0, jr0, ir0 ); + + // Finalize BLIS. + bli_finalize_auto(); +} + +void PASTEF770(bli_thread_set_num_threads) + ( + const f77_int* nt + ) +{ + dim_t nt0 = *nt; + + // Initialize BLIS. + bli_init_auto(); + + // Convert/typecast negative values to zero. + //bli_convert_blas_dim1( *nt, nt0 ); + + // Call the BLIS function. + bli_thread_set_num_threads( nt0 ); + + // Finalize BLIS. + bli_finalize_auto(); +} + diff --git a/frame/compat/blis/thread/b77_thread.h b/frame/compat/blis/thread/b77_thread.h new file mode 100644 index 000000000..0004536a8 --- /dev/null +++ b/frame/compat/blis/thread/b77_thread.h @@ -0,0 +1,53 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + + +// +// Prototype Fortran-compatible BLIS interfaces. +// + +void PASTEF770(bli_thread_set_ways) + ( + const f77_int* jc, + const f77_int* pc, + const f77_int* ic, + const f77_int* jr, + const f77_int* ir + ); + +void PASTEF770(bli_thread_set_num_threads) + ( + const f77_int* nt + ); + diff --git a/frame/include/bli_kernel_macro_defs.h b/frame/include/bli_kernel_macro_defs.h index 137d3b375..1b67988d7 100644 --- a/frame/include/bli_kernel_macro_defs.h +++ b/frame/include/bli_kernel_macro_defs.h @@ -38,20 +38,20 @@ // -- Define default threading parameters -------------------------------------- -#ifndef BLIS_DEFAULT_M_THREAD_RATIO -#define BLIS_DEFAULT_M_THREAD_RATIO 2 +#ifndef BLIS_THREAD_RATIO_M +#define BLIS_THREAD_RATIO_M 2 #endif -#ifndef BLIS_DEFAULT_N_THREAD_RATIO -#define BLIS_DEFAULT_N_THREAD_RATIO 1 +#ifndef BLIS_THREAD_RATIO_N +#define BLIS_THREAD_RATIO_N 1 #endif -#ifndef BLIS_DEFAULT_MR_THREAD_MAX -#define BLIS_DEFAULT_MR_THREAD_MAX 1 +#ifndef BLIS_THREAD_MAX_IR +#define BLIS_THREAD_MAX_IR 1 #endif -#ifndef BLIS_DEFAULT_NR_THREAD_MAX -#define BLIS_DEFAULT_NR_THREAD_MAX 4 +#ifndef BLIS_THREAD_MAX_JR +#define BLIS_THREAD_MAX_JR 4 #endif diff --git a/frame/include/bli_param_macro_defs.h b/frame/include/bli_param_macro_defs.h index eb92f08b0..790f3427b 100644 --- a/frame/include/bli_param_macro_defs.h +++ b/frame/include/bli_param_macro_defs.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -638,6 +639,13 @@ static bool_t bli_intersects_diag_n( doff_t diagoff, dim_t m, dim_t n ) !bli_is_strictly_below_diag_n( diagoff, m, n ) ); } +static bool_t bli_is_outside_diag_n( doff_t diagoff, dim_t m, dim_t n ) +{ + return ( bool_t ) + ( bli_is_strictly_above_diag_n( diagoff, m, n ) || + bli_is_strictly_below_diag_n( diagoff, m, n ) ); +} + static bool_t bli_is_stored_subpart_n( doff_t diagoff, uplo_t uplo, dim_t m, dim_t n ) { return ( bool_t ) @@ -784,10 +792,16 @@ static bool_t bli_is_not_edge_b( dim_t i, dim_t n_iter, dim_t n_left ) ( i != 0 || n_left == 0 ); } -static bool_t bli_is_last_iter( dim_t i, dim_t n_iter, dim_t tid, dim_t nth ) +static bool_t bli_is_last_iter_sl( dim_t i, dim_t end_iter, dim_t tid, dim_t nth ) { return ( bool_t ) - ( i == n_iter - 1 - ( ( n_iter - tid - 1 ) % nth ) ); + ( i == end_iter - 1 ); +} + +static bool_t bli_is_last_iter_rr( dim_t i, dim_t end_iter, dim_t tid, dim_t nth ) +{ + return ( bool_t ) + ( i == end_iter - 1 - ( ( end_iter - tid - 1 ) % nth ) ); } diff --git a/frame/thread/bli_thread.c b/frame/thread/bli_thread.c index 2931d0951..886dc15f5 100644 --- a/frame/thread/bli_thread.c +++ b/frame/thread/bli_thread.c @@ -59,9 +59,35 @@ void bli_thread_finalize( void ) { } +// ----------------------------------------------------------------------------- +#if 0 +void bli_thread_range_jrir + ( + thrinfo_t* thread, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* start, + dim_t* end, + dim_t* inc + ) +{ +//#ifdef BLIS_JRIR_INTERLEAVE +#if 1 + // Use interleaved partitioning of jr/ir loops. + *start = bli_thread_work_id( thread ); + *inc = bli_thread_n_way( thread ); + *end = n; +#else + // Use contiguous slab partitioning for jr/ir loops. + bli_thread_range_sub( thread, n, bf, handle_edge_low, start, end ); + *inc = 1; +#endif +} +#endif // ----------------------------------------------------------------------------- -void bli_thread_get_range_sub +void bli_thread_range_sub ( thrinfo_t* thread, dim_t n, @@ -72,6 +98,9 @@ void bli_thread_get_range_sub ) { dim_t n_way = bli_thread_n_way( thread ); + + if ( n_way == 1 ) { *start = 0; *end = n; return; } + dim_t work_id = bli_thread_work_id( thread ); dim_t all_start = 0; @@ -202,7 +231,7 @@ void bli_thread_get_range_sub } } -siz_t bli_thread_get_range_l2r +siz_t bli_thread_range_l2r ( thrinfo_t* thr, obj_t* a, @@ -216,13 +245,13 @@ siz_t bli_thread_get_range_l2r dim_t n = bli_obj_width_after_trans( a ); dim_t bf = bli_blksz_get_def( dt, bmult ); - bli_thread_get_range_sub( thr, n, bf, - FALSE, start, end ); + bli_thread_range_sub( thr, n, bf, + FALSE, start, end ); return m * ( *end - *start ); } -siz_t bli_thread_get_range_r2l +siz_t bli_thread_range_r2l ( thrinfo_t* thr, obj_t* a, @@ -236,13 +265,13 @@ siz_t bli_thread_get_range_r2l dim_t n = bli_obj_width_after_trans( a ); dim_t bf = bli_blksz_get_def( dt, bmult ); - bli_thread_get_range_sub( thr, n, bf, - TRUE, start, end ); + bli_thread_range_sub( thr, n, bf, + TRUE, start, end ); return m * ( *end - *start ); } -siz_t bli_thread_get_range_t2b +siz_t bli_thread_range_t2b ( thrinfo_t* thr, obj_t* a, @@ -256,13 +285,13 @@ siz_t bli_thread_get_range_t2b dim_t n = bli_obj_width_after_trans( a ); dim_t bf = bli_blksz_get_def( dt, bmult ); - bli_thread_get_range_sub( thr, m, bf, - FALSE, start, end ); + bli_thread_range_sub( thr, m, bf, + FALSE, start, end ); return n * ( *end - *start ); } -siz_t bli_thread_get_range_b2t +siz_t bli_thread_range_b2t ( thrinfo_t* thr, obj_t* a, @@ -276,15 +305,15 @@ siz_t bli_thread_get_range_b2t dim_t n = bli_obj_width_after_trans( a ); dim_t bf = bli_blksz_get_def( dt, bmult ); - bli_thread_get_range_sub( thr, m, bf, - TRUE, start, end ); + bli_thread_range_sub( thr, m, bf, + TRUE, start, end ); return n * ( *end - *start ); } // ----------------------------------------------------------------------------- -dim_t bli_thread_get_range_width_l +dim_t bli_thread_range_width_l ( doff_t diagoff_j, dim_t m, @@ -495,17 +524,17 @@ siz_t bli_find_area_trap_l // ----------------------------------------------------------------------------- -siz_t bli_thread_get_range_weighted_sub +siz_t bli_thread_range_weighted_sub ( - thrinfo_t* thread, - doff_t diagoff, - uplo_t uplo, - dim_t m, - dim_t n, - dim_t bf, - bool_t handle_edge_low, - dim_t* j_start_thr, - dim_t* j_end_thr + thrinfo_t* restrict thread, + doff_t diagoff, + uplo_t uplo, + dim_t m, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* restrict j_start_thr, + dim_t* restrict j_end_thr ) { dim_t n_way = bli_thread_n_way( thread ); @@ -570,7 +599,7 @@ siz_t bli_thread_get_range_weighted_sub // Compute the width of the jth subpartition, taking the // current diagonal offset into account, if needed. width_j = - bli_thread_get_range_width_l + bli_thread_range_width_l ( diagoff_j, m, n_left, j, n_way, @@ -614,7 +643,7 @@ siz_t bli_thread_get_range_weighted_sub bli_toggle_bool( &handle_edge_low ); // Compute the appropriate range for the rotated trapezoid. - area = bli_thread_get_range_weighted_sub + area = bli_thread_range_weighted_sub ( thread, diagoff, uplo, m, n, bf, handle_edge_low, @@ -632,7 +661,7 @@ siz_t bli_thread_get_range_weighted_sub return area; } -siz_t bli_thread_get_range_mdim +siz_t bli_thread_range_mdim ( dir_t direct, thrinfo_t* thr, @@ -678,20 +707,20 @@ siz_t bli_thread_get_range_mdim if ( use_weighted ) { if ( direct == BLIS_FWD ) - return bli_thread_get_range_weighted_t2b( thr, x, bmult, start, end ); + return bli_thread_range_weighted_t2b( thr, x, bmult, start, end ); else - return bli_thread_get_range_weighted_b2t( thr, x, bmult, start, end ); + return bli_thread_range_weighted_b2t( thr, x, bmult, start, end ); } else { if ( direct == BLIS_FWD ) - return bli_thread_get_range_t2b( thr, x, bmult, start, end ); + return bli_thread_range_t2b( thr, x, bmult, start, end ); else - return bli_thread_get_range_b2t( thr, x, bmult, start, end ); + return bli_thread_range_b2t( thr, x, bmult, start, end ); } } -siz_t bli_thread_get_range_ndim +siz_t bli_thread_range_ndim ( dir_t direct, thrinfo_t* thr, @@ -737,20 +766,20 @@ siz_t bli_thread_get_range_ndim if ( use_weighted ) { if ( direct == BLIS_FWD ) - return bli_thread_get_range_weighted_l2r( thr, x, bmult, start, end ); + return bli_thread_range_weighted_l2r( thr, x, bmult, start, end ); else - return bli_thread_get_range_weighted_r2l( thr, x, bmult, start, end ); + return bli_thread_range_weighted_r2l( thr, x, bmult, start, end ); } else { if ( direct == BLIS_FWD ) - return bli_thread_get_range_l2r( thr, x, bmult, start, end ); + return bli_thread_range_l2r( thr, x, bmult, start, end ); else - return bli_thread_get_range_r2l( thr, x, bmult, start, end ); + return bli_thread_range_r2l( thr, x, bmult, start, end ); } } -siz_t bli_thread_get_range_weighted_l2r +siz_t bli_thread_range_weighted_l2r ( thrinfo_t* thr, obj_t* a, @@ -782,7 +811,7 @@ siz_t bli_thread_get_range_weighted_l2r } area = - bli_thread_get_range_weighted_sub + bli_thread_range_weighted_sub ( thr, diagoff, uplo, m, n, bf, FALSE, start, end @@ -790,7 +819,7 @@ siz_t bli_thread_get_range_weighted_l2r } else // if dense or zeros { - area = bli_thread_get_range_l2r + area = bli_thread_range_l2r ( thr, a, bmult, start, end @@ -800,7 +829,7 @@ siz_t bli_thread_get_range_weighted_l2r return area; } -siz_t bli_thread_get_range_weighted_r2l +siz_t bli_thread_range_weighted_r2l ( thrinfo_t* thr, obj_t* a, @@ -834,7 +863,7 @@ siz_t bli_thread_get_range_weighted_r2l bli_rotate180_trapezoid( &diagoff, &uplo, &m, &n ); area = - bli_thread_get_range_weighted_sub + bli_thread_range_weighted_sub ( thr, diagoff, uplo, m, n, bf, TRUE, start, end @@ -842,7 +871,7 @@ siz_t bli_thread_get_range_weighted_r2l } else // if dense or zeros { - area = bli_thread_get_range_r2l + area = bli_thread_range_r2l ( thr, a, bmult, start, end @@ -852,7 +881,7 @@ siz_t bli_thread_get_range_weighted_r2l return area; } -siz_t bli_thread_get_range_weighted_t2b +siz_t bli_thread_range_weighted_t2b ( thrinfo_t* thr, obj_t* a, @@ -886,7 +915,7 @@ siz_t bli_thread_get_range_weighted_t2b bli_reflect_about_diag( &diagoff, &uplo, &m, &n ); area = - bli_thread_get_range_weighted_sub + bli_thread_range_weighted_sub ( thr, diagoff, uplo, m, n, bf, FALSE, start, end @@ -894,7 +923,7 @@ siz_t bli_thread_get_range_weighted_t2b } else // if dense or zeros { - area = bli_thread_get_range_t2b + area = bli_thread_range_t2b ( thr, a, bmult, start, end @@ -904,7 +933,7 @@ siz_t bli_thread_get_range_weighted_t2b return area; } -siz_t bli_thread_get_range_weighted_b2t +siz_t bli_thread_range_weighted_b2t ( thrinfo_t* thr, obj_t* a, @@ -939,7 +968,7 @@ siz_t bli_thread_get_range_weighted_b2t bli_rotate180_trapezoid( &diagoff, &uplo, &m, &n ); - area = bli_thread_get_range_weighted_sub + area = bli_thread_range_weighted_sub ( thr, diagoff, uplo, m, n, bf, TRUE, start, end @@ -947,7 +976,7 @@ siz_t bli_thread_get_range_weighted_b2t } else // if dense or zeros { - area = bli_thread_get_range_b2t + area = bli_thread_range_b2t ( thr, a, bmult, start, end diff --git a/frame/thread/bli_thread.h b/frame/thread/bli_thread.h index ffed93106..e55244435 100644 --- a/frame/thread/bli_thread.h +++ b/frame/thread/bli_thread.h @@ -6,6 +6,7 @@ Copyright (C) 2014, The University of Texas at Austin Copyright (C) 2016, Hewlett Packard Enterprise Development LP + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -56,7 +57,21 @@ void bli_thread_finalize( void ); #endif // Thread range-related prototypes. -void bli_thread_get_range_sub +#if 0 +void bli_thread_range_jrir + ( + thrinfo_t* thread, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* start, + dim_t* end, + dim_t* inc + ); +#endif +// ----------------------------------------------------------------------------- + +void bli_thread_range_sub ( thrinfo_t* thread, dim_t n, @@ -82,8 +97,8 @@ siz_t PASTEMAC0( opname ) \ dim_t* end \ ); -GENPROT( thread_get_range_mdim ) -GENPROT( thread_get_range_ndim ) +GENPROT( thread_range_mdim ) +GENPROT( thread_range_ndim ) #undef GENPROT #define GENPROT( opname ) \ @@ -97,18 +112,18 @@ siz_t PASTEMAC0( opname ) \ dim_t* end \ ); -GENPROT( thread_get_range_l2r ) -GENPROT( thread_get_range_r2l ) -GENPROT( thread_get_range_t2b ) -GENPROT( thread_get_range_b2t ) +GENPROT( thread_range_l2r ) +GENPROT( thread_range_r2l ) +GENPROT( thread_range_t2b ) +GENPROT( thread_range_b2t ) -GENPROT( thread_get_range_weighted_l2r ) -GENPROT( thread_get_range_weighted_r2l ) -GENPROT( thread_get_range_weighted_t2b ) -GENPROT( thread_get_range_weighted_b2t ) +GENPROT( thread_range_weighted_l2r ) +GENPROT( thread_range_weighted_r2l ) +GENPROT( thread_range_weighted_t2b ) +GENPROT( thread_range_weighted_b2t ) -dim_t bli_thread_get_range_width_l +dim_t bli_thread_range_width_l ( doff_t diagoff_j, dim_t m, @@ -126,17 +141,17 @@ siz_t bli_find_area_trap_l dim_t n, doff_t diagoff ); -siz_t bli_thread_get_range_weighted_sub +siz_t bli_thread_range_weighted_sub ( - thrinfo_t* thread, - doff_t diagoff, - uplo_t uplo, - dim_t m, - dim_t n, - dim_t bf, - bool_t handle_edge_low, - dim_t* j_start_thr, - dim_t* j_end_thr + thrinfo_t* restrict thread, + doff_t diagoff, + uplo_t uplo, + dim_t m, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* restrict j_start_thr, + dim_t* restrict j_end_thr ); @@ -204,16 +219,102 @@ dim_t bli_thread_get_jr_nt( void ); dim_t bli_thread_get_ir_nt( void ); dim_t bli_thread_get_num_threads( void ); -void bli_thread_set_jc_nt( dim_t value ); -void bli_thread_set_pc_nt( dim_t value ); -void bli_thread_set_ic_nt( dim_t value ); -void bli_thread_set_jr_nt( dim_t value ); -void bli_thread_set_ir_nt( dim_t value ); +void bli_thread_set_ways( dim_t jc, dim_t pc, dim_t ic, dim_t jr, dim_t ir ); void bli_thread_set_num_threads( dim_t value ); void bli_thread_init_rntm( rntm_t* rntm ); void bli_thread_init_rntm_from_env( rntm_t* rntm ); +// ----------------------------------------------------------------------------- + +static void bli_thread_range_jrir_rr + ( + thrinfo_t* thread, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* start, + dim_t* end, + dim_t* inc + ) +{ + // Use interleaved partitioning of jr/ir loops. + *start = bli_thread_work_id( thread ); + *inc = bli_thread_n_way( thread ); + *end = n; +} + +static void bli_thread_range_jrir_sl + ( + thrinfo_t* thread, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* start, + dim_t* end, + dim_t* inc + ) +{ + // Use contiguous slab partitioning of jr/ir loops. + bli_thread_range_sub( thread, n, bf, handle_edge_low, start, end ); + *inc = 1; +} + +#if 0 +static void bli_thread_range_jrir + ( + thrinfo_t* thread, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* start, + dim_t* end, + dim_t* inc + ) +{ +#ifdef BLIS_ENABLE_JRIR_SLAB + bli_thread_range_jrir_sl( thread, n, bf, handle_edge_low, start, end, inc ); +#else + bli_thread_range_jrir_rr( thread, n, bf, handle_edge_low, start, end, inc ); +#endif +} + +static void bli_thread_range_weighted_jrir + ( + thrinfo_t* thread, + doff_t diagoff, + uplo_t uplo, + dim_t m, + dim_t n, + dim_t bf, + bool_t handle_edge_low, + dim_t* start, + dim_t* end, + dim_t* inc + ) +{ +#ifdef BLIS_ENABLE_JRIR_SLAB + + // Use contiguous slab partitioning for jr/ir loops. + bli_thread_range_weighted_sub( thread, diagoff, uplo, m, n, bf, + handle_edge_low, start, end ); + + *start = *start / bf; *inc = 1; + + if ( *end % bf ) *end = *end / bf + 1; + else *end = *end / bf; + +#else + + // Use interleaved partitioning of jr/ir loops. + *start = bli_thread_work_id( thread ); + *inc = bli_thread_n_way( thread ); + *end = n; + +#endif +} +#endif + #endif diff --git a/sandbox/ref99/blx_gemm_int.c b/sandbox/ref99/blx_gemm_int.c index 4937095a9..febb8040a 100644 --- a/sandbox/ref99/blx_gemm_int.c +++ b/sandbox/ref99/blx_gemm_int.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -46,10 +47,10 @@ void blx_gemm_int thrinfo_t* thread ) { - obj_t a_local; - obj_t b_local; - obj_t c_local; - gemm_voft f; + obj_t a_local; + obj_t b_local; + obj_t c_local; + gemm_var_oft f; // Alias A, B, and C in case we need to update attached scalars. bli_obj_alias_to( a, &a_local ); diff --git a/sandbox/ref99/cntl/blx_gemm_cntl.c b/sandbox/ref99/cntl/blx_gemm_cntl.c index ebcf6da30..c6c7e61f9 100644 --- a/sandbox/ref99/cntl/blx_gemm_cntl.c +++ b/sandbox/ref99/cntl/blx_gemm_cntl.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -54,7 +55,28 @@ cntl_t* blx_gemmbp_cntl_create pack_t schema_b ) { - void* macro_kernel_p = blx_gemm_ker_var2; + void* macro_kernel_fp; + void* packa_fp; + void* packb_fp; + +#ifdef BLIS_ENABLE_JRIR_SLAB + + // Use the function pointers to the macrokernels that use slab + // assignment of micropanels to threads in the jr and ir loops. + macro_kernel_fp = blx_gemm_ker_var2sl; + + packa_fp = bli_packm_blk_var1sl; + packb_fp = bli_packm_blk_var1sl; + +#else // BLIS_ENABLE_JRIR_RR + + // Use the function pointers to the macrokernels that use round-robin + // assignment of micropanels to threads in the jr and ir loops. + macro_kernel_fp = bli_gemm_ker_var2rr; + + packa_fp = bli_packm_blk_var1rr; + packb_fp = bli_packm_blk_var1rr; +#endif // Create two nodes for the macro-kernel. cntl_t* gemm_cntl_bu_ke = blx_gemm_cntl_create_node @@ -69,7 +91,7 @@ cntl_t* blx_gemmbp_cntl_create ( family, BLIS_NR, // not used by macro-kernel, but needed for bli_thrinfo_rgrow() - macro_kernel_p, + macro_kernel_fp, gemm_cntl_bu_ke ); @@ -77,7 +99,7 @@ cntl_t* blx_gemmbp_cntl_create cntl_t* gemm_cntl_packa = blx_packm_cntl_create_node ( blx_gemm_packa, // pack the left-hand operand - bli_packm_blk_var1, + packa_fp, BLIS_MR, BLIS_KR, FALSE, // do NOT invert diagonal @@ -101,7 +123,7 @@ cntl_t* blx_gemmbp_cntl_create cntl_t* gemm_cntl_packb = blx_packm_cntl_create_node ( blx_gemm_packb, // pack the right-hand operand - bli_packm_blk_var1, + packb_fp, BLIS_KR, BLIS_NR, FALSE, // do NOT invert diagonal diff --git a/sandbox/ref99/vars/blx_gemm_blk_var1.c b/sandbox/ref99/vars/blx_gemm_blk_var1.c index 43eb40bef..70482ede1 100644 --- a/sandbox/ref99/vars/blx_gemm_blk_var1.c +++ b/sandbox/ref99/vars/blx_gemm_blk_var1.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -52,7 +53,7 @@ void blx_gemm_blk_var1 dim_t my_start, my_end; // Determine the current thread's subpartition range. - bli_thread_get_range_mdim + bli_thread_range_mdim ( BLIS_FWD, thread, a, b, c, cntl, cntx, &my_start, &my_end diff --git a/sandbox/ref99/vars/blx_gemm_blk_var2.c b/sandbox/ref99/vars/blx_gemm_blk_var2.c index debcb2dfc..00a19ceef 100644 --- a/sandbox/ref99/vars/blx_gemm_blk_var2.c +++ b/sandbox/ref99/vars/blx_gemm_blk_var2.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -52,7 +53,7 @@ void blx_gemm_blk_var2 dim_t my_start, my_end; // Determine the current thread's subpartition range. - bli_thread_get_range_ndim + bli_thread_range_ndim ( BLIS_FWD, thread, a, b, c, cntl, cntx, &my_start, &my_end diff --git a/sandbox/ref99/vars/blx_gemm_ker_var2.c b/sandbox/ref99/vars/blx_gemm_ker_var2rr.c similarity index 85% rename from sandbox/ref99/vars/blx_gemm_ker_var2.c rename to sandbox/ref99/vars/blx_gemm_ker_var2rr.c index 2a1cbe6b6..eff1ecc85 100644 --- a/sandbox/ref99/vars/blx_gemm_ker_var2.c +++ b/sandbox/ref99/vars/blx_gemm_ker_var2rr.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -58,14 +59,14 @@ typedef void (*gemm_fp) // Function pointer array for datatype-specific functions. static gemm_fp ftypes[BLIS_NUM_FP_TYPES] = { - PASTECH2(blx_,s,gemm_ker_var2), - PASTECH2(blx_,c,gemm_ker_var2), - PASTECH2(blx_,d,gemm_ker_var2), - PASTECH2(blx_,z,gemm_ker_var2) + PASTECH2(blx_,s,gemm_ker_var2rr), + PASTECH2(blx_,c,gemm_ker_var2rr), + PASTECH2(blx_,d,gemm_ker_var2rr), + PASTECH2(blx_,z,gemm_ker_var2rr) }; -void blx_gemm_ker_var2 +void blx_gemm_ker_var2rr ( obj_t* a, obj_t* b, @@ -255,14 +256,27 @@ void PASTECH2(blx_,ch,varname) \ bli_auxinfo_set_is_a( is_a, &aux ); \ bli_auxinfo_set_is_b( is_b, &aux ); \ \ - thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ - dim_t jr_num_threads = bli_thread_n_way( thread ); \ - dim_t jr_thread_id = bli_thread_work_id( thread ); \ - dim_t ir_num_threads = bli_thread_n_way( caucus ); \ - dim_t ir_thread_id = bli_thread_work_id( caucus ); \ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Determine the thread range and increment for each thrinfo_t node. */ \ + bli_thread_range_jrir_rr( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_rr( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ \ /* Loop over the n dimension (NR columns at a time). */ \ - for ( j = jr_thread_id; j < n_iter; j += jr_num_threads ) \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ { \ ctype* restrict a1; \ ctype* restrict c11; \ @@ -277,7 +291,7 @@ void PASTECH2(blx_,ch,varname) \ b2 = b1; \ \ /* Loop over the m dimension (MR rows at a time). */ \ - for ( i = ir_thread_id; i < m_iter; i += ir_num_threads ) \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ { \ ctype* restrict a2; \ \ @@ -287,12 +301,12 @@ void PASTECH2(blx_,ch,varname) \ m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ \ /* Compute the addresses of the next panels of A and B. */ \ - a2 = bli_gemm_get_next_a_upanel( caucus, a1, rstep_a ); \ - if ( bli_is_last_iter( i, m_iter, ir_thread_id, ir_num_threads ) ) \ + a2 = bli_gemm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_rr( i, ir_end, ir_tid, ir_nt ) ) \ { \ a2 = a_cast; \ - b2 = bli_gemm_get_next_b_upanel( thread, b1, cstep_b ); \ - if ( bli_is_last_iter( j, n_iter, jr_thread_id, jr_num_threads ) ) \ + b2 = bli_gemm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_rr( j, jr_end, jr_tid, jr_nt ) ) \ b2 = b_cast; \ } \ \ @@ -349,11 +363,11 @@ PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: c after", m_cur, n_cur, c11, rs_c, } #if 0 -GENTFUNC( float, s, gemm_ker_var2 ) -GENTFUNC( double, d, gemm_ker_var2 ) -GENTFUNC( scomplex, c, gemm_ker_var2 ) -GENTFUNC( dcomplex, z, gemm_ker_var2 ) +GENTFUNC( float, s, gemm_ker_var2rr ) +GENTFUNC( double, d, gemm_ker_var2rr ) +GENTFUNC( scomplex, c, gemm_ker_var2rr ) +GENTFUNC( dcomplex, z, gemm_ker_var2rr ) #else -INSERT_GENTFUNC_BASIC0( gemm_ker_var2 ) +INSERT_GENTFUNC_BASIC0( gemm_ker_var2rr ) #endif diff --git a/sandbox/ref99/vars/blx_gemm_ker_var2sl.c b/sandbox/ref99/vars/blx_gemm_ker_var2sl.c new file mode 100644 index 000000000..31f51df92 --- /dev/null +++ b/sandbox/ref99/vars/blx_gemm_ker_var2sl.c @@ -0,0 +1,373 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas at Austin nor the names + of its contributors may be used to endorse or promote products + derived from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include "blis.h" +#include "blix.h" + +// Function pointer type for datatype-specific functions. +typedef void (*gemm_fp) + ( + pack_t schema_a, + pack_t schema_b, + dim_t m, + dim_t n, + dim_t k, + void* alpha, + void* a, inc_t cs_a, inc_t is_a, + dim_t pd_a, inc_t ps_a, + void* b, inc_t rs_b, inc_t is_b, + dim_t pd_b, inc_t ps_b, + void* beta, + void* c, inc_t rs_c, inc_t cs_c, + cntx_t* cntx, + rntm_t* rntm, + thrinfo_t* thread + ); + +// Function pointer array for datatype-specific functions. +static gemm_fp ftypes[BLIS_NUM_FP_TYPES] = +{ + PASTECH2(blx_,s,gemm_ker_var2sl), + PASTECH2(blx_,c,gemm_ker_var2sl), + PASTECH2(blx_,d,gemm_ker_var2sl), + PASTECH2(blx_,z,gemm_ker_var2sl) +}; + + +void blx_gemm_ker_var2sl + ( + obj_t* a, + obj_t* b, + obj_t* c, + cntx_t* cntx, + rntm_t* rntm, + cntl_t* cntl, + thrinfo_t* thread + ) +{ + num_t dt_exec = bli_obj_exec_dt( c ); + + pack_t schema_a = bli_obj_pack_schema( a ); + pack_t schema_b = bli_obj_pack_schema( b ); + + dim_t m = bli_obj_length( c ); + dim_t n = bli_obj_width( c ); + dim_t k = bli_obj_width( a ); + + void* buf_a = bli_obj_buffer_at_off( a ); + inc_t cs_a = bli_obj_col_stride( a ); + inc_t is_a = bli_obj_imag_stride( a ); + dim_t pd_a = bli_obj_panel_dim( a ); + inc_t ps_a = bli_obj_panel_stride( a ); + + void* buf_b = bli_obj_buffer_at_off( b ); + inc_t rs_b = bli_obj_row_stride( b ); + inc_t is_b = bli_obj_imag_stride( b ); + dim_t pd_b = bli_obj_panel_dim( b ); + inc_t ps_b = bli_obj_panel_stride( b ); + + void* buf_c = bli_obj_buffer_at_off( c ); + inc_t rs_c = bli_obj_row_stride( c ); + inc_t cs_c = bli_obj_col_stride( c ); + + obj_t scalar_a; + obj_t scalar_b; + + void* buf_alpha; + void* buf_beta; + + gemm_fp f; + + // Detach and multiply the scalars attached to A and B. + bli_obj_scalar_detach( a, &scalar_a ); + bli_obj_scalar_detach( b, &scalar_b ); + bli_mulsc( &scalar_a, &scalar_b ); + + // Grab the addresses of the internal scalar buffers for the scalar + // merged above and the scalar attached to C. + buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b ); + buf_beta = bli_obj_internal_scalar_buffer( c ); + + // Index into the type combination array to extract the correct + // function pointer. + f = ftypes[dt_exec]; + + // Invoke the function. + f( schema_a, + schema_b, + m, + n, + k, + buf_alpha, + buf_a, cs_a, is_a, + pd_a, ps_a, + buf_b, rs_b, is_b, + pd_b, ps_b, + buf_beta, + buf_c, rs_c, cs_c, + cntx, + rntm, + thread ); +} + + +#undef GENTFUNC +#define GENTFUNC( ctype, ch, varname ) \ +\ +void PASTECH2(blx_,ch,varname) \ + ( \ + pack_t schema_a, \ + pack_t schema_b, \ + dim_t m, \ + dim_t n, \ + dim_t k, \ + void* alpha, \ + void* a, inc_t cs_a, inc_t is_a, \ + dim_t pd_a, inc_t ps_a, \ + void* b, inc_t rs_b, inc_t is_b, \ + dim_t pd_b, inc_t ps_b, \ + void* beta, \ + void* c, inc_t rs_c, inc_t cs_c, \ + cntx_t* cntx, \ + rntm_t* rntm, \ + thrinfo_t* thread \ + ) \ +{ \ + const num_t dt = PASTEMAC(ch,type); \ +\ + /* Alias some constants to simpler names. */ \ + const dim_t MR = pd_a; \ + const dim_t NR = pd_b; \ + /*const dim_t PACKMR = cs_a;*/ \ + /*const dim_t PACKNR = rs_b;*/ \ +\ + /* Query the context for the micro-kernel address and cast it to its + function pointer type. */ \ + PASTECH(ch,gemm_ukr_ft) \ + gemm_ukr = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \ +\ + /* Temporary C buffer for edge cases. Note that the strides of this + temporary buffer are set so that they match the storage of the + original C matrix. For example, if C is column-stored, ct will be + column-stored as well. */ \ + ctype ct[ BLIS_STACK_BUF_MAX_SIZE \ + / sizeof( ctype ) ] \ + __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \ + const bool_t col_pref = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \ + const inc_t rs_ct = ( col_pref ? 1 : NR ); \ + const inc_t cs_ct = ( col_pref ? MR : 1 ); \ +\ + ctype* restrict zero = PASTEMAC(ch,0); \ + ctype* restrict a_cast = a; \ + ctype* restrict b_cast = b; \ + ctype* restrict c_cast = c; \ + ctype* restrict alpha_cast = alpha; \ + ctype* restrict beta_cast = beta; \ + ctype* restrict b1; \ + ctype* restrict c1; \ +\ + dim_t m_iter, m_left; \ + dim_t n_iter, n_left; \ + dim_t i, j; \ + dim_t m_cur; \ + dim_t n_cur; \ + inc_t rstep_a; \ + inc_t cstep_b; \ + inc_t rstep_c, cstep_c; \ + auxinfo_t aux; \ +\ + /* + Assumptions/assertions: + rs_a == 1 + cs_a == PACKMR + pd_a == MR + ps_a == stride to next micro-panel of A + rs_b == PACKNR + cs_b == 1 + pd_b == NR + ps_b == stride to next micro-panel of B + rs_c == (no assumptions) + cs_c == (no assumptions) + */ \ +\ + /* If any dimension is zero, return immediately. */ \ + if ( bli_zero_dim3( m, n, k ) ) return; \ +\ + /* Clear the temporary C buffer in case it has any infs or NaNs. */ \ + PASTEMAC(ch,set0s_mxn)( MR, NR, \ + ct, rs_ct, cs_ct ); \ +\ + /* Compute number of primary and leftover components of the m and n + dimensions. */ \ + n_iter = n / NR; \ + n_left = n % NR; \ +\ + m_iter = m / MR; \ + m_left = m % MR; \ +\ + if ( n_left ) ++n_iter; \ + if ( m_left ) ++m_iter; \ +\ + /* Determine some increments used to step through A, B, and C. */ \ + rstep_a = ps_a; \ +\ + cstep_b = ps_b; \ +\ + rstep_c = rs_c * MR; \ + cstep_c = cs_c * NR; \ +\ + /* Save the pack schemas of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_schema_a( schema_a, &aux ); \ + bli_auxinfo_set_schema_b( schema_b, &aux ); \ +\ + /* Save the imaginary stride of A and B to the auxinfo_t object. */ \ + bli_auxinfo_set_is_a( is_a, &aux ); \ + bli_auxinfo_set_is_b( is_b, &aux ); \ +\ + /* The 'thread' argument points to the thrinfo_t node for the 2nd (jr) + loop around the microkernel. Here we query the thrinfo_t node for the + 1st (ir) loop around the microkernel. */ \ + thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \ +\ + /* Query the number of threads and thread ids for each loop. */ \ + dim_t jr_nt = bli_thread_n_way( thread ); \ + dim_t jr_tid = bli_thread_work_id( thread ); \ + dim_t ir_nt = bli_thread_n_way( caucus ); \ + dim_t ir_tid = bli_thread_work_id( caucus ); \ +\ + dim_t jr_start, jr_end; \ + dim_t ir_start, ir_end; \ + dim_t jr_inc, ir_inc; \ +\ + /* Determine the thread range and increment for each thrinfo_t node. */ \ + bli_thread_range_jrir_sl( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \ + bli_thread_range_jrir_sl( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \ +\ + /* Loop over the n dimension (NR columns at a time). */ \ + for ( j = jr_start; j < jr_end; j += jr_inc ) \ + { \ + ctype* restrict a1; \ + ctype* restrict c11; \ + ctype* restrict b2; \ +\ + b1 = b_cast + j * cstep_b; \ + c1 = c_cast + j * cstep_c; \ +\ + n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \ +\ + /* Initialize our next panel of B to be the current panel of B. */ \ + b2 = b1; \ +\ + /* Loop over the m dimension (MR rows at a time). */ \ + for ( i = ir_start; i < ir_end; i += ir_inc ) \ + { \ + ctype* restrict a2; \ +\ + a1 = a_cast + i * rstep_a; \ + c11 = c1 + i * rstep_c; \ +\ + m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \ +\ + /* Compute the addresses of the next panels of A and B. */ \ + a2 = bli_gemm_get_next_a_upanel( a1, rstep_a, ir_inc ); \ + if ( bli_is_last_iter_sl( i, ir_end, ir_tid, ir_nt ) ) \ + { \ + a2 = a_cast; \ + b2 = bli_gemm_get_next_b_upanel( b1, cstep_b, jr_inc ); \ + if ( bli_is_last_iter_sl( j, jr_end, jr_tid, jr_nt ) ) \ + b2 = b_cast; \ + } \ +\ + /* Save addresses of next panels of A and B to the auxinfo_t + object. */ \ + bli_auxinfo_set_next_a( a2, &aux ); \ + bli_auxinfo_set_next_b( b2, &aux ); \ +\ + /* Handle interior and edge cases separately. */ \ + if ( m_cur == MR && n_cur == NR ) \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + beta_cast, \ + c11, rs_c, cs_c, \ + &aux, \ + cntx \ + ); \ + } \ + else \ + { \ + /* Invoke the gemm micro-kernel. */ \ + gemm_ukr \ + ( \ + k, \ + alpha_cast, \ + a1, \ + b1, \ + zero, \ + ct, rs_ct, cs_ct, \ + &aux, \ + cntx \ + ); \ +\ + /* Scale the bottom edge of C and add the result from above. */ \ + PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \ + ct, rs_ct, cs_ct, \ + beta_cast, \ + c11, rs_c, cs_c ); \ + } \ + } \ + } \ +\ +/* +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: b1", k, NR, b1, NR, 1, "%4.1f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: a1", MR, k, a1, 1, MR, "%4.1f", "" ); \ +PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: c after", m_cur, n_cur, c11, rs_c, cs_c, "%4.1f", "" ); \ +*/ \ +} + +#if 0 +GENTFUNC( float, s, gemm_ker_var2sl ) +GENTFUNC( double, d, gemm_ker_var2sl ) +GENTFUNC( scomplex, c, gemm_ker_var2sl ) +GENTFUNC( dcomplex, z, gemm_ker_var2sl ) +#else +INSERT_GENTFUNC_BASIC0( gemm_ker_var2sl ) +#endif + diff --git a/sandbox/ref99/vars/blx_gemm_var.h b/sandbox/ref99/vars/blx_gemm_var.h index 22911eda2..0ba824f8c 100644 --- a/sandbox/ref99/vars/blx_gemm_var.h +++ b/sandbox/ref99/vars/blx_gemm_var.h @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -57,7 +58,8 @@ GENPROT( gemm_blk_var3 ) GENPROT( gemm_packa ) GENPROT( gemm_packb ) -GENPROT( gemm_ker_var2 ) +GENPROT( gemm_ker_var2sl ) +GENPROT( gemm_ker_var2rr ) // // Prototype BLAS-like interfaces with void pointer operands. @@ -85,5 +87,6 @@ void PASTECH2(blx_,ch,varname) \ thrinfo_t* thread \ ); -INSERT_GENTPROT_BASIC0( gemm_ker_var2 ) +INSERT_GENTPROT_BASIC0( gemm_ker_var2sl ) +INSERT_GENTPROT_BASIC0( gemm_ker_var2rr ) diff --git a/test/3m4m/Makefile b/test/3m4m/Makefile index 3c2a52124..e91b100b2 100644 --- a/test/3m4m/Makefile +++ b/test/3m4m/Makefile @@ -5,6 +5,7 @@ # libraries. # # Copyright (C) 2014, The University of Texas at Austin +# Copyright (C) 2018, Advanced Micro Devices, Inc. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions are @@ -200,13 +201,13 @@ STR_ST := -DTHR_STR=\"st\" STR_MT := -DTHR_STR=\"mt\" # Problem size specification -PDEF_ST := -DP_BEGIN=40 \ +PDEF_ST := -DP_BEGIN=96 \ -DP_END=2000 \ - -DP_INC=40 + -DP_INC=96 -PDEF_MT := -DP_BEGIN=200 \ - -DP_END=10000 \ - -DP_INC=200 +PDEF_MT := -DP_BEGIN=192 \ + -DP_END=3000 \ + -DP_INC=192 @@ -226,9 +227,6 @@ all-mt: blis-mt openblas-mt mkl-mt blis-st: blis-gemm-st blis-mt: blis-gemm-mt -blis-nat-st: blis-gemm-nat-st -blis-nat-mt: blis-gemm-nat-mt - openblas-st: openblas-gemm-st openblas-mt: openblas-gemm-mt @@ -240,6 +238,42 @@ blis-gemm-st: blis-gemm-nat-st \ blis-gemm-mt: blis-gemm-nat-mt \ blis-gemm-ind-mt +blis-nat-st: \ + test_sgemm_asm_blis_st.x \ + test_dgemm_asm_blis_st.x \ + test_cgemm_asm_blis_st.x \ + test_zgemm_asm_blis_st.x \ + test_sherk_asm_blis_st.x \ + test_dherk_asm_blis_st.x \ + test_cherk_asm_blis_st.x \ + test_zherk_asm_blis_st.x \ + test_strmm_asm_blis_st.x \ + test_dtrmm_asm_blis_st.x \ + test_ctrmm_asm_blis_st.x \ + test_ztrmm_asm_blis_st.x \ + test_strsm_asm_blis_st.x \ + test_dtrsm_asm_blis_st.x \ + test_ctrsm_asm_blis_st.x \ + test_ztrsm_asm_blis_st.x + +blis-nat-mt: \ + test_sgemm_asm_blis_mt.x \ + test_dgemm_asm_blis_mt.x \ + test_cgemm_asm_blis_mt.x \ + test_zgemm_asm_blis_mt.x \ + test_sherk_asm_blis_mt.x \ + test_dherk_asm_blis_mt.x \ + test_cherk_asm_blis_mt.x \ + test_zherk_asm_blis_mt.x \ + test_strmm_asm_blis_mt.x \ + test_dtrmm_asm_blis_mt.x \ + test_ctrmm_asm_blis_mt.x \ + test_ztrmm_asm_blis_mt.x \ + test_strsm_asm_blis_mt.x \ + test_dtrsm_asm_blis_mt.x \ + test_ctrsm_asm_blis_mt.x \ + test_ztrsm_asm_blis_mt.x + blis-gemm-nat-st: \ test_sgemm_asm_blis_st.x \ test_dgemm_asm_blis_st.x \ @@ -390,28 +424,28 @@ test_c%_1m_blis_mt.o: test_%.c $(CC) $(CFLAGS) $(PDEF_MT) $(DT_C) $(BLI_DEF) $(D1M) $(STR_1M) $(STR_MT) -c $< -o $@ # blis asm -test_d%_asm_blis_st.o: test_%.c +test_d%_asm_blis_st.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_ST) $(DT_D) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_ST) -c $< -o $@ -test_s%_asm_blis_st.o: test_%.c +test_s%_asm_blis_st.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_ST) $(DT_S) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_ST) -c $< -o $@ -test_z%_asm_blis_st.o: test_%.c +test_z%_asm_blis_st.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_ST) $(DT_Z) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_ST) -c $< -o $@ -test_c%_asm_blis_st.o: test_%.c +test_c%_asm_blis_st.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_ST) $(DT_C) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_ST) -c $< -o $@ -test_d%_asm_blis_mt.o: test_%.c +test_d%_asm_blis_mt.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_MT) $(DT_D) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_MT) -c $< -o $@ -test_s%_asm_blis_mt.o: test_%.c +test_s%_asm_blis_mt.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_MT) $(DT_S) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_MT) -c $< -o $@ -test_z%_asm_blis_mt.o: test_%.c +test_z%_asm_blis_mt.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_MT) $(DT_Z) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_MT) -c $< -o $@ -test_c%_asm_blis_mt.o: test_%.c +test_c%_asm_blis_mt.o: test_%.c Makefile $(CC) $(CFLAGS) $(PDEF_MT) $(DT_C) $(BLI_DEF) $(DNAT) $(STR_NAT) $(STR_MT) -c $< -o $@ # openblas diff --git a/test/3m4m/test_herk.c b/test/3m4m/test_herk.c new file mode 100644 index 000000000..66a057a59 --- /dev/null +++ b/test/3m4m/test_herk.c @@ -0,0 +1,314 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas nor the names of its + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include +#include "blis.h" + + +//#define PRINT + +int main( int argc, char** argv ) +{ + obj_t a, c; + obj_t c_save; + obj_t alpha, beta; + dim_t m, k; + dim_t p; + dim_t p_begin, p_end, p_inc; + int m_input, k_input; + ind_t ind; + num_t dt, dt_real; + char dt_ch; + int r, n_repeats; + uplo_t uploc; + trans_t transa; + f77_char f77_uploc; + f77_char f77_transa; + + double dtime; + double dtime_save; + double gflops; + + //bli_init(); + + //bli_error_checking_level_set( BLIS_NO_ERROR_CHECKING ); + + n_repeats = 3; + + dt = DT; + dt_real = bli_dt_proj_to_real( DT ); + + ind = IND; + + p_begin = P_BEGIN; + p_end = P_END; + p_inc = P_INC; + + m_input = -1; + k_input = -1; + + + // Supress compiler warnings about unused variable 'ind'. + ( void )ind; + +#if 0 + + cntx_t* cntx; + + ind_t ind_mod = ind; + + // A hack to use 3m1 as 1mpb (with 1m as 1mbp). + if ( ind == BLIS_3M1 ) ind_mod = BLIS_1M; + + // Initialize a context for the current induced method and datatype. + cntx = bli_gks_query_ind_cntx( ind_mod, dt ); + + // Set k to the kc blocksize for the current datatype. + k_input = bli_cntx_get_blksz_def_dt( dt, BLIS_KC, cntx ); + +#elif 1 + + //k_input = 256; + +#endif + + // Choose the char corresponding to the requested datatype. + if ( bli_is_float( dt ) ) dt_ch = 's'; + else if ( bli_is_double( dt ) ) dt_ch = 'd'; + else if ( bli_is_scomplex( dt ) ) dt_ch = 'c'; + else dt_ch = 'z'; + + uploc = BLIS_LOWER; + transa = BLIS_NO_TRANSPOSE; + + bli_param_map_blis_to_netlib_uplo( uploc, &f77_uploc ); + bli_param_map_blis_to_netlib_trans( transa, &f77_transa ); + + // Begin with initializing the last entry to zero so that + // matlab allocates space for the entire array once up-front. + for ( p = p_begin; p + p_inc <= p_end; p += p_inc ) ; +#ifdef BLIS + printf( "data_%s_%cherk_%s_blis", THR_STR, dt_ch, STR ); +#else + printf( "data_%s_%cherk_%s", THR_STR, dt_ch, STR ); +#endif + printf( "( %2lu, 1:4 ) = [ %4lu %4lu %7.2f ];\n", + ( unsigned long )(p - p_begin + 1)/p_inc + 1, + ( unsigned long )0, + ( unsigned long )0, 0.0 ); + + + for ( p = p_begin; p <= p_end; p += p_inc ) + { + + if ( m_input < 0 ) m = p / ( dim_t )abs(m_input); + else m = ( dim_t ) m_input; + if ( k_input < 0 ) k = p / ( dim_t )abs(k_input); + else k = ( dim_t ) k_input; + + bli_obj_create( dt_real, 1, 1, 0, 0, &alpha ); + bli_obj_create( dt, 1, 1, 0, 0, &beta ); + + if ( bli_does_trans( transa ) ) + bli_obj_create( dt, k, m, 0, 0, &a ); + else + bli_obj_create( dt, m, k, 0, 0, &a ); + bli_obj_create( dt, m, m, 0, 0, &c ); + //bli_obj_create( dt, m, k, 2, 2*m, &a ); + //bli_obj_create( dt, k, n, 2, 2*k, &b ); + //bli_obj_create( dt, m, n, 2, 2*m, &c ); + bli_obj_create( dt, m, m, 0, 0, &c_save ); + + bli_randm( &a ); + bli_randm( &c ); + + bli_obj_set_struc( BLIS_HERMITIAN, &c ); + bli_obj_set_uplo( uploc, &c ); + + bli_obj_set_conjtrans( transa, &a ); + + bli_setsc( (2.0/1.0), 0.0, &alpha ); + bli_setsc( (1.0/1.0), 0.0, &beta ); + + + bli_copym( &c, &c_save ); + +#ifdef BLIS + bli_ind_disable_all_dt( dt ); + bli_ind_enable_dt( ind, dt ); +#endif + + dtime_save = DBL_MAX; + + for ( r = 0; r < n_repeats; ++r ) + { + bli_copym( &c_save, &c ); + + + dtime = bli_clock(); + + +#ifdef PRINT + bli_printm( "a", &a, "%4.1f", "" ); + bli_printm( "c", &c, "%4.1f", "" ); +#endif + +#ifdef BLIS + + bli_herk( &alpha, + &a, + &beta, + &c ); + +#else + + if ( bli_is_float( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width_after_trans( &a ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + float* alphap = bli_obj_buffer( &alpha ); + float* ap = bli_obj_buffer( &a ); + float* betap = bli_obj_buffer( &beta ); + float* cp = bli_obj_buffer( &c ); + + ssyrk_( &f77_uploc, + &f77_transa, + &mm, + &kk, + alphap, + ap, &lda, + betap, + cp, &ldc ); + } + else if ( bli_is_double( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width_after_trans( &a ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + double* alphap = bli_obj_buffer( &alpha ); + double* ap = bli_obj_buffer( &a ); + double* betap = bli_obj_buffer( &beta ); + double* cp = bli_obj_buffer( &c ); + + dsyrk_( &f77_uploc, + &f77_transa, + &mm, + &kk, + alphap, + ap, &lda, + betap, + cp, &ldc ); + } + else if ( bli_is_scomplex( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width_after_trans( &a ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + float* alphap = bli_obj_buffer( &alpha ); + scomplex* ap = bli_obj_buffer( &a ); + scomplex* betap = bli_obj_buffer( &beta ); + scomplex* cp = bli_obj_buffer( &c ); + + cherk_( &f77_uploc, + &f77_transa, + &mm, + &kk, + alphap, + ap, &lda, + betap, + cp, &ldc ); + } + else if ( bli_is_dcomplex( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width_after_trans( &a ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + double* alphap = bli_obj_buffer( &alpha ); + dcomplex* ap = bli_obj_buffer( &a ); + dcomplex* betap = bli_obj_buffer( &beta ); + dcomplex* cp = bli_obj_buffer( &c ); + + zherk_( &f77_uploc, + &f77_transa, + &mm, + &kk, + alphap, + ap, &lda, + betap, + cp, &ldc ); + } +#endif + +#ifdef PRINT + bli_printm( "c after", &c, "%4.1f", "" ); + exit(1); +#endif + + + dtime_save = bli_clock_min_diff( dtime_save, dtime ); + } + + gflops = ( 1.0 * m * k * m ) / ( dtime_save * 1.0e9 ); + + if ( bli_is_complex( dt ) ) gflops *= 4.0; + +#ifdef BLIS + printf( "data_%s_%cherk_%s_blis", THR_STR, dt_ch, STR ); +#else + printf( "data_%s_%cherk_%s", THR_STR, dt_ch, STR ); +#endif + printf( "( %2lu, 1:4 ) = [ %4lu %4lu %7.2f ];\n", + ( unsigned long )(p - p_begin + 1)/p_inc + 1, + ( unsigned long )m, + ( unsigned long )k, gflops ); + + bli_obj_free( &alpha ); + bli_obj_free( &beta ); + + bli_obj_free( &a ); + bli_obj_free( &c ); + bli_obj_free( &c_save ); + } + + //bli_finalize(); + + return 0; +} + diff --git a/test/3m4m/test_trmm.c b/test/3m4m/test_trmm.c new file mode 100644 index 000000000..06ed38539 --- /dev/null +++ b/test/3m4m/test_trmm.c @@ -0,0 +1,328 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas nor the names of its + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include +#include "blis.h" + + +//#define PRINT + +int main( int argc, char** argv ) +{ + obj_t a, c; + obj_t c_save; + obj_t alpha; + dim_t m, n; + dim_t p; + dim_t p_begin, p_end, p_inc; + int m_input, n_input; + ind_t ind; + num_t dt; + char dt_ch; + int r, n_repeats; + side_t side; + uplo_t uploa; + trans_t transa; + diag_t diaga; + f77_char f77_side; + f77_char f77_uploa; + f77_char f77_transa; + f77_char f77_diaga; + + double dtime; + double dtime_save; + double gflops; + + //bli_init(); + + //bli_error_checking_level_set( BLIS_NO_ERROR_CHECKING ); + + n_repeats = 3; + + dt = DT; + + ind = IND; + + p_begin = P_BEGIN; + p_end = P_END; + p_inc = P_INC; + + m_input = -1; + n_input = -1; + + + // Supress compiler warnings about unused variable 'ind'. + ( void )ind; + +#if 0 + + cntx_t* cntx; + + ind_t ind_mod = ind; + + // A hack to use 3m1 as 1mpb (with 1m as 1mbp). + if ( ind == BLIS_3M1 ) ind_mod = BLIS_1M; + + // Initialize a context for the current induced method and datatype. + cntx = bli_gks_query_ind_cntx( ind_mod, dt ); + + // Set k to the kc blocksize for the current datatype. + k_input = bli_cntx_get_blksz_def_dt( dt, BLIS_KC, cntx ); + +#elif 1 + + //k_input = 256; + +#endif + + // Choose the char corresponding to the requested datatype. + if ( bli_is_float( dt ) ) dt_ch = 's'; + else if ( bli_is_double( dt ) ) dt_ch = 'd'; + else if ( bli_is_scomplex( dt ) ) dt_ch = 'c'; + else dt_ch = 'z'; + +#if 0 + side = BLIS_LEFT; +#else + side = BLIS_RIGHT; +#endif +#if 0 + uploa = BLIS_LOWER; +#else + uploa = BLIS_UPPER; +#endif + transa = BLIS_NO_TRANSPOSE; + diaga = BLIS_NONUNIT_DIAG; + + bli_param_map_blis_to_netlib_side( side, &f77_side ); + bli_param_map_blis_to_netlib_uplo( uploa, &f77_uploa ); + bli_param_map_blis_to_netlib_trans( transa, &f77_transa ); + bli_param_map_blis_to_netlib_diag( diaga, &f77_diaga ); + + // Begin with initializing the last entry to zero so that + // matlab allocates space for the entire array once up-front. + for ( p = p_begin; p + p_inc <= p_end; p += p_inc ) ; +#ifdef BLIS + printf( "data_%s_%ctrmm_%s_blis", THR_STR, dt_ch, STR ); +#else + printf( "data_%s_%ctrmm_%s", THR_STR, dt_ch, STR ); +#endif + printf( "( %2lu, 1:4 ) = [ %4lu %4lu %7.2f ];\n", + ( unsigned long )(p - p_begin + 1)/p_inc + 1, + ( unsigned long )0, + ( unsigned long )0, 0.0 ); + + + for ( p = p_begin; p <= p_end; p += p_inc ) + { + + if ( m_input < 0 ) m = p / ( dim_t )abs(m_input); + else m = ( dim_t ) m_input; + if ( n_input < 0 ) n = p / ( dim_t )abs(n_input); + else n = ( dim_t ) n_input; + + bli_obj_create( dt, 1, 1, 0, 0, &alpha ); + + if ( bli_does_trans( side ) ) + bli_obj_create( dt, m, m, 0, 0, &a ); + else + bli_obj_create( dt, n, n, 0, 0, &a ); + bli_obj_create( dt, m, n, 0, 0, &c ); + bli_obj_create( dt, m, n, 0, 0, &c_save ); + + bli_randm( &a ); + bli_randm( &c ); + + bli_obj_set_struc( BLIS_TRIANGULAR, &a ); + bli_obj_set_uplo( uploa, &a ); + bli_obj_set_conjtrans( transa, &a ); + bli_obj_set_diag( diaga, &a ); + + bli_randm( &a ); + bli_mktrim( &a ); + + bli_setsc( (2.0/1.0), 0.0, &alpha ); + + bli_copym( &c, &c_save ); + +#ifdef BLIS + bli_ind_disable_all_dt( dt ); + bli_ind_enable_dt( ind, dt ); +#endif + + dtime_save = DBL_MAX; + + for ( r = 0; r < n_repeats; ++r ) + { + bli_copym( &c_save, &c ); + + + dtime = bli_clock(); + + +#ifdef PRINT + bli_printm( "a", &a, "%4.1f", "" ); + bli_printm( "c", &c, "%4.1f", "" ); +#endif + +#ifdef BLIS + + bli_trmm( side, + &alpha, + &a, + &c ); + +#else + + if ( bli_is_float( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + float* alphap = bli_obj_buffer( &alpha ); + float* ap = bli_obj_buffer( &a ); + float* cp = bli_obj_buffer( &c ); + + strmm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } + else if ( bli_is_double( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + double* alphap = bli_obj_buffer( &alpha ); + double* ap = bli_obj_buffer( &a ); + double* cp = bli_obj_buffer( &c ); + + dtrmm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } + else if ( bli_is_scomplex( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + scomplex* alphap = bli_obj_buffer( &alpha ); + scomplex* ap = bli_obj_buffer( &a ); + scomplex* cp = bli_obj_buffer( &c ); + + ctrmm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } + else if ( bli_is_dcomplex( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + dcomplex* alphap = bli_obj_buffer( &alpha ); + dcomplex* ap = bli_obj_buffer( &a ); + dcomplex* cp = bli_obj_buffer( &c ); + + ztrmm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } +#endif + +#ifdef PRINT + bli_printm( "c after", &c, "%4.1f", "" ); + exit(1); +#endif + + + dtime_save = bli_clock_min_diff( dtime_save, dtime ); + } + + if ( bli_is_left( side ) ) + gflops = ( 1.0 * m * m * n ) / ( dtime_save * 1.0e9 ); + else + gflops = ( 1.0 * m * n * n ) / ( dtime_save * 1.0e9 ); + + if ( bli_is_complex( dt ) ) gflops *= 4.0; + +#ifdef BLIS + printf( "data_%s_%ctrmm_%s_blis", THR_STR, dt_ch, STR ); +#else + printf( "data_%s_%ctrmm_%s", THR_STR, dt_ch, STR ); +#endif + printf( "( %2lu, 1:4 ) = [ %4lu %4lu %7.2f ];\n", + ( unsigned long )(p - p_begin + 1)/p_inc + 1, + ( unsigned long )m, + ( unsigned long )n, gflops ); + + bli_obj_free( &alpha ); + + bli_obj_free( &a ); + bli_obj_free( &c ); + bli_obj_free( &c_save ); + } + + //bli_finalize(); + + return 0; +} + diff --git a/test/3m4m/test_trsm.c b/test/3m4m/test_trsm.c new file mode 100644 index 000000000..f417a5361 --- /dev/null +++ b/test/3m4m/test_trsm.c @@ -0,0 +1,338 @@ +/* + + BLIS + An object-based framework for developing high-performance BLAS-like + libraries. + + Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + - Neither the name of The University of Texas nor the names of its + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +*/ + +#include +#include "blis.h" + + +//#define PRINT + +int main( int argc, char** argv ) +{ + obj_t a, c, d; + obj_t c_save; + obj_t alpha; + dim_t m, n; + dim_t p; + dim_t p_begin, p_end, p_inc; + int m_input, n_input; + ind_t ind; + num_t dt; + char dt_ch; + int r, n_repeats; + side_t side; + uplo_t uploa; + trans_t transa; + diag_t diaga; + f77_char f77_side; + f77_char f77_uploa; + f77_char f77_transa; + f77_char f77_diaga; + + double dtime; + double dtime_save; + double gflops; + + //bli_init(); + + //bli_error_checking_level_set( BLIS_NO_ERROR_CHECKING ); + + n_repeats = 3; + + dt = DT; + + ind = IND; + + p_begin = P_BEGIN; + p_end = P_END; + p_inc = P_INC; + + m_input = -1; + n_input = -1; + + + // Supress compiler warnings about unused variable 'ind'. + ( void )ind; + +#if 0 + + cntx_t* cntx; + + ind_t ind_mod = ind; + + // A hack to use 3m1 as 1mpb (with 1m as 1mbp). + if ( ind == BLIS_3M1 ) ind_mod = BLIS_1M; + + // Initialize a context for the current induced method and datatype. + cntx = bli_gks_query_ind_cntx( ind_mod, dt ); + + // Set k to the kc blocksize for the current datatype. + k_input = bli_cntx_get_blksz_def_dt( dt, BLIS_KC, cntx ); + +#elif 1 + + //k_input = 256; + +#endif + + // Choose the char corresponding to the requested datatype. + if ( bli_is_float( dt ) ) dt_ch = 's'; + else if ( bli_is_double( dt ) ) dt_ch = 'd'; + else if ( bli_is_scomplex( dt ) ) dt_ch = 'c'; + else dt_ch = 'z'; + +#if 0 + side = BLIS_LEFT; +#else + side = BLIS_RIGHT; +#endif +#if 0 + uploa = BLIS_LOWER; +#else + uploa = BLIS_UPPER; +#endif + transa = BLIS_NO_TRANSPOSE; + diaga = BLIS_NONUNIT_DIAG; + + bli_param_map_blis_to_netlib_side( side, &f77_side ); + bli_param_map_blis_to_netlib_uplo( uploa, &f77_uploa ); + bli_param_map_blis_to_netlib_trans( transa, &f77_transa ); + bli_param_map_blis_to_netlib_diag( diaga, &f77_diaga ); + + // Begin with initializing the last entry to zero so that + // matlab allocates space for the entire array once up-front. + for ( p = p_begin; p + p_inc <= p_end; p += p_inc ) ; +#ifdef BLIS + printf( "data_%s_%ctrsm_%s_blis", THR_STR, dt_ch, STR ); +#else + printf( "data_%s_%ctrsm_%s", THR_STR, dt_ch, STR ); +#endif + printf( "( %2lu, 1:4 ) = [ %4lu %4lu %7.2f ];\n", + ( unsigned long )(p - p_begin + 1)/p_inc + 1, + ( unsigned long )0, + ( unsigned long )0, 0.0 ); + + + for ( p = p_begin; p <= p_end; p += p_inc ) + { + + if ( m_input < 0 ) m = p / ( dim_t )abs(m_input); + else m = ( dim_t ) m_input; + if ( n_input < 0 ) n = p / ( dim_t )abs(n_input); + else n = ( dim_t ) n_input; + + bli_obj_create( dt, 1, 1, 0, 0, &alpha ); + + if ( bli_does_trans( side ) ) + bli_obj_create( dt, m, m, 0, 0, &a ); + else + bli_obj_create( dt, n, n, 0, 0, &a ); + bli_obj_create( dt, m, n, 0, 0, &c ); + //bli_obj_create( dt, m, n, n, 1, &c ); + bli_obj_create( dt, m, n, 0, 0, &c_save ); + + if ( bli_does_trans( side ) ) + bli_obj_create( dt, m, m, 0, 0, &d ); + else + bli_obj_create( dt, n, n, 0, 0, &d ); + + bli_randm( &a ); + bli_randm( &c ); + + bli_obj_set_struc( BLIS_TRIANGULAR, &a ); + bli_obj_set_uplo( uploa, &a ); + bli_obj_set_conjtrans( transa, &a ); + bli_obj_set_diag( diaga, &a ); + + bli_randm( &a ); + bli_mktrim( &a ); + + bli_setd( &BLIS_TWO, &d ); + bli_addd( &d, &a ); + + bli_setsc( (2.0/1.0), 0.0, &alpha ); + + bli_copym( &c, &c_save ); + +#ifdef BLIS + bli_ind_disable_all_dt( dt ); + bli_ind_enable_dt( ind, dt ); +#endif + + dtime_save = DBL_MAX; + + for ( r = 0; r < n_repeats; ++r ) + { + bli_copym( &c_save, &c ); + + + dtime = bli_clock(); + + +#ifdef PRINT + bli_printm( "a", &a, "%4.1f", "" ); + bli_printm( "c", &c, "%4.1f", "" ); +#endif + +#ifdef BLIS + + bli_trsm( side, + &alpha, + &a, + &c ); + +#else + + if ( bli_is_float( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + float* alphap = bli_obj_buffer( &alpha ); + float* ap = bli_obj_buffer( &a ); + float* cp = bli_obj_buffer( &c ); + + strsm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } + else if ( bli_is_double( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + double* alphap = bli_obj_buffer( &alpha ); + double* ap = bli_obj_buffer( &a ); + double* cp = bli_obj_buffer( &c ); + + dtrsm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } + else if ( bli_is_scomplex( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + scomplex* alphap = bli_obj_buffer( &alpha ); + scomplex* ap = bli_obj_buffer( &a ); + scomplex* cp = bli_obj_buffer( &c ); + + ctrsm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } + else if ( bli_is_dcomplex( dt ) ) + { + f77_int mm = bli_obj_length( &c ); + f77_int kk = bli_obj_width( &c ); + f77_int lda = bli_obj_col_stride( &a ); + f77_int ldc = bli_obj_col_stride( &c ); + dcomplex* alphap = bli_obj_buffer( &alpha ); + dcomplex* ap = bli_obj_buffer( &a ); + dcomplex* cp = bli_obj_buffer( &c ); + + ztrsm_( &f77_side, + &f77_uploa, + &f77_transa, + &f77_diaga, + &mm, + &kk, + alphap, + ap, &lda, + cp, &ldc ); + } +#endif + +#ifdef PRINT + bli_printm( "c after", &c, "%4.1f", "" ); + exit(1); +#endif + + + dtime_save = bli_clock_min_diff( dtime_save, dtime ); + } + + if ( bli_is_left( side ) ) + gflops = ( 1.0 * m * m * n ) / ( dtime_save * 1.0e9 ); + else + gflops = ( 1.0 * m * n * n ) / ( dtime_save * 1.0e9 ); + + if ( bli_is_complex( dt ) ) gflops *= 4.0; + +#ifdef BLIS + printf( "data_%s_%ctrsm_%s_blis", THR_STR, dt_ch, STR ); +#else + printf( "data_%s_%ctrsm_%s", THR_STR, dt_ch, STR ); +#endif + printf( "( %2lu, 1:4 ) = [ %4lu %4lu %7.2f ];\n", + ( unsigned long )(p - p_begin + 1)/p_inc + 1, + ( unsigned long )m, + ( unsigned long )n, gflops ); + + bli_obj_free( &alpha ); + + bli_obj_free( &a ); + bli_obj_free( &c ); + bli_obj_free( &c_save ); + bli_obj_free( &d ); + } + + //bli_finalize(); + + return 0; +} + diff --git a/test/studies/thunderx2/Makefile b/test/studies/thunderx2/Makefile index b0693e198..28fdcb727 100644 --- a/test/studies/thunderx2/Makefile +++ b/test/studies/thunderx2/Makefile @@ -213,8 +213,8 @@ PDEF_MT := -DP_BEGIN=200 \ # --- Targets/rules ------------------------------------------------------------ # -all-st: blis-st openblas-st mkl-st -all-mt: blis-mt openblas-mt mkl-mt +all-st: blis-st openblas-st armpl-st +all-mt: blis-mt openblas-mt armpl-mt blis-st: blis-gemm-st blis-syrk-st blis-hemm-st blis-trmm-st blis-mt: blis-gemm-mt blis-syrk-mt blis-hemm-mt blis-trmm-mt diff --git a/test/studies/thunderx2/runme.sh b/test/studies/thunderx2/runme.sh index fff08d313..709ff6a35 100755 --- a/test/studies/thunderx2/runme.sh +++ b/test/studies/thunderx2/runme.sh @@ -117,7 +117,10 @@ for nc in ${cores_r}; do th="mt" else - export BLIS_NUM_THREADS=1 + export BLIS_JC_NT=1 + export BLIS_IC_NT=1 + export BLIS_JR_NT=1 + export BLIS_IR_NT=1 export OMP_NUM_THREADS=1 out_dir="${out_rootdir}/st" mkdir -p $out_rootdir/st @@ -181,7 +184,10 @@ for nc in ${cores}; do th="mt" else - export BLIS_NUM_THREADS=1 + export BLIS_JC_NT=1 + export BLIS_IC_NT=1 + export BLIS_JR_NT=1 + export BLIS_IR_NT=1 export OMP_NUM_THREADS=1 out_dir="${out_rootdir}/st" th="st" diff --git a/test/thread_ranges/test_ranges.c b/test/thread_ranges/test_ranges.c index 68ffe7fec..9bf293ca5 100644 --- a/test/thread_ranges/test_ranges.c +++ b/test/thread_ranges/test_ranges.c @@ -5,6 +5,7 @@ libraries. Copyright (C) 2014, The University of Texas at Austin + Copyright (C) 2018, Advanced Micro Devices, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are @@ -290,13 +291,13 @@ int main( int argc, char** argv ) thrinfo.work_id = t; if ( part_n_dim && go_fwd ) - area = bli_thread_get_range_weighted_l2r( &thrinfo, &a, &bfs, &start, &end ); + area = bli_thread_range_weighted_l2r( &thrinfo, &a, &bfs, &start, &end ); else if ( part_n_dim && go_bwd ) - area = bli_thread_get_range_weighted_r2l( &thrinfo, &a, &bfs, &start, &end ); + area = bli_thread_range_weighted_r2l( &thrinfo, &a, &bfs, &start, &end ); else if ( part_m_dim && go_fwd ) - area = bli_thread_get_range_weighted_t2b( &thrinfo, &a, &bfs, &start, &end ); + area = bli_thread_range_weighted_t2b( &thrinfo, &a, &bfs, &start, &end ); else // ( part_m_dim && go_bwd ) - area = bli_thread_get_range_weighted_b2t( &thrinfo, &a, &bfs, &start, &end ); + area = bli_thread_range_weighted_b2t( &thrinfo, &a, &bfs, &start, &end ); width = end - start; diff --git a/testsuite/src/test_libblis.c b/testsuite/src/test_libblis.c index d7f5825be..b97b25358 100644 --- a/testsuite/src/test_libblis.c +++ b/testsuite/src/test_libblis.c @@ -732,19 +732,73 @@ void libblis_test_output_params_struct( FILE* os, test_params_t* params ) // If bli_info_get_int_type_size() returns 32 or 64, the size is forced. // Otherwise, the size is chosen automatically. We query the result of // that automatic choice via sizeof(gint_t). -/* - if ( bli_info_get_int_type_size() == 32 || - bli_info_get_int_type_size() == 64 ) - sprintf( int_type_size_str, "%d", ( int )bli_info_get_int_type_size() ); - else - sprintf( int_type_size_str, "%d", ( int )sizeof(gint_t) * 8 ); -*/ if ( bli_info_get_int_type_size() == 32 || bli_info_get_int_type_size() == 64 ) int_type_size = bli_info_get_int_type_size(); else int_type_size = sizeof(gint_t) * 8; + char impl_str[16]; + char jrir_str[16]; + + // Describe the threading implementation. + if ( bli_info_get_enable_openmp() ) sprintf( impl_str, "openmp" ); + else if ( bli_info_get_enable_pthreads() ) sprintf( impl_str, "pthreads" ); + else /* threading disabled */ sprintf( impl_str, "disabled" ); + + // Describe the status of jrir thread partitioning. + if ( bli_info_get_thread_part_jrir_slab() ) sprintf( jrir_str, "slab" ); + else /*bli_info_get_thread_part_jrir_rr()*/ sprintf( jrir_str, "round-robin" ); + + char nt_str[16]; + char jc_nt_str[16]; + char pc_nt_str[16]; + char ic_nt_str[16]; + char jr_nt_str[16]; + char ir_nt_str[16]; + + // Query the number of ways of parallelism per loop (and overall) and + // convert these values into strings, with "unset" being used if the + // value returned was -1 (indicating the environment variable was unset). + dim_t nt = bli_thread_get_num_threads(); + dim_t jc_nt = bli_thread_get_jc_nt(); + dim_t pc_nt = bli_thread_get_pc_nt(); + dim_t ic_nt = bli_thread_get_ic_nt(); + dim_t jr_nt = bli_thread_get_jr_nt(); + dim_t ir_nt = bli_thread_get_ir_nt(); + + if ( nt == -1 ) sprintf( nt_str, "unset" ); + else sprintf( nt_str, "%d", ( int ) nt ); + if ( jc_nt == -1 ) sprintf( jc_nt_str, "unset" ); + else sprintf( jc_nt_str, "%d", ( int )jc_nt ); + if ( pc_nt == -1 ) sprintf( pc_nt_str, "unset" ); + else sprintf( pc_nt_str, "%d", ( int )pc_nt ); + if ( ic_nt == -1 ) sprintf( ic_nt_str, "unset" ); + else sprintf( ic_nt_str, "%d", ( int )ic_nt ); + if ( jr_nt == -1 ) sprintf( jr_nt_str, "unset" ); + else sprintf( jr_nt_str, "%d", ( int )jr_nt ); + if ( ir_nt == -1 ) sprintf( ir_nt_str, "unset" ); + else sprintf( ir_nt_str, "%d", ( int )ir_nt ); + + // Set up rntm_t objects for each of the four families: + // gemm, herk, trmm, trsm. + rntm_t gemm, herk, trmm_l, trmm_r, trsm_l, trsm_r; + dim_t m = 1000, n = 1000, k = 1000; + + bli_thread_init_rntm( &gemm ); + bli_thread_init_rntm( &herk ); + bli_thread_init_rntm( &trmm_l ); + bli_thread_init_rntm( &trmm_r ); + bli_thread_init_rntm( &trsm_l ); + bli_thread_init_rntm( &trsm_r ); + + bli_rntm_set_ways_for_op( BLIS_GEMM, BLIS_LEFT, m, n, k, &gemm ); + bli_rntm_set_ways_for_op( BLIS_HERK, BLIS_LEFT, m, n, k, &herk ); + bli_rntm_set_ways_for_op( BLIS_TRMM, BLIS_LEFT, m, n, k, &trmm_l ); + bli_rntm_set_ways_for_op( BLIS_TRMM, BLIS_RIGHT, m, n, k, &trmm_r ); + bli_rntm_set_ways_for_op( BLIS_TRSM, BLIS_LEFT, m, n, k, &trsm_l ); + bli_rntm_set_ways_for_op( BLIS_TRSM, BLIS_RIGHT, m, n, k, &trsm_r ); + // Output some system parameters. libblis_test_fprintf_c( os, "\n" ); libblis_test_fprintf_c( os, "--- BLIS library info -------------------------------------\n" ); @@ -779,12 +833,62 @@ void libblis_test_output_params_struct( FILE* os, test_params_t* params ) libblis_test_fprintf_c( os, "CBLAS compatibility layer \n" ); libblis_test_fprintf_c( os, " enabled? %d\n", ( int )bli_info_get_enable_cblas() ); libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "libmemkind \n" ); + libblis_test_fprintf_c( os, " enabled? %d\n", ( int )bli_info_get_enable_memkind() ); + libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "gemm sandbox \n" ); + libblis_test_fprintf_c( os, " enabled? %d\n", ( int )bli_info_get_enable_sandbox() ); + libblis_test_fprintf_c( os, "\n" ); libblis_test_fprintf_c( os, "floating-point types s d c z \n" ); libblis_test_fprintf_c( os, " sizes (bytes) %7u %7u %7u %7u\n", sizeof(float), sizeof(double), sizeof(scomplex), sizeof(dcomplex) ); libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "--- BLIS parallelization info ---\n" ); + libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "multithreading %s\n", impl_str ); + libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "thread auto-factorization \n" ); + libblis_test_fprintf_c( os, " m dim thread ratio %d\n", ( int )BLIS_THREAD_RATIO_M ); + libblis_test_fprintf_c( os, " n dim thread ratio %d\n", ( int )BLIS_THREAD_RATIO_N ); + libblis_test_fprintf_c( os, " jr max threads %d\n", ( int )BLIS_THREAD_MAX_JR ); + libblis_test_fprintf_c( os, " ir max threads %d\n", ( int )BLIS_THREAD_MAX_IR ); + libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "ways of parallelism nt jc pc ic jr ir\n" ); + libblis_test_fprintf_c( os, " environment %5s %5s %5s %5s %5s %5s\n", + nt_str, jc_nt_str, pc_nt_str, + ic_nt_str, jr_nt_str, ir_nt_str ); + libblis_test_fprintf_c( os, " gemm (m,n,k=1000) %5d %5d %5d %5d %5d\n", + ( int )bli_rntm_jc_ways( &gemm ), ( int )bli_rntm_pc_ways( &gemm ), + ( int )bli_rntm_ic_ways( &gemm ), + ( int )bli_rntm_jr_ways( &gemm ), ( int )bli_rntm_ir_ways( &gemm ) ); + libblis_test_fprintf_c( os, " herk (m,k=1000) %5d %5d %5d %5d %5d\n", + ( int )bli_rntm_jc_ways( &herk ), ( int )bli_rntm_pc_ways( &herk ), + ( int )bli_rntm_ic_ways( &herk ), + ( int )bli_rntm_jr_ways( &herk ), ( int )bli_rntm_ir_ways( &herk ) ); + libblis_test_fprintf_c( os, " trmm_l (m,n=1000) %5d %5d %5d %5d %5d\n", + ( int )bli_rntm_jc_ways( &trmm_l ), ( int )bli_rntm_pc_ways( &trmm_l ), + ( int )bli_rntm_ic_ways( &trmm_l ), + ( int )bli_rntm_jr_ways( &trmm_l ), ( int )bli_rntm_ir_ways( &trmm_l ) ); + libblis_test_fprintf_c( os, " trmm_r (m,n=1000) %5d %5d %5d %5d %5d\n", + ( int )bli_rntm_jc_ways( &trmm_r ), ( int )bli_rntm_pc_ways( &trmm_r ), + ( int )bli_rntm_ic_ways( &trmm_r ), + ( int )bli_rntm_jr_ways( &trmm_r ), ( int )bli_rntm_ir_ways( &trmm_r ) ); + libblis_test_fprintf_c( os, " trsm_l (m,n=1000) %5d %5d %5d %5d %5d\n", + ( int )bli_rntm_jc_ways( &trsm_l ), ( int )bli_rntm_pc_ways( &trsm_l ), + ( int )bli_rntm_ic_ways( &trsm_l ), + ( int )bli_rntm_jr_ways( &trsm_l ), ( int )bli_rntm_ir_ways( &trsm_l ) ); + libblis_test_fprintf_c( os, " trsm_r (m,n=1000) %5d %5d %5d %5d %5d\n", + ( int )bli_rntm_jc_ways( &trsm_r ), ( int )bli_rntm_pc_ways( &trsm_r ), + ( int )bli_rntm_ic_ways( &trsm_r ), + ( int )bli_rntm_jr_ways( &trsm_r ), ( int )bli_rntm_ir_ways( &trsm_r ) ); + libblis_test_fprintf_c( os, "\n" ); + libblis_test_fprintf_c( os, "thread partitioning \n" ); + //libblis_test_fprintf_c( os, " jc/ic loops %s\n", "slab" ); + libblis_test_fprintf_c( os, " jr/ir loops %s\n", jrir_str ); + libblis_test_fprintf_c( os, "\n" ); libblis_test_fprintf_c( os, "\n" ); libblis_test_fprintf_c( os, "--- BLIS default implementations ---\n" ); diff --git a/windows/build/libblis-symbols.def b/windows/build/libblis-symbols.def index 983292b05..13ae1c60c 100644 --- a/windows/build/libblis-symbols.def +++ b/windows/build/libblis-symbols.def @@ -1797,19 +1797,19 @@ bli_thread_get_jc_nt bli_thread_get_jr_nt bli_thread_get_num_threads bli_thread_get_pc_nt -bli_thread_get_range_b2t -bli_thread_get_range_l2r -bli_thread_get_range_mdim -bli_thread_get_range_ndim -bli_thread_get_range_r2l -bli_thread_get_range_sub -bli_thread_get_range_t2b -bli_thread_get_range_weighted_b2t -bli_thread_get_range_weighted_l2r -bli_thread_get_range_weighted_r2l -bli_thread_get_range_weighted_sub -bli_thread_get_range_weighted_t2b -bli_thread_get_range_width_l +bli_thread_range_b2t +bli_thread_range_l2r +bli_thread_range_mdim +bli_thread_range_ndim +bli_thread_range_r2l +bli_thread_range_sub +bli_thread_range_t2b +bli_thread_range_weighted_b2t +bli_thread_range_weighted_l2r +bli_thread_range_weighted_r2l +bli_thread_range_weighted_sub +bli_thread_range_weighted_t2b +bli_thread_range_width_l bli_thread_init bli_thread_init_rntm bli_thread_init_rntm_from_env