Files
blis/kernels/haswell/bli_kernels_haswell.h
Field G. Van Zee 1c719c91a3 Bugfixes, cleanup of sup dgemm ukernels.
Details:
- Fixed a few not-really-bugs:
  - Previously, the d6x8m kernels were still prefetching the next upanel
    of A using MR*rs_a instead of ps_a (same for prefetching of next
    upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
    that the upanels might be packed, using ps_a or ps_b is the correct
    way to compute the prefetch address.
  - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
    executed as intended even though it was based on a faulty pointer
    management. Basically, in the rd_d6x8m kernel, the pointer for B
    (stored in rdx) was loaded only once, outside of the jj loop, and in
    the second iteration its new position was calculated by incrementing
    rdx by the *absolute* offset (four columns), which happened to be the
    same as the relative offset (also four columns) that was needed. It
    worked only because that loop only executed twice. A similar issue
    was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
  - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
    that it is loaded only once outside of the loops rather than
    multiple times inside the loops.
  - Changed outer loop in rd kernels so that the jump/comparison and
    loop bounds more closely mimic what you'd see in higher-level source
    code. That is, something like:
      for( i = 0; i < 6; i+=3 )
    rather than something like:
      for( i = 0; i <= 3; i+=3 )
  - Switched row-based IO to use byte offsets instead of byte column
    strides (e.g. via rsi register), which were known to be 8 anyway
    since otherwise that conditional branch wouldn't have executed.
  - Cleaned up and homogenized prefetching a bit.
  - Updated the comments that show the before and after of the
    in-register transpositions.
  - Added comments to column-based IO cases to indicate which columns
    are being accessed/updated.
  - Added rbp register to clobber lists.
  - Removed some dead (commented out) code.
  - Fixed some copy-paste typos in comments in the rv_6x8n kernels.
  - Cleaned up whitespace (including leading ws -> tabs).
  - Moved edge case (non-milli) kernels to their own directory, d6x8,
    and split them into separate files based on the "NR" value of the
    kernels (Mx8, Mx4, Mx2, etc.).
  - Moved config-specific reference Mx1 kernels into their own file
    (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
  - Added rd_dMx1 assembly kernels, which seems marginally faster than
    the corresponding reference kernels.
  - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
    the row-oriented reference kernels for all storage combos.
2020-06-04 17:21:08 -05:00

158 lines
6.6 KiB
C

/*
BLIS
An object-based framework for developing high-performance BLAS-like
libraries.
Copyright (C) 2014, The University of Texas at Austin
Copyright (C) 2019, Advanced Micro Devices, Inc.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name(s) of the copyright holder(s) nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
// -- level-3 ------------------------------------------------------------------
// gemm (asm d6x8)
GEMM_UKR_PROT( float, s, gemm_haswell_asm_6x16 )
GEMM_UKR_PROT( double, d, gemm_haswell_asm_6x8 )
GEMM_UKR_PROT( scomplex, c, gemm_haswell_asm_3x8 )
GEMM_UKR_PROT( dcomplex, z, gemm_haswell_asm_3x4 )
// gemm (asm d8x6)
GEMM_UKR_PROT( float, s, gemm_haswell_asm_16x6 )
GEMM_UKR_PROT( double, d, gemm_haswell_asm_8x6 )
GEMM_UKR_PROT( scomplex, c, gemm_haswell_asm_8x3 )
GEMM_UKR_PROT( dcomplex, z, gemm_haswell_asm_4x3 )
// gemmtrsm_l (asm d6x8)
GEMMTRSM_UKR_PROT( float, s, gemmtrsm_l_haswell_asm_6x16 )
GEMMTRSM_UKR_PROT( double, d, gemmtrsm_l_haswell_asm_6x8 )
// gemmtrsm_u (asm d6x8)
GEMMTRSM_UKR_PROT( float, s, gemmtrsm_u_haswell_asm_6x16 )
GEMMTRSM_UKR_PROT( double, d, gemmtrsm_u_haswell_asm_6x8 )
// gemm (asm d8x6)
//GEMM_UKR_PROT( float, s, gemm_haswell_asm_16x6 )
//GEMM_UKR_PROT( double, d, gemm_haswell_asm_8x6 )
//GEMM_UKR_PROT( scomplex, c, gemm_haswell_asm_8x3 )
//GEMM_UKR_PROT( dcomplex, z, gemm_haswell_asm_4x3 )
// -- level-3 sup --------------------------------------------------------------
// -- double real --
// gemmsup_r
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_6x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_5x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_4x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_3x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_2x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_1x1 )
// gemmsup_rv
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x6 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x6 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x6 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x6 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x6 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x6 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x2 )
// gemmsup_rv (mkernel in m dim)
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x8m )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x6m )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x4m )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x2m )
// gemmsup_rv (mkernel in n dim)
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x8n )
// gemmsup_rd
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x8 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x4 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_3x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x2 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_3x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x1 )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x1 )
// gemmsup_rd (mkernel in m dim)
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x8m )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x4m )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x2m )
// gemmsup_rd (mkernel in n dim)
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_3x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x8n )
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x8n )