mirror of
https://github.com/amd/blis.git
synced 2026-05-14 03:02:08 +00:00
Details:
- Fixed a few not-really-bugs:
- Previously, the d6x8m kernels were still prefetching the next upanel
of A using MR*rs_a instead of ps_a (same for prefetching of next
upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
that the upanels might be packed, using ps_a or ps_b is the correct
way to compute the prefetch address.
- Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
executed as intended even though it was based on a faulty pointer
management. Basically, in the rd_d6x8m kernel, the pointer for B
(stored in rdx) was loaded only once, outside of the jj loop, and in
the second iteration its new position was calculated by incrementing
rdx by the *absolute* offset (four columns), which happened to be the
same as the relative offset (also four columns) that was needed. It
worked only because that loop only executed twice. A similar issue
was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
- Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
that it is loaded only once outside of the loops rather than
multiple times inside the loops.
- Changed outer loop in rd kernels so that the jump/comparison and
loop bounds more closely mimic what you'd see in higher-level source
code. That is, something like:
for( i = 0; i < 6; i+=3 )
rather than something like:
for( i = 0; i <= 3; i+=3 )
- Switched row-based IO to use byte offsets instead of byte column
strides (e.g. via rsi register), which were known to be 8 anyway
since otherwise that conditional branch wouldn't have executed.
- Cleaned up and homogenized prefetching a bit.
- Updated the comments that show the before and after of the
in-register transpositions.
- Added comments to column-based IO cases to indicate which columns
are being accessed/updated.
- Added rbp register to clobber lists.
- Removed some dead (commented out) code.
- Fixed some copy-paste typos in comments in the rv_6x8n kernels.
- Cleaned up whitespace (including leading ws -> tabs).
- Moved edge case (non-milli) kernels to their own directory, d6x8,
and split them into separate files based on the "NR" value of the
kernels (Mx8, Mx4, Mx2, etc.).
- Moved config-specific reference Mx1 kernels into their own file
(e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
- Added rd_dMx1 assembly kernels, which seems marginally faster than
the corresponding reference kernels.
- Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
the row-oriented reference kernels for all storage combos.
158 lines
6.6 KiB
C
158 lines
6.6 KiB
C
/*
|
|
|
|
BLIS
|
|
An object-based framework for developing high-performance BLAS-like
|
|
libraries.
|
|
|
|
Copyright (C) 2014, The University of Texas at Austin
|
|
Copyright (C) 2019, Advanced Micro Devices, Inc.
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions are
|
|
met:
|
|
- Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
- Redistributions in binary form must reproduce the above copyright
|
|
notice, this list of conditions and the following disclaimer in the
|
|
documentation and/or other materials provided with the distribution.
|
|
- Neither the name(s) of the copyright holder(s) nor the names of its
|
|
contributors may be used to endorse or promote products derived
|
|
from this software without specific prior written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
// -- level-3 ------------------------------------------------------------------
|
|
|
|
// gemm (asm d6x8)
|
|
GEMM_UKR_PROT( float, s, gemm_haswell_asm_6x16 )
|
|
GEMM_UKR_PROT( double, d, gemm_haswell_asm_6x8 )
|
|
GEMM_UKR_PROT( scomplex, c, gemm_haswell_asm_3x8 )
|
|
GEMM_UKR_PROT( dcomplex, z, gemm_haswell_asm_3x4 )
|
|
|
|
// gemm (asm d8x6)
|
|
GEMM_UKR_PROT( float, s, gemm_haswell_asm_16x6 )
|
|
GEMM_UKR_PROT( double, d, gemm_haswell_asm_8x6 )
|
|
GEMM_UKR_PROT( scomplex, c, gemm_haswell_asm_8x3 )
|
|
GEMM_UKR_PROT( dcomplex, z, gemm_haswell_asm_4x3 )
|
|
|
|
// gemmtrsm_l (asm d6x8)
|
|
GEMMTRSM_UKR_PROT( float, s, gemmtrsm_l_haswell_asm_6x16 )
|
|
GEMMTRSM_UKR_PROT( double, d, gemmtrsm_l_haswell_asm_6x8 )
|
|
|
|
// gemmtrsm_u (asm d6x8)
|
|
GEMMTRSM_UKR_PROT( float, s, gemmtrsm_u_haswell_asm_6x16 )
|
|
GEMMTRSM_UKR_PROT( double, d, gemmtrsm_u_haswell_asm_6x8 )
|
|
|
|
|
|
// gemm (asm d8x6)
|
|
//GEMM_UKR_PROT( float, s, gemm_haswell_asm_16x6 )
|
|
//GEMM_UKR_PROT( double, d, gemm_haswell_asm_8x6 )
|
|
//GEMM_UKR_PROT( scomplex, c, gemm_haswell_asm_8x3 )
|
|
//GEMM_UKR_PROT( dcomplex, z, gemm_haswell_asm_4x3 )
|
|
|
|
|
|
// -- level-3 sup --------------------------------------------------------------
|
|
|
|
// -- double real --
|
|
|
|
// gemmsup_r
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_6x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_5x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_4x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_3x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_2x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_r_haswell_ref_1x1 )
|
|
|
|
// gemmsup_rv
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x8 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x8 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x8 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x8 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x8 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x8 )
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x6 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x6 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x6 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x6 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x6 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x6 )
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x4 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x4 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x4 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x4 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x4 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x4 )
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x2 )
|
|
|
|
// gemmsup_rv (mkernel in m dim)
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x8m )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x6m )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x4m )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x2m )
|
|
|
|
// gemmsup_rv (mkernel in n dim)
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_6x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_5x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_4x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_3x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_2x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rv_haswell_asm_1x8n )
|
|
|
|
// gemmsup_rd
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x8 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x8 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x8 )
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x4 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x4 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x4 )
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_3x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x2 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x2 )
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_3x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x1 )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x1 )
|
|
|
|
// gemmsup_rd (mkernel in m dim)
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x8m )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x4m )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x2m )
|
|
|
|
// gemmsup_rd (mkernel in n dim)
|
|
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_6x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_3x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_2x8n )
|
|
GEMMSUP_KER_PROT( double, d, gemmsup_rd_haswell_asm_1x8n )
|
|
|