mirror of
https://github.com/amd/blis.git
synced 2026-05-03 05:51:13 +00:00
Details: - Added a highly configurable, unified test suite. - Removed DUPB configuration constant from bl2_kernel.h and macro-kernel header files. Now, instead, DUPB is computed as (NDUP != 1) within each macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into incorrectly when DUPB was set to FALSE but the NDUP was still non-unit. By encoding both pieces of information into one constant in _kernel.h, it seems somewhat less likely others will encounter this bug in the future. - Added level-2 cache blocksizes to _kernel.h for reference configuration, and defined blocksizes in _cntl.c files to these default values. - Changed semantics of her2k and syr2k such that these operations no longer expect the B matrix to already be conjugate-transposed (or just transposed for syr2k). However, these semantics are preserved for the internal mechanics of the implementations, including the internal back-end and all blocked variants. - Inserted checks for real-valued alpha and beta for herk/her2k and herk, respectively. - Relaxed general object structure constraints in _basic_check() for gemv, ger. - Changed her front-end to NOT copy-cast to real projection; instead, this is replaced by selecting either the real part or both parts within the unblocked algorithm implementation, depending on the value of conjh. - Added conjh to all _check routines for her so that the code knows when to verify that alpha has an imaginary component equal to zero (for her, but not syr). - Changed control tree for her to forgo packing. - Added unit diagonal support to fnormm. - Redefined real versions of abval2s macros in terms of fabs(), fabsf(). - Redefined complex versions of sqrt2s macros using the actual "complex square root" formula. - Created new level-0 object-based routines, suffixed with "sc" (for "scalar"). - Defined new level-1v, -1d, and -1m versions of add and sub operations (two-operand add and subtract). - Added new scalar macros: - getris: acquire real and imaginary components. - setris: set real and imaginary components. - addjs: addition with conjugated x. - subjs: subtraction with conjugated x. - Defined new utility operations: - absumv: element-wise sum of absolute values for vector elements. - absumm: element-wise sum of absolute values for matrix elements. - mkherm: convert existing matrix to Hermitian. - mksymm: convert existing matrix to symmetric. - mktrim: convert existing matrix to triangular. - Added various error checking routines. - Added bl2_clock_min_diff(), which is used to more cleanly measure the wall clock time of a code block. - Added general stride support to bl2_obj_alloc_buffer(). - Added bl2_obj_init_scalar(). - Updated parameter mapping in bl2_param_map.c. - Added support for queriable version string. - Fixed a bug in the her2k macro-kernels (which currently are simply implemented in terms of two invocations of herk) whereby beta was being applied to both the first and second rank-k updates, rather than only the first. - Fixed a bug in trmm/trsm whereby transpose and right side cases were not properly implemented due to erroneous assumptions regarding aliasing and root objects. - Fixed a bug in the upper triangular trsm macro-kernel in which the wrong MR x NR block of B was being updated. - Fixed a bug in the inverts macro in the double real case whereby the value was typecast to float before inversion. This affected non-unit cases of dtrsm. - Fixed a bug in the reference kernels for gemmtrsm whereby the minus one constant was being applied incorrectly. - Fixed a bug in the overall treatment of non-unit alpha for trsm. The code now mimics the rank-k strategy of gemm, whereby alpah is applied during the first iteration of variant 3, with BLIS_ONE passed in instead for subsequent iterations. This also required passing alpha into the macro- kernels as well as the fused gemmtrsm micro-kernels. - Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being called for blocks strictly above the diagonal. While this sounds good in theory, this cannot be done because gemm_ker_var2 expects row panels of A to be packed from top to bottom, while for trsm_u, A is actually packed from bottom to top due to the reverse (BR->TL) nature of the algorithm. - Fixed a bug in packm_cxk() whereby panel packings with unit panel dimensions were mishandled due to incorrect arguments to the copyv kernel. Also changed the copyv kernel invocation to scal2v so that these edge cases are properly handled when scaling is requested. - Fixed a bug in packv_int() whereby an uninitialized object is passed in instead of the source object. - Fixed a bug whereby level-2 code could allocate memory dynamically via bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed a potential future bug whereby a mem_t object that is actually no longer "allocated" from the static pool is mistaken for being allocated due to failure to NULLify the buffer when the block was most recently released. - Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly toggled when the requested subpartition needed to be "reflected" due to it residing in an unstored region.
322 lines
10 KiB
C
322 lines
10 KiB
C
/*
|
|
|
|
BLIS
|
|
An object-based framework for developing high-performance BLAS-like
|
|
libraries.
|
|
|
|
Copyright (C) 2012, The University of Texas
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions are
|
|
met:
|
|
- Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
- Redistributions in binary form must reproduce the above copyright
|
|
notice, this list of conditions and the following disclaimer in the
|
|
documentation and/or other materials provided with the distribution.
|
|
- Neither the name of The University of Texas nor the names of its
|
|
contributors may be used to endorse or promote products derived
|
|
from this software without specific prior written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
#ifndef BLIS_KERNEL_H
|
|
#define BLIS_KERNEL_H
|
|
|
|
|
|
// -- LEVEL-3 MICRO-KERNEL CONSTANTS -------------------------------------------
|
|
|
|
// -- Default cache blocksizes --
|
|
|
|
// Constraints:
|
|
//
|
|
// (1) MC must be a multiple of:
|
|
// (a) MR (for zero-padding purposes) and
|
|
// (b) NR.
|
|
// (2) NC must be a multiple of
|
|
// (a) NR (for zero-padding purposes) and
|
|
// (b) MR.
|
|
// (3) KC does not need to be multiple of anything, unless the micro-kernel
|
|
// specifically requires it (and typically it does not).
|
|
//
|
|
// NOTE: For BLIS libraries built on block-panel macro-kernels, constraint
|
|
// (2b) is relaxed. In this case, (1b) is needed for operation implementations
|
|
// involving matrices with diagonals (trmm, trsm). In these cases, we want the
|
|
// diagonal offset of any panel of packed matrix A to have a diagonal offset
|
|
// that is a multiple of MR. If, instead, the library were to be built on
|
|
// block-panel macro-kernels, matrix B would be the one with structure, not A,
|
|
// and thus it would be constraint (2b) that would be needed instead of (1b).
|
|
//
|
|
|
|
#define BLIS_DEFAULT_MC_S 128
|
|
#define BLIS_DEFAULT_KC_S 256
|
|
#define BLIS_DEFAULT_NC_S 8192
|
|
|
|
#define BLIS_DEFAULT_MC_D 128
|
|
#define BLIS_DEFAULT_KC_D 256
|
|
#define BLIS_DEFAULT_NC_D 8192
|
|
|
|
#define BLIS_DEFAULT_MC_C 128
|
|
#define BLIS_DEFAULT_KC_C 256
|
|
#define BLIS_DEFAULT_NC_C 8192
|
|
|
|
#define BLIS_DEFAULT_MC_Z 128
|
|
#define BLIS_DEFAULT_KC_Z 256
|
|
#define BLIS_DEFAULT_NC_Z 8192
|
|
|
|
// -- Default register blocksizes for inner kernel --
|
|
|
|
// NOTE: When using the reference configuration, these register blocksizes
|
|
// in the m and n dimensions should all be equal to the size expected by
|
|
// the reference micro-kernel(s).
|
|
|
|
#define BLIS_DEFAULT_MR_S 4
|
|
#define BLIS_DEFAULT_NR_S 4
|
|
|
|
#define BLIS_DEFAULT_MR_D 4
|
|
#define BLIS_DEFAULT_NR_D 4
|
|
|
|
#define BLIS_DEFAULT_MR_C 4
|
|
#define BLIS_DEFAULT_NR_C 4
|
|
|
|
#define BLIS_DEFAULT_MR_Z 4
|
|
#define BLIS_DEFAULT_NR_Z 4
|
|
|
|
// NOTE: If the micro-kernel, which is typically unrolled to a factor
|
|
// of f, handles leftover edge cases (ie: when k % f > 0) then these
|
|
// register blocksizes in the k dimension can be defined to 1.
|
|
|
|
#define BLIS_DEFAULT_KR_S 1
|
|
#define BLIS_DEFAULT_KR_D 1
|
|
#define BLIS_DEFAULT_KR_C 1
|
|
#define BLIS_DEFAULT_KR_Z 1
|
|
|
|
// -- Number of elements per vector register --
|
|
|
|
// NOTE: These constants are typically only used to determine the amount
|
|
// of duplication needed when configuring level-3 macro-kernels that
|
|
// copy and duplicate elements of B to a temporary duplication buffer
|
|
// (so that element-wise vector multiplication and addition instructions
|
|
// can be used).
|
|
|
|
#define BLIS_NUM_ELEM_PER_REG_S 4
|
|
#define BLIS_NUM_ELEM_PER_REG_D 2
|
|
#define BLIS_NUM_ELEM_PER_REG_C 2
|
|
#define BLIS_NUM_ELEM_PER_REG_Z 1
|
|
|
|
// -- Default switch for duplication of B --
|
|
|
|
// NOTE: Setting these values to 1 disables duplication. Any value
|
|
// d > 1 results in a d-1 duplicates created within special macro-kernel
|
|
// buffer of dimension k x NR*d.
|
|
|
|
//#define BLIS_DEFAULT_NUM_DUPL_S BLIS_NUM_ELEM_PER_REG_S
|
|
//#define BLIS_DEFAULT_NUM_DUPL_D BLIS_NUM_ELEM_PER_REG_D
|
|
//#define BLIS_DEFAULT_NUM_DUPL_C BLIS_NUM_ELEM_PER_REG_C
|
|
//#define BLIS_DEFAULT_NUM_DUPL_Z BLIS_NUM_ELEM_PER_REG_Z
|
|
#define BLIS_DEFAULT_NUM_DUPL_S 1
|
|
#define BLIS_DEFAULT_NUM_DUPL_D 1
|
|
#define BLIS_DEFAULT_NUM_DUPL_C 1
|
|
#define BLIS_DEFAULT_NUM_DUPL_Z 1
|
|
|
|
// -- Default incremental packing blocksizes (n dimension) --
|
|
|
|
// NOTE: These incremental packing blocksizes (for the n dimension) are only
|
|
// used by certain blocked variants. But when the *are* used, they MUST be
|
|
// be an integer multiple of NR!
|
|
|
|
#define BLIS_DEFAULT_NI_FAC 16
|
|
#define BLIS_DEFAULT_NI_S (BLIS_DEFAULT_NI_FAC * BLIS_DEFAULT_NR_S)
|
|
#define BLIS_DEFAULT_NI_D (BLIS_DEFAULT_NI_FAC * BLIS_DEFAULT_NR_D)
|
|
#define BLIS_DEFAULT_NI_C (BLIS_DEFAULT_NI_FAC * BLIS_DEFAULT_NR_C)
|
|
#define BLIS_DEFAULT_NI_Z (BLIS_DEFAULT_NI_FAC * BLIS_DEFAULT_NR_Z)
|
|
|
|
|
|
|
|
// -- LEVEL-2 KERNEL CONSTANTS -------------------------------------------------
|
|
|
|
// NOTE: These values determine high-level cache blocking for level-2
|
|
// operations ONLY. So, if gemv is performed with a 2000x2000 matrix A and
|
|
// MC = NC = 1000, then a total of four unblocked (or unblocked fused)
|
|
// gemv subproblems are called. The blocked algorithms are only useful in
|
|
// that they provide the opportunity for packing vectors. (Matrices can also
|
|
// be packed here, but this tends to be much too expensive in practice to
|
|
// actually employ.)
|
|
|
|
#define BLIS_DEFAULT_L2_MC_S 1000
|
|
#define BLIS_DEFAULT_L2_NC_S 1000
|
|
|
|
#define BLIS_DEFAULT_L2_MC_D 1000
|
|
#define BLIS_DEFAULT_L2_NC_D 1000
|
|
|
|
#define BLIS_DEFAULT_L2_MC_C 1000
|
|
#define BLIS_DEFAULT_L2_NC_C 1000
|
|
|
|
#define BLIS_DEFAULT_L2_MC_Z 1000
|
|
#define BLIS_DEFAULT_L2_NC_Z 1000
|
|
|
|
|
|
|
|
// -- LEVEL-1F KERNEL CONSTANTS ------------------------------------------------
|
|
|
|
// -- Default fusing factors for level-1f operations --
|
|
|
|
// NOTE: Default fusing factors are not used by the reference implementations
|
|
// of level-1f operations. They are here only for use when these operations
|
|
// are optimized.
|
|
|
|
#define BLIS_DEFAULT_FUSING_FACTOR_S 8
|
|
#define BLIS_DEFAULT_FUSING_FACTOR_D 4
|
|
#define BLIS_DEFAULT_FUSING_FACTOR_C 4
|
|
#define BLIS_DEFAULT_FUSING_FACTOR_Z 2
|
|
|
|
|
|
|
|
// -- LEVEL-1V KERNEL CONSTANTS ------------------------------------------------
|
|
|
|
// -- Default register blocksizes for vectors --
|
|
|
|
// NOTE: Register blocksizes for vectors are used when packing
|
|
// non-contiguous vectors. Similar to that of KR, they can
|
|
// typically be set to 1.
|
|
|
|
#define BLIS_DEFAULT_VR_S 1
|
|
#define BLIS_DEFAULT_VR_D 1
|
|
#define BLIS_DEFAULT_VR_C 1
|
|
#define BLIS_DEFAULT_VR_Z 1
|
|
|
|
|
|
|
|
// -- LEVEL-3 KERNEL DEFINITIONS -----------------------------------------------
|
|
|
|
// -- dupl --
|
|
|
|
#define DUPL_KERNEL dupl_unb_var1
|
|
|
|
// -- gemm --
|
|
|
|
#define GEMM_UKERNEL gemm_ref_4x4
|
|
|
|
// -- trsm-related --
|
|
|
|
#define GEMMTRSM_L_UKERNEL gemmtrsm_l_ref_4x4
|
|
#define GEMMTRSM_U_UKERNEL gemmtrsm_u_ref_4x4
|
|
|
|
#define TRSM_L_UKERNEL trsm_l_ref_4x4
|
|
#define TRSM_U_UKERNEL trsm_u_ref_4x4
|
|
|
|
|
|
|
|
// -- LEVEL-1M KERNEL DEFINITIONS ----------------------------------------------
|
|
|
|
// -- packm --
|
|
|
|
#define PACKM_2XK_KERNEL packm_ref_2xk
|
|
#define PACKM_4XK_KERNEL packm_ref_4xk
|
|
#define PACKM_6XK_KERNEL packm_ref_6xk
|
|
#define PACKM_8XK_KERNEL packm_ref_8xk
|
|
#define PACKM_10XK_KERNEL packm_ref_10xk
|
|
#define PACKM_12XK_KERNEL packm_ref_12xk
|
|
#define PACKM_14XK_KERNEL packm_ref_14xk
|
|
#define PACKM_16XK_KERNEL packm_ref_16xk
|
|
|
|
// -- unpackm --
|
|
|
|
#define UNPACKM_2XK_KERNEL unpackm_ref_2xk
|
|
#define UNPACKM_4XK_KERNEL unpackm_ref_4xk
|
|
#define UNPACKM_6XK_KERNEL unpackm_ref_6xk
|
|
#define UNPACKM_8XK_KERNEL unpackm_ref_8xk
|
|
#define UNPACKM_10XK_KERNEL unpackm_ref_10xk
|
|
#define UNPACKM_12XK_KERNEL unpackm_ref_12xk
|
|
#define UNPACKM_14XK_KERNEL unpackm_ref_14xk
|
|
#define UNPACKM_16XK_KERNEL unpackm_ref_16xk
|
|
|
|
|
|
|
|
// -- LEVEL-1F KERNEL DEFINITIONS ----------------------------------------------
|
|
|
|
// -- axpy2v --
|
|
|
|
#define AXPY2V_KERNEL axpy2v_unb_var1
|
|
|
|
// -- dotaxpyv --
|
|
|
|
#define DOTAXPYV_KERNEL dotaxpyv_unb_var1
|
|
|
|
// -- axpyf --
|
|
|
|
#define AXPYF_KERNEL axpyf_unb_var1
|
|
|
|
// -- dotxf --
|
|
|
|
#define DOTXF_KERNEL dotxf_unb_var1
|
|
|
|
// -- dotxaxpyf --
|
|
|
|
#define DOTXAXPYF_KERNEL dotxaxpyf_unb_var1
|
|
|
|
|
|
|
|
// -- LEVEL-1V KERNEL DEFINITIONS ----------------------------------------------
|
|
|
|
// -- addv --
|
|
|
|
#define ADDV_KERNEL addv_unb_var1
|
|
|
|
// -- axpyv --
|
|
|
|
#define AXPYV_KERNEL axpyv_unb_var1
|
|
|
|
// -- copynzv --
|
|
|
|
#define COPYNZV_KERNEL copynzv_unb_var1
|
|
|
|
// -- copyv --
|
|
|
|
#define COPYV_KERNEL copyv_unb_var1
|
|
|
|
// -- dotv --
|
|
|
|
#define DOTV_KERNEL dotv_unb_var1
|
|
|
|
// -- dotxv --
|
|
|
|
#define DOTXV_KERNEL dotxv_unb_var1
|
|
|
|
// -- invertv --
|
|
|
|
#define INVERTV_KERNEL invertv_unb_var1
|
|
|
|
// -- scal2v --
|
|
|
|
#define SCAL2V_KERNEL scal2v_unb_var1
|
|
|
|
// -- scalv --
|
|
|
|
#define SCALV_KERNEL scalv_unb_var1
|
|
|
|
// -- setv --
|
|
|
|
#define SETV_KERNEL setv_unb_var1
|
|
|
|
// -- subv --
|
|
|
|
#define SUBV_KERNEL subv_unb_var1
|
|
|
|
|
|
|
|
#endif
|
|
|