amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 01:59:59 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	e9899be090	Added high-level implementations of 4m, 3m. Details: - Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at high levels, respectively. APIs for trmm and trsm were NOT added due to the fact that these approaches are inherently incompatible with implementing 4m or 3m at high levels (because the input right-hand side matrix is overwritten). - Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and 3m so that all are stylistically consistent. - Added new "rih" packing kernels (both low-level and structure-aware) to support both 4mh and 3mh. - Defined new pack_t schemas to support real-only, imaginary-only, and real+imaginary packing formats. - Added various level0 scalar macros to support the rih packm kernels. - Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh. - Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in that order) and execute the first one that is enabled, or the native implementation if none are enabled. - Added implementation query functions for each level-3 operation so that the user can query a string that describes the implementation that is currently enabled. - Updated test suite to output implementation types for reach level-3 operation, as well as micro-kernel types for each of the five micro- kernels. - Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX. - Fixed an obscure bug when packing Hermitian matrices (regular packing type) whereby the diagonal elements of the packed micro-panels could get tainted if the source matrix's imaginary diagonal part contained garbage.	2014-09-16 18:19:32 -05:00
Field G. Van Zee	c6793cecb7	Reorganized #includes for scalar macro headers. Details: - Reordered the #include statements in bli_scalar_macro_defs.h so that conventional, ri-, and ri3-based macros are grouped together. - Renamed bli_eqri.h (and macros within) to end with 'ris' suffix.	2014-08-28 17:14:48 -05:00
Field G. Van Zee	7ed415824d	Updated copyright headers (continued). Details: - Inserted "at Austin" into third clause of license declarations. Meant to include this change in previous commit.	2014-07-14 16:14:33 -05:00
Field G. Van Zee	5c2c6c8561	Updated copyright headers to contain "at Austin". Details: - Updated copyright headers to include "at Austin" in the name of the University of Texas. - Updated the copyright years of a few headers to 2014 (from 2011 and 2012).	2014-07-14 16:05:03 -05:00
Field G. Van Zee	cb12e456f9	Fixed possible level-3 inf/NaN issue when beta=0. Details: - Redefined xpbys_mxn and xpbys_mxn_u/_l macros to employ a copy (instead of scaling by beta) when beta is zero. This will stamp out any possible infs or NaNs in the output matrix, if it happens to be uninitialized. Thanks to Tony Kelman for isolating this bug.	2014-07-08 10:07:46 -05:00
Field G. Van Zee	c663ce3b51	Fixed various bugs when C99 complex is enabled. Details: - Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and elsewhere in the framework that were not yet set up to work properly when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h - Extensive changes to f2c-derived files in frame/compat/f2c to allow C99 complex storage. Most of these changes center around accessing real and imaginary components via bli_?real()/bli_?imag() accessor macros, and setting of values via bli_?sets() assignment macros. (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX was broken.)	2014-02-27 16:32:57 -06:00
Field G. Van Zee	6363a9f658	Added level-3 support for complex via 4m-/3m. Details: - Added the ability to induce complex domain level-3 operations via new virtual complex micro-kernels which are implemented via only real domain micro-kernels. Two new implementations are provided: 4m and 3m. 4m implements complex matrix multiplication in terms of four real matrix multiplications, where as 3m uses only three and thus is capable of even higher (than peak) performance. However, the 3m method has somewhat weaker numerical properties, making it less desirable in general. - Further refined packing routines, which were recently revamped, and added packing functionality for 4m and 3m. - Some modifications to trmm and trsm macro-kernels to facilitate indexing into micro-panels which were packed for 4m/3m virtual kernels. - Added 4m and 3m interfaces for each level-3 operation. - Various other minor changes to facilitate 4m/3m methods.	2014-02-19 17:00:52 -06:00
Field G. Van Zee	2cb13600f9	Updated year in copyright headers to 2014.	2014-01-03 12:29:13 -06:00
Field G. Van Zee	392428dea4	Added "ri" scalar macros. Details: - Added set of basic scalar macros that take arguments' real and imaginary components separately, named like the previous set except with the "ris" (instead of "s") suffix. - Redefined the previous set of scalar macros (those that take arguments "whole") in terms of the new "ri" set. - Renamed setris and getris macros to sets and gets. - Renamed setimag0 macros to seti0s. - Use bli_?1 macro instead of a local constant in bla_trmv.c, bla_trsv.c.	2013-12-12 19:01:47 -06:00
Field G. Van Zee	d70f2b089d	Added scaling to abval2s, sqrt2s macros. Details: - Re-defined abval2s and sqrt2s macros to use scaling to avoid underflow and overflow from squaring the real and imaginary components. (This is the same technique used to fix recent bugs in invscals/invscaljs and inverts.)	2013-11-02 17:19:40 -05:00
Field G. Van Zee	97f89fbcf2	Fixed bug in complex invscals. Details: - Fixed complex inversion in invscals and invscaljs whereby the imaginary component was being computed incorrectly. - Use bli_fmaxabs() instead of bli_fabs() when choosing the scalar in inverts, invscals, and invscaljs. - Changed bli_abs() and bli_fabs() macro definitions to use "<=" operator instead of "<".	2013-11-01 10:16:39 -05:00
Field G. Van Zee	2807013a47	Fixed over/under-flow in complex inversion. Details: - Fixed the complex bli_?inverts() macros, which were inverting elements in an "unsafe" manner, such that very large and very small values were unnecessarily over/under-flowing. Thanks for Vladimir Sukharev for reporting this bug. - Comment update to bli_sumsqv_unb_var1.c. - Removed redundant bli_min() macro in bli_scalar_macro_defs.h. - Changed 1.0F to 1.0 for bli_drands() macro.	2013-10-24 14:32:20 -05:00
Field G. Van Zee	4e80ad28c9	Added support for C99 complex types/arithmetic. Details: - Added support for C99 complex types to bli_type_defs.h and overloaded complex arithmetic to the scalar-level macros in include/level0. This includes a somewhat substantial reorganization and re-layering of much of the existing machinery present in the level0 macros. - Added new #define for BLIS_ENABLE_C99_COMPLEX to bli_config.h files, commented-out by default, which optionally enables the use of built-in C99 complex types and arithmetic. - Minor changes to clarksville and reference configs' make_defs.mk files. - Removed macro definitions from bli_param_macro_defs.h which was not being used (bli_proj_dt_to_real_if_imag_eq0).	2013-07-18 17:53:31 -05:00
Field G. Van Zee	aec12d90f5	Removed copynzv, copynzm and related codes. Details: - Removed copynzv and copynzm operation directories. These operations implemented a variation of copyv/m that, in the case of real source and complex destination operands, leaves the imaginary component untouched (rather than setting it to zero). I realize now that the special case(s) (e.g. gemm with real A and B but complex C) that I thought required this operation actually can be handled more simply. - Removed level0 scalar macros implementing copynzs, copynzjs.	2013-07-10 13:33:30 -05:00
Field G. Van Zee	4b7e7970f1	Migrated integer usage to stdint.h types. Details: - Changed the way bli_type_defs.h defines integer types so that dim_t, inc_t, doff_t, etc. are all defined in terms of gint_t (general signed integer) or guint_t (general unsigned integer). - Renamed Fortran types fchar and fint to f77_char and f77_int. - Define f77_int as int64_t if a new configuration variable, BLIS_ENABLE_BLIS2BLAS_INT64, is defined, and int32_t otherwise. These types are defined in stdint.h, which is now included in blis.h. - Renamed "complex" type in f2c files to "singlecomplex" and typedef'ed in terms of scomplex. - Renamed "char" type in f2c files to "character" and typedef'ed in terms of char. - Updated bla_amax() wrappers so that the return type is defined directly as f77_int, rather than letting the prototype-generating macro decide the type. This was the only use of GENTFUNC2I/GENTPROT2I-related macros, so I removed them. Also, changed the body of the wrapper so that a gint_t is passed into abmaxv, which is THEN typecast to an f77_int before returning the value. - Updated f2c code that accessed .r and .i fields of complex and doublecomplex types so that they use .real and .imag instead (now that we are using scomplex and dcomplex).	2013-07-08 15:20:34 -05:00
Field G. Van Zee	b6e24b23cb	Use PASTEMAC in macro-kernels (over MAC2 or MAC3). Details: - Replaced multi-type invocations of copys_mxn, xpbys_mxn, etc. (PASTEMAC2 and PASTEMAC3) with those that only use a single type (PASTEMAC). - Added extra macros to bli_adds_mxn_uplo.h and bli_xpbys_mxn_uplo.h to accommodate above change. - Fixed comment typo in bli_config.h files. - Added .nfs* pattern to .gitignore.	2013-04-25 12:06:12 -05:00
Field G. Van Zee	4afe3bfd82	Renamed/moved object scalar constant macros. Details: - Replaced scalar constant macro definitions in bli_const_defs.h with a single, simplier macro in bli_obj_macro_defs.h. - Updated invocations of old macros accordingly. - Removed bli_const_defs.h.	2013-04-09 17:45:39 -05:00
Field G. Van Zee	7cbda15291	Added reference microkernels for arbitrary MR, NR. Details: - Added a new set of reference gemm, gemmtrsm, and trsm micro-kernels that contain explicit loops over MR and NR, thus allowing them to be used unmodified by developers who want to build a reference library with custom register blocksizes. - Changed config/reference/bli_kernel.h to use above ukernels by default. - Changed interfaces of new and existing gemm, gemmtrsm, and trsm micro-kernels to use 'restrict' keyword. - Added -funroll-loops option to config/reference/make_defs.mk. - Updated comments in bli_kernel.h describing constraints on register and cache blocksizes. - Updated _adds_mxn.h, _copys_mxn.h, and _xpbys_mxn.h macros files so that single-char macros are also defined.	2013-04-04 15:25:43 -05:00
Field G. Van Zee	6684b73d55	Implemented amax operation and related changes. Details: - Implemented amax operation in BLIS. - Activated BLAS2BLIS routine mapping for new amax BLIS implementation. - Added integer support to [f]printv, [f]printm. - Added integer support to level-0 copys macros. - Updated printing of configuration information in test suite driver. - Comment changes to _config.h files. - Added comments to bla_dot.c to reminder reader what sdsdot()/dsdot() are used for.	2013-04-02 13:06:20 -05:00
Field G. Van Zee	b65cdc57d9	Migrated 'bl2' prefix to 'bli'. Details: - Changed all filename and function prefixes from 'bl2' to 'bli'. - Changed the "blis2.h" header filename to "blis.h" and changed all corresponding #include statements accordingly. - Fixed incorrect association for Fran in CREDITS file.	2013-03-24 20:01:49 -05:00
Field G. Van Zee	132bffcef7	Removed several 'old' directories and files. Details: - Removed most of the 'old' directories scattered throughout the framework, which includes alternate/half-baked/broken implementations.	2013-03-24 18:49:36 -05:00
Field G. Van Zee	ede75693e5	Implemented blas2blis compatibility layer. Details: - Added the blas2blis compatibility layer, located in frame/compat. This includes virtually all of the BLAS, including banded and packed level-2 operations. - Defined bl2_init_safe(), bl2_finalize_safe(). The former allows a conditional initialization, which stores the "exit status" in an err_t, which is then read by the latter function to determine whether finalization should actually take place. - Added calls to bl2_init_safe(), bl2_finalize_safe() to all level-2 and level-3 BLAS-like wrappers. - Added configuration option to instruct BLIS to remain initialized whenever it automatically initializes itself (via bl2_init_safe()), until/unless the application code explicitly calls bl2_finalize(). - Added INSERT_GENTFUNC* and INSERT_GENTPROT* macros to facilitate type templatization of blas2blis wrappers. - Defined level-0 scalar macro bl2_??swaps(). - Defined level-1v operation bl2_swapv(). - Defined some "Fortran" types to bl2_type_defs.h for use with BLAS wrappers.	2013-02-22 12:11:24 -06:00
Field G. Van Zee	e6ac623a90	Properly implemented beta == 0 semantics. Details: - Changed name of set0 and set0_mxn macros to set0s and set0s_mxn, respectively. - Added code to the following operations that sets the output operand to zero if the corresponding scalar is zero (rather than performing the floating-point multiply, or in the case of setv, copying the value). This will prevent nan's and inf's from creeping into results from uninitialized memory. - axpy - dotxv - scalv - scal2v - setv - gemv - ger - hemv - her - her2 - gemm reference ukernels	2013-02-13 18:44:59 -06:00
Field G. Van Zee	474bac30c9	Removed level-0 macros projrs, grabis. Details: - Replaced instances of projrs and grabis macros with newer, more general-purpose getris.	2013-02-12 12:23:48 -06:00
Field G. Van Zee	1274e12437	Updated copyright headers from 2012 to 2013.	2013-02-11 14:37:47 -06:00
Field G. Van Zee	768fcebaa8	Added unified test suite, and many fixes. Details: - Added a highly configurable, unified test suite. - Removed DUPB configuration constant from bl2_kernel.h and macro-kernel header files. Now, instead, DUPB is computed as (NDUP != 1) within each macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into incorrectly when DUPB was set to FALSE but the NDUP was still non-unit. By encoding both pieces of information into one constant in _kernel.h, it seems somewhat less likely others will encounter this bug in the future. - Added level-2 cache blocksizes to _kernel.h for reference configuration, and defined blocksizes in _cntl.c files to these default values. - Changed semantics of her2k and syr2k such that these operations no longer expect the B matrix to already be conjugate-transposed (or just transposed for syr2k). However, these semantics are preserved for the internal mechanics of the implementations, including the internal back-end and all blocked variants. - Inserted checks for real-valued alpha and beta for herk/her2k and herk, respectively. - Relaxed general object structure constraints in _basic_check() for gemv, ger. - Changed her front-end to NOT copy-cast to real projection; instead, this is replaced by selecting either the real part or both parts within the unblocked algorithm implementation, depending on the value of conjh. - Added conjh to all _check routines for her so that the code knows when to verify that alpha has an imaginary component equal to zero (for her, but not syr). - Changed control tree for her to forgo packing. - Added unit diagonal support to fnormm. - Redefined real versions of abval2s macros in terms of fabs(), fabsf(). - Redefined complex versions of sqrt2s macros using the actual "complex square root" formula. - Created new level-0 object-based routines, suffixed with "sc" (for "scalar"). - Defined new level-1v, -1d, and -1m versions of add and sub operations (two-operand add and subtract). - Added new scalar macros: - getris: acquire real and imaginary components. - setris: set real and imaginary components. - addjs: addition with conjugated x. - subjs: subtraction with conjugated x. - Defined new utility operations: - absumv: element-wise sum of absolute values for vector elements. - absumm: element-wise sum of absolute values for matrix elements. - mkherm: convert existing matrix to Hermitian. - mksymm: convert existing matrix to symmetric. - mktrim: convert existing matrix to triangular. - Added various error checking routines. - Added bl2_clock_min_diff(), which is used to more cleanly measure the wall clock time of a code block. - Added general stride support to bl2_obj_alloc_buffer(). - Added bl2_obj_init_scalar(). - Updated parameter mapping in bl2_param_map.c. - Added support for queriable version string. - Fixed a bug in the her2k macro-kernels (which currently are simply implemented in terms of two invocations of herk) whereby beta was being applied to both the first and second rank-k updates, rather than only the first. - Fixed a bug in trmm/trsm whereby transpose and right side cases were not properly implemented due to erroneous assumptions regarding aliasing and root objects. - Fixed a bug in the upper triangular trsm macro-kernel in which the wrong MR x NR block of B was being updated. - Fixed a bug in the inverts macro in the double real case whereby the value was typecast to float before inversion. This affected non-unit cases of dtrsm. - Fixed a bug in the reference kernels for gemmtrsm whereby the minus one constant was being applied incorrectly. - Fixed a bug in the overall treatment of non-unit alpha for trsm. The code now mimics the rank-k strategy of gemm, whereby alpah is applied during the first iteration of variant 3, with BLIS_ONE passed in instead for subsequent iterations. This also required passing alpha into the macro- kernels as well as the fused gemmtrsm micro-kernels. - Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being called for blocks strictly above the diagonal. While this sounds good in theory, this cannot be done because gemm_ker_var2 expects row panels of A to be packed from top to bottom, while for trsm_u, A is actually packed from bottom to top due to the reverse (BR->TL) nature of the algorithm. - Fixed a bug in packm_cxk() whereby panel packings with unit panel dimensions were mishandled due to incorrect arguments to the copyv kernel. Also changed the copyv kernel invocation to scal2v so that these edge cases are properly handled when scaling is requested. - Fixed a bug in packv_int() whereby an uninitialized object is passed in instead of the source object. - Fixed a bug whereby level-2 code could allocate memory dynamically via bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed a potential future bug whereby a mem_t object that is actually no longer "allocated" from the static pool is mistaken for being allocated due to failure to NULLify the buffer when the block was most recently released. - Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly toggled when the requested subpartition needed to be "reflected" due to it residing in an unstored region.	2013-02-11 13:20:44 -06:00
Field G. Van Zee	806e74beb4	Defined Frobenius norm operations. Details: - Added level-0 grabis macro operation to grab imaginary component of one variable and copy it to the real component of another variable. - Defined sumsqv operation, which computes the sum of the absolute squares of the elements of a vector. This implementation is modeled after ?lassq in netlib LAPACK. - Defined fnormv and fnormm operations, which compute the Frobenius norm on vectors and matrices, respectively. These operations are treated as one- operand operations where the output norm value is the real projection of the datatype of the input operand. Both operations are implemented in terms of sumsqv.	2012-12-20 17:07:50 -06:00
Field G. Van Zee	00f3498a89	Initial commit.	2012-12-03 12:36:11 -06:00

28 Commits