amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-05 15:01:13 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	e5f90f3a8d	Removed copynz defs from bli_kernel.h files. Details: - Removed COPYNZ_KERNEL definition from the bli_kernel.h files in each configuration. (Meant to include this in previous commit.)	2013-07-10 13:40:12 -05:00
Field G. Van Zee	4b7e7970f1	Migrated integer usage to stdint.h types. Details: - Changed the way bli_type_defs.h defines integer types so that dim_t, inc_t, doff_t, etc. are all defined in terms of gint_t (general signed integer) or guint_t (general unsigned integer). - Renamed Fortran types fchar and fint to f77_char and f77_int. - Define f77_int as int64_t if a new configuration variable, BLIS_ENABLE_BLIS2BLAS_INT64, is defined, and int32_t otherwise. These types are defined in stdint.h, which is now included in blis.h. - Renamed "complex" type in f2c files to "singlecomplex" and typedef'ed in terms of scomplex. - Renamed "char" type in f2c files to "character" and typedef'ed in terms of char. - Updated bla_amax() wrappers so that the return type is defined directly as f77_int, rather than letting the prototype-generating macro decide the type. This was the only use of GENTFUNC2I/GENTPROT2I-related macros, so I removed them. Also, changed the body of the wrapper so that a gint_t is passed into abmaxv, which is THEN typecast to an f77_int before returning the value. - Updated f2c code that accessed .r and .i fields of complex and doublecomplex types so that they use .real and .imag instead (now that we are using scomplex and dcomplex).	2013-07-08 15:20:34 -05:00
Field G. Van Zee	02002ef6f3	Added row-storage optimizations for trmm, trsm. Details: - Implemented algorithmic optimizations for trmm and trsm whereby the right side case is now handled explicitly, rather than induced indirectly by transposing and swapping strides on operands. This allows us to walk through the output matrix with favorable access patterns no matter how it is stored, for all parameter combinations. - Renamed trmm and trsm blocked variants so that there is no longer a lower/upper distinction. Instead, we simply label the variants by which dimension is partitioned and whether the variant marches forwards or backwards through the corresponding partitioned operands. - Added support for row-stored packing of lower and upper triangular matrices (as provided by bli_packm_blk_var3.c). - Fixed a performance bug in bli_determine_blocksize_b() whereby the cache blocksize extensions (if non-zero) were not being used to appropriately size the first iteration (ie: the bottom/right edge case). - Updated comments in bli_kernel.h to indicate that both MC and NC must be whole multiples of MR AND NR. This is needed for the case of trsm_r where, in order to reuse existing left-side gemmtrsm fused micro-kernels, the packing of A (left-hand operand) and B (right-hand operand) is done with NR and MR, respectively (instead of MR and NR).	2013-06-24 17:08:14 -05:00
Field G. Van Zee	5b641c3bab	Use separate CFLAGS for "kernels" directories. Details: - Added a new "special" directory type: any source code within directories named "kernels" will be compiled with a separate CFLAGS_KERNELS set of compiler flags. This allows the developer to specify a separate set of flags (e.g. optimization flags) for compiling kernels while maintaining a standard set for regular framework code. - Fixed a bug in the top-level Makefile that was causing "noopt" code to be compiled with the standard set of compilation flags. - Updated make_defs.mk in reference, flame, and clarksville configurations according to above changes.	2013-06-12 16:02:12 -05:00
Field G. Van Zee	6bfa96f848	Absorbed blocksize extensions into main objects. Details: - Revamped some parts of commit `b6ef84fad1` by adding blocksize extension fields to the blksz_t object rather than have them as separate structs. - Updated all packm interfaces/invocations according to above change. - Generalized bli_determine_blocksize_?() so that edge case optimization happens if and only if cache blocksizes are created with non-zero extensions. - Updated comments in bli_kernel.h files to indicate that the edge case blocksize extension mechanism is now available for use.	2013-04-30 19:35:54 -05:00
Field G. Van Zee	b6e24b23cb	Use PASTEMAC in macro-kernels (over MAC2 or MAC3). Details: - Replaced multi-type invocations of copys_mxn, xpbys_mxn, etc. (PASTEMAC2 and PASTEMAC3) with those that only use a single type (PASTEMAC). - Added extra macros to bli_adds_mxn_uplo.h and bli_xpbys_mxn_uplo.h to accommodate above change. - Fixed comment typo in bli_config.h files. - Added .nfs* pattern to .gitignore.	2013-04-25 12:06:12 -05:00
Field G. Van Zee	4fe1435f20	Updated dupl implementation to use PACKNR and NR. Details: - Updated frame/util/dupl/bli_dupl_unb_var1.c to utilize PACKNR and NR explicitly so navigate b1 so that situations where PACKNR > NR are supported. - Moved the 4x2 and 4x4 reference micro-kernels in frame/3/gemm/ukernels and frame/3/trsm/ukernels to kernels/c99/. - Updated clarksville and flame configurations.	2013-04-22 19:00:43 -05:00
Field G. Van Zee	b6ef84fad1	Allow ldim of packed micro-panels != MR, NR. Details: - Made substantial changes throughout the framework to decouple the leading dimension (row or column stride) used within each packed micro-panel from the corresponding register blocksize. It appears advantageous on some systems to use, for example, packed micro-panels of A where the column stride is greater than MR (whereas previously it was always equal to MR). - Changes include: - Added BLIS_EXTEND_[MNK]R_? macros, which specify how much extra padding to use when packing micro-panels of A and B. - Adjusted all packing routines and macro-kernels to use PACKMR and PACKNR where appropriate, instead of MR and NR. - Added pd field (panel dimension) to obj_t. - New interface to bli_packm_cntl_obj_create(). - Renamed bli_obj_packed_length()/_width() macros to bli_obj_padded_length()/_width(). - Removed local #defines for cache/register blocksizes in level-3 *_cntl.c. - Print out new cache and register blocksize extensions in test suite. - Also added new BLIS_EXTEND_[MNK]C_? macros for future use in using a larger blocksize for edge cases, which can improve performance at the margins.	2013-04-21 15:00:24 -05:00
Field G. Van Zee	1a9f427b85	Added/renamed alignment constants to _config.h. Details: - Added new memory alignment constants: BLIS_HEAP_STRIDE_ALIGN_SIZE (previously assumed to be same as SYSTEM_MEM) BLIS_CONTIG_ADDR_ALIGN_SIZE (previously assumed to be same as PAGE_SIZE) BLIS_STACK_BUF_ALIGN_SIZE (previously not enforced) and renamed existing ones BLIS_SYSTEM_MEM_ALIGN_SIZE -> BLIS_HEAP_ADDR_ALIGN_SIZE BLIS_CONTIG_MEM_ALIGN_SIZE -> BLIS_CONTIG_STRIDE_ALIGN_SIZE to better convey what the alignment factor is used for (and what it is not used for). - Removed BLIS_ENABLE_SYSTEM_MEM_ALIGN. Dynamic memory alignment is now disabled by setting BLIS_HEAP_STRIDE_ALIGN_SIZE to 1. - Inserted instances of __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))) into macro-kernels to specify stack alignment of temporary buffers. - Modified test suite driver to output new constants. - Removed bli_align_dim_to_sys() and bli_align_dim_to_cmem(). Instead, we now use bli_align_dim_to_size(), which takes a third argument (the desired alignment).	2013-04-12 15:25:54 -05:00
Field G. Van Zee	0495bd1d6d	Moved _POSIX_C_SOURCE def to compiler cmd line. Details: - Removed the #define of _POSIX_C_SOURCE in bli_config.h (for both reference and clarksville configurations) and added "-D_POSIX_C_SOURCE=200112L" to the compiler command line arguments in make_defs.mk (for both configs). Thanks to Devin Matthews for suggesting this change.	2013-04-11 16:39:25 -05:00
Field G. Van Zee	874707c1b1	Fixed edge case handling bug in herk macrokernels. Details: - Fixed a bug present in bli_herk_l_ker_var2() and bli_herk_u_ker_var2() that only manifests when BLIS is configured such that MR != NR. The bug involves incorrectly detecting edge cases, which resulted in some parts of matrix C potentially being skipped and not updated, depending on the problem size. - Updated the default values of MR and NR in config/reference/bli_kernel.h to 8 and 4, respectively, so that I can better stress the framework on a day-to-day basis. (The fact that they were both equal to 4 for so long is why I did not stumble upon this bug much sooner.)	2013-04-05 17:19:43 -05:00
Field G. Van Zee	7cbda15291	Added reference microkernels for arbitrary MR, NR. Details: - Added a new set of reference gemm, gemmtrsm, and trsm micro-kernels that contain explicit loops over MR and NR, thus allowing them to be used unmodified by developers who want to build a reference library with custom register blocksizes. - Changed config/reference/bli_kernel.h to use above ukernels by default. - Changed interfaces of new and existing gemm, gemmtrsm, and trsm micro-kernels to use 'restrict' keyword. - Added -funroll-loops option to config/reference/make_defs.mk. - Updated comments in bli_kernel.h describing constraints on register and cache blocksizes. - Updated _adds_mxn.h, _copys_mxn.h, and _xpbys_mxn.h macros files so that single-char macros are also defined.	2013-04-04 15:25:43 -05:00
Field G. Van Zee	6684b73d55	Implemented amax operation and related changes. Details: - Implemented amax operation in BLIS. - Activated BLAS2BLIS routine mapping for new amax BLIS implementation. - Added integer support to [f]printv, [f]printm. - Added integer support to level-0 copys macros. - Updated printing of configuration information in test suite driver. - Comment changes to _config.h files. - Added comments to bla_dot.c to reminder reader what sdsdot()/dsdot() are used for.	2013-04-02 13:06:20 -05:00
Field G. Van Zee	fb68087f87	More memory alignment-related tweaks. Details: - Renamed BLIS_MEMORY_ALIGNMENT_SIZE to BLIS_CONTIG_MEM_ALIGN_SIZE. - Renamed BLIS_ENABLE_MEMORY_ALIGNMENT to BLIS_ENABLE_SYSTEM_MEM_ALIGN. - Added BLIS_SYSTEM_MEM_ALIGN_SIZE, which controls only the alignment passed into posix_memalign() or equivalent. - Defined new function, bli_align_dim_to_cmem(), which applies the contiguous memory alignment (rather than the system/malloc alignment).	2013-03-26 15:10:16 -05:00
Field G. Van Zee	9682ef61db	Always define memory alignment size cpp constant. Details: - Removed guard around #define for memory alignment size constant. Memory alignment should always be enabled, and so this value should always be defined.	2013-03-26 14:14:53 -05:00
Field G. Van Zee	3a787cccaa	Renamed memory alignment macro constant. Details: - Renamed all occurrences of BLIS_MEMORY_ALIGNMENT_BOUNDARY to BLIS_MEMORY_ALIGNMENT_SIZE.	2013-03-26 13:59:19 -05:00
Field G. Van Zee	37308f9a50	Align packed panel strides with system alignment. Details: - Pass panel strides through bli_align_dim_to_sys() to ensure that each subsequent packed panel of A and B begins at an aligned address. (The first panel is presumably aligned to system alignment because it is aligned to a page boundary, which is typically much larger.) - Rearranged code in packm_init_pack() to prevent additional conditional blocks as a result of the aforementioned change. - Adjusted contiguous memory allocator so that the system memory alignment is used to allocate enough space for each block no matter what kind of register blocking is used (even if register blocksize is unit and every row/column needs maximal padding). - Adjusted default blocksizes in reference configuration so that MCKC and KCNC result in identical footprints for all datatypes.	2013-03-26 12:43:14 -05:00
Field G. Van Zee	b65cdc57d9	Migrated 'bl2' prefix to 'bli'. Details: - Changed all filename and function prefixes from 'bl2' to 'bli'. - Changed the "blis2.h" header filename to "blis.h" and changed all corresponding #include statements accordingly. - Fixed incorrect association for Fran in CREDITS file.	2013-03-24 20:01:49 -05:00
Field G. Van Zee	f469907503	Renamed MAX_PREFETCH_BYTE_OFFSET to MAX_PRELOAD_. Details: - Renamed BLIS_MAX_PREFETCH_BYTE_OFFSET to BLIS_MAX_PRELOAD_BYTE_OFFSET since "prefetch" is kind of a loaded word (e.g. "prefetch" instructions, which are different than the particular kind of prefetching/preloading referred to by this constant).	2013-03-22 15:20:15 -05:00
Field G. Van Zee	e7d41229d3	Re-implemented contiguous memory allocator. Details: - Completely re-wrote the contiguous memory allocator (bl2_mem.c). The new allocator instantiates and initializes three separate memory pool objects, each one associated with a separate array of contiguous memory blocks, each block of fixed and uniform size. (The three pools are for allocating mc-by-kc blocks of A, kc-by-nc panels of B, and mc-by-nc panels of C.) The pool objects use a stack structure internally to track which blocks in the region have been "checked out" to a thread and which are still available. Critical regions are now clearly marked and adaptable to parallel environments (e.g. OpenMP). Memory pools are set up when bl2_init() is called. - Added a new field to the packm control tree node, which indicates what kind of packed buffer is being allocated. The enumerated type for this argument is defined as packbuf_t in bl2_type_defs.h. - Updated level-3 _cntl.c files to pass in the appropriate value for a new packbuf_t argument to bl2_packm_cntl_obj_create(). - Moved some macros called by packm_init_pack() from bl2_obj_macro_defs.h to bl2_mem_macro_defs.h. - Added BLIS_MAX_NUM_THREADS to bl2_config.h, which we use as the default number of blocks of A reserved for the memory allocator. - Deprecated bl2_align_dim(). Replaced usage with that of bl2_align_dim_to_mult(). Turns out that typically we don't need to align a dimension to the system alignment, since that value has to do with starting addresses, whereas the values we are dealing with are unitless dimensions.	2013-03-15 17:12:36 -05:00
Field G. Van Zee	1454c1a142	Moved Fortran name-mangling macro to bl2_config.h. Details: - Moved the Fortran-77 name-mangling macros from bl2_blas_macro_defs.h to the configuration directory (bl2_config.h, specifically) given that it can be expected to be tweaked by some developers.	2013-02-22 12:38:45 -06:00
Field G. Van Zee	ede75693e5	Implemented blas2blis compatibility layer. Details: - Added the blas2blis compatibility layer, located in frame/compat. This includes virtually all of the BLAS, including banded and packed level-2 operations. - Defined bl2_init_safe(), bl2_finalize_safe(). The former allows a conditional initialization, which stores the "exit status" in an err_t, which is then read by the latter function to determine whether finalization should actually take place. - Added calls to bl2_init_safe(), bl2_finalize_safe() to all level-2 and level-3 BLAS-like wrappers. - Added configuration option to instruct BLIS to remain initialized whenever it automatically initializes itself (via bl2_init_safe()), until/unless the application code explicitly calls bl2_finalize(). - Added INSERT_GENTFUNC* and INSERT_GENTPROT* macros to facilitate type templatization of blas2blis wrappers. - Defined level-0 scalar macro bl2_??swaps(). - Defined level-1v operation bl2_swapv(). - Defined some "Fortran" types to bl2_type_defs.h for use with BLAS wrappers.	2013-02-22 12:11:24 -06:00
Field G. Van Zee	1274e12437	Updated copyright headers from 2012 to 2013.	2013-02-11 14:37:47 -06:00
Field G. Van Zee	768fcebaa8	Added unified test suite, and many fixes. Details: - Added a highly configurable, unified test suite. - Removed DUPB configuration constant from bl2_kernel.h and macro-kernel header files. Now, instead, DUPB is computed as (NDUP != 1) within each macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into incorrectly when DUPB was set to FALSE but the NDUP was still non-unit. By encoding both pieces of information into one constant in _kernel.h, it seems somewhat less likely others will encounter this bug in the future. - Added level-2 cache blocksizes to _kernel.h for reference configuration, and defined blocksizes in _cntl.c files to these default values. - Changed semantics of her2k and syr2k such that these operations no longer expect the B matrix to already be conjugate-transposed (or just transposed for syr2k). However, these semantics are preserved for the internal mechanics of the implementations, including the internal back-end and all blocked variants. - Inserted checks for real-valued alpha and beta for herk/her2k and herk, respectively. - Relaxed general object structure constraints in _basic_check() for gemv, ger. - Changed her front-end to NOT copy-cast to real projection; instead, this is replaced by selecting either the real part or both parts within the unblocked algorithm implementation, depending on the value of conjh. - Added conjh to all _check routines for her so that the code knows when to verify that alpha has an imaginary component equal to zero (for her, but not syr). - Changed control tree for her to forgo packing. - Added unit diagonal support to fnormm. - Redefined real versions of abval2s macros in terms of fabs(), fabsf(). - Redefined complex versions of sqrt2s macros using the actual "complex square root" formula. - Created new level-0 object-based routines, suffixed with "sc" (for "scalar"). - Defined new level-1v, -1d, and -1m versions of add and sub operations (two-operand add and subtract). - Added new scalar macros: - getris: acquire real and imaginary components. - setris: set real and imaginary components. - addjs: addition with conjugated x. - subjs: subtraction with conjugated x. - Defined new utility operations: - absumv: element-wise sum of absolute values for vector elements. - absumm: element-wise sum of absolute values for matrix elements. - mkherm: convert existing matrix to Hermitian. - mksymm: convert existing matrix to symmetric. - mktrim: convert existing matrix to triangular. - Added various error checking routines. - Added bl2_clock_min_diff(), which is used to more cleanly measure the wall clock time of a code block. - Added general stride support to bl2_obj_alloc_buffer(). - Added bl2_obj_init_scalar(). - Updated parameter mapping in bl2_param_map.c. - Added support for queriable version string. - Fixed a bug in the her2k macro-kernels (which currently are simply implemented in terms of two invocations of herk) whereby beta was being applied to both the first and second rank-k updates, rather than only the first. - Fixed a bug in trmm/trsm whereby transpose and right side cases were not properly implemented due to erroneous assumptions regarding aliasing and root objects. - Fixed a bug in the upper triangular trsm macro-kernel in which the wrong MR x NR block of B was being updated. - Fixed a bug in the inverts macro in the double real case whereby the value was typecast to float before inversion. This affected non-unit cases of dtrsm. - Fixed a bug in the reference kernels for gemmtrsm whereby the minus one constant was being applied incorrectly. - Fixed a bug in the overall treatment of non-unit alpha for trsm. The code now mimics the rank-k strategy of gemm, whereby alpah is applied during the first iteration of variant 3, with BLIS_ONE passed in instead for subsequent iterations. This also required passing alpha into the macro- kernels as well as the fused gemmtrsm micro-kernels. - Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being called for blocks strictly above the diagonal. While this sounds good in theory, this cannot be done because gemm_ker_var2 expects row panels of A to be packed from top to bottom, while for trsm_u, A is actually packed from bottom to top due to the reverse (BR->TL) nature of the algorithm. - Fixed a bug in packm_cxk() whereby panel packings with unit panel dimensions were mishandled due to incorrect arguments to the copyv kernel. Also changed the copyv kernel invocation to scal2v so that these edge cases are properly handled when scaling is requested. - Fixed a bug in packv_int() whereby an uninitialized object is passed in instead of the source object. - Fixed a bug whereby level-2 code could allocate memory dynamically via bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed a potential future bug whereby a mem_t object that is actually no longer "allocated" from the static pool is mistaken for being allocated due to failure to NULLify the buffer when the block was most recently released. - Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly toggled when the requested subpartition needed to be "reflected" due to it residing in an unstored region.	2013-02-11 13:20:44 -06:00
Field G. Van Zee	8945db6ec9	Renamed x86,x86_64 kernels to indicate 'd' fusing. Details: - Renamed x86 and x86_64 kernels to contain a 'd' before the fusing shape to emphasize that the fusing shape is not for all datatype instances, but rather just for one (that of double-precision real). Other fusing shapes would be proportional to their precision and domain "byte footprints". - Corresponding changes to config/clarksville/bl2_kernel.h.	2012-12-18 15:07:36 -06:00
Field G. Van Zee	6fbbdd4e19	More tweaks to _config.h, _kernel.h; smem tweaks. Details: - Moved kernel-related definitions form bl2_config.h to bl2_kernel.h. - Replaced #define of _GNU_SOURCE with #define of _POSIX_C_SOURCE. This accomplishes the same thing (enabling posix_memalign()) without enabling all of the GNU extensions we don't need. - Defined the size of the static memory pool in terms of MC, KC, and NC, as well as two new constants that determine how many MCxKC blocks and how many KCxNC blocks should be allocated (defined in bl2_config.h). - In the case of static memory pool exhaustion, replaced the generic bl2_abort() with a specific error code call.	2012-12-18 14:34:02 -06:00
Field G. Van Zee	5d8bdb21c4	Minor reordering of bl2_config.h definitions.	2012-12-17 16:07:36 -06:00
Field G. Van Zee	4a83f67490	Consolidated configuration headers. Details: - Merged contents of bl2_arch.h into bl2_config.h for reference and clarksville configurations. - Updated CREDITS, INSTALL, LICENSE, README files.	2012-12-17 12:35:54 -06:00
Field G. Van Zee	e2e7cb2fbe	Expanded reference packm/unpackm kernel set to 16. Details: - Added 10xk, 12xk, 14xk, and 16xk reference kernels for packm and unpackm. - Updated bl2_[un]packm_cxk() to silently use scal2m if "out of range" kernel size is requested. (Thanks to Tyler for finding this bug.) - Updated bl2_kernel.h to contain new _KERNEL definitions, according to above changes, for 'reference' and 'clarksville' configurations. - Updated CHANGELOG. - Removed "output*.m" from .gitignore.	2012-12-13 18:17:54 -06:00
Field G. Van Zee	714c527b0e	Added 'changelog' make target; other tweaks. Details: - Updated CHANGELOG. - Added 'changelog' target to Makefile that runs 'git log --decorate' and overwrites CHANGELOG with the output. - Other trivial changes.	2012-12-07 19:54:04 -06:00
Field G. Van Zee	e4e5404d26	Define static memory pool size in bl2_config.h.	2012-12-07 17:34:53 -06:00
Field G. Van Zee	2f272b40f4	Added build system and continued reorganization. Details: - Added/renamed packm, unpackm kernels. - Added machine value routines. - Added param_map facility. - Renamed AUTHORS to CREDITS. - Added Makefile; continued to expand upon existing configure script. - #define fuse_fac macros in operation headers if not defined already (by the user in bl2_kernels.h).	2012-12-04 19:22:14 -06:00
Field G. Van Zee	00f3498a89	Initial commit.	2012-12-03 12:36:11 -06:00

33 Commits