Commit Graph

26 Commits

Author SHA1 Message Date
Field G. Van Zee
b02e0aae8c Updated test drivers to iterate backwards.
Details:
- Updated test driver source in test, test/3, test/1m4m, and
  test/mixeddt to iterate through the problem space backwards. This
  can help avoid certain situations where the CPU frequency does not
  immediately throttle up to its maximum. Thanks to Robert van de
  Geijn for recommending this fix (originally made to test/sup drivers
  in 57e422a).
- Applied off-by-one matlab output bugfix from b6017e5 to test drivers
  in test, test/3, test/1m4m, and test/mixeddt directories.
2019-08-27 14:37:46 -05:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
addce08966 Format spec and other updates in test, test/3m4m.
Details:
- Removed the dtime (delta time, or wallclock time) column from the
  matlab output of all test drivers in test, test/3m4m, test/studies.
  This value was rarely (if ever) really needed and usually only served
  to take up screen space.
- Updated format specifier in test/studies/skx to use %7.2f instead of
  %6.3f.
- For the test drivers in 'test' directory, added an initial line of
  output that sets last entry of matlab matrix to zero in order to
  induce a pre-allocation of the entire array of performance results.
2018-08-06 13:18:20 -05:00
Field G. Van Zee
4b36e85be9 Converted function-like macros to static functions.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
  bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
  between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
  functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
  bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
2018-05-08 14:26:30 -05:00
Field G. Van Zee
70640a3710 Implemented library self-initialization.
Details:
- Defined two new functions in bli_init.c: bli_init_once() and
  bli_finalize_once(). Each is implemented with pthread_once(), which
  guarantees that, among the threads that pass in the same pthread_once_t
  data structure, exactly one thread will execute a user-defined function.
  (Thus, there is now a runtime dependency against libpthread even when
  multithreading is not enabled at configure-time.)
- Added calls to bli_init_once() to top-level user APIs for all
  computational operations as well as many other functions in BLIS to
  all but guarantee that BLIS will self-initialize through the normal
  use of its functions.
- Rewrote and simplified bli_init() and bli_finalize() and related
  functions.
- Added -lpthread to LDFLAGS in common.mk.
- Modified the bli_init_auto()/_finalize_auto() functions used by the
  BLAS compatibility layer to take and return no arguments. (The
  previous API that tracked whether BLIS was initialized, and then
  only finalized if it was initialized in the same function, was too
  cute by half and borderline useless because by default BLIS stays
  initialized when auto-initialized via the compatibility layer.)
- Removed static variables that track initialization of the sub-APIs in
  bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and
  bli_ind.c. We don't need to track initialization at the sub-API level,
  especially now that BLIS can self-initialize.
- Added a critical section around the changing of the error checking
  level in bli_error.c.
- Deprecated bli_ind_oper_has_avail() as well as all functions
  bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation
  name. These functions had no use cases within BLIS and likely none
  outside of BLIS.
- Commented out calls to bli_init() and bli_finalize() in testsuite's
  main() function, and likewise for standalone test drivers in 'test'
  directory, so that self-initialization is exercised by default.
2017-12-11 17:18:43 -06:00
Field G. Van Zee
c31b1e7b9d Relax alignment restrictions for sandybridge ukrs.
Details:
- Relaxed the base pointer and leading dimension alignment restrictions
  in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
  instead of vmovaps/vmovapd. These change mimic those made to the haswell
  microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
  directory to use DBL_MAX as the initial time candidate. Thanks to Devin
  Matthews for suggesting this change.
- Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.
2016-07-27 15:58:07 -05:00
Field G. Van Zee
cc8d2b8277 Updated old test drivers in 'test'. 2014-09-09 13:48:22 -05:00
Field G. Van Zee
9526ce9881 Updated copyright headers of emscripten configuration files. 2014-08-06 14:15:34 -05:00
Field G. Van Zee
7ed415824d Updated copyright headers (continued).
Details:
- Inserted "at Austin" into third clause of license declarations.
  Meant to include this change in previous commit.
2014-07-14 16:14:33 -05:00
Field G. Van Zee
5c2c6c8561 Updated copyright headers to contain "at Austin".
Details:
- Updated copyright headers to include "at Austin" in the name of the
  University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
  2012).
2014-07-14 16:05:03 -05:00
Field G. Van Zee
caa6507ff3 Minor cleanup to standalone test drivers.
Details:
- Very minor code changes to standalone test drivers in 'test' directory.
- Added *.so files to '.gitignore'.
2014-07-08 10:25:27 -05:00
Field G. Van Zee
6363a9f658 Added level-3 support for complex via 4m-/3m.
Details:
- Added the ability to induce complex domain level-3 operations via new
  virtual complex micro-kernels which are implemented via only real
  domain micro-kernels. Two new implementations are provided: 4m and 3m.
  4m implements complex matrix multiplication in terms of four real
  matrix multiplications, where as 3m uses only three and thus is
  capable of even higher (than peak) performance. However, the 3m method
  has somewhat weaker numerical properties, making it less desirable
  in general.
- Further refined packing routines, which were recently revamped, and
  added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
  into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.
2014-02-19 17:00:52 -06:00
Field G. Van Zee
2cb13600f9 Updated year in copyright headers to 2014. 2014-01-03 12:29:13 -06:00
Field G. Van Zee
680188d46b Cleaned up old test drivers.
Details:
- Minor updates to old test drivers in preparation for our participation
  in ACM TOMS's replicated results initiative.
2013-10-10 13:23:37 -05:00
Field G. Van Zee
7ae4d7a41d Various changes to treatment of integers.
Details:
- Added a new cpp macro in bli_config.h, BLIS_INT_TYPE_SIZE, which can be
  assigned values of 32, 64, or some other value. The former two result in
  defining gint_t/guint_t in terms of 32- or 64-bit integers, while the latter
  causes integers to be defined in terms of a default type (e.g. long int).
- Updated bli_config.h in reference and clarksville configurations according
  to above changes.
- Updated test drivers in test and testsuite to avoid type warnings associated
  with format specifiers not matching the types of their arguments to printf()
  and scanf().
- Inserted missing #include "bli_system.h" into blis.h (which was slated for
  inclusion in d141f9eeb6).
- Added explicit typecasting of dim_t and inc_t to macros in
  bli_blas_macro_defs.h (which are used in BLAS compatibility layer).
- Slight changes to CREDITS and INSTALL files.
- Slight tweaks to Windows build system, mostly in the form of switching to
  Windows-style CRLF newlines for certain files.
2013-09-10 16:35:12 -05:00
Field G. Van Zee
8abfe55f2a Miscellaneous updates.
Details:
- Changed the BLIS_HEAP_STRIDE_ALIGN_SIZE in the configurations from 16 to
  BLIS_CACHE_LINE_SIZE (typically 64).
- Changed the use of nr in sizing of bd buffer to packnr in level-3 macro-
  kernels.
- Reformulated gemm_ker_var2 to look more like the other level-3 macro-
  kernels, in that the interior and edge-case handling is expressed once
  inside the loops in the n and m dimensions, rather than the edge-case
  handling being "unrolled" and expressed as distinct code regions. The
  previous macro-kernel now lives in retired form in the subdirectory
  other/bli_gemm_ker_var2.c.old.
- Updated experimental gemm_ker_var5 according to above change.
- Fixed bug in bli_her2k.c whereby incorrect transformations were being
  applied to optimize the macro-kernel accesses pattern on C when C is
  row-stored.
- Various updates inside of test/exec_sizes.
2013-08-08 13:30:19 -05:00
Field G. Van Zee
0680916fdd Added BLAS error checking to compatibility layer.
Details:
- Added frame/compat/check directory, which now houses companion _check()
  routines for each of the BLAS wrappers in frame/compat. These _check()
  routines are called from the compatibility wrappers and mimic the
  error-checking present in the netlib BLAS.
- Edited bla_xerbla.c so that xerbla() translates the operation string to
  uppercase before printing.
- Redefined util routines in frame/compat/f2c/util in terms of level0
  macros.
- Added prototypes for util routines, f2c routines, lsame(), and xerbla().
- Commented out prototypes in test/test_*.c since Fortran integers are now
  int64_t by default (and the prototypes that were present in the files
  used int).
- Removed redundant #include "bli_f2c.h" in bli_?lamch.c and bli_lsame.c,
  since blis.h was already being included.
- Other minor changes to code in frame/compat/f2c.
2013-07-18 18:04:34 -05:00
Field G. Van Zee
22b06cfcd2 Updated level-1/-1f [vector intrinsic] kernels.
Details:
- Updated level-1/-1f kernels so that non-unit and un-aligned cases are
  handled by reference implementation (rather than aborted).
- Added -fomit-frame-pointer to default make_defs.mk for clarksville
  configuration.
- Defined bli_offset_from_alignment() macro.
- Minor edits to old test drivers.
2013-06-03 16:54:52 -05:00
Field G. Van Zee
0288c827d3 Updated ukernels for x86_64.
Details:
- Tweaked micro-kernels and configuration for clarksville.
- Updated/cleaned up old test drivers in test directory.
- Fixed syntax bug in trsv_unb_var1 and trsv_unf_var1 (introduced
  recently).
2013-06-01 08:02:23 -05:00
Field G. Van Zee
b6ef84fad1 Allow ldim of packed micro-panels != MR, NR.
Details:
- Made substantial changes throughout the framework to decouple the leading
  dimension (row or column stride) used within each packed micro-panel from
  the corresponding register blocksize. It appears advantageous on some
  systems to use, for example, packed micro-panels of A where the column
  stride is greater than MR (whereas previously it was always equal to MR).
- Changes include:
  - Added BLIS_EXTEND_[MNK]R_? macros, which specify how much extra padding
    to use when packing micro-panels of A and B.
  - Adjusted all packing routines and macro-kernels to use PACKMR and PACKNR
    where appropriate, instead of MR and NR.
  - Added pd field (panel dimension) to obj_t.
  - New interface to bli_packm_cntl_obj_create().
  - Renamed bli_obj_packed_length()/_width() macros to
    bli_obj_padded_length()/_width().
  - Removed local #defines for cache/register blocksizes in level-3 *_cntl.c.
  - Print out new cache and register blocksize extensions in test suite.
- Also added new BLIS_EXTEND_[MNK]C_? macros for future use in using a larger
  blocksize for edge cases, which can improve performance at the margins.
2013-04-21 15:00:24 -05:00
Field G. Van Zee
09eacbd1ab Changed old level3 test drivers to call front-ends.
Details:
- Changed old level-3 test drivers, in 'test' directory, to always call the
  front-end object API instead of the internal back-end with the locally
  defined control tree.
2013-04-18 19:39:13 -05:00
Field G. Van Zee
ca9e435c57 Fixed a bug in reference implementation of dupl.
Details:
- Fixed a bug in reference implementation of dupl (bli_dupl_unb_var1.c),
  which resulted in incorrect duplication.
- Updated old test drivers according to recently updated packm control tree
  creation interface.
- Added 'restrict' to x86 gemm microkernel interface.
2013-04-15 09:59:46 -05:00
Field G. Van Zee
b65cdc57d9 Migrated 'bl2' prefix to 'bli'.
Details:
- Changed all filename and function prefixes from 'bl2' to 'bli'.
- Changed the "blis2.h" header filename to "blis.h" and changed all
  corresponding #include statements accordingly.
- Fixed incorrect association for Fran in CREDITS file.
2013-03-24 20:01:49 -05:00
Field G. Van Zee
1274e12437 Updated copyright headers from 2012 to 2013. 2013-02-11 14:37:47 -06:00
Field G. Van Zee
768fcebaa8 Added unified test suite, and many fixes.
Details:
- Added a highly configurable, unified test suite.

- Removed DUPB configuration constant from bl2_kernel.h and macro-kernel
  header files. Now, instead, DUPB is computed as (NDUP != 1) within each
  macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into
  incorrectly when DUPB was set to FALSE but the NDUP was still non-unit.
  By encoding both pieces of information into one constant in _kernel.h,
  it seems somewhat less likely others will encounter this bug in the
  future.
- Added level-2 cache blocksizes to _kernel.h for reference configuration,
  and defined blocksizes in _cntl.c files to these default values.

- Changed semantics of her2k and syr2k such that these operations no longer
  expect the B matrix to already be conjugate-transposed (or just transposed
  for syr2k). However, these semantics are preserved for the internal
  mechanics of the implementations, including the internal back-end and all
  blocked variants.
- Inserted checks for real-valued alpha and beta for herk/her2k and herk,
  respectively.

- Relaxed general object structure constraints in _basic_check() for gemv, ger.
- Changed her front-end to NOT copy-cast to real projection; instead, this is
  replaced by selecting either the real part or both parts within the unblocked
  algorithm implementation, depending on the value of conjh.
- Added conjh to all _check routines for her so that the code knows when to
  verify that alpha has an imaginary component equal to zero (for her, but
  not syr).
- Changed control tree for her to forgo packing.

- Added unit diagonal support to fnormm.
- Redefined real versions of abval2s macros in terms of fabs(), fabsf().
- Redefined complex versions of sqrt2s macros using the actual "complex square
  root" formula.
- Created new level-0 object-based routines, suffixed with "sc" (for "scalar").
- Defined new level-1v, -1d, and -1m versions of add and sub operations
  (two-operand add and subtract).
- Added new scalar macros:
  - getris: acquire real and imaginary components.
  - setris: set real and imaginary components.
  - addjs: addition with conjugated x.
  - subjs: subtraction with conjugated x.
- Defined new utility operations:
  - absumv: element-wise sum of absolute values for vector elements.
  - absumm: element-wise sum of absolute values for matrix elements.
  - mkherm: convert existing matrix to Hermitian.
  - mksymm: convert existing matrix to symmetric.
  - mktrim: convert existing matrix to triangular.

- Added various error checking routines.
- Added bl2_clock_min_diff(), which is used to more cleanly measure the
  wall clock time of a code block.
- Added general stride support to bl2_obj_alloc_buffer().
- Added bl2_obj_init_scalar().
- Updated parameter mapping in bl2_param_map.c.
- Added support for queriable version string.

- Fixed a bug in the her2k macro-kernels (which currently are simply
  implemented in terms of two invocations of herk) whereby beta was being
  applied to both the first and second rank-k updates, rather than only
  the first.
- Fixed a bug in trmm/trsm whereby transpose and right side cases were not
  properly implemented due to erroneous assumptions regarding aliasing and
  root objects.
- Fixed a bug in the upper triangular trsm macro-kernel in which the wrong
  MR x NR block of B was being updated.
- Fixed a bug in the inverts macro in the double real case whereby the
  value was typecast to float before inversion. This affected non-unit cases
  of dtrsm.
- Fixed a bug in the reference kernels for gemmtrsm whereby the minus one
  constant was being applied incorrectly.
- Fixed a bug in the overall treatment of non-unit alpha for trsm. The code
  now mimics the rank-k strategy of gemm, whereby alpah is applied during
  the first iteration of variant 3, with BLIS_ONE passed in instead for
  subsequent iterations. This also required passing alpha into the macro-
  kernels as well as the fused gemmtrsm micro-kernels.
- Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being
  called for blocks strictly above the diagonal. While this sounds good in
  theory, this cannot be done because gemm_ker_var2 expects row panels of
  A to be packed from top to bottom, while for trsm_u, A is actually packed
  from bottom to top due to the reverse (BR->TL) nature of the algorithm.
- Fixed a bug in packm_cxk() whereby panel packings with unit panel
  dimensions were mishandled due to incorrect arguments to the copyv kernel.
  Also changed the copyv kernel invocation to scal2v so that these edge
  cases are properly handled when scaling is requested.
- Fixed a bug in packv_int() whereby an uninitialized object is passed in
  instead of the source object.
- Fixed a bug whereby level-2 code could allocate memory dynamically via
  bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed
  a potential future bug whereby a mem_t object that is actually no longer
  "allocated" from the static pool is mistaken for being allocated due to
  failure to NULLify the buffer when the block was most recently released.
- Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly
  toggled when the requested subpartition needed to be "reflected" due to it
  residing in an unstored region.
2013-02-11 13:20:44 -06:00
Field G. Van Zee
00f3498a89 Initial commit. 2012-12-03 12:36:11 -06:00