Commit Graph

63 Commits

Author SHA1 Message Date
Field G. Van Zee
00e14cb6d8 Replaced use of bool_t type with C99 bool.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
  C99 bool type. A few remaining instances, such as those in the files
  bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
  bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
  used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
  C99's bool instead of bool_t, which was raised in issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7. The second phase, which redefined the bool_t typedef
  in terms of bool (from gint_t), was implemented by commit 2c554c2.
2020-07-29 14:24:34 -05:00
Field G. Van Zee
a69a4d7e2f Cleaned up bool_t usage and various typecasts.
Details:
- Fixed various typecasts in

    frame/base/bli_cntx.h
    frame/base/bli_mbool.h
    frame/base/bli_rntm.h
    frame/include/bli_misc_macro_defs.h
    frame/include/bli_obj_macro_defs.h
    frame/include/bli_param_macro_defs.h

  that were missing or being done improperly/incompletely. For example,
  many return values were being typecast as
    (bool_t)x && y
  rather than
    (bool_t)(x && y)
  Thankfully, none of these deficiencies had manifested as actual bugs
  at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
  This reflects the fact that bli_env_get_var() needs to be able to
  return a signed integer, and even though dim_t is currently defined
  as a signed integer, it does not intuitively appear to necessarily be
  signed by inspection (i.e., an integer named "dim_t" for matrix
  "dimension"). Also, updated use of bli_env_get_var() within
  bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
  and added comments to the bli_thrcomm_*.h files that will explain a
  planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
  'bool' for 'bool_t', which will eliminate the namespace conflict with
  arm_sve.h as reported in issue #420. This commit implements the first
  phase of that transition. Thanks to RuQing Xu for reporting this
  issue.
- CREDITS file update.
2020-07-22 16:13:09 -05:00
Field G. Van Zee
72f6ed0637 Declare/define static functions via BLIS_INLINE.
Details:
- Updated all static function definitions to use the cpp macro
  BLIS_INLINE instead of the static keyword. This allows blis.h to
  use a different keyword (inline) to define these functions when
  compiling with C++, which might otherwise trigger "defined but
  not used" warning messages. Thanks to Giorgos Margaritis for
  reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
  hardware auto-detection facility, to unconditionally #define
  BLIS_INLINE to the static keyword (since we know BLIS will be
  compiled with C, not C++):
    build/detect/config/config_detect.c
    frame/base/bli_arch.c
    frame/base/bli_cpuid.c
- CREDITS file update.
2020-07-03 17:55:54 -05:00
Field G. Van Zee
29b0e1ef4e Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
  into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
  inadvertantly not incremented when the Zen2 subconfiguration was
  added.
- In bli_gemm_front(), added a missing conditional constraint around the
  call to bli_gemm_small() that ensures that the computation precision
  of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
  that existed around the call to bli_syrk_small() into bli_syrk_small()
  to minimize the calling code footprint and also to bring that code
  into stylistic harmony with similar code in bli_gemm_front() and
  bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
  proper accessor static functions (e.g. 'a->dim[0]' becomes
  'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
  bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
  strictly speaking unnecessary, but it serves as a useful visual cue to
  those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
  bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
  version check for availability of -march=znver2, and added appropriate
  support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
  config/zen/amd_config.mk, including: removal of -march=znver1 et al.
  from CKVECFLAGS (since the -march flag is added within make_defs.mk);
  setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
  added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
  set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
2019-10-11 10:24:24 -05:00
kdevraje
13806ba3b0 This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041
2019-05-27 16:24:43 +05:30
Field G. Van Zee
b9c9f03502 Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2019-04-27 18:44:50 -05:00
Field G. Van Zee
945928c650 Merge branch 'amd' of github.com:flame/blis into amd 2019-04-17 15:58:56 -05:00
Field G. Van Zee
89cd650e7b Use void_fp for function pointers instead of void*.
Change void*-typed function pointers to void_fp.
- Updated all instances of void* variables that store function pointers
  to variables of a new type, void_fp. Originally, I wanted to define
  the type of void_fp as "void (*void_fp)( void )"--that is, a pointer
  to a function with no return value and no arguments. However, once
  I did this, I realized that gcc complains with incompatible pointer
  type (-Wincompatible-pointer-types) warnings every time any such a
  pointer is being assigned to its final, type-accurate function
  pointer type. That is, gcc will silently typecast a void* to
  another defined function pointer type (e.g. dscalv_ker_ft) during
  an assignment from the former to the latter, but the same statement
  will trigger a warning when typecasting from a void_fp type. I suspect
  an explicit typecast is needed in order to avoid the warning, which
  I'm not willing to insert at this time.
- Added a typedef to bli_type_defs.h defining void_fp as void*, along
  with a commented-out version of the aborted definition described
  above. (Note that POSIX requires that void* and function pointers
  be interchangeable; it is the C standard that does not provide this
  guarantee.)
- Comment updates to various _oapi.c files.
2019-04-02 17:23:55 -05:00
Field G. Van Zee
4ed39c0971 Merge branch 'amd' of github.com:flame/blis into amd 2019-03-08 11:56:58 -06:00
Isuru Fernando
f0dcc8944f Add symbol export macro for all functions (#302)
* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now
2019-02-27 17:27:23 -06:00
Field G. Van Zee
540ec1b479 Updated level-3 BLAS to call object API directly.
Details:
- Updated the BLAS compatibility layer for level-3 operations so that
  the corresponding BLIS object API is called directly rather than first
  calling the typed BLIS API. The previous code based on the typed BLIS
  API calls is still available in a deactivated cpp macro branch, which
  may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not
  yet correspond to a configure option. If it seems like people might
  want to toggle this behavior more regularly, a configure option can be
  added in the future.)
- Updated the BLIS typed API to statically "pre-initialize" objects via
  new initializor macros. Initialization is then finished via calls to
  static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(),
  which are similar to the previously-called functions,
  bli_obj_create_1x1_with_attached_buffer() and
  bli_obj_create_with_attached_buffer(), respectively. (The BLAS
  compatibility layer updates mentioned above employ this new technique
  as well.)
- Transformed certain routines in bli_param_map.c--specifically, the
  ones that convert netlib-style parameters to BLIS equivalents--into
  static functions, now in bli_param_map.h. (The remaining three classes
  of conversation routines were left unchanged.)
- Added the aforementioned pre-initializor macros to bli_type_defs.h.
- Relocated bli_obj_init_const() and bli_obj_init_constdata() from
  bli_obj_macro_defs.h to bli_type_defs.h.
- Added a few macros to bli_param_macro_defs.h for testing domains for
  real/complexness and precisions for single/double-ness.
2019-02-24 19:09:10 -06:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
71c5832d5f Consolidated slab/rr-explicit level-3 macrokernels.
Details:
- Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
  file per sl/rr pair, with those files named as they were before
  c92762e. The consolidation does not take away the *option* of using
  slab or round-robin assignment of micropanels to threads; it merely
  *hides* the choice within the definitions of functions such as
  bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
  rather than expose that choice explicitly in the code. The choice of
  slab or rr is not always hidden, however; there are some cases
  involving herk and trmm, for example, that require some part of the
  computation to use rr unconditionally. (The --thread-part-jrir option
  controls the partitioning in all other cases.)
- Note: Originally, the sl and rr macrokernels were separated out for
  clarity. However, aside from the additional binary code bloat, I later
  deemed that clarity not worth the price of maintaining the additional
  (mostly similar) codes.
2018-10-17 14:11:01 -05:00
Field G. Van Zee
c92762ecdc Added option of slab or rr partitioning in jr/ir.
Details:
- Updated existing macrokernel function names and definitions to
  explicitly use slab assignment of micropanels to threads, then created
  duplicate versions of macrokernels that explicitly use round-robin
  assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
  were not substantially updated in this commit because they are
  currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
  explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
  and packm function pointers depending on which method (slab or rr) was
  enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
  option (-m [slab|rr] for short), which allows the user to explicitly
  request either slab or round-robin assignment (partitioning) of
  micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
2018-10-07 20:30:32 -05:00
Field G. Van Zee
ac18949a4b Multithreading optimizations for l3 macrokernels.
Details:
- Adjusted the method by which micropanels are assigned to threads in
  the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
  employ contiguous "slab" partitioning rather than interleaved (round
  robin) partitioning. The new partitioning schemes and related details
  for specific families of operations are listed below:
  - gemm: slab partitioning.
  - herk: slab partitioning for region corresponding to non-triangular
          region of C; round robin partitioning for triangular region.
  - trmm: slab partitioning for region corresponding to non-triangular
          region of B; round robin partitioning for triangular region.
          (NOTE: This affects both left- and right-side macrokernels:
          trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
  - trsm: slab partitioning.
          (NOTE: This only affects only left-side macrokernels trsm_ll,
          trsm_lu; right-side macrokernels were not touched.)
  Also note that the previous macrokernels were preserved inside of
  the 'other' directory of each operation family directory (e.g.
  frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
  and fixed a stale function pointer type in blx_gemm_int.c
  (gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
  and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
  and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
2018-09-30 18:54:56 -05:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
b7db293323 Explicitly typecast return vals in static funcs.
Details:
- Added explicit typecasting to various functions (mostly static
  functions), primarily those in bli_param_macro_defs.h,
  bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header
  files.
- This change was prompted by feedback from Jacob Gorm Hansen, who
  reported that #including "blis.h" from his application caused a
  gcc to output error messages (relating to types being returned
  mismatching the declared return types) when used via the C++ compiler
  front-end. This is the first pass of fixes, and we may need to
  iterate with additional follow-up commits (#233).
2018-07-19 11:14:30 -05:00
Field G. Van Zee
52d80b5f09 Fixed static funcs related to target and exec dts.
Details:
- Fixed incorrect bit shifts in the following static functions:
    bli_obj_set_target_domain()
    bli_obj_set_target_prec()
    bli_obj_set_exec_domain()
    bli_obj_set_exec_prec()
- Fixed incorrect bitmask in bli_dt_proj_to_single_prec().
- Updated bli_obj_real_part() and bli_obj_imag_part() so that it updates
  the target and exec datatypes (in addition to the storage datatypes).
2018-06-29 12:30:44 -05:00
Field G. Van Zee
17928b1c99 Added static funcs bli_dt_domain(), bli_dt_prec().
Details:
- Added definitions of static functions bli_dt_domain()/bli_dt_prec(),
  which extract a dom_t domain or prec_t precision value, respectively,
  from a num_t datatype.
- Changed the return types of bli_obj_domain() and bli_obj_prec() from
  objbits_t to dom_t and prec_t. (Not sure why they were ever set to
  return objbits_t.)
2018-06-19 17:59:03 -05:00
Field G. Van Zee
5f7fbb7115 Static funcs for projecting dt to single/double.
Details:
- Added static functions for projecting a datatype to single precision
  or double precision, both for obj_t's storage datatypes and standalone
  datatypes.
2018-06-19 15:38:55 -05:00
Field G. Van Zee
e88a5b8da8 Implemented castm, castv operations.
Details:
- Implemented castm and castv operations, which behave like copym and
  copyv except where the obj_t operands can be of different datatypes.
  These new operations, however, unlike copym/copyv, do not build upon
  existing level-1v kernels.
- Reorganized projm, projv into a 'proj' subdirectory of frame/base (to
  match the newly added frame/base/cast directory).
- Added new macros to bli_gentfunc_macro_defs.h, _gentprot_macro_defs.h
  that insert GENTFUNC2/GENTPROT2 macros for all non-homogeneous datatype
  combinations. Previously, one had to invoke two additional macros--one
  which mixed domains only and another that included all remaining
  cases--in order to get full type combination coverage.
- Defined a new static function, bli_set_dims_incs_2m(), to aid in the
  setting of various variables in the implementations of bli_??castm().
  This static function joins others like it in bli_param_macro_defs.h.
- Comment update to bli_copysc.h.
2018-06-18 15:56:26 -05:00
Field G. Van Zee
0a4a27e1a4 Defined/implemented bli_projm().
Details:
- Defined a new operation in frame/base/bli_proj.c, bli_projm(), which
  behaves like bli_copym(), except that operands a and b are allowed to
  contain data of differing domains (e.g. a is real while b is complex,
  or vice versa). The file is named bli_proj.c, rather than bli_projm.c,
  with the intention that a 'v' vector version of the function may be
  added to the same file (at some point in the future).
- Added supporting bli_check_*() functions in bli_check.c to confirm
  consistent precisions between to datatypes/objects, as well as the
  appropriate error message in bli_error.c and a new error code in
  bli_type_defs.h.
- Wrote a bli_projm_check() function to go along with bli_projm().
- Defined static function bli_obj_real_part() in bli_obj_macro_defs.h,
  which will initialize an obj_t alias to the real part of the source
  object.
- Fixed a bug in the static function bli_dt_proj_to_complex(), found
  in bli_param_macro_defs.h. Thankfully, there were no calls to the
  function to produce buggy behavior.
2018-06-06 19:02:29 -05:00
Field G. Van Zee
5140ee3424 Updated types of bli_is_[un]aligned_to() functions.
Details:
- Changed the void* arguments of the following static functions:
    bli_is_aligned_to()
    bli_is_unaligned_to()
    bli_offset_past_alignment()
  to siz_t, and the return type of bli_offset_past_alignment() from
  guint_t to siz_t. This allows for more versatile usage of these
  functions (e.g. when aligning both pointers and leading dimension).
- Updated all invocations of these functions, mostly in kernels/penryn
  but also in kernels/bgq, to include explicit typecasts to siz_t when
  pointer arguments are passed in.
- Thanks to Devin Matthews for pointing out this potential bug (via issue
  #211).
- Deleted a few trailing spaces in various penryn kernels.
- Removed duplicate instances of the words "derived" and "THEORY" from
  various kernel license headers, likely from a malformed recursive sed
  performed long ago.
2018-05-23 16:56:14 -05:00
Field G. Van Zee
4b36e85be9 Converted function-like macros to static functions.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
  bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
  between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
  functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
  bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
2018-05-08 14:26:30 -05:00
Field G. Van Zee
75d0d1057d Renamed various datatype-related macros/functions.
Details:
- Renamed the following macros in bli_obj_macro_defs.h and
  bli_param_macro_defs.h:
  - bli_obj_datatype()                 -> bli_obj_dt()
  - bli_obj_target_datatype()          -> bli_obj_target_dt()
  - bli_obj_execution_datatype()       -> bli_obj_exec_dt()
  - bli_obj_set_datatype()             -> bli_obj_set_dt()
  - bli_obj_set_target_datatype()      -> bli_obj_set_target_dt()
  - bli_obj_set_execution_datatype()   -> bli_obj_set_exec_dt()
  - bli_obj_datatype_proj_to_real()    -> bli_obj_dt_proj_to_real()
  - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex()
  - bli_datatype_proj_to_real()        -> bli_dt_proj_to_real()
  - bli_datatype_proj_to_complex()     -> bli_dt_proj_to_complex()
- Renamed the following functions in bli_obj.c:
  - bli_datatype_size()                -> bli_dt_size()
  - bli_datatype_string()              -> bli_dt_string()
  - bli_datatype_union()               -> bli_dt_union()
- Removed a pair of old level-1f penryn intrinsics kernels that were no
  longer in use.
2018-04-30 14:57:33 -05:00
Field G. Van Zee
126482a3b6 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2016-11-25 18:29:49 -06:00
Field G. Van Zee
86969873b5 Reclassified amaxv operation as a level-1v kernel.
Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
  This includes the establishment of a new amaxv kernel to live beside all
  of the other level-1v kernels.
- Added two new functions to bli_part.c:
    bli_acquire_mij()
    bli_acquire_vi()
  The first acquires a scalar object for the (i,j) element of a matrix,
  and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
  adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
  module works by comparing the value identified by bli_amaxv() to the
  the value found from a reference-like code local to the test module
  source file. In other words, it (intentionally) does not guarantee the
  same index is found; only the same value. This allows for different
  implementations in the case where a vector contains two or more elements
  containing exactly the same floating point value (or values, in the case
  of the complex domain).
- Removed the directory frame/include/old/.
2016-10-04 14:24:59 -05:00
Field G. Van Zee
9dcd6f05c4 Implemented developer-configurable malloc()/free().
Details:
- Replaced all instances of bli_malloc() and bli_free() with one of:
  - bli_malloc_pool()/bli_free_pool()
  - bli_malloc_user()/bli_free_user()
  - bli_malloc_intl()/bli_free_intl()
  each of which can be configured to call malloc()/free() substitutes,
  so long as the substitute functions have the same function type
  signatures as malloc() and free() defined by C's stdlib.h. The _pool()
  function is called when allocating blocks for the memory pools (used
  for packing buffers, primarily), the _user() function is called when
  obj_t's are created (via bli_obj_create() and friends), and the _intl()
  function is called for internal use by BLIS, such as when creating
  control tree nodes or temporary buffers for manipulating internal data
  structures. Substitutes for any of the three types of bli_malloc() may
  be specified by #defining the following pairs of cpp macros in
  bli_kernel.h:
  - BLIS_MALLOC_POOL/BLIS_FREE_POOL
  - BLIS_MALLOC_USER/BLIS_FREE_USER
  - BLIS_MALLOC_INTL/BLIS_FREE_INTL
  to be the name of the substitute functions. (Obviously, the object
  code that contains these functions must be provided at link-time.)
  These macros default to malloc() and free(). Subsitute functions are
  also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
- Removed definitions for bli_malloc() and bli_free().
- Note that bli_malloc_pool() and bli_malloc_user() are now defined in
  terms of a new function, bli_malloc_align(), which aligns memory to an
  arbitrary (power of two) alignment boundary, but does so manually,
  whereas before alignment was performed behind the scenes by
  posix_memalign(). Currently, bli_malloc_intl() is defined in terms
  of bli_malloc_noalign(), which serves as a simple wrapper to the
  designated function that is passed in (e.g. BLIS_MALLOC_INTL).
  Similarly, there are bli_free_align() and bli_free_noalign(), which
  are used in concert with their bli_malloc_*() counterparts.
2016-05-24 13:15:32 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Field G. Van Zee
0b126de134 Consolidated packm_blk_var1 and packm_blk_var2.
Details:
- Consolidated the two blocked variants for packm into a single
  implementation (packm_blk_var1) and removed the other variant.
- Updated all induced method _cntl_init() functions in frame/cntl/ind/
  to use the new blocked variant 1.
- Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
  to detect pack_t schemas for induced methods and native execution,
  respectively.
2015-11-13 16:29:12 -06:00
Field G. Van Zee
30e5eb29e0 Minor changes to treatment of rs, cs in bli_obj.c.
Details:
- Applied a patch submitted by Devin Matthews that:
  - implements subtle changes to handling of somewhat unusual cases of
    row and column strides to accommodate certail tensor cases, which
    includes adding dimension parameters to _is_col_tilted() and
    _is_row_tilted() macros,
  - simplifies how buffers are sized when requested BLIS-allocated
    objects,
  - re-consolidates bli_adjust_strides_*() into one function, and
  - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
    environments.
2015-11-13 12:14:19 -06:00
Field G. Van Zee
37e55ca39b Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.
Details:
- Fixed a family of bugs in the triangular level-3 operations for
  certain complex implementations (3m1 and 4m1a) that only manifest if
  one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
  - Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
    for the triangular case.
  - Fixed the incorrect computation of imaginary stride, as stored in
    the auxinfo_t struct in trmm and trsm macro-kernels.
  - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
    cases where the the register blocksize for the triangular matrix is
    odd. Introduced a new byte-granular pointer arithmetic macro,
    bli_ptr_add(), that computes the correct value.
- Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
  terms of __typeof__, which is used by bli_ptr_add() macro.
- Disabled the row- vs. column-storage optimization in bli_trmm_front()
  for singleton problems because the inherent ambiguity of whether a
  scalar is row-stored or column-stored causes the wrong parameter
  combination code to be executed (by dumb luck of our checking for
  row storage first).
- Added commented-out debugging lines to 3m1/4m1a and reference
  micro-kernels, and trsm_ll macro-kernel.
2015-10-30 18:25:04 -05:00
Field G. Van Zee
e2e9d64a63 Load balance thread ranges for arbitrary diagonals.
Details:
- Expanded/updated interface for bli_get_range_weighted() and
  bli_get_range() so that the direction of movement is specified in the
  function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
  and also so that the object being partitioned is passed instead of an
  uplo parameter. Updated invocations in level-3 blocked variants, as
  appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
  carefully take into account the location of the diagonal when computing
  ranges so that the area of each subpartition (which, in all present
  level-3 operations, is proportional to the amount of computation
  engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
  variants:
    bli_<oper>_prune_unref_mparts_[mnk]()
  where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
  dimension is being partitioned. These routines call a more basic
  routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
  regions from matrices and simultaneously adjust other matrices which
  share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
  new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
  bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
  bli_get_range_*() and bli_get_range_weighted_*() under a range of
  conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
  array field so that dimensions could be queried via index constant
  (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
  macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
  and bli_round_to_mult(), which rounds a value to the nearest multiple
  of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
  bli_obj_row_off(), bli_obj_col_off().
2015-09-24 12:14:03 -05:00
Field G. Van Zee
4dd9dd3e1d Fixed minor alignment ambiguity bug in bli_pool.c.
Details:
- Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
  pointer arithmetic was performed on a void* as if it were a byte
  pointer (such as char*). Some compilers may have already been
  interpreting this situation as intended, despite the sloppiness.
  Thanks to Aleksei Rechinskii for reporting this issue.
- Redefined pointer alignment macros to typecast to uintptr_t instead of
  siz_t.
2015-08-21 11:52:37 -05:00
Field G. Van Zee
26a4b8f6f9 Implemented 3m2, 3m3 induced algorithms (gemm only).
Details:
- Defined a new "3ms" (separated 3m) pack schema and added appropriate
  support in packm_init(), packm_blk_var2().
- Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
  as an argument instead of computing it locally. Exception: for trmm,
  is_p must be computed locally, since it changes for triangular
  packed matrices. Also exposed is_p in interface to dt-specific
  packm_blk_var2 (and _var1, even though it does not use imaginary
  stride).
- Renamed many functions/variables from _3mi to _3mis to indicate that
  they work for either interleaved or separated 3m pack schemas.
- Generalized gemm and herk macro-kernels to pass in imaginary stride
  rather than compute them locally.
- Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
  and 3m3-specific virtual micro-kernels.
- Added special gemm macro-kernels to support 3m2 and 3m3.
- Added support for 3m2 and 3m3 to testsuite.
- Corrected the type of the panel dimension (pd_) in various macro-
  kernels from inc_t to dim_t.
- Renamed many functions defined in bli_blocksize.c.
- Moved most induced-related macro defs from frame/include to
  frame/ind/include.
- Updated the _ukernel.c files so that the micro-kernel function pointers
  are obtained from the func_t objects rather than the cpp macros that
  define the function names.
- Updated test/3m4m driver, Makefile, and run script.
2015-04-01 10:44:54 -05:00
Field G. Van Zee
441d47542a Renamed 3m and 4m symbols/macros to 3mi and 4mi.
Details:
- Renamed several variables and macros from 3m/4m to 3mi/4mi. This is
  because those packing schemas were always implicitly "interleaved".
  This new naming scheme will make way for new schemas that separate
  instead of interleve the real and imaginary (and summed) parts.
- Expanded the pack format sub-field of the pack schema field of the
  info_t to 4 bits (from 3). This will allow for more schema types
  going forward.
- Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.
2015-02-19 17:06:10 -06:00
Field G. Van Zee
e171504a72 Use correct definition of bli_is_last_iter().
Details:
- As intended for previous commit, the new definition of
  bli_is_last_iter() is now disabled in favor of the old
  definition.
2014-10-17 11:25:59 -05:00
Field G. Van Zee
0d954087b2 Minor changes and fixes.
Details:
- Redefined bli_is_last_iter() to take thread_id and num_thread
  arguments, which allows the macro to correctly compute whether a
  given iteration is the last that the thread will compute in that
  particular loop. The new definition, however, remains disabled
  (commented out) until someone can look at this more closely, as
  the new definition seems to actually hurt performance slightly.
- Whitespace and related updates to level-3 macro-kernels.
- Updated test suite so that performance results in the hundreds of
  gigaflops does not disrupt the column alignment of the output.
2014-10-17 11:19:34 -05:00
Field G. Van Zee
e9899be090 Added high-level implementations of 4m, 3m.
Details:
- Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
  high levels, respectively. APIs for trmm and trsm were NOT added due
  to the fact that these approaches are inherently incompatible with
  implementing 4m or 3m at high levels (because the input right-hand
  side matrix is overwritten).
- Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
  3m so that all are stylistically consistent.
- Added new "rih" packing kernels (both low-level and structure-aware)
  to support both 4mh and 3mh.
- Defined new pack_t schemas to support real-only, imaginary-only, and
  real+imaginary packing formats.
- Added various level0 scalar macros to support the rih packm kernels.
- Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
- Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
  level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
  that order) and execute the first one that is enabled, or the native
  implementation if none are enabled.
- Added implementation query functions for each level-3 operation so
  that the user can query a string that describes the implementation
  that is currently enabled.
- Updated test suite to output implementation types for reach level-3
  operation, as well as micro-kernel types for each of the five micro-
  kernels.
- Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
- Fixed an obscure bug when packing Hermitian matrices (regular packing
  type) whereby the diagonal elements of the packed micro-panels could
  get tainted if the source matrix's imaginary diagonal part contained
  garbage.
2014-09-16 18:19:32 -05:00
Field G. Van Zee
7fc48a7d92 Combined 4m/3m bits into an expanded bitfield.
Details:
- Combined the 4m/3m bits into an expanded bitfield, which will encode
  the packing "format" of the micro-panels. This will allow for more
  easily and compactly encoding additional formats.
- Other minor comment/whitespace updates to bli_type_defs.h.
- Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new
  format bitfield.
- Comment update to bli_kernel_post_macro_defs.h.
- Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h.
2014-08-23 16:50:58 -05:00
Field G. Van Zee
137143345d Reimplemented unit blocksize fix in prev commit.
Details:
- Instead of inferring the storage format of the micro-panels from within
  the packm variants, we now pass in a bool_t value that denotes whether
  the packed matrix contains row-stored column panels or column-stored
  row panels. This value can then be tested more easily inside the main
  packm variant loop.
- Renumbered pack_t schema values in bli_type_defs.h so that there are
  now five bits, each with different meaning:
  - 4: packed or not packed?
  - 3: packed for 3m?
  - 2: packed for 4m?
  - 1: packed to panels?
  - 0: stored by rows or columns?
- Added new macros that test for status of above bits in schema bit
  subfield, and renamed some existing macros related to 4m/3m.
2014-07-31 12:12:45 -05:00
Field G. Van Zee
a51e32ec06 Fixed unit register blocksize brokenness.
Details:
- Fixed a breakdown in BLIS's ability to differentiate between row-stored
  and column-stored micro-panels when MR or NR is unit. When either
  register blocksize (or both) is equal to one, inspecting the strides of
  the affected packed micro-panel is no longer sufficient to determine
  whether the micro-panel is a row-stored column panel or a column-stored
  row panel (because both strides are unit). At that point, dimension
  information is necessary when invoking the bli_is_row_stored_f() and
  bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to
  Ilya Polkovnichenko for reporting this bug.
- Added panel dimensions (m and n) to obj_t, which are set in
  packm_init() and then passed into the blocked variants to support the
  aforementioned update.
2014-07-30 10:41:48 -05:00
Field G. Van Zee
7ed415824d Updated copyright headers (continued).
Details:
- Inserted "at Austin" into third clause of license declarations.
  Meant to include this change in previous commit.
2014-07-14 16:14:33 -05:00
Field G. Van Zee
5c2c6c8561 Updated copyright headers to contain "at Austin".
Details:
- Updated copyright headers to include "at Austin" in the name of the
  University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
  2012).
2014-07-14 16:05:03 -05:00
Field G. Van Zee
47a90e69df Attempted to fix uninitialized variable warnings.
Details:
- Added initialization statements to various macros used in level 1m and
  1m-like operations. I wasn't able to reproduce the reported behavior,
  so hopefully this takes care of it. Thanks to Jeff Hammond for the
  report.
2014-04-01 14:34:31 -05:00
Field G. Van Zee
a3902750b9 Reorganized norm operations.
Details:
- Completely reoganized norm operations:
  - Renames:
    - fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
    - absumv -> norm1v (vector 1-norm)
  - New operations:
    - norm1m (matrix 1-norm)
    - normiv, normim (infinity-norm)
    - amaxv (BLAS-like absolute maximum value index)
    - asumv (BLAS-like absolute sum)
- Deprecated absumm, as it did not correspond to any actual norm.
  (However, an inlined version now exists in the testsuite module for
  randm.)
2014-03-19 12:35:17 -05:00
Field G. Van Zee
c663ce3b51 Fixed various bugs when C99 complex is enabled.
Details:
- Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
  elsewhere in the framework that were not yet set up to work properly
  when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
- Extensive changes to f2c-derived files in frame/compat/f2c to allow
  C99 complex storage. Most of these changes center around accessing
  real and imaginary components via bli_?real()/bli_?imag() accessor
  macros, and setting of values via bli_?sets() assignment macros.
  (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
  was broken.)
2014-02-27 16:32:57 -06:00
Field G. Van Zee
c255a293e2 Consolidated packm_blk_var2 and var3.
Details:
- Consolidated the functionality previously supported by packm_blk_var2()
  and packm_blk_var3() into a new variant, packm_blk_var1().
- Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
  to accommodate above changes.
- Removed packm_blk_var3() and retired packm_blk_var2() to
  frame/1m/packm/old.
- Updated all level-3 _cntl_init() functions so that the new, more
  versatile packm_blk_var1 is used for all level-3 matrix packing.
2014-02-10 14:31:24 -06:00
Field G. Van Zee
2cb13600f9 Updated year in copyright headers to 2014. 2014-01-03 12:29:13 -06:00
Field G. Van Zee
e3a6c7e776 Macroized conditionals for a2/b2 in macro-kernels.
Details:
- Replaced conditional expressions in macro-kernels related to computing
  the addresses a2 and b2 (a_next and b_next) with a preprocessor macro
  invocation, bli_is_last_iter(), that tests the same condition.
- Updated gemm_ukr module to use auxinfo_t argument.
- Whitespace changes in test suite ukr modules.
2013-12-19 16:29:31 -06:00