Commit Graph

228 Commits

Author SHA1 Message Date
Field G. Van Zee
e88a5b8da8 Implemented castm, castv operations.
Details:
- Implemented castm and castv operations, which behave like copym and
  copyv except where the obj_t operands can be of different datatypes.
  These new operations, however, unlike copym/copyv, do not build upon
  existing level-1v kernels.
- Reorganized projm, projv into a 'proj' subdirectory of frame/base (to
  match the newly added frame/base/cast directory).
- Added new macros to bli_gentfunc_macro_defs.h, _gentprot_macro_defs.h
  that insert GENTFUNC2/GENTPROT2 macros for all non-homogeneous datatype
  combinations. Previously, one had to invoke two additional macros--one
  which mixed domains only and another that included all remaining
  cases--in order to get full type combination coverage.
- Defined a new static function, bli_set_dims_incs_2m(), to aid in the
  setting of various variables in the implementations of bli_??castm().
  This static function joins others like it in bli_param_macro_defs.h.
- Comment update to bli_copysc.h.
2018-06-18 15:56:26 -05:00
Field G. Van Zee
f88c2e7a53 Defined static function bli_blksz_scale_def_max().
Details:
- Added a new static function to bli_blksz.h that scales both the default
  (regular) blocksize as well as the maximum blocksize in the blksz_t
  object. Reminder: maximum blocksizes have different meanings in
  different contexts. For register blocksizes, they refer to the packing
  register blocksizes (PACKMR or PACKNR) while for cache blocksizes, they
  refer to the maximum blocksize to use during the final iteration of a
  loop.
2018-06-13 18:27:46 -05:00
Field G. Van Zee
87db5c048e Changed usage of virtual microkernel slots in cntx.
Details:
- Changed the way virtual microkernels are handled in the context.
  Previously, there were query routines such as bli_cntx_get_l3_ukr_dt()
  which returned the native ukernel for a datatype if the method was
  equal to BLIS_NAT, or the virtual ukernel for that datatype if the
  method was some other value. Going forward, the context native and
  virtual ukernel slots will both be initialized to native ukernel
  function pointers for native execution, and for non-native execution
  the virtual ukernel pointer will be something else. This allows us
  to always query the virtual ukernel slot (from within, say, the
  macrokernel) without needing any logic in the query routine to decide
  which function pointer (native or virtual) to return. (Essentially,
  the logic has been shifted to init-time instead of compute-time.)
  This scheme will also allow generalized virtual ukernels as a way
  to insert extra logic in between the macrokernel and the native
  microkernel.
- Initialize native contexts (in bli_cntx_ref.c) with native ukernel
  function addresses stored to the virtual ukernel slots pursuant to
  the above policy change.
- Renamed all static functions that were native/virtual-ambiguous, such
  as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt()
  pursuant to the above polilcy change. Those routines now use the
  substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All
  of these functions were static functions defined in bli_cntx.h, and
  most uses were in level-3 front-ends and macrokernels.
- Deprecated anti_pref bool_t in context, along with related functions
  such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's
  panel-block execution is disabled.
2018-06-12 19:38:37 -05:00
Field G. Van Zee
dbaf440540 Merge branch 'master' into dev 2018-06-11 12:37:04 -05:00
Field G. Van Zee
043d0cd37e Implemented bli_acquire_mpart(), added example code.
Details:
- Implemented bli_acquire_mpart(), a general-purpose submatrix view
  function that will alias an obj_t to be a submatrix "view" of an
  existing obj_t.
- Renumbered examples in examples/oapi and inserted a new example file,
  03obj_view.c, which shows how to use bli_acquire_mpart() to obtain
  submatrix views of existing objects, which can then be used to
  indirectly modify the parent object.
2018-06-09 13:46:49 -05:00
Field G. Van Zee
65fae95074 Implemented bli_setrm, _setim, _setrv, _setiv.
Details:
- Defined new wrappers to setm/setv operations in frame/base/bli_setri.c
  that will target only the real or only the imaginary parts of a
  matrix/vector object.
- Updated bli_obj_real_part() so that the complex-specific portions of
  the function are not executed if the object is real.
- Defined bli_obj_imag_part().
  - Caveat: If bli_obj_imag_part() is called on a real object, it does
    nothing, leaving the destination object untouched. The caller must
    take care to only call the function on complex objects.
- Reordered some of the static functions in bli_obj_macro_defs.h related
  to aliasing.
2018-06-07 17:41:09 -05:00
Field G. Van Zee
513138b1a1 Defined/implemented bli_projv().
Details:
- Added an implementation for bli_projv() to go along with the
  implementation of bli_projm() added in 0a4a27e. The only difference
  between the two is that bli_projv() may only be used on vectors,
  whereas bli_projm() is general-purpose.
- Added a _check() function corresponding to bli_projv().
2018-06-07 12:24:47 -05:00
Field G. Van Zee
b5a641e968 Added char-to-dt and dt-to-char mapping functions.
Details:
- Defined additional functions in bli_param_map.c:
    bli_param_map_char_to_blis_dt()
    bli_param_map_blis_to_char_dt()
  which will map a char to its corresponding num_t, or vice versa.
2018-06-06 19:05:37 -05:00
Field G. Van Zee
0a4a27e1a4 Defined/implemented bli_projm().
Details:
- Defined a new operation in frame/base/bli_proj.c, bli_projm(), which
  behaves like bli_copym(), except that operands a and b are allowed to
  contain data of differing domains (e.g. a is real while b is complex,
  or vice versa). The file is named bli_proj.c, rather than bli_projm.c,
  with the intention that a 'v' vector version of the function may be
  added to the same file (at some point in the future).
- Added supporting bli_check_*() functions in bli_check.c to confirm
  consistent precisions between to datatypes/objects, as well as the
  appropriate error message in bli_error.c and a new error code in
  bli_type_defs.h.
- Wrote a bli_projm_check() function to go along with bli_projm().
- Defined static function bli_obj_real_part() in bli_obj_macro_defs.h,
  which will initialize an obj_t alias to the real part of the source
  object.
- Fixed a bug in the static function bli_dt_proj_to_complex(), found
  in bli_param_macro_defs.h. Thankfully, there were no calls to the
  function to produce buggy behavior.
2018-06-06 19:02:29 -05:00
Field G. Van Zee
5df201260f Merge branch 'master' into dev 2018-06-05 16:14:19 -05:00
Tyler Michael Smith
96d2774b4c Make bli_auxinfo_next_b() return b_next, not a_next (#216) 2018-06-05 07:17:39 -05:00
Field G. Van Zee
ed7dedfd4a Merge branch 'master' into dev 2018-06-02 20:29:53 -05:00
Field G. Van Zee
f97a86f322 Updated setting/querying pack schema (cntx->cntl).
- Query pack schemas in level-3 bli_*_front() functions and store those
  values in the schema bitfields of the correponding obj_t's when the
  cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default
  native schemas are stored to the obj_t's.)
- In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in
  bli_*_front(), clear the schema bitfields, and pass the queried values
  into bli_gemm_cntl_create() and bli_trsm_cntl_create().
- Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to
  take schemas for A and B, and use these values to initialize the
  appropriate control tree nodes. (Also cpp-disabled the panel-block cntl
  tree creation variant, bli_gemmpb_cntl_create(), as it has not been
  employed by BLIS in quite some time.)
- Simplified querying of schema in bli_packm_init() thanks to above
  changes.
- Updated openmp and pthreads definitions of bli_l3_thread_decorator()
  so that thread-local aliases of matrix operands are guaranteed, even
  if aliasing is disabled within the internal back-end functions (e.g.
  bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c
  explaining why the extra aliasing is not needed there.
- Change bli_gemm() and level-3 friends so that the operation's ind()
  function is called only if all matrix operands have the same datatype,
  and only if that datatype is complex. The former condition is needed
  in preparation for work related to mixed domain operands, while the
  latter helps with readability, especially for those who don't want to
  venture into frame/ind.
- Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be
  consistent with BLIS calling conventions (modified argument(s) are
  last), and updated all invocations in the level-3 _front() functions.
- Comment updates to bli_cntx_set_thrloop_from_env().
2018-06-02 20:28:20 -05:00
Field G. Van Zee
965db85d29 Updated macro invocations in bli_gemm_ker_var2.c.
Details:
- Updated "get next a/b micropanel" macro invocations in
  bli_gemm_ker_var2.c according to changes in 9588625.
- Comment update in bli_cntx.c.
2018-06-01 12:32:15 -05:00
Field G. Van Zee
469727d4f8 Very minor comment updates. 2018-05-25 16:17:13 -05:00
Field G. Van Zee
5140ee3424 Updated types of bli_is_[un]aligned_to() functions.
Details:
- Changed the void* arguments of the following static functions:
    bli_is_aligned_to()
    bli_is_unaligned_to()
    bli_offset_past_alignment()
  to siz_t, and the return type of bli_offset_past_alignment() from
  guint_t to siz_t. This allows for more versatile usage of these
  functions (e.g. when aligning both pointers and leading dimension).
- Updated all invocations of these functions, mostly in kernels/penryn
  but also in kernels/bgq, to include explicit typecasts to siz_t when
  pointer arguments are passed in.
- Thanks to Devin Matthews for pointing out this potential bug (via issue
  #211).
- Deleted a few trailing spaces in various penryn kernels.
- Removed duplicate instances of the words "derived" and "THEORY" from
  various kernel license headers, likely from a malformed recursive sed
  performed long ago.
2018-05-23 16:56:14 -05:00
Field G. Van Zee
962a706a6f Updated LICENSE file to mention HP Enterprise.
Details:
- Added HP Enterprise to the LICENSE file. Previously, only the source
  files touched by HPE contained the corresponding copyright notices.
  (This oversight was unintentional.)
- Updated file-level copyright notices to include a comma, to match
  the formatting used for UT and AMD copyrights.
2018-05-18 18:19:40 -05:00
Field G. Van Zee
af244194e7 Removed explicit critical sec. from bli_memsys.c.
Details:
- Removed critical sections protecting the initialization/finalization of
  bli_memsys.c. These synchronization mechanisms are no longer needed now
  that BLIS initializes all APIs via pthread_once().
2018-05-17 15:38:02 -05:00
Field G. Van Zee
10c9e8f952 Cache hardware's arch_t id after querying once.
Details:
- Added logic to bli_arch.c that will call what was previously the body
  of bli_arch_query_id() only once and then cache the value in a static
  variable local to the file. (Previously, the arch_t associated with
  the hardware/configuration was queried every time bli_arch_query_id()
  was called, which was at least once per level-3 function call. Thanks
  to Devin Matthews for suggesting this feature via issue #175.
- Added -lpthread to the compile/link command line of the compiler
  invocation that compiles build/detect/config/config_detect.c, which
  prints the string identifying the detected configuration, since it
  is now needed due to new pthread_once() logic in bli_arch.c.
- Implementation note: I chose to implement this arch_t caching feature
  via pthread_once(), using a separate pthread_once_t variable local to
  the file, rather than calling bli_init_once(). The reason is that I
  did not want to require bli_init() as a prerequisite to this function.
  bli_init() already calls several sub-components, some of which make use
  of bli_arch_query_id(), and therefore it would be easy to fall into a
  circular self-init situation (which usually causes pthreads to hang
  indefinitely).
2018-05-17 15:22:51 -05:00
Field G. Van Zee
4fb353bd90 Merge branch 'master' into dev 2018-05-13 17:50:51 -05:00
Field G. Van Zee
bf03503059 Renamed (shortened) a few build system variables.
Details:
- Renamed the following variables in config.mk (via build/config.mk.in):
    BLIS_ENABLE_VERBOSE_MAKE_OUTPUT -> ENABLE_VERBOSE
    BLIS_ENABLE_STATIC_BUILD        -> MK_ENABLE_STATIC
    BLIS_ENABLE_SHARED_BUILD        -> MK_ENABLE_SHARED
    BLIS_ENABLE_BLAS2BLIS           -> MK_ENABLE_BLAS
    BLIS_ENABLE_CBLAS               -> MK_ENABLE_CBLAS
    BLIS_ENABLE_MEMKIND             -> MK_ENABLE_MEMKIND
  and also renamed all uses of these variables in makefiles and makefile
  fragments. Notice that we use the "MK_" prefix so that those variables
  can be easily differentiated (such as via grep) from their "BLIS_" C
  preprocessor macro counterparts.
- Other whitespace changes to build/config.mk.in.
- Renamed the following C preprocessor macros in bli_config.h (via
  build/bli_config.h.in):
    BLIS_ENABLE_BLAS2BLIS        -> BLIS_ENABLE_BLAS
    BLIS_DISABLE_BLAS2BLIS       -> BLIS_DISABLE_BLAS
    BLIS_BLAS2BLIS_INT_TYPE_SIZE -> BLIS_BLAS_INT_TYPE_SIZE
  and also renamed all relevant uses of these macros in BLIS source
  files.
- Renamed "blas2blis" variable occurrences in configure to "blas", as
  was done in build/config.mk.in and build/bli_config.h.in.
- Renamed the following functions in frame/base/bli_info.c:
    bli_info_get_enable_blas2blis() -> bli_info_get_enable_blas()
    bli_info_get_blas2blis_int_type_size()
                                    -> bli_info_get_blas_int_type_size()
- Remove bli_config.h during 'make cleanh' target of top-level Makefile.
2018-05-08 16:49:22 -05:00
Field G. Van Zee
4b36e85be9 Converted function-like macros to static functions.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
  bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
  between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
  functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
  bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
2018-05-08 14:26:30 -05:00
Field G. Van Zee
75d0d1057d Renamed various datatype-related macros/functions.
Details:
- Renamed the following macros in bli_obj_macro_defs.h and
  bli_param_macro_defs.h:
  - bli_obj_datatype()                 -> bli_obj_dt()
  - bli_obj_target_datatype()          -> bli_obj_target_dt()
  - bli_obj_execution_datatype()       -> bli_obj_exec_dt()
  - bli_obj_set_datatype()             -> bli_obj_set_dt()
  - bli_obj_set_target_datatype()      -> bli_obj_set_target_dt()
  - bli_obj_set_execution_datatype()   -> bli_obj_set_exec_dt()
  - bli_obj_datatype_proj_to_real()    -> bli_obj_dt_proj_to_real()
  - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex()
  - bli_datatype_proj_to_real()        -> bli_dt_proj_to_real()
  - bli_datatype_proj_to_complex()     -> bli_dt_proj_to_complex()
- Renamed the following functions in bli_obj.c:
  - bli_datatype_size()                -> bli_dt_size()
  - bli_datatype_string()              -> bli_dt_string()
  - bli_datatype_union()               -> bli_dt_union()
- Removed a pair of old level-1f penryn intrinsics kernels that were no
  longer in use.
2018-04-30 14:57:33 -05:00
Field G. Van Zee
d6ab25a323 Add setijm, getijm operations.
Details:
- Added bli_setgetijm.c, which defines bli_setijm(), bli_getijm(), and
  related functions that can be used to read and write individual
  elements of an obj_t.
- Defined a new function, bli_obj_create_conf_to(), in bli_obj.c that will
  create a new object with dimensions conformal to an existing object.
  Transposition and conjugation states on the existing object are ignored,
  as are structure and uplo fields.
- Defined a new function, bli_datatype_string(), in bli_obj.c that returns
  a char* to a string representation of the name of each num_t datatype.
  For example, BLIS_DOUBLE is "double" and BLIS_DCOMPLEX is "dcomplex".
  BLIS_INT is included (as "int"), but BLIS_CONSTANT is not, and thus is
  not a valid input argument to bli_datatype_string().
- Added calls to bli_init_once() to various functions in bli_obj.c, the
  most important of which was bli_obj_create_without_buffer().
- Removed unintended/extra newline from the end of printv output.
- Whitespace changes to
  - frame/base/bli_machval.c
  - frame/base/bli_machval.h
  - frame/0/copysc/bli_copysc.c
- Trivial changes to README.md and common.mk.
2018-04-24 18:43:03 -05:00
Field G. Van Zee
cb7ed90752 Convert op names to uppercase before calling xerbla_().
Details:
- Defined a new function, bli_string_mkupper(), that calls toupper() on
  every non-NULL character in a string.
- Call bli_string_mkupper() prior to calling xerbla_() in the level-2/-3
  BLAS _check() macros. This prevents the BLAS testsuite from complaining
  that the operation name (e.g. "dgemm") does not match the expected
  value (e.g. "DGEMM"). Thanks to Dave Love for reporting this issue.
2018-03-16 13:05:56 -05:00
Devin Matthews
8f2fabec80 Make arm32 and arm64 families work. (#176) 2018-03-14 17:43:42 -05:00
Field G. Van Zee
1a3031740f Updates to ARM hardware detection support.
Details:
- Updated/clarified the ARM preprocessor macro branch of bli_cpuid.c.
  Going forward, cortexa57 (64-bit), cortexa15, and cortexa9 (32-bit)
  sub-configurations are supported. However, the functions that detect
  features specific to a15 and a9 are identical, and since a15 is tested
  first, it will always be chosen for arm32 hardware (even if both
  sub-configurations were enabled at configure-time and the library is
  linked and run on an a9). Thus, more work needs to be done to
  distinguish these two.
- Added cpp guard around x86_64 portions of bli_cpuid.c. Now, either
  the x86_64 or ARM code will be compiled (or neither, if neither
  environment is detected).
- In bli_arch_query_id(), call bli_cpuid_query_id() when the
  BLIS_FAMILY_ARM64 or BLIS_FAMILY_ARM32 macros are defined.
- Added arm64 and arm32 configuration families to config_registry.
- Added a note to the arch_t typedef enum in bli_type_defs.h reminding
  the developer to update the string array in bli_arch.c whenever new
  enum values are added or existing values are reordered.
2018-03-13 16:04:40 -05:00
Field G. Van Zee
1a8350f705 Fixed cache blocksize bug in knl configuration.
Details:
- Changed the mc blocksize for double real execution in the knl sub-
  configuration from 160 to 148. The old value was not a multiple of
  mr (which is 24), and thus the safeguards in bli_gks_register_cntx()
  were tripping. Thanks for Dave Love for reporting this issue.
- Switch knl sub-configuration to use default blocksizes for datatypes
  not supported by native kernels.
- Fixed typos in bli_error.c that prevented certain error strings
  (which report maximum cache blocksizes not being multiples of their
  corresponding register blocksize) from properly initializing.
2018-03-05 13:32:00 -06:00
Field G. Van Zee
16813335bd Merge branch 'amd' into rt
Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
  Special thanks to AMD for their contributions to-date, especially with
  regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
  bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
  the extra cost of transposing the microtile in registers, this is
  much faster than using the general storage case when the underlying
  matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
  column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
  kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
  to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
  family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
  - only one vzeroupper instruction is called prior to returning
  - movapd/movupd are used in leiu of movaps/movups for double-real
    microkernels. (Note that single-real microkernels still use
    movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
  now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
  in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
  behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
  sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
  comments.
- Updated LICENSE file to explicitly mention that parts are copyright
  UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.

Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
  s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
  and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
  implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
  sum of the squares via dotv(). If there is a floating-point exception
  (FE_OVERFLOW), then the previous (numerically conservative) code is
  used; otherwise, the result of dotv() is square-rooted and stored as
  the result. This new implementation is only enabled when FE_OVERFLOW
  is #defined. If the macro is not #defined, then the previous
  implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.
2018-02-21 17:43:32 -06:00
Field G. Van Zee
0b9c5127e9 Enabled C99, added stdint.h to auto-detect build.
Details:
- Added "-std=c99" to compiler arguments when building auto-detection
  driver in configure script.
- Added #include <stdint.h> to all three source files needed by auto-
  detection program.
2017-12-23 15:53:44 -06:00
Field G. Van Zee
0ce5e19c31 Reimplemented configure-time hardware detection.
Details:
- Reimplemented the hardware detection functionality invoked when running
  "./configure auto". Previously, a standalone script in build/auto-detect
  that used CPUID was used. However, the script attempted to enumerate all
  models for each microarchitecture supported. The new approach recycles
  the same code used for runtime hardware detection introduced in 2c51356.
  This has two immediate benefits. First, it reduces and consolidates the
  code required to detect microarchitectures via the CPUID instruction.
  Second, it provides an indirect way of testing at configure-time the
  code that is used to detect hardware at runtime. This code is (a) only
  activated when targeting a configuration family (such as intel64 or
  amd64) at configure-time and (b) somewhat difficult to test in
  practice, since it relies on having access to older microarchitectures.
- The above change required placing conditional cpp macro blocks in
  bli_arch.c and bli_cpuid.c which either #include "blis.h" or #include
  a bare-bones set of headers that does not rely on the presence of a
  bli_config.h header. This is needed because bli_config.h has not been
  created yet when configure-time auto-detection takes places.
- Defined a new function in bli_arch.c, bli_arch_string(), which takes
  an arch_t id and returns a pointer to a string that contains the
  lowercase name of the corresponding microarchitecture. This function
  is used by the auto-detection script to printf() the name of the
  sub-configuration corresponding to the detected hardware.
2017-12-23 15:32:03 -06:00
Field G. Van Zee
9804adfd40 Added option to disable pack buffer memory pools.
Details:
- Added a new configure option, --[en|dis]able-packbuf-pools, which will
  enable or disable the use of internal memory pools for managing buffers
  used for packing. When disabled, the function specified by the cpp
  macro BLIS_MALLOC_POOL is called whenever a packing buffer is needed
  (and BLIS_FREE_POOL is called when the buffer is ready to be released,
  usually at the end of a loop). When enabled, which was the status quo
  prior to this commit, a memory pool data structure is created and
  managed to provide threads with packing buffers. The memory pool
  minimizes calls to bli_malloc_pool() (i.e., the wrapper that calls
  BLIS_MALLOC_POOL), but does so through a somewhat more complex
  mechanism that may incur additional overhead in some (but not all)
  situations. The new option defaults to --enable-packbuf-pools.
- Removed the reinitialization of the memory pools from the level-3
  front-ends and replaced it with automatic reinitialization within the
  pool API's implementation. This required an extra argument to
  bli_pool_checkout_block() in the form of a requested size, but hides
  the complexity entirely from BLIS. And since bli_pool_checkout_block()
  is only ever called within a critical section, this change fixes a
  potential race condition in which threads using contexts with different
  cache blocksizes--most likely a heterogeneous environment--can check
  out pool blocks that are too small for the submatrices it wishes to
  pack. Thanks to Nisanth Padinharepatt for reporting this potential
  issue.
- Removed several functions in light of the relocation of pool reinit,
  including bli_membrk_reinit_pools(), bli_memsys_reinit(),
  bli_pool_reinit_if(), and bli_check_requested_block_size_for_pool().
- Updated the testsuite to print whether the memory pools are enabled or
  disabled.
2017-12-21 19:22:57 -06:00
Field G. Van Zee
83316485ce Simplified/fixed self-initialization.
Details:
- Fixed a race condition in self-initialization whereby the bli_is_init
  static variable could be erroneously read as TRUE by thread 1 while
  thread 0 is still executing bli_init_apis(), thus allowing thread 1 to
  use the library before it is actually ready. Thanks to to Minh Quan Ho
  and Devin Matthews for pointing out this issue.
- Part of the solution to the aforementioned race condition was involved
  replacing the runtime initialization of the global scalar constants
  (e.g., BLIS_ONE, BLIS_ZERO, etc.) in bli_const.c with a static
  initialization of those same constants. This eliminates the need for
  bli_const_init() altogether. (The static initialization is made concise
  via preprocess macros.)
- Defined bli_gks_query_cntx_noinit(), which behaves just like
  bli_gks_query_cntx(), except that it does not call bli_init_once(). This
  function is called in lieu of bli_gks_query_cntx() in bli_ind_init() and
  bli_memsys_init() so as to not result in any recursion into
  bli_init_once().
- Removed BLIS_ONE_HALF, BLIS_MINUS_ONE_HALF global scalar constants.
  They have no use in BLIS or its test products, and we have little reason
  to believe they are used by others.
- Removed testsuite/out file, which was accidentally committed as part
  of 70640a3.
2017-12-13 14:14:50 -06:00
Field G. Van Zee
70640a3710 Implemented library self-initialization.
Details:
- Defined two new functions in bli_init.c: bli_init_once() and
  bli_finalize_once(). Each is implemented with pthread_once(), which
  guarantees that, among the threads that pass in the same pthread_once_t
  data structure, exactly one thread will execute a user-defined function.
  (Thus, there is now a runtime dependency against libpthread even when
  multithreading is not enabled at configure-time.)
- Added calls to bli_init_once() to top-level user APIs for all
  computational operations as well as many other functions in BLIS to
  all but guarantee that BLIS will self-initialize through the normal
  use of its functions.
- Rewrote and simplified bli_init() and bli_finalize() and related
  functions.
- Added -lpthread to LDFLAGS in common.mk.
- Modified the bli_init_auto()/_finalize_auto() functions used by the
  BLAS compatibility layer to take and return no arguments. (The
  previous API that tracked whether BLIS was initialized, and then
  only finalized if it was initialized in the same function, was too
  cute by half and borderline useless because by default BLIS stays
  initialized when auto-initialized via the compatibility layer.)
- Removed static variables that track initialization of the sub-APIs in
  bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and
  bli_ind.c. We don't need to track initialization at the sub-API level,
  especially now that BLIS can self-initialize.
- Added a critical section around the changing of the error checking
  level in bli_error.c.
- Deprecated bli_ind_oper_has_avail() as well as all functions
  bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation
  name. These functions had no use cases within BLIS and likely none
  outside of BLIS.
- Commented out calls to bli_init() and bli_finalize() in testsuite's
  main() function, and likewise for standalone test drivers in 'test'
  directory, so that self-initialization is exercised by default.
2017-12-11 17:18:43 -06:00
Field G. Van Zee
70a64432ee Fixed off-by-one indexing in bli_cpuid.c.
Details:
- In bli_cpuid.c, fixed an off-by-one indexing statement in vpu_count()
  whereby a string-terminating NULL character, '\0', is written beyond
  the bounds of the model_num string.
- Minor whitespace and formatting edits to bli_cpuid.c.
2017-12-11 13:14:20 -06:00
Field G. Van Zee
513ef4d040 Various typecasting fixes, mis-typed enums, etc.
Details:
- Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c.
- Properly typecast integer arguments to match format specifier in various
  calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and
  bli_util_oapi.c.
- Fixed "unsigned less-than-comparison with zero" checks in bli_check.c,
  bli_cntx.h.
- Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been
  l1fkr_t or l1vkr_t).
- Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t
  value BLIS_GEMM_UKR in bli_cntx_ref.c.
- NOTE: These issues were identified via compiler warnings when building
  BLIS with clang on a rather old installation of OS X:
    $ clang --version
    Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
    Target: x86_64-apple-darwin15.2.0
    Thread model: posix
2017-12-11 12:35:59 -06:00
Nisanth M P
3a44118398 Added AMD copyright line to the changed files in last 3 commits
Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66
2017-12-11 12:41:02 +05:30
Nisanth M P
9c0a3c4c02 Thread Safety: Move bli_init() before and bli_finalize() after main()
BLIS provides APIs to initialize and finalize its global context.
One application thread can finalize BLIS, while other threads
in the application are stil using BLIS.

This issue can be solved by removing bli_finalize() from API.
One way to do this is by getting bli_finalize() to execute by default
after application exits from main().

GCC supports this behaviour with the help of __attribute__((destructor))
added to the function that need to be executed after main exits.

Similarly bli_init() can be made to run before application enters main()
so that application need not call it.

Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac
2017-12-11 12:12:29 +05:30
Field G. Van Zee
a64c15de19 Fixed a pthread typo in previous commit.
Details:
- Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'.
2017-12-11 12:12:29 +05:30
Field G. Van Zee
42dcd589c3 Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
  into a k x k triangular matrix for the purposes of obtaining an mr x k
  micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
  very large k (depending on the product of mr x kc on that architecture).
  The bug arose from the fact that the test module was triggering the
  allocation of blocks from the internal memory pools, which are limited in
  size. This allocation imposes an implicit assumption that the micro-
  panel being tested with will fit inside, and this assumption is violated
  for large values of k. Arbitrarily large k may now be tested for both
  operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
  statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
  issues.
2017-12-11 12:12:29 +05:30
Field G. Van Zee
a666fd4e26 Added edge handling to _determine_blocksize_b().
Details:
- Added explicit handling of situations where i == dim to
  bli_determine_blocksize_b_sub(). This isn't actually needed by any
  current use case within BLIS, but handling the situation is nonetheless
  prudent. Thanks to Minh Quan for reporting this issue and requesting
  the fix.
2017-12-11 12:12:29 +05:30
Field G. Van Zee
95adc43d80 Moved 'family' field from cntx_t to cntl_t.
Details:
- Removed the family field inside the cntx_t struct and re-added it to the
  cntl_t struct. Updated all accessor functions/macros accordingly, as well
  as all consumers and intermediaries of the family parameter (such as
  bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
  change was motivated by the desire to keep the context limited, as much
  as possible, to information about the computing environment. (The family
  field, by contrast, is a descriptor about the operation being executed.)
- Added additional functions to bli_blksz_*() API.
- Added additional functions to bli_cntx_*() API.
- Minor updates to bli_func.c, bli_mbool.c.
- Removed 'obj' from bli_blksz_*() API names.
- Removed 'obj' from bli_cntx_*() API names.
- Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
  that operate only on a single struct to contain the "_node" suffix to
  differentiate with those routines that operate on the entire tree.
- Added enums for packm and unpackm kernels to bli_type_defs.h.
- Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
  They weren't being used and probably never will be.
2017-12-11 12:12:29 +05:30
Field G. Van Zee
8f739cc847 Added API to set mt environment variables.
Details:
- Renamed bli_env_get_nway() -> bli_thread_get_env().
- Added bli_thread_set_env() to allow setting environment variables
  pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
- Added the following convenience wrapper routines:
    bli_thread_get_jc_nt()
    bli_thread_get_ic_nt()
    bli_thread_get_jr_nt()
    bli_thread_get_ir_nt()
    bli_thread_get_num_threads()
    bli_thread_set_jc_nt()
    bli_thread_set_ic_nt()
    bli_thread_set_jr_nt()
    bli_thread_set_ir_nt()
    bli_thread_set_num_threads()
- Added #include "errno.h" to bli_system.h.
- This commit addresses issue #140.
- Thanks to Chris Goodyer for inspiring these updates.
2017-12-11 12:08:58 +05:30
Minh Quan HO
c09b30d115 set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers
The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is
not set in bli_membrk_init
2017-12-11 12:08:58 +05:30
Devin Matthews
cc3107ae1c Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123. 2017-12-11 12:05:22 +05:30
Field G. Van Zee
4f61528d56 Added 1m-specific APIs for bp, pb gemm algorithms.
Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
  body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
  bli_cntl_free() can check if the thread parameter is NULL, and if so,
  call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
  terms of bli_gemm1mxx_cntx_init(), which behaves the same as
  bli_gemm1m_cntx_init() did before, except that an extra bool parameter
  (is_pb) is used to support both bp and pb algorithms (including to
  support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
  when true, will toggle the boolean return value of routines such as
  bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
  causing BLIS to transpose the operation to achieve disagreement (rather
  than agreement) between the storage of C and the micro-kernel output
  preference. This disagreement is needed for panel-block implementations,
  since they induce a transposition of the suboperation immediately before
  the macro-kernel is called, which changes the apparent storage of C. For
  now, anti-preference is used only with the pb algorithm for 1m (and not
  with any other non-1m implementation).
- Defined new functions,
    bli_cntx_l3_ukr_eff_prefers_storage_of()
    bli_cntx_l3_ukr_eff_dislikes_storage_of()
    bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
    bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
  which are identical to their non-"eff" (effectively) counterparts except
  that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
  bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
  in terms of the existing block-panel macro-kernel _ker_var2(). This
  technique requires inducing transposes on all operands and swapping
  the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
  also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
  specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
    bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
    bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
    bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
  and updated all instantiations. Also updated the field names in the
  cntx_t struct.
- Comment updates.
2017-12-11 11:58:33 +05:30
Field G. Van Zee
1d728ccb23 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2017-12-11 11:55:31 +05:30
Field G. Van Zee
b150870397 Removed most "old" directories.
Details:
- Removed the vast majority of directories named "old", which contained
  deprecated code that I wasn't quite ready to jettison from the source
  tree.
2017-12-08 16:08:41 -06:00
Field G. Van Zee
270c65985d Modified bli_getopt() for thread-safety.
Details:
- Changed the interface of bli_getopt() to take a new argument, a getopt_t
  struct, that stores the values of optarg, optind, opterr, and optopt,
  and updated the implementation accordingly. (Previously,  these
  variables were assumed to be global.)
- Added a function for initializing a getopt_t struct.
- Changed test_libblis.c--currently the only consumer of bli_getopt()--to
  utilize the new getopt_t state object.
2017-12-08 15:21:18 -06:00
Field G. Van Zee
ce4d8fabc2 Merge branch 'master' of github.com:flame/blis 2017-12-07 17:36:44 -06:00