Commit Graph

778 Commits

Author SHA1 Message Date
Kiran Varaganti
b80bd5bcb2 config/zen/bli_cntx_init_zen.c: removed BLIS_ENBLE_ZEN_BLOCK_SIZES macro. We have different configurations for both zen and zen2
config/zen/bli_family_zen.h: deleted macro BLIS_ENBLE_ZEN_BLOCK_SIZES
config/zen/make_defs.mk: removed compiler flag -mno-avx256-split-unaligned-store
frame/base/bli_cpuid.c: ROME family is 17H but model # is from 0x30H.
test/test_gemm.c - commented out #define FILE_IN_OUT (some compilation error when BLIS is configured as amd64)
Now we can use single configuration has ./configure amd64 - this will work both for ROME & Naples

Change-Id: I91b4fc35380f8a35b4f4c345da040c6b5910b4a2
2019-05-22 05:51:22 -04:00
Kiran Varaganti
a23f92594c config_registry: New AMD zen2 architecture configuration added.
frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS]
  frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...).
  frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..).
  frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif
  frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif
  frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20

Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866
2019-05-22 05:28:16 -04:00
kdevraje
df755848b8 Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis into rome2.0
Change-Id: Ie8aad1ab810f0f3c0b90ec67f9dd3dfb8dcc74cc
2019-05-22 13:30:07 +05:30
Field G. Van Zee
fa7e6b182b Define _POSIX_C_SOURCE in bli_system.h.
Details:
- Added
    #ifndef _POSIX_C_SOURCE
    #define _POSIX_C_SOURCE 200809L
    #endif
  to bli_system.h so that an application that uses BLIS (specifically,
  an application that #includes blis.h) does not need to remember to
  #define the macro itself (either on the command line or in the code
  that includes blis.h) in order to activate things like the pthreads.
  Thanks to Christos Psarras for reporting this issue and suggesting
  this fix.
- Commented out #include <sys/time.h> in bli_system.h, since I don't
  think this header is used/needed anymore.
- Comment update to function macro for bli_?normiv_unb_var1() in
  frame/util/bli_util_unb_var1.c.
2019-05-01 19:13:00 -05:00
Field G. Van Zee
3df84f1b5d Minor bugfixes in sup dgemm implementation.
Details:
- Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel
  that only affected the beta == 0, column-storage output case. Thanks
  to the BLAS test drivers for catching this bug.
- Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if
  k = 0, when the correct action would be to scale by beta (and then
  return). Thanks to the BLAS test drivers to catching this bug.
- Changed the sup threshold behavior such that the sup implementation
  only kicks in if a matrix dimension is strictly less than (rather than
  less than or equal to) the threshold in question.
- Initialize all thresholds to zero (instead of 10) by default in
  ref_kernels/bli_cntx_ref.c. This, combined with the above change to
  threshold testing means that calls to BLIS or BLAS with one or more
  matrix dimensions of zero will no longer trigger the sup
  implementation.
- Added disabled debugging output to frame/3/bli_l3_sup.c (for future
  use, perhaps).
2019-04-27 21:27:32 -05:00
Field G. Van Zee
ecbdd1c42d Ceased use of BLIS_ENABLE_SUP_MR/NR_EXT macros.
Details:
- Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and
  BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and
  bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long
  conditional that would determine whether to allow the last iteration
  to be merged with the second-to-last iteration. Functionally, the
  macros were not needed, and they ended up causing problems when
  building configuration families such as intel64 and x86_64.
2019-04-27 19:38:11 -05:00
Field G. Van Zee
aa8a6bec30 Fixed typo in --disable-sup-handling macro guard.
Details:
- Fixed an incorrectly-named macro guard that is intended to allow
  disabling of the sup framework via the configure option
  --disable-sup-handling. In this case, the preprocessor macro,
  BLIS_DISABLE_SUP_HANDLING, was still named by its name from an older
  uncommitted version of the code (BLIS_DISABLE_SM_HANDLING).
2019-04-27 18:53:33 -05:00
Field G. Van Zee
b9c9f03502 Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2019-04-27 18:44:50 -05:00
Field G. Van Zee
945928c650 Merge branch 'amd' of github.com:flame/blis into amd 2019-04-17 15:58:56 -05:00
Field G. Van Zee
89cd650e7b Use void_fp for function pointers instead of void*.
Change void*-typed function pointers to void_fp.
- Updated all instances of void* variables that store function pointers
  to variables of a new type, void_fp. Originally, I wanted to define
  the type of void_fp as "void (*void_fp)( void )"--that is, a pointer
  to a function with no return value and no arguments. However, once
  I did this, I realized that gcc complains with incompatible pointer
  type (-Wincompatible-pointer-types) warnings every time any such a
  pointer is being assigned to its final, type-accurate function
  pointer type. That is, gcc will silently typecast a void* to
  another defined function pointer type (e.g. dscalv_ker_ft) during
  an assignment from the former to the latter, but the same statement
  will trigger a warning when typecasting from a void_fp type. I suspect
  an explicit typecast is needed in order to avoid the warning, which
  I'm not willing to insert at this time.
- Added a typedef to bli_type_defs.h defining void_fp as void*, along
  with a commented-out version of the aborted definition described
  above. (Note that POSIX requires that void* and function pointers
  be interchangeable; it is the C standard that does not provide this
  guarantee.)
- Comment updates to various _oapi.c files.
2019-04-02 17:23:55 -05:00
Isuru Fernando
044df9506f Test with shared on windows (#306)
Export macros can't support both shared and static at the same time.
When blis is built with both shared and static, headers assume that
shared is used at link time and dllimports the symbols with __imp_
prefix.

To use the headers with static libraries a user can give
-DBLIS_EXPORT= to import the symbol without the __imp_ prefix
2019-03-27 12:39:31 -05:00
Field G. Van Zee
feefcab442 Allow disabling of BLAS prototypes at compile-time.
Details:
- Modified bli_blas.h so that:
  - By default, if the BLAS layer is enabled at configure-time, BLAS
    prototypes are also enabled within blis.h;
  - But if the user #defines BLIS_DISABLE_BLAS_DEFS prior to including
    blis.h, BLAS prototypes are skipped over entirely so that, for
    example, the application or some other header pulled in by the
    application may prototype the BLAS functions without causing any
    duplication.
- Updated docs/BuildSystem.md to document the feature above, and
  related text.
2019-03-21 18:11:20 -05:00
Field G. Van Zee
663f662932 Merge branch 'amd' of github.com:flame/blis into amd 2019-03-16 16:17:12 -05:00
Field G. Van Zee
938c05ef86 Merge branch 'amd' of github.com:flame/blis into amd 2019-03-16 16:01:43 -05:00
Field G. Van Zee
809395649c Annotated additional symbols for export.
Details:
- Added export annotations to additional function prototypes in order to
  accommodate the testsuite.
- Disabled calling bli_amaxv_check() from within the testsuite's
  test_amaxv.c.
2019-03-13 18:21:35 -05:00
Field G. Van Zee
e095926c64 Support shared lib export of only public symbols.
Details:
- Introduced a new configure option, --enable-export-all, which will
  cause all shared library symbols to be exported by default, or,
  alternatively, --disable-export-all, which will cause all symbols to
  be hidden by default, with only those symbols that are annotated for
  visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS
  symbols), to be exported. The default for this configure option is
  --disable-export-all. Thanks to Isuru Fernando for consulting on
  this commit.
- Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h,
  which was intended for 5a5f494.
- Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to
  frame/include/bli_config_macro_defs.h.
- Provided appropriate logic within common.mk to implement variable
  symbol visibility for gcc, clang, and icc (to the extend that each of
  these compilers allow).
- Relocated --help text associated with debug option (-d) to configure
  slightly further down in the list.
2019-03-13 17:35:18 -05:00
Field G. Van Zee
5a5f494e42 Removed export macros from all internal prototypes.
Details:
- After merging PR #303, at Isuru's request, I removed the use of
  BLIS_EXPORT_BLIS from all function prototypes *except* those that we
  potentially wish to be exported in shared/dynamic libraries. In other
  words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
  functions that can be considered private or for internal use only.
  This is likely the last big modification along the path towards
  implementing the functionality spelled out in issue #248. Thanks
  again to Isuru Fernando for his initial efforts of sprinkling the
  export macros throughout BLIS, which made removing them where
  necessary relatively painless. Also, I'd like to thank Tony Kelman,
  Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
  participating in the initial discussion in issue #37 that was later
  summarized and restated in issue #248.
- CREDITS file update.
2019-03-12 18:45:09 -05:00
Field G. Van Zee
3dc18920b6 Merge branch 'master' into dev 2019-03-12 11:20:25 -05:00
Isuru Fernando
766769eeb9 Export functions without def file (#303)
* Revert "restore bli_extern_defs exporting for now"

This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8.

* Remove symbols not intended to be public

* No need of def file anymore

* Fix whitespace

* No need of configure option

* Remove export macro from definitions

* Remove blas export macro from definitions
2019-03-11 19:05:32 -05:00
Field G. Van Zee
4ed39c0971 Merge branch 'amd' of github.com:flame/blis into amd 2019-03-08 11:56:58 -06:00
Kiran Varaganti
f5ed95ecd7 Merged BLIS Release 1.3
Modified config/zen/make_defs.mk, now CKVECFLAGS     := -mavx2 -mfpmath=sse -mfma -march=znver1

Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81
2019-03-05 15:03:57 +05:30
Field G. Van Zee
3bdab823fa Merge branch 'master' into dev 2019-02-28 14:07:24 -06:00
Isuru Fernando
f0dcc8944f Add symbol export macro for all functions (#302)
* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now
2019-02-27 17:27:23 -06:00
Field G. Van Zee
540ec1b479 Updated level-3 BLAS to call object API directly.
Details:
- Updated the BLAS compatibility layer for level-3 operations so that
  the corresponding BLIS object API is called directly rather than first
  calling the typed BLIS API. The previous code based on the typed BLIS
  API calls is still available in a deactivated cpp macro branch, which
  may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not
  yet correspond to a configure option. If it seems like people might
  want to toggle this behavior more regularly, a configure option can be
  added in the future.)
- Updated the BLIS typed API to statically "pre-initialize" objects via
  new initializor macros. Initialization is then finished via calls to
  static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(),
  which are similar to the previously-called functions,
  bli_obj_create_1x1_with_attached_buffer() and
  bli_obj_create_with_attached_buffer(), respectively. (The BLAS
  compatibility layer updates mentioned above employ this new technique
  as well.)
- Transformed certain routines in bli_param_map.c--specifically, the
  ones that convert netlib-style parameters to BLIS equivalents--into
  static functions, now in bli_param_map.h. (The remaining three classes
  of conversation routines were left unchanged.)
- Added the aforementioned pre-initializor macros to bli_type_defs.h.
- Relocated bli_obj_init_const() and bli_obj_init_constdata() from
  bli_obj_macro_defs.h to bli_type_defs.h.
- Added a few macros to bli_param_macro_defs.h for testing domains for
  real/complexness and precisions for single/double-ness.
2019-02-24 19:09:10 -06:00
Field G. Van Zee
565fa3853b Redirect trsm pc, ir parallelism to ic, jr loops.
Details:
- trsm parallelization was temporarily simplifed in 075143d to entirely
  ignore any parallelism specified via the pc or ir loops. Now, any
  parallelism specified to the pc loop will be redirected to the ic
  loop, and any parallelism specified to the ir loop will be redirected
  to the jr loop. (Note that because of inter-iteration dependencies,
  trsm cannot parallelize the ir loop. Parallelism via the pc loop is
  at least somewhat feasible in theory, but it would require tracking
  dependencies between blocks--something for which BLIS currently lacks
  the necessary supporting infrastructure.)
2019-02-18 11:43:58 -06:00
Field G. Van Zee
075143dfd9 Added support for IC loop parallelism to trsm.
Details:
- Parallelism within the IC loop (3rd loop around the microkernel) is
  now supported within the trsm operation. This is done via a new branch
  on each of the control and thread trees, which guide execution of a
  new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
  subproblem corresponds to the macrokernel computation on only the
  block of A that contains the diagonal (labeled as A11 in algorithms
  with FLAME-like partitioning), and the corresponding row panel of C.
  During the trsm subproblem, all threads within the JC communicator
  participate and parallelize along the JR loop, including any
  parallelism that was specified for the IC loop. (IR loop parallelism
  is not supported for trsm due to inter-iteration dependencies.) After
  this trsm subproblem is complete, a barrier synchronizes all
  participating threads and then they proceed to apply the prescribed
  BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
  parallelism specified within) to the remaining gemm subproblem (the
  rank-k update that is performed using the newly updated row-panel of
  B). Thus, trsm now supports JC, IC, and JR loop parallelism.
- Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
  of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
  now, since it is not currently used. (All trsm problems are cast in
  terms of left-side trsm.)
- Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
  trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
  subnode is only recursed upon if there existed a corresponding
  thrinfo_t node, which may not always exist (for problems too small
  to employ full parallelization due to the minimum granularity imposed
  by micropanels).
- Updated other functions in frame/base/bli_cntl.c, such as
  bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
  if they exist.
- Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
  when they exist, and added support for growing a prenode branch to
  bli_thrinfo_grow() via a corresponding set of help functions named
  with the _prenode() suffix.
- Added a bszid_t field thrinfo_t nodes. This field comes in handy when
  debugging the allocation/release of thrinfo_t nodes, as it helps trace
  the "identity" of each nodes as it is created/destroyed.
- Renamed
    bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
  and created a separate bli_l3_thrinfo_print_trsm_paths() function to
  print out the newly reconfigured thrinfo_t trees for the trsm
  operation.
- Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
  regarding variable declarations.
- Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
  BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
  (semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
  represent the subpartition ahead of and behind, respectively,
  BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
- Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
  bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
- Deprecated old functions in frame/3/bli_l3_thrinfo.c.
2019-02-14 18:52:45 -06:00
Nicholai Tukanov
78bc0bc8b6 Power9 sub-configuration (#298)
Formally registered power9 sub-configuration.

Details:
- Added and registered power9 sub-configuration into the build system.
  Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
- Note: The sub-configuration does not yet have a corresponding
  architecture-specific kernel set registered, and so for now the 
  sub-config is using the generic kernel set.
2019-02-14 13:29:02 -06:00
Field G. Van Zee
6b83273126 Generalized ref kernels' pragma omp simd usage.
Details:
- Replaced direct usage of _Pragma( "omp simd" ) in reference kernels
  with PRAGMA_SIMD, which is defined as a function of the compiler being
  used in a new bli_pragma_macro_defs.h file. That definition is cleared
  when BLIS detects that the -fopenmp-simd command line option is
  unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions
  that guided this commit.
- Updated configure and bli_config.h.in so that the appropriate anchor
  is substituted in (when the corresponding pragma omp simd support is
  present).
2019-02-12 16:01:28 -06:00
M. Zhou
1aa280d052 Amend OS detection for kFreeBSD. (#295) 2019-01-27 15:40:48 -06:00
Field G. Van Zee
bdd46f9ee8 Rewrote reference kernels to use #pragma omp simd.
Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
  indexing annotated by the #pragma omp simd directive, which a compiler
  can use to vectorize certain constant-bounded loops. (The new kernels
  actually use _Pragma("omp simd") since the kernels are defined via
  templatizing macros.) Modest speedup was observed in most cases using
  gcc 5.4.0, which may improve with newer versions. Thanks to Devin
  Matthews for suggesting this via issue #286 and #259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
  be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
  respectively, with a default row preference for the gemm ukernel. Also
  updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
  respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
  option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
  which compiler is detected (via macros such as __GNUC__). These
  prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
  build/templates and removed AMD as a default copyright holder.
2019-01-24 17:23:18 -06:00
Field G. Van Zee
adf5c17f08 Formally registered thunderx2 subconfiguration.
Details:
- Added a separate subconfiguration for thunderx2, which now uses
  different optimization flags than cortexa57/cortexa53.
2019-01-18 15:14:45 -06:00
M. Zhou
094cfdf7df Port BLIS to GNU Hurd OS. (#294)
Prevent blis.h from misidentifying Hurd as OSX.
2019-01-18 12:46:13 -06:00
Field G. Van Zee
706cbd9d56 Minor tweaks/cleanups to bli_malloc.c, _apool.c.
Details:
- Removed malloc_ft and free_ft function pointer arguments from the
  interface to bli_apool_init() after deciding that there is no need to
  specify the malloc()/free() for blocks within the apool. (The apool
  blocks are actually just array_t structs.) Instead, we simply call
  bli_malloc_intl()/_free_intl() directly. This has the added benefit
  of allowing additional output when memory tracing is enabled via
  --enable-mem-tracing. Also made corresponding changes elsewhere in
  the apool API.
- Changed the inner pools (elements of the array_t within the apool_t)
  to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL
  and BLIS_FREE_INTL.
- Disabled definitions of bli_malloc_pool() and bli_free_pool() since
  there are no longer any consumers of these functions.
- Very minor comment / printf() updates.
2019-01-07 18:28:19 -06:00
Minh Quan Ho
579145039d Initialize error messages at compile time (#289)
* Initialize error messages at compile time

- Assigning strings directly to the bli_error_string array, instead of
snprintf() at execution-time.

* Retired bli_error_init(), _finalize().

Details:
- Removed functions obviated by changes in 80e8dc6: bli_error_init(),
  bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
  the former two in bli_init.c.

* Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
  'configure --enable-cblas auto'.
2019-01-07 16:00:15 -06:00
Field G. Van Zee
7052fca5ae Apply f272c289 to bli_fmalloc_noalign().
Details:
- Perform the same check for NULL return values and error message output
  in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This
  change was intended for f272c289.)
2019-01-02 13:48:40 -06:00
Field G. Van Zee
f272c2899a Add error message to malloc() check for NULL.
Details:
- Output an error message if and when the malloc()-equivalent called by
  bli_fmalloc_align() ever returns NULL. Everything was already in place
  for this to happen, including the error return code, the error string
  sprintf(), the error checking function bli_check_valid_malloc_buf()
  definition, and its prototype. Thanks to Minh Quan Ho for pointing out
  the missing error message.
- Increased the default block_ptrs_len for each inner pool stored in the
  small block allocator from 10 to 25. Under normal execution, each
  thread uses only 21 blocks, so this change will prevent the sba from
  needing to resize the block_ptrs array of any given inner pool as
  threads initially populate the pool with small blocks upon first
  execution of a level-3 operation.
- Nix stray newline echo in configure.
2019-01-02 12:34:15 -06:00
Field G. Van Zee
eb97f778a1 Added missing AMD copyrights to previous commit.
Details:
- Forgot to add AMD copyrights to several touched files that did not
  already have them in 2f31743.
2018-12-25 20:17:09 -06:00
Field G. Van Zee
2f3174330f Implemented a pool-based small block allocator.
Details:
- Implemented a sophisticated data structure and set of APIs that track
  the small blocks of memory (around 80-100 bytes each) used when
  creating nodes for control and thread trees (cntl_t and thrinfo_t) as
  well as thread communicators (thrcomm_t). The purpose of the small
  block allocator, or sba, is to allow the library to transition into a
  runtime state in which it does not perform any calls to malloc() or
  free() during normal execution of level-3 operations, regardless of
  the threading environment (potentially multiple application threads
  as well as multiple BLIS threads). The functionality relies on a new
  data structure, apool_t, which is (roughly speaking) a pool of
  arrays, where each array element is a pool of small blocks. The outer
  pool, which is protected by a mutex, provides separate arrays for each
  application thread while the arrays each handle multiple BLIS threads
  for any given application thread. The design minimizes the potential
  for lock contention, as only concurrent application threads would
  need to fight for the apool_t lock, and only if they happen to begin
  their level-3 operations at precisely the same time. Thanks to Kiran
  Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
  by default; renamed the --[dis|en]able-packbuf-pools option to
  --[dis|en]able-pba-pools; and rewrote the --help text associated with
  this new option and consolidated it with the --help text for the
  option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
  a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
  do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
  used for small blocks with calls to bli_sba_acquire(), which takes a
  rntm (in addition to the bytes requested), and bli_sba_release().
  These latter two functions reduce to the former two when the sba pools
  are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
  required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
  in the block_size) from bli_membrk_acquire_m() to the implementation
  of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
  relative to that of gemmtrsm_ukr, in part to avoid the need to create
  a packm control tree node, which now requires a rntm_t that has been
  initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
  run the testsuite with memory tracing enabled can check for memory
  leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
  the rntm to be copied locally when it is passed in via one of the
  expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
  thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
  formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.
2018-12-25 19:35:01 -06:00
Field G. Van Zee
e809b5d2f1 Merge branch 'master' into amd 2018-12-20 16:27:26 -06:00
sraut
1f4eeee517 Fixed BLAS test failures of small matrix SYRK for single and double precision.
Details:
- SYRK for small matrix was implemented by reusing small GEMM routine. This was
  resulting in output written to the full C matrix, and C being symmetric the
  lower and upper triangles of C matrix contained same results. BLAS SYRK API
  spec demands either lower or upper triangle of C matrix to be written with
  results. So, this was resulting in BLAS test failures, even though testsuite
  of BLIS was passing small SYRK operation.
- To fix BLAS test failures of small matrix SYRK, separate kernel routines are
  implemented for small SYRK for both single and double precision. The newly
  added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c.
  Now the intermediate results of matrix C are written to a scratch buffer.
  Final results are written from scratch buffer to matrix C using SIMD
  copy to either lower or upper traingle part of matrix C.
- Source and header files frame/3/syrk/bli_syrk_front.c and
  frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines.

Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb
2018-12-19 21:23:05 +05:30
Field G. Van Zee
93d56319f2 Added missing bli_init_once() in bli_thread API.
Details:
- Fixed an issue with specifying threading globally at runtime via
  bli_thread_set_num_threads() (the automatic way) or via
  bli_thread_set_ways() (the manual way), with bli_thread_init_rntm()
  also affected. These functions were not calling bli_init_once() prior
  to acting, and therefore their effects on the global rntm_t structure
  were being wiped out by the eventual call to bli_init_once(), by some
  other BLIS function. Thanks to Ali Emre Gülcü for reporting the
  behavior associated with this bug.
- Added additional content to docs/Multithreading.md covering topics of
  choosing between OpenMP and pthreads, and specifying affinity via
  OpenMP.
- CREDITS file update.
2018-12-17 19:17:30 -06:00
Field G. Van Zee
76016691e2 Improvements to bli_pool; malloc()/free() tracing.
Details:
- Added malloc_ft and free_ft fields to pool_t, which are provided when
  the pool is initialized, to allow bli_pool_alloc_block() and
  bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align()
  with arbitrary align_size values (according to how the pool_t was
  initialized).
- Added a block_ptrs_len argument to bli_pool_init(), which allows the
  caller to specify an initial length for the block_ptrs array, which
  previously suffered the cost of being reallocated, copied, and freed
  each time a new block was added to the pool.
- Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t
  into a single "buf" field. Consolidated the bli_pblk API accordingly
  and also updated the bli_mem API implementation. This was done
  because I'd previously already implemented opaque alignment via
  bli_malloc_align(), which allocates extra space and stores the
  original pointer returned by malloc() one element before the element
  whose address is aligned.
- Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call
  bli_fmalloc_align() and bli_ffree_align(), which required adding an
  align_size field to the membrk_t struct.
- Pass the pack schemas directly into bli_l3_cntl_create_if() rather
  than transmit them via objects for A and B.
- Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free().
  The function had not been conditionally freeing control trees for
  quite some time. Also, removed obj_t* parameters since they aren't
  needed anymore (or never were).
- Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a
  separate function, bli_l3_thread_decorator_thread_check().
- Renamed:
    bli_malloc_align()   -> bli_fmalloc_align()
    bli_free_align()     -> bli_ffree_align()
    bli_malloc_noalign() -> bli_fmalloc_noalign()
    bli_free_noalign()   -> bli_ffree_noalign()
  The 'f' is for "function" since they each take a malloc_ft or free_ft
  function pointer argument.
- Inserted various printf() calls for the purposes of tracing memory
  allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which,
  for now, is intended to be a "hidden" feature rather than one hooked
  up to a configure-time option.
- Defined bli_rntm_equals(), which compares two rntm_t for equality.
  (There are no use cases for this function yet, but there may be soon.)
- Whitespace changes to function parameter lists in bli_pool.c, .h.
2018-12-13 17:23:09 -06:00
Field G. Van Zee
f808d829c5 Handle edge cases, zero-filling in packm kernels.
Details:
- Updated the API and semantics of packm kernels such that they must now
  handle edge cases, meaning that a c-by-k packm kernel must be able to
  pack edge cases that are fewer than c rows/columns and be able to
  zero-fill the remaining elements. They must also be able to zero-fill
  the equivalent region when copying fewer than k columns/rows (which is
  needed by trsm). The new packm kernel API is generally:

    void packm_kernel
         (
           conj_t           conja,
           dim_t            cdim,
           dim_t            n,
           dim_t            n_max,
           ctype*  restrict kappa,
           ctype*  restrict a, inc_t inca, inc_t lda,
           ctype*  restrict p,             inc_t ldp,
           cntx_t* restrict cntx
         );

  where cdim and n are the dimensions (short and long, respectively) of
  the submatrix being copied from the source matrix A, and n_max is the
  "full" long dimension (corresponding to the k dimension in gemm) of
  the micropanel. The "full" short dimension (corresponding to the
  register blocksize MR or NR) is not part of the API because it is
  known intrinsically by the packm kernel implementation. Thanks to
  Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
  above changes, as well as all optimized packm kernels (which only
  consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
  I was considering leaving it unchanged, but I couldn't escape the
  reality that the packm kernel API is much closer to an expert API
  than it is some obscure helper function interface within the framework
  that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
  that would have been using those kernels is knc, which is likely no
  longer being used by very many people (if any). (This also mostly
  offset the larger object code footprint incurred by moving the edge-
  case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
  which those implementations were modifying the contexts stored in the
  gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
  method implementations of trsm from executing.
2018-12-12 15:22:59 -06:00
Field G. Van Zee
02ec0be3ba Merge branch 'master' into amd 2018-12-05 19:33:53 -06:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
375eb30b0a Added mixed-precision support to 1m method.
Details:
- Lifted the constraint that 1m only be used when all operands' storage
  datatypes (along with the computation datatype) are equal. Now, 1m may
  be used as long as all operands are stored in the complex domain. This
  change largely consisted of adding the ability to pack to 1e and 1r
  formats from one precision to another. It also required adding logic
  for handling complex values of alpha to bli_packm_blk_var1_md()
  (similar to the logic in bli_packm_blk_var1()).
- Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
  bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
  ukernel output preference field being read. Previously, the preference
  for the native complex ukernel was being read instead of the pref for
  the native real domain ukernel. This bug would not manifest if the
  preference for the native complex ukernel happened to be equal to that
  of the native real ukernel.
- Added support for testing mixed-precision 1m execution via the gemm
  module of the testsuite.
- Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
  schemas are always read from the context, rather than trying to
  sometimes embed them directly to the A and B objects. (They are still
  embedded, but now uniformly only after reading the schemas from the
  context.)
- Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
  and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
  consumer).
- Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
  bli_gemm_ker_var2_md().
- Added explicit handling for beta == 1 and beta == 0 in the reference
  gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
- Rewrote various level-0 macro defs, including axpyris, axpbyris,
  scal2ris, and xpbyris (and their conjugating counterparts) to
  explicitly support three operand types and updated invocations to
  xpbyris in bli_gemmtrsm1m_ref.c.
- Query and use the storage datatype of the packed object instead of the
  storage datatype of the source object in bli_packm_blk_var1().
- Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
  frame/3/gemm/ind/bli_gemm_ind_opt.h.
- Various whitespace/comment updates.
2018-12-03 17:49:52 -06:00
Field G. Van Zee
e275def30a Merge branch 'master' into amd 2018-11-30 15:39:50 -06:00
Field G. Van Zee
6a4885f8be Merge branch 'master' into dev 2018-11-27 13:22:59 -06:00
Isuru Fernando
9ddffba584 Fix MinGW build failure
Fixes https://github.com/flame/blis/issues/278
2018-11-21 00:23:34 -06:00
Field G. Van Zee
1d8aae220b Track internal scalar datatypes.
Details:
- Added a num_t datatype bitfield to the obj_t in the form of a new
  info2 field in the obj_t. This change was made primarily so that in
  the case of mixed-datatype gemm, the alpha scalar would not need to
  be cast to the storage datatype of B (or A) before then being cast to
  the computation datatype just before the macrokernel is called. This
  double-casting regime could result in loss of precision if the storage
  datatype of B (or A) is less than the computation precision. In
  practice, it was likely not going to be a big deal since most usage of
  alpha is for -1.0, 0.0, and 1.0 (or integer multiples thereof), which
  can all be represented exactly in single or double precision.
- The type of objbits_t was changed to uint32_t, so the new format
  potentially takes up the same space as the previous obj_t definition,
  assuming no padding inserted by the compiler. Shrinking info to 32
  bits and spilling over into a second field was chosen over using the
  high 32 bits of a single 64-bit objbits_t info field because many of
  the bitwise operations are performed with enums such as num_t, dom_t,
  and prec_t, which may take on the type of 32-bit ints. It's easier to
  just keep all of those bitwise operations in 32 bits than perform a
  million typecasts throughout bli_type_defs.h and bli_obj_macro_defs.h
  to ensure that the integers are treated as 64-bit for the purposes of
  the ANDs, ORs, and bitshifts.
- Many comment updates.
- Thanks to Devin Matthews and Devangi Parikh for their feedback and
  involvement during this commit cycle.
2018-11-20 18:42:07 -06:00