Commit Graph

778 Commits

Author SHA1 Message Date
Field G. Van Zee
00e14cb6d8 Replaced use of bool_t type with C99 bool.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
  C99 bool type. A few remaining instances, such as those in the files
  bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
  bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
  used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
  C99's bool instead of bool_t, which was raised in issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7. The second phase, which redefined the bool_t typedef
  in terms of bool (from gint_t), was implemented by commit 2c554c2.
2020-07-29 14:24:34 -05:00
Field G. Van Zee
2c554c2fce Redefined bool_t typedef in terms of C99 bool.
Details:
- Changed the typedef that defines bool_t from:

    typedef gint_t bool_t;

  where gint_t is a signed integer that forms the basis of most other
  integers in BLIS, to:

    typedef bool bool_t;

- Changed BLIS's TRUE and FALSE macro definitions from being in terms of
  integer literals:

    #define TRUE  1
    #define FALSE 0

  to being in terms of C99 boolean constants:

    #define TRUE  true
    #define FALSE false

  which are provided by stdbool.h.
- This commit constitutes the second phase of a transition toward using
  C99's bool instead of bool_t, which will address issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7.
2020-07-24 15:57:19 -05:00
Devin Matthews
b4f47f7540 Add BLIS_EXPORT_BLIS to bli_abort. (#429)
Fixes #428.
2020-07-24 13:56:13 -05:00
Field G. Van Zee
a69a4d7e2f Cleaned up bool_t usage and various typecasts.
Details:
- Fixed various typecasts in

    frame/base/bli_cntx.h
    frame/base/bli_mbool.h
    frame/base/bli_rntm.h
    frame/include/bli_misc_macro_defs.h
    frame/include/bli_obj_macro_defs.h
    frame/include/bli_param_macro_defs.h

  that were missing or being done improperly/incompletely. For example,
  many return values were being typecast as
    (bool_t)x && y
  rather than
    (bool_t)(x && y)
  Thankfully, none of these deficiencies had manifested as actual bugs
  at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
  This reflects the fact that bli_env_get_var() needs to be able to
  return a signed integer, and even though dim_t is currently defined
  as a signed integer, it does not intuitively appear to necessarily be
  signed by inspection (i.e., an integer named "dim_t" for matrix
  "dimension"). Also, updated use of bli_env_get_var() within
  bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
  and added comments to the bli_thrcomm_*.h files that will explain a
  planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
  'bool' for 'bool_t', which will eliminate the namespace conflict with
  arm_sve.h as reported in issue #420. This commit implements the first
  phase of that transition. Thanks to RuQing Xu for reporting this
  issue.
- CREDITS file update.
2020-07-22 16:13:09 -05:00
Field G. Van Zee
a6437a5c11 Replaced broken ref99 sandbox w/ simpler version.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
  API changes over the last two years. Rather than try to fix it, I've
  replaced it with a much simpler version based on var2 of gemmsup.
  Why not fix the previous implementation? It occurred to me that the
  old implementation was trying to be a lightly simplified duplication
  of what exists in the framework. Duplication aside, this sandbox
  would have worked fine if it had been completely independent of the
  framework code. The problem was that it was only partially
  independent, with many function calls calling a function in BLIS
  rather than a duplicated/simplified version within the sandbox. (And
  the reason I didn't make it fully independent to begin with was that
  it seemed unnecessarily duplicative at the time.) Maintaining two
  versions of the same implementation is problematic for obvious
  reasons, especially when it wasn't even done properly to begin with.
  This explains the reimplementation in this commit. The only catch is
  that the newer implementation is single-threaded only and does not
  perform any packing on either input matrix (A or B). Basically, it's
  only meant to be a simple placeholder that shows how you could plug
  in your own implementation. Thanks to Francisco Igual for reporting
  this brokenness.
- Updated the three reference gemmsup kernels (defined in
  ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
  conjugation of conja and/or conjb. The general storage kernel, which
  is currently identical to the column-storage kernel, is used in the
  new ref99 sandbox to provide basic support for all datatypes
  (including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
  and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
  sandbox implementation is based).
2020-07-20 19:21:07 -05:00
Field G. Van Zee
72f6ed0637 Declare/define static functions via BLIS_INLINE.
Details:
- Updated all static function definitions to use the cpp macro
  BLIS_INLINE instead of the static keyword. This allows blis.h to
  use a different keyword (inline) to define these functions when
  compiling with C++, which might otherwise trigger "defined but
  not used" warning messages. Thanks to Giorgos Margaritis for
  reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
  hardware auto-detection facility, to unconditionally #define
  BLIS_INLINE to the static keyword (since we know BLIS will be
  compiled with C, not C++):
    build/detect/config/config_detect.c
    frame/base/bli_arch.c
    frame/base/bli_cpuid.c
- CREDITS file update.
2020-07-03 17:55:54 -05:00
Field G. Van Zee
6af59b7057 Fixed disabled edge case optimization in gemmsup.
Details:
- Fixed an inadvertently disabled edge case optimization in the two
  gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case
  optimizations allow the last millikernel operation in the jr loop to
  be executed with inflated an register blocksize if it is the last
  (or only) iteration. For example, if mr=6 and nr=8 and the gemmsup
  problem is m=8, n=100, k=100. (In this case, the panel-block variant
  (var1n) is executed, which places the jr loop in the m dimension.)
  In principle, this problem could be executed as two millikernels: one
  with dimensions 6x100x100, and one as 2x100x100. However, with the
  support for inflated blocksizes in the kernel, the entire 8x100x100
  problem can be passed to the millikernel function, which will then
  execute it more favorably as two 4x100x100 millikernel sub-calls.
  Now, this optimization is disabled under certain circumstances, such
  as when multithreading. Previously, the is_mt predicate was being set
  incorrectly such that it was non-zero even when running
  single-threaded.
- Upon fixing the is_mt issue above, another bit of code needed to be
  moved so that the result of the optimization could have an impact on
  the assignment of loop bounds ranges to threads.
2020-07-01 14:54:23 -05:00
Field G. Van Zee
b5b604e106 Ensure random objects' 1-norms are non-zero.
Details:
- Fixed an innocuous bug that manifested when running the testsuite on
  extremely small matrices with randomization via the "powers of 2 in
  narrow precision range" option enabled. When the randomization
  function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
  then compute 0.0/0.0 during the normalization process, which leads to
  NaN residuals. The solution entails smarter implementaions of randv,
  randnv, randm, and randnm, each of which will compute the 1-norm of
  the vector or matrix in question. If the object has a 1-norm of 0.0,
  the object is re-randomized until the 1-norm is not 0.0. Thanks to
  Kiran Varaganti for reporting this issue (#413).
- Updated the implementation of randm_unb_var1() so that it loops over
  a call to the randv_unb_var1() implementation directly rather than
  calling it indirectly via randv(). This was done to avoid the overhead
  of multiple calls to norm1v() when randomizing the rows/columns of a
  matrix.
- Updated comments.
2020-06-17 16:42:24 -05:00
Field G. Van Zee
787adad73b Defined netlib equivalent of xerbla_array().
Details:
- Added a function definition for xerbla_array_(), which largely mirrors
  its netlib implementation. Thanks to Isuru Fernando for suggesting the
  addition of this function.
2020-05-08 16:18:20 -05:00
Guodong Xu
f032d5d4a6 New kernel set for Arm SVE using assembly (#396)
Here adds two kernels for Arm SVE vector extensions.
1. a gemm  kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.

To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.

"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code."  [1]

[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning

Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
2020-04-29 12:08:46 -05:00
Field G. Van Zee
477ce91c52 Moved #include "cpuid.h" to bli_cpuid.c.
Details:
- Relocated the #include "cpuid.h" directive from bli_cpuid.h to
  bli_cpuid.c. This was done because cpuid.h (which is pulled into
  the post-build blis.h developer header) doesn't protect its
  definitions with a preprocessor guard of the form:

    #ifndef FOOBAR_H
    #define FOOBAR_H
    // header contents.
    #endif

  and as a result, applications (previously) could not #include both
  blis.h and cpuid.h (since the former was already including the
  latter). Thanks to Bhaskar Nallani for raising this issue via #393
  and to Devin Matthews for suggesting this fix.
- CREDITS file update.
2020-04-22 14:26:49 -05:00
Field G. Van Zee
976902406b Disable packing by default in expert rntm_t init.
Details:
- Changed the behavior of bli_rntm_init() as well as the static
  initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t
  objects by default specify the disabling of packing for A and B.
  Packing of A/B was already disabled by default when calling non-expert
  APIs (and enabled only when the user set environment variables
  BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of
  using user-initialized rntm_t objects with expert APIs comes into line
  with the default behavior of non-expert APIs--that is, they now both
  lead to the avoidance of packing in the sup code path. (Note: The
  conventional code path is unaffected by the environment variables
  BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t
  object when calling an expert API.) This addresses issue #392. Thanks
  to Kiran Varaganti for bringing this inconsistency to our attention.
- The above change was accomplished by changing the the definitions of
  static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b()
  in bli_rntm.h, which are both for internal use only.
2020-04-17 15:11:10 -05:00
Field G. Van Zee
2cb604ba47 Rename more bli_thread_obarrier(), _obroadcast().
Details:
- Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
  that were made in the supmt-specific code commited to the 'amd'
  branch, which has now been merged with 'master'. Prior to the merge,
  'master' received commit c01d249, which applied these renamings to
  the existing, non-sup codebase.
2020-04-06 16:42:14 -05:00
Field G. Van Zee
2e3b3782cf Merge branch 'master' into amd 2020-04-06 14:55:35 -05:00
Field G. Van Zee
9f3a8d4d85 Added missing return to bli_thread_partition_2x2().
Details:
- Added a missing return statement to the body of an early case handling
  branch in bli_thread_partition_2x2(). This bug only affected cases
  where n_threads < 4, and even then, the code meant to handle cases
  where n_threads >= 4 executes and does the right thing, albeit using
  more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
  for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).
2020-03-14 17:48:43 -05:00
Field G. Van Zee
c01d249d7c Renamed bli_thread_obarrier(), _obroadcast().
Details:
- Renamed two bli_thread_*() APIs:
    bli_thread_obarrier()   -> bli_thread_barrier()
    bli_thread_obroadcast() -> bli_thread_broadcast()
  The 'o' was a leftover from when thrcomm_t objects tracked both
  "inner" and "outer" communicators. They have long since been
  simplified to only support the latter, and thus the 'o' is
  superfluous.
2020-02-25 14:50:53 -06:00
Field G. Van Zee
9e5f7296cc Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.
2020-02-18 15:16:03 -06:00
Field G. Van Zee
90081e6a64 Fixed bug(s) in mt sup when single-threaded.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
  changing function interface for the thread entry point function
  (of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
  a memory leak in the sba at bli_finalize() time. It turns out that,
  due to the new multithreading-capable variant code useing thrinfo_t
  objects--specifically, their calling of bli_thrinfo_grow()--we
  have to pass in a real thrinfo_t object rather than the global
  objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
  Thus, I inserted the appropriate logic from the OpenMP and pthreads
  versions so that single-threaded execution would work as intended
  with the newly upgraded variants.
2020-02-17 14:57:25 -06:00
Field G. Van Zee
c0558fde45 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
2020-02-17 14:08:08 -06:00
Field G. Van Zee
d7a7679182 Fixed int-to-packbuf_t conversion error (C++ only).
Details:
- Fixed an error that manifests only when using C++ (specifically,
  modern versions of g++) to compile drivers in 'test' (and likely most
  other application code that #includes blis.h. Thanks to Ajay Panyala
  for reporting this issue (#374).
2020-02-07 17:37:03 -06:00
Dave Love
f391b3e2e7 Fix parsing in vpu_count on workstation SKX (#351)
* Fix parsing in vpu_count on workstation SKX

* Document Skylake-X as Haswell for single FMA

* Update vpu_count for Skylake and Cascade Lake models

* Support printing the configuration selected, controlled by the environment

Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.

* Move bli_log outside the cpp condition, and use it where intended

* Add Fixme comment (Skylake D)

* Mostly superficial edits to commits towards #351.

Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
  to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
  relates to single-VPU Skylake-Xs.

* Fix comment typos

Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
2020-01-06 14:15:48 -06:00
Field G. Van Zee
5271107378 Fixed bugs in cblas_sdsdot(), sdsdot_().
Details:
- Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar,
  named 'sb'. This value was already being added by the underlying
  sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub().
  Thanks to Simon Lukas Märtens for reporting this bug via #367.
- Fixed a second bug in order of typecasting intermediate products in
  sdsdot_(). Previously, the "alpha" scalar was being added after the
  "outer" typecast to float. However, the operation is supposed to first
  add the dot product to the (promoted) scalar and THEN downcast the sum
  to float. Thanks to Devin Matthews for catching this bug.
2019-12-16 16:30:26 -06:00
Field G. Van Zee
fe2560a4b1 Annoted missing thread-related symbols for export.
Details:
- Added BLIS_EXPORT_BLIS annotation to function prototypes for

    bli_thrcomm_bcast()
    bli_thrcomm_barrier()
    bli_thread_range_sub()

  so that these functions are exported to shared libraries by default.
  This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for
  reporting this bug.
- CREDITS file update.
2019-12-06 17:12:44 -06:00
Field G. Van Zee
efa61a6c8b Added missing bli_l3_sup_thread_decorator() symbol.
Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
  and pthreads so that those builds don't fail when performing shared
  library linking (especially for Windows DLLs via AppVeyor). For now,
  these dummy implementations of bli_l3_sup_thread_decorator() are
  merely carbon-copies of the implementation provided for single-
  threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
  Thus, an OpenMP or pthreads build will be able to use the gemmsup
  code (including the new selective packing functionality), as it did
  before 39fa7136, even though it will not actually employ any
  multithreaded parallelism.
2019-11-29 16:17:04 -06:00
Field G. Van Zee
39fa7136f4 Added support for selective packing to gemmsup.
Details:
- Implemented optional packing for A or B (or both) within the sup
  framework (which currently only supports gemm). The request for
  packing either matrix A or matrix B can be made via setting
  environment variables BLIS_PACK_A or BLIS_PACK_B (to any
  non-zero value; if set, zero means "disable packing"). It can also
  be made globally at runtime via bli_pack_set_pack_a() and
  bli_pack_set_pack_b() or with individual rntm_t objects via
  bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
  interface of either the BLIS typed or object APIs. (If using the
  BLAS API, environment variables are the only way to communicate the
  packing request.)
- One caveat (for now) with the current implementation of selective
  packing is that any blocksize extension registered in the _cntx_init
  function (such as is currently used by haswell and zen subconfigs)
  will be ignored if the affected matrix is packed. The reason is
  simply that I didn't get around to implementing the necessary logic
  to pack a larger edge-case micropanel, though this is entirely
  possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
  bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
  with corresponding headers, in which higher-level packm-related
  functions are defined for use within the sup framework. The actual
  packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
  bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
  always NULL), and pointer to a thrinfo_t* (which for nowis the address
  of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
  the millikernel can query the panel stride of the packed matrix and
  step through it accordingly. If the matrix isn't packed, the panel
  stride of interest for the given millikernel will be set to the
  appropriate value so that the mkernel may step through the unpacked
  matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
  panel strides (ps_a and ps_b, respectively) instead of computing them
  on the fly.
- Spun off the environment variable getting and setting functions into
  a new file, bli_env.c (with a corresponding prototype header). These
  functions are now used by the threading infrastructure (e.g.
  BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
  infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
  for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
  This means that the function bli_thread_init_rntm() was renamed to
  bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
  functions that manage the pack_a and pack_b fields of the global
  rntm_t, including from environment variables, just as we have
  functions to manage the threading fields of the global rntm_t in
  bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
  spinning off the bli_l3_thread_decorator() functions into their own
  files. This change makes more sense when considering the further
  addition of bli_l3_sup_thread_decorator() functions (for now limited
  only to the single-threaded form found in the  _single.c file).
- Explicitly initialize the reference sup handlers in both
  bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
  obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
2019-11-29 15:27:07 -06:00
Field G. Van Zee
881b05ecd4 Fixed blastest failure for 'generic' subconfig.
Details:
- Fixed a subtle and complicated bug that only manifested via the BLAS
  test drivers in the generic subconfiguration, and possibly any other
  subconfiguration that did not register complex-domain gemm ukernels,
  or registered ONLY real-domain ukernels as row-preferential. This is
  a long story, but it boils down to an exception to the "transpose the
  operation to bring storage of C into agreement with ukernel pref"
  optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
  proper functioning of the 1m method, but only when the imaginary
  component of beta is zero. See the comments in issue #342 for more
  details. Thanks to Dave Love for identifying the commit in which this
  bug was introduced, and other feedback related to this bug.
2019-11-21 16:34:27 -06:00
Field G. Van Zee
0c7165fb01 Fixed obscure bug in bli_acquire_mpart_[mn]dim().
Details:
- Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(),
  and bli_acquire_mpart_mndim() that allowed the use of a blocksize b
  that is too large given the current row/column index (i.e., the i/j
  argument) and the size of the dimension being partitioned (i.e., the
  m/n argument). This bug only affected backwards partitioning/motion
  through the dimension and was the result of a misplaced conditional
  check-and-redirect to the backwards code path. It should be noted
  that this bug was discovered not because it manifested the way it
  could (thanks to the callers in BLIS making sure to always pass in
  the "correct" blocksize b), but could have manifested if the
  functions were used by 3rd party callers. Thanks to Minh Quan Ho for
  reporting the bug via issue #363.
2019-11-14 16:48:14 -06:00
Field G. Van Zee
bdc7ee3394 Various fixes to support packing duplication in B.
Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
  those operations to be cast so the structured matrix is on the left.
  symm and hemm already had such macros, but these too were renamed so
  that the macros were individual to the operation. We now have four
  such macros:
    #define BLIS_DISABLE_HEMM_RIGHT
    #define BLIS_DISABLE_SYMM_RIGHT
    #define BLIS_DISABLE_TRMM_RIGHT
    #define BLIS_DISABLE_TRMM3_RIGHT
  Also, updated the comments in the symm and hemm front-ends related to
  the first two macro guards, and added corresponding comments to the
  trmm and trmm3 front-ends for the latter two guards. (They all
  functionally do the same thing, just for their specific operations.)
  Thanks to Jeff Hammond for reporting the bugs that led me to this
  change (via #359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
  related to duplicating B during packing) to register: a packing
  kernel for single-precision real; gemmbb ukernels for s, c, and z;
  trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
  and z; and to use non-default cache and register blocksizes for s, c,
  and z datatypes. Also declared prototypes for all of the gemmbb,
  trsmbb, and gemmtrsmbb ukernel functions within the
  bli_cntx_init_haswellbb() function. This should, once applied to the
  power9 configuration, fix the remaining issues in #359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
  duplication factor of 4. This function is defined in the same file as
  bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).
2019-11-11 15:47:17 -06:00
Jérôme Duval
f377bb4485 Add Haiku to the known OS list (#361) 2019-11-07 16:39:29 -06:00
Field G. Van Zee
e29b1f9706 Fixed failing testsuite gemmtrsm_ukr for power9.
Details:
- Added code that fixes false failures in the gemmtrsm_ukr module of the
  testsuite. The tests were failing because the computation (bli_gemv())
  that performs the numerical check was not able to properly travserse
  the matrix operands bx1 and b11 that are views into the micropanel of
  B, which has duplicated/broadcast elements under the power9 subconfig.
  (For example, a micropanel of B with duplication factor of 2 needs to
  use a column stride of 2; previously, the column stride was being
  interpreted as 1.)
- Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
  static functions in bli_obj_macro_defs.h. (Previously, only the
  function bli_obj_set_strides() was defined. Amazing to think that we
  got this far without these former functions.)
- Updated/expounded upon comments.
2019-11-05 17:15:19 -06:00
Field G. Van Zee
c84391314d Reverted minor temp/wspace changes from b426f9e.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
  directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
2019-11-04 13:57:12 -06:00
Nicholai Tukanov
b426f9e04e POWER9 DGEMM (#355)
Implemented and registered power9 dgemm ukernel.

Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel 
  assumes that elements of B have been duplicated/broadcast during the
  packing step. The microkernel uses a column orientation for its 
  microtile vector registers and thus implements column storage and 
  general stride IO cases. (A row storage IO case via in-register
  transposition may be added at a future date.) It should be noted that 
  we recommend using this microkernel with gcc and *not* xlc, as issues 
  with the latter cropped up during development, including but not 
  limited to slightly incompatible vector register mnemonics in the GNU 
  extended inline assembly clobber list.
2019-11-01 17:57:03 -05:00
Field G. Van Zee
b9bc222bfc Call bli_syrk_small() before error checking.
Details:
- In bli_syrk_front(), moved the conditional call to bli_syrk_check()
  (if error checking is enabled) and the conditional scaling of C by
  beta (if alpha is zero) so that they occur after, instead of before,
  the call to bli_syrk_small(). This sequencing now matches that of
  bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in
  bli_trsm_front().
2019-10-14 16:38:15 -05:00
Field G. Van Zee
6218ac95a5 Merge branch 'master' into amd 2019-10-11 11:53:51 -05:00
Field G. Van Zee
29b0e1ef4e Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
  into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
  inadvertantly not incremented when the Zen2 subconfiguration was
  added.
- In bli_gemm_front(), added a missing conditional constraint around the
  call to bli_gemm_small() that ensures that the computation precision
  of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
  that existed around the call to bli_syrk_small() into bli_syrk_small()
  to minimize the calling code footprint and also to bring that code
  into stylistic harmony with similar code in bli_gemm_front() and
  bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
  proper accessor static functions (e.g. 'a->dim[0]' becomes
  'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
  bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
  strictly speaking unnecessary, but it serves as a useful visual cue to
  those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
  bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
  version check for availability of -march=znver2, and added appropriate
  support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
  config/zen/amd_config.mk, including: removal of -march=znver1 et al.
  from CKVECFLAGS (since the -march flag is added within make_defs.mk);
  setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
  added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
  set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
2019-10-11 10:24:24 -05:00
Field G. Van Zee
c60db26aee Fixed bad loop counter in bli_[cz]scal2bbs_mxn().
Details:
- Fixed a typo in the loop counter for the 'd' (duplication) dimension
  in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h.
  They shouldn't be used by anyone yet, but thankfully clang via
  AppVeyor spit out warnings that alerted me to the issue.
2019-09-17 18:04:17 -05:00
Field G. Van Zee
31c8657f1d Added support for pre-broadcast when packing B.
Details:
- Added support for being able to duplicate (broadcast) elements in
  memory when packing matrix B (ie: the left-hand operand) in level-3
  operations. This turns out advantageous for some architectures that
  can afford the cost of the extra bandwidth and somehow benefit from
  the pre-broadcast elements (and thus being able to avoid using
  broadcast-style load instructions on micro-rows of B in the gemm
  microkernel).
- Support optionally disabling right-side hemm and symm. If this occurs,
  hemm_r is implemented in terms of hemm_l (and symm_r in terms of
  symm_l). This is needed when broadcasting during packing because the
  alternative--supporting the broadcast of B while also allowing matrix
  B to be Hermitian/symmetric--would be an absolute mess.
- Support alignment factors for packed blocks of A, B, and C separately
  (as well as for general-purpose buffers). In addition, we support
  byte offsets from those alignment values (which is different from
  aligning by align+offset bytes to begin with). The default alignment
  values are BLIS_PAGE_SIZE in all four cases, with the offset values
  defaulting to zero.
- Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
  into the packm kernel, where it will be needed by packm kernels that
  perform broadcasts of B, since the idea is that we *only* want to
  broadcast when packing micropanels of B and not A.
- Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
  used to set custom virtual level-3 microkernels in the cntx_t, which
  would typically be done in the bli_cntx_init_*() function defined in
  the subconfiguration of interest.
- Added a "broadcast B" kernel function for use with NP/NR = 12/6,
  defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
- Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
  defined in ref_kernels/3/bb. (These kernels have been tested with
  double real with NP/NR = 12/6.)
- Added #ifndef ... #endif guards around several macro constants defined
  in frame/include/bli_kernel_macro_defs.h.
- Defined a few "broadcast B" static functions in
  frame/include/level0/bb for use by "broadcast B"-style packm reference
  kernels. For now, only the real domain kernels are tested and fully
  defined.
- Output the alignment and offset values for packed blocks of A and B
  in the testsuite's "BLIS configuration info" section.
- Comment updates to various files.
- Bumped so_version to 3.0.0.
2019-09-17 17:42:10 -05:00
Field G. Van Zee
1cfe8e2562 Reimplemented bli_cpuid_query() for ARM.
Details:
- Rewrote bli_cpuid_query() for ARM architectures to use stdio-based
  functions such as fopen() and fgets() instead of popen(). The new code
  does more or less the same thing as before--searches /proc/cpuinfo for
  various strings, which are then parsed in order to determine the
  model, part number, and features. Thanks to Dave Love for suggesting
  this change in issue #335.
2019-09-05 16:08:30 -05:00
Devin Matthews
7c78191457 Always use sqsumv to compute normfv. (#334)
* Always use sqsumv to compute normfv on MacOS.

* Unconditionally disable the "dot trick" in normfv.

* Added explanatory comment to normfv definition.

Details:
- Added a comment above the unconditional disabling of the dotv-based
  implementation to normfv. Thanks to Roman Yurchak, Devin Matthews,
  and Isuru Fernando in helping with this improvement.
- CREDITS file update.
2019-08-30 16:52:09 -05:00
figual
bfddf67132 Fixed context registration for Cortex A53 (#329). 2019-08-26 12:01:33 +02:00
Field G. Van Zee
c4cc6fa702 New cntx_t blksz "set" functions + misc tweaks.
Details:
- Defined two new static functions in bli_cntx.h:
    bli_cntx_set_blksz_def_dt()
    bli_cntx_set_blksz_max_dt()
  which developers may find convenient when experimenting with different
  values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
  increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.
2019-07-16 13:00:35 -05:00
Field G. Van Zee
ceee2f973e Fixed thrinfo_t printing bug for small problems.
Details:
- Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
  bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
  whereby subnodes of the thrinfo_t tree are "dereferenced" near the
  beginning of the functions, which may lead to segfaults in certain
  situations where the thread tree was not fully formed because the
  matrix problem was too small for the level of parallelism specified.
  (That is, too small because some problems were assigned no work due
  to the smallest units in the m and n dimensions being defined by the
  register blocksizes mr and nr.) The fix requires several nested levels
  of if statements, and this is one of those few instances where use of
  goto statements results in (mostly) prettier code, especially in the
  case of _gemm_paths(). And while it wasn't necessary, I ported this
  goto usage to the loop body that prints the thrinfo_t work_id and
  comm_id values for each thread. Thanks to Nicholai Tukanov for helping
  to find this bug.
2019-06-24 17:47:40 -05:00
kdevraje
cac127182d Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis
with public repo commit id 565fa3853b.

Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42
2019-06-24 14:05:54 +05:30
Field G. Van Zee
ad937db950 Added missing #include "bli_family_thunderx2.h".
Details:
- Added a cpp-conditional directive block to bli_arch_config.h that
  #includes "bli_family_thunderx2.h". The code has been missing since
  adf5c17f. However, this never manifested as an error because the file
  is virtually empty and not needed for thunderx2 (or most subconfigs).
  Thanks to Jeff Diamond for helping to spot this.
2019-06-07 11:34:08 -05:00
Field G. Van Zee
6bf449cc69 Merge branch 'amd' 2019-05-31 17:42:40 -05:00
Kiran Varaganti
b69fb0b74a Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup
Change-Id: I9f5d8225254676a99c6f2b09a0825e545206d0fc
2019-05-31 15:14:22 +05:30
kdevraje
13806ba3b0 This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041
2019-05-27 16:24:43 +05:30
kdevraje
02920f5c48 make checkblis fails for matrix dimension check at the begining hence reverting it
Change-Id: Ibd2ee8c2d4914598b72003fbfc5845be9c9c1e87
2019-05-23 15:29:59 +05:30
kdevraje
84215022f2 Adding threshold condition to dgemm small matrix kernels, defining the constants in zen2 configuration
Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5
2019-05-23 14:33:47 +05:30
kdevraje
a3554eb1dc Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2
Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae
2019-05-23 11:53:32 +05:30