Commit Graph

1969 Commits

Author SHA1 Message Date
Field G. Van Zee
e56d9f2d94 ReleaseNotes.md update in advance of next version. 2021-03-22 17:40:50 -05:00
Field G. Van Zee
ca83f955d4 CREDITS file update. 2021-03-22 17:21:21 -05:00
Field G. Van Zee
57ef61f6cd Merge branch 'master' of github.com:flame/blis 2021-03-19 13:05:43 -05:00
Field G. Van Zee
bf1b578ea3 Reduced KC on skx from 384 to 256.
Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
  from 384 to 256. The maximum (extended) KC was also reduced
  accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
  this change.
2021-03-19 13:03:17 -05:00
Nicholai Tukanov
e7a4a8edc9 Fix calculation of new pb size (#487)
Details:
- Added missing parentheses to the i8 and i4 instantiations of the
  GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.
2021-03-17 19:43:31 -05:00
Field G. Van Zee
4493cf516e Redefined BLIS_NUM_ARCHS to update automatically.
Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
  value in the arch_t enum. This means that it no longer needs to get
  updated manually whenever new subconfigurations are added to BLIS.
  Also removed the explicit initial index assigment of 0 from the
  first enum value, which was unnecessary due to how the C language
  standard mandates indexing of enum values. Thanks to Devin Matthews
  for originally submitting this as a PR in #446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
  change.
2021-03-15 13:12:49 -05:00
Field G. Van Zee
a4b73de84c Disabled _self() and _equal() in bli_pthread API.
Details:
- Disabled the _self() and _equal() extensions to the bli_pthread API
  introduced in d479654. These functions were disabled after I realized
  that they aren't actually needed yet. Thanks to Devin Matthews for
  helping me reason through the appropriate consumer code that will
  appear in BLIS (eventually) in a future commit. (Also, I could never
  get the Windows branch to link properly in clang builds in AppVeyor.
  See the comment I left in the code, and #485, for more info.)
2021-03-12 19:47:39 -06:00
Field G. Van Zee
f9d604679d Added _self() and _equal() to bli_pthread API.
Details:
- Expanded the bli_pthread API to include equivalents to pthread_self()
  and pthread_equal(). Implemented these two functions for all three cpp
  branches present within bli_pthread.c: systemless, Windows, and
  Linux/BSD.
2021-03-12 19:47:39 -06:00
Field G. Van Zee
fa9b3c8f6b Shuffled code in Windows branch of bli_pthreads.c.
Details:
- Reordered the definitions in the cpp branch in bli_pthreads.c that
  defines the bli_pthreads API in terms of Windows API calls. Also added
  missing comments that mark sections of the API, which brings the code
  into harmony with other cpp branches (as well as bli_pthread.h).
2021-03-11 15:13:51 -06:00
Field G. Van Zee
95d4f3934d Moved cpp macro redef of strerror_r to bli_env.c.
Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
  (in terms of strerror_s) from bli_thread.h to bli_env.c. It was
  likely left behind in bli_thread.h in a previous commit, when code
  that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
  find any other instance of strerror_r being used in BLIS, so I moved
  the #define directly to bli_env.c rather than place it in bli_env.h.)
  The code that uses strerror_r is currently disabled, though, so this
  commit should have no affect on BLIS.
2021-03-11 13:50:40 -06:00
Field G. Van Zee
8a3066c315 Relocated gemmsup_ref general stride handling.
Details:
- Moved the logic that checks for general stridedness in any of the
  matrix operands in a gemmsup problem. The logic previously resided
  near the top of bli_gemmsup_int(), which is the thread entry point
  for the parallel region of the current gemmsup implementation. The
  problem with this setup was that the code would attempt to reject
  problems with any general-strided operands by returning BLIS_FAILURE,
  and that return value was then being ignored by the l3_sup thread
  decorator, which unconditionally returns BLIS_SUCCESS. To solve this
  issue, rather than try to manage n return values, one from each of n
  threads, I simply moved the logic into bli_gemmsup_ref(). I didn't
  move it any higher (e.g. bli_gemmsup()) because I still want the
  logic to be part of the current gemmsup handler implementation. That
  is, perhaps someone else will create a different handler, and that
  author wants to handle general stride differently. (We don't want to
  force them into a particular way of handling general stride.)
- Removed the general stride handling from bli_gemmtsup_int(), even
  though this function is inoperative for now.
- This commit addresses issue #484. Thanks to RuQing Xu for reporting
  this issue.
2021-03-09 17:52:59 -06:00
Nicholai Tukanov
670bc7b60f Add low-precision POWER10 gemm kernels (#467)
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations 
  based on low-precision gemm kernels, and (2) extends the BLIS typed 
  API for those new implementations. Currently, these new kernels can 
  only be used for the POWER10 microarchitecture; however, they may 
  provide a template for developing similar kernels for other 
  microarchitectures (even those beyond POWER), as changes would likely 
  be limited to select places in the microkernel and possibly the 
  packing routines. The new low-precision operations that are now 
  supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more 
  information, refer to the POWER10.md document that is included in 
  'sandbox/power10'.
2021-03-05 13:53:43 -06:00
RuQing Xu
b8dcc5bc75 Fixed typed API definition for gemmt (#476)
Details:
- Fixed incorrect definition and prototype of bli_?gemmt() in 
  frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
  defined identically to gemm, which was wrong because it did not
  take into account the uplo property of C.
- Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
  Specifically, the document erroneously listed only a single transab
  parameter instead of transa and transb.
2021-03-01 16:58:24 -06:00
Ilknur
a0e4fe2340 Fixed double free() in level1v example (#482)
Details:
- In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and
  pointer 'a' was not being freed at all. This commit correctly frees 
  each pointer exactly once.
2021-03-01 16:06:56 -06:00
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00
Devin Matthews
f50c1b7e58 Merge pull request #473 from ajaypanyala/pkgconfig
build: generate pkgconfig file
2021-02-01 11:55:51 -06:00
Field G. Van Zee
8f39aea11f Merge branch 'dev' 2021-01-30 17:59:56 -06:00
Field G. Van Zee
f8db9fb33b Fixed missing parentheses in README.md Citations. 2021-01-28 08:04:52 -06:00
Ajay Panyala
b3953b938e drop CFLAGS in the generated pkgconfig file 2021-01-12 17:07:04 -08:00
Ajay Panyala
b02d9376ba add datadir 2021-01-12 11:47:58 -08:00
Ajay Panyala
d8d8deeb6d generate pkgconfig file 2021-01-11 17:47:50 -08:00
Devin Matthews
8c65411c7c Merge pull request #471 from flame/fix-470
Fix kernel-to-config mapping for intel64
2021-01-11 16:01:45 -06:00
Devin Matthews
874c3f04ec Update configure
Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes #470.
2021-01-08 13:56:30 -06:00
Field G. Van Zee
2a815d5b36 Support trsm pre-inversion in 1m, bb, ref kernels.
Details:
- Expanded support for disabling trsm diagonal pre-inversion to other
  microkernel types, including the reference microkernel as well as the
  kernel implementations for 1m and the pre-broadcast B (bb) format used
  by the power9 subconfig. This builds on the 'haswell' and 'penryn'
  kernel support added in 7038bba. Thanks to Bhaskar Nallani for
  reminding me, in #461 (post-closure), that 1m support was missing from
  that commit.
- Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the
  omp simd implementation after making a stripped-down copy in 'old'.
  This code has been disabled for some time and it seemed better suited
  to rot away out of sight rather than clutter up a file that is already
  cluttered by the presence of lower and upper versions.
- Minor comment update to bli_ind_init().
2021-01-04 18:03:39 -06:00
Field G. Van Zee
c3ed2cbb9f Enable 1m only if real domain ukr is not reference.
Details:
- Previously, BLIS would automatically enable use of the 1m method
  for a given precision if the complex domain microkernel was a
  reference kernel. This commit adds an additional constraint so that
  1m is only enabled if the corresponding real domain microkernel is
  NOT reference. That is, BLIS now forgos use of 1m if both the real and
  complex domain kernels are reference implementations. Note that this
  does not prevent 1m from being enabled manually under those
  conditions; it only means that 1m will not be enabled automatically
  at initialization-time.
2021-01-04 16:16:32 -06:00
Field G. Van Zee
ed50c94738 Merge branch 'master' into dev 2021-01-04 14:31:44 -06:00
Devin Matthews
328b4f8872 Shared object (dylib) was not built correctly for partial build.
The SO build rule used $? instead of $^. Observed on macOS, not sure if it affected Linux or not.
2020-12-30 17:54:18 -06:00
Devin Matthews
ae6ef66ef8 bli_diag_offset_with_trans had wrong return type. Fixes #468. 2020-12-30 17:34:55 -06:00
Devin Matthews
ebcf197fb8 Merge pull request #466 from isuruf/patch-3
fix cc_vendor for crosstool-ng toolchains
2020-12-05 22:26:27 -06:00
Isuru Fernando
21aa67e11c fix cc_vendor for crosstool-ng toolchains 2020-12-05 21:59:13 -06:00
Field G. Van Zee
472f138cb9 Fixed typo in README.md to CodingConventions.md. 2020-12-05 14:13:52 -06:00
Field G. Van Zee
0cef09aa92 Consolidated code in level-3 _front() functions.
Details:
- Reduced a code segment that appears in all of the bli_*_front()
  functions except for bli_gemm_front(). Previously, the code looked
  like this (taken from bli_herk_front()):

    if ( bli_cntx_method( cntx ) == BLIS_NAT )
    {
        bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
        bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local );
    }
    else // if ( bli_cntx_method( cntx ) != BLIS_NAT )
    {
        pack_t schema_a = bli_cntx_schema_a_block( cntx );
        pack_t schema_b = bli_cntx_schema_b_panel( cntx );

        bli_obj_set_pack_schema( schema_a, &a_local );
        bli_obj_set_pack_schema( schema_b, &ah_local );
    }

  This code segment is part of a sort-of-hack that allows us to
  communicate the pack schemas into the level-3 thread decorator, which
  needs them so that they can be passed into bli_l3_cntl_create_if(),
  where the control tree is created. However, the first conditional case
  above is unnecessary because the second case is fully generalized.
  That is, even in the native case, the context contains correct,
  queryable schemas. Thus, these code segments were reduced to something
  like:

    pack_t schema_a = bli_cntx_schema_a_block( cntx );
    pack_t schema_b = bli_cntx_schema_b_panel( cntx );

    bli_obj_set_pack_schema( schema_a, &a_local );
    bli_obj_set_pack_schema( schema_b, &ah_local );

  There's always a small chance that the seemingly unnecessary code
  in the first branch case has some special use that is not apparent to
  me, but the testsuite's default input parameters seem to think this
  commit will be fine.
2020-12-04 16:40:59 -06:00
Field G. Van Zee
7038bbaa05 Optionally disable trsm diagonal pre-inversion.
Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
  optionally disables the pre-inversion of diagonal elements of the
  triangular matrix in the trsm operation and instead uses division
  instructions within the gemmtrsm microkernels. Pre-inversion is
  enabled by default. When it is disabled, performance may suffer
  slightly, but numerical robustness should improve for certain
  pathological cases involving denormal (subnormal) numbers that would
  otherwise result in overflow in the pre-inverted value. Thanks to
  Bhaskar Nallani for reporting this issue via #461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
  gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
  to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
  instructions.
2020-12-04 16:08:15 -06:00
Field G. Van Zee
78aee79452 Allow amaxv testsuite module to run with dim = 0.
Details:
- Exit early from libblis_test_amaxv_check() when the vector dimension
  (length) of x is 0. This allows the module to run when the testsuite
  driver passes in a problem size of 0. Thanks to Meghana Vankadari for
  alerting us to this issue via #459.
- Note: All other testsuite modules appear to work with problem sizes
  of 0, except for the microkernel modules. I chose not to "fix" those
  modules because a failure (or segmentation fault, as happens in this
  case) is actually meaningful in that it alerts the developer that some
  microkernels cannot be used with k = 0. Specifically, the 'haswell'
  kernel set contains microkernels that preload elements of B. Those
  microkernels would need to be restructured to avoid preloading in
  order to support usage when k = 0.
2020-12-02 13:02:36 -06:00
Field G. Van Zee
92d2b12a44 Fixed obscure testsuite gemmt dependency bug.
Details:
- Fixed a bug in the gemmt testsuite module that only manifested when
  testing of gemmt is enabled but testing of gemv is disabled. The bug
  was due to a copy-paste error dating back to the introduction of gemmt
  in 88ad841.
2020-12-02 13:02:00 -06:00
Field G. Van Zee
b43dae9a5d Fixed copy-paste bugs in edge-case sup kernels.
Details:
- Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and
  bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly
  instructions that were left over from when the kernels were first
  written. These instructions would cause segmentation faults in some
  situations where extra memory was not allocated beyond the end of
  the matrix buffers. Thanks to Kiran Varaganti for reporting these
  bugs and to Bhaskar Nallani for identifying the cause and solution.
2020-12-01 16:44:38 -06:00
Field G. Van Zee
11dfc176a3 Reorganized thread auto-factorization logic.
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
  guts were factored out into "fast" and "slow" variants. Then added
  logic to the "fast" variant that allows for more optimal thread
  factorizations in some situations where there is at least one factor
  of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
  added comments to that file describing BLIS_THREAD_RATIO_? and
  BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
  macros not used in vanilla BLIS and removed the unused macro
  BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
  and bli_trsm_front.c. (These branches of small matrix handling have
  not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
2020-12-01 19:51:27 +00:00
Devin Matthews
6d3bafacd7 Update BuildSystem.md
Add git version >= 1.8.5 requirement (see #462).
2020-11-28 17:17:56 -06:00
Field G. Van Zee
64856ea5a6 Auto-reduce (by default) prime numbers of threads.
Details:
- When requesting multithreaded parallelism by specifying the total
  number of threads (whether it be via environment variable, globally at
  runtime, or locally at runtime), reduce the number of threads actually
  used by one if the original value (a) is prime and (b) exceeds a
  minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
  to 11 by default. If, when specifying the total number of threads (and
  not the individual ways of parallelism for each loop), prime numbers
  of threads are desired, this feature may be overridden by defining the
  BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
  corresponds to the configuration family targeted at configure-time.
  (For now, there is no configure option(s) to control this feature.)
  Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
  bool that determines whether an integer is prime. This function is
  implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
  with unrelated minor edits.
2020-11-23 16:54:51 -06:00
Field G. Van Zee
55933b6ff6 Added missing attribution to docs/ReleaseNotes.md. 2020-11-20 10:39:32 -06:00
Field G. Van Zee
e310f57b4b CHANGELOG update (0.8.0) 2020-11-19 13:33:37 -06:00
Field G. Van Zee
9b387f6d5a Version file update (0.8.0) 0.8.0 2020-11-19 13:33:37 -06:00
Field G. Van Zee
2928ec750d ReleaseNotes.md update in advance of next version.
Details:
- Updated docs/ReleaseNotes.md in preparation for next version.
2020-11-18 18:31:35 -06:00
Field G. Van Zee
b9899bedff CREDITS file update. 2020-11-18 16:52:41 -06:00
Field G. Van Zee
9bb23e6c2a Added support for systemless build (no pthreads).
Details:
- Added a configure option, --[enable|disable]-system, which determines
  whether the modest operating system dependencies in BLIS are included.
  The most notable example of this on Linux and BSD/OSX is the use of
  POSIX threads to ensure thread safety for when application-level
  threads call BLIS. When --disable-system is given, the bli_pthreads
  implementation is dummied out entirely, allowing the calling code
  within BLIS to remain unchanged. Why would anyone want to build BLIS
  like this? The motivating example was submitted via #454 in which a
  user wanted to build BLIS for a simulator such as gem5 where thread
  safety may not be a concern (and where the operating system is largely
  absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
  the implementation of bli_clock() unconditionally returns 0.0 instead
  of the time elapsed since some fixed point in the past. The reasoning
  for this is that if the operating system is truly minimal, the system
  function call upon which bli_clock() would normally be implemented
  (e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
  to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
  from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
  and BLISTypedAPI.md, with a note that both are non-functional when
  BLIS is configured with --disable-system.
2020-11-16 15:55:45 -06:00
Field G. Van Zee
88ad841434 Squash-merge 'pr' into 'squash'. (#457)
Merged contributions from AMD's AOCL BLIS (#448).
  
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
  only the lower or upper triangle of a square matrix C. For now, only
  the conventional/large code path will be supported (in vanilla BLIS).
  This was accomplished by leveraging the existing variant logic for
  herk. However, some of the infrastructure to support a gemmtsup is
  included in this commit, including
  - A bli_gemmtsup() front-end, similar to bli_gemmsup().
  - A bli_gemmtsup_ref() reference handler function.
  - A bli_gemmtsup_int() variant chooser function (with variant calls
    commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
  wrapper to a set of polymorphic CBLAS-like function wrappers defined
  in another header (cblas.hh). These two headers are installed if
  running the 'install' target with INSTALL_HH is set to 'yes'. (Also
  added a set of unit tests that exercise blis.hh, although they are
  disabled for now because they aren't compatible with out-of-tree
  builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
  within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
  various minor updates to dotv and scalv kernels. Also added various
  sup kernels contributed by AMD to kernels/zen/3. However, these
  kernels are (for now) not yet used, in part because they caused
  AppVeyor clang failures, and also because I have not found time to
  review and vet them.
- Output the python found during configure into the definition of PYTHON
  in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
  to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
  bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
  bug surfaced because the gemmt module verifies its computation using
  gemm with its beta parameter set to zero, which, on a cortexa15 system
  caused the gemm kernel code to unconditionally multiply the
  uninitialized C data by beta. The C matrix likely contained
  non-numeric values such as NaN, which then would have resulted in a
  false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
  in bli_l3_blocksize.c, was inadvertantly being defined in terms of
  helper functions meant for trmm. This bug was probably harmless since
  the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
  kernels/zen/3/bli_gemm_small.c since those macros are not used in
  vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
  accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
  Windows systems.
- Various whitespace changes.
2020-11-14 09:39:48 -06:00
Field G. Van Zee
234b8b0cf4 Increased dotxaxpyf testsuite thresholds.
Details:
- Increased the test thresholds used by the dotxaxpyf testsuite module
  by a factor of five in order to avoid residuals that unnecessarily
  fall in the MARGINAL range. This commit should fix #455. Thanks to
  @nagsingh for reporting this issue.
2020-11-12 19:11:16 -06:00
Field G. Van Zee
ed612dd82c Updated README.md with sgemmsup blurb.
Details:
- Added an entry to the "What's New" section of the README.md to
  announce the availability of sgemmsup.
2020-11-07 13:09:42 -06:00
Field G. Van Zee
e14424f55b Merge branch 'dev' 2020-11-07 13:02:50 -06:00