Commit Graph

1958 Commits

Author SHA1 Message Date
Nicholai Tukanov
670bc7b60f Add low-precision POWER10 gemm kernels (#467)
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations 
  based on low-precision gemm kernels, and (2) extends the BLIS typed 
  API for those new implementations. Currently, these new kernels can 
  only be used for the POWER10 microarchitecture; however, they may 
  provide a template for developing similar kernels for other 
  microarchitectures (even those beyond POWER), as changes would likely 
  be limited to select places in the microkernel and possibly the 
  packing routines. The new low-precision operations that are now 
  supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more 
  information, refer to the POWER10.md document that is included in 
  'sandbox/power10'.
2021-03-05 13:53:43 -06:00
RuQing Xu
b8dcc5bc75 Fixed typed API definition for gemmt (#476)
Details:
- Fixed incorrect definition and prototype of bli_?gemmt() in 
  frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
  defined identically to gemm, which was wrong because it did not
  take into account the uplo property of C.
- Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
  Specifically, the document erroneously listed only a single transab
  parameter instead of transa and transb.
2021-03-01 16:58:24 -06:00
Ilknur
a0e4fe2340 Fixed double free() in level1v example (#482)
Details:
- In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and
  pointer 'a' was not being freed at all. This commit correctly frees 
  each pointer exactly once.
2021-03-01 16:06:56 -06:00
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00
Devin Matthews
f50c1b7e58 Merge pull request #473 from ajaypanyala/pkgconfig
build: generate pkgconfig file
2021-02-01 11:55:51 -06:00
Field G. Van Zee
8f39aea11f Merge branch 'dev' 2021-01-30 17:59:56 -06:00
Field G. Van Zee
f8db9fb33b Fixed missing parentheses in README.md Citations. 2021-01-28 08:04:52 -06:00
Ajay Panyala
b3953b938e drop CFLAGS in the generated pkgconfig file 2021-01-12 17:07:04 -08:00
Ajay Panyala
b02d9376ba add datadir 2021-01-12 11:47:58 -08:00
Ajay Panyala
d8d8deeb6d generate pkgconfig file 2021-01-11 17:47:50 -08:00
Devin Matthews
8c65411c7c Merge pull request #471 from flame/fix-470
Fix kernel-to-config mapping for intel64
2021-01-11 16:01:45 -06:00
Devin Matthews
874c3f04ec Update configure
Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes #470.
2021-01-08 13:56:30 -06:00
Field G. Van Zee
2a815d5b36 Support trsm pre-inversion in 1m, bb, ref kernels.
Details:
- Expanded support for disabling trsm diagonal pre-inversion to other
  microkernel types, including the reference microkernel as well as the
  kernel implementations for 1m and the pre-broadcast B (bb) format used
  by the power9 subconfig. This builds on the 'haswell' and 'penryn'
  kernel support added in 7038bba. Thanks to Bhaskar Nallani for
  reminding me, in #461 (post-closure), that 1m support was missing from
  that commit.
- Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the
  omp simd implementation after making a stripped-down copy in 'old'.
  This code has been disabled for some time and it seemed better suited
  to rot away out of sight rather than clutter up a file that is already
  cluttered by the presence of lower and upper versions.
- Minor comment update to bli_ind_init().
2021-01-04 18:03:39 -06:00
Field G. Van Zee
c3ed2cbb9f Enable 1m only if real domain ukr is not reference.
Details:
- Previously, BLIS would automatically enable use of the 1m method
  for a given precision if the complex domain microkernel was a
  reference kernel. This commit adds an additional constraint so that
  1m is only enabled if the corresponding real domain microkernel is
  NOT reference. That is, BLIS now forgos use of 1m if both the real and
  complex domain kernels are reference implementations. Note that this
  does not prevent 1m from being enabled manually under those
  conditions; it only means that 1m will not be enabled automatically
  at initialization-time.
2021-01-04 16:16:32 -06:00
Field G. Van Zee
ed50c94738 Merge branch 'master' into dev 2021-01-04 14:31:44 -06:00
Devin Matthews
328b4f8872 Shared object (dylib) was not built correctly for partial build.
The SO build rule used $? instead of $^. Observed on macOS, not sure if it affected Linux or not.
2020-12-30 17:54:18 -06:00
Devin Matthews
ae6ef66ef8 bli_diag_offset_with_trans had wrong return type. Fixes #468. 2020-12-30 17:34:55 -06:00
Devin Matthews
ebcf197fb8 Merge pull request #466 from isuruf/patch-3
fix cc_vendor for crosstool-ng toolchains
2020-12-05 22:26:27 -06:00
Isuru Fernando
21aa67e11c fix cc_vendor for crosstool-ng toolchains 2020-12-05 21:59:13 -06:00
Field G. Van Zee
472f138cb9 Fixed typo in README.md to CodingConventions.md. 2020-12-05 14:13:52 -06:00
Field G. Van Zee
0cef09aa92 Consolidated code in level-3 _front() functions.
Details:
- Reduced a code segment that appears in all of the bli_*_front()
  functions except for bli_gemm_front(). Previously, the code looked
  like this (taken from bli_herk_front()):

    if ( bli_cntx_method( cntx ) == BLIS_NAT )
    {
        bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
        bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local );
    }
    else // if ( bli_cntx_method( cntx ) != BLIS_NAT )
    {
        pack_t schema_a = bli_cntx_schema_a_block( cntx );
        pack_t schema_b = bli_cntx_schema_b_panel( cntx );

        bli_obj_set_pack_schema( schema_a, &a_local );
        bli_obj_set_pack_schema( schema_b, &ah_local );
    }

  This code segment is part of a sort-of-hack that allows us to
  communicate the pack schemas into the level-3 thread decorator, which
  needs them so that they can be passed into bli_l3_cntl_create_if(),
  where the control tree is created. However, the first conditional case
  above is unnecessary because the second case is fully generalized.
  That is, even in the native case, the context contains correct,
  queryable schemas. Thus, these code segments were reduced to something
  like:

    pack_t schema_a = bli_cntx_schema_a_block( cntx );
    pack_t schema_b = bli_cntx_schema_b_panel( cntx );

    bli_obj_set_pack_schema( schema_a, &a_local );
    bli_obj_set_pack_schema( schema_b, &ah_local );

  There's always a small chance that the seemingly unnecessary code
  in the first branch case has some special use that is not apparent to
  me, but the testsuite's default input parameters seem to think this
  commit will be fine.
2020-12-04 16:40:59 -06:00
Field G. Van Zee
7038bbaa05 Optionally disable trsm diagonal pre-inversion.
Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
  optionally disables the pre-inversion of diagonal elements of the
  triangular matrix in the trsm operation and instead uses division
  instructions within the gemmtrsm microkernels. Pre-inversion is
  enabled by default. When it is disabled, performance may suffer
  slightly, but numerical robustness should improve for certain
  pathological cases involving denormal (subnormal) numbers that would
  otherwise result in overflow in the pre-inverted value. Thanks to
  Bhaskar Nallani for reporting this issue via #461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
  gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
  to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
  instructions.
2020-12-04 16:08:15 -06:00
Field G. Van Zee
78aee79452 Allow amaxv testsuite module to run with dim = 0.
Details:
- Exit early from libblis_test_amaxv_check() when the vector dimension
  (length) of x is 0. This allows the module to run when the testsuite
  driver passes in a problem size of 0. Thanks to Meghana Vankadari for
  alerting us to this issue via #459.
- Note: All other testsuite modules appear to work with problem sizes
  of 0, except for the microkernel modules. I chose not to "fix" those
  modules because a failure (or segmentation fault, as happens in this
  case) is actually meaningful in that it alerts the developer that some
  microkernels cannot be used with k = 0. Specifically, the 'haswell'
  kernel set contains microkernels that preload elements of B. Those
  microkernels would need to be restructured to avoid preloading in
  order to support usage when k = 0.
2020-12-02 13:02:36 -06:00
Field G. Van Zee
92d2b12a44 Fixed obscure testsuite gemmt dependency bug.
Details:
- Fixed a bug in the gemmt testsuite module that only manifested when
  testing of gemmt is enabled but testing of gemv is disabled. The bug
  was due to a copy-paste error dating back to the introduction of gemmt
  in 88ad841.
2020-12-02 13:02:00 -06:00
Field G. Van Zee
b43dae9a5d Fixed copy-paste bugs in edge-case sup kernels.
Details:
- Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and
  bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly
  instructions that were left over from when the kernels were first
  written. These instructions would cause segmentation faults in some
  situations where extra memory was not allocated beyond the end of
  the matrix buffers. Thanks to Kiran Varaganti for reporting these
  bugs and to Bhaskar Nallani for identifying the cause and solution.
2020-12-01 16:44:38 -06:00
Field G. Van Zee
11dfc176a3 Reorganized thread auto-factorization logic.
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
  guts were factored out into "fast" and "slow" variants. Then added
  logic to the "fast" variant that allows for more optimal thread
  factorizations in some situations where there is at least one factor
  of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
  added comments to that file describing BLIS_THREAD_RATIO_? and
  BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
  macros not used in vanilla BLIS and removed the unused macro
  BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
  and bli_trsm_front.c. (These branches of small matrix handling have
  not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
2020-12-01 19:51:27 +00:00
Devin Matthews
6d3bafacd7 Update BuildSystem.md
Add git version >= 1.8.5 requirement (see #462).
2020-11-28 17:17:56 -06:00
Field G. Van Zee
64856ea5a6 Auto-reduce (by default) prime numbers of threads.
Details:
- When requesting multithreaded parallelism by specifying the total
  number of threads (whether it be via environment variable, globally at
  runtime, or locally at runtime), reduce the number of threads actually
  used by one if the original value (a) is prime and (b) exceeds a
  minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
  to 11 by default. If, when specifying the total number of threads (and
  not the individual ways of parallelism for each loop), prime numbers
  of threads are desired, this feature may be overridden by defining the
  BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
  corresponds to the configuration family targeted at configure-time.
  (For now, there is no configure option(s) to control this feature.)
  Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
  bool that determines whether an integer is prime. This function is
  implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
  with unrelated minor edits.
2020-11-23 16:54:51 -06:00
Field G. Van Zee
55933b6ff6 Added missing attribution to docs/ReleaseNotes.md. 2020-11-20 10:39:32 -06:00
Field G. Van Zee
e310f57b4b CHANGELOG update (0.8.0) 2020-11-19 13:33:37 -06:00
Field G. Van Zee
9b387f6d5a Version file update (0.8.0) 0.8.0 2020-11-19 13:33:37 -06:00
Field G. Van Zee
2928ec750d ReleaseNotes.md update in advance of next version.
Details:
- Updated docs/ReleaseNotes.md in preparation for next version.
2020-11-18 18:31:35 -06:00
Field G. Van Zee
b9899bedff CREDITS file update. 2020-11-18 16:52:41 -06:00
Field G. Van Zee
9bb23e6c2a Added support for systemless build (no pthreads).
Details:
- Added a configure option, --[enable|disable]-system, which determines
  whether the modest operating system dependencies in BLIS are included.
  The most notable example of this on Linux and BSD/OSX is the use of
  POSIX threads to ensure thread safety for when application-level
  threads call BLIS. When --disable-system is given, the bli_pthreads
  implementation is dummied out entirely, allowing the calling code
  within BLIS to remain unchanged. Why would anyone want to build BLIS
  like this? The motivating example was submitted via #454 in which a
  user wanted to build BLIS for a simulator such as gem5 where thread
  safety may not be a concern (and where the operating system is largely
  absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
  the implementation of bli_clock() unconditionally returns 0.0 instead
  of the time elapsed since some fixed point in the past. The reasoning
  for this is that if the operating system is truly minimal, the system
  function call upon which bli_clock() would normally be implemented
  (e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
  to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
  from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
  and BLISTypedAPI.md, with a note that both are non-functional when
  BLIS is configured with --disable-system.
2020-11-16 15:55:45 -06:00
Field G. Van Zee
88ad841434 Squash-merge 'pr' into 'squash'. (#457)
Merged contributions from AMD's AOCL BLIS (#448).
  
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
  only the lower or upper triangle of a square matrix C. For now, only
  the conventional/large code path will be supported (in vanilla BLIS).
  This was accomplished by leveraging the existing variant logic for
  herk. However, some of the infrastructure to support a gemmtsup is
  included in this commit, including
  - A bli_gemmtsup() front-end, similar to bli_gemmsup().
  - A bli_gemmtsup_ref() reference handler function.
  - A bli_gemmtsup_int() variant chooser function (with variant calls
    commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
  wrapper to a set of polymorphic CBLAS-like function wrappers defined
  in another header (cblas.hh). These two headers are installed if
  running the 'install' target with INSTALL_HH is set to 'yes'. (Also
  added a set of unit tests that exercise blis.hh, although they are
  disabled for now because they aren't compatible with out-of-tree
  builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
  within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
  various minor updates to dotv and scalv kernels. Also added various
  sup kernels contributed by AMD to kernels/zen/3. However, these
  kernels are (for now) not yet used, in part because they caused
  AppVeyor clang failures, and also because I have not found time to
  review and vet them.
- Output the python found during configure into the definition of PYTHON
  in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
  to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
  bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
  bug surfaced because the gemmt module verifies its computation using
  gemm with its beta parameter set to zero, which, on a cortexa15 system
  caused the gemm kernel code to unconditionally multiply the
  uninitialized C data by beta. The C matrix likely contained
  non-numeric values such as NaN, which then would have resulted in a
  false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
  in bli_l3_blocksize.c, was inadvertantly being defined in terms of
  helper functions meant for trmm. This bug was probably harmless since
  the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
  kernels/zen/3/bli_gemm_small.c since those macros are not used in
  vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
  accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
  Windows systems.
- Various whitespace changes.
2020-11-14 09:39:48 -06:00
Field G. Van Zee
234b8b0cf4 Increased dotxaxpyf testsuite thresholds.
Details:
- Increased the test thresholds used by the dotxaxpyf testsuite module
  by a factor of five in order to avoid residuals that unnecessarily
  fall in the MARGINAL range. This commit should fix #455. Thanks to
  @nagsingh for reporting this issue.
2020-11-12 19:11:16 -06:00
Field G. Van Zee
ed612dd82c Updated README.md with sgemmsup blurb.
Details:
- Added an entry to the "What's New" section of the README.md to
  announce the availability of sgemmsup.
2020-11-07 13:09:42 -06:00
Field G. Van Zee
e14424f55b Merge branch 'dev' 2020-11-07 13:02:50 -06:00
Field G. Van Zee
0cfe1aac22 Relocated operation index to ToC in API docs.
Details:
- Moved the "Operation index" section of both the BLISObjectAPI.md and
  BLISTypedAPI.md docs to appear immediately after the table of contents
  of each document. This allows the reader to quickly jump to the
  documentation for any operation without having to scroll through much
  of the document (when rendered via a web browser).
- Fixed a mistake in the BLISObjectAPI.md for the setd operation, which
  does *not* observe the diag property of its matrix argument. Thanks to
  Jeff Diamond for reporting this.
2020-10-30 17:10:36 -05:00
Field G. Van Zee
2a0682f8e5 Implemented runtime subconfig selection (#451).
Details:
- Implemented support for the user manually overriding the automatic
  subconfiguration selection that happens at runtime. This override
  can be requested by setting the BLIS_ARCH_TYPE environment variable.
  The variable must be set to the arch_t id (as enumerated in
  bli_type_defs.h) corresponding to the desired subconfiguration. If a
  value outside this enumerated range is given, BLIS will abort with an
  error message. If the value is in the valid range but corresponds to a
  subconfiguration that was not activated at configure-time/compile-time,
  BLIS will abort with a (different) error message. Thanks to decandia50
  for suggesting this feature via issue #451.
- Defined a new function bli_gks_lookup_id to return the address of an
  internal data structure within the gks. If this address is NULL, then
  it indicates that the subconfig corresponding to the arch_t id passed
  into the function was not compiled into BLIS. This function is used
  in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
  is returned for the latter of the two abort scenarios mentioned above,
  along with a corresponding error message and a function to perform
  the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
  auto-detect.x executable during configure-time. This cpp branch is
  similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
  going forward. Also added a convenient debug switch that outputs the
  compilation command for the auto-detect.x executable and exits.
2020-10-18 18:04:03 -05:00
Field G. Van Zee
eccdd75a2d Whitespace tweak in docs/PerformanceSmall.md. 2020-10-09 15:44:16 -05:00
Field G. Van Zee
7677e9ba60 Merge branch 'dev' of github.com:flame/blis into dev 2020-10-09 15:41:25 -05:00
Field G. Van Zee
addcd46b05 Added Epyc 7742 Zen2 ("Rome") sup perf results.
Details:
- Added single-threaded and multithreaded sup performance results to
  docs/PerformanceSmall.md for both sgemm and dgemm. These results were
  gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
  microarchitecture. Special thanks to Jeff Diamond for facilitating
  access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
  and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
  consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
  I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
  microarchitecture section.
2020-10-09 15:41:09 -05:00
Field G. Van Zee
a0849d390d Register l3 sup kernels in zen2 subconfig.
Details:
- Registered full suite of sgemm and dgemm sup millikernels, blocksizes,
  and crossover thresholds in bli_cntx_init_zen2.c.
- Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742
  system.
2020-10-09 20:22:17 +00:00
Field G. Van Zee
d98368c32d Another tweak to line thickness of Zen2 graphs. 2020-10-08 19:05:51 -05:00
Field G. Van Zee
1855dfbdaa Tweaked line thickness in Zen2 graphs once more.
Details:
- Decreased (relative to previous commit) line thickness in recent Zen2
  graphs.
2020-10-08 19:01:00 -05:00
Field G. Van Zee
0991611e7e Increased line thickness in recent Zen2 graphs.
Details:
- Increased the width of the lines in the graphs introduced in 74ec6b8.
2020-10-08 18:54:49 -05:00
Field G. Van Zee
8273cbacd7 README.md, docs/FAQ.md updates.
Details:
- Added a frequently asked question to docs/FAQ.md regarding the
  difference between upstream (vanilla) BLIS and AMD BLIS.
- Updated the name of ICES in the README.md to reflect the Oden
  rebranding.
2020-10-07 14:51:33 -05:00
Field G. Van Zee
a178a822ad Added Zen2 links to docs/Performance.md Contents. 2020-09-30 16:00:52 -05:00