Commit Graph

2239 Commits

Author SHA1 Message Date
Field G. Van Zee
bbaf29abd9 Very minor variable updates to common.mk.
Details:
- Fixed a harmless bug that would have allowed C++ headers into the list
  of header suffices specifically reserved for C99 headers. In practice,
  this would have had no substantive effect on anything since the core
  BLIS framework does not use C++ headers.
2022-08-04 17:51:37 -05:00
Field G. Van Zee
a48e29d799 CREDITS file update.
Details:
- Thanks to Kihiro Bando for assisting with issue #644.
2022-07-28 10:11:07 -05:00
Field G. Van Zee
5b298935de Removed buggy cruft from power10 subconfig.
Details:
- Removed #defines for BLIS_BBN_s and BLIS_BBN_d from
  bli_kernel_defs_power10.h. These were inadvertently set in ae10d949
  because the power10 subconfig was registering bb packm ukernels, but
  only for 6xk (power10 uses s8x16 and d8x8 ukernels) and only because
  the original author (probably) copy-pasted from power9 when getting
  started. That 6xk packm registration was effectively "dead code"
  prior to ae10d949, but was then mistaken as not-dead code during the
  ae10d949 refactor. These improper bb factors may have been causing
  bugs in power10 builds. Thanks to Nicholai Tukanov for helping remind
  me what the power10 subconfig was supposed to look like.
- Removed extraneous microkernel preference registrations from power10
  subconfig. Preferences for single and double complex gemm were being
  registered despite there being no complex gemm ukernels registered to
  go with them. Similarly, there were trsm preferences registered
  without any trsm ukernels registered (and BLIS doesn't actually use a
  preference for the trsm ukernel anyway). These extraneous
  registrations were almost surely not hurting anything, even if they
  were quite misleading.
2022-07-27 19:14:15 -05:00
Devin Matthews
56de31b00f Disable modification of KC in the gemmsup kernels. (#648)
This led to a ~50% performance reduction for certain gemm operations (but not others?). See #644 for example.
2022-07-27 13:54:17 -05:00
Field G. Van Zee
4dde947e2e Fixed out-of-bounds bug in sup s6x16m haswell kernel.
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
  assembly kernels. This bug is similar to the one fixed in 17b0caa
  and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
  Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.
2022-07-26 17:29:32 -05:00
Devin Matthews
6826c1cdfb Add #line directives to flattened blis.h. (#643)
Details:
- Modified flatten-headers.py so that #line directives are inserted into
  the flattened blis.h file. This facilitates easier debugging when
  something is amiss in the flattened blis.h because the compiler will
  be able to refer to the line number within the original constituent
  header file (which is where the fix would go) rather than the line
  number within the flattened header (which is not as helpful).
2022-07-25 18:21:05 -05:00
Alexander Grund
af3a41e025 Add autodetection for POWER7, POWER9 & POWER10 (#647)
Read from `/proc/cpuinfo` as done for ARM.
Fixes #501
2022-07-21 11:05:48 -05:00
Field G. Van Zee
17b0caa2b2 Fixed out-of-bounds read in haswell gemmsup kernels.
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
  kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
  single-precision elements of C, via instructions such as:

	vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

  in situations where only two elements are guaranteed to exist. (These
  bugs may not have manifested in earlier tests due to the leading
  dimension alignment that BLIS employs by default.) The issue was fixed
  by replacing lines like the one above with:

	vmovsd(mem(rcx), xmm0)
	vfmadd231ps(xmm0, xmm3, xmm4)

  Thus, we use vmovsd to explicitly load only two elements of C into
  registers, and then operate on those values using register addressing.
  Thanks to Daniël de Kok for reporting these bugs in #635, and to
  Bhaskar Nallani for proposing the fix).
- CREDITS file update.
2022-07-14 17:55:34 -05:00
Field G. Van Zee
cc260fd706 Allow uniform max problem sizes in test/3/runme.sh.
Details:
- Tweaked test/3/runme.sh so that the test driver binaries for single-
  threaded (st), single-socket (1s), and dual-socket (2s) execution can
  be built using identical problem size ranges. Previously, this was not
  possible because runme.sh used the maximum problem size, which was
  embedded into the binary filename, to tell the three classes of
  binaries apart from one another. Now, runme.sh uses the binary suffix
  ("st", "1s", or "2s") to tell them apart. This required only a few
  changes to the logic, but it also required a change in format to the
  threading config strings themselves (replacing the max problem size
  with "st", "1s", or "2s"). Thanks to Jeff Diamond for inspiring this
  improvement.
- Comment updates.
2022-07-13 16:16:01 -05:00
bartoldeman
9b1beec60b Use BLIS_ENABLE_COMPLEX_RETURN_INTEL in blastest files (#636)
Details:
- Fixed a crash that occurs when either cblat1 or zblat1 are linked 
  with a build of BLIS that was compiled with '--complex-return=intel'.
  This fix involved inserting preprocessor macro guards based on
  BLIS_ENABLE_COMPLEX_RETURN_INTEL into blastest/src/cblat1.c and 
  blastest/src/zblat1.c to correctly handle situations where BLIS is 
  compiled with Intel/f2c-style calling conventions for complex numbers.
- Updated blastest/src/fortran/run-f2c.sh so that future executions
  will insert the aforementioned cpp macro conditional where
  appropriate.
2022-07-11 19:15:12 -05:00
bartoldeman
98d467891b Change complex_return='intel' for ifx. (#637)
Details:
- When checking the version string of the Fortran compiler for the
  purposes of determining a default return convention for complex
  domain values, grep for "IFORT" instead of "ifort" since that string
  is common to both the 'ifx' and 'ifort' binaries provided by Intel:

    $ ifx --version
    ifx (IFORT) 2022.1.0 20220316
    Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

    $ ifort --version
    ifort (IFORT) 2021.6.0 20220226
    Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
2022-07-11 18:40:53 -05:00
jdiamondGitHub
ffde54cc5c Minor changes to .gitignore and LICENSE files. (#642)
Details:
- Macs create .DS_Store files in every directory visited. Updated
  .gitignore file so these files won't be reported as untracked by
  'git status'.
- Added Oracle Corporation to the LICENSE file.
- Updated UT copyright on behalf of SHPC.
2022-07-11 16:47:30 -05:00
Field G. Van Zee
7cba7ce3dd Minor cleanups, comment updates to bli_gks.c.
Details:
- Removed a redundant registration of 'a64fx' subconfig in
  bli_gks_init().
- Reordered registration of 'armsve', 'a64fx', and 'firestorm'
  subconfigs. Thanks to Jeff Diamond for his input on this reordering.
- Comment updates to bli_gks.c and arch_t enum in bli_type_defs.h.
2022-07-08 11:15:18 -05:00
Field G. Van Zee
667f201b78 Fixed type bug in bli_cntx_set_ukr_prefs().
Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
  num_t value read from va_args() down to a bool before being stored
  within the cntx_t. This bug was introduced on April 6th 2022, in
  ae10d94. This caused the ukernel preferences for double real and
  double complex to go unchanged while the preferences for single real
  and single complex were corrupted by the former datatypes'
  preference values. The bug manifested as degraded performance for
  subconfigurations that registered column-preferential ukernels. The
  reason is that the erroneous preferences trigger unnecessary
  transpositions in the operation, which forces the gemm ukernel to
  compute on matrices that are not stored according to its preference.
  Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
  extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
  include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
2022-07-07 16:44:21 -05:00
Isuru Fernando
d429b6bfce Support clang targetting MinGW (#639)
* Support clang targetting MinGW

* Fix pthread linking
2022-06-28 16:34:10 -04:00
Field G. Van Zee
d93df02334 Removed unused dt arg in bli_gks_query_ind_cntx().
Details:
- Removed the num_t datatype argument from bli_gks_query_ind_cntx().
  This argument stopped being needed by the function in commit e9da642.
  Its only use in bli_gks_query_ind_cntx() was to be passed through to
  the context initialization function for the chosen induced method,
  but even then, commit log notes from e9da642 indicate that I could not
  recall why the datatype argument was ever needed by the context init
  function to begin with.
- Updated all invocations of bli_gks_query_ind_cntx() to omit the dt
  argument. Most of these invocations resided in various standalone test
  drivers (and the testsuite).
2022-06-15 14:09:49 -05:00
Field G. Van Zee
5677289245 Added SMU citation to README.md intro.
Details:
- Added a citation to SMU and the Matthews Research Group to the general
  attribution of maintainership and development in the Introduction of
  the README.md file. Thanks to Robert van de Geijn and Devin Matthews
  for suggesting this change.
2022-06-01 10:49:33 -05:00
Field G. Van Zee
4603324eb0 Init/finalize via bli_pthread_switch_t API (#634).
Details:
- Defined and implemented a new pthread-like abstract datatype and API
  in bli_pthread.c. The new type, bli_pthread_switch_t, is similar to
  bli_pthread_once_t in some respects. The idea is that like a switch in
  your home that controls a light or ceiling fan, it can either be on or 
  off. The switch starts in the off state. Moving from one state to the 
  other (on to off; off to on) causes some action (i.e., a startup or
  shutdown function) to be executed. Trying to move from one state to 
  the same state (on to on; off to off) is safe in that it results in
  no action. Unlike bli_pthread_once(), the API for bli_pthread_switch_t 
  contains both _on() and _off() interfaces. Also, unlike the _once()
  function, the _on() and _off() functions return error codes so that
  the 'int' error code returned from the startup or shutdown functions
  may be passed back to the caller. Thanks to Devin Matthews for his
  input and feedback on this feature.
- Replaced the previous implementation of bli_init_once() and 
  bli_finalize_once() -- both of which used bli_pthread_once() -- with 
  ones that rely upon bli_pthread_switch_on() and _switch_off(),
  respectively. This also required updating the return types of 
  _init_apis() and _finalize_apis() to match the function pointer type 
  required by bli_pthread_switch_on()/_switch_off().
- Comment updates.
2022-05-19 14:07:03 -05:00
Field G. Van Zee
64a9b061f6 Fixed misspelling of 'xpbys' in gemm macrokernel.
Details:
- Fixed a functionally harmless typo in bli_gemm_ker_var2.c where a few
  instances of the substring "xpbys" were misspelled as "xbpys". The
  misspellings were harmless because they were consistent, and because
  they referenced only local symbols.
2022-05-10 14:54:22 -05:00
Jed Brown
1c733402a9 Fix version check for znver3, which needs gcc >= 10.3 (#628)
Apple's clang-12 lacks znver3 support, unlike upstream clang-12.
2022-04-28 12:58:44 -05:00
Field G. Van Zee
6431c9e13b Added missing 'const' to zen bli_gemm_small.c.
Details:
- Added missing 'const' qualifiers to signatures of functions defined in
  kernels/zen/3/bli_gemm_small.c. This fixes compile-time errors when
  targeting 'zen3' subconfig (which apparently is enabling AMD's
  gemm_small code path by default). Thanks to Devin Matthews for
  reporting this error.
2022-04-14 13:01:24 -05:00
Devin Matthews
9fea633748 Partial addition of 'const' to all interfaces above the (micro)kernels. (#625)
Details:
- Added 'const' qualifier to applicable function arguments wherever the
  the pointed-to object is not internally modified. This change affects 
  all interfaces that reside above the level of the (micro)kernels.
- Typecast certain function return values to discard 'const' qualifier.
- Removed 'restrict' from various arguments, including cntx_t*,
  auxinfo_t*, rntm_t*, thrinfo_t*, mem_t*, and others
- Removed parts of some APIs, such as bli_cntx_*(), due to limited use.
- Merged some variable declarations with their corresponding 
  initialization statements.
- Whitespace changes.
2022-04-13 15:59:06 -05:00
Devin Matthews
ae10d94954 Simplify and rewrite reference packm kernels. (#610)
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
  that rather than having a function pointer for every supported size of
  unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
  kernels per datatype: one to pack MRxk micropanels and one to pack
  NRxk micropanels.
  - NOTE: The "bb" (broadcast B) reference kernels have been merged into
    the "standard" kernels (packm [including 1er and unpackm], gemm, 
    trsm, gemmtrsm). This replication factor is controlled by 
    BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a 
    replication factor of 1 has been tested. armsve also needs testing 
    since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
  array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
  function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
  etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to 
  the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], 
  and friends) in bli_kernel_macro_defs.h, but only when the macro
  BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
  - Still need to update documentation, including:
    - docs/ConfigurationHowTo.md
    - docs/KernelsHowTo.md
    to reflect changes made in this commit.
2022-04-06 20:31:11 -05:00
Field G. Van Zee
b3e674db3c README.md update to link to releases page. 2022-04-04 17:31:02 -05:00
Field G. Van Zee
69fa915464 Fixed broken "tagged releases" link in README.md. 2022-04-01 08:47:46 -05:00
Field G. Van Zee
88cab8383c CHANGELOG update (0.9.0) 2022-04-01 08:12:06 -05:00
Field G. Van Zee
14c86f66b2 Version file update (0.9.0) 2022-04-01 08:12:06 -05:00
Field G. Van Zee
99bb9002f1 ReleaseNotes.md update in advance of next version. 2022-04-01 08:10:59 -05:00
Field G. Van Zee
bee7678b25 CREDITS file update. 2022-03-31 14:09:39 -05:00
Field G. Van Zee
cf06364327 Fixed typo in BLAS gemm3m call to _check().
Details:
- Fixed an unresolved symbol issue leftover from #590 whereby ?gemm3m_()
  as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which
  does not exist. It should have simply called the _check() function for
  gemm.
2022-03-29 16:18:25 -05:00
Dipal M Zambare
1ec020b33e AMD kernel updates; frame-specific AMD updates. (#597)
Details:
- Allow building BLIS with certain framework files (each with the '_amd'
  suffix) that have been customized by AMD for Zen-based hardware. These
  customized files were derived from portable versions of the same files
  (i.e., those without the '_amd' suffix). Whether the portable or AMD-
  specific files are compiled is now controlled by a new configure
  option, --[en|dis]able-amd-frame-tweaks. This option is disabled by
  default in vanilla BLIS, though AMD may choose to enable it by default
  in their fork. For now, the added AMD-specific files are:
  - bli_gemv_unf_var2_amd.c
  - bla_copy_amd.c
  - bla_gemv_amd.c
  These files reside in 'amd' subdirectories found within the directory
  housing their generic counterparts.
- Register optimized real-domain copyv, setv, and swapv kernels in
  bli_cntx_init_zen.c.
- Various minor updates to level-1v kernels in 'zen' kernel set.
- Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to
  the 'zen' kernel set
- If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim,
  call gemv instead and return early.
- Combined variable declarations with their initialization in various
  level-2 and level-3 BLAS compatibility files, and also inserted
  'const' qualifer in those same declaration statements.
- Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ .
- Added copyv and swapv test drivers to 'test' directory.
- Whitespace, comment changes.
2022-03-29 16:15:36 -05:00
Bhaskar Nallani
0db2bd5341 Added BLAS/CBLAS APIs for gemm3m. (#590)
Details:
- Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply
  invoke the 1m implementation unconditionally. (Note that these APIs
  bypass sup handling.)
- Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h.
- Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h.
- Relocated: 
    frame/compat/cblas/src/cblas_?gemmt.c 
  files into
    frame/compat/cblas/src/extra/ 
- Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ .
- Minor reorganization of prototypes and cpp macro directives in 
  bli_blas.h, cblas.h, and cblas_f77.h.
- Trival whitespace change to cblas_zgemm.c.
2022-03-24 18:41:55 -05:00
Devin Matthews
d6810000e9 Update Multithreading.md
Add notes about `BLIS_IR_NT` (should typically be 1) and `BLIS_JR_NT` (should typically be small, e.g. <= 4). [ci skip]
2022-03-14 10:29:54 -05:00
Field G. Van Zee
f1dbb0e514 Trival whitespace change; commit log addendum.
Details:
- A co-attribution to Mithun Mohan was inadvertently omitted from the
  commit log for headline change in the previous commit, 7c07b47.
2022-03-11 13:38:28 -06:00
Field G. Van Zee
7c07b477e4 Avoid gemmsup barriers when not packing A or B. (#622)
Details:
- Implemented a multithreaded optimization for the special (and common)
  case of employing the gemmsup code path when the user requests
  (implicitly or explicitly) that neither A nor B be packed during
  computation. This optimization takes the form of a greatly reduced
  code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a
  broadcast and two barriers, and results in higher performance when
  obtaining two-way or higher parallelism within BLIS. Thanks to
  Bhaskar Nallani of AMD for proposing this change via issue #605.
- Added an early return branch to bli_thrinfo_create_for_cntl() that
  detects and quickly handles cases where no parallelism is being
  obtained within BLIS (i.e., single-threaded execution). Note that
  this special case handling was/is already present in
  bli_thrinfo_sup_create_for_cntl().
- CREDITS file update.
2022-03-11 13:28:50 -06:00
Ivan Korostelev
cad10410b2 POWER10: edge cases in microkernel (#620)
Use new API for POWER10 gemm microkernel
2022-03-10 09:58:14 -06:00
Field G. Van Zee
71851a0549 Fixed level-3 performance bug in haswell ukernels.
Details:
- Fixed a performance regression affecting nearly all level-3 operations
  that use the 'haswell' sgemm and dgemm microkernels. This regression
  was introduced in 54fa28b, caused by an ill-formed conditional
  expression in the assembly code that controls whether cache lines of C
  should be prefetched as rows or as columns. Essentially, the two
  branches were reversed, causing incomplete prefetching to occur for
  both row- and column-stored instances of matrix C. Thanks to Devin
  Matthews for his help finding and fixing this bug.
2022-03-08 17:38:09 -06:00
Field G. Van Zee
84732bf956 Revamp how tools are handled/checked by configure.
Details:
- Consolidate handling of tools that are specifiable via CC, CXX, FC, 
  PYTHON, AR, and RANLIB into one bash function, select_tool_w_env().
  - If the user specifies a tool via an environment variable (e.g. 
    CC=gcc) and that tool does not seem valid, print an error message 
    and abort configure, unless the tool is optional (e.g. CXX or FC), 
    in which case a warning message is printed instead.
  - The definition of "seems valid" above amounts to:
    - responding to at least one of a basic set of command line options 
      (e.g. --version, -V, -h) if the os_name is Linux (since GNU tools 
      tend to respond to flags such as --version) or if the tool in 
      question is CC, CXX, FC, or PYTHON (which tend to respond to the 
      expected flags regardless of OS)
    - the binary merely existing for AR and RANLIB on Darwin/OSX/BSD. 
      (These OSes tend to have non-GNU versions of ar and ranlib, which 
      typically do not respond to --version and friends.)
- This PR addresses #584. Thanks to Devin Matthews for suggesting some
  of the changes in this commit.
2022-02-28 12:19:31 -06:00
RuQing Xu
d5146582b1 ArmSVE Ensure Non-zero Block Size (#615)
Fixes #613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.
2022-02-22 12:35:46 -06:00
RuQing Xu
4d83523097 Add armsve to arm64 Metaconfig (#614)
Availability of the `armsve` subconfig is controlled by the compiler version (gcc/clang). Tested for SVE and non-SVE. Fixes #612.
2022-02-22 10:03:47 -06:00
Field G. Van Zee
c9700f369a Renamed SIMD-related macro constants for clarity.
Details:
- Renamed the following macros defined in bli_kernel_macro_defs.h:

    BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS
    BLIS_SIMD_SIZE          -> BLIS_SIMD_MAX_SIZE

  Also updated all instances of these macros elsewhere, including
  subconfigurations, source code, and documentation. Thanks to Devin
  Matthews for suggesting this change.
2022-02-15 15:36:52 -06:00
Field G. Van Zee
ee9ff988c4 Move edge cases to gemmtrsm ukrs; doc updates.
Details:
- Moved edge-case handling into the gemmtrsm microkernel. This required
  changing the microkernel API to take m and n dimension parameters as
  well as updating all existing gemmtrsm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. Also updated all existing gemmtrsm kernels in the
  'kernels' directory (which for now is limited to haswell and penryn
  kernel sets, plus native and 1m-based reference kernels in
  'ref_kernels') to take m and n dimensions, and implemented edge-case
  handling within those microkernels via a collection of new C
  preprocessor macros defined within bli_edge_case_macro_defs.h. Note
  that the edge-case handling for gemm-like operations had already
  been relocated into the gemm microkernel in 54fa28b.
- Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in
  bli_edge_case_macro_defs.h to allow for easier reading.
- Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up
  the bullet under "Implementation Notes for gemm" that covers alignment
  issues. (Thanks to Ivan Korostelev for pointing out the confusing and
  outdated language in issue #591.)
- Other minor tweaks to KernelsHowTo.md.
2022-02-15 15:01:51 -06:00
Devin Matthews
2506159346 Don't use -Wl,-flat-namespace.
Flat namespaces can cause problems due to conflicting system libraries,
etc., so just mark `xerbla_` as a weak symbol on macOS instead.
2022-02-13 20:11:55 -06:00
Devin Matthews
5a4d3f5208 Use -flat_namespace option to link on macOS
Fixes #611.
2022-02-13 17:28:30 -06:00
Devin Matthews
26742910a0 Update CC_VENDOR logic
Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]
2022-02-13 16:53:45 -06:00
RuQing Xu
2f3872e01d ArmSVE Adopts Label Wrapper
For clang (& armclang?) compilation.

Hopefully solves #609 .
2022-02-07 09:54:11 -06:00
RuQing Xu
72089bb291 ArmSVE Use Predicate in M-Direction
No need to query MR during kernel runtime.
2022-02-07 09:54:11 -06:00
Ruqing Xu
9cc897f374 Fix SVE Compil. 2022-02-07 09:54:11 -06:00
RuQing Xu
b5df1811f1 Armv8a, ArmSVE: Simplify Gen-C 2022-02-07 09:54:11 -06:00
Devin Matthews
35195bb5ce Add armclang detection to configure.
armclang is treated as regular clang. Fixes #606. [ci skip]
2022-01-31 10:29:50 -06:00