Commit Graph

44 Commits

Author SHA1 Message Date
Devin Matthews
138d403b6b Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (#331) 2019-08-26 18:11:27 -05:00
Field G. Van Zee
e8c6281f13 Add -march support for specific gcc version ranges.
Details:
- Added logic to configure that checks the version of the compiler
  against known version ranges that could cause problems later in the
  build process. For example, versions of gcc older than 4.9.0 use
  different -march labels than version 4.9.0 or later
  ('-march=corei7-avx' vs '-march=sandybridge', respectively).
  Similarly, before 6.1, compilation on Zen was possible, but you
  need to start with -march=bdver4 and then disable instruction sets
  that were discarded during the transition from Excavator to Zen. So
  now, configure substitutes 'yes'/'no' values into anchors in
  config.mk.in, which sets various make variables (e.g. GCC_OT_4_9_0),
  which can be accessed and branched upon by the various
  configurations' make_defs.mk files when setting their compiler flags.
- Updated config/haswell/make_defs.mk to branch on GCC_OT_4_9_0.
- Updated config/sandybridge/make_defs.mk to branch on GCC_OT_4_9_0.
- Updated config/zen/make_defs.mk to branch on GCC_OT_6_1_0.
2019-08-21 12:38:53 -05:00
Field G. Van Zee
deda4ca8a0 Added test/1m4m driver directory.
Details:
- Added a new standalone test driver directory named '1m4m' that can
  build and run performance experiments for BLIS 1m, 4m1a, assembly,
  OpenBLAS, and the vendor library (MKL). This new driver directory
  was used to regenerate performance results for the 1m paper.
- Added alternate (commented-out) cache blocksizes to
  config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to
  work well on an a 12-core Intel Xeon E5-2650 v3.
2019-07-22 13:59:05 -05:00
Field G. Van Zee
af17bca26a Updated haswell MC cache blocksizes.
Details:
- Updated the default MC cache blocksizes used by the haswell subconfig
  for both row-preferential (the default) and column-preferential
  microkernels.
2019-07-19 14:46:23 -05:00
Field G. Van Zee
b5e9bce4dd Updated -march flags for sandybridge, haswell.
Details:
- Updated the '-march=corei7-avx' flag in the sandybridge subconfig
  to '-march=sandybridge' and the '-march=core-avx2' flag in the
  haswell subconfig to '-march=haswell'. The older flags were used
  by older versions of gcc and should have been updated to the newer
  forms a long time ago. (The older flags were clearly working, even
  though they are no longer documented in the gcc man page.)
2019-07-19 14:42:37 -05:00
Field G. Van Zee
a4e8801d08 Increased MT sup threshold for double to 201.
Details:
- Fine-tuned the double-precision real MT threshold (which controls
  whether the sup implementation kicks for smaller m dimension values)
  from 180 to 201 for haswell and 180 to 256 for zen.
- Updated octave scripts in test/sup/octave to include a seventh column
  to display performance for m = n = k.
2019-05-31 17:30:51 -05:00
Field G. Van Zee
cb788ffc89 Increased MT sup threshold for double to 180.
Details:
- Increased the double-precision real MT threshold (which controls
  whether the sup implementation kicks for smaller m dimension values)
  from 80 to 180, and this change was made for both haswell and zen
  subconfigurations. This is less about the m dimension in particular
  and more about facilitating a smoother performance transition when
  m = n = k.
2019-05-23 13:00:53 -05:00
Field G. Van Zee
ecbdd1c42d Ceased use of BLIS_ENABLE_SUP_MR/NR_EXT macros.
Details:
- Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and
  BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and
  bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long
  conditional that would determine whether to allow the last iteration
  to be merged with the second-to-last iteration. Functionally, the
  macros were not needed, and they ended up causing problems when
  building configuration families such as intel64 and x86_64.
2019-04-27 19:38:11 -05:00
Field G. Van Zee
b9c9f03502 Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2019-04-27 18:44:50 -05:00
Field G. Van Zee
70f12f209b Changed unsafe-loop to unsafe-math optimizations.
Details:
- Changed -funsafe-loop-optimizations (re-)introduced in 7690855 for
  make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to
  account for a miscommunication in issue #300). Thanks to Dave Love
  for this suggestion and Jeff Hammond for his feedback on the topic.
2019-02-20 16:10:10 -06:00
Field G. Van Zee
7690855c51 Restored -funsafe-loop-optimizations to subconfigs.
Details:
- Restored use of -funsafe-loop-optimizations in the definitions of
  CRVECFLAGS (when using gcc), but only for sub-configurations (and
  not configuration families such as amd64, intel64, and x86_64).
  This more or less reverts 5190d05 and 6cf1550.
2019-02-18 19:16:01 -06:00
Field G. Van Zee
6cf1550491 Removed -funsafe-loop-optimizations from all configs.
Details:
- Error persists. Removed -funsafe-loop-optimizations from all remaining
  sub-configurations.
2019-02-18 17:29:51 -06:00
Field G. Van Zee
6a014a3377 Standardized optimization flags in make_defs.mk.
Details:
- Per Dave Love's recommendation in issue #300, this commit defines
    COPTFLAGS := -03
  and
    CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations
  in the make_defs.mk for all Intel- and AMD-based configurations.
2019-02-18 14:52:29 -06:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
3c52725693 Renamed/moved l3 zen ukernels to haswell kernel set.
Details:
- Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and
  then updated the file contents to use the 'haswell' infix.
- Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to
  above function renames.
- Moved/updated the corresponding prototypes in bli_kernels_zen.h to
  bli_kernels_haswell.h.
- Updated config_registry according to above changes.
- NOTE: This rename reflects the fact that haswell microkernels are
  specifically written to overcome the floating-point latency for FMA
  instructions on Intel Haswell-like architectures, which can issue two
  FMA instructions per cycle. These ukernels happen to work fine on AMD
  Zen-based architectures. However, Zen only issues one FMA per cycle,
  which, while halving its floating-point throughput, gives it extra
  flexibility in the design of its microkernels--namely, mr and nr can
  be smaller and still overcome the floating-point latency for those
  single-issue cores. A smaller value of mr and nr allows for a larger
  value of kc, which may be useful in some situations. In the future,
  we may write such Zen-specific microkernels to take advantage of this
  additional flexibility.
2018-10-17 14:56:22 -05:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
d4a22702c7 Set up haswell config for optional col-pref ukrs.
Details:
- Added two presently-disabled cpp blocks in bli_cntx_init_haswell.c to
  easily allow one to switch to a set of column-preferential gemm
  microkernels (in the haswell subconfiguration). The second column-
  preferring block sets the the register blocksizes to their appropriate
  values. However, cache blocksizes are left unchanged, and therefore are
  likely suboptimal. This should be addressed later.
2018-06-19 14:54:57 -05:00
Field G. Van Zee
ad67dc4e34 Communicate cc, cc_vendor to make via config.mk.
Details:
- Historically, the compiler selection has happened statically in the
  various make_defs.mk and would only be overriden by setting CC (either
  prior to running configure or as a configure argument). However, in
  the last couple months, configure has evolved to contain rather
  sophisticated compiler detection logic for the purposes of blacklisting
  sub-configurations. It only makes sense that configure now fully take
  over the responsibility of selecting a compiler from the GNU make side
  of the build system. Thanks to Alex Arslan for his help exposing this
  issue.
- Substitute found_cc into CC in config.mk via configure.
- Set a new variable, CC_VENDOR, in config.mk via substitution from
  configure, and disable the corresponding CC_VENDOR code in common.mk.
- Disabled default compiler selection (usually gcc) in the sub-configs'
  various make_def.mk files.
2018-05-14 18:35:28 -05:00
Field G. Van Zee
45fbe66b3e Fixed libmemkind dependency for x86_64.
Details:
- Removed some old conditional code in config/knl/make_defs.mk that
  added -lmemkind to LDFLAGS if DEBUG_TYPE was not 'sde' and inserted
  code into common.mk that affirmatively filters out -lmemkind from
  LDFLAGS if DEBUG_TYPE is 'sde'. (Thanks to Dave Love for reporting
  this issue.) Other minor cleanups to neighboring code in common.mk.
- Updated CRVECFLAGS in knl/make_defs.mk to be based on -march=knl,
  and then AVX-512 functionality is manually removed via various
  -mno-avx512* flags. Also, make the setting of CRVECFLAGS conditional
  on CC_VENDOR. Similar change to skx/make_defs.mk.
- Comment/whitespace updates.
2018-04-09 14:01:08 -05:00
Field G. Van Zee
bd0276752c Track separate ref kernel flags for each sub-config.
Details:
- Renamed CVECFLAGS variables in sub-configurations' make_defs.mk files
  to CKVECFLAGS.
- Added default defintions of two new make variables to most sub-
  configurations' make_defs.mk files--CROPTFLAGS and CRVECFLAGS--
  which correspond to reference kernel analogues of the CKOPTFLAGS
  and CKVECFLAGS, which track optimization and vectorization flags for
  optimized kernels. Currently, two sub-configurations (knl and skx)
  explicitly set CRVECFLAGS to non-default values (using AVX2 instead of
  AVX-512 for reference kernels. Thanks to Jeff Hammond, whose feedback
  prompted me to make this change (issue #187).
- Changed common.mk so that the get-refkern-cflags-for function returns
  the flags associated with the given sub-configuration's CROPTFLAGS
  and CRVECFLAGS (instead of CKOPTFLAGS and CKVECFLAGS).
2018-04-06 18:51:43 -05:00
Field G. Van Zee
34862aed89 Use zen kernels in haswell sub-configuration.
Details:
- Register use of level-1v zen intrinsic kernels for amaxv, axpyv, dotv,
  dotxv, and scalv, as well asl level-1f zen intrinsic kernels for axpyf
  and dotxf. This works because these kernels simply target AVX/AVX2,
  and therefore work without modification on haswell hardware.
- Switch to use of zen microkernels in bli_cntx_init_haswell.c. The zen
  kernels are essentially identical to those used by haswell, except that
  now zen kernels are a bit more up-to-date. In the future, I may
  continue to maintain duplicates, or I may keep the kernels named after
  one architecture (zen or haswell) but used by both sub-configurations.
- In config_registry, enable use of both haswell and zen kernels for the
  haswell sub-configuration. This is necessary in order to make zen
  kernels visible when registering kernels in bli_cntx_init_haswell.c.
- Enable use of assembly-based complex gemm microkernels for zen,
  bli_cgemm_zen_asm_3x8() and bli_zgemm_zen_asm_3x4(), in
  bli_cntx_init_zen.c. This was actually intended for 1681333.
2018-02-28 15:30:14 -06:00
Field G. Van Zee
0b3ca3cfb6 Intelligently select compiler for auto-detection.
Details:
- Rewrote code that selects the compiler for the purposes of compiling
  the auto-detection executable. CC (if specified) is tried first. Then
  gcc. Then clang. The absolute fallback is cc. The previous code was
  sort of broken, and seemed to unintentionally always use gcc.
- Moved various configuration-agnostic flags from config/*/make_defs.mk
  files to common.mk. The new mechanism appends the configuration-
  agnostic flags to the various compiler flag variables initialized in
  make_defs.mk. Flags specific to the sub-configuration are still set
  in make_defs.mk.
- Added -Wno-tautological-compare to CMISCFLAGS when clang is in use.
  Also added the flag to the compiler instantiation during configure-
  time hardware detection (when clang is selected).
- Added some missing (but mostly-optional) quotes to configure script.
2018-01-04 20:51:35 -06:00
Field G. Van Zee
07c352188b Added "generic" configuration.
Details:
- Added a "generic" configuration that leaves the default blocksizes and
  kernels unchanged. This replaces the older "reference" configuration.
  Updated auto-detect script and code accordingly.
- Added support for generic configuration to arch_t (bli_type_defs.h),
  bli_gks_init() (bli_gks.c), and bli_arch_config.h
- Moved bli_arch_query_id() to bli_arch.c (and prototype to bli_arch.h).
- Whitespace changes to configurations' make_defs.mk files.
2017-10-23 16:59:22 -05:00
Field G. Van Zee
75b9383f01 Minor header renaming ahead of bli_arch.c.
Details:
- Renamed the various configurations' "bli_arch_<configname>.h" header files
  (replacing "arch" with "family") to free up the 'bli_arch' namespace for a
  different purpose (hardware detection).
- Renamed "bli_arch.h" and "bli_arch_pre_macro_defs.h" in frame/include to
  "bli_arch_config.h" and "bli_arch_config_pre.h", respectively.
2017-10-20 16:41:22 -05:00
Field G. Van Zee
453deb2906 Implemented runtime kernel management.
Details:
- Reworked the build system around a configuration registry file, named
  config_registry', that identifies valid configuration targets, their
  constituent sub-configurations, and the kernel sets that are needed by
  those sub-configurations. The build system now facilitates the building
  of a single library that can contains kernels and cache/register
  blocksizes for multiple configurations (microarchitectures). Reference
  kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
  config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
  in determining which sub-configurations (CONFIG_LIST) and kernel sets
  (KERNEL_LIST) are included in the library, and which make_defs.mk files'
  CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
  functions into a standard format that includes the kernel set name
  (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
  kernels sub-directory. These files exist to provide prototypes for the
  kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
  This directory includes a new source file, bli_cntx_ref.c (compiled on
  a per-configuration basis), that defines the code needed to initialize
  a reference context and a context for induced methods for the
  microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
  variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
  basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
  #defines for the config (family) name, the sub-configurations that are
  associated with the family, and the kernel sets needed by those
  sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
  what remains to new header files named "bli_arch_<configname>.h", which
  are conditionally #included from a new header bli_arch.h. These files
  are still needed to set library-wide parameters such as custom
  malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
  The files contain a function, named the same as the file, that initializes
  a "native" context for a particular configuration (microarchitecture). The
  idea is that optimized kernels, if available, will be initialized into
  these contexts. Other fields will retain pointers to reference functions,
  which will be compiled on a per-configuration basis. These bli_cntx_init_*()
  functions will be called during the initialization of the global kernel
  structure. They are thought of as initializing for "native" execution, but
  they also form the basis for contexts that use induced methods. These
  functions are prototyped, along with their _ref() and _ind() brethren, by
  prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
  identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
  structures (pointers to cntx_t, actually). The first dimension is indexed
  over arch_t and the inner dimension is the ind_t (induced method) for
  each microarchitecture. When a microarchitecture (configuration) is
  "registered" at init-time, the inner array for that configuration in the
  2D array is initialized (and allocated, if it hasn't been already). The
  cntx_t slot for BLIS_NAT is initialized immediately and those for other
  induced method types are initialized and cached on-demand, as needed. At
  cntx_t registration, we also store function pointers to cntx_init functions
  that will initialize (a) "reference" contexts and (b) contexts for use with
  induced methods. We don't cache the full contexts for reference contexts
  since they are rarely needed. The functions that initialize these two kinds
  of contexts are generated automatically for each targeted sub-configuration
  from cpp-templatized code at compile-time. Induced method contexts that
  need "stage" adjustments can still obtain them via functions in
  bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
  the level-1f, level-1v, and packm kernels, and for converting a native
  context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
  macros in bli_kernel_macro_defs.h to being runtime checks defined in
  bli_check.c and called from bli_gks_register_cntx() at the time that the
  global kernel structure's internal context is initialized for a given
  microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
  the previous operation-level cntx_t_init()/_finalize() invocations.
  Instead, we now query the gks for a suitable context, usually via
  bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
  hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
  into a single kernel that will branch on the schema and support packing
  to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
  and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
  bli_malloc_intl(). Useful when we wish to allocate and initialize to
  zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
  bli_cntx.h into static functions.
2017-10-18 13:29:32 -05:00
Field G. Van Zee
1f1ec0db93 Updated ar option list used by all configurations.
Details:
- Dropped 'u' from the list of modifiers passed into the library archiver
  ar. Previously, "cru" was used, while now we employ only "cr". This
  change was prompted by a warning observed on Ubuntu 16.04:

    ar: `u' modifier ignored since `D' is the default (see `U')

  This caused me to realize that the default mode causes timestamps to be
  zero, and thus the 'u' option, which causes only changed object files to
  be inserted, is not applicable.
2017-07-19 15:40:48 -05:00
Field G. Van Zee
6e04f9df01 Restored deleted lines from makefile fragments. 2017-05-17 13:03:52 -05:00
Devin Matthews
555ddc30d4 Remove shebangs from makefiles. 2017-05-17 12:27:14 -05:00
J M Dieterich
5fa4e9439c A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash 2017-05-16 21:50:49 -04:00
Field G. Van Zee
b88542591d Merge pull request #107 from jeffhammond/intel-compilers-no-use-libm
never use libm with Intel compilers
2017-05-02 19:22:41 -05:00
Field G. Van Zee
126482a3b6 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2016-11-25 18:29:49 -06:00
Field G. Van Zee
b3e58ee303 Reimplemented 4x12 haswell ukernels (real only).
Details:
- Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
  defines 4x24 single real and 4x12 double real gemm microkernels, with
  broadcast-based implementations. (The previous microkernel file has been
  moved to an 'old' subdirectory.)
2016-11-23 17:58:26 -06:00
Field G. Van Zee
8a11a2174a Updates to non-default haswell microkernels.
Details:
- Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
  constraints.
- Added missing c and z microkernels, which are based on the corresponding
  kernels in the d6x8 set.
- This completes the d8x6 set (which may be used for situations when it
  is desirable to have a microkernel with a column preference).
2016-10-31 19:07:55 -05:00
Jeff Hammond
c2c91e09b4 never use libm with Intel compilers
Intel compilers include a highly optimized math library (libimf) that
should be used instead of GNU libm.

yes, this change is for ALL targets, including those that are not
supported by the Intel compiler.  there is no harm in doing this, and it
is future-proof in the event that the Intel compilers support other
architectures.
2016-10-25 21:15:26 -07:00
Field G. Van Zee
121c39d455 Added complex gemm micro-kernels for haswell.
Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).
2016-09-05 13:11:42 -05:00
Field G. Van Zee
c3a4d39d03 Updates to haswell gemm micro-kernels.
Details:
- Added two new sets of [sd]gemm micro-kernels for haswell architectures,
  one that is 4x24/4x12 (s and d) and one that is 6x16/6x8.
- Changed the haswell configuration to use the 6x16/6x8 micro-kernels
  by default.
- Updated various Makefiles, in test, test/3m4m, and testsuite.
2016-05-04 17:22:56 -05:00
Devin Matthews
0e1a9821d8 Add configure options and generate bli_config.h automatically.
Options to configure have been added for:
- Setting the internal BLIS and BLAS/CBLAS integer sizes.
- Enabling and disabling the BLAS and CBLAS layers.

Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file.

Lastly, support for OMP in clang has been added (closes #56).
2016-04-19 11:44:37 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Devin Matthews
0171ad5899 Add icc and clang support for Intel architectures, fixes #47. 2bd036f fixes #49 BTW. 2016-03-28 13:55:06 -05:00
Devin Matthews
8442d65c9e Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures. 2016-03-25 20:06:48 -05:00
Devin Matthews
76099f20be Add threading option to configure. 2016-03-25 17:22:58 -05:00
Devin Matthews
2bd036f1f9 Fix configuration issue where instruction set flags are not specified for debug builds. 2016-03-25 12:16:49 -05:00
Devin Matthews
d226dfa051 Add several changes to the build system.
1) Add -- options.
2) Add -d/--enable-debug option to enable debugging symbols with and without optimization.
3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor.
4) Add make V=[0,1] option to control build verbosity.
2016-03-05 16:18:14 -06:00
Field G. Van Zee
fdfe14f1e1 Added support for Intel Haswell/Broadwell.
Details:
- Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
  and FMA instructions. (Complex support is currently provided by default
  induced method, 4m1a.)
- Added a 'haswell' configuration, which uses the aforementioned kernels.
- Inserted auto-detection support for haswell configuration in
  build/auto-detect/cpuid_x86.c.
- Modified configure script to explicitly echo when automatic or manual
  configuration is in progress.
- Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.
2015-07-09 13:52:39 -05:00