Commit Graph

1745 Commits

Author SHA1 Message Date
Kiran Varaganti
d605a19e42 config_registry: New AMD zen2 architecture configuration added.
frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS]
  frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...).
  frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..).
  frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif
  frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif
  frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20

Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866
2019-08-23 14:18:55 +05:30
Kiran Varaganti
34c2c22ae8 Disabled BLIS_ENABLE_ZEN_BLOCK_SIZES in bli_family_zen.h for ROME tuning
Change-Id: Iec47fcf51f4d4396afef1ce3958e58cf02c59a57
2019-08-23 14:18:09 +05:30
Kiran Varaganti
016acd387c Merged BLIS Release 1.3
Modified config/zen/make_defs.mk, now CKVECFLAGS     := -mavx2 -mfpmath=sse -mfma -march=znver1

Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81
2019-08-23 14:18:09 +05:30
sraut
d6bb56d088 Fixed BLAS test failures of small matrix SYRK for single and double precision.
Details:
- SYRK for small matrix was implemented by reusing small GEMM routine. This was
  resulting in output written to the full C matrix, and C being symmetric the
  lower and upper triangles of C matrix contained same results. BLAS SYRK API
  spec demands either lower or upper triangle of C matrix to be written with
  results. So, this was resulting in BLAS test failures, even though testsuite
  of BLIS was passing small SYRK operation.
- To fix BLAS test failures of small matrix SYRK, separate kernel routines are
  implemented for small SYRK for both single and double precision. The newly
  added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c.
  Now the intermediate results of matrix C are written to a scratch buffer.
  Final results are written from scratch buffer to matrix C using SIMD
  copy to either lower or upper traingle part of matrix C.
- Source and header files frame/3/syrk/bli_syrk_front.c and
  frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines.

Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb
2019-08-23 14:18:09 +05:30
sraut
2752b51c37 Fix on EPYC machine for multi instance performance issue,
Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically.
Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores.

Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe
2019-08-23 14:18:09 +05:30
pradeeptrgit
1720efe630 Update version number to 1.2
Change-Id: Ibb31f6683cdecca6b218bc2f0c14701d7e92ebf3
2019-08-23 14:18:09 +05:30
Kiran V
d805fdf169 This is a fix to floating-point exception error for BLIS SGEMM with larger matrix sizes.
BUG No: CPUPL-197 fixed by Thangaraj Santanu
The bli_clock_min_diff() function in BLIS assumed that if the time taken is greater than 1 hour then the reading must be wrong. However this is not the case in general, while the other checks such as time taken closer to zero or nsec is ofcourse valid.
gerrit review: http://git.amd.com:8080/#/c/118694/1/frame/base/bli_clock.c

Change-Id: I9dc313d7c5fdc20684f67a516bf3237de3e0694a
2019-08-23 14:18:09 +05:30
sraut
73ddc58df0 Small TRSM optimization changes :- 1) single precision small trsm kernels for XAt=B case are further optimized for performance. 2) double precision small trsm kernels for AX=B and XAtB cases are implemented. 3) single precision small trsm kernels for AutX=B are implemented in intrinsics to improve the current performance.
Change-Id: Ic9d67ae6d8522615257dde018903f049dcffa2cf
2019-08-23 14:18:09 +05:30
sraut
bc9dbce512 AMD Copyright information changed to 2018
Change-Id: Idfd11afd5d252f8063d0158680d24bf7e2854469
2019-08-23 14:18:09 +05:30
sraut
d56ca14589 small matrix trsm intrinsics optimization code for AX=B and XA'=B
Change-Id: I90123c4d9adbd314c867995cd19dc975150b448c
2019-08-23 14:18:09 +05:30
Nisanth M P
ca6f5b762d Re-enabling the small matrix gemm optimization for target zen
Change-Id: I13872784586984634d728cd99a00f71c3f904395
2019-08-23 14:18:09 +05:30
Nisanth M P
d9c0b8b4aa Re-enabling Zen optimized cache block sizes for config target zen
Change-Id: I8191421b876755b31590323c66156d4a814575f1
2019-08-23 14:18:09 +05:30
Field G. Van Zee
f85d3363ec CHANGELOG update (0.3.0)
Change-Id: Id038b00a62de51c9818ad249651ec5dc662f4415
2019-08-23 14:18:09 +05:30
Field G. Van Zee
9034c885fc Added "Education and Learning" ToC entry to README. 2019-08-23 14:18:09 +05:30
Field G. Van Zee
99c7d15f1e Added "Education and Learning" section to README.
Details:
- Added a short section after the Intro of the README.md file titled
  "Education and Learning" that directs interested readers to the
  "LAFF-On Programming for High-Performance" massive open online course
  (MOOC) hosted via edX.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
06c5a5c4a9 Added test/1m4m driver directory.
Details:
- Added a new standalone test driver directory named '1m4m' that can
  build and run performance experiments for BLIS 1m, 4m1a, assembly,
  OpenBLAS, and the vendor library (MKL). This new driver directory
  was used to regenerate performance results for the 1m paper.
- Added alternate (commented-out) cache blocksizes to
  config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to
  work well on an a 12-core Intel Xeon E5-2650 v3.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
d44c42dce7 Updated haswell MC cache blocksizes.
Details:
- Updated the default MC cache blocksizes used by the haswell subconfig
  for both row-preferential (the default) and column-preferential
  microkernels.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
ba4a77177c Updated -march flags for sandybridge, haswell.
Details:
- Updated the '-march=corei7-avx' flag in the sandybridge subconfig
  to '-march=sandybridge' and the '-march=core-avx2' flag in the
  haswell subconfig to '-march=haswell'. The older flags were used
  by older versions of gcc and should have been updated to the newer
  forms a long time ago. (The older flags were clearly working, even
  though they are no longer documented in the gcc man page.)
2019-08-23 14:18:09 +05:30
Field G. Van Zee
0e3f0ce634 More updates to comments in testsuite modules.
Details:
- Updated most comments in testsuite modules that describe how the
  correctness test is performed so that it is clear whether the vector
  (normfv) or matrix (normfm) form of Frobenius norm is used.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
b3974dafac New cntx_t blksz "set" functions + misc tweaks.
Details:
- Defined two new static functions in bli_cntx.h:
    bli_cntx_set_blksz_def_dt()
    bli_cntx_set_blksz_max_dt()
  which developers may find convenient when experimenting with different
  values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
  increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
7366bf25aa Fixed thrinfo_t printing bug for small problems.
Details:
- Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
  bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
  whereby subnodes of the thrinfo_t tree are "dereferenced" near the
  beginning of the functions, which may lead to segfaults in certain
  situations where the thread tree was not fully formed because the
  matrix problem was too small for the level of parallelism specified.
  (That is, too small because some problems were assigned no work due
  to the smallest units in the m and n dimensions being defined by the
  register blocksizes mr and nr.) The fix requires several nested levels
  of if statements, and this is one of those few instances where use of
  goto statements results in (mostly) prettier code, especially in the
  case of _gemm_paths(). And while it wasn't necessary, I ported this
  goto usage to the loop body that prints the thrinfo_t work_id and
  comm_id values for each thread. Thanks to Nicholai Tukanov for helping
  to find this bug.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
66c43ca427 Updated BLASFEO results in PerformanceSmall.md.
Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
  using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
  accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
  containing the mpnpkp results do not need to be preprocessed in order
  to plot half the problem size range (ie: up to 400 instead of the
  800 range of the other shape cases).
- Trivial updates to runme.m.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
d192d47945 Trivial change to MixedDatatypes.md link text. 2019-08-23 14:18:09 +05:30
Field G. Van Zee
428921c2c5 Fixed typo in README.md's MixedDatatypes.md link. 2019-08-23 14:18:09 +05:30
Field G. Van Zee
6e9b1e0244 Adjust -fopenmp-simd for icc's preferred syntax.
Details:
- Use -qopenmp-simd instead of -fopenmp-simd when compiling with Intel
  icc. Recall that this option is used for SIMD auto-vectorization in
  reference kernels only. Support for the -f option has been completely
  deprecated and removed in newer versions of icc in favor of -q. Thanks
  to Victor Eijkhout for reporting this issue and suggesting the fix.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
78adbe9846 Added missing #include "bli_family_thunderx2.h".
Details:
- Added a cpp-conditional directive block to bli_arch_config.h that
  #includes "bli_family_thunderx2.h". The code has been missing since
  adf5c17f. However, this never manifested as an error because the file
  is virtually empty and not needed for thunderx2 (or most subconfigs).
  Thanks to Jeff Diamond for helping to spot this.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
2bf1ad11a7 Fixed formatting/typo in docs/PerformanceSmall.md. 2019-08-23 14:18:09 +05:30
Field G. Van Zee
df67302896 Tweaked language in README.md related to sup/AMD. 2019-08-23 14:18:09 +05:30
Field G. Van Zee
bb4a01f130 Added BLASFEO results to docs/PerformanceSmall.md.
Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
  and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
  docs/Performance.md.
2019-08-23 14:18:09 +05:30
Field G. Van Zee
ecfc223e62 Minor tweaks to test/sup.
Details:
- Changed starting problem and increment from 16 to 4.
- Added 'lll' (square problems) to list of problem size shapes to
  compile and run with.
- Define BLASFEO location and added BLASFEO-related definitions.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
5052917179 CHANGELOG update (0.6.0) 2019-08-23 14:18:08 +05:30
Field G. Van Zee
2ae8faa252 ReleaseNotes.md update in advance of next version.
Details:
- Updated ReleaseNotes.md in preparation for next version.
- CREDITS file update.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
d5903a8393 Minor edits to docs/PerformanceSmall.md.
Details:
- Added performance analysis to "Comments" section of both Kaby Lake and
  Epyc sections.
- Added emphasis to certain passages.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
55e7b045c3 Added sup performance graphs/document to 'docs'.
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
  publishes new performance graphs for Kaby Lake and Epyc showcasing
  the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
  For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
  graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
  in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
  document.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
5e03ca6fc7 Increased MT sup threshold for double to 201.
Details:
- Fine-tuned the double-precision real MT threshold (which controls
  whether the sup implementation kicks for smaller m dimension values)
  from 180 to 201 for haswell and 180 to 256 for zen.
- Updated octave scripts in test/sup/octave to include a seventh column
  to display performance for m = n = k.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
c90a1a8dca Inadvertantly hidden xerbla_() in blastest (#313).
Details:
- Attempted a fix to issue #313, which reports that when building only
  a shared library (ie: static library build is disabled), running the
  BLAS test drivers can fail because those drivers provide their own
  local version of xerbla_() as a clever (albeit still rather hackish)
  way of checking the error codes that result from the individual tests.
  This local xerbla_() function is never found at link-time because the
  BLAS test drivers' Makefile imports BLIS compilation flags via the
  get-user-cflags-for() function, which currently conveys the
  -fvisibility=hidden flag, which hides symbols unless they are
  explicitly annotated for export. The -fvisibility=hidden flag was
  only ever intended for use when building BLIS (not for applications),
  and so the attempted solution here is to omit the symbol export
  flag(s) from get-user-cflags-for() by storing the symbol export
  flag(s) to a new BULID_SYMFLAGS variable instead of appending it
  to the subconfigurations' CMISCFLAGS variable (which is returned by
  every get-*-cflags-for() function). Thanks to M. Zhou for reporting
  this issue and also to Isuru Fernando for suggesting the fix.
- Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly
  created BUILD_SYMFLAGS.
- Fixed typo in entry for --export-shared flag in 'configure --help'
  text.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
defe789b8c Minor rewording of language around mt env. vars. 2019-08-23 14:18:08 +05:30
Field G. Van Zee
73970bf124 Added BLIS theading info to Performance.md.
Details:
- Documented the BLIS environment variables that were set
  (e.g. BLIS_JC_NT, BLIS_IC_NT, BLIS_JR_NT) for each machine and
  threading configuration in order to achieve the parallelism reported
  on in docs/Performance.md.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
d488d7acb6 Increased MT sup threshold for double to 180.
Details:
- Increased the double-precision real MT threshold (which controls
  whether the sup implementation kicks for smaller m dimension values)
  from 80 to 180, and this change was made for both haswell and zen
  subconfigurations. This is less about the m dimension in particular
  and more about facilitating a smoother performance transition when
  m = n = k.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
fb305d0837 Minor build system housekeeping.
Details:
- Commented out redundant setting of LIBBLIS_LINK within all driver-
  level Makefiles. This variable is already set within common.mk, and
  so the only time it should be overridden is if the user wants to link
  to a different copy of libblis.
- Very minor changes to build/gen-make-frags/gen-make-frag.sh.
- Whitespace and inconsequential quoting change to configure.
- Moved top-level 'windows' directory into a new 'attic' directory.
2019-08-23 14:18:08 +05:30
Jeff Hammond
cd8e74a69f add info about CXX in configure (#311) 2019-08-23 14:18:08 +05:30
Field G. Van Zee
e394c7459c Define _POSIX_C_SOURCE in bli_system.h.
Details:
- Added
    #ifndef _POSIX_C_SOURCE
    #define _POSIX_C_SOURCE 200809L
    #endif
  to bli_system.h so that an application that uses BLIS (specifically,
  an application that #includes blis.h) does not need to remember to
  #define the macro itself (either on the command line or in the code
  that includes blis.h) in order to activate things like the pthreads.
  Thanks to Christos Psarras for reporting this issue and suggesting
  this fix.
- Commented out #include <sys/time.h> in bli_system.h, since I don't
  think this header is used/needed anymore.
- Comment update to function macro for bli_?normiv_unb_var1() in
  frame/util/bli_util_unb_var1.c.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
20391abca2 Minor bugfixes in sup dgemm implementation.
Details:
- Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel
  that only affected the beta == 0, column-storage output case. Thanks
  to the BLAS test drivers for catching this bug.
- Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if
  k = 0, when the correct action would be to scale by beta (and then
  return). Thanks to the BLAS test drivers to catching this bug.
- Changed the sup threshold behavior such that the sup implementation
  only kicks in if a matrix dimension is strictly less than (rather than
  less than or equal to) the threshold in question.
- Initialize all thresholds to zero (instead of 10) by default in
  ref_kernels/bli_cntx_ref.c. This, combined with the above change to
  threshold testing means that calls to BLIS or BLAS with one or more
  matrix dimensions of zero will no longer trigger the sup
  implementation.
- Added disabled debugging output to frame/3/bli_l3_sup.c (for future
  use, perhaps).
2019-08-23 14:18:08 +05:30
Field G. Van Zee
5c4bb0cc24 Ceased use of BLIS_ENABLE_SUP_MR/NR_EXT macros.
Details:
- Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and
  BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and
  bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long
  conditional that would determine whether to allow the last iteration
  to be merged with the second-to-last iteration. Functionally, the
  macros were not needed, and they ended up causing problems when
  building configuration families such as intel64 and x86_64.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
7d54fc5eb5 Fixed typo in --disable-sup-handling macro guard.
Details:
- Fixed an incorrectly-named macro guard that is intended to allow
  disabling of the sup framework via the configure option
  --disable-sup-handling. In this case, the preprocessor macro,
  BLIS_DISABLE_SUP_HANDLING, was still named by its name from an older
  uncommitted version of the code (BLIS_DISABLE_SM_HANDLING).
2019-08-23 14:18:08 +05:30
Field G. Van Zee
4f08619855 Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2019-08-23 14:18:08 +05:30
Isuru Fernando
e253bfcda4 make unix friendly archives on appveyor (#310) 2019-08-23 14:18:08 +05:30
Field G. Van Zee
f73cef483e Support row storage in Eigen gemm test/3 driver.
Details:
- Added preprocessor branches to test/3/test_gemm.c to explicitly
  support row-stored matrices. Column-stored matrices are also still
  supported (and is the default for now). (This is mainly residual work
  leftover from initial integration of Eigen into the test drivers, so
  if we ever want to test Eigen with row-stored matrices, the code will
  be ready to use, even if it is not yet integrated into the Makefile
  in test/3.)
2019-08-23 14:18:08 +05:30
Field G. Van Zee
f205eeab64 Applied forgotten variable rename from 89a70cc.
Details:
- Somehow the variable name change (root_file_name -> root_inputname)
  in flatten-headers.py mentioned in the commit log entry for 89a70cc
  didn't make it into the actual commit. This commit applies that
  change.
2019-08-23 14:18:08 +05:30
Field G. Van Zee
184ba1b3d5 GNU-like handling of installation prefix et al.
Details:
- Changed the default installation prefix from $HOME/lib to /usr/local.
- Modified the way configure internally handles the prefix, libdir,
  includedir, and sharedir (and also added an --exec-prefix option).
  The defaults to these variables are set as follows:
    prefix:      /usr/local
    exec_prefix: ${prefix}
    libdir:      ${exec_prefix}/lib
    includedir:  ${prefix}/include
    sharedir:    ${prefix}/share
  The key change, aside from the addition of exec_prefix and its use to
  define the default to libdir, is that the variables are substituted
  into config.mk with quoting that delays evaluation, meaning the
  substituted values may contain unevaluated references to other
  variables (namely, ${prefix} and ${exec_prefix}). This more closely
  follows GNU conventions, including those used by GNU autoconf, and
  also allows make to override any one of the variables *after*
  configure has already been run (e.g. during 'make install').
- Updates to build/config.mk.in pursuant to above changes.
- Updates to output of 'configure --help' pursuant to above changes.
- Updated docs/BuildSystem.md to reflect the new default installation
  prefix, as well as mention EXECPREFIX and SHAREDIR.
- Changed the definitions of the UNINSTALL_OLD_* variables in the
  top-level Makefile to use $(wildcard ...) instead of 'find'. This
  was motivated by the new way of handling prefix and friends, which
  leads to the 'find' command being run on /usr/local (by default),
  which can take a while almost never yielding any benefit (since the
  user will very rarely use the uninstall-old targets).
- Removed periods from the end of descriptive output statements (i.e.,
  non-verbose output) since those statements often end with file or
  directory paths, which get confusing to read when puctuated by a
  period.
- Trival change to 'make showconfig' output.
- Removed my name from 'configure --help'. (Many have contributed to it
  over the years.)
- In configure script, changed the default state of threading_model
  variable from 'no' to 'off' to match that of debug_type, where there
  are similarly more than two valid states. ('no' is still accepted
  if given via the --enable-debug= option, though it will be
  standardized to 'off' prior to config.mk being written out.)
- Minor variable name change in flatten-headers.py that was intended for
  32812ff.
- CREDITS file update.
2019-08-23 14:18:08 +05:30