Commit Graph

807 Commits

Author SHA1 Message Date
J M Dieterich
7541d46e2b Mark bulldozer compilable w/ clang. 2017-05-16 22:12:12 -04:00
J M Dieterich
91f897073e Correct error message. 2017-05-16 22:06:59 -04:00
J M Dieterich
f5131e1e49 Indeed once can compile for carrizo also using clang. 2017-05-16 22:03:23 -04:00
J M Dieterich
5fa4e9439c A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash 2017-05-16 21:50:49 -04:00
Tyler Michael Smith
cbf8710a1b Merge pull request #127 from devinamatthews/fix_blis_nt_xx
Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS
2017-05-08 11:21:20 -05:00
Field G. Van Zee
cf39d3ef3b Fixed a bug in norm1v, norm1m.
Details:
- Fixed a bug that manifested as improperly-computed 1-norm for vectors
  and matrices. This is one of the few operations in BLIS that does not
  have its own test module within the testsuite, hence why it went
  undetected for so long. The bad 1-norms were being used to normalize
  matrices in the testsuite after initialization, which led to some
  matrices containing a combination of "large" and "small" values. This
  tended to push the residuals computed after each test away from zero.
  In some cases, they were off *just* enough to the testsuite to label
  it a "failure". Many thanks to Jeff Hammond for reporting this bug.
  (Wonky details: the bug was due to improperly-defined level-0 scalar
  macros for abval2, an operation that computes the absolute square,
  or complex magnitude/modulus. Certain complex domain instances of
  abval2 were being incorrectly defined in terms of real-only solutions,
  leading to bad results. This level-0 operation forms the basis of
  norm1v/norm1m. absq2 was also affected, but almost nothing uses
  this operation.)
2017-05-05 15:06:56 -05:00
Devin Matthews
799485124f Merge pull request #121 from jeffhammond/not-real-knl
allow KNL build without hbwmalloc (i.e. emulated)
2017-05-04 10:52:09 -05:00
Devin Matthews
fdc66f12d4 Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123. 2017-05-04 10:36:04 -05:00
Field G. Van Zee
773a24efb2 Merge branch 'master' of github.com:flame/blis 2017-05-03 15:07:59 -05:00
Field G. Van Zee
dd58c9545c Disable complex 3m/4m in testsuite by default.
Details:
- Disabled testsuite tests of all level-3 implementations based on 3m
  and 4m. This will improve testing runtime on Travis CI as well as for
  anyone manually running the testsuite using default test parameters.
  Thanks to Devin Matthews for suggesting this change.
2017-05-03 15:04:51 -05:00
Jeff Hammond
0df3541f54 allow KNL build without hbwmalloc.h (i.e. emulated)
we want to be able to run BLIS KNL binaries on non-KNL machines via SDE.
although it is possible to install hbwmalloc implementation on such
systems, it is easier not to, since obviously the performance of SDE
execution is not representative so there is no reason to emulate HBW
allocation.
2017-05-02 19:35:38 -07:00
Field G. Van Zee
b88542591d Merge pull request #107 from jeffhammond/intel-compilers-no-use-libm
never use libm with Intel compilers
2017-05-02 19:22:41 -05:00
Field G. Van Zee
43007f7b65 Fixed stray parentheses in README citations. 2017-05-02 16:48:43 -05:00
Field G. Van Zee
a4f1d0b880 CHANGELOG update (0.2.2) 2017-05-02 16:38:43 -05:00
Field G. Van Zee
940a707ac7 Version file update (0.2.2) 0.2.2 2017-05-02 16:38:42 -05:00
Field G. Van Zee
d5a5e003ea Fixed a trsm1m bug that affected right-side cases.
Details:
- Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result
  was nondeterministic behavior (usually segmentation faults) for certain
  problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The
  cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c
  which explicitly directed the virtual gemm micro-kernel to use temporary
  space if the storage preference of the [real domain] gemm ukernel did
  not match the storage of the output matrix C. In the context of gemm,
  this handling is not needed because agreement between the storage pref
  and the matrix is guaranteed by a high-level optimization in BLIS.
  However, this optimization is not applied to trsm because the storage
  of C is not necessarily the same as the storage of the micro-panels of
  B--both of which are updated by the micro-kernel during a trsm
  operation. Thus, the guarantee of storage/preference agreement is not
  in place for trsm, which means we must handle that case within the
  virtual gemm micro-kernel.
- Comment updates and a minor macro change to bli_trsm*_cntx_init() for
  3m1, 4m1a, and 1m.
2017-05-02 15:48:30 -05:00
Field G. Van Zee
e80993e71f Merge branch 'master' into 1m 2017-05-02 12:30:28 -05:00
Field G. Van Zee
ca3a792477 README.md update.
Details:
- Updated bibtex entries for 4th BLIS paper, and adds entries for 5th
  and 6th BLIS papers.
2017-05-02 12:09:39 -05:00
Field G. Van Zee
6e7de6ef84 Minor updates to test/3m4m.
Details:
- Updated initial problem size and increment in Makefile.
- Updated code in test_gemm.c to correctly query kc from context.
2017-03-17 12:10:24 -05:00
Field G. Van Zee
f484c6cd43 Whitespace reformatting to armv8a kernels file.
Details:
- Updated formatting of function signature/header in
  kernels/armv8a/3/bli_gemm_opt_4x4.c.
2017-03-17 12:07:27 -05:00
Field G. Van Zee
a509fbd5ac Merge branch 'master' into 1m 2017-02-21 17:06:16 -06:00
Field G. Van Zee
69b4846ae9 Disabled experiment-related 1m code.
Details:
- Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was
  specifically inserted to facilitate the benchmarking of 1m block-panel
  and panel-block algorithms.
- Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to
  reflect changes used/needed during benchmarking.
2017-02-21 15:33:39 -06:00
Devin Matthews
513944e4a9 Merge pull request #118 from devinamatthews/master
Handle k=0 correctly in KNL dgemm ukernel.
2017-02-20 10:04:33 -05:00
Devin Matthews
0e18f68cf1 Handle k=0 correctly in KNL dgemm ukernel. 2017-02-20 09:03:21 -06:00
Devin Matthews
8b462a0e8c Merge pull request #117 from devinamatthews/master
Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.
2017-02-19 23:03:03 -05:00
Devin Matthews
7d42fc0796 Cast dim_t and inc_t parameters to 64-bit in KNL microkernels. 2017-02-19 21:10:55 -05:00
Field G. Van Zee
c362afc525 Added missing "level-0" BLAS [sd]cabs1_().
Details:
- Fixed issue #115 by adding implementations for scabs1_() and dcabs1_()
  to the BLAS compatibility layer. Thanks to heroxbd for pointing out
  their absence.
2017-02-09 11:54:59 -06:00
Field G. Van Zee
018180c938 Fixed a minor bug in configure (issue #114).
Details:
- Fixed a bug in the configure script whereby a non-preferred value for
  --enable-threading would cause problems in common.mk vis-a-vis detecting
  which threading model was chosen. Thanks to heroxbd for reporting this
  issue.
2017-02-08 11:20:52 -06:00
Devin Matthews
ddf45e7177 Merge pull request #113 from devinamatthews/knl_thread_params
Change default threading parameters for KNL.
2017-01-27 14:25:40 -06:00
Devin Matthews
78e1b16e16 Change default threading parameters for KNL. 2017-01-27 14:22:20 -06:00
Field G. Van Zee
1c732d3ddc Added 1m-specific APIs for bp, pb gemm algorithms.
Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
  body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
  bli_cntl_free() can check if the thread parameter is NULL, and if so,
  call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
  terms of bli_gemm1mxx_cntx_init(), which behaves the same as
  bli_gemm1m_cntx_init() did before, except that an extra bool parameter
  (is_pb) is used to support both bp and pb algorithms (including to
  support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
  when true, will toggle the boolean return value of routines such as
  bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
  causing BLIS to transpose the operation to achieve disagreement (rather
  than agreement) between the storage of C and the micro-kernel output
  preference. This disagreement is needed for panel-block implementations,
  since they induce a transposition of the suboperation immediately before
  the macro-kernel is called, which changes the apparent storage of C. For
  now, anti-preference is used only with the pb algorithm for 1m (and not
  with any other non-1m implementation).
- Defined new functions,
    bli_cntx_l3_ukr_eff_prefers_storage_of()
    bli_cntx_l3_ukr_eff_dislikes_storage_of()
    bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
    bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
  which are identical to their non-"eff" (effectively) counterparts except
  that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
  bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
  in terms of the existing block-panel macro-kernel _ker_var2(). This
  technique requires inducing transposes on all operands and swapping
  the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
  also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
  specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
    bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
    bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
    bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
  and updated all instantiations. Also updated the field names in the
  cntx_t struct.
- Comment updates.
2017-01-25 16:25:46 -06:00
Field G. Van Zee
a6ab91bc61 Merge pull request #111 from figual/master
Fixed missing cntx argument in ARMv8 microkernels.
2016-11-30 09:26:58 -06:00
Francisco Igual
7f31a6307b Fixed missing cntx argument in ARMv8 microkernels. 2016-11-27 14:40:47 +01:00
Field G. Van Zee
126482a3b6 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2016-11-25 18:29:49 -06:00
Field G. Van Zee
145a551d52 Switched to simpler trsm_r implementation.
Details:
- Disabled the implementation of trsm_r that allows the right-hand matrix
  B to be trianglar, and switched to the implementation that simply
  transposes the operation (and thus the storage of C) in order to recast
  the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru
  macrokernels, which require an awkward swapping of MR and NR. For now,
  the support for trsm_r macrokernels, via separate control trees, remains.
- Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS
  is defined by default. This is mostly a safety precaution in case someone
  tries to switch back to the previous trsm_r implementation, but also
  serves as a convenience on some systems where one does not naturally
  choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0.
2016-11-23 17:59:06 -06:00
Field G. Van Zee
b3e58ee303 Reimplemented 4x12 haswell ukernels (real only).
Details:
- Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
  defines 4x24 single real and 4x12 double real gemm microkernels, with
  broadcast-based implementations. (The previous microkernel file has been
  moved to an 'old' subdirectory.)
2016-11-23 17:58:26 -06:00
Field G. Van Zee
bdc0a264d2 Adjusted stride selection of ct in macrokernels.
Details:
- Updated the changes introduced in 618f433 so that the strides of the
  temporary microtile ct used in the macrokernels is determined based
  on the storage preference of the microkernel (via the new functions
  below), rather than the strides of c. In almost all cases, presently,
  this change results in no net effect, as a high-level optimization
  in the _front() functions aligns the storage of c to that of the
  microkernel's preference. However, I encountered some cases where
  this is not always the case in some development code that has yet
  to be committed, and therefore I'm generalizing the framework code
  in advance.
- Defined two new functions in bli_cntx.c:
    bli_cntx_l3_ukr_prefers_rows_dt()
    bli_cntx_l3_ukr_prefers_cols_dt()
  which return bool_t's based on the current micro-kernel's storage
  preferences. For induced methods, the preference of the underlying
  real domain microkernel is returned.
- Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and
  by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of
  the above functions, rather than querying the preferences of the
  native microkernel directly (which did the wrong thing for induced
  methods).
2016-11-16 14:13:08 -06:00
Field G. Van Zee
031978d264 Fixed inactive trsm_r blocksize constraint code.
Details:
- Changed a cpp macro that was meant to prevent using certain trsm_r code
  if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded
  incorrectly at first. I've now fixed its location and changed its
  consequence to a compile-time #error message.
2016-11-16 14:04:33 -06:00
Field G. Van Zee
6b5a4032d2 Merge pull request #109 from devinamatthews/omp_num_threads
Add automatic loop thread assignment.
2016-11-10 15:28:24 -06:00
Devin Matthews
a8220e3a86 - Fix typo in bli_cntx.c
- Bump BLIS_DEFAULT_NR_THREAD_MAX to 4
2016-11-10 14:19:34 -06:00
Devin Matthews
c05b3862f6 Add automatic loop thread assignment.
- Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before.
- Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h.
- All level-3 BLAS covered.
2016-11-04 15:48:02 -05:00
Field G. Van Zee
3b524a08e3 Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code.
Details:
- Consolidated the macros that define the lower and upper versions of the
  gemmtrsm microkernels into a single macro that is instantiated twice.
  Did this for both 3m1 and 4m1 microkernels.
- Consolidated lower and upper versions of the trsm microkernels for 3m1
  and 4m1 into single files (each).
2016-11-02 17:45:18 -05:00
Field G. Van Zee
ead231aca6 Merge pull request #108 from devinamatthews/patch-2
Update .travis.yml with additional tests
2016-11-02 13:03:50 -05:00
Devin Matthews
62987f60a6 Allow KNL to fail 2016-11-02 11:20:37 -05:00
Devin Matthews
8f9010542c Fix some problems with OSX builds:
- Update CPU detection for Intel archs (esp. Skylake)
- Allow clang for the reference config
2016-11-02 11:18:32 -05:00
Field G. Van Zee
d25e6f8b63 Can disable trsm_r-specific blocksize constraints.
Details:
- Added cpp guards around the constraints in bli_kernel_macro_defs.h
  that enforce MC % NR = 0 and NC % MR = 0. These constraints are ONLY
  needed when handling right-side trsm by allowing the matrix on the
  right (matrix B) to be triangular, because it involves swapping
  register, but not cache, blocksizes (packing A by NR and B by MR)
  and then swapping the operands to gemmtrsm just before that kernel
  is called. It may be useful to disable these constraints if, for
  example, the developer wishes to test the configuration with
  a different set of cache blocksizes where only MC % MR = 0 and
  NC % NR = 0 are enforced.
- In summary, #defining BLIS_RELAX_MCNR_NCMR_CONSTRAINTS will bypass
  the enforcement of MC % NR = 0 and NC % MR = 0.
2016-11-01 14:35:15 -05:00
Devin Matthews
1a67e3688e Bogus commit
Need to trigger another Travis build.
2016-11-01 13:53:18 -05:00
Devin Matthews
2cd82d67b3 Some fixes for .travis.yml
- Switch to gcc-5 to support knl
- Don't run tests in parallel -- it is super slow.
- Use clang on OSX since gcc is only a zombie husk.
2016-11-01 13:25:50 -05:00
Devin Matthews
a3db4e6bdf Update .travis.yml with additional tests
- Test knl configuration (without running of course).
- Test openmp and pthreads threading for auto configuration with 4 threads.
- Test auto configuration with and without pthreads on OSX.
- Also, run make in parallel.

I don't know how the `addons:` section works on OSX; hopefully it is just ignored.
2016-11-01 10:33:18 -05:00
Field G. Van Zee
8a11a2174a Updates to non-default haswell microkernels.
Details:
- Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
  constraints.
- Added missing c and z microkernels, which are based on the corresponding
  kernels in the d6x8 set.
- This completes the d8x6 set (which may be used for situations when it
  is desirable to have a microkernel with a column preference).
2016-10-31 19:07:55 -05:00