Commit Graph

1886 Commits

Author SHA1 Message Date
praveeng
c25a9205fd Merge master code till Switched to simpler trsm_r 2016_11_25 to amd-staging
Change-Id: Ibf71d224d8fb6cf0bc497f84d50c27d276512cc1
2016-11-25 17:08:22 +05:30
Field G. Van Zee
145a551d52 Switched to simpler trsm_r implementation.
Details:
- Disabled the implementation of trsm_r that allows the right-hand matrix
  B to be trianglar, and switched to the implementation that simply
  transposes the operation (and thus the storage of C) in order to recast
  the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru
  macrokernels, which require an awkward swapping of MR and NR. For now,
  the support for trsm_r macrokernels, via separate control trees, remains.
- Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS
  is defined by default. This is mostly a safety precaution in case someone
  tries to switch back to the previous trsm_r implementation, but also
  serves as a convenience on some systems where one does not naturally
  choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0.
2016-11-23 17:59:06 -06:00
Field G. Van Zee
b3e58ee303 Reimplemented 4x12 haswell ukernels (real only).
Details:
- Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
  defines 4x24 single real and 4x12 double real gemm microkernels, with
  broadcast-based implementations. (The previous microkernel file has been
  moved to an 'old' subdirectory.)
2016-11-23 17:58:26 -06:00
sthangar
65298762ff removed a redundant copy operation in DNRM2
Change-Id: I673b08efde4480e871779716f7715566740ad9ce
2016-11-22 12:15:33 +05:30
sthangar
d6863e851a checked-in DNRM2 optimizations
Change-Id: I3b31d768bd7f4fbf43042aa5a0762995c73c4522
2016-11-21 11:30:30 +05:30
Field G. Van Zee
bdc0a264d2 Adjusted stride selection of ct in macrokernels.
Details:
- Updated the changes introduced in 618f433 so that the strides of the
  temporary microtile ct used in the macrokernels is determined based
  on the storage preference of the microkernel (via the new functions
  below), rather than the strides of c. In almost all cases, presently,
  this change results in no net effect, as a high-level optimization
  in the _front() functions aligns the storage of c to that of the
  microkernel's preference. However, I encountered some cases where
  this is not always the case in some development code that has yet
  to be committed, and therefore I'm generalizing the framework code
  in advance.
- Defined two new functions in bli_cntx.c:
    bli_cntx_l3_ukr_prefers_rows_dt()
    bli_cntx_l3_ukr_prefers_cols_dt()
  which return bool_t's based on the current micro-kernel's storage
  preferences. For induced methods, the preference of the underlying
  real domain microkernel is returned.
- Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and
  by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of
  the above functions, rather than querying the preferences of the
  native microkernel directly (which did the wrong thing for induced
  methods).
2016-11-16 14:13:08 -06:00
Field G. Van Zee
031978d264 Fixed inactive trsm_r blocksize constraint code.
Details:
- Changed a cpp macro that was meant to prevent using certain trsm_r code
  if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded
  incorrectly at first. I've now fixed its location and changed its
  consequence to a compile-time #error message.
2016-11-16 14:04:33 -06:00
sthangar
9772218cae Added optimized DAMAX routines for Zen
Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8
2016-11-16 15:19:19 +05:30
Santanu Thangaraj
9c448e3017 Merge "Added new optimized micro-kernel for dotxv routine" into amd-staging 2016-11-16 04:18:57 -05:00
praveeng
998d824044 Merge master code till devinamatthews/omp_num_thrds 2016_11_16 to amd-staging
Change-Id: I601ff1d3ec8a680e1be039ffc7b299744e8a27c5
2016-11-16 14:24:15 +05:30
Field G. Van Zee
6b5a4032d2 Merge pull request #109 from devinamatthews/omp_num_threads
Add automatic loop thread assignment.
2016-11-10 15:28:24 -06:00
Devin Matthews
a8220e3a86 - Fix typo in bli_cntx.c
- Bump BLIS_DEFAULT_NR_THREAD_MAX to 4
2016-11-10 14:19:34 -06:00
Kiran Varaganti
e35d3c23f2 Added new optimized micro-kernel for dotxv routine
Change-Id: I2c544e9b25a454d971ad690353502a55cd668391
2016-11-10 14:30:53 +05:30
praveeng
0d13e9a4f6 bli_kernel.h
Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091
2016-11-07 14:40:41 +05:30
Devin Matthews
c05b3862f6 Add automatic loop thread assignment.
- Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before.
- Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h.
- All level-3 BLAS covered.
2016-11-04 15:48:02 -05:00
Field G. Van Zee
3b524a08e3 Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code.
Details:
- Consolidated the macros that define the lower and upper versions of the
  gemmtrsm microkernels into a single macro that is instantiated twice.
  Did this for both 3m1 and 4m1 microkernels.
- Consolidated lower and upper versions of the trsm microkernels for 3m1
  and 4m1 into single files (each).
2016-11-02 17:45:18 -05:00
Field G. Van Zee
ead231aca6 Merge pull request #108 from devinamatthews/patch-2
Update .travis.yml with additional tests
2016-11-02 13:03:50 -05:00
Devin Matthews
62987f60a6 Allow KNL to fail 2016-11-02 11:20:37 -05:00
Devin Matthews
8f9010542c Fix some problems with OSX builds:
- Update CPU detection for Intel archs (esp. Skylake)
- Allow clang for the reference config
2016-11-02 11:18:32 -05:00
Field G. Van Zee
d25e6f8b63 Can disable trsm_r-specific blocksize constraints.
Details:
- Added cpp guards around the constraints in bli_kernel_macro_defs.h
  that enforce MC % NR = 0 and NC % MR = 0. These constraints are ONLY
  needed when handling right-side trsm by allowing the matrix on the
  right (matrix B) to be triangular, because it involves swapping
  register, but not cache, blocksizes (packing A by NR and B by MR)
  and then swapping the operands to gemmtrsm just before that kernel
  is called. It may be useful to disable these constraints if, for
  example, the developer wishes to test the configuration with
  a different set of cache blocksizes where only MC % MR = 0 and
  NC % NR = 0 are enforced.
- In summary, #defining BLIS_RELAX_MCNR_NCMR_CONSTRAINTS will bypass
  the enforcement of MC % NR = 0 and NC % MR = 0.
2016-11-01 14:35:15 -05:00
Devin Matthews
1a67e3688e Bogus commit
Need to trigger another Travis build.
2016-11-01 13:53:18 -05:00
Devin Matthews
2cd82d67b3 Some fixes for .travis.yml
- Switch to gcc-5 to support knl
- Don't run tests in parallel -- it is super slow.
- Use clang on OSX since gcc is only a zombie husk.
2016-11-01 13:25:50 -05:00
Devin Matthews
a3db4e6bdf Update .travis.yml with additional tests
- Test knl configuration (without running of course).
- Test openmp and pthreads threading for auto configuration with 4 threads.
- Test auto configuration with and without pthreads on OSX.
- Also, run make in parallel.

I don't know how the `addons:` section works on OSX; hopefully it is just ignored.
2016-11-01 10:33:18 -05:00
Field G. Van Zee
8a11a2174a Updates to non-default haswell microkernels.
Details:
- Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
  constraints.
- Added missing c and z microkernels, which are based on the corresponding
  kernels in the d6x8 set.
- This completes the d8x6 set (which may be used for situations when it
  is desirable to have a microkernel with a column preference).
2016-10-31 19:07:55 -05:00
Field G. Van Zee
618f4331eb Align strides of ct in macrokernels to that of c.
Details:
- Previously, rs_ct and cs_ct, the strides of the temporary microtile used
  primarily in the macrokernels' edge case handling, were unconditionally
  set to 1 and MR, respectively. However, Devin Matthews noted that this
  ought to be changed so that the strides of ct were in agreement with the
  strides of C. (That is, if C was row-stored, then ct should be accessed
  as by rows as well.) The implicit assumption is that the strides of C
  have already been adjusted, via induced transposition, if the storage
  preference of the microkernel is at odds with the storage of C. So, if
  the microkernel prefers row storage, the macrokernel's interior cases
  would present row-stored (ideal) microkernel subproblems to the
  microkernel, but for edge cases, it would still see column-stored
  subproblems (not ideal). This commit fixes this issue. Thanks to Devin
  for his suggestion.
2016-10-31 14:40:51 -05:00
Jeff Hammond
c2c91e09b4 never use libm with Intel compilers
Intel compilers include a highly optimized math library (libimf) that
should be used instead of GNU libm.

yes, this change is for ALL targets, including those that are not
supported by the Intel compiler.  there is no harm in doing this, and it
is future-proof in the event that the Intel compilers support other
architectures.
2016-10-25 21:15:26 -07:00
Field G. Van Zee
6303910023 Merge pull request #105 from devinamatthews/knl
Support for Intel Knight's Landing.
2016-10-25 19:34:51 -05:00
Devin Matthews
216206c1d3 Fix up for merge to master. 2016-10-25 13:56:18 -05:00
Devin Matthews
11eb7957ab Merge branch 'master' into knl
# Conflicts:
#	frame/thread/bli_thread.h
2016-10-25 13:51:07 -05:00
Devin Matthews
cd5b668183 Don't use %rbp in KNL packing kernels. 2016-10-25 13:49:27 -05:00
Field G. Van Zee
956b3edf8e Merge pull request #104 from devinamatthews/misspellings
Add flexible options for thread model (pthread/posix for pthreads etc.).
2016-10-25 13:02:57 -05:00
Devin Matthews
0662a3c1b1 Add flexible options for thread model (pthread/posix for pthreads etc.). 2016-10-25 12:42:44 -05:00
Kiran Varaganti
e044fa6240 Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault
Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a
2016-10-25 13:03:05 +05:30
Kiran Varaganti
b3ed4933aa Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault
Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a
alpharelease
2016-10-25 09:51:27 +05:30
Field G. Van Zee
b7e41d71b0 Merge pull request #103 from devinamatthews/patch-1
Change .align to .p2align in Bulldozer ukernels.
2016-10-24 16:47:46 -05:00
Devin Matthews
5117d444f7 Change .align to .p2align in Bulldozer ukernels
Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts.
2016-10-24 16:20:47 -05:00
Field G. Van Zee
4bd905bd45 Merge pull request #93 from ShadenSmith/config_check
Adds sanity check to configuration choice.
2016-10-21 14:48:44 -05:00
Field G. Van Zee
936d5fdc26 Fixed multithreading compilation bug in 970745a.
Details:
- Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING
  from bli_thread.h to bli_config_macro_defs.h. Also moved the
  sanity check that OpenMP and POSIX threads are not both enabled.
- Thanks to Krzysztof Drewniak for reporting this bug.
2016-10-21 14:34:27 -05:00
Kiran Varaganti
d250e6a3af Merged TRSM and scalv routines into zen folder
Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9
2016-10-20 14:34:39 +05:30
Field G. Van Zee
8feb0f85a6 Removed auto-prototyping of malloc()/free() substitutes.
Details:
- Removed the header file, bli_malloc_prototypes.h, which automatically
  generated prototypes for the functions specified by the following
  cpp macros:
    BLIS_MALLOC_INTL
    BLIS_FREE_INTL
    BLIS_MALLOC_POOL
    BLIS_FREE_POOL
    BLIS_MALLOC_USER
    BLIS_FREE_USER
  These prototypes were originally provided primarily as a convenience
  to those developers who specified their own malloc()/free() substitutes
  for one or more of the following. However, we generated these prototypes
  regardless, even when the default values (malloc and free) of the
  macros above were used. A problem arose under certain circumstances
  (e.g., gcc in C++ mode on Linux with glibc) when including blis.h that
  stemmed from the "throw" specification which was added to the glibc's
  malloc() prototype, resulting in a prototype mismatch. Therefore, going
  forward, developers who specify their own custom malloc()/free()
  substitutes must also prototype those substitutes via bli_kernel.h.
  Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews
  for researching the nature and potential solutions.
2016-10-19 16:05:41 -05:00
Field G. Van Zee
970745a5fc Reorganized typedefs to avoid compiler warnings.
Details:
- Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h.
- Moved #include of bli_malloc.h from blis.h to bli_type_defs.h.
- Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h.
- Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h.
- The redundant typedefs of membrk_t and mtx_t caused a warning on some C
  compilers. Thanks to Tyler Smith for reporting this issue.
2016-10-19 15:58:03 -05:00
sthangar
1c2f7b57d5 Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly
Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048
2016-10-18 15:06:35 +05:30
praveeng
d864ea9f4f Merge master code 2016_10_14 till Added disabled code thrinfo_t structures
Change-Id: If7db98d286c1471fcd30f00757abee9b253ef987
2016-10-14 17:01:31 +05:30
Field G. Van Zee
28b2af8a71 Added disabled code to print thrinfo_t structures.
Details:
- Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious
  developer to print the contents of the thrinfo_t structures of each
  thread, for verification purposes or just to study the way thread
  information and communicators are used in BLIS.
- Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing
  an array of thrinfo_t* values that is used in the new, cpp-guarde code
  mentioned above.
- Removed some old commented lines from bli_gemm_front.c.
2016-10-13 14:50:08 -05:00
Field G. Van Zee
11eed3f683 Fixed a configure -t omp/openmp bug from fd04869.
Details:
- Forgot to update certain occurrences of "omp" in common.mk during
  commit fd04869, which changed the preferred configure option string
  for enabling OpenMP from "omp" to "openmp".
2016-10-13 14:23:23 -05:00
praveeng
7045fcbf0b Merge master code 2016_10_13 Removed previously renamed/old files
Change-Id: I8106d371afaa0af474a8967388d44481b05de923
2016-10-13 12:03:24 +05:30
sthangar
7e04490002 Checked in the SAMAX optimizations
Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd
2016-10-13 10:07:51 +05:30
Field G. Van Zee
9cda6057ea Removed previously renamed/old files.
Details:
- Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h,
  both of which were renamed/removed in 701b9aa. For some reason, these
  files survived when the compose branch was merged back into master.
  (Clearly, git's merging algorithm is not perfect.)
- Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed
  memory allocator that I was keeping around for no particular reason).
2016-10-11 13:21:26 -05:00
Field G. Van Zee
22377abd84 Fixed bli_gemm() segfault on empty C matrices.
Details:
- Fixed a bug that would manifest in the form of a segmentation fault
  in bli_cntl_free() when calling any level-3 operation on an empty
  output matrix (ie: m = n = 0). Specifically, the code previously
  assumed that the entire control tree was built prior to it being
  freed. However, if the level-3 operation performs an early exit, the
  control tree will be incomplete, and this scenario is now handled.
  Thanks to Elmar Peise for reporting this bug.
2016-10-10 13:43:56 -05:00
Field G. Van Zee
0b571cd94d Fixed segfault in bli_free_align() for NULL ptrs.
Details:
- Fixed a bug in bli_free_align() caused by failing to handle NULL pointers
  up-front, which led to performing pointer arithmetic on NULL pointers in
  order to free the address immediately before the pointer. Thanks to Devin
  Matthews for reporting this bug.
2016-10-06 14:48:15 -05:00