Commit Graph

357 Commits

Author SHA1 Message Date
Kiran Varaganti
bc828f7f8e Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv
Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972
2017-03-03 14:45:35 +05:30
sthangar
c9949f4603 Checked in DGEMMTRSM and edge case handling routine in DDOTXF
Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e
2017-03-01 11:14:34 +05:30
Field G. Van Zee
a509fbd5ac Merge branch 'master' into 1m 2017-02-21 17:06:16 -06:00
Kiran Varaganti
04245c9ff7 Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h
Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5
2017-02-10 14:24:30 +05:30
Kiran Varaganti
3fa53e8af3 Merged axpyv and gemm small in bli_kernel.h
Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging

	modified:   config/zen/bli_kernel.h
	modified:   frame/3/gemm/bli_gemm_front.c
	modified:   kernels/x86_64/zen/3/bli_gemm_small_matrix.c

Change-Id: If181cf9345178c448b3530beb8bef453917fe295
2017-02-08 11:51:57 +05:30
sthangar
95be7b0470 Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code
Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0
2017-02-08 11:24:10 +05:30
Kiran Varaganti
b5291a445b Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full
Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9
2017-02-07 12:39:31 +05:30
Kiran Varaganti
f4bfc1662a New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c
Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072
2017-02-06 15:04:27 +05:30
Devin Matthews
78e1b16e16 Change default threading parameters for KNL. 2017-01-27 14:22:20 -06:00
sthangar
574472ba5a checked in unpacked SGEMM optimization
Change-Id: I8e4ea374415c0c402c660b656fb076af15354181
2017-01-27 14:32:02 +05:30
sthangar
d625c49e20 checked-in SGEMMTRSM microkernel for Zen
Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f
2016-12-01 16:17:09 +05:30
Field G. Van Zee
126482a3b6 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2016-11-25 18:29:49 -06:00
praveeng
d8f13beeea Merge master code till 2016_11_25 to amd-staging 2016-11-25 17:31:08 +05:30
Field G. Van Zee
b3e58ee303 Reimplemented 4x12 haswell ukernels (real only).
Details:
- Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
  defines 4x24 single real and 4x12 double real gemm microkernels, with
  broadcast-based implementations. (The previous microkernel file has been
  moved to an 'old' subdirectory.)
2016-11-23 17:58:26 -06:00
sthangar
9772218cae Added optimized DAMAX routines for Zen
Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8
2016-11-16 15:19:19 +05:30
Kiran Varaganti
e35d3c23f2 Added new optimized micro-kernel for dotxv routine
Change-Id: I2c544e9b25a454d971ad690353502a55cd668391
2016-11-10 14:30:53 +05:30
praveeng
0d13e9a4f6 bli_kernel.h
Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091
2016-11-07 14:40:41 +05:30
Devin Matthews
8f9010542c Fix some problems with OSX builds:
- Update CPU detection for Intel archs (esp. Skylake)
- Allow clang for the reference config
2016-11-02 11:18:32 -05:00
Field G. Van Zee
8a11a2174a Updates to non-default haswell microkernels.
Details:
- Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
  constraints.
- Added missing c and z microkernels, which are based on the corresponding
  kernels in the d6x8 set.
- This completes the d8x6 set (which may be used for situations when it
  is desirable to have a microkernel with a column preference).
2016-10-31 19:07:55 -05:00
Jeff Hammond
c2c91e09b4 never use libm with Intel compilers
Intel compilers include a highly optimized math library (libimf) that
should be used instead of GNU libm.

yes, this change is for ALL targets, including those that are not
supported by the Intel compiler.  there is no harm in doing this, and it
is future-proof in the event that the Intel compilers support other
architectures.
2016-10-25 21:15:26 -07:00
Devin Matthews
11eb7957ab Merge branch 'master' into knl
# Conflicts:
#	frame/thread/bli_thread.h
2016-10-25 13:51:07 -05:00
Kiran Varaganti
e044fa6240 Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault
Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a
2016-10-25 13:03:05 +05:30
Kiran Varaganti
d250e6a3af Merged TRSM and scalv routines into zen folder
Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9
2016-10-20 14:34:39 +05:30
sthangar
1c2f7b57d5 Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly
Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048
2016-10-18 15:06:35 +05:30
sthangar
7e04490002 Checked in the SAMAX optimizations
Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd
2016-10-13 10:07:51 +05:30
praveeng
f2e7ea113a conflicts merge for bli_kernel.h
Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0
2016-10-06 12:35:30 +05:30
sthangar
133983c36f code clean up in bli_kernel.h
Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d
2016-10-06 11:26:22 +05:30
Field G. Van Zee
b922d75634 Avoid compiling BLAS/CBLAS files when disabled.
Details:
- Updated the top-level Makefile, build/config.mk.in template, and
  configure script so that object files corresponding to source files
  belonging to the BLAS compatibility layer are not compiled (or archived)
  when the compatibility layer is disabled. (Same for CBLAS.) Thanks
  to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
  of converting (overwriting) some, such as enable_blas2blis and
  enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
  now stored in new variables that live alongside the originals (with the
  suffix "_01").  This is convenient since some values need to be
  sed-substituted into the config.mk.in template, which requires "yes" or
  "no", while some need to be written to the bli_config.h.in template,
  which requires "0" or "1".

Updated BLIS4 TOMS citation in README.md.

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).

Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1
2016-09-15 12:24:07 +05:30
Pradeep Rao
69826110ba Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging 2016-09-14 03:26:25 -04:00
Field G. Van Zee
121c39d455 Added complex gemm micro-kernels for haswell.
Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).
2016-09-05 13:11:42 -05:00
sthangar
64598ee4cf fixed the symlink issue
Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955
2016-08-31 12:54:50 +05:30
sthangar
fdc6639023 Placed 1 and 1f AMD optimized AVX routines under zen folder
Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328
2016-08-29 10:43:38 +05:30
Kiran Varaganti
a58dd35ed7 Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision
Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9
2016-08-26 14:55:12 +05:30
Devin Matthews
c8e4ef9395 Add prefetchw to 30x8 kernel. 2016-08-03 16:13:03 -05:00
Devin Matthews
380736bfe9 Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug. 2016-08-03 16:08:28 -05:00
Devin Matthews
8945a1512d This version gets ~1550 GFLOPs on KNL wuth 16x4. 2016-08-03 11:28:24 -05:00
Devin Matthews
6ce4c022eb Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved. 2016-07-27 16:26:36 -05:00
Devin Matthews
b8f2b55532 Try an 8x24 kernel for the hell of it. 2016-07-27 15:22:55 -05:00
Devin Matthews
7ede5863ae Allocate pack buffer on MCDRAM for KNL. 2016-07-27 13:42:32 -06:00
Devin Matthews
81e2b05f31 Add optimized packing kernels for KNL. 2016-07-27 11:39:05 -05:00
Devin Matthews
65735bbedf Switch to 24x8 kernel, unrolled by 16. 2016-07-24 21:50:32 -05:00
Devin Matthews
8c6e621c09 Make knl conform to new kernel dir structure. 2016-07-22 15:05:15 -05:00
Devin Matthews
119d039942 Add 8x24 KNL kernel. 2016-07-22 10:23:31 -05:00
praveeng
1aa77dfc1d Merge master code as on 2016_07_21 to amd-staging branch by praveeng
Change-Id: Ic7d0a21101358f08147736e7f1884e7409937344
2016-07-21 14:23:41 +05:30
Devin Matthews
b58cda9eba Merge remote-tracking branch 'origin/master' into knl
# Conflicts:
#	frame/base/bli_threading.h
#	frame/include/blis.h
#	frame/thread/bli_thread.c
2016-07-19 14:09:09 -05:00
sthangar
9101a9c880 Checked in optimized 1V kernels along with benchmark codes. Also incorporated review comments for 1F kernels
Change-Id: I035c0d39e6b0bed28e6e2041242186c49f6ed55b
2016-07-13 16:51:14 +05:30
Devin Matthews
318f063dcb Add new KNL microkernel derived from Haswell. 2016-06-08 17:46:50 -05:00
Field G. Van Zee
096895c5d5 Reorganized code, APIs related to multithreading.
Details:
- Reorganized code and renamed files defining APIs related to multithreading.
  All code that is not specific to a particular operation is now located in a
  new directory: frame/thread. Code is now organized, roughly, by the
  namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
  thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
  also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
  functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
  We now have the following API namespaces:
  - bli_thrinfo_*(): functions related to thrinfo_t objects
  - bli_thrcomm_*(): functions related to thrcomm_t objects.
  - bli_thread_*(): general-purpose functions, such as initialization,
    finalization, and computing ranges. (For now, some macros, such as
    bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
    bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
    appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
  consistent with the thread-related macros that were also renamed).
- Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
  #undef was a temporary fix to some macro defaults which were being applied
  in the wrong order, which was recently fixed.
2016-06-06 13:32:04 -05:00
Devin Matthews
e3bd5ca64a Fix SIMD definitions in KNL config, and a couple of fixes to C update. 2016-05-12 20:54:13 -05:00
Tyler Smith
4dcd37eb1b fixing knc simd align size 2016-05-10 16:28:59 -05:00