Commit Graph

162 Commits

Author SHA1 Message Date
Kiran Varaganti
0b19029342 Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv
Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81
2017-03-14 14:51:31 +05:30
praveeng
825363bd2a Merge code from master to amd-staging as on 2017_03_08 by praveeng
Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d
2017-03-08 15:43:42 +05:30
sthangar
093bdb80c8 Checked in Unpacked DGEMM code
Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723
2017-03-07 13:35:50 +05:30
Kiran Varaganti
33923da9a1 Added variant 10 for double precision axpyv microkernel
Change-Id: I7a20cc113a422603250bc450825c965136354974
2017-03-06 14:31:31 +05:30
Kiran Varaganti
bc828f7f8e Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv
Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972
2017-03-03 14:45:35 +05:30
sthangar
c9949f4603 Checked in DGEMMTRSM and edge case handling routine in DDOTXF
Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e
2017-03-01 11:14:34 +05:30
Devin Matthews
0e18f68cf1 Handle k=0 correctly in KNL dgemm ukernel. 2017-02-20 09:03:21 -06:00
Devin Matthews
7d42fc0796 Cast dim_t and inc_t parameters to 64-bit in KNL microkernels. 2017-02-19 21:10:55 -05:00
Kiran Varaganti
04245c9ff7 Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h
Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5
2017-02-10 14:24:30 +05:30
Kiran Varaganti
58b5b77e5f Fixed a bug in axpyv, the arguments passed to intrinsic fmad instruction are corrected
Change-Id: If12f24c6bc74b22ac9e4acd6b9378e06d79f2f5e
2017-02-08 21:43:34 +05:30
Kiran Varaganti
85de4ebf74 variant 4 axpyv single precision modified: explicitly used FMA intrinsics, replaced vector multiply and add operations
Change-Id: I975feef56696d479d2b9e9441b0660021cf4f6ff
2017-02-08 14:41:04 +05:30
Kiran Varaganti
3fa53e8af3 Merged axpyv and gemm small in bli_kernel.h
Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging

	modified:   config/zen/bli_kernel.h
	modified:   frame/3/gemm/bli_gemm_front.c
	modified:   kernels/x86_64/zen/3/bli_gemm_small_matrix.c

Change-Id: If181cf9345178c448b3530beb8bef453917fe295
2017-02-08 11:51:57 +05:30
sthangar
95be7b0470 Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code
Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0
2017-02-08 11:24:10 +05:30
Kiran Varaganti
b5291a445b Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full
Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9
2017-02-07 12:39:31 +05:30
Kiran Varaganti
f4bfc1662a New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c
Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072
2017-02-06 15:04:27 +05:30
sthangar
574472ba5a checked in unpacked SGEMM optimization
Change-Id: I8e4ea374415c0c402c660b656fb076af15354181
2017-01-27 14:32:02 +05:30
praveeng
41595e98ee Merge master code as on 2016_12_07 to amd-staging
Change-Id: I5d9ecef9bff960aeb9b51ca4e4b21714e789e44f
2016-12-07 15:14:02 +05:30
sthangar
d625c49e20 checked-in SGEMMTRSM microkernel for Zen
Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f
2016-12-01 16:17:09 +05:30
Francisco Igual
7f31a6307b Fixed missing cntx argument in ARMv8 microkernels. 2016-11-27 14:40:47 +01:00
praveeng
d8f13beeea Merge master code till 2016_11_25 to amd-staging 2016-11-25 17:31:08 +05:30
Field G. Van Zee
b3e58ee303 Reimplemented 4x12 haswell ukernels (real only).
Details:
- Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
  defines 4x24 single real and 4x12 double real gemm microkernels, with
  broadcast-based implementations. (The previous microkernel file has been
  moved to an 'old' subdirectory.)
2016-11-23 17:58:26 -06:00
sthangar
9772218cae Added optimized DAMAX routines for Zen
Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8
2016-11-16 15:19:19 +05:30
Kiran Varaganti
e35d3c23f2 Added new optimized micro-kernel for dotxv routine
Change-Id: I2c544e9b25a454d971ad690353502a55cd668391
2016-11-10 14:30:53 +05:30
praveeng
0d13e9a4f6 bli_kernel.h
Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091
2016-11-07 14:40:41 +05:30
Field G. Van Zee
8a11a2174a Updates to non-default haswell microkernels.
Details:
- Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
  constraints.
- Added missing c and z microkernels, which are based on the corresponding
  kernels in the d6x8 set.
- This completes the d8x6 set (which may be used for situations when it
  is desirable to have a microkernel with a column preference).
2016-10-31 19:07:55 -05:00
Devin Matthews
11eb7957ab Merge branch 'master' into knl
# Conflicts:
#	frame/thread/bli_thread.h
2016-10-25 13:51:07 -05:00
Devin Matthews
cd5b668183 Don't use %rbp in KNL packing kernels. 2016-10-25 13:49:27 -05:00
Devin Matthews
5117d444f7 Change .align to .p2align in Bulldozer ukernels
Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts.
2016-10-24 16:20:47 -05:00
Kiran Varaganti
d250e6a3af Merged TRSM and scalv routines into zen folder
Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9
2016-10-20 14:34:39 +05:30
sthangar
1c2f7b57d5 Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly
Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048
2016-10-18 15:06:35 +05:30
sthangar
7e04490002 Checked in the SAMAX optimizations
Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd
2016-10-13 10:07:51 +05:30
Field G. Van Zee
b922d75634 Avoid compiling BLAS/CBLAS files when disabled.
Details:
- Updated the top-level Makefile, build/config.mk.in template, and
  configure script so that object files corresponding to source files
  belonging to the BLAS compatibility layer are not compiled (or archived)
  when the compatibility layer is disabled. (Same for CBLAS.) Thanks
  to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
  of converting (overwriting) some, such as enable_blas2blis and
  enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
  now stored in new variables that live alongside the originals (with the
  suffix "_01").  This is convenient since some values need to be
  sed-substituted into the config.mk.in template, which requires "yes" or
  "no", while some need to be written to the bli_config.h.in template,
  which requires "0" or "1".

Updated BLIS4 TOMS citation in README.md.

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).

Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1
2016-09-15 12:24:07 +05:30
Pradeep Rao
69826110ba Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging 2016-09-14 03:26:25 -04:00
Field G. Van Zee
121c39d455 Added complex gemm micro-kernels for haswell.
Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).
2016-09-05 13:11:42 -05:00
sthangar
64598ee4cf fixed the symlink issue
Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955
2016-08-31 12:54:50 +05:30
sthangar
fdc6639023 Placed 1 and 1f AMD optimized AVX routines under zen folder
Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328
2016-08-29 10:43:38 +05:30
Kiran Varaganti
a58dd35ed7 Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision
Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9
2016-08-26 14:55:12 +05:30
Devin Matthews
c8e4ef9395 Add prefetchw to 30x8 kernel. 2016-08-03 16:13:03 -05:00
Devin Matthews
4b5a2f3d6e Merge remote-tracking branch 'origin/knl' into knl
# Conflicts:
#	kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c
2016-08-03 16:09:51 -05:00
Devin Matthews
380736bfe9 Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug. 2016-08-03 16:08:28 -05:00
Devin Matthews
9f52a587de Try prefetchw[t1] instead of regular prefetch for C. 2016-08-03 16:03:53 -05:00
Devin Matthews
8945a1512d This version gets ~1550 GFLOPs on KNL wuth 16x4. 2016-08-03 11:28:24 -05:00
praveeng
cdfb3c3f29 Merge master code as on 2016_07_29 to amd-staging branch by praveeng
Change-Id: Ic78b84d8b8d10158fb2a612f9a64bbc7b1f9b486
2016-07-29 12:46:21 +05:30
Devin Matthews
6ce4c022eb Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved. 2016-07-27 16:26:36 -05:00
Field G. Van Zee
c31b1e7b9d Relax alignment restrictions for sandybridge ukrs.
Details:
- Relaxed the base pointer and leading dimension alignment restrictions
  in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
  instead of vmovaps/vmovapd. These change mimic those made to the haswell
  microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
  directory to use DBL_MAX as the initial time candidate. Thanks to Devin
  Matthews for suggesting this change.
- Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.
2016-07-27 15:58:07 -05:00
Devin Matthews
b8f2b55532 Try an 8x24 kernel for the hell of it. 2016-07-27 15:22:55 -05:00
Devin Matthews
ad89ed2e82 Merge branch 'knl' of github.com:devinamatthews/blis into knl 2016-07-27 11:45:40 -05:00
Devin Matthews
2c9de740ed This version gets ~26GF on one core. 2016-07-27 11:44:54 -05:00
Devin Matthews
81e2b05f31 Add optimized packing kernels for KNL. 2016-07-27 11:39:05 -05:00
Devin Matthews
a7d8ca97b8 All fixed. 2016-07-25 15:15:13 -05:00