amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 09:39:59 +00:00

Author	SHA1	Message	Date
Kiran Varaganti	0b19029342	Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81	2017-03-14 14:51:31 +05:30
praveeng	825363bd2a	Merge code from master to amd-staging as on 2017_03_08 by praveeng Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d	2017-03-08 15:43:42 +05:30
sthangar	093bdb80c8	Checked in Unpacked DGEMM code Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723	2017-03-07 13:35:50 +05:30
Kiran Varaganti	33923da9a1	Added variant 10 for double precision axpyv microkernel Change-Id: I7a20cc113a422603250bc450825c965136354974	2017-03-06 14:31:31 +05:30
Kiran Varaganti	bc828f7f8e	Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972	2017-03-03 14:45:35 +05:30
sthangar	c9949f4603	Checked in DGEMMTRSM and edge case handling routine in DDOTXF Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e	2017-03-01 11:14:34 +05:30
Devin Matthews	0e18f68cf1	Handle k=0 correctly in KNL dgemm ukernel.	2017-02-20 09:03:21 -06:00
Devin Matthews	7d42fc0796	Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.	2017-02-19 21:10:55 -05:00
Kiran Varaganti	04245c9ff7	Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5	2017-02-10 14:24:30 +05:30
Kiran Varaganti	58b5b77e5f	Fixed a bug in axpyv, the arguments passed to intrinsic fmad instruction are corrected Change-Id: If12f24c6bc74b22ac9e4acd6b9378e06d79f2f5e	2017-02-08 21:43:34 +05:30
Kiran Varaganti	85de4ebf74	variant 4 axpyv single precision modified: explicitly used FMA intrinsics, replaced vector multiply and add operations Change-Id: I975feef56696d479d2b9e9441b0660021cf4f6ff	2017-02-08 14:41:04 +05:30
Kiran Varaganti	3fa53e8af3	Merged axpyv and gemm small in bli_kernel.h Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging modified: config/zen/bli_kernel.h modified: frame/3/gemm/bli_gemm_front.c modified: kernels/x86_64/zen/3/bli_gemm_small_matrix.c Change-Id: If181cf9345178c448b3530beb8bef453917fe295	2017-02-08 11:51:57 +05:30
sthangar	95be7b0470	Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0	2017-02-08 11:24:10 +05:30
Kiran Varaganti	b5291a445b	Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9	2017-02-07 12:39:31 +05:30
Kiran Varaganti	f4bfc1662a	New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072	2017-02-06 15:04:27 +05:30
sthangar	574472ba5a	checked in unpacked SGEMM optimization Change-Id: I8e4ea374415c0c402c660b656fb076af15354181	2017-01-27 14:32:02 +05:30
praveeng	41595e98ee	Merge master code as on 2016_12_07 to amd-staging Change-Id: I5d9ecef9bff960aeb9b51ca4e4b21714e789e44f	2016-12-07 15:14:02 +05:30
sthangar	d625c49e20	checked-in SGEMMTRSM microkernel for Zen Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f	2016-12-01 16:17:09 +05:30
Francisco Igual	7f31a6307b	Fixed missing cntx argument in ARMv8 microkernels.	2016-11-27 14:40:47 +01:00
praveeng	d8f13beeea	Merge master code till 2016_11_25 to amd-staging	2016-11-25 17:31:08 +05:30
Field G. Van Zee	b3e58ee303	Reimplemented 4x12 haswell ukernels (real only). Details: - Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which defines 4x24 single real and 4x12 double real gemm microkernels, with broadcast-based implementations. (The previous microkernel file has been moved to an 'old' subdirectory.)	2016-11-23 17:58:26 -06:00
sthangar	9772218cae	Added optimized DAMAX routines for Zen Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8	2016-11-16 15:19:19 +05:30
Kiran Varaganti	e35d3c23f2	Added new optimized micro-kernel for dotxv routine Change-Id: I2c544e9b25a454d971ad690353502a55cd668391	2016-11-10 14:30:53 +05:30
praveeng	0d13e9a4f6	bli_kernel.h Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091	2016-11-07 14:40:41 +05:30
Field G. Van Zee	8a11a2174a	Updates to non-default haswell microkernels. Details: - Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment constraints. - Added missing c and z microkernels, which are based on the corresponding kernels in the d6x8 set. - This completes the d8x6 set (which may be used for situations when it is desirable to have a microkernel with a column preference).	2016-10-31 19:07:55 -05:00
Devin Matthews	11eb7957ab	Merge branch 'master' into knl # Conflicts: # frame/thread/bli_thread.h	2016-10-25 13:51:07 -05:00
Devin Matthews	cd5b668183	Don't use %rbp in KNL packing kernels.	2016-10-25 13:49:27 -05:00
Devin Matthews	5117d444f7	Change .align to .p2align in Bulldozer ukernels Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts.	2016-10-24 16:20:47 -05:00
Kiran Varaganti	d250e6a3af	Merged TRSM and scalv routines into zen folder Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9	2016-10-20 14:34:39 +05:30
sthangar	1c2f7b57d5	Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048	2016-10-18 15:06:35 +05:30
sthangar	7e04490002	Checked in the SAMAX optimizations Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd	2016-10-13 10:07:51 +05:30
Field G. Van Zee	b922d75634	Avoid compiling BLAS/CBLAS files when disabled. Details: - Updated the top-level Makefile, build/config.mk.in template, and configure script so that object files corresponding to source files belonging to the BLAS compatibility layer are not compiled (or archived) when the compatibility layer is disabled. (Same for CBLAS.) Thanks to Devin Matthews for suggesting this optimization. - Slight change to the way configure handles internal variables. Instead of converting (overwriting) some, such as enable_blas2blis and enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are now stored in new variables that live alongside the originals (with the suffix "_01"). This is convenient since some values need to be sed-substituted into the config.mk.in template, which requires "yes" or "no", while some need to be written to the bli_config.h.in template, which requires "0" or "1". Updated BLIS4 TOMS citation in README.md. Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have). Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1	2016-09-15 12:24:07 +05:30
Pradeep Rao	69826110ba	Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging	2016-09-14 03:26:25 -04:00
Field G. Van Zee	121c39d455	Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have).	2016-09-05 13:11:42 -05:00
sthangar	64598ee4cf	fixed the symlink issue Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955	2016-08-31 12:54:50 +05:30
sthangar	fdc6639023	Placed 1 and 1f AMD optimized AVX routines under zen folder Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328	2016-08-29 10:43:38 +05:30
Kiran Varaganti	a58dd35ed7	Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9	2016-08-26 14:55:12 +05:30
Devin Matthews	c8e4ef9395	Add prefetchw to 30x8 kernel.	2016-08-03 16:13:03 -05:00
Devin Matthews	4b5a2f3d6e	Merge remote-tracking branch 'origin/knl' into knl # Conflicts: # kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c	2016-08-03 16:09:51 -05:00
Devin Matthews	380736bfe9	Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug.	2016-08-03 16:08:28 -05:00
Devin Matthews	9f52a587de	Try prefetchw[t1] instead of regular prefetch for C.	2016-08-03 16:03:53 -05:00
Devin Matthews	8945a1512d	This version gets ~1550 GFLOPs on KNL wuth 16x4.	2016-08-03 11:28:24 -05:00
praveeng	cdfb3c3f29	Merge master code as on 2016_07_29 to amd-staging branch by praveeng Change-Id: Ic78b84d8b8d10158fb2a612f9a64bbc7b1f9b486	2016-07-29 12:46:21 +05:30
Devin Matthews	6ce4c022eb	Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved.	2016-07-27 16:26:36 -05:00
Field G. Van Zee	c31b1e7b9d	Relax alignment restrictions for sandybridge ukrs. Details: - Relaxed the base pointer and leading dimension alignment restrictions in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd instead of vmovaps/vmovapd. These change mimic those made to the haswell microkernels in `e0d2fa0` and `ee2c139`. - Updated testsuite modules as well as standalone test drivers in 'test' directory to use DBL_MAX as the initial time candidate. Thanks to Devin Matthews for suggesting this change. - Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX). - Minor update (vis-a-vis contexts) to driver code in test/3m4m.	2016-07-27 15:58:07 -05:00
Devin Matthews	b8f2b55532	Try an 8x24 kernel for the hell of it.	2016-07-27 15:22:55 -05:00
Devin Matthews	ad89ed2e82	Merge branch 'knl' of github.com:devinamatthews/blis into knl	2016-07-27 11:45:40 -05:00
Devin Matthews	2c9de740ed	This version gets ~26GF on one core.	2016-07-27 11:44:54 -05:00
Devin Matthews	81e2b05f31	Add optimized packing kernels for KNL.	2016-07-27 11:39:05 -05:00
Devin Matthews	a7d8ca97b8	All fixed.	2016-07-25 15:15:13 -05:00

1 2 3 4

162 Commits