amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 18:52:14 +00:00

Author	SHA1	Message	Date
Kiran Varaganti	bc828f7f8e	Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972	2017-03-03 14:45:35 +05:30
sthangar	c9949f4603	Checked in DGEMMTRSM and edge case handling routine in DDOTXF Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e	2017-03-01 11:14:34 +05:30
Field G. Van Zee	a509fbd5ac	Merge branch 'master' into 1m	2017-02-21 17:06:16 -06:00
Kiran Varaganti	04245c9ff7	Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5	2017-02-10 14:24:30 +05:30
Kiran Varaganti	3fa53e8af3	Merged axpyv and gemm small in bli_kernel.h Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging modified: config/zen/bli_kernel.h modified: frame/3/gemm/bli_gemm_front.c modified: kernels/x86_64/zen/3/bli_gemm_small_matrix.c Change-Id: If181cf9345178c448b3530beb8bef453917fe295	2017-02-08 11:51:57 +05:30
sthangar	95be7b0470	Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0	2017-02-08 11:24:10 +05:30
Kiran Varaganti	b5291a445b	Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9	2017-02-07 12:39:31 +05:30
Kiran Varaganti	f4bfc1662a	New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072	2017-02-06 15:04:27 +05:30
Devin Matthews	78e1b16e16	Change default threading parameters for KNL.	2017-01-27 14:22:20 -06:00
sthangar	574472ba5a	checked in unpacked SGEMM optimization Change-Id: I8e4ea374415c0c402c660b656fb076af15354181	2017-01-27 14:32:02 +05:30
sthangar	d625c49e20	checked-in SGEMMTRSM microkernel for Zen Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f	2016-12-01 16:17:09 +05:30
Field G. Van Zee	126482a3b6	Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver.	2016-11-25 18:29:49 -06:00
praveeng	d8f13beeea	Merge master code till 2016_11_25 to amd-staging	2016-11-25 17:31:08 +05:30
Field G. Van Zee	b3e58ee303	Reimplemented 4x12 haswell ukernels (real only). Details: - Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which defines 4x24 single real and 4x12 double real gemm microkernels, with broadcast-based implementations. (The previous microkernel file has been moved to an 'old' subdirectory.)	2016-11-23 17:58:26 -06:00
sthangar	9772218cae	Added optimized DAMAX routines for Zen Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8	2016-11-16 15:19:19 +05:30
Kiran Varaganti	e35d3c23f2	Added new optimized micro-kernel for dotxv routine Change-Id: I2c544e9b25a454d971ad690353502a55cd668391	2016-11-10 14:30:53 +05:30
praveeng	0d13e9a4f6	bli_kernel.h Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091	2016-11-07 14:40:41 +05:30
Devin Matthews	8f9010542c	Fix some problems with OSX builds: - Update CPU detection for Intel archs (esp. Skylake) - Allow clang for the reference config	2016-11-02 11:18:32 -05:00
Field G. Van Zee	8a11a2174a	Updates to non-default haswell microkernels. Details: - Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment constraints. - Added missing c and z microkernels, which are based on the corresponding kernels in the d6x8 set. - This completes the d8x6 set (which may be used for situations when it is desirable to have a microkernel with a column preference).	2016-10-31 19:07:55 -05:00
Jeff Hammond	c2c91e09b4	never use libm with Intel compilers Intel compilers include a highly optimized math library (libimf) that should be used instead of GNU libm. yes, this change is for ALL targets, including those that are not supported by the Intel compiler. there is no harm in doing this, and it is future-proof in the event that the Intel compilers support other architectures.	2016-10-25 21:15:26 -07:00
Devin Matthews	11eb7957ab	Merge branch 'master' into knl # Conflicts: # frame/thread/bli_thread.h	2016-10-25 13:51:07 -05:00
Kiran Varaganti	e044fa6240	Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a	2016-10-25 13:03:05 +05:30
Kiran Varaganti	d250e6a3af	Merged TRSM and scalv routines into zen folder Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9	2016-10-20 14:34:39 +05:30
sthangar	1c2f7b57d5	Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048	2016-10-18 15:06:35 +05:30
sthangar	7e04490002	Checked in the SAMAX optimizations Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd	2016-10-13 10:07:51 +05:30
praveeng	f2e7ea113a	conflicts merge for bli_kernel.h Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0	2016-10-06 12:35:30 +05:30
sthangar	133983c36f	code clean up in bli_kernel.h Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d	2016-10-06 11:26:22 +05:30
Field G. Van Zee	b922d75634	Avoid compiling BLAS/CBLAS files when disabled. Details: - Updated the top-level Makefile, build/config.mk.in template, and configure script so that object files corresponding to source files belonging to the BLAS compatibility layer are not compiled (or archived) when the compatibility layer is disabled. (Same for CBLAS.) Thanks to Devin Matthews for suggesting this optimization. - Slight change to the way configure handles internal variables. Instead of converting (overwriting) some, such as enable_blas2blis and enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are now stored in new variables that live alongside the originals (with the suffix "_01"). This is convenient since some values need to be sed-substituted into the config.mk.in template, which requires "yes" or "no", while some need to be written to the bli_config.h.in template, which requires "0" or "1". Updated BLIS4 TOMS citation in README.md. Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have). Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1	2016-09-15 12:24:07 +05:30
Pradeep Rao	69826110ba	Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging	2016-09-14 03:26:25 -04:00
Field G. Van Zee	121c39d455	Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have).	2016-09-05 13:11:42 -05:00
sthangar	64598ee4cf	fixed the symlink issue Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955	2016-08-31 12:54:50 +05:30
sthangar	fdc6639023	Placed 1 and 1f AMD optimized AVX routines under zen folder Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328	2016-08-29 10:43:38 +05:30
Kiran Varaganti	a58dd35ed7	Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9	2016-08-26 14:55:12 +05:30
Devin Matthews	c8e4ef9395	Add prefetchw to 30x8 kernel.	2016-08-03 16:13:03 -05:00
Devin Matthews	380736bfe9	Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug.	2016-08-03 16:08:28 -05:00
Devin Matthews	8945a1512d	This version gets ~1550 GFLOPs on KNL wuth 16x4.	2016-08-03 11:28:24 -05:00
Devin Matthews	6ce4c022eb	Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved.	2016-07-27 16:26:36 -05:00
Devin Matthews	b8f2b55532	Try an 8x24 kernel for the hell of it.	2016-07-27 15:22:55 -05:00
Devin Matthews	7ede5863ae	Allocate pack buffer on MCDRAM for KNL.	2016-07-27 13:42:32 -06:00
Devin Matthews	81e2b05f31	Add optimized packing kernels for KNL.	2016-07-27 11:39:05 -05:00
Devin Matthews	65735bbedf	Switch to 24x8 kernel, unrolled by 16.	2016-07-24 21:50:32 -05:00
Devin Matthews	8c6e621c09	Make knl conform to new kernel dir structure.	2016-07-22 15:05:15 -05:00
Devin Matthews	119d039942	Add 8x24 KNL kernel.	2016-07-22 10:23:31 -05:00
praveeng	1aa77dfc1d	Merge master code as on 2016_07_21 to amd-staging branch by praveeng Change-Id: Ic7d0a21101358f08147736e7f1884e7409937344	2016-07-21 14:23:41 +05:30
Devin Matthews	b58cda9eba	Merge remote-tracking branch 'origin/master' into knl # Conflicts: # frame/base/bli_threading.h # frame/include/blis.h # frame/thread/bli_thread.c	2016-07-19 14:09:09 -05:00
sthangar	9101a9c880	Checked in optimized 1V kernels along with benchmark codes. Also incorporated review comments for 1F kernels Change-Id: I035c0d39e6b0bed28e6e2041242186c49f6ed55b	2016-07-13 16:51:14 +05:30
Devin Matthews	318f063dcb	Add new KNL microkernel derived from Haswell.	2016-06-08 17:46:50 -05:00
Field G. Van Zee	096895c5d5	Reorganized code, APIs related to multithreading. Details: - Reorganized code and renamed files defining APIs related to multithreading. All code that is not specific to a particular operation is now located in a new directory: frame/thread. Code is now organized, roughly, by the namespace to which it belongs (see below). - Consolidated all operation-specific _thrinfo_t object types into a single thrinfo_t object type. Operation-specific level-3 _thrinfo_t APIs were also consolidated, leaving bli_l3_thrinfo_() and bli_packm_thrinfo_() functions (aside from a few general purpose bli_thrinfo_() functions). - Renamed thread_comm_t object type to thrcomm_t. - Renamed many of the routines and functions (and macros) for multithreading. We now have the following API namespaces: - bli_thrinfo_(): functions related to thrinfo_t objects - bli_thrcomm_(): functions related to thrcomm_t objects. - bli_thread_(): general-purpose functions, such as initialization, finalization, and computing ranges. (For now, some macros, such as bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the bli_thread_ namespace prefix, even though bli_thrinfo_ may be more appropriate.) - Renamed thread-related macros so that they use a bli_ prefix. - Renamed control tree-related macros so that they use a bli_ prefix (to be consistent with the thread-related macros that were also renamed). - Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This #undef was a temporary fix to some macro defaults which were being applied in the wrong order, which was recently fixed.	2016-06-06 13:32:04 -05:00
Devin Matthews	e3bd5ca64a	Fix SIMD definitions in KNL config, and a couple of fixes to C update.	2016-05-12 20:54:13 -05:00
Tyler Smith	4dcd37eb1b	fixing knc simd align size	2016-05-10 16:28:59 -05:00

... 2 3 4 5 6 ...

357 Commits