amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 01:59:59 +00:00

Author	SHA1	Message	Date
Marat Dukhan	1016383307	Fix Emscripten builds	2017-12-11 12:08:58 +05:30
Kiran Varaganti	ee86906616	Improved efficiency of dGEMM for large matrices by reducing TLB load misses and majorly L3 cache misses. This is achieved by changing the packed block sizes of matrix A & B. Now the optimum values are MC_D = 510 and KC_D = 1024. Change-Id: I2d8bdd5f62f2d1f8782ae2997f3d7a26587d1ca4	2017-12-11 12:08:58 +05:30
Devin Matthews	25d0e61854	Revert "Change PACKDIM_MR (double) for haswell to 8." This reverts commit `681eec913d`.	2017-12-11 12:07:31 +05:30
Devin Matthews	c5bdd84b35	Change PACKDIM_MR (double) for haswell to 8.	2017-12-11 12:07:31 +05:30
Field G. Van Zee	172789d562	Restored deleted lines from makefile fragments.	2017-12-11 12:07:31 +05:30
Devin Matthews	49438409ee	Remove shebangs from makefiles.	2017-12-11 12:07:31 +05:30
J M Dieterich	497e264047	Fix if/else structure. Thanks to TravisCI.	2017-12-11 12:07:31 +05:30
J M Dieterich	835035c56a	Mark piledriver compilable w/ clang.	2017-12-11 12:06:40 +05:30
J M Dieterich	6cdb533472	Mark bulldozer compilable w/ clang.	2017-12-11 12:06:40 +05:30
J M Dieterich	a85697d622	Correct error message.	2017-12-11 12:06:40 +05:30
J M Dieterich	e0c64cad27	Indeed once can compile for carrizo also using clang.	2017-12-11 12:06:40 +05:30
J M Dieterich	4aafe0505d	A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash	2017-12-11 12:06:40 +05:30
Jeff Hammond	9700f0e578	allow KNL build without hbwmalloc.h (i.e. emulated) we want to be able to run BLIS KNL binaries on non-KNL machines via SDE. although it is possible to install hbwmalloc implementation on such systems, it is easier not to, since obviously the performance of SDE execution is not representative so there is no reason to emulate HBW allocation.	2017-12-11 12:05:22 +05:30
Field G. Van Zee	1d728ccb23	Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver.	2017-12-11 11:55:31 +05:30
Jeff Hammond	0d1b90286e	never use libm with Intel compilers Intel compilers include a highly optimized math library (libimf) that should be used instead of GNU libm. yes, this change is for ALL targets, including those that are not supported by the Intel compiler. there is no harm in doing this, and it is future-proof in the event that the Intel compilers support other architectures.	2017-12-11 11:52:25 +05:30
Kiran Varaganti	0b19029342	Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81	2017-03-14 14:51:31 +05:30
praveeng	825363bd2a	Merge code from master to amd-staging as on 2017_03_08 by praveeng Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d	2017-03-08 15:43:42 +05:30
sthangar	093bdb80c8	Checked in Unpacked DGEMM code Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723	2017-03-07 13:35:50 +05:30
Kiran Varaganti	33923da9a1	Added variant 10 for double precision axpyv microkernel Change-Id: I7a20cc113a422603250bc450825c965136354974	2017-03-06 14:31:31 +05:30
Kiran Varaganti	bc828f7f8e	Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972	2017-03-03 14:45:35 +05:30
sthangar	c9949f4603	Checked in DGEMMTRSM and edge case handling routine in DDOTXF Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e	2017-03-01 11:14:34 +05:30
Kiran Varaganti	04245c9ff7	Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5	2017-02-10 14:24:30 +05:30
Kiran Varaganti	3fa53e8af3	Merged axpyv and gemm small in bli_kernel.h Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging modified: config/zen/bli_kernel.h modified: frame/3/gemm/bli_gemm_front.c modified: kernels/x86_64/zen/3/bli_gemm_small_matrix.c Change-Id: If181cf9345178c448b3530beb8bef453917fe295	2017-02-08 11:51:57 +05:30
sthangar	95be7b0470	Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0	2017-02-08 11:24:10 +05:30
Kiran Varaganti	b5291a445b	Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9	2017-02-07 12:39:31 +05:30
Kiran Varaganti	f4bfc1662a	New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072	2017-02-06 15:04:27 +05:30
Devin Matthews	78e1b16e16	Change default threading parameters for KNL.	2017-01-27 14:22:20 -06:00
sthangar	574472ba5a	checked in unpacked SGEMM optimization Change-Id: I8e4ea374415c0c402c660b656fb076af15354181	2017-01-27 14:32:02 +05:30
sthangar	d625c49e20	checked-in SGEMMTRSM microkernel for Zen Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f	2016-12-01 16:17:09 +05:30
praveeng	d8f13beeea	Merge master code till 2016_11_25 to amd-staging	2016-11-25 17:31:08 +05:30
Field G. Van Zee	b3e58ee303	Reimplemented 4x12 haswell ukernels (real only). Details: - Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which defines 4x24 single real and 4x12 double real gemm microkernels, with broadcast-based implementations. (The previous microkernel file has been moved to an 'old' subdirectory.)	2016-11-23 17:58:26 -06:00
sthangar	9772218cae	Added optimized DAMAX routines for Zen Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8	2016-11-16 15:19:19 +05:30
Kiran Varaganti	e35d3c23f2	Added new optimized micro-kernel for dotxv routine Change-Id: I2c544e9b25a454d971ad690353502a55cd668391	2016-11-10 14:30:53 +05:30
praveeng	0d13e9a4f6	bli_kernel.h Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091	2016-11-07 14:40:41 +05:30
Devin Matthews	8f9010542c	Fix some problems with OSX builds: - Update CPU detection for Intel archs (esp. Skylake) - Allow clang for the reference config	2016-11-02 11:18:32 -05:00
Field G. Van Zee	8a11a2174a	Updates to non-default haswell microkernels. Details: - Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment constraints. - Added missing c and z microkernels, which are based on the corresponding kernels in the d6x8 set. - This completes the d8x6 set (which may be used for situations when it is desirable to have a microkernel with a column preference).	2016-10-31 19:07:55 -05:00
Devin Matthews	11eb7957ab	Merge branch 'master' into knl # Conflicts: # frame/thread/bli_thread.h	2016-10-25 13:51:07 -05:00
Kiran Varaganti	e044fa6240	Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a	2016-10-25 13:03:05 +05:30
Kiran Varaganti	d250e6a3af	Merged TRSM and scalv routines into zen folder Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9	2016-10-20 14:34:39 +05:30
sthangar	1c2f7b57d5	Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048	2016-10-18 15:06:35 +05:30
sthangar	7e04490002	Checked in the SAMAX optimizations Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd	2016-10-13 10:07:51 +05:30
praveeng	f2e7ea113a	conflicts merge for bli_kernel.h Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0	2016-10-06 12:35:30 +05:30
sthangar	133983c36f	code clean up in bli_kernel.h Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d	2016-10-06 11:26:22 +05:30
Field G. Van Zee	b922d75634	Avoid compiling BLAS/CBLAS files when disabled. Details: - Updated the top-level Makefile, build/config.mk.in template, and configure script so that object files corresponding to source files belonging to the BLAS compatibility layer are not compiled (or archived) when the compatibility layer is disabled. (Same for CBLAS.) Thanks to Devin Matthews for suggesting this optimization. - Slight change to the way configure handles internal variables. Instead of converting (overwriting) some, such as enable_blas2blis and enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are now stored in new variables that live alongside the originals (with the suffix "_01"). This is convenient since some values need to be sed-substituted into the config.mk.in template, which requires "yes" or "no", while some need to be written to the bli_config.h.in template, which requires "0" or "1". Updated BLIS4 TOMS citation in README.md. Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have). Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1	2016-09-15 12:24:07 +05:30
Pradeep Rao	69826110ba	Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging	2016-09-14 03:26:25 -04:00
Field G. Van Zee	121c39d455	Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have).	2016-09-05 13:11:42 -05:00
sthangar	64598ee4cf	fixed the symlink issue Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955	2016-08-31 12:54:50 +05:30
sthangar	fdc6639023	Placed 1 and 1f AMD optimized AVX routines under zen folder Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328	2016-08-29 10:43:38 +05:30
Kiran Varaganti	a58dd35ed7	Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9	2016-08-26 14:55:12 +05:30
Devin Matthews	c8e4ef9395	Add prefetchw to 30x8 kernel.	2016-08-03 16:13:03 -05:00

1 2 3 4 5

223 Commits