amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 01:59:59 +00:00

Author	SHA1	Message	Date
HariharaSudhan S	d687bd36ea	Merge "Improved AXPYV Kernel performance" into amd-staging-genoa-4.0	2022-05-17 18:03:42 +05:30
Chandrashekara K R	ec6e4162bc	Updated windows build system. We were using add_compile_options(-Xclang -fopenmp) statement to set omp library compiler flags on MSVC using cmake. Observing there is an performance regression because of the compiler version which is using in MSVC(clang 10), so removing it from the windows build system and configuring the compiler version(clang 13) and compiler options manually on MSVC gui to gain a performance on matlab bench. Change-Id: I37d778abdceb7c1fae9b1caaeea8adb114677dd2	2022-05-17 18:03:10 +05:30
Dipal M Zambare	31921b9974	Updated windows build system to define BLIS_CONFIG_EPYC flag. All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC pre-preprocessor, this was not defined in CMake which are resulting in overall lower performance. Updated version number to 3.1.1 Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9	2022-05-17 18:03:09 +05:30
Meghana Vankadari	c11fd5a8f6	Added functionality support for dzgemm AMD-Internal: [SWLCSG-1012] Change-Id: I2eac3131d2dcd534f84491289cbd3fe7fb7de3da	2022-05-17 18:01:55 +05:30
Dipal M. Zambare	b90420627a	Revert "Enabled AVX-512 kernels for Zen4 config" This reverts commit `62c96a4190`. Was committed without review.	2022-04-21 06:46:00 +00:00
Dipal M. Zambare	0adb525f5b	Revert "Enabled AVX-512 kernels for Zen4 config" This reverts commit `f816cf059f`. Was committed without review.	2022-04-21 06:45:38 +00:00
Dipal M. Zambare	f816cf059f	Enabled AVX-512 kernels for Zen4 config Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for float and double types. AMD-Internal: [CPUPL-2108] Change-Id: Idfe3f64a037db019cbdf43318954db52ad241a51	2022-04-21 06:38:24 +00:00
Dipal M. Zambare	62c96a4190	Enabled AVX-512 kernels for Zen4 config Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for float and double types. AMD-Internal: [CPUPL-2108]	2022-04-21 06:28:29 +00:00
Field G. Van Zee	a4abb10831	Added a new 'gemmlike' sandbox. Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework. Change-Id: Ifc3c50e9fd0072aada38eace50c57552c88cc6cf	2022-04-01 13:55:30 +05:30
Field G. Van Zee	7a0ba4194f	Added support for addons. Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files. Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717	2022-03-31 12:03:27 +05:30
Meghana Vankadari	0792eb8608	Fixed a bug in deriving dimensions from objects in gemm_front files Change-Id: I1f796c3a7ce6efacb6ef64651a7818b7ee38c6bb	2022-02-16 23:24:14 -05:00
Harihara Sudhan S	6696f91f41	Improved DGEMV performance for column-major cases - Altered the framework to use 2 more fused kernels for better problem decomposition - Increased unroll factor in AXPYF5 and AXPYF8 kernels to improve register usage AMD-Internal: [CPUPL-1970] Change-Id: I79750235d9554466def5ff93898f832834990343	2022-02-02 23:13:10 -05:00
Dipal M Zambare	6d1edca727	Optimized CPU feature determination. We added new API to check if the CPU architecture has support for AVX instruction. This API was calling CPUID instruction every time it is invoked. However, since this information does not change at runtime, it is sufficient to determine it once and use the cached results for subsequent calls. This optimization is needed to improve performance for small size matrix vector operations. AMD-Internal: [CPUPL-2009] Change-Id: If6697e1da6dd6b7f28fbfed45215ea3fdd569c5f	2022-02-01 11:15:55 +05:30
Harihara Sudhan	14fb31c0d5	Improved performance of DOTXV kernel for float and double - Vectorized sections of code that were not vectorized AMD Internal: [CPUPL-1980] Change-Id: I08528d054442a5e728f631142f244f1624170136	2022-01-24 23:08:38 -05:00
Saitharun	e783ea10db	Enable wrapper code by default details: Changes Made for 4.0 branch to enable wrapper code by default and also removed ENABLE_API_WRAPPER macro. Change-Id: I5c9ede7ae959d811bc009073a266e66cbf07ef1a	2022-01-19 11:38:45 +05:30
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Harsh Dave	79c6aa5643	Implemented optimal S/DCOMPLEX dotxf kernel - Optimized dotxf implementation for double and single precision complex datatype by handling dot product computation in tile 2x6 and 4x6 handling 6 columns at a time, and rows in multiple of 2 and 4. - Dot product computation is arranged such a way that multiple rho vector register will hold the temporary result till the end of loop and finally does horizontal addition to get final dot product result. - Corner cases are handled serially. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1975] Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098	2022-01-06 02:22:52 -05:00
mkadavil	457c33a601	Eliminating barriers in SUP path when matrices are not packed. -Current gemm SUP path uses bli_thrinfo_sup_grow, bli_thread_range_sub to generate per thread data ranges at each loop of gemm algorithm. bli_thrinfo_sup_grow involves usage of multiple barriers for cross thread synchronization. These barriers are necessary in cases where either the A or B matrix are packed for centralized pack buffer allocation/deallocation (bli_thread_am_ochief thread). -However for cases where both A and B matrices are unpacked, these barrier are resulting in overhead for smaller dimensions. Here creation of unnecessary communicators are avoided and subsequently the requirement for barriers are eliminated when packing is disabled for both the input matrices in SUP path. Change-Id: Ic373dfd2d6b08b8f577dc98399a83bb08f794afa	2022-01-06 01:56:43 -05:00
Harsh Dave	351269219f	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-01-05 05:51:15 -06:00
Harsh Dave	8b5b2707c1	Optimized daxpy2v implementation - Optimized axpy2v implementation for double datatype by handling rows in mulitple of 4 and store the final computed result at the end of computation, preventing unnecessary stores for improving the performance. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1973] Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9	2022-01-05 06:37:22 -05:00
Harsh Dave	0e7073a600	Optimized ztrsv implementation - Implemented alternate method of performing multiplication and addition operations on double precision complex datatype by separating out real and imaginary parts of complex number. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1969] Change-Id: Ib181f193c05740d5f6b9de3930e1995dea4a50f2	2022-01-05 05:22:51 -05:00
Arnav Sharma	3190e547b0	Optimized AXPBYV Kernel using AVX2 Intrinsics Details: - Intrinsic implementation of axpbyv for AVX2 - Bench written for axpbyv - Added definitions in zen contexts AMD-Internal: [CPUPL-1963] Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539	2022-01-05 04:19:11 -05:00
Harihara Sudhan S	b095f1f3a2	Improved SCALV kernel performance. - Unrolled the loop by a greater factor. Incorporated switch case to decide unrolling factor according to the input size. - Removed unused structs. AMD-Internal: [CPUPL-1974] Change-Id: Iee9d7defcc8c582ca0420f84c4fb2c202dabe3e7	2021-12-31 01:28:46 -05:00
HariharaSudhan S	4af7870ab7	Merge "Improved AXPYV Kernel performance" into amd-staging-genoa-4.0	2021-12-24 00:05:13 -05:00
Chandrashekara K R	b3553c08fa	AOCL-Windows: Updating the blis windows build system. 1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library. 2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library. 3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture. AMD-Internal: CPUPL-1630 Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a	2021-12-22 14:47:15 +05:30
Harihara Sudhan S	75d5f538d2	Improved AXPYV Kernel performance - Increased the unroll factor of the loop by 15 in SAXPYV - Increased the unroll factor of the loop by 12 in DAXPYV - The above changes were made for better register utilization Change-Id: I69ad1fec2fcf958dbd1bfd71378641274b43a6aa	2021-12-20 13:21:20 +05:30
Harihara Sudhan S	f72758d80a	Fixed DDOTXF Bug - Corrected xv and avec indexing in vector loop of bli_ddotxf_zen_int_2 Change-Id: I4c511236aad09541fe6b1295103a1a8b54ceec39	2021-12-17 15:27:13 +05:30
Nallani Bhaskar	c2df5eac1c	Reduced number of threads in dgemm for small dimensions - Number of threads are reduced to 1 when the dimensions are very low. - Removed uninitialized xmm compilation warning in trsm small Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b	2021-12-15 15:11:08 +05:30
Harihara Sudhan S	8201bcfdaf	Improved DGEMV performance for smaller sizes - Introduced two new ddotxf functions with lower fuse factor. - Changed the DGEMV framework to use new kernels to improve problem decomposition. Change-Id: I523e158fd33260d06224118fbf74f2314e03a617	2021-12-15 13:17:14 +05:30
Harsh Dave	0f43db8347	Optimized dsymv implementation -Implemented hemv framework calls for lower and upper kernel variants. -hemv computation is implemented in two parts. One part operate on triangular part of matrix and the remaining part is computed by dotxfaxpyf kernel. -First part performs dotxf and axpyf operation on triangular part of matrix in chunk of 8x8. Two separate helper function for doing so are implemented for lower and upper kernels respectively. -Second part is ddotxaxpyf fused kernel, which performs dotxf and axpyf operation alltogether on non-triangular part of matrix in chunk of 4x8. -Implementation efficiently uses cache memory while computing for optimal performance. Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e	2021-12-14 23:34:20 -06:00
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
Dipal M Zambare	7a15aa9c87	Fixed xGEMM dynamic dispatch crash on ST library. Small gemm implemenation is called from gemmnat path when library is built as multi-threaded small gemm is completely disabled. For single threaded the crash is fixed by disabling small gemm on generic architecture. AMD-Internal: [CPUPL-1930] Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a	2021-11-12 08:58:59 +05:30
mkurumel	235071690a	Fixed dynamic dispatch crash issue on non-zen architecture for gemv and axpy routines. Summary: 1. This commit fixed issue for gemv and axpy API’s. 2. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). 3. The crash was caused by un-supported instructions in zen optimized kernels.The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4	2021-11-12 08:58:59 +05:30
Harsh Dave	1b0a7e1c89	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in ctrsv, ztrsv interface. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. AMD-Internal: [CPUPL-1930] Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b	2021-11-12 08:58:58 +05:30
Harihara Sudhan S	4de6f2ca6d	Fixed dynamic dispatch crash issue on non-zen architecture. Direct calls to zen kernels replaced by architecture dependent calls for dotv and amaxv kernels. For non-zen architecture, generic function is called using the BLIS interface. For zen architecture, direct calls to zen optimized kernels are made. Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f	2021-11-12 08:58:58 +05:30
Harsh Dave	8f297f6267	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in blis interface for trsm, scalv, swapv. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81	2021-11-12 08:58:58 +05:30
Dipal M Zambare	61b5b9c4d0	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in cblas source itself. Similar optimizations are done by the function directly invoked from Cblas layer. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea	2021-11-12 08:58:58 +05:30
Dipal M Zambare	ddbdfd0ba4	Fixed dynamic dispatch crash issue on non-zen architecture. This commit fixed issue for gemm and copy API’s. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366	2021-11-12 08:58:57 +05:30
mkadavil	d683c224e8	Workaround for perf regression observed for sgemm Details: - Perf regression is observed for certain m,n,k inputs where (m,n,k > 512) and (m > 4 * n) in BLIS 3.1. The root cause was traced to commit `11dfc176a3` where BLIS_THREAD_RATIO_M was updated from 2 to 1. This change was not part of BLIS 3.0.6 and hence resulted in the new perf drop in 3.1. - This workaround updates the m dimension (doubles it) that is passed as argument to bli_rntm_set_ways_for_op which is used to determine the ic,jc work split in the threads. The BLIS_THREAD_RATIO_M is not updated (to 2) and rather the effect is induced using the doubled m dimension. AMD-Internal: [CPUPL-1909] Change-Id: I3b6ec4d4a22154289cb56d8f7db4cb60e5f34afe	2021-11-12 08:58:57 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30
nphaniku	4af525a313	AOCL Windows BLIS : Windows build for dynamic dispatch library Change-Id: Ie05eafbeacbd5589b514d9353517330515104939	2021-11-12 08:58:57 +05:30
mkurumel	366ab66134	DNRM2 : Disable dnrm2 Fast math implementation. Details : - Accuracy failures observed when fast math and ILP64 are enabled. - Disabling the feature with macro BLIS_ENABLE_FAST_MATH . AMD-Internal: [CPUPL-1907] Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16	2021-11-12 08:58:56 +05:30
Dipal M Zambare	3364c0e4eb	Binary and dynamic dispatch configuration name change -- Reverted changes made to include lp/ilp info in binary name This reverts commit `c5e6f885f0`. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6	2021-11-12 08:58:56 +05:30
Dipal M Zambare	ae844f475c	Fixed build issue in DTL when only traces are enabled. AMD-Internal: [CPUPL-1691] Change-Id: Idc273666054529db5a2fb96a7d7ebbf7a3f5b008	2021-11-12 08:58:56 +05:30
Harsh Dave	53e1d0539f	Fixed conjugate transpose kernel issue Details: AMD Internal Id: CPUPL-1702 - For the cases of A being of 1x1 dimension and of left and right hand side, A's only element is conjugate transposed by negating its imaginary component. Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c	2021-11-12 08:58:56 +05:30
Harsh Dave	2b6faf21a1	Fixed conjugate transpose issue for zscalv and cscalv Details: AMD Internal Id: CPUPL-1702 - While performing trsm function A's imaginary part needed to be complimented as per conjugate transpose. -So in the case of conjugate transpose A's imaginary part is negated before doing trsm. Change-Id: Ic736733a483eeadf6356952b434128c0af988e36	2021-11-12 08:58:55 +05:30
Nageshwar Singh	cbd9ea76af	Complex single standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 8x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: Ic9a5e59363290caf26284548638da9065952fd48	2021-11-12 08:58:55 +05:30
Madan mohan Manokar	d6fcfe7345	gemmt SUP limitThread count for small sizes 1. Max thread cap added for small dimension based on product(n*k). AMD-Internal: [CPUPL-1388] Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057	2021-11-12 08:58:55 +05:30
Harsh Dave	590c763e22	Implemented ctrsm small kernels Details: -- AMD Internal Id: CPUPL-1702 -- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 24 scomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ctrsm_small for in ctrsm_ BLAS path for single thread when (m,n)<1000 and multithread (m+n)<320 -- Taken care of --disable_pre_inversion configuration -- Achieved 13% average performance improvement for sizes less than 1000 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64	2021-11-12 08:58:55 +05:30
satish kumar nuggu	23278627f4	STRSM small kernel implementation Details: -- AMD Internal Id: [CPUPL-1702] -- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Taken care of --disable_pre_inversion configuration -- modularized strsm 16 combinations of trsm into 4 kernels Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea	2021-11-12 08:58:55 +05:30

1 2 3 4 5 ...

2599 Commits