amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-25 10:54:33 +00:00

Author	SHA1	Message	Date
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Edward Smyth	6835205ba8	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-2870] Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce	2023-04-19 12:44:56 -04:00
Aayush Kumar	71272ab574	.Fixed Compiler warnings for GCC 12 and AOCC 4.0 - Set the variables to zero to avoid the compiler warning (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c, bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and bli_trsm_small_AVX512.c - Changed the datatype from dim_t to siz_t for i,k,j in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to avoid the compiler warning (-Waggressive-loop-optimizations) AMD-Internal: [CPUPL-2870] Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03	2023-04-14 13:29:17 +00:00
Harihara Sudhan S	2e6724262e	ZGEMV var 2 bug fix - Fixed segmentation fault that was seen on non zen and non avx2 machines. - cntx object was not passed to the invoked kernel causing a seg fault. AMD-Internal: [CPUPL-3167] Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d	2023-04-05 01:31:24 -04:00
Edward Smyth	1ac03e64b5	BLIS cpuid tidy and bugfix. Improvements to BLIS cpuid functionality: - Tidy names of avx support test functions, especially rename bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported() to more accurately describe what it tests. - Fix bug in frame/base/bli_check.c related to changes in commit `6861fcae91` AMD-Internal: [CPUPL-3031] Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5	2023-04-03 08:46:37 -04:00
Harihara Sudhan S	4b36529a8b	Added vector packing logic to ZGEMV variant 2 - In cases when incy != 1, a buffer is created for y vector. The contents of vector y is scaled by beta and stored in this buffer. - After performing the compute using ZAXPYF kernel, the results in y buffer memory is copied back to the orginal buffer using ZCOPYV. - In cases when alpha is zero, we only scale the y vector by beta without using the buffer and return. - The kernels are picked based on the architecture ID. For any zen based architecture, AVX2 kernels are invoked. For other, the kernels are invoked based on the context. - In ZSCAL2V, query for the context if NULL pointer is passed. AMD-Internal: [CPUPL-2773] Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f	2023-03-22 03:19:18 -04:00
Edward Smyth	7f86561d26	BLIS-Nov2022: HPL memory issues with GCC. HPL script was using BLIS manual way to set threading, i.e. setting BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return -1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines. Fix: if this occurs, set local number of threads based on product of BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values. Note: BLIS_PC_NT should always be 1, but this environment variable is currently being read (contrary to documentation), so include it for now. Other changes: * implement _Pragma convention in all code used on AMD * frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag AMD-Internal: [CPUPL-2803] Change-Id: I37e8b038e5640d6693a87be0609888186322b465	2022-12-06 05:10:34 -05:00
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Edward Smyth	abf848ad12	Code cleanup and warnings fixes - Removed some additional compiler warnings reported by GCC 12.1 - Fixed a couple of typos in comments - frame/3/bli_l3_sup.c: routines were returning before final call to AOCL_DTL_TRACE_EXIT - frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is only defined in header file if BLIS_ENABLE_OPENMP is defined AMD-Internal: [CPUPL-2460] Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde	2022-08-29 08:22:30 -04:00
Arnav Sharma	eb83a0fe9d	Enabled ZHER Optimized Path - While calculating the diagonal and corner elements, the combined operation of calculating the product of x and x hermitian and simultaneously scaling it with alpha and adding the result to the matrix was the cause of increased underflow and overflow errors in netlib tests. - So the above calculation is now being done in three steps: scaling x vector with alpha, then calculating its product with x hermitian and later adding the final result to the matrix. AMD-Internal: [CPUPL-2213] Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8	2022-08-29 08:09:42 -04:00
Arnav Sharma	035ed98b51	Temporarily disabling optimized ZHER - Disabling optimized ZHER pending verification with netlib BLAS test. AMD-Internal: [CPUPL-2416] Change-Id: I74c4d16e1c99ddeb1df91130a8e14feafd0952d0	2022-08-19 12:16:46 -04:00
Harihara Sudhan S	d4bb906094	Exception handling in GEMV smart-threading - Added condition to check if n or m is 0 in smart threading logic AMD-Internal: [CPUPL2219] Change-Id: Idd58cd13a11aa5bdb4117b4c9262f38ef3c1afc4	2022-06-29 01:52:18 -04:00
Dipal M Zambare	c87b9aab75	Added support for AVX512 for Windows and AMAVX - Completed zen4 configuration support on windows - Enabled AVX512 kernels for AMAXV - Added zen4 configuration in amdzen for windows - Moved all zen4 kernels inside kernels/zen4 folder AMD-Internal: [CPUPL-2108] Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba	2022-06-08 11:09:48 +05:30
Arnav Sharma	66b2231b65	Fixed CMake files for HER - Removed subdirectory addition Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1	2022-06-01 12:25:43 +05:30
Arnav Sharma	e5d5a43eab	Optimized ZHER Implementation - Implemented optimized her framework calls for double precision complex numbers. - The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows. AMD-Internal: [CPUPL-2151] Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca	2022-05-25 14:03:01 +05:30
Dipal M Zambare	6e2f536590	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-05-17 20:35:40 +05:30
Harsh Dave	f48ced0811	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-05-17 18:13:07 +05:30
mkadavil	8670992c3d	Default sgemv kernel to be used in single-threaded scenarios. - sgemv calls a multi-threading friendly kernel whenever it is compiled with open mp and multi-threading enabled. However it was observed that this kernel is not suited for scenarios where sgemv is invoked in a single-threaded context (eg: sgemv from ST sgemm fringe kernels and with matrix blocking). Falling back to the default single-threaded sgemv kernel resulted in better performance for this scenario. AMD-Internal: [CPUPL-2136] Change-Id: Ic023db4d20b2503ea45e56a839aa35de0337d5a6	2022-05-17 18:10:40 +05:30
Nallani Bhaskar	2acb3f6ed0	Tuned aocl dynamic for specific range in dgemm Description: 1. Decision logic to choose optimal number of threads for given input dgemm dimensions under aocl dynamic feature were retuned based on latest code. 2. Updated code in few file to avoid compilation warnings. 3. Added a min check for nt in bli_sgemv_var1_smart_threading function AMD-Internal: [ CPUPL-2100 ] Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02	2022-05-17 18:10:39 +05:30
S, HariharaSudhan	a8bc55c373	Multithreaded SGEMV var 1 with smart threading - Implemented an OpenMP based stand alone SGEMV kernel for row-major (var 1) for multithread scenarios - Smart threading is enabled when AOCL DYNAMIC is defined - Number of threads are decided based on the input dims using smart threading AMD-Internal: [CPUPL-1984] Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e	2022-05-17 18:10:39 +05:30
Dipal M Zambare	06e386f054	Updated Windows build system to pick AMD specific sources. The framework cleanup was done for linux as part of `f63f78d7` Removed Arch specific code from BLIS framework. This commit adds changes needed for windows build. AMD-Internal: [CPUPL-2052] Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d	2022-05-17 18:09:20 +05:30
Harsh Dave	d50d607995	dher2 API in blis make check fails on non avx2 platform - dher2 did not have avx check for platform. It was calling avx kernel regardless of platform support. Which resulted in core dump. - Added avx based platform check in both variant of dher2 for fixing the issue. AMD-Internal: [CPUPL-2043] Change-Id: I1fd1dcc9336980bfb7ffa9376f491f107c889c0b	2022-05-17 18:08:57 +05:30
Dipal M Zambare	f69f59c32c	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-05-17 18:08:56 +05:30
Harsh Dave	d116780616	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-05-17 18:05:08 +05:30
Harihara Sudhan S	6696f91f41	Improved DGEMV performance for column-major cases - Altered the framework to use 2 more fused kernels for better problem decomposition - Increased unroll factor in AXPYF5 and AXPYF8 kernels to improve register usage AMD-Internal: [CPUPL-1970] Change-Id: I79750235d9554466def5ff93898f832834990343	2022-02-02 23:13:10 -05:00
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Harsh Dave	351269219f	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-01-05 05:51:15 -06:00
Harihara Sudhan S	8201bcfdaf	Improved DGEMV performance for smaller sizes - Introduced two new ddotxf functions with lower fuse factor. - Changed the DGEMV framework to use new kernels to improve problem decomposition. Change-Id: I523e158fd33260d06224118fbf74f2314e03a617	2021-12-15 13:17:14 +05:30
Harsh Dave	0f43db8347	Optimized dsymv implementation -Implemented hemv framework calls for lower and upper kernel variants. -hemv computation is implemented in two parts. One part operate on triangular part of matrix and the remaining part is computed by dotxfaxpyf kernel. -First part performs dotxf and axpyf operation on triangular part of matrix in chunk of 8x8. Two separate helper function for doing so are implemented for lower and upper kernels respectively. -Second part is ddotxaxpyf fused kernel, which performs dotxf and axpyf operation alltogether on non-triangular part of matrix in chunk of 4x8. -Implementation efficiently uses cache memory while computing for optimal performance. Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e	2021-12-14 23:34:20 -06:00
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
mkurumel	235071690a	Fixed dynamic dispatch crash issue on non-zen architecture for gemv and axpy routines. Summary: 1. This commit fixed issue for gemv and axpy API’s. 2. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). 3. The crash was caused by un-supported instructions in zen optimized kernels.The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4	2021-11-12 08:58:59 +05:30
Harsh Dave	1b0a7e1c89	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in ctrsv, ztrsv interface. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. AMD-Internal: [CPUPL-1930] Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b	2021-11-12 08:58:58 +05:30
Harsh Dave	8f297f6267	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in blis interface for trsm, scalv, swapv. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81	2021-11-12 08:58:58 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30
Nageshwar Singh	cbd9ea76af	Complex single standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 8x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: Ic9a5e59363290caf26284548638da9065952fd48	2021-11-12 08:58:55 +05:30
Nageshwar Singh	a3d04a21a0	Complex double standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 4x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37	2021-11-12 08:58:54 +05:30
mkurumel	ce75a86e9b	BLIS : DGEMV performance improvement for incy/incx greater than 1 Details : - Added packing Of Y for incy >1 cases for dgemv_unf_var2. - Added packing Of X for incx >1 cases for dgemv_unf_var1. AMD-Internal: [SWLCSG-735] Change-Id: Ib395f478ba984a85533e4f79b3521d0b2500c30c	2021-07-21 17:55:05 +05:30
mkurumel	9afbb11b4f	DTL Logging bug in GEMV Details : - Fixed Incorrect Macro used in dgemv and cgemv Trace logging exit. AMD-Internal: [CPUPL-1403] Change-Id: Icac502d8d4adad112754d9c764a30d3db56a743f	2021-06-02 21:21:00 +05:30
mkurumel	99e3bce065	SGEMV : single Precision axpyf kernel optimization for SGEMV Details : - Implemented saxpyf kernel with fuse factor=6 for sgemv. AMD-Internal: [CPUPL-1403] Change-Id: I72fd30c08a789603267cf58910138549d45d231a	2021-06-02 07:55:48 -04:00
Nageshwar Singh	2e1a5bc1dd	Optimized double complex axpyf kernel for zgemv Details: - Implemented zaxpyf kernel with fuse factor=4 for zgemv. - Modified BLAS interface call for zgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: I2231285fe3060982d4434466346a040b7ab803fc	2021-06-01 18:03:29 +05:30
Meghana Vankadari	8c9a7c21b4	Optimized axpyf kernel for scomplex datatype Details: - Implemented axpyf kernel with fuse factor=4 for scomplex datatype. - Modified BLAS interface call for cgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: Ibaab078008d76953332ba4da3515993578c0e586	2021-05-24 14:40:17 +05:30
Meghana Vankadari	33ddf2e448	Fixed blastest failure for haswell configuration Details: - Placed optimized version of BLAS DGEMM, ZGEMM definitions under BLIS_CONFIG_EPYC as they use gemm small which are defined only for zen family configurations. - Added code to query and set cntx in gemv and trsv framework before cntx is referred for any function pointers to avoid querying from NULL pointer. AMD-Internal: [CPUPL-1562] Change-Id: I977d028ec4ddb57dcdc70e443e7708f36c01cca9	2021-05-07 01:49:54 -04:00
nphaniku	b3628cdfd3	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for build with Clang compiler. 2. CMake script changes for build test and testsuite based on the lib type ST/MT 3. CMake script changes for testcpp and blastest 4. Added python scripts to support library build and testsuite build. AMD Internal : [CPUPL-1422] Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898	2021-03-08 19:04:17 +05:30
managalv	1ff4981203	Modified blas interface of TRSM to call TRSV whenever m=1 or n=1. Case1: Call TRSV when matrix C & B are vector & A is matrix, When n = 1 for left side and when m = 1 for right side Case2: Divide B/A when matrix C & B are vector & A is scalar(Diagonal element), When m = 1 for left side and when n = 1 for right side For right side, Transpose complete operation, Change upper to lower and vice versa when A is being transposed Change-Id: Ie87e4a263c287ba554832ccc56b629f982e3ac4c	2021-02-08 19:02:25 +05:30
Meghana Vankadari	2e7cf8d82f	Added 16x4 AXPYF kernel for zen2 config Details: - Added a new AXPYF kernel with fuse_factor = 4 and iter_unroll = 4. - Modified blas interface of GEMM to call GEMV whenever m=1 or n=1. Change-Id: I3f5acd37b009f53cf63f462cec79fd3e73676dbc	2021-02-02 21:22:44 +05:30
Dipal M Zambare	0a3d94c9a2	Updated test drivers for dotv, scalv and swapv. Added traces in cblas layer for these API's. These test drivers didn't have calls for complex data types, the drivers are updated to support them. AMD-Internal : [CPUPL-1315] Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b	2020-11-24 10:26:32 +05:30
managalv	fdc0e70cd8	Added debug log and trace for gemv and dotv for blis and cblas interface AMD Internal: [CPUPL-1314] Change-Id: I2708fd9c73419c968c8e02ff11545645dc639052	2020-11-23 19:55:21 +05:30
managalv	aae48c2221	Optimised AXPYF routine for complex float and complex double Details: - Added SIMD code - Processing 5 rows at a time in SIMD loop to improve performance AMD-Internal: [CPUPL-1054] Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a	2020-11-06 18:42:13 +05:30
Meghana Vankadari	0775f09b41	Added debug trace and log support for copy and ger routines Change-Id: Id7fb64c0a626b2f8f53e89ee7df4391693eb4f4c	2020-11-02 22:56:58 -05:00
Meghana Vankadari	47744663d9	Enabling framework optimizations for zen family architectures. Details: - Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas framework optimizations for zen family configurations. - The macro needs to be defined in family.h files of respective arch configs. - Moved zen2-specific optimized kernels to zen folder, in order to be accessible to all zen family architectures. Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d	2020-10-07 13:10:50 +05:30

1 2 3

119 Commits