amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
Minh Quan Ho	6d4d6a7514	Alloc at least 1 elem in pool_t block_ptrs. (#560 ) Details: - Previously, the block_ptrs field of the pool_t was allowed to be initialized as any unsigned integer, including 0. However, a length of 0 could be problematic given that malloc(0) is undefined and therefore variable across implementations. As a safety measure, we check for block_ptrs array lengths of 0 and, in that case, increase them to 1. - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu> Change-Id: I1e885d887aaba5e73df091ef52e6c327fd6418de	2022-08-01 08:45:00 +05:30
Minh Quan Ho	3c01fcb9fc	Fix insufficient pool-growing logic in bli_pool.c. (#559 ) Details: - The current mechanism for growing a pool_t doubles the length of the block_ptrs array every time the array length needs to be increased due to new blocks being added. However, that logic did not take in account the new total number of blocks, and the fact that the caller may be requesting more blocks that would fit even after doubling the current length of block_ptrs. The code comments now contain two illustrating examples that show why, even after doubling, we must always have at least enough room to fit all of the old blocks plus the newly requested blocks. - This commit also happens to fix a memory corruption issue that stems from growing any pool_t that is initialized with a block_ptrs length of 0. (Previously, the memory pool for packed buffers of C was initialized with a block_ptrs length of 0, but because it is unused this bug did not manifest by default.) - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu> Change-Id: Ie4963c56e03cbc197d26e29f2def6494f0a6046d	2022-08-01 08:19:30 +05:30
Chandrashekara K R	fde812015f	Updated blis library version from 4.0 to 3.2.1 AMD-Internal: [CPUPL-2322] Change-Id: I3a6a61543dd2754e2590d7f5f22442c9fdeaee95	2022-07-29 15:55:10 +05:30
Arnav Sharma	4f96bb712e	AOCL Dynamic Optimization for DGEMMT - Fine-tuned the thread allocation logic for parallelizing DGEMMT for the cases where n <= 220. This results in performance improvement in multi-threaded DGEMMT for small values of n. AMD-Internal: [CPUPL-2215] Change-Id: I2654bc64d2dc43c2db911e0c9175755be3aa8ba5	2022-07-18 06:52:48 -04:00
Arnav Sharma	25cf7517ab	AOCL Dynamic Optimization for DGEMMT - Optimized thread allocation for cases with n <= 220 for DGEMMT. AMD-Internal: [CPUPL-2215] Change-Id: Id01edf268a90fd96a41ef947db54f6afc490548f	2022-06-30 15:00:34 +05:30
Dipal M Zambare	2ba2fb2b63	Add AVX2 path for TRSM+GEMM combination. - Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called from TRSM context it will invoke AVX2 GEMM kernels instead of the default AVX-512 GEMM kernels. - The default context has the block sizes for AVX512 GEMM kernels, however, TRSM uses AVX2 GEMM kernels and they need different block sizes. - Added new API bli_zen4_override_trsm_blkszs(). It overrides default block sizes in context with block sizes needed for AVX2 GEMM kernels. - Added new API bli_zen4_restore_default_blkszs(). It restores The block sizes to there default values (as needed by default AVX512 GEMM kernels). - Updated bli_trsm_front() to override the block sizes in the context needed by TRSM + AVX2 GEMM kernels and restore them to the default values at the end of this function. It is done in bli_trsm_front() so that we override the context before creating different threads. AMD-Internal: [CPUPL-2225] Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55	2022-06-29 10:16:24 +00:00
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00
mkadavil	31f8820bab	Bug fixes for open mp based multi-threaded GEMM/GEMMT SUP path. - auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at runtime. Currently the auto_factor is enabled if num_threads > 0 and not reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm. This results in gemm/gemmt SUP path applying 2x2 factorization of num_threads, and thereby modifying the preset factorization. This issue is not observed in native path since factorization happens without checking auto_factor value. - Setting omp threads to n_threads using omp_set_num_threads after the global_rntm n_threads update in bli_thread_set_num_threads. This ensures that in bli_rntm_init_from_global, omp_get_max_threads returns the same value as set previously. AMD-Internal: [CPUPL-2137] Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426	2022-05-17 18:10:40 +05:30
Nallani Bhaskar	7658067107	Added AOCL Dynamic feature for dtrmm Description: 1. Tuned number of threads to achive better performance for dtrmm AMD-Internal: [ CPUPL-2100 ] Change-Id: Ib2e3df224ba76d86185721bef1837cd7855dd593	2022-05-17 18:10:39 +05:30
Dipal M Zambare	4ccb438c18	Updated Zen3 architecture detection for Ryzen 5000 - Added support to detect Ryzen 5000 Desktop and APUs AMD-Internal: [CPUPL-2117] Change-Id: I312a7de1a84cf368b74ba20e58192803a9f7dace	2022-05-17 18:10:39 +05:30
Nallani Bhaskar	2acb3f6ed0	Tuned aocl dynamic for specific range in dgemm Description: 1. Decision logic to choose optimal number of threads for given input dgemm dimensions under aocl dynamic feature were retuned based on latest code. 2. Updated code in few file to avoid compilation warnings. 3. Added a min check for nt in bli_sgemv_var1_smart_threading function AMD-Internal: [ CPUPL-2100 ] Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02	2022-05-17 18:10:39 +05:30
mkadavil	a3836a560d	Smart Threading for GEMM (sgemm) v1. - Cache aware factorization. Experiments shows that ic,jc factorization based on m,n gives better results compared to mu,nu on a generic data set in SUP path. Also slight adjustments in the factorizations w.r.t matrix data loads can help in improving perf further. - Moving native path inputs to SUP path. Experiments shows that in multi-threaded scenarios if the per thread data falls under SUP thresholds, taking SUP path instead of native path results in improved performance. This is the case even if the original matrix dimensions falls in native path. This is not applicable if A matrix transpose is required. - Enabling B matrix packing in SUP path. Performance improvement is observed when B matrix is packed in cases where gemm takes SUP path instead of native path based on per thread matrix dimensions. AMD-Internal: [CPUPL-659] Change-Id: I3b8fc238a0ece1ababe5d64aebab63092f7c6914	2022-05-17 18:10:39 +05:30
satish kumar nuggu	1a3428ddfc	Parallelization of dtrsm_small routine 1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right). 2. Fine-tuning with AOCL_DYNAMIC to achieve better performance. AMD-Internal: [CPUPL-2103] Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005	2022-05-17 18:10:39 +05:30
Sireesha Sanga	cc3069fb5e	Performance Improvement for ztrsm small sizes Details: - Optimization of ztrsm for Non-unit Diag Variants. - Handled Overflow and Underflow Vulnerabilites in ztrsm small implementations. - Fixed failures observed in libflame testing. - Fine-tuned ztrsm small implementations for specific sizes 64<= m,n <= 256, by keeping the number of threads to the optimum value, under AOCL_DYNAMIC flag. - For small sizes, ztrsm small implementation is used for all variants. AMD-Internal: [SWLCSG-1194] Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d	2022-05-17 18:10:39 +05:30
satish kumar nuggu	fe7f0a9085	Changes to enable zgemm small from BLAS Layer 1. Removed small gemm call from native path to avoid Single threaded calls as a part of MultiThreaded scenarios. 2. SUP and INDUCED Method path disabled. 3. Added AOCL Dynamic for optimum number of threads to achieve higher performance. Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92	2022-05-17 18:10:38 +05:30
Sireesha Sanga	6a2c4acc66	Runtime Thread Control using OpenMP API Details: - During runtime, Application can set the desired number of threads using standard OpenMP API omp_set_num_threads(nt). - BLIS Library uses standard OpenMP API omp_get_max_threads() internally, to fetch the latest value set by the application. - This value will be used to decide the number of threads in the subsequent BLAS calls. - At the time of BLIS Initialization, BLIS_NUM_THREADS environment variable will be given precedence, over the OpenMP standard API omp_set_num_threads(nt) and OMP_NUM_THREADS environment variable. - Order of precedence followed during BLIS Initialization is as follows 1. Valid value of BLIS_NUM_THREADS 2. omp_set_num_threads(nt) 3. valid value of OMP_NUM_THREADS 4. Number of cores - After BLIS initialization, if the Application issues omp_set_num_threads(nt) during runtime, number of threads set during BLIS Initialization, is overridden by the latest value set by the Application. - Existing precedence of BLIS_*_NT environment variables and the decision of optimal number of threads over the number of threads derived from the above process remains as it is. AMD-Internal: [CPUPL-2076] Change-Id: I935ba0246b1c256d0fee7d386eac0f5940fabff8	2022-05-17 18:09:22 +05:30
Field G. Van Zee	7a0ba4194f	Added support for addons. Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files. Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717	2022-03-31 12:03:27 +05:30
Dipal M Zambare	6d1edca727	Optimized CPU feature determination. We added new API to check if the CPU architecture has support for AVX instruction. This API was calling CPUID instruction every time it is invoked. However, since this information does not change at runtime, it is sufficient to determine it once and use the cached results for subsequent calls. This optimization is needed to improve performance for small size matrix vector operations. AMD-Internal: [CPUPL-2009] Change-Id: If6697e1da6dd6b7f28fbfed45215ea3fdd569c5f	2022-02-01 11:15:55 +05:30
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Nallani Bhaskar	c2df5eac1c	Reduced number of threads in dgemm for small dimensions - Number of threads are reduced to 1 when the dimensions are very low. - Removed uninitialized xmm compilation warning in trsm small Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b	2021-12-15 15:11:08 +05:30
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
nphaniku	4af525a313	AOCL Windows BLIS : Windows build for dynamic dispatch library Change-Id: Ie05eafbeacbd5589b514d9353517330515104939	2021-11-12 08:58:57 +05:30
Dipal M Zambare	3364c0e4eb	Binary and dynamic dispatch configuration name change -- Reverted changes made to include lp/ilp info in binary name This reverts commit `c5e6f885f0`. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6	2021-11-12 08:58:56 +05:30
Madan mohan Manokar	d6fcfe7345	gemmt SUP limitThread count for small sizes 1. Max thread cap added for small dimension based on product(n*k). AMD-Internal: [CPUPL-1388] Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057	2021-11-12 08:58:55 +05:30
mkurumel	9f1ce594a5	BLIS : Compiler warning fixes Details : - Fixed warnings with AOCC and GCC compilers. AMD-Internal: [CPUPL-1662] Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce	2021-11-12 08:58:52 +05:30
Kiran Varaganti	5ecb4fd0db	Removed dead code This sanity check (checking top_index != 0) which has been disabled earlier in bli_pool_reinit() by commenting out bli_abort() was unnecessarily computing top_index - this whole statement is commented out. Change-Id: If296754ca8cba3a69d023d4a7ec891f1cbce1d6a	2021-11-12 08:58:51 +05:30
Kiran Varaganti	628f4a41b8	Fixed bug in packing API bli_pack_set_pack_b() API wrongly inits pack_a element of rntm object instead of pack_b. Fixed this bug. Change-Id: I267493ab3ff0bade478d1157799a6fa5b51c7970	2021-11-12 08:58:50 +05:30
Dipal M Zambare	bcd9591b3f	Added support for amdepyc fat binary -- Created new configuration amdepyc to include fat binary which includes zen, zen2, zen3 and generic architecture for fallback. -- Updated amdepyc family makefiles to include macros needed in amdepyc family binary. This file must include all macros, compiler options to be used for non architecture specific code. -- Added 'workaround' to exclude ZEN family specific code in some of the framework files. There are still lot of places were ZEN family specific code is added in framework files. They will be addressed with proper design later. - Moved definition of BLIS_CONFIG_EPYC from header files to makefile so that it is enabled only for framework and kernels -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC wherever it was needed. -- Removed un-used, obsolete macros, some of them may be needed for debugging which can be added in the individual workspaces. - BLIS_DEFAULT_MR_THREAD_MAX - BLIS_DEFAULT_NR_THREAD_MAX - BLIS_ENABLE_ZEN_BLOCK_SIZES - BLIS_SMALL_MATRIX_THRES_TRSM - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES - BLIS_ENABLE_SUP_MR_EXT - BLIS_ENABLE_SUP_NR_EXT -- Corrected implementation of exiting amd64_legacy configuration. AMD-Internal: [CPUPL-1626, CPUPL-1628] Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c	2021-08-26 15:13:49 +05:30
Meghana Vankadari	5cf260f1ec	Fixed a bug in trsm blocksize determining function. - The function "bli_determine_blocksize_b_sub" uses a modulo operation using b_alg. In cases where trsm blocksizes are not set, using this function with trsm blocksizes being zero will lead to FPE. - Added a check to avoid calling the above function with zeroes as parameters. Change-Id: I770aaa13125b55320a68ff9fc3da782111e0978a	2021-08-26 15:10:41 +05:30
Meghana Vankadari	107eaea237	Removed dependency on AOCL_BLIS_ZEN for TRSM blocksizes. Details: - The field "trsm_blkszs" in cntx enables us to set and use blocksizes that are specific to trsm only instead of using commom L3 blocksizes. - While querying blocksizes for TRSM, Added a check to see if trsm-specific blocksizes are set, if they are not set, we continue trsm execution by using common Level-3 blocksizes. - TRSM-specific blocksizes( if required ) needs to be set in cntx_init function of respective configuration. Change-Id: I06afe2b1d29004e428d2fb3bf2be2958cbbd1a97	2021-08-16 00:12:33 -04:00
Meghana Vankadari	6bad157754	Added a new field in cntx to store l3 threshold function pointers Details: - Adding threshold function pointers to cntx gives flexibility to choose different threshold functions for different configurations. - In case of fat binary where configuration is decided at run-time, adding threshold functions under a macro enables these functions for all the configs under a family. This can be avoided by adding function pointers to cntx which can be queried from cntx during run-time based on the config chosen. Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f	2021-08-16 00:10:01 -04:00
Dipal Madhukar Zambare	0cb552c8f8	Merge "Updated memory pool implementation to contain buffers of different sizes." into amd-staging-milan-3.1	2021-08-12 09:01:02 -04:00
Dipal M Zambare	2eede504b5	Updated memory pool implementation to contain buffers of different sizes. -- In existing design memory pool supports buffers of only one size, This size is determined at compile time to support buffer needed for biggest block size and data type. However, the size calculation is not generic as it considers sizes only for GEMM. Also it assumes that the buffer will not be used for any other purpose than packing operations. -- If the new buffer is requested whose size is bigger than existing size, The pool is re-initialized to contain buffers of new size, however, this is done only if the pool is empty. If the pool is not empty the execution is aborted. This is undesirable as in mulithreaded scenarios it is possible that different threads needs buffers of different sizes at the same time (i.e. while other thread is still in middle of the operation). -- This commit removes the restriction of single size buffer, when new buffer is requested whose size if bigger than exiting one, no re-init is done. New buffer of required size is allocated, added to the pool and returned to the user. Change-Id: I20acdb60eb06ab2e53366d51713aa83c4b2df0da	2021-08-12 10:46:27 +05:30
Kiran Varaganti	adfd569591	DGEMM Optimizations Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10). Improved smart threading logic for dgemm, Additional conditions at the blas interface added to invoke bli_dgemm_small. Removed N > 3 condition from bli_dgemm_small. Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e	2021-08-10 12:34:43 -04:00
Meghana Vankadari	170719e647	Fixed few bugs in GEMMT for non-zen configs Details: - Added a check condition in GEMMT native path to choose update_triang routines based on whether the kernels are row-preferred or column preferred. - Moved the zen-specific SUP thresholds under BLIS_CONFIG_EPYC macro. For non-zen architectures, it falls back to L3 SUP thresholds. - Modified SUP code path to always choose 2m variant. 1n variant is not implemented for GEMMT. Change-Id: Ifdd55815c588f645e337de80b5b9d1864f6b5dd3	2021-08-10 02:38:17 -04:00
Nallani Bhaskar	75f72b7f6e	Added aocl dynamic feature for dtrsm for small sizes Details: 1. Added aocl-dynamic for dtrsm native path When (m,n)<512 better performance observed for nthreads=4 2. Updated trsm_small threshold such that when (m+n)<320 trsm_small is doing better than native irrespective of number of threads Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487	2021-06-18 08:46:47 -04:00
Meghana Vankadari	d5ff5e5f50	Added dynamic threading support for SYRK SUP code path Details: - when AOCL dynamic is enabled, the decision to choose ST Vs MT to solve SYRK is taken based on dimensions of matrices. - Decisions to choose optimum number of threads will be updated in the subsequent commits. - Only local copy of rntm is modified by AOCL Dynamic feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Added an early-exit condition in bli_nthreads_optimum when nt =1 or nt=-1. This ensures that AOCL dynamic feature is not used when threading is set using BLIS_IC_NT or BLIS_JC_NT. Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0	2021-06-16 02:08:11 -04:00
Meghana Vankadari	887ecb46e0	Added threshold logic for SYRK Details: - Added decision logic to choose between SUP and native implementations of SYRK for zen2 architectures. - For architectures other than zen2 it will be redirected to gemm threshold function. Change-Id: I350578cc4f930e85b9581e4d9aed220e71a2171d	2021-05-31 05:34:38 -04:00
Meghana Vankadari	eea347b02e	Added dynamic threading support for GEMM SUP code path Details: - Introduced new feature called AOCL_DYNAMIC. - When this macro is defined, Optimum number of threads to solve DGEMM is estimated based on the dimensions (M,N,K). - Range of optimum number of threads will be [1, num_threads], where "num_threads" is number of threads set by the application. - Num_threads is derived from either environment variable "OMP_NUM_THREADS or BLIS_NUM_THREADS' or bli_set_num_threads() API. - Only local copy of rntm is modified by AOCL_DYNAMIC feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Optimum number of threads calculation is done only for SUP. - Since 'native' code path handles larger problem sizes, we use max number of threads recommended by the application. AMD-Internal: [CPUPL-1376] Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3	2021-05-07 09:52:51 +05:30
Meghana Vankadari	1303732e83	Added sup functionality for SYRK Details: - Added bli_syrksup function that internally uses gemmt implementation. - Modified OAPI of syrk to call SUP before proceeding to the conventional implementation. - Copied gemmsup threshold function for syrk temporarily. Thresholds are yet to be derived for syrk. Change-Id: I751c6bd62bc76a3e4717f77c5cb33f19b759151d	2021-04-29 12:35:30 +05:30
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Field G. Van Zee	95d4f3934d	Moved cpp macro redef of strerror_r to bli_env.c. Details: - Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r (in terms of strerror_s) from bli_thread.h to bli_env.c. It was likely left behind in bli_thread.h in a previous commit, when code that now resides in bli_env.c was moved from bli_thread.c. (I couldn't find any other instance of strerror_r being used in BLIS, so I moved the #define directly to bli_env.c rather than place it in bli_env.h.) The code that uses strerror_r is currently disabled, though, so this commit should have no affect on BLIS.	2021-03-11 13:50:40 -06:00
nphaniku	b3628cdfd3	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for build with Clang compiler. 2. CMake script changes for build test and testsuite based on the lib type ST/MT 3. CMake script changes for testcpp and blastest 4. Added python scripts to support library build and testsuite build. AMD Internal : [CPUPL-1422] Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898	2021-03-08 19:04:17 +05:30
Madan mohan Manokar	3ab9104dae	Handling zgemm real(+/-1) alpha and beta 1.Improved performance when zgemm's alpha and beta are real and equal to +/-1. 2.change done in bli_zgemmsup_rv_zen_asm_3x4n. 3.change done in bli_zgemmsup_rv_zen_asm_3x4m. 4.change done in bli_zgemm_haswell_asm_3x4. Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430 AMD-Internal: [CPUPL-1352]	2021-02-10 02:58:58 -05:00
Field G. Van Zee	11dfc176a3	Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c.	2020-12-01 19:51:27 +00:00
Field G. Van Zee	64856ea5a6	Auto-reduce (by default) prime numbers of threads. Details: - When requesting multithreaded parallelism by specifying the total number of threads (whether it be via environment variable, globally at runtime, or locally at runtime), reduce the number of threads actually used by one if the original value (a) is prime and (b) exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set to 11 by default. If, when specifying the total number of threads (and not the individual ways of parallelism for each loop), prime numbers of threads are desired, this feature may be overridden by defining the BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that corresponds to the configuration family targeted at configure-time. (For now, there is no configure option(s) to control this feature.) Thanks to Jeff Diamond for suggesting this change. - Defined a new function in bli_thread.c, bli_is_prime(), that returns a bool that determines whether an integer is prime. This function is implemented in terms of existing functions in bli_thread.c. - Updated docs/Multithreading.md to document the above feature, along with unrelated minor edits.	2020-11-23 16:54:51 -06:00
Field G. Van Zee	9bb23e6c2a	Added support for systemless build (no pthreads). Details: - Added a configure option, --[enable\|disable]-system, which determines whether the modest operating system dependencies in BLIS are included. The most notable example of this on Linux and BSD/OSX is the use of POSIX threads to ensure thread safety for when application-level threads call BLIS. When --disable-system is given, the bli_pthreads implementation is dummied out entirely, allowing the calling code within BLIS to remain unchanged. Why would anyone want to build BLIS like this? The motivating example was submitted via #454 in which a user wanted to build BLIS for a simulator such as gem5 where thread safety may not be a concern (and where the operating system is largely absent anyway). Thanks to Stepan Nassyr for suggesting this feature. - Another, more minor side effect of the --disable-system option is that the implementation of bli_clock() unconditionally returns 0.0 instead of the time elapsed since some fixed point in the past. The reasoning for this is that if the operating system is truly minimal, the system function call upon which bli_clock() would normally be implemented (e.g. clock_gettime()) may not be available. - Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h to remove redundancies. - Removed old comments and commented #include of "bli_pthread_wrap.h" from bli_system.h. - Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md and BLISTypedAPI.md, with a note that both are non-functional when BLIS is configured with --disable-system.	2020-11-16 15:55:45 -06:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
Field G. Van Zee	2a0682f8e5	Implemented runtime subconfig selection (#451 ). Details: - Implemented support for the user manually overriding the automatic subconfiguration selection that happens at runtime. This override can be requested by setting the BLIS_ARCH_TYPE environment variable. The variable must be set to the arch_t id (as enumerated in bli_type_defs.h) corresponding to the desired subconfiguration. If a value outside this enumerated range is given, BLIS will abort with an error message. If the value is in the valid range but corresponds to a subconfiguration that was not activated at configure-time/compile-time, BLIS will abort with a (different) error message. Thanks to decandia50 for suggesting this feature via issue #451. - Defined a new function bli_gks_lookup_id to return the address of an internal data structure within the gks. If this address is NULL, then it indicates that the subconfig corresponding to the arch_t id passed into the function was not compiled into BLIS. This function is used in the second of the two abort scenarios described above. - Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which is returned for the latter of the two abort scenarios mentioned above, along with a corresponding error message and a function to perform the error check. - Added cpp macro branching to bli_env.c to support compilation of the auto-detect.x executable during configure-time. This cpp branch is similar to the cpp code already found in bli_arch.c and bli_cpuid.c. - Cleaned up the auto_detect() function to facilitate easier maintenance going forward. Also added a convenient debug switch that outputs the compilation command for the auto-detect.x executable and exits.	2020-10-18 18:04:03 -05:00
Meghana Vankadari	029ed033f1	Added decision logic for zgemmt AMD-Internal: [CPUPL-1032] Change-Id: I8ba1c66b06cd91a864b16a249b263b3694ac1d5e	2020-10-09 12:11:53 +05:30

1 2 3 4 5 ...

416 Commits