amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-20 16:38:57 +00:00

Author	SHA1	Message	Date
Shubham Sharma	9607f207da	AOCL Dynamic tuning for DAXPYV - Existing logic is not picking the ideal number of threads for some problem sizes. - Problem size and their corresponding ideal number of threads are retuned for daxpy in aocl dynamic. AMD-Internal: [CPUPL-3484] Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7	2023-08-01 10:34:04 +05:30
Meghana Vankadari	79e174ff0a	Level-3 triangular routines now use different block sizes and kernels. Details: - Eliminated the need for override function in SUP for GEMMT/SYRK. - New set of block sizes, kernels and kernel preferences are added to cntx data structure for level-3 triangular routines. - Added supporting functions to set and get the above parameters from cntx. - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels. In case they are not set, use the default block sizes/kernels of Level-3 SUP. AMD-Internal: [CPUPL-3649] Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0	2023-07-26 01:26:11 -04:00
Harihara Sudhan S	9ee95e171a	Control flow issue reported during static code analysis - Missing break statement will result in unexpected control flow. This function will not launch the threads for the API in question according to the AOCL dynamic logic without the break statement. AMD-Internal: [CPUPL-3436] Change-Id: Ic47d773169c09e84086a27b50cd59dba33529698	2023-05-18 04:53:03 -04:00
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Edward Smyth	b531022bac	BLIS cpuid: distinguish submodels within a microarchitecture Incorporate a means of detecting submodels of a microarchitecture, so that different optimizations e.g. block sizes or kernel choices can be used. The details are as follows: - Different models are currently only enabled for zen3 and zen4 architectures (for server parts). - There is a single enumeration (model_t) for all models for all architectures, but function bli_check_valid_model_id() should check the provided model_id against the suitable range within the enumeration for the provided arch_id. - To enable the model_id to be used within the cntx setup functions, checking of a user specified value of BLIS_ARCH_TYPE against the enabled configurations is delayed to a separate function, bli_arch_check_id(). - Default selection based on hardware can be overridden using the BLIS_MODEL_TYPE environment variable. Valid values are: Genoa, Bergamo, Genoa-X, Milan, Milan-X Values are case-insensitive and -X can also be specified as _X or X - Specifying an incorrect value for BLIS_MODEL_TYPE is not an error, but will result in the default option for that architecture being selected. This is different to specifying an incorrect value of BLIS_ARCH_TYPE, which is an error. - The environment variable BLIS_MODEL_TYPE can be renamed using the --rename-blis-model-type argument to configure (or cmake equivalent), in a similar way to renaming BLIS_ARCH_TYPE with --rename-blis-arch-type. - Configure option --disable-blis-arch-type will disable both BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables. - Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes, currently only for AMD cpus. Functions are provided to query these from other parts of the code, namely: uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size() AMD-Internal: [CPUPL-3033] Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702	2023-04-20 10:47:44 -04:00
Edward Smyth	6835205ba8	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-2870] Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce	2023-04-19 12:44:56 -04:00
Harihara Sudhan S	32bbd96652	Moving AOCL Dynamic logic from BLIS impli layer Threading related changes -------------------------- - Created function bli_nthreads_l1 that dispatches the AOCL dynamic logic for a L1 function based on the kernel ID and input datatypes. - bli_nthreads_l1 gets the number of threads to be launched from the rntm variable. - Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and DDOTV. This function contains the AOCL dynamic logic for the respective kernels. - Added handling for cases when number of elements (n) is less than number of threads spawned (nt) in AOCL dynamic. - Added function bli_thread_vector_partition that calculates the amount of work the calling thread is supposed to perform on a vector. Interface changes ----------------- - In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick kernel based on architecture ID and removed AVX2 flag check. - Modified function signature of ZDSCALV. Alpha is passed as dcomplex and only the real part of the alpha passed is used inside the kernel. The change was done to facilitate kernel dispatch based on arch ID. - Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV. Without this multithreaded code might crash because of minimum work calculation. Misc ----- - Removed unused variables from ZSCAL2V and AXPYV kernels. AMD-Internal: [CPUPL-3095] Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9	2023-04-14 02:05:38 -04:00
Harsh Dave	9aff4f23b3	Dynamic threading for dgemm sup. - Following sequence is followed for getting number of threads for given input. - Divides total range of input into 3 category(m < n, m > n, m = n) - For each range it is further divided into 4 sub category.(K <= 32, K <= 64, K <=128, K > 128) - As per the input range, number of threads is being decided. AMD-Internal: [CPUPL-2966] Change-Id: I0b04e9de1615e87acb189b228544afac74664f02	2023-04-13 10:57:43 -04:00
Edward Smyth	16592f79ee	BLIS cpuid: better support Intel and future AMD processors Where a processor has not been targeted for optimization, the slow generic code path may be selected by default. Another code path could perform much better, even if not specifically optimized for this processor. Changes to enable this: - Always call bli_cpuid_query_id() to get actual hardware info, even if the user is overriding the choice at build time or by setting BLIS_ARCH_TYPE. - Use bli_cpuid_query_id() to gather availability of AVX2 and AVX512 instructions. - Include AVX-512 fallback test to select zen4 code path on future AMD processors (that don't yet have a specific codepath). - Also use AVX-512 and AVX2 tests on Intel processors to select zen4 or zen3 code paths over the generic code path. If both zen and skx/haswell code paths are enabled, the zen code path will be given preference as zen paths have additional optimizations that should, in general, benefit Intel processors too. AMD-Internal: [CPUPL-3031] Change-Id: Ib7d0ebdb02fec872f9443a1d20070026f2020516	2023-04-06 11:29:33 -04:00
Edward Smyth	9b9142644f	BLIS cpuid: Bugfix for BLAS1/BLAS2 when BLIS_ARCH_TYPE is set BLAS1 and BLAS2 routines may not immediately call bli_init_auto, as the full cntx and other global data structures may not be required for all code paths. This may cause a problem if the user sets BLIS_ARCH_TYPE, as it needs to check the requested value against the available options configured in cntx. Solution: add a separate pthread_once call to run bli_gks_init(). AMD-Internal: [CPUPL-3031] Change-Id: Icd73a8dd161b34b23cc336623d675248f28ed23f	2023-04-04 07:54:31 -04:00
Edward Smyth	1ac03e64b5	BLIS cpuid tidy and bugfix. Improvements to BLIS cpuid functionality: - Tidy names of avx support test functions, especially rename bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported() to more accurately describe what it tests. - Fix bug in frame/base/bli_check.c related to changes in commit `6861fcae91` AMD-Internal: [CPUPL-3031] Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5	2023-04-03 08:46:37 -04:00
mkadavil	7dca25d056	Fixed decision logic in choosing between sup vs native GEMM code paths -Currently the storage preference of native kernel is used to determine appropriate "m" and "n" which are used in native/SUP threshold logic. This assumption works only when "sup" & native kernel preferences are same. However, this need not have to be the case. This fix calculates the correct "m" and "n" irrespective of the native/sup kernel preferences thereby improving accuracy of this decision logic. AMD-Internal: [CPUPL-3039] Change-Id: I88c5a6838aeed867effdbbdfc41fbf2e88d4e52a	2023-03-02 02:12:42 -05:00
Edward Smyth	2592774fe8	BLIS: Nested Parallelism issues (3) Bugfix for parallel BLAS1 and BLAS2 routines. Threading information was not being set correctly when initializing local rntm from global. Also ensure th_rntm is initialized along with global_rntm by updating it in bli_thread_init(), called by bli_init_once() AMD-Internal: [CPUPL-2433] Change-Id: Iba658f87ae13fe16a57ca1fc279e149b7fa294cf	2022-12-13 12:38:40 -05:00
Edward Smyth	34730a1e4c	BLIS: Nested parallelism issues 1. Check OpenMP active level against max active levels when setting number of threads for starting a new parallel region in ./frame/thread/bli_thread.c to ensure the correct number of threads is used when BLIS is called within nested OpenMP parallelism. 2. In subsequent BLIS calls, threading choices could be incorrectly set based on values used and stored in global_rntm by a previous call. This could apply when the OpenMP number of threads differ from call to call, different nested parallelism is used in different parts of a user's code, or different threads at the user level request different numbers of OpenMP threads for BLIS calls. Keep threading information in both global_rntm and a new Thread Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime environment (as appropriate) during bli_init_auto() calls in each BLIS routine. The details are: * global_rntm is initialized on first BLIS call based on OpenMP and BLIS threading environment variables. * global_rntm is updated by any BLIS threading function calls. * In bli_thread_update_tl(), called by bli_init_auto(), sync with any BLIS values set or updated in global_rntm. Then, if BLIS threading control is not used, check OpenMP ICVs and set thread count and auto_factor appropriately. * Setting BLIS threading locally (using expert interfaces to pass a user defined rntm data structure) should work as before. 3. bli_thread_get_is_parallel can now only be called outside of parallelism within BLIS routines. Change calls in trsm to reflect this. 4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env() if any BLIS_*_NT environment variables are set. 5. Set auto_factor = FALSE when the number of threads is 1. 6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE. 7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init(). 8. For debugging, internal information on the rntm threading data can be printed by defining "PRINT_THREADING" at the top of bli_rntm.h 9. bli_rntm_print() now also prints the value of blis_mt. 10. Function prototypes in bli_rntm.h moved to top of file, so that bli_rntm_print() can be used within inline functions defined in this header file. 11. Comment out bli_init_auto() and bli_finalize_auto() calls in Fortran interfaces in frame/compat/blis/thread/b77_thread.c 12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and set_pack_b functions outside of the auto_factor if statements. 13. Misc code tidying. AMD-Internal: [CPUPL-2433] Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee	2022-10-21 07:38:39 -04:00
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Edward Smyth	d69473c8f7	Update zen4 test in bli_cpuid.c Remove test on is_arch in bli_cpuid_is_zen4() so it will report true for all Zen4 models, based on family and AVX512 feature tests. AMD-Internal: [CPUPL-2474] Change-Id: I85d2e230b33391d5c9779df4585ae2a358788e72	2022-09-01 05:12:24 -04:00
Edward Smyth	abf848ad12	Code cleanup and warnings fixes - Removed some additional compiler warnings reported by GCC 12.1 - Fixed a couple of typos in comments - frame/3/bli_l3_sup.c: routines were returning before final call to AOCL_DTL_TRACE_EXIT - frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is only defined in header file if BLIS_ENABLE_OPENMP is defined AMD-Internal: [CPUPL-2460] Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde	2022-08-29 08:22:30 -04:00
Harihara Sudhan S	5ca632e0f0	Added API to check for BF16 ISA support - Checking for AVX512 bfloat 16 instructions support in architecture using the CPUID AMD-Internal: [CPUPL-2446] Change-Id: I088a8aa46b037af837b2e58a96b59eae70c1dbf0	2022-08-25 11:00:35 -04:00
satish kumar nuggu	0b81f53074	Fixed bug in DZGEMM 1. In zen4 dgemm and sgemm native kernels are column-prefer kernels, cgemm and zgemm native kernels are row-prefer kernels. zen3 and older arch (uses row-prefer kernels for all datatypes) hence induced-transpose carried out based on kernel preference check. Added a condition check, output matrix storage format need to be checked along with kernel preference to avoid induced-transpose for zen4. 2. Added functions bli_cntx_l3_vir_ukr_dislikes_storage_of_md, bli_cntx_l3_vir_ukr_prefers_storage_of_md for checking output matrix storage format and micro kernel preference of mixed datatypes. AMD-Internal: [CPUPL-2347] Change-Id: Ib77676f4e2152f7876ad7dc91de716547f5ba3a5	2022-08-19 12:37:01 -04:00
Edward Smyth	6861fcae91	BLIS: Improve architecture selection at runtime Make BLIS_ARCH_TYPE=0 be an error, so that incorrect meaningful names will get an error rather than "skx" code path. BLIS_ARCH_TYPE=1 is now "generic", so that it should be constant as new code paths are added. Thus all other code path enum values have increased by 2. Also added new options to BLIS configure program to allow: 1. BLIS_ARCH_TYPE functionality to be disabled, e.g.: ./configure --disable-blis-arch-type amdzen 2. Renaming the environment variable tested from "BLIS_ARCH_TYPE" to a specified value, e.g.: ./configure --rename-blis-arch-type=MY_NAME_FOR_ARCH_TYPE amdzen On Windows, these can be enabled with e.g.: cmake ... -DDISABLE_BLIS_ARCH_TYPE=ON or cmake ... -DRENAME_BLIS_ARCH_TYPE=MY_NAME_FOR_ARCH_TYPE This implements changes 2 and 3 in the Jira ticket below. AMD-Internal: [CPUPL-2235] Change-Id: Ie42906bd909f9d83f00a90c5bef9c5bf3ef5adb4	2022-08-19 10:59:35 -04:00
Sireesha Sanga	22af681a11	Runtime Thread Control Feature Update Details: 1. Runtime Thread Control Feature is enhanced to create a provision for the application to allocate a different number of threads to BLIS from the number of threads application is using for itself. 2. In the previous implementation, if application sets BLIS_NUM_THREADS with a valid value, BLIS internally calls omp_set_num_threads() API with same value. Due to this, application could not differentiate between the number of threads used in BLIS library and the application. 3. With the current solution, if Application wants to allocate different number of threads for BLIS API and application, Application can choose either BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API for BLIS, and OpenMP APIs or environment variables for itself, respectively. 4. If BLIS_NUM_THREADS is set with a valid value, same value will be used in the subsequent parallel regions unless bli_thread_set_num_threads() API is used by the Application to modify the desired number of threads during BLIS API execution. 5. Once BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API is used by the application, BLIS module would always give precedence to these values. BLIS API would not consider the values set using OpenMP API omp_set_num_threads(nt) API or OMP_NUM_THREADS environment variable. 6. If BLIS_NUM_THREADS is not set, then if Application is multithreaded and issued omp_set_num_threads(nt) with desired number of threads, omp_get_max_threads() API will fetch the number of threads set earlier. 7. If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called by the application, but only OMP_NUM_THREADS is set, omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS. 8. If both environment variables are not set, or if they are set with invalid values, and omp_set_num_threads(nt) is not issued by application, omp_get_max_threads() API will return the number of the cores in the current context. 9. BLIS will initialize rntm->num_threads with the same value. However if omp_set_nested is false - BLIS APIs called from parallel threads will run in sequential. But if nested parallelism is enabled Then each application will launch MT BLIS. 10. Order of precedence used for number of threads: 0. value set using bli_thread_set_num_threads(nt) by the application 1. valid value set for BLIS_NUM_THREADS environment variable 2. omp_set_num_threads(nt) issued by the application 3. valid value set for OMP_NUM_THREADS environment variable 4. Number of cores 11. If nt is not a valid value for omp_set_num_threads(nt) API, number of threads would be set to 1. omp_get_max_threads() API will return 1. 12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled. AMD-Internal: [CPUPL-2342] Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6	2022-08-19 05:43:01 -04:00
Vignesh Balasubramanian	cf31fcd020	Fine tuned threshold and aocl dynamic for zgemm for skinny matrices. -Updated optimal threads in zgemm sup path for skinny matrices. -Fine tuned the threshold values for small and sup paths to improve overall zgemm. -Zgemm small is selected for inputs with transb as N. -Redirection of input among small, sup and native path was fine tuned. AMD-Internal : [CPUPL-1900] Change-Id: Ide37c8255def770b4b74bc6e7c6edb5ee15d3b1f	2022-08-19 01:19:14 -04:00
Harsh Dave	46e7727ea8	DGEMM Improvements - Incase of DGEMM when m, n and leading dimensions are large packing of A and B matrixes are required for optimal performance. - Modified decision logic to choose between sup vs native, now apart from matrix dimensions, we also incorporate matrix leading dimensions into this decision. AMD-Internal: [CPUPL-2366] Change-Id: I255db5f7049d783e22d7c912edf8bbf023e32ed8	2022-08-18 01:14:40 -04:00
Dipal M. Zambare	c85bbfdb50	Updated BLIS version string format - Updated version string to match the recommended format “AOCL-BLIS 3.2.1 Build 20220727”. - Fixed issues with include paths which was preventing compile time version sting definition passing via build commands. - Removed version string determination based on git tag using ‘git describe’, version string will always be taken from the version file. AMD-Internal: [CPUPL-2324] Change-Id: Idc7edf1211f66d348ec3b5b43f2507c2b810f088	2022-08-12 05:53:35 +00:00
Edward Smyth	737e08cd7a	BLIS: Improve architecture selection at runtime Enable meaningful names as options for BLIS_ARCH_TYPE environment variable. For example, BLIS_ARCH_TYPE=zen4 or BLIS_ARCH_TYPE='ZEN4' or BLIS_ARCH_TYPE=6 will select the same code path (in this release). The meaningful names are not case sensitive. This implements change 1 in the Jira ticket below. Following review comments: 1. Use names from arch_t enum in function bli_env_get_var_arch_type() rather than directly using numbers. 2. AMD copyrights updated. AMD-Internal: [CPUPL-2235] Change-Id: I8cfd43d34765d5e8c7e35680d18825d9934753ad	2022-08-10 08:26:49 -04:00
Minh Quan Ho	6d4d6a7514	Alloc at least 1 elem in pool_t block_ptrs. (#560 ) Details: - Previously, the block_ptrs field of the pool_t was allowed to be initialized as any unsigned integer, including 0. However, a length of 0 could be problematic given that malloc(0) is undefined and therefore variable across implementations. As a safety measure, we check for block_ptrs array lengths of 0 and, in that case, increase them to 1. - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu> Change-Id: I1e885d887aaba5e73df091ef52e6c327fd6418de	2022-08-01 08:45:00 +05:30
Minh Quan Ho	3c01fcb9fc	Fix insufficient pool-growing logic in bli_pool.c. (#559 ) Details: - The current mechanism for growing a pool_t doubles the length of the block_ptrs array every time the array length needs to be increased due to new blocks being added. However, that logic did not take in account the new total number of blocks, and the fact that the caller may be requesting more blocks that would fit even after doubling the current length of block_ptrs. The code comments now contain two illustrating examples that show why, even after doubling, we must always have at least enough room to fit all of the old blocks plus the newly requested blocks. - This commit also happens to fix a memory corruption issue that stems from growing any pool_t that is initialized with a block_ptrs length of 0. (Previously, the memory pool for packed buffers of C was initialized with a block_ptrs length of 0, but because it is unused this bug did not manifest by default.) - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu> Change-Id: Ie4963c56e03cbc197d26e29f2def6494f0a6046d	2022-08-01 08:19:30 +05:30
Chandrashekara K R	fde812015f	Updated blis library version from 4.0 to 3.2.1 AMD-Internal: [CPUPL-2322] Change-Id: I3a6a61543dd2754e2590d7f5f22442c9fdeaee95	2022-07-29 15:55:10 +05:30
Arnav Sharma	4f96bb712e	AOCL Dynamic Optimization for DGEMMT - Fine-tuned the thread allocation logic for parallelizing DGEMMT for the cases where n <= 220. This results in performance improvement in multi-threaded DGEMMT for small values of n. AMD-Internal: [CPUPL-2215] Change-Id: I2654bc64d2dc43c2db911e0c9175755be3aa8ba5	2022-07-18 06:52:48 -04:00
Arnav Sharma	25cf7517ab	AOCL Dynamic Optimization for DGEMMT - Optimized thread allocation for cases with n <= 220 for DGEMMT. AMD-Internal: [CPUPL-2215] Change-Id: Id01edf268a90fd96a41ef947db54f6afc490548f	2022-06-30 15:00:34 +05:30
Dipal M Zambare	2ba2fb2b63	Add AVX2 path for TRSM+GEMM combination. - Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called from TRSM context it will invoke AVX2 GEMM kernels instead of the default AVX-512 GEMM kernels. - The default context has the block sizes for AVX512 GEMM kernels, however, TRSM uses AVX2 GEMM kernels and they need different block sizes. - Added new API bli_zen4_override_trsm_blkszs(). It overrides default block sizes in context with block sizes needed for AVX2 GEMM kernels. - Added new API bli_zen4_restore_default_blkszs(). It restores The block sizes to there default values (as needed by default AVX512 GEMM kernels). - Updated bli_trsm_front() to override the block sizes in the context needed by TRSM + AVX2 GEMM kernels and restore them to the default values at the end of this function. It is done in bli_trsm_front() so that we override the context before creating different threads. AMD-Internal: [CPUPL-2225] Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55	2022-06-29 10:16:24 +00:00
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00
mkadavil	31f8820bab	Bug fixes for open mp based multi-threaded GEMM/GEMMT SUP path. - auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at runtime. Currently the auto_factor is enabled if num_threads > 0 and not reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm. This results in gemm/gemmt SUP path applying 2x2 factorization of num_threads, and thereby modifying the preset factorization. This issue is not observed in native path since factorization happens without checking auto_factor value. - Setting omp threads to n_threads using omp_set_num_threads after the global_rntm n_threads update in bli_thread_set_num_threads. This ensures that in bli_rntm_init_from_global, omp_get_max_threads returns the same value as set previously. AMD-Internal: [CPUPL-2137] Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426	2022-05-17 18:10:40 +05:30
Nallani Bhaskar	7658067107	Added AOCL Dynamic feature for dtrmm Description: 1. Tuned number of threads to achive better performance for dtrmm AMD-Internal: [ CPUPL-2100 ] Change-Id: Ib2e3df224ba76d86185721bef1837cd7855dd593	2022-05-17 18:10:39 +05:30
Dipal M Zambare	4ccb438c18	Updated Zen3 architecture detection for Ryzen 5000 - Added support to detect Ryzen 5000 Desktop and APUs AMD-Internal: [CPUPL-2117] Change-Id: I312a7de1a84cf368b74ba20e58192803a9f7dace	2022-05-17 18:10:39 +05:30
Nallani Bhaskar	2acb3f6ed0	Tuned aocl dynamic for specific range in dgemm Description: 1. Decision logic to choose optimal number of threads for given input dgemm dimensions under aocl dynamic feature were retuned based on latest code. 2. Updated code in few file to avoid compilation warnings. 3. Added a min check for nt in bli_sgemv_var1_smart_threading function AMD-Internal: [ CPUPL-2100 ] Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02	2022-05-17 18:10:39 +05:30
mkadavil	a3836a560d	Smart Threading for GEMM (sgemm) v1. - Cache aware factorization. Experiments shows that ic,jc factorization based on m,n gives better results compared to mu,nu on a generic data set in SUP path. Also slight adjustments in the factorizations w.r.t matrix data loads can help in improving perf further. - Moving native path inputs to SUP path. Experiments shows that in multi-threaded scenarios if the per thread data falls under SUP thresholds, taking SUP path instead of native path results in improved performance. This is the case even if the original matrix dimensions falls in native path. This is not applicable if A matrix transpose is required. - Enabling B matrix packing in SUP path. Performance improvement is observed when B matrix is packed in cases where gemm takes SUP path instead of native path based on per thread matrix dimensions. AMD-Internal: [CPUPL-659] Change-Id: I3b8fc238a0ece1ababe5d64aebab63092f7c6914	2022-05-17 18:10:39 +05:30
satish kumar nuggu	1a3428ddfc	Parallelization of dtrsm_small routine 1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right). 2. Fine-tuning with AOCL_DYNAMIC to achieve better performance. AMD-Internal: [CPUPL-2103] Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005	2022-05-17 18:10:39 +05:30
Sireesha Sanga	cc3069fb5e	Performance Improvement for ztrsm small sizes Details: - Optimization of ztrsm for Non-unit Diag Variants. - Handled Overflow and Underflow Vulnerabilites in ztrsm small implementations. - Fixed failures observed in libflame testing. - Fine-tuned ztrsm small implementations for specific sizes 64<= m,n <= 256, by keeping the number of threads to the optimum value, under AOCL_DYNAMIC flag. - For small sizes, ztrsm small implementation is used for all variants. AMD-Internal: [SWLCSG-1194] Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d	2022-05-17 18:10:39 +05:30
satish kumar nuggu	fe7f0a9085	Changes to enable zgemm small from BLAS Layer 1. Removed small gemm call from native path to avoid Single threaded calls as a part of MultiThreaded scenarios. 2. SUP and INDUCED Method path disabled. 3. Added AOCL Dynamic for optimum number of threads to achieve higher performance. Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92	2022-05-17 18:10:38 +05:30
Sireesha Sanga	6a2c4acc66	Runtime Thread Control using OpenMP API Details: - During runtime, Application can set the desired number of threads using standard OpenMP API omp_set_num_threads(nt). - BLIS Library uses standard OpenMP API omp_get_max_threads() internally, to fetch the latest value set by the application. - This value will be used to decide the number of threads in the subsequent BLAS calls. - At the time of BLIS Initialization, BLIS_NUM_THREADS environment variable will be given precedence, over the OpenMP standard API omp_set_num_threads(nt) and OMP_NUM_THREADS environment variable. - Order of precedence followed during BLIS Initialization is as follows 1. Valid value of BLIS_NUM_THREADS 2. omp_set_num_threads(nt) 3. valid value of OMP_NUM_THREADS 4. Number of cores - After BLIS initialization, if the Application issues omp_set_num_threads(nt) during runtime, number of threads set during BLIS Initialization, is overridden by the latest value set by the Application. - Existing precedence of BLIS_*_NT environment variables and the decision of optimal number of threads over the number of threads derived from the above process remains as it is. AMD-Internal: [CPUPL-2076] Change-Id: I935ba0246b1c256d0fee7d386eac0f5940fabff8	2022-05-17 18:09:22 +05:30
Field G. Van Zee	7a0ba4194f	Added support for addons. Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files. Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717	2022-03-31 12:03:27 +05:30
Dipal M Zambare	6d1edca727	Optimized CPU feature determination. We added new API to check if the CPU architecture has support for AVX instruction. This API was calling CPUID instruction every time it is invoked. However, since this information does not change at runtime, it is sufficient to determine it once and use the cached results for subsequent calls. This optimization is needed to improve performance for small size matrix vector operations. AMD-Internal: [CPUPL-2009] Change-Id: If6697e1da6dd6b7f28fbfed45215ea3fdd569c5f	2022-02-01 11:15:55 +05:30
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Nallani Bhaskar	c2df5eac1c	Reduced number of threads in dgemm for small dimensions - Number of threads are reduced to 1 when the dimensions are very low. - Removed uninitialized xmm compilation warning in trsm small Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b	2021-12-15 15:11:08 +05:30
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
nphaniku	4af525a313	AOCL Windows BLIS : Windows build for dynamic dispatch library Change-Id: Ie05eafbeacbd5589b514d9353517330515104939	2021-11-12 08:58:57 +05:30
Dipal M Zambare	3364c0e4eb	Binary and dynamic dispatch configuration name change -- Reverted changes made to include lp/ilp info in binary name This reverts commit `c5e6f885f0`. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6	2021-11-12 08:58:56 +05:30
Madan mohan Manokar	d6fcfe7345	gemmt SUP limitThread count for small sizes 1. Max thread cap added for small dimension based on product(n*k). AMD-Internal: [CPUPL-1388] Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057	2021-11-12 08:58:55 +05:30
mkurumel	9f1ce594a5	BLIS : Compiler warning fixes Details : - Fixed warnings with AOCC and GCC compilers. AMD-Internal: [CPUPL-1662] Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce	2021-11-12 08:58:52 +05:30

1 2 3 4 5 ...

441 Commits