amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 10:47:16 +00:00

Author	SHA1	Message	Date
S, Hari Govind	e097346658	Implemented Multithreading Support and Optimization of DGEMV API (#10 ) - Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance. - The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta. AMD-Internal: [CPUPL-6746]	2025-06-17 12:39:48 +05:30
Hari Govind S	29f30c7863	Optimisation for DCOPY API - Introducted new assembly kernel that copies data from source to destination from the front and back of the vector at the same time. This kernel provides better performance for larger input sizes. - Added a wrapper function responsible for selecting the kernel used by DCOPYV API to handle the given input for zen5 architecture. - Updated AOCL-dynamic threshold for DCOPYV API in zen4 and zen5 architectures. - New unit-tests were included in the grestsuite for the new kernel. AMD-Internal: [CPUPL-6650] Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568	2025-04-28 05:58:21 -04:00
Hari Govind S	b6e9cde317	Revert "Optimisation of AOCL-dynamic for dotv API" This reverts commit `a028108cbb`. Reason for revert: With libgomp, scalability issues were observed with a higher number of threads, leading to the use of fewer threads. However, with different OpenMP libraries like libomp, this scalability issue was not observed, and using fewer threads resulted in performance loss. The AOCL dynamic logic has been updated to select a higher number of threads, considering the iomp OpenMP library. Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f	2025-02-05 06:33:06 -05:00
Shubham Sharma	9fd2aebd25	Tuned DTRSM thresholds for ZEN5 - Tuned AOCL_dynamic thresholds for DTRSM for ZEN5. - Tuned thresholds for better selection of code path. AMD-Internal: [CPUPL-5408] Change-Id: Ic40b5c8d276c8ce8399fd49ce0d0569f79ec98be	2025-01-28 07:33:10 -05:00
Arnav Sharma	66461b8df3	Improved Multi-threaded Performance of DSCALV - Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5 architectures, since earlier they were using the Zen thresholds. - Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads incurred when the single-threaded path is optimally performant. AMD-Internal: [CPUPL-5934] Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033	2025-01-22 03:55:38 -05:00
Hari Govind S	a028108cbb	Optimisation of AOCL-dynamic for dotv API - Adding AOCL-dynamic logic for dotv on zen4 and zen5 architecture. AMD-Internal: [CPUPL-4877] Change-Id: I421ad5e48cf001d1fc5c13e9f1cbbf4db0bf5f37	2024-12-16 22:09:59 -05:00
Edward Smyth	0c6d006225	Changes to rntm to reduce mutex operations Change usage of global_rntm and tl_rntm to elimate need for mutex operations when accessing global_rntm. Usage of these data structures is now as follows: * global_rntm is set once during bli_init_apis and includes all getenv calls to check BLIS threading and error printing environment variables. global_rntm is then read-only. * tl_rntm is intialized once from global_rntm on each application thread. Any calls to BLIS set threading/ways APIs will update tl_rntm for that application thread only (Previously they updated global_rntm for all application threads). * Re-initialize info_value in tl_rntm in every call to bli_init APIs. * In bli_rntm_init_from_global() we initialize the local (per API call) rntm as a copy of tl_rntm and then update threading values in bli_thread_update_rntm_from_env() to reflect the current status of OpenMP runtime ICVs. AMD-Internal: [CPUPL-6168][SWLCSG-3143] Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6	2024-12-16 04:45:26 -05:00
Edward Smyth	5c946e10dc	Always define function bli_nthreads_optimum bli_nthreads_optimum is exported and called directly by AOCL libFLAME, however it was only defined if building a multithreaded BLIS library with AOCL_DYNAMIC enabled. Change to always define this function. If BLIS is serial or if AOCL_DYNAMIC is disabled, this function returns without modifying the supplied rntm. Change-Id: Ie65690e9e6ec2a8ea77b3778f96676a68e6260be	2024-12-12 12:43:43 -05:00
Hari Govind S	6dd8f06aff	Bug Fix: When calculating number of threads for level1 APIs when BLIS_IC_* or BLIS_JC_* are set - Reverted the change done for tuning ddotv API. When number of threads is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads are not calculated and as a result number of threads value is -1. OpenMP threads are launched with -1 value. This results in crash. This bug is fixed by correctly calculating number of threads. AMD-Internal: [SWLCSG-3028][CPUPL-5689] Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a	2024-08-29 05:42:03 -04:00
Edward Smyth	1f18eeb267	Determine AMD FP/SIMD execution datapath width Different Zen processors may have a 512-bit, 256-bit or 128-bit FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows a selection of FP512 or FP256 width in BIOS settings. Add cpuid code to detect the width and store an indication of it in the global variable bli_fp_datapath. This should be accessed internally via the function bli_cpuid_query_fp_datapath(). This functionality is currently only enabled on x86_64 platforms and only currently reports a value for AMD CPUs. Also add Zen3 as a fallback path for any unknown AMD processors if AVX512 is not supported or has been disabled. AMD-Internal: [CPUPL-4415] Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a	2024-08-28 12:25:57 -04:00
Edward Smyth	7fff7b4026	Code cleanup: Miscellaneous fixes - Delete unused cmake files. - Add guards around call to bli_cpuid_is_avx2fma3_supported in frame/3/bli_l3_sup.c, currently assumes that non-x86 platforms will not use bli_gemmtsup. - Correct variable in frame/base/bli_arch.c on non-x86 builds. - Add guards around omp pragma to avoid possible gcc compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c. - Add missing registers in clobber list in kernels/zen4/1/bli_dotv_zen_int_avx512.c. - Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV. - Correct calls to cblas_{c,z}swap in gtestsuite. - Correct test name in ddotxf gtestsuite program. AMD-Internal: [CPUPL-4415] Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5	2024-08-06 06:56:01 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Hari Govind S	f2acd4fd49	AOCL Dynamic for zen3 dcopy - Create seperate AOCL Dynamic values for multithreading dcopy API for zen1, zen2 and zen3 AMD-Internal: [CPUPL-5238] Change-Id: I42f56393716edeeace8bfe71d7adab0ba7325b47	2024-08-05 00:44:14 -04:00
Vignesh Balasubramanian	f23b8e636b	AVX2 and AVX512 optimizations for DAXPYV - Removed some of the unrolling factors that affected the performance of AVX2 DAXPYV kernel. In addition to improving the current performance on sizes compatible to single-threaded runs, this will now perform better for tiny sizes as well since the overhead to reach the computation is less. - Updated the vector partitioning logic, by using bli_thread_range_sub( ... ), which ensures that there is no false sharing among multiple threads. - Updated the AOCL-DYNAMIC logic for the API, to include thresholds or zen4 and zen5 micro-architectures. AMD-Internal: [CPUPL-5514] Change-Id: Iee9edddac685334213cd6694421ab3df3547e930	2024-07-31 09:24:36 -04:00
Hari Govind S	eacad443e3	Optimization for DCOPY and SCOPY API - Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512" kernel. - Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512" and "bli_scopyv_zen4_asm_avx512" kernels. - Replaced existing load balancing algorithm for dcopy API with "bli_thread_range_sub" algorithm. - Included AOCL-dynamic values for optimial number of threads for zen5 architecture. AMD-Internal: [CPUPL-5238] Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957	2024-07-24 08:23:07 -04:00
Hari Govind S	627bf0b1ba	Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API - Replaced 'bli_zaxpyv_zen_int5' kernel with optimised 'bli_zaxpyv_zen_int_avx512' kernel for zen4 and zen5 config. - Implemented multithreading support and AOCL-dynamic for ZAXPY API. - Utilized 'bli_thread_range_sub' function to achieve better work distribution and avoid false sharing. AMD-Internal: [CPUPL-5250] Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b	2024-07-09 01:29:53 -04:00
Edward Smyth	2ee46a3a3a	Merge commit 'cfa3db3f' into amd-main * commit 'cfa3db3f': Fixed bug in mixed-dt gemm introduced in `e9da642`. Removed support for 3m, 4m induced methods. Updated do_sde.sh to get SDE from GitHub. Disable SDE testing of old AMD microarchitectures. Fixed substitution bug in configure. Allow use of 1m with mixing of row/col-pref ukrs. AMD-Internal: [CPUPL-2698] Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f	2024-07-08 06:09:11 -04:00
Edward Smyth	8de8dc2961	Merge commit '81e10346' into amd-main * commit '81e10346': Alloc at least 1 elem in pool_t block_ptrs. (#560) Fix insufficient pool-growing logic in bli_pool.c. (#559) Arm SVE C/ZGEMM Fix FMOV 0 Mistake SH Kernel Unused Eigher Arm SVE C/ZGEMM Support beta==0 Arm SVE Config armsve Use ZGEMM/CGEMM Arm SVE: Update Perf. Graph Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 A64FX Config Use ZGEMM/CGEMM Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg Arm SVE Add SGEMM 2Vx10 Unindexed Arm SVE ZGEMM Support Gather Load / Scatt. St. Arm SVE Add ZGEMM 2Vx10 Unindexed Arm SVE Add ZGEMM 2Vx7 Unindexed Arm SVE Add ZGEMM 2Vx8 Unindexed Update Travis CI badge Armv8 Trash New Bulk Kernels Enable testing 1m in `make check`. Config ArmSVE Unregister 12xk. Move 12xk to Old Revert __has_include(). Distinguish w/ BLIS_FAMILY_* Register firestorm into arm64 Metaconfig Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo Add test for Apple M1 (firestorm) Firestorm CPUID Dispatcher Armv8 GEMMSUP Edge Cases Require Signed Ints Make error checking level a thread-local variable. Fix data race in testsuite. Update .appveyor.yml Firestorm Block Size Fixes Armv8 Handle beta == 0 for GEMMSUP ??r Case. Move unused ARM SVE kernels to "old" directory. Add an option to control whether or not to use @rpath. Fix $ORIGIN usage on linux. Arm micro-architecture dispatch (#344) Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. Armv8 Handle beta == 0 for GEMMSUP ?rc Case. Armv8 Fix 6x8 Row-Maj Ukr Apply patch from @xrq-phys. Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. bli_error: more cleanup on the error strings array Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers Fix config_name in bli_arch.c Arm Whole GEMMSUP Call Route is Asm/Int Optimized Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Header Typo Arm: DGEMMSUP ??r(rv) Invoke Edge Size Arm: DGEMMSUP ?rc(rd) Invoke Edge Size Arm: Implement GEMMSUP Fallback Method Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Added Apple Firestorm (A14/M1) Subconfig Arm64 8x4 Kernel Use Less Regs Armv8-A Supplimentary GEMMSUP Sizes for RD Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Armv8-A Adjust Types for PACKM Kernels Armv8-A GEMMSUP-RD 6x8m Armv8-A GEMMSUP-RD 6x8n Armv8-A s/d Packing Kernels Fix Typo Armv8-A Introduced s/d Packing Kernels Armv8-A DGEMMSUP 6x8m Kernel Armv8-A DGEMMSUP Adjustments Armv8-A Add More DGEMMSUP Armv8-A Add GEMMSUP 4x8n Kernel Armv8-A Add Part of GEMMSUP 8x4m Kernel Armv8A DGEMM 4x4 Kernel WIP. Slow Armv8-A Add 8x4 Kernel WIP AMD-Internal: [CPUPL-2698] Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803	2024-06-25 05:48:46 -04:00
Vignesh Balasubramanian	02da190560	AVX512 optimizations for DNRM2 - Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512 computational kernel for DNRM2 API. - Updated the header to include this kernel signature, as well as the framework layer to use this function in case of ZEN4 and ZEN5 configurations. - Updated the tipping points for ideal thread setting in DNRM2 for ZEN5 micro-architecture. These thresholds are specific to the library's linkage to LLVM's OpenMP or GNU's OpenMp. - Further abstracted the AOCL-DYNAMIC logic to separate functions for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2). - Further updated the ?NRM2 framework to accommodate the necessary changes to invoke the newer AOCL-DYNAMIC functions and the AVX512 kernel, when needed. - Added micro-kernel and memory tests for this kernel in GTestsuite, to validate accuracy and out-of-bounds read and write. AMD-Internal: [CPUPL-5265] Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049	2024-06-24 08:50:36 -04:00
Moripalli Chitra	cb915c241d	Tuning ddotv API - Modifying threading framework for L1 APIs to update only number of threads from runtime env and avoid overhead of reading other ICVs. - Removing bli_arch_set_id_once() from bli_arch_set_id_once() flow as bli_arch_check_id_once() calls it. AMD-Internal: [CPUPL-4877] Change-Id: I87b346825a96d74e746a41530b6d22ae162f19ba	2024-06-18 19:31:17 +05:30
Edward Smyth	ca7ba707e7	AOCL_ENABLE_INSTRUCTIONS improvements Changes to how AOCL_ENABLE_INSTRUCTIONS handles requests for different ISAs (i.e. BLIS sub-configurations): - Add missing SSE and AVX options. These will all chose the generic option in amdzen builds. - For unsupported ISAs (e.g. AVX512 on Milan), select the hardware's default sub-configuration instead of trying to step down through alternative choices. - For invalid options, or options not implemented in the BLIS build (e.g. skx in amdzen build), select the hardware's default sub-configuration instead of aborting. Currently BLIS_ARCH_TYPE behaviour is not affected by these changes. AMD-Internal: [CPUPL-5078] Change-Id: Idbd00d2806b1679889a9249878c51981c8d23b3f	2024-06-17 09:58:30 -04:00
Arnav Sharma	38f65c28fc	Correction for DSCALV AOCL_DYNAMIC Thresholds - Updated nt_ideal to 12 for n_elem <= 2500000 and nt_ideal to 16 for n_elem <= 4000000 AMD-Internal: [CPUPL-4408] Change-Id: I97c143ab0d9b97e797358af93181c71d948757cc	2024-06-14 11:51:47 +05:30
Shubham Sharma	b4bc71f3ac	Bug fix IN DAXPYF MT and Code Cleanup - Fixed bug in DAXPYF MT kernel when incx != inca. - Added AOCL Dynamic function for 1f kernels. - Moved all DOTXF and AXPYF kernels into one file. AMD-Internal: [CPUPL-4880] Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe	2024-05-14 06:44:10 -04:00
Edward Smyth	62c886feee	Export some BLIS internal symbols AOCL libFLAME optimizations directly call some internal BLIS symbols. Export them to enable this to work with the BLIS shared library. AMD-Internal: [CPUPL-5044] Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d	2024-05-08 12:51:32 -04:00
Arnav Sharma	1dbeee4d19	ZDOTV AVX512 Kernel with MT Support - Added AVX512 kernel for ZDOTV. - Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support. AMD-Internal: [CPUPL-5011] Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8	2024-05-08 04:54:05 -04:00
Mangala V	e6cc2a3e22	ZGEMMT SUP Optimizations for AVX512 Existing Design: - GEMM AVX2 kernel performs computation and updates temporary C buffer - Portion of temporary C buffer is copied to output C buffer based on UPLO parameter - For diagonal blocks, using GEMM kernels is not efficient New Design: Implemented in current patch when UPLO='L' - GEMMT kernel used for computation, temporary buffer is not required. - Only required elements are computed using mask load store for all fringe cases - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC - AOCL-Dynamic is added based on dimension - Check for AVX platform is added in SUP interface, It returns to native implementation if hardware doesnot support AVX platform - SUP ref_var2m is expanded for dcomplex datatype to avoid condition check which exists for double datatype AMD_Internal: [CPUPL-5006] Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286	2024-05-07 23:00:29 +05:30
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Shubham Sharma	1d983e6124	Added AVX512 kernels for DAXPYF and DDOTXF - Added DAXPYF and DDOTXF AVX512 kernels. - Fuse factor for ddotxf kernel is 8. - 2 DAXPYF kernels are added, with fuse factor 8 and 32. - Multithreading is also added to the DAXPYf kernel with fuse factor 32. - These kernels are internally used by TRSM. - Added changes in TRSV to call these kernels in ZEN4 AMD-Internal: [CPUPL-4880] Change-Id: I12850de974b437bbca07677b68bc3d6a35858770	2024-05-03 05:10:22 -04:00
Hari Govind	9c26de1a18	Optimisiation COPYV APIs - Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_ using respective AVX512 intrinsics including masked load and store operations. - Implemented AVX512 kernels for scopy_, dcopy_ and zcopy_ using assembly language to prevent loss of performance during the translation of intrinsics. - Updated the dcopy_blis_impl( ... ) and zcopy_blis_impl( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Implemented OpenMP parallelization for dcopyv_ and zcopyv_ APIs, while scopyv_ and ccopyv_ only support single thread. AMD-Internal: [CPUPL-4854] Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e	2024-05-01 00:23:01 +05:30
Edward Smyth	ccf3910209	BLIS: bli_cpuid.c incorrectly selecting zen5 on zen4 hardware Correct the order of tests in bli_cpuid.c to test all known zen AVX512 platforms before considering fallback tests on AVX512 support. This avoids builds with "configure auto" or "cmake -DBLIS_CONFIG_FAMILY=auto" incorrectly selecting zen5 sub-configuration on zen4 systems. AMD-Internal: [CPUPL-4966] Change-Id: I8706382e2df7c9ae4bb456e3a7f465053e15beea	2024-04-22 03:26:06 -04:00
Shubham Sharma	ea010c5dc2	Improve perf of bli_obj_equals for 1x1 matrices - Comparision using bli_eqsc is slower than direct comparison. - Changed comparision logic for 1x1 matrix from bli_sqsc to direct comparision. AMD-Internal: [CPUPL-4324] Change-Id: Ifb2d0ad7a97c8bf33b66d624a7ecc53e38c1c803	2024-04-16 00:43:28 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Edward Smyth	f93ccb0cea	BLIS: zen5 cpuid and arch changes Implement initial support for Zen5 systems: - Detect new Zen5 AVXVNNI, AVX512VP2INTERSECT, MOVDIRI and MOVDIR64B instructions. - Assume for now that Zen5 will use Zen4 code path. BLIS_ARCH_TYPE=zen5 will therefore function as an alias for BLIS_ARCH_TYPE=zen4, but different hardware model will still be detected. AMD-Internal: [CPUPL-3518] Change-Id: I00fb413d743f152a5412ace3e740df1fd39a1600	2024-01-17 11:41:15 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	dc41fa3829	User selection of code path in single architecture builds User control over code path using AOCL_ENABLE_INSTRUCTIONS or BLIS_ARCH_TYPE only makes sense for fat binary builds. Thus this functionality is now disabled by default for single architecture builds. User can still override the default selections by using configure options --enable-blis-arch-type or --disable-blis-arch-type. Other changes: - include x86_64 family as using zen codepaths in cmake build system. - Update help and error messages to include AOCL_ENABLE_INSTRUCTIONS. AMD-Internal: [CPUPL-4202] Change-Id: I7aa5fcf89df8675bcc12d81f81781de647e0fcf8	2023-11-22 10:48:44 -05:00
Edward Smyth	c6f3340125	Merge commit '5013a6cb' into amd-main * commit '5013a6cb': More edits and fixes to docs/FAQ.md. Fixed newly broken link to CREDITS in FAQ.md. More minor fixes to FAQ.md and Sandboxes.md. Updates to FAQ.md, Sandboxes.md, and README.md. Safelist 'master', 'dev', 'amd' branches. Re-enable and fix `fb93d24`. Reverted `fb93d24`. Re-enable and fix `8e0c425` (BLIS_ENABLE_SYSTEM). Removed last vestige of #define BLIS_NUM_ARCHS. Added new packm var3 to 'gemmlike'. Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. Fix more copy-paste errors in the haswell gemmsup code. Do a fast test on OSX. [ci skip] Fix AArch64 tests and consolidate some other tests. Use C++ cross-compiler for ARM tests. Attempt to fix cxx-test for OOT builds. Updated travis-ci.org link in README.md to .com. Disabled (at least temporarily) commit `8e0c425`. Define BLIS_OS_NONE when using --disable-system. Updated stale calls to malloc_intl() in gemmlike. Blacklist clang10/gcc9 and older for 'armsve'. Add test to Travis using C++ compiler to make sure blis.h is C++-compatible. Moved lang defs from _macro_def.h to _lang_defs.h. Minor tweaks to gemmlike sandbox. Added local _check() code to gemmlike sandbox. README.md citation updates (e.g. BLIS7 bibtex). Tweaks to gemmlike to facilitate 3rd party mods. Whitespace tweaks. Add row- and column-strides for A/B in obj_ukr_fn_t. Clean up some warnings that show up on clang/OSX. Remove schema field on obj_t (redundant) and add new API functions. Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects. Disabled sanity check in bli_pool_finalize(). Implement proposed new function pointer fields for obj_t. AMD-Internal: [CPUPL-2698] Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d	2023-11-10 13:05:12 -05:00
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Harsh Dave	75356d45e5	DGEMM improvement for very tiny sizes less than 24. - This commit helps improving performance for very small input by reducing framework check and routing all such inputs to bli_dgemm_tiny_6x8_kernel. It forces single threaded computation for such sizes. - It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4 code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment variable is set to avx512. In that case, such a small inputs are routed to bli_dgemm_tiny_24x8_kernel avx512 kernel. AMD-Internal: [CPUPL-1701] Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e	2023-11-08 23:45:57 -05:00
Harihara Sudhan S	bed6bd4941	Modified DSCALV AOCL dynamic - AOCL dynamic logic that determines the number of threads to be launched has been modified. AMD-Internal: [CPUPL-3956] Change-Id: Ia6c052515bd24e93660f020a7d0894fc75a229fc	2023-11-06 22:35:14 -05:00
Eashan Dash	c3d1a3878c	Parallelized Pack and Compute Extension APIs 1. OpenMP based multi-threading parallelism is added for BLAS extension APIs of Pack and Compute 2. Both pack and compute APIs are parallelized. 3. Multi-threading of pack and compute APIs done with different number of threads can lead to inconsistent results due to output difference of the full packed matrix buffer when packed with different number of threads. 4. In multi-threaded execution, we ensure output of packed buffer is exactly the same as in single threaded execution. 5. Similarly for compute API, read of packed buffer in multi- threaded execution is exactly the same as in single-threaded execution. 6. Routines are added to compute the offsets for thread workload distribution for MT execution. 1. The offsets are calculated in such a way that it resembles the reorder buffer traversal in single threaded reordering. 2. The panel boundaries (KCxNC) remain as it is accessed in single thread, and as a consequence a thread with jc_start inside the panel cannot consider NC range for reorder. 3. It has to work with NC' < NC, and the offset is calulated using prev NC panels spanning k dim + cur NC panel spaning pc loop cur iteration + (NC - NC') spanning current kc0 (<= KC). 7. Routines to ensure the same are added for MT execution 1. frame/base/bli_pack_compute_utils.c 2. frame/base/bli_pack_compute_utils.h AMD-Internal: [CPUPL-3560] Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac	2023-11-03 08:47:17 -04:00
Edward Smyth	248dc2af9a	Implement AOCL_ENABLE_INSTRUCTIONS environment variable Add AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative to BLIS_ARCH_TYPE. The details are: 1. AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE env vars are both supported, with BLIS_ARCH_TYPE taking precedence if both are set. 2. Values of "avx2" and "avx512" are aliases for "zen3" and "zen4" code paths respectively in AMD focused builds, or for "skx" and "haswell" respectively in Intel focused builds. These names are not case-sensitive. 3. BLIS_ARCH_TYPE specifies the code path to use. If this is unsupported, e.g. zen4 code path on a Milan or earlier system, that code path is still executed, likely resulting in an illegal instruction error. 4. By contrast, AOCL_ENABLE_INSTRUCTIONS will check ISA support on the system (for AVX2 and AVX512), and try a "lower" ISA option if the desired one is not supported, i.e. AVX512->AVX2, AVX2->generic. 5. Appropriate messages are printed if BLIS_ARCH_DEBUG=1 is set. AMD-Internal: [CPUPL-4105] Change-Id: Ia941b41d4b7d11f5589d7c5e16f607618baed315	2023-10-27 14:59:33 -04:00
Edward Smyth	f5505be9f3	Merge commit 'e366665c' into amd-main * commit 'e366665c': Fixed stale API calls to membrk API in gemmlike. Fixed bli_init.c compile-time error on OSX clang. Fixed configure breakage on OSX clang. Fixed one-time use property of bli_init() (#525). CREDITS file update. Added Graviton2 Neoverse N1 performance results. Remove unnecesary windows/zen2 directory. Add vzeroupper to Haswell microkernels. (#524) Fix Win64 AVX512 bug. Add comment about make checkblas on Windows CREDITS file update. Test installation in Travis CI Add symlink to blis.pc.in for out-of-tree builds Revert "Always run `make check`." Always run `make check`. Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced. Update POWER10.md Rework POWER10 sandbox Skip clearing temp microtile in gemmlike sandbox. Fix asm warning Sandbox header edits trigger full library rebuild. Add vhsubpd/vhsubpd. Fixed bugs in cpackm kernels, gemmlike code. Armv8A Rename Regs for Safe Darwin Compile Armv8A Rename Regs for Clang Compile: FP32 Part Armv8A Rename Regs for Clang Compile: FP64 Part Asm Flag Mingling for Darwin_Aarch64 Added a new 'gemmlike' sandbox. Updated Fugaku (a64fx) performance results. Add explicit compiler check for Windows. Remove `rm-dupls` function in common.mk. Travis CI Revert Unnecessary Extras from `91d3636` Adjust TravisCI Travis Support Arm SVE Added 512b SVE-based a64fx subconfig + SVE kernels. Replace bli_dlamch with something less archaic (#498) Allow clang for ThunderX2 config AMD-Internal: [CPUPL-2698] Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4	2023-10-18 09:09:54 -04:00
Edward Smyth	6d0444497f	Improvements to xerbla functionality The following improvements have been implemented: - Option to stop in xerbla on error. This is controlled by setting the environment variable BLIS_STOP_ON_ERROR=1 - Option to disable printing of error message from BLIS. This is controlled by setting the environment variable BLIS_PRINT_ON_ERROR=0 - Added a function to return the value of INFO passed to xerbla, assuming xerbla was not set to stop on error. Example call is info = bli_info_get_info_value(); The default behaviour remains to print but don't stop on error, i.e. the equivalent to export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0 Implementation details: - Values of the environment variables are stored and retrieved from global_rntm. - Info value is stored and retrieved from tl_rntm. It is set to 0 during initialization for all calls and updated by xerbla if an error has occurred. - Call to bli_init_auto before calling PASTEBLACHK macro (which calls xerbla) will reinitialize info_value to 0 via call to bli_thread_update_rntm_from_env AMD-Internal: [CPUPL-3520] Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d	2023-10-16 08:48:51 -04:00
Arnav Sharma	c8f14edcf5	BLAS Extension API - ?gemm_compute() - Added support for 2 new APIs: 1. sgemm_compute() 2. dgemm_compute() These are dependent on the ?gemm_pack_get_size() and ?gemm_pack() APIs. - ?gemm_compute() takes the packed matrix buffer (represented by the packed matrix identifier) and performs the GEMM operation: C := A * B + beta * C. - Whenever the kernel storage preference and the matrix storage scheme isn't matching, and the respective matrix being loaded isn't packed either, on-the-go packing has been enabled for such cases to pack that matrix. - Note: If both the matrices are packed using the ?gemm_pack() API, it is the responsibility of the user to pack only one matrix with alpha scalar and the other with a unit scalar. - Note: Support is presently limited to Single Thread only. Both, pack and compute APIs are forced to take n_threads=1. AMD-Internal: [CPUPL-3560] Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158	2023-10-16 08:18:52 -04:00
Edward Smyth	9d19effec5	Fix for non-x86 builds: cpuid query functions Ensure functions bli_cpuid_query_id() and bli_cpuid_query_model_id() are defined for all architectures in bli_cpuid.c AMD-Internal: [CPUPL-3838] Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e	2023-09-11 04:19:14 -04:00
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Shubham Sharma	0000cc88de	Removed local copy of cntx in TRSM - TRSM and GEMM has different blocksizes in zen4, in order to accommodate this, a local copy of cntx was created in TRSM. - Local copy of cntx has been removed and TRSM blocksizes are stored in cntx->trsmblkszs. - Functions to override and restore default blocksizes for TRSM are removed. Instead of overriding the default blocksizes, TRSM blocksizes are stored separately in cntx. - Pack buffers for TRSM have to be packed with TRSM blocksizes and GEMM pack buffers have to be packed with default blocksizes. To check if we are packing for TRSM, "family" argument is added in bli_packm_init_pack function. - BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if it is not set then BLIS_GEMM_UKR has to be used. This functionality has been added to all TRSM macro kernels. - Methods to retrieve TRSM blocksizes from cntx are added to bli_cntx.h. - Tests for micro kernels are modified to accommodate the change in signature of bli_packm_init_pack. AMD-Internal: [CPUPL-3781] Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a	2023-08-16 08:09:01 -04:00
Shubham Sharma	9607f207da	AOCL Dynamic tuning for DAXPYV - Existing logic is not picking the ideal number of threads for some problem sizes. - Problem size and their corresponding ideal number of threads are retuned for daxpy in aocl dynamic. AMD-Internal: [CPUPL-3484] Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7	2023-08-01 10:34:04 +05:30
Meghana Vankadari	79e174ff0a	Level-3 triangular routines now use different block sizes and kernels. Details: - Eliminated the need for override function in SUP for GEMMT/SYRK. - New set of block sizes, kernels and kernel preferences are added to cntx data structure for level-3 triangular routines. - Added supporting functions to set and get the above parameters from cntx. - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels. In case they are not set, use the default block sizes/kernels of Level-3 SUP. AMD-Internal: [CPUPL-3649] Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0	2023-07-26 01:26:11 -04:00
Harihara Sudhan S	9ee95e171a	Control flow issue reported during static code analysis - Missing break statement will result in unexpected control flow. This function will not launch the threads for the API in question according to the AOCL dynamic logic without the break statement. AMD-Internal: [CPUPL-3436] Change-Id: Ic47d773169c09e84086a27b50cd59dba33529698	2023-05-18 04:53:03 -04:00

1 2 3 4 5 ...

516 Commits