amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Harihara Sudhan S	8d7593a940	AVX2 vectorized SCALV for double complex - Added ZSCALV that uses AVX2 and SSE instructions for vectorization. - Return early when the vector dimension is zero. When alpha is 1 there is no need to perform computation hence return early. - When alpha is zero expert interface of ZSETV is invoked. In this case, all the elements of the input vector are set 0. - Invocation of expert interface means that NULL pointer can be passed to the function in place of context. Expert interface of ZSETV will query the context and get the approriate function pointer. - Added BLAS interface for ZSCALV. The architecture ID is used to decide the function that is to be invoked. - Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV BLAS macro interface only for single complex type and single complex, float mixed type AMD-Internal: [CPUPL-2773] Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050	2023-02-05 09:36:12 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Harsh Dave	0a699c45f0	Corrected VEXTRACTF64X2 macro in bli_x86_asm_macro file. - Previously VEXTRACTF64X2 macro was defined as vextractf64x4. Change-Id: I79727a85b7d6da3b4d524064e297fc8c71d4f466	2023-01-16 23:04:48 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Shubham Sharma	2d72d4c862	DTRSM Native path AVX 512 kernels are developed - Added DTRSM AVX512 kernels for lower and upper variants in the native path. - Changes in framework are made to accommodate these kernels. AMD-Internal: [CPUPL-2588] Change-Id: I1f74273ef2389018343c0645870290373ce25efe	2022-10-01 15:58:24 +00:00
Arnav Sharma	90f915d3a9	Vectorized and parallelized zdscal routine - Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported. - Also added multithreaded support for the same. - The optimal number of threads is being calculated on the basis of input size. AMD-Internal: [CPUPL-2602] Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067	2022-09-30 06:11:07 -04:00
Dipal M Zambare	de3247b0da	Removed extra prototypes for ?gemm3m APIs - Removed prototypes for float(sgemm3m) and double(dgemm3m) types, as BLIS implements this API only for scomplex(cgemm3m) and dcomplex(zgemm3m) AMD-Internal: [SWLCSG-1477] Change-Id: Ifad86a74b4c939ed240743894b85bb4fa5e6d754	2022-09-20 15:51:56 +05:30
Dipal M Zambare	866e8de7bf	CBLAS/BLAS interface decoupling for the level 2 APIs -In BLIS, the CBLAS interface is implemented as a wrapper around the BLAS interface. For example the CBLAS API ‘cblas_dgemv’ internally invokes the BLAS API ‘dgemv_’. -This coupling between CBLAS and BLAS interface prevents the end user from overriding them individually by the application or other libraries. -This change separates the CBLAS and BLAS implementation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c	2022-09-15 17:51:05 +05:30
Dipal M Zambare	e18db8a172	CBLAS/BLAS interface decoupling for the level 3 APIs -In BLIS, the CBLAS interface is implemented as a wrapper around the BLAS interface. For example the CBLAS API ‘cblas_dgemm’ internally invokes the BLAS API ‘dgemm_’. -This coupling between CBLAS and BLAS interface prevents the end user from overriding them individually by the application or other libraries. -This change separates the CBLAS and BLAS implementation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409	2022-09-15 06:23:46 -04:00
Dipal M Zambare	5c42afada8	Revert "CBLAS/BLAS interface decoupling for level 3 APIs" This reverts commit `d925ebeb06`. Change-Id: I2e842b29c1fedbe14bf913949cf978f3e7515ff3	2022-08-30 14:50:38 +05:30
Dipal M Zambare	7e42b3d2e0	Revert "CBLAS/BLAS interface decoupling for level 2 APIs" This reverts commit `192f5313a1`. Change-Id: I876cad90902970ebc61550f109eb0ce32539ea1c	2022-08-30 11:53:46 +05:30
Dipal M Zambare	40c71dd2e1	Revert "CBLAS/BLAS interface decoupling for swap api" This reverts commit `2beaa6a0e6`. Reverting it as it is planned for the next release. Change-Id: Ib9271acd0b5b4cfd10c8f8b7bbb6ef93a3d594ea	2022-08-30 10:10:06 +05:30
jagar	2beaa6a0e6	CBLAS/BLAS interface decoupling for swap api - In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. - If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it. - This change separates the CBLAS and BLAS implantation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I8d81072aaca739f175318b82f6510d386103c24b	2022-08-29 16:26:01 +05:30
jagar	192f5313a1	CBLAS/BLAS interface decoupling for level 2 APIs - In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. - If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it and may result in recursion - This change separates the CBLAS and BLAS implantation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I8380b6468683028035f2aece48916939e0fede8a	2022-08-29 09:47:19 +05:30
Chandrashekara K R	d925ebeb06	CBLAS/BLAS interface decoupling for level 3 APIs ->In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. ->If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it and may result in recursion ->This change separate the CBLAS and BLAS implantation by adding and additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5	2022-08-26 05:54:29 -04:00
Edward Smyth	6861fcae91	BLIS: Improve architecture selection at runtime Make BLIS_ARCH_TYPE=0 be an error, so that incorrect meaningful names will get an error rather than "skx" code path. BLIS_ARCH_TYPE=1 is now "generic", so that it should be constant as new code paths are added. Thus all other code path enum values have increased by 2. Also added new options to BLIS configure program to allow: 1. BLIS_ARCH_TYPE functionality to be disabled, e.g.: ./configure --disable-blis-arch-type amdzen 2. Renaming the environment variable tested from "BLIS_ARCH_TYPE" to a specified value, e.g.: ./configure --rename-blis-arch-type=MY_NAME_FOR_ARCH_TYPE amdzen On Windows, these can be enabled with e.g.: cmake ... -DDISABLE_BLIS_ARCH_TYPE=ON or cmake ... -DRENAME_BLIS_ARCH_TYPE=MY_NAME_FOR_ARCH_TYPE This implements changes 2 and 3 in the Jira ticket below. AMD-Internal: [CPUPL-2235] Change-Id: Ie42906bd909f9d83f00a90c5bef9c5bf3ef5adb4	2022-08-19 10:59:35 -04:00
Sireesha Sanga	22af681a11	Runtime Thread Control Feature Update Details: 1. Runtime Thread Control Feature is enhanced to create a provision for the application to allocate a different number of threads to BLIS from the number of threads application is using for itself. 2. In the previous implementation, if application sets BLIS_NUM_THREADS with a valid value, BLIS internally calls omp_set_num_threads() API with same value. Due to this, application could not differentiate between the number of threads used in BLIS library and the application. 3. With the current solution, if Application wants to allocate different number of threads for BLIS API and application, Application can choose either BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API for BLIS, and OpenMP APIs or environment variables for itself, respectively. 4. If BLIS_NUM_THREADS is set with a valid value, same value will be used in the subsequent parallel regions unless bli_thread_set_num_threads() API is used by the Application to modify the desired number of threads during BLIS API execution. 5. Once BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API is used by the application, BLIS module would always give precedence to these values. BLIS API would not consider the values set using OpenMP API omp_set_num_threads(nt) API or OMP_NUM_THREADS environment variable. 6. If BLIS_NUM_THREADS is not set, then if Application is multithreaded and issued omp_set_num_threads(nt) with desired number of threads, omp_get_max_threads() API will fetch the number of threads set earlier. 7. If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called by the application, but only OMP_NUM_THREADS is set, omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS. 8. If both environment variables are not set, or if they are set with invalid values, and omp_set_num_threads(nt) is not issued by application, omp_get_max_threads() API will return the number of the cores in the current context. 9. BLIS will initialize rntm->num_threads with the same value. However if omp_set_nested is false - BLIS APIs called from parallel threads will run in sequential. But if nested parallelism is enabled Then each application will launch MT BLIS. 10. Order of precedence used for number of threads: 0. value set using bli_thread_set_num_threads(nt) by the application 1. valid value set for BLIS_NUM_THREADS environment variable 2. omp_set_num_threads(nt) issued by the application 3. valid value set for OMP_NUM_THREADS environment variable 4. Number of cores 11. If nt is not a valid value for omp_set_num_threads(nt) API, number of threads would be set to 1. omp_get_max_threads() API will return 1. 12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled. AMD-Internal: [CPUPL-2342] Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6	2022-08-19 05:43:01 -04:00
Edward Smyth	737e08cd7a	BLIS: Improve architecture selection at runtime Enable meaningful names as options for BLIS_ARCH_TYPE environment variable. For example, BLIS_ARCH_TYPE=zen4 or BLIS_ARCH_TYPE='ZEN4' or BLIS_ARCH_TYPE=6 will select the same code path (in this release). The meaningful names are not case sensitive. This implements change 1 in the Jira ticket below. Following review comments: 1. Use names from arch_t enum in function bli_env_get_var_arch_type() rather than directly using numbers. 2. AMD copyrights updated. AMD-Internal: [CPUPL-2235] Change-Id: I8cfd43d34765d5e8c7e35680d18825d9934753ad	2022-08-10 08:26:49 -04:00
Dipal M Zambare	2ba2fb2b63	Add AVX2 path for TRSM+GEMM combination. - Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called from TRSM context it will invoke AVX2 GEMM kernels instead of the default AVX-512 GEMM kernels. - The default context has the block sizes for AVX512 GEMM kernels, however, TRSM uses AVX2 GEMM kernels and they need different block sizes. - Added new API bli_zen4_override_trsm_blkszs(). It overrides default block sizes in context with block sizes needed for AVX2 GEMM kernels. - Added new API bli_zen4_restore_default_blkszs(). It restores The block sizes to there default values (as needed by default AVX512 GEMM kernels). - Updated bli_trsm_front() to override the block sizes in the context needed by TRSM + AVX2 GEMM kernels and restore them to the default values at the end of this function. It is done in bli_trsm_front() so that we override the context before creating different threads. AMD-Internal: [CPUPL-2225] Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55	2022-06-29 10:16:24 +00:00
Dipal M Zambare	c87b9aab75	Added support for AVX512 for Windows and AMAVX - Completed zen4 configuration support on windows - Enabled AVX512 kernels for AMAXV - Added zen4 configuration in amdzen for windows - Moved all zen4 kernels inside kernels/zen4 folder AMD-Internal: [CPUPL-2108] Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba	2022-06-08 11:09:48 +05:30
Dipal M Zambare	8cc15107ed	Enabled AVX-512 kernels for Zen4 config - Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for GEMM float and double types. - Enabled reference kernel for TRSM native path AMD-Internal: [CPUPL-2108] Change-Id: I66f3468346085c17183cbcbf4f2c8cfe07579b6f	2022-06-03 06:34:35 +00:00
Chandrashekara K R	8e6da6b844	Added the checks to not defining the bool type for C++ code in windows to avoid redefinition build time errror. AMD-Internal: [CPUPL-2037] Change-Id: I065da9206ab06f60876324f258ee12fb9fe83f88	2022-05-17 18:10:39 +05:30
Dipal M Zambare	e712ffe139	Added AOCL progress support for BLIS -- AOCL libraries are used for lengthy computations which can go on for hours or days, once the operation is started, the user doesn’t get any update on current state of the computation. This (AOCL progress) feature enables user to receive a periodic update from the libraries. -- User registers a callback with the library if it is interested in receiving the periodic update. -- The library invokes this callback periodically with information about current state of the operation. -- The update frequency is statically set in the code, it can be modified as needed if the library is built from source. -- These feature is supported for GEMM and TRSM operations. -- Added example for GEMM and TRSM. -- Cleaned up and reformatted test_gemm.c and test_trsm.c to remove warnings and making indentation consistent across the file. AMD-Internal: [CPUPL-2082] Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e	2022-05-17 18:10:39 +05:30
Sireesha Sanga	9621ef3067	Performance Improvement for ztrsm small sizes Details: - Enable ztrsm small implementation - For small sizes, Right Variants and Left Unit Diag Variants are using ztrsm_small implementations. - Optimization of Left Non-Unit Diagonal Variants, Work In Progress AMD-Internal: [SWLCSG-1194] Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4	2022-05-17 18:09:22 +05:30
Meghana Vankadari	c11fd5a8f6	Added functionality support for dzgemm AMD-Internal: [SWLCSG-1012] Change-Id: I2eac3131d2dcd534f84491289cbd3fe7fb7de3da	2022-05-17 18:01:55 +05:30
Dipal M. Zambare	b90420627a	Revert "Enabled AVX-512 kernels for Zen4 config" This reverts commit `62c96a4190`. Was committed without review.	2022-04-21 06:46:00 +00:00
Dipal M. Zambare	62c96a4190	Enabled AVX-512 kernels for Zen4 config Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for float and double types. AMD-Internal: [CPUPL-2108]	2022-04-21 06:28:29 +00:00
Field G. Van Zee	a4abb10831	Added a new 'gemmlike' sandbox. Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework. Change-Id: Ifc3c50e9fd0072aada38eace50c57552c88cc6cf	2022-04-01 13:55:30 +05:30
Field G. Van Zee	7a0ba4194f	Added support for addons. Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files. Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717	2022-03-31 12:03:27 +05:30
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30
Dipal M Zambare	3364c0e4eb	Binary and dynamic dispatch configuration name change -- Reverted changes made to include lp/ilp info in binary name This reverts commit `c5e6f885f0`. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6	2021-11-12 08:58:56 +05:30
Nageshwar Singh	a263146a4c	Optimized scalv for complex data-types c and z (cscalv and zscalv) AMD-Internal: [CPUPL-1551] Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9	2021-11-12 08:58:53 +05:30
Saitharun	ffcb338531	Adding ENABLE_WRAPPER CMAKE OPTION details: Wrapper code will be enabled when selecting the cmake option ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build error on windows. AMD-Internal: [CPUPL-1848] Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e	2021-11-12 08:58:49 +05:30
Saitharun	5d6012c50d	Added additional symbols for BLIS APIs using wrapper functions Details: BLIS currently supports BLAS and CBLAS interfaces with lowercase. With this commit - we also supports uppercase with and without trailing underscore, lowercase without trailing underscore symbol names. Change-Id: Ibb06121821ab937b25d492409625916f542b2135	2021-11-12 08:58:48 +05:30
Dipal M Zambare	bcd9591b3f	Added support for amdepyc fat binary -- Created new configuration amdepyc to include fat binary which includes zen, zen2, zen3 and generic architecture for fallback. -- Updated amdepyc family makefiles to include macros needed in amdepyc family binary. This file must include all macros, compiler options to be used for non architecture specific code. -- Added 'workaround' to exclude ZEN family specific code in some of the framework files. There are still lot of places were ZEN family specific code is added in framework files. They will be addressed with proper design later. - Moved definition of BLIS_CONFIG_EPYC from header files to makefile so that it is enabled only for framework and kernels -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC wherever it was needed. -- Removed un-used, obsolete macros, some of them may be needed for debugging which can be added in the individual workspaces. - BLIS_DEFAULT_MR_THREAD_MAX - BLIS_DEFAULT_NR_THREAD_MAX - BLIS_ENABLE_ZEN_BLOCK_SIZES - BLIS_SMALL_MATRIX_THRES_TRSM - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES - BLIS_ENABLE_SUP_MR_EXT - BLIS_ENABLE_SUP_NR_EXT -- Corrected implementation of exiting amd64_legacy configuration. AMD-Internal: [CPUPL-1626, CPUPL-1628] Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c	2021-08-26 15:13:49 +05:30
Meghana Vankadari	107eaea237	Removed dependency on AOCL_BLIS_ZEN for TRSM blocksizes. Details: - The field "trsm_blkszs" in cntx enables us to set and use blocksizes that are specific to trsm only instead of using commom L3 blocksizes. - While querying blocksizes for TRSM, Added a check to see if trsm-specific blocksizes are set, if they are not set, we continue trsm execution by using common Level-3 blocksizes. - TRSM-specific blocksizes( if required ) needs to be set in cntx_init function of respective configuration. Change-Id: I06afe2b1d29004e428d2fb3bf2be2958cbbd1a97	2021-08-16 00:12:33 -04:00
Meghana Vankadari	6bad157754	Added a new field in cntx to store l3 threshold function pointers Details: - Adding threshold function pointers to cntx gives flexibility to choose different threshold functions for different configurations. - In case of fat binary where configuration is decided at run-time, adding threshold functions under a macro enables these functions for all the configs under a family. This can be avoided by adding function pointers to cntx which can be queried from cntx during run-time based on the config chosen. Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f	2021-08-16 00:10:01 -04:00
Chandrashekara K R	f94e3ad237	AOCL-Windows: Update BLIS build system 1. Added support in cmake scripts for linking libomp for blis multithreading build. 2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file. 3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's. 4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS. 5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file. AMD Internal : [CPUPL-1630] Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f	2021-06-15 16:49:08 +05:30
Nagarapu Phanikumar	7ea32e6d0b	Merge " Unifying BLIS Windows and Linux codebase" into amd-staging-milan-3.1	2021-06-03 06:03:26 -04:00
nphaniku	2bdee3cd6c	Unifying BLIS Windows and Linux codebase 1. Removed dependency on bli_config.h inclusion in blis.h 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags. 3. CMAKE changes to incorporate new changes as per 3.1 code base. 4. Removed zen2 folder from Windows directory. AMD Internal : [CPUPL-1532] Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47	2021-06-03 15:28:10 +05:30
Nageshwar Singh	2e1a5bc1dd	Optimized double complex axpyf kernel for zgemv Details: - Implemented zaxpyf kernel with fuse factor=4 for zgemv. - Modified BLAS interface call for zgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: I2231285fe3060982d4434466346a040b7ab803fc	2021-06-01 18:03:29 +05:30
Meghana Vankadari	8c9a7c21b4	Optimized axpyf kernel for scomplex datatype Details: - Implemented axpyf kernel with fuse factor=4 for scomplex datatype. - Modified BLAS interface call for cgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: Ibaab078008d76953332ba4da3515993578c0e586	2021-05-24 14:40:17 +05:30
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Field G. Van Zee	4493cf516e	Redefined BLIS_NUM_ARCHS to update automatically. Details: - Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum value in the arch_t enum. This means that it no longer needs to get updated manually whenever new subconfigurations are added to BLIS. Also removed the explicit initial index assigment of 0 from the first enum value, which was unnecessary due to how the C language standard mandates indexing of enum values. Thanks to Devin Matthews for originally submitting this as a PR in #446. - Updated docs/ConfigurationHowTo.md to reflect the aforementioned change.	2021-03-15 13:12:49 -05:00
nphaniku	e3cc577ec1	AOCL Windows: 3.1 BLIS changes 1. Incorporated code review comments . 2. Updated Copyright to 2021. AMD Internal : [CPUPL-1422] Change-Id: I722b0f71daae029a3dcc2cbd029524ea39ca78e6	2021-03-09 17:35:57 +05:30
nphaniku	d78defa0fc	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for adding new files to the build. 2. Added Upper case support for couple of API's. 3. bool is not support in clang so defined it. AMD Internal : [CPUPL-1422] Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61	2021-03-08 22:32:13 +05:30
Kiran Varaganti	a7d43cf720	DGEMM Optimizations for smaller dimensions Modified dgemm_ to able to call small_gemm 16x3 kernel. small_gemm will be called if((m + n -k) < 2000 && (m + k-n) < 2000 && n + k-m < 2000) && n > 2. small_gemm kernel - if m or n or k = 0 we return and this case will be handled by sup or native kernel. [CPUPL - 1376] Change-Id: I61c2b36ad0ae4fb3dd23bc37c2b6c78556b3105b	2021-02-11 11:05:42 +05:30
Madan mohan Manokar	f1ea1f1d34	Adpative zgemm 1. 3m1 choosen for (m<=128) & (68>n<=128) & (k<=128) 2. Default blis3.1 path for rest of the sizes. Change-Id: I1e50dece013e72a67f1162faef5cbeb9bfbbc23a AMD-Internal: [CPUPL-1352]	2021-02-03 12:43:57 +05:30

1 2 3 4 5 ...

452 Commits