amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 02:25:39 +00:00

Author	SHA1	Message	Date
Edward Smyth	c6f3340125	Merge commit '5013a6cb' into amd-main * commit '5013a6cb': More edits and fixes to docs/FAQ.md. Fixed newly broken link to CREDITS in FAQ.md. More minor fixes to FAQ.md and Sandboxes.md. Updates to FAQ.md, Sandboxes.md, and README.md. Safelist 'master', 'dev', 'amd' branches. Re-enable and fix `fb93d24`. Reverted `fb93d24`. Re-enable and fix `8e0c425` (BLIS_ENABLE_SYSTEM). Removed last vestige of #define BLIS_NUM_ARCHS. Added new packm var3 to 'gemmlike'. Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. Fix more copy-paste errors in the haswell gemmsup code. Do a fast test on OSX. [ci skip] Fix AArch64 tests and consolidate some other tests. Use C++ cross-compiler for ARM tests. Attempt to fix cxx-test for OOT builds. Updated travis-ci.org link in README.md to .com. Disabled (at least temporarily) commit `8e0c425`. Define BLIS_OS_NONE when using --disable-system. Updated stale calls to malloc_intl() in gemmlike. Blacklist clang10/gcc9 and older for 'armsve'. Add test to Travis using C++ compiler to make sure blis.h is C++-compatible. Moved lang defs from _macro_def.h to _lang_defs.h. Minor tweaks to gemmlike sandbox. Added local _check() code to gemmlike sandbox. README.md citation updates (e.g. BLIS7 bibtex). Tweaks to gemmlike to facilitate 3rd party mods. Whitespace tweaks. Add row- and column-strides for A/B in obj_ukr_fn_t. Clean up some warnings that show up on clang/OSX. Remove schema field on obj_t (redundant) and add new API functions. Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects. Disabled sanity check in bli_pool_finalize(). Implement proposed new function pointer fields for obj_t. AMD-Internal: [CPUPL-2698] Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d	2023-11-10 13:05:12 -05:00
Mangala V	f6046784ce	Re-Designed SGEMM SUP kernel to use mask load/store instruction Added all fringe kernels with mask load store support Fringe kernels cover m direction from 5 to 1 and n direction from 15 to 1 for row storage format - New edge kernels that uses masked load-store instructions for handling corner cases. - Mask load-store instruction macros are added. vmaskmovps, VMASKMOVPS for masked load-store. - It improves performance by reducing branching overhead and by being more cache friendly. - Mask load-store is added only for row storage format AMD-Internal: [CPUPL-4041] Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456	2023-11-10 01:23:48 -05:00
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Arnav Sharma	8885510db2	Fix for Missing Symbols for gemm_pack_get_size - Symbols for gemm_pack_get_size were not being exported properly when BLIS was built as a shared library. - Correctly assigned the BLIS_EXPORT_BLAS macro to ?gemm_pack_get_size_ function declaration. - Added missing gemm_pack and gemm_pack_get_size macros to bli_macro_defs.h file. - Removed an unnecessary BLIS_EXPORT_BLAS macro from dgemm_compute function definition. - Updated bli_util_api_wrap with no underscore API wrappers for pack and compute set of BLAS Extension APIs: 1. ?gemm_pack_get_size 2. ?gemm_pack 3. ?gemm_compute AMD-Internal: [CPUPL-4083] Change-Id: I78cd7642c2fcbfdf02676e654a377ad2aa5295c1	2023-11-03 08:58:59 -04:00
Eashan Dash	c3d1a3878c	Parallelized Pack and Compute Extension APIs 1. OpenMP based multi-threading parallelism is added for BLAS extension APIs of Pack and Compute 2. Both pack and compute APIs are parallelized. 3. Multi-threading of pack and compute APIs done with different number of threads can lead to inconsistent results due to output difference of the full packed matrix buffer when packed with different number of threads. 4. In multi-threaded execution, we ensure output of packed buffer is exactly the same as in single threaded execution. 5. Similarly for compute API, read of packed buffer in multi- threaded execution is exactly the same as in single-threaded execution. 6. Routines are added to compute the offsets for thread workload distribution for MT execution. 1. The offsets are calculated in such a way that it resembles the reorder buffer traversal in single threaded reordering. 2. The panel boundaries (KCxNC) remain as it is accessed in single thread, and as a consequence a thread with jc_start inside the panel cannot consider NC range for reorder. 3. It has to work with NC' < NC, and the offset is calulated using prev NC panels spanning k dim + cur NC panel spaning pc loop cur iteration + (NC - NC') spanning current kc0 (<= KC). 7. Routines to ensure the same are added for MT execution 1. frame/base/bli_pack_compute_utils.c 2. frame/base/bli_pack_compute_utils.h AMD-Internal: [CPUPL-3560] Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac	2023-11-03 08:47:17 -04:00
Edward Smyth	f5505be9f3	Merge commit 'e366665c' into amd-main * commit 'e366665c': Fixed stale API calls to membrk API in gemmlike. Fixed bli_init.c compile-time error on OSX clang. Fixed configure breakage on OSX clang. Fixed one-time use property of bli_init() (#525). CREDITS file update. Added Graviton2 Neoverse N1 performance results. Remove unnecesary windows/zen2 directory. Add vzeroupper to Haswell microkernels. (#524) Fix Win64 AVX512 bug. Add comment about make checkblas on Windows CREDITS file update. Test installation in Travis CI Add symlink to blis.pc.in for out-of-tree builds Revert "Always run `make check`." Always run `make check`. Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced. Update POWER10.md Rework POWER10 sandbox Skip clearing temp microtile in gemmlike sandbox. Fix asm warning Sandbox header edits trigger full library rebuild. Add vhsubpd/vhsubpd. Fixed bugs in cpackm kernels, gemmlike code. Armv8A Rename Regs for Safe Darwin Compile Armv8A Rename Regs for Clang Compile: FP32 Part Armv8A Rename Regs for Clang Compile: FP64 Part Asm Flag Mingling for Darwin_Aarch64 Added a new 'gemmlike' sandbox. Updated Fugaku (a64fx) performance results. Add explicit compiler check for Windows. Remove `rm-dupls` function in common.mk. Travis CI Revert Unnecessary Extras from `91d3636` Adjust TravisCI Travis Support Arm SVE Added 512b SVE-based a64fx subconfig + SVE kernels. Replace bli_dlamch with something less archaic (#498) Allow clang for ThunderX2 config AMD-Internal: [CPUPL-2698] Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4	2023-10-18 09:09:54 -04:00
Edward Smyth	6d0444497f	Improvements to xerbla functionality The following improvements have been implemented: - Option to stop in xerbla on error. This is controlled by setting the environment variable BLIS_STOP_ON_ERROR=1 - Option to disable printing of error message from BLIS. This is controlled by setting the environment variable BLIS_PRINT_ON_ERROR=0 - Added a function to return the value of INFO passed to xerbla, assuming xerbla was not set to stop on error. Example call is info = bli_info_get_info_value(); The default behaviour remains to print but don't stop on error, i.e. the equivalent to export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0 Implementation details: - Values of the environment variables are stored and retrieved from global_rntm. - Info value is stored and retrieved from tl_rntm. It is set to 0 during initialization for all calls and updated by xerbla if an error has occurred. - Call to bli_init_auto before calling PASTEBLACHK macro (which calls xerbla) will reinitialize info_value to 0 via call to bli_thread_update_rntm_from_env AMD-Internal: [CPUPL-3520] Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d	2023-10-16 08:48:51 -04:00
Arnav Sharma	c8f14edcf5	BLAS Extension API - ?gemm_compute() - Added support for 2 new APIs: 1. sgemm_compute() 2. dgemm_compute() These are dependent on the ?gemm_pack_get_size() and ?gemm_pack() APIs. - ?gemm_compute() takes the packed matrix buffer (represented by the packed matrix identifier) and performs the GEMM operation: C := A * B + beta * C. - Whenever the kernel storage preference and the matrix storage scheme isn't matching, and the respective matrix being loaded isn't packed either, on-the-go packing has been enabled for such cases to pack that matrix. - Note: If both the matrices are packed using the ?gemm_pack() API, it is the responsibility of the user to pack only one matrix with alpha scalar and the other with a unit scalar. - Note: Support is presently limited to Single Thread only. Both, pack and compute APIs are forced to take n_threads=1. AMD-Internal: [CPUPL-3560] Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158	2023-10-16 08:18:52 -04:00
Harihara Sudhan S	105de694cf	Optimized ZGEMV variant 1 - Added an explicit function definition for ZGEMV var 1. This removes the need to query the context for Zen architectures. - Added a new INSERT_GENTFUNC to generate the definition only for scomplex type. - Rewrote ZDOTXF kernel and added the function name for ZDOTV instead of querying it. - With this change fringe loop is vectorized using SSE instructions. AMD-Internal:[CPUPL-3997] Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3	2023-10-13 05:03:53 -04:00
eashdash	30bdeecbcc	Added BLAS Extension APIs - Get Size and Pack API 1. 4 new APIs are added to support packed compute GEMM operations 1. dgemm_pack_get_size 2. sgemm_pack_get_size 3. dgemm_pack 4. sgemm_pack 2. Pack_get_size API 1. Returns size in bytes required for packing of input 2. Requires identifier to identify the input matrix to be packed 3. Additionally requires 3 integer parameters for input dimensions 3. Packed buffer is allocated using the pack size computed 4. Pack API: 1. Performs full matrix packing of the input 2. Additionally, performs the alpha scaling 3. Packed buffer created contains the full packed matrix 5. The GEMM compute calls are required to be operated on the packed buffer with alpha = 1 since alpha scaling is already done by the Pack API 6. GEMM Pack API eliminate the cost of packing the input matrixes by avoiding on the go pack in the GEMM 5 loop. Packing of input matrixes are done when there is resue of matrixes across different GEMM calls. AMD-Internal: [CPUPL-3560] Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3	2023-10-04 06:43:59 -04:00
Edward Smyth	ccb8dd26fd	Compiler warnings when using --int-size=32 Correct compiler warnings when building with configure --int-size=32 - bla_imatcopy.c: Cast ints to longs to match %ld format specification in error printf statement and change this to fprintf to stderr. Also copy this additional fprintf statement to other variants of this function. - bli_type_defs.h: siz_t should always be the same size as a pointer. This corrects an issue in bli_malloc.c when casting from a pointer to a siz_t integer value. AMD-Internal: [CPUPL-3519] Change-Id: Ic87cd6142b8a6fed177b7c55bc0bb6013c5b69ab	2023-09-19 06:08:19 -04:00
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Harsh Dave	5bdf5e2aaa	Optimized AVX2 DGEMM SUP and small edge kernels. - Re-designed the new edge kernels that uses masked load-store instructions for handling corner cases. - Mask load-store instruction macros are added. vmovdqu, VMOVDQU for setting up the mask. vmaskmovpd, VMASKMOVPD for masked load-store - Following edge kernels are added for 6x8m dgemm sup. n-left edge kernels - bli_dgemmsup_rv_haswell_asm_6x7m - bli_dgemmsup_rv_haswell_asm_6x5m - bli_dgemmsup_rv_haswell_asm_6x3m m-left edge kernels - bli_dgemmsup_rv_haswell_asm_5x7 - bli_dgemmsup_rv_haswell_asm_4x7 - bli_dgemmsup_rv_haswell_asm_3x7 - bli_dgemmsup_rv_haswell_asm_2x7 - bli_dgemmsup_rv_haswell_asm_1x7 - bli_dgemmsup_rv_haswell_asm_5x5 - bli_dgemmsup_rv_haswell_asm_4x5 - bli_dgemmsup_rv_haswell_asm_3x5 - bli_dgemmsup_rv_haswell_asm_2x5 - bli_dgemmsup_rv_haswell_asm_1x5 - bli_dgemmsup_rv_haswell_asm_5x3 - bli_dgemmsup_rv_haswell_asm_4x3 - bli_dgemmsup_rv_haswell_asm_3x3 - bli_dgemmsup_rv_haswell_asm_2x3 - bli_dgemmsup_rv_haswell_asm_1x3 - For 16x3 dgemm_small, m_left computation is handled with masked load-store instructions avoid overhead of conditional checks for edge cases. - It improves performance by reducing branching overhead and by being more cache friendly. AMD-Internal: [CPUPL-3574] Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173	2023-08-07 07:30:50 -04:00
Meghana Vankadari	79e174ff0a	Level-3 triangular routines now use different block sizes and kernels. Details: - Eliminated the need for override function in SUP for GEMMT/SYRK. - New set of block sizes, kernels and kernel preferences are added to cntx data structure for level-3 triangular routines. - Added supporting functions to set and get the above parameters from cntx. - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels. In case they are not set, use the default block sizes/kernels of Level-3 SUP. AMD-Internal: [CPUPL-3649] Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0	2023-07-26 01:26:11 -04:00
vignbala	9164427e86	Code cleanup: Mismatch in assembly macros - In the bli_x86_asm_macros.h file, the set of vinsertf?x? and vextractf?x? instructions are facing macro expansion errors due to ambiguous macro redirection. The lower-case macro definitions of these instructions are not properly redirected to their corresponding upper-case macro definitions. - This error occurs due to ambiguity in the upper-case macro name. At the place of lower-case macro definition, the redirection is to macros of the form VINSERTF?x? and VEXTRACTF?x?, while at the place of upper-case macro definition, they are of the form VINSERTF?X? and VEXTRACTF?X?. This causes a mismatch of the upper-case macro due to different case sensitive 'x' being used. - This patch corrects this issue, by changing the lower-case 'x' to upper-case, among the upper case macros at the place of redirection. This provides uniformity and facilitates the expected macro-expansion. AMD-Internal: [CPUPL-3276] Change-Id: Id1f45f8e4bb083cd4b87632b713ff6baba616ff2	2023-05-04 08:49:58 -04:00
Edward Smyth	b531022bac	BLIS cpuid: distinguish submodels within a microarchitecture Incorporate a means of detecting submodels of a microarchitecture, so that different optimizations e.g. block sizes or kernel choices can be used. The details are as follows: - Different models are currently only enabled for zen3 and zen4 architectures (for server parts). - There is a single enumeration (model_t) for all models for all architectures, but function bli_check_valid_model_id() should check the provided model_id against the suitable range within the enumeration for the provided arch_id. - To enable the model_id to be used within the cntx setup functions, checking of a user specified value of BLIS_ARCH_TYPE against the enabled configurations is delayed to a separate function, bli_arch_check_id(). - Default selection based on hardware can be overridden using the BLIS_MODEL_TYPE environment variable. Valid values are: Genoa, Bergamo, Genoa-X, Milan, Milan-X Values are case-insensitive and -X can also be specified as _X or X - Specifying an incorrect value for BLIS_MODEL_TYPE is not an error, but will result in the default option for that architecture being selected. This is different to specifying an incorrect value of BLIS_ARCH_TYPE, which is an error. - The environment variable BLIS_MODEL_TYPE can be renamed using the --rename-blis-model-type argument to configure (or cmake equivalent), in a similar way to renaming BLIS_ARCH_TYPE with --rename-blis-arch-type. - Configure option --disable-blis-arch-type will disable both BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables. - Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes, currently only for AMD cpus. Functions are provided to query these from other parts of the code, namely: uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size() AMD-Internal: [CPUPL-3033] Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702	2023-04-20 10:47:44 -04:00
Aayush Kumar	8c537b0cd5	Added DTRSM Small Path AVX512 based LLNN/LUTN Variant Kernels - 8x8 kernels are used for DTRSM SMALL - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I58d28912bddbaadb404052c0f3449ebbe3c97b68	2023-04-07 08:50:28 +00:00
Aayush Kumar	5bd2a777ba	Fixed Compilation Fails when configured with --disable-blas - Moved _blis_impl function declaration outside the BLIS_ENABLE_BLAS guard. - Changed Makefile to continue to compile bla_ files to get _blis_impl interfaces. - Modify CBLAS headers, bli_macro_defs.h and bli_util_api_wrap.{c,h} to add BLIS_ENABLE_CBLAS guards. - Comment out BLIS_ENABLE_BLAS guards in various headers and utility functions. - Define BLIS Fortran-style functions lsame_blis_impl and xerbla_blis_impl. New macros PASTE_LSAME and PASTE_XERBLA are used in bla_*_check headers and some other places to select whether to call lsame and xerbla, or the _blis_impl versions. - Defined various other missing _blis_impl functions. - In bli_util_api_wrap.c, only define any functions if BLIS_ENABLE_BLAS is defined, and only define the subroutine versions of functions like dot, nrm2, etc if BLIS_ENABLE_CBLAS is defined. - BLAS layer is needed if CBLAS layer is enabled. Changed header files build/bli_config.h.in and bli_blas.h, and configure program to help ensure consistency in generated blis.h header and configure output. Undefining BLIS_ENABLE_BLAS_DEFS appears to be broken in UTA BLIS too, thus BLIS_ENABLE_BLAS_DEFS is currently permanently defined. AMD-Internal: [CPUPL-3015] Change-Id: I7c0fe07db85781db46f2c690e174451860b37635	2023-03-23 06:11:52 -04:00
Harihara Sudhan S	8d7593a940	AVX2 vectorized SCALV for double complex - Added ZSCALV that uses AVX2 and SSE instructions for vectorization. - Return early when the vector dimension is zero. When alpha is 1 there is no need to perform computation hence return early. - When alpha is zero expert interface of ZSETV is invoked. In this case, all the elements of the input vector are set 0. - Invocation of expert interface means that NULL pointer can be passed to the function in place of context. Expert interface of ZSETV will query the context and get the approriate function pointer. - Added BLAS interface for ZSCALV. The architecture ID is used to decide the function that is to be invoked. - Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV BLAS macro interface only for single complex type and single complex, float mixed type AMD-Internal: [CPUPL-2773] Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050	2023-02-05 09:36:12 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Harsh Dave	0a699c45f0	Corrected VEXTRACTF64X2 macro in bli_x86_asm_macro file. - Previously VEXTRACTF64X2 macro was defined as vextractf64x4. Change-Id: I79727a85b7d6da3b4d524064e297fc8c71d4f466	2023-01-16 23:04:48 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Shubham Sharma	2d72d4c862	DTRSM Native path AVX 512 kernels are developed - Added DTRSM AVX512 kernels for lower and upper variants in the native path. - Changes in framework are made to accommodate these kernels. AMD-Internal: [CPUPL-2588] Change-Id: I1f74273ef2389018343c0645870290373ce25efe	2022-10-01 15:58:24 +00:00
Arnav Sharma	90f915d3a9	Vectorized and parallelized zdscal routine - Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported. - Also added multithreaded support for the same. - The optimal number of threads is being calculated on the basis of input size. AMD-Internal: [CPUPL-2602] Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067	2022-09-30 06:11:07 -04:00
Dipal M Zambare	de3247b0da	Removed extra prototypes for ?gemm3m APIs - Removed prototypes for float(sgemm3m) and double(dgemm3m) types, as BLIS implements this API only for scomplex(cgemm3m) and dcomplex(zgemm3m) AMD-Internal: [SWLCSG-1477] Change-Id: Ifad86a74b4c939ed240743894b85bb4fa5e6d754	2022-09-20 15:51:56 +05:30
Dipal M Zambare	866e8de7bf	CBLAS/BLAS interface decoupling for the level 2 APIs -In BLIS, the CBLAS interface is implemented as a wrapper around the BLAS interface. For example the CBLAS API ‘cblas_dgemv’ internally invokes the BLAS API ‘dgemv_’. -This coupling between CBLAS and BLAS interface prevents the end user from overriding them individually by the application or other libraries. -This change separates the CBLAS and BLAS implementation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c	2022-09-15 17:51:05 +05:30
Dipal M Zambare	e18db8a172	CBLAS/BLAS interface decoupling for the level 3 APIs -In BLIS, the CBLAS interface is implemented as a wrapper around the BLAS interface. For example the CBLAS API ‘cblas_dgemm’ internally invokes the BLAS API ‘dgemm_’. -This coupling between CBLAS and BLAS interface prevents the end user from overriding them individually by the application or other libraries. -This change separates the CBLAS and BLAS implementation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409	2022-09-15 06:23:46 -04:00
Dipal M Zambare	5c42afada8	Revert "CBLAS/BLAS interface decoupling for level 3 APIs" This reverts commit `d925ebeb06`. Change-Id: I2e842b29c1fedbe14bf913949cf978f3e7515ff3	2022-08-30 14:50:38 +05:30
Dipal M Zambare	7e42b3d2e0	Revert "CBLAS/BLAS interface decoupling for level 2 APIs" This reverts commit `192f5313a1`. Change-Id: I876cad90902970ebc61550f109eb0ce32539ea1c	2022-08-30 11:53:46 +05:30
Dipal M Zambare	40c71dd2e1	Revert "CBLAS/BLAS interface decoupling for swap api" This reverts commit `2beaa6a0e6`. Reverting it as it is planned for the next release. Change-Id: Ib9271acd0b5b4cfd10c8f8b7bbb6ef93a3d594ea	2022-08-30 10:10:06 +05:30
jagar	2beaa6a0e6	CBLAS/BLAS interface decoupling for swap api - In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. - If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it. - This change separates the CBLAS and BLAS implantation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I8d81072aaca739f175318b82f6510d386103c24b	2022-08-29 16:26:01 +05:30
jagar	192f5313a1	CBLAS/BLAS interface decoupling for level 2 APIs - In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. - If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it and may result in recursion - This change separates the CBLAS and BLAS implantation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I8380b6468683028035f2aece48916939e0fede8a	2022-08-29 09:47:19 +05:30
Chandrashekara K R	d925ebeb06	CBLAS/BLAS interface decoupling for level 3 APIs ->In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. ->If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it and may result in recursion ->This change separate the CBLAS and BLAS implantation by adding and additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5	2022-08-26 05:54:29 -04:00
Edward Smyth	6861fcae91	BLIS: Improve architecture selection at runtime Make BLIS_ARCH_TYPE=0 be an error, so that incorrect meaningful names will get an error rather than "skx" code path. BLIS_ARCH_TYPE=1 is now "generic", so that it should be constant as new code paths are added. Thus all other code path enum values have increased by 2. Also added new options to BLIS configure program to allow: 1. BLIS_ARCH_TYPE functionality to be disabled, e.g.: ./configure --disable-blis-arch-type amdzen 2. Renaming the environment variable tested from "BLIS_ARCH_TYPE" to a specified value, e.g.: ./configure --rename-blis-arch-type=MY_NAME_FOR_ARCH_TYPE amdzen On Windows, these can be enabled with e.g.: cmake ... -DDISABLE_BLIS_ARCH_TYPE=ON or cmake ... -DRENAME_BLIS_ARCH_TYPE=MY_NAME_FOR_ARCH_TYPE This implements changes 2 and 3 in the Jira ticket below. AMD-Internal: [CPUPL-2235] Change-Id: Ie42906bd909f9d83f00a90c5bef9c5bf3ef5adb4	2022-08-19 10:59:35 -04:00
Sireesha Sanga	22af681a11	Runtime Thread Control Feature Update Details: 1. Runtime Thread Control Feature is enhanced to create a provision for the application to allocate a different number of threads to BLIS from the number of threads application is using for itself. 2. In the previous implementation, if application sets BLIS_NUM_THREADS with a valid value, BLIS internally calls omp_set_num_threads() API with same value. Due to this, application could not differentiate between the number of threads used in BLIS library and the application. 3. With the current solution, if Application wants to allocate different number of threads for BLIS API and application, Application can choose either BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API for BLIS, and OpenMP APIs or environment variables for itself, respectively. 4. If BLIS_NUM_THREADS is set with a valid value, same value will be used in the subsequent parallel regions unless bli_thread_set_num_threads() API is used by the Application to modify the desired number of threads during BLIS API execution. 5. Once BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API is used by the application, BLIS module would always give precedence to these values. BLIS API would not consider the values set using OpenMP API omp_set_num_threads(nt) API or OMP_NUM_THREADS environment variable. 6. If BLIS_NUM_THREADS is not set, then if Application is multithreaded and issued omp_set_num_threads(nt) with desired number of threads, omp_get_max_threads() API will fetch the number of threads set earlier. 7. If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called by the application, but only OMP_NUM_THREADS is set, omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS. 8. If both environment variables are not set, or if they are set with invalid values, and omp_set_num_threads(nt) is not issued by application, omp_get_max_threads() API will return the number of the cores in the current context. 9. BLIS will initialize rntm->num_threads with the same value. However if omp_set_nested is false - BLIS APIs called from parallel threads will run in sequential. But if nested parallelism is enabled Then each application will launch MT BLIS. 10. Order of precedence used for number of threads: 0. value set using bli_thread_set_num_threads(nt) by the application 1. valid value set for BLIS_NUM_THREADS environment variable 2. omp_set_num_threads(nt) issued by the application 3. valid value set for OMP_NUM_THREADS environment variable 4. Number of cores 11. If nt is not a valid value for omp_set_num_threads(nt) API, number of threads would be set to 1. omp_get_max_threads() API will return 1. 12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled. AMD-Internal: [CPUPL-2342] Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6	2022-08-19 05:43:01 -04:00
Edward Smyth	737e08cd7a	BLIS: Improve architecture selection at runtime Enable meaningful names as options for BLIS_ARCH_TYPE environment variable. For example, BLIS_ARCH_TYPE=zen4 or BLIS_ARCH_TYPE='ZEN4' or BLIS_ARCH_TYPE=6 will select the same code path (in this release). The meaningful names are not case sensitive. This implements change 1 in the Jira ticket below. Following review comments: 1. Use names from arch_t enum in function bli_env_get_var_arch_type() rather than directly using numbers. 2. AMD copyrights updated. AMD-Internal: [CPUPL-2235] Change-Id: I8cfd43d34765d5e8c7e35680d18825d9934753ad	2022-08-10 08:26:49 -04:00
Dipal M Zambare	2ba2fb2b63	Add AVX2 path for TRSM+GEMM combination. - Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called from TRSM context it will invoke AVX2 GEMM kernels instead of the default AVX-512 GEMM kernels. - The default context has the block sizes for AVX512 GEMM kernels, however, TRSM uses AVX2 GEMM kernels and they need different block sizes. - Added new API bli_zen4_override_trsm_blkszs(). It overrides default block sizes in context with block sizes needed for AVX2 GEMM kernels. - Added new API bli_zen4_restore_default_blkszs(). It restores The block sizes to there default values (as needed by default AVX512 GEMM kernels). - Updated bli_trsm_front() to override the block sizes in the context needed by TRSM + AVX2 GEMM kernels and restore them to the default values at the end of this function. It is done in bli_trsm_front() so that we override the context before creating different threads. AMD-Internal: [CPUPL-2225] Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55	2022-06-29 10:16:24 +00:00
Dipal M Zambare	c87b9aab75	Added support for AVX512 for Windows and AMAVX - Completed zen4 configuration support on windows - Enabled AVX512 kernels for AMAXV - Added zen4 configuration in amdzen for windows - Moved all zen4 kernels inside kernels/zen4 folder AMD-Internal: [CPUPL-2108] Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba	2022-06-08 11:09:48 +05:30
Dipal M Zambare	8cc15107ed	Enabled AVX-512 kernels for Zen4 config - Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for GEMM float and double types. - Enabled reference kernel for TRSM native path AMD-Internal: [CPUPL-2108] Change-Id: I66f3468346085c17183cbcbf4f2c8cfe07579b6f	2022-06-03 06:34:35 +00:00
Chandrashekara K R	8e6da6b844	Added the checks to not defining the bool type for C++ code in windows to avoid redefinition build time errror. AMD-Internal: [CPUPL-2037] Change-Id: I065da9206ab06f60876324f258ee12fb9fe83f88	2022-05-17 18:10:39 +05:30
Dipal M Zambare	e712ffe139	Added AOCL progress support for BLIS -- AOCL libraries are used for lengthy computations which can go on for hours or days, once the operation is started, the user doesn’t get any update on current state of the computation. This (AOCL progress) feature enables user to receive a periodic update from the libraries. -- User registers a callback with the library if it is interested in receiving the periodic update. -- The library invokes this callback periodically with information about current state of the operation. -- The update frequency is statically set in the code, it can be modified as needed if the library is built from source. -- These feature is supported for GEMM and TRSM operations. -- Added example for GEMM and TRSM. -- Cleaned up and reformatted test_gemm.c and test_trsm.c to remove warnings and making indentation consistent across the file. AMD-Internal: [CPUPL-2082] Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e	2022-05-17 18:10:39 +05:30
Sireesha Sanga	9621ef3067	Performance Improvement for ztrsm small sizes Details: - Enable ztrsm small implementation - For small sizes, Right Variants and Left Unit Diag Variants are using ztrsm_small implementations. - Optimization of Left Non-Unit Diagonal Variants, Work In Progress AMD-Internal: [SWLCSG-1194] Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4	2022-05-17 18:09:22 +05:30
Meghana Vankadari	c11fd5a8f6	Added functionality support for dzgemm AMD-Internal: [SWLCSG-1012] Change-Id: I2eac3131d2dcd534f84491289cbd3fe7fb7de3da	2022-05-17 18:01:55 +05:30
Dipal M. Zambare	b90420627a	Revert "Enabled AVX-512 kernels for Zen4 config" This reverts commit `62c96a4190`. Was committed without review.	2022-04-21 06:46:00 +00:00
Dipal M. Zambare	62c96a4190	Enabled AVX-512 kernels for Zen4 config Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for float and double types. AMD-Internal: [CPUPL-2108]	2022-04-21 06:28:29 +00:00
Field G. Van Zee	a4abb10831	Added a new 'gemmlike' sandbox. Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework. Change-Id: Ifc3c50e9fd0072aada38eace50c57552c88cc6cf	2022-04-01 13:55:30 +05:30
Field G. Van Zee	7a0ba4194f	Added support for addons. Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon=<name> or -a <name> syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files. Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717	2022-03-31 12:03:27 +05:30
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30

1 2 3 4 5 ...

492 Commits