amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
jagar	29711dd5a3	Gtestsuite: Updated testings_basics.* to print matrix/vector name AMD-Internal: [CPUPL-2732] Change-Id: I89b4ffc97ea852e66f42b82058af67c16144fbf6	2023-09-26 08:27:19 -04:00
jagar	f96e20b894	Update in CMakeLists.txt to install on windows Updated CMakeLists.txt to copy library and headers into folder mentioned during cmake configuration. Steps to install 1. cmake .. -G ........ -DCMAKE_INSTALL_PREFIX=path_to_install 2. cmake --build . --config Release 3. cmake --install . (install lib and headers) Change-Id: Ic2728209a2e1d181cc92bab08b82a748bec583d4	2023-09-26 07:56:03 -04:00
Harsh Dave	e437469a99	Optimized AVX2 DGEMM SUP edge kernels - For edge kernels which handles the corner cases and specially for cases where there is really small amount of computation to be done, executing FMA efficiently becomes very crucial. - In previous implementation, edge kernels were using same, limited number of vector register to hold FMA result, which indirectly creates dependency on previous FMA to complete before CPU can issue new FMA. - This commit address this issue by using different vector registers that are available at disposal to hold FMA result. - That way we hold FMA results in two sets of vector registers, so that sub-sequent FMA won't have to wait for previous FMA to complete. - At the end of un-rolled K loop these two sets of vector registers are added together to store correct result in intended vector registers. AMD-Internal: [CPUPL-3574] Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54	2023-09-22 03:43:47 -04:00
Edward Smyth	ccb8dd26fd	Compiler warnings when using --int-size=32 Correct compiler warnings when building with configure --int-size=32 - bla_imatcopy.c: Cast ints to longs to match %ld format specification in error printf statement and change this to fprintf to stderr. Also copy this additional fprintf statement to other variants of this function. - bli_type_defs.h: siz_t should always be the same size as a pointer. This corrects an issue in bli_malloc.c when casting from a pointer to a siz_t integer value. AMD-Internal: [CPUPL-3519] Change-Id: Ic87cd6142b8a6fed177b7c55bc0bb6013c5b69ab	2023-09-19 06:08:19 -04:00
Edward Smyth	15f9a747af	Fix for non-x86 builds: bli_gemmt_sup_var1n2m.c bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore bli_gemmt_sup_var1n2m.c as of commit `10ca8710f0` as variant for non-AMD codepath builds. AMD-Internal: [CPUPL-3838] Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f	2023-09-13 16:03:53 +05:30
Edward Smyth	0d16d952dc	BLIS: DTL enhancements Several improvements to BLIS DTL functionality - For APIs that report performance statistics, test for time=0.0 before dividing by time when calculating GFLOPS. - Call AOCL_DTL_TRACE_EXIT in the parameter checking functions inlined from ./frame/compat/check/bla_*_check.h - Correct flop count for complex routines. AMD-Internal: [CPUPL-3736] Change-Id: Icc515d88810dd79e66e22ea8c47d84649ca9f768	2023-09-11 08:42:11 -04:00
orequest	09e34fd2bd	Added optimised CGEMM function pointers in zen4 cntx 1. Two CGEMM function pointers are added for different storage schemes 1. bli_cgemmsup_rv_zen_asm_3x8m 2. bli_cgemmsup_rv_zen_asm_3x8n 2. In previous commit: (Level-3 triangular routines now use different block sizes and kernels Commit Id: `79e174ff0a`) 1. bli_cntx_set_l3_sup_tri_kers cntx function was created 2. Function holds optimised function pointers for GEMMT/SYRK API's 3. It avoids over riding default block sizes which improves the performance 4. This function did not include optimised CGEMM function pointers leading to regression as reference kernels were invoked 3. With this commit, 2 optimized CGEMM function pointers are added in bli_cntx_set_l3_sup_tri_kers 1. This fixes the regression as optimized CGEMM functions are invoked AMD-Internal: [CPUPL-3831] [CPUPL-3830] Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e	2023-09-11 06:38:31 -04:00
Edward Smyth	9d19effec5	Fix for non-x86 builds: cpuid query functions Ensure functions bli_cpuid_query_id() and bli_cpuid_query_model_id() are defined for all architectures in bli_cpuid.c AMD-Internal: [CPUPL-3838] Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e	2023-09-11 04:19:14 -04:00
Vignesh Balasubramanian	32104c400c	GTestSuite : Designing test cases for ZGEMM - Designed test cases for unit testing of ZGEMM compute kernel for handling inputs when k == 1. The design uses value-parameterized testing for checking accuracy, and verifying the mandate in case of exception values on the inputs/output. - The design uses type-parameterized testing for verifying BLAS standard for invalid input cases, and also for early return scenarios. - Added the function template set_ev_mat( ... ) as part of testinghelpers. This function is used as a helper for inducing exception values onto indices specified as arguments to the test_gemm( ... ) interface. - Abstracted the function definition of getValueString( ... ) from the NRM2 testing interface to testinghelpers(renamed as get_value_string( ... ) for naming consistency), in order to use it as a helper function across all APIs in case of exception value testing. AMD-Internal: [CPUPL-3823] Change-Id: I0fea21f9c8759bbbdc88ba0a016202753e28f2a7	2023-09-08 17:36:57 +05:30
mkadavil	e5e9127a68	Fixes for aocl_gemm addon compilation issues Certain functions were updated recently and now takes extra arguments for error handling. Usage of the same are now updated in aocl_gemm. Change-Id: I7daca4fd1f284d57034d564f0a08cc6410ccfd5c	2023-09-06 16:00:34 +05:30
Eleni Vlachopoulou	a6641dec0b	Updating GTestSuite CMake system to enable testing BLIS libraries on Windows. - Renaming ELEMENT_TYPE to BLIS_ELEMENT_TYPE, since the first is defined on a Windows header. - Updating refCBLAS object to have different implementation depending on the platform. - Removing dlfcn.h from all reference headers since it's linux specific and adding it conditionally on a higher level. - Changes on all CMakeLists.txt files to enable building on Windows. AMD-Internal: [CPUPL-2732] Change-Id: I6e35656a3779b35dc815a2409cf84c22dd27f3e7	2023-08-29 16:11:22 +05:30
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Shubham Sharma	0000cc88de	Removed local copy of cntx in TRSM - TRSM and GEMM has different blocksizes in zen4, in order to accommodate this, a local copy of cntx was created in TRSM. - Local copy of cntx has been removed and TRSM blocksizes are stored in cntx->trsmblkszs. - Functions to override and restore default blocksizes for TRSM are removed. Instead of overriding the default blocksizes, TRSM blocksizes are stored separately in cntx. - Pack buffers for TRSM have to be packed with TRSM blocksizes and GEMM pack buffers have to be packed with default blocksizes. To check if we are packing for TRSM, "family" argument is added in bli_packm_init_pack function. - BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if it is not set then BLIS_GEMM_UKR has to be used. This functionality has been added to all TRSM macro kernels. - Methods to retrieve TRSM blocksizes from cntx are added to bli_cntx.h. - Tests for micro kernels are modified to accommodate the change in signature of bli_packm_init_pack. AMD-Internal: [CPUPL-3781] Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a	2023-08-16 08:09:01 -04:00
Harihara Sudhan S	278ca71706	Fixes for GEMV Functionality Issues - Added call to dsetv in dscalv. When DSCALV is invoked by DGEMV the SCAL function is expected to SET the vector to zero when alpha is 0. This change is done to ensure BLAS compatibility of DGEMV. - Fixed bug in DGEMV var 1. Reverted changes in DGEMV var 1 to remove packing and dispatch logic. - CMAKE now builds with _amd files for unf_var2 of GEMV. AMD-Internal: [CPUPL-3772] Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade	2023-08-14 13:54:48 +05:30
Chandrashekara K R	248d09c722	Version String Update AOCL-BLIS: Updated version string to AOCL-BLIS 4.1.1 Build <YYMMDD> Change-Id: Iced62a66d0859b3c7d4bcfe6f0e0527922e41cae	2023-08-08 07:27:41 -04:00
Harihara Sudhan S	03fa660792	Optimized xGEMV for non-unit stride X vector - In GEMV variant 1, the input matrix A is in row major. X vector has to be of unit stride if the operation is to be vectorized. - In cases when X vector is non-unit stride, vectorization of the GEMV operation inside the kernel has been ensured by packing the input X vector to a temporary buffer with unit stride. Currently, the packing is done using the SCAL2V. - In case of DGEMV, X vector is scaled by alpha as part of packing. In CGEMV and ZGEMV, alpha is passed as 1 while packing. - The temporary buffer created is released once the GEMV operation is complete. - In DGEMV variant 1, moved problem decomposition for Zen architecture to the DOTXF kernel. - Removed flag check based kernel dispatch logic from DGEMV. Now, kernels will be picked from the context for non-avx machines. For avx machines, the kernel(s) to be dispatched is(are) assigned to the function pointer in the unf_var layer. AMD-Internal: [CPUPL-3475] Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e	2023-08-08 01:01:22 -04:00
Edward Smyth	c445f192d5	BLIS: Missing clobbers (batch 6) More missing clobbers in skx and zen4 kernels, missed in previous commits. AMD-Internal: [CPUPL-3521] Change-Id: I838240f0539af4bf977a10d20302a40c34710858	2023-08-07 10:52:23 -04:00
Harihara Sudhan S	3be43d264f	Optimized xGEMV for non-unit stride Y vector - In variant 2 of GEMV, A matrix is in column major. Y vector has to be of unit stride if the operation is to be vectorized. - In cases when Y vector is non-unit stride, vectorization of the GEMV operation inside the kernel has been ensured by packing the input Y vector to a temporary buffer with unit stride. As part of the packing Y is scaled by beta to reduce the number of times Y vector is to be loaded. - After performing the GEMV operation, the results in the temporary buffer are copied to the original buffer and the temporary one is released. - In DGEMV var 2, moved problem decomposition for Zen architecture to the AXPYF kernel. - Removed flag check based kernel dispatch logic from DGEMV. Now, kernels will be picked from the context for non-avx machines. For avx machines, the kernel(s) to be dispatched is(are) assigned to the function pointer in the unf_var layer. AMD-Internal: [CPUPL-3485] Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654	2023-08-07 08:12:44 -04:00
Harsh Dave	5bdf5e2aaa	Optimized AVX2 DGEMM SUP and small edge kernels. - Re-designed the new edge kernels that uses masked load-store instructions for handling corner cases. - Mask load-store instruction macros are added. vmovdqu, VMOVDQU for setting up the mask. vmaskmovpd, VMASKMOVPD for masked load-store - Following edge kernels are added for 6x8m dgemm sup. n-left edge kernels - bli_dgemmsup_rv_haswell_asm_6x7m - bli_dgemmsup_rv_haswell_asm_6x5m - bli_dgemmsup_rv_haswell_asm_6x3m m-left edge kernels - bli_dgemmsup_rv_haswell_asm_5x7 - bli_dgemmsup_rv_haswell_asm_4x7 - bli_dgemmsup_rv_haswell_asm_3x7 - bli_dgemmsup_rv_haswell_asm_2x7 - bli_dgemmsup_rv_haswell_asm_1x7 - bli_dgemmsup_rv_haswell_asm_5x5 - bli_dgemmsup_rv_haswell_asm_4x5 - bli_dgemmsup_rv_haswell_asm_3x5 - bli_dgemmsup_rv_haswell_asm_2x5 - bli_dgemmsup_rv_haswell_asm_1x5 - bli_dgemmsup_rv_haswell_asm_5x3 - bli_dgemmsup_rv_haswell_asm_4x3 - bli_dgemmsup_rv_haswell_asm_3x3 - bli_dgemmsup_rv_haswell_asm_2x3 - bli_dgemmsup_rv_haswell_asm_1x3 - For 16x3 dgemm_small, m_left computation is handled with masked load-store instructions avoid overhead of conditional checks for edge cases. - It improves performance by reducing branching overhead and by being more cache friendly. AMD-Internal: [CPUPL-3574] Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173	2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian	758ec3b5ca	ZGEMM optimizations for cases with k = 1 - Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, and the transpose values for A and B as N. - The kernel dimension has been changed from 4x6 to 4x4, due to the following reasons : - The 1xNR block of B in the n-loop can be reused over multiple MRx1 blocks of A in the m-loop during computation. Similar analogy exists for the fringe cases. - Every 1xNR block of B was scaled with alpha and stored in registers before traversing in the m-dimension. Similar change was done for fringe cases in n-dimension. - These registers should not be modified during compute, hence the kernel dimension was changed from 4x6 to 4x4. - The check for early exit(with regards to BLAS mandate) has been removed, since it is already present in the BLAS layer. - The check for parallel ZGEMM has been moved post the redirection to this kernel, since the kernel is single-threaded. - The bli_kernels_zen.h file was updated with the new kernel signature. AMD-Internal: [CPUPL-3622] Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea	2023-08-07 15:10:08 +05:30
Harihara Sudhan S	c97471dce0	Added AVX512 ZDSCALV kernel - Added AVX512-based kernel for ZDSCAL. This will be dispatched from the BLAS layer for machines that have AVX512 flags. - In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE instructions. - Removed the negative incx handling checks from the blis_impli layer of ZDSCAL as BLAS expects early return for incx <= 0. AMD-Internal: [CPUPL-3648] Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9	2023-08-06 01:51:47 -04:00
Harihara Sudhan S	b126c9943b	ZSCALV kernel optimization - ZSCALV kernel now uses fmaddsub intrinsics instead of mul followed by addsub instrinsics. - Removed the negative incx handling checks from the BLAS impli layer as BLAS expects early return for incx <= 0. - Moved all exceptions in the kernel to the BLAS impli layer. AMD-Internal: [SWLCSG-2224] Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674	2023-08-04 06:57:18 -04:00
Shubham Sharma	9607f207da	AOCL Dynamic tuning for DAXPYV - Existing logic is not picking the ideal number of threads for some problem sizes. - Problem size and their corresponding ideal number of threads are retuned for daxpy in aocl dynamic. AMD-Internal: [CPUPL-3484] Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7	2023-08-01 10:34:04 +05:30
Eleni Vlachopoulou	fa77d0415a	Updating nrm2 GTestSuite testing - Adding default template parameter for the type of the returned value from nrm2. - Bugfix on NaN/Inf comparator for scalars. - Tuning sizes of vector x to exercise the different paths for vectorized and scalar code. - Adding wrong parameters and extreme value testing. - Adding tests for overflow and underflow using max and min representable numbers for vectorized and scalar code. AMD-Internal: [CPUPL-2732] Change-Id: Ice8ee65095ecaa7b30ebd5f90ed2a890178533db	2023-07-28 05:03:00 -04:00
Shubham Sharma	954c97f858	Added NT in DTL logs for GEMMT, TRSM and NRM2 - Number of threads and gflops are added in the DTL logs for GEMMT, TRSM and NRM2 AMD-Internal: [CPUPL-2144] Change-Id: If68887a5150bd0feda351180f379996497a1e678	2023-07-27 05:15:08 -04:00
Meghana Vankadari	79e174ff0a	Level-3 triangular routines now use different block sizes and kernels. Details: - Eliminated the need for override function in SUP for GEMMT/SYRK. - New set of block sizes, kernels and kernel preferences are added to cntx data structure for level-3 triangular routines. - Added supporting functions to set and get the above parameters from cntx. - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels. In case they are not set, use the default block sizes/kernels of Level-3 SUP. AMD-Internal: [CPUPL-3649] Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0	2023-07-26 01:26:11 -04:00
Chandrashekara K R	7b78d93282	Removing omp library linking to static multithreaded library build. Description: We have seen the library dependency issue when we are linking the libomp.lib or libiomp5md.lib while building the library for static multithreaded scenario. So we are removing the linking of openmp library for static multithreaded blis library build. So that user can link any openmp library(libomp.lib or libiomp5md.lib) while building their applications by linking static multithreaded blis library. AMD-Internal: [SWLCSG-2196] Change-Id: I96722f3587ee555af12de664957c211c56fcf03d	2023-07-13 06:54:02 -04:00
Eleni Vlachopoulou	660cd6d1b2	Adding nrm2 target for benchmarking on Windows. Modifying blis/bench/CMakeLists.txt to include nrm2 target and produce the corresponding executable. AMD-Internal: [CPUPL-3625] Change-Id: I7945416142e07ac99510ed9500a2c620053c7e13	2023-07-10 14:03:05 -04:00
Harihara Sudhan S	ffbb0e83e5	ZGEMM optimization for cases when m = 1 or n = 1 - When n = 1 and A matrix is transposed ZGEMV row major variant is invoked. - When m = 1 and B matrix is not transposed ZGEMV row major variant is invoked. - This redirection happens before parallel ZGEMM check. This is done to avoid the unneccesary condition check. Any parallelization check is expected to happen in the invoked ZGEMV interface. AMD-Internal: [CPUPL-2773] Change-Id: I6b7b31db712edc682c089475d12e98730a960138	2023-06-30 04:54:42 -04:00
jagar	fb6f1380b2	Gtestsuite:Added util functions - Functions to print matrix and vector elements. - Functions to convert matrix to symmetric, hermitian triangular matrix and set diagonal elements in matrix. AMD-Internal: [CPUPL-2732] Change-Id: I1ffa5289329cbb8a9581bf545bdd157801cf5baa	2023-06-27 16:33:57 +05:30
Chandrashekara K R	cdba2db827	BLIS: Added address sanitizer flag for blis library on windows. Description: Added cmake option to test address related issues using address sanitizer(-fsanitizer=address) on windows. When the user enable the ENABLE_ASAN_TESTS option, cmake will add related compiler and linker flags along with dependent libraries. AMD-Internal: [CPUPL-2984] Change-Id: I6d2a0cfe84fe122fc6c40e3023d8c79211d5fa71	2023-06-22 13:42:38 -04:00
jagar	003d1e9ae6	GTestSuite: Using ELEMENT_TYPE to specify generation of random numbers in tests. Since random numbers are specified from ELEMENT_TYPE and we never generate tests for both integer and floating point numbers at the same time, we update code as described below: - random vector/matrix generators are updated to use ELEMENT_TYPE as a default parameter. - ::testing::Values(ELEMENT_TYPE) is removed from all test generators. AMD-Internal: [CPUPL-2732] Change-Id: Ibc6b05044502f541c9e8a7687931b1ca2903fb0c	2023-06-21 11:30:15 -04:00
Eleni Vlachopoulou	7b35a1283b	Updating CMake to select the correct Windows runtime libraries. - Upgrated to 3.15 as minimum version of CMake. - Used CMAKE_MSVC_RUNTIME_LIBRARY instead of CMAKE_C_FLAGS to set MT and MD flags correctly. AMD-Internal: [CPUPL-3559] Change-Id: Ib82821d245b6acaa1399166219168ad2535d8d92	2023-06-16 22:04:09 +05:30
Edward Smyth	94a4abe2e5	BLIS: Incorrect ifdef in cblas.h and cblas_f77.h Remove unnecessary ifdef BLIS_ENABLE_CBLAS statement from cblas.h and cblas_f77.h. These were erroneously added when fixing the --disable-blas functionality but are not needed in the CBLAS headers, as these files will not be generated when BLAS or CBLAS is disabled. This is a fix to commit `5bd2a777ba` AMD-Internal: [CPUPL-3541] Change-Id: If38bd795d31098a7023d575672b0a913338c0d2d	2023-06-07 06:52:57 -04:00
Eleni Vlachopoulou	7b2924c079	Updating object library targets in CMakeLists.txt for zen4 based on configuration AMD-Internal: [CPUPL-3516] Change-Id: Ibfe66f50fa77d4011829d8386f0a91f140d38335	2023-06-01 17:29:37 +05:30
sireesha.sanga	85eb7880f7	README File Update Updated with latest and relevant details. AMD-Internal: [CPUPL-3007] Change-Id: I6d86c5f0c49fd8739c656bcc8187a5f8a4dc9beb	2023-05-25 14:46:33 +00:00
Harsh Dave	655955dd3b	Doxygen document generation from cmake build - Added support to generate doxygen documentation from cmake build. - If doxygen is already installed on machine, it will generate documentation and promtps the path for documentation. AMD-Internal: [CPUPL-3188] Change-Id: I6047f62df63844aa71836fd481b4df246b793696	2023-05-25 07:41:40 -04:00
Eleni Vlachopoulou	9c613c4c03	Windows CMake bugfix in object libraries for shared library option Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory. The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries. AMD-Internal: [CPUPL-3241] Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52	2023-05-24 17:30:16 +05:30
Edward Smyth	dea5fe4d12	BLIS: Missing clobbers (batch 5) Add missing clobbers for AVX512 mask registers k0-k7 in zen4 kernels. AMD-Internal: [CPUPL-3456] Change-Id: I5f28c725d7af1466df4db4cdfa2d456bbc6ab36d	2023-05-23 15:40:29 -04:00
Edward Smyth	a3adfb68cf	BLIS: Missing clobbers (batch 4) Add missing clobbers haswell (sup) kernels. AMD-Internal: [CPUPL-3456] Change-Id: I19fa97b85f75c8b8fe15d31b13768f937cc5e4cc	2023-05-23 14:57:08 -04:00
Edward Smyth	03965a4f07	BLIS: Missing clobbers (batch 3) Add missing clobbers in haswell (non-sup) kernels. AMD-Internal: [CPUPL-3456] Change-Id: I68f6ad0c01557fcde73b1775d250d48b5162c521	2023-05-23 14:37:31 -04:00
Edward Smyth	e960141fe2	BLIS: Missing clobbers (batch 2) Add missing clobbers in other zen4 kernels. AMD-Internal: [CPUPL-3456] Change-Id: I5cceb44fe100e03269cfe21d8c4c0d2171b921c3	2023-05-23 13:12:20 -04:00
Edward Smyth	ea2eea5097	BLIS: Missing clobbers (batch 1) Add missing clobbers in first batch of assembly kernels: - zen3 bli_gemmsup* - bli_zgemm_zen4_asm_12x4 - bli_gemmsup_rv_haswell_asm_sMx6 AMD-Internal: [CPUPL-3456] Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1	2023-05-23 11:51:18 -04:00
Edward Smyth	6911d2dd21	zen config make_defs.mk improvements Improvements to zen make_defs.mk files: * Add -znver4 flag for GCC 13 and later. * Add AVX512 flags or -znver4 as appropriate for upstream LLVM in config/zen4/make_defs.mk to enable BLIS to be build with LLVM rather than AOCC. * zen make_defs.mk files were inheriting settings from the previous one (zen->zen2->zen3->zen4), when they should be independent of each other. Correct by including config/zen/amd_config.mk in all zen make_defs.mk files to reinitialize the compiler flags. * Update zen2 and zen3 make_defs.mk for recent AOCC compiler releases, rather than rely on LLVM settings. * Remove -mfpmath=sse flag in config/zen4/make_defs.mk as this is already specified in amd_config.mk (and should be the default setting anyway). * Tidy files to simplify nested if structures and be more consistent with one another. AMD-Internal: [CPUPL-3399] Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac	2023-05-22 07:51:41 -04:00
Mangala V	5f5bc24989	Bug fix: AVX2 code being invoked on non-avx2 machine for ZGEMM API Prevented calling avx2 based bli_zgemm_ref_k1_nn code on non-supported systems. Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn(). Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn(). Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com> for identifying and helping to fix the issue. AMD-Internal: [CPUPL-3352] Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35	2023-05-21 23:13:46 +05:30
eashdash	2c4f032e0f	Fix for lack of BF16 instruction when compiled with GCC-11 GCC-11 and below support AVX512-BF16. However, it doesn't support all the bf16 instructions required. For bf16 downscale APIs, when beta scaling is done, C output elements must be upscaled from BF16 type to Float type for beta scaling operation. For this upscaling operation of bf16 to float, _mm512_cvtpbh_ps is used. This however is not supported by GCC-11 and below (but is supported on GCC 12 onwards) Lack of this instruction support in gcc11, and below leads to compilation issues with this instruction (_mm512_cvtpbh_ps) not being recognized. To fix, this, we use a set of instructions: 1. register containing bf16 type __m256bh a1 2. Convert bf16 to float with shift left ops __m512 float_a1 = (__m512) (_mm512_sllv_epi32 (_mm512_cvtepi16_epi32 ((__m256i) a1), _mm512_set1_epi32 (16))); AMD-Internal: [CPUPL-3454] Change-Id: Ie4a9f04881c59ced088608633774b27f22b4ab8e	2023-05-19 10:15:08 +00:00
eashdash	061a68ff0d	BF16 Downscale and Performance fix for bf16 API This change contains the following: 1. Downscale optimization fix a. Similar to downscale optimizations made for s32 and s16 gemm, the following optimizations are done to improve the downscale performance for BF16 gemm b. The store to temporary float buffer can be avoided when k < KC since intermediate accumulation will not be required for the pc loop (only 1 iteration). The downscaled values (bf16) are written directly to the output C matrix. c. Within the micro-kernel when beta != 0, the bf16 data from the original C output matrix is loaded to a register, converted to float and beta scaling is applied on it at register level. This eliminates the requirement of previous design of copying the bf16 value to the temporary float buffer inside jc loop. 2. Alpha scaling a. Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in bf16 micro-kernels. b. Alpha scaling is now only done when alpha != 1. 3. K Fringe optimization a. Previously memcpy was used for K fringe case to load elements from A matrix in the microkernels b. Now, masked stores are used to store the downscaled and non-downscaled outputs without the need to use memcpy functions 4. N LT-16 fringe optimization a. Previously memcpy was used for N LT 16 fringe case in the microkernelsfor storing the downscaled and non-downscaled output. b. Now, masked stores are used to store the downscaled and non-downscaled outputs of BF16 without the need to use memcpy functions 5. Framework updates to avoid unnecessary pack buffer allocation a. The default allocation of the temporary pack buffer is removed and the pack buffer is now only allocated if k > KC. AMD-Internal: [CPUPL-3437] Change-Id: I71ff862e7d250559409a12a3533678c7a7951044	2023-05-18 10:02:56 -04:00
Shubham Sharma	26e120ea25	Fixed diagonal packing for C/Z TRSM small - In C/Z TRSM small, packing in case of unit diagonal is not handled properly. - Diagonal elements are still being read even in case of unit diagonal. - This causes "Conditional jump or move depends on uninitialised value" error during valgrind tests. - To fix this, diagonal elements should not be read in case of unit diagonal. AMD-Internal: [CPUPL-3406] Change-Id: If3d6965299998a83d87f3a032f654fc7f8c43d4e	2023-05-18 07:57:21 -04:00
Harihara Sudhan S	9ee95e171a	Control flow issue reported during static code analysis - Missing break statement will result in unexpected control flow. This function will not launch the threads for the API in question according to the AOCL dynamic logic without the break statement. AMD-Internal: [CPUPL-3436] Change-Id: Ic47d773169c09e84086a27b50cd59dba33529698	2023-05-18 04:53:03 -04:00
mkadavil	1e266bbcbc	LPGEMM framework updates to avoid unnecessary pack buffer allocation. -Currently when any of the downscale API is called, a temporary pack buffer is allocated (with bli_membrk_acquire_m) by each thread. It is used to persist intermediate higher precision output accumulated by the micro-kernel across pc loop when the number of pc iterations is more than 1 (k > KC). The bli_membrk_acquire_m is a thread safe operation and uses locks (pthread_mutex) to ensure thread safe checkout of memory/ block from the memory pool. -However when k < KC, this temporary buffer is not required. But since this pack buffer is allocated by default in downscale API, the overhead from locks affects performance when k < KC, m or n is sufficiently small and the number of threads involved is high. This default allocation is removed and the pack buffer is now only allocated if k > KC. AMD-Internal: [CPUPL-3430] Change-Id: I492586ff4c47bc7480d364efb7af3674e31bd2c1	2023-05-17 19:16:02 +05:30

1 2 3 4 5 ...

2972 Commits