amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 18:52:14 +00:00

Author	SHA1	Message	Date
Harihara Sudhan S	828ac8e2dd	Partial completion of work in L1 APIs - Partial completion of compute was happening since BLIS was unable to launch the required number of threads. This was because rntm was returning a thread count greater than the maximum number of threads that can be launched in the subsequent parallel region. - Added 'omp_get_num_threads' inside the parallel regions to get the actual number of threads spawned. The work distribution happens based on the actual number of threads launched in that region. AMD-Internal: [CPUPL-3268] Change-Id: I086ad4b9b644f966b7bab439e43222396f0c2bf0	2023-04-27 15:17:26 +05:30
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Edward Smyth	0f0277e104	Code cleanup: dos2unix file conversion Source and other files in some directories were a mixture of Unix and DOS file formats. Convert all relevant files to Unix format for consistency. Some Windows-specific files remain in DOS format. AMD-Internal: [CPUPL-2870] Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb	2023-04-21 08:41:16 -04:00
Harihara Sudhan S	ada88e3695	Mismatch in fuse factor and kernel fuse - In Zen 4 context, there was a mismatch between the fuse factor initialized in the block size parameter and fuse factor of the corresponding kernel initialized. AMD-Internal: [SWLCSG-2051] Change-Id: I65f71532692a1459605abb860b91a2a360bcca5d	2023-04-21 06:30:11 -04:00
eashdash	a72fff2be9	Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8 1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added. 2. New interface, frame and kernel files are added. 3. Frame and kernel level files added and modified for s8s8s16 4. s8s8s16 type involves design changes of 2 operations - Pack B and Mat Mul 5. Pack B kernel routines to pack B matrix for s16 FMA and compute the sum of every column of B matrix to implement the s8s8s16 operation using the s16 FMA instructions. 5. Mat Mul Kernel files to compute the GEMM output using s16 FMA. Here the A matrix elements are converted from int8 to uint8 (s16 FMA works with A matrix type uint8 only) by adding extra 128 to every A matrix element 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. Final C = C - ( (sum of column of B matrix) * 128 ) This is done to compensate for the addition of extra 128 to every A matrix elements 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s16os16 and s8s8s16os8. 8. All previously added post-ops are supported on s8s8os16/os8 also. AMD-Internal: [CPUPL-3234] Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c	2023-04-21 05:30:38 -04:00
mkadavil	3572baa9d3	aocl_softmax_f32 api's for softmax computation as part of lpgemm. -Softmax is often used as the last activation function in a neural network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))). This step happens after the final low precision gemm computation, and it helps to have the softmax functionality that can be invoked as part of the lpgemm workflow. In order to support this, a new api, aocl_softmax_f32 is introduced as part of aocl_gemm. This api computes element-wise softmax of a matrix/vector of floats. This api invokes ISA specific vectorized micro-kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3247] Change-Id: If15880360947435985fa87b6436e475571e4684a	2023-04-21 05:26:08 -04:00
Arnav Sharma	4aace5f524	Smart Threading for SGEMM SUP for Zen4 Architecture - Added Smart Threading logic for AVX-512 based SGEMM SUP. - Calculating ic and jc for optimal work distribution to the allocated threads based on logic similar to Zen3. - Zen4 Architecture specific Native-to-SUP check has been added to redirect few Native inputs to the SUP path based on the fact that in a multi-threaded environment some Native cases perfom better as SUP. - For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have been increased from 512 and 200 to 682 and 512, respectively. - Further optimizations to the work distribution logic will be added subsequently. AMD-Internal: [CPUPL-3248] Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca	2023-04-21 12:54:03 +05:30
Harsh Dave	b85b856950	Added Doxygen support for extension APIs. Details: - Added Doxyfile, a configuration file in docs directory for generating Doxygen document from source files. - Currently only CBLAS interface of (Batched gemm and gemmt)extension APIs are included. - Support for BLAS interface is yet to be added. - To generate Doxygen based document for extension API, use given command. $ doxygen docs/Doxyfile AMD-Internal: [CPUPL-3188] Change-Id: I76e70b08f0114a528e86514bcb01d666acc591e8	2023-04-21 00:54:19 -04:00
Edward Smyth	b531022bac	BLIS cpuid: distinguish submodels within a microarchitecture Incorporate a means of detecting submodels of a microarchitecture, so that different optimizations e.g. block sizes or kernel choices can be used. The details are as follows: - Different models are currently only enabled for zen3 and zen4 architectures (for server parts). - There is a single enumeration (model_t) for all models for all architectures, but function bli_check_valid_model_id() should check the provided model_id against the suitable range within the enumeration for the provided arch_id. - To enable the model_id to be used within the cntx setup functions, checking of a user specified value of BLIS_ARCH_TYPE against the enabled configurations is delayed to a separate function, bli_arch_check_id(). - Default selection based on hardware can be overridden using the BLIS_MODEL_TYPE environment variable. Valid values are: Genoa, Bergamo, Genoa-X, Milan, Milan-X Values are case-insensitive and -X can also be specified as _X or X - Specifying an incorrect value for BLIS_MODEL_TYPE is not an error, but will result in the default option for that architecture being selected. This is different to specifying an incorrect value of BLIS_ARCH_TYPE, which is an error. - The environment variable BLIS_MODEL_TYPE can be renamed using the --rename-blis-model-type argument to configure (or cmake equivalent), in a similar way to renaming BLIS_ARCH_TYPE with --rename-blis-arch-type. - Configure option --disable-blis-arch-type will disable both BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables. - Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes, currently only for AMD cpus. Functions are provided to query these from other parts of the code, namely: uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size() AMD-Internal: [CPUPL-3033] Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702	2023-04-20 10:47:44 -04:00
Meghana Vankadari	f788618f27	Setting AVX-512 specific blocksizes as default for L3 SUP for zen4 config Details: - Overriding of blocksizes with avx-2 specific ones(6x8) is done for gemmt/syrk because near-to-square shaped kernel performs better than skewed/rectangular shaped kernel. - Overriding is done for S,D and Z datatypes. AMD-Internal: [CPUPL-3060] Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53	2023-04-20 08:52:34 -04:00
mkadavil	ffa72f09cc	Support for multiple eltwise post-ops in low precision gemm. -Currently only one eltwise post-op (one of relu/prelu/gelu_tanh/ gelu_erf) is supported in the post-op struct along with bias or downscale. This setup was sufficient when only activation functions were supported as eltwise post-ops. But with the introduction of clip post-op(a type of non-activation eltwise operation), it has become necessary to extend the post-ops framework to support multiple eltwise operations, with the multiple eltwise often used in the form activation eltwise op + non-activation eltwise ops. The aocl post-op struct is modified and the post-op parser is updated to support this use case. -The lpgemm_bench is updated to support testing/benchmarking of the multiple eltwise operations use case. The function for accuracy checking is modified to support correctness testing irrespective of the order and count of post-ops. Additionally the help message is updated so as to better describe the capabilities of lpgemm_bench. AMD-Internal: [CPUPL-3244] Change-Id: If4ce8d7261d32073da8fa4757ed4f2ea0e94249f	2023-04-20 07:24:32 -04:00
Mangala V	5dc8e3fbca	AOCL progress callback pointer update per thread Thanks to Moore, Branden <Branden.Moore@amd.com> for identifying the race condition and suggesting the changes to fix the same Existing Design: - AOCL progress callback pointer is a global pointer which is shared across all threads Existing Design challenges: - The callback function cannot safely disable the progress mechanism, as another thread may have already checked to see if the function pointer is set, and then re-reads the pointer upon invocation of the callback. If one thread sets the callback to NULL in this time, then the resulting thread will attempt to call the null pointer as a function pointer, leading to a segfault. New Design : - Each thread maintains a local copy of progress pointer AMD-Internal: [SWLCSG-1971] Change-Id: I282989805a4a2a8a759a7373b645f3569bf42ed4	2023-04-20 05:33:12 -04:00
Chandrashekara K R	3edc4ef865	Updated CMakeLists.txt file to disply config status. Details: - Added logic to display CMAKE_BUILD_TYPE while configuring through cmake gui. - Added logic to set values for BLIS_ENABLE_JRIR_SLAB, BLIS_ENABLE_JRIR_RR mutually exclusive variables. AMD-Internal: [SWLCSG-2041, SWLCSG-2042] Change-Id: I81c96a9941418a0810d554ddc89056ca8420b064	2023-04-20 02:14:15 -04:00
Edward Smyth	6835205ba8	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-2870] Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce	2023-04-19 12:44:56 -04:00
mkadavil	99d10c3f88	Low precision gemm u8s8s16 downscale optimization. -Similar to downscale optimizations made for u8s8s32 gemm, the following optimizations are made to improve the downscale performance for u8s8s16 gemm: a. The store to temporary s16 buffer can be avoided when k < KC since intermediate accumulation will not required for the pc loop (only 1 iteration). The downscaled values (s8) are written directly to the output C matrix. b. Within the micro-kernel when beta != 0, the s8 data from the original C output matrix is loaded to a register, converted to s16 and beta scaling applied on it. The previous design of copying the s8 value to the s16 temporary buffer inside jc loop and using the same in beta scaling is removed. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s16 micro-kernels. Alpha scaling is now only done when alpha != 1. AMD-Internal: [CPUPL-3237] Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef	2023-04-19 06:40:06 -04:00
Eleni Vlachopoulou	ea484f38e6	BLIS GTestSuite fixes for ILP64. - Adding doc regarding option setting for INT64 in README. - Bugfix on template instantiation on helper function. Updated to use gtint_t instead of int. AMD-Internal: [CPUPL-2732] Change-Id: Ia52407a1ef3fdd06e905c2e3d4aa5befb80e82d6	2023-04-19 03:41:55 -04:00
eashdash	462f9e0012	Added Custom Clip post-op support for u8s8s32os32/os8 and s8s8s32os32/os8 1. Custom Clip is a post-op which is used to clip the accumulated GEMM output within a certain range. 2. This post-op is implemented for u8s8s32os32/os8 and s8s8s32os32/os8 LPGEMM types. 3. Changes are done at the microkernel level for these 2 APIs to support Clip Post-Op AMD-Internal: [CPUPL-3207] Change-Id: I8b4da5807de6a93711b0ae9343970c55192f75d4	2023-04-18 15:21:27 -04:00
Edward Smyth	5c58cc0546	BLIS: ACML LAPACK test failures: ZDSCAL Correct argument alpha in call to ZDSCAL kernel function in serial code path. This resolves numerous instances of incorrect results in ACML LAPACK test programs when BLIS_ARCH_TYPE=generic. AMD-Internal: [CPUPL-3227] Change-Id: Ibf5ee79392e80c2d93a0d336a7b0e2568e149f94	2023-04-18 11:15:36 -04:00
Harihara Sudhan S	9272d3c778	Bug fix in work load distribution among the given threads - In level-1 kernels, with multi-threading enabled, only the partial job was getting executed. - The bug was in bli_thread_vector_partition and occurred only when minimum work for a thread >= 1 i.e., when the number of threads launched is less than number of elements and the number of elements is not a multiple of the number of threads launched. AMD-Internal: [CPUPL-3231] Change-Id: Ie20abb93468282cd6ac2372267714fb80c26d7cc	2023-04-18 10:16:09 -04:00
Harihara Sudhan S	1a1559380e	Added AVX512 functions to the BLAS layer - Added AVX512 function's to the BLAS layer of daxpy, dscal and ddot. - Added BLAS exceptions for incx <= 0 to DSCALV - Added BLIS_KERNELS_ZEN4 macro check to guard AVX512 kernels as they will not be available in other contexts. AMD-Internal: [CPUPL-2766][CPUPL-2765][CPUPL-2793][CPUPL-2800] Change-Id: I68860c2ff6b65624907cc1b590173f0e909bd271	2023-04-18 04:13:00 -04:00
Shubham Sharma	036da2e651	Fixed compilation errors for generic configuration - In gemmt and normf, #ifdef BLIS_KERNELS_* is added to make sure only compiled kernels are used. - In bal_copy and bla_swap, missing '\' is added. AMD-Internal: [CPUPL-2870] Change-Id: I83452dff761f60db6957f557321ce210ab72c037	2023-04-18 00:27:05 -04:00
Meghana Vankadari	42d05a5aa0	DGEMM: Added decision logic to choose between sup vs native for zen4 architecture Details: - Added a new function for choosing between SUP and native implementation for a given size. - This function pointer is stored in cntx for zen4 config. - Divided total combinations of sizes into 3 categories: - one dimension is small - Two dimensions are small - All dimensions are small - Added different threshold conditions for each of the categories. AMD-Internal: [CPUPL-2755] Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf	2023-04-17 13:08:34 -04:00
mkadavil	e23765010d	aocl_gelu_<tanh\|erf>_f32 api's for gelu computation as part of lpgemm. -Currently in aocl_gemm, gelu (both tanh and erf based) computation is only supported as a post-op as part of low precision gemm api call (done at micro-kernel level). However gelu computation alone without gemm is required in certain cases for users of aocl_gemm. -In order to support this, two new api's - aocl_gelu_tanh_f32 and aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's computes element-wise gelu_tanh and gelu_erf respectively of a matrix/ vector of floats. Both the api's invokes ISA specific vectorized micro- kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3218] Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704	2023-04-17 05:15:56 -04:00
eashdash	12c97021a1	Added New Post-Op - Custom Clipping for LPGEMM and SGEMM 1. Custom Clip is an element-wise post-op which is used to clip the accumulated GEMM output within a certain range. 2. The Clip Post-Op is used in downscaled and non-downscaled LPGEMM APIs and SGEMM. 3. Changes are done at frame and microkernel level to implement this post-op. 4. Different versions are implemented - AVX-512, AVX-2, SSE-2 to enable custom clipping for various LPGEMM types and SGEMM AMD-Internal: [CPUPL-3207] Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6	2023-04-17 04:38:20 -04:00
Aayush Kumar	71272ab574	.Fixed Compiler warnings for GCC 12 and AOCC 4.0 - Set the variables to zero to avoid the compiler warning (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c, bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and bli_trsm_small_AVX512.c - Changed the datatype from dim_t to siz_t for i,k,j in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to avoid the compiler warning (-Waggressive-loop-optimizations) AMD-Internal: [CPUPL-2870] Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03	2023-04-14 13:29:17 +00:00
jagar	a77402968c	GTestsuite: Updates in CmakeLists.txt to check libraries Updated the CmakeLists.txt to check whether the specified libraries are present or abort cmake building AMD-Internal: [CPUPL-2732] Change-Id: I90115217c228430095aa53a82dc26d16935b320f	2023-04-14 08:56:41 -04:00
Harihara Sudhan S	32bbd96652	Moving AOCL Dynamic logic from BLIS impli layer Threading related changes -------------------------- - Created function bli_nthreads_l1 that dispatches the AOCL dynamic logic for a L1 function based on the kernel ID and input datatypes. - bli_nthreads_l1 gets the number of threads to be launched from the rntm variable. - Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and DDOTV. This function contains the AOCL dynamic logic for the respective kernels. - Added handling for cases when number of elements (n) is less than number of threads spawned (nt) in AOCL dynamic. - Added function bli_thread_vector_partition that calculates the amount of work the calling thread is supposed to perform on a vector. Interface changes ----------------- - In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick kernel based on architecture ID and removed AVX2 flag check. - Modified function signature of ZDSCALV. Alpha is passed as dcomplex and only the real part of the alpha passed is used inside the kernel. The change was done to facilitate kernel dispatch based on arch ID. - Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV. Without this multithreaded code might crash because of minimum work calculation. Misc ----- - Removed unused variables from ZSCAL2V and AXPYV kernels. AMD-Internal: [CPUPL-3095] Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9	2023-04-14 02:05:38 -04:00
Harihara Sudhan S	15bd0f9646	Added AVX512 based double and float AXPYV - Added AVX512 based double and float AXPYV which will be used in Zen4 context. - Added n <= 0 check and alpha == 0 check to the BLAS layer of SAXPY. - Modified BLAS framework of float AXPYV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2793] Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1	2023-04-14 01:06:57 -04:00
Harsh Dave	9aff4f23b3	Dynamic threading for dgemm sup. - Following sequence is followed for getting number of threads for given input. - Divides total range of input into 3 category(m < n, m > n, m = n) - For each range it is further divided into 4 sub category.(K <= 32, K <= 64, K <=128, K > 128) - As per the input range, number of threads is being decided. AMD-Internal: [CPUPL-2966] Change-Id: I0b04e9de1615e87acb189b228544afac74664f02	2023-04-13 10:57:43 -04:00
mkadavil	5e510727a9	Masked load/store to replace copy macros in u8s8s32 micro-kernels. -As part of an earlier optimization, the memcpy function call in k fringe ((k % 4) != 0 case, to utilize vpdpbusd instruction) and n fringe (n < 16 - beta scale and C store) were replaced with copy macros specifically optimized for less than 4 and 16 elements each. However upon further analysis it was observed that masked load/broadcast and masked store performed better on average than the copy macros. The copy macros contained more if conditions, which resulted in more branching and thus resulting in perf variations. It was also noted that code generation varied a lot based on the compilers when using the copy macros due to the extra conditional code. -As part of this change, the copy macros are completely replaced with masked load/broadcast/store. Performance was observed to be better and less prone to variations for the k fringe and n fringe (< 16) cases. AMD-Internal: [CPUPL-3173] Change-Id: I73e6e65302ecf02e1397541b4a32b2a536f19503	2023-04-13 09:17:26 -04:00
Aayush Kumar	6ad387c2aa	Added DTRSM Small Path AVX512 based LUNN/LLTN Variant Kernels - 8x8 kernels are used for DTRSM SMALL - Matrix A(a10) is packed for GEMM operations. - Packed martix A will be re-used in all the col-block along N-dimension. - Diagonal elements of A matrix are packed(a11) for TRSM operations. - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I5bb57501f6d3783eb654e375d63901467dd14734	2023-04-13 01:44:31 -04:00
Chandrashekara K R	b515643c54	Added logic to define blas and cblas macro in blis.h Details: - Macro BLIS_ENABLE_BLAS and BLIS_ENABLE_CBLAS were not defined as part of the blis.h header due to which, they can't be used in cblas.h header file. Which is causing few CBLAS enums undefined in windows build. - Defined above two macro based on configuration in blis.h - Updated the change as fix for windows build AMD-Internal: [SWLCSG-2041, SWLCSG-2042] Change-Id: Ibd854108d7e4cbdcaadc2b0f2843bbeb2ab789e7	2023-04-13 01:35:24 -04:00
jagar	f164c7fe70	Added GTestSuite helper functions - Functions to convert to cblas enums from char. - Functions to print matrix and vector elements. - Functions to set matrix and vector elements with the given value. AMD-Internal: [CPUPL-2732] Change-Id: I1046b9578c8456e89eddba4a4e8577016b9361ca	2023-04-12 09:03:08 -04:00
Harihara Sudhan S	6b8f4744a4	Added AVX512 based double and float DOTV - Added AVX512 based double and float DOTV which will be used in Zen4 context. - Added n <= 0 check to the BLAS layer of SDOTV. - Modified BLAS framework of float DOTV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2800] Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702	2023-04-12 12:36:52 +05:30
Eleni Vlachopoulou	e8392fedb8	GTestSuite fix on trsm tests. - Fixing thresholds to be more appropriate. - Updating the way random entries of A and B are generated so that A is diagonally dominant and the algorithm doesn't diverge. AMD-Internal: [CPUPL-2732] Change-Id: I6d5691d744ecc623f66c45e94461bd88625d7179	2023-04-11 20:01:21 +05:30
Harihara Sudhan S	be7fb342c1	Added AVX512 based double and float SCALV - Added AVX512 based double and float SCALV which will be used in Zen4 context. - Added incx <= 0 check and alpha == 1 check to the BLAS layer of SSCAL. - Modified BLAS framework of float SCAL to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2766],[CPUPL-2765] Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc	2023-04-11 14:41:56 +05:30
Harihara Sudhan S	f8219bdd2f	Added CPP interfaces for AXPBY - Added C++ interfaces for all four types of AXPBY. - The newly added interfaces will call the already present cblas interfaces. AMD-Internal: [CPUPL-3038] Change-Id: I59421fa0a8958b7cfb5c73c337b2dbad0f134705	2023-04-10 10:52:27 -04:00
Arnav Sharma	c14ce55bcb	Added SGEMM SUP blocksize override to zen4 context - Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for SGEMM. This can be updated once GEMMT specific optimization are added for AVX-512. - Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for SGEMM operation. AMD-Internal: [CPUPL-3060] Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a	2023-04-10 08:16:45 -04:00
jagar	1d5c1e5803	Code coverage support in gtestsuite framework - Tools used for code coverage are : Gcov and Lcov. - We need to use macros specified by gcov during compiliation of blis and gtestsuite. - Locv will generate coverage reports in html format. AMD-Internal: [CPUPL-2732] Change-Id: I17b30b4a322b8771f2d6a4ba28986cf0ccf3fba6	2023-04-10 07:48:15 -04:00
Shubham	cc25cff864	Added AVX512 flag for d24xk pack kernel for windows - on windows 24xk kernel is compiled without avx512 flag which causes out of bounds writes for DTRSM. - to fix this avx512 flag has been added to the CMakeLists.txt file for 24xk kernel. AMD-Internal: [CPUPL-3186] Change-Id: I0314dea88302fc4964a303853a4b9b719ecd8064	2023-04-09 22:38:33 +05:30
Aayush Kumar	8c537b0cd5	Added DTRSM Small Path AVX512 based LLNN/LUTN Variant Kernels - 8x8 kernels are used for DTRSM SMALL - Implemented fringe cases with below block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I58d28912bddbaadb404052c0f3449ebbe3c97b68	2023-04-07 08:50:28 +00:00
Eleni Vlachopoulou	fa024b82ad	Adding helper functionality for wrong input testing in GTestSuite. - Added a header with correct default values to be used in tests. - Updated README to include information on how to test for wrong parameters and some explanation on how lda increments work. AMD-Internal: [CPUPL-2732] Change-Id: I4f540d46013ffe91b4acb30da2b437251c09d3bc	2023-04-06 13:32:29 -04:00
Edward Smyth	16592f79ee	BLIS cpuid: better support Intel and future AMD processors Where a processor has not been targeted for optimization, the slow generic code path may be selected by default. Another code path could perform much better, even if not specifically optimized for this processor. Changes to enable this: - Always call bli_cpuid_query_id() to get actual hardware info, even if the user is overriding the choice at build time or by setting BLIS_ARCH_TYPE. - Use bli_cpuid_query_id() to gather availability of AVX2 and AVX512 instructions. - Include AVX-512 fallback test to select zen4 code path on future AMD processors (that don't yet have a specific codepath). - Also use AVX-512 and AVX2 tests on Intel processors to select zen4 or zen3 code paths over the generic code path. If both zen and skx/haswell code paths are enabled, the zen code path will be given preference as zen paths have additional optimizations that should, in general, benefit Intel processors too. AMD-Internal: [CPUPL-3031] Change-Id: Ib7d0ebdb02fec872f9443a1d20070026f2020516	2023-04-06 11:29:33 -04:00
Edward Smyth	1885540c5a	Code cleanup: compiler warning fixes Modify code to correct some warning messages from GCC 12.2 or AOCC 4.0: - Increase size of nbuf in blastest/f2c/endfile.c - Remove unused variables in kernels/zen/1/bli_scal2v_zen_int.c and kernels/zen/1/bli_axpyv_zen_int10.c - Remove extraneous parentheses in frame/compat/bla_trsm_amd.c and kernels/zen4/3/bli_zgemm_zen4_asm_12x4.c - Add __attribute__ ((unused)) to several variables in frame/1m/packm/bli_packm_struc_cxk.c and frame/1m/packm/bli_packm_struc_cxk_md.c AMD-Internal: [CPUPL-2870] Change-Id: I595e46f0a3d737beb393c3ab531717565220b10d	2023-04-06 06:56:09 -04:00
vignbala	775ce1f13c	Implemented AVX-512 based 12x4 m-variant SUP kernels for ZGEMM - Implemented 12x4m column preferential SUP kernels(main and fringe cases). The main kernel dimension is 12x4, and the associated fringe kernel dimensions are : 12x3m, 12x2m, 12x1m 8x4, 8x3, 8x2, 8x1 4x4, 4x3, 4x2, 4x1 2x4, 2x3, 2x2, 2x1. - Included in-register transposition support for C, thus extending the storage scheme supports to CCC, CCR, RCC and RCR inside the milli-kernel. - Integrated conditional packing of A onto the SUP front end for dcomplex datatype. This redirects RRC and CRC storage schemes onto the preceding set of SUP kernels through storage scheme transformation(RCC and CCC respectively). - Updated the zen4 context file with the new set of SUP kernels, to get enabled appropriately. Furthermore, the context file was updated with the AVX-2 dotxv signatures for dcomplex datatype. This redirects the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines. - Added C prefetching onto L2-cache, and an unroll factor of 4 for the k loop in all the kernels. - Work in progress to include conjugate support and input spectrum extension for the AVX-512 SUP kernels. The current thresholds in zen4 context is the same as that of the zen3 thresholds for ZGEMM SUP. AMD-Internal: [CPUPL-3122] Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e	2023-04-06 04:49:15 -04:00
Edward Smyth	75c5fd1b66	Bug fixes in ZSCALV Fixes for a couple of issues: - Modify alpha argument in kernel call in zscal_blis_impl to avoid compiler warning message about discarding 'const' qualifier from pointer target type. - When BLIS_ARCH_TYPE=generic, call to bli_cntx_get_l1v_ker_dt in ref_kernels/1/bli_scalv_ref.c for ZSCAL was causing a segmentation fault. This was because the cntx was NULL on entry to this function. Correct by getting zscal_blis_impl to pass the cntx it has initialized for non-Zen codepaths. This changes has been copied to idamax_blis_impl for consistency. AMD-Internal: [CPUPL-2773] Change-Id: Ib02e4c1a2a7bf30c208732241d4959f7a2696179	2023-04-05 13:00:58 -04:00
Eleni Vlachopoulou	bf3f5cafa8	BLIS GTestSuite Updates: - Fix in README.md. - Updating abs overload for scomplex and dcomplex to avoid overflow by using std::abs. - Updating comparators to take into account NaNs and Infs when measuring error. AMD-Internal: [CPUPL-2732] Change-Id: I8c12bacd9d63b2e914d0a79f337f7525dc16b733	2023-04-05 06:11:34 -05:00
jagar	f9adfa8ee4	Updated CmakeLists.txt to remove cmake generated files cmake generated files and executables are cleaned within build directory by "make distclean" command. Change-Id: I4fd5193e92958122ff10ecc634b42096f3b3716e	2023-04-05 06:11:16 -04:00
Harihara Sudhan S	2e6724262e	ZGEMV var 2 bug fix - Fixed segmentation fault that was seen on non zen and non avx2 machines. - cntx object was not passed to the invoked kernel causing a seg fault. AMD-Internal: [CPUPL-3167] Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d	2023-04-05 01:31:24 -04:00
Edward Smyth	9b9142644f	BLIS cpuid: Bugfix for BLAS1/BLAS2 when BLIS_ARCH_TYPE is set BLAS1 and BLAS2 routines may not immediately call bli_init_auto, as the full cntx and other global data structures may not be required for all code paths. This may cause a problem if the user sets BLIS_ARCH_TYPE, as it needs to check the requested value against the available options configured in cntx. Solution: add a separate pthread_once call to run bli_gks_init(). AMD-Internal: [CPUPL-3031] Change-Id: Icd73a8dd161b34b23cc336623d675248f28ed23f	2023-04-04 07:54:31 -04:00

1 2 3 4 5 ...

2903 Commits