amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Shubham Sharma	8adef27aca	Optimization of DGEMMT SUP kernel for beta zero cases. Details: 1. In kernels for non-transpose variants, changes are made to optimize the cases of beta zero. 2. Validated the changes with BLIS Testsuite, GTestSuite(Functionality, Valgrind, Integer Tests) and Netlib Tests. 3. Fixed warnings during the build process. AMD-Internal: [CPUPL-2341] Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a	2022-08-18 08:05:58 -04:00
Edward Smyth	7f322da01d	BLIS: BLAS3 quick return functionality Implement netlib BLAS style quick return functionality for when no work is required. Similar functionality was already in HERK and HER2K routines. AMD copyrights updated. AMD-Internal: [CPUPL-2373] Change-Id: I0ebe9d76465b0e48b2ff5c2f1cc2a75763fe187c	2022-08-18 04:09:56 -04:00
mkadavil	171fb7358d	SGEMM Optimization -sup GEMM - 2 variants var2m (block-panel) and var1n (panel-block). We added decision logic to choose between var1n and var2m for single thread SGEMM.var1n is favorable option when "n" is very large compared to "m". -Also fixed a bug related to fetching "MR" "NR" values in bli_gemmsup_int(). We replaced "bli_cntx_get_blksz_def_dt()(used for Native)" with "bli_cntx_get_l3_sup_blksz_def_dt()". AMD-Internal: [CPUPL-2406] Change-Id: If36529015b1c5f8f87eb40c05ebcf433c471d4d5	2022-08-18 01:39:42 -04:00
Harsh Dave	46e7727ea8	DGEMM Improvements - Incase of DGEMM when m, n and leading dimensions are large packing of A and B matrixes are required for optimal performance. - Modified decision logic to choose between sup vs native, now apart from matrix dimensions, we also incorporate matrix leading dimensions into this decision. AMD-Internal: [CPUPL-2366] Change-Id: I255db5f7049d783e22d7c912edf8bbf023e32ed8	2022-08-18 01:14:40 -04:00
Nallani Bhaskar	39196d163e	Enable packing of A & B dynamically in dgemmsup. Details: - When work distributed for each thread is larger than caches, it is advisable to perform packing of B for sup dgemm. - Work distribution per thread is calculated based on the values of jc_nt and ic_nt. - For RRC and CRC cases we want to avoid rd kernels which are not efficient in performance compared to rv kernels. Therefore we perform packing of A as well so that rv kernels are invoked for these cases. - These changes result in improved DGEMM performance. - Dynamic packing is done using the API "bli_rntm_set_pack_b( 1, rntm )" Change-Id: I8344520b4a2591e57518bb54183a15957f60f94b	2022-08-17 12:36:15 -04:00
Harsh Dave	e2e1dadee1	DGEMM Improvements - We prefetch next panel while packing 8xk panel. - Modified prefetch offsets for dgemm native and dgemm_small kernel. AMD-Internal: [CPUPL-2366] Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed	2022-08-16 08:07:30 -04:00
satish kumar nuggu	88e44c64e3	Fixed Memory Leaks in TRSM 1. Fixed the memory leaks in corner cases which caused due to extra loads in all datatypes(s,d,c,z). 2. In remainder cases instead of loading required number of elements, loaded extra elements which lead to memory leaks. Fixed memory leaks by restricting number of loads to required number of elements. AMD-Internal: [CPUPL-2280] Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a	2022-08-16 05:08:40 -04:00
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
Shubham Sharma	f5ef30a44a	Fix in DGEMMT SUP kernel Details: 1. Due to error in C output buffer address computation in kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid memory is being accessed. This is causing seg fault in libflame netlib testing. 2. Validated the fix with libflame netlib testing. AMD-Internal: [CPUPL-2341] Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed	2022-08-12 21:03:36 +05:30
Arnav Sharma	a226e54421	AVX512 based SGEMM Optimizations - Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 Native SGEMM kernel. AMD-Internal: [CPUPL-2385] Change-Id: I1feae5ac79e960c6b26df24756d460243820b797	2022-08-12 02:33:39 -04:00
Dipal M. Zambare	c85bbfdb50	Updated BLIS version string format - Updated version string to match the recommended format “AOCL-BLIS 3.2.1 Build 20220727”. - Fixed issues with include paths which was preventing compile time version sting definition passing via build commands. - Removed version string determination based on git tag using ‘git describe’, version string will always be taken from the version file. AMD-Internal: [CPUPL-2324] Change-Id: Idc7edf1211f66d348ec3b5b43f2507c2b810f088	2022-08-12 05:53:35 +00:00
Shubham Sharma	4bca7f6f4a	DGEMMT optimizations Details: 1. For lower and upper, non-transpose variants of gemmt, new kernels are developed and optimized to compute only the required outputs in the diagonal blocks. 2. In the previous implementation, all the 48 outputs of the given 6x8 block of C matrix are computed and stored into a temporary buffer. Later,the required elements are copied into the final C output buffer. 3. Changes are made to compute only the required outputs of the 6x8 block of C matrix and directly stored in the final C output buffer. 4. With this optimization, we are avoiding copy operation and also reducing the number of computations. 5. Kernels specific to compute Lower and Upper Variant diagonal outputs have been added. 6. SUP Framework changes to integrate the new kernels have been added. 7. These kernels are part of the SUP framework. AMD-Internal: [CPUPL-2341] Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a	2022-08-11 06:16:33 -04:00
Mangala.V	8504ef013d	Optimisation of DTRSM and ZTRSM 1. Extract instruction replaced with cast when accessing first 128bit, as cast inst needs no cycle but extract takes few cycles 2. Added prefetch of A buffer when computing gemm operation 3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c With above changes performance improvements observed in case of Single thread Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be	2022-08-11 01:39:40 -04:00
Edward Smyth	737e08cd7a	BLIS: Improve architecture selection at runtime Enable meaningful names as options for BLIS_ARCH_TYPE environment variable. For example, BLIS_ARCH_TYPE=zen4 or BLIS_ARCH_TYPE='ZEN4' or BLIS_ARCH_TYPE=6 will select the same code path (in this release). The meaningful names are not case sensitive. This implements change 1 in the Jira ticket below. Following review comments: 1. Use names from arch_t enum in function bli_env_get_var_arch_type() rather than directly using numbers. 2. AMD copyrights updated. AMD-Internal: [CPUPL-2235] Change-Id: I8cfd43d34765d5e8c7e35680d18825d9934753ad	2022-08-10 08:26:49 -04:00
Harihara Sudhan S	d1eaf65a26	Post-Ops for u8s8s16os16 Functionality - Post-ops is an operation performed on every element of the output matrix after GEMM operation is completed. - Post-ops relu and bias added to all the compute kernels of u8s8s16os16 - Post-ops are done on the value loaded into the register to avoid reloading of C matrix elements - Minor bug fixes in openmp thread decorator of lpgemm - Added test cases to lpgemm bench input file AMD-Internal: [CPUPL-2171] Change-Id: If49f763fdfac19749f6665c172348691165d8631	2022-08-09 14:52:41 +05:30
Harihara Sudhan S	60de0a1856	Multithreading and support for unpacked B matrix in u8s8s16os16 Fucntionality - When the B matrix is not reordered before the u8s8s16os16 compute kernel call packing of B matrix is done as part of the five loop algorithm. The state of B matrix (packed or unpacked) is given as an user input. - Packing of B matrix is done as part of the five loop compute. - Temprorary buffer for pack B is allocated in the five loop algorithm - Multithreading for computation kernel - Configuration constants for u8s8s16os16 are part of the lpgemm config AMD-Internal: [CPUPL-2171] Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8	2022-08-05 19:28:37 +05:30
mkadavil	828d3cd3d3	Post operations support for low precision gemm. - Low precision gemm is often used in ML/DNN workloads and is used in conjunction with pre and post operations. Performing gemm and ops together at the micro kernel level results in better overall performance due to cache/register reuse of output matrix. The provision for defining the post-operations and invoking the micro-kernel with it from the framework is added as part of this change. This includes adding new data structures/functions to define the post-ops to be applied and an extensible template using which new post-ops can easily be integrated. As for the post-operations, RELU and Bias Add for u8s8s32 is implemented in this first cut. - aocl_gemm bench modifications to test/benchmark RELU and Bias Add. AMD-Internal: [CPUPL-2316] Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18	2022-08-05 11:53:05 +05:30
Harihara Sudhan S	e5d4fc2a70	Added low precision GEMM (u8s8s16os16) Feature Addition : Added low precision GEMM to addon. The kernel takes unsigned int8 and signed int8 as inputs and performs GEMM operation. The intermediate accumulation and output are in signed int16. - The compute kernel will perform computation only if B matrix reordered to suit the usage of AVX2 instruction vpmaddubsw. - Kernel for packing the B matrix is provided. - LPGEMM bench code was modified to test the performance and accuracy of the new variant. AMD-Internal: [CPUPL-2171] Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b	2022-08-02 02:20:00 -04:00
Devin Matthews	76fbf1233d	Add vzeroupper to Haswell microkernels. (#524 ) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions. Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9	2022-08-01 09:29:04 +05:30
Field G. Van Zee	2a81437bd8	Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging. Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011	2022-08-01 09:11:58 +05:30
Minh Quan Ho	6d4d6a7514	Alloc at least 1 elem in pool_t block_ptrs. (#560 ) Details: - Previously, the block_ptrs field of the pool_t was allowed to be initialized as any unsigned integer, including 0. However, a length of 0 could be problematic given that malloc(0) is undefined and therefore variable across implementations. As a safety measure, we check for block_ptrs array lengths of 0 and, in that case, increase them to 1. - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu> Change-Id: I1e885d887aaba5e73df091ef52e6c327fd6418de	2022-08-01 08:45:00 +05:30
Minh Quan Ho	3c01fcb9fc	Fix insufficient pool-growing logic in bli_pool.c. (#559 ) Details: - The current mechanism for growing a pool_t doubles the length of the block_ptrs array every time the array length needs to be increased due to new blocks being added. However, that logic did not take in account the new total number of blocks, and the fact that the caller may be requesting more blocks that would fit even after doubling the current length of block_ptrs. The code comments now contain two illustrating examples that show why, even after doubling, we must always have at least enough room to fit all of the old blocks plus the newly requested blocks. - This commit also happens to fix a memory corruption issue that stems from growing any pool_t that is initialized with a block_ptrs length of 0. (Previously, the memory pool for packed buffers of C was initialized with a block_ptrs length of 0, but because it is unused this bug did not manifest by default.) - Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu> Change-Id: Ie4963c56e03cbc197d26e29f2def6494f0a6046d	2022-08-01 08:19:30 +05:30
Devin Matthews	3d655a951b	Fix data race in testsuite. Change-Id: I7704037bad0f7485e7b352de68c2c4535d364226	2022-08-01 07:49:19 +05:30
Devin Matthews	9495401b73	Fix more copy-paste errors in the haswell gemmsup code. Fixes #486. Change-Id: I568386b5d67a698ea9c0b6b17f133df86c2894bd	2022-07-31 21:36:21 +05:30
Devin Matthews	ea163fc23b	Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition. Change-Id: I96ed39e289aaeeb44be9117074b32bd8d4c19de6	2022-07-31 21:15:28 +05:30
Field G. Van Zee	faff30b46a	Fixed out-of-bounds bug in sup s6x16m haswell kernel. Details: - Fixed another out-of-bounds read access bug in the haswell sup assembly kernels. This bug is similar to the one fixed in `17b0caa` and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh Kannan for reporting this bug (and a suitable fix) in #635. - CREDITS file update. Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9	2022-07-31 21:10:58 +05:30
Field G. Van Zee	4b1663213c	Fixed out-of-bounds read in haswell gemmsup kernels. Details: - Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2() kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. Thanks to Daniël de Kok for reporting these bugs in #635, and to Bhaskar Nallani for proposing the fix). - CREDITS file update. Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b	2022-07-31 21:06:08 +05:30
Nallani Bhaskar	1d31386c02	Fixed few out of bound memory reads in sgemmsup kernels Details: Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16() kernel. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 032), ymm3, ymm4) or vfmadd231ps(mem(rcx, 032), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. AMD_CPUPLID: CPUPL-2279 Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68	2022-07-29 09:09:12 -04:00
Chandrashekara K R	fde812015f	Updated blis library version from 4.0 to 3.2.1 AMD-Internal: [CPUPL-2322] Change-Id: I3a6a61543dd2754e2590d7f5f22442c9fdeaee95	2022-07-29 15:55:10 +05:30
Mangala V	6c1acc74c8	ZGEMM optimizations -- Conditionally packing of B matrix is enabled in zgemmsup path which is performing better when B matrix is large -- Incorporated decision logic to choose between zgemm_small vs zgemm sup based on matrix dimensions "m, n and k". -- Calling of ZGEMV when matrix dimension m or n = 1. Very good performance improvement is observed. Change-Id: I7c64020f4f78a6a51617b184cc88076213b5527d	2022-07-28 14:55:24 +05:30
Vignesh Balasubramanian	808d79a610	Implemented efficient ZGEMM algorithm when k=1 Problem statement : To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation. In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines: - Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api. - Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6. - Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance. - Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output. The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads. AMD-Internal: [CPUPL-2236] Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847	2022-07-28 02:09:45 -04:00
Kiran Varaganti	6054b888fb	Fixed Bug in bench_trsm.c When bli_trsm() API is called, we make sure the "side" argument is "side_t" and not f77_char and argument is passed by value and not by its address. Change-Id: I5a616eb054c034be2d67640b8ab3b9615706a8c9	2022-07-25 15:38:30 +00:00
Kiran Varaganti	eff436c653	Bug Fix to replace vzeroall Fixed syntax in AVX512 dgemm native kernel. zen4 configuration follows Intel ASM syntax whereas other AMD configs follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM format. src and dst operands are interchanged. Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447	2022-07-22 03:42:17 -04:00
Dipal M Zambare	5d617429f4	Enabled znver4 support for GCC version >= 12 - Updated zen4 configuration to add -march=znver4 flag in the compiler options if the gcc version is above or equal to 12 AMD-Internal: [CPUPL-1937] Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f	2022-07-22 12:46:15 +05:30
mkadavil	f63e699c08	Fix for segmentation fault in low precision gemm. - Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL when compiled without open mp threading support. Subsequently the code is executed as if it is single-threaded. However, when B matrix needs to be packed, communicators are required (irrespective of single or multi-threaded), and the code accesses lpgemm_thrinfo_t for the same without NULL check. This results in seg fault. For the fix, a non-open mp thread decorator layer is added, which creates a placeholder lpgemm_thrinfo_t object with a communicator before invoking the 5 loop algorithm. This object will be used for packing. - Makefile for compilation of aocl_gemm bench. AMD-Internal: [CPUPL-2304] Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc	2022-07-21 11:51:40 +05:30
Kiran Varaganti	86134c7278	Replaced vzeroall Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels -both AVX2 and AVX512. vzeroall is expensive instruction and replaced it with faster version of zeroing all registers. vzeroupper() instruction is also added at the end of AVX2 kernels to avoid any AVX2/SSE transition penalities. Kindly note only the main kernels are modified. Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4	2022-07-20 01:16:59 -04:00
Arnav Sharma	4f96bb712e	AOCL Dynamic Optimization for DGEMMT - Fine-tuned the thread allocation logic for parallelizing DGEMMT for the cases where n <= 220. This results in performance improvement in multi-threaded DGEMMT for small values of n. AMD-Internal: [CPUPL-2215] Change-Id: I2654bc64d2dc43c2db911e0c9175755be3aa8ba5	2022-07-18 06:52:48 -04:00
Vignesh Balasubramanian	2ad25a7180	ZGEMM kernel performance improvement for k=1 sizes: The current implementation for handling zgemm exploits SIMD parallelism along the k dimension. This would give great performance in cases of k being large. But for input sizes with k=1, it is better to exploit SIMD parallelism along the m and n dimensions, thereby giving better performance. This commit does the same through loop reordering, by loading column vectors from A. AMD-Internal: [CPUPL-2236] Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f	2022-06-30 07:19:52 -04:00
Arnav Sharma	25cf7517ab	AOCL Dynamic Optimization for DGEMMT - Optimized thread allocation for cases with n <= 220 for DGEMMT. AMD-Internal: [CPUPL-2215] Change-Id: Id01edf268a90fd96a41ef947db54f6afc490548f	2022-06-30 15:00:34 +05:30
Dipal M Zambare	2ba2fb2b63	Add AVX2 path for TRSM+GEMM combination. - Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called from TRSM context it will invoke AVX2 GEMM kernels instead of the default AVX-512 GEMM kernels. - The default context has the block sizes for AVX512 GEMM kernels, however, TRSM uses AVX2 GEMM kernels and they need different block sizes. - Added new API bli_zen4_override_trsm_blkszs(). It overrides default block sizes in context with block sizes needed for AVX2 GEMM kernels. - Added new API bli_zen4_restore_default_blkszs(). It restores The block sizes to there default values (as needed by default AVX512 GEMM kernels). - Updated bli_trsm_front() to override the block sizes in the context needed by TRSM + AVX2 GEMM kernels and restore them to the default values at the end of this function. It is done in bli_trsm_front() so that we override the context before creating different threads. AMD-Internal: [CPUPL-2225] Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55	2022-06-29 10:16:24 +00:00
Harihara Sudhan S	d4bb906094	Exception handling in GEMV smart-threading - Added condition to check if n or m is 0 in smart threading logic AMD-Internal: [CPUPL2219] Change-Id: Idd58cd13a11aa5bdb4117b4c9262f38ef3c1afc4	2022-06-29 01:52:18 -04:00
Harish	2e4ed37e97	Added missing endif for the AOCC 4.0 verion check Change-Id: I1c77ae795c398aec685152b491b838a75e7ce318	2022-06-28 13:01:53 +05:30
Harish	77e8492cbd	Added znver4 flag for config builds with AOCC 4.0 compiler version Change-Id: I45f1031ed4c5ea2e3f594713f3821d6bbbecd4df	2022-06-28 02:47:48 -04:00
Chandrashekara K R	ff2ee0ae3f	AOCL-WINDOWS: Added the windows build system to build bench folder on windows. 1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf. Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2	2022-06-27 22:32:39 -04:00
Kiran Varaganti	47c344bc2e	DGEMM Benchmark Optimizations Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 dgemm kernel Change-Id: I56b3df238b6d85a6f6861448c0c6f907c972146a	2022-06-27 09:32:39 -04:00
Dipal M Zambare	00c5048f65	Fixed high impact static analysis issues Initialized ymm and xmm registers to zero to address un-inilizaed variable errors reported in static analsys. AMD-Internal: [CPUPL-2078] Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04	2022-06-27 08:19:26 +00:00
satish kumar nuggu	aaf840d86e	Disabled zgemm SUP path - Need to identify new Thresholds for zgemm SUP path to avoid performance regression. AMD-Internal: [CPUPL-2148] Change-Id: I0baa2b415dc5e296780566ba7450249445b93d43	2022-06-27 08:19:12 +00:00
satish kumar nuggu	13c71ca976	BugFix of AOCL_DYNAMIC in TRSM multithreaded small. - Added initialization of rntm object before aocl_dynamic. - Bugfixes in dtrsm right-side kernels, avoided accessing extra memory while using store for corner cases. AMD-Internal: [CPUPL-2193] [CPUPL-2194] Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba	2022-06-14 15:46:17 +05:30
mkadavil	e073e8b669	DAMAXV AXX512 micro kernel bug fix. -DAMAXV AVX512 is giving wrong results when max element is present at index in [n-u, n), where u < 32. This is a fallout of using wrong start offset for the non-loop unrolled code. -Functions for replacing NaN with negative numbers is replaced with MACRO to avoid function call overhead and to remove static variables used for stateful replacement numbers for NaN. AMD-Internal: [CPUPL-2190] Change-Id: Ie1435c38b264a271f869782793d0b52bbe6e1b2a	2022-06-13 10:52:53 +05:30
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00

1 2 3 4 5 ...

2699 Commits