amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Shubham Sharma	8adef27aca	Optimization of DGEMMT SUP kernel for beta zero cases. Details: 1. In kernels for non-transpose variants, changes are made to optimize the cases of beta zero. 2. Validated the changes with BLIS Testsuite, GTestSuite(Functionality, Valgrind, Integer Tests) and Netlib Tests. 3. Fixed warnings during the build process. AMD-Internal: [CPUPL-2341] Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a	2022-08-18 08:05:58 -04:00
Harsh Dave	e2e1dadee1	DGEMM Improvements - We prefetch next panel while packing 8xk panel. - Modified prefetch offsets for dgemm native and dgemm_small kernel. AMD-Internal: [CPUPL-2366] Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed	2022-08-16 08:07:30 -04:00
satish kumar nuggu	88e44c64e3	Fixed Memory Leaks in TRSM 1. Fixed the memory leaks in corner cases which caused due to extra loads in all datatypes(s,d,c,z). 2. In remainder cases instead of loading required number of elements, loaded extra elements which lead to memory leaks. Fixed memory leaks by restricting number of loads to required number of elements. AMD-Internal: [CPUPL-2280] Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a	2022-08-16 05:08:40 -04:00
Shubham Sharma	f5ef30a44a	Fix in DGEMMT SUP kernel Details: 1. Due to error in C output buffer address computation in kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid memory is being accessed. This is causing seg fault in libflame netlib testing. 2. Validated the fix with libflame netlib testing. AMD-Internal: [CPUPL-2341] Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed	2022-08-12 21:03:36 +05:30
Shubham Sharma	4bca7f6f4a	DGEMMT optimizations Details: 1. For lower and upper, non-transpose variants of gemmt, new kernels are developed and optimized to compute only the required outputs in the diagonal blocks. 2. In the previous implementation, all the 48 outputs of the given 6x8 block of C matrix are computed and stored into a temporary buffer. Later,the required elements are copied into the final C output buffer. 3. Changes are made to compute only the required outputs of the 6x8 block of C matrix and directly stored in the final C output buffer. 4. With this optimization, we are avoiding copy operation and also reducing the number of computations. 5. Kernels specific to compute Lower and Upper Variant diagonal outputs have been added. 6. SUP Framework changes to integrate the new kernels have been added. 7. These kernels are part of the SUP framework. AMD-Internal: [CPUPL-2341] Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a	2022-08-11 06:16:33 -04:00
Mangala.V	8504ef013d	Optimisation of DTRSM and ZTRSM 1. Extract instruction replaced with cast when accessing first 128bit, as cast inst needs no cycle but extract takes few cycles 2. Added prefetch of A buffer when computing gemm operation 3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c With above changes performance improvements observed in case of Single thread Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be	2022-08-11 01:39:40 -04:00
Devin Matthews	76fbf1233d	Add vzeroupper to Haswell microkernels. (#524 ) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions. Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9	2022-08-01 09:29:04 +05:30
Field G. Van Zee	2a81437bd8	Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging. Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011	2022-08-01 09:11:58 +05:30
Devin Matthews	9495401b73	Fix more copy-paste errors in the haswell gemmsup code. Fixes #486. Change-Id: I568386b5d67a698ea9c0b6b17f133df86c2894bd	2022-07-31 21:36:21 +05:30
Devin Matthews	ea163fc23b	Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition. Change-Id: I96ed39e289aaeeb44be9117074b32bd8d4c19de6	2022-07-31 21:15:28 +05:30
Field G. Van Zee	faff30b46a	Fixed out-of-bounds bug in sup s6x16m haswell kernel. Details: - Fixed another out-of-bounds read access bug in the haswell sup assembly kernels. This bug is similar to the one fixed in `17b0caa` and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh Kannan for reporting this bug (and a suitable fix) in #635. - CREDITS file update. Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9	2022-07-31 21:10:58 +05:30
Field G. Van Zee	4b1663213c	Fixed out-of-bounds read in haswell gemmsup kernels. Details: - Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2() kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. Thanks to Daniël de Kok for reporting these bugs in #635, and to Bhaskar Nallani for proposing the fix). - CREDITS file update. Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b	2022-07-31 21:06:08 +05:30
Nallani Bhaskar	1d31386c02	Fixed few out of bound memory reads in sgemmsup kernels Details: Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16() kernel. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 032), ymm3, ymm4) or vfmadd231ps(mem(rcx, 032), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. AMD_CPUPLID: CPUPL-2279 Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68	2022-07-29 09:09:12 -04:00
Vignesh Balasubramanian	808d79a610	Implemented efficient ZGEMM algorithm when k=1 Problem statement : To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation. In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines: - Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api. - Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6. - Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance. - Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output. The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads. AMD-Internal: [CPUPL-2236] Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847	2022-07-28 02:09:45 -04:00
Kiran Varaganti	eff436c653	Bug Fix to replace vzeroall Fixed syntax in AVX512 dgemm native kernel. zen4 configuration follows Intel ASM syntax whereas other AMD configs follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM format. src and dst operands are interchanged. Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447	2022-07-22 03:42:17 -04:00
Kiran Varaganti	86134c7278	Replaced vzeroall Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels -both AVX2 and AVX512. vzeroall is expensive instruction and replaced it with faster version of zeroing all registers. vzeroupper() instruction is also added at the end of AVX2 kernels to avoid any AVX2/SSE transition penalities. Kindly note only the main kernels are modified. Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4	2022-07-20 01:16:59 -04:00
Vignesh Balasubramanian	2ad25a7180	ZGEMM kernel performance improvement for k=1 sizes: The current implementation for handling zgemm exploits SIMD parallelism along the k dimension. This would give great performance in cases of k being large. But for input sizes with k=1, it is better to exploit SIMD parallelism along the m and n dimensions, thereby giving better performance. This commit does the same through loop reordering, by loading column vectors from A. AMD-Internal: [CPUPL-2236] Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f	2022-06-30 07:19:52 -04:00
Dipal M Zambare	00c5048f65	Fixed high impact static analysis issues Initialized ymm and xmm registers to zero to address un-inilizaed variable errors reported in static analsys. AMD-Internal: [CPUPL-2078] Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04	2022-06-27 08:19:26 +00:00
satish kumar nuggu	13c71ca976	BugFix of AOCL_DYNAMIC in TRSM multithreaded small. - Added initialization of rntm object before aocl_dynamic. - Bugfixes in dtrsm right-side kernels, avoided accessing extra memory while using store for corner cases. AMD-Internal: [CPUPL-2193] [CPUPL-2194] Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba	2022-06-14 15:46:17 +05:30
mkadavil	e073e8b669	DAMAXV AXX512 micro kernel bug fix. -DAMAXV AVX512 is giving wrong results when max element is present at index in [n-u, n), where u < 32. This is a fallout of using wrong start offset for the non-loop unrolled code. -Functions for replacing NaN with negative numbers is replaced with MACRO to avoid function call overhead and to remove static variables used for stateful replacement numbers for NaN. AMD-Internal: [CPUPL-2190] Change-Id: Ie1435c38b264a271f869782793d0b52bbe6e1b2a	2022-06-13 10:52:53 +05:30
Dipal M Zambare	c87b9aab75	Added support for AVX512 for Windows and AMAVX - Completed zen4 configuration support on windows - Enabled AVX512 kernels for AMAXV - Added zen4 configuration in amdzen for windows - Moved all zen4 kernels inside kernels/zen4 folder AMD-Internal: [CPUPL-2108] Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba	2022-06-08 11:09:48 +05:30
Dipal M. Zambare	e61ec820f9	Fixed windows build issue for BLIS 4.0 - Removed extra AVX512 typedef from bli_amaxv_zen_int.c. AMD-Internal: [CPUPL-2154] Change-Id: Ieaa827c7d81b8d101f3a827827b99433570219f2	2022-06-02 17:52:43 +05:30
Arnav Sharma	66b2231b65	Fixed CMake files for HER - Removed subdirectory addition Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1	2022-06-01 12:25:43 +05:30
Arnav Sharma	e5d5a43eab	Optimized ZHER Implementation - Implemented optimized her framework calls for double precision complex numbers. - The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows. AMD-Internal: [CPUPL-2151] Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca	2022-05-25 14:03:01 +05:30
Dipal M Zambare	6e2f536590	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-05-17 20:35:40 +05:30
Harsh Dave	f48ced0811	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-05-17 18:13:07 +05:30
Harsh Dave	718c6bc024	Optimized daxpy2v implementation - Optimized axpy2v implementation for double datatype by handling rows in mulitple of 4 and store the final computed result at the end of computation, preventing unnecessary stores for improving the performance. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1973] Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9	2022-05-17 18:13:05 +05:30
Harihara Sudhan S	0fddd9eb0d	Level 1 Kernel: damaxv AVX512 Details: - Developed damaxv for AVX512 extension - Implemented removeNAN function that converts NAN values to negative values based on the location - Usage COMPARE256/COMPARE128 avoided in AVX512 implementation for better performance - Unrolled the loop by order of 4. Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326	2022-05-17 18:12:16 +05:30
satish kumar nuggu	09b70de635	Performance Improvement for ztrsm small sizes Details: - Handled Overflow and Underflow Vulnerabilites in ztrsm small right implementations. - Fixed failures observed in Scalapack testing. AMD-Internal: [CPUPL-2115] Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439	2022-05-17 18:10:39 +05:30
Nallani Bhaskar	2acb3f6ed0	Tuned aocl dynamic for specific range in dgemm Description: 1. Decision logic to choose optimal number of threads for given input dgemm dimensions under aocl dynamic feature were retuned based on latest code. 2. Updated code in few file to avoid compilation warnings. 3. Added a min check for nt in bli_sgemv_var1_smart_threading function AMD-Internal: [ CPUPL-2100 ] Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02	2022-05-17 18:10:39 +05:30
S, HariharaSudhan	a8bc55c373	Multithreaded SGEMV var 1 with smart threading - Implemented an OpenMP based stand alone SGEMV kernel for row-major (var 1) for multithread scenarios - Smart threading is enabled when AOCL DYNAMIC is defined - Number of threads are decided based on the input dims using smart threading AMD-Internal: [CPUPL-1984] Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e	2022-05-17 18:10:39 +05:30
satish kumar nuggu	1a3428ddfc	Parallelization of dtrsm_small routine 1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right). 2. Fine-tuning with AOCL_DYNAMIC to achieve better performance. AMD-Internal: [CPUPL-2103] Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005	2022-05-17 18:10:39 +05:30
Harsh Dave	f17d043e1c	Implemented optimal dotxv kernel Details: - Intrinsic implementation of zdotxv, cdotxv kernel - Unrolling in multiple of 8, remaining corner cases are handled serially for zdotxv kernel - Unrolling in multiple of 16, remainig corner cases are handled serially for cdotxv kernel - Added declaration in zen contexts AMD-Internal: [CPUPL-2050] Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211	2022-05-17 18:10:39 +05:30
Harsh Dave	52e4fd0f11	Performance Improvement for ctrsm small sizes Details: - Enable ctrsm small implementation - Handled Overflow and Underflow Vulnerabilites in ctrsm small implementations. - Fixed failures observed in libflame testing. - For small sizes, ctrsm small implementation is used for all variants. Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179	2022-05-17 18:10:39 +05:30
Arnav Sharma	caa5b37005	Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics Details: - Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics - Updated definitions zen context AMD-Internal: [CPUPL-2059] Change-Id: Ic657e4b66172ae459173626222af2756a4125565	2022-05-17 18:10:39 +05:30
Sireesha Sanga	cc3069fb5e	Performance Improvement for ztrsm small sizes Details: - Optimization of ztrsm for Non-unit Diag Variants. - Handled Overflow and Underflow Vulnerabilites in ztrsm small implementations. - Fixed failures observed in libflame testing. - Fine-tuned ztrsm small implementations for specific sizes 64<= m,n <= 256, by keeping the number of threads to the optimum value, under AOCL_DYNAMIC flag. - For small sizes, ztrsm small implementation is used for all variants. AMD-Internal: [SWLCSG-1194] Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d	2022-05-17 18:10:39 +05:30
satish kumar nuggu	fe7f0a9085	Changes to enable zgemm small from BLAS Layer 1. Removed small gemm call from native path to avoid Single threaded calls as a part of MultiThreaded scenarios. 2. SUP and INDUCED Method path disabled. 3. Added AOCL Dynamic for optimum number of threads to achieve higher performance. Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92	2022-05-17 18:10:38 +05:30
Harsh Dave	015bcb88d4	Fixed ztrsm computational failure - Fixed memory access for edge cases such that all load are within memory boundary only. - Corrected ztrsm utility APIs for dcomplex multiplication and division. AMD-Internal: [CPUPL-2093] Change-Id: Ib2c65e7921f6391b530cd20d6ea6b50f24bd705e	2022-05-17 18:09:22 +05:30
Harsh Dave	0976ed9ce5	Implement zgemm_small kernel Details: - Intrinsic implementation of zgemm_small nn kernel. - Intrinsic implementation of zgemm_small_At kernel. - Added support conjugate and hermitian transpose - Main loop operates in multiple of 4x3 tile. - Edge cases are handles separately. AMD-Internal: [CPUPL-2084] Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66	2022-05-17 18:09:22 +05:30
mkurumel	ab06f17689	DGEMMT : Tuning SUP threshold to improve ST and MT performance. Details : - SUP Threshold change for native vs SUP - Improved the ST performances for sizes n<800 - Introduce PACKB in SUP to improve ST performance between 320<n<800 - 16T SUP Tuning for n<1600. AMD-Internal: [CPUPL-1981] Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda	2022-05-17 18:09:22 +05:30
Dipal M Zambare	06e386f054	Updated Windows build system to pick AMD specific sources. The framework cleanup was done for linux as part of `f63f78d7` Removed Arch specific code from BLIS framework. This commit adds changes needed for windows build. AMD-Internal: [CPUPL-2052] Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d	2022-05-17 18:09:20 +05:30
Arnav Sharma	393effbb0c	Optimized ZAXPY2V using AVX2 Intrinsics Details: - Intrinsic implementation of ZAXPY2V fused kernel for AVX2 - Updated definitions in zen contexts AMD-Internal: [CPUPL-2023] Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa	2022-05-17 18:08:57 +05:30
Arnav Sharma	86690f9fd3	Optimized AXPBYV Kernel using AVX2 Intrinsics Details: - Intrinsic implementation of axpbyv for AVX2 - Bench written for axpbyv - Added definitions in zen contexts AMD-Internal: [CPUPL-1963] Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539	2022-05-17 18:03:42 +05:30
Dipal M Zambare	31921b9974	Updated windows build system to define BLIS_CONFIG_EPYC flag. All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC pre-preprocessor, this was not defined in CMake which are resulting in overall lower performance. Updated version number to 3.1.1 Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9	2022-05-17 18:03:09 +05:30
Harihara Sudhan S	6696f91f41	Improved DGEMV performance for column-major cases - Altered the framework to use 2 more fused kernels for better problem decomposition - Increased unroll factor in AXPYF5 and AXPYF8 kernels to improve register usage AMD-Internal: [CPUPL-1970] Change-Id: I79750235d9554466def5ff93898f832834990343	2022-02-02 23:13:10 -05:00
Harihara Sudhan	14fb31c0d5	Improved performance of DOTXV kernel for float and double - Vectorized sections of code that were not vectorized AMD Internal: [CPUPL-1980] Change-Id: I08528d054442a5e728f631142f244f1624170136	2022-01-24 23:08:38 -05:00
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Harsh Dave	79c6aa5643	Implemented optimal S/DCOMPLEX dotxf kernel - Optimized dotxf implementation for double and single precision complex datatype by handling dot product computation in tile 2x6 and 4x6 handling 6 columns at a time, and rows in multiple of 2 and 4. - Dot product computation is arranged such a way that multiple rho vector register will hold the temporary result till the end of loop and finally does horizontal addition to get final dot product result. - Corner cases are handled serially. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1975] Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098	2022-01-06 02:22:52 -05:00
Harsh Dave	351269219f	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-01-05 05:51:15 -06:00
Harsh Dave	8b5b2707c1	Optimized daxpy2v implementation - Optimized axpy2v implementation for double datatype by handling rows in mulitple of 4 and store the final computed result at the end of computation, preventing unnecessary stores for improving the performance. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1973] Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9	2022-01-05 06:37:22 -05:00

1 2 3 4 5 ...

481 Commits