amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-21 17:08:17 +00:00

Author	SHA1	Message	Date
Mangala V	6c1acc74c8	ZGEMM optimizations -- Conditionally packing of B matrix is enabled in zgemmsup path which is performing better when B matrix is large -- Incorporated decision logic to choose between zgemm_small vs zgemm sup based on matrix dimensions "m, n and k". -- Calling of ZGEMV when matrix dimension m or n = 1. Very good performance improvement is observed. Change-Id: I7c64020f4f78a6a51617b184cc88076213b5527d	2022-07-28 14:55:24 +05:30
Vignesh Balasubramanian	808d79a610	Implemented efficient ZGEMM algorithm when k=1 Problem statement : To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation. In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines: - Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api. - Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6. - Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance. - Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output. The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads. AMD-Internal: [CPUPL-2236] Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847	2022-07-28 02:09:45 -04:00
Vignesh Balasubramanian	2ad25a7180	ZGEMM kernel performance improvement for k=1 sizes: The current implementation for handling zgemm exploits SIMD parallelism along the k dimension. This would give great performance in cases of k being large. But for input sizes with k=1, it is better to exploit SIMD parallelism along the m and n dimensions, thereby giving better performance. This commit does the same through loop reordering, by loading column vectors from A. AMD-Internal: [CPUPL-2236] Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f	2022-06-30 07:19:52 -04:00
satish kumar nuggu	aaf840d86e	Disabled zgemm SUP path - Need to identify new Thresholds for zgemm SUP path to avoid performance regression. AMD-Internal: [CPUPL-2148] Change-Id: I0baa2b415dc5e296780566ba7450249445b93d43	2022-06-27 08:19:12 +00:00
Dipal M Zambare	c87b9aab75	Added support for AVX512 for Windows and AMAVX - Completed zen4 configuration support on windows - Enabled AVX512 kernels for AMAXV - Added zen4 configuration in amdzen for windows - Moved all zen4 kernels inside kernels/zen4 folder AMD-Internal: [CPUPL-2108] Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba	2022-06-08 11:09:48 +05:30
Dipal M Zambare	7247e6a150	Fixed crash issue in TRSM on non-avx platform. - Ensured that FMA, AVX2 based kernels are called only on platforms supporting these instructions, otherwise standard ‘C’ kernels will be called. - Code cleanup for optimization and consistency AMD-Internal: [CPUPL-2126] Change-Id: I203270892b2fad2ccc9301fb55e2bae75508e050	2022-05-17 18:10:39 +05:30
Dave	963a6aa099	Enabled zgemm_sup path and removed sqp path - Previously zgemm computation failures were due to status variable did not have pre-defined initial value which resulted in zgemm computation to return without being computed by any kernel. Reflected same change in dgemm_ function as well. - Enabled sup zgemm as the issue is fixed with status variable with bli_zgemm_small call. -Removed calling sqp method as it is disabled Change-Id: I0f4edfd619bc4877ebfc5cb6532c26c3888f919d	2022-05-17 18:10:39 +05:30
satish kumar nuggu	1a3428ddfc	Parallelization of dtrsm_small routine 1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right). 2. Fine-tuning with AOCL_DYNAMIC to achieve better performance. AMD-Internal: [CPUPL-2103] Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005	2022-05-17 18:10:39 +05:30
Harsh Dave	52e4fd0f11	Performance Improvement for ctrsm small sizes Details: - Enable ctrsm small implementation - Handled Overflow and Underflow Vulnerabilites in ctrsm small implementations. - Fixed failures observed in libflame testing. - For small sizes, ctrsm small implementation is used for all variants. Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179	2022-05-17 18:10:39 +05:30
Sireesha Sanga	cc3069fb5e	Performance Improvement for ztrsm small sizes Details: - Optimization of ztrsm for Non-unit Diag Variants. - Handled Overflow and Underflow Vulnerabilites in ztrsm small implementations. - Fixed failures observed in libflame testing. - Fine-tuned ztrsm small implementations for specific sizes 64<= m,n <= 256, by keeping the number of threads to the optimum value, under AOCL_DYNAMIC flag. - For small sizes, ztrsm small implementation is used for all variants. AMD-Internal: [SWLCSG-1194] Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d	2022-05-17 18:10:39 +05:30
satish kumar nuggu	fe7f0a9085	Changes to enable zgemm small from BLAS Layer 1. Removed small gemm call from native path to avoid Single threaded calls as a part of MultiThreaded scenarios. 2. SUP and INDUCED Method path disabled. 3. Added AOCL Dynamic for optimum number of threads to achieve higher performance. Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92	2022-05-17 18:10:38 +05:30
Sireesha Sanga	9621ef3067	Performance Improvement for ztrsm small sizes Details: - Enable ztrsm small implementation - For small sizes, Right Variants and Left Unit Diag Variants are using ztrsm_small implementations. - Optimization of Left Non-Unit Diagonal Variants, Work In Progress AMD-Internal: [SWLCSG-1194] Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4	2022-05-17 18:09:22 +05:30
Harsh Dave	0976ed9ce5	Implement zgemm_small kernel Details: - Intrinsic implementation of zgemm_small nn kernel. - Intrinsic implementation of zgemm_small_At kernel. - Added support conjugate and hermitian transpose - Main loop operates in multiple of 4x3 tile. - Edge cases are handles separately. AMD-Internal: [CPUPL-2084] Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66	2022-05-17 18:09:22 +05:30
Nallani Bhaskar	eb0ff01871	Fine-tuning dynamic threading logic of DGEMM for small dimensions Description: 1. For small dimensions single threads dgemm_small performing better than dgemmsup and native paths. 2. Irrespecive of given number of threads we are redirecting into single thread dgemm_small AMD-Internal:[CPUPL-2053] Change-Id: If591152d18282c2544249f70bd2f0a8cd816b94e	2022-05-17 18:09:22 +05:30
Dipal M Zambare	06e386f054	Updated Windows build system to pick AMD specific sources. The framework cleanup was done for linux as part of `f63f78d7` Removed Arch specific code from BLIS framework. This commit adds changes needed for windows build. AMD-Internal: [CPUPL-2052] Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d	2022-05-17 18:09:20 +05:30
Meghana Vankadari	c11fd5a8f6	Added functionality support for dzgemm AMD-Internal: [SWLCSG-1012] Change-Id: I2eac3131d2dcd534f84491289cbd3fe7fb7de3da	2022-05-17 18:01:55 +05:30
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
mkurumel	235071690a	Fixed dynamic dispatch crash issue on non-zen architecture for gemv and axpy routines. Summary: 1. This commit fixed issue for gemv and axpy API’s. 2. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). 3. The crash was caused by un-supported instructions in zen optimized kernels.The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4	2021-11-12 08:58:59 +05:30
Harihara Sudhan S	4de6f2ca6d	Fixed dynamic dispatch crash issue on non-zen architecture. Direct calls to zen kernels replaced by architecture dependent calls for dotv and amaxv kernels. For non-zen architecture, generic function is called using the BLIS interface. For zen architecture, direct calls to zen optimized kernels are made. Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f	2021-11-12 08:58:58 +05:30
Harsh Dave	8f297f6267	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in blis interface for trsm, scalv, swapv. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81	2021-11-12 08:58:58 +05:30
Dipal M Zambare	61b5b9c4d0	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in cblas source itself. Similar optimizations are done by the function directly invoked from Cblas layer. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea	2021-11-12 08:58:58 +05:30
Dipal M Zambare	ddbdfd0ba4	Fixed dynamic dispatch crash issue on non-zen architecture. This commit fixed issue for gemm and copy API’s. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366	2021-11-12 08:58:57 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30
Harsh Dave	53e1d0539f	Fixed conjugate transpose kernel issue Details: AMD Internal Id: CPUPL-1702 - For the cases of A being of 1x1 dimension and of left and right hand side, A's only element is conjugate transposed by negating its imaginary component. Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c	2021-11-12 08:58:56 +05:30
Harsh Dave	2b6faf21a1	Fixed conjugate transpose issue for zscalv and cscalv Details: AMD Internal Id: CPUPL-1702 - While performing trsm function A's imaginary part needed to be complimented as per conjugate transpose. -So in the case of conjugate transpose A's imaginary part is negated before doing trsm. Change-Id: Ic736733a483eeadf6356952b434128c0af988e36	2021-11-12 08:58:55 +05:30
Harsh Dave	590c763e22	Implemented ctrsm small kernels Details: -- AMD Internal Id: CPUPL-1702 -- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 24 scomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ctrsm_small for in ctrsm_ BLAS path for single thread when (m,n)<1000 and multithread (m+n)<320 -- Taken care of --disable_pre_inversion configuration -- Achieved 13% average performance improvement for sizes less than 1000 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64	2021-11-12 08:58:55 +05:30
satish kumar nuggu	23278627f4	STRSM small kernel implementation Details: -- AMD Internal Id: [CPUPL-1702] -- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Taken care of --disable_pre_inversion configuration -- modularized strsm 16 combinations of trsm into 4 kernels Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea	2021-11-12 08:58:55 +05:30
Nageshwar Singh	a3d04a21a0	Complex double standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 4x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37	2021-11-12 08:58:54 +05:30
Dipal M Zambare	8f310c3384	AOCL DTL - Added thread and execution time details in logs -- Added number of threads used in DTL logs -- Added support for timestamps in DTL traces -- Added time taken by API at BLAS layer in the DTL logs -- Added GFLOPS achieved in DTL logs -- Added support to enable/disable execution time and gflops printing for individual API's. We may not want it for all API's. Also it will help us migrate API's to execution time and gflops logs in stages. -- Updated GEMM bench to match new logs -- Refactored aocldtl_blis.c to remove code duplication. -- Clean up logs generation and reading to use spaces consistently to separate various fields. -- Updated AOCL_gettid() to return correct thread id when using pthreads. AMD-Internal: [CPUPL-1691] Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e	2021-11-12 08:58:54 +05:30
Harsh Dave	a7f600b3a4	Implemented ztrsm small kernels Details: -- AMD Internal Id: CPUPL-1702 -- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 12 dcomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ztrsm_small for in ztrsm_ BLAS path for single thread when (m,n)<500 and multithread (m+n)<128 -- Taken care of --disable_pre_inversion configuration -- Achieved 10% average performance improvement for sizes less than 500 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75	2021-11-12 08:58:54 +05:30
Nallani Bhaskar	40e1fd1860	Enabled processing of zero inputs of x in tbsv to provide NaN Details: The basic idea is inverse of singular matrix doesn't exist, therefore we should be returning NAN. BLAS standard and BLIS is optimizing by not doing any compute when x[j]== 0. As a result BLIS is generating finite values for inverse calculation of singular matrices which in reality is not the right answer. Fix is provided in this commit to generate NAN/INF values incase this API is called to compute inverses of singular matrices. But according to the standard, this API shouldn't be called in the first place, the check for singularity or near singularity should be done by the calling application Change-Id: Iccdbc07744de3892626f4066ee4a63eb30bc06cd	2021-11-12 08:58:53 +05:30
Nageshwar Singh	a263146a4c	Optimized scalv for complex data-types c and z (cscalv and zscalv) AMD-Internal: [CPUPL-1551] Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9	2021-11-12 08:58:53 +05:30
Madan mohan Manokar	3dda4ebf22	Induced method turned off, fix for beta=0 & C = NAN 1. Induced Method turned off, till the path fully tested for different alpha,beta conditions. 2. Fix for Beta =0, and C = NAN done. Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000	2021-11-12 08:58:50 +05:30
Meghana Vankadari	f1291fc957	Disabled dzgemm blas interfaces AMD-Internal: [CPUPL-1825] Change-Id: I786a2ad83be418bd4c55d24e30ae6c3a86fd8844	2021-11-12 08:58:50 +05:30
Saitharun	ffcb338531	Adding ENABLE_WRAPPER CMAKE OPTION details: Wrapper code will be enabled when selecting the cmake option ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build error on windows. AMD-Internal: [CPUPL-1848] Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e	2021-11-12 08:58:49 +05:30
Saitharun	5d6012c50d	Added additional symbols for BLIS APIs using wrapper functions Details: BLIS currently supports BLAS and CBLAS interfaces with lowercase. With this commit - we also supports uppercase with and without trailing underscore, lowercase without trailing underscore symbol names. Change-Id: Ibb06121821ab937b25d492409625916f542b2135	2021-11-12 08:58:48 +05:30
Dipal M Zambare	bcd9591b3f	Added support for amdepyc fat binary -- Created new configuration amdepyc to include fat binary which includes zen, zen2, zen3 and generic architecture for fallback. -- Updated amdepyc family makefiles to include macros needed in amdepyc family binary. This file must include all macros, compiler options to be used for non architecture specific code. -- Added 'workaround' to exclude ZEN family specific code in some of the framework files. There are still lot of places were ZEN family specific code is added in framework files. They will be addressed with proper design later. - Moved definition of BLIS_CONFIG_EPYC from header files to makefile so that it is enabled only for framework and kernels -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC wherever it was needed. -- Removed un-used, obsolete macros, some of them may be needed for debugging which can be added in the individual workspaces. - BLIS_DEFAULT_MR_THREAD_MAX - BLIS_DEFAULT_NR_THREAD_MAX - BLIS_ENABLE_ZEN_BLOCK_SIZES - BLIS_SMALL_MATRIX_THRES_TRSM - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES - BLIS_ENABLE_SUP_MR_EXT - BLIS_ENABLE_SUP_NR_EXT -- Corrected implementation of exiting amd64_legacy configuration. AMD-Internal: [CPUPL-1626, CPUPL-1628] Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c	2021-08-26 15:13:49 +05:30
Kiran Varaganti	adfd569591	DGEMM Optimizations Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10). Improved smart threading logic for dgemm, Additional conditions at the blas interface added to invoke bli_dgemm_small. Removed N > 3 condition from bli_dgemm_small. Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e	2021-08-10 12:34:43 -04:00
Madan mohan Manokar	4b90ae3112	single instance zgemm tuning 1. single instance case sup is enabled. 2. Env BLIS_SINGLE_INSTANCE should be set to 1 to enable single instance tuning. AMD-Internal: [CPUPL-1743] Change-Id: Iadb05a6e9313ac41271c0522da243fd47d80abec	2021-07-29 13:36:14 +05:30
Madan mohan Manokar	d3542ff0e0	3m_sqp conjugate support added 1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added. AMD-Internal: [CPUPL-1521] Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f	2021-07-05 18:44:55 +05:30
Madan mohan Manokar	70e9d327a2	squarePacked(sqp) framework and multi-instance handling 1. kx partitions added to k loop for dgemm and zgemm. 2. mx loop based threading model added for dgemm as prototype of zgemm. 3. nx loop added for 3m_sqp and dgemm_sqp. 4. single 3m_sqp workspace allocation with smaller memory footprint. 5. sqp framework done from dgemm and zgemm. 6. sqp kernels moved to seperate kernel file. 7. residue kernel core added to handle mx<8. 8. multi-instance tuning for 3m_sqp done. 9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp. AMD-Internal: [CPUPL-1521] Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c	2021-06-28 15:40:11 +05:30
Meghana Vankadari	cb3a40ab9d	Added blas interface for dzgemm - Added blas interface for dzgemm. This function will call native implementation of gemm. - Mixed datatype support is already present in BLIS. But this implementation requires alpha_imag value to be 0. - Modified test_gemm.c to support testing of dzgemm. Change-Id: I496fffdede9f0f778b9a33b405eb6861c6dcc334	2021-06-27 09:34:18 -04:00
Nallani Bhaskar	75f72b7f6e	Added aocl dynamic feature for dtrsm for small sizes Details: 1. Added aocl-dynamic for dtrsm native path When (m,n)<512 better performance observed for nthreads=4 2. Updated trsm_small threshold such that when (m+n)<320 trsm_small is doing better than native irrespective of number of threads Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487	2021-06-18 08:46:47 -04:00
Nallani Bhaskar	e328bdc549	Added prefetch in left cases of dtrsm small Details: 1. Added prefetching next micro-panel of A and B in dgemm block, which are helping in reducing load latency and improved performance. 2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core dgemm into macros and made it more modular 3. Packing and diagonal packing in main dgemm loops are modularized. Fringe cases are yet to modularize. 4. Updated dtrsm small thresholds for single and multi thread cases 5. Updated div/scale based on disable/enable of trsm pre-inversion 6. Code clean up Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df	2021-06-15 23:15:22 +05:30
Kiran Varaganti	c2abbcab96	Fix dgemm_ Multi-thread running as Single Thread Details: When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed. Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1 irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls small_gemm which ends up running sequentially. Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one. Add fix for zgemm_ also. Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573	2021-06-15 12:14:11 +05:30
Dipal M Zambare	2f344f5df1	DTL logs corrections -- Fixed issues in printing the values of side, uploa and diaga parameters for hemm, hemv, her, her2, her2k, herk, symm, symv, syr, syr2, syr2k, syrk, trmm, trmv, trsm, trsv. -- For above API's logging was called with MKSTR() for side, uploa and diaga parameters. MKSTR is needed only for macro arguments but not for function's arguments. -- Added space between function name and data type where it was missing. Bench expects logs in this format. AMD-Internal: [CPUPL-1585] Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d	2021-06-04 15:24:13 +05:30
Nageshwar Singh	2e1a5bc1dd	Optimized double complex axpyf kernel for zgemv Details: - Implemented zaxpyf kernel with fuse factor=4 for zgemv. - Modified BLAS interface call for zgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: I2231285fe3060982d4434466346a040b7ab803fc	2021-06-01 18:03:29 +05:30
Meghana Vankadari	4446395047	Redirecting dgemv to axpyf based implementation for smaller sizes. AMD-Internal: [CPUPL-1403] Change-Id: I0ff2763c41c5ae598c58bc250adc317d7f8a4994	2021-05-25 01:39:12 -04:00
satish kumar nuggu	82087773a0	Optimized single threaded dtrsm small for right cases Details: 1. Added optimized dtrsm kernels for all 8 right side cases Below are few notable optimizations which improved performance a. Loading, transposing (for transa cases), packing and reusing of a01 block required for GEMM operation. The block size increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM from one end of A to other end of triangular A b. Packing of 6 diagonal elements in one location helped to utilize cache line efficiently AMD-Internal: [CPUPL-1563] Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3	2021-05-25 01:09:50 -04:00

1 2 3 4 5

205 Commits