amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-24 18:34:40 +00:00

Author	SHA1	Message	Date
Harihara Sudhan S	8201bcfdaf	Improved DGEMV performance for smaller sizes - Introduced two new ddotxf functions with lower fuse factor. - Changed the DGEMV framework to use new kernels to improve problem decomposition. Change-Id: I523e158fd33260d06224118fbf74f2314e03a617	2021-12-15 13:17:14 +05:30
Harsh Dave	0f43db8347	Optimized dsymv implementation -Implemented hemv framework calls for lower and upper kernel variants. -hemv computation is implemented in two parts. One part operate on triangular part of matrix and the remaining part is computed by dotxfaxpyf kernel. -First part performs dotxf and axpyf operation on triangular part of matrix in chunk of 8x8. Two separate helper function for doing so are implemented for lower and upper kernels respectively. -Second part is ddotxaxpyf fused kernel, which performs dotxf and axpyf operation alltogether on non-triangular part of matrix in chunk of 4x8. -Implementation efficiently uses cache memory while computing for optimal performance. Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e	2021-12-14 23:34:20 -06:00
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
Dipal M Zambare	7a15aa9c87	Fixed xGEMM dynamic dispatch crash on ST library. Small gemm implemenation is called from gemmnat path when library is built as multi-threaded small gemm is completely disabled. For single threaded the crash is fixed by disabling small gemm on generic architecture. AMD-Internal: [CPUPL-1930] Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a	2021-11-12 08:58:59 +05:30
mkurumel	235071690a	Fixed dynamic dispatch crash issue on non-zen architecture for gemv and axpy routines. Summary: 1. This commit fixed issue for gemv and axpy API’s. 2. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). 3. The crash was caused by un-supported instructions in zen optimized kernels.The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4	2021-11-12 08:58:59 +05:30
Harsh Dave	1b0a7e1c89	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in ctrsv, ztrsv interface. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. AMD-Internal: [CPUPL-1930] Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b	2021-11-12 08:58:58 +05:30
Harihara Sudhan S	4de6f2ca6d	Fixed dynamic dispatch crash issue on non-zen architecture. Direct calls to zen kernels replaced by architecture dependent calls for dotv and amaxv kernels. For non-zen architecture, generic function is called using the BLIS interface. For zen architecture, direct calls to zen optimized kernels are made. Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f	2021-11-12 08:58:58 +05:30
Harsh Dave	8f297f6267	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in blis interface for trsm, scalv, swapv. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81	2021-11-12 08:58:58 +05:30
Dipal M Zambare	61b5b9c4d0	Fixed dynamic dispatch crash issue on non-zen architecture. Removed direct calling of zen kernels in cblas source itself. Similar optimizations are done by the function directly invoked from Cblas layer. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea	2021-11-12 08:58:58 +05:30
Dipal M Zambare	ddbdfd0ba4	Fixed dynamic dispatch crash issue on non-zen architecture. This commit fixed issue for gemm and copy API’s. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366	2021-11-12 08:58:57 +05:30
mkadavil	d683c224e8	Workaround for perf regression observed for sgemm Details: - Perf regression is observed for certain m,n,k inputs where (m,n,k > 512) and (m > 4 * n) in BLIS 3.1. The root cause was traced to commit `11dfc176a3` where BLIS_THREAD_RATIO_M was updated from 2 to 1. This change was not part of BLIS 3.0.6 and hence resulted in the new perf drop in 3.1. - This workaround updates the m dimension (doubles it) that is passed as argument to bli_rntm_set_ways_for_op which is used to determine the ic,jc work split in the threads. The BLIS_THREAD_RATIO_M is not updated (to 2) and rather the effect is induced using the doubled m dimension. AMD-Internal: [CPUPL-1909] Change-Id: I3b6ec4d4a22154289cb56d8f7db4cb60e5f34afe	2021-11-12 08:58:57 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30
nphaniku	4af525a313	AOCL Windows BLIS : Windows build for dynamic dispatch library Change-Id: Ie05eafbeacbd5589b514d9353517330515104939	2021-11-12 08:58:57 +05:30
mkurumel	366ab66134	DNRM2 : Disable dnrm2 Fast math implementation. Details : - Accuracy failures observed when fast math and ILP64 are enabled. - Disabling the feature with macro BLIS_ENABLE_FAST_MATH . AMD-Internal: [CPUPL-1907] Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16	2021-11-12 08:58:56 +05:30
Dipal M Zambare	3364c0e4eb	Binary and dynamic dispatch configuration name change -- Reverted changes made to include lp/ilp info in binary name This reverts commit `c5e6f885f0`. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6	2021-11-12 08:58:56 +05:30
Dipal M Zambare	ae844f475c	Fixed build issue in DTL when only traces are enabled. AMD-Internal: [CPUPL-1691] Change-Id: Idc273666054529db5a2fb96a7d7ebbf7a3f5b008	2021-11-12 08:58:56 +05:30
Harsh Dave	53e1d0539f	Fixed conjugate transpose kernel issue Details: AMD Internal Id: CPUPL-1702 - For the cases of A being of 1x1 dimension and of left and right hand side, A's only element is conjugate transposed by negating its imaginary component. Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c	2021-11-12 08:58:56 +05:30
Harsh Dave	2b6faf21a1	Fixed conjugate transpose issue for zscalv and cscalv Details: AMD Internal Id: CPUPL-1702 - While performing trsm function A's imaginary part needed to be complimented as per conjugate transpose. -So in the case of conjugate transpose A's imaginary part is negated before doing trsm. Change-Id: Ic736733a483eeadf6356952b434128c0af988e36	2021-11-12 08:58:55 +05:30
Nageshwar Singh	cbd9ea76af	Complex single standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 8x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: Ic9a5e59363290caf26284548638da9065952fd48	2021-11-12 08:58:55 +05:30
Madan mohan Manokar	d6fcfe7345	gemmt SUP limitThread count for small sizes 1. Max thread cap added for small dimension based on product(n*k). AMD-Internal: [CPUPL-1388] Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057	2021-11-12 08:58:55 +05:30
Harsh Dave	590c763e22	Implemented ctrsm small kernels Details: -- AMD Internal Id: CPUPL-1702 -- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 24 scomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ctrsm_small for in ctrsm_ BLAS path for single thread when (m,n)<1000 and multithread (m+n)<320 -- Taken care of --disable_pre_inversion configuration -- Achieved 13% average performance improvement for sizes less than 1000 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64	2021-11-12 08:58:55 +05:30
satish kumar nuggu	23278627f4	STRSM small kernel implementation Details: -- AMD Internal Id: [CPUPL-1702] -- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Taken care of --disable_pre_inversion configuration -- modularized strsm 16 combinations of trsm into 4 kernels Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea	2021-11-12 08:58:55 +05:30
Nageshwar Singh	a3d04a21a0	Complex double standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 4x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37	2021-11-12 08:58:54 +05:30
Dipal M Zambare	8f310c3384	AOCL DTL - Added thread and execution time details in logs -- Added number of threads used in DTL logs -- Added support for timestamps in DTL traces -- Added time taken by API at BLAS layer in the DTL logs -- Added GFLOPS achieved in DTL logs -- Added support to enable/disable execution time and gflops printing for individual API's. We may not want it for all API's. Also it will help us migrate API's to execution time and gflops logs in stages. -- Updated GEMM bench to match new logs -- Refactored aocldtl_blis.c to remove code duplication. -- Clean up logs generation and reading to use spaces consistently to separate various fields. -- Updated AOCL_gettid() to return correct thread id when using pthreads. AMD-Internal: [CPUPL-1691] Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e	2021-11-12 08:58:54 +05:30
Harsh Dave	a7f600b3a4	Implemented ztrsm small kernels Details: -- AMD Internal Id: CPUPL-1702 -- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 12 dcomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ztrsm_small for in ztrsm_ BLAS path for single thread when (m,n)<500 and multithread (m+n)<128 -- Taken care of --disable_pre_inversion configuration -- Achieved 10% average performance improvement for sizes less than 500 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75	2021-11-12 08:58:54 +05:30
Nallani Bhaskar	40e1fd1860	Enabled processing of zero inputs of x in tbsv to provide NaN Details: The basic idea is inverse of singular matrix doesn't exist, therefore we should be returning NAN. BLAS standard and BLIS is optimizing by not doing any compute when x[j]== 0. As a result BLIS is generating finite values for inverse calculation of singular matrices which in reality is not the right answer. Fix is provided in this commit to generate NAN/INF values incase this API is called to compute inverses of singular matrices. But according to the standard, this API shouldn't be called in the first place, the check for singularity or near singularity should be done by the calling application Change-Id: Iccdbc07744de3892626f4066ee4a63eb30bc06cd	2021-11-12 08:58:53 +05:30
Nageshwar Singh	a263146a4c	Optimized scalv for complex data-types c and z (cscalv and zscalv) AMD-Internal: [CPUPL-1551] Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9	2021-11-12 08:58:53 +05:30
satish kumar nuggu	3bafdf3923	DGEMM kernel implementation for case k = 1. Details : - DGEMM kernel implementation for case k = 1, vectorized with 8x6 block implementation (Rank-1 update in DGEMM Optimization). Change-Id: I7d06378adeb8bcc5b965e2a94314d731629d0b4c	2021-11-12 08:58:53 +05:30
mkurumel	595f7b7edf	dnrm2 optimization with dot method 1. Added new kernel bli_dnorm2fv_unb_var1 kernel to compute norm with dot operation. 2. Added vectorization to compute square of 32 double element block size from vector X. 3. Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header to compute nrm2 using new kernel. 4. Dot kernel definitions and implementation have a possibility for accuracy issues .we can switch to traditional implementation by disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm for Vector X . AMD-Internal: [CPUPL-1757] Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11	2021-11-12 08:58:53 +05:30
mkurumel	9f1ce594a5	BLIS : Compiler warning fixes Details : - Fixed warnings with AOCC and GCC compilers. AMD-Internal: [CPUPL-1662] Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce	2021-11-12 08:58:52 +05:30
Meghana Vankadari	9a5b15da68	Disabled dzgemm_ function call in test_gemm.c Change-Id: I31da12cbaf50cf7fe44baf97de3c39896c4ccfb1	2021-11-12 08:58:52 +05:30
Meghana Vankadari	10ca8710f0	Optimized SUP code for GEMMT Details: - Eliminated the IR loop in ref_var2m functions. - Handled the rectangular and triangular portions of C matrix separately. - Added a condition to check and eliminate zero regions inside IC loop. - modified kc selection logic to choose optimal KC in SUP - Updated thresholds to choose between SUP and native. Change-Id: I21908eaa6bc3a8f37bdea29f7bfca7e6fcfee724	2021-11-12 08:58:52 +05:30
Nageshwar Singh	faeb79f2b9	Trsm bench utility missmatch DTL logs and bench AOCL-Internal: [CPUPL-1585] Change-Id: I2896d695e6bb40ec39a4f840240499927de16962	2021-11-12 08:58:52 +05:30
Dipal M Zambare	5d287fdba0	Include LP64/ILP64 in BLIS binary name Binary name will be chosen based on multi-threading and BLAS integer size configuration as given below. libblis-[mt]-lp64 - when configured to use 32 bit integers libblis-[mt]-ilp64 - when configured to use 64 bit integers AMD-Internal: [CPUPL-1879] Change-Id: I865023c63235a0a72bdfce7057b2cfb8158b1d87	2021-11-12 08:58:51 +05:30
Dipal M Zambare	a8e47a82c0	Fixed build issue with clang in 'make checkcpp' The makefile in vendor/testcpp folder has hardcoded g++ as cpp compiler and linker. Updated the makefile to take the compiler which is used to build the library. AMD-Internal: [CPUPL-1873] Change-Id: Ib0bcbb8fccd0ff6f90b49b3b1e4a272cf3bad361	2021-11-12 08:58:51 +05:30
Kiran Varaganti	7196b86f05	Removed packm kernels of zen intrinsic optimized packm kernels written for zen are no longer used. Therefore removing it. Currently packm kernels from haswell configuration are being used for zen2 and zen3 configs.	2021-11-12 08:58:51 +05:30
Kiran Varaganti	5ecb4fd0db	Removed dead code This sanity check (checking top_index != 0) which has been disabled earlier in bli_pool_reinit() by commenting out bli_abort() was unnecessarily computing top_index - this whole statement is commented out. Change-Id: If296754ca8cba3a69d023d4a7ec891f1cbce1d6a	2021-11-12 08:58:51 +05:30
Kiran Varaganti	09bfe3e372	Improve DGEMM sup for RCR case When m and n are very large values, larger KC is preferred to reduce L3 cache misses, therefore increased KC value to KC0. This improved performance of DGEMM considerably on EPYC processors Kindly note (PACKA and PACKB disabled). Change-Id: I7b3f5b53a01fca0e55bc4479eeb644c7001d3463	2021-11-12 08:58:51 +05:30
Kiran Varaganti	628f4a41b8	Fixed bug in packing API bli_pack_set_pack_b() API wrongly inits pack_a element of rntm object instead of pack_b. Fixed this bug. Change-Id: I267493ab3ff0bade478d1157799a6fa5b51c7970	2021-11-12 08:58:50 +05:30
Madan mohan Manokar	3dda4ebf22	Induced method turned off, fix for beta=0 & C = NAN 1. Induced Method turned off, till the path fully tested for different alpha,beta conditions. 2. Fix for Beta =0, and C = NAN done. Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000	2021-11-12 08:58:50 +05:30
Meghana Vankadari	f1291fc957	Disabled dzgemm blas interfaces AMD-Internal: [CPUPL-1825] Change-Id: I786a2ad83be418bd4c55d24e30ae6c3a86fd8844	2021-11-12 08:58:50 +05:30
Nallani Bhaskar	15b7fff159	Fixed reading C when beta=0 in few sgemm asm kernels Description: 1. When beta is zero we should not be doing any arthemetic operation on C data and should not assume anything on values of C matrix 2. It is taken care in all sgemmsup kernels already except in bli_sgemmsup_rv_zen_asm_2x8 and bli_sgemmsup_rv_zen_asm_3x8 kernels, when beta zero and C is column storage case. Fixed this issue by removing reading C matrix in these kernels. 3. When C has NaN or Inf and when we multiply NaN or Inf with zero (beta) the result becomes NaN only. Change-Id: I3fb8c0cd37cf1d52a7909f6b402aa9c40c7c3846	2021-11-12 08:58:50 +05:30
Saitharun	ffcb338531	Adding ENABLE_WRAPPER CMAKE OPTION details: Wrapper code will be enabled when selecting the cmake option ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build error on windows. AMD-Internal: [CPUPL-1848] Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e	2021-11-12 08:58:49 +05:30
Meghana Vankadari	5c770cafee	Removed syrk_small code - The current implementation of syrk_small computes the entire C matrix rather than computing triangular part. This implementation is not efficient. AMD-Internal: [CPUPL-1571] Change-Id: I9a153207471a55e52634429062d18ba1a225fed9	2021-11-12 08:58:49 +05:30
Meghana Vankadari	7bbb7ee7f2	Added weighted thread distibution for SUP GEMMT/SYRK Change-Id: Ia080b8a76e788d923bb3545b3f8f97e39f85cebf	2021-11-12 08:58:49 +05:30
Dipal M Zambare	d1fb770f1c	Removed duplicate cpp and testcpp folders. The cpp and testcpp folder exists in root directory as well as vendor directory. Only folder in vendor directory are needed. Removed duplicate directories and updated makefiles to pick the sources from vendor folder. AMD-Internal: [CPUPL-1834] Change-Id: I178043a09fd746660938b89ecce73c53d6c53409	2021-11-12 08:58:49 +05:30
Dipal M Zambare	41a86c2463	Enabled znver3 flag for gcc version above 11 -- Added -march=znver3 flag if the library is built for zen3 configuration with gcc compiler version 11 or above. -- Replaced hardcoded compiler names 'gcc' and 'clang' with variable $CC so that options are chosen as per the compiler specified at configure time (instead of compiler in path). AMD-Internal: [CPUPL-1823] Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4	2021-11-12 08:58:48 +05:30
Dipal M Zambare	f698afc567	Updated version number to 3.1.0 AMD-Internal: [CPUPL-1811] Change-Id: I6b485e7622e526791094ae621d9f84d2526e6569	2021-11-12 08:58:48 +05:30
Saitharun	5d6012c50d	Added additional symbols for BLIS APIs using wrapper functions Details: BLIS currently supports BLAS and CBLAS interfaces with lowercase. With this commit - we also supports uppercase with and without trailing underscore, lowercase without trailing underscore symbol names. Change-Id: Ibb06121821ab937b25d492409625916f542b2135	2021-11-12 08:58:48 +05:30
Abhiram S	7787bc79b1	Level1 samaxv: AVX512 implementation Details: 1. Unrolled by a factor 5. This gave around 1GFLOPS gain 2. Changed CMP to subs and remove nan. CMP uses a lot of compare, which is higher in latency and more number of instructions. Replacing with subs and remove nan reduced it to 3 instructions and lighter ones. 3. Added remove nan function. 4. Added AVX512 definition in skx context. 5. Disabled code in AMAXV kernel depending on AVX512 flag exists or not Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35	2021-09-27 16:10:08 +05:30

1 2 3 4 5 ...

2571 Commits