amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 10:05:38 +00:00

Author	SHA1	Message	Date
Dipal M Zambare	31921b9974	Updated windows build system to define BLIS_CONFIG_EPYC flag. All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC pre-preprocessor, this was not defined in CMake which are resulting in overall lower performance. Updated version number to 3.1.1 Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9	2022-05-17 18:03:09 +05:30
Harihara Sudhan S	6696f91f41	Improved DGEMV performance for column-major cases - Altered the framework to use 2 more fused kernels for better problem decomposition - Increased unroll factor in AXPYF5 and AXPYF8 kernels to improve register usage AMD-Internal: [CPUPL-1970] Change-Id: I79750235d9554466def5ff93898f832834990343	2022-02-02 23:13:10 -05:00
Harihara Sudhan	14fb31c0d5	Improved performance of DOTXV kernel for float and double - Vectorized sections of code that were not vectorized AMD Internal: [CPUPL-1980] Change-Id: I08528d054442a5e728f631142f244f1624170136	2022-01-24 23:08:38 -05:00
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Harsh Dave	79c6aa5643	Implemented optimal S/DCOMPLEX dotxf kernel - Optimized dotxf implementation for double and single precision complex datatype by handling dot product computation in tile 2x6 and 4x6 handling 6 columns at a time, and rows in multiple of 2 and 4. - Dot product computation is arranged such a way that multiple rho vector register will hold the temporary result till the end of loop and finally does horizontal addition to get final dot product result. - Corner cases are handled serially. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1975] Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098	2022-01-06 02:22:52 -05:00
Harsh Dave	351269219f	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-01-05 05:51:15 -06:00
Harsh Dave	8b5b2707c1	Optimized daxpy2v implementation - Optimized axpy2v implementation for double datatype by handling rows in mulitple of 4 and store the final computed result at the end of computation, preventing unnecessary stores for improving the performance. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1973] Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9	2022-01-05 06:37:22 -05:00
Harsh Dave	0e7073a600	Optimized ztrsv implementation - Implemented alternate method of performing multiplication and addition operations on double precision complex datatype by separating out real and imaginary parts of complex number. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1969] Change-Id: Ib181f193c05740d5f6b9de3930e1995dea4a50f2	2022-01-05 05:22:51 -05:00
Arnav Sharma	3190e547b0	Optimized AXPBYV Kernel using AVX2 Intrinsics Details: - Intrinsic implementation of axpbyv for AVX2 - Bench written for axpbyv - Added definitions in zen contexts AMD-Internal: [CPUPL-1963] Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539	2022-01-05 04:19:11 -05:00
Harihara Sudhan S	b095f1f3a2	Improved SCALV kernel performance. - Unrolled the loop by a greater factor. Incorporated switch case to decide unrolling factor according to the input size. - Removed unused structs. AMD-Internal: [CPUPL-1974] Change-Id: Iee9d7defcc8c582ca0420f84c4fb2c202dabe3e7	2021-12-31 01:28:46 -05:00
Harihara Sudhan S	75d5f538d2	Improved AXPYV Kernel performance - Increased the unroll factor of the loop by 15 in SAXPYV - Increased the unroll factor of the loop by 12 in DAXPYV - The above changes were made for better register utilization Change-Id: I69ad1fec2fcf958dbd1bfd71378641274b43a6aa	2021-12-20 13:21:20 +05:30
Harihara Sudhan S	f72758d80a	Fixed DDOTXF Bug - Corrected xv and avec indexing in vector loop of bli_ddotxf_zen_int_2 Change-Id: I4c511236aad09541fe6b1295103a1a8b54ceec39	2021-12-17 15:27:13 +05:30
Nallani Bhaskar	c2df5eac1c	Reduced number of threads in dgemm for small dimensions - Number of threads are reduced to 1 when the dimensions are very low. - Removed uninitialized xmm compilation warning in trsm small Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b	2021-12-15 15:11:08 +05:30
Harihara Sudhan S	8201bcfdaf	Improved DGEMV performance for smaller sizes - Introduced two new ddotxf functions with lower fuse factor. - Changed the DGEMV framework to use new kernels to improve problem decomposition. Change-Id: I523e158fd33260d06224118fbf74f2314e03a617	2021-12-15 13:17:14 +05:30
Harsh Dave	0f43db8347	Optimized dsymv implementation -Implemented hemv framework calls for lower and upper kernel variants. -hemv computation is implemented in two parts. One part operate on triangular part of matrix and the remaining part is computed by dotxfaxpyf kernel. -First part performs dotxf and axpyf operation on triangular part of matrix in chunk of 8x8. Two separate helper function for doing so are implemented for lower and upper kernels respectively. -Second part is ddotxaxpyf fused kernel, which performs dotxf and axpyf operation alltogether on non-triangular part of matrix in chunk of 4x8. -Implementation efficiently uses cache memory while computing for optimal performance. Change-Id: Id603031b4578e87a92c6b77f710c647acc195c8e	2021-12-14 23:34:20 -06:00
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
Dipal M Zambare	7a15aa9c87	Fixed xGEMM dynamic dispatch crash on ST library. Small gemm implemenation is called from gemmnat path when library is built as multi-threaded small gemm is completely disabled. For single threaded the crash is fixed by disabling small gemm on generic architecture. AMD-Internal: [CPUPL-1930] Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a	2021-11-12 08:58:59 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30
Harsh Dave	2b6faf21a1	Fixed conjugate transpose issue for zscalv and cscalv Details: AMD Internal Id: CPUPL-1702 - While performing trsm function A's imaginary part needed to be complimented as per conjugate transpose. -So in the case of conjugate transpose A's imaginary part is negated before doing trsm. Change-Id: Ic736733a483eeadf6356952b434128c0af988e36	2021-11-12 08:58:55 +05:30
Nageshwar Singh	cbd9ea76af	Complex single standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 8x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: Ic9a5e59363290caf26284548638da9065952fd48	2021-11-12 08:58:55 +05:30
Harsh Dave	590c763e22	Implemented ctrsm small kernels Details: -- AMD Internal Id: CPUPL-1702 -- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 24 scomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ctrsm_small for in ctrsm_ BLAS path for single thread when (m,n)<1000 and multithread (m+n)<320 -- Taken care of --disable_pre_inversion configuration -- Achieved 13% average performance improvement for sizes less than 1000 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64	2021-11-12 08:58:55 +05:30
satish kumar nuggu	23278627f4	STRSM small kernel implementation Details: -- AMD Internal Id: [CPUPL-1702] -- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Taken care of --disable_pre_inversion configuration -- modularized strsm 16 combinations of trsm into 4 kernels Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea	2021-11-12 08:58:55 +05:30
Nageshwar Singh	a3d04a21a0	Complex double standalone gemv implementation independent of axpyf. Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 4x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37	2021-11-12 08:58:54 +05:30
Harsh Dave	a7f600b3a4	Implemented ztrsm small kernels Details: -- AMD Internal Id: CPUPL-1702 -- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 12 dcomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ztrsm_small for in ztrsm_ BLAS path for single thread when (m,n)<500 and multithread (m+n)<128 -- Taken care of --disable_pre_inversion configuration -- Achieved 10% average performance improvement for sizes less than 500 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75	2021-11-12 08:58:54 +05:30
Nageshwar Singh	a263146a4c	Optimized scalv for complex data-types c and z (cscalv and zscalv) AMD-Internal: [CPUPL-1551] Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9	2021-11-12 08:58:53 +05:30
satish kumar nuggu	3bafdf3923	DGEMM kernel implementation for case k = 1. Details : - DGEMM kernel implementation for case k = 1, vectorized with 8x6 block implementation (Rank-1 update in DGEMM Optimization). Change-Id: I7d06378adeb8bcc5b965e2a94314d731629d0b4c	2021-11-12 08:58:53 +05:30
mkurumel	595f7b7edf	dnrm2 optimization with dot method 1. Added new kernel bli_dnorm2fv_unb_var1 kernel to compute norm with dot operation. 2. Added vectorization to compute square of 32 double element block size from vector X. 3. Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header to compute nrm2 using new kernel. 4. Dot kernel definitions and implementation have a possibility for accuracy issues .we can switch to traditional implementation by disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm for Vector X . AMD-Internal: [CPUPL-1757] Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11	2021-11-12 08:58:53 +05:30
mkurumel	9f1ce594a5	BLIS : Compiler warning fixes Details : - Fixed warnings with AOCC and GCC compilers. AMD-Internal: [CPUPL-1662] Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce	2021-11-12 08:58:52 +05:30
Meghana Vankadari	10ca8710f0	Optimized SUP code for GEMMT Details: - Eliminated the IR loop in ref_var2m functions. - Handled the rectangular and triangular portions of C matrix separately. - Added a condition to check and eliminate zero regions inside IC loop. - modified kc selection logic to choose optimal KC in SUP - Updated thresholds to choose between SUP and native. Change-Id: I21908eaa6bc3a8f37bdea29f7bfca7e6fcfee724	2021-11-12 08:58:52 +05:30
Kiran Varaganti	7196b86f05	Removed packm kernels of zen intrinsic optimized packm kernels written for zen are no longer used. Therefore removing it. Currently packm kernels from haswell configuration are being used for zen2 and zen3 configs.	2021-11-12 08:58:51 +05:30
Madan mohan Manokar	3dda4ebf22	Induced method turned off, fix for beta=0 & C = NAN 1. Induced Method turned off, till the path fully tested for different alpha,beta conditions. 2. Fix for Beta =0, and C = NAN done. Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000	2021-11-12 08:58:50 +05:30
Nallani Bhaskar	15b7fff159	Fixed reading C when beta=0 in few sgemm asm kernels Description: 1. When beta is zero we should not be doing any arthemetic operation on C data and should not assume anything on values of C matrix 2. It is taken care in all sgemmsup kernels already except in bli_sgemmsup_rv_zen_asm_2x8 and bli_sgemmsup_rv_zen_asm_3x8 kernels, when beta zero and C is column storage case. Fixed this issue by removing reading C matrix in these kernels. 3. When C has NaN or Inf and when we multiply NaN or Inf with zero (beta) the result becomes NaN only. Change-Id: I3fb8c0cd37cf1d52a7909f6b402aa9c40c7c3846	2021-11-12 08:58:50 +05:30
Meghana Vankadari	5c770cafee	Removed syrk_small code - The current implementation of syrk_small computes the entire C matrix rather than computing triangular part. This implementation is not efficient. AMD-Internal: [CPUPL-1571] Change-Id: I9a153207471a55e52634429062d18ba1a225fed9	2021-11-12 08:58:49 +05:30
Abhiram S	7787bc79b1	Level1 samaxv: AVX512 implementation Details: 1. Unrolled by a factor 5. This gave around 1GFLOPS gain 2. Changed CMP to subs and remove nan. CMP uses a lot of compare, which is higher in latency and more number of instructions. Replacing with subs and remove nan reduced it to 3 instructions and lighter ones. 3. Added remove nan function. 4. Added AVX512 definition in skx context. 5. Disabled code in AMAXV kernel depending on AVX512 flag exists or not Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35	2021-09-27 16:10:08 +05:30
Harihara Sudhan S	bdb5e32176	Level 1 Kernel: damaxv AVX512 Details: - Developed damaxv for AVX512 extension - Implemented removeNAN function that converts NAN values to negative values based on the location - Usage COMPARE256/COMPARE128 avoided in AVX512 implementation for better performance - Unrolled the loop by order of 4. Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326	2021-09-02 09:47:08 +05:30
satish kumar nuggu	5adb7bf1a4	Combined variants to reduce redundancy in dtrsm small 1. Left Lower non-trans,Left Upper trans 2. Left Upper non-trans,Left Lower trans 3. Right Lower non-trans.Right Upper trans 4. Right Upper non-trans,Right Lower trans Change-Id: I0b0155d7c3a55ec74d53c8f1f49f1bceb63b15f5	2021-08-26 15:04:14 +05:30
Meghana Vankadari	6bad157754	Added a new field in cntx to store l3 threshold function pointers Details: - Adding threshold function pointers to cntx gives flexibility to choose different threshold functions for different configurations. - In case of fat binary where configuration is decided at run-time, adding threshold functions under a macro enables these functions for all the configs under a family. This can be avoided by adding function pointers to cntx which can be queried from cntx during run-time based on the config chosen. Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f	2021-08-16 00:10:01 -04:00
Kiran Varaganti	adfd569591	DGEMM Optimizations Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10). Improved smart threading logic for dgemm, Additional conditions at the blas interface added to invoke bli_dgemm_small. Removed N > 3 condition from bli_dgemm_small. Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e	2021-08-10 12:34:43 -04:00
Nallani Bhaskar	40a5a614b2	Updated vector loads/stores in reminder cases to avoid access beyond the matrix boundary Description: 1. While processing reminder cases in bli_trsm_small algorithm there were few loads and stores which were accessing beyond the given matrix buffer because of vectorized instructions. 2. Modified 256bit vector loads at edges into 128bit or 64 bit loads/stores such that no read/write happens beyond the matrix boundary. AMD-Internal: [CPUPL-1759] [SWLCSG-819] Change-Id: Iba51d0ed9bb28d1b0948a219755b8dbcc86a7fa9	2021-08-09 10:51:26 +05:30
Madan mohan Manokar	d3542ff0e0	3m_sqp conjugate support added 1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added. AMD-Internal: [CPUPL-1521] Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f	2021-07-05 18:44:55 +05:30
Madan mohan Manokar	70e9d327a2	squarePacked(sqp) framework and multi-instance handling 1. kx partitions added to k loop for dgemm and zgemm. 2. mx loop based threading model added for dgemm as prototype of zgemm. 3. nx loop added for 3m_sqp and dgemm_sqp. 4. single 3m_sqp workspace allocation with smaller memory footprint. 5. sqp framework done from dgemm and zgemm. 6. sqp kernels moved to seperate kernel file. 7. residue kernel core added to handle mx<8. 8. multi-instance tuning for 3m_sqp done. 9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp. AMD-Internal: [CPUPL-1521] Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c	2021-06-28 15:40:11 +05:30
Chandrashekara KR	d7377f967c	Merge "AOCL-Windows: Update BLIS build system" into amd-staging-milan-3.1	2021-06-17 08:49:55 -04:00
Nallani Bhaskar	e328bdc549	Added prefetch in left cases of dtrsm small Details: 1. Added prefetching next micro-panel of A and B in dgemm block, which are helping in reducing load latency and improved performance. 2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core dgemm into macros and made it more modular 3. Packing and diagonal packing in main dgemm loops are modularized. Fringe cases are yet to modularize. 4. Updated dtrsm small thresholds for single and multi thread cases 5. Updated div/scale based on disable/enable of trsm pre-inversion 6. Code clean up Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df	2021-06-15 23:15:22 +05:30
Chandrashekara K R	f94e3ad237	AOCL-Windows: Update BLIS build system 1. Added support in cmake scripts for linking libomp for blis multithreading build. 2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file. 3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's. 4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS. 5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file. AMD Internal : [CPUPL-1630] Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f	2021-06-15 16:49:08 +05:30
satish kumar nuggu	8885136786	Added prefetch in gemm module for single threaded dtrsm small for right cases Details: 1. By adding prefetch in gemm module we observed average gain of 10% in dtrsm right cases. 2. For skinny sizes with sizes m<=2000 and n<=1000, performance is equivalent to MKL. Change-Id: I6a5f4b676aa133eb71edb249eccc4644d97da605	2021-06-08 17:39:23 +05:30
Nagarapu Phanikumar	7ea32e6d0b	Merge " Unifying BLIS Windows and Linux codebase" into amd-staging-milan-3.1	2021-06-03 06:03:26 -04:00
nphaniku	2bdee3cd6c	Unifying BLIS Windows and Linux codebase 1. Removed dependency on bli_config.h inclusion in blis.h 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags. 3. CMAKE changes to incorporate new changes as per 3.1 code base. 4. Removed zen2 folder from Windows directory. AMD Internal : [CPUPL-1532] Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47	2021-06-03 15:28:10 +05:30
mkurumel	99e3bce065	SGEMV : single Precision axpyf kernel optimization for SGEMV Details : - Implemented saxpyf kernel with fuse factor=6 for sgemv. AMD-Internal: [CPUPL-1403] Change-Id: I72fd30c08a789603267cf58910138549d45d231a	2021-06-02 07:55:48 -04:00
Nageshwar Singh	2e1a5bc1dd	Optimized double complex axpyf kernel for zgemv Details: - Implemented zaxpyf kernel with fuse factor=4 for zgemv. - Modified BLAS interface call for zgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: I2231285fe3060982d4434466346a040b7ab803fc	2021-06-01 18:03:29 +05:30
satish kumar nuggu	82087773a0	Optimized single threaded dtrsm small for right cases Details: 1. Added optimized dtrsm kernels for all 8 right side cases Below are few notable optimizations which improved performance a. Loading, transposing (for transa cases), packing and reusing of a01 block required for GEMM operation. The block size increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM from one end of A to other end of triangular A b. Packing of 6 diagonal elements in one location helped to utilize cache line efficiently AMD-Internal: [CPUPL-1563] Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3	2021-05-25 01:09:50 -04:00

1 2 3 4 5 ...

438 Commits