amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 18:15:37 +00:00

Author	SHA1	Message	Date
Dipal Madhukar Zambare	0cb552c8f8	Merge "Updated memory pool implementation to contain buffers of different sizes." into amd-staging-milan-3.1	2021-08-12 09:01:02 -04:00
Dipal M Zambare	2eede504b5	Updated memory pool implementation to contain buffers of different sizes. -- In existing design memory pool supports buffers of only one size, This size is determined at compile time to support buffer needed for biggest block size and data type. However, the size calculation is not generic as it considers sizes only for GEMM. Also it assumes that the buffer will not be used for any other purpose than packing operations. -- If the new buffer is requested whose size is bigger than existing size, The pool is re-initialized to contain buffers of new size, however, this is done only if the pool is empty. If the pool is not empty the execution is aborted. This is undesirable as in mulithreaded scenarios it is possible that different threads needs buffers of different sizes at the same time (i.e. while other thread is still in middle of the operation). -- This commit removes the restriction of single size buffer, when new buffer is requested whose size if bigger than exiting one, no re-init is done. New buffer of required size is allocated, added to the pool and returned to the user. Change-Id: I20acdb60eb06ab2e53366d51713aa83c4b2df0da	2021-08-12 10:46:27 +05:30
Kiran Varaganti	adfd569591	DGEMM Optimizations Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10). Improved smart threading logic for dgemm, Additional conditions at the blas interface added to invoke bli_dgemm_small. Removed N > 3 condition from bli_dgemm_small. Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e	2021-08-10 12:34:43 -04:00
Meghana Vankadari	170719e647	Fixed few bugs in GEMMT for non-zen configs Details: - Added a check condition in GEMMT native path to choose update_triang routines based on whether the kernels are row-preferred or column preferred. - Moved the zen-specific SUP thresholds under BLIS_CONFIG_EPYC macro. For non-zen architectures, it falls back to L3 SUP thresholds. - Modified SUP code path to always choose 2m variant. 1n variant is not implemented for GEMMT. Change-Id: Ifdd55815c588f645e337de80b5b9d1864f6b5dd3	2021-08-10 02:38:17 -04:00
Nallani Bhaskar	29e7f08eb2	Updated aocldtl loging with fflush Description: 1. Added fflush after fprintf in aocldtl so avoid mising any log info when application crashes 2. Removed one redundent if def statement Change-Id: I79a645060fa36bd8c7e5eaca4d9183dc944329ea	2021-08-09 19:14:50 +05:30
Nallani Bhaskar	40a5a614b2	Updated vector loads/stores in reminder cases to avoid access beyond the matrix boundary Description: 1. While processing reminder cases in bli_trsm_small algorithm there were few loads and stores which were accessing beyond the given matrix buffer because of vectorized instructions. 2. Modified 256bit vector loads at edges into 128bit or 64 bit loads/stores such that no read/write happens beyond the matrix boundary. AMD-Internal: [CPUPL-1759] [SWLCSG-819] Change-Id: Iba51d0ed9bb28d1b0948a219755b8dbcc86a7fa9	2021-08-09 10:51:26 +05:30
Madan mohan Manokar	4b90ae3112	single instance zgemm tuning 1. single instance case sup is enabled. 2. Env BLIS_SINGLE_INSTANCE should be set to 1 to enable single instance tuning. AMD-Internal: [CPUPL-1743] Change-Id: Iadb05a6e9313ac41271c0522da243fd47d80abec	2021-07-29 13:36:14 +05:30
Manideep Kurumella	33fd2f7398	Merge " BLIS : DGEMV performance improvement for incy/incx greater than 1" into amd-staging-milan-3.1	2021-07-21 10:23:47 -04:00
mkurumel	ce75a86e9b	BLIS : DGEMV performance improvement for incy/incx greater than 1 Details : - Added packing Of Y for incy >1 cases for dgemv_unf_var2. - Added packing Of X for incx >1 cases for dgemv_unf_var1. AMD-Internal: [SWLCSG-735] Change-Id: Ib395f478ba984a85533e4f79b3521d0b2500c30c	2021-07-21 17:55:05 +05:30
Chandrashekara K R	a906ffa8c6	AOCL-Windows: Update BLIS build system. 1. Added the compiler flags for the clang-cl compiler to build blis multithreading using openmp library. 2. Updated format of presenting version string. AMD Internal : [CPUPL-1630] Change-Id: I979de541fa57415c08c20b0d5b684ae6bd242d19	2021-07-20 18:56:52 +05:30
Madan mohan Manokar	d3542ff0e0	3m_sqp conjugate support added 1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added. AMD-Internal: [CPUPL-1521] Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f	2021-07-05 18:44:55 +05:30
Meghana Vankadari	4e246b20c7	Merge "Fixed a bug in Level-3 bench files" into amd-staging-milan-3.1	2021-07-04 23:41:17 -04:00
Dipal M Zambare	333fe4ca8b	Makefile cleanup Removed unused function rm-dupls() from common.mk Removed code from patch-ld-so.py which is not needed for AMD codebase. AMD-Internal: [CPUPL-1539] Change-Id: If1812d5aa87c1e3a9d0c4706d571223d56f2fc20	2021-07-02 01:20:01 -04:00
Dipal M Zambare	d2313bb4e6	Update show config to include missing info. -- Ignore aocl dynamic configuration if multithreading is disabled. AOCL Dynamic will also be disabled in this case. -- Added following configuration settings in showconfig output 1. Complex return scheme 2. TRSM preinversion status 3. AOCL dynamic active status AOCL-Internal: [CPUPL-1565] Change-Id: Id5a31b233fc08dcd871de4a693aab0b2a5d9f1c4	2021-06-29 12:03:47 +05:30
Madan mohan Manokar	70e9d327a2	squarePacked(sqp) framework and multi-instance handling 1. kx partitions added to k loop for dgemm and zgemm. 2. mx loop based threading model added for dgemm as prototype of zgemm. 3. nx loop added for 3m_sqp and dgemm_sqp. 4. single 3m_sqp workspace allocation with smaller memory footprint. 5. sqp framework done from dgemm and zgemm. 6. sqp kernels moved to seperate kernel file. 7. residue kernel core added to handle mx<8. 8. multi-instance tuning for 3m_sqp done. 9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp. AMD-Internal: [CPUPL-1521] Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c	2021-06-28 15:40:11 +05:30
Meghana Vankadari	cb3a40ab9d	Added blas interface for dzgemm - Added blas interface for dzgemm. This function will call native implementation of gemm. - Mixed datatype support is already present in BLIS. But this implementation requires alpha_imag value to be 0. - Modified test_gemm.c to support testing of dzgemm. Change-Id: I496fffdede9f0f778b9a33b405eb6861c6dcc334	2021-06-27 09:34:18 -04:00
Nallani Bhaskar	650005e6fe	Enabled optional packing of B in sgemm sup Details: - Enabling packing of B helping in performance in sgemm when all m,n,k dimensions are above 240 irrespective of the lda alignment. - We may extend this optional enablement further for other skinny types and incase of multithread scenarios. Change-Id: Icb2a21e458cdcb0f8fdce373d8d0860c51be8d21	2021-06-25 15:15:42 +05:30
Dipal M Zambare	fe3384b3c6	Enable AOCL Dynamic feature by default. It can be disabled by configuration option --disable-aocl-dynamic. AOCL-Internal: [CPUPL-1565] Change-Id: I15ea5964dcd479f16dc9edc72957af3bcf4bc0e2	2021-06-22 14:17:52 +05:30
Meghana Vankadari	1944de1cfa	Fixed a bug in Level-3 bench files Details: - BLIS has reserved rs = cs = 1 case only for 1x1 scalars. - For vectors, even though rs = cs = 1 is a valid input, BLIS adjusts the strides to satisfy the error checking. - For an mxn matrix, if m > 1 and n = 1, BLIS sets cs = m to indicate that this is a column vector stored in column major order. Similarly BLIS sets rs = n in case of m = 1 and n > 1. - So determining storage-scheme based on row-stride could lead to errors if one of the matrices becomes vector. - Modified bench files to determine storage scheme based on stor_scheme character instead of checking row-strides. Change-Id: Id2dc0ea11f0e549ce8e49eb2c393442b33851527	2021-06-22 10:38:11 +05:30
Nallani Bhaskar	75f72b7f6e	Added aocl dynamic feature for dtrsm for small sizes Details: 1. Added aocl-dynamic for dtrsm native path When (m,n)<512 better performance observed for nthreads=4 2. Updated trsm_small threshold such that when (m+n)<320 trsm_small is doing better than native irrespective of number of threads Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487	2021-06-18 08:46:47 -04:00
Chandrashekara KR	d7377f967c	Merge "AOCL-Windows: Update BLIS build system" into amd-staging-milan-3.1	2021-06-17 08:49:55 -04:00
Kiran Varaganti	d26089c665	Multi-threaded BLIS - OpenMP Apart from "BLIS_NUM_THREADS" or OMP_NUM_THREADS, number of threads can also be set by the application by calling omp_set_num_threads(int ); In the function "bli_thread_init_rntm_from_env()" when environment variabes are not set, number of threads is inferred by calling the API - omp_get_max_threads(). Now by default if OMP_NUM_THREADS or BLIS_NUM_THREADS are not set - it will run with omp_get_max_threads() threads. This feature is only enabled when BLIS is configured with openmp parallelization. Change-Id: Ic2b48bfcd33368e14758f2bb914c1545f7b0c3e6	2021-06-17 05:17:37 -04:00
Meghana Vankadari	d5ff5e5f50	Added dynamic threading support for SYRK SUP code path Details: - when AOCL dynamic is enabled, the decision to choose ST Vs MT to solve SYRK is taken based on dimensions of matrices. - Decisions to choose optimum number of threads will be updated in the subsequent commits. - Only local copy of rntm is modified by AOCL Dynamic feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Added an early-exit condition in bli_nthreads_optimum when nt =1 or nt=-1. This ensures that AOCL dynamic feature is not used when threading is set using BLIS_IC_NT or BLIS_JC_NT. Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0	2021-06-16 02:08:11 -04:00
Nallani Bhaskar	e328bdc549	Added prefetch in left cases of dtrsm small Details: 1. Added prefetching next micro-panel of A and B in dgemm block, which are helping in reducing load latency and improved performance. 2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core dgemm into macros and made it more modular 3. Packing and diagonal packing in main dgemm loops are modularized. Fringe cases are yet to modularize. 4. Updated dtrsm small thresholds for single and multi thread cases 5. Updated div/scale based on disable/enable of trsm pre-inversion 6. Code clean up Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df	2021-06-15 23:15:22 +05:30
Chandrashekara K R	f94e3ad237	AOCL-Windows: Update BLIS build system 1. Added support in cmake scripts for linking libomp for blis multithreading build. 2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file. 3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's. 4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS. 5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file. AMD Internal : [CPUPL-1630] Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f	2021-06-15 16:49:08 +05:30
Kiran Varaganti	c2abbcab96	Fix dgemm_ Multi-thread running as Single Thread Details: When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed. Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1 irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls small_gemm which ends up running sequentially. Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one. Add fix for zgemm_ also. Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573	2021-06-15 12:14:11 +05:30
Nageshwar Singh	3002239f83	Added bench utility for swapv API AMD-Internal: [CPUPL-1591] Change-Id: I5619d402db49d1f325e4293f3be7a8bc0dde6f15	2021-06-09 17:05:00 +05:30
Nageshwar Singh	6ca50e1b72	Added bench utility for copyv API AOCL-Internal: [CPUPL-1591] Change-Id: I00ddad565cb87cd9371d7b1df2b57394fef437e0	2021-06-09 12:29:49 +05:30
satish kumar nuggu	8885136786	Added prefetch in gemm module for single threaded dtrsm small for right cases Details: 1. By adding prefetch in gemm module we observed average gain of 10% in dtrsm right cases. 2. For skinny sizes with sizes m<=2000 and n<=1000, performance is equivalent to MKL. Change-Id: I6a5f4b676aa133eb71edb249eccc4644d97da605	2021-06-08 17:39:23 +05:30
Nageshwar Singh	6842c2a30e	Bench trsv logging error Details - Passing enum rather than char for uplo, transa, and diaga - Deleting log file, and other temp files, merged in the codebase from amax AOCL-Internal: [CPUPL-1591] Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2	2021-06-08 11:54:55 +05:30
Dipal Madhukar Zambare	1638ff7605	Merge "DTL logs corrections" into amd-staging-milan-3.1	2021-06-06 23:20:22 -04:00
Nageshwar Singh	61b7584580	Bench addition for amaxv API AOCL-Internal: [CPUPL-1591] Change-Id: Ia9754dfed1a7302d5c267858f9005c8f64e28b46	2021-06-04 17:45:04 +05:30
Nageshwar Singh	ecfbdd16a8	Added bench utility for trsv API AOCL-Internal: [CPUPL-1591] Change-Id: I5953e13e9c75f620987ea92d92d1b1d7b5bfd043	2021-06-04 08:05:37 -04:00
Dipal M Zambare	2f344f5df1	DTL logs corrections -- Fixed issues in printing the values of side, uploa and diaga parameters for hemm, hemv, her, her2, her2k, herk, symm, symv, syr, syr2, syr2k, syrk, trmm, trmv, trsm, trsv. -- For above API's logging was called with MKSTR() for side, uploa and diaga parameters. MKSTR is needed only for macro arguments but not for function's arguments. -- Added space between function name and data type where it was missing. Bench expects logs in this format. AMD-Internal: [CPUPL-1585] Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d	2021-06-04 15:24:13 +05:30
Dipal M Zambare	849e1cee0a	Updated version number to 3.0.1. Change-Id: I07d5c26bb96b590854e1f81d41ed49a5e320f60e	2021-06-03 15:48:05 +05:30
Nagarapu Phanikumar	7ea32e6d0b	Merge " Unifying BLIS Windows and Linux codebase" into amd-staging-milan-3.1	2021-06-03 06:03:26 -04:00
nphaniku	2bdee3cd6c	Unifying BLIS Windows and Linux codebase 1. Removed dependency on bli_config.h inclusion in blis.h 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags. 3. CMAKE changes to incorporate new changes as per 3.1 code base. 4. Removed zen2 folder from Windows directory. AMD Internal : [CPUPL-1532] Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47	2021-06-03 15:28:10 +05:30
mkurumel	9afbb11b4f	DTL Logging bug in GEMV Details : - Fixed Incorrect Macro used in dgemv and cgemv Trace logging exit. AMD-Internal: [CPUPL-1403] Change-Id: Icac502d8d4adad112754d9c764a30d3db56a743f	2021-06-02 21:21:00 +05:30
mkurumel	99e3bce065	SGEMV : single Precision axpyf kernel optimization for SGEMV Details : - Implemented saxpyf kernel with fuse factor=6 for sgemv. AMD-Internal: [CPUPL-1403] Change-Id: I72fd30c08a789603267cf58910138549d45d231a	2021-06-02 07:55:48 -04:00
Nageshwar Singh	2e1a5bc1dd	Optimized double complex axpyf kernel for zgemv Details: - Implemented zaxpyf kernel with fuse factor=4 for zgemv. - Modified BLAS interface call for zgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: I2231285fe3060982d4434466346a040b7ab803fc	2021-06-01 18:03:29 +05:30
Meghana Vankadari	3804e301c9	Fixed a bug in Level-3 bench files where ldc = 1 Details: - To determine whether matrices are col-stored, we were checking ldc == 1. This is incorrect as a matrix can be col-stored with ldc = 1 if dimension is 1. - Modified the condition to check row_stride instead of col stride. if row-stride != 1, we can assume that matrices are not col-stored and ignore those inputs by printing an error message. Change-Id: Id4d5b971104eb11cbcdd6d22c5c620febefd3a87	2021-06-01 10:57:18 +05:30
Kiran Varaganti	ff84d37930	Merge "SUP GEMM - Enable only block panel (var2m)" into amd-staging-milan-3.1	2021-05-31 06:46:04 -04:00
Meghana Vankadari	887ecb46e0	Added threshold logic for SYRK Details: - Added decision logic to choose between SUP and native implementations of SYRK for zen2 architectures. - For architectures other than zen2 it will be redirected to gemm threshold function. Change-Id: I350578cc4f930e85b9581e4d9aed220e71a2171d	2021-05-31 05:34:38 -04:00
Kiran Varaganti	aa9f5b8b37	SUP GEMM - Enable only block panel (var2m) Completely disabling supvar1n (Panel Block) gemm to simplify things supvar1n perform better only when m >> and n=k=small (<10). This simplification will improve performance for m = n shape dgemm. Change-Id: I523fcb211e8ab92718ea7367f9707a38275e24b1	2021-05-30 21:22:44 +05:30
Madan mohan Manokar	6d6f746190	3m1 turning OFF since 3m1 is turned off in bla_gemm.c, setting FALSE for 3m1 in bli_l3_ind_oper_st AMD-Internal: [CPUPL-1592] Change-Id: I80dfe7c993f9edfbf752b7351cfdaa22a9e60035	2021-05-26 10:06:54 +05:30
Kiran Varaganti	ae6b6a7b7c	Merge "Fix a bug in bench_gemm.c" into amd-staging-milan-3.1	2021-05-25 05:00:05 -04:00
Meghana Vankadari	4446395047	Redirecting dgemv to axpyf based implementation for smaller sizes. AMD-Internal: [CPUPL-1403] Change-Id: I0ff2763c41c5ae598c58bc250adc317d7f8a4994	2021-05-25 01:39:12 -04:00
satish kumar nuggu	82087773a0	Optimized single threaded dtrsm small for right cases Details: 1. Added optimized dtrsm kernels for all 8 right side cases Below are few notable optimizations which improved performance a. Loading, transposing (for transa cases), packing and reusing of a01 block required for GEMM operation. The block size increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM from one end of A to other end of triangular A b. Packing of 6 diagonal elements in one location helped to utilize cache line efficiently AMD-Internal: [CPUPL-1563] Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3	2021-05-25 01:09:50 -04:00
Meghana Vankadari	8c9a7c21b4	Optimized axpyf kernel for scomplex datatype Details: - Implemented axpyf kernel with fuse factor=4 for scomplex datatype. - Modified BLAS interface call for cgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: Ibaab078008d76953332ba4da3515993578c0e586	2021-05-24 14:40:17 +05:30
Kiran Varaganti	492f54fb5e	Fix a bug in bench_gemm.c When op(A) or op(B) = transpose - the leading dimensions of these matrices altered. Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this mistake in both column and row storages. Provide a provision to call BLIS interfaces when row-major inputs are used. Change-Id: Id2041af219a64567471c14190f283274d1df2f7f	2021-05-24 12:59:28 +05:30

1 2 3 4 5 ...

2515 Commits