amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Shubham	1faee9f89e	Fixed 32xk AVX512 double precision pack kernel - Currently the pointer received as function argument is used for packing which causes only a partial copy of input buffer to output buffer due to strange optimizations by compiler. - To fix this, instead of using a normal pointer for output buffer, we define a "restrict" local pointer variable. - "restrict" keyword tells the compiler that the pointer is the only way to access the object pointed by the pointer. - By defining "restrict" local pointer pointing to output buffer, the mysterious problem of incomplete copy has been solved. Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646	2023-02-20 21:18:54 -05:00
Arnav Sharma	46965dfc57	Developed 6x64 SGEMM row-preferred kernels - Added kernels for all rv and rd variants. - Main kernel is of size 6x64, and the associated fringe kernels added are - 4x64, 2x64, 1x64 - 6x32, 4x32, 2x32, 1x32 - 6x16, 4x16, 2x16, 1x16 - Updated the zen4 config to enable these kernels in zen4 path. - Added C-prefetching to 6x? row-stored main kernels. - C-prefetching for column storage yet to be added. - K-loop unrolling for fringe kernels yet to be added. AMD-Internal: [CPUPL-3002] Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4	2023-02-19 23:56:58 -05:00
Harihara Sudhan S	535b49f150	Vectorized COPYV for double complex - Added ZCOPYV kernel that uses AVX2 and SSE instructions for vectorization. - The routine returns early when the vector dimension is zero. - The kernel takes one among the two available paths based on conjugation requirement of X vector. - VZEROUPPER is added before transitioning from AVX2 to SEE. - Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts. AMD-Internal: [CPUPL-2773] Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0	2023-02-16 09:32:10 -05:00
Shubham	18569b42ee	Added DTRSM small RUNN/RLTN variant AVX512 kernels - 8x8 kernels are used for DTRSM SMALL - Matrix A(a10) is packed for GEMM operations. - Packed martix A will be re-used in all the col-block along N-dimension. - Diagonal elements of A matrix are packed(a11) for TRSM operations. - Implemented fringe cases with following block sizes 8x8, 8x4, 8x3, 8x2, 8x1 4x8, 4x4, 4x3, 4x2, 4x1 3x8, 3x4, 3x3, 3x2, 3x1 2x8, 2x4, 2x3, 2x2, 2x1 1x8, 1x4, 1x3, 1x2, 1x1 AMD-Internal: [CPUPL-2745] Change-Id: I6a174e7f88a4c2c5778052525879552a1e82f6ad	2023-02-16 06:49:46 -05:00
Harihara Sudhan S	8d7593a940	AVX2 vectorized SCALV for double complex - Added ZSCALV that uses AVX2 and SSE instructions for vectorization. - Return early when the vector dimension is zero. When alpha is 1 there is no need to perform computation hence return early. - When alpha is zero expert interface of ZSETV is invoked. In this case, all the elements of the input vector are set 0. - Invocation of expert interface means that NULL pointer can be passed to the function in place of context. Expert interface of ZSETV will query the context and get the approriate function pointer. - Added BLAS interface for ZSCALV. The architecture ID is used to decide the function that is to be invoked. - Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV BLAS macro interface only for single complex type and single complex, float mixed type AMD-Internal: [CPUPL-2773] Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050	2023-02-05 09:36:12 -05:00
Harihara Sudhan S	18ae57305e	ZAXPYF4 optimization - Vectorized alpha scaling of X vector using SSE instructions. This can be done irrespective of incx. - Added code to prefetch A matrix and Y vector to L1 cache - Vectorized fringe case computation and non-unit stride computation with SSE instructions. - Increased unroll in unit stride cases for better register utilization. AMD-Internal: [CPUPL-2773] Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715	2023-02-04 12:34:50 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Kiran Varaganti	d8d4499e54	AVX2 dgemm kernel optimization for AOCC Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with k0 is typecasted to uint64_t to enable AOCC generate optimized code. Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting this change. Similar change was applied to sgemm, cgemm and zgemm kernels. Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9	2023-01-09 07:49:41 -05:00
Edward Smyth	82c2eb4e8e	Code cleanup and warnings fixes Corrections for some occurances of: - Compiler warnings about initialization of float from double - Spelling mistakes in comments - Incorrect indentation of code and comments AMD-Internal: [CPUPL-2870] Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc	2023-01-09 04:34:52 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Nithya V S	6e9defe1b5	Developed AVX512 based "sup" kernels for SGEMM - Only for RRR case, var2m kernels are added - Main kernel is of 12x32 (AVX512), associated fringe kernels of - 8x32, 4x32, 2x32, 1x32 (AVX512) - 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512) - 12x8, 8x8 (AVX2) - 12x4, 8x4 (SSE4) - 12x2, 8x2 (SSE4) - existing AVX2/SSE4 kernels are used for other fringe cases - Currently, these kernels are not invoked in zen4 path - Once all AVX512 kernels (n and rd) are done, invoke all of them together in zen4 config AMD-Internal: [CPUPL-2801] Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196	2022-12-16 10:18:09 +05:30
Eleni Vlachopoulou	13aa3c8cd0	Adding AVX2 support for SNRM2 and SCNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [SWLCSG-1080] Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9	2022-12-14 04:25:45 -05:00
Edward Smyth	7f86561d26	BLIS-Nov2022: HPL memory issues with GCC. HPL script was using BLIS manual way to set threading, i.e. setting BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return -1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines. Fix: if this occurs, set local number of threads based on product of BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values. Note: BLIS_PC_NT should always be 1, but this environment variable is currently being read (contrary to documentation), so include it for now. Other changes: * implement _Pragma convention in all code used on AMD * frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag AMD-Internal: [CPUPL-2803] Change-Id: I37e8b038e5640d6693a87be0609888186322b465	2022-12-06 05:10:34 -05:00
Edward Smyth	34730a1e4c	BLIS: Nested parallelism issues 1. Check OpenMP active level against max active levels when setting number of threads for starting a new parallel region in ./frame/thread/bli_thread.c to ensure the correct number of threads is used when BLIS is called within nested OpenMP parallelism. 2. In subsequent BLIS calls, threading choices could be incorrectly set based on values used and stored in global_rntm by a previous call. This could apply when the OpenMP number of threads differ from call to call, different nested parallelism is used in different parts of a user's code, or different threads at the user level request different numbers of OpenMP threads for BLIS calls. Keep threading information in both global_rntm and a new Thread Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime environment (as appropriate) during bli_init_auto() calls in each BLIS routine. The details are: * global_rntm is initialized on first BLIS call based on OpenMP and BLIS threading environment variables. * global_rntm is updated by any BLIS threading function calls. * In bli_thread_update_tl(), called by bli_init_auto(), sync with any BLIS values set or updated in global_rntm. Then, if BLIS threading control is not used, check OpenMP ICVs and set thread count and auto_factor appropriately. * Setting BLIS threading locally (using expert interfaces to pass a user defined rntm data structure) should work as before. 3. bli_thread_get_is_parallel can now only be called outside of parallelism within BLIS routines. Change calls in trsm to reflect this. 4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env() if any BLIS_*_NT environment variables are set. 5. Set auto_factor = FALSE when the number of threads is 1. 6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE. 7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init(). 8. For debugging, internal information on the rntm threading data can be printed by defining "PRINT_THREADING" at the top of bli_rntm.h 9. bli_rntm_print() now also prints the value of blis_mt. 10. Function prototypes in bli_rntm.h moved to top of file, so that bli_rntm_print() can be used within inline functions defined in this header file. 11. Comment out bli_init_auto() and bli_finalize_auto() calls in Fortran interfaces in frame/compat/blis/thread/b77_thread.c 12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and set_pack_b functions outside of the auto_factor if statements. 13. Misc code tidying. AMD-Internal: [CPUPL-2433] Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee	2022-10-21 07:38:39 -04:00
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Shubham Sharma	2d72d4c862	DTRSM Native path AVX 512 kernels are developed - Added DTRSM AVX512 kernels for lower and upper variants in the native path. - Changes in framework are made to accommodate these kernels. AMD-Internal: [CPUPL-2588] Change-Id: I1f74273ef2389018343c0645870290373ce25efe	2022-10-01 15:58:24 +00:00
Nallani Bhaskar	f1caa8e630	Fixed initialization issues in ddot multithread implementation Description: 1. n value per thread and offset are defined outside omp loop which can vary for each thread. Moved them to inside the omp loop so that the output 2. Fixed few compilation warnings in dnrm2 avx2 implemenation AMD Internal:[CPUPL-2606] Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81	2022-10-01 14:43:39 +05:30
Mangala V	e440cbc91a	Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m Address sanitizer reports error when rbp regitser is modified. Register rbp was stored with rs_a which was used during prefetch of Matrix A. Usage of rbp is avoided by using rcx register as a temporary storage register. Hence rcx is updated with Matrix C address before storing the computed data. This fix address the issue reported by GEQP3 API of libflame AMD-Internal: [CPUPL-2587] Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d	2022-09-30 11:52:29 -04:00
Eleni Vlachopoulou	863b73dfaf	Adding AVX2 support for DZNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. - Cleaned up some white space in the AVX2 implementation for DNRM2. AMD-Internal: [CPUPL-2551] Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc	2022-09-30 07:51:04 -04:00
Arnav Sharma	90f915d3a9	Vectorized and parallelized zdscal routine - Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported. - Also added multithreaded support for the same. - The optimal number of threads is being calculated on the basis of input size. AMD-Internal: [CPUPL-2602] Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067	2022-09-30 06:11:07 -04:00
satish kumar nuggu	9c292b79e2	Fixed ASAN reported issues in [s/c]trsm small kernels Details: 1. Fixed the memory access paritial overflows for the variables AlphaVal,ones reported by ASAN. 2. Using 128 bit packed broadcast with the 64 bit data types after type casting would cause the garbage data to be filled in the destination register. 3. Fixed this issue by using set_ps instruction instead of broadcast. 4. In cases of n remainder being 1, extra elements were accessed that could cause out of memory access. Removed the extra element access. AMD-Internal: [CPUPL-2578][CPUPL-2587] Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75	2022-09-30 03:02:07 -04:00
Eleni Vlachopoulou	1c1a0027a8	Bugfix in DNRM2 AVX path Description: Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy errors. AMD-Internal: [CPUPL-2576] Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e	2022-09-29 04:46:04 -04:00
satish kumar nuggu	754627eae9	Addressed uninitialized variables in trsm small algo 1. Addressed uninitialized variables reported in coverity for all datatypes of trsm small algo. AMD-Internal: [CPUPL-2542] Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71	2022-09-29 01:49:22 -04:00
Mangala V	42434fbc31	Bug fix in TRSM small for S, D, C & Z datatype incase of MT 1. Corrected B buffer accessing to access by its offset instead of starting address which is required incase of MT. 3. When num_threads > 1, B buffer is divided in to blocks in m or n dimension based on side right or left. Hence need to access by its offset to access starting of the block. 4. Currently B Matrix is divided in to blocks for each thread and complete matrix A is used by all threads. Incase of design change in future, modified A buffer accessing by its offset to support partition of matrix A for MT AMD-Internal:[CPUPL-2520] Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea	2022-09-23 03:30:25 -04:00
Eleni Vlachopoulou	a5891f7ead	Adding AVX2 support for DNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [CPUPL-2551] Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c	2022-09-20 06:05:01 -04:00
Nallani Bhaskar	ea79efa915	Fixed out of bound memory access in sgemmsup zen rv kernels Details: 1. In sgemmsup_zen_rv_?x2 kernels "vmovps" instruction is used to load B matrix in k loop and k last loop, which is loading 128 bit into xmm than 64 bit as expected. 2. Changed vmovps instruction to vmovsd instrucntions which load only 64 bit in xmm register 3. Avoided C memory access by vfma instruction when multiplying with non-beta at corner cases with required access to 128 bit which leads to out of bound. Replaced with vmovq first to get 64 bit data then peformed vfma on xmm register in rv_6x8m and rv_6x4m AMD-Internal: [CPUPL-2472] Change-Id: Iad397f8f5b5cc607b4278b603b1e0ea3f6b082f2	2022-08-30 01:13:14 -04:00
Arnav Sharma	eb83a0fe9d	Enabled ZHER Optimized Path - While calculating the diagonal and corner elements, the combined operation of calculating the product of x and x hermitian and simultaneously scaling it with alpha and adding the result to the matrix was the cause of increased underflow and overflow errors in netlib tests. - So the above calculation is now being done in three steps: scaling x vector with alpha, then calculating its product with x hermitian and later adding the final result to the matrix. AMD-Internal: [CPUPL-2213] Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8	2022-08-29 08:09:42 -04:00
Dipal M Zambare	d3b503bbf2	Code cleanup and warnings fixes - Removed all compiler warnings as reported by GCC 11 and AOCC 3.2 - Removed unused files - Removed commented and disabled code (#if 0, #if 1) from some files AMD-Internal: [CPUPL-2460] Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a	2022-08-29 15:15:40 +05:30
satish kumar nuggu	2114a43df8	Fixes to avoid Out of Bound Memory Access in TRSM small algorithm Details: 1. Fixed the issues corresponding to Out of bound memory access during load and store. 2. In Intrinsic code: i. AVX2 Registers can hold 4 double elements. ii. In case of remainder when number of elements is lessthan vectorised register. Though the required number of elements are lessthan 4, we are reading and writing in chunks of 4 elements due to vectorization. This might cause out of bound memory access. 3. Redesigned code to restrict out of bound access by loading and storing the exact number of elements required. AMD-Internal: [SWLCSG-1470] Change-Id: I786f8023cf5a5f3e5343bea413c59bd0e764df9b	2022-08-26 01:06:10 -04:00
satish kumar nuggu	219c41ded9	ZTRSM Improvements Details: 1. Optimized ztrsm for small sizes upto 500 in multi thread scenarios. 2. Enabled multithreading execution for bli_trsm_small implementation for double complex data type. 3. Added decision logic to choose between native vs multi-threaded small path for sizes upto 500 and threads upto 8. AMD-Internal: [CPUPL-2340] Change-Id: I4df9d7e6ee152baa9cf33e58d36e1c17f75a00c1	2022-08-20 05:08:47 -04:00
Shubham Sharma	b8b339416a	DGEMMT optimizations Details: 1. For lower and upper, "B" column major storage variants of gemmt, new kernels are developed and optimized to compute only the required outputs in the diagonal blocks. 2. In the previous implementation, all the 48 outputs of the given 6x8 block of C matrix are computed and stored into a temporary buffer. Later,the required elements are copied into the final C output buffer. 3. Changes are made to compute only the required outputs of the 6x8 block of C matrix and directly stored in the final C output buffer. 4. With this optimization, we are avoiding copy operation and also reducing the number of computations. 5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to compute Lower and Upper Variant diagonal outputs have been added. 6. SUP Framework changes to integrate the new kernels have been added. 7. These kernels are part of the SUP framework. AMD-Internal: [CPUPL-2341] Change-Id: I9748b2b52557718e7497ecf046530d3031636a63	2022-08-19 12:31:35 -04:00
Shubham Sharma	32c9239c7f	Optimization of DGEMMT SUP kernels Details: 1. Optimized the kernels by replacing the macros with the actual computation of required output elements. AMD-Internal: [CPUPL-2341] Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5	2022-08-18 08:30:19 -04:00
Shubham Sharma	8adef27aca	Optimization of DGEMMT SUP kernel for beta zero cases. Details: 1. In kernels for non-transpose variants, changes are made to optimize the cases of beta zero. 2. Validated the changes with BLIS Testsuite, GTestSuite(Functionality, Valgrind, Integer Tests) and Netlib Tests. 3. Fixed warnings during the build process. AMD-Internal: [CPUPL-2341] Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a	2022-08-18 08:05:58 -04:00
Harsh Dave	e2e1dadee1	DGEMM Improvements - We prefetch next panel while packing 8xk panel. - Modified prefetch offsets for dgemm native and dgemm_small kernel. AMD-Internal: [CPUPL-2366] Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed	2022-08-16 08:07:30 -04:00
satish kumar nuggu	88e44c64e3	Fixed Memory Leaks in TRSM 1. Fixed the memory leaks in corner cases which caused due to extra loads in all datatypes(s,d,c,z). 2. In remainder cases instead of loading required number of elements, loaded extra elements which lead to memory leaks. Fixed memory leaks by restricting number of loads to required number of elements. AMD-Internal: [CPUPL-2280] Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a	2022-08-16 05:08:40 -04:00
Shubham Sharma	f5ef30a44a	Fix in DGEMMT SUP kernel Details: 1. Due to error in C output buffer address computation in kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid memory is being accessed. This is causing seg fault in libflame netlib testing. 2. Validated the fix with libflame netlib testing. AMD-Internal: [CPUPL-2341] Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed	2022-08-12 21:03:36 +05:30
Shubham Sharma	4bca7f6f4a	DGEMMT optimizations Details: 1. For lower and upper, non-transpose variants of gemmt, new kernels are developed and optimized to compute only the required outputs in the diagonal blocks. 2. In the previous implementation, all the 48 outputs of the given 6x8 block of C matrix are computed and stored into a temporary buffer. Later,the required elements are copied into the final C output buffer. 3. Changes are made to compute only the required outputs of the 6x8 block of C matrix and directly stored in the final C output buffer. 4. With this optimization, we are avoiding copy operation and also reducing the number of computations. 5. Kernels specific to compute Lower and Upper Variant diagonal outputs have been added. 6. SUP Framework changes to integrate the new kernels have been added. 7. These kernels are part of the SUP framework. AMD-Internal: [CPUPL-2341] Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a	2022-08-11 06:16:33 -04:00
Mangala.V	8504ef013d	Optimisation of DTRSM and ZTRSM 1. Extract instruction replaced with cast when accessing first 128bit, as cast inst needs no cycle but extract takes few cycles 2. Added prefetch of A buffer when computing gemm operation 3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c With above changes performance improvements observed in case of Single thread Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be	2022-08-11 01:39:40 -04:00
Devin Matthews	76fbf1233d	Add vzeroupper to Haswell microkernels. (#524 ) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions. Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9	2022-08-01 09:29:04 +05:30
Field G. Van Zee	2a81437bd8	Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging. Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011	2022-08-01 09:11:58 +05:30
Devin Matthews	9495401b73	Fix more copy-paste errors in the haswell gemmsup code. Fixes #486. Change-Id: I568386b5d67a698ea9c0b6b17f133df86c2894bd	2022-07-31 21:36:21 +05:30
Devin Matthews	ea163fc23b	Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition. Change-Id: I96ed39e289aaeeb44be9117074b32bd8d4c19de6	2022-07-31 21:15:28 +05:30
Field G. Van Zee	faff30b46a	Fixed out-of-bounds bug in sup s6x16m haswell kernel. Details: - Fixed another out-of-bounds read access bug in the haswell sup assembly kernels. This bug is similar to the one fixed in `17b0caa` and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh Kannan for reporting this bug (and a suitable fix) in #635. - CREDITS file update. Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9	2022-07-31 21:10:58 +05:30
Field G. Van Zee	4b1663213c	Fixed out-of-bounds read in haswell gemmsup kernels. Details: - Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2() kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. Thanks to Daniël de Kok for reporting these bugs in #635, and to Bhaskar Nallani for proposing the fix). - CREDITS file update. Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b	2022-07-31 21:06:08 +05:30
Nallani Bhaskar	1d31386c02	Fixed few out of bound memory reads in sgemmsup kernels Details: Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16() kernel. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 032), ymm3, ymm4) or vfmadd231ps(mem(rcx, 032), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. AMD_CPUPLID: CPUPL-2279 Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68	2022-07-29 09:09:12 -04:00
Vignesh Balasubramanian	808d79a610	Implemented efficient ZGEMM algorithm when k=1 Problem statement : To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation. In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines: - Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api. - Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6. - Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance. - Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output. The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads. AMD-Internal: [CPUPL-2236] Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847	2022-07-28 02:09:45 -04:00
Kiran Varaganti	eff436c653	Bug Fix to replace vzeroall Fixed syntax in AVX512 dgemm native kernel. zen4 configuration follows Intel ASM syntax whereas other AMD configs follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM format. src and dst operands are interchanged. Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447	2022-07-22 03:42:17 -04:00
Kiran Varaganti	86134c7278	Replaced vzeroall Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels -both AVX2 and AVX512. vzeroall is expensive instruction and replaced it with faster version of zeroing all registers. vzeroupper() instruction is also added at the end of AVX2 kernels to avoid any AVX2/SSE transition penalities. Kindly note only the main kernels are modified. Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4	2022-07-20 01:16:59 -04:00
Vignesh Balasubramanian	2ad25a7180	ZGEMM kernel performance improvement for k=1 sizes: The current implementation for handling zgemm exploits SIMD parallelism along the k dimension. This would give great performance in cases of k being large. But for input sizes with k=1, it is better to exploit SIMD parallelism along the m and n dimensions, thereby giving better performance. This commit does the same through loop reordering, by loading column vectors from A. AMD-Internal: [CPUPL-2236] Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f	2022-06-30 07:19:52 -04:00
Dipal M Zambare	00c5048f65	Fixed high impact static analysis issues Initialized ymm and xmm registers to zero to address un-inilizaed variable errors reported in static analsys. AMD-Internal: [CPUPL-2078] Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04	2022-06-27 08:19:26 +00:00

1 2 3 4 5 ...

513 Commits