amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
mkadavil	d2713d3dc0	GCC compiler optimization flag (kernel) update for zen3 and zen4 config. -Inefficient assembly is generated for s16 gemm micro-kernel(intrinsics code) when compiled using gcc. The presence of -fschedule-insns + -fschedule-insns2 + -ftree-pre in O2 compiler optimization flags results in the code being optimized to reduce data stalls, and results in the usage of stack to store intermediate C register output. Disabling -ftree-pre in gcc fixes the issue, even in the presence of the other two flags. AMD-Internal: [CPUPL-2971] Change-Id: Ibf0dcde20b5a18708a05faad34e684eb0a9a5463	2023-02-02 23:58:14 -05:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
sireesha.sanga	fe999a9653	Removal of Prohibited Source Links as per Legal Scan AMD-Internal: [SWCSD-5866] Change-Id: I03011b114d37fcee8a399af8d631aee610b1f924	2023-01-27 07:56:01 -05:00
sireesha.sanga	71a25a5152	DAXPY Parallelization Details: - DAXPYV is parallelized based on the number of threads set by Application. - Using the empirical data collected on genoa, optimal number of threads is set. - If intended number of threads to be spawned is not equal to actual number of threads, AXPY routine will execute sequentially. Change-Id: I48015a712ada6428056af6320aa5570596a02b1f AMD-Internal: [CPUPL-2747]	2023-01-24 07:32:16 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Harsh Dave	0a699c45f0	Corrected VEXTRACTF64X2 macro in bli_x86_asm_macro file. - Previously VEXTRACTF64X2 macro was defined as vextractf64x4. Change-Id: I79727a85b7d6da3b4d524064e297fc8c71d4f466	2023-01-16 23:04:48 -05:00
mkadavil	4b5e24d0d9	Column major input support for f32 gemm (sgemm for lpgemm). -The f32 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. AMD-Internal: [CPUPL-2919] Change-Id: I805f1cb9ff934bb3106e01eb74e528915ffb90a3	2023-01-16 04:04:21 -05:00
Sireesha Sanga	540509f374	Enabling AVX2 path for SCNRM2	2023-01-13 10:27:54 +05:30
Eleni Vlachopoulou	758d68467f	Disabling AVX2 path for SNRM2 and SCNRM2. AMD-Internal: [CPUPL-2865] Change-Id: I09c67115801a6b9446c7930c54fc937bd17908a3	2023-01-12 09:58:28 -05:00
Meghana Vankadari	f39dba9fd8	Added dzgemm, DZGEMM, DZGEMM_ prototypes to wrapper file. AMD-Internal: [CPUPL-2199] Change-Id: Ied814cb7be60d30b8217ec42ac436b4e628ea6d2	2023-01-12 01:35:29 -05:00
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
Kiran Varaganti	d8d4499e54	AVX2 dgemm kernel optimization for AOCC Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with k0 is typecasted to uint64_t to enable AOCC generate optimized code. Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting this change. Similar change was applied to sgemm, cgemm and zgemm kernels. Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9	2023-01-09 07:49:41 -05:00
Shubham	9e8595356f	Fixed bug related to block size override for TRSM - In zen4 TRSM and GEMM have different blocksizes, when trsm is called, blocksizes are changed in global cntx object. If GEMM and TRSM are called in parallel, blocksizes in global cntx will not be correct for GEMM which will cause a seg fault. - To fix this, a local copy of cntx is created and blocksizes are changed only in the local copy. AMD-Internal:[CPUPL-2896] Change-Id: I0e724520a92fc3b2ed0becf385ec41ab5d1b4490	2023-01-09 05:30:36 -05:00
Edward Smyth	82c2eb4e8e	Code cleanup and warnings fixes Corrections for some occurances of: - Compiler warnings about initialization of float from double - Spelling mistakes in comments - Incorrect indentation of code and comments AMD-Internal: [CPUPL-2870] Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc	2023-01-09 04:34:52 -05:00
Meghana Vankadari	fbd9dde869	Fix bug in DZGEMM Details: - In case of dzgemm, if the microkernel prefers column output, We will induce a transposition and perform C += A*B where A (formerly B) becomes complex and B(formerly A) becomes real. Hence attach complex alpha object to A instead of B. - This commit reverts all the changes made by `d62f12a18a` and `0b81f53074` as they are causing failures in make checkblis-md. AMD-Internal: [CPUPL-2893] Change-Id: I56b94ac136fb96003302c568ae2587142c836620	2023-01-09 03:30:57 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Nithya V S	6e9defe1b5	Developed AVX512 based "sup" kernels for SGEMM - Only for RRR case, var2m kernels are added - Main kernel is of 12x32 (AVX512), associated fringe kernels of - 8x32, 4x32, 2x32, 1x32 (AVX512) - 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512) - 12x8, 8x8 (AVX2) - 12x4, 8x4 (SSE4) - 12x2, 8x2 (SSE4) - existing AVX2/SSE4 kernels are used for other fringe cases - Currently, these kernels are not invoked in zen4 path - Once all AVX512 kernels (n and rd) are done, invoke all of them together in zen4 config AMD-Internal: [CPUPL-2801] Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196	2022-12-16 10:18:09 +05:30
Eleni Vlachopoulou	13aa3c8cd0	Adding AVX2 support for SNRM2 and SCNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [SWLCSG-1080] Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9	2022-12-14 04:25:45 -05:00
Edward Smyth	2592774fe8	BLIS: Nested Parallelism issues (3) Bugfix for parallel BLAS1 and BLAS2 routines. Threading information was not being set correctly when initializing local rntm from global. Also ensure th_rntm is initialized along with global_rntm by updating it in bli_thread_init(), called by bli_init_once() AMD-Internal: [CPUPL-2433] Change-Id: Iba658f87ae13fe16a57ca1fc279e149b7fa294cf	2022-12-13 12:38:40 -05:00
Eleni Vlachopoulou	88e549e7bd	Using CMake as the build system for both Linux and Windows: - GoogleTest headers removed. GoogleTest gets fetched at configuration time. - BLIS headers removed. A BLIS installation path is required at configuration time. - Windows has been temporarily disabled. AMD-Internal: [CPUPL-2732] Change-Id: I9e55c8e43b2733f96cd8b6e5449d79623decad5c	2022-12-13 19:09:23 +05:30
jagar	cff29bde76	Added gtestsuite folder into blis repo Moved blis gtestsuite from lib-confscript to blis repo (branch: amd-main) Change-Id: If7ad391eef66bac6d26cf5223e6043d52b746072	2022-12-07 23:57:13 -05:00
Edward Smyth	345aacf806	BLIS: Nested parallelism issues (2) Improvements to recent parallelism changes: 1. BLIS specific threading options: In bli_thread_update_rntm_from_env() set threading variables in tl_rntm to serial values when OpenMP level for parallelism within BLIS will not be active. User supplied BLIS threading values remain unchanged in global_rntm. 2. Simplify code structure in bli_thread_update_rntm_from_env(). 3. Change variable declarations in bli_thread_init_rntm_from_env() and bli_thread_update_rntm_from_env() to avoid unused variable warnings in non-OpenMP builds. AMD-Internal: [CPUPL-2433] Change-Id: I5505657e3d2722e69bc4a1c1bb9fd8df55407fdd	2022-12-07 04:34:07 -05:00
Edward Smyth	7f86561d26	BLIS-Nov2022: HPL memory issues with GCC. HPL script was using BLIS manual way to set threading, i.e. setting BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return -1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines. Fix: if this occurs, set local number of threads based on product of BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values. Note: BLIS_PC_NT should always be 1, but this environment variable is currently being read (contrary to documentation), so include it for now. Other changes: * implement _Pragma convention in all code used on AMD * frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag AMD-Internal: [CPUPL-2803] Change-Id: I37e8b038e5640d6693a87be0609888186322b465	2022-12-06 05:10:34 -05:00
Chandrashekara K R	5f1ea3246a	Updated blis library version string to 4.0.1 Change-Id: I1f3eb9081bcc05ccec300e00860d905b23d9709a	2022-11-24 10:35:34 +05:30
Arnav Sharma	de7b7b8eb8	Fix for ZDSCAL Multi-thread implementation - In the existing implementation, when the actual number of threads spawned is different from the indicated number of threads for the parallel region, partial job is being executed. - Fix added to identify actual number of threads spawned and allocate the work load to single thread in case of discrepancy in the number of threads spawned vs indicated. AMD-Internal: [CPUPL-2761] Change-Id: I9eb9baccaef7f2149346be705ee151ae7af0a0e5	2022-11-22 11:03:07 +05:30
Arnav Sharma	2bc2d11e8a	Fix for DSCAL Multi-thread implementation - In the existing implementation, when the actual number of threads spawned is different from the indicated number of threads for the parallel region, partial job is being executed. - Fix added to identify actual number of threads spawned and allocate the work load to single thread in case of discrepancy in the number of threads spawned vs indicated. AMD-Internal: [CPUPL-2761] Change-Id: Ife36e6e4993bdcc5a506349b54b2177173866e32	2022-11-21 08:49:07 -05:00
Arnav Sharma	69be3b0557	Fix for DDOT Multi-thread implementation - In the existing implementation, when the actual number of threads spawned is different from the indicated number of threads for the parallel region, partial job is being executed. - Fix added to identify actual number of threads spawned and allocate the work load to single thread in case of discrepancy in the number of threads spawned vs indicated. AMD-Internal: [SWLCSG-1659] Change-Id: I882d0af608009e502189a4469eb3483d659c2b3f	2022-11-21 06:17:26 -05:00
Harihara Sudhan S	11c42ce1d3	C matrix prefetch for BF16 GEMM - Broke down the KR loop inside the compute kernel into two pieces - Added C matrix prefetch between the two decomposed pieces of KR loop AMD-Internal: [CPUPL-2693] Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e	2022-11-21 04:57:19 -05:00
Harihara Sudhan S	2f6c26eff6	Dynamic block tuning for LPGEMM u8s8s16os16 - NC value is adjusted based on the n dimension of the B matrix. - KC value is adjusted based on the k dimension input matrices. - In order to ensure a handshake between the reorder function and compute function, the same dynamic block tuning logic is called in both places. - Reorder followed by compute scenarios mean that tuning KC value based on MC value and vice versa is not feasible. AMD-Internal: [CPUPL-2638] Change-Id: Ie515da7de83e3dfb59946e873c92d67f11559429	2022-11-21 01:39:35 -05:00
Vignesh Balasubramanian	078b0bc626	Enabling ZGEMM SUP path for single thread -Enabled the sup path for zgemm in case of single thread. -Fine tuned the thresholds for small path in order to handle certain spectrum of skinny matrices. -The inputs are now redirected among small, sup and native paths in case of single thread. AMD-Internal: [CPUPL-2632] Change-Id: I3eca315bcb8bfa8e015349a7b2bc4ebd5448e065	2022-11-04 16:05:48 +05:30
Edward Smyth	34730a1e4c	BLIS: Nested parallelism issues 1. Check OpenMP active level against max active levels when setting number of threads for starting a new parallel region in ./frame/thread/bli_thread.c to ensure the correct number of threads is used when BLIS is called within nested OpenMP parallelism. 2. In subsequent BLIS calls, threading choices could be incorrectly set based on values used and stored in global_rntm by a previous call. This could apply when the OpenMP number of threads differ from call to call, different nested parallelism is used in different parts of a user's code, or different threads at the user level request different numbers of OpenMP threads for BLIS calls. Keep threading information in both global_rntm and a new Thread Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime environment (as appropriate) during bli_init_auto() calls in each BLIS routine. The details are: * global_rntm is initialized on first BLIS call based on OpenMP and BLIS threading environment variables. * global_rntm is updated by any BLIS threading function calls. * In bli_thread_update_tl(), called by bli_init_auto(), sync with any BLIS values set or updated in global_rntm. Then, if BLIS threading control is not used, check OpenMP ICVs and set thread count and auto_factor appropriately. * Setting BLIS threading locally (using expert interfaces to pass a user defined rntm data structure) should work as before. 3. bli_thread_get_is_parallel can now only be called outside of parallelism within BLIS routines. Change calls in trsm to reflect this. 4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env() if any BLIS_*_NT environment variables are set. 5. Set auto_factor = FALSE when the number of threads is 1. 6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE. 7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init(). 8. For debugging, internal information on the rntm threading data can be printed by defining "PRINT_THREADING" at the top of bli_rntm.h 9. bli_rntm_print() now also prints the value of blis_mt. 10. Function prototypes in bli_rntm.h moved to top of file, so that bli_rntm_print() can be used within inline functions defined in this header file. 11. Comment out bli_init_auto() and bli_finalize_auto() calls in Fortran interfaces in frame/compat/blis/thread/b77_thread.c 12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and set_pack_b functions outside of the auto_factor if statements. 13. Misc code tidying. AMD-Internal: [CPUPL-2433] Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee	2022-10-21 07:38:39 -04:00
Mangala V	1290c27b2a	Revert "Enabled MT path in DTRSM Small" This reverts commit `bdbdd20996`. Reason for revert: Functionality issue in TRSM MT Path on Genoa Change-Id: I197ce4d58641196de9fb32828b3cc91c568b5de7	2022-10-21 05:13:38 -04:00
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Vignesh Balasubramanian	f6185f2e53	Disabling ZGEMM SUP for single thread -Disabled the sup path for zgemm with single thread temporarily until we tune the performance and update the thresholds. The redirection of inputs in zgemm is now among small and native path for single thread. -Fine tuned the input size constraints for entry to small path in zgemm in case of single thread. AMD-Internal: [CPUPL-2495] Change-Id: Ifa19116757ca132d9636c3f2a0661524d2c1d3ec	2022-10-11 08:09:53 -04:00
eashdash	63864d7dfb	Added clipping while downscaling for u8s8s32os8 and u8s8s16os8. Clipping is done during the downscaling of the accumulated result from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8, to saturate the final output values between [-128,127] AMD-Internal: [LWPZENDNN-493] Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f	2022-10-11 07:28:06 -04:00
Mangala V	bdbdd20996	Enabled MT path in DTRSM Small 1. Fixed accuracy issues in dtrsm small multi-thread. 2. Fixed out of bound memory accesses in the patch "Fixes to avoid Out of Bound Memory Access in TRSM small algorithm" 3. Re-enabled DTRSM small MT with above fixes. AMD-Internal: [CPUPL-2567] Change-Id: Ibf2949b25fde4007a92cc635526bed0e0d897800	2022-10-10 09:47:45 -04:00
Harihara Sudhan S	492555785a	Fixed bench accuracy issue in LPGEMM Description: - When the value of the result in s8 for u8s8s32 and u8s8s16 are close to 0. Values are getting ceiled to 1. - Used nearbyintf to round the downscaled values in bench reference. This fixed the result mismatch issue between the vectorized kernel implementation and reference implementation in bench accuracy test. AMD-Internal: [CPUPL-2617] Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82	2022-10-07 09:48:21 +00:00
Edward Smyth	991f2ec0e8	BLIS: Errors in Netlib LAPACK Tests Zen4 kernel bli_damaxv_zen_int_avx512 is causing incorrect results in the netlib LAPACK tests, specifically in: ./xlintstd < ../dtest.in > dtest.out in the TESTING/LIN directory. Given time constraints, i.e. the need to finalize code for AOCL 4.0 release, disable calls to AVX512 kernel (i.e. always use the AVX2 kernel) for now, and aim to correct bli_damaxv_zen_int_avx512 for AOCL 4.1. AMD-Internal: [CPUPL-2590] Change-Id: I2603dd97c3931acb9730563e8126b109ec2b2572	2022-10-03 10:55:49 +00:00
Shubham Sharma	2d72d4c862	DTRSM Native path AVX 512 kernels are developed - Added DTRSM AVX512 kernels for lower and upper variants in the native path. - Changes in framework are made to accommodate these kernels. AMD-Internal: [CPUPL-2588] Change-Id: I1f74273ef2389018343c0645870290373ce25efe	2022-10-01 15:58:24 +00:00
Nallani Bhaskar	f1caa8e630	Fixed initialization issues in ddot multithread implementation Description: 1. n value per thread and offset are defined outside omp loop which can vary for each thread. Moved them to inside the omp loop so that the output 2. Fixed few compilation warnings in dnrm2 avx2 implemenation AMD Internal:[CPUPL-2606] Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81	2022-10-01 14:43:39 +05:30
Mangala V	e440cbc91a	Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m Address sanitizer reports error when rbp regitser is modified. Register rbp was stored with rs_a which was used during prefetch of Matrix A. Usage of rbp is avoided by using rcx register as a temporary storage register. Hence rcx is updated with Matrix C address before storing the computed data. This fix address the issue reported by GEQP3 API of libflame AMD-Internal: [CPUPL-2587] Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d	2022-09-30 11:52:29 -04:00
Eleni Vlachopoulou	863b73dfaf	Adding AVX2 support for DZNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. - Cleaned up some white space in the AVX2 implementation for DNRM2. AMD-Internal: [CPUPL-2551] Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc	2022-09-30 07:51:04 -04:00
Chandrashekara K R	e36e13ec78	Parallelized DDOT using OpenMP. Description: 1. Calculate the optimal number of threads required based on input dimension. 2. Request dynamic memory from memory pool to store the each thread output. 3. Create optimal number of threads and divide the given input data among all threads. 4. Accumulate the results in rho variable. 5. Free the dynamically allocated memory back to memory pool. AMD-Internal: [CPUPL-2560] Change-Id: I30982b622d531b29d3570b2f5f7569bd1d951d68	2022-09-30 07:36:17 -04:00
Arnav Sharma	90f915d3a9	Vectorized and parallelized zdscal routine - Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported. - Also added multithreaded support for the same. - The optimal number of threads is being calculated on the basis of input size. AMD-Internal: [CPUPL-2602] Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067	2022-09-30 06:11:07 -04:00
satish kumar nuggu	9c292b79e2	Fixed ASAN reported issues in [s/c]trsm small kernels Details: 1. Fixed the memory access paritial overflows for the variables AlphaVal,ones reported by ASAN. 2. Using 128 bit packed broadcast with the 64 bit data types after type casting would cause the garbage data to be filled in the destination register. 3. Fixed this issue by using set_ps instruction instead of broadcast. 4. In cases of n remainder being 1, extra elements were accessed that could cause out of memory access. Removed the extra element access. AMD-Internal: [CPUPL-2578][CPUPL-2587] Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75	2022-09-30 03:02:07 -04:00
mkadavil	f4702debb9	Zen4 compilation flag updates to support low precision gemm. - BFloat16 flags added to zen4 make_defs in order to enable compilation of low precision gemm by using zen4 config. - Avoid -ftree-partial-pre optimization flag with gcc due to non optimal code generation for intrinsics based kernels in low precision gemm. - Enable only Zen3 specific low precision gemm kernels (s16) compilation when aocl_gemm addon is compiled on Zen3 machines. AMD-Internal: [CPUPL-1545] Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f	2022-09-29 08:19:40 -04:00
Eleni Vlachopoulou	1c1a0027a8	Bugfix in DNRM2 AVX path Description: Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy errors. AMD-Internal: [CPUPL-2576] Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e	2022-09-29 04:46:04 -04:00
satish kumar nuggu	754627eae9	Addressed uninitialized variables in trsm small algo 1. Addressed uninitialized variables reported in coverity for all datatypes of trsm small algo. AMD-Internal: [CPUPL-2542] Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71	2022-09-29 01:49:22 -04:00
Arnav Sharma	985c3f0fe1	Multithreaded DSCALV - Added multithreaded support for dscalv. - Optimal number of threads is calculated based on the input size. AMD-Internal: [CPUPL-2583] Change-Id: I0a253c28cb389a4f5214439dbca62304888ca5ae	2022-09-28 07:42:49 -04:00
Field G. Van Zee	4c3dd93eba	Use kernel CFLAGS for 'kernels' subdirs in addons. (#658 ) Details: - Updated Makefile and common.mk so that the targeted configuration's kernel CFLAGS are applied to source files that are found in a 'kernels' subdirectory within an enabled addon. For now, this behavior only applies when the 'kernels' directory is at the top level of the addon directory structure. For example, if there is an addon named 'foobar', the source code must be located in addon/foobar/kernels/ in order for it to be compiled with the target configurations's kernel CFLAGS. Any other source code within addon/foobar/ will be compiled with general-purpose CFLAGS (the same ones that were used on all addon code prior to this commit). Thanks to AMD (esp. Mithun Mohan) for suggesting this change and catching an intermediate bug in the PR. - Comment/whitespace updates. (cherry picked from commit `fd885cf98f`) Change-Id: I9a678f78bde90b23a6293ce90377004876f51067	2022-09-27 06:37:38 -04:00

1 2 3 4 5 ...

2803 Commits