amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-24 10:24:34 +00:00

Author	SHA1	Message	Date
Harihara Sudhan S	11c42ce1d3	C matrix prefetch for BF16 GEMM - Broke down the KR loop inside the compute kernel into two pieces - Added C matrix prefetch between the two decomposed pieces of KR loop AMD-Internal: [CPUPL-2693] Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e	2022-11-21 04:57:19 -05:00
Harihara Sudhan S	2f6c26eff6	Dynamic block tuning for LPGEMM u8s8s16os16 - NC value is adjusted based on the n dimension of the B matrix. - KC value is adjusted based on the k dimension input matrices. - In order to ensure a handshake between the reorder function and compute function, the same dynamic block tuning logic is called in both places. - Reorder followed by compute scenarios mean that tuning KC value based on MC value and vice versa is not feasible. AMD-Internal: [CPUPL-2638] Change-Id: Ie515da7de83e3dfb59946e873c92d67f11559429	2022-11-21 01:39:35 -05:00
Vignesh Balasubramanian	078b0bc626	Enabling ZGEMM SUP path for single thread -Enabled the sup path for zgemm in case of single thread. -Fine tuned the thresholds for small path in order to handle certain spectrum of skinny matrices. -The inputs are now redirected among small, sup and native paths in case of single thread. AMD-Internal: [CPUPL-2632] Change-Id: I3eca315bcb8bfa8e015349a7b2bc4ebd5448e065	2022-11-04 16:05:48 +05:30
Edward Smyth	34730a1e4c	BLIS: Nested parallelism issues 1. Check OpenMP active level against max active levels when setting number of threads for starting a new parallel region in ./frame/thread/bli_thread.c to ensure the correct number of threads is used when BLIS is called within nested OpenMP parallelism. 2. In subsequent BLIS calls, threading choices could be incorrectly set based on values used and stored in global_rntm by a previous call. This could apply when the OpenMP number of threads differ from call to call, different nested parallelism is used in different parts of a user's code, or different threads at the user level request different numbers of OpenMP threads for BLIS calls. Keep threading information in both global_rntm and a new Thread Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime environment (as appropriate) during bli_init_auto() calls in each BLIS routine. The details are: * global_rntm is initialized on first BLIS call based on OpenMP and BLIS threading environment variables. * global_rntm is updated by any BLIS threading function calls. * In bli_thread_update_tl(), called by bli_init_auto(), sync with any BLIS values set or updated in global_rntm. Then, if BLIS threading control is not used, check OpenMP ICVs and set thread count and auto_factor appropriately. * Setting BLIS threading locally (using expert interfaces to pass a user defined rntm data structure) should work as before. 3. bli_thread_get_is_parallel can now only be called outside of parallelism within BLIS routines. Change calls in trsm to reflect this. 4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env() if any BLIS_*_NT environment variables are set. 5. Set auto_factor = FALSE when the number of threads is 1. 6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE. 7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init(). 8. For debugging, internal information on the rntm threading data can be printed by defining "PRINT_THREADING" at the top of bli_rntm.h 9. bli_rntm_print() now also prints the value of blis_mt. 10. Function prototypes in bli_rntm.h moved to top of file, so that bli_rntm_print() can be used within inline functions defined in this header file. 11. Comment out bli_init_auto() and bli_finalize_auto() calls in Fortran interfaces in frame/compat/blis/thread/b77_thread.c 12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and set_pack_b functions outside of the auto_factor if statements. 13. Misc code tidying. AMD-Internal: [CPUPL-2433] Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee	2022-10-21 07:38:39 -04:00
Mangala V	1290c27b2a	Revert "Enabled MT path in DTRSM Small" This reverts commit `bdbdd20996`. Reason for revert: Functionality issue in TRSM MT Path on Genoa Change-Id: I197ce4d58641196de9fb32828b3cc91c568b5de7	2022-10-21 05:13:38 -04:00
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Vignesh Balasubramanian	f6185f2e53	Disabling ZGEMM SUP for single thread -Disabled the sup path for zgemm with single thread temporarily until we tune the performance and update the thresholds. The redirection of inputs in zgemm is now among small and native path for single thread. -Fine tuned the input size constraints for entry to small path in zgemm in case of single thread. AMD-Internal: [CPUPL-2495] Change-Id: Ifa19116757ca132d9636c3f2a0661524d2c1d3ec	2022-10-11 08:09:53 -04:00
eashdash	63864d7dfb	Added clipping while downscaling for u8s8s32os8 and u8s8s16os8. Clipping is done during the downscaling of the accumulated result from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8, to saturate the final output values between [-128,127] AMD-Internal: [LWPZENDNN-493] Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f	2022-10-11 07:28:06 -04:00
Mangala V	bdbdd20996	Enabled MT path in DTRSM Small 1. Fixed accuracy issues in dtrsm small multi-thread. 2. Fixed out of bound memory accesses in the patch "Fixes to avoid Out of Bound Memory Access in TRSM small algorithm" 3. Re-enabled DTRSM small MT with above fixes. AMD-Internal: [CPUPL-2567] Change-Id: Ibf2949b25fde4007a92cc635526bed0e0d897800	2022-10-10 09:47:45 -04:00
Harihara Sudhan S	492555785a	Fixed bench accuracy issue in LPGEMM Description: - When the value of the result in s8 for u8s8s32 and u8s8s16 are close to 0. Values are getting ceiled to 1. - Used nearbyintf to round the downscaled values in bench reference. This fixed the result mismatch issue between the vectorized kernel implementation and reference implementation in bench accuracy test. AMD-Internal: [CPUPL-2617] Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82	2022-10-07 09:48:21 +00:00
Edward Smyth	991f2ec0e8	BLIS: Errors in Netlib LAPACK Tests Zen4 kernel bli_damaxv_zen_int_avx512 is causing incorrect results in the netlib LAPACK tests, specifically in: ./xlintstd < ../dtest.in > dtest.out in the TESTING/LIN directory. Given time constraints, i.e. the need to finalize code for AOCL 4.0 release, disable calls to AVX512 kernel (i.e. always use the AVX2 kernel) for now, and aim to correct bli_damaxv_zen_int_avx512 for AOCL 4.1. AMD-Internal: [CPUPL-2590] Change-Id: I2603dd97c3931acb9730563e8126b109ec2b2572	2022-10-03 10:55:49 +00:00
Shubham Sharma	2d72d4c862	DTRSM Native path AVX 512 kernels are developed - Added DTRSM AVX512 kernels for lower and upper variants in the native path. - Changes in framework are made to accommodate these kernels. AMD-Internal: [CPUPL-2588] Change-Id: I1f74273ef2389018343c0645870290373ce25efe	2022-10-01 15:58:24 +00:00
Nallani Bhaskar	f1caa8e630	Fixed initialization issues in ddot multithread implementation Description: 1. n value per thread and offset are defined outside omp loop which can vary for each thread. Moved them to inside the omp loop so that the output 2. Fixed few compilation warnings in dnrm2 avx2 implemenation AMD Internal:[CPUPL-2606] Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81	2022-10-01 14:43:39 +05:30
Mangala V	e440cbc91a	Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m Address sanitizer reports error when rbp regitser is modified. Register rbp was stored with rs_a which was used during prefetch of Matrix A. Usage of rbp is avoided by using rcx register as a temporary storage register. Hence rcx is updated with Matrix C address before storing the computed data. This fix address the issue reported by GEQP3 API of libflame AMD-Internal: [CPUPL-2587] Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d	2022-09-30 11:52:29 -04:00
Eleni Vlachopoulou	863b73dfaf	Adding AVX2 support for DZNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. - Cleaned up some white space in the AVX2 implementation for DNRM2. AMD-Internal: [CPUPL-2551] Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc	2022-09-30 07:51:04 -04:00
Chandrashekara K R	e36e13ec78	Parallelized DDOT using OpenMP. Description: 1. Calculate the optimal number of threads required based on input dimension. 2. Request dynamic memory from memory pool to store the each thread output. 3. Create optimal number of threads and divide the given input data among all threads. 4. Accumulate the results in rho variable. 5. Free the dynamically allocated memory back to memory pool. AMD-Internal: [CPUPL-2560] Change-Id: I30982b622d531b29d3570b2f5f7569bd1d951d68	2022-09-30 07:36:17 -04:00
Arnav Sharma	90f915d3a9	Vectorized and parallelized zdscal routine - Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported. - Also added multithreaded support for the same. - The optimal number of threads is being calculated on the basis of input size. AMD-Internal: [CPUPL-2602] Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067	2022-09-30 06:11:07 -04:00
satish kumar nuggu	9c292b79e2	Fixed ASAN reported issues in [s/c]trsm small kernels Details: 1. Fixed the memory access paritial overflows for the variables AlphaVal,ones reported by ASAN. 2. Using 128 bit packed broadcast with the 64 bit data types after type casting would cause the garbage data to be filled in the destination register. 3. Fixed this issue by using set_ps instruction instead of broadcast. 4. In cases of n remainder being 1, extra elements were accessed that could cause out of memory access. Removed the extra element access. AMD-Internal: [CPUPL-2578][CPUPL-2587] Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75	2022-09-30 03:02:07 -04:00
mkadavil	f4702debb9	Zen4 compilation flag updates to support low precision gemm. - BFloat16 flags added to zen4 make_defs in order to enable compilation of low precision gemm by using zen4 config. - Avoid -ftree-partial-pre optimization flag with gcc due to non optimal code generation for intrinsics based kernels in low precision gemm. - Enable only Zen3 specific low precision gemm kernels (s16) compilation when aocl_gemm addon is compiled on Zen3 machines. AMD-Internal: [CPUPL-1545] Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f	2022-09-29 08:19:40 -04:00
Eleni Vlachopoulou	1c1a0027a8	Bugfix in DNRM2 AVX path Description: Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy errors. AMD-Internal: [CPUPL-2576] Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e	2022-09-29 04:46:04 -04:00
satish kumar nuggu	754627eae9	Addressed uninitialized variables in trsm small algo 1. Addressed uninitialized variables reported in coverity for all datatypes of trsm small algo. AMD-Internal: [CPUPL-2542] Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71	2022-09-29 01:49:22 -04:00
Arnav Sharma	985c3f0fe1	Multithreaded DSCALV - Added multithreaded support for dscalv. - Optimal number of threads is calculated based on the input size. AMD-Internal: [CPUPL-2583] Change-Id: I0a253c28cb389a4f5214439dbca62304888ca5ae	2022-09-28 07:42:49 -04:00
Field G. Van Zee	4c3dd93eba	Use kernel CFLAGS for 'kernels' subdirs in addons. (#658 ) Details: - Updated Makefile and common.mk so that the targeted configuration's kernel CFLAGS are applied to source files that are found in a 'kernels' subdirectory within an enabled addon. For now, this behavior only applies when the 'kernels' directory is at the top level of the addon directory structure. For example, if there is an addon named 'foobar', the source code must be located in addon/foobar/kernels/ in order for it to be compiled with the target configurations's kernel CFLAGS. Any other source code within addon/foobar/ will be compiled with general-purpose CFLAGS (the same ones that were used on all addon code prior to this commit). Thanks to AMD (esp. Mithun Mohan) for suggesting this change and catching an intermediate bug in the PR. - Comment/whitespace updates. (cherry picked from commit `fd885cf98f`) Change-Id: I9a678f78bde90b23a6293ce90377004876f51067	2022-09-27 06:37:38 -04:00
Dipal M Zambare	1f345f87f5	Fixed ASAN reported issues in bli_l3_packm.c The local_mem_s is allocated on stack but in the inner scop of the “if” block, however, it can be accessed through cntl_mem_p outside the if block. This error is flagged by address sanitizer. Fixed this issue by moving the variables declaration at the functions scope. This fix address the issue reported for follwoing libflame APIs geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs, stedc, steqr, syevd AMD-Internal: [CPUPL-2587] Change-Id: I63749c7d406c7339d2b45b0488108ccd3f90a248	2022-09-23 16:41:00 +05:30
eashdash	d21cd51fde	Accumulation type for alpha, beta values and BF16 bench integration 1. Correcting the type of alpha, and beta values from C_type (output type) to accumulation type. For the downscaled LPGEMM APIs, C_type will be the downscaled type but the required type for alpha and beta values should be the accumulation type. 2. BF16 bench integration with the LPGEMM bench for both the BF16 (bf16bf16f32of32 and bf16bf16f32obf16) APIs AMD-Internal: [CPUPL-2561] Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797	2022-09-23 05:00:49 -04:00
Dipal M Zambare	3a3472e4e5	Revert "Fixed ASAN reported issues in bli_l3_packm.c" This reverts commit `3401a0d202`. Got pushed without review.	2022-09-23 14:27:12 +05:30
Dipal M Zambare	3401a0d202	Fixed ASAN reported issues in bli_l3_packm.c The local_mem_s is allocated on stack but in the inner scop of the “if” block, however, it can be accessed through cntl_mem_p outside the if block. This error is flagged by address sanitizer. Fixed this issue by moving the variables declaration at the functions scope. This fix address the issue reported for follwoing libflame APIs geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs, stedc, steqr, syevd AMD-Internal: [CPUPL-2587] Change-Id: Ib1e9f89bc4b6911e4b7e9b910d1ecc2ba00b286a	2022-09-23 14:24:55 +05:30
Mangala V	42434fbc31	Bug fix in TRSM small for S, D, C & Z datatype incase of MT 1. Corrected B buffer accessing to access by its offset instead of starting address which is required incase of MT. 3. When num_threads > 1, B buffer is divided in to blocks in m or n dimension based on side right or left. Hence need to access by its offset to access starting of the block. 4. Currently B Matrix is divided in to blocks for each thread and complete matrix A is used by all threads. Incase of design change in future, modified A buffer accessing by its offset to support partition of matrix A for MT AMD-Internal:[CPUPL-2520] Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea	2022-09-23 03:30:25 -04:00
Dipal M. Zambare	17694b6ca5	Enabled AVX512 compiler flags for the reference kernels. - Updated zen4 configuration to enable AVX512 flags for the reference kernels - Reference and vector kernels will use the same compiler flags AMD-Internal: [CPUPL-2533] Change-Id: I5a2ba7e584dc3fb93625df12cca6b6c18f514ea8	2022-09-23 11:39:33 +05:30
Harihara Sudhan S	a45827b3f9	u8s8s16os16 bug fix for downscale operation - Removed some read code from the macros for downscale - Store permute correction - Simplified macros for edge cases and corrected intermediate operation AMD-Internal:[CPUPL-2171] Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68	2022-09-22 05:02:17 -04:00
Nallani Bhaskar	1e5d98322d	Disabled DNRM2 AVX path temporarily Description: Disabled AVX2 optimized path for DNRM2 to avoid accuracy issues in netlib blas test. AMD-Internal: CPUPL-2576 ] Change-Id: I0764725d4f6b1e4e0b5f60a255bc681bb698560e	2022-09-22 13:00:04 +05:30
mkadavil	bf4d1da1b9	Column major input support for BFloat16 gemm. -The bf16 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. -Bench changes to test low precision gemm for column major inputs. AMD-Internal: [CPUPL-2570] Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495	2022-09-22 02:50:46 -04:00
Dipal M Zambare	de3247b0da	Removed extra prototypes for ?gemm3m APIs - Removed prototypes for float(sgemm3m) and double(dgemm3m) types, as BLIS implements this API only for scomplex(cgemm3m) and dcomplex(zgemm3m) AMD-Internal: [SWLCSG-1477] Change-Id: Ifad86a74b4c939ed240743894b85bb4fa5e6d754	2022-09-20 15:51:56 +05:30
Eleni Vlachopoulou	a5891f7ead	Adding AVX2 support for DNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [CPUPL-2551] Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c	2022-09-20 06:05:01 -04:00
satish kumar nuggu	b7bc5f204f	Fix in DTRSM Small MT Details: 1. Changes are made in dtrsm small MT path,to avoid accuracy issues. AMD-Internal: [SWLCSG-1470] Change-Id: I65237225892f97b7222fe71f66b02841b5956560	2022-09-20 05:54:48 -04:00
Dipal M Zambare	2cdeea3c66	CBLAS/BLAS interface decoupling for the level 1 APIs -In BLIS, the CBLAS interface is implemented as a wrapper around the BLAS interface. For example the CBLAS API ‘cblas_dscal’ internally invokes the BLAS API ‘dscal_’. -This coupling between CBLAS and BLAS interface prevents the end user from overriding them individually by the application or other libraries. -This change separates the CBLAS and BLAS implementation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I0e80071398af29c9313296d2a92e61e3897ac28e	2022-09-19 21:50:29 +05:30
Dipal M Zambare	866e8de7bf	CBLAS/BLAS interface decoupling for the level 2 APIs -In BLIS, the CBLAS interface is implemented as a wrapper around the BLAS interface. For example the CBLAS API ‘cblas_dgemv’ internally invokes the BLAS API ‘dgemv_’. -This coupling between CBLAS and BLAS interface prevents the end user from overriding them individually by the application or other libraries. -This change separates the CBLAS and BLAS implementation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c	2022-09-15 17:51:05 +05:30
Dipal M Zambare	e18db8a172	CBLAS/BLAS interface decoupling for the level 3 APIs -In BLIS, the CBLAS interface is implemented as a wrapper around the BLAS interface. For example the CBLAS API ‘cblas_dgemm’ internally invokes the BLAS API ‘dgemm_’. -This coupling between CBLAS and BLAS interface prevents the end user from overriding them individually by the application or other libraries. -This change separates the CBLAS and BLAS implementation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409	2022-09-15 06:23:46 -04:00
Dipal M Zambare	61232d540c	AOCL progress callback hardening - BLIS uses callback function to report the progress of the operation. The callback is implemented in the user application and is invoked by BLIS. - Updated callback function prototype to make all arguments const. This will ensure that any attempt to write using callback’s argument is prevented at the compile time itself. AMD-Internal: [CPUPL-2504] Change-Id: I8ceb671242365d2a9155b485301cd8c75043e667	2022-09-14 15:32:10 +05:30
mkadavil	9bc59cc500	Low Precision GEMM framework fixes for downscaling. - The temporary buffer allocated for C matrix when downscaling is enabled is not filled properly. This results in wrong gemm accumulation when beta != 0, and thus wrong output after downscaling. The C panel iterators used for filling the temporary buffer are updated to fix it. - Low precision gemm bench updated for testing/benchmarking downscaling. AMD-Internal: [CPUPL-2514] Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb	2022-09-13 07:42:29 -04:00
Harihara Sudhan S	5b6cc5d39d	Bug fix in s16 downscale operation - Store operations was done to c matrix and not to c buffer AMD-Internal:[CPUPL-2171] Change-Id: Ic0897a20850fdae96db52f0ccc6fa087c84239fa	2022-09-13 06:01:48 -04:00
eashdash	e1349c0c71	LPGEMM BF16 MT panel based balancing Introduced multi-thread panel based balancing for BF16 to improve the overall MT performance. AMD-Internal: [CPUPL-2502] Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25	2022-09-07 03:20:50 -04:00
Dipal M Zambare	6b71fea1f4	GCC 8 support for zen4 (and amdzen) configuration - Added check for GCC version 8, - Added AVX512 compiler flags needed for zen4 build using GCC 8. AMD-Internal: [CPUPL-2494] Change-Id: I7fd72e4b197fdd754633f674a8b87f01da8dd320	2022-09-06 16:58:02 +05:30
Edward Smyth	d69473c8f7	Update zen4 test in bli_cpuid.c Remove test on is_arch in bli_cpuid_is_zen4() so it will report true for all Zen4 models, based on family and AVX512 feature tests. AMD-Internal: [CPUPL-2474] Change-Id: I85d2e230b33391d5c9779df4585ae2a358788e72	2022-09-01 05:12:24 -04:00
eashdash	32a9e735f1	BF16 Output downscaling functionality - BF16 instructions output is accumulated at a higher precision of FP32 which needs to be converted to a lower precison of bf16 post the GEMM operations. This is required in AI workloads where both input and output are in BF16 format. - BF16 downscaling is implemented as post-ops inside the GEMM microkernels. Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8	2022-08-30 13:46:09 -04:00
Harihara Sudhan S	5faab43e66	Downscaling as part of u8s8s16os16 - int16 c matrix intermediate values are converted to int32, then the int32 values are converted to fp32. On these fp32 values scaling is done - The resultant value is down scaled to int8 and stored in a separate buffer AMD-Internal: [2171] Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab	2022-08-30 13:41:36 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
satish kumar nuggu	d62f12a18a	Fixed bug in DZGEMM Details: 1. In zen4 dgemm and sgemm native kernels are column-prefer kernels, cgemm and zgemm kernels are row-prefer kernels. zen3 and older arch (uses row-prefer kernels for all datatypes). Induced-transpose carried out based on kernel preference check in both crc and ccr. we don't require kernel preference check to apply Induced-transpose. 2. In case of ccr, B matrix is real. We must Induce a transposition and perform C+=A*B (crc). where A (formerly B) is real. 3. In case of crc, A matrix is real. We don't require any Induced-transpose. AMD-Internal: [CPUPL-2440] [CPUPL-2449] Change-Id: I44c53a20c8def7ddbb84797ba20260acec2086a2	2022-08-30 09:34:05 -04:00
satish kumar nuggu	99dc9066fb	Fix in ZTRSM Small MT Details: 1. Changes are made in ztrsm small MT path,to avoid accuracy issues reported in libflame tests. AMD-Internal: [CPUPL-2476] Change-Id: Ic279106343fb1744e89ff4c920023adbe1d0158a	2022-08-30 18:02:19 +05:30
Dipal M Zambare	5c42afada8	Revert "CBLAS/BLAS interface decoupling for level 3 APIs" This reverts commit `d925ebeb06`. Change-Id: I2e842b29c1fedbe14bf913949cf978f3e7515ff3	2022-08-30 14:50:38 +05:30

1 2 3 4 5 ...

2776 Commits