amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 09:39:59 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	faff30b46a	Fixed out-of-bounds bug in sup s6x16m haswell kernel. Details: - Fixed another out-of-bounds read access bug in the haswell sup assembly kernels. This bug is similar to the one fixed in `17b0caa` and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh Kannan for reporting this bug (and a suitable fix) in #635. - CREDITS file update. Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9	2022-07-31 21:10:58 +05:30
Field G. Van Zee	4b1663213c	Fixed out-of-bounds read in haswell gemmsup kernels. Details: - Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2() kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. Thanks to Daniël de Kok for reporting these bugs in #635, and to Bhaskar Nallani for proposing the fix). - CREDITS file update. Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b	2022-07-31 21:06:08 +05:30
Nallani Bhaskar	1d31386c02	Fixed few out of bound memory reads in sgemmsup kernels Details: Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16() kernel. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 032), ymm3, ymm4) or vfmadd231ps(mem(rcx, 032), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. AMD_CPUPLID: CPUPL-2279 Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68	2022-07-29 09:09:12 -04:00
Chandrashekara K R	fde812015f	Updated blis library version from 4.0 to 3.2.1 AMD-Internal: [CPUPL-2322] Change-Id: I3a6a61543dd2754e2590d7f5f22442c9fdeaee95	2022-07-29 15:55:10 +05:30
Mangala V	6c1acc74c8	ZGEMM optimizations -- Conditionally packing of B matrix is enabled in zgemmsup path which is performing better when B matrix is large -- Incorporated decision logic to choose between zgemm_small vs zgemm sup based on matrix dimensions "m, n and k". -- Calling of ZGEMV when matrix dimension m or n = 1. Very good performance improvement is observed. Change-Id: I7c64020f4f78a6a51617b184cc88076213b5527d	2022-07-28 14:55:24 +05:30
Vignesh Balasubramanian	808d79a610	Implemented efficient ZGEMM algorithm when k=1 Problem statement : To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation. In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines: - Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api. - Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6. - Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance. - Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output. The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads. AMD-Internal: [CPUPL-2236] Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847	2022-07-28 02:09:45 -04:00
Kiran Varaganti	6054b888fb	Fixed Bug in bench_trsm.c When bli_trsm() API is called, we make sure the "side" argument is "side_t" and not f77_char and argument is passed by value and not by its address. Change-Id: I5a616eb054c034be2d67640b8ab3b9615706a8c9	2022-07-25 15:38:30 +00:00
Kiran Varaganti	eff436c653	Bug Fix to replace vzeroall Fixed syntax in AVX512 dgemm native kernel. zen4 configuration follows Intel ASM syntax whereas other AMD configs follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM format. src and dst operands are interchanged. Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447	2022-07-22 03:42:17 -04:00
Dipal M Zambare	5d617429f4	Enabled znver4 support for GCC version >= 12 - Updated zen4 configuration to add -march=znver4 flag in the compiler options if the gcc version is above or equal to 12 AMD-Internal: [CPUPL-1937] Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f	2022-07-22 12:46:15 +05:30
mkadavil	f63e699c08	Fix for segmentation fault in low precision gemm. - Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL when compiled without open mp threading support. Subsequently the code is executed as if it is single-threaded. However, when B matrix needs to be packed, communicators are required (irrespective of single or multi-threaded), and the code accesses lpgemm_thrinfo_t for the same without NULL check. This results in seg fault. For the fix, a non-open mp thread decorator layer is added, which creates a placeholder lpgemm_thrinfo_t object with a communicator before invoking the 5 loop algorithm. This object will be used for packing. - Makefile for compilation of aocl_gemm bench. AMD-Internal: [CPUPL-2304] Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc	2022-07-21 11:51:40 +05:30
Kiran Varaganti	86134c7278	Replaced vzeroall Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels -both AVX2 and AVX512. vzeroall is expensive instruction and replaced it with faster version of zeroing all registers. vzeroupper() instruction is also added at the end of AVX2 kernels to avoid any AVX2/SSE transition penalities. Kindly note only the main kernels are modified. Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4	2022-07-20 01:16:59 -04:00
Arnav Sharma	4f96bb712e	AOCL Dynamic Optimization for DGEMMT - Fine-tuned the thread allocation logic for parallelizing DGEMMT for the cases where n <= 220. This results in performance improvement in multi-threaded DGEMMT for small values of n. AMD-Internal: [CPUPL-2215] Change-Id: I2654bc64d2dc43c2db911e0c9175755be3aa8ba5	2022-07-18 06:52:48 -04:00
Vignesh Balasubramanian	2ad25a7180	ZGEMM kernel performance improvement for k=1 sizes: The current implementation for handling zgemm exploits SIMD parallelism along the k dimension. This would give great performance in cases of k being large. But for input sizes with k=1, it is better to exploit SIMD parallelism along the m and n dimensions, thereby giving better performance. This commit does the same through loop reordering, by loading column vectors from A. AMD-Internal: [CPUPL-2236] Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f	2022-06-30 07:19:52 -04:00
Arnav Sharma	25cf7517ab	AOCL Dynamic Optimization for DGEMMT - Optimized thread allocation for cases with n <= 220 for DGEMMT. AMD-Internal: [CPUPL-2215] Change-Id: Id01edf268a90fd96a41ef947db54f6afc490548f	2022-06-30 15:00:34 +05:30
Dipal M Zambare	2ba2fb2b63	Add AVX2 path for TRSM+GEMM combination. - Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called from TRSM context it will invoke AVX2 GEMM kernels instead of the default AVX-512 GEMM kernels. - The default context has the block sizes for AVX512 GEMM kernels, however, TRSM uses AVX2 GEMM kernels and they need different block sizes. - Added new API bli_zen4_override_trsm_blkszs(). It overrides default block sizes in context with block sizes needed for AVX2 GEMM kernels. - Added new API bli_zen4_restore_default_blkszs(). It restores The block sizes to there default values (as needed by default AVX512 GEMM kernels). - Updated bli_trsm_front() to override the block sizes in the context needed by TRSM + AVX2 GEMM kernels and restore them to the default values at the end of this function. It is done in bli_trsm_front() so that we override the context before creating different threads. AMD-Internal: [CPUPL-2225] Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55	2022-06-29 10:16:24 +00:00
Harihara Sudhan S	d4bb906094	Exception handling in GEMV smart-threading - Added condition to check if n or m is 0 in smart threading logic AMD-Internal: [CPUPL2219] Change-Id: Idd58cd13a11aa5bdb4117b4c9262f38ef3c1afc4	2022-06-29 01:52:18 -04:00
Harish	2e4ed37e97	Added missing endif for the AOCC 4.0 verion check Change-Id: I1c77ae795c398aec685152b491b838a75e7ce318	2022-06-28 13:01:53 +05:30
Harish	77e8492cbd	Added znver4 flag for config builds with AOCC 4.0 compiler version Change-Id: I45f1031ed4c5ea2e3f594713f3821d6bbbecd4df	2022-06-28 02:47:48 -04:00
Chandrashekara K R	ff2ee0ae3f	AOCL-WINDOWS: Added the windows build system to build bench folder on windows. 1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf. Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2	2022-06-27 22:32:39 -04:00
Kiran Varaganti	47c344bc2e	DGEMM Benchmark Optimizations Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 dgemm kernel Change-Id: I56b3df238b6d85a6f6861448c0c6f907c972146a	2022-06-27 09:32:39 -04:00
Dipal M Zambare	00c5048f65	Fixed high impact static analysis issues Initialized ymm and xmm registers to zero to address un-inilizaed variable errors reported in static analsys. AMD-Internal: [CPUPL-2078] Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04	2022-06-27 08:19:26 +00:00
satish kumar nuggu	aaf840d86e	Disabled zgemm SUP path - Need to identify new Thresholds for zgemm SUP path to avoid performance regression. AMD-Internal: [CPUPL-2148] Change-Id: I0baa2b415dc5e296780566ba7450249445b93d43	2022-06-27 08:19:12 +00:00
satish kumar nuggu	13c71ca976	BugFix of AOCL_DYNAMIC in TRSM multithreaded small. - Added initialization of rntm object before aocl_dynamic. - Bugfixes in dtrsm right-side kernels, avoided accessing extra memory while using store for corner cases. AMD-Internal: [CPUPL-2193] [CPUPL-2194] Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba	2022-06-14 15:46:17 +05:30
mkadavil	e073e8b669	DAMAXV AXX512 micro kernel bug fix. -DAMAXV AVX512 is giving wrong results when max element is present at index in [n-u, n), where u < 32. This is a fallout of using wrong start offset for the non-loop unrolled code. -Functions for replacing NaN with negative numbers is replaced with MACRO to avoid function call overhead and to remove static variables used for stateful replacement numbers for NaN. AMD-Internal: [CPUPL-2190] Change-Id: Ie1435c38b264a271f869782793d0b52bbe6e1b2a	2022-06-13 10:52:53 +05:30
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00
Chandrashekara K R	c03699b97a	AOCL-Windows: Updating windows build system to add /arch:AVX512 compiler flag only to zen4 specific source files. Change-Id: Ia4fa65a831a00ce37f97075db6812be048bfe0bc	2022-06-09 14:32:54 +05:30
Dipal M Zambare	c87b9aab75	Added support for AVX512 for Windows and AMAVX - Completed zen4 configuration support on windows - Enabled AVX512 kernels for AMAXV - Added zen4 configuration in amdzen for windows - Moved all zen4 kernels inside kernels/zen4 folder AMD-Internal: [CPUPL-2108] Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba	2022-06-08 11:09:48 +05:30
Dipal M Zambare	8cc15107ed	Enabled AVX-512 kernels for Zen4 config - Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for GEMM float and double types. - Enabled reference kernel for TRSM native path AMD-Internal: [CPUPL-2108] Change-Id: I66f3468346085c17183cbcbf4f2c8cfe07579b6f	2022-06-03 06:34:35 +00:00
Dipal M. Zambare	e61ec820f9	Fixed windows build issue for BLIS 4.0 - Removed extra AVX512 typedef from bli_amaxv_zen_int.c. AMD-Internal: [CPUPL-2154] Change-Id: Ieaa827c7d81b8d101f3a827827b99433570219f2	2022-06-02 17:52:43 +05:30
Arnav Sharma	66b2231b65	Fixed CMake files for HER - Removed subdirectory addition Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1	2022-06-01 12:25:43 +05:30
Arnav Sharma	e5d5a43eab	Optimized ZHER Implementation - Implemented optimized her framework calls for double precision complex numbers. - The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows. AMD-Internal: [CPUPL-2151] Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca	2022-05-25 14:03:01 +05:30
Dipal M Zambare	6e2f536590	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-05-17 20:35:40 +05:30
Harsh Dave	f48ced0811	Optimized dher2 implementation - Impplemented her2 framework calls for transposed and non transposed kernel variants. - dher2 kernel operate over 4 columns at a time. It computes 4x4 triangular part of matrix first and remainder part is computed in chunk of 4x4 tile upto m rows. - remainder cases(m < 4) are handled serially. AMD-Internal: [CPUPL-1968] Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313	2022-05-17 18:13:07 +05:30
Harsh Dave	718c6bc024	Optimized daxpy2v implementation - Optimized axpy2v implementation for double datatype by handling rows in mulitple of 4 and store the final computed result at the end of computation, preventing unnecessary stores for improving the performance. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1973] Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9	2022-05-17 18:13:05 +05:30
Chandrashekara K R	43c16d8e08	AOCL-Windows: Updating the blis windows build system. 1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library. 2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library. 3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture. AMD-Internal: CPUPL-1630 Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a	2022-05-17 18:12:51 +05:30
Dipal M Zambare	9c6c76613c	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2022-05-17 18:12:49 +05:30
Harihara Sudhan S	0fddd9eb0d	Level 1 Kernel: damaxv AVX512 Details: - Developed damaxv for AVX512 extension - Implemented removeNAN function that converts NAN values to negative values based on the location - Usage COMPARE256/COMPARE128 avoided in AVX512 implementation for better performance - Unrolled the loop by order of 4. Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326	2022-05-17 18:12:16 +05:30
Dipal M Zambare	8f9be1766b	Disable AOCL_VERBOSE feature - AOCL_VERBOSE implementation is causing breakage in libFLAME. Currently DTL code is duplicated in BLIS and libFLAME, Which results in duplicate symbol errors when DTL is enabled in both the libraries. - It will be addressed by making DTL as separate library. - The input logs can still be enabled by setting AOCL_DTL_LOG_ENABLE = 1 in aocldtlcf.h and recompiling the BLIS library. AMD-Internal: [CPUPL-2101] Change-Id: I8e69b68d53940e306a1d16ffbb65019def7e655a	2022-05-17 18:10:40 +05:30
mkadavil	31f8820bab	Bug fixes for open mp based multi-threaded GEMM/GEMMT SUP path. - auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at runtime. Currently the auto_factor is enabled if num_threads > 0 and not reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm. This results in gemm/gemmt SUP path applying 2x2 factorization of num_threads, and thereby modifying the preset factorization. This issue is not observed in native path since factorization happens without checking auto_factor value. - Setting omp threads to n_threads using omp_set_num_threads after the global_rntm n_threads update in bli_thread_set_num_threads. This ensures that in bli_rntm_init_from_global, omp_get_max_threads returns the same value as set previously. AMD-Internal: [CPUPL-2137] Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426	2022-05-17 18:10:40 +05:30
mkadavil	8670992c3d	Default sgemv kernel to be used in single-threaded scenarios. - sgemv calls a multi-threading friendly kernel whenever it is compiled with open mp and multi-threading enabled. However it was observed that this kernel is not suited for scenarios where sgemv is invoked in a single-threaded context (eg: sgemv from ST sgemm fringe kernels and with matrix blocking). Falling back to the default single-threaded sgemv kernel resulted in better performance for this scenario. AMD-Internal: [CPUPL-2136] Change-Id: Ic023db4d20b2503ea45e56a839aa35de0337d5a6	2022-05-17 18:10:40 +05:30
Harsh Dave	349dcc459a	Fixed scalapack xcsep failer due to cdotxv kernel. -Failure was observed in zen configuration as gcc flag safe-math-optimization was being used for reference kernel compilation. - Optmized kernels were being compiled without this gcc flag resulted in computation difference resulting in test case failure. AMD-Internal: [CPUPL-2121] Change-Id: I5d86e589cdea633220aecadbcab84d9b88b31f57	2022-05-17 18:10:40 +05:30
Dipal M Zambare	7247e6a150	Fixed crash issue in TRSM on non-avx platform. - Ensured that FMA, AVX2 based kernels are called only on platforms supporting these instructions, otherwise standard ‘C’ kernels will be called. - Code cleanup for optimization and consistency AMD-Internal: [CPUPL-2126] Change-Id: I203270892b2fad2ccc9301fb55e2bae75508e050	2022-05-17 18:10:39 +05:30
Nallani Bhaskar	7658067107	Added AOCL Dynamic feature for dtrmm Description: 1. Tuned number of threads to achive better performance for dtrmm AMD-Internal: [ CPUPL-2100 ] Change-Id: Ib2e3df224ba76d86185721bef1837cd7855dd593	2022-05-17 18:10:39 +05:30
Dipal M Zambare	4ccb438c18	Updated Zen3 architecture detection for Ryzen 5000 - Added support to detect Ryzen 5000 Desktop and APUs AMD-Internal: [CPUPL-2117] Change-Id: I312a7de1a84cf368b74ba20e58192803a9f7dace	2022-05-17 18:10:39 +05:30
satish kumar nuggu	09b70de635	Performance Improvement for ztrsm small sizes Details: - Handled Overflow and Underflow Vulnerabilites in ztrsm small right implementations. - Fixed failures observed in Scalapack testing. AMD-Internal: [CPUPL-2115] Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439	2022-05-17 18:10:39 +05:30
Nallani Bhaskar	2acb3f6ed0	Tuned aocl dynamic for specific range in dgemm Description: 1. Decision logic to choose optimal number of threads for given input dgemm dimensions under aocl dynamic feature were retuned based on latest code. 2. Updated code in few file to avoid compilation warnings. 3. Added a min check for nt in bli_sgemv_var1_smart_threading function AMD-Internal: [ CPUPL-2100 ] Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02	2022-05-17 18:10:39 +05:30
mkadavil	a3836a560d	Smart Threading for GEMM (sgemm) v1. - Cache aware factorization. Experiments shows that ic,jc factorization based on m,n gives better results compared to mu,nu on a generic data set in SUP path. Also slight adjustments in the factorizations w.r.t matrix data loads can help in improving perf further. - Moving native path inputs to SUP path. Experiments shows that in multi-threaded scenarios if the per thread data falls under SUP thresholds, taking SUP path instead of native path results in improved performance. This is the case even if the original matrix dimensions falls in native path. This is not applicable if A matrix transpose is required. - Enabling B matrix packing in SUP path. Performance improvement is observed when B matrix is packed in cases where gemm takes SUP path instead of native path based on per thread matrix dimensions. AMD-Internal: [CPUPL-659] Change-Id: I3b8fc238a0ece1ababe5d64aebab63092f7c6914	2022-05-17 18:10:39 +05:30
S, HariharaSudhan	a8bc55c373	Multithreaded SGEMV var 1 with smart threading - Implemented an OpenMP based stand alone SGEMV kernel for row-major (var 1) for multithread scenarios - Smart threading is enabled when AOCL DYNAMIC is defined - Number of threads are decided based on the input dims using smart threading AMD-Internal: [CPUPL-1984] Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e	2022-05-17 18:10:39 +05:30
Dipal M Zambare	16de63c818	Updated version and copyright notice. Changed AMD-BLIS version to 3.1.2 AMD-Internal: [CPUPL-2111] Change-Id: Id8fc3fbc112f08bd5e5def646c472047352e65b5	2022-05-17 18:10:39 +05:30
Dave	963a6aa099	Enabled zgemm_sup path and removed sqp path - Previously zgemm computation failures were due to status variable did not have pre-defined initial value which resulted in zgemm computation to return without being computed by any kernel. Reflected same change in dgemm_ function as well. - Enabled sup zgemm as the issue is fixed with status variable with bli_zgemm_small call. -Removed calling sqp method as it is disabled Change-Id: I0f4edfd619bc4877ebfc5cb6532c26c3888f919d	2022-05-17 18:10:39 +05:30

1 2 3 4 5 ...

2674 Commits