amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-07-03 21:57:51 +00:00

Author	SHA1	Message	Date
Smyth, Edward	05e837d176	BLIS: Implement zen6 sub-configuration Implement zen6 cpuid and arch changes, and add zen6 as a separate BLIS sub-configuration and code path within amdzen configuration family. Currently all optimization choices are copies of zen5 sub-configuration. AMD-Internal: [CPUPL-7162]	2026-03-05 13:33:56 +00:00
Smyth, Edward	011c75dddb	Remove unnecessary OpenMP include (AOCL) Copy of similar change in upstream BLIS (843a5e8) to fix issues https://github.com/flame/blis/issues/873 and https://github.com/amd/blis/issues/50 Details: - Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the framework could access the necessary OpenMP functions. - As @melven reported (#873), this causes issues when `blis.h` is included in C++ code since the `<omp.h>` include happens with `extern "C"`. - Move the include from the header to the necessary .c files so that it does not "pollute" `blis.h`. Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in AOCL BLIS AMD-Internal: [CPUPL-7303]	2026-02-06 10:41:38 +00:00
Smyth, Edward	8310b2d5d3	Optimize bli_arch_query_id and related functions bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous implementation incurred the overhead of multiple function calls. This has been reduced by: - Changing the function to be defined in a header file so it can be inlined. - Avoiding call to bli_arch_check_id_once that was a wrapper for a call to bli_pthread_once. Instead bli_pthread_once is called directly. - For builds with a single BLIS sub-configuration, correct arch_id is taken directly from a header file in the corresponding config subdirectory, avoiding the bli_pthread_once call and making the value explicit at compile time, which may enable additional optimizations. To enable these changes, the variables arch_id and model_id defined in frame/base/bli_arch.c are no longer static, as they must be accessed in multiple files (i.e. they are now global variables). Rename to g_arch_id and g_model_id to distinguish from any locally defined arch_id or model_id variables.	2026-02-04 13:16:46 +00:00
S, Hari Govind	4ecfbde082	Fix extreme values handling in GEMV - When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all. - This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks. - When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel. - To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic. - DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0. AMD-Internal: [CPUPL-7585]	2025-11-08 12:30:03 +05:30
Varaganti, Kiran	49961aa569	Fix DTL dynamic thread logging in BLAS operations (#230 ) - Remove redundant AOCL_DTL_LOG_NUM_THREADS calls from early return paths - Update thread count logging to use AOCL_get_requested_threads_count() for early exits - Clean up duplicate DTL logging in gemv_unf_var1 and gemv_unf_var2 implementations - Remove thread count logging from bli_dgemv_n_zen4_int kernel variants - Simplify aocldtl_blis.c AOCL_DTL_log_gemv_sizes by removing redundant conditional - Standardize DTL trace exit patterns across axpy, scal, and gemv operations - Remove commented-out DTL logging code in zen4 gemv kernel This patch ensures thread count is logged only once per operation and uses the correct API (AOCL_get_requested_threads_count) for early exit scenarios where the actual execution thread count may differ from requested threads.	2025-10-24 13:34:00 +01:00
Rayan, Rohan	dc4e0f72c1	Fixing an integer division in GEMV that was supposed to be a double operation (#218 ) --------- Co-authored-by: Rayan <rohrayan@amd.com>	2025-09-30 14:04:39 +05:30
Varaganti, Kiran	807de2a990	DTL Log update * DTL Log update Updates logs with nt and AOCL Dynamic selected nt for axpy, scal and dgemv Modified bench_gemv.c to able to process modified dtl logs. * Updated DTL log for copy routine with actual nt and dynamic nt * Refactor OpenMP pragmas and clean up code Removed unnecessary nested OpenMP pragma and cleaned up function end comment. * Fixed DTL log for sequential build * Added thread logging in bla_gemv_check for invalid inputs --------- Co-authored-by: Smyth, Edward <Edward.Smyth@amd.com>	2025-09-22 11:32:00 +05:30
Smyth, Edward	ae6c7d86df	Tidying code - AMD specific BLAS1 and BLAS2 franework: changes to make variants more consistent with each other - Initialize kernel pointers to NULL where not immediately set - Fix code indentation and other other whitespace changes in DTL code and addon/aocl_gemm/frame/s8s8s32/lpgemm_s8s8s32_sym_quant.c - Fix typos in DTL comments - Add missing newline at end of test/CMakeLists.txt - Standardize on using arch_id variable name AMD-Internal: [CPUPL-6579]	2025-09-16 14:52:54 +01:00
Smyth, Edward	fb2a682725	Miscellaneous changes - Change begin_asm and end_asm comments and unused code in files kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c to avoid problems in clobber checking script. - Add missing clobbers in files kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c - Add missing newline at end of files. - Update some copyright years for recent changes. - Standardize license text formatting. AMD-Internal: [CPUPL-6579]	2025-08-26 16:37:43 +01:00
Smyth, Edward	509aa07785	Standardize Zen kernel names Naming of Zen kernels and associated files was inconsistent with BLIS conventions for other sub-configurations and between different Zen generations. Other anomalies existed, e.g. dgemmsup 24x column preferred kernels names with _rv_ instead of _cv_. This patch renames kernels and file names to address these issues. AMD-Internal: [CPUPL-6579]	2025-08-19 18:19:51 +01:00
Sharma, Shubham	b0a4914417	Added DGEMV no transpose multithreaded Implementations (#12 ) * Added DGEMV no transpose multithreaded Implementations - Added new avx512 M and N kernels for DGEMV. - Added multiple MT implementations for same kernels. - Added AOCL_dynamic logic for L2 apis. - Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5. - Added same kernels for SGEMV, but these kernels are not enabled yet. - Added SGEMV reference kernel. AMD-Internal: [SWLCSG-3408] Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-08-12 10:39:12 +05:30
S, Hari Govind	273a05f0bd	Fix for performance regression caused by non-unit stride y in DGEMV API (#91 ) - Temperory fix for regression in DGEMV for non-unit stride y inputs. The code section responsible for handling non-unit stride y has been removed from the frame. - The kernel code is extended with if condition to handle both unit and non-unit stride y. AMD-Internal: [CPUPL-6869]	2025-07-25 10:57:57 +05:30
S, Hari Govind	8d41565822	Fix build failure when AOCL_DYNAMIC is disabled (#57 ) - The build was failing when AOCL_DYNAMIC was disabled because `fast_path_thresh` was only declared when both AOCL_DYNAMIC and OpenMP were enabled. This variable was used in an `if` condition for single-thread execution without an AOCL_DYNAMIC guard. - To resolve this, the test expression for single-thread execution has been replaced with a macro. This macro is set to 0 when AOCL_DYNAMIC is disabled, ensuring the condition is handled correctly. AMD-Internal: [CPUPL-6854]	2025-06-23 15:56:15 +05:30
S, Hari Govind	e097346658	Implemented Multithreading Support and Optimization of DGEMV API (#10 ) - Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance. - The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta. AMD-Internal: [CPUPL-6746]	2025-06-17 12:39:48 +05:30
Smyth, Edward	49ae7db89a	Avoid including .c files (#40 ) Including a C file directly in another C file is not recommended, and some build systems (e.g. Bazel and Buck) do not allow .c files to include other .c files. This commit changes the tapi and oapi framework files that are included from the _ex and _ba file variants from .c filenames to .h filenames. AMD-Internal: [CPUPL-6784] Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>	2025-06-10 11:33:33 +05:30
Hari Govind S	29f30c7863	Optimisation for DCOPY API - Introducted new assembly kernel that copies data from source to destination from the front and back of the vector at the same time. This kernel provides better performance for larger input sizes. - Added a wrapper function responsible for selecting the kernel used by DCOPYV API to handle the given input for zen5 architecture. - Updated AOCL-dynamic threshold for DCOPYV API in zen4 and zen5 architectures. - New unit-tests were included in the grestsuite for the new kernel. AMD-Internal: [CPUPL-6650] Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568	2025-04-28 05:58:21 -04:00
Hari Govind S	8998839c71	Optimisation of DGEMV Transpose Case for unit stride - Included a new code section to handle input having non-unit strided y vector for dgemv transpose case. Removed the same from the respective kernels to avoid repeated branching caused by condition checks within the 'for' loop. - The condition check for beta is equal to zero in the primary kernels are moved outside the for loop to avoid repeated branching. - The '_mm512_reduce_pd' operations in the primary kernel is replaced by a series of operations to reduce the number of instructions required to reduce the 8 registers. - Changing naming convention for DGEMV transpose kernels. - Modified unit kernel test to avoid y increment for dgemv tranpose kernels during the test. AMD-Internal: [CPUPL-6565] Change-Id: I1ac516d6b8f156ac53ac9f6eb18badd50e152e05	2025-03-06 05:15:58 -05:00
Arnav Sharma	b4c1026ec2	Added Support for General Stride in DGEMV - Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride. - Since the latest dgemv kernels don't support general stride, added checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general stride. - Thanks to Vignesh Balasubramanian <vignesh.balasubramanian@amd.com> for finding this issue. AMD-Internal: [CPUPL-6492] Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328	2025-02-27 12:47:21 -05:00
Shubham Sharma	26bd265cfd	Optimized DTRSV for tiny sizes - Replaced switch case with if else, lookup table for switch case is palced at the end of .text section which causes a huge jump. - Reduced number of branches for tiny sizes. - Cpuid query is slow, therefore added a new if statement which avoids cpuid query for tiny sizes(<200). - Redirected tiny sizes to AVX2 kernel. AMD-Internal: [CPUPL-5407] Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c	2025-01-30 01:23:09 -05:00
Hari Govind S	349fc47ec5	DGEMV Optimizations for TRANSPOSE Cases - Developed new AVX512 DGEMV kernels for Zen4/5 architectures and AVX2 kernels for Zen1/2/3 architectures. These kernels are written from the ground up and are independent of fused kernels. - The DGEMV primary kernel processes the calculation in chunks of 8 columns. Fringe columns (sizes 1 to 7) are handled by fringe kernels, which are invoked by the primary kernel as needed. - Implemented the kernels by computing the dot product of matrix A columns with vector x in chunks of 32 elements, storing the results in accumulator registers. Fringe elements are handled in chunks of 16, 8, etc. The data in the accumulator registers is then reduced and added to vector y. AMD-Internal: [CPUPL-5835] Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61	2025-01-24 00:38:34 -05:00
Arnav Sharma	25e59fcbb9	DGEMV Optimizations for NO_TRANSPOSE Cases - AVX512 specific DGEMV native kernels are added for Zen4/5 architectures to handle the NO_TRANSPOSE cases and are independent of the AXPYF fused kernels. - The following set of kernels biased towards the n-dimension perform beta scaling of y vector within the kernel itself and handle cases where n is less than 5: - bli_dgemv_n_zen_int_32x8n_avx512( ... ) - bli_dgemv_n_zen_int_32x4n_avx512( ... ) - bli_dgemv_n_zen_int_32x2n_avx512( ... ) - bli_dgemv_n_zen_int_32x1n_avx512( ... ) - The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the m-dimension and for this kernel beta scaling is handled beforehand within the framework. - Added unit-tests for the new kernels. - AVX2 path for Zen/2/3 architectures still follows the old approach of using fused kernel, namely AXPYF, to perform the GEMV operation. AMD-Internal: [CPUPL-5560] Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79	2024-12-12 10:26:50 -05:00
Shubham Sharma	d322cc11f8	Tiny size optimization for DTRSV var2 - Use AVX2 kernels for tiny sizes on genoa. - Removed the runtime init overhead for small sizes. AMD-Internal: [CPUPL-5407] Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30	2024-08-20 00:40:25 -04:00
Arnav Sharma	9583ee2e23	DGEMV Optimizations for NO_TRANSPOSE cases - Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases. - Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16. - Added a wrapper for DAXPYF kernels for redirection to kernels with a smaller fuse factor than 32. - Also added UKR tests for the new fused kernels. AMD-Internal: [CPUPL-5098] Change-Id: I0b102b67c6c068873393bac0494284f379c253f2	2024-07-24 15:59:36 +05:30
Hari Govind S	38824244d5	Implementation of AXPYF Kernels for DTRSV - Implemented two new axpyf kernels for fused factors 8 and 12 by manually unrolling the loops. Used to achieve better performance in var2 case. AMD-Internal: [CPUPL-5184] Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5	2024-07-16 06:23:01 -04:00
Shubham Sharma	7553abad8e	Fixed compilation error with AOCC in TRSV - Added a {} around zen4 switch case to avoid AOCC error. - Error is caused because in C declarations are not a statement, therefore they cannot be labled hence compiler is not able to create a lable for jump. AMD-Internal: [CPUPL-4880] Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e	2024-05-03 21:08:38 +05:30
Shubham Sharma	1d983e6124	Added AVX512 kernels for DAXPYF and DDOTXF - Added DAXPYF and DDOTXF AVX512 kernels. - Fuse factor for ddotxf kernel is 8. - 2 DAXPYF kernels are added, with fuse factor 8 and 32. - Multithreading is also added to the DAXPYf kernel with fuse factor 32. - These kernels are internally used by TRSM. - Added changes in TRSV to call these kernels in ZEN4 AMD-Internal: [CPUPL-4880] Change-Id: I12850de974b437bbca07677b68bc3d6a35858770	2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian	4e2966f9b0	AVX512 optimizations for ZGEMV API with transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with transpose to A matrix. - This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var1( ... ) function to update the function pointers to these kernels, based on the configuration. AMD-Internal: [CPUPL-4974] Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda	2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian	53cb83d0cc	AVX512 optimizations for ZGEMV API with no-transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with no-transpose to A matrix. - This includes the ZAXPYF, ZAXPYV and ZSETV kernels. The set of ZAXPYF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var2( ... ) function to set the function pointers to these kernels, based on the configuration. Further added the call to ZSETV at this layer in case beta is 0. AMD-Internal: [CPUPL-4974] Change-Id: Iee4b724719e49023138bb16479765be44d677cd9	2024-05-03 07:04:47 +00:00
Shubham Sharma	632c32767b	Avoid alpha scaling in ZTRSV/ZTRSM when alpha = 1 - Scaling vector X is skipped when alpha is 1 in ZTRSV. - Scaling matrix A is skipped when alpha is 1 in ZTRSM. AMD-Internal: [CPUPL-4324] Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81	2024-04-16 02:24:02 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
mangala v	fa355c0049	Removed warning during compilation of gemv api for non-zen config - When configured for haswell config "Warning unused variable 'zero'" was throwed during compilation. - Removed zero variable which is not being used AMD-Internal: [CPUPL-3973] Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8	2023-11-08 01:43:33 -05:00
Vignesh Balasubramanian	ef545b928e	Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel - The call to the bli_saxpyf_zen_int_6( ... ) is explicitly present in the bli_gemv_unf_var2_amd.c file, as part of the bli_sgemv_unf_var2( ... ) function. This was changed to bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor from 6 to 5 ), in accordance to the function pointer present in the zen3 and zen4 context files. - Changed the accumulator type to double from float, inside the fringe loop for unit-strides(vectorized path) and non-unit strides (scalar code). AMD-Internal: [CPUPL-4028] Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e	2023-11-03 01:37:43 -04:00
Harihara Sudhan S	106342f402	ZGEMV optimization for special cases in beta - Avoiding scaling of y vector by beta when beta is 1. AMD-Internal: [CPUPL-3829] Change-Id: I9cf46f44c5f1c2da3653937ff035594b4046b4a1	2023-11-02 08:21:46 -04:00
Harihara Sudhan S	105de694cf	Optimized ZGEMV variant 1 - Added an explicit function definition for ZGEMV var 1. This removes the need to query the context for Zen architectures. - Added a new INSERT_GENTFUNC to generate the definition only for scomplex type. - Rewrote ZDOTXF kernel and added the function name for ZDOTV instead of querying it. - With this change fringe loop is vectorized using SSE instructions. AMD-Internal:[CPUPL-3997] Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3	2023-10-13 05:03:53 -04:00
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Harihara Sudhan S	278ca71706	Fixes for GEMV Functionality Issues - Added call to dsetv in dscalv. When DSCALV is invoked by DGEMV the SCAL function is expected to SET the vector to zero when alpha is 0. This change is done to ensure BLAS compatibility of DGEMV. - Fixed bug in DGEMV var 1. Reverted changes in DGEMV var 1 to remove packing and dispatch logic. - CMAKE now builds with _amd files for unf_var2 of GEMV. AMD-Internal: [CPUPL-3772] Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade	2023-08-14 13:54:48 +05:30
Harihara Sudhan S	03fa660792	Optimized xGEMV for non-unit stride X vector - In GEMV variant 1, the input matrix A is in row major. X vector has to be of unit stride if the operation is to be vectorized. - In cases when X vector is non-unit stride, vectorization of the GEMV operation inside the kernel has been ensured by packing the input X vector to a temporary buffer with unit stride. Currently, the packing is done using the SCAL2V. - In case of DGEMV, X vector is scaled by alpha as part of packing. In CGEMV and ZGEMV, alpha is passed as 1 while packing. - The temporary buffer created is released once the GEMV operation is complete. - In DGEMV variant 1, moved problem decomposition for Zen architecture to the DOTXF kernel. - Removed flag check based kernel dispatch logic from DGEMV. Now, kernels will be picked from the context for non-avx machines. For avx machines, the kernel(s) to be dispatched is(are) assigned to the function pointer in the unf_var layer. AMD-Internal: [CPUPL-3475] Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e	2023-08-08 01:01:22 -04:00
Harihara Sudhan S	3be43d264f	Optimized xGEMV for non-unit stride Y vector - In variant 2 of GEMV, A matrix is in column major. Y vector has to be of unit stride if the operation is to be vectorized. - In cases when Y vector is non-unit stride, vectorization of the GEMV operation inside the kernel has been ensured by packing the input Y vector to a temporary buffer with unit stride. As part of the packing Y is scaled by beta to reduce the number of times Y vector is to be loaded. - After performing the GEMV operation, the results in the temporary buffer are copied to the original buffer and the temporary one is released. - In DGEMV var 2, moved problem decomposition for Zen architecture to the AXPYF kernel. - Removed flag check based kernel dispatch logic from DGEMV. Now, kernels will be picked from the context for non-avx machines. For avx machines, the kernel(s) to be dispatched is(are) assigned to the function pointer in the unf_var layer. AMD-Internal: [CPUPL-3485] Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654	2023-08-07 08:12:44 -04:00
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Edward Smyth	6835205ba8	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-2870] Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce	2023-04-19 12:44:56 -04:00
Aayush Kumar	71272ab574	.Fixed Compiler warnings for GCC 12 and AOCC 4.0 - Set the variables to zero to avoid the compiler warning (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c, bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and bli_trsm_small_AVX512.c - Changed the datatype from dim_t to siz_t for i,k,j in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to avoid the compiler warning (-Waggressive-loop-optimizations) AMD-Internal: [CPUPL-2870] Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03	2023-04-14 13:29:17 +00:00
Harihara Sudhan S	2e6724262e	ZGEMV var 2 bug fix - Fixed segmentation fault that was seen on non zen and non avx2 machines. - cntx object was not passed to the invoked kernel causing a seg fault. AMD-Internal: [CPUPL-3167] Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d	2023-04-05 01:31:24 -04:00
Edward Smyth	1ac03e64b5	BLIS cpuid tidy and bugfix. Improvements to BLIS cpuid functionality: - Tidy names of avx support test functions, especially rename bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported() to more accurately describe what it tests. - Fix bug in frame/base/bli_check.c related to changes in commit `6861fcae91` AMD-Internal: [CPUPL-3031] Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5	2023-04-03 08:46:37 -04:00
Harihara Sudhan S	4b36529a8b	Added vector packing logic to ZGEMV variant 2 - In cases when incy != 1, a buffer is created for y vector. The contents of vector y is scaled by beta and stored in this buffer. - After performing the compute using ZAXPYF kernel, the results in y buffer memory is copied back to the orginal buffer using ZCOPYV. - In cases when alpha is zero, we only scale the y vector by beta without using the buffer and return. - The kernels are picked based on the architecture ID. For any zen based architecture, AVX2 kernels are invoked. For other, the kernels are invoked based on the context. - In ZSCAL2V, query for the context if NULL pointer is passed. AMD-Internal: [CPUPL-2773] Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f	2023-03-22 03:19:18 -04:00
Edward Smyth	7f86561d26	BLIS-Nov2022: HPL memory issues with GCC. HPL script was using BLIS manual way to set threading, i.e. setting BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return -1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines. Fix: if this occurs, set local number of threads based on product of BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values. Note: BLIS_PC_NT should always be 1, but this environment variable is currently being read (contrary to documentation), so include it for now. Other changes: * implement _Pragma convention in all code used on AMD * frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag AMD-Internal: [CPUPL-2803] Change-Id: I37e8b038e5640d6693a87be0609888186322b465	2022-12-06 05:10:34 -05:00
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Edward Smyth	abf848ad12	Code cleanup and warnings fixes - Removed some additional compiler warnings reported by GCC 12.1 - Fixed a couple of typos in comments - frame/3/bli_l3_sup.c: routines were returning before final call to AOCL_DTL_TRACE_EXIT - frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is only defined in header file if BLIS_ENABLE_OPENMP is defined AMD-Internal: [CPUPL-2460] Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde	2022-08-29 08:22:30 -04:00
Arnav Sharma	eb83a0fe9d	Enabled ZHER Optimized Path - While calculating the diagonal and corner elements, the combined operation of calculating the product of x and x hermitian and simultaneously scaling it with alpha and adding the result to the matrix was the cause of increased underflow and overflow errors in netlib tests. - So the above calculation is now being done in three steps: scaling x vector with alpha, then calculating its product with x hermitian and later adding the final result to the matrix. AMD-Internal: [CPUPL-2213] Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8	2022-08-29 08:09:42 -04:00

1 2 3 4

166 Commits