amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 02:37:05 +00:00

Author	SHA1	Message	Date
Smyth, Edward	4ee6f75292	GCC 16 fixes Changes to fix errors and warnings when using gcc 16.1.0: - Copy changes from 5c2b22da81 in upstream BLIS to extend disabling of tree-vectorization in affected kernels to gcc 16 and later. - Remove unused variables in bli_packm_blk_var1_md.c and bli_util_unb_var1.c to fix warning messages. Background bp (base pointer) is the %rbp/%ebp register on x86/x86-64. Inline assembly kernels in BLIS use asm volatile blocks where they manually manage registers - including saving and restoring bp themselves to use it as a general-purpose register for holding loop counters or matrix pointers. When GCC's tree-vectorizer (specifically the superword-level parallelism (SLP) pass) runs on a translation unit containing inline asm, it can generate code that itself needs bp as a frame pointer or in the vectorized prologue/epilogue. At that point GCC internally marks bp as unavailable and then, when it tries to compile the inline asm block that also references bp, it throws an error. As a workaround, disabling tree vectorization for the entire file removes the conflict - with no vectorizer-generated code, bp stays free for the inline asm.	2026-05-18 15:54:29 +01:00
Dave, Harsh	07fd52823a	Fix: Resovle label redefination errors in zgemm sup kernel (#359 ) - Resovles label collision errors across assembly blocks of zgemm sup kernels. AMD-Internal: [CPUPL-8336] Co-authored-by: harsdave <harsdave@amd.com>	2026-04-16 16:51:55 +05:30
S, Hari Govind	cdd181a7d7	Optimize complex scalv kernels with inline assembly and FMA instructions This PR optimizes the complex scalar vector multiplication kernels by replacing intrinsics with inline assembly and leveraging FMA (Fused Multiply-Add) instructions for improved performance. Changes: - Replaced intrinsic-based implementation with inline assembly - Utilizes `vfmaddsub231ps`, `vfmadd231ss`, and `vfmsub231ss` FMA instructions - Improved instruction scheduling and register usage - Handles both unit-stride (vectorized) and non-unit-stride (scalar) cases - Processes up to 16 complex elements per iteration in the main loop	2026-04-13 16:06:02 +05:30
Rayan, Rohan	c8d0259ed4	Fixing memory issue in the cgemm pack kernels on zen4 Fixing a memory issue in the cgemm zen4 packing kernel In the loop section where the leftover m and k iterations were handled, the load operations (in the k-direction) were missing the mask instructions which has now been added. AMD-Internal: CPUPL-8189 Co-authored-by: Rohan Rayan rohrayan@amd.com	2026-03-30 11:08:34 +05:30
Rayan, Rohan	d512e3a736	Converting mul+add to FMA for ddot, daxpy and daxpyf zen kernels - Replace mul+add with FMA for ddot, daxpy and daxpyf - Using masked operations where possible - Non-unit stride code paths still use scalar loops, but use FMAs for accuracy AMD-Internal: CPUPL-8055 Co-authored-by: Rohan Rayan rohrayan@amd.com	2026-03-20 11:07:12 +05:30
Rayan, Rohan	af4c3ef1a5	Fixing memory issues in sgemm SUP kernels on AVX2 and AVX512 * Resolved memory-access issues in the SGEMM SUP kernels on AVX2 and AVX-512 by correcting instructions that could read invalid addresses in the C matrix. * Removed k=0 kernel gtests for the native SGEMM and DGEMM paths as these tests caused spurious failures for kernels that are not intended to handle this case. * Standardized all instruction macros to lowercase in the Zen4 kernel to improve readability and code consistency. --------- AMD-Internal: CPUPL-8117 Co-authored-by: Rayan <rohrayan@amd.com>	2026-03-16 16:05:34 +05:30
Dave, Harsh	48a6db6c69	add support for conjugate transpose in avx512 zgemm sup kernel (#300 ) * ZGEMM SUP: Add conjugate support for AVX-512 kernels on Zen4/Zen5/Zen6 - Add CONJA, CONJB and CONJA_CONJB variants to zgemm SUP micro-tiles - Enable SUP path for conjugate cases when both are same type - Unify RRC/CRC storage to use CV kernel variant - Update SUP dispatch to handle conjugate flags correctly Note: CONJ_NO_TRANSPOSE + CONJ_NO_TRANSPOSE and CONJ_TRANSPOSE + CONJ_TRANSPOSE remain unsupported --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-03-12 19:00:53 +05:30
Dave, Harsh	fbd45e8eab	optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels (#334 ) * optimize edge and non-unit stride path with intrinsics ins/c axpyv kernels - Updated the edge and non-unit stride path in c/saxpy to use intrinsics. - This ensures that the edge and non-unit stride cases maintain numerical consistency with the optimized Zen assembly/intrinsic path. --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-03-12 17:20:44 +05:30
Smyth, Edward	05e837d176	BLIS: Implement zen6 sub-configuration Implement zen6 cpuid and arch changes, and add zen6 as a separate BLIS sub-configuration and code path within amdzen configuration family. Currently all optimization choices are copies of zen5 sub-configuration. AMD-Internal: [CPUPL-7162]	2026-03-05 13:33:56 +00:00
Sharma, Shubham	4e84bbfb68	Ensure Accumulation Consistency Across DOTXF and DOTXV kernels (#325 ) APIs like GEMV use DOTXF (for parts of problem which are multiple of fuse_factor) and DOTXV (for parts not multiple of fuse_factor). DOTXF and DOTXV use different numbers of temporary accumulation registers(rho). This results in different round offs which can be significant when sizes are small and problem is about equally divided between DOTXF and DOTXV. To fix this, the number of temporary accumulation (and therefore roundoffs) and have made identical across both kernels. Known related GCC bugs to reference GCC Bug #56812 — incorrect code with vzeroupper and register allocation GCC Bug #95483 — vzeroupper clobbers live values GCC Bug #101617 — wrong code generation with AVX intrinsics and transitions AMD-Internal:CPUPL-8015	2026-03-03 16:38:55 +05:30
Sharma, Shubham	3bbf12665c	Ensure consistency across AVX2 and AVX512 AMAX kernels Fix NaN handling in AVX2 amax kernel by initializing global max to 0 In the AVX2 kernel, the global maximum (curr_max_val) was previously initialized to -1, while the local maximum (temp) is initialized to 0. The kernel determines the maximum using the condition: max(abs(x[i]), temp) > curr_max_val Because curr_max_val started at -1, this initial check always evaluates to true for the first iteration ( 0 > -1), which is incorrect if the first element is a NaN. If subsequent elements are valid numbers, curr_max_val eventually recovers and updates to the correct value. However, if the entire array consists of NaNs, this logic fails to properly update the trackers, meaning we never correctly return index 0 as required by the BLAS standard. To fix this and align the behavior with the AVX-512 kernel, curr_max_val is now initialized to 0. This ensures that the initial condition evaluates correctly and all-NaN arrays return the proper index (0 if all values are NaN). Window start and end index are also updated to 0 which is the minimum valid value of index. AMD-Internal: CPUPL-8047	2026-03-03 13:01:52 +05:30
Rayan, Rohan	6f718982d6	SGEMM RV kernel optimization on Zen4 SGEMM RV kernel optimizations for Zen4: Introduces new masked kernels handling edge cases efficiently. Also introduces better instruction selection for Zen4. New Masked Kernels: 6x16m_mask: Handles n_left = [9-15] with masked ZMM operations 6x8m_mask: Handles n_left = [5-8] with masked YMM operations 6x4m_mask: Handles n_left = [2-4] with masked XMM operations Corresponding edge-M kernels (4x, 3x, 2x, 1x variants) Replaces multiple kernel calls with single masked kernel call Instruction Cleanup: Remove unnecessary branches and unused macros Remove separate load->FMA instructions in favor of memory-operand FMAs Replace cascading if-else chains with O(1) jump table dispatch New Assembly Macros: MOVSLQ - move a 32‑bit signed value into a 64‑bit register with sign‑extension JUMP_TABLE(table_label, ...) - Defines the jump table TABLE_ENTRY(label) - Defines a 4-byte relative offset entry in the table LEA_RIP(label, reg) - RIP-relative address load JMPI(reg) - Indirect jump through register --------- Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2026-02-27 15:50:25 +05:30
Smyth, Edward	011c75dddb	Remove unnecessary OpenMP include (AOCL) Copy of similar change in upstream BLIS (843a5e8) to fix issues https://github.com/flame/blis/issues/873 and https://github.com/amd/blis/issues/50 Details: - Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the framework could access the necessary OpenMP functions. - As @melven reported (#873), this causes issues when `blis.h` is included in C++ code since the `<omp.h>` include happens with `extern "C"`. - Move the include from the header to the necessary .c files so that it does not "pollute" `blis.h`. Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in AOCL BLIS AMD-Internal: [CPUPL-7303]	2026-02-06 10:41:38 +00:00
Smyth, Edward	8310b2d5d3	Optimize bli_arch_query_id and related functions bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous implementation incurred the overhead of multiple function calls. This has been reduced by: - Changing the function to be defined in a header file so it can be inlined. - Avoiding call to bli_arch_check_id_once that was a wrapper for a call to bli_pthread_once. Instead bli_pthread_once is called directly. - For builds with a single BLIS sub-configuration, correct arch_id is taken directly from a header file in the corresponding config subdirectory, avoiding the bli_pthread_once call and making the value explicit at compile time, which may enable additional optimizations. To enable these changes, the variables arch_id and model_id defined in frame/base/bli_arch.c are no longer static, as they must be accessed in multiple files (i.e. they are now global variables). Rename to g_arch_id and g_model_id to distinguish from any locally defined arch_id or model_id variables.	2026-02-04 13:16:46 +00:00
Rayan, Rohan	ebf8721a5c	Optimizing sgemm rd kernels on zen3 (#293 ) Fixing some inefficiencies on the zen (AVX2) SUP RD kernel for SGEMM. After performing the iteration for the 8 loop, the next loop that was being performed was the 1 loop for the k-direction. This caused a lot of unnecessary iterations when the remainder of k < 8. This has been fixed by introducing masked operations for k < 8 When remainder of k == 1, we handle this with the original non-masked code (with a branch) as the masked code introduces more penalty because of the masking operation. There were also some unnecessary instructions in the zen4 kernels which have been removed. AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7775 Co-authored-by: rohrayan@amd.com	2026-02-04 09:08:11 +05:30
S, Hari Govind	ec6f4e96cd	Replace intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5 GCC over-optimizes intrinsics code by reordering and interleaving instructions, making it difficult to verify correctness and causing potential accuracy issues in certain cases. This change replaces intrinsics-based implementations with inline assembly to ensure one-to-one mapping between source and generated assembly. Changes: - bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly * Processes blocks of 128, 64, 32, 16, and 8 elements * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides - bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly * Processes blocks of 16 and 8 elements with 5-way fusion * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides Benefits: - Predictable code generation with no compiler reordering - Better numerical accuracy by preventing unexpected transformations - Easier verification of generated assembly against specifications - Explicit control over instruction sequence and register allocation	2026-01-29 11:48:47 +05:30
Dave, Harsh	b510d06cc8	Tuned input threshold for tiny dgemm interface (#309 ) * Tuned input threshold for tiny dgemm interface - Added upper limit check for M dimension to avoid cache thrashing. - Added required buffer size check needed while packing A matrix. AMD-Internal : [CPUPL-7915] * Tuned input threshold for tiny dgemm interface - Added upper limit check for M dimension to avoid cache thrashing. - Added required buffer size check needed while packing A matrix. AMD-Internal : [CPUPL-7915] --------- Co-authored-by: harsdave <harsdave@amd.com>	2026-01-22 20:11:06 +05:30
Balasubramanian, Vignesh	73911d5990	Updates to the build systems(CMake and Make) for LPGEMM compilation (#303 ) - The current build systems have the following behaviour with regards to building "aocl_gemm" addon codebase(LPGEMM) when giving "amdzen" as the target architecture(fat-binary) - Make: Attempts to compile LPGEMM kernels using the same compiler flags that the makefile fragments set for BLIS kernels, based on the compiler version. - CMake: With presets, it always enables the addon compilation unless explicitly specified with the ENABLE_ADDON variable. - This poses a bug with older compilers, owing to them not supporting BF16 or INT8 intrinsic compilation. - This patch adds the functionality to check for GCC and Clang compiler versions, and disables LPGEMM compilation if GCC < 11.2 or Clang < 12.0. - Make: Updated the configure script to check for the compiler version if the addon is specified. CMake: Updated the main CMakeLists.txt to check for the compiler version if the addon is specified, and to also force-update the associated cache variable update. Also updated kernels/CMakeLists.txt to check if "aocl_gemm" remains in the ENABLE_ADDONS list after all the checks in the previous layers. AMD-Internal: [CPUPL-7850] Signed-off by : Vignesh Balasubramanian <Vignesh.Balasubramanian@amd.com>	2026-01-16 19:39:55 +05:30
Sharma, Shubham	824e289899	Tuned decision logic for DGEMV multithreading for skinny sizes. (#301 ) AMD-Internal: [CPUPL-7769]	2026-01-14 12:08:46 +05:30
Rayan, Rohan	9cbb1c45d8	Improving sgemm rd kernel on zen4/zen5 (#292 ) Fixing some inefficiencies on the zen4 SUP RD kernel for SGEMM The loops for the 8 and 1 iteration of the K-loop were performing loads on ymm/xmm registers and computation on zmm registers This caused multiple unnecessary iterations in the kernel for matrices with certain k-values. Fixed by introducing masked loads and computations for these cases AMD-Internal: https://amd.atlassian.net/browse/CPUPL-7762 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2025-12-17 18:48:50 +05:30
Rayan, Rohan	a22e0022c2	SGEMM tiny path tuning for zen4 and zen5 (#267 ) * Adding a model to determine which matrices enter the SGEMM tiny path * This extends the sizes of matrices that enter the tiny path, which was constrained to the L1 cache size previously * Now matrices that fit in L2 are also allowed into the tiny path, provided they are determined to be faster than the SUP path * Adding thresholds based on the SUP path sizes * Added for Zen4 and Zen5 --------- AMD-Internal: CPUPL-7555 Co-authored-by: Rohan Rayan <rohrayan@amd.com>	2025-12-10 15:58:54 +05:30
Balasubramanian, Vignesh	54ac36c8bc	Bugfix: BF16 to F32 conversion in AVX2 F32 codepath - Updated the conversion function(in case of receiving column stored inputs) from BF16 to F32, in order to use the correct strides while storing. - Conversion of B is potentially multithreaded using the threads meant for IC compute. With the wrong strides in the kernel, this gives rise to incorrect writes onto the miscellaneous buffer. AMD-Internal: [CPUPL-7675] Co-authored-by: Vishal-A <Vishal.Akula@amd.com> Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>	2025-12-01 15:06:13 +05:30
Sharma, Shubham	b60542c45c	Added Fast path for single threaded AVX512 DGEMV kernel #260 - If blis is compiled as multithreaded library, BLIS_NUM_THREADS is set to 1, and sizes are large enough for multithreaded path to be optimal, we take multithreaded path even though we can spawn only one thread. This adds openmp overhead. - A check has been added inside the multithreaded kernels to check to use single threaded code path of only 1 thread can be spawned. AMD-Internal : [SWLCSG-3408]	2025-11-10 10:32:36 +05:30
S, Hari Govind	4ecfbde082	Fix extreme values handling in GEMV - When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all. - This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks. - When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel. - To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic. - DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0. AMD-Internal: [CPUPL-7585]	2025-11-08 12:30:03 +05:30
Varaganti, Kiran	49961aa569	Fix DTL dynamic thread logging in BLAS operations (#230 ) - Remove redundant AOCL_DTL_LOG_NUM_THREADS calls from early return paths - Update thread count logging to use AOCL_get_requested_threads_count() for early exits - Clean up duplicate DTL logging in gemv_unf_var1 and gemv_unf_var2 implementations - Remove thread count logging from bli_dgemv_n_zen4_int kernel variants - Simplify aocldtl_blis.c AOCL_DTL_log_gemv_sizes by removing redundant conditional - Standardize DTL trace exit patterns across axpy, scal, and gemv operations - Remove commented-out DTL logging code in zen4 gemv kernel This patch ensures thread count is logged only once per operation and uses the correct API (AOCL_get_requested_threads_count) for early exit scenarios where the actual execution thread count may differ from requested threads.	2025-10-24 13:34:00 +01:00
S, Hari Govind	ab25b825aa	Fix: Resolve Operator Precedence Warning in Zen5 DCOMPLEX Threshold Logic - Add explicit parentheses around (n <= 1520) && (k <= 128) to clarify operator precedence and resolve compiler warning. The intended logic is (m <= 1380) OR (n <= 1520 AND k <= 128). - This change eliminates the compiler warning about mixing \|\| and && operators without explicit grouping.	2025-10-24 14:23:23 +05:30
Rayan, Rohan	e85be22da0	Adding tiny path for SGEMM (#237 ) Adding SGEMM tiny path for Zen architectures. Needed to cover some performance gaps seen wrt MKL Only allowing matrices that all fit into the L1 cache to the tiny path Only tuned for single threaded operation at the moment Todo: Tune cases where AVX2 performs better than AVX512 on Zen4 Todo: The current ranges are very conservative, there may be scope to increase the matrix sizes that go into the tiny path AMD-Internal: CPUPL-7555 Co-authored-by: Rohan Rayan rohrayan@amd.com	2025-10-24 13:14:33 +05:30
V, Varsha	fecb1aa7a5	Bug Fix in BF16 AVX2 conversion path (#236 ) - In the current implementation of bf16 to f32 conversion for packed data we handle both GEMM and GEMV conditions in the same function separated with conditions. - But, when n = (NC+1) the function would execute GEMV conversion logic and write back the data inaccurately leading to accuracy issues. - Hence, modified the convert function and reorder functions to have separate conversion logic to make it cleaner and avoid confusions. - Also, updated the API calls to adhere to the changes appropriately. [AMD-Internal: CPUPL-7540]	2025-10-17 15:38:02 +05:30
S, Hari Govind	0ce45e3147	Changing ZGEMM SUP threshold logic for zen5 to fix performance regression (#233 ) - Revert the logical operator from OR (\|\|) to AND (&&) in the DCOMPLEX (ZGEMM) SUP threshold condition for k <= 128. The previous change to OR logic was causing performance regressions for certain input sizes by incorrectly routing cases to the SUP path when the native path would be more optimal.	2025-10-17 11:25:33 +05:30
Sharma, Shubham	1dca574a9d	Improved fringe case handling for AVXPV kernel (#228 ) - Current kernel uses masked AVX512 instructions to handle fringe cases. - These instructions are slow on genoa. - To handle sizes less than 8, AVX2 and SSE code has been added. - Existing masked AVX512 code is performing better when n > 8 therefore it is still kept for handling larger sizes where n % 8 != 0. AMD-Internal: [CPUPL-7467]	2025-10-10 14:25:35 +05:30
Dave, Harsh	c9933886f7	Tuned zgemm threshold for zen5 (#215 ) Threshold tuning that determines whether SUP or native path should be used for given input matrix size. This tuning forces skinny matrices to take SUP path to ensure better performance. AMD-Internal: [CPUPL-7369] Co-authored-by: harsh dave <harsdave@amd.com>	2025-09-23 08:05:31 +05:30
Bhaskar, Nallani	db3134ed6d	Disabled no post-ops path in lpgemm f32 kernels for few gcc versions Guarded np (no post-ops) path in f32 API with a macro as a workaround as gcc 11.4 and 11.2 are giving accuracy issues with np path.	2025-09-22 15:52:21 +05:30
Varaganti, Kiran	807de2a990	DTL Log update * DTL Log update Updates logs with nt and AOCL Dynamic selected nt for axpy, scal and dgemv Modified bench_gemv.c to able to process modified dtl logs. * Updated DTL log for copy routine with actual nt and dynamic nt * Refactor OpenMP pragmas and clean up code Removed unnecessary nested OpenMP pragma and cleaned up function end comment. * Fixed DTL log for sequential build * Added thread logging in bla_gemv_check for invalid inputs --------- Co-authored-by: Smyth, Edward <Edward.Smyth@amd.com>	2025-09-22 11:32:00 +05:30
Sharma, Arnav	ee3d250b7a	Fix for F32 to BF16 Conversion and AVX512 ISA Support Checks - Fixed register assignment bug in lpgemv_m_kernel_f32_avx512 where zmm3 was incorrectly used instead of zmm4 in BF16_F32_BETA_OP_NLT16F_MASK macro. - Replaced hardware-specific BF16 conversion intrinsics with manual rounding, bit manipulation and F32 instruction set for compatibility on hardware without native BF16 support. - Added AVX512_BF16 ISA support checks for s8s8s32obf16 and u8s8s32obf16 GEMM operations to ensure processor compatibility before execution. AMD-Internal: [CPUPL-7410]	2025-09-19 18:49:33 +05:30
Sharma, Shubham	773d3a3d45	Re-tuned GEMV thresholds (#210 ) Retune DGEMV AVX512 non transpose thresholds to avoid regression on ZEN4. AMD-Internal: [CPUPL-7448]	2025-09-19 12:43:50 +05:30
Sharma, Shubham	10b2e59782	Modified AVPY kernel to ensure consistency of numerical results (#188 ) Current DAXPY kernel uses C code to solve cases when n %8 != 0. This results in the compiled code using MUL+ADD instruction using SSE, instead of FMA instruction. This causes inconsistency of numerical results. To fix this, AVX2 and C code is replaced with masked AVX512 instructions to compute fringe cases. AMD-Internal : [CPUPL-7315]	2025-09-18 11:42:31 +05:30
Sharma, Shubham	740fbdf50d	Fix memory leak in DGEMV kernel (#187 ) Memory is not freed for GEMV when MT kernel with called for NT = 1; Fixed this by adding an extra check to make sure memory is freed. AMD-Internal: [CPUPL-7352]	2025-09-18 11:18:13 +05:30
Sharma, Shubham	a02020686c	Tuned DGEMV no-transpose thresholds #193 In order to avoid regression in some sizes, thresholds are retuned and scope of 32x8 kernels is expanded. AMD-Internal:[CPUPL-7336]	2025-09-18 10:50:09 +05:30
Smyth, Edward	e3b22f495e	Standardize Zen kernel names (2) Further changes to fix inconsistencies in naming of zen kernels. AMD-Internal: [CPUPL-6579]	2025-09-17 21:48:34 +01:00
Smyth, Edward	e59eabaf58	Compiler warnings fixes (2) Fix compiler warning messages in LPGEMM code: - Removed extraneous parentheses in aocl_batch_gemm_s8s8s32os32.c - Removed unused variables in lpgemv_{m,n}_kernel_s8_grp_amd512vnni.c - Changed ERR_UBOUND in math_utils_avx2.h and math_utils_avx512.h to match how it is specified in AOCL libm erff.c AMD-Internal: [CPUPL-6579]	2025-09-17 18:28:34 +01:00
Dave, Harsh	31aba514fe	coverity issue fix for ztrsm (#176 ) * Fixed coverity issue in ztrsm small code path * Fixed coverity issue in ztrsm small code path --------- Co-authored-by: harsh dave <harsdave@amd.com>	2025-09-17 19:39:12 +05:30
Dave, Harsh	a2526a2593	Fixes Coverity static analysis issue in the DTRSM (#181 ) * Fixes Coverity static analysis issue in the DTRSM - Initializes ps_a_use variable and calls bli_auxinfo_set_ps_a() to set pack stride in auxinfo structure. * Fixed unintialized variable issue in the DTRSM - Initializes ps_a_use variable and calls bli_auxinfo_set_ps_a() to set pack stride in auxinfo structure. --------- Co-authored-by: harsdave <harsdave@amd.com>	2025-09-17 18:02:45 +05:30
Smyth, Edward	ae6c7d86df	Tidying code - AMD specific BLAS1 and BLAS2 franework: changes to make variants more consistent with each other - Initialize kernel pointers to NULL where not immediately set - Fix code indentation and other other whitespace changes in DTL code and addon/aocl_gemm/frame/s8s8s32/lpgemm_s8s8s32_sym_quant.c - Fix typos in DTL comments - Add missing newline at end of test/CMakeLists.txt - Standardize on using arch_id variable name AMD-Internal: [CPUPL-6579]	2025-09-16 14:52:54 +01:00
Rayan, Rohan	2e7f387d13	Fixing the coverity issues with CID: 23269 and CID: 137049 (#180 ) Fixing some coverity issues that detected out of bounds array accesses. AMD-Internal: CPUPL-6579 Co-authored-by: Rohan Rayan rohrayan@amd.com	2025-09-16 10:01:52 +05:30
KadavilMadanaMohanan, MithunMohan (Mithun Mohan)	5de25ce9a7	Fixed high priority coverity issues in LPGEMM. (#178 ) * Fixed high priority coverity issues in LPGEMM. -Out of bounds issue and uninitialized variables fixed in aocl_gemm addon.	2025-09-11 18:27:19 +05:30
Smyth, Edward	a4db661b44	GCC 15 SUP kernel workaround (2) Previous commit (`30c42202d7`) for this problem turned off -ftree-slp-vectorize optimizations for all kernels. Instead, copy the approach of upstream BLIS commit 36effd70b6a323856d98 and disable these optimizations only for the affected files by using GCC pragmas AMD-Internal: [CPUPL-6579]	2025-09-04 17:14:06 +01:00
V, Varsha	c5bd1feabd	Fixed out-of-bound access in F32 matrix add/mul ops (#168 ) - Modified the out-of-bound access in scale factors of matrix-add and matrix mul post-ops of f32 AVX512_256 kerenls. [ AMD-Internal : CPUPL-7261 ]	2025-09-01 16:40:03 +05:30
Smyth, Edward	fb2a682725	Miscellaneous changes - Change begin_asm and end_asm comments and unused code in files kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c to avoid problems in clobber checking script. - Add missing clobbers in files kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c - Add missing newline at end of files. - Update some copyright years for recent changes. - Standardize license text formatting. AMD-Internal: [CPUPL-6579]	2025-08-26 16:37:43 +01:00
Vankadari, Meghana	a05279cd97	Bug fix in F32 AVX2 kernels (#164 ) - corrected the loading strides used to load matadd and matmul pointers in F32 AVX2 kernels. AMD-Internal: CPUPL-7221	2025-08-26 19:52:50 +05:30
Bhaskar, Nallani	b052775644	Disabled topology detection in LPGEMM - Disabled topology detection as libgomp is not honoring the standard function omp_get_place_proc_ids - Added B prefetch in bf16 B packing kernels AMD-Internal: SWLCSG-3761	2025-08-26 14:50:01 +01:00

1 2 3 4 5 ...

973 Commits