amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 10:47:16 +00:00

Author	SHA1	Message	Date
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Shubham Sharma	f8c83fedb6	Added new ZTRSM small code path for ZEN5 - Added new ZTRSM kernels for right and left variants. - Kernel dimensions are 12x4. - 12x4 ZGEMM SUP kernels are used internally for solving GEMM subproblem. - These kernels do not support conjugate transpose. - Only column major inputs are supported. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6356] Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e	2025-02-06 18:01:10 +05:30
Hari Govind S	b6e9cde317	Revert "Optimisation of AOCL-dynamic for dotv API" This reverts commit `a028108cbb`. Reason for revert: With libgomp, scalability issues were observed with a higher number of threads, leading to the use of fewer threads. However, with different OpenMP libraries like libomp, this scalability issue was not observed, and using fewer threads resulted in performance loss. The AOCL dynamic logic has been updated to select a higher number of threads, considering the iomp OpenMP library. Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f	2025-02-05 06:33:06 -05:00
Hari Govind S	fe73445813	Introduced fast-path in DCOPYV API and fix compiler warning for AXPYV - Added a conditional check to invoke the vectorized DCOPYV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Used macros to protect the declaration of fast_path_thresh in DAXPYV API to avoid compiler warnings. AMD-Internal: [CPUPL-4875][CPUPL-5895] Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb	2025-02-05 04:41:55 -05:00
Shubham Sharma	bac0fed3cf	Fixed Bug in Dynamic Blocksizes - Mixed precision datatypes use a modified cntx. - For some variants of mixed precision, complex and real blocksizes are needed to be same. This is achieved by creating a local copy of cntx and copying complex blocksizes onto real blocksizes. - By using the dynamic blocksizes, the changes made to the blocksizes for mixed precision are overwritten by changes made by dynamic blocksizes. - This mismatch between complex and real blocksizes is causing a issue where the pack buffer is allocated based on complex blocksizes but amount of data packed is based on real blocksizes. - This makes the pack buffer sizes smaller than the required sizes. - To fix this, dynamic blocksizes are disabled for mixed precision. AMD-Internal: [CPUPL-6384] Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83	2025-02-05 01:09:24 -05:00
Hari Govind S	3d2653f1ab	DDOTV Optimization for ZEN3 Architecture - Reduced the blocking size of 'bli_ddotv_zen_int10' kernel from 40 elements to 20 elements for better utilization of vector registers - Replaced redundant 'for' loops in 'bli_ddotv_zen_int10' kernel with 'if' conditions to handle reminder iterations. As only a single iteration is used when reminder is less than the primary unroll factor. - Added a conditional check to invoke the vectorized DDOTV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Updated getestsuite ukr tests for 'bli_ddotv_zen_int10' kernel. AMD-Internal: [CPUPL-4877] Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2	2025-02-04 06:01:04 -05:00
Shubham Sharma	26bd265cfd	Optimized DTRSV for tiny sizes - Replaced switch case with if else, lookup table for switch case is palced at the end of .text section which causes a huge jump. - Reduced number of branches for tiny sizes. - Cpuid query is slow, therefore added a new if statement which avoids cpuid query for tiny sizes(<200). - Redirected tiny sizes to AVX2 kernel. AMD-Internal: [CPUPL-5407] Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c	2025-01-30 01:23:09 -05:00
harsh dave	c5e842e8d3	Revert changes to force dgemm inputs under threshold to arch-specific kernels in single-threaded mode - This patch reverts the previous changes that removed the enforcement of dgemm inputs under a certain threshold to be processed by kernels selected based on architecture ID and handled in single-threaded mode. - This change is now forcing such small inputs to be computed in tiny path. Previously when this check was not there, it was routing these inputs to SUP path and causing performance regression due to framework overhead. AMD-Internal: [CPUPL-5927] Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb	2025-01-29 04:22:16 -05:00
Shubham Sharma	9fd2aebd25	Tuned DTRSM thresholds for ZEN5 - Tuned AOCL_dynamic thresholds for DTRSM for ZEN5. - Tuned thresholds for better selection of code path. AMD-Internal: [CPUPL-5408] Change-Id: Ic40b5c8d276c8ce8399fd49ce0d0569f79ec98be	2025-01-28 07:33:10 -05:00
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
Hari Govind S	349fc47ec5	DGEMV Optimizations for TRANSPOSE Cases - Developed new AVX512 DGEMV kernels for Zen4/5 architectures and AVX2 kernels for Zen1/2/3 architectures. These kernels are written from the ground up and are independent of fused kernels. - The DGEMV primary kernel processes the calculation in chunks of 8 columns. Fringe columns (sizes 1 to 7) are handled by fringe kernels, which are invoked by the primary kernel as needed. - Implemented the kernels by computing the dot product of matrix A columns with vector x in chunks of 32 elements, storing the results in accumulator registers. Fringe elements are handled in chunks of 16, 8, etc. The data in the accumulator registers is then reduced and added to vector y. AMD-Internal: [CPUPL-5835] Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61	2025-01-24 00:38:34 -05:00
Arnav Sharma	66461b8df3	Improved Multi-threaded Performance of DSCALV - Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5 architectures, since earlier they were using the Zen thresholds. - Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads incurred when the single-threaded path is optimally performant. AMD-Internal: [CPUPL-5934] Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033	2025-01-22 03:55:38 -05:00
Edward Smyth	3c2dedb13c	Restore or add update from env in bli_thread_get APIs We want bli_thread_get_num_threads() and bli_thread_get__nt() to report the threading values modified to reflect what will be in effect given OpenMP nesting and active levels. This was lost in commit `0c6d006225` for bli_thread_get_num_threads() and wasn't previously implemented in bli_thread_get__nt() AMD-Internal: [CPUPL-6168] Change-Id: Ife2d281546d2f79fc17cd712e574f29b06c30ccd	2025-01-20 08:58:22 -05:00
Vignesh Balasubramanian	8e660215c3	Introduced fast-path in DAXPYV API - Added a conditional check to invoke the vectorized DAXPYV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. AMD-Internal: [CPUPL-4878] Change-Id: I001fd1b8bbd2d691ecb3e2423ec7998e130850bb	2025-01-10 09:19:38 -05:00
Vignesh Balasubramanian	345204d69b	Additional updates to the thresholds for ZGEMM small path - Further updated the thresholds for entry to ZGEMM small path(AVX2), when the execution is mulithreaded. The newer thresholds account for more skinnier inputs, compatible with single-threaded small path, as opposed to multithreaded SUP path. AMD-Internal: [CPUPL-6040][CPUPL-5930] Change-Id: I333f97d8af49733310e4ae48b12baba15ef828d6	2025-01-10 08:29:31 -05:00
Edward Smyth	567039a7fe	Fortran interfaces for bli_thread_get APIs Create and export Fortran interfaces for bli_thread_get_num_threads() and bli_thread_get_{jc,pc,ic,jr,ir}_nt() APIs. bli_thread_get_is_parallel() is intended for internal BLIS usage, so not adding a Fortran interfaces for it at this time. AMD-Internal: [CPUPL-6168] Change-Id: Ieba2537e5455cc289536aec3de5d4b5866e607f1	2025-01-10 05:07:33 -05:00
harsh dave	aeda581539	Fix: Rename bli_tiny_gemm.c to bli_tiny_gemm_amd.c When compiling with config generic (or any non-zen build), the bli_dgemm_tiny_6x8 kernel is not defined. Since bli_dgemm_tiny() is only used within amd specific file, bli_tiny_gemm.c has been renamed to bli_tiny_gemm_amd.c to reflect its specific usage. Thanks to Smyth, Edward<edward.smyth@amd.com> for identifying and helping to fix the issue. Change-Id: If5d134aeba6d30d0a51e6d7d6fa9b3c4450a3307	2025-01-07 23:29:32 -05:00
Vignesh Balasubramanian	cdaa2ac7fd	Bugfix and optimizations for AVX512 AMAXV micro-kernels - Bug : The current {S/D}AMAXV AVX512 kernels produced an incorrect functionality with multiple absolute maximums. They returned the last index when having multiple occurences, instead of the first one. - Implemented a bug-fix to handle this issue on these AVX512 kernels. Also ensured that the kernels are compliant with the standard when handling exception values. - Further optimized the code by decoupling the logic to find the maximum element and its search space for index. This way, we use lesser latency instructions to compute the maximum first. - Updated the unit-tests, exception value tests and early return tests for the API to ensure code-coverage. AMD-Internal: [CPUPL-4745] Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f	2025-01-07 22:56:20 +05:30
Vignesh Balasubramanian	f548f42607	Fixing compiler warnings on ZGEMM - Scoped some of the variables used in zgemm_blis_impl() when determining the thresholds to small path. These variables will be used only when the architecture is ZEN5 or ZEN4. AMD-Internal: [CPUPL-5895] Change-Id: I6f90856f34454423ac777e33c74fe5ec6bb94e13	2025-01-07 10:59:43 +05:30
Edward Smyth	4ce708c316	Move some BLAS extension APIs to extra subdirectories In preparation for merging next group of changes from upstream BLIS, move some BLAS extension APIs to new extra subdirectories in frame/compat and frame/compat/cblas/src. Other extension APIs will be moved in later commits. Some tidying up to better match upstream BLIS code has also been done. AMD-Internal: [CPUPL-2698] Change-Id: I0780a775d37242fba562c3f13666da0ad2b2cdfb	2024-12-17 04:54:39 -05:00
Hari Govind S	a028108cbb	Optimisation of AOCL-dynamic for dotv API - Adding AOCL-dynamic logic for dotv on zen4 and zen5 architecture. AMD-Internal: [CPUPL-4877] Change-Id: I421ad5e48cf001d1fc5c13e9f1cbbf4db0bf5f37	2024-12-16 22:09:59 -05:00
Edward Smyth	0c6d006225	Changes to rntm to reduce mutex operations Change usage of global_rntm and tl_rntm to elimate need for mutex operations when accessing global_rntm. Usage of these data structures is now as follows: * global_rntm is set once during bli_init_apis and includes all getenv calls to check BLIS threading and error printing environment variables. global_rntm is then read-only. * tl_rntm is intialized once from global_rntm on each application thread. Any calls to BLIS set threading/ways APIs will update tl_rntm for that application thread only (Previously they updated global_rntm for all application threads). * Re-initialize info_value in tl_rntm in every call to bli_init APIs. * In bli_rntm_init_from_global() we initialize the local (per API call) rntm as a copy of tl_rntm and then update threading values in bli_thread_update_rntm_from_env() to reflect the current status of OpenMP runtime ICVs. AMD-Internal: [CPUPL-6168][SWLCSG-3143] Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6	2024-12-16 04:45:26 -05:00
Shubham Sharma	beaea1b88f	Added new DTRSM small code path for ZEN5 - Added new DTRSM kernels for right and left variants. - Kernel dimensions are 24x8. - 24x8 DGEMM SUP kernels are used internally for solving GEMM subproblem. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6016] Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db	2024-12-16 10:38:45 +05:30
Vignesh Balasubramanian	609af9bfe2	Threshold tuning for ZGEMM small path - Updated the threshold check for ZGEMM small path to include runtime checks for redirection, specific to the micro-architecture. - The current ZGEMM small path has only its AVX2 variant available. Post implementing an AVX512(same/different algorithm), the thresholds will further be fine-tuned. - Included the dot-product based AVX512 ZGEMM kernels in the ZEN5 context. It will be used as part of handling RRC and CRC storage schemes of C, A and B matrices in both single-thread and multi-thread runs. AMD-Internal: [CPUPL-5949] Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75	2024-12-13 12:51:22 -05:00
harsdave	54b46ec1ed	Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop efficiency and edge kernel performance. The following technical improvements have been implemented: 1. IR Loop Optimization: - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated with `begin_asm` and `end_asm` calls, resulting in more efficient execution. 2. JR Loop Integration: - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead of stack frame management for each JR iteration, thereby enhancing loop performance. 3. Kernel Decomposition Strategy: - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1. - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently. 1. Interleaved Scaling by Alpha: - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline and reduce latency. 2. Efficient Mask Preparation: - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary, minimizing unnecessary overhead. 3. Broadcast Instruction Optimization: - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse, the broadcast instruction is replaced with `mem_1to8`. - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding dependency chains and improving execution efficiency. 4. C Matrix Update Optimization: - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers. This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating performance bottlenecks and enhancing throughput. These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication operations. This patch also involves changes for tiny gemm interface. A light interface for calling kernels and removing calls to avx2 dgemm kernels as we use avx512 dgemm kernels for all the sizes for zen4 and zen5. For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have the support to handle such inputs and thus such inputs are routed to gemm_small path. AMD-Internal: [CPUPL-6054] Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a	2024-12-13 00:03:00 -05:00
Edward Smyth	5c946e10dc	Always define function bli_nthreads_optimum bli_nthreads_optimum is exported and called directly by AOCL libFLAME, however it was only defined if building a multithreaded BLIS library with AOCL_DYNAMIC enabled. Change to always define this function. If BLIS is serial or if AOCL_DYNAMIC is disabled, this function returns without modifying the supplied rntm. Change-Id: Ie65690e9e6ec2a8ea77b3778f96676a68e6260be	2024-12-12 12:43:43 -05:00
Arnav Sharma	25e59fcbb9	DGEMV Optimizations for NO_TRANSPOSE Cases - AVX512 specific DGEMV native kernels are added for Zen4/5 architectures to handle the NO_TRANSPOSE cases and are independent of the AXPYF fused kernels. - The following set of kernels biased towards the n-dimension perform beta scaling of y vector within the kernel itself and handle cases where n is less than 5: - bli_dgemv_n_zen_int_32x8n_avx512( ... ) - bli_dgemv_n_zen_int_32x4n_avx512( ... ) - bli_dgemv_n_zen_int_32x2n_avx512( ... ) - bli_dgemv_n_zen_int_32x1n_avx512( ... ) - The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the m-dimension and for this kernel beta scaling is handled beforehand within the framework. - Added unit-tests for the new kernels. - AVX2 path for Zen/2/3 architectures still follows the old approach of using fused kernel, namely AXPYF, to perform the GEMV operation. AMD-Internal: [CPUPL-5560] Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79	2024-12-12 10:26:50 -05:00
Vignesh Balasubramanian	da6e9defcb	Dynamic selection of AVX2 or AVX512 DNRM2 kernels - Added a kernel selection logic based on the input dimension(runtime parameter), to choose between deploying AVX2 or AVX512 computational kernel for single-thread execution. - An empirical analysis was conducted to arrive at the thresholds, for ZEN4 and ZEN5 architectures. - Updated the fast-path threshold for ZEN4 to be in hand with the tipping points of its dynamic thread-setter(used when AOCL_DYNAMIC is enabled). AMD-Internal: [CPUPL-5937] Change-Id: I96d7f167658c9e25a0098c4c67e12e4ba673e228	2024-12-10 10:53:54 +05:30
Shubham Sharma.	be6fbadd95	BlockSize Tuning for ZEN4 and ZEN5 - Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems. - MC, KC and NC are dynamically selected at runtime for DGEMM native. - A local copy of cntx is created and blocksizes are updated in the local cntx. - Updated threshold for picking DGEMM SUP kernel for ZEN4. AMD-Internal: [CPUPL-5912] Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6	2024-11-29 03:19:16 -05:00
Shubham Sharma	f2320a1fef	Enabled DGEMM row major kernel for ZEN4 - Merged ZEN4 and ZEN5 DGEMM 8x24 kernel. - Replaced 32x6 kernel with 8x24. Now same kernel is used for ZEN4 and ZEN5. - Blocksizes have been tuned for genoa only. - DGEMM kernel for DTRSM native code path is replaced with 8x24 kernel. - Enabled alpha scaling during packing for ZEN4. - ZEN4 8x24 kernel has been removed. AMD-Internal: [CPUPL-5912] Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754	2024-11-29 08:18:48 +00:00
Shubham Sharma	266bd32dea	Enable fringe case handling in DGEMM ZEN5 macro kernel - Generic kernel is used if N is not multiple of NR or M is not multiple of MR. - This limit the maximum values of NR that can be used. - Support for fringe case handling is added in DGEMM macro kernel so that macro kernel can be used for all problem sizes. AMD-Internal: [CPUPL-5912] Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f	2024-11-29 00:22:33 -05:00
Kiran Varaganti	3e2795f406	OpenMP barrier overhead bug fix In the function bli_thread_update_rntm_from_env()mutex is used for reading global_rntm "bli_pthread_mutex_lock( &global_rntm_mutex );" This causes regression when application is Multithreaded. The cause of this regression is due to these mutexes, Imagine a scenario two threads launched, one thread acquires this mutex, second thread stalls till mutex is freed by first thread, as a result second thread will be slower to arrive at openmp barrier in application thereby increasing the openmp barrier overhead. Things get worst when more number of threads are launched. Thanks to rocHPL for sharing standalone panelfact application to reproduce this issue. Thanks to @Edward Symth (edward.smyth@amd.com) for finding this bug. [SWLCSG-3143]	2024-11-22 15:36:30 +05:30
Vignesh Balasubramanian	4da1ad2cd9	Added CBLAS wrappers for complex precision ?ROT and ?ROTG APIs - Added the appropriate CBLAS wrappers for CROTG, CSROT, ZROTG and ZDROT APIs. These would internally call their ?_blis_impl() layer. AMD-Internal: [CPUPL-5813] Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97	2024-09-19 08:49:46 -04:00
Shubham Sharma	b3b56ae3bb	BugFix: Fixed GEMM mixed precision failure in ZEN5 - Optimized DGEMM macro kernel does not support mixed precision. - This kernel was being used for solving some of the mixed precision problems. - Currently only ( bli_obj_elem(A) == 8 ) is used for checking if the problem being solved is mixed precision. - bli_obj_elem(A) will be equal to 8 for both double precision data type and mixed precision case single-complex. - Added extra checks (bli_obj_is_real( a )) to make sure that A and B are real and DGEMM macro kernel is being used only for DDDGEMM. AMD-Internal: [CPUPL-5804] Change-Id: Iaa1accf8d851d11533f8ba31dc0235fbc14f89a9	2024-09-19 04:54:53 -04:00
Edward Smyth	a07e041b1f	SCALV alpha=zero BLAS compliance SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but also within many other APIs to handle special cases. In general it is preferred to use SETV when alpha=0, but BLAS and CBLAS continue to multiple all vector element by alpha. This has different behaviour for propagating NaNs or Infs. Changes in this commit: - Standardize early returns from SCALV reference and optimized kernels. - User supplied N<0 is handled at the top level API layer. Use negative values of N in kernel calls to signify that SETV should _not_ be used when alpha=0. This should only be required in SCALV. - Include serial threshold in zdscal (as in dscal) to reduce overhead for small problem sizes. - Code tidying to make different variants more consistent. - More standardization of tests in SCALV gtestsuite programs. - Remove scalv_extreme_cases.cpp as it is now redundant. AMD-Internal: [CPUPL-4415] Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae	2024-09-16 07:10:28 -04:00
Edward Smyth	711dce14d0	Export full set of _blis_impl interfaces The _blis_impl layer provide a BLAS-like API for use in builds where BLAS and CBLAS interfaces are not desirable. This patch generates interfaces in uppercase and with and without trailing underscores, to match what is generated for the regular BLAS interface. AMD-Internal: [CPUPL-5650] Change-Id: I3ba9d0992291b0977479ab479acb71e42277c7c2	2024-09-03 04:13:06 -04:00
Hari Govind S	6dd8f06aff	Bug Fix: When calculating number of threads for level1 APIs when BLIS_IC_* or BLIS_JC_* are set - Reverted the change done for tuning ddotv API. When number of threads is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads are not calculated and as a result number of threads value is -1. OpenMP threads are launched with -1 value. This results in crash. This bug is fixed by correctly calculating number of threads. AMD-Internal: [SWLCSG-3028][CPUPL-5689] Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a	2024-08-29 05:42:03 -04:00
Edward Smyth	1f18eeb267	Determine AMD FP/SIMD execution datapath width Different Zen processors may have a 512-bit, 256-bit or 128-bit FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows a selection of FP512 or FP256 width in BIOS settings. Add cpuid code to detect the width and store an indication of it in the global variable bli_fp_datapath. This should be accessed internally via the function bli_cpuid_query_fp_datapath(). This functionality is currently only enabled on x86_64 platforms and only currently reports a value for AMD CPUs. Also add Zen3 as a fallback path for any unknown AMD processors if AVX512 is not supported or has been disabled. AMD-Internal: [CPUPL-4415] Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a	2024-08-28 12:25:57 -04:00
Vignesh Balasubramanian	189a0b7224	Bugfix for {D/C/Z}AXPBY and ZAXPY BLAS APIs - Bug : For non-zen architectures, {D/C/Z}AXPBY had incorrect datatypes passed when querying the computational kernel from context. The right datatype is now passed to each variant. - Bug : For ZAXPY, a NULL context was passed to the kernel when using the single-threaded path. In case of further using the context inside the kernel, this would be an issue. We now pass the context instead of a null pointer. AMD-Internal: [CPUPL-5643] Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf	2024-08-22 14:12:14 +05:30
Shubham Sharma	d322cc11f8	Tiny size optimization for DTRSV var2 - Use AVX2 kernels for tiny sizes on genoa. - Removed the runtime init overhead for small sizes. AMD-Internal: [CPUPL-5407] Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30	2024-08-20 00:40:25 -04:00
Shubham Sharma.	1a17e95332	Fixed failures for mixed precision on ZEN5 in GEMM - Optimized macro kernel (bli_dgemm_avx512_asm_8x24_macro_kernel) for zen5 do not support alpha scaling. Alpha scaling is supported by zen5 micro kernel (bli_dgemm_avx512_asm_8x24). - Optimized macro kernel expects alpha scaling to be done during packing. The packing kernel used for mixed precision do not support alpha scaling. Therefore, the optimized Zen5 macro kernel is not compatible with existing packing logic. - Changes have been made to use the generic macro kernel which in turn used zen5 micro kernel for mixed precision which supports alpha scaling. AMD-Internal: [CPUPL-5058] Change-Id: I1bfeb32ae07eedafadad7dd2c62d63913a46e446	2024-08-20 00:40:12 -04:00
Edward Smyth	7fff7b4026	Code cleanup: Miscellaneous fixes - Delete unused cmake files. - Add guards around call to bli_cpuid_is_avx2fma3_supported in frame/3/bli_l3_sup.c, currently assumes that non-x86 platforms will not use bli_gemmtsup. - Correct variable in frame/base/bli_arch.c on non-x86 builds. - Add guards around omp pragma to avoid possible gcc compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c. - Add missing registers in clobber list in kernels/zen4/1/bli_dotv_zen_int_avx512.c. - Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV. - Correct calls to cblas_{c,z}swap in gtestsuite. - Correct test name in ddotxf gtestsuite program. AMD-Internal: [CPUPL-4415] Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5	2024-08-06 06:56:01 -04:00
Hari Govind S	d349f89df6	Fix warning caused by dscalv - Setting the value for ST_THRESH for default code path in dscalv API to avoid warning message. Change-Id: I8ace2070350267904faa498197b8356de9af58d1	2024-08-06 12:13:23 +05:30
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Edward Smyth	09c45525f4	Missing early returns (2) Add missing early return in axpyv. AMD-Internal: [CPUPL-5540] Change-Id: I522fd6f5551a4dab24e8c164fa38818c900b89f8	2024-08-05 12:18:33 -04:00
Edward Smyth	591a3a7395	Code cleanup: file formats and permissions - Remove execute file permission from source and make files. - dos2unix conversion. - Add missing eol at end of files. Also update .gitignore to not exclude build directory but to exclude any build_* created by cmake builds. AMD-Internal: [CPUPL-4415] Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124	2024-08-05 11:52:33 -04:00
Shubham Sharma.	dac0524ac8	BugFix in AVX512 DGEMMT SUP ST RRC variant - C<- alpha * op(A) op(B) + beta C. C(nxn) - A(n x k) * B(k x n) For ZEN4 and ZEN5 DGEMM is col-preferred kernel DGEMMT = DGEMM + DGEMMT DGEMM is col-preferred and DGEMMT is row-preferred. DGEMM is evaluated as C = AB (all col-storage) whereas DGEMMT is evaluated as C = B A (row-storage). When A is packed it is packed as row-panels with col-stored elements. So DGEMM is evaluated as C = AB (A is col-stored) it aligns with col-stored preference. For DGEMMT: C = B A, here A will become col-stored because of packingand as result it will break the DGEMMT kernel assumption that A is row-storage. - Fixed this by disabling this optimization for ZEN4 and ZEN5. AMD-Internal: [CPUPL-5542} Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379	2024-08-05 05:21:24 -04:00
Arnav Sharma	0a5c057475	DGEMV Optimizations for Tiny Sizes - Added reference kernel for dgemv that handles computation for tiny sizes (m < 8 && n < 8). - The reference kernel, bli_dgemv_zen_ref( ... ), supports both row/column storage schemes as well as transpose and no transpose cases. - Added additional unit-tests for functional verification. AMD-Internal: [CPUPL-5098] Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada	2024-08-05 12:19:42 +05:30
Hari Govind S	3ae466697b	Fixed performance drop of multi-threaded dscalv - Avoid performance degradation of dscalv for ST when OpenMP is enabled by using fast-path to skip the overhead caused by 'bli_nthreads_l1' function if the input size is less than a particular threshold. - Replaced 'bli_thread_vector_partition' work distribution function with 'bli_thread_range_sub'. AMD-Internal: [CPUPL-5522] Change-Id: I4ad0041d6e448c4a26fcd47ce44e0321a41b8b9f	2024-08-05 01:51:30 -04:00

1 2 3 4 5 ...

1457 Commits