amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-30 03:07:23 +00:00

Author	SHA1	Message	Date
Shubham Sharma	f8c83fedb6	Added new ZTRSM small code path for ZEN5 - Added new ZTRSM kernels for right and left variants. - Kernel dimensions are 12x4. - 12x4 ZGEMM SUP kernels are used internally for solving GEMM subproblem. - These kernels do not support conjugate transpose. - Only column major inputs are supported. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6356] Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e	2025-02-06 18:01:10 +05:30
Hari Govind S	fe73445813	Introduced fast-path in DCOPYV API and fix compiler warning for AXPYV - Added a conditional check to invoke the vectorized DCOPYV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Used macros to protect the declaration of fast_path_thresh in DAXPYV API to avoid compiler warnings. AMD-Internal: [CPUPL-4875][CPUPL-5895] Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb	2025-02-05 04:41:55 -05:00
Hari Govind S	3d2653f1ab	DDOTV Optimization for ZEN3 Architecture - Reduced the blocking size of 'bli_ddotv_zen_int10' kernel from 40 elements to 20 elements for better utilization of vector registers - Replaced redundant 'for' loops in 'bli_ddotv_zen_int10' kernel with 'if' conditions to handle reminder iterations. As only a single iteration is used when reminder is less than the primary unroll factor. - Added a conditional check to invoke the vectorized DDOTV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Updated getestsuite ukr tests for 'bli_ddotv_zen_int10' kernel. AMD-Internal: [CPUPL-4877] Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2	2025-02-04 06:01:04 -05:00
Shubham Sharma	9fd2aebd25	Tuned DTRSM thresholds for ZEN5 - Tuned AOCL_dynamic thresholds for DTRSM for ZEN5. - Tuned thresholds for better selection of code path. AMD-Internal: [CPUPL-5408] Change-Id: Ic40b5c8d276c8ce8399fd49ce0d0569f79ec98be	2025-01-28 07:33:10 -05:00
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
Hari Govind S	349fc47ec5	DGEMV Optimizations for TRANSPOSE Cases - Developed new AVX512 DGEMV kernels for Zen4/5 architectures and AVX2 kernels for Zen1/2/3 architectures. These kernels are written from the ground up and are independent of fused kernels. - The DGEMV primary kernel processes the calculation in chunks of 8 columns. Fringe columns (sizes 1 to 7) are handled by fringe kernels, which are invoked by the primary kernel as needed. - Implemented the kernels by computing the dot product of matrix A columns with vector x in chunks of 32 elements, storing the results in accumulator registers. Fringe elements are handled in chunks of 16, 8, etc. The data in the accumulator registers is then reduced and added to vector y. AMD-Internal: [CPUPL-5835] Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61	2025-01-24 00:38:34 -05:00
Arnav Sharma	66461b8df3	Improved Multi-threaded Performance of DSCALV - Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5 architectures, since earlier they were using the Zen thresholds. - Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads incurred when the single-threaded path is optimally performant. AMD-Internal: [CPUPL-5934] Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033	2025-01-22 03:55:38 -05:00
Vignesh Balasubramanian	8e660215c3	Introduced fast-path in DAXPYV API - Added a conditional check to invoke the vectorized DAXPYV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. AMD-Internal: [CPUPL-4878] Change-Id: I001fd1b8bbd2d691ecb3e2423ec7998e130850bb	2025-01-10 09:19:38 -05:00
Vignesh Balasubramanian	345204d69b	Additional updates to the thresholds for ZGEMM small path - Further updated the thresholds for entry to ZGEMM small path(AVX2), when the execution is mulithreaded. The newer thresholds account for more skinnier inputs, compatible with single-threaded small path, as opposed to multithreaded SUP path. AMD-Internal: [CPUPL-6040][CPUPL-5930] Change-Id: I333f97d8af49733310e4ae48b12baba15ef828d6	2025-01-10 08:29:31 -05:00
Edward Smyth	567039a7fe	Fortran interfaces for bli_thread_get APIs Create and export Fortran interfaces for bli_thread_get_num_threads() and bli_thread_get_{jc,pc,ic,jr,ir}_nt() APIs. bli_thread_get_is_parallel() is intended for internal BLIS usage, so not adding a Fortran interfaces for it at this time. AMD-Internal: [CPUPL-6168] Change-Id: Ieba2537e5455cc289536aec3de5d4b5866e607f1	2025-01-10 05:07:33 -05:00
Vignesh Balasubramanian	cdaa2ac7fd	Bugfix and optimizations for AVX512 AMAXV micro-kernels - Bug : The current {S/D}AMAXV AVX512 kernels produced an incorrect functionality with multiple absolute maximums. They returned the last index when having multiple occurences, instead of the first one. - Implemented a bug-fix to handle this issue on these AVX512 kernels. Also ensured that the kernels are compliant with the standard when handling exception values. - Further optimized the code by decoupling the logic to find the maximum element and its search space for index. This way, we use lesser latency instructions to compute the maximum first. - Updated the unit-tests, exception value tests and early return tests for the API to ensure code-coverage. AMD-Internal: [CPUPL-4745] Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f	2025-01-07 22:56:20 +05:30
Vignesh Balasubramanian	f548f42607	Fixing compiler warnings on ZGEMM - Scoped some of the variables used in zgemm_blis_impl() when determining the thresholds to small path. These variables will be used only when the architecture is ZEN5 or ZEN4. AMD-Internal: [CPUPL-5895] Change-Id: I6f90856f34454423ac777e33c74fe5ec6bb94e13	2025-01-07 10:59:43 +05:30
Edward Smyth	4ce708c316	Move some BLAS extension APIs to extra subdirectories In preparation for merging next group of changes from upstream BLIS, move some BLAS extension APIs to new extra subdirectories in frame/compat and frame/compat/cblas/src. Other extension APIs will be moved in later commits. Some tidying up to better match upstream BLIS code has also been done. AMD-Internal: [CPUPL-2698] Change-Id: I0780a775d37242fba562c3f13666da0ad2b2cdfb	2024-12-17 04:54:39 -05:00
Edward Smyth	0c6d006225	Changes to rntm to reduce mutex operations Change usage of global_rntm and tl_rntm to elimate need for mutex operations when accessing global_rntm. Usage of these data structures is now as follows: * global_rntm is set once during bli_init_apis and includes all getenv calls to check BLIS threading and error printing environment variables. global_rntm is then read-only. * tl_rntm is intialized once from global_rntm on each application thread. Any calls to BLIS set threading/ways APIs will update tl_rntm for that application thread only (Previously they updated global_rntm for all application threads). * Re-initialize info_value in tl_rntm in every call to bli_init APIs. * In bli_rntm_init_from_global() we initialize the local (per API call) rntm as a copy of tl_rntm and then update threading values in bli_thread_update_rntm_from_env() to reflect the current status of OpenMP runtime ICVs. AMD-Internal: [CPUPL-6168][SWLCSG-3143] Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6	2024-12-16 04:45:26 -05:00
Shubham Sharma	beaea1b88f	Added new DTRSM small code path for ZEN5 - Added new DTRSM kernels for right and left variants. - Kernel dimensions are 24x8. - 24x8 DGEMM SUP kernels are used internally for solving GEMM subproblem. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6016] Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db	2024-12-16 10:38:45 +05:30
Vignesh Balasubramanian	609af9bfe2	Threshold tuning for ZGEMM small path - Updated the threshold check for ZGEMM small path to include runtime checks for redirection, specific to the micro-architecture. - The current ZGEMM small path has only its AVX2 variant available. Post implementing an AVX512(same/different algorithm), the thresholds will further be fine-tuned. - Included the dot-product based AVX512 ZGEMM kernels in the ZEN5 context. It will be used as part of handling RRC and CRC storage schemes of C, A and B matrices in both single-thread and multi-thread runs. AMD-Internal: [CPUPL-5949] Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75	2024-12-13 12:51:22 -05:00
Arnav Sharma	25e59fcbb9	DGEMV Optimizations for NO_TRANSPOSE Cases - AVX512 specific DGEMV native kernels are added for Zen4/5 architectures to handle the NO_TRANSPOSE cases and are independent of the AXPYF fused kernels. - The following set of kernels biased towards the n-dimension perform beta scaling of y vector within the kernel itself and handle cases where n is less than 5: - bli_dgemv_n_zen_int_32x8n_avx512( ... ) - bli_dgemv_n_zen_int_32x4n_avx512( ... ) - bli_dgemv_n_zen_int_32x2n_avx512( ... ) - bli_dgemv_n_zen_int_32x1n_avx512( ... ) - The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the m-dimension and for this kernel beta scaling is handled beforehand within the framework. - Added unit-tests for the new kernels. - AVX2 path for Zen/2/3 architectures still follows the old approach of using fused kernel, namely AXPYF, to perform the GEMV operation. AMD-Internal: [CPUPL-5560] Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79	2024-12-12 10:26:50 -05:00
Vignesh Balasubramanian	4da1ad2cd9	Added CBLAS wrappers for complex precision ?ROT and ?ROTG APIs - Added the appropriate CBLAS wrappers for CROTG, CSROT, ZROTG and ZDROT APIs. These would internally call their ?_blis_impl() layer. AMD-Internal: [CPUPL-5813] Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97	2024-09-19 08:49:46 -04:00
Edward Smyth	a07e041b1f	SCALV alpha=zero BLAS compliance SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but also within many other APIs to handle special cases. In general it is preferred to use SETV when alpha=0, but BLAS and CBLAS continue to multiple all vector element by alpha. This has different behaviour for propagating NaNs or Infs. Changes in this commit: - Standardize early returns from SCALV reference and optimized kernels. - User supplied N<0 is handled at the top level API layer. Use negative values of N in kernel calls to signify that SETV should _not_ be used when alpha=0. This should only be required in SCALV. - Include serial threshold in zdscal (as in dscal) to reduce overhead for small problem sizes. - Code tidying to make different variants more consistent. - More standardization of tests in SCALV gtestsuite programs. - Remove scalv_extreme_cases.cpp as it is now redundant. AMD-Internal: [CPUPL-4415] Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae	2024-09-16 07:10:28 -04:00
Vignesh Balasubramanian	189a0b7224	Bugfix for {D/C/Z}AXPBY and ZAXPY BLAS APIs - Bug : For non-zen architectures, {D/C/Z}AXPBY had incorrect datatypes passed when querying the computational kernel from context. The right datatype is now passed to each variant. - Bug : For ZAXPY, a NULL context was passed to the kernel when using the single-threaded path. In case of further using the context inside the kernel, this would be an issue. We now pass the context instead of a null pointer. AMD-Internal: [CPUPL-5643] Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf	2024-08-22 14:12:14 +05:30
Hari Govind S	d349f89df6	Fix warning caused by dscalv - Setting the value for ST_THRESH for default code path in dscalv API to avoid warning message. Change-Id: I8ace2070350267904faa498197b8356de9af58d1	2024-08-06 12:13:23 +05:30
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Edward Smyth	09c45525f4	Missing early returns (2) Add missing early return in axpyv. AMD-Internal: [CPUPL-5540] Change-Id: I522fd6f5551a4dab24e8c164fa38818c900b89f8	2024-08-05 12:18:33 -04:00
Edward Smyth	591a3a7395	Code cleanup: file formats and permissions - Remove execute file permission from source and make files. - dos2unix conversion. - Add missing eol at end of files. Also update .gitignore to not exclude build directory but to exclude any build_* created by cmake builds. AMD-Internal: [CPUPL-4415] Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124	2024-08-05 11:52:33 -04:00
Arnav Sharma	0a5c057475	DGEMV Optimizations for Tiny Sizes - Added reference kernel for dgemv that handles computation for tiny sizes (m < 8 && n < 8). - The reference kernel, bli_dgemv_zen_ref( ... ), supports both row/column storage schemes as well as transpose and no transpose cases. - Added additional unit-tests for functional verification. AMD-Internal: [CPUPL-5098] Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada	2024-08-05 12:19:42 +05:30
Hari Govind S	3ae466697b	Fixed performance drop of multi-threaded dscalv - Avoid performance degradation of dscalv for ST when OpenMP is enabled by using fast-path to skip the overhead caused by 'bli_nthreads_l1' function if the input size is less than a particular threshold. - Replaced 'bli_thread_vector_partition' work distribution function with 'bli_thread_range_sub'. AMD-Internal: [CPUPL-5522] Change-Id: I4ad0041d6e448c4a26fcd47ce44e0321a41b8b9f	2024-08-05 01:51:30 -04:00
Edward Smyth	0151ea748a	Missing early returns Add missing early returns in amax, asum, gemm_compute and gemv. AMD-Internal: [CPUPL-5540] Change-Id: I3ed682cae954331e48da5e8ef5c7f27dd4f11c5e	2024-08-02 10:16:19 -04:00
Moripalli Chitra	448702a1b4	Coverity issue fix Out-of-bound access fix in malloc failure case for following APIs: ddot_, zdotc_, zdotu_ AMD-Internal: [CPUPL-4686] Change-Id: I676697223604fbb2a8d03421d98ed0d8d706f8c7	2024-08-02 09:31:38 -04:00
Shubham Sharma.	45d82a1ebf	Threshold tuning for DTRSM on zen5 - Added new decision logic to choose between native TRSM vs unpacked small TRSM for double precision. - The changes are made for zen5 processor. AMD-Internal: [CPUPL-5534] Change-Id: I5204f6df111edec27d006daeb1c2b535a67b3e46	2024-08-01 11:27:28 -04:00
Vignesh Balasubramanian	f23b8e636b	AVX2 and AVX512 optimizations for DAXPYV - Removed some of the unrolling factors that affected the performance of AVX2 DAXPYV kernel. In addition to improving the current performance on sizes compatible to single-threaded runs, this will now perform better for tiny sizes as well since the overhead to reach the computation is less. - Updated the vector partitioning logic, by using bli_thread_range_sub( ... ), which ensures that there is no false sharing among multiple threads. - Updated the AOCL-DYNAMIC logic for the API, to include thresholds or zen4 and zen5 micro-architectures. AMD-Internal: [CPUPL-5514] Change-Id: Iee9edddac685334213cd6694421ab3df3547e930	2024-07-31 09:24:36 -04:00
Edward Smyth	8848ecb103	Improvements to CBLAS xerbla functionality Currently the CBLAS xerbla always prints and always stops on error. This commits adds similar functionality to the regular BLAS xerbla to match the changes in `6d0444497f`, namely: - Option to stop in xerbla on error. This is controlled by setting the environment variable BLIS_STOP_ON_ERROR=1 - Option to disable printing of error message from BLIS. This is controlled by setting the environment variable BLIS_PRINT_ON_ERROR=0 - Added a function to return the value of INFO passed to xerbla, assuming xerbla was not set to stop on error. Example call is info = bli_info_get_info_value(); The default behaviour remains to print but has been changed to not stop on error, i.e. the equivalent to export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0 AMD-Internal: [CPUPL-5361] Change-Id: Icd6125fd60da139e3ec0969e52337a1ed515f0a2	2024-07-26 10:36:37 -04:00
Hari Govind S	eacad443e3	Optimization for DCOPY and SCOPY API - Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512" kernel. - Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512" and "bli_scopyv_zen4_asm_avx512" kernels. - Replaced existing load balancing algorithm for dcopy API with "bli_thread_range_sub" algorithm. - Included AOCL-dynamic values for optimial number of threads for zen5 architecture. AMD-Internal: [CPUPL-5238] Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957	2024-07-24 08:23:07 -04:00
Vignesh Balasubramanian	b48e864e82	AVX512 optimizations for DAXPBYV API - Implemented AVX512 computational kernel for DAXPBYV with optimal unrolling. Further implemented the other missing kernels that would be required to decompose the computation in special cases, namely the AVX512 DADDV and DSCAL2V kernels. - Updated the zen4 and zen5 contexts to ensure any query to acquire the kernel pointer for DAXPBYV returns the address of the new kernel. - Added micro-kernel units tests to GTestsuite to check for functionality and out-of-bounds reads and writes. AMD-Internal: [CPUPL-5406][CPUPL-5421] Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6	2024-07-22 11:32:19 +05:30
Vignesh Balasubramanian	cec9fdcc6e	Framework enhancements for ?AXPBYV APIs - Implemented a new front-end for the BLAS/CBLAS calls to ?AXPBYV(BLAS-extension API), that is intended to be compiled only on Zen micro-architectures(as per the existing build system). - This new front-end makes the framework lightweight for BLAS/CBLAS calls to ?AXPBYV, by directly querying the architecture ID and deploying the associated computational kernel. - Further updated the rerouting to other L1 kernels based on alpha and beta value. This was initially present in the Typed-API interface. It has been moved inside the respective kernels, and only necessary rerouting is done to specific L1 kernels to avoid redundant checks. AMD-Internal: [CPUPL-5406] Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e	2024-07-18 10:06:31 -04:00
Arnav Sharma	d5e29e3c7b	CSCALV Framework Bugfix - Fixed bug for non-zen architecture where CSCALV framework incorrectly fetches the dcomplex (ZSCALV) kernel pointer. AMD-Internal: [CPUPL-5299] Change-Id: I1d16588aa9dffd8b9dca69860026e377fa74d547	2024-07-17 00:27:47 +05:30
Arnav Sharma	4aa66f108e	Added CSCALV AVX512 Kernel - Added CSCALV kernel utilizing the AVX512 ISA. - Added function pointers for the same to zen4 and zen5 contexts. - Updated the BLAS interface to invoke respective CSCALV kernels based on the architecture. - Added UKR tests for bli_cscalv_zen_int_avx512( ... ). AMD-Internal: [CPUPL-5299] Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3	2024-07-15 07:17:43 -04:00
vignbala	236d092656	AVX512 optimizations for ZGEMM to handle k = 1 cases - Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, with the transpose values being N(for column-major) and T(for row-major). - Updated the zgemm_blis_impl( ... ) layer to query the architecture ID and invoke the AVX2 or AVX512 kernel accordingly. - Added API level tests for accuracy and code-coverage, as well as micro-kernel tests for verifying functionality and out-of-bounds memory accesses. AMD-Internal: [CPUPL-5249] Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84	2024-07-09 07:07:24 -04:00
Hari Govind S	627bf0b1ba	Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API - Replaced 'bli_zaxpyv_zen_int5' kernel with optimised 'bli_zaxpyv_zen_int_avx512' kernel for zen4 and zen5 config. - Implemented multithreading support and AOCL-dynamic for ZAXPY API. - Utilized 'bli_thread_range_sub' function to achieve better work distribution and avoid false sharing. AMD-Internal: [CPUPL-5250] Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b	2024-07-09 01:29:53 -04:00
Edward Smyth	2ee46a3a3a	Merge commit 'cfa3db3f' into amd-main * commit 'cfa3db3f': Fixed bug in mixed-dt gemm introduced in `e9da642`. Removed support for 3m, 4m induced methods. Updated do_sde.sh to get SDE from GitHub. Disable SDE testing of old AMD microarchitectures. Fixed substitution bug in configure. Allow use of 1m with mixing of row/col-pref ukrs. AMD-Internal: [CPUPL-2698] Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f	2024-07-08 06:09:11 -04:00
Vignesh Balasubramanian	947811a429	Bugfix for ?OMATCOPY2 and ?IMATCOPY APIs - Updated the parameter check for leading dimensions in the functions handling transpose case of matrix A. - Updated the logic to perform ?IMATCOPY operation. The new logic uses an auxiliary buffer to copy and scale in place, if and when needed. This is done in order to avoid overwriting any subsequent reads that might follow(specifically in case of having different leading dimensions for reading and writing). - Updated xerbla_() to throw memory allocation failure based on INFO parameter being -10. This value is specific to its use-case in ?IMATCOPY, where it is set to -10. - Updated the Extreme Value Tests(EVT) logger for ?IMATCOPY for uniformity. - Cleaned up the files to follow coding conventions. AMD-Internal: [CPUPL-4862][SWLCSG-2706] Change-Id: I34dfa2bcb66b821315e11f7ab2139c41a79ef780	2024-05-21 11:13:28 +05:30
Arnav Sharma	cb27fad49c	ZSCALV AVX512 Kernel - Implemented ZSCALV kernel utilizing AVX512 intrinsics. - Gtestsuite: Added ukr tests for the new kernel. AMD-Internal: [CPUPL-5012] Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2	2024-05-08 11:55:13 -04:00
Arnav Sharma	1dbeee4d19	ZDOTV AVX512 Kernel with MT Support - Added AVX512 kernel for ZDOTV. - Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support. AMD-Internal: [CPUPL-5011] Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8	2024-05-08 04:54:05 -04:00
Arnav Sharma	b1d69180f9	Updated DOTV DTL in bla_dot.c - Updated DOTV DTL entry to include conjugate parameter. AMD-Internal: [CPUPL-5059] Change-Id: Id66be02fc06ff2faa18325dffe76559af2c6a5cf	2024-05-08 01:46:17 -04:00
Kiran Varaganti	fd61c69778	Fixed bug in omatcopy for when trans="t" Thanks to Zhenyu Zhu ajz34 for pointing out this bug. When trans="t" or "conjugate transpose" in the case of complex data-types the ldb should be greater than equal to cols. In the bug it was checked against "rows". Fixed this bug. Some minor code format is done. [CPUPL-4810][SWLCSG-2706] Change-Id: Ie796d25a361b2ba72eda80e8c5867d6352af901f	2024-05-06 12:57:38 -04:00
Shubham Sharma	be34169001	Fixed Matlab Failure in ZTRSM - In AVX512 ZTRSM kernel, vertorizes division code is causing failures in matlab. - The logic is identical in reference C code and intrinsics code, but intrinsics code is causing failure - Replaced optimized intrinsics code with C code. AMD-Internal: [CPUPL-5052] Change-Id: Iea184330b22c46d979867b870486066ef980eb84	2024-05-06 06:56:45 -04:00
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Hari Govind	9c26de1a18	Optimisiation COPYV APIs - Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_ using respective AVX512 intrinsics including masked load and store operations. - Implemented AVX512 kernels for scopy_, dcopy_ and zcopy_ using assembly language to prevent loss of performance during the translation of intrinsics. - Updated the dcopy_blis_impl( ... ) and zcopy_blis_impl( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Implemented OpenMP parallelization for dcopyv_ and zcopyv_ APIs, while scopyv_ and ccopyv_ only support single thread. AMD-Internal: [CPUPL-4854] Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e	2024-05-01 00:23:01 +05:30
srigovin	2c838dadfb	Updated return type of xerbla and xerbla_array APIs to void Return type of xerbla and xerbla_array APIs are defined as int in BLIS, but according to netlib it should be void. Updated the defination and declaration accordingly. Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com> Change-Id: I3072ba76111189de5c5cf08df83ea154163dd34d	2024-04-29 00:51:10 -04:00
Shubham Sharma	632c32767b	Avoid alpha scaling in ZTRSV/ZTRSM when alpha = 1 - Scaling vector X is skipped when alpha is 1 in ZTRSV. - Scaling matrix A is skipped when alpha is 1 in ZTRSM. AMD-Internal: [CPUPL-4324] Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81	2024-04-16 02:24:02 -04:00

1 2 3 4 5 ...

350 Commits