amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-24 18:34:40 +00:00

Author	SHA1	Message	Date
Varaganti, Kiran	145e706992	Fixed auxiliary cache block sizes for Native and SUP DGEMM kernels for ZEN4 and ZEN5 configs. Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. (https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md). In bli_cntx_init_zen4 and zen5 - auxiliary blocksize for KC was less than primary blocksize. These are fixed. Code-cleanup of the files bli_family_zen4, zen5.h" Removed unused constants. Thanks to Igor Kozachenko <igork@berkeley.edu> for pointing out these two bugs. Change-Id: I44fc564d5d91cb978d062c413e70751aeaa07f2c	2024-08-05 10:29:43 +05:30
Hari Govind S	f2acd4fd49	AOCL Dynamic for zen3 dcopy - Create seperate AOCL Dynamic values for multithreading dcopy API for zen1, zen2 and zen3 AMD-Internal: [CPUPL-5238] Change-Id: I42f56393716edeeace8bfe71d7adab0ba7325b47	2024-08-05 00:44:14 -04:00
Vignesh Balasubramanian	9843bd0317	Tuning the decision logic to choose SUP vs Native for ZGEMM - Added an additional decision logic to choose between SUP and Native paths for zen4 and zen5 micro-architectures, based on the input dimensions. This logic has been added to the architecture-specific thresholds functions, that are registered in the context. - The decision logic will overrule the discrete thresholds present in the zen4 and zen5 contexts. AMD-Internal: [CPUPL-5547] Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03	2024-08-03 19:08:07 +05:30
Ruchika Ashtankar	bdb94fb218	GTestSuite: Added tests for DGEMM SUP kernel - Added dgemmGenericSUP test for the new 24x8 DGEMM SUP kernel for zen5. AMD-Internal: [CPUPL-4404] Change-Id: I150ca310655a495bdcf5ea9d5a16746483a17b68	2024-08-02 11:37:29 -04:00
Edward Smyth	0151ea748a	Missing early returns Add missing early returns in amax, asum, gemm_compute and gemv. AMD-Internal: [CPUPL-5540] Change-Id: I3ed682cae954331e48da5e8ef5c7f27dd4f11c5e	2024-08-02 10:16:19 -04:00
Mangala V	0a4f9d5ac1	Removed -fno-tree-loop-vectorize from kernel flags - This change in made in MAKE build system. - Removed -fno-tree-loop-vectorize from global kernel flags, instead added it to lpgemm specific kernels only. - If this flag is not used , then gcc tries to auto vectorize the code which results in usages of vector registers, if the auto vectorized function is using intrinsic then the total numbers of vector registers used by intrinsic and auto vectorized code becomes more than the registers available in machine which causes read and writes to stack, which is causing regression in lpgemm. - If this flag is enabled globally, then the files which do not use any intrinsic code do not get auto vectorized. - To get optimal performance for both blis and lpgemm, this flag is enabled for lpgemm kernels only. Previous commit (`75df1ef218`) contains similar changes on cmake build system AMD-Internal: [CPUPL-5544] Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506	2024-08-02 09:40:06 -04:00
Moripalli Chitra	448702a1b4	Coverity issue fix Out-of-bound access fix in malloc failure case for following APIs: ddot_, zdotc_, zdotu_ AMD-Internal: [CPUPL-4686] Change-Id: I676697223604fbb2a8d03421d98ed0d8d706f8c7	2024-08-02 09:31:38 -04:00
Shubham Sharma	0d95fcf20c	Revert "DGEMM Native AVX512 updates" This reverts commit `f378fc57b5`. Reason for revert: Causing Failure AMD-Internal: [CPUPL-5262] Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f	2024-08-02 07:32:28 -04:00
Moripalli Chitra	8b486e8d14	Added new decision logic to choose between 6x8 dgemm kernel vs 24x8 kernel. The decision is based on the values of "m, n and k". Change-Id: I307ff002797ccef5bd61106b808cecb069b91fd6	2024-08-02 14:18:58 +05:30
Ruchika Ashtankar	92fbd04238	DGEMM SUP Optimizations for Turin - Introduced a new 24x8 column preferred DGEMM sup kernel for zen5. - A prefetch logic is modified compared to zen4 24x8 sup kernels. - Earlier, next panel of A is prefetched into L2 cache, which is now modified to prefetching the second next column of the current panel of A into L1 cache. - B and C prefetches are enabled and unchanged. - Tuned MC, KC and NC block sizes for new kernel. AMD-Internal: [CPUPL-5262] Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d	2024-08-02 04:00:51 -04:00
Ruchika Ashtankar	5760e06100	Threshold tuning for DGEMM SUP for zen5 - New Decision threshold constants are added to decide between double precision sup vs native dgemm code-path for zen5 processors. - The decision is based on the values of m, n and k. AMD-Internal: [CPUPL-5262] Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040	2024-08-02 11:34:32 +05:30
Shubham Sharma.	45d82a1ebf	Threshold tuning for DTRSM on zen5 - Added new decision logic to choose between native TRSM vs unpacked small TRSM for double precision. - The changes are made for zen5 processor. AMD-Internal: [CPUPL-5534] Change-Id: I5204f6df111edec27d006daeb1c2b535a67b3e46	2024-08-01 11:27:28 -04:00
Vignesh Balasubramanian	4ec2bad744	Updating reduction step of AVX512 DNRM2 API - Updated the final reduction of partial sums to use scalar accumulation entirely, instead of using the _mm512_reduce_add_pd( ... ) intrinsic. This will in turn change the associativity and the rounding-off pattern in the reduction step. - Defined a union data-type to do the same, by having a 512-bit register and a double-precision array as its members. - Updated the declaration and usage of the register variable according to the union definition, for uniformity. AMD-Internal: [CPUPL-5472] Change-Id: I997464a6ec47e4054dca48a000fbd4ac0cfcc679	2024-08-01 17:09:17 +05:30
Edward Smyth	75f21182bd	GTestSuite: IIT and ERS test improvements Various improvements: - Where appropriate, test both: - with nullptr for suitable arguments that should never be touched. - with all arguments correct except the one we want to test, to check we are not returning early because another argument is a nullptr. - Test incorrect values for order argument in CBLAS calls. - Test early exits with limited data changes, e.g. set C to 0 or scale C in GEMM when alpha = 0. - Bugfix in gemmt test when alpha is 0 and beta is 1. - Use reference library gemmt for comparison when library is not netlib BLAS. AMD-Internal: [CPUPL-4500] Change-Id: Ibde7eaba5a484a87674044ca44855c6f6ee4ff4b	2024-07-31 15:36:01 -04:00
Shubham Sharma.	f378fc57b5	DGEMM Native AVX512 updates - In the initial patch - for m, n non-multiple of MR and NR respectively we are calling bli_dgemm_ker_var2. Now we have implemented macro-kernel for these fringe cases as well. - Replaced RBP register with R11 in the macro-kernel. - Retuned MC, KC and NC with these new changes. This will result in better performance for matrix sizes like m=4000 or greater when running on single thread. AMD-Internal: [CPUPL-5262] Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a	2024-07-31 12:23:34 -04:00
Eleni Vlachopoulou	20d6a9a9f3	CMake: Add installation of .pc files. AMD-Internal: [CPUPL-4938] Change-Id: Iaf1ad702e61d8a81ee9ae6496ff3ba0dda21eceb	2024-07-31 11:54:57 -04:00
Vignesh Balasubramanian	f23b8e636b	AVX2 and AVX512 optimizations for DAXPYV - Removed some of the unrolling factors that affected the performance of AVX2 DAXPYV kernel. In addition to improving the current performance on sizes compatible to single-threaded runs, this will now perform better for tiny sizes as well since the overhead to reach the computation is less. - Updated the vector partitioning logic, by using bli_thread_range_sub( ... ), which ensures that there is no false sharing among multiple threads. - Updated the AOCL-DYNAMIC logic for the API, to include thresholds or zen4 and zen5 micro-architectures. AMD-Internal: [CPUPL-5514] Change-Id: Iee9edddac685334213cd6694421ab3df3547e930	2024-07-31 09:24:36 -04:00
Hari Govind S	e2e95a09b0	Fixing missing registers in end_asm for copyv APIs - Added the missing registers in end_asm for scopy, dcopy and zcopy APIs. - Removed unnecessary registers from end_asm for scopy and dcopy APIs. - Corrected mistakes in the comments. Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498	2024-07-30 15:09:52 +05:30
Meghana Vankadari	d5b4d3aa5e	Fixing control flow in aocl_gemm_bf16s4f32of32\|bf16 - Fixed framework of bf16s4f32of32 API to correct pointer updations. - Modified pre_op structure to exclude pre-op-offset. Now offset is passed as a separate parameter to the scale-pack functions. - Fixed work-distribution among threads in MT scenario. - Added Blocksizes and kernel-pointers and verified functionality for the new API. AMD-Internal: [SWLCSG-2943] Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d	2024-07-29 05:12:09 -04:00
Edward Smyth	b90e12dfa4	GTestSuite: copyright notice Standardize format of copyright notice. AMD-Internal: [CPUPL-4500] Change-Id: I6bde64c15ff639492dd0de95423c660112a37e2c	2024-07-26 15:34:41 -04:00
Edward Smyth	ea286cf6f6	GTestSuite: whitespace at end of lines Unnecessary whitespace (spaces, tabs) at the end of lines has been removed. AMD-Internal: [CPUPL-4500] Change-Id: Ice5f5504232cb22460c14ac47e6a3a43309cba22	2024-07-26 12:12:56 -04:00
Edward Smyth	4183efa722	GTestSuite: No newline at end of file Add missing newline at the end of these files. AMD-Internal: [CPUPL-4500] Change-Id: I835cc73de0008b66ae3cf77fbb3daa1c8fcaaa7f	2024-07-26 11:42:57 -04:00
Edward Smyth	46fe3f3dcb	GTestSuite: dos2unix file conversion Source and other files in some directories were a mixture of Unix and DOS file formats. Convert all relevant files to Unix format for consistency. AMD-Internal: [CPUPL-4500] Change-Id: Ia3e479643b0bed4ae8a9107bde6e2cddf32d5bd8	2024-07-26 11:09:06 -04:00
Edward Smyth	8848ecb103	Improvements to CBLAS xerbla functionality Currently the CBLAS xerbla always prints and always stops on error. This commits adds similar functionality to the regular BLAS xerbla to match the changes in `6d0444497f`, namely: - Option to stop in xerbla on error. This is controlled by setting the environment variable BLIS_STOP_ON_ERROR=1 - Option to disable printing of error message from BLIS. This is controlled by setting the environment variable BLIS_PRINT_ON_ERROR=0 - Added a function to return the value of INFO passed to xerbla, assuming xerbla was not set to stop on error. Example call is info = bli_info_get_info_value(); The default behaviour remains to print but has been changed to not stop on error, i.e. the equivalent to export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0 AMD-Internal: [CPUPL-5361] Change-Id: Icd6125fd60da139e3ec0969e52337a1ed515f0a2	2024-07-26 10:36:37 -04:00
Vignesh Balasubramanian	68c54297bd	Fixing compiler warnings when configuring BLIS without OpenMP - Adjusted the macro-guards for variables specific to multithreading, when BLIS is configured with OpenMP. - This included calling the single-threaded kernel directly if increment is 0 as well, since this would remove an unnecessary dependency on one of the variables used only when we enable OpenMP. - Further updated the condition to pack the vector, to avoid it when increment is 0. In this case, we directly call the kernel. AMD-Internal: [CPUPL-5480] Change-Id: I31a9c6e3ffc3c4f9d5b03ed8745919ad65c99c79	2024-07-25 10:29:33 -04:00
mkadavil	ec8c39541e	Test/benchmark framework updates to test WOQ workflow. -To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs have been developed where data types are A:bf16, B:int4 and C:f32/bf16. The testing and benchmarking framework for the same are added. AMD-Internal: [SWLCSG-2943] Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17	2024-07-25 06:44:37 +05:30
Nallani Bhaskar	c6dd7c1b4b	Added new API in aocl_gemm to support A bf16 data type and B s4 data type Description: 1. Added a new API aocl_gemm_bf16s4f32of32 to support for WoQ (Weight-only-Quantization) in LLM's 2. The API supports only reordered B matrix of data size signed 4 bits (S4). 3. Substracting zero point and multiplying with scale on B matrix is performed in packing B. 4. zero point and scale data should be passed by user through pre-ops data structure. 5. The API is still in experimental state and NOT tested. AMD-Internal: SWLCSG-2943 Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee	2024-07-25 11:59:03 +00:00
Meghana Vankadari	49949f488f	Implemented on-the-go pack kernel for s4->bf16 Details: - To enable Weight-only-Quantization(WOQ) workflow, new LPGEMM APIs are added where datatypes are A: bf16, B: int4, C: f32/bf16. To support this, B matrix will be reordered with type still being int4. New pack kernels that packs the reordered B matrix after converting the data from int4 to bf16 and applying zero-point and scale are added. AMD-Internal: [SWLCSG-2943] Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03	2024-07-25 04:13:05 +05:30
mkadavil	7114376519	New kernels for int4 B matrix reordering following BF16 kernel schema. -To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs are required where data types are A:bf16, B:int4 and C:f32/bf16. It is expected that the BF16 kernels will be reused within this API and subsequently the B matrix needs to be reordered following the BF16 kernel schema, but with the reordered matrix type still being int4. To address this, new BF16 reorder kernels enabling the same are added. AMD-Internal: [SWLCSG-2943] Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738	2024-07-25 01:10:13 -04:00
Hari Govind S	eacad443e3	Optimization for DCOPY and SCOPY API - Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512" kernel. - Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512" and "bli_scopyv_zen4_asm_avx512" kernels. - Replaced existing load balancing algorithm for dcopy API with "bli_thread_range_sub" algorithm. - Included AOCL-dynamic values for optimial number of threads for zen5 architecture. AMD-Internal: [CPUPL-5238] Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957	2024-07-24 08:23:07 -04:00
Arnav Sharma	9583ee2e23	DGEMV Optimizations for NO_TRANSPOSE cases - Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases. - Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16. - Added a wrapper for DAXPYF kernels for redirection to kernels with a smaller fuse factor than 32. - Also added UKR tests for the new fused kernels. AMD-Internal: [CPUPL-5098] Change-Id: I0b102b67c6c068873393bac0494284f379c253f2	2024-07-24 15:59:36 +05:30
Shubham Sharma	15ef6532e9	BugFix in DGEMMT SUP AVX512 code path - Logic to calculate the kernel index in AVX512 DGEMMT SUP framework is incorrect. - The granularity for workload distribution along N dimension is NR(8), whereas current logic to pick diagonal kernel assumes the granularity to be MR (24). - To Fix this, the logic to determine the kernel index is changed, instead of relying solely on n_offset, the kernel index is derived depending on distance from the diagonal. - If distance from diagonal is greater than LCM of (MR and NR) - NR, that that means the current micro panel is not a diagonal micro panel. - If the micro panel is a diagonal micro panel, then the distance from diagonal is equal to the M dimension for initial full GEMM region or empty region of diagonal kernel. This info can be used to determine the kernel index. AMD-Internal: [CPUPL-5440] Change-Id: I640d3a1b43e63b24bc9f0ed4a67cced45f6fa3b3	2024-07-24 06:36:34 +00:00
mkadavil	42e539b878	Quantization (scale + zero point) updates/fixes for BF16 LPGEMM api. -_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc (<gcc 12.2) and subsequently throws a compilation error. This is fixed by replacing this intrinsic with a macro that achieves the bf16 to f32 conversion via shift operations. -Bug fixes in the vector scale factor load in fringe kernels. AMD-Internal: [SWLCSG-2945] Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018	2024-07-23 04:39:14 +05:30
Shubham Sharma	16c56e0101	Added 24x8 triangular kernels for DGEMMT SUP - In order to reuse 24x8 AVX512 DGEMM SUP kernels, 24x8 triangular AVX512 DGEMMT SUP kernels are added. - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal pattern repeats every 24x24 block of C. To cover this 24x24 block, 3 kernels are needed for one variant of DGEMMT. A total of 6 kernels are needed to cover both upper and lower variants. - In order to maximize code reuse, the 24x8 kernels are broken into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8 diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8 full GEMM part is computed by 24x8 DGEMM SUP kernel. - Changes are made in framework to enable the use of these kernels. AMD-Internal: [CPUPL-5338] Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a	2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian	b48e864e82	AVX512 optimizations for DAXPBYV API - Implemented AVX512 computational kernel for DAXPBYV with optimal unrolling. Further implemented the other missing kernels that would be required to decompose the computation in special cases, namely the AVX512 DADDV and DSCAL2V kernels. - Updated the zen4 and zen5 contexts to ensure any query to acquire the kernel pointer for DAXPBYV returns the address of the new kernel. - Added micro-kernel units tests to GTestsuite to check for functionality and out-of-bounds reads and writes. AMD-Internal: [CPUPL-5406][CPUPL-5421] Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6	2024-07-22 11:32:19 +05:30
Shubham Sharma	75df1ef218	Removed -fno-tree-loop-vectorize from kernel flags - This change in made in CMAKE build system only. - Removed -fno-tree-loop-vectorize from global kernel flags, instead added it to lpgemm specific kernels only. - If this flag is not used , then gcc tries to auto vectorize the code which results in usages of vector registers, if the auto vectorized function is using intrinsics then the total numbers of vector registers used by intrinsic and auto vectorized code becomes more than the registers available in machine which causes read and writes to stack, which is causing regression in lpgemm. - If this flag is enabled globally, then the files which do not use any intrinsic code do not get auto vectorized. - To get optimal performance for both blis and lpgemm, this flag is enabled for lpgemm kernels only. Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4	2024-07-19 00:49:52 -04:00
Vignesh Balasubramanian	cec9fdcc6e	Framework enhancements for ?AXPBYV APIs - Implemented a new front-end for the BLAS/CBLAS calls to ?AXPBYV(BLAS-extension API), that is intended to be compiled only on Zen micro-architectures(as per the existing build system). - This new front-end makes the framework lightweight for BLAS/CBLAS calls to ?AXPBYV, by directly querying the architecture ID and deploying the associated computational kernel. - Further updated the rerouting to other L1 kernels based on alpha and beta value. This was initially present in the Typed-API interface. It has been moved inside the respective kernels, and only necessary rerouting is done to specific L1 kernels to avoid redundant checks. AMD-Internal: [CPUPL-5406] Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e	2024-07-18 10:06:31 -04:00
mkadavil	d37c91dffa	Quantization (scale + zero point) support for BF16 LPGEMM api. -Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point) instead of just type conversion in aocl_gemm_bf16bf16f32obf16. -Support for multiple scale/sum/matrix_add/bias post-ops in a single LPGEMM api call. -Post-ops mask related fixes in lpgemv kernels . -Additional scale post-ops sanity checks. AMD-Internal: [SWLCSG-2945] Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb	2024-07-18 05:32:51 -04:00
Arnav Sharma	d5e29e3c7b	CSCALV Framework Bugfix - Fixed bug for non-zen architecture where CSCALV framework incorrectly fetches the dcomplex (ZSCALV) kernel pointer. AMD-Internal: [CPUPL-5299] Change-Id: I1d16588aa9dffd8b9dca69860026e377fa74d547	2024-07-17 00:27:47 +05:30
Hari Govind S	38824244d5	Implementation of AXPYF Kernels for DTRSV - Implemented two new axpyf kernels for fused factors 8 and 12 by manually unrolling the loops. Used to achieve better performance in var2 case. AMD-Internal: [CPUPL-5184] Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5	2024-07-16 06:23:01 -04:00
Arnav Sharma	4aa66f108e	Added CSCALV AVX512 Kernel - Added CSCALV kernel utilizing the AVX512 ISA. - Added function pointers for the same to zen4 and zen5 contexts. - Updated the BLAS interface to invoke respective CSCALV kernels based on the architecture. - Added UKR tests for bli_cscalv_zen_int_avx512( ... ). AMD-Internal: [CPUPL-5299] Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3	2024-07-15 07:17:43 -04:00
srikanth pogula	1d7f6d414f	Bench APPs - change in Print statement for more params >Made changes in the print statements in bench files to print all the params of the individual APIs > Ex : removing tab & adding Func param "Dt\t n\t incx\t incy\t gflops\n" --> "Func Dt n incx incy gflops\n" > Ex : adding func, incx, incy params "dt_ch, n, alpha_r, alpha_i, beta_r, beta_i, gflops" --> "tmp, dt_ch, n, alpha_r, alpha_i, incx, beta_r, beta_i, incy, gflops" Change-Id: Ib5d151d7472d3f88c13a85a615a447dfa5e6b528	2024-07-11 02:04:19 -04:00
Nallani Bhaskar	d5133e4363	Fixed linking issue with bli_print_msg function Description: In recent changes bli_print_msg is used in lpgemm test application file bench_lpgemm.c for printing error message. bli_print_msg is a blis library function which is not exported for the usage of applications, because of which linking failed when blis shared library is used to build. Updated bli_print_msg with printf in the bench_lpgemm.c AMD Internal: CPUPL-5326 Change-Id: I021849baa6881bd997013e42013db1c5c711627f	2024-07-09 23:18:53 +05:30
Shubham Sharma.	a7744361e4	DGEMM optimizations for Turin Classic - Introduced new 8x24 macro kernels. - 4 new kernels are added for beta 0, beta 1, beta -1 and beta N. - IR and JR loop moved to ASM region. - Kernels support row major storage scheme. - Prefetch of current micro panel of C is enabled. - Kernel supports negative offsets for A and B matrices. - Moved alpha scaling from DGEMM kernel to B pack kernel. - Tuned blocksizes for new kernel. - Added support for alpha scaling in 24xk pack kernel. - Reverted back to old b_next computation in gemm_ker_var2. - BugFix in 8x24 DGEMM kernel for beta 1, comparsion for jmp conditions was done using integer instructions, which caused beta 1 path to never be taken. Fixed this by changing the comparsion to double. AMD-Internal: [CPUPL-5262] Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f	2024-07-09 07:53:27 -04:00
vignbala	236d092656	AVX512 optimizations for ZGEMM to handle k = 1 cases - Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, with the transpose values being N(for column-major) and T(for row-major). - Updated the zgemm_blis_impl( ... ) layer to query the architecture ID and invoke the AVX2 or AVX512 kernel accordingly. - Added API level tests for accuracy and code-coverage, as well as micro-kernel tests for verifying functionality and out-of-bounds memory accesses. AMD-Internal: [CPUPL-5249] Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84	2024-07-09 07:07:24 -04:00
Hari Govind S	627bf0b1ba	Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API - Replaced 'bli_zaxpyv_zen_int5' kernel with optimised 'bli_zaxpyv_zen_int_avx512' kernel for zen4 and zen5 config. - Implemented multithreading support and AOCL-dynamic for ZAXPY API. - Utilized 'bli_thread_range_sub' function to achieve better work distribution and avoid false sharing. AMD-Internal: [CPUPL-5250] Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b	2024-07-09 01:29:53 -04:00
Varaganti, Kiran	2ac24d1f9c	Avoided Extra copy of "c" matrix Initailized c_save instead of 'c" and then removed copying c to c_save. Because at the start every n_repeats iteration we are copying back c_save to c. Therefore if we initialize c_save, we can avoid extra copy of "c" to c_save before calling GEMM. For very large sizes matrix initialization takes considerable amount of time. This can be reduced now. Change-Id: I2c6ffe169e991607314897cb0c1fbfc0d74ef179	2024-07-09 00:54:03 -04:00
Edward Smyth	2ee46a3a3a	Merge commit 'cfa3db3f' into amd-main * commit 'cfa3db3f': Fixed bug in mixed-dt gemm introduced in `e9da642`. Removed support for 3m, 4m induced methods. Updated do_sde.sh to get SDE from GitHub. Disable SDE testing of old AMD microarchitectures. Fixed substitution bug in configure. Allow use of 1m with mixing of row/col-pref ukrs. AMD-Internal: [CPUPL-2698] Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f	2024-07-08 06:09:11 -04:00
Eleni Vlachopoulou	db2e353362	CMake: Adding presets for zen5 configuration with clang & gcc compiler. Change-Id: Ieacc5eeaf8e9f4e1c77e2ff5c6fb455f7ff93393	2024-07-04 05:22:28 -04:00
Meghana Vankadari	4e6fa17c08	Bug fix in LPGEMV for INT8 APIs Details: - Corrected the usage of vpdpbusd instruction in GEMV implementation for INT8 APIs. - Modified bench to fill matrices with values ranging between -5 and +5 whenever the datatype is a signed integer. Change-Id: I457462b888b667d8a34c53de762e9b4aee784ecc	2024-06-27 04:22:04 +05:30

1 2 3 4 5 ...

3465 Commits