amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-25 10:54:33 +00:00

Author	SHA1	Message	Date
Vignesh Balasubramanian	f23b8e636b	AVX2 and AVX512 optimizations for DAXPYV - Removed some of the unrolling factors that affected the performance of AVX2 DAXPYV kernel. In addition to improving the current performance on sizes compatible to single-threaded runs, this will now perform better for tiny sizes as well since the overhead to reach the computation is less. - Updated the vector partitioning logic, by using bli_thread_range_sub( ... ), which ensures that there is no false sharing among multiple threads. - Updated the AOCL-DYNAMIC logic for the API, to include thresholds or zen4 and zen5 micro-architectures. AMD-Internal: [CPUPL-5514] Change-Id: Iee9edddac685334213cd6694421ab3df3547e930	2024-07-31 09:24:36 -04:00
Vignesh Balasubramanian	cec9fdcc6e	Framework enhancements for ?AXPBYV APIs - Implemented a new front-end for the BLAS/CBLAS calls to ?AXPBYV(BLAS-extension API), that is intended to be compiled only on Zen micro-architectures(as per the existing build system). - This new front-end makes the framework lightweight for BLAS/CBLAS calls to ?AXPBYV, by directly querying the architecture ID and deploying the associated computational kernel. - Further updated the rerouting to other L1 kernels based on alpha and beta value. This was initially present in the Typed-API interface. It has been moved inside the respective kernels, and only necessary rerouting is done to specific L1 kernels to avoid redundant checks. AMD-Internal: [CPUPL-5406] Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e	2024-07-18 10:06:31 -04:00
vignbala	236d092656	AVX512 optimizations for ZGEMM to handle k = 1 cases - Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, with the transpose values being N(for column-major) and T(for row-major). - Updated the zgemm_blis_impl( ... ) layer to query the architecture ID and invoke the AVX2 or AVX512 kernel accordingly. - Added API level tests for accuracy and code-coverage, as well as micro-kernel tests for verifying functionality and out-of-bounds memory accesses. AMD-Internal: [CPUPL-5249] Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84	2024-07-09 07:07:24 -04:00
Edward Smyth	43d36b9f66	AOCL_ENABLE_INSTRUCTIONS improvements 2 Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is unnecessary and incorrectly caused AVX512 code to be run on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2 or equivalent options was selected. Replace with code to select kernel in a similar way to other dgemm code paths and other APIs. Note that at present AVX2 code is used the smallest matrix sizes on all zen platforms. AMD-Internal: [CPUPL-5078] Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85	2024-06-25 04:57:38 -04:00
Vignesh Balasubramanian	6165001658	Bugfix and optimizations for ?AXPBYV API - Updated the existing code-path for ?AXPBYV to reroute the inputs to the appropriate L1 kernel, based on the alpha and beta value. This is done in order to utilize sensible optimizations with regards to the compute and memory operations. - Updated the typed API interface for ?AXPBYV to include an early exit condition(when n is 0, or when alpha is 0 and beta is 1). Further updated this layer to query the right kernel from context, based on the input values of alpha and beta. - Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV, ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special case handling in ?AXPBYV. - Moved the early return with negative increments from ?SCAL2V kernels to its typed API interface. - Updated the zen, zen2 and zen3 context to include function pointers for all these vector kernels. - Updated the existing ?AXPBYV vector kernels to handle only the required computation. Additional cleanup was done to these kernels. - Added accuracy and memory tests for AVX2 kernels of ?SETV ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs - Updated the existing thresholds in ?AXPBYV tests for complex types. This is due to the fact that every complex multiplication involves two mul ops and one add op. Further added test-cases for API level accuracy check, that includes special cases of alpha and beta. - Decomposed the reference call to ?AXPBYV with several other L1 BLAS APIs(in case of the reference not supporting its own ?AXPBYV API). The decomposition is done to match the exact operations that is done in BLIS based on alpha and/or beta values. This ensures that we test for our own compliance. AMD-Internal: [CPUPL-4861] Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6	2024-06-20 16:22:07 +05:30
Arnav Sharma	aa3adb8d69	Updated DOTXF and AXPYF Kernels - Updated the fused kernels (DOTXF and AXPYF) to properly handle cases when b_n > fuse_factor. - The fused kernels are expected to invoke respective Level-1 kernels iteratively when b_n > fuse_factor. AMD-Internal: [CPUPL-5246] Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095	2024-06-20 04:41:41 -04:00
Mangala V	e9124ffca7	BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case BUG: When alpha real and imaginary is zero Output is computed as C= Beta * C + A * B instead of C = Beta * C FIX: Updated kernel to scale A * B product with alpha in case of alpha=0 Existing framework design: - When alpha real and imaginary value is zero, framework handles to skip kernel call to avoid alpha * A * B operation - SCALM is invoked to perform Beta * C - Accuracy issue was not observed as alpha=0 was handled in framework - If we call kernel directly with alpha=0, results would be wrong - Issue was figured out during microkernel testing using gtestsuite AMD-Internal: [CPUPL-4454] Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803	2024-06-20 02:58:43 -04:00
Meghana Vankadari	c9254bd9e9	Implemented LPGEMV(n=1) for AVX2-INT8 variants - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n=1. AMD-Internal: [SWLCSG-2354] Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1	2024-06-18 12:09:18 +05:30
Edward Smyth	1f60b7c366	Export some BLIS internal symbols 2 Export more symbols for BLIS kernels so that AOCL libFLAME optimizations can call them directly. AMD-Internal: [CPUPL-5044] Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733	2024-05-15 06:59:56 -04:00
Edward Smyth	62c886feee	Export some BLIS internal symbols AOCL libFLAME optimizations directly call some internal BLIS symbols. Export them to enable this to work with the BLIS shared library. AMD-Internal: [CPUPL-5044] Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d	2024-05-08 12:51:32 -04:00
mkadavil	118e955a22	SWISH post-op support for all LPGEMM APIs. SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)). SiLU = SWISH with alpha = 1. AMD-Internal: [SWLCSG-2387] Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26	2024-05-06 06:05:11 -04:00
Shubham Sharma	b9e21e8701	Added ZTRSM AVX512 small code path - Kernel dimensions are 4x4. - Two kernels are implemented, Right Upper and Right lower. - In case of Left variants of TRSM, transpose is induced so that Right variant kernels can be used. - No packing is performed in these kernels. - Changes are made in the threshold to pick ZTRSM small code path. - BLIS_INLINE is removed from signature of "TRSMSMALL_KER_PROT". - These kernels do not support "ENABLE_TRSM_PREINVERSION". - Newly added kernels do not support conjugate transpose. - Added multithreading to ZTRSM small code path. AMD-Internal: [CPUPL-4324] Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c	2024-05-03 05:10:41 -04:00
Vignesh Balasubramanian	53cb83d0cc	AVX512 optimizations for ZGEMV API with no-transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with no-transpose to A matrix. - This includes the ZAXPYF, ZAXPYV and ZSETV kernels. The set of ZAXPYF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var2( ... ) function to set the function pointers to these kernels, based on the configuration. Further added the call to ZSETV at this layer in case beta is 0. AMD-Internal: [CPUPL-4974] Change-Id: Iee4b724719e49023138bb16479765be44d677cd9	2024-05-03 07:04:47 +00:00
Shubham Sharma	14bab0eb17	Fixed out of bounds read in CTRSM small kernel - In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex precision numbers are being read instead of 1 scomplex. - Fixed the code to read only one scomplex. AMD-Internal: [CPUPL-4403] Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d	2024-04-16 02:42:34 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Nallani Bhaskar	5070343318	Fixed load intrinsic in aocl-gemm addon f32 api Description: 1. Replaced aligned load intrinsics _mm512_load_ps with unaligned load intrinsics _mm512_loadu_ps. 2. There is no guarantee that the memory address can be aligned everywhere. The changes are under beta multiplication. Copy paste error. Change-Id: I978231b556e17ad7e66c5028ed1cd904c653e0a8	2024-03-20 06:24:32 -04:00
Bhaskar Nallani	2ce47e6f5e	Implemented optimal AVX512-variant of f32 LPGEMV 1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector (i.e, m == 1 or n == 1). 2. An efficient implementation of lpgemv_rowvar_f32 is developed considering the b matrix reorder in case of m=1 and post-ops fusion. 3. When m = 1 the algorithm divide the GEMM workload in n dimension intelligently at a granularity of NR. Each thread work on A:1xk B:kx(>=NR) and produce C=1x(>NR). K is unrolled by 4 along with remainder loop. 4. When n = 1 the algorithm divide the GEMM workload in m dimension intelligently at a granularity of MR. Each thread work on A:(>=MR)xk B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided to efficiently process in n one kernel. 5. Fixed few warnings while loading 2 f32 bias elements using _mm_load_sd using float pointer. Typecasted to (const double *) AMD-Internal: [SWLCSG-2391, SWLCSG-2353] Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599	2024-03-04 23:53:23 +05:30
mkadavil	d00e84ced3	Matrix Add post-operation support for float(bf16\|f32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. AMD-Internal: [SWLCSG-2424] Change-Id: I9464d1f514e3b04275fe93441489b4503a08937a	2024-02-23 02:02:33 -05:00
mkadavil	01b7f8c945	Matrix Add post-operation support for integer(s16\|s32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. -For clang compilers (including aocc), -march=znver1 is not enabled for zen kernels. Have updated CKVECFLAGS to capture the same. AMD-Internal: [SWLCSG-2424] Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09	2024-02-12 23:51:36 +05:30
Shubham Sharma	fc91932b4a	Fixed out of bounds read in DTRSM small kernels - In 3x1 fringe case in [RLN/RUT] kernel, 4 double precision floats are being read instead of 3 doubles. - Fixed the code to read only 3 double. AMD-Internal: [CPUPL-4403] Change-Id: If0afb155efefabe13487cf322d479981f1838aa2	2024-02-02 10:31:12 +05:30
eashdash	ef134dc49f	Added Trans A feature for all INT8 LPGEMM APIs 1. Added Trans A feature to handle column major inputs for A matrix. 2. Trans A is enabled by on-the-go pack of A matrix. 3. The on-the-go pack of A converts a column storage MCxKC block of A into row storage MCxKC block as LPGEMM kernels are row major kernels. 4. New pack routines are added for conversion of A matrix from column major storage to row major storage. 5. LPGEMM Cntx is updated with pack kernel function pointers. 6. Packing of A matrix: - Converts column major input A to row major in blocks of MCxKC with newly added pack A functions when cs_a > 1. 7. Pack routines are added for AVX512 and AVX2 INT8 LPGEMM APIs. 8. Trans A feature is now supported in: 1. u8s8s32os32/os8 2. u8s8s16os16/os8/ou8 3. s8s8s32os32/os8 4. s8s8s16os16/os8 AMD-Internal: SWLCSG-2582 Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf	2024-01-30 03:40:56 -05:00
mkadavil	864170f5cb	Scalar value support for zero-point and scale-factor. -As it stands, in LPGEMM, users are expected to pass an array of values with length the same as N dimension as inputs for zero point or scale factor. However at times, a single scalar value is used as zero point or scale factor for the entire downscaling operation. The mandate to pass an array requires the user to allocate extra memory and fill it with the scalar value so as to be used in downscaling. This limitation is lifted as part of this commit, and now scalar values can be passed as zero point or scale factor. -LPGEMM bench enhancements along with new input format to improve readability as well as flexibility. AMD-Internal: [SWLCSG-2581] Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f	2024-01-12 04:37:35 +05:30
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	50608f28df	BLIS: Missing clobbers (batch 7) Add missing clobbers in: - bli_gemmsup_rv_haswell kernels - spare copies of kernels in old, other and broken subdirectories - misc kernels for legacy platforms AMD-Internal: [CPUPL-3521] Change-Id: I7cdb7fd1cb29630d8b7fa914b1002a270dfe9ef5	2023-11-22 17:51:46 -05:00
Edward Smyth	f471615c66	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. AMD-Internal: [CPUPL-3519] Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce	2023-11-22 17:11:10 -05:00
mangala v	e0df20806a	Updated prefetching in SGEMM SUP (mask load/store) kernels 1. Prefetch only MR rows or rows required for fringe cases 2. Specify prefetching offset - the least column address supported by masked functions 3. Removed unnecessary prefetches in fringe case for mx4 kernels Updated gtestuite for sgemm calls AMD_Internal: [CPUPL-4221] Change-Id: I1e2e7d3ebce37dc54a2f0a5c1c70ce0a6d4c8d6c	2023-11-21 06:31:47 -05:00
Harsh Dave	e91d23ff05	Re-implements ddotv edge kernel using masked instructions - This commit uses avx2 and avx512 masked load instructions for handling edge case where vector size is not exact multiple of avx2/avx512 vector register size. - Thanks to Shubham, Sharma <shubham.sharma3@amd.com> for avx512 ddotv kernel changes Change-Id: I998651eeb1083caf3308f1b45bd7d55b7974bcb4	2023-11-21 02:25:00 -05:00
mangala v	3256a7b074	BugFix: Re-Designed SGEMM SUP kernel to use mask load/store instruction Segfault was reported through nightly jenkins job. Issue was observed when running in MT mode. Issue was due to extra broadcast being used. Extra broadcast would access out of bound memory on input buffer Cleaned up cobbler list by removing unused registers. AMD_Internal: [CPUPL-4180] Change-Id: I1c8715b2850ef855328f2ef12f215987299bdb2b	2023-11-17 18:14:34 +05:30
Mangala V	f6046784ce	Re-Designed SGEMM SUP kernel to use mask load/store instruction Added all fringe kernels with mask load store support Fringe kernels cover m direction from 5 to 1 and n direction from 15 to 1 for row storage format - New edge kernels that uses masked load-store instructions for handling corner cases. - Mask load-store instruction macros are added. vmaskmovps, VMASKMOVPS for masked load-store. - It improves performance by reducing branching overhead and by being more cache friendly. - Mask load-store is added only for row storage format AMD-Internal: [CPUPL-4041] Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456	2023-11-10 01:23:48 -05:00
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Edward Smyth	9500cbee63	Code cleanup: spelling corrections Corrections for some spelling mistakes in comments. AMD-Internal: [CPUPL-3519] Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba	2023-11-09 00:16:30 -05:00
Harsh Dave	75356d45e5	DGEMM improvement for very tiny sizes less than 24. - This commit helps improving performance for very small input by reducing framework check and routing all such inputs to bli_dgemm_tiny_6x8_kernel. It forces single threaded computation for such sizes. - It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4 code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment variable is set to avx512. In that case, such a small inputs are routed to bli_dgemm_tiny_24x8_kernel avx512 kernel. AMD-Internal: [CPUPL-1701] Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e	2023-11-08 23:45:57 -05:00
Vignesh Balasubramanian	5f9c8c6929	Bugfix : Fallback mechanism in SNRM2 and SCNRM2 kernels if packing fails - Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to a layer higher. - Added a scalar loop to handle compute in case of non-unit strides. This loop ensures functionality in case packing fails at the framework level. AMD-Internal: [CPUPL-3633] Change-Id: I555aea519d7434d43c541bb0f661f81105135b98	2023-11-08 15:16:10 +05:30
Vignesh Balasubramanian	06f23c4fd4	Bugfix : Functional correctness of DNRM2_ and DZNRM2_ APIs - Updated the final reduction of partial sums( AVX-2 code section ) to use scalar accumulation entirely, instead of using the _mm256_hadd_pd( ... ) intrinsic. This will in turn change the associativity in the reduction step. - Reverted to using scalar code on the fringe cases in AVX-2 kernel for DNRM2 and DZNRM2, for improving functional correctness. AMD-Internal: [CPUPL-4049] Change-Id: I9d320b39d23a0cbcc77fb24d951fced778ea5ea5	2023-11-07 10:21:41 -05:00
Harsh Dave	0de10cc86c	Added k=1 avx512 dgemm kernel. - This commit implements avx512 dgemm kernel for k=1 cases. which gets called for zen4 codepath. - Added architecture check for k=1 kernel in dgemm code path to pick correct kernel based on cpu arhcitecture since now blis is having avx2 and avx512 dgemm kernels for k=1 case. - Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was being called irrespective of architecture type. - Added architecture check before calling the kernel for case where k=1, so only for respective architectures this kernel is invoked. AMD-Internal: [CPUPL-4017] Change-Id: I418bbc933b41db41d323b331c6d89893868a6971	2023-11-07 01:10:09 -05:00
Vignesh Balasubramanian	ef545b928e	Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel - The call to the bli_saxpyf_zen_int_6( ... ) is explicitly present in the bli_gemv_unf_var2_amd.c file, as part of the bli_sgemv_unf_var2( ... ) function. This was changed to bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor from 6 to 5 ), in accordance to the function pointer present in the zen3 and zen4 context files. - Changed the accumulator type to double from float, inside the fringe loop for unit-strides(vectorized path) and non-unit strides (scalar code). AMD-Internal: [CPUPL-4028] Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e	2023-11-03 01:37:43 -04:00
mkadavil	d1844678f4	LPGEMM <u\|s>8s8s16ou8 fixes for incorrect zero point addition. -The zero point data type is different based on the downscale data type. For int8_t downscale type, zero point type is int8_t whereas for uint8_t downscale type, it is uint8_t. During downscale post-op, the micro-kernels upscales the zero point from its data type (int8_t or uint8_t) to that of the accumulation data type and then performs the zero point addition. The accumulated output is then stored as downscaled type in a later storage phase. For the <u\|s>8s8s16 micro-kernels, the upscaling to int16_t (accumulation type) is always performed assuming the zero point is int8_t using the _mm256_cvtepi8_epi16 instruction. However this will result in incorrect upscaled zero point values if the downscale type is uint8_t and the associated zero point type is also uint8_t. This issue is corrected by switching between the correct upscale instruction based on the zero point type. AMD-Internal: [SWLCSG-2500] Change-Id: I92eed4aed686c447d29312836b9e551d6dd4b076	2023-11-02 01:30:48 -04:00
Nallani Bhaskar	b3391ef5da	Updated ERF threshold and packa changes in bf16 Description: 1. Updated ERF function threshold from 3.91920590400 to 3.553 to match with the reference erf float implementation which reduced errors a the borders and also clipped the output to 1.0 2. Updated packa function call with pack function ptr in bf16 api to avoid compilation issues for non avx512bf16 archs 3. Updated lpgemm bench [AMD-Internal: SWLCSG-2423 ] Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d	2023-10-29 23:55:46 +05:30
Harsh Dave	7bcb701b79	Fixed functionality failure for dgemm tiny kernel. - For k > KC, C matrix is getting scaled by beta on each iteration. It should be scaled only once. Fixed the scaling of C matrix by beta in K loop. - Corrected A and B matrix buffer offsets, for cases where k > KC. AMD-Internal: [CPUPL-4078] AMD-Internal: [CPUPL-4079] AMD-Internal: [CPUPL-4081] AMD-Internal: [CPUPL-4080] AMD-Internal: [CPUPL-4087] Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa	2023-10-26 15:11:59 +05:30
Vignesh Balasubramanian	81161066e5	Multithreading the DNRM2 and DZNRM2 API - Updated the bli_dnormfv_unb_var1( ... ) and bli_znormfv_unb_var1( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Added the logic to distribute the job among the threads such that only one thread has to deal with fringe case(if required). The remaining threads will execute only the AVX-2 code section of the computational kernel. - Added reduction logic post parallel region, to handle overflow and/or underflow conditions as per the mandate. The reduction for both the APIs involve calling the vectorized kernel of dnormfv operation. - Added changes to the kernel to have the scaling factors and thresholds prebroadcasted onto the registers, instead of broadcasting every time on a need basis. - Non-unit stride cases are packed to be redirected to the vectorized implementation. In case the packing fails, the input is handled by the fringe case loop in the kernel. - Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... ) and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe cases of size = 2 ( and ) size = 1 or non-unit strides respectively. AMD-Internal: [CPUPL-3916][CPUPL-3633] Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d	2023-10-16 07:26:27 -04:00
Harsh Dave	7a4f84fbac	Optimized dgemm for tiny input sizes. - This commit focused on enhancing the performance of dgemm for matrices for very small dimenstions. - blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing the conventional SUP framework code path. As SUP framework code path requires the creation and initilization of blis objects, accessing all the needed meta-information from objects, querying contexts which adds performance penaulty while computing for matrices with very small dimensions. - To avoid such performance penaulty blis_dgemm_tiny function implements a lightweight support code so that it can re-use dgemm SUP kernels such a way that it directly operates on input buffers. It avoids framework overhead of creating and intializing blis objects, context intialization, accessing other large framework data structures. - blis_dgemm_tiny function checks for threshold condition to match before picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500. While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is invoked. -blis_dgemm_tiny function supports single threaded computation as of now. AMD-Internal: [CPUPL-3574] Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f	2023-10-16 05:52:49 -04:00
Shubham Sharma	9a2a4151ac	Added improved ZTRSM AVX2 kernels - Added 2x6 ZGEMM row-preferred kernel. - Kernel supports prefetch_a, prefetch_b, prefetch_a_next and prefetch_b_next. - Multiple Ways to prefetch c are supported. - prefetch_a and prefetch_c are enabled by default. - K loop is divided into multiple subloops for better c prefetch. - Added 2x6 ZTRSM row-preferred lower and upper kernels using AVX2 ISA. - These kernels are used for ZTRSM only, zgemm still uses 3x4 kernel. - Kernels support row/col/gen storage. - Updated the zen3 and zen4 config to enable use of these kernels for TRSM in zen3 and zen4 path. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73	2023-10-13 07:43:33 -04:00
Harihara Sudhan S	105de694cf	Optimized ZGEMV variant 1 - Added an explicit function definition for ZGEMV var 1. This removes the need to query the context for Zen architectures. - Added a new INSERT_GENTFUNC to generate the definition only for scomplex type. - Rewrote ZDOTXF kernel and added the function name for ZDOTV instead of querying it. - With this change fringe loop is vectorized using SSE instructions. AMD-Internal:[CPUPL-3997] Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3	2023-10-13 05:03:53 -04:00
mkadavil	ea0324ab95	Multi data type downscaling support for u8s8s16 - u8s8s16<u8\|s8> Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. Currently the u8s8s16 flavor of api only supports downscaling to s8 (int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at int16_t. LPGEMM is modified to support downscaling to different data types, like u8, s16, apart from s8. The framework (5 loop) passes the downscale data type to the micro-kernels. Within the micro-kernel, based on the downscale type, appropriate beta scaling and output buffer store logic is executed. This support is only enabled for u8s8s16 flavor of api's. The LPGEMM bench is also modified to support passing downscale data type for performance and accuracy testing. AMD-Internal: [SWLCSG-2313] Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30	2023-10-12 09:19:56 -04:00
Vignesh Balasubramanian	a6a67fea2d	ZAXPBYV optimizations for handling unit and non-unit strides - Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational logic. The kernel performs two different sets of compute based on the value of alpha, for both unit and non-unit strides. There are no constraints on beta scaling of the 'y' vector. - Updated the logic to support 'x' conjugate in the computation. The kernel supports conjugate/no conjugate operation through the usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... ) intrinsics. - Updated the early return condition in the kernel to adhere to the standard compliance. - Updated the scalar computation with vector computation(using 128 bit registers), in case of dealing with a single element(fringe case) in unit-stride or vectors with non-unit strides. A single dcomplex element occupies 128 bits in memory, thereby providing scope for this optimization. - Added accuracy and extreme value testing with sufficient sizes and initializations, to test the required main and fringe cases of the computation. AMD-Internal: [CPUPL-3623] Change-Id: I7ae918856e7aba49424162290f3e3d592c244826	2023-10-12 06:31:08 -04:00
bhaskarn	5fd24c27a7	Updated expf max min precission fix nan issue in Tanh Description: The expf_max and expf_min have more precission than the computation which is leading to corss the clipping at the edge case which is causing nan's in the tanh output. Updated the thresholds to less precission to clip the edge cases to avoid nan's in the tanh output. AMD-Internal: [SWLCSG-2423 ] Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29	2023-10-12 01:04:59 -04:00
mkadavil	c3b97559c1	Zero Point support for <u\|s>8s8s<32\|16>os8 LPGEMM APIs -Downscaled / quantized value is calculated using the formula x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels for these APIs only support scaling. Zero point addition is implemented as part of this commit, with it being fused as part of the downscale post-op in the micro-kernel. The zero point input is a vector of int8 values, and currently only vector based zero point addition is supported. -Bench enhancements to test/benchmark zero point addition. AMD-Internal: [SWLCSG-2332] Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea	2023-10-10 12:05:47 +05:30
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Harihara Sudhan S	03fa660792	Optimized xGEMV for non-unit stride X vector - In GEMV variant 1, the input matrix A is in row major. X vector has to be of unit stride if the operation is to be vectorized. - In cases when X vector is non-unit stride, vectorization of the GEMV operation inside the kernel has been ensured by packing the input X vector to a temporary buffer with unit stride. Currently, the packing is done using the SCAL2V. - In case of DGEMV, X vector is scaled by alpha as part of packing. In CGEMV and ZGEMV, alpha is passed as 1 while packing. - The temporary buffer created is released once the GEMV operation is complete. - In DGEMV variant 1, moved problem decomposition for Zen architecture to the DOTXF kernel. - Removed flag check based kernel dispatch logic from DGEMV. Now, kernels will be picked from the context for non-avx machines. For avx machines, the kernel(s) to be dispatched is(are) assigned to the function pointer in the unf_var layer. AMD-Internal: [CPUPL-3475] Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e	2023-08-08 01:01:22 -04:00
Harihara Sudhan S	3be43d264f	Optimized xGEMV for non-unit stride Y vector - In variant 2 of GEMV, A matrix is in column major. Y vector has to be of unit stride if the operation is to be vectorized. - In cases when Y vector is non-unit stride, vectorization of the GEMV operation inside the kernel has been ensured by packing the input Y vector to a temporary buffer with unit stride. As part of the packing Y is scaled by beta to reduce the number of times Y vector is to be loaded. - After performing the GEMV operation, the results in the temporary buffer are copied to the original buffer and the temporary one is released. - In DGEMV var 2, moved problem decomposition for Zen architecture to the AXPYF kernel. - Removed flag check based kernel dispatch logic from DGEMV. Now, kernels will be picked from the context for non-avx machines. For avx machines, the kernel(s) to be dispatched is(are) assigned to the function pointer in the unf_var layer. AMD-Internal: [CPUPL-3485] Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654	2023-08-07 08:12:44 -04:00

1 2 3 4 5 ...

317 Commits