amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-07-03 13:47:29 +00:00

Author	SHA1	Message	Date
Edward Smyth	4eabb5cd2e	SCALV alpha=zero BLAS compliance SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but also within many other APIs to handle special cases. In general it is preferred to use SETV when alpha=0, but BLAS and CBLAS continue to multiple all vector element by alpha. This has different behaviour for propagating NaNs or Infs. Changes in this commit: - Standardize early returns from SCALV reference and optimized kernels. - User supplied N<0 is handled at the top level API layer. Use negative values of N in kernel calls to signify that SETV should _not_ be used when alpha=0. This should only be required in SCALV. - Include serial threshold in zdscal (as in dscal) to reduce overhead for small problem sizes. - Code tidying to make different variants more consistent. - More standardization of tests in SCALV gtestsuite programs. - Remove scalv_extreme_cases.cpp as it is now redundant. AMD-Internal: [CPUPL-4415] Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae (cherry picked from commit `a07e041b1f`)	2024-09-20 01:05:03 -04:00
Vignesh Balasubramanian	601e4fbb87	Added CBLAS wrappers for complex precision ?ROT and ?ROTG APIs - Added the appropriate CBLAS wrappers for CROTG, CSROT, ZROTG and ZDROT APIs. These would internally call their ?_blis_impl() layer. AMD-Internal: [CPUPL-5813] Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97	2024-09-19 18:49:38 +05:30
Arnav Sharma	c31620d85e	Bugfix: Fix for gemmsup_r Reference Kernels - The existing row-preferred reference kernels for GEMM SUP path were not taking into consideration the packing state of matrices A or B. Thus, whenever either or both A and B matrices were packed the kernel was unable to iterate appropriately through the matrices thereby calculating incorrect values resulting in failures. - Though, for generic configuration, the SUP path is disabled by default the set of Pack and Compute Extension APIs use these kernels thus, this issue resulted in their failures as well. - With this patch, the loops being used in these kernels have been fixed to iterate over steps of MR and NR while also accounting for the fringe cases. Within the updated loops, temporary pointers used to point to the correct block/panel of the matrices are incremented with panel strides of respective matrices. AMD-Internal: [CPUPL-5674] Change-Id: Ic3939877c79ebb9ccf9e53b1d1672cea4b8c5959	2024-09-19 11:09:04 +05:30
Chandrashekara K R	62dcb157da	Added logic to use right format specifier to read integer value. Updated logic to use "%ld" and "%lld" format specifiers to read 64-bit integer from input files using fscanf function on Linux and Windows respectively when the user set INT_SIZE='auto' on 64-bit machine or INT_SIZE='64'. Otherwise "%d" on both windows and Linux for benchmarking blis and LPGEMM. Change-Id: I4762c4c1b3fcd09cf66d0cc9572d38766be6be60 (cherry picked from commit `e4eed817aa`)	2024-09-17 05:27:04 -04:00
Vignesh Balasubramanian	86798b03ae	Fixed compiler warnings due to prefetch in AVX2 and AVX512 kernels - Added explicit typecast to the pointers that are passed to the _mm_prefetch( ... ) intrinsic, to avoid compiler warnings. AMD-Internal: [CPUPL-4415] Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea	2024-09-16 06:33:21 -04:00
Chandrashekara K R	ccdccc5d97	Updated format specifier for fscanf to read double values. Updated format specifier to read signed double("%lld") and unsigned double("%llu") from file using fscanf from both windows and Linux. AMD-Internal: [CPUPL-5787] Change-Id: Ibef50b0df708f474e22f703240e264eff1de3994 (cherry picked from commit `91d4337b8b`)	2024-09-13 09:24:16 -04:00
varshav2	a87c8dbca3	Bug Fixes in the F32F32 m == 1 transpose scenario - added the missing stride updates in B reorder case in GEMV - added the missing stride updates for the cast of transA with B reordered case. Change-Id: Ic89781dfa7c0d9380ea523796958f795828a1ade	2024-09-11 02:54:42 -04:00
Mithun Mohan	baa1ff98da	Fixes for bfloat16 accumulation rounding errors in bench. For the bf16bf16of32bf16 lpgemm api, inside the micro-kernels in order to convert the accumulated float values to bfloat16 before storing, the _mm512_cvtneps_pbh intrinsic (vcvtneps2bf16) is used. This intrinsic rounds the value based on a rounding bias logic. Replicating the same rounding logic inside the bf16 bench accuracy check function to get proper one to one comparison of output values. AMD Internal: [SWLCSG-2948] Change-Id: I135ac39ac8484769b6c0fe5b3e351dd22d7ca1d8	2024-09-10 23:54:56 +05:30
Meghana Vankadari	5855874df3	Developed all WoQ kernels for bf16s4f32o<f32\|bf16> Description: 1. Written 6x64 main and other fringe kernels for WoQ where scaling s4 weights into bf16 performed in the kernel itself to reduce bandwidth. 2. These kernels are performing better compared to bf16 weights when m is small and n is large. 3. Established a threshold to do quantization support at packing of B (KCXNC) level or WoQ kernel level. Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289 (cherry picked from commit `5120f98e12`)	2024-09-10 08:31:53 -04:00
varshav2	c47470c0bd	Add TransA and TransB support for F32F32F32oF32 - Added support for TransA and transB in f32f32of32 APIs - Modified the GEMV case(m == 1) to support PACKB feature - Redirecting the operations to GEMM instead of GEMV in case of n == 1 conditions, with storage scheme r/transA and c/transB to avoid the packing errors which would lead to failures in computation. Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9	2024-09-09 22:15:07 +05:30
Chandrashekara K R	a8e1684861	Clang version requirement is updated in cmake Description: Due to the latest VNNI instructions are supported only from Clang Version 18 and above, updated clang version check from 17 to 18. AMD-Internal: [CPUPL-5744] Change-Id: I4a3ecec65bd88d9dccfe1018fb25cb7be29946f0 (cherry picked from commit `5ada963b4c`)	2024-09-09 06:52:24 -04:00
Mithun Mohan	1295c4973a	Disabling smart threading for small input dimensions. -It has been observed that reduction of threads as part of smart threading for smaller input dimensions hampers the performance of the other inputs with larger dimensions due to lower operating frequency of the newly launched threads (apart from the existing ones). Disabling smart threading for these bandwidth bound input patterns (small m and n) fixes this issue. -Bug fixes related to work split in LPGEMV for n < NR and m < MR cases. AMD Internal: [SWLCSG-2948] Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121	2024-09-08 23:04:03 +05:30
jagar	8ff27d23bd	pkg-config file for blis MT library Updated Makefile to update st/mt library in blis.pc Change-Id: Idc61d7652ee6380cf2d73f08caaf9e9216fbb77a	2024-09-06 15:17:20 +05:30
Meghana Vankadari	64ca52b5b8	Bug fix in WOQ kernel for m=4 case. - Updated pre_op_off computation for nr0 < NR cases. - Fixed warnings in bench file. Change-Id: Iae30fa84b6b47ebd94ab05d2139056aee24546d7 (cherry picked from commit `687abe4c96`)	2024-09-05 09:31:58 +00:00
Meghana Vankadari	1007f00632	Added bf16s4f32 kernels to handle m=4 cases Details: - In WOQ, if m = 4, special case kernels are added where s4->bf16 conversion happens inside the compute kernel and packing is avoided. For all other cases, B matrix is dequantized and packed at KC loop level and native bf16 kernels are re-used at compute level. - Fixes in bench to avoid accuracy failures when datatype of output is bf16. Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2 (cherry picked from commit `2e1cc2f14a`)	2024-09-05 06:25:28 +00:00
Edward Smyth	9beda4a43c	Export full set of _blis_impl interfaces The _blis_impl layer provide a BLAS-like API for use in builds where BLAS and CBLAS interfaces are not desirable. This patch generates interfaces in uppercase and with and without trailing underscores, to match what is generated for the regular BLAS interface. AMD-Internal: [CPUPL-5650] Change-Id: I3ba9d0992291b0977479ab479acb71e42277c7c2 (cherry picked from commit `711dce14d0`)	2024-09-03 04:34:58 -04:00
Mangala V	e955201717	Revert "Using znver2 flags for building zen/zen2/zen3 kernels on amdzen builds." This reverts commit `7d379c7879`. Reason for revert: < Perf regression is observed for GEMM(gemm_small_At) as fma uses memory operand > Change-Id: I0ec3a22acaacfaade860c67858be6a2ba6296bce (cherry picked from commit `705755bb5c`)	2024-09-02 09:55:45 -04:00
Deepak Negi	043a7e0baa	Replaced int_32 with dim_t in lpgemm bench Replaced int32_t with dim_t (int64_t) to avoid overflow. Change-Id: I4132b72fcbffd9dbd2242b3638922931bcdb1b80	2024-09-02 09:16:17 -04:00
Edward Smyth	493e306982	Determine AMD FP/SIMD execution datapath width Different Zen processors may have a 512-bit, 256-bit or 128-bit FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows a selection of FP512 or FP256 width in BIOS settings. Add cpuid code to detect the width and store an indication of it in the global variable bli_fp_datapath. This should be accessed internally via the function bli_cpuid_query_fp_datapath(). This functionality is currently only enabled on x86_64 platforms and only currently reports a value for AMD CPUs. Also add Zen3 as a fallback path for any unknown AMD processors if AVX512 is not supported or has been disabled. AMD-Internal: [CPUPL-4415] Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a (cherry picked from commit `1f18eeb267`)	2024-09-02 08:28:49 -04:00
mkadavil	f7701cfd45	Disabling smart threading for bandwidth bound input patterns. For some applications, one of the input dimension is mostly m < MR or n < NR with the other dimension being small for the most part, with intermittent large ones. Currently in these cases (m < MR or n < NR), the number of threads used is reduced (as part of smart threading) if the other dimension (n or m) is also small. For larger dimensions all the threads are used. However its been observed that this reduction of threads hampers the performance of the larger inputs due to lower operating frequency of the newly launched threads (apart from the existing ones). Disabling smart threading for these bandwidth bound input patterns (m < MR or n < NR) fixes this issue. AMD Internal: [SWLCSG-2948] Change-Id: I5334860cf4411ea4504d2e6bc598b9904780bbbf	2024-09-02 03:40:44 +05:30
varshav2	d67f9cfe5c	Revert duplicate check and fix bug in the check for post-ops - Revert of patch 1110983 - Duplicate check removal and early return for s8s8s32/u8s8s32 - Add fix - Added check to see if post-ops is enabled with col-major storage and return early in that case. Change-Id: Id3b8c97b6d1425dfb06f3b196e5acd60caee8fca	2024-08-29 04:46:02 +05:30
varshav2	880de5e2e4	Fix duplicate check and early return in s8s8s32/u8s8s32 - removed the duplicate check for col-major inputs in s8s8s32/u8s8s32 APIs - Fixed the print in bench_lpgemm Change-Id: If40837b89927dd82d8aa6f620d1a7f2c24aed53c	2024-08-29 04:26:13 +05:30
Hari Govind S	6900d742a3	Bug Fix: When calculating number of threads for level1 APIs when BLIS_IC_* or BLIS_JC_* are set - Reverted the change done for tuning ddotv API. When number of threads is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads are not calculated and as a result number of threads value is -1. OpenMP threads are launched with -1 value. This results in crash. This bug is fixed by correctly calculating number of threads. AMD-Internal: [SWLCSG-3028][CPUPL-5689] Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a	2024-08-29 16:15:10 +05:30
Deepak Negi	1a388b1e1d	Element wise operations API for float(f32) input matrix in LPGEMM. This API supports applying element wise operations (eg: post-ops) on a float(f32) input matrix to get an output matrix of the same (float(f32)). Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24 AMD Internal: [SWLCSG-2947] Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24	2024-08-27 08:17:29 -04:00
Chandrashekara K R	09d5021a30	CMake: Enabled ADDON(aocl_gemm) feature for Windows. 1. Updated datatype from __int64_t to int64_t. Since __int64_t was not defined for Windows 2. Updated CMake build system to build lpgemm on windows Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8 (cherry picked from commit `2ff0125f11`)	2024-08-26 02:03:29 -04:00
Vignesh Balasubramanian	694d6c9493	Bugfix for {D/C/Z}AXPBY and ZAXPY BLAS APIs - Bug : For non-zen architectures, {D/C/Z}AXPBY had incorrect datatypes passed when querying the computational kernel from context. The right datatype is now passed to each variant. - Bug : For ZAXPY, a NULL context was passed to the kernel when using the single-threaded path. In case of further using the context inside the kernel, this would be an issue. We now pass the context instead of a null pointer. AMD-Internal: [CPUPL-5643] Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf	2024-08-22 06:48:14 -04:00
Edward Smyth	cf029f4a9c	Disable disabling sba pools The disable sba pools functionality currently gives incorrect results at runtime when multiple threads are used. Fixes and improvements are present in the upstream version of BLIS, so until these are downstreamed only allow builds where sba pools are enabled. AMD-Internal: [CPUPL-5512] Change-Id: I9ccd654477fb714a2fb5f38a138b7e9b5e55e33d	2024-08-22 06:42:03 -04:00
Edward Smyth	cb3d2878d2	GTestSuite: Fix TRSM ukr tests in non-zen builds Add guards around bli_trsm_small kernel tests to only call them if BLIS_ENABLE_SMALL_MATRIX_TRSM is defined. This fixes missing symbol errors in tests of non-zen builds, e.g. generic or skx. AMD-Internal: [CPUPL-4500] Change-Id: I7a822a41b5f686b5e38b0c63dd1871963e990407	2024-08-22 16:06:48 +05:30
Chandrashekara K R	10e4dce660	CMake: Updated cmake minimum version to be supported to 3.22.0 to maintain uniform across all AOCL libraries. AMD Internal : [CPUPL-5616] Change-Id: Ic53532ff9883b1bba39e859ea2523c20c1ac383b (cherry picked from commit `545f9ee44e`)	2024-08-21 06:48:44 -04:00
Shubham Sharma	ecf984359b	Tiny size optimization for DTRSV var2 - Use AVX2 kernels for tiny sizes on genoa. - Removed the runtime init overhead for small sizes. AMD-Internal: [CPUPL-5407] Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30	2024-08-20 02:57:57 -04:00
Shubham Sharma.	bee1640b38	Fixed failures for mixed precision on ZEN5 in GEMM - Optimized macro kernel (bli_dgemm_avx512_asm_8x24_macro_kernel) for zen5 do not support alpha scaling. Alpha scaling is supported by zen5 micro kernel (bli_dgemm_avx512_asm_8x24). - Optimized macro kernel expects alpha scaling to be done during packing. The packing kernel used for mixed precision do not support alpha scaling. Therefore, the optimized Zen5 macro kernel is not compatible with existing packing logic. - Changes have been made to use the generic macro kernel which in turn used zen5 micro kernel for mixed precision which supports alpha scaling. AMD-Internal: [CPUPL-5058] Change-Id: I1bfeb32ae07eedafadad7dd2c62d63913a46e446	2024-08-20 02:22:26 -04:00
Vignesh Balasubramanian	a29efd12b1	Bugfix : Fixed memory accesses in AVX512 SGEMMSUP RD kernels - Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction that strictly needed YMMs. - Further updated the existing micro-kernel test cases to simulate these issues and validate the fix. AMD-Internal: [CPUPL-5353] Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1	2024-08-20 02:22:16 -04:00
Arnav Sharma	f9a606f00d	Gtestsuite: Fix for GEMM_COMPUTE IIT_ERS Test - The IIT_ERS test for GEMM_COMPUTE where alpha = 0 and beta = 0 was failing since neither of the matrices was being packed and thus, missing the scaling by alpha resulting in a non-zero output for C matrix (C := A * B). - Enabled packing of A matrix for the ZeroAlpha_ZeroBeta IIT_ERS test which handles the alpha scaling. AMD-Internal: [CPUPL-5598] Change-Id: Id9179ec6150d1bc5a0274edce727ce6cc4172213	2024-08-20 02:20:15 -04:00
Vignesh Balasubramanian	f8f070f6ec	Fixing compiler warnings when exporting internal BLIS kernels - Added the attribute to export symbols, in the header file that contains the L1 kernel declarations. This attribute was previously added as part of the kernel definitions. AMD-Internal: [CPUPL-4415] Change-Id: I375246f47d53c220f885644f9b75c7d7991ae710	2024-08-20 02:11:11 -04:00
Meghana Vankadari	b1c046ef1d	Added LPGEMV(n=1) kernels for s8s8s32os32\|s8 and s8s8s16os16\|s8 APIs - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n==1. AMD-Internal: [SWLCSG-2354] Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761	2024-08-20 11:17:24 +05:30
jagar	ef011c9d92	CMake : Added cmake support for "test" in blis. command to build test $ "cmake --build . --target test_blis" AMD-Internal: [CPUPL-2748] Change-Id: I088af68115f3d9fdd007203dd22a42f53478a44f	2024-08-06 19:34:57 +05:30
Edward Smyth	7fff7b4026	Code cleanup: Miscellaneous fixes - Delete unused cmake files. - Add guards around call to bli_cpuid_is_avx2fma3_supported in frame/3/bli_l3_sup.c, currently assumes that non-x86 platforms will not use bli_gemmtsup. - Correct variable in frame/base/bli_arch.c on non-x86 builds. - Add guards around omp pragma to avoid possible gcc compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c. - Add missing registers in clobber list in kernels/zen4/1/bli_dotv_zen_int_avx512.c. - Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV. - Correct calls to cblas_{c,z}swap in gtestsuite. - Correct test name in ddotxf gtestsuite program. AMD-Internal: [CPUPL-4415] Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5	2024-08-06 06:56:01 -04:00
Hari Govind S	d349f89df6	Fix warning caused by dscalv - Setting the value for ST_THRESH for default code path in dscalv API to avoid warning message. Change-Id: I8ace2070350267904faa498197b8356de9af58d1	2024-08-06 12:13:23 +05:30
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Nallani Bhaskar	e712673ab7	Peformance fixes for gcc compiler in fringe kernels Description: 1. GCC avoiding loading b into registers in m fringe kenrels of int8 kernels. Instead gcc generating fma with memory as an operand for B input. 2. This is causing performance regression for larger n where each fma needs to load the input from memory again and again. 3. This is observed with gcc but not with clang. 4. Inserted dummy shuffle instructions for b data to further explicitly tell compiler that b needs to be in registers. AMD-Internal: SWLCSG-2948 Change-Id: Ibbf186fe6569e6265e2c2bb4ec3141ef323ea3e6	2024-08-05 14:31:22 -04:00
Edward Smyth	09c45525f4	Missing early returns (2) Add missing early return in axpyv. AMD-Internal: [CPUPL-5540] Change-Id: I522fd6f5551a4dab24e8c164fa38818c900b89f8	2024-08-05 12:18:33 -04:00
Edward Smyth	591a3a7395	Code cleanup: file formats and permissions - Remove execute file permission from source and make files. - dos2unix conversion. - Add missing eol at end of files. Also update .gitignore to not exclude build directory but to exclude any build_* created by cmake builds. AMD-Internal: [CPUPL-4415] Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124	2024-08-05 11:52:33 -04:00
mkadavil	9f5fec7713	Matrix MUL op support in element wise operations API for bfloat16. -Matrix MUL op support added in main as well as fringe bfloat16 element wise operations kernels. -Benchmarking/testing framework for the same is added. -Fixed issues in setting up post-ops node index. AMD Internal: [SWLCSG-2947, SWLCSG-2953] Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8	2024-08-05 08:29:42 +05:30
Edward Smyth	b964308e50	GTestSuite: option to check input arguments Add tests to check input arguments have not been modified by BLIS routine. These tests add a large runtime overhead, so they are disabled by default. To enable them, configure gtestsuite with: cmake -DTEST_INPUT_ARGS=ON ... and run desired tests as normal. Also: - Correct testinghelpers::chktrans to handle upper case values of argument trns. - Change testinghelpers::matsize to return size 0 if m, n or leading dimension are 0, or if leading dimension is too small. AMD-Internal: [CPUPL-4379] Change-Id: I9494af800f9383195272ce99f622104a38fd0ed8	2024-08-05 09:58:17 -04:00
Edward Smyth	6393cb9d7c	GTestSuite: misc corrections 3 - Set threshold to epsilon for early return cases where we are just scaling a matrix. - Add this threshold to IIT_ERS files for appropriate tests. - In IIT_ERS for gemm_compute, remove tests on null A and B when we are expecting to set or scale C. More thought is required in gemm_compute tests to handle these cases and look at cases where A or B has been packed. AMD-Internal: [CPUPL-4500] Change-Id: Ia649cc340ca1df6511388f9c43a31e53296cb2bf	2024-08-05 09:31:18 -04:00
Deepak Negi	80bf6249f0	Matrix MUL post-operation support for float(bf16\|f32) LPGEMM APIs. This post-operation computes C = (betaC + alphaAB) D, where D is a matrix with dimensions and data type the same as that of C matrix. AMD-Internal: [SWLCSG-2953] Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd	2024-08-05 08:25:32 -04:00
Nallani Bhaskar	4c2f436cce	Peformance fixes for gcc compiler in fringe kernels Description: 1. GCC avoiding loading b into registers in m fringe kenrels of int8 kernels. Instead gcc generating fma with memory as an operand for B input. 2. This is causing performance regression for larger n where each fma needs to load the input from memory again and again. 3. This is observed with gcc but not with clang. 4. Inserted dummy shuffle instructions for b data to further explicitly tell compiler that b needs to be in registers. 5. Moved packb_s4_to_bf16 under JIT macro to resovle compilation issue with gcc version < 11.2 AMD-Internal: SWLCSG-2948 Change-Id: I5bd1bad7ad129e0dde91ed78d49a4ede3bff456a	2024-08-05 08:13:06 -04:00
mkadavil	f040ba617f	Element wise operations API for bfloat16 input matrix in LPGEMM. -This API supports applying element wise operations (eg: post-ops) on a bfloat16 input matrix to get an output matrix of the same(bfloat16) or upscaled data type (float). -Benchmarking/testing framework for the same is added. AMD Internal: SWLCSG-2947 Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2	2024-08-05 07:17:08 -04:00
Shubham Sharma.	dac0524ac8	BugFix in AVX512 DGEMMT SUP ST RRC variant - C<- alpha * op(A) op(B) + beta C. C(nxn) - A(n x k) * B(k x n) For ZEN4 and ZEN5 DGEMM is col-preferred kernel DGEMMT = DGEMM + DGEMMT DGEMM is col-preferred and DGEMMT is row-preferred. DGEMM is evaluated as C = AB (all col-storage) whereas DGEMMT is evaluated as C = B A (row-storage). When A is packed it is packed as row-panels with col-stored elements. So DGEMM is evaluated as C = AB (A is col-stored) it aligns with col-stored preference. For DGEMMT: C = B A, here A will become col-stored because of packingand as result it will break the DGEMMT kernel assumption that A is row-storage. - Fixed this by disabling this optimization for ZEN4 and ZEN5. AMD-Internal: [CPUPL-5542} Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379	2024-08-05 05:21:24 -04:00

1 2 3 4 5 ...

3518 Commits