amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-21 17:08:17 +00:00

Author	SHA1	Message	Date
Meghana Vankadari	fbb72d047f	Added group quantization and zero-point support for WOQ kernels Description: 1. Added group quantization and zero-point (zp) in aocl_gemm_bf16s4f32o<bf16\|f32> API. 2. Group quantization is technique to improve accuracy where scale factors to dequantize weights varies at group level instead of per channel and per tensor level. 3. Added zp and scaling in woq packb kernels so that for large M values zp and scaling are performed at pack-b stage and bf16 kernels are called 4. Adding zp support and scaling to default path in WoQ kernels created some performance overhead when M value is very small. 5. Added string group_size to lpgemm bench to read group size from bench_input.txt and tested for various combinations of matrix dimensions. 6. The scalefactors could be of type float or bf16 and the zeropoint values are expected to be in int8 format. AMD-Internal: [SWLCSG-3168, SWLCSG-3172] Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57	2024-12-02 06:46:13 +00:00
Shubham Sharma.	be6fbadd95	BlockSize Tuning for ZEN4 and ZEN5 - Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems. - MC, KC and NC are dynamically selected at runtime for DGEMM native. - A local copy of cntx is created and blocksizes are updated in the local cntx. - Updated threshold for picking DGEMM SUP kernel for ZEN4. AMD-Internal: [CPUPL-5912] Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6	2024-11-29 03:19:16 -05:00
Shubham Sharma	f2320a1fef	Enabled DGEMM row major kernel for ZEN4 - Merged ZEN4 and ZEN5 DGEMM 8x24 kernel. - Replaced 32x6 kernel with 8x24. Now same kernel is used for ZEN4 and ZEN5. - Blocksizes have been tuned for genoa only. - DGEMM kernel for DTRSM native code path is replaced with 8x24 kernel. - Enabled alpha scaling during packing for ZEN4. - ZEN4 8x24 kernel has been removed. AMD-Internal: [CPUPL-5912] Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754	2024-11-29 08:18:48 +00:00
Shubham Sharma	082081658f	BugFix: Fixed extreme value handling in AVX512 DGEMM kernel - Extreme values are not handled correctly when beta == 0 and C is column major stored. - For checking if beta is zero, VCOMISD(XMM(1), XMM(2)) is used, beta(XMM1) is compared with zero(XMM2), for column major C, setting of xmm2 to zero was missed. - XMM2 is set to zero after the jump to column major stored C code is made, this skips the setting of XMM2 to zero for column major C. - This is fixed by setting XMM2 to zero before the column major jump. AMD-Internal: [CPUPL-5851] Change-Id: Ic511071fbc82a082fa48a1543c0c7325eaf75cb8	2024-11-29 08:13:57 +00:00
Shubham Sharma.	bc3238e21e	BugFixes in ZEN5 DGEMM kernel - Changed fringe cases to use ZEN5 DGEMM kernel instead of ZEN4 kernel. - ASAN reporting error when RBP is used even when -fno-stack-pointer flag is used, therefore replaced RBP register with R11 register. - Added missing RDX register in clobber list which is causing failures with AOCC compiler. Thanks to harsh.dave@amd.com for debugging some of the issues. AMD-Internal: [CPUPL-5851] Change-Id: I0ee412c97c9dbfb3e7a736a10bfd93d775779b5b	2024-11-29 00:22:41 -05:00
Shubham Sharma	266bd32dea	Enable fringe case handling in DGEMM ZEN5 macro kernel - Generic kernel is used if N is not multiple of NR or M is not multiple of MR. - This limit the maximum values of NR that can be used. - Support for fringe case handling is added in DGEMM macro kernel so that macro kernel can be used for all problem sizes. AMD-Internal: [CPUPL-5912] Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f	2024-11-29 00:22:33 -05:00
Kiran Varaganti	3e2795f406	OpenMP barrier overhead bug fix In the function bli_thread_update_rntm_from_env()mutex is used for reading global_rntm "bli_pthread_mutex_lock( &global_rntm_mutex );" This causes regression when application is Multithreaded. The cause of this regression is due to these mutexes, Imagine a scenario two threads launched, one thread acquires this mutex, second thread stalls till mutex is freed by first thread, as a result second thread will be slower to arrive at openmp barrier in application thereby increasing the openmp barrier overhead. Things get worst when more number of threads are launched. Thanks to rocHPL for sharing standalone panelfact application to reproduce this issue. Thanks to @Edward Symth (edward.smyth@amd.com) for finding this bug. [SWLCSG-3143]	2024-11-22 15:36:30 +05:30
Deepak Negi	04ae01aeab	Added support to specify bias data type in bf16 API's Description: 1. The bias type was supported only based on output data type. 2. The option is added in the pre-ops structure to select the bias data type irrespective of the storage data type in bf16 and WoQ API's AMD-Internal: SWLCSG-3171 Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf	2024-11-19 05:30:02 -05:00
Edward Smyth	971c890fc6	GTestSuite: Select ukr tests by BLIS version Add definitions in gtestsuite header to list available kernel by AOCL BLIS version. Check these definitions in ukr test programs to avoid missing symbol errors when testing with an older version of BLIS. Currently AOCL_41, AOCL_42, AOCL_50 and AOCL_DEV are supported, with AOCL_DEV inferred from the version being later than the value of AOCL_BLAS_LATEST_VERSION set in CMakeLists.txt. Thanks to Eleni Vlachopoulou for the cmake functionality to automatically detect the version from the library. AMD-Internal: [CPUPL-4500] Change-Id: I40ffd3d3789324fbb1dabfbf5e1dd4e0c94d54d9	2024-11-15 10:07:29 -05:00
Deepak Negi	60a8c71a1a	Sigmoid and Tanh post-operation support for int8 API's. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_<s8\|u8>s8<s32\|s16>o<s8\|u8\|s32> API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: I9df3aab02222f728ff9d1f292c7bc549f30176f0 AOCL-Nov2024-b2	2024-11-15 05:36:31 -05:00
Deepak Negi	146f3b2eb2	Sigmoid and Tanh post-operation support for f32 API. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_f32f32f32of32 API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: Iac0a907f6dea1d9cb82d9fd8716bfdbf1c33921d	2024-11-15 04:20:20 -04:00
Deepak Negi	b5c1b6055a	Sigmoid and Tanh post-operation support for bf16 API. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_bf16bf16f32o<f32\|bf16) API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969	2024-11-15 01:13:31 -04:00
Edward Smyth	0249f57022	GTestSuite: Correct blis_impl calls for gemm_compute gemm_compute currently has differences in the interface to the blis_impl layer compared to the top-level API. Modify gtestsuite wrapper to account for this. AMD-Internal: [CPUPL-4500] Change-Id: Ie96c9ac3b23128ae8e03af34ad11e65910dec594	2024-11-12 06:57:59 -05:00
Edward Smyth	ef5cbf7c9a	GTestSuite: More threshold changes Various changes from testing code paths with both gcc and aocc. AMD-Internal: [CPUPL-4378] Change-Id: I8964d8ab4e1f5669026af606598c2eb3dfddde16	2024-11-11 12:34:05 -05:00
Vignesh Balasubramanian	06d776b025	AVX512 ZGEMM SUP Inner product kernels - Implemented a set of column preferential dot-product based ZGEMM kernels(main and fringe) in AVX512(for SUP code-path). These kernels perform matrix multiplication as a sequence of inner products(i.e, dot-products). - These standalone kernels are expected to strictly handle the CRC storage scheme for C, A and B matrices. RRC is also supported through operation transpose, at the framework level. - Added unit-tests to test all the kernels(main and fringe), as well as the redirection between these kernels. AMD-Internal: [CPUPL-5949] Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38	2024-11-06 04:18:57 -05:00
Nallani Bhaskar	3e6f98f046	Bug fixes f32 to bf16 reorder API Description: _mm512_cvtne2ps_pbh(a, b) instruction takes b when j<16 but the code was developed in with assuming reverse order. Fixed some indentation issues Changed the file name and made it uniform Change-Id: I7b45b4c35931d8febde7b7b5d9604ea953046f97	2024-11-05 12:58:15 +00:00
Nallani Bhaskar	9735391e1d	Implemented f32tobf16 reorder function Description: aocl_reorder_f32obf16 function is implemented to reorder input weight matrix of data type float to bfloat16. The reordering is done to match the input requirements of API aocl_gemm_bf16bf16f32o<f32\|bf16>. The objective of the API is to convert a model/matrix of type f32 to bf16 and process when machine supports bf16 FMA instruction _mm512_dpbf16_ps but the model is still in float Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5	2024-11-04 04:32:01 +00:00
Mithun Mohan	097cda9f9e	Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API. -Currently lpgemm sets the context (block sizes and micro-kernels) based on the ISA of the machine it is being executed on. However this approach does not give the flexibility to select a different context at runtime. In order to enable runtime selection of context, the context initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env variable and set the context based on the same. As part of this commit, only f32 context selection is enabled. -Bug fixes in scale ops in f32 micro-kernels and GEMV path selection. -Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512). This is only for B matrix and helps remove dependency of f32 lpgemm api on the BLIS packing framework. AMD Internal: [CPUPL-5959] Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6	2024-10-30 08:52:22 +00:00
Edward Smyth	9ce2696fc9	GTestSuite: Fix builds testing against MKL Correction to CMakeLists.txt to fix problem building executables when testing against MKL. AMD-Internal: [CPUPL-5928] Change-Id: Ie427fff0afb48be6ce6d940b1db2c9d1c7a40e5b	2024-10-29 06:32:27 -04:00
Edward Smyth	cffb501e00	GTestSuite: ILP64 build fix Cast literal 0 to match integer size in std::max tests. AMD-Internal: [CPUPL-4500] Change-Id: I330aafd8669884c5e1900b95742b5d1e4ce8ddfa	2024-10-29 06:10:49 -04:00
Chandrashekara K R	62da6e6163	Added "NOTICES" file. Change-Id: I3022942f46983a7b80c2b1ca39ce6b013e303768	2024-10-25 14:41:49 +05:30
Meghana Vankadari	b04b8f22c9	Introduced un-reorder API for bf16bf16f32of32 Details: - Added a new API called unreorder that converts a matrix from reordered format to it's original format( row-major or col-major ). - Currently this API only supports bf16 datatype. - Added corresponding bench and input file to test accuracy of the API. - The new API is only supported for 'B' matrix. - Modified input validation checks in reorder API to account for row Vs col storage of matrix and transposes for bf16 datatype. Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7	2024-10-23 07:49:24 -04:00
Deepak Negi	6ddde390aa	Added support for column major B matrix in BF16S4F32F32 reorder API. -Added new pack kernels that packs/reorders B matrix (odd strides) from column-major input format. This also supports the transB scenario if input B matrix is row major. Change-Id: Ia0fe7e5f19ae9eba5c418f4089c7e6df11091853	2024-10-22 03:01:39 -04:00
varshav2	dabfdf484a	Add Scale post-op for F32 API - Implemented the Scale post-op for the F32 API for all kernels - f32_scale = (f32 * scale_factor) + offset - Added the bench inputs Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c	2024-10-21 06:08:31 -04:00
Chandrashekara K R	cfbe202868	Updated version string from 4.2.1 to 5.0.1 Change-Id: I4cbd8d9ae7e35fa235a6707fe7ddbd157eb63b98 (cherry picked from commit `7f1824b8ee`)	2024-10-17 03:07:17 -04:00
Chandrashekara K R	8cf0f92559	Updated AOCL-BLAS 5.0 EULA. Change-Id: Ibc85b2df01f5d118087b2459f890e14a20dd680a	2024-10-09 11:56:05 +05:30
Eleni Vlachopoulou	d6a411d6b6	GTestSuite: Reorganizing some tests - Breaking tests to smaller executables. - Removing some redundant tests. AMD-Internal: [CPUPL-4500] Change-Id: I6288c3fcf5194ccb5de3485ca1ad95a20414208c	2024-10-02 11:48:18 -04:00
Shubham Sharma.	2b02024404	Fixed bug in bli_zdotv_zen4_asm_avx512 kernel. - Data-type of n, and conj is dim_t which will be int32_t for LP64 case. - When loading 64-bit registers using "mov" instructions, mov(rax, var(n)), the "n" should be 64-bit otherwise incorrect values gets loaded. Fix: We typecast these variables to int64_t before loading into registers. Thanks to mangala.v@amd.com for finding this bug. Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5	2024-09-24 02:33:44 -04:00
Shubham Sharma	1833ee70cd	Fixed bug in bli_dgemm_avx512 8x24 native kernel. - Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case. - When loading 64-bit registers using "mov" instructions, mov(rax, var(m)), the "m" should be 64-bit otherwise incorrect values gets loaded. Fix: We typecast these variables to int64_t before loading into registers. AMD-Internal: [CPUPL-5819] Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0	2024-09-23 04:54:04 -04:00
Deepak Negi	16653ed208	Added support for column major B matrix in BF16S4F32F32 reorder API. -Added new pack kernels that packs/reorders B matrix from column-major input format. This also supports the transB scenario if input B matrix is row major. Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f	2024-09-20 01:11:21 +05:30
Eleni Vlachopoulou	72536e56ba	GTestSuite: Reducing gemm tests. Since there is thorough kernel testing, we reduce the number of "Black Box" test cases so that CI is faster. AMD-Internal: [CPUPL-4500] Change-Id: Ie57eeccff8103c0051eb1904162d6447da0ef102	2024-09-19 12:17:20 -04:00
Edward Smyth	6330ac6a52	GTestSuite: Misc changes - Correct matsize and NumericalComparison functions for tests with first matrix dimension <= 0. - BLAS1: - Fix for BLAS vs CBLAS differences in amaxv IIT_ERS tests. - Threshold adjustments in ddotxf and zaxpy. - Break axpyv and scalv into separate executables for each data type. - BLAS2: - Threshold adjustments in symv and hemv. - Break ger into separate executables for each data type. - UKR: - Break gemm and trsm ukr test into separate executables for each data type. - Threshold adjustments in daxpyf - Disable {z,c}trsm ukr tests when BLIS_INT_ELEMENT_TYPE is used, as matrix generator is not currently suitable for this. AMD-Internal: [CPUPL-4500] Change-Id: I1d9e7acc11025f1478b8b511c14def5517ef0ae6	2024-09-19 10:17:36 -04:00
Vignesh Balasubramanian	4da1ad2cd9	Added CBLAS wrappers for complex precision ?ROT and ?ROTG APIs - Added the appropriate CBLAS wrappers for CROTG, CSROT, ZROTG and ZDROT APIs. These would internally call their ?_blis_impl() layer. AMD-Internal: [CPUPL-5813] Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97	2024-09-19 08:49:46 -04:00
varshav2	605517964b	Add Transpose Kernel for A matrix in F32F32f32Of32 - Implemented the AVX512 packA kernel for col major inputs in F32 API - Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was being directed to GEMM instead of GEMV. Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f	2024-09-19 06:37:11 -04:00
Shubham Sharma	b3b56ae3bb	BugFix: Fixed GEMM mixed precision failure in ZEN5 - Optimized DGEMM macro kernel does not support mixed precision. - This kernel was being used for solving some of the mixed precision problems. - Currently only ( bli_obj_elem(A) == 8 ) is used for checking if the problem being solved is mixed precision. - bli_obj_elem(A) will be equal to 8 for both double precision data type and mixed precision case single-complex. - Added extra checks (bli_obj_is_real( a )) to make sure that A and B are real and DGEMM macro kernel is being used only for DDDGEMM. AMD-Internal: [CPUPL-5804] Change-Id: Iaa1accf8d851d11533f8ba31dc0235fbc14f89a9	2024-09-19 04:54:53 -04:00
Arnav Sharma	df0ce5a799	Bugfix: Fix for gemmsup_r Reference Kernels - The existing row-preferred reference kernels for GEMM SUP path were not taking into consideration the packing state of matrices A or B. Thus, whenever either or both A and B matrices were packed the kernel was unable to iterate appropriately through the matrices thereby calculating incorrect values resulting in failures. - Though, for generic configuration, the SUP path is disabled by default the set of Pack and Compute Extension APIs use these kernels thus, this issue resulted in their failures as well. - With this patch, the loops being used in these kernels have been fixed to iterate over steps of MR and NR while also accounting for the fringe cases. Within the updated loops, temporary pointers used to point to the correct block/panel of the matrices are incremented with panel strides of respective matrices. AMD-Internal: [CPUPL-5674] Change-Id: Ic3939877c79ebb9ccf9e53b1d1672cea4b8c5959	2024-09-18 13:52:28 -04:00
Eleni Vlachopoulou	c7a5d04d4d	GTestSuite: Disabling falling tests. Those can be run in --gtest_also_run_disabled_tests is used. Bugs will be addressed and resolved in the future. AMD-Internal: [CPUPL-4500] Change-Id: I7a5443606ea8ef20f18ff8beec14bece5f6ee661	2024-09-18 13:12:35 +01:00
Edward Smyth	54f8fb951e	GTestSuite: BLAS2 test case selection Various changes to BLAS2 test cases: - GEMV: Reduce number of tests to make runtime more reasonable. - TRSV: - Standardize tests across different data types, including adding memory testing for all variants. - Improve scaling when making matrix A diagonally dominant and avoid singular matrix when BLIS_INT_ELEMENT_TYPE is used. - TRMV: Copy TRSV generic tests. - Expand set of tests for HEMV, HER, HER2, SYMV, SYR, SYR2 and make lda contribution to test names consistent with others routines. - Various adjustments to thresholds added. Update gtestsuite documentation to describe using GTEST_FILTER environment variable to select tests to run or exclude. This works particularly well when using ctest, as we do not enumerate all the tests at this level and so need to pass the selection down to the individual executables. AMD-Internal: [CPUPL-4500] Change-Id: Ifcb6410455b7f91e58b555f94b9fd7920d7ad9d9	2024-09-17 09:35:29 -04:00
Edward Smyth	61c6f1ad78	GTestSuite:a Fix alpha and beta input argument tests Check if alpha and beta are null before testing values. This avoids possible seg faults if alpha or beta have not been defined in IIT tests. AMD-Internal: [CPUPL-4500] Change-Id: Ibbf2d6a8fb38d9a95033f3fec3d06c3441e98689	2024-09-17 09:00:09 -04:00
Chandrashekara K R	e4eed817aa	Added logic to use right format specifier to read integer value. Updated logic to use "%ld" and "%lld" format specifiers to read 64-bit integer from input files using fscanf function on Linux and Windows respectively when the user set INT_SIZE='auto' on 64-bit machine or INT_SIZE='64'. Otherwise "%d" on both windows and Linux for benchmarking blis and LPGEMM. Change-Id: I4762c4c1b3fcd09cf66d0cc9572d38766be6be60	2024-09-17 04:48:59 -04:00
Edward Smyth	8d4881c4fd	GTestSuite: add option to test blis_impl layer Add BLAS_TEST_IMPL option for TEST_INTERFACE to test the wrapper layer underneath BLAS and CBLAS interfaces. This is particularly useful if building a BLIS library with these interfaces disabled, e.g. ./configure --disable-blas amdzen or cmake . -DENABLE_BLAS=OFF -DBLIS_CONFIG_FAMILY=amdzen The ?_blis_impl wrappers should have the same arguments as the BLAS interfaces, thus we define TEST_BLAS_LIKE as an additional definition for convenience when selecting tests and options in the C++ files. AMD-Internal: [CPUPL-5650] Change-Id: I0275a387563f3efc2b40029950c8569956f2df7b	2024-09-16 09:53:56 -04:00
Edward Smyth	a07e041b1f	SCALV alpha=zero BLAS compliance SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but also within many other APIs to handle special cases. In general it is preferred to use SETV when alpha=0, but BLAS and CBLAS continue to multiple all vector element by alpha. This has different behaviour for propagating NaNs or Infs. Changes in this commit: - Standardize early returns from SCALV reference and optimized kernels. - User supplied N<0 is handled at the top level API layer. Use negative values of N in kernel calls to signify that SETV should _not_ be used when alpha=0. This should only be required in SCALV. - Include serial threshold in zdscal (as in dscal) to reduce overhead for small problem sizes. - Code tidying to make different variants more consistent. - More standardization of tests in SCALV gtestsuite programs. - Remove scalv_extreme_cases.cpp as it is now redundant. AMD-Internal: [CPUPL-4415] Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae	2024-09-16 07:10:28 -04:00
Chandrashekara K R	91d4337b8b	Updated format specifier for fscanf to read double values. Updated format specifier to read signed double("%lld") and unsigned double("%llu") from file using fscanf from both windows and Linux. AMD-Internal: [CPUPL-5787] Change-Id: Ibef50b0df708f474e22f703240e264eff1de3994	2024-09-13 14:57:28 +05:30
Vignesh Balasubramanian	ed682a429d	Fixed compiler warnings due to prefetch in AVX2 and AVX512 kernels - Added explicit typecast to the pointers that are passed to the _mm_prefetch( ... ) intrinsic, to avoid compiler warnings. AMD-Internal: [CPUPL-4415] Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea	2024-09-12 11:37:06 +05:30
varshav2	7c78b9991f	Bug Fixes in the F32F32 m == 1 transpose scenario - added the missing stride updates in B reorder case in GEMV - added the missing stride updates for the cast of transA with B reordered case. Change-Id: Ic89781dfa7c0d9380ea523796958f795828a1ade	2024-09-11 02:08:50 -04:00
Mithun Mohan	453c9f0084	Fixes for bfloat16 accumulation rounding errors in bench. For the bf16bf16of32bf16 lpgemm api, inside the micro-kernels in order to convert the accumulated float values to bfloat16 before storing, the _mm512_cvtneps_pbh intrinsic (vcvtneps2bf16) is used. This intrinsic rounds the value based on a rounding bias logic. Replicating the same rounding logic inside the bf16 bench accuracy check function to get proper one to one comparison of output values. AMD Internal: [SWLCSG-2948] Change-Id: I135ac39ac8484769b6c0fe5b3e351dd22d7ca1d8	2024-09-11 01:39:11 -04:00
Meghana Vankadari	5120f98e12	Developed all WoQ kernels for bf16s4f32o<f32\|bf16> Description: 1. Written 6x64 main and other fringe kernels for WoQ where scaling s4 weights into bf16 performed in the kernel itself to reduce bandwidth. 2. These kernels are performing better compared to bf16 weights when m is small and n is large. 3. Established a threshold to do quantization support at packing of B (KCXNC) level or WoQ kernel level. Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289	2024-09-10 12:00:10 +00:00
varshav2	298a165718	Add TransA and TransB support for F32F32F32oF32 - Added support for TransA and transB in f32f32of32 APIs - Modified the GEMV case(m == 1) to support PACKB feature - Redirecting the operations to GEMM instead of GEMV in case of n == 1 conditions, with storage scheme r/transA and c/transB to avoid the packing errors which would lead to failures in computation. Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9	2024-09-09 05:19:49 +05:30
Chandrashekara K R	5ada963b4c	Clang version requirement is updated in cmake Description: Due to the latest VNNI instructions are supported only from Clang Version 18 and above, updated clang version check from 17 to 18. AMD-Internal: [CPUPL-5744] Change-Id: I4a3ecec65bd88d9dccfe1018fb25cb7be29946f0	2024-09-09 06:42:36 -04:00
Mithun Mohan	cf123aa926	Disabling smart threading for small input dimensions. -It has been observed that reduction of threads as part of smart threading for smaller input dimensions hampers the performance of the other inputs with larger dimensions due to lower operating frequency of the newly launched threads (apart from the existing ones). Disabling smart threading for these bandwidth bound input patterns (small m and n) fixes this issue. -Bug fixes related to work split in LPGEMV for n < NR and m < MR cases. AMD Internal: [SWLCSG-2948] Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121	2024-09-06 09:20:42 -04:00

1 2 3 4 5 ...

3556 Commits