amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 10:47:16 +00:00

Author	SHA1	Message	Date
Mithun Mohan	4cfbb47b87	Initialize block sizes for F32 element wise post-op APIs. -The block sizes and micro kernel dimensions for the F32OF32 group of APIs are updated in the element wise operations cntx map. AMD-Internal: [SWLCSG-3390] Change-Id: Ic5690b7eb4f7b2559d893f374dd811b00e31e329	2025-02-11 06:47:24 -05:00
varshav	f4e3a4b1c3	AVX2 Support for BF16 Kernels - Bug fixes - Added early return checks for A/B transpose cases and Column major support, as it is not currently supported. - Enabled the JIT kernels for the Zen4 architecture. AMD Internal: [SWLCSG - 3281] Change-Id: Ie671676c51c739dd18709892414fd34d26a540df	2025-02-11 12:40:43 +05:30
Nallani Bhaskar	0acb5eb9a4	Implemented reference unreorder bf16 function Description: Implemented a c reference for aocl_gemm_unreorder_bf16bf16f32of32 function The implementation working for row major and column major yet to be enabled. AMD-Internal: [ SWLCSG-3279 ] Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a	2025-02-11 02:04:42 +00:00
varshav2	ef04388a44	Added AVX2 support for BF16 kernels: Row major - Currently the BF16 kernels uses the AVX512 VNNI instructions. In order to support AVX2 kernels, the BF16 input has to be converted to F32 and then the F32 kernels has to be executed. - Added un-pack function for the B-Matrix, which does the unpacking of the Re-ordered BF16 B-Matrix and converts it to Float. - Added a kernel, to convert the matrix data from Bf16 to F32 for the give input. - Added a new path to the BF16 5LOOP to work with the BF16 data, where the packed/unpacked A matrix is converted from BF16 to F32. The packed B matrix is converted from BF16 to F32 and the re-ordered B matrix is unre-ordered and converted to F32 before feeding to the F32 micro kernels. - Removed AVX512 condition checks in BF16 code path. - Added the Re-order reference code path to support BF16 AVX2. - Currently the F32 AVX-2 kernels supports only F32 BIAS support. Added BF16 support for BIAS post-op in F32 AVX2 kernels. - Bug fix in the test input generation script. AMD Internal : [SWLCSG - 3281] Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb	2025-02-10 08:18:52 -05:00
Meghana Vankadari	da3d0c6034	Added new Int8 batch_gemm APIs Details: - Added u8s8s32of32\|bf16\|u8 batch_gemm APIs. - Fixed some bugs in bench file for bf16 API. Change-Id: I55380238869350a848f2deec0641d7b9b416b192	2025-02-10 11:19:02 +00:00
Deepak Negi	3a7523b51b	Element wise post-op APIs are upgraded with new post-ops Description: 1. Added new output types for f32 element wise API's to support s8, u8, s32 , bf16 outputs. 2. Updated the base f32 API to support all the post-ops supported in gemm API's AMD Internal: [SWLCSG-3384] Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8	2025-02-10 01:06:39 -05:00
Edward Smyth	0bae96d7ac	BLIS: Missing clobbers (batch 8) - Add missing xmm, ymm and k registers to clobber lists in bli_dgemmsup_rv_zen4_asm_24x8m.c - Add missing ymm1 in bli_dgemmsup_rv_zen4_asm_24x8m.c bli_gemmsup_rv_haswell_asm_d6x8m.c and bli_gemmsup_rd_zen_s6x64.c - Also change formatting in bli_copyv_zen4_asm_avx512.c bli_dgemm_avx512_asm_8x24.c and bli_zero_zmm.c to make automatic processing of clobber lists easier. AMD-Internal: [CPUPL-5895] Change-Id: If05a3f00e6c0f9033eeced5de165ba4c3128b3e5	2025-02-07 10:39:24 -05:00
Edward Smyth	eee3fe1b54	GTestSuite: nested parallelism tests Optionally enable parallelism inside gtestsuite so that we can check BLIS functions perform correctly when nested parallelism is in operation. Enable with: cmake ... -DOPENMP_NESTED={0,1,2,1diff} where in gtestsuite - 0 is the default choice with no parallelism. - 1 and 2 are simple nested parallelism. - 1diff has one level of parallelism setting different numbers of threads to be used by BLIS and reference library calls from each gtestsuite thread. Note: OMP_NUM_THREADS must be set appropriately to enable or disable parallelism at each level in the test programs as desired. OMP_NUM_THREADS will also define the parallelism used within the BLIS library (if it is multithreaded), unless BLIS-specific ways of specifying parallelism have been used. If a BLIS-specific parallelism option has been set, the same mechanism will be used in the 1diff option to vary the number of threads in BLIS per application thread. AMD-Internal: [CPUPL-3902] Change-Id: I89f9edb4125c64ef03e025a9f6ccb84960ba8771	2025-02-07 08:49:25 -05:00
Mithun Mohan	bffa92ec93	Deprecate S16 LPGEMM APIs. -The following S16 APIs are removed: 1. aocl_gemm_u8s8s16os16 2. aocl_gemm_u8s8s16os8 3. aocl_gemm_u8s8s16ou8 4. aocl_gemm_s8s8s16os16 5. aocl_gemm_s8s8s16os8 along with the associated reorder APIs and corresponding framework elements. AMD-Internal: [CPUPL-6412] Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b	2025-02-07 11:43:28 +00:00
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Edward Smyth	c74faac80f	Fix compiler warning messages Various occurances of the following compiler warnings have been fixed: * Type mismatch * Misleading code indentation * Array bounds violation warning in blastest when using gcc 11 without -fPIC flag AMD-Internal: [CPUPL-5895] Change-Id: Ia5d5310b76a66e87ad3953a72e8472ed5b01e588	2025-02-07 05:03:49 -05:00
Mithun Mohan	b9f6286731	Tiny GEMM path for BF16 LPGEMM API. -Currently the BF16 API uses the 5 loop algorithm inside the OMP loop to compute the results, irrespective if the input sizes. However it was observed that for very tiny sizes (n <= 128, m <= 36), this OMP loop and NC,MC,KC loops were turning out to be overheads. -In order to address this, a new path without OMP loop and just the NR loop over the micro-kernel is introduced for tiny inputs. This is only applied when the num threads set for GEMM is 1. -Only row major inputs are allowed to proceed with tiny GEMM. AMD-Internal: [SWLCSG-3380, SWLCSG-3258] Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031	2025-02-07 04:37:11 -05:00
Meghana Vankadari	c47f0f499f	Fixed bug in testing matrix_mul post_op Details: - Added a new python script that can test all microkernels along with post-ops. - Modified post_op freeing function to avoid memory leaks. Change-Id: Iedba84e8233a88ca9261596c4c7e0a65c196b7e7	2025-02-07 02:27:14 +05:30
Deepak Negi	86e52783e4	Tiny GEMM path for F32 LPGEMM API. -Currently the F32 API uses the 5 loop algorithm inside the OMP loop to compute the results, irrespective if the input sizes. However it was observed that for very tiny sizes (n <= 128, m <= 36), this OMP loop and NC,MC,KC loops were turning out to be overheads. -In order to address this, a new path without OMP loop and just the NR loop over the micro-kernel is introduced for tiny inputs. This is only applied when the num threads set for GEMM is 1. AMD-Internal: [SWLCSG-3380] Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275	2025-02-06 23:36:44 -05:00
Vignesh Balasubramanian	8abb37a0ad	Update to AOCL-BLAS bench application for logging outputs - Updated the format specifiers to have a leading space, in order to delimit the outputs appropriately in the output file. - Further updated every source file to have a leading space in its format string occuring after the macros. AMD-Internal: [CPUPL-5895] Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15	2025-02-06 22:59:59 +05:30
Arnav Sharma	5a4739d288	DGEMV NO_TRANSPOSE Optimizations and Unit Tests - Added 32x3n n-biased kernels to directly handle the cases where n=3 which were earlier being handled by the primary n-biased, 32x8n, kernel. - Modified the n-biased fringe kernels to further handle the smaller m-fringe cases. Thus, now the kernels handle the following range of m for any value of n: - 16x8n : m = [16, 31) - 8x8n : m = [8, 15) - m_leftx8n : m = [1, 7] - Updated the function pointer map for n-biased kernels with added granularity to invoke the smaller fringe cases directly on the basis of m-dimension. - Added micro-kernel unit tests for all the dgemv_n kernels. AMD-Internal: [CPUPL-6231] Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119	2025-02-06 18:52:32 +05:30
Shubham Sharma	f8c83fedb6	Added new ZTRSM small code path for ZEN5 - Added new ZTRSM kernels for right and left variants. - Kernel dimensions are 12x4. - 12x4 ZGEMM SUP kernels are used internally for solving GEMM subproblem. - These kernels do not support conjugate transpose. - Only column major inputs are supported. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6356] Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e	2025-02-06 18:01:10 +05:30
Deepak Negi	2e687d8847	Updated all post-ops in s8s8s32 API to operate in float precision Description: 1. Changed all post-ops in s8s8s32o<s32\|s8\|u8\|f32\|bf16> to operate on float data. All the post-ops are updated to operate on f32 by converting s32 accumulator registers to float at the end of k loop. Changed all post-ops to operate on float data. 2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store the output in u8 AMD-Internal - SWLCSG-3366 Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6	2025-02-06 07:31:28 -05:00
jagar	2ece628a4d	Build System: Remove absolute path of files appearing in the library Updated Makefile and main CMakelists.txt to replace absolute path within library object files with whatever path specified. In files where __FILE__ macro is used, the absolute path was appearing in the library. Now this will be replaced with relative paths. AMD-Internal: [CPUPL-5910] Change-Id: Iac63645348a7f8214123fdcd3675670eedd887e3	2025-02-06 06:15:56 -05:00
Meghana Vankadari	13e7ada3f2	Modified bench to test different types of post-ops - Modified bench to support testing of different types of buffers for bias, mat_add and mat_mul postops. - Added support for testing integer APIs with float accumulation type. Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96	2025-02-06 02:38:08 +05:30
Mithun Mohan	0701a4388a	Thread factorization improvements (ic ways) for BF16 LPGEMM API. -Currently when m is small compared to n, even if MR blks (m / MR) > 1, and total work blocks (MR blks * NR blks) < available threads, the number of threads assigned for m dimension (ic ways) is 1. This results in sub par performance in bandwidth bound cases. To address this, the thread factorization is updated to increase ic ways for these cases. AMD-Internal: [SWLCSG-3333] Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9	2025-02-06 00:51:10 -05:00
AngryLoki	ea93d2e2c9	Fix SyntaxWarning messages from python 3.12 (#809 ) Details: - When using regexes in Python, certain characters need backslash escaping, e.g.: ```python regex = re.compile( '^[\s]#include (["<])([\w\.\-/])([">])' ) ``` However, technically escape sequences like `\s` are not valid and should actually be double-escaped: `\\s`. Python 3.12 now warns about such escape sequences, and in a later version these warning will be promoted to errors. See also: https://docs.python.org/dev/whatsnew/3.12.html#other-language-changes. The fix here is to use Python's "raw strings" to avoid double-escaping. This issue can be checked for all files in the current directory with the command `python -m compileall -d . -f -q .` - Thanks to @AngryLoki for the fix. AMD-Internal: [CPUPL-5895] Change-Id: I7ab564beef1d1b81e62d985c5cb30ab6b9a937f2 (cherry picked from commit `729c57c15a`)	2025-02-05 07:13:42 -05:00
Hari Govind S	b6e9cde317	Revert "Optimisation of AOCL-dynamic for dotv API" This reverts commit `a028108cbb`. Reason for revert: With libgomp, scalability issues were observed with a higher number of threads, leading to the use of fewer threads. However, with different OpenMP libraries like libomp, this scalability issue was not observed, and using fewer threads resulted in performance loss. The AOCL dynamic logic has been updated to select a higher number of threads, considering the iomp OpenMP library. Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f	2025-02-05 06:33:06 -05:00
jagar	8d0bf148ee	Added separate PC for mt blis library Added separate package configuration file for st and mt library in blis Makefile and CMakeLists.txt Change-Id: I8d851fac10d63983358e1f4c67fd9451246056bf	2025-02-05 05:10:11 -05:00
Hari Govind S	fe73445813	Introduced fast-path in DCOPYV API and fix compiler warning for AXPYV - Added a conditional check to invoke the vectorized DCOPYV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Used macros to protect the declaration of fast_path_thresh in DAXPYV API to avoid compiler warnings. AMD-Internal: [CPUPL-4875][CPUPL-5895] Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb	2025-02-05 04:41:55 -05:00
Shubham Sharma	bac0fed3cf	Fixed Bug in Dynamic Blocksizes - Mixed precision datatypes use a modified cntx. - For some variants of mixed precision, complex and real blocksizes are needed to be same. This is achieved by creating a local copy of cntx and copying complex blocksizes onto real blocksizes. - By using the dynamic blocksizes, the changes made to the blocksizes for mixed precision are overwritten by changes made by dynamic blocksizes. - This mismatch between complex and real blocksizes is causing a issue where the pack buffer is allocated based on complex blocksizes but amount of data packed is based on real blocksizes. - This makes the pack buffer sizes smaller than the required sizes. - To fix this, dynamic blocksizes are disabled for mixed precision. AMD-Internal: [CPUPL-6384] Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83	2025-02-05 01:09:24 -05:00
Hari Govind S	3d2653f1ab	DDOTV Optimization for ZEN3 Architecture - Reduced the blocking size of 'bli_ddotv_zen_int10' kernel from 40 elements to 20 elements for better utilization of vector registers - Replaced redundant 'for' loops in 'bli_ddotv_zen_int10' kernel with 'if' conditions to handle reminder iterations. As only a single iteration is used when reminder is less than the primary unroll factor. - Added a conditional check to invoke the vectorized DDOTV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Updated getestsuite ukr tests for 'bli_ddotv_zen_int10' kernel. AMD-Internal: [CPUPL-4877] Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2	2025-02-04 06:01:04 -05:00
Edward Smyth	bec9406996	Export some BLIS internal symbols (3) libFLAME calls DAMAX kernel directly. Now that AVX512 version has been enabled in BLIS cntx, export this symbol. AMD-Internal: [CPUPL-5895] Change-Id: I4c74150578f49eb643b0f68c6cc32ee2bb23bec2	2025-02-03 06:30:14 -05:00
Shubham Sharma	7695561f4e	Tuned DGEMM blocksizes for ZEN5 - In the existing code, blocksizes for sizes where M >> K, N >> K and K < 500 were not tuned properly for cases when application would use more than one instance of blis in parallel. - Imporved DGEMM performane for sizes where M, N >> k by retuning blocksizes. Such sizes are used by applications like HPL. AMD-Internal: [SWLCSG-3338] Change-Id: Iec17ecc53a6fabf50eedacaf208e4e74a4e21418	2025-02-03 05:40:07 -05:00
Shubham Sharma	50306c4854	Tuned ZEN4 DGEMM blocksizes - Blocksizes for sizes where M >> K, N >> K and K < 500 were tuned by running blis bench on only one MPI rank. Blocksizes tuned this way are not performing well for all configurations. - Retuned the blocksizes so that performance is good for such skinny sizes. AMD-Internal: [CPUPL-6362] Change-Id: I89c61889df2443ef6bf0e87bf89263768b5c00c1	2025-02-03 03:05:50 -05:00
Hari Govind S	67322416d3	Added support to benchmark ASUMV APIs - Implemented the feature to benchmark ?ASUMV APIs for the supported datatypes. The feature allows to benchmark BLAS, CBLAS or the native BLIS API, based on the macro definition. - Added a sample input file to provide examples to benchmark ASUMV for all its datatype supports. AMD-Internal: [CPUPL-5984] Change-Id: Iff512166545687d12504babda1bd52d71a3a5755 AOCL-Feb2025-b1	2025-01-31 06:04:16 -05:00
Vignesh Balasubramanian	0e71d28c01	Additional bug-fix for AOCL-BLAS bench - Corrected the format specifier setting(as macro) to not include additional spaces, since this would cause incorrect parsing of input files(in case they have exactly the expected number of parameters and not more). - Updated the inputgemm.txt file to contain some inputs that have the exact parameters, to validate this fix. AMD-Internal: [CPUPL-6365] Change-Id: Ie9a83d4ed7e750ff1380d00c9c182b0c9ed42c49	2025-01-30 08:28:14 -05:00
Deepak Negi	4ade159800	Buffer scale support for matrix add and matrix mul post-ops in f32 API. Description: 1. Support has been added to scale buffer values using both scalar and vector scale factors before matrix add or matrix mul post-ops. AMD-Internal: CPUPL-6340 Change-Id: Ie023d5963689897509ef3d5784c3592791e57125	2025-01-30 07:09:18 -05:00
Shubham Sharma	26bd265cfd	Optimized DTRSV for tiny sizes - Replaced switch case with if else, lookup table for switch case is palced at the end of .text section which causes a huge jump. - Reduced number of branches for tiny sizes. - Cpuid query is slow, therefore added a new if statement which avoids cpuid query for tiny sizes(<200). - Redirected tiny sizes to AVX2 kernel. AMD-Internal: [CPUPL-5407] Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c	2025-01-30 01:23:09 -05:00
harsh dave	c5e842e8d3	Revert changes to force dgemm inputs under threshold to arch-specific kernels in single-threaded mode - This patch reverts the previous changes that removed the enforcement of dgemm inputs under a certain threshold to be processed by kernels selected based on architecture ID and handled in single-threaded mode. - This change is now forcing such small inputs to be computed in tiny path. Previously when this check was not there, it was routing these inputs to SUP path and causing performance regression due to framework overhead. AMD-Internal: [CPUPL-5927] Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb AOCL-LPGEMM-012925	2025-01-29 04:22:16 -05:00
Nallani Bhaskar	805bd10353	Updated all post-ops in u8s8s32 API to operate in float precision Description: 1. Changed all post-ops in u8s8s32o<s32\|s8\|u8\|f32\|bf16> to operate on float data. All the post-ops are updated to operate on f32 by converting s32 accumulator registers to float at the end of k loop. Changed all post-ops to operate on float data. 2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store the output in u8 AMD-Internal - SWLCSG-3366 Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5	2025-01-29 04:21:17 -05:00
Vignesh Balasubramanian	445327f255	Bugfix for AOCL-BLAS bench application - Bug : When configuring our library with the native BLIS integer size being 32, the bench application would crash or read an invalid value when parsing the input file. This is because of a mismatch of format specifier, that we hardset in the Makefile. - Fix : Defined a header that sets the format specifiers as macros with the right matching, based on how we configure and build the library. It is expected to include this header in every source file for benchmarking. AMD-Internal: [CPUPL-5895] Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89	2025-01-29 03:25:57 -05:00
Shubham Sharma	9fd2aebd25	Tuned DTRSM thresholds for ZEN5 - Tuned AOCL_dynamic thresholds for DTRSM for ZEN5. - Tuned thresholds for better selection of code path. AMD-Internal: [CPUPL-5408] Change-Id: Ic40b5c8d276c8ce8399fd49ce0d0569f79ec98be	2025-01-28 07:33:10 -05:00
Edward Smyth	d9cabce0ba	GTestSuite: Catch BLIS version executable errors In case the executable to obtain the BLIS library version fails, catch and report common errors to help with debugging. Also correct the test for bli_info_get_info() support to mark that it is not available in any AOCL version <= 4.1 AMD-Internal: [CPUPL-4500] Change-Id: Ie8f728b49faa60e0469562dbf77d67f86b415cd8	2025-01-28 16:54:05 +05:30
Vignesh Balasubramanian	327142395b	Cleanup for readability and uniformity of Tiny-ZGEMM - Guarded the inclusion of thresholds(configuration headers) using macros, to maintain uniformity in the design principles. - Updated the threshold macro names for every micro-architecture. AMD-Internal: [CPUPL-5895] Change-Id: I9fd193371c41469d9ef38c37f9c055c21457b56c	2025-01-27 15:48:31 +05:30
Deepak Negi	db407fd202	Added F32 bias type support, F32, BF16 output type support in int8 APIs Description: 1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs. Where the inputs are uint8/int8 and the processing is done using VNNI but the output is stored in f32 and bf16 formats. All the int8 kernels are reused and updated with the new output data types. 2. Added F32 data type support in bias. 3. Updated the bench and bench input file to support validation. AMD-Internal: SWLCSG-3335 Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd	2025-01-26 11:38:30 -05:00
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
Hari Govind S	c149b5a98b	Optimisation of DCOPY kernel in zen3 - Using 'if' condition instead of 'for'loop to handle fringe cases. 'for' loop is redundant for handling reminder iterations as only a single iteration is used when reminder is less than primary unroll factor. AMD-Internal: [CPUPL-5594] Change-Id: I8cebc037742ee47961869e22e2471e550fcd99e9	2025-01-24 05:55:59 -05:00
Hari Govind S	106a2b1fe1	Gtestsuite: UKR test for GEMV kernels - Added support for gemv kernels unit test in gtestsuite. - Added micro-kernel tests and memory tests for DGEMV transpose case kernels. AMD-Internal: [CPUPL-5835] Change-Id: I7d2d3cdbfea436f6c9b2cce9f2e85bfc5c51f201	2025-01-24 05:09:33 -05:00
Hari Govind S	349fc47ec5	DGEMV Optimizations for TRANSPOSE Cases - Developed new AVX512 DGEMV kernels for Zen4/5 architectures and AVX2 kernels for Zen1/2/3 architectures. These kernels are written from the ground up and are independent of fused kernels. - The DGEMV primary kernel processes the calculation in chunks of 8 columns. Fringe columns (sizes 1 to 7) are handled by fringe kernels, which are invoked by the primary kernel as needed. - Implemented the kernels by computing the dot product of matrix A columns with vector x in chunks of 32 elements, storing the results in accumulator registers. Fringe elements are handled in chunks of 16, 8, etc. The data in the accumulator registers is then reduced and added to vector y. AMD-Internal: [CPUPL-5835] Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61	2025-01-24 00:38:34 -05:00
Mithun Mohan	39289858b7	Packed A matrix stride update to account for fringe cases. -When A matrix is packed, it is packed in blocks of MRxKC, to form a whole packed MCxKC block. If the m value is not a multiple of MR, then the m % MR block is packed in a different manner as opposed to the MR blocks. Subsequently the strides of the packed MR block and m % MR blocks are different and the same needs to be updated when calling the GEMV kernels with packed A matrix. -Fixes to address compiler warnings. AMD-Internal: [SWLCSG-3359] Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44	2025-01-22 05:42:30 -05:00
Arnav Sharma	66461b8df3	Improved Multi-threaded Performance of DSCALV - Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5 architectures, since earlier they were using the Zen thresholds. - Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads incurred when the single-threaded path is optimally performant. AMD-Internal: [CPUPL-5934] Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033	2025-01-22 03:55:38 -05:00
Meghana Vankadari	69ca5dbcd6	Fixed compilation errors for gcc versions < 11.2 Details: - Disabled intrinsics code of f32obf16 pack function for gcc < 11.2 as the instructions used in kernels are not supported by the compiler versions. - Addded early-return check for WOQ APIs when compiling with gcc < 11.2 - Fixed code to check whether JIT kernels are generated inside batch_gemm API for bf16 datatype. AMD Internal: [CPUPL-6327] Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a	2025-01-21 07:13:31 -05:00
Edward Smyth	3c2dedb13c	Restore or add update from env in bli_thread_get APIs We want bli_thread_get_num_threads() and bli_thread_get__nt() to report the threading values modified to reflect what will be in effect given OpenMP nesting and active levels. This was lost in commit `0c6d006225` for bli_thread_get_num_threads() and wasn't previously implemented in bli_thread_get__nt() AMD-Internal: [CPUPL-6168] Change-Id: Ife2d281546d2f79fc17cd712e574f29b06c30ccd	2025-01-20 08:58:22 -05:00
Deepak Negi	994098dd35	Fix for a bias data type in u8s8s32/s8s8s32 gemv APIs. Description: Added _mm512_cvtps_epi32 for bf16 to s32 conversion in gemv APIs. AMD-Internal: SWLCSG-3302 Change-Id: I7e3e6da8f50d1f7177629cb68ac21e3bbce40bee AOCL-Jan2025-b2	2025-01-16 01:43:38 -05:00

1 2 3 4 5 ...

3645 Commits