amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 10:47:16 +00:00

Author	SHA1	Message	Date
Mithun Mohan	4cfbb47b87	Initialize block sizes for F32 element wise post-op APIs. -The block sizes and micro kernel dimensions for the F32OF32 group of APIs are updated in the element wise operations cntx map. AMD-Internal: [SWLCSG-3390] Change-Id: Ic5690b7eb4f7b2559d893f374dd811b00e31e329	2025-02-11 06:47:24 -05:00
varshav	f4e3a4b1c3	AVX2 Support for BF16 Kernels - Bug fixes - Added early return checks for A/B transpose cases and Column major support, as it is not currently supported. - Enabled the JIT kernels for the Zen4 architecture. AMD Internal: [SWLCSG - 3281] Change-Id: Ie671676c51c739dd18709892414fd34d26a540df	2025-02-11 12:40:43 +05:30
Nallani Bhaskar	0acb5eb9a4	Implemented reference unreorder bf16 function Description: Implemented a c reference for aocl_gemm_unreorder_bf16bf16f32of32 function The implementation working for row major and column major yet to be enabled. AMD-Internal: [ SWLCSG-3279 ] Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a	2025-02-11 02:04:42 +00:00
varshav2	ef04388a44	Added AVX2 support for BF16 kernels: Row major - Currently the BF16 kernels uses the AVX512 VNNI instructions. In order to support AVX2 kernels, the BF16 input has to be converted to F32 and then the F32 kernels has to be executed. - Added un-pack function for the B-Matrix, which does the unpacking of the Re-ordered BF16 B-Matrix and converts it to Float. - Added a kernel, to convert the matrix data from Bf16 to F32 for the give input. - Added a new path to the BF16 5LOOP to work with the BF16 data, where the packed/unpacked A matrix is converted from BF16 to F32. The packed B matrix is converted from BF16 to F32 and the re-ordered B matrix is unre-ordered and converted to F32 before feeding to the F32 micro kernels. - Removed AVX512 condition checks in BF16 code path. - Added the Re-order reference code path to support BF16 AVX2. - Currently the F32 AVX-2 kernels supports only F32 BIAS support. Added BF16 support for BIAS post-op in F32 AVX2 kernels. - Bug fix in the test input generation script. AMD Internal : [SWLCSG - 3281] Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb	2025-02-10 08:18:52 -05:00
Meghana Vankadari	da3d0c6034	Added new Int8 batch_gemm APIs Details: - Added u8s8s32of32\|bf16\|u8 batch_gemm APIs. - Fixed some bugs in bench file for bf16 API. Change-Id: I55380238869350a848f2deec0641d7b9b416b192	2025-02-10 11:19:02 +00:00
Deepak Negi	3a7523b51b	Element wise post-op APIs are upgraded with new post-ops Description: 1. Added new output types for f32 element wise API's to support s8, u8, s32 , bf16 outputs. 2. Updated the base f32 API to support all the post-ops supported in gemm API's AMD Internal: [SWLCSG-3384] Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8	2025-02-10 01:06:39 -05:00
Mithun Mohan	bffa92ec93	Deprecate S16 LPGEMM APIs. -The following S16 APIs are removed: 1. aocl_gemm_u8s8s16os16 2. aocl_gemm_u8s8s16os8 3. aocl_gemm_u8s8s16ou8 4. aocl_gemm_s8s8s16os16 5. aocl_gemm_s8s8s16os8 along with the associated reorder APIs and corresponding framework elements. AMD-Internal: [CPUPL-6412] Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b	2025-02-07 11:43:28 +00:00
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Edward Smyth	c74faac80f	Fix compiler warning messages Various occurances of the following compiler warnings have been fixed: * Type mismatch * Misleading code indentation * Array bounds violation warning in blastest when using gcc 11 without -fPIC flag AMD-Internal: [CPUPL-5895] Change-Id: Ia5d5310b76a66e87ad3953a72e8472ed5b01e588	2025-02-07 05:03:49 -05:00
Mithun Mohan	b9f6286731	Tiny GEMM path for BF16 LPGEMM API. -Currently the BF16 API uses the 5 loop algorithm inside the OMP loop to compute the results, irrespective if the input sizes. However it was observed that for very tiny sizes (n <= 128, m <= 36), this OMP loop and NC,MC,KC loops were turning out to be overheads. -In order to address this, a new path without OMP loop and just the NR loop over the micro-kernel is introduced for tiny inputs. This is only applied when the num threads set for GEMM is 1. -Only row major inputs are allowed to proceed with tiny GEMM. AMD-Internal: [SWLCSG-3380, SWLCSG-3258] Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031	2025-02-07 04:37:11 -05:00
Deepak Negi	86e52783e4	Tiny GEMM path for F32 LPGEMM API. -Currently the F32 API uses the 5 loop algorithm inside the OMP loop to compute the results, irrespective if the input sizes. However it was observed that for very tiny sizes (n <= 128, m <= 36), this OMP loop and NC,MC,KC loops were turning out to be overheads. -In order to address this, a new path without OMP loop and just the NR loop over the micro-kernel is introduced for tiny inputs. This is only applied when the num threads set for GEMM is 1. AMD-Internal: [SWLCSG-3380] Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275	2025-02-06 23:36:44 -05:00
Deepak Negi	2e687d8847	Updated all post-ops in s8s8s32 API to operate in float precision Description: 1. Changed all post-ops in s8s8s32o<s32\|s8\|u8\|f32\|bf16> to operate on float data. All the post-ops are updated to operate on f32 by converting s32 accumulator registers to float at the end of k loop. Changed all post-ops to operate on float data. 2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store the output in u8 AMD-Internal - SWLCSG-3366 Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6	2025-02-06 07:31:28 -05:00
Mithun Mohan	0701a4388a	Thread factorization improvements (ic ways) for BF16 LPGEMM API. -Currently when m is small compared to n, even if MR blks (m / MR) > 1, and total work blocks (MR blks * NR blks) < available threads, the number of threads assigned for m dimension (ic ways) is 1. This results in sub par performance in bandwidth bound cases. To address this, the thread factorization is updated to increase ic ways for these cases. AMD-Internal: [SWLCSG-3333] Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9	2025-02-06 00:51:10 -05:00
Nallani Bhaskar	805bd10353	Updated all post-ops in u8s8s32 API to operate in float precision Description: 1. Changed all post-ops in u8s8s32o<s32\|s8\|u8\|f32\|bf16> to operate on float data. All the post-ops are updated to operate on f32 by converting s32 accumulator registers to float at the end of k loop. Changed all post-ops to operate on float data. 2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store the output in u8 AMD-Internal - SWLCSG-3366 Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5	2025-01-29 04:21:17 -05:00
Deepak Negi	db407fd202	Added F32 bias type support, F32, BF16 output type support in int8 APIs Description: 1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs. Where the inputs are uint8/int8 and the processing is done using VNNI but the output is stored in f32 and bf16 formats. All the int8 kernels are reused and updated with the new output data types. 2. Added F32 data type support in bias. 3. Updated the bench and bench input file to support validation. AMD-Internal: SWLCSG-3335 Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd	2025-01-26 11:38:30 -05:00
Mithun Mohan	39289858b7	Packed A matrix stride update to account for fringe cases. -When A matrix is packed, it is packed in blocks of MRxKC, to form a whole packed MCxKC block. If the m value is not a multiple of MR, then the m % MR block is packed in a different manner as opposed to the MR blocks. Subsequently the strides of the packed MR block and m % MR blocks are different and the same needs to be updated when calling the GEMV kernels with packed A matrix. -Fixes to address compiler warnings. AMD-Internal: [SWLCSG-3359] Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44	2025-01-22 05:42:30 -05:00
Meghana Vankadari	69ca5dbcd6	Fixed compilation errors for gcc versions < 11.2 Details: - Disabled intrinsics code of f32obf16 pack function for gcc < 11.2 as the instructions used in kernels are not supported by the compiler versions. - Addded early-return check for WOQ APIs when compiling with gcc < 11.2 - Fixed code to check whether JIT kernels are generated inside batch_gemm API for bf16 datatype. AMD Internal: [CPUPL-6327] Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a	2025-01-21 07:13:31 -05:00
Mithun Mohan	7a25505f5c	Simulation of spread like pattern in worker thread to core binding in LPGEMM. -In multi-threaded cases if a packed/close pattern thread to core binding is used (e.g.: OMP_PROC_BIND=close and OMP_PLACES=core\|threads), LPGEMM (OMP framework) launches threads such that threads with adjacent id's are bound to nearby (even adjacent) cores. Depending on the processor architecture, multiple threads with adjacent id's can be bound to cores sharing the same last level cache. However it was observed that when these threads (with adjacent id's) access the B reorder buffer, the last level cache access was suboptimal. This can be attributed to the per thread reorder buffer block accesses and how it maps to the last level cache. -In these cases, m is small (<= 4 * MR) and n value is such that number of NR blocks (n/NR) is less than available threads nt (like < 0.5 * nt). In such cases, id's of the threads can be modified such that the number of threads with adjacent id's bound to the last level cache can be reduced. This looks similar to the spread pattern used in thread to core binding. This reduces the load on last level cache due to reorder buffer access and improves performance in these cases. A heuristic method is used to detect whether thread to core binding follows close pattern before applying the thread id modifications. AMD-Internal: [SWLCSG-3185] Change-Id: Ie3c87d56e0f7b59161a381f382cf4e2d5d02a591	2025-01-10 06:02:06 -05:00
Meghana Vankadari	852cdc6a9a	Implemented batch_matmul for f32 & int8 datatypes Details: - The batch matmul performs a series of matmuls, processing more than one GEMM problem at once. - Introduced a new parameter called batch_size for the user to indicate number of GEMM problems in a batch/group. - This operation supports processing GEMM problems with different parameters including dims,post-ops,stor-schemes etc., - This operation is optimized for problems where all the GEMMs in a batch are of same size and shape. - For now, the threads are distributed among different GEMM problems equally irrespective of their dimensions which leads to better performance for batches with identical GEMMs but performs sub-optimally for batches with non-identical GEMMs. - Optimizations for batches with non-identical GEMMs is in progress. - Added bench and input files for batch_matmul. - Added logger functionality for batch_matmul APIs. AMD-Internal: [SWLCSG-2944] Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a	2025-01-10 04:10:53 -05:00
Mithun Mohan	ef4286a97e	Multi-data type buffer and scale support for matrix add\|mul post-ops in s32 API. -As it stands the buffer type in matrix add\|mul post-ops is expected to be the same as that of the output C matrix type. This limitation is now removed and user can specify the buffer type by setting the stor_type attribute in add\|mul post-op struct. As of now int8, int32, bfloat16 and float types are supported for the buffer in s32 micro-kernels. The same support is also added for bf16 micro-kernels, with bfloat16 and float supported for now. -Additionally the values (from buffer) are added/multiplied as is to the output registers while performing the matrix add\|mul post-ops. Support is added for scaling these values before using them in the post-ops. Both scalar and vector scale_factors are supported. -The bias_stor_type attribute is renamed to stor_type in bias post-ops. AMD-Internal: [SWLCSG-3319] Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa	2025-01-10 02:11:12 -05:00
Meghana Vankadari	051c9ac7a2	Bug fixes in F32 and INT8 APIs Details: - Fixed few bugs in downscale post-op for f32 datatype. - Fixed a bug in setting strides of packB buffer in int8 APIs. Change-Id: Idb3019cc4593eace3bd5475dd1463dea32dbe75c	2025-01-09 04:07:26 -05:00
varshav	7b9d29f9b3	Adding post-ops for JIT kernels - Added Downscale, tanh and sigmoid post-op support to the JIT kernels - Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error. - Added the optional support for B-prefetch in the JIT kernels - Resolved the visibility issues in global variable jit_krnels_generated - Modified the array generation for scale and zp values in the bench Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d	2025-01-08 12:55:27 +00:00
Mithun Mohan	4a95f44d39	Buffer scale support for matrix add and matrix mul post-ops in bf16 API. -Currently the values (from buffer) are added/multiplied as is to the output registers while performing the matrix add/mul post-ops. Support is added for scaling these values before using them in the post-ops. Both scalar and vector scale_factors are supported. AMD-Internal: [SWLCSG-3181] Change-Id: Ifdb7160a1ea4f5ecccfa3ef31ecfed432898c14d	2025-01-08 10:35:50 +00:00
Meghana Vankadari	c9f0240679	Returning early for col-major inputs in u8s8s32os32\|s8 APIs Details: - For u8s8s32os32\|s8 APIs, A & B matrices are of different datatypes. Hence col-major inputs cannot be supported by swapping the matrices internally. Added a check to return early in such cases. Change-Id: I99fbebe811c3d05310f30f7fc978f5084b5a51ba	2025-01-05 23:46:06 +05:30
Mithun Mohan	8d8a8e2f19	Light-weight logging framewok for LPGEMM. -A light-weight mechanism/framework to log input details and a stringified version of the post-ops structure is added to LPGEMM. Additionally the runtime of the API is also logged. The logging framework logs to a file with filename following the format aocl_gemm_log_<PID>_<TID>.txt. -To enable this feature, the AOCL_LPGEMM_LOGGER_SUPPORT=1 macro needs to be defined when compiling BLIS (with aocl_gemm addon enabled) by passing CFLAGS="-DAOCL_LPGEMM_LOGGER_SUPPORT=1" to ./configure. Additionally AOCL_ENABLE_LPGEMM_LOGGER=1 has to be exported in the environment during LPGEMM runtime. AMD-Internal: [SWLCSG-3280] Change-Id: I30bfb35b2dc412df70044601b335938fc9f49cfb	2025-01-03 11:28:57 +00:00
Nallani Bhaskar	6cb1acf3c3	Fixed out-of-memory read access in bf16 reorder reference Description: Loop count was taken as 16 instead of n0_partial_rem in packb_nrlt16_bf16bf16f32of32_col_major_ref function. Updated comments on reference reorder functionality. AMD Internal: SWLCSG-3279 Change-Id: Idfc3b92906bc2b24651c7923e395fe10db56166b	2025-01-03 04:09:08 -05:00
Meghana Vankadari	bfc512d3e1	Implemented batch_gemm for bf16bf16f32of32\|bf16 Details: - The batch matmul performs a series of matmuls, processing more than one GEMM problem at once. - Introduced a new parameter called batch_size for the user to indicate number of GEMM problems in a batch/group. - This operation supports processing GEMM problems with different parameters including dims,post-ops,stor-schemes etc., - This operation is optimized for problems where all the GEMMs in a batch are of same size and shape. - For now, the threads are distributed among different GEMM problems equally irrespective of their dimensions which leads to better performance for batches with identical GEMMs but performs sub-optimally for batches with non-identical GEMMs. - Optimizations for batches with non-identical GEMMs is in progress. - Added bench and input files for batch_matmul. AMD-Internal: [SWLCSG-2944] Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155	2025-01-03 03:28:32 -05:00
Nallani Bhaskar	40719e0438	Implemented reference function for bf16 reorder function Description: Implemented a reference version for aocl_gemm_reorder_bf16bf16f32of32 function to make the code cpu architecture independent. AMD-Internal: [ SWLCSG-3279 ] Change-Id: I0c715864c0ab3e5afea2ee6ee9546b75c3dbf9ec	2024-12-17 05:46:39 +00:00
Deepak Negi	615789e196	Fixed compilation issue with clang 18 on windows Description -In enum AOCL_PARAMS_STORAGE_TYPES the member FLOAT was declared and the clang 18 compiler in msvc throwing issue with multiple definition. We replace FLOAT and BFLOAT16 to AOCL_GEMM_<F32/BF16>. AMD-Internal: CPUPL-6174 Change-Id: Ic061af068854d51629b82b495efd0eb54543f329	2024-12-12 06:37:06 -05:00
Deepak Negi	baeebe75c9	Support for standard AutoAWQ storage format. Description: 1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this format [0, 2, 4, 6, 1, 3, 5, 7]. 2. Support is added to convert above format back to the original sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering in the AWQ API. AMD-Internal: SWLCSG-3169 Change-Id: I5395766060c200ab81d0b8be94356678a169ac13	2024-12-02 04:02:27 -05:00
Meghana Vankadari	fbb72d047f	Added group quantization and zero-point support for WOQ kernels Description: 1. Added group quantization and zero-point (zp) in aocl_gemm_bf16s4f32o<bf16\|f32> API. 2. Group quantization is technique to improve accuracy where scale factors to dequantize weights varies at group level instead of per channel and per tensor level. 3. Added zp and scaling in woq packb kernels so that for large M values zp and scaling are performed at pack-b stage and bf16 kernels are called 4. Adding zp support and scaling to default path in WoQ kernels created some performance overhead when M value is very small. 5. Added string group_size to lpgemm bench to read group size from bench_input.txt and tested for various combinations of matrix dimensions. 6. The scalefactors could be of type float or bf16 and the zeropoint values are expected to be in int8 format. AMD-Internal: [SWLCSG-3168, SWLCSG-3172] Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57	2024-12-02 06:46:13 +00:00
Deepak Negi	04ae01aeab	Added support to specify bias data type in bf16 API's Description: 1. The bias type was supported only based on output data type. 2. The option is added in the pre-ops structure to select the bias data type irrespective of the storage data type in bf16 and WoQ API's AMD-Internal: SWLCSG-3171 Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf	2024-11-19 05:30:02 -05:00
Deepak Negi	b5c1b6055a	Sigmoid and Tanh post-operation support for bf16 API. Description: Implemented sigmoid, tanh as fused post-ops in aocl_gemm_bf16bf16f32o<f32\|bf16) API's Sigmoid(x) = 1/1+e^(-x) Tanh(x) = (1-e^(-2x))/(1+e^(2x)) Updated bench_lpgemm to recognize sigmod, tanh as options for post-ops from bench_input and verified. AMD-Internal: [SWLCSG-3178] Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969	2024-11-15 01:13:31 -04:00
Nallani Bhaskar	9735391e1d	Implemented f32tobf16 reorder function Description: aocl_reorder_f32obf16 function is implemented to reorder input weight matrix of data type float to bfloat16. The reordering is done to match the input requirements of API aocl_gemm_bf16bf16f32o<f32\|bf16>. The objective of the API is to convert a model/matrix of type f32 to bf16 and process when machine supports bf16 FMA instruction _mm512_dpbf16_ps but the model is still in float Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5	2024-11-04 04:32:01 +00:00
Mithun Mohan	097cda9f9e	Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API. -Currently lpgemm sets the context (block sizes and micro-kernels) based on the ISA of the machine it is being executed on. However this approach does not give the flexibility to select a different context at runtime. In order to enable runtime selection of context, the context initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env variable and set the context based on the same. As part of this commit, only f32 context selection is enabled. -Bug fixes in scale ops in f32 micro-kernels and GEMV path selection. -Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512). This is only for B matrix and helps remove dependency of f32 lpgemm api on the BLIS packing framework. AMD Internal: [CPUPL-5959] Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6	2024-10-30 08:52:22 +00:00
Meghana Vankadari	b04b8f22c9	Introduced un-reorder API for bf16bf16f32of32 Details: - Added a new API called unreorder that converts a matrix from reordered format to it's original format( row-major or col-major ). - Currently this API only supports bf16 datatype. - Added corresponding bench and input file to test accuracy of the API. - The new API is only supported for 'B' matrix. - Modified input validation checks in reorder API to account for row Vs col storage of matrix and transposes for bf16 datatype. Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7	2024-10-23 07:49:24 -04:00
varshav2	605517964b	Add Transpose Kernel for A matrix in F32F32f32Of32 - Implemented the AVX512 packA kernel for col major inputs in F32 API - Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was being directed to GEMM instead of GEMV. Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f	2024-09-19 06:37:11 -04:00
varshav2	7c78b9991f	Bug Fixes in the F32F32 m == 1 transpose scenario - added the missing stride updates in B reorder case in GEMV - added the missing stride updates for the cast of transA with B reordered case. Change-Id: Ic89781dfa7c0d9380ea523796958f795828a1ade	2024-09-11 02:08:50 -04:00
Meghana Vankadari	5120f98e12	Developed all WoQ kernels for bf16s4f32o<f32\|bf16> Description: 1. Written 6x64 main and other fringe kernels for WoQ where scaling s4 weights into bf16 performed in the kernel itself to reduce bandwidth. 2. These kernels are performing better compared to bf16 weights when m is small and n is large. 3. Established a threshold to do quantization support at packing of B (KCXNC) level or WoQ kernel level. Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289	2024-09-10 12:00:10 +00:00
varshav2	298a165718	Add TransA and TransB support for F32F32F32oF32 - Added support for TransA and transB in f32f32of32 APIs - Modified the GEMV case(m == 1) to support PACKB feature - Redirecting the operations to GEMM instead of GEMV in case of n == 1 conditions, with storage scheme r/transA and c/transB to avoid the packing errors which would lead to failures in computation. Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9	2024-09-09 05:19:49 +05:30
Mithun Mohan	cf123aa926	Disabling smart threading for small input dimensions. -It has been observed that reduction of threads as part of smart threading for smaller input dimensions hampers the performance of the other inputs with larger dimensions due to lower operating frequency of the newly launched threads (apart from the existing ones). Disabling smart threading for these bandwidth bound input patterns (small m and n) fixes this issue. -Bug fixes related to work split in LPGEMV for n < NR and m < MR cases. AMD Internal: [SWLCSG-2948] Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121	2024-09-06 09:20:42 -04:00
Meghana Vankadari	2e1cc2f14a	Added bf16s4f32 kernels to handle m=4 cases Details: - In WOQ, if m = 4, special case kernels are added where s4->bf16 conversion happens inside the compute kernel and packing is avoided. For all other cases, B matrix is dequantized and packed at KC loop level and native bf16 kernels are re-used at compute level. - Fixes in bench to avoid accuracy failures when datatype of output is bf16. Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2	2024-09-04 07:36:57 -04:00
mkadavil	1257eaf72d	Disabling smart threading for bandwidth bound input patterns. For some applications, one of the input dimension is mostly m < MR or n < NR with the other dimension being small for the most part, with intermittent large ones. Currently in these cases (m < MR or n < NR), the number of threads used is reduced (as part of smart threading) if the other dimension (n or m) is also small. For larger dimensions all the threads are used. However its been observed that this reduction of threads hampers the performance of the larger inputs due to lower operating frequency of the newly launched threads (apart from the existing ones). Disabling smart threading for these bandwidth bound input patterns (m < MR or n < NR) fixes this issue. AMD Internal: [SWLCSG-2948] Change-Id: I5334860cf4411ea4504d2e6bc598b9904780bbbf	2024-09-02 02:18:45 +05:30
varshav2	d4e0fa9b4c	Revert duplicate check and fix bug in the check for post-ops - Revert of patch 1110983 - Duplicate check removal and early return for s8s8s32/u8s8s32 - Add fix - Added check to see if post-ops is enabled with col-major storage and return early in that case. Change-Id: Id3b8c97b6d1425dfb06f3b196e5acd60caee8fca	2024-08-29 06:52:14 -04:00
Deepak Negi	6dcf500703	Element wise operations API for float(f32) input matrix in LPGEMM. This API supports applying element wise operations (eg: post-ops) on a float(f32) input matrix to get an output matrix of the same (float(f32)). Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24 AMD Internal: [SWLCSG-2947] Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24	2024-08-27 03:28:52 -04:00
varshav2	e3c434080a	Fix duplicate check and early return in s8s8s32/u8s8s32 - removed the duplicate check for col-major inputs in s8s8s32/u8s8s32 APIs - Fixed the print in bench_lpgemm Change-Id: If40837b89927dd82d8aa6f620d1a7f2c24aed53c	2024-08-23 02:32:20 +05:30
Meghana Vankadari	5514c7a75f	Added LPGEMV(n=1) kernels for s8s8s32os32\|s8 and s8s8s16os16\|s8 APIs - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n==1. AMD-Internal: [SWLCSG-2354] Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761	2024-08-09 06:17:52 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Edward Smyth	591a3a7395	Code cleanup: file formats and permissions - Remove execute file permission from source and make files. - dos2unix conversion. - Add missing eol at end of files. Also update .gitignore to not exclude build directory but to exclude any build_* created by cmake builds. AMD-Internal: [CPUPL-4415] Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124	2024-08-05 11:52:33 -04:00
mkadavil	9f5fec7713	Matrix MUL op support in element wise operations API for bfloat16. -Matrix MUL op support added in main as well as fringe bfloat16 element wise operations kernels. -Benchmarking/testing framework for the same is added. -Fixed issues in setting up post-ops node index. AMD Internal: [SWLCSG-2947, SWLCSG-2953] Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8	2024-08-05 08:29:42 +05:30

1 2 3

141 Commits