composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-07 00:04:37 +00:00

Author	SHA1	Message	Date
apoorva	bdfdb0c11e	Fixes applied according to review comments	2025-07-08 12:17:20 +00:00
apoorva	669befb25a	Updated copyrights and added wrappers.	2025-07-08 12:06:38 +00:00
apoorva	9b64da2298	Added wrapper and renamed the wmma_v3 instances	2025-07-08 11:26:01 +00:00
apoorva	86ca6b827d	Removed the old wmma instances.	2025-07-08 11:23:33 +00:00
apoorva	5c491e7a4b	Fixing typo to resolve build errors.	2025-07-02 18:04:10 +00:00
apoorva	ba9c637c0b	Added examples for gemm_add_relu	2025-07-02 14:12:52 +00:00
apoorva	6ec0ad2758	Added test for gemm_add_relu wmma instance	2025-07-01 13:44:18 +00:00
apoorva	f5843dd22b	Added v3 instances for gemm_add_relu	2025-07-01 12:37:46 +00:00
apoorva	bb7f6650f7	Fixed typo in profiler	2025-07-01 12:02:28 +00:00
apoorva	6a116fa958	Modified the template parameters to make the instances work.	2025-07-01 11:23:44 +00:00
apoorva	cdaff7f210	Added instances to Cmake	2025-07-01 11:23:43 +00:00
apoorva	6f89183179	Added f16 wmma instance and corrected bf16 instance errors.	2025-07-01 11:23:17 +00:00
apoorva	35aab35d96	Added bf16 wmma instance for add_relu	2025-07-01 11:23:17 +00:00
Zoltán Lakatos	4f1910117b	Merge branch '63-implement-device_gemm_bilinear-for-rdna4' into 'feature/multiple-d-gemms' Resolve "Implement `device_gemm_bilinear` for RDNA4" See merge request amd/ai/composable_kernel!26	2025-06-26 06:48:38 +00:00
Zoltán Lakatos	686df332e2	Resolve "Implement `device_gemm_bilinear` for RDNA4"	2025-06-26 06:48:38 +00:00
Zoltán Lakatos	055bc0227c	Merge branch '10-implement-device_gemm_add_fastgelu-for-rdna4' into 'feature/multiple-d-gemms' Resolve "Implement device_gemm_add_fastgelu for RDNA4" See merge request amd/ai/composable_kernel!23	2025-06-18 14:12:31 +00:00
Zoltan Lakatos	7424b4a0f8	fixed ONLY_XDL_AND_WMMA_KERNELS tag	2025-06-17 19:50:07 +00:00
Zoltan Lakatos	61b6e9a606	splitk xdl test fixes	2025-06-13 09:37:37 +00:00
Zoltan Lakatos	b4d3e4112c	added tests for all splitk fastgelus	2025-06-12 11:43:26 +00:00
Zoltan Lakatos	264e1b238a	created fastegelu instances	2025-06-11 14:14:43 +00:00
Zoltan Lakatos	4c8ea9517d	created gemm_add_add_fastgelu instances	2025-06-11 12:46:42 +00:00
Zoltan Lakatos	aeca8efdea	revert unintended change in xdl add_fastgelu	2025-06-11 07:50:53 +00:00
Zoltan Lakatos	ea9805b580	added add_fastgelu instances	2025-06-10 17:23:44 +00:00
Anton Gorenko	881bc3fbb0	Merge branch '52-implement-multipled-in-gemm-universal' into 'feature/multiple-d-gemms' DeviceGemmMultipleD_Wmma_CShuffleV3 See merge request amd/ai/composable_kernel!21	2025-06-04 16:18:14 +05:00
Anton Gorenko	9912e5f09c	Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there.	2025-06-04 12:34:26 +05:00
Anton Gorenko	381c02d06a	Add gemm_add_fastgelu instances and test	2025-06-04 12:34:26 +05:00
Anton Gorenko	bcf93e292c	Prepare gemma_add tests for adding wmma	2025-06-04 12:29:34 +05:00
Anton Gorenko	e36a176e38	Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3	2025-06-04 12:29:34 +05:00
Anton Gorenko	137efa743d	Implement DeviceGemmMultipleD_Wmma_CShuffleV3	2025-06-04 12:29:34 +05:00
Anton Gorenko	89ac60de6d	Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma	2025-06-04 12:29:34 +05:00
Anton Gorenko	7dff5fe4ff	Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support	2025-06-04 12:29:34 +05:00
Anton Gorenko	deebe1ea13	Use ThreadGroupTensorSliceTransfer_v7r3	2025-06-04 12:29:34 +05:00
Anton Gorenko	ed047d08b4	Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.	2025-06-04 12:29:33 +05:00
Anton Gorenko	52b4860a30	WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 ) * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Updates to support mixed precision * Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip * Added support for F8xF16xF16 to gemm_wmma_universal * Added support for F16xF8xF16 to gemm_wmma_universal * Added support for BF16xI4xBF16 to gemm_wmma_universal * Added support for F16xI4xF16 to gemm_wmma_universal * Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType * Added missing test class for FP16_KM_NK * Pre-commit hooks fixes * Added padding instances for f16xf16xf16 * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Ammending changes for adding support for padding instances for f16xf16xf16 * Fixes for padding instances for f16xf16xf16 * Added padding instances for bf16xbf16, f8xf8 * Added packed instances for bf16xi4xbf16 * Added padding instances for f8xf16xf16 * Added padding instances for f16xf8xf16, f16xi4xf16 * Fixed typos for bf16xbf16xbf16 padding instances * Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Fix int8 DTYPES check for gemm_bilinear --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>	2025-06-04 12:22:33 +06:00
Khushbu Agarwal	59a85cb4bc	[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262 ) * debugging issue * debugging issue * debugging * debugging * reverting debugging code * clang formatted * updating default_config.json * fix ci failure * clang formatted	2025-06-03 20:16:10 -07:00
Khushbu Agarwal	1037b21cfe	Update changelog for Rotating buffer (#2283 ) * Updating changelog for Rotating buffer * address review comment.	2025-06-03 20:14:22 -07:00
Bartłomiej Kocot	6e5acee0f9	Add Clamp/Relu bf16/fp16 cast fixes (#2279 ) * Add Clamp/Relu bf16/fp16 fixes * fix	2025-06-03 18:31:46 +02:00
Xiaodong Wang	7f9eef40b0	Move pragma ahead (#2231 )	2025-06-03 07:27:51 -07:00
Aviral Goel	11f6c14e03	Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 ) * add 0 as valid default arguement for strides * add 0 as valid default arguement for strides # Conflicts: # example/01_gemm/common.hpp	2025-06-03 07:26:58 -07:00
Illia Silin	b76fdbe47f	Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274 ) * upgrade to rocm6.4.1 and use gfx1x-generic targets * add rocm version parsing * fix the gfx10-3-generic syntax in cmake	2025-06-03 07:17:35 -07:00
Khushbu Agarwal	2e38eb4f1c	Rotating buffer PR CI fix (#2257 ) * Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)" This reverts commit `bbdaf79a52`. * fix regression	2025-06-02 10:25:01 -07:00
dependabot[bot]	cffe8fa2a4	Bump rocm-docs-core[api_reference] from 1.19.1 to 1.20.0 in /docs/sphinx (#2272 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.19.1 to 1.20.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.19.1...v1.20.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-02 06:44:10 -07:00
valarLip	0fdbf6bcd1	extend buffer load for fp16/bf16x16 (#2270 ) * extend buffer load for fp16/bf16x16 * format	2025-06-02 10:29:54 +08:00
Kiefer van Teutem	2215a9edf0	Explicitly set the LINKER_LANGUAGE for the gemm_template_instances target to avoid Ninja build config failure. (#2265 ) Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2025-05-30 13:32:28 -07:00
Illia Silin	654956bb02	Add a daily CI build on GFX950. (#2261 ) * add CI build for gfx950 * make sure gfx950 CI always uses special docker and compiler * enable codegen tests by default	2025-05-30 12:50:08 -07:00
Mirza Halilčević	fbce6c7bb6	Define CHAR_BIT during hipRTC (#2264 ) * Fix failing codegen tests. * fix clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2025-05-30 08:23:44 -07:00
dependabot[bot]	61e6c382c6	Bump rocm-docs-core[api_reference] from 1.19.0 to 1.19.1 in /docs/sphinx (#2263 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.19.0 to 1.19.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.19.0...v1.19.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.19.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-30 05:56:59 -07:00
slippedJim	57f497452a	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-05-30 10:14:21 +08:00
Illia Silin	4e561af18c	Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 )" (#2260 ) This reverts commit `fd6a859b44`.	2025-05-29 16:22:16 -07:00
Paul Fultz II	306f4c537e	Export codegen targets (#2259 )	2025-05-29 11:03:51 -07:00

1 2 3 4 5 ...

1988 Commits