composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 10:09:41 +00:00

Author	SHA1	Message	Date
linqunAMD	511f170dab	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `37e1a27537`]	2025-06-25 01:07:45 -07:00
Po Yen Chen	b86c92c84e	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs() [ROCm/composable_kernel commit: `50fad03524`]	2025-06-25 15:19:21 +08:00
Xiao Li	b3b4aa8d57	Fix amd_ck_fp8.hpp macro definitions (#2325 ) * Fix amd_ck_fp8.hpp macro definitions 1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before. 2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace fp8_impl to avoid redefined error when building with rocm 6.4+ (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h) Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `bac51b6ec0`]	2025-06-24 22:46:15 -06:00
Yi DING	820ba182a0	Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393 ) [ROCm/composable_kernel commit: `c5d9181e1b`]	2025-06-24 16:35:54 -07:00
JiaLuo-CAN	d07dd533b3	add a mx_fp8 client example (#2380 ) * add a mx_fp8 client example * remove verify code and fix date * remove verify code and fix date, type --------- Co-authored-by: root <root@bg-1w300-e1-2a.mkm.dcgpu> Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `778ac24376`]	2025-06-24 12:13:18 -04:00
Anton Gorenko	e156b5aebb	Improve fmha_bwd tests performance (#2376 ) * Avoid passing indices (std::vector) by value to host tensor's operator() Each access requires 2 allocations and copies of the vector. * Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification * Compute ds_hp_host_ref in parallel This sequntial ForEach is the slowest part of validation and it benefits from parallel computation. * Do not use ForEach for simple copy and conversion of large tensors These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices. [ROCm/composable_kernel commit: `77123600ee`]	2025-06-24 07:45:24 -07:00
JonathanLichtnerAMD	1c8b1cee57	Do not build "other" library for MIOpen (#2382 ) MIOpen only needs the static CK library for convolutions. [ROCm/composable_kernel commit: `87fdb368a7`]	2025-06-24 07:32:16 -07:00
JonathanLichtnerAMD	79c30fbb3b	Fix build error when building with MIOPEN_REQ_LIBS_ONLY=ON (#2383 ) Co-authored-by: John Shumway <john.shumwayjr@gmail.com> [ROCm/composable_kernel commit: `42e246e90f`]	2025-06-24 07:30:42 -07:00
Kiefer van Teutem	eb4b7c65ff	Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319 ) * Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty. * Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good. * Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet. * Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3. In principle the whole thing works now, just need to add other data types and perhaps do some cleanup. * Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16). * Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts. * Fixup comments and TODOs * Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small. * Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests. * Take batching into account when using rotating memory or clearing the C tensor. * Implement small refactors / comments etc. from review. * Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists. * Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems. * Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups. * Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `9e74ae7c89`]	2025-06-24 07:28:13 -07:00
lalala-sh	b6c780fc7f	fix moe i4 bug from aiter (#2339 ) [ROCm/composable_kernel commit: `bb571a0330`]	2025-06-24 14:51:29 +08:00
Yi DING	9f0d3497c3	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `b8212864cf`]	2025-06-24 01:33:31 +08:00
Rostyslav Geyyer	de3cfbab9a	Add accelerated stochastic rounding on gfx950 (#2355 ) * Add native prand generation support for gfx950 * Update seed calculation [ROCm/composable_kernel commit: `dbfe70e72a`]	2025-06-23 09:31:46 -05:00
John Shumway	7c57c4f045	Shard several of the most costly targets. (#2373 ) * Shard several of the most costly targets. Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance * fix clang format * Fix build errors in instantiation code. I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR. * Fix link errors from mismatched declarations. Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Shard the longest 2D convolution builds Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes. * Use PROJECT_SOURCE_DIR for submodule compatibility I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Remove accidental copy of a file * Remove accidental copies of template files. --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `47ae4b0955`]	2025-06-23 07:24:36 -07:00
Linjun-AMD	17346f2c91	update the way to compute fmha fwd tflop, include mask type (#2386 ) * update the way to compute fwd tflop, include mask type Signed-off-by: JL-underdog <Jun.Lin@amd.com> * remove unneccessary comment * add necessary comment * remove some comment --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu> [ROCm/composable_kernel commit: `61eb622e85`]	2025-06-23 15:53:58 +08:00
Po Yen Chen	7001322416	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors [ROCm/composable_kernel commit: `7d669440a6`]	2025-06-23 12:29:15 +08:00
Max Podkorytov	0bb4daa71b	Update for xformers (#2372 ) * update api * update kernel api * clang-format [ROCm/composable_kernel commit: `0366fb2abc`]	2025-06-22 00:28:30 -07:00
Bartłomiej Kocot	29cfe38b42	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes [ROCm/composable_kernel commit: `cebdee4d9e`]	2025-06-20 15:44:36 -07:00
Illia Silin	bc61ff620d	update code owners list (#2381 ) [ROCm/composable_kernel commit: `7378a51b4c`]	2025-06-20 14:03:20 -07:00
Thomas Ning	5c2009c852	fix the mi350 error (#2378 ) [ROCm/composable_kernel commit: `df6023e305`]	2025-06-20 12:50:13 -07:00
Illia Silin	3d10c98abe	Introduce dependency-based CI test selection. (#2377 ) * Selective test filter initial commit. * Expanded folder paths for parsing ninja dependencies. * Fixing default branch name in the test evaluation script. * Fixing paths for robustness and adding ctest command to the launch script. * change jenkins file and few tests to upgrade CI * Setting ninja build path. * Fixing typo in Jenkinsfile, and wrong paths. * Fixing typo in launch script. * add few more tests to check CI logic * Fixing header for shell script. * turn off performance test by default, add option to run all unit tests * revert dummy changes in source code to trigger tests * make sure develop branch runs all unit tests --------- Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com> [ROCm/composable_kernel commit: `c3c8c6a10f`]	2025-06-20 12:48:00 -07:00
Thomas Ning	3414888f92	Transpose builtin macro defense (#2374 ) * add the macro defense * add the static assert check [ROCm/composable_kernel commit: `107e3623c7`]	2025-06-20 11:24:54 -07:00
Bartłomiej Kocot	9e27236fb7	Grouped conv bias clamp fp32/fp16 support (#2366 ) [ROCm/composable_kernel commit: `663992e99b`]	2025-06-20 11:41:04 +02:00
Max Podkorytov	7c10189a27	Reland fix default epilogue (#2367 ) * Revert "Revert "Fix default epilogue (#2358)" (#2364)" This reverts commit `f85c70b31e`. * add operator() with old signature [ROCm/composable_kernel commit: `11eb9f1c77`]	2025-06-19 10:39:30 -07:00
dependabot[bot]	83fa5f32c6	Bump sphinxcontrib-bibtex from 2.6.3 to 2.6.4 in /docs/sphinx (#2365 ) Bumps [sphinxcontrib-bibtex](https://github.com/mcmtroffaes/sphinxcontrib-bibtex) from 2.6.3 to 2.6.4. - [Changelog](https://github.com/mcmtroffaes/sphinxcontrib-bibtex/blob/develop/CHANGELOG.rst) - [Commits](https://github.com/mcmtroffaes/sphinxcontrib-bibtex/compare/2.6.3...2.6.4) --- updated-dependencies: - dependency-name: sphinxcontrib-bibtex dependency-version: 2.6.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `c8b247c55c`]	2025-06-18 08:15:59 -07:00
Muhammed Emin Ozturk	9c035fb203	Stream-K CkProfiler Update ( Replace CPU Validation with GPU Validation and Add Dynamic Grid Size Calculation for Stream-K GEMM Profiler) (#2333 ) * Stream-K Ckprofiler Update * new grid list based on sm number * clang * update for review * Update profile_gemm_universal_streamk.cpp --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> [ROCm/composable_kernel commit: `bfb33bc1e9`]	2025-06-18 07:49:22 -07:00
joyeamd	3cb0dd8506	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `a2f01141aa`]	2025-06-18 01:28:34 -07:00
Thomas Ning	f85c70b31e	Revert "Fix default epilogue (#2358 )" (#2364 ) This reverts commit `b29e3830a6`. [ROCm/composable_kernel commit: `64a2fda713`]	2025-06-17 22:43:05 -07:00
linqunAMD	cd0bf60645	[CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd (#2243 ) * [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd * fix error with the latest develop code. [ROCm/composable_kernel commit: `7aeec9a901`]	2025-06-17 21:37:59 -07:00
carlushuang	f540c6ccb4	[CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335 ) * support local_token for hipgraph * update README * fix comment * fix fmoe example [ROCm/composable_kernel commit: `a4e1248dba`]	2025-06-18 10:49:43 +08:00
Kiefer van Teutem	609cb2c3ad	Fix argument order for calls to profile_batched_gemm_impl() (#2277 ) * Fix argument order for calls to profile_batched_gemm_impl() * Revert previous and swap the order of the profile_batched_gemm_impl() function arguments instead. * Revert copyright years for unchanged files. * Remove test_batched_gemm from REGRESSION_TESTS since it no longer takes more than 30 seconds to run. --------- Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `c7c6a0ccb3`]	2025-06-17 19:29:09 -07:00
Max Podkorytov	b29e3830a6	Fix default epilogue (#2358 ) * [ck-tile] fix default epilogue in gemm universal * argument validation needs vector size D * operator() needs to specify dram windows * copy/paste from cshuffle epilogue * clang-format * mark unused argument --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cd606f72c1`]	2025-06-17 17:30:21 -07:00
linqunAMD	af00674037	[CK_TILE] Support multi-config in tile_example_gemm_universal (#2240 ) * [CK_TILE] Support multi-config in tile_example_gemm_universal Add GemmConfig in run_gemm_example to support multiple tile config. - It is useful when use you need compare gemm perf with different tile/pipeline config - we also can use it simplify the code for wmma support in the furture. * [CK_TILE] Support multi-config in tile_example_gemm_universal Address review comments * rebase code and fix clang format. * fix clang format * support pipeline v5. * fix merge conflict * address review comment * add missing file * address review comment v2 * fix build error [ROCm/composable_kernel commit: `0eb8974502`]	2025-06-17 17:27:46 -07:00
John Afaganis	3ef7712ee3	Add missing copyright headers (#2359 ) * Add missing copyright headers * empty commit [ROCm/composable_kernel commit: `df54667102`]	2025-06-17 14:29:45 -07:00
Illia Silin	073bb8d588	Revert "Shard several of the most costly targets. (#2266 )" (#2361 ) This reverts commit `c1285aaada`. [ROCm/composable_kernel commit: `cdfd7722bf`]	2025-06-17 13:56:30 -07:00
Bartłomiej Kocot	d9316dfbeb	Fix Add in dynamic buffer for fp32/i8 (#2351 ) * Fix Add in dynamic buffer for fp32/i8 * fixes * Fix [ROCm/composable_kernel commit: `cc98a41f46`]	2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi	bde406245a	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
Aviral Goel	66afddf431	add script to pre commit hooks for checking file permissions (#2322 ) [ROCm/composable_kernel commit: `3af66e99ab`]	2025-06-17 07:07:08 -07:00
Thomas Ning	bc6af0fa49	Fix the CK Tile related operators (#2356 ) * fix the flatmm * Fix the pipeline * address the comment [ROCm/composable_kernel commit: `3c4cdfac4f`]	2025-06-16 17:38:52 -07:00
rahjain-amd	456719c9bc	Add cmake flag to enable Assembly dump (#2347 ) This flag makes it easy to dump assembly for the example kernels. [ROCm/composable_kernel commit: `6589f50bc9`]	2025-06-16 09:29:35 -07:00
Illia Silin	0f4d68633b	Revert "fix the flatmm (#2349 )" (#2352 ) This reverts commit `fc65195605`. [ROCm/composable_kernel commit: `5523df4b2d`]	2025-06-16 07:54:55 -07:00
Bartłomiej Kocot	4ae33b454f	Grouped convolution forward with clamp (#2334 ) * Grouped convolution forward with clamp * Optimize clamp * unary fixes * test gk bias * Revert "test gk bias" This reverts commit `8e42e29d7b`. * Revert "Revert "test gk bias"" This reverts commit `e73c0550ce`. * workaround comment [ROCm/composable_kernel commit: `f6c2ff9dce`]	2025-06-16 15:36:53 +02:00
Thomas Ning	fc65195605	fix the flatmm (#2349 ) [ROCm/composable_kernel commit: `d996bc78be`]	2025-06-16 02:17:53 -07:00
ruanjm	1fdac8b8fe	Add support for specifying valid flag when fetching elements for tile_scatter_gather (#2332 ) * Add support for specifying valid flag when fetching elements for tile_scatter_gather Add constexpr for operator[] of TrueGenerator * Use different path when valid is enabled [ROCm/composable_kernel commit: `b34c234f51`]	2025-06-16 17:17:03 +08:00
carlushuang	370dd01230	hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem (#2348 ) [ROCm/composable_kernel commit: `fb97f75099`]	2025-06-15 22:49:04 -07:00
Illia Silin	7eaa398458	Fix direct lds load for gfx950 and clang20 (#2346 ) * fix direct lds load for gfx950 and clang20 * Update include/ck/utility/amd_buffer_addressing_builtins.hpp * Fix format --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `2d8a804152`]	2025-06-15 15:22:34 -07:00
Illia Silin	d15d80a68b	Limit the threads to builf ck_tile engine, use ninja. (#2342 ) * limit the threads to builf ck_tile engine, use ninja * disable ck_tile engine until it can be built safely [ROCm/composable_kernel commit: `56f654a826`]	2025-06-13 14:13:07 -07:00
Illia Silin	6c4973f734	check for if misched-bottomup flag is valid (#2341 ) [ROCm/composable_kernel commit: `a0f4db8d9c`]	2025-06-13 13:34:22 -07:00
Mateusz Ozga	6b3ddd0e23	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails [ROCm/composable_kernel commit: `bd96ac9742`]	2025-06-13 19:39:11 +02:00
John Shumway	c1285aaada	Shard several of the most costly targets. (#2266 ) * Shard several of the most costly targets. Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance * fix clang format * Fix build errors in instantiation code. I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR. * Fix link errors from mismatched declarations. Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Shard the longest 2D convolution builds Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes. * Use PROJECT_SOURCE_DIR for submodule compatibility I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project. --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `3a0cb27966`]	2025-06-13 03:58:50 -07:00
kylasa	afbc0625f4	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build. [ROCm/composable_kernel commit: `5f1ad09b61`]	2025-06-12 18:24:02 -07:00

1 2 3 4 5 ...

2034 Commits