composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-03-27 18:47:39 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	c28d0033d7	[rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1) [CK] Fix min k_batch calculation in conv kernels ## Motivation Avoid division by 0 and remove not needed "-1". ## Technical Details Our div up implementation return lower value if input is divisible. There is no need to subtract 1. ## Test Plan test_grouped_conv_bwd_weight ## Test Result Passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1019	2026-03-27 15:38:21 +00:00
Bartłomiej Kocot	9c414d2e59	[rocm-libraries] ROCm/rocm-libraries#5454 (commit 8dade31) [CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454) ## Motivation Flush cache to get more stable results during profiling old ck and ck tile. ## Technical Details Flush cache before each kernel call and one more first run. ## Test Plan test_grouped_conv_bwd_weight_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-966	2026-03-16 17:47:07 +00:00
lalala-sh	a3ccd5dca1	[rocm-libraries] ROCm/rocm-libraries#5225 (commit 880166b) [CK] fix moe memset size which is bigger than alloc ## Motivation Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that crashes split-K MOE GEMM with "HIP runtime error: invalid argument". When KBatch > 1, the invoker zeroes the output buffer using arg.M * arg.N as the byte count. However, arg.M is the padded sorted-token-id length from MOE routing, which can be much larger than the actual output allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write beyond the buffer, and the silently-swallowed HIP error propagates to the subsequent kernel launch via hipGetLastError(). This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset matches the actual output size. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 09:30:57 +00:00
Bartłomiej Kocot	2169367735	[rocm-libraries] ROCm/rocm-libraries#5114 (commit 59b8cb5) [CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114) ## Motivation Improve profiling for grouped convolution forward for better comparison between CK and CK Tile ## Technical Details - Include preprocessing time for ck tile - Add flush cache for conv fwd profiler - Switch configs to builder reflect - Add KPerXdl deduce - Add non-grouped ported instances ## Test Plan test_grouped_convnd_fwd_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-786	2026-03-11 22:39:20 +00:00
JP-Fernando	d8ee107a47	[rocm-libraries] ROCm/rocm-libraries#4421 (commit 5bb5769) [CK] Unify the grouped convolution gridwise Run() functions (#4421) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation There are currently three different grouped convolution related Run() function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`. These are used for the different types of grouped convolution: Forward, Backward weights, and Backward data. The functions are very similar and should be unified to a single `Run()` function for all types of grouped convolution. ## Technical Details The three old `Run<>()` functions were replaced with a single unified function. The new `Run<>()` function is run from device implementations: - DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 - DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor implementation uses a different `Run<>()` overload and was therefore not modified. ## Test Plan Run the following grouped convolution tests on `gfx1201`, as this architecture is WMMA-capable: - `test_grouped_convnd_fwd` - `test_grouped_convnd_bwd_weight` - `test_grouped_convnd_bwd_data` Compilation and testing were also executed on `gfx1100` to avoid CI problems. ## Test Result First part (unification of `Run<>()` function): All tests successful. Second part (integration of single `Run<>()` function as a direct call): All tests successful. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 16:40:12 +00:00
John Shumway	9f47b8a63d	[rocm-libraries] ROCm/rocm-libraries#5284 (commit 76b5b15) [CK_BUILDER] Add DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284) Add factory, InstanceTraits, and conv traits support for the WMMA V3 forward convolution kernel, enabling the CK Builder to generate and dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs. ## Motivation As reported in issue #4944, MIOpen includes WMMA V3 forward convolution kernels, so this PR adds support for those kernels similarly to other supported kernels. ## Technical Details This follows the same implementation as the other kernels. I added some support for reflection, but I left a few todos since we need to generalize our convolution traits to generalize across WMMA/MFMA and CK/CKTile. ## Test Plan Added faster tests to `ninja smoke-builder` that check the instance-traits logic, and I added longer tests that instantiate kernels, following the existing pattern in other kernals. ## Test Result I tested all code with `ninja check-builder` on a gfx1101 build and ran on gfx1101. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 23:43:03 +00:00
Márton Bidlek	683865895e	[rocm-libraries] ROCm/rocm-libraries#5135 (commit 5ccc138) Proof of concept for removing forward declarations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently, we forward declare CK device operation templates in CK-Builder's reflection code: `9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)` This is mainly required to break a circular dependency in reflection. The architecture of that is as follows: MyDeviceOp implements GetInstanceString(). This is typically defined directly in the class definition (no forward declaration). GetInstanceString() calls instance_string<MyDeviceOp>() instance_string<MyDeviceOp>() calls InstanceTraits<MyDeviceOp>::instance_string() InstanceTraits has a specialization for MyDeviceOp which implements instance_string() So order for GetInstanceString() to work properly, InstanceTraits must already be defined. And for InstanceTraits to be defined, the device op needs to be defined. In order to do that, we are currently using aforementioned forward declaration. ## Technical Details C++'s lazy template evaluation is used by calling into an as-of-yet undefined function static member function of `InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then specializing `InstanceTraits` only _after that_. The caveat here is that both the device op itself as well as the instance traits specialization must be in scope, otherwise there would be an undefined function error. In practise, we can solve that either by placing the instance traits directly into the file that defines `MyDeviceOp`, or possibly by using a `.inc` file to keep the concerns separated. ## Test Plan The results were verified by running the existing regression tests for CK Builder ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-09 16:35:26 +00:00
Max Podkorytov	1dd47118e2	[rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb) Add generate_identity_sequences helper and replace lambdas with named functors (#4828) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add `generate_identity_sequences<N>()` helper that returns `Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>` - Replace lambdas with named functors in `transform_tensor_descriptor` - Add `unpack_and_merge_sequences` helper functor - Reduces `transform_tensor_descriptor` instantiations from 388 to 32 (92% reduction) ## Motivation Multiple call sites use `generate_tuple([](auto i) { return Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda instantiations. Additionally, each lambda in `transform_tensor_descriptor` creates a unique closure type, causing the function to be instantiated separately for every call site. Named functors share a single type, so the compiler reuses the same instantiation. ## Changes ### Part 1: generate_identity_sequences helper - Replaces common lambda pattern for generating identity sequences - Each lambda expression creates a unique closure type, causing separate template instantiations at every call site - Named helper shares a single type across all uses ### Part 2: Named functors in transform_tensor_descriptor - Add `unpack_and_merge_sequences` helper to replace lambda in `GetNumOfHiddenDimension` - Use `generate_identity_sequences` in `matrix_padder.hpp` ## Test Plan - [x] Added 7 unit tests: - 4 tests for `generate_identity_sequences` - 3 tests for `unpack_and_merge_sequences` - [ ] Waiting for full CI ## Related PRs This PR merges the functionality from: - ROCm/composable_kernel#3588 (generate_identity_sequences helper) - ROCm/composable_kernel#3589 (Named functors in transform_tensor_descriptor) Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times) Note: This PR supersedes #4283, ROCm/composable_kernel#3588 and ROCm/composable_kernel#3589, which can be closed once this is merged.	2026-02-28 20:11:11 +00:00
kabrahamAMD	5e06874aae	[rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d) [CK_Builder] added bwd data kernels to builder factory (#4582) This PR adds bwd data wmma and xdl kernels to the ck builder, their instance and conv traits as well as tests for the above.	2026-02-27 03:06:29 +00:00
Yung-sheng Tu	75aea70c2c	[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f) Implement device_grouped_gemm_fixed_nk_bias for RDNA4 ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-26 00:28:58 +00:00
Zoltán Lakatos	a32d704d89	[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f) [CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 05:17:08 +00:00
Illia Silin	1915cdfcc2	[rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5) Revert "[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3" (#4762) Reverts ROCm/rocm-libraries#4638 unfortunately, this PR interfered with the PR#4299 and caused build errors for gfx11: In file included from /rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7: In file included from /rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11: /rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21: error: no matching function for call to 'CheckValidity' 553 \| if(!GridwiseGemm::CheckValidity( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~	2026-02-20 22:41:34 +00:00
linqunAMD	29781f2ac4	[rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71) [ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3 (#4638) 1. Add GetEstimateVgprCount to estimate the VGPR usage in GridwiseGemm_wmma_cshuffle_v3 2. Add IsValidCompilationParameter to disable kernel which use too many vgprs. - Currently, the threashold is AvailableVgprCount * 1.25 3. Modify examples to avoid test is disabled on gfx11 It is port from internal repo PR[#192](https://github.com/ROCm/composable_kernel/issues/192) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-20 15:57:18 +00:00
Márton Bidlek	7b97e197ef	[rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49) 173 implement device grouped gemm fixed nk for rdna4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk instance library using for WMMA. The implementation is based on the existing DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level structure, but replaces the XDL kernel with a WMMA-based one. It uses the GridwiseGemm_wmma_cshuffle_v3 kernel. At this stage, the focus is functional correctness and compatibility, not performance tuning. ## Technical Details - Device struct for grouped gemm fixed NK - Example code for the WMMA version - Unit tests for both new wmma implementation and the reference XDL code (previously missing) - Generic ck profiler interface with the purpose of calling unit tests. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered	2026-02-19 08:13:46 +00:00
lalala-sh	dae352e8dc	[rocm-libraries] ROCm/rocm-libraries#4282 (commit 2050f93) add memsetasync for ck moe splitk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes add memsetasync for ck moe splitk to fix ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered	2026-02-12 17:45:52 +00:00
Johannes Graner	e88f139c6c	[rocm-libraries] ROCm/rocm-libraries#4271 (commit 6fce58e) [Conv] Add NumGroupsToMerge to BwdWeight type string MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Add parameter to bwd weight V3 type string showing the number of groups to merge. This is required for MIOpen to be properly tuned since it uses type strings for performance database entries. In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered	2026-02-11 09:08:38 +00:00
Ville Pietilä	57d26db844	[rocm-libraries] ROCm/rocm-libraries#4273 (commit 591f504) [CK] Add fwd conv group merging to v3 conv instances MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Added conv group merging to the (universal) V3 fwd conv pipeline. The new instance improves fwd conv performance when the number of input/output channel per group is low. On MI300 (`gfx942`) we get \| CK prof command \| Baseline (TFLOPS) \| V3 group merging (TFLOPS) \| \|:-----\|:------:\|------:\| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 \| 3.86035 \| 8.36796 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 \| 10.1867 \| 13.4677 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1 \| 11.7875 \| 16.3657 \|	2026-02-08 11:35:56 +00:00
Illia Silin	569640dc70	Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 )" (#3705 ) This reverts commit `301eb5cf08`.	2026-02-03 09:52:14 -08:00
Zoltán Lakatos	301eb5cf08	Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 ) * device struct implementation * added xdl grouped multi abd fixed nk testing * wmma implementation fixed * avoid unnecessary device mem allocation and code cleanups * cleanup instances definitions * wmma examples added * code cleanups * fix clang format * typo and compilation fixes related to reference gemm * fix compilation error due to std::remove_cvref_t * added missing hip_check_error includes * correction to example instances * review commentes addressed * removed split-k from testing * code formatting --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-02 13:58:11 -08:00
Jan Patrick Lehr	069500464d	[Compiler] Addressing new compiler warnings (#3640 ) * [Compiler] Addressing new compiler warnings Clang enables new lifetime warnings in production and we see build errors due to this with the staging compiler. The attributes added in this PR are suggested by the compiler. However, I'm not very familiar with the code base, so the changes may be incorrect. * Update some more instances * Adds file-level ignores via clang diagnostic pragma The number of instances was large, so I decided to use file-level scope to disable the warning via pragma clang diagnostic ignored. It also showed this warning coming from the gtest dependency. For that, I did add the respective command line flag to the CMake variables. I don't know if this is acceptable or not. * This adds the remaining instances For a build on gfx90a. * fix clang format * Adding couple more instances from gfx1200 build * Fixed another few instances --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-02 09:39:48 -08:00
Johannes Graner	fabac7e2c3	[Conv] Enable bwd weight splitk autodeduction with cap (#3656 ) * Enable bwd weight splitk autodeduction with cap * Fix error threshold calculations * Add missing logic to wmma multiple d kernel * Fix threshold calculation * Update test with new applicability	2026-01-29 17:40:28 +00:00
Bartłomiej Kocot	83b58bb0c3	Grouped Conv Bwd Weight Direct Load (#3648 ) * Grouped Conv Bwd Weight Direct Load * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp * Implement group merging for bwd_weight and add instances * Link direct load instances * builder fixes * fix * fixes * fix --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com>	2026-01-28 15:31:54 -06:00
linqunAMD	23cefda140	[ck] add gridwise base class for in all xdl kernel (#186 ) (#3544 ) 1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue	2026-01-27 12:49:47 -08:00
Enrico Degregori	2e49b6b2f7	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format	2026-01-26 12:57:09 -08:00
chris-tsiaousis-hpc	917f35553a	Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617 ) * Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-01-26 10:20:30 -08:00
Ville Pietilä	7ac3794284	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <>	2026-01-25 13:42:23 +01:00
chris-tsiaousis-hpc	e1c46ff548	Remove code duplications in batched gemm wmma (#3580 ) * Moved device struct for batched gemm wmma to a common file Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Use the common device struct in the scaled batched gemm wmma implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: Remove unused includes and ambiguous comment Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved pointer offset calculation and gridwise argument to common struct This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common" Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-01-23 12:39:03 -08:00
Wojciech Laskowski	81ee19bd2c	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops	2026-01-23 12:19:51 +01:00
Erwin Terpstra	d5ae81b292	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2026-01-20 13:06:59 -08:00
music-dino	6300ad3c62	Batched gemm softmax gemm descriptor fix (#3564 ) * Add rocm to prefix path for codegen * Fix issue with c0_matrix_mask construction	2026-01-20 07:25:30 -08:00
Wojciech Laskowski	b09121f860	WMMA support for batched_gemm_reduce (#3332 ) Summary: - added new device impl of Batched GEMM Reduce for WMMA - added instance library - added WMMA impl to the Batched GEMM Reduce tests	2026-01-20 10:50:46 +01:00
Erwin Terpstra	fe40a5d139	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method	2026-01-17 08:30:27 +01:00
Yung-sheng Tu	6df2d70143	Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 ) * add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples	2026-01-15 07:19:31 -08:00
John Shumway	5122637215	[CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547 ) * Factor helpers out of conv_traits.hpp * Create a non-templated conv_traits struct * Migrate to new instance-specific instance_to_conv_traits functions * Clean up reflection concepts * Clean up ConvTraits helpers * Update testing for convolution traits This is a lot of cleanup on tests to have verbose coverage of feature extraction, explicit tests for each supported device kernel, and simple, readable test code. * Address reviewer comments and resolve merge conflict	2026-01-15 10:03:21 +01:00
Bartłomiej Kocot	a346cfa960	Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566 )	2026-01-14 12:37:12 -08:00
Bartłomiej Kocot	a07c8e38bd	Fix grouped conv bwd data wmma check (#3562 )	2026-01-14 11:04:37 -08:00
Enrico Degregori	693ff3bbb3	Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465 ) - Add support for direct store in epilogue instead of cshuffle - Add padding support for wave transfer without transpose - Add wave transfer with interleaved layout to support direct store - Enable new functionalities on GEMMs - Add optional new functionality support for grouped convolution fwd - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)	2026-01-14 11:02:19 +01:00
Ville Pietilä	9908a87c31	[CK_BUILDER] Add bwd weight factories (#3509 ) * Add placeholder test. * Initial conv bwd weight factory. * Conv builder test refactoring. * Add missing pieces to bwd weight factory. * Improve compile time erros message when no matching factory is found. * Use amcro to ensure automatic macthing between concepts are their string representations. * Improve compile time diagnostics. * Small improvements. * Improve missing member/wrong type compile-time errors. * Improve compile time diagnostics. * Concept bug fixes. * Remove debug assert. * Update algorithm signature diagnostics. * Factory bug fixes. * First functional version of bwd weight conv factory. * Refactor handing of GEMM-K batch template parameter in conv bwd weight factory. * Concept improvements. * Improve concept diagnostics. * Introduve a common size type for concepts. * Update compiletime diagnostics to use the size type. * Update conv specialization enum. * Fix fwd conv builder tests. * Fix smoke tests. * Separate bwd weigth and bwd data tests into separate targets. * Clean-up CK Tile builder tests. * Add bwd weight XDL CShuffle V3 factory. * Build conv bwd weigth v3 instances successfully. * Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Test fix. * Add instance traits for bwd weight algorithms. * Add unit tests for instance strings. * Build new instance traits unit tests but exclude WMMA for now. * Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Conv bwd weight DL factory. * Final implementation for bwd weight DL factory. * Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle * Treat ref algorithm the same way as real algorithms in the dispatcher. * Refactor large tensor support and WMMA configuration. * Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3. * Update Readme. * Fix WMMA bwd weight tests. * Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3. * Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle. * Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 * Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs. * Fix fwd factories after refactoring. * clang-format * Move compile-time diagnostics to a separate branch. * Fix ref algorithm dispatching. * Fix smoke tests. * clang-format * Fix factory for regular WMMA conv bwd weight. * Clarify builder Readme. * Remove obsolete test file. * Fix test after merge. * clang-format * Remove the C++26 extensions. * Unify conv elementwise ops and layout definitions for fwd and bwd directions. * Remove old layout and elementwise ops. * Unify handling of conv tensor types between fwd and bwd directions. * Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank. * Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms. * clang-format --------- Co-authored-by: Ville Pietilä <>	2026-01-13 18:12:38 +02:00
Erwin Terpstra	eb041079a3	Implement grouped gemm tile loop for RDNA4 (#3304 ) * feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3	2026-01-13 07:14:23 +01:00
yadaish	32408c8bc0	moe fp8 blockscale use nt (#3524 ) * nt on fp8 blockscale * some improve and tests needs to be fixed * update * fix format * revert useless change * revert any change in amd_buffer_coherence	2026-01-12 10:48:10 +08:00
Johannes Graner	ee2c35b92d	[CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169 ) * Take split_k into account when checking 2GB tensor limit. * Revert "Take split_k into account when checking 2GB tensor limit." This reverts commit `adf35c91be`. * Optimize grouped conv bwd wei split_k off calc (cherry picked from commit `6f61dd56c5`) * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp (cherry picked from commit `b33877c10f`) * Fix tensor descriptors and stride calculations * Don't miss half of the elements * Fix buffer size calculations * Disable hack if stride not divisible by k_batch * Clean up comments * Disallow hack in non-contiguous edge cases * Index -> Dim * Fix broken test * Refactor applicability checks into separate function * fix missed variable name * Fix variable name in info print * update V3 2GB check * No more regression, use templates instead * Code deduplication * Regression fix for cshuffle * arch-guarded atomic_add implementations for gfx11 * Similar for half(4\|8)_t as well * Only use both offset hacks at the same time * Revert "arch-guarded atomic_add implementations for gfx11" This reverts commit `3883fe6935`. This reverts commit `5311ec608d`. * Reapply "arch-guarded atomic_add implementations for gfx11" This reverts commit `1972adeddc`. * Only remove float4 atomic_add * Refactor to single flag * Consolidate template parameters * Consolidate flag in transformers --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2026-01-08 08:02:02 +01:00
Bartłomiej Kocot	f449a5faaa	Disable fp32 atomic adds on gfx11 (#3510 ) * Disable fp32 atomic adds on gfx11 * Fixes is supported	2026-01-07 15:32:04 -08:00
Enrico Degregori	aad4cf0985	Wmma support for gemm_bias_add_reduce (#3316 ) * Add tests for gemm_bias_add_reduce * Initial working implementation * Generalize implementation of reduce epilogue * Add tests for all layouts * Add instances * Fix test archs * Fix xdl bug * Remove library/profiler duplications * Fix num_byted error profiler * Fix typos * Fix copyright	2026-01-07 10:27:16 -08:00
Erwin Terpstra	f9c6ba0403	Implement grouped gemm fastgelu for RDNA4 (#3303 ) * Implement grouped gemm fastgelu for RDNA4 * chore: some cleanup and minor inconsistencies in grouped gemm profiler * chore: clarified logic and reporting of supported instance warnings	2026-01-07 10:20:44 -08:00
Ville Pietilä	6e8c401e33	[CK_BUILDER] Instance traits for conv bwd weight algorithms (#3498 ) Added instance traits for the following bwd weight conv algorithms DeviceGroupedConvBwdWeight_Xdl_CShuffleV3 DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 DeviceGroupedConvBwdWeight_Wmma_CShuffle DeviceGroupedConvBwdWeight_TwoStage_Xdl_CShuffle DeviceGroupedConvBwdWeight_TwoStage_Wmma_CShuffleV3 DeviceGroupedConvBwdWeight_DL DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 Added also unit tests for instance traits of those bwd weigth algorithms that are currently exposed by the narrow CK build for MIOpen. --------- Co-authored-by: Ville Pietilä <>	2025-12-31 15:41:15 -08:00
Bartłomiej Kocot	2b8302eb6d	Fix grouped conv wrw kernels names (#3494 )	2025-12-30 16:45:39 +01:00
ApoorvaKalyani	53a1e4f551	Grouped convolution backward data WMMA v3 implementation (#3460 ) * Added device level implementation for bwd_data_wmma_v3. * Added first instance of bwd_data_wmma_v3(f16). * Add support for bwd data in gridwise implementation Some changes are general for convolution and some are specific for bwd data. We need to generalize them once we have fwd, bwd data and bwd weight * Initial device implementation of bwd data * Remove unused template parameters in device impl * Add one instance for different layout initial check of device implementation * Add tests for splitk and for different layouts * Appended more instances to wmma_v3_f16. * Added conv_2d bf16 wmma_v3 instances. * Added conv_3d_bf16 wmma_v3_instances. * Added conv_3d_f16_wmma_v3_instances. * Added SplitN test cases for wmma. * Conv3d_bwd_data_scale_wmma_v3 instances. * Conv3d_bwd_data_bilinear_wmma_v3_instances * Renaming the device level instances file to common name , since it is defined for different DataTypes. * Renaming the instances and fixing typo * Added the test cases to regression test list * NCHW support for wmma_v3 * Examples for bf16 and f16 bwd_data_wmma_v3 * Added transpose conditons for device impl * fixing bugs * Added the gemm_args array implmentation * WIP debug conv bwd * fix splitk * Grouped gemm fix * Update CmakeLists with EOF * Added more instances for tests * Fixed the run time error in examples and removed 3d conv examples. * Fixed a typo. * Updated CmakeLists to removed the 3d convultion deleted files * Added print error statements for unsupoorted argument * Added the merge conflict related changes * Fixed compilation error * Fixed the InstanceFactory duplication error. * Removed the print statements and added logs to Arg function * All the merge conflict related errors resolved * Added d_tensor tests. * Added the missing example types of wmm_v3 * Merge error fix * Corrected the instance name * Reverted the bias relu change * Revereted the transpose load local change * Updated the regression test list with bwd_data_scale * Revert "Revereted the transpose load local change" This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b. * Revert "Merge error fix" This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b. * Reverting the local change * Added merge error fix * Build error fix due to merge conflicts * Added bias_relu example for wmma_v3 * Modified the main method in dtensor tests * Updated the dtensor tests to pick all the shapes * Updated the dtensor test shapes. * Updated the mem operations in tests. * Added reference func * Fixed typos in device impl * Added new header file and modified the include file for 3d tests * Renamed the test file and added reference func call. * clang format fix * Added ignore params * Modified device impl and tests * Removed debug print statements and updated dtensor test shapes * Fixing merge conflicts * Fixing more merge conflicts * Fixed copyrights * Updated the tuned instances to bilinear and scale. * Adding tuned instances to vanilla wmma_v3 * Removed all unused instances and modified test layouts. * Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances. * Fix clang format * Updated tuned f16/-genric instances * Formatting the instances file * Fixed copyrights and clang issues * Nonsense commit to force git to force * Removed the transpose instances * Added verified genric instances * Fixing namespace errors * Added todo for failing shapes * Formatting instance file * Fix instance list formatting * Removing unnecessary formats * Renamed the common file * Unification of xdl and wmma bwd_data tests * Updated Cmake * Added all layout types and deleted code. * Updated Cmake to add the condition to all tests. --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2025-12-30 16:25:08 +01:00
Kiefer van Teutem	88ae445580	Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470 ) * Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests. * Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.	2025-12-29 12:58:29 +01:00
Johannes Graner	e1381d6a71	[CK grouped gemm] Fix grouped gemm two stage HasMainK0BlockLoop (#3466 ) * Re-enable two stage kernel * Only disable on HasMainKBlockLoop mismatch * Address PR comments	2025-12-23 11:33:09 +01:00
Bartłomiej Kocot	2955d77f3c	Fix grouped conv fwd wmma porting (#3479 ) * Fix grouped conv fwd wmma porting * add more limitations	2025-12-22 21:32:48 +01:00

1 2 3 4 5 ...

526 Commits